35:[["$","audio",null,{"id":"tts"}],["$","$L3a",null,{"paperID":"0912.3995","publisher":"arxiv","paperJSON":{"title":"Gaussian Process Optimization in the Bandit Setting: No Regret and Experimental Design","paperID":"0912.3995","avgLineHeight":11.95,"imgScale":4,"sections":[{"heading":"Abstract","paragraphs":[[{"text":"Many applications require optimizing an unknown, ","element":"span"},{"text":"noisy function that is expensive to evaluate. ","element":"span"},{"text":"We formalize this task as a multi-armed bandit problem, where the payoff function is either sampled from a Gaussian process (GP) or has low RKHS norm. We resolve the important open problem of deriving regret bounds for this setting, which imply novel convergence rates for GP optimization. ","element":"span"},{"text":"We analyze ","element":"span"},{"text":"GP-UCB","element":"span"},{"text":", an intuitive upper-confidence based algorithm, and bound its cumulative regret in terms of maximal information gain, establishing a novel connection between GP optimization and experimental design. Moreover, by bounding the latter in terms of operator spectra, we obtain explicit sublinear regret bounds for many commonly used covariance functions. ","element":"span"},{"text":"In some important cases, our bounds have surprisingly weak dependence on the dimensionality. In our experiments on real sensor data, ","element":"span"},{"text":"GP-UCB ","element":"span"},{"text":"compares favorably with other heuristical GP optimization approaches.","element":"span"}]]},{"heading":"1. Introduction","paragraphs":[[{"text":"In most stochastic optimization settings, evaluating the unknown function is expensive, and sampling is to be minimized. ","element":"span"},{"text":"Examples include choosing advertisements ","element":"span"},{"text":"in ","element":"span"},{"text":"sponsored ","element":"span"},{"text":"search ","element":"span"},{"text":"to ","element":"span"},{"text":"maximize profit in a click-through model (","element":"span"},{"href":"#id-0","referenceIndex":25,"text":"Pandey & Olston","element":"a"},{"href":"#id-0","referenceIndex":25,"text":", ","element":"a"},{"href":"#id-0","referenceIndex":25,"text":"2007","element":"a"},{"text":") or learning optimal control strategies for robots (","element":"span"},{"href":"#id-1","referenceIndex":20,"text":"Lizotte et al.","element":"a"},{"href":"#id-1","referenceIndex":20,"text":", ","element":"a"},{"href":"#id-1","referenceIndex":20,"text":"2007","element":"a"},{"text":"). ","element":"span"},{"text":"Predominant approaches to this problem include the multi-armed bandit paradigm (","element":"span"},{"href":"#id-2","referenceIndex":27,"text":"Robbins","element":"a"},{"href":"#id-2","referenceIndex":27,"text":", ","element":"a"},{"href":"#id-2","referenceIndex":27,"text":"1952","element":"a"},{"text":"), ","element":"span"},{"text":"where the goal is to maximize cumulative reward by optimally balancing exploration and exploitation, and experimental design (","element":"span"},{"href":"#id-3","referenceIndex":6,"text":"Chaloner & Verdinelli","element":"a"},{"href":"#id-3","referenceIndex":6,"text":", ","element":"a"},{"href":"#id-3","referenceIndex":6,"text":"1995","element":"a"},{"text":"), where the function is to be explored globally with as few evaluations as possible, for example by maximizing information gain. The challenge in both approaches is twofold: we have to estimate an unknown function ","element":"span"},{"style":{"fontStyle":"italic"},"text":"f ","element":"span"},{"text":"from noisy samples, and we must optimize our estimate over some high-dimensional input space. For the former, much progress has been made in machine learning through kernel methods and Gaussian process (GP) models (","element":"span"},{"href":"#id-4","referenceIndex":26,"text":"Rasmussen & Williams","element":"a"},{"href":"#id-4","referenceIndex":26,"text":", ","element":"a"},{"href":"#id-4","referenceIndex":26,"text":"2006","element":"a"},{"text":"), where smoothness assumptions about ","element":"span"},{"style":{"fontStyle":"italic"},"text":"f ","element":"span"},{"text":"are encoded through the choice of kernel in a flexible nonparametric fashion. Beyond Euclidean spaces, kernels can be defined on diverse domains such as spaces of graphs, sets, or lists.","element":"span"}],[{"text":"We are concerned with GP optimization in the multi-armed bandit setting, where ","element":"span"},{"style":{"fontStyle":"italic"},"text":"f ","element":"span"},{"text":"is sampled from a GP distribution or has low “complexity” measured in terms of its RKHS norm under some kernel. We provide the first sublinear regret bounds in this nonparametric setting, which imply convergence rates for GP optimization. In particular, we analyze the Gaussian Process Upper Confidence Bound (","element":"span"},{"text":"GP-UCB","element":"span"},{"text":") algorithm, a simple and intuitive Bayesian method (","element":"span"},{"href":"#id-5","referenceIndex":3,"text":"Auer ","element":"a"},{"href":"#id-5","referenceIndex":3,"text":"et al.","element":"a"},{"href":"#id-5","referenceIndex":3,"text":", ","element":"a"},{"href":"#id-5","referenceIndex":3,"text":"2002","element":"a"},{"text":"; ","element":"span"},{"href":"#id-6","referenceIndex":2,"text":"Auer","element":"a"},{"href":"#id-6","referenceIndex":2,"text":", ","element":"a"},{"href":"#id-6","referenceIndex":2,"text":"2002","element":"a"},{"text":"; ","element":"span"},{"href":"#id-7","referenceIndex":9,"text":"Dani et al.","element":"a"},{"href":"#id-7","referenceIndex":9,"text":", ","element":"a"},{"href":"#id-7","referenceIndex":9,"text":"2008","element":"a"},{"text":"). ","element":"span"},{"text":"While objectives are different in the multi-armed bandit and experimental design paradigm, our results draw a close technical connection between them: our regret bounds come in terms of an ","element":"span"},{"style":{"fontStyle":"italic"},"text":"information gain ","element":"span"},{"text":"quantity, measuring how fast ","element":"span"},{"style":{"fontStyle":"italic"},"text":"f ","element":"span"},{"text":"can be learned in an information theoretic sense. ","element":"span"},{"text":"The submodularity of this function allows us to prove sharp regret bounds for particular covariance functions, which we demonstrate for commonly used Squared Exponential and Mat´ern kernels.","element":"span"}],[{"style":{"fontWeight":"bold"},"text":"Related Work. ","element":"span"},{"text":"Our work generalizes stochastic ","element":"span"},{"style":{"fontStyle":"italic"},"text":"linear ","element":"span"},{"text":"optimization in a bandit setting, where the unknown function comes from a finite-dimensional linear space. ","element":"span"},{"text":"GPs are nonlinear random functions, which can be represented in an infinite-dimensional linear space. ","element":"span"},{"text":"For the standard linear setting, ","element":"span"},{"href":"#id-7","referenceIndex":9,"text":"Dani ","element":"a"},{"href":"#id-7","referenceIndex":9,"text":"et al. ","element":"a"},{"href":"#id-7","referenceIndex":9,"text":"(","element":"a"},{"href":"#id-7","referenceIndex":9,"text":"2008","element":"a"},{"text":") provide a near-complete characterization1","element":"span"}],[{"text":"(also see ","element":"span"},{"href":"#id-6","referenceIndex":2,"text":"Auer ","element":"a"},{"href":"#id-6","referenceIndex":2,"text":"2002","element":"a"},{"text":"; ","element":"span"},{"href":"#id-8","referenceIndex":8,"text":"Dani et al. ","element":"a"},{"href":"#id-8","referenceIndex":8,"text":"2007","element":"a"},{"text":"; ","element":"span"},{"href":"#id-9","referenceIndex":1,"text":"Abernethy et al. ","element":"a"},{"href":"#id-9","referenceIndex":1,"text":"2008","element":"a"},{"text":"; ","element":"span"},{"href":"#id-10","referenceIndex":28,"text":"Rusmevichientong & Tsitsiklis ","element":"a"},{"href":"#id-10","referenceIndex":28,"text":"2008","element":"a"},{"text":"), explicitly ","element":"span"},{"id":"id-14","text":"dependent on the dimensionality. In the GP setting, ","element":"span"},{"text":"the challenge is to characterize complexity in a differ-ent manner, through properties of the kernel function. Our technical contributions are twofold: ","element":"span"},{"text":"first, we show how to analyze the nonlinear setting by focusing on the concept of information gain, and second, we explicitly bound this information gain measure using the concept of submodularity (","element":"span"},{"href":"#id-11","referenceIndex":24,"text":"Nemhauser et al.","element":"a"},{"href":"#id-11","referenceIndex":24,"text":", ","element":"a"},{"href":"#id-11","referenceIndex":24,"text":"1978","element":"a"},{"text":") and knowledge about kernel operator spectra.","element":"span"}],[{"href":"#id-12","referenceIndex":16,"text":"Kleinberg et al. ","element":"a"},{"href":"#id-12","referenceIndex":16,"text":"(","element":"a"},{"href":"#id-12","referenceIndex":16,"text":"2008","element":"a"},{"text":") provide regret bounds under weaker and less configurable assumptions (only Lipschitz-continuity ","element":"span"},{"text":"w.r.t. ","element":"span"},{"text":"a ","element":"span"},{"text":"metric ","element":"span"},{"text":"is ","element":"span"},{"text":"assumed; ","element":"span"},{"href":"#id-13","referenceIndex":5,"text":"Bubeck et al. ","element":"a"},{"href":"#id-13","referenceIndex":5,"text":"2008 ","element":"a"},{"text":"consider arbitrary topological spaces), which however degrade rapidly with the dimensionality of the problem (Ω(","element":"span"},{"style":{"fontStyle":"italic"},"text":"T","element":"span"}],[{"text":"linearity w.r.t. a fixed basis is often too stringent","element":"span"}],[{"text":"an assumption, while Lipschitz-continuity can be too coarse-grained, leading to poor rate bounds. Adopting GP assumptions, we can model levels of smoothness in a fine-grained way. For example, our rates for the frequently used Squared Exponential kernel, enforcing a high degree of smoothness, have weak dependence on the dimensionality: ","element":"span"},{"style":{"height":19.2},"width":117.18,"height":48,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/0912.3995/images/1-0.png","element":"img","alt":" O(�T","inline":true},{"text":"(log ","element":"span"},{"style":{"height":16.2},"width":101.37,"height":40.51,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/0912.3995/images/1-1.png","element":"img","alt":" T)d+1","inline":true},{"text":") (see Fig. ","element":"span"},{"href":"#id-14","text":"1","element":"a"},{"text":").","element":"span"}],[{"text":"There is a large literature on GP (response surface) optimization. Several heuristics for trading off exploration and exploitation in GP optimization have been proposed (such as Expected Improvement, ","element":"span"},{"href":"#id-15","referenceIndex":23,"text":"Mockus ","element":"a"},{"href":"#id-15","referenceIndex":23,"text":"et al. ","element":"a"},{"href":"#id-15","referenceIndex":23,"text":"1978","element":"a"},{"text":", and Most Probable Improvement, ","element":"span"},{"href":"#id-16","referenceIndex":22,"text":"Mockus ","element":"a"},{"href":"#id-16","referenceIndex":22,"text":"1989","element":"a"},{"text":") and successfully applied in practice (","element":"span"},{"style":{"fontStyle":"italic"},"text":"c.f.","element":"span"},{"text":", ","element":"span"},{"href":"#id-1","referenceIndex":20,"text":"Lizotte ","element":"a"},{"href":"#id-1","referenceIndex":20,"text":"et al. ","element":"a"},{"href":"#id-1","referenceIndex":20,"text":"2007","element":"a"},{"text":"). ","element":"span"},{"href":"#id-17","referenceIndex":4,"text":"Brochu et al. ","element":"a"},{"href":"#id-17","referenceIndex":4,"text":"(","element":"a"},{"href":"#id-17","referenceIndex":4,"text":"2009","element":"a"},{"text":") provide a comprehensive review of and motivation for Bayesian optimization using GPs. ","element":"span"},{"text":"The Efficient Global Optimization (EGO) algorithm for optimizing expensive black-box functions is proposed by ","element":"span"},{"href":"#id-18","referenceIndex":15,"text":"Jones et al. ","element":"a"},{"href":"#id-18","referenceIndex":15,"text":"(","element":"a"},{"href":"#id-18","referenceIndex":15,"text":"1998","element":"a"},{"text":") and extended to GPs by ","element":"span"},{"href":"#id-19","referenceIndex":14,"text":"Huang et al. ","element":"a"},{"href":"#id-19","referenceIndex":14,"text":"(","element":"a"},{"href":"#id-19","referenceIndex":14,"text":"2006","element":"a"},{"text":"). Little is known about theoretical performance of GP optimization. While convergence of EGO is established by ","element":"span"},{"href":"#id-20","referenceIndex":33,"text":"Vazquez ","element":"a"},{"href":"#id-20","referenceIndex":33,"text":"& Bect ","element":"a"},{"href":"#id-20","referenceIndex":33,"text":"(","element":"a"},{"href":"#id-20","referenceIndex":33,"text":"2007","element":"a"},{"text":"), convergence rates have remained elusive. ","element":"span"},{"href":"#id-21","referenceIndex":13,"text":"Gr¨unew¨alder et al. ","element":"a"},{"href":"#id-21","referenceIndex":13,"text":"(","element":"a"},{"href":"#id-21","referenceIndex":13,"text":"2010","element":"a"},{"text":") consider the pure exploration problem for GPs, where the goal is to find the optimal decision over ","element":"span"},{"style":{"fontStyle":"italic"},"text":"T ","element":"span"},{"text":"rounds, rather than maximize cumulative reward (with no exploration/exploitation dilemma). They provide sharp bounds for this exploration problem. Note that this methodology would not lead to bounds for minimizing the cumulative regret. Our cumulative regret bounds translate to the first performance guarantees (rates) for GP optimization.","element":"span"}],[{"style":{"fontWeight":"bold"},"text":"Summary. ","element":"span"},{"text":"Our main contributions are:","element":"span"}],[{"style":{"fontStyle":"italic"},"text":"• ","element":"span"},{"text":"We analyze ","element":"span"},{"text":"GP-UCB","element":"span"},{"text":", an intuitive algorithm for GP optimization, when the function is either sam-","element":"span"}],[{"style":{"width":"100%"},"width":937,"height":182,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/0912.3995/images/1-2.png","element":"img"}],[{"style":{"fontStyle":"italic"},"text":"Figure 1. ","element":"figcaption","subtype":"caption"},{"text":"Our regret bounds (up to polylog factors) for linear, radial basis, and Mat´ern kernels — ","element":"figcaption","subtype":"caption"},{"style":{"fontStyle":"italic"},"text":"d ","element":"figcaption","subtype":"caption"},{"text":"is the dimension, ","element":"figcaption","subtype":"caption"},{"style":{"fontStyle":"italic"},"text":"T ","element":"figcaption","subtype":"caption"},{"text":"is the time horizon, and ","element":"figcaption","subtype":"caption"},{"style":{"height":6.4},"width":20,"height":16,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/0912.3995/images/1-3.png","element":"img","alt":" ν","inline":true,"padRight":true},{"text":"is a Mat´ern parameter.","element":"figcaption","subtype":"caption"}],[{"style":{"width":"89%"},"width":836,"height":37,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/0912.3995/images/1-4.png","element":"img"}],[{"style":{"fontStyle":"italic"},"text":"• ","element":"span"},{"text":"We bound the cumulative regret for ","element":"span"},{"text":"GP-UCB ","element":"span"},{"text":"in terms of the information gain due to sampling, establishing a novel connection between experimental design and GP optimization.","element":"span"}],[{"style":{"fontStyle":"italic"},"text":"• ","element":"span"},{"text":"By bounding the information gain for popular classes of kernels, we establish sublinear regret bounds for GP optimization for the first time. Our bounds depend on kernel choice and parameters in a fine-grained fashion.","element":"span"}],[{"style":{"fontStyle":"italic"},"text":"• ","element":"span"},{"text":"We evaluate ","element":"span"},{"text":"GP-UCB ","element":"span"},{"text":"on sensor network data, demonstrating that it compares favorably to existing algorithms for GP optimization.","element":"span"}]]},{"heading":"2. Problem Statement and Background","paragraphs":[[{"text":"Consider the problem of sequentially optimizing an unknown reward function ","element":"span"},{"style":{"height":14},"width":182.1,"height":35,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/0912.3995/images/1-5.png","element":"img","alt":" f : D → R","inline":true},{"text":": in each round ","element":"span"},{"style":{"fontStyle":"italic"},"text":"t","element":"span"},{"text":", we choose a point ","element":"span"},{"style":{"height":13.19},"width":122,"height":32.98,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/0912.3995/images/1-6.png","element":"img","alt":" xt ∈ D","inline":true,"padRight":true},{"text":"and get to see the function value there, perturbed by noise: ","element":"span"},{"style":{"height":16},"width":251.88,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/0912.3995/images/1-7.png","element":"img","alt":" yt = f(xt)+ϵt","inline":true},{"text":". Our goal is to maximize the sum of rewards ","element":"span"},{"style":{"height":20.4},"width":180.64,"height":50.99,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/0912.3995/images/1-8.png","element":"img","alt":"�Tt=1 f(xt","inline":true},{"text":"), thus to ","element":"span"},{"text":"perform essentially as well as ","element":"span"},{"style":{"height":16.7},"width":382.12,"height":41.74,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/0912.3995/images/1-9.png","element":"img","alt":" x∗ = argmaxx∈D f(x","inline":true},{"text":") (as rapidly as possible). For example, we might want to find locations of highest temperature in a building by sequentially activating sensors in a spatial network and regressing on their measurements. ","element":"span"},{"style":{"fontStyle":"italic"},"text":"D ","element":"span"},{"text":"consists of all sensor locations, ","element":"span"},{"style":{"fontStyle":"italic"},"text":"f","element":"span"},{"text":"(","element":"span"},{"style":{"fontStyle":"italic","fontWeight":"bold"},"text":"x","element":"span"},{"text":") is the temperature at ","element":"span"},{"style":{"fontStyle":"italic","fontWeight":"bold"},"text":"x","element":"span"},{"text":", and sensor accuracy is quantified by the noise variance. Each activation draws battery power, so we want to sample from as few sensors as possible.","element":"span"}],[{"style":{"fontWeight":"bold"},"text":"Regret. ","element":"span"},{"text":"A natural performance metric in this context is cumulative regret, the loss in reward due to not knowing ","element":"span"},{"style":{"fontStyle":"italic"},"text":"f","element":"span"},{"text":"’s maximum points beforehand. ","element":"span"},{"text":"Suppose the unknown function is ","element":"span"},{"style":{"fontStyle":"italic"},"text":"f","element":"span"},{"text":", its maximum point","element":"span"},{"style":{"height":7.6},"width":16,"height":19,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/0912.3995/images/1-10.png","element":"img","alt":"1","inline":true},{"style":{"height":16.7},"width":388.2,"height":41.74,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/0912.3995/images/1-11.png","element":"img","alt":"x∗ = argmaxx∈D f(x","inline":true},{"text":"). ","element":"span"},{"text":"For our choice ","element":"span"},{"style":{"height":9.59},"width":38.26,"height":23.98,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/0912.3995/images/1-12.png","element":"img","alt":" xt","inline":true,"padRight":true},{"text":"in round ","element":"span"},{"style":{"fontStyle":"italic"},"text":"t","element":"span"},{"text":", we incur instantaneous regret ","element":"span"},{"style":{"height":16},"width":324.99,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/0912.3995/images/1-13.png","element":"img","alt":" rt = f(x∗) − f(xt","inline":true},{"text":"). The ","element":"span"},{"style":{"height":14},"width":372.06,"height":35,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/0912.3995/images/1-14.png","element":"img","alt":" cumulative regret RT","inline":true,"padRight":true},{"text":"after ","element":"span"},{"style":{"fontStyle":"italic"},"text":"T ","element":"span"},{"text":"rounds is the sum of instantaneous regrets: ","element":"span"},{"style":{"height":20.4},"width":248.53,"height":50.99,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/0912.3995/images/1-15.png","element":"img","alt":" RT = �Tt=1 rt","inline":true},{"text":". A desirable ","element":"span"},{"text":"asymptotic property of an algorithm is to be ","element":"span"},{"style":{"fontStyle":"italic"},"text":"no-regret","element":"span"},{"text":": lim","element":"span"},{"style":{"height":16},"width":292.46,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/0912.3995/images/1-16.png","element":"img","alt":"T →∞ RT /T = 0.","inline":true,"padRight":true},{"text":"Note that neither ","element":"span"},{"style":{"height":9.19},"width":29.98,"height":22.98,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/0912.3995/images/1-17.png","element":"img","alt":" rt","inline":true,"padRight":true},{"text":"nor ","element":"span"},{"style":{"height":13.19},"width":53.26,"height":32.98,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/0912.3995/images/1-18.png","element":"img","alt":" RT","inline":true,"padRight":true},{"text":"are ever revealed to the algorithm. Bounds on the average regret ","element":"span"},{"style":{"height":16},"width":104.28,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/0912.3995/images/1-19.png","element":"img","alt":" RT /T","inline":true,"padRight":true},{"text":"translate to convergence rates for GP optimization: the maximum max","element":"span"},{"style":{"height":16.79},"width":146.25,"height":41.98,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/0912.3995/images/1-20.png","element":"img","alt":"t≤T f(xt","inline":true},{"text":") in the first ","element":"span"},{"style":{"fontStyle":"italic"},"text":"T ","element":"span"},{"text":"rounds is no further from ","element":"span"},{"style":{"height":16},"width":81.56,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/0912.3995/images/1-21.png","element":"img","alt":" f(x∗","inline":true},{"text":") than the average.","element":"span"}],[{"style":{"fontWeight":"bold"},"text":"2.1. Gaussian Processes and RKHS’s","element":"span"}],[{"style":{"fontWeight":"bold"},"text":"Gaussian Processes. ","element":"span"},{"text":"Some assumptions on ","element":"span"},{"style":{"fontStyle":"italic"},"text":"f ","element":"span"},{"text":"are required to guarantee no-regret. While rigid parametric assumptions such as linearity may not hold in practice, a certain degree of smoothness is often warranted. In our sensor network, temperature readings at closeby locations are highly correlated (see Figure ","element":"span"},{"href":"#id-22","text":"2(a)","element":"a"},{"text":"). We can enforce implicit properties like smoothness without relying on any parametric assumptions, modeling ","element":"span"},{"style":{"fontStyle":"italic"},"text":"f ","element":"span"},{"text":"as a sample from a ","element":"span"},{"style":{"fontStyle":"italic"},"text":"Gaussian process ","element":"span"},{"text":"(GP): a collection of dependent random variables, one for each ","element":"span"},{"style":{"height":11.6},"width":127.92,"height":29,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/0912.3995/images/2-0.png","element":"img","alt":"x ∈ D","inline":true},{"text":", every finite subset of which is multivariate Gaussian distributed in an overall consistent way (","element":"span"},{"href":"#id-4","referenceIndex":26,"text":"Ras- ","element":"a"},{"href":"#id-4","referenceIndex":26,"text":"mussen & Williams","element":"a"},{"href":"#id-4","referenceIndex":26,"text":", ","element":"a"},{"href":"#id-4","referenceIndex":26,"text":"2006","element":"a"},{"text":"). ","element":"span"},{"text":"A ","element":"span"},{"style":{"height":16},"width":298.66,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/0912.3995/images/2-1.png","element":"img","alt":" GP(µ(x), k(x, x′","inline":true},{"text":")) is specified by its mean function ","element":"span"},{"style":{"height":16},"width":258.89,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/0912.3995/images/2-2.png","element":"img","alt":" µ(x) = E[f(x","inline":true},{"text":")] and covariance (or kernel) function ","element":"span"},{"style":{"height":16},"width":374.02,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/0912.3995/images/2-3.png","element":"img","alt":" k(x, x′) = E[(f(x) −","inline":true},{"style":{"height":16},"width":65.51,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/0912.3995/images/2-4.png","element":"img","alt":"µ(x","inline":true},{"text":"))(","element":"span"},{"style":{"height":16},"width":225.53,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/0912.3995/images/2-5.png","element":"img","alt":"f(x′) − µ(x′","inline":true},{"text":"))]. ","element":"span"},{"text":"For GPs not conditioned on data, we assume","element":"span"},{"style":{"height":7.6},"width":16,"height":19,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/0912.3995/images/2-6.png","element":"img","alt":"2","inline":true,"padRight":true},{"text":"that ","element":"span"},{"style":{"height":10.8},"width":71.07,"height":27,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/0912.3995/images/2-7.png","element":"img","alt":" µ ≡","inline":true,"padRight":true},{"text":"0. Moreover, we restrict ","element":"span"},{"style":{"height":16},"width":165.3,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/0912.3995/images/2-8.png","element":"img","alt":"k(x, x) ≤","inline":true,"padRight":true},{"text":"1, ","element":"span"},{"style":{"height":11.6},"width":107.97,"height":29,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/0912.3995/images/2-9.png","element":"img","alt":" x ∈ D","inline":true},{"text":", i.e., we assume bounded variance. By fixing the correlation behavior, the covariance function ","element":"span"},{"style":{"fontStyle":"italic"},"text":"k ","element":"span"},{"text":"encodes smoothness properties of sample functions ","element":"span"},{"style":{"fontStyle":"italic"},"text":"f ","element":"span"},{"text":"drawn from the GP. A range of commonly used kernel functions is given in Section ","element":"span"},{"href":"#id-23","text":"5.2","element":"a"},{"text":".","element":"span"}],[{"text":"In this work, GPs play multiple roles. First, some of our results hold when the unknown target function is a sample from a known GP distribution GP(0","element":"span"},{"style":{"height":16},"width":139.45,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/0912.3995/images/2-10.png","element":"img","alt":", k(x, x′","inline":true},{"text":")). Second, the Bayesian algorithm we analyze generally uses GP(0","element":"span"},{"style":{"height":16},"width":139.44,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/0912.3995/images/2-11.png","element":"img","alt":", k(x, x′","inline":true},{"text":")) as prior distribution over ","element":"span"},{"style":{"fontStyle":"italic"},"text":"f","element":"span"},{"text":". ","element":"span"},{"text":"A major advantage of working with GPs is the existence of simple analytic formulae for mean and covariance of the posterior distribution, which allows easy implementation of algorithms. For a noisy sample ","element":"span"},{"style":{"height":17.38},"width":316.41,"height":43.46,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/0912.3995/images/2-12.png","element":"img","alt":" yT = [y1 . . . yT ]T","inline":true,"padRight":true},{"text":"at points ","element":"span"},{"style":{"height":16},"width":345.03,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/0912.3995/images/2-13.png","element":"img","alt":" AT = {x1, . . . , xT }","inline":true},{"text":", ","element":"span"},{"style":{"height":16},"width":244.43,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/0912.3995/images/2-14.png","element":"img","alt":"yt = f(xt)+ϵt","inline":true,"padRight":true},{"text":"with ","element":"span"},{"style":{"height":17.38},"width":213.04,"height":43.46,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/0912.3995/images/2-15.png","element":"img","alt":" ϵt ∼ N(0, σ2","inline":true},{"text":") i.i.d. Gaussian noise, the posterior over ","element":"span"},{"style":{"fontStyle":"italic"},"text":"f ","element":"span"},{"text":"is a GP distribution again, with mean ","element":"span"},{"style":{"height":16},"width":90.62,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/0912.3995/images/2-16.png","element":"img","alt":" µT (x","inline":true},{"text":"), covariance ","element":"span"},{"style":{"height":16},"width":145.58,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/0912.3995/images/2-17.png","element":"img","alt":" kT (x, x′","inline":true},{"text":") and variance ","element":"span"},{"style":{"height":17.78},"width":45.77,"height":44.44,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/0912.3995/images/2-18.png","element":"img","alt":" σ2T","inline":true,"padRight":true},{"text":"(","element":"span"},{"style":{"fontStyle":"italic","fontWeight":"bold"},"text":"x","element":"span"},{"text":"):","element":"span"}],[{"style":{"height":18.18},"width":289.66,"height":45.46,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/0912.3995/images/2-19.png","element":"img","alt":"µT (x) = kT (x)T","inline":true,"padRight":true},{"text":"(","element":"span"},{"style":{"height":18.18},"width":300.22,"height":45.46,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/0912.3995/images/2-20.png","element":"img","alt":"KT + σ2I)−1yT ,","inline":true,"padRight":true},{"text":"(1) ","element":"span"},{"style":{"height":18.18},"width":524.67,"height":45.46,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/0912.3995/images/2-21.png","element":"img","alt":"kT (x, x′) = k(x, x′) − kT (x)T","inline":true,"padRight":true},{"text":"(","element":"span"},{"style":{"height":18.18},"width":368.46,"height":45.46,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/0912.3995/images/2-22.png","element":"img","alt":"KT + σ2I)−1kT (x′),","inline":true}],[{"style":{"width":"92%"},"width":865,"height":45,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/0912.3995/images/2-23.png","element":"img"}],[{"text":"where ","element":"span"},{"style":{"height":17.38},"width":592.16,"height":43.46,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/0912.3995/images/2-24.png","element":"img","alt":" kT (x) = [k(x1, x) . . . k(xT , x)]T","inline":true,"padRight":true},{"text":"and ","element":"span"},{"style":{"height":13.19},"width":64.48,"height":32.98,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/0912.3995/images/2-25.png","element":"img","alt":" KT","inline":true,"padRight":true},{"text":"is the positive definite kernel matrix [","element":"span"},{"style":{"height":16.79},"width":271.31,"height":41.98,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/0912.3995/images/2-26.png","element":"img","alt":"k(x, x′)]x,x′∈AT","inline":true,"padRight":true},{"text":".","element":"span"}],[{"style":{"fontWeight":"bold"},"text":"RKHS. ","element":"span"},{"text":"Instead of the Bayes case, where ","element":"span"},{"style":{"fontStyle":"italic"},"text":"f ","element":"span"},{"text":"is sampled from a GP prior, we also consider the more agnostic case where ","element":"span"},{"style":{"fontStyle":"italic"},"text":"f ","element":"span"},{"text":"has low “complexity” as measured under an RKHS norm (and distribution free assumptions on the noise process). The notion of ","element":"span"},{"style":{"fontStyle":"italic"},"text":"reproducing kernel Hilbert spaces ","element":"span"},{"text":"(RKHS, ","element":"span"},{"href":"#id-24","referenceIndex":34,"text":"Wahba ","element":"a"},{"href":"#id-24","referenceIndex":34,"text":"1990","element":"a"},{"text":") is intimately related to GPs and their covariance functions ","element":"span"},{"style":{"height":16},"width":121.74,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/0912.3995/images/2-27.png","element":"img","alt":" k(x, x′","inline":true},{"text":"). The RKHS ","element":"span"},{"style":{"height":16},"width":101.76,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/0912.3995/images/2-28.png","element":"img","alt":" Hk(D","inline":true},{"text":") is a complete subspace of ","element":"span"},{"style":{"height":16},"width":93.5,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/0912.3995/images/2-29.png","element":"img","alt":" L2(D","inline":true},{"text":") of nicely behaved functions, with an inner product ","element":"span"},{"style":{"height":16},"width":87.85,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/0912.3995/images/2-30.png","element":"img","alt":" ⟨·, ·⟩k","inline":true,"padRight":true},{"text":"obeying the reproducing property: ","element":"span"},{"style":{"height":16},"width":324.85,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/0912.3995/images/2-31.png","element":"img","alt":"⟨f, k(x, ·)⟩k = f(x","inline":true},{"text":") for all ","element":"span"},{"style":{"height":16},"width":182.74,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/0912.3995/images/2-32.png","element":"img","alt":" f ∈ Hk(D","inline":true},{"text":"). It is literally constructed by completing the set of mean functions ","element":"span"},{"style":{"height":10},"width":47.01,"height":25,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/0912.3995/images/2-33.png","element":"img","alt":"µT","inline":true,"padRight":true},{"text":"for all possible ","element":"span"},{"style":{"height":16},"width":141.7,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/0912.3995/images/2-34.png","element":"img","alt":" T, {xt}","inline":true},{"text":", and ","element":"span"},{"style":{"height":11.1},"width":48,"height":27.74,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/0912.3995/images/2-35.png","element":"img","alt":" yT","inline":true,"padRight":true},{"text":". ","element":"span"},{"text":"The induced","element":"span"}],[{"text":"RKHS norm ","element":"span"},{"style":{"height":19.2},"width":287.34,"height":48,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/0912.3995/images/2-36.png","element":"img","alt":" ∥f∥k =�⟨f, f⟩k","inline":true,"padRight":true},{"text":"measures smoothness of ","element":"span"},{"style":{"fontStyle":"italic"},"text":"f ","element":"span"},{"text":"w.r.t. ","element":"span"},{"style":{"fontStyle":"italic"},"text":"k","element":"span"},{"text":": in much the same way as ","element":"span"},{"style":{"height":13.19},"width":36.74,"height":32.98,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/0912.3995/images/2-37.png","element":"img","alt":" k1","inline":true,"padRight":true},{"text":"would generate smoother samples than ","element":"span"},{"style":{"height":13.19},"width":36.75,"height":32.98,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/0912.3995/images/2-38.png","element":"img","alt":" k2","inline":true,"padRight":true},{"text":"as GP covariance functions, ","element":"span"},{"style":{"height":16},"width":89.91,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/0912.3995/images/2-39.png","element":"img","alt":"∥·∥k1","inline":true,"padRight":true},{"text":"assigns larger penalties than ","element":"span"},{"style":{"height":16},"width":209.28,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/0912.3995/images/2-40.png","element":"img","alt":" ∥·∥k2. ⟨·, ·⟩k","inline":true,"padRight":true},{"text":"can be extended to all of ","element":"span"},{"style":{"height":16},"width":93.5,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/0912.3995/images/2-41.png","element":"img","alt":" L2(D","inline":true},{"text":"), in which case ","element":"span"},{"style":{"height":16},"width":180.11,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/0912.3995/images/2-42.png","element":"img","alt":" ∥f∥k < ∞","inline":true,"padRight":true},{"text":"iff ","element":"span"},{"style":{"height":16},"width":174.26,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/0912.3995/images/2-43.png","element":"img","alt":"f ∈ Hk(D","inline":true},{"text":"). For most kernels discussed in Section ","element":"span"},{"href":"#id-23","text":"5.2","element":"a"},{"text":", members of ","element":"span"},{"style":{"height":16},"width":101.76,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/0912.3995/images/2-44.png","element":"img","alt":" Hk(D","inline":true},{"text":") can uniformly approximate any continuous function on any compact subset of ","element":"span"},{"style":{"fontStyle":"italic"},"text":"D","element":"span"},{"text":".","element":"span"}],[{"style":{"fontWeight":"bold"},"text":"2.2. Information Gain & Experimental Design","element":"span"}],[{"text":"One approach to maximizing ","element":"span"},{"style":{"fontStyle":"italic"},"text":"f ","element":"span"},{"text":"is to first choose points ","element":"span"},{"style":{"height":9.59},"width":38.26,"height":23.98,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/0912.3995/images/2-45.png","element":"img","alt":" xt","inline":true,"padRight":true},{"text":"so as to estimate the function globally well, then play the maximum point of our estimate. How can we learn about ","element":"span"},{"style":{"fontStyle":"italic"},"text":"f ","element":"span"},{"text":"as rapidly as possible? This question comes down to Bayesian Experimental Design (henceforth “ED”; see ","element":"span"},{"href":"#id-3","referenceIndex":6,"text":"Chaloner & Verdinelli ","element":"a"},{"href":"#id-3","referenceIndex":6,"text":"1995","element":"a"},{"text":"), where the informativeness of a set of sampling points ","element":"span"},{"style":{"height":12.4},"width":116.02,"height":31,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/0912.3995/images/2-46.png","element":"img","alt":" A ⊂ D","inline":true,"padRight":true},{"text":"about ","element":"span"},{"style":{"fontStyle":"italic"},"text":"f ","element":"span"},{"text":"is measured by the ","element":"span"},{"style":{"fontStyle":"italic"},"text":"information gain ","element":"span"},{"text":"(c.f., ","element":"span"},{"href":"#id-25","referenceIndex":7,"text":"Cover & Thomas ","element":"a"},{"href":"#id-25","referenceIndex":7,"text":"1991","element":"a"},{"text":"), which is the mutual information between ","element":"span"},{"style":{"fontStyle":"italic"},"text":"f ","element":"span"},{"text":"and observations ","element":"span"},{"style":{"height":14.7},"width":235.38,"height":36.74,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/0912.3995/images/2-47.png","element":"img","alt":" yA = f A+ϵA","inline":true,"padRight":true},{"text":"at these points:","element":"span"}],[{"id":"id-34","style":{"width":"76%"},"width":719,"height":41,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/0912.3995/images/2-48.png","element":"img"}],[{"text":"quantifying the reduction in uncertainty about ","element":"span"},{"style":{"fontStyle":"italic"},"text":"f ","element":"span"},{"text":"from revealing ","element":"span"},{"style":{"height":11.1},"width":49,"height":27.74,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/0912.3995/images/2-49.png","element":"img","alt":" yA","inline":true},{"text":". ","element":"span"},{"text":"Here, ","element":"span"},{"style":{"height":16},"width":320.18,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/0912.3995/images/2-50.png","element":"img","alt":" f A = [f(x)]x∈A","inline":true,"padRight":true},{"text":"and ","element":"span"},{"style":{"height":17.38},"width":285.36,"height":43.46,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/0912.3995/images/2-51.png","element":"img","alt":"εA ∼ N(0, σ2I","inline":true},{"text":"). ","element":"span"},{"text":"For a Gaussian, H(","element":"span"},{"style":{"height":16},"width":221.74,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/0912.3995/images/2-52.png","element":"img","alt":"N(µ, Σ)) =","inline":true}],[{"style":{"height":7.6},"width":16,"height":19,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/0912.3995/images/2-53.png","element":"img","alt":"2","inline":true,"padRight":true},{"text":"log ","element":"span"},{"style":{"height":16},"width":119.78,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/0912.3995/images/2-54.png","element":"img","alt":" |2πeΣ|","inline":true},{"text":", ","element":"span"},{"text":"so that in our setting I(","element":"span"},{"style":{"height":16},"width":179.1,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/0912.3995/images/2-55.png","element":"img","alt":"yA; f) =","inline":true,"padRight":true},{"text":"I(","element":"span"},{"style":{"height":19.37},"width":268.47,"height":48.43,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/0912.3995/images/2-56.png","element":"img","alt":"yA; f A) = 12","inline":true,"padRight":true},{"text":"log ","element":"span"},{"style":{"height":17.38},"width":251.62,"height":43.46,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/0912.3995/images/2-57.png","element":"img","alt":" |I + σ−2KA|","inline":true},{"text":", ","element":"span"},{"text":"where ","element":"span"},{"style":{"height":13.19},"width":138.21,"height":32.98,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/0912.3995/images/2-58.png","element":"img","alt":" KA =","inline":true,"padRight":true},{"text":"[","element":"span"},{"style":{"height":16.79},"width":252.34,"height":41.98,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/0912.3995/images/2-59.png","element":"img","alt":"k(x, x′)]x,x′∈A","inline":true},{"text":". ","element":"span"},{"text":"While finding the information gain maximizer among ","element":"span"},{"style":{"height":16},"width":309.44,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/0912.3995/images/2-60.png","element":"img","alt":" A ⊂ D, |A| ≤ T","inline":true,"padRight":true},{"text":"is NP-hard (","element":"span"},{"href":"#id-26","referenceIndex":17,"text":"Ko ","element":"a"},{"href":"#id-26","referenceIndex":17,"text":"et al.","element":"a"},{"href":"#id-26","referenceIndex":17,"text":", ","element":"a"},{"href":"#id-26","referenceIndex":17,"text":"1995","element":"a"},{"text":"), it can be approximated by an efficient greedy algorithm. If ","element":"span"},{"style":{"height":16},"width":272.44,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/0912.3995/images/2-61.png","element":"img","alt":" F(A) = I(yA; f","inline":true},{"text":"), this algorithm picks ","element":"span"},{"style":{"height":16.7},"width":526.01,"height":41.74,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/0912.3995/images/2-62.png","element":"img","alt":" xt = argmaxx∈D F(At−1∪{x}","inline":true},{"text":") in round ","element":"span"},{"style":{"fontStyle":"italic"},"text":"t","element":"span"},{"text":", which can be shown to be equivalent to","element":"span"}],[{"id":"id-29","style":{"width":"69%"},"width":652,"height":66,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/0912.3995/images/2-63.png","element":"img"}],[{"text":"where ","element":"span"},{"style":{"height":16},"width":432.18,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/0912.3995/images/2-64.png","element":"img","alt":" At−1 = {x1, . . . , xt−1}","inline":true},{"text":". ","element":"span"},{"text":"Importantly, this simple algorithm is guaranteed to find a near-optimal solution: for the set ","element":"span"},{"style":{"height":13.99},"width":52.89,"height":34.98,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/0912.3995/images/2-65.png","element":"img","alt":" AT","inline":true,"padRight":true},{"text":"obtained after ","element":"span"},{"style":{"fontStyle":"italic"},"text":"T ","element":"span"},{"text":"rounds, we have that","element":"span"}],[{"id":"id-28","style":{"width":"78%"},"width":731,"height":65,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/0912.3995/images/2-66.png","element":"img"}],[{"text":"at least a constant fraction of the optimal information gain value. ","element":"span"},{"text":"This is because ","element":"span"},{"style":{"fontStyle":"italic"},"text":"F","element":"span"},{"text":"(","element":"span"},{"style":{"fontStyle":"italic"},"text":"A","element":"span"},{"text":") satisfies a diminishing returns property called ","element":"span"},{"style":{"fontStyle":"italic"},"text":"submodularity ","element":"span"},{"text":"(","element":"span"},{"href":"#id-27","referenceIndex":19,"text":"Krause & Guestrin","element":"a"},{"href":"#id-27","referenceIndex":19,"text":", ","element":"a"},{"href":"#id-27","referenceIndex":19,"text":"2005","element":"a"},{"text":"), and the greedy approximation guarantee (","element":"span"},{"href":"#id-28","text":"5","element":"a"},{"text":") holds for any submodular function (","element":"span"},{"href":"#id-11","referenceIndex":24,"text":"Nemhauser et al.","element":"a"},{"href":"#id-11","referenceIndex":24,"text":", ","element":"a"},{"href":"#id-11","referenceIndex":24,"text":"1978","element":"a"},{"text":").","element":"span"}],[{"text":"While sequentially optimizing Eq. ","element":"span"},{"href":"#id-29","text":"4 ","element":"a"},{"text":"is a provably good way to ","element":"span"},{"style":{"fontStyle":"italic"},"text":"explore ","element":"span"},{"style":{"fontStyle":"italic"},"text":"f ","element":"span"},{"text":"globally, it is not well suited for function optimization. For the latter, we only need to identify points ","element":"span"},{"style":{"fontStyle":"italic","fontWeight":"bold"},"text":"x ","element":"span"},{"text":"where ","element":"span"},{"style":{"fontStyle":"italic"},"text":"f","element":"span"},{"text":"(","element":"span"},{"style":{"fontStyle":"italic","fontWeight":"bold"},"text":"x","element":"span"},{"text":") is large, in order to concentrate sampling there as rapidly as possible, thus ","element":"span"},{"style":{"fontStyle":"italic"},"text":"exploit ","element":"span"},{"text":"our knowledge about maxima. In fact, the ED rule (","element":"span"},{"href":"#id-29","text":"4","element":"a"},{"text":") does not even depend on observations ","element":"span"},{"style":{"height":10},"width":31.54,"height":25,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/0912.3995/images/3-0.png","element":"img","alt":" yt","inline":true,"padRight":true},{"text":"obtained along the way. Nevertheless, the maximum information gain after ","element":"span"},{"style":{"fontStyle":"italic"},"text":"T ","element":"span"},{"text":"rounds will play a prominent role in our regret bounds, forging an important connection between GP optimization and experimental design.","element":"span"}]]},{"heading":"3. GP-UCB Algorithm","paragraphs":[[{"text":"For sequential optimization, the ED rule (","element":"span"},{"href":"#id-29","text":"4","element":"a"},{"text":") can be wasteful: it aims at decreasing uncertainty globally, not just where maxima might be. Another idea is to pick points as ","element":"span"},{"style":{"height":16.7},"width":430.52,"height":41.74,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/0912.3995/images/3-1.png","element":"img","alt":" xt = argmaxx∈D µt−1(x","inline":true},{"text":"), maximizing the expected reward based on the posterior so far. However, this rule is too greedy too soon and tends to get stuck in shallow local optima. ","element":"span"},{"text":"A combined strategy is to choose","element":"span"}],[{"id":"id-30","style":{"width":"83%"},"width":782,"height":78,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/0912.3995/images/3-2.png","element":"img"}],[{"text":"where ","element":"span"},{"style":{"height":14.4},"width":34.54,"height":36,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/0912.3995/images/3-3.png","element":"img","alt":" βt","inline":true,"padRight":true},{"text":"are appropriate constants. This latter objective prefers both points ","element":"span"},{"style":{"fontStyle":"italic","fontWeight":"bold"},"text":"x ","element":"span"},{"text":"where ","element":"span"},{"style":{"fontStyle":"italic"},"text":"f ","element":"span"},{"text":"is uncertain (large ","element":"span"},{"style":{"height":16},"width":104.09,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/0912.3995/images/3-4.png","element":"img","alt":"σt−1(·","inline":true},{"text":")) and such where we expect to achieve high rewards (large ","element":"span"},{"style":{"height":16},"width":105.33,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/0912.3995/images/3-5.png","element":"img","alt":" µt−1(·","inline":true},{"text":")): it implicitly negotiates the exploration–exploitation tradeoff. A natural interpretation of this sampling rule is that it greedily selects points ","element":"span"},{"style":{"fontStyle":"italic","fontWeight":"bold"},"text":"x ","element":"span"},{"text":"such that ","element":"span"},{"style":{"fontStyle":"italic"},"text":"f","element":"span"},{"text":"(","element":"span"},{"style":{"fontStyle":"italic","fontWeight":"bold"},"text":"x","element":"span"},{"text":") should be a reasonable upper bound on ","element":"span"},{"style":{"height":16},"width":81.56,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/0912.3995/images/3-6.png","element":"img","alt":" f(x∗","inline":true},{"text":"), since the argument in (","element":"span"},{"href":"#id-30","text":"6","element":"a"},{"text":") is an upper quantile of the marginal posterior ","element":"span"},{"style":{"height":16},"width":216.69,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/0912.3995/images/3-7.png","element":"img","alt":" P(f(x)|yt−1","inline":true},{"text":"). We call this choice the ","element":"span"},{"style":{"fontStyle":"italic"},"text":"Gaussian process upper confidence bound ","element":"span"},{"text":"rule (","element":"span"},{"text":"GP-UCB","element":"span"},{"text":"), where ","element":"span"},{"style":{"height":14.4},"width":34.54,"height":36,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/0912.3995/images/3-8.png","element":"img","alt":" βt","inline":true,"padRight":true},{"text":"is specified depending on the context (see Section ","element":"span"},{"href":"#id-30","text":"4","element":"a"},{"text":"). ","element":"span"},{"text":"Pseudocode for the ","element":"span"},{"text":"GP-UCB ","element":"span"},{"text":"algorithm is provided in Algorithm ","element":"span"},{"text":"1","element":"span"},{"text":". Figure ","element":"span"},{"href":"#id-31","text":"2 ","element":"a"},{"text":"illustrates two subsequent iterations, where ","element":"span"},{"text":"GP-UCB ","element":"span"},{"text":"both explores (Figure ","element":"span"},{"href":"#id-31","text":"2(b)","element":"a"},{"text":") by sampling an input ","element":"span"},{"style":{"fontStyle":"italic","fontWeight":"bold"},"text":"x ","element":"span"},{"text":"with large ","element":"span"},{"style":{"height":17.38},"width":119.09,"height":43.46,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/0912.3995/images/3-9.png","element":"img","alt":" σ2t−1(x","inline":true},{"text":") and exploits (Figure ","element":"span"},{"href":"#id-31","text":"2(c)","element":"a"},{"text":") ","element":"span"},{"text":"by sampling ","element":"span"},{"style":{"fontStyle":"italic","fontWeight":"bold"},"text":"x ","element":"span"},{"text":"with large ","element":"span"},{"style":{"height":16},"width":120.33,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/0912.3995/images/3-10.png","element":"img","alt":" µt−1(x","inline":true},{"text":").","element":"span"}],[{"text":"The ","element":"span"},{"text":"GP-UCB ","element":"span"},{"text":"selection rule Eq. ","element":"span"},{"href":"#id-30","text":"6 ","element":"a"},{"text":"is motivated by the UCB algorithm for the classical multi-armed bandit problem (","element":"span"},{"href":"#id-5","referenceIndex":3,"text":"Auer et al.","element":"a"},{"href":"#id-5","referenceIndex":3,"text":", ","element":"a"},{"href":"#id-5","referenceIndex":3,"text":"2002","element":"a"},{"text":"; ","element":"span"},{"href":"#id-32","referenceIndex":18,"text":"Kocsis & Szepesv´ari","element":"a"},{"href":"#id-32","referenceIndex":18,"text":", ","element":"a"},{"href":"#id-32","referenceIndex":18,"text":"2006","element":"a"},{"text":"). Among competing criteria for GP optimization (see Section ","element":"span"},{"text":"1","element":"span"},{"text":"), a variant of the ","element":"span"},{"text":"GP-UCB ","element":"span"},{"text":"rule has been demonstrated to be effective for this application (","element":"span"},{"href":"#id-33","referenceIndex":10,"text":"Dorard et al.","element":"a"},{"href":"#id-33","referenceIndex":10,"text":", ","element":"a"},{"href":"#id-33","referenceIndex":10,"text":"2009","element":"a"},{"text":"). ","element":"span"},{"text":"To our knowledge, strong theoretical results of the kind provided for ","element":"span"},{"text":"GP-UCB ","element":"span"},{"text":"in this paper have not been given for any of these search heuristics. ","element":"span"},{"text":"In Section ","element":"span"},{"text":"6","element":"span"},{"text":", we show that in practice ","element":"span"},{"text":"GP-UCB ","element":"span"},{"text":"compares favorably with these alternatives.","element":"span"}],[{"text":"If ","element":"span"},{"style":{"fontStyle":"italic"},"text":"D ","element":"span"},{"text":"is infinite, finding ","element":"span"},{"style":{"height":9.59},"width":38.26,"height":23.98,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/0912.3995/images/3-11.png","element":"img","alt":" xt","inline":true,"padRight":true},{"text":"in (","element":"span"},{"href":"#id-30","text":"6","element":"a"},{"text":") may be hard: the upper confidence index is multimodal in general. However, global search heuristics are very effective in practice (","element":"span"},{"href":"#id-17","referenceIndex":4,"text":"Brochu et al.","element":"a"},{"href":"#id-17","referenceIndex":4,"text":", ","element":"a"},{"href":"#id-17","referenceIndex":4,"text":"2009","element":"a"},{"text":"). It is generally assumed","element":"span"}],[{"style":{"width":"100%"},"width":937,"height":415,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/0912.3995/images/3-12.png","element":"img"}],[{"text":"that evaluating ","element":"span"},{"style":{"fontStyle":"italic"},"text":"f ","element":"span"},{"text":"is more costly than maximizing the UCB index.","element":"span"}],[{"text":"UCB algorithms (and GP optimization techniques in general) have been applied to a large number of problems in practice (","element":"span"},{"href":"#id-32","referenceIndex":18,"text":"Kocsis & Szepesv´ari","element":"a"},{"href":"#id-32","referenceIndex":18,"text":", ","element":"a"},{"href":"#id-32","referenceIndex":18,"text":"2006","element":"a"},{"text":"; ","element":"span"},{"href":"#id-0","referenceIndex":25,"text":"Pandey & Olston","element":"a"},{"href":"#id-0","referenceIndex":25,"text":", ","element":"a"},{"href":"#id-0","referenceIndex":25,"text":"2007","element":"a"},{"text":"; ","element":"span"},{"href":"#id-1","referenceIndex":20,"text":"Lizotte et al.","element":"a"},{"href":"#id-1","referenceIndex":20,"text":", ","element":"a"},{"href":"#id-1","referenceIndex":20,"text":"2007","element":"a"},{"text":"). Their performance is well characterized in both the finite arm setting and the linear optimization setting, but no convergence rates for GP optimization are known.","element":"span"}]]},{"heading":"4. Regret Bounds","paragraphs":[[{"text":"We now establish cumulative regret bounds for GP optimization, treating a number of different settings: ","element":"span"},{"style":{"height":14},"width":69.64,"height":35,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/0912.3995/images/3-13.png","element":"img","alt":"f ∼","inline":true,"padRight":true},{"text":"GP(0","element":"span"},{"style":{"height":16},"width":139.45,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/0912.3995/images/3-14.png","element":"img","alt":", k(x, x′","inline":true},{"text":")) for finite ","element":"span"},{"style":{"height":14},"width":130.92,"height":35,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/0912.3995/images/3-15.png","element":"img","alt":" D, f ∼","inline":true,"padRight":true},{"text":"GP(0","element":"span"},{"style":{"height":16},"width":139.44,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/0912.3995/images/3-16.png","element":"img","alt":", k(x, x′","inline":true},{"text":")) for general compact ","element":"span"},{"style":{"fontStyle":"italic"},"text":"D","element":"span"},{"text":", and the agnostic case of arbitrary ","element":"span"},{"style":{"fontStyle":"italic"},"text":"f ","element":"span"},{"text":"with bounded RKHS norm.","element":"span"}],[{"text":"GP optimization generalizes stochastic linear optimization, where a function ","element":"span"},{"style":{"fontStyle":"italic"},"text":"f ","element":"span"},{"text":"from a finite-dimensional linear space is optimized over. For the linear case, ","element":"span"},{"href":"#id-7","referenceIndex":9,"text":"Dani ","element":"a"},{"href":"#id-7","referenceIndex":9,"text":"et al. ","element":"a"},{"href":"#id-7","referenceIndex":9,"text":"(","element":"a"},{"href":"#id-7","referenceIndex":9,"text":"2008","element":"a"},{"text":") provide regret bounds that explicitly depend on the dimensionality","element":"span"},{"style":{"height":13.38},"width":56.1,"height":33.46,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/0912.3995/images/3-17.png","element":"img","alt":"3 d","inline":true},{"text":". GPs can be seen as random functions in some infinite-dimensional linear space, so their results do not apply in this case. This problem is circumvented in our regret bounds. The quantity governing them is the ","element":"span"},{"style":{"fontStyle":"italic"},"text":"maximum information ","element":"span"},{"style":{"height":14},"width":130.24,"height":35,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/0912.3995/images/3-18.png","element":"img","alt":"gain γT","inline":true,"padRight":true},{"text":"after ","element":"span"},{"style":{"fontStyle":"italic"},"text":"T ","element":"span"},{"text":"rounds, defined as:","element":"span"}],[{"style":{"width":"74%"},"width":699,"height":65,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/0912.3995/images/3-19.png","element":"img"}],[{"text":"where I(","element":"span"},{"style":{"height":16},"width":323.35,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/0912.3995/images/3-20.png","element":"img","alt":"yA; f A) = I(yA; f","inline":true},{"text":") is defined in (","element":"span"},{"href":"#id-34","text":"3","element":"a"},{"text":"). Recall that I(","element":"span"},{"style":{"height":19.37},"width":232.44,"height":48.44,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/0912.3995/images/3-21.png","element":"img","alt":"yA; f A) = 12","inline":true,"padRight":true},{"text":"log ","element":"span"},{"style":{"height":17.38},"width":237.22,"height":43.46,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/0912.3995/images/3-22.png","element":"img","alt":" |I + σ−2KA|","inline":true},{"text":", where ","element":"span"},{"style":{"height":13.19},"width":120.2,"height":32.98,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/0912.3995/images/3-23.png","element":"img","alt":" KA =","inline":true,"padRight":true},{"text":"[","element":"span"},{"style":{"height":16.79},"width":256.32,"height":41.97,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/0912.3995/images/3-24.png","element":"img","alt":"k(x, x′)]x,x′∈A","inline":true,"padRight":true},{"text":"is the covariance matrix of ","element":"span"},{"style":{"height":14.69},"width":112.86,"height":36.74,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/0912.3995/images/3-25.png","element":"img","alt":" f A =","inline":true,"padRight":true},{"text":"[","element":"span"},{"style":{"height":16},"width":162.16,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/0912.3995/images/3-26.png","element":"img","alt":"f(x)]x∈A","inline":true,"padRight":true},{"text":"associated with the samples ","element":"span"},{"style":{"fontStyle":"italic"},"text":"A","element":"span"},{"text":". Our regret bounds are of the form ","element":"span"},{"style":{"height":16.84},"width":219.96,"height":42.09,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/0912.3995/images/3-27.png","element":"img","alt":" O∗(√TβT γT","inline":true,"padRight":true},{"text":"), where ","element":"span"},{"style":{"height":14.4},"width":45.54,"height":36,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/0912.3995/images/3-28.png","element":"img","alt":" βT","inline":true,"padRight":true},{"text":"is the confidence parameter in Algorithm ","element":"span"},{"text":"1","element":"span"},{"text":", while the bounds of ","element":"span"},{"href":"#id-7","referenceIndex":9,"text":"Dani et al. ","element":"a"},{"href":"#id-7","referenceIndex":9,"text":"(","element":"a"},{"href":"#id-7","referenceIndex":9,"text":"2008","element":"a"},{"text":") are of the form ","element":"span"},{"style":{"height":16.83},"width":197.32,"height":42.08,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/0912.3995/images/3-29.png","element":"img","alt":" O∗(√TβT d","inline":true},{"text":") (","element":"span"},{"style":{"fontStyle":"italic"},"text":"d ","element":"span"},{"text":"the dimensionality of the linear function space). Here and below, the ","element":"span"},{"style":{"height":10.98},"width":48.83,"height":27.46,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/0912.3995/images/3-30.png","element":"img","alt":" O∗","inline":true,"padRight":true},{"text":"notation is a variant of ","element":"span"},{"style":{"fontStyle":"italic"},"text":"O","element":"span"},{"text":", where log factors are suppressed. While our proofs – all provided in the Appendix – use techniques similar to those of ","element":"span"},{"href":"#id-7","referenceIndex":9,"text":"Dani et al. ","element":"a"},{"href":"#id-7","referenceIndex":9,"text":"(","element":"a"},{"href":"#id-7","referenceIndex":9,"text":"2008","element":"a"},{"text":"), we face a number of additional","element":"span"}],[{"id":"id-22","style":{"width":"100%"},"width":1945,"height":569,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/0912.3995/images/4-0.png","element":"img"}],[{"id":"id-31","style":{"fontStyle":"italic"},"text":"Figure 2. ","element":"figcaption","subtype":"caption"},{"text":"(a) Example of temperature data collected by a network of 46 sensors at Intel Research Berkeley. (b,c) Two iterations of the ","element":"figcaption","subtype":"caption"},{"text":"GP-UCB ","element":"figcaption","subtype":"caption"},{"text":"algorithm. It samples points that are either uncertain (b) or have high posterior mean (c).","element":"figcaption","subtype":"caption"}],[{"text":"significant technical challenges. Besides avoiding the finite-dimensional analysis, we must handle confidence issues, which are more delicate for nonlinear random functions.","element":"span"}],[{"id":"id-37","text":"Importantly, note that the information gain is a prob- ","element":"span"},{"text":"lem dependent quantity — properties of both the kernel and the input space will determine the growth of regret. In Section ","element":"span"},{"text":"5","element":"span"},{"text":", we provide general methods for bounding ","element":"span"},{"style":{"height":10.4},"width":43.63,"height":26,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/0912.3995/images/4-1.png","element":"img","alt":" γT","inline":true,"padRight":true},{"text":", either by efficient auxiliary computations or by direct expressions for specific kernels of interest. Our results match known lower bounds (up to log factors) in both the ","element":"span"},{"style":{"fontStyle":"italic"},"text":"K","element":"span"},{"text":"-armed bandit and the ","element":"span"},{"style":{"fontStyle":"italic"},"text":"d","element":"span"},{"text":"-dimensional linear optimization case.","element":"span"}],[{"style":{"fontWeight":"bold"},"text":"Bounds for a GP Prior. ","element":"span"},{"text":"For finite ","element":"span"},{"style":{"fontStyle":"italic"},"text":"D","element":"span"},{"text":", we obtain the following bound.","element":"span"}],[{"style":{"fontWeight":"bold"},"text":"Theorem 1 ","element":"span"},{"style":{"fontStyle":"italic"},"text":"Let ","element":"span"},{"style":{"height":16},"width":244.5,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/0912.3995/images/4-2.png","element":"img","alt":"δ ∈ (0,","inline":true,"padRight":true},{"text":"1) ","element":"span"},{"style":{"fontStyle":"italic"},"text":"and ","element":"span"},{"style":{"height":14.4},"width":143.72,"height":36,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/0912.3995/images/4-3.png","element":"img","alt":"βt =","inline":true,"padRight":true},{"text":"2 log(","element":"span"},{"style":{"height":17.38},"width":189.38,"height":43.46,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/0912.3995/images/4-4.png","element":"img","alt":"|D|t2π2/6δ","inline":true},{"text":")","element":"span"},{"style":{"fontStyle":"italic"},"text":". ","element":"span"},{"style":{"fontStyle":"italic"},"text":"Running ","element":"span"},{"style":{"fontStyle":"italic"},"text":"GP-UCB ","element":"span"},{"style":{"fontStyle":"italic"},"text":"with ","element":"span"},{"style":{"height":14.4},"width":34.54,"height":36,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/0912.3995/images/4-5.png","element":"img","alt":" βt","inline":true,"padRight":true},{"style":{"fontStyle":"italic"},"text":"for a sample ","element":"span"},{"style":{"fontStyle":"italic"},"text":"f ","element":"span"},{"style":{"fontStyle":"italic"},"text":"of a GP with mean function zero and covariance function ","element":"span"},{"style":{"height":16},"width":123.73,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/0912.3995/images/4-6.png","element":"img","alt":" k(x, x′","inline":true},{"text":")","element":"span"},{"style":{"fontStyle":"italic"},"text":", we obtain a regret bound of ","element":"span"},{"style":{"height":19.2},"width":178.95,"height":48,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/0912.3995/images/4-7.png","element":"img","alt":" O∗(�TγT","inline":true,"padRight":true},{"text":"log ","element":"span"},{"style":{"fontStyle":"italic"},"text":"|","element":"span"},{"style":{"fontStyle":"italic"},"text":"D","element":"span"},{"style":{"fontStyle":"italic"},"text":"|","element":"span"},{"text":") ","element":"span"},{"style":{"fontStyle":"italic"},"text":"with high probability. Precisely,","element":"span"}],[{"style":{"width":"78%"},"width":730,"height":73,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/0912.3995/images/4-8.png","element":"img"}],[{"style":{"height":16},"width":251.95,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/0912.3995/images/4-9.png","element":"img","alt":"where C1 = 8/","inline":true,"padRight":true},{"text":"log(1 + ","element":"span"},{"style":{"height":17.39},"width":94.48,"height":43.46,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/0912.3995/images/4-10.png","element":"img","alt":" σ−2).","inline":true}],[{"text":"The proof methodology follows ","element":"span"},{"href":"#id-8","referenceIndex":8,"text":"Dani et al. ","element":"a"},{"href":"#id-8","referenceIndex":8,"text":"(","element":"a"},{"href":"#id-8","referenceIndex":8,"text":"2007","element":"a"},{"text":") in that we relate the regret to the growth of the log volume of the confidence ellipsoid — a novelty in our proof is showing how this growth is characterized by the information gain.","element":"span"}],[{"text":"This theorem shows that, with high probability over samples from the GP, the cumulative regret is bounded in terms of the maximum information gain, forging a novel connection between GP optimization and experimental design. This link is of fundamental technical importance, allowing us to generalize Theorem ","element":"span"},{"href":"#id-35","text":"1 ","element":"a"},{"text":"to infinite decision spaces. Moreover, the submodularity of I(","element":"span"},{"style":{"height":14.7},"width":119.7,"height":36.74,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/0912.3995/images/4-11.png","element":"img","alt":"yA; f A","inline":true},{"text":") allows us to derive sharp a priori bounds, depending on choice and parameterization of ","element":"span"},{"style":{"fontStyle":"italic"},"text":"k ","element":"span"},{"text":"(see Section ","element":"span"},{"text":"5","element":"span"},{"text":"). ","element":"span"},{"text":"In the following theorem, we generalize our result to any compact and convex ","element":"span"},{"style":{"height":14.19},"width":137.4,"height":35.47,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/0912.3995/images/4-12.png","element":"img","alt":" D ⊂ Rd","inline":true,"padRight":true},{"text":"under mild assumptions on the kernel function ","element":"span"},{"style":{"fontStyle":"italic"},"text":"k","element":"span"},{"text":".","element":"span"}],[{"style":{"fontWeight":"bold"},"text":"Theorem 2 ","element":"span"},{"style":{"fontStyle":"italic"},"text":"Let ","element":"span"},{"style":{"height":11.6},"width":81.28,"height":29,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/0912.3995/images/4-13.png","element":"img","alt":" D ⊂","inline":true,"padRight":true},{"text":"[0","element":"span"},{"style":{"height":17.38},"width":64.87,"height":43.46,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/0912.3995/images/4-14.png","element":"img","alt":", r]d","inline":true,"padRight":true},{"style":{"fontStyle":"italic"},"text":"be compact and convex, ","element":"span"},{"style":{"height":14},"width":183.84,"height":35,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/0912.3995/images/4-15.png","element":"img","alt":"d ∈ N, r >","inline":true,"padRight":true},{"text":"0","element":"span"},{"style":{"fontStyle":"italic"},"text":". Suppose that the kernel ","element":"span"},{"style":{"height":16},"width":123.73,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/0912.3995/images/4-16.png","element":"img","alt":" k(x, x′","inline":true},{"text":") ","element":"span"},{"style":{"fontStyle":"italic"},"text":"satisfies the following high probability bound on the derivatives of GP sample paths ","element":"span"},{"style":{"fontStyle":"italic"},"text":"f","element":"span"},{"style":{"fontStyle":"italic"},"text":": for some constants ","element":"span"},{"style":{"fontStyle":"italic"},"text":"a, b > ","element":"span"},{"text":"0","element":"span"},{"style":{"fontStyle":"italic"},"text":",","element":"span"}],[{"text":"Pr ","element":"span"},{"style":{"fontStyle":"italic"},"text":"{","element":"span"},{"text":"sup","element":"span"},{"style":{"height":16.7},"width":88.59,"height":41.74,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/0912.3995/images/4-17.png","element":"img","alt":"x∈D |","inline":true},{"style":{"fontStyle":"italic"},"text":"∂f/∂x","element":"span"},{"style":{"height":16.79},"width":27.79,"height":41.98,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/0912.3995/images/4-18.png","element":"img","alt":"j|","inline":true,"padRight":true},{"style":{"fontStyle":"italic"},"text":"> L","element":"span"},{"style":{"height":16},"width":62,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/0912.3995/images/4-19.png","element":"img","alt":"} ≤","inline":true,"padRight":true},{"style":{"fontStyle":"italic"},"text":"ae","element":"span"},{"style":{"height":13.21},"width":116.02,"height":33.02,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/0912.3995/images/4-20.png","element":"img","alt":"−(L/b)2","inline":true},{"style":{"fontStyle":"italic"},"text":", ","element":"span"},{"style":{"fontStyle":"italic"},"text":"j ","element":"span"},{"text":"= 1","element":"span"},{"style":{"fontStyle":"italic"},"text":", . . . , d.","element":"span"}],[{"style":{"width":"48%"},"width":451,"height":41,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/0912.3995/images/4-21.png","element":"img"}],[{"style":{"height":18.18},"width":351.81,"height":45.45,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/0912.3995/images/4-22.png","element":"img","alt":"βt = 2 log(t22π2/(3δ","inline":true},{"text":")) + 2","element":"span"},{"style":{"fontStyle":"italic"},"text":"d ","element":"span"},{"text":"log","element":"span"},{"style":{"height":28.8},"width":153,"height":72,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/0912.3995/images/4-23.png","element":"img","alt":"�t2dbr�","inline":true},{"text":"log(4","element":"span"},{"style":{"height":28.8},"width":137.89,"height":72,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/0912.3995/images/4-24.png","element":"img","alt":"da/δ)�.","inline":true}],[{"style":{"fontStyle":"italic"},"text":"Running the ","element":"span"},{"style":{"fontStyle":"italic"},"text":"GP-UCB ","element":"span"},{"style":{"fontStyle":"italic"},"text":"with ","element":"span"},{"style":{"height":14.4},"width":34.54,"height":36,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/0912.3995/images/4-25.png","element":"img","alt":" βt","inline":true,"padRight":true},{"style":{"fontStyle":"italic"},"text":"for a sample ","element":"span"},{"style":{"fontStyle":"italic"},"text":"f ","element":"span"},{"style":{"fontStyle":"italic"},"text":"of a ","element":"span"},{"id":"id-35","style":{"fontStyle":"italic"},"text":"GP with mean function zero and covariance function ","element":"span"},{"style":{"height":16},"width":123.73,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/0912.3995/images/4-26.png","element":"img","alt":"k(x, x′","inline":true},{"text":")","element":"span"},{"style":{"fontStyle":"italic"},"text":", we obtain a regret bound of ","element":"span"},{"style":{"height":16.83},"width":193.06,"height":42.08,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/0912.3995/images/4-27.png","element":"img","alt":" O∗(√dTγT","inline":true,"padRight":true},{"text":") ","element":"span"},{"style":{"fontStyle":"italic"},"text":"with high probability. Precisely, with ","element":"span"},{"style":{"height":16},"width":143.34,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/0912.3995/images/4-28.png","element":"img","alt":" C1 = 8/","inline":true,"padRight":true},{"text":"log(1 + ","element":"span"},{"style":{"height":13.39},"width":65.1,"height":33.46,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/0912.3995/images/4-29.png","element":"img","alt":" σ−2","inline":true},{"text":") ","element":"span"},{"style":{"fontStyle":"italic"},"text":"we have","element":"span"}],[{"style":{"width":"85%"},"width":799,"height":73,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/0912.3995/images/4-30.png","element":"img"}],[{"text":"The main challenge in our proof (provided in the Appendix) is to lift the regret bound in terms of the confidence ellipsoid to general ","element":"span"},{"style":{"fontStyle":"italic"},"text":"D","element":"span"},{"text":". ","element":"span"},{"text":"The smoothness assumption on ","element":"span"},{"style":{"height":16},"width":123.73,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/0912.3995/images/4-31.png","element":"img","alt":" k(x, x′","inline":true},{"text":") disqualifies GPs with highly erratic sample paths. It holds for stationary kernels ","element":"span"},{"style":{"height":16},"width":369.7,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/0912.3995/images/4-32.png","element":"img","alt":"k(x, x′) = k(x − x′","inline":true},{"text":") which are four times differen-tiable (Theorem 5 of ","element":"span"},{"href":"#id-36","referenceIndex":12,"text":"Ghosal & Roy ","element":"a"},{"href":"#id-36","referenceIndex":12,"text":"(","element":"a"},{"href":"#id-36","referenceIndex":12,"text":"2006","element":"a"},{"text":")), such as the Squared Exponential and Mat´ern kernels with ","element":"span"},{"style":{"height":9.6},"width":68.14,"height":24,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/0912.3995/images/4-33.png","element":"img","alt":" ν >","inline":true,"padRight":true},{"text":"2 (see Section ","element":"span"},{"href":"#id-23","text":"5.2","element":"a"},{"text":"), while it is violated for the OrnsteinUhlenbeck kernel (Mat´ern with ","element":"span"},{"style":{"height":16},"width":118.28,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/0912.3995/images/4-34.png","element":"img","alt":" ν = 1/","inline":true},{"text":"2; a stationary variant of the Wiener process). For the latter, sample paths ","element":"span"},{"style":{"fontStyle":"italic"},"text":"f ","element":"span"},{"text":"are nondifferentiable almost everywhere with probability one and come with independent increments. We conjecture that a result of the form of Theorem ","element":"span"},{"href":"#id-37","text":"2 ","element":"a"},{"text":"does not hold in this case.","element":"span"}],[{"style":{"fontWeight":"bold"},"text":"Bounds for Arbitrary ","element":"span"},{"style":{"fontStyle":"italic"},"text":"f ","element":"span"},{"style":{"fontWeight":"bold"},"text":"in the RKHS. ","element":"span"},{"text":"Thus far, we have assumed that the target function ","element":"span"},{"style":{"fontStyle":"italic"},"text":"f ","element":"span"},{"text":"is sampled from a GP prior and that the noise is ","element":"span"},{"style":{"height":17.38},"width":129.7,"height":43.46,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/0912.3995/images/5-0.png","element":"img","alt":" N(0, σ2","inline":true},{"text":") with known variance ","element":"span"},{"style":{"height":13.39},"width":40.2,"height":33.47,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/0912.3995/images/5-1.png","element":"img","alt":" σ2","inline":true},{"text":". We now analyze ","element":"span"},{"text":"GP-UCB ","element":"span"},{"text":"in an agnostic setting, where ","element":"span"},{"style":{"fontStyle":"italic"},"text":"f ","element":"span"},{"text":"is an arbitrary function from the RKHS corresponding to kernel ","element":"span"},{"style":{"height":16},"width":123.72,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/0912.3995/images/5-2.png","element":"img","alt":" k(x, x′","inline":true},{"text":"). Moreover, we allow the noise variables ","element":"span"},{"style":{"height":9.59},"width":30.58,"height":23.98,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/0912.3995/images/5-3.png","element":"img","alt":" εt","inline":true,"padRight":true},{"text":"to be an arbitrary martingale difference sequence (meaning that ","element":"span"},{"style":{"height":16},"width":461.1,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/0912.3995/images/5-4.png","element":"img","alt":"E[εt | ε 1: γT = O�T d(d+1)/(2ν+d(d+1))","inline":true},{"text":"(log ","element":"span"},{"style":{"height":19.2},"width":74.58,"height":48,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/0912.3995/images/6-21.png","element":"img","alt":" T)�.","inline":true}],[{"text":"A proof of Theorem ","element":"span"},{"href":"#id-41","text":"5 ","element":"a"},{"text":"is given in the Appendix, , we only sketch the idea here. ","element":"span"},{"style":{"height":10.4},"width":43.63,"height":26,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/0912.3995/images/6-22.png","element":"img","alt":"γT","inline":true,"padRight":true},{"text":"is bounded by Theorem ","element":"span"},{"text":"4 ","element":"span"},{"text":"in terms the eigendecay of the kernel matrix ","element":"span"},{"style":{"height":13.19},"width":67.48,"height":32.98,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/0912.3995/images/6-23.png","element":"img","alt":"KD","inline":true},{"text":". ","element":"span"},{"text":"If ","element":"span"},{"style":{"fontStyle":"italic"},"text":"D ","element":"span"},{"text":"is infinite or very large, we can use the operator spectrum of ","element":"span"},{"style":{"height":16},"width":121.74,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/0912.3995/images/6-24.png","element":"img","alt":" k(x, x′","inline":true},{"text":"), which likewise decays rapidly. For the kernels of interest here, asymptotic expressions for the operator eigenvalues are given in ","element":"span"},{"href":"#id-42","referenceIndex":29,"text":"Seeger et al. ","element":"a"},{"href":"#id-42","referenceIndex":29,"text":"(","element":"a"},{"href":"#id-42","referenceIndex":29,"text":"2008","element":"a"},{"text":"), who derived bounds on the information gain for fixed and random designs (in contrast to the worst-case information gain considered here, ","element":"span"},{"text":"which is substantially more challenging to bound). The main challenge in the proof is to ensure the existence of discretizations ","element":"span"},{"style":{"height":13.19},"width":147.5,"height":32.98,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/0912.3995/images/6-25.png","element":"img","alt":" DT ⊂ D","inline":true},{"text":", dense in the limit, for which tail sums ","element":"span"},{"style":{"height":16},"width":171.67,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/0912.3995/images/6-26.png","element":"img","alt":" B(T∗)/nT","inline":true,"padRight":true},{"text":"in Theorem ","element":"span"},{"text":"4 ","element":"span"},{"text":"are close to corresponding operator spectra tail sums.","element":"span"}],[{"text":"Together with Theorems ","element":"span"},{"href":"#id-37","text":"2 ","element":"a"},{"text":"and ","element":"span"},{"href":"#id-38","text":"3","element":"a"},{"text":", this result guarantees sublinear regret of ","element":"span"},{"text":"GP-UCB ","element":"span"},{"text":"for any dimension (see Figure ","element":"span"},{"href":"#id-14","text":"1","element":"a"},{"text":"). For the Squared Exponential kernel, the dimension ","element":"span"},{"style":{"fontStyle":"italic"},"text":"d ","element":"span"},{"text":"appears as exponent of log ","element":"span"},{"style":{"fontStyle":"italic"},"text":"T ","element":"span"},{"text":"only, so that the regret grows at most as ","element":"span"},{"style":{"height":18.3},"width":128.86,"height":45.74,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/0912.3995/images/6-27.png","element":"img","alt":" O∗(√T","inline":true},{"text":"(log ","element":"span"},{"style":{"height":20.32},"width":98.32,"height":50.79,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/0912.3995/images/6-28.png","element":"img","alt":" T)d+12","inline":true,"padRight":true},{"text":") – the high degree of smoothness of the sample paths effectively combats the curse of dimensionality.","element":"span"}]]},{"heading":"6. Experiments","paragraphs":[[{"text":"We compare ","element":"span"},{"text":"GP-UCB ","element":"span"},{"text":"with heuristics such as the Expected ","element":"span"},{"text":"Improvement ","element":"span"},{"text":"(EI) ","element":"span"},{"text":"and ","element":"span"},{"text":"Most ","element":"span"},{"text":"Probable Improvement (MPI), and with naive methods which choose points of maximum mean or variance only, both on synthetic and real sensor network data.","element":"span"}],[{"text":"For synthetic data, we sample random functions from a squared exponential kernel with lengthscale parameter 0","element":"span"},{"style":{"fontStyle":"italic"},"text":".","element":"span"},{"text":"2. The sampling noise variance ","element":"span"},{"style":{"height":13.39},"width":40.2,"height":33.46,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/0912.3995/images/6-29.png","element":"img","alt":" σ2","inline":true,"padRight":true},{"text":"was set to 0","element":"span"},{"style":{"fontStyle":"italic"},"text":".","element":"span"},{"text":"025 or 5% of the signal variance. Our decision set ","element":"span"},{"style":{"fontStyle":"italic"},"text":"D ","element":"span"},{"text":"= [0","element":"span"},{"style":{"fontStyle":"italic"},"text":", ","element":"span"},{"text":"1] is uniformly discretized into 1000 points. ","element":"span"},{"text":"We run each algorithm for ","element":"span"},{"style":{"height":12},"width":566.42,"height":30,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/0912.3995/images/6-30.png","element":"img","alt":" T = 1000 iterations with δ = 0.","inline":true},{"text":"1, averaging over 30 trials (samples from the kernel). While the choice of ","element":"span"},{"style":{"height":14.4},"width":34.54,"height":36,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/0912.3995/images/6-31.png","element":"img","alt":" βt","inline":true,"padRight":true},{"text":"as recommended by Theorem ","element":"span"},{"href":"#id-35","text":"1 ","element":"a"},{"text":"leads to competitive performance of ","element":"span"},{"text":"GP-UCB","element":"span"},{"text":", we find (using cross-validation) that the algorithm is improved by scaling ","element":"span"},{"style":{"height":14.4},"width":34.54,"height":36,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/0912.3995/images/6-32.png","element":"img","alt":" βt","inline":true,"padRight":true},{"text":"down by a factor 5. Note that we did not optimize constants in our regret bounds.","element":"span"}],[{"text":"Next, we use temperature data collected from 46 sensors deployed at Intel Research Berkeley over 5 days at 1 minute intervals, pertaining to the example in Section ","element":"span"},{"text":"2","element":"span"},{"text":". We take the first two-thirds of the data set to compute the empirical covariance of the sensor readings, and use it as the kernel matrix. The functions ","element":"span"},{"style":{"fontStyle":"italic"},"text":"f ","element":"span"},{"text":"for optimization consist of one set of observations from all the sensors taken from the remaining third of the","element":"span"}],[{"style":{"width":"100%"},"width":1945,"height":476,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/0912.3995/images/7-0.png","element":"img"}],[{"style":{"fontStyle":"italic"},"text":"Figure 4. ","element":"figcaption","subtype":"caption"},{"text":"Sample functions drawn from a GP with linear, squared exponential and Mat´ern kernels (","element":"figcaption","subtype":"caption"},{"style":{"height":14.4},"width":141.01,"height":36,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/0912.3995/images/7-1.png","element":"img","alt":"ν = 2.5.)","inline":true}],[{"id":"id-40","style":{"width":"95%"},"width":1849,"height":522,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/0912.3995/images/7-2.png","element":"img"}],[{"id":"id-43","style":{"fontStyle":"italic"},"text":"Figure 5. ","element":"figcaption","subtype":"caption"},{"text":"Comparison of performance: ","element":"figcaption","subtype":"caption"},{"text":"GP-UCB ","element":"figcaption","subtype":"caption"},{"text":"and various heuristics on synthetic (a), and sensor network data (b, c).","element":"figcaption","subtype":"caption"}],[{"text":"data set, and the results (for ","element":"span"},{"style":{"height":16.58},"width":283.58,"height":41.46,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/0912.3995/images/7-3.png","element":"img","alt":" T = 46, σ2 = 0.","inline":true},{"text":"5 or 5% noise, ","element":"span"},{"style":{"height":11.6},"width":138.43,"height":29,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/0912.3995/images/7-4.png","element":"img","alt":" δ = 0.","inline":true},{"text":"1) were averaged over 2000 possible choices of the objective function.","element":"span"}],[{"text":"Lastly, we take data from traffic sensors deployed along the highway I-880 South in California. The goal was to find the point of minimum speed in order to identify the most congested portion of the highway; we used traffic speed data for all working days from 6 AM to 11 AM for one month, from 357 sensors. We again use the covariance matrix from two-thirds of the data set as kernel matrix, and test on the other third. The results (for ","element":"span"},{"style":{"height":16.58},"width":306.79,"height":41.46,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/0912.3995/images/7-5.png","element":"img","alt":" T = 357, σ2 = 4.","inline":true},{"text":"78 or 5% noise, ","element":"span"},{"style":{"height":11.6},"width":106.72,"height":29,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/0912.3995/images/7-6.png","element":"img","alt":" δ = 0.","inline":true},{"text":"1) were averaged over 900 runs.","element":"span"}],[{"text":"Figure ","element":"span"},{"href":"#id-43","text":"5 ","element":"a"},{"text":"compares the mean average regret incurred by the different heuristics and the ","element":"span"},{"text":"GP-UCB ","element":"span"},{"text":"algorithm on synthetic and real data. ","element":"span"},{"text":"For temperature data, the ","element":"span"},{"text":"GP-UCB ","element":"span"},{"text":"algorithm ","element":"span"},{"text":"and ","element":"span"},{"text":"EI ","element":"span"},{"text":"heuristic ","element":"span"},{"text":"clearly outperform the others, and do not exhibit significant difference between each other. On synthetic and traf-fic data MPI does equally well. In summary, ","element":"span"},{"text":"GP-UCB ","element":"span"},{"text":"performs at least on par with the existing approaches which are not equipped with regret bounds.","element":"span"}]]},{"heading":"7. Conclusions","paragraphs":[[{"text":"We prove the first sublinear regret bounds for GP optimization with commonly used kernels (see Figure ","element":"span"},{"href":"#id-14","text":"1","element":"a"},{"text":"), both for ","element":"span"},{"style":{"fontStyle":"italic"},"text":"f ","element":"span"},{"text":"sampled from a known GP and ","element":"span"},{"style":{"fontStyle":"italic"},"text":"f ","element":"span"},{"text":"of low RKHS norm. We analyze ","element":"span"},{"text":"GP-UCB","element":"span"},{"text":", an intuitive, Bayesian upper confidence bound based sampling rule. Our regret bounds crucially depend on the information gain due to sampling, establishing a novel connection between bandit optimization and experimental design. We bound the information gain in terms of the kernel spectrum, providing a general methodology for obtaining regret bounds with kernels of interest. Our experiments on real sensor network data indicate that ","element":"span"},{"text":"GPUCB ","element":"span"},{"text":"performs at least on par with competing criteria for GP optimization, for which no regret bounds are known at present. Our results provide an interesting step towards understanding exploration–exploitation tradeoffs with complex utility functions.","element":"span"}]]},{"heading":"Acknowledgements","paragraphs":[[{"text":"We thank Marcus Hutter for insightful comments on an earlier version of this paper. ","element":"span"},{"text":"This research was partially supported by ONR grant N00014-09-1-1044, NSF grant CNS-0932392, a gift from Microsoft Corporation and the Excellence Initiative of the German research foundation (DFG).","element":"span"}]]},{"heading":"References","paragraphs":[[{"id":"id-9","text":"Abernethy, J., Hazan, E., and Rakhlin, A. ","element":"span"},{"text":"An efficient algorithm for linear bandit optimization, 2008. COLT.","element":"span"}],[{"id":"id-6","text":"Auer, ","element":"span"},{"text":"P. ","element":"span"},{"text":"Using confidence bounds for exploitationexploration trade-offs. ","element":"span"},{"style":{"fontStyle":"italic"},"text":"JMLR","element":"span"},{"text":", 3:397–422, 2002.","element":"span"}],[{"id":"id-5","text":"Auer, P., Cesa-Bianchi, N., and Fischer, P. ","element":"span"},{"text":"Finite-time","element":"span"}],[{"text":"analysis of the multiarmed bandit problem. ","element":"span"},{"style":{"fontStyle":"italic"},"text":"Mach. Learn.","element":"span"},{"text":", 47(2-3):235–256, 2002.","element":"span"}],[{"id":"id-17","text":"Brochu, E., Cora, M., and de Freitas, N. A tutorial on ","element":"span"},{"text":"Bayesian optimization of expensive cost functions, with application to active user modeling and hierarchical reinforcement learning. In ","element":"span"},{"style":{"fontStyle":"italic"},"text":"TR-2009-23, UBC","element":"span"},{"text":", 2009.","element":"span"}],[{"id":"id-13","text":"Bubeck, S., Munos, R., Stoltz, G., and Szepesv´ari, C. On- ","element":"span"},{"text":"line optimization in X-armed bandits. In ","element":"span"},{"style":{"fontStyle":"italic"},"text":"NIPS","element":"span"},{"text":", 2008.","element":"span"}],[{"id":"id-3","text":"Chaloner, K. and Verdinelli, I. Bayesian experimental de- ","element":"span"},{"text":"sign: A review. ","element":"span"},{"style":{"fontStyle":"italic"},"text":"Stat. Sci.","element":"span"},{"text":", 10(3):273–304, 1995.","element":"span"}],[{"id":"id-25","text":"Cover, T. M. and Thomas, J. A. ","element":"span"},{"style":{"fontStyle":"italic"},"text":"Elements of Information Theory","element":"span"},{"text":". Wiley Interscience, 1991.","element":"span"}],[{"id":"id-8","text":"Dani, V., Hayes, T. P., and Kakade, S. The price of bandit ","element":"span"},{"text":"information for online optimization. In ","element":"span"},{"style":{"fontStyle":"italic"},"text":"NIPS","element":"span"},{"text":", 2007.","element":"span"}],[{"id":"id-7","text":"Dani, V., Hayes, T. P., and Kakade, S. M. Stochastic linear ","element":"span"},{"text":"optimization under bandit feedback. In ","element":"span"},{"style":{"fontStyle":"italic"},"text":"COLT","element":"span"},{"text":", 2008.","element":"span"}],[{"id":"id-33","text":"Dorard, L., Glowacka, D., and Shawe-Taylor, J. Gaussian ","element":"span"},{"text":"process modelling of dependencies in multi-armed bandit problems. In ","element":"span"},{"style":{"fontStyle":"italic"},"text":"Int. Symp. Op. Res.","element":"span"},{"text":", 2009.","element":"span"}],[{"id":"id-54","text":"Freedman, D. A. On tail probabilities for martingales. ","element":"span"},{"style":{"fontStyle":"italic"},"text":"Ann. Prob.","element":"span"},{"text":", 3(1):100–118, 1975.","element":"span"}],[{"id":"id-36","text":"Ghosal, S. and Roy, A. Posterior consistency of Gaussian ","element":"span"},{"text":"process prior for nonparametric binary regression. ","element":"span"},{"style":{"fontStyle":"italic"},"text":"Ann. Stat.","element":"span"},{"text":", 34(5):2413–2429, 2006.","element":"span"}],[{"id":"id-21","text":"Gr¨unew¨alder, S., Audibert, J-Y., Opper, M., and Shawe- ","element":"span"},{"text":"Taylor, J. ","element":"span"},{"text":"Regret bounds for gaussian process bandit problems. In ","element":"span"},{"style":{"fontStyle":"italic"},"text":"AISTATS","element":"span"},{"text":", 2010.","element":"span"}],[{"id":"id-19","text":"Huang, D., Allen, T. T., Notz, W. I., and Zeng, N. Global ","element":"span"},{"text":"optimization of stochastic black-box systems via sequential kriging meta-models. ","element":"span"},{"style":{"fontStyle":"italic"},"text":"J Glob. Opt.","element":"span"},{"text":", 34:441–466, 2006.","element":"span"}],[{"id":"id-18","text":"Jones, D. R., Schonlau, M., and Welch, W. J. ","element":"span"},{"text":"Efficient global optimization of expensive black-box functions. ","element":"span"},{"style":{"fontStyle":"italic"},"text":"J Glob. Opti.","element":"span"},{"text":", 13:455–492, 1998.","element":"span"}],[{"id":"id-12","text":"Kleinberg, R., Slivkins, A., and Upfal, E. ","element":"span"},{"text":"Multi-armed bandits in metric spaces. In ","element":"span"},{"style":{"fontStyle":"italic"},"text":"STOC","element":"span"},{"text":", pp. 681–690, 2008.","element":"span"}],[{"id":"id-26","text":"Ko, C., Lee, J., and Queyranne, M. An exact algorithm ","element":"span"},{"text":"for maximum entropy sampling. ","element":"span"},{"style":{"fontStyle":"italic"},"text":"Ops Res","element":"span"},{"text":", 43(4):684–691, 1995.","element":"span"}],[{"id":"id-32","text":"Kocsis, L. and Szepesv´ari, C. Bandit based monte-carlo ","element":"span"},{"text":"planning. In ","element":"span"},{"style":{"fontStyle":"italic"},"text":"ECML","element":"span"},{"text":", 2006.","element":"span"}],[{"id":"id-27","text":"Krause, A. and Guestrin, C. Near-optimal nonmyopic value ","element":"span"},{"text":"of information in graphical models. In ","element":"span"},{"style":{"fontStyle":"italic"},"text":"UAI","element":"span"},{"text":", 2005.","element":"span"}],[{"id":"id-1","text":"Lizotte, D., Wang, T., Bowling, M., and Schuurmans, D. ","element":"span"},{"text":"Automatic gait optimization with Gaussian process regression. In ","element":"span"},{"style":{"fontStyle":"italic"},"text":"IJCAI","element":"span"},{"text":", pp. 944–949, 2007.","element":"span"}],[{"id":"id-55","text":"McDiarmid, C. ","element":"span"},{"style":{"fontStyle":"italic"},"text":"Concentration. In Probabilistiic Methods for Algorithmic Discrete Mathematics","element":"span"},{"text":". Springer, 1998.","element":"span"}],[{"id":"id-16","text":"Mockus, J. ","element":"span"},{"style":{"fontStyle":"italic"},"text":"Bayesian Approach to Global Optimization","element":"span"},{"text":". Kluwer Academic Publishers, 1989.","element":"span"}],[{"id":"id-15","text":"Mockus, J., Tiesis, V., and Zilinskas, A. ","element":"span"},{"style":{"fontStyle":"italic"},"text":"Toward Global Optimization","element":"span"},{"text":", volume 2, chapter Bayesian Methods for Seeking the Extremum, pp. 117–128. 1978.","element":"span"}],[{"id":"id-11","text":"Nemhauser, G., Wolsey, L., and Fisher, M. An analysis ","element":"span"},{"text":"of the approximations for maximizing submodular set functions. ","element":"span"},{"style":{"fontStyle":"italic"},"text":"Math. Prog.","element":"span"},{"text":", 14:265–294, 1978.","element":"span"}],[{"id":"id-0","text":"Pandey, S. and Olston, C. Handling advertisements of un- ","element":"span"},{"text":"known quality in search advertising. In ","element":"span"},{"style":{"fontStyle":"italic"},"text":"NIPS","element":"span"},{"text":". 2007.","element":"span"}],[{"id":"id-4","text":"Rasmussen, C. E. and Williams, C. K. I. ","element":"span"},{"style":{"fontStyle":"italic"},"text":"Gaussian Processes for Machine Learning","element":"span"},{"text":". MIT Press, 2006.","element":"span"}],[{"id":"id-2","text":"Robbins, H. Some aspects of the sequential design of ex- ","element":"span"},{"text":"periments. ","element":"span"},{"style":{"fontStyle":"italic"},"text":"Bul. Am. Math. Soc.","element":"span"},{"text":", 58:527–535, 1952.","element":"span"}],[{"id":"id-10","text":"Rusmevichientong, P. and Tsitsiklis, J. N. Linearly param- ","element":"span"},{"text":"eterized bandits. abs/0812.3465, 2008.","element":"span"}],[{"id":"id-42","text":"Seeger, M. W., Kakade, S. M., and Foster, D. P. Infor- ","element":"span"},{"text":"mation consistency of nonparametric Gaussian process methods. ","element":"span"},{"style":{"fontStyle":"italic"},"text":"IEEE Tr. Inf. Theo.","element":"span"},{"text":", 54(5):2376–2382, 2008.","element":"span"}],[{"id":"id-64","text":"Shawe-Taylor, J., Williams, C., Cristianini, N., and Kan- ","element":"span"},{"text":"dola, J. On the eigenspectrum of the Gram matrix and the generalization error of kernel-PCA. ","element":"span"},{"style":{"fontStyle":"italic"},"text":"IEEE Trans. Inf. Theo.","element":"span"},{"text":", 51(7):2510–2522, 2005.","element":"span"}],[{"text":"Srinivas, N., Krause, A., Kakade, S., and Seeger, M. Gaus- ","element":"span"},{"text":"sian process optimization in the bandit setting: No regret and experimental design. In ","element":"span"},{"style":{"fontStyle":"italic"},"text":"ICML","element":"span"},{"text":", 2010.","element":"span"}],[{"id":"id-53","text":"Stein, M. ","element":"span"},{"style":{"fontStyle":"italic"},"text":"Interpolation of Spatial Data: Some Theory for Kriging","element":"span"},{"text":". Springer, 1999.","element":"span"}],[{"id":"id-20","text":"Vazquez, E. and Bect, J. Convergence properties of the ","element":"span"},{"text":"expected improvement algorithm, 2007.","element":"span"}],[{"id":"id-24","text":"Wahba, G. ","element":"span"},{"style":{"fontStyle":"italic"},"text":"Spline Models for Observational Data","element":"span"},{"text":". SIAM, 1990.","element":"span"}]]},{"heading":"A. Regret Bounds for Target Function Sampled from GP","paragraphs":[[{"text":"In this section, we provide details for the proofs of Theorem ","element":"span"},{"href":"#id-35","text":"1 ","element":"a"},{"text":"and Theorem ","element":"span"},{"href":"#id-37","text":"2","element":"a"},{"text":". In both cases, the strategy is to show that ","element":"span"},{"style":{"height":20.68},"width":534.58,"height":51.7,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/0912.3995/images/8-0.png","element":"img","alt":" |f(x) − µt−1(x)| ≤ β1/2t σt−1(x","inline":true},{"text":") for all ","element":"span"},{"style":{"height":11.6},"width":99.08,"height":29,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/0912.3995/images/8-1.png","element":"img","alt":"t ∈ N","inline":true,"padRight":true},{"text":"and all ","element":"span"},{"style":{"height":11.6},"width":116.95,"height":29,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/0912.3995/images/8-2.png","element":"img","alt":" x ∈ D","inline":true},{"text":", or in the infinite case, all ","element":"span"},{"style":{"fontStyle":"italic","fontWeight":"bold"},"text":"x ","element":"span"},{"text":"in a discretization of ","element":"span"},{"style":{"fontStyle":"italic"},"text":"D ","element":"span"},{"text":"which becomes dense as ","element":"span"},{"style":{"fontStyle":"italic"},"text":"t ","element":"span"},{"text":"gets large.","element":"span"}],[{"style":{"fontWeight":"bold"},"text":"A.1. Finite Decision Set","element":"span"}],[{"id":"id-44","text":"We begin with the finite case, ","element":"span"},{"style":{"height":16},"width":149.37,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/0912.3995/images/8-3.png","element":"img","alt":" |D| < ∞","inline":true},{"text":".","element":"span"}],[{"style":{"fontWeight":"bold"},"text":"Lemma 5.1 ","element":"span"},{"style":{"fontStyle":"italic"},"text":"Pick ","element":"span"},{"style":{"height":16},"width":197.28,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/0912.3995/images/8-4.png","element":"img","alt":"δ ∈ (0,","inline":true,"padRight":true},{"text":"1) ","element":"span"},{"style":{"fontStyle":"italic"},"text":"and ","element":"span"},{"style":{"fontStyle":"italic"},"text":"set ","element":"span"},{"style":{"height":14.4},"width":120.1,"height":36,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/0912.3995/images/8-5.png","element":"img","alt":"βt =","inline":true,"padRight":true},{"text":"2 log(","element":"span"},{"style":{"height":16},"width":131.91,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/0912.3995/images/8-6.png","element":"img","alt":"|D|πt/δ","inline":true},{"text":")","element":"span"},{"style":{"fontStyle":"italic"},"text":", where ","element":"span"},{"style":{"height":21.6},"width":348.8,"height":54,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/0912.3995/images/8-7.png","element":"img","alt":"�t≥1 π−1t = 1, πt >","inline":true,"padRight":true},{"text":"0","element":"span"},{"style":{"fontStyle":"italic"},"text":". Then,","element":"span"}],[{"style":{"width":"89%"},"width":840,"height":53,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/0912.3995/images/8-8.png","element":"img"}],[{"style":{"fontStyle":"italic"},"text":"holds with probability ","element":"span"},{"style":{"height":14},"width":129.7,"height":35,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/0912.3995/images/8-9.png","element":"img","alt":" ≥ 1 − δ","inline":true},{"style":{"fontStyle":"italic"},"text":".","element":"span"}],[{"style":{"fontWeight":"bold"},"text":"Proof ","element":"span"},{"text":"Fix ","element":"span"},{"style":{"height":12.8},"width":56.79,"height":32,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/0912.3995/images/9-0.png","element":"img","alt":" t ≥","inline":true,"padRight":true},{"text":"1 and ","element":"span"},{"style":{"height":11.6},"width":110.61,"height":29,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/0912.3995/images/9-1.png","element":"img","alt":" x ∈ D","inline":true},{"text":". Conditioned on ","element":"span"},{"style":{"height":11.1},"width":122.22,"height":27.74,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/0912.3995/images/9-2.png","element":"img","alt":" yt−1 =","inline":true,"padRight":true},{"text":"(","element":"span"},{"style":{"height":16},"width":502.82,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/0912.3995/images/9-3.png","element":"img","alt":"y1, . . . , yt−1), {x1, . . . , xt−1}","inline":true,"padRight":true},{"text":"are deterministic, and ","element":"span"},{"style":{"height":17.39},"width":480.96,"height":43.47,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/0912.3995/images/9-4.png","element":"img","alt":"f(x) ∼ N(µt−1(x), σ2t−1(x","inline":true},{"text":")). ","element":"span"},{"text":"Now, if ","element":"span"},{"style":{"height":16},"width":173.04,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/0912.3995/images/9-5.png","element":"img","alt":" r ∼ N(0,","inline":true,"padRight":true},{"text":"1), then","element":"span"}],[{"style":{"width":"93%"},"width":872,"height":156,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/0912.3995/images/9-6.png","element":"img"}],[{"text":"for ","element":"span"},{"style":{"fontStyle":"italic"},"text":"c > ","element":"span"},{"text":"0, since ","element":"span"},{"style":{"height":16.58},"width":188.72,"height":41.46,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/0912.3995/images/9-7.png","element":"img","alt":" e−c(r−c) ≤","inline":true,"padRight":true},{"text":"1 for ","element":"span"},{"style":{"height":12.8},"width":104.7,"height":32,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/0912.3995/images/9-8.png","element":"img","alt":" r ≥ c","inline":true},{"text":". Therefore, Pr","element":"span"},{"style":{"height":20.68},"width":773.44,"height":51.7,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/0912.3995/images/9-9.png","element":"img","alt":"{|f(x) − µt−1(x)| > β1/2t σt−1(x)} ≤ e−βt/2","inline":true},{"text":", using ","element":"span"},{"style":{"height":16},"width":500.77,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/0912.3995/images/9-10.png","element":"img","alt":"r = (f(x)−µt−1(x))/σt−1(x","inline":true},{"text":") and ","element":"span"},{"style":{"height":20.6},"width":143.24,"height":51.51,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/0912.3995/images/9-11.png","element":"img","alt":" c = β1/2t","inline":true,"padRight":true},{"text":". Applying the union bound,","element":"span"}],[{"style":{"width":"77%"},"width":724,"height":53,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/0912.3995/images/9-12.png","element":"img"}],[{"text":"holds with probability ","element":"span"},{"style":{"height":18.18},"width":297.94,"height":45.46,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/0912.3995/images/9-13.png","element":"img","alt":" ≥ 1 − |D|e−βt/2","inline":true},{"text":". ","element":"span"},{"text":"Choosing ","element":"span"},{"style":{"height":18.18},"width":326.77,"height":45.46,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/0912.3995/images/9-14.png","element":"img","alt":"|D|e−βt/2 = δ/πt","inline":true,"padRight":true},{"text":"and using the union bound for ","element":"span"},{"style":{"height":11.6},"width":96.99,"height":29,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/0912.3995/images/9-15.png","element":"img","alt":"t ∈ N","inline":true},{"text":", the statement holds. For example, we can use ","element":"span"},{"style":{"height":17.39},"width":184.17,"height":43.46,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/0912.3995/images/9-16.png","element":"img","alt":"πt = π2t2/","inline":true},{"text":"6.","element":"span"}],[{"style":{"fontWeight":"bold"},"text":"Lemma 5.2 ","element":"span"},{"style":{"fontStyle":"italic"},"text":"Fix ","element":"span"},{"style":{"height":12.8},"width":76.3,"height":32,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/0912.3995/images/9-17.png","element":"img","alt":" t ≥","inline":true,"padRight":true},{"text":"1","element":"span"},{"style":{"fontStyle":"italic"},"text":". ","element":"span"},{"style":{"fontStyle":"italic"},"text":"If ","element":"span"},{"style":{"height":16},"width":369.79,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/0912.3995/images/9-18.png","element":"img","alt":" |f(x) − µt−1(x)| ≤","inline":true},{"style":{"height":20.68},"width":193.82,"height":51.7,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/0912.3995/images/9-19.png","element":"img","alt":"β1/2t σt−1(x","inline":true},{"text":") ","element":"span"},{"style":{"fontStyle":"italic"},"text":"for all ","element":"span"},{"style":{"height":11.6},"width":144.25,"height":29,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/0912.3995/images/9-20.png","element":"img","alt":" x ∈ D","inline":true},{"style":{"fontStyle":"italic"},"text":", then the regret ","element":"span"},{"style":{"height":9.19},"width":29.98,"height":22.98,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/0912.3995/images/9-21.png","element":"img","alt":" rt","inline":true,"padRight":true},{"style":{"fontStyle":"italic"},"text":"is bounded by ","element":"span"},{"text":"2","element":"span"},{"style":{"height":20.68},"width":206.09,"height":51.7,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/0912.3995/images/9-22.png","element":"img","alt":"β1/2t σt−1(xt","inline":true},{"text":")","element":"span"},{"style":{"fontStyle":"italic"},"text":".","element":"span"}],[{"style":{"fontWeight":"bold"},"text":"Proof ","element":"span"},{"text":"By definition of ","element":"span"},{"style":{"height":20.68},"width":523.1,"height":51.7,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/0912.3995/images/9-23.png","element":"img","alt":" xt: µt−1(xt)+β1/2t σt−1(xt) ≥","inline":true},{"style":{"height":16},"width":136.6,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/0912.3995/images/9-24.png","element":"img","alt":"µt−1(x∗","inline":true},{"text":") + ","element":"span"},{"style":{"height":20.68},"width":362.6,"height":51.7,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/0912.3995/images/9-25.png","element":"img","alt":" β1/2t σt−1(x∗) ≥ f(x∗","inline":true},{"text":"). Therefore,","element":"span"}],[{"style":{"height":20.68},"width":587.54,"height":51.7,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/0912.3995/images/9-26.png","element":"img","alt":"rt = f(x∗) − f(xt) ≤ β1/2t σt−1(xt","inline":true},{"text":") + ","element":"span"},{"style":{"height":16},"width":276.39,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/0912.3995/images/9-27.png","element":"img","alt":" µt−1(xt) − f(xt","inline":true},{"text":")","element":"span"}],[{"style":{"width":"95%"},"width":890,"height":115,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/0912.3995/images/9-28.png","element":"img"}],[{"style":{"fontWeight":"bold"},"text":"Lemma 5.3 ","element":"span"},{"style":{"fontStyle":"italic"},"text":"The information gain for the points selected can be expressed in terms of the predictive variances. If ","element":"span"},{"style":{"height":17.38},"width":331.85,"height":43.46,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/0912.3995/images/9-29.png","element":"img","alt":" f T = (f(xt)) ∈ RT","inline":true,"padRight":true},{"style":{"fontStyle":"italic"},"text":":","element":"span"}],[{"id":"id-50","style":{"width":"81%"},"width":767,"height":82,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/0912.3995/images/9-30.png","element":"img"}],[{"style":{"fontWeight":"bold"},"text":"Proof ","element":"span"},{"text":"Recall ","element":"span"},{"text":"that ","element":"span"},{"text":"I(","element":"span"},{"style":{"height":16},"width":447.58,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/0912.3995/images/9-31.png","element":"img","alt":"yT ; f T ) = H(yT ) −","inline":true,"padRight":true},{"text":"(1","element":"span"},{"style":{"fontStyle":"italic"},"text":"/","element":"span"},{"text":"2) log ","element":"span"},{"style":{"height":17.38},"width":150.24,"height":43.46,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/0912.3995/images/9-32.png","element":"img","alt":" |2πeσ2I|","inline":true},{"text":". ","element":"span"},{"text":"Now, ","element":"span"},{"text":"H(","element":"span"},{"style":{"height":16},"width":311.24,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/0912.3995/images/9-33.png","element":"img","alt":"yT ) = H(yT −1","inline":true},{"text":") + H(","element":"span"},{"style":{"height":16},"width":349.64,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/0912.3995/images/9-34.png","element":"img","alt":"yT |yT −1) = H(yT −1","inline":true},{"text":") + log(2","element":"span"},{"style":{"height":17.39},"width":98.4,"height":43.46,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/0912.3995/images/9-35.png","element":"img","alt":"πe(σ2","inline":true,"padRight":true},{"text":"+ ","element":"span"},{"style":{"height":17.38},"width":142.36,"height":43.46,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/0912.3995/images/9-36.png","element":"img","alt":" σ2t−1(xT","inline":true,"padRight":true},{"text":")))","element":"span"},{"style":{"fontStyle":"italic"},"text":"/","element":"span"},{"text":"2. ","element":"span"},{"text":"Here, we use that ","element":"span"},{"style":{"height":10.4},"width":181.96,"height":26,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/0912.3995/images/9-37.png","element":"img","alt":" x1, . . . , xT","inline":true,"padRight":true},{"text":"are deterministic conditioned on ","element":"span"},{"style":{"height":11.1},"width":89.02,"height":27.75,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/0912.3995/images/9-38.png","element":"img","alt":" yT −1","inline":true},{"text":", and that the conditional variance ","element":"span"},{"style":{"height":17.78},"width":153.43,"height":44.44,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/0912.3995/images/9-39.png","element":"img","alt":"σ2T −1(xT","inline":true,"padRight":true},{"text":") does not depend on ","element":"span"},{"style":{"height":11.1},"width":89.02,"height":27.74,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/0912.3995/images/9-40.png","element":"img","alt":" yT −1","inline":true},{"text":". The result fol- ","element":"span"},{"text":"lows by induction.","element":"span"}],[{"id":"id-47","style":{"fontWeight":"bold"},"text":"Lemma 5.4 ","element":"span"},{"style":{"fontStyle":"italic"},"text":"Pick ","element":"span"},{"style":{"height":12.4},"width":57.29,"height":31,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/0912.3995/images/9-41.png","element":"img","alt":" δ ∈","inline":true,"padRight":true},{"text":"(0","element":"span"},{"style":{"fontStyle":"italic"},"text":", ","element":"span"},{"text":"1) ","element":"span"},{"style":{"fontStyle":"italic"},"text":"and let ","element":"span"},{"style":{"height":14.4},"width":34.54,"height":36,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/0912.3995/images/9-42.png","element":"img","alt":" βt","inline":true,"padRight":true},{"style":{"fontStyle":"italic"},"text":"be defined as in Lemma ","element":"span"},{"href":"#id-44","style":{"fontStyle":"italic"},"text":"5.1","element":"a"},{"style":{"fontStyle":"italic"},"text":". Then, the following holds with probability ","element":"span"},{"style":{"height":14},"width":129.69,"height":35,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/0912.3995/images/9-43.png","element":"img","alt":"≥ 1 − δ","inline":true},{"style":{"fontStyle":"italic"},"text":":","element":"span"}],[{"style":{"width":"88%"},"width":833,"height":76,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/0912.3995/images/9-44.png","element":"img"}],[{"style":{"height":16},"width":263.02,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/0912.3995/images/9-45.png","element":"img","alt":"where C1 := 8/","inline":true,"padRight":true},{"text":"log(1 + ","element":"span"},{"style":{"height":17.38},"width":209.62,"height":43.46,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/0912.3995/images/9-46.png","element":"img","alt":" σ−2) ≥ 8σ2.","inline":true}],[{"style":{"fontWeight":"bold"},"text":"Proof ","element":"span"},{"text":"By Lemma ","element":"span"},{"href":"#id-44","text":"5.1 ","element":"a"},{"text":"and Lemma ","element":"span"},{"href":"#id-45","text":"5.2","element":"a"},{"text":", we have that ","element":"span"},{"style":{"height":17.39},"width":474.19,"height":43.47,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/0912.3995/images/9-47.png","element":"img","alt":"{r2t ≤ 4βtσ2t−1(xt) ∀t ≥ 1}","inline":true,"padRight":true},{"text":"with probability ","element":"span"},{"style":{"height":14},"width":136.22,"height":35,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/0912.3995/images/9-48.png","element":"img","alt":" ≥ 1 − δ","inline":true},{"text":". ","element":"span"},{"text":"Now, ","element":"span"},{"style":{"height":14.4},"width":34.54,"height":36,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/0912.3995/images/9-49.png","element":"img","alt":" βt","inline":true,"padRight":true},{"text":"is nondecreasing, so that","element":"span"}],[{"style":{"width":"83%"},"width":785,"height":110,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/0912.3995/images/9-50.png","element":"img"}],[{"text":"with ","element":"span"},{"style":{"height":17.38},"width":268.54,"height":43.46,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/0912.3995/images/9-51.png","element":"img","alt":"C2 = σ−2/","inline":true,"padRight":true},{"text":"log(1 + ","element":"span"},{"style":{"height":17.38},"width":165.58,"height":43.46,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/0912.3995/images/9-52.png","element":"img","alt":" σ−2) ≥","inline":true,"padRight":true},{"text":"1, ","element":"span"},{"text":"since ","element":"span"},{"style":{"height":15.79},"width":207.85,"height":39.46,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/0912.3995/images/9-53.png","element":"img","alt":"s2 ≤ C2","inline":true,"padRight":true},{"text":"log(1 + ","element":"span"},{"style":{"height":13.39},"width":34.68,"height":33.46,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/0912.3995/images/9-54.png","element":"img","alt":" s2","inline":true},{"text":") ","element":"span"},{"text":"for ","element":"span"},{"style":{"height":17.39},"width":254.88,"height":43.46,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/0912.3995/images/9-55.png","element":"img","alt":"s ∈ [0, σ−2","inline":true},{"text":"], ","element":"span"},{"text":"and ","element":"span"},{"style":{"height":17.39},"width":664.92,"height":43.46,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/0912.3995/images/9-56.png","element":"img","alt":"σ−2σ2t−1(xt) ≤ σ−2k(xt, xt) ≤ σ−2","inline":true},{"text":". ","element":"span"},{"text":"Noting that ","element":"span"},{"style":{"height":15.77},"width":229.44,"height":39.44,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/0912.3995/images/9-57.png","element":"img","alt":"C1 = 8σ2C2","inline":true},{"text":", the result follows by plugging in the representation of Lemma ","element":"span"},{"href":"#id-46","text":"5.3","element":"a"},{"text":".","element":"span"}],[{"text":"Finally, ","element":"span"},{"text":"Theorem ","element":"span"},{"href":"#id-35","text":"1 ","element":"a"},{"text":"is ","element":"span"},{"text":"a ","element":"span"},{"text":"simple ","element":"span"},{"text":"consequence ","element":"span"},{"text":"of Lemma ","element":"span"},{"href":"#id-47","text":"5.4","element":"a"},{"text":", since ","element":"span"},{"style":{"height":20.4},"width":301.88,"height":50.99,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/0912.3995/images/9-58.png","element":"img","alt":" R2T ≤ T �Tt=1 r2t","inline":true,"padRight":true},{"text":"by the Cauchy- ","element":"span"},{"text":"Schwarz inequality.","element":"span"}],[{"style":{"fontWeight":"bold"},"text":"A.2. General Decision Set","element":"span"}],[{"id":"id-45","text":"Theorem ","element":"span"},{"href":"#id-37","text":"2 ","element":"a"},{"text":"extends the statement of Theorem ","element":"span"},{"href":"#id-35","text":"1 ","element":"a"},{"text":"to the general case of ","element":"span"},{"style":{"height":14.18},"width":157.75,"height":35.46,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/0912.3995/images/9-59.png","element":"img","alt":" D ⊂ Rd","inline":true,"padRight":true},{"text":"compact. ","element":"span"},{"text":"We cannot expect this generalization to work without any assumptions on the kernel ","element":"span"},{"style":{"height":16},"width":123.72,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/0912.3995/images/9-60.png","element":"img","alt":" k(x, x′","inline":true},{"text":"). ","element":"span"},{"text":"For example, if ","element":"span"},{"style":{"height":18.19},"width":344.15,"height":45.46,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/0912.3995/images/9-61.png","element":"img","alt":"k(x, x′) = e−∥x−x′∥","inline":true,"padRight":true},{"text":"(Ornstein-Uhlenbeck), while sample paths ","element":"span"},{"style":{"fontStyle":"italic"},"text":"f ","element":"span"},{"text":"are a.s. continuous, they are still very erratic: ","element":"span"},{"style":{"fontStyle":"italic"},"text":"f ","element":"span"},{"text":"is a.s. nondifferentiable almost everywhere, and the process comes with independent increments, a stationary variant of Brownian motion. The additional assumption on ","element":"span"},{"style":{"fontStyle":"italic"},"text":"k ","element":"span"},{"text":"in Theorem ","element":"span"},{"href":"#id-37","text":"2 ","element":"a"},{"text":"is rather mild and is satisfied by several common kernels, as discussed in Section ","element":"span"},{"href":"#id-30","text":"4","element":"a"},{"text":".","element":"span"}],[{"id":"id-46","text":"Recall that the finite case proof is based on Lemma ","element":"span"},{"href":"#id-44","text":"5.1 ","element":"a"},{"text":"paving the way for Lemma ","element":"span"},{"href":"#id-45","text":"5.2","element":"a"},{"text":". However, Lemma ","element":"span"},{"href":"#id-44","text":"5.1 ","element":"a"},{"text":"does not hold for infinite ","element":"span"},{"style":{"fontStyle":"italic"},"text":"D","element":"span"},{"text":". First, let us observe that we have confidence on all decisions actually chosen.","element":"span"}],[{"style":{"fontWeight":"bold"},"text":"Lemma 5.5 ","element":"span"},{"style":{"fontStyle":"italic"},"text":"Pick ","element":"span"},{"style":{"height":12.4},"width":58.54,"height":31,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/0912.3995/images/9-62.png","element":"img","alt":" δ ∈","inline":true,"padRight":true},{"text":"(0","element":"span"},{"style":{"fontStyle":"italic"},"text":", ","element":"span"},{"text":"1) ","element":"span"},{"style":{"fontStyle":"italic"},"text":"and set ","element":"span"},{"style":{"height":16},"width":261.4,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/0912.3995/images/9-63.png","element":"img","alt":" βt = 2 log(πt/δ","inline":true},{"text":")","element":"span"},{"style":{"fontStyle":"italic"},"text":", where ","element":"span"},{"style":{"height":21.6},"width":348.81,"height":54,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/0912.3995/images/9-64.png","element":"img","alt":"�t≥1 π−1t = 1, πt >","inline":true,"padRight":true},{"text":"0","element":"span"},{"style":{"fontStyle":"italic"},"text":". Then,","element":"span"}],[{"style":{"width":"78%"},"width":737,"height":52,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/0912.3995/images/9-65.png","element":"img"}],[{"style":{"fontStyle":"italic"},"text":"holds with probability ","element":"span"},{"style":{"height":14},"width":129.7,"height":35,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/0912.3995/images/9-66.png","element":"img","alt":" ≥ 1 − δ","inline":true},{"style":{"fontStyle":"italic"},"text":".","element":"span"}],[{"style":{"fontWeight":"bold"},"text":"Proof ","element":"span"},{"text":"Fix ","element":"span"},{"style":{"height":12.8},"width":71.08,"height":32,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/0912.3995/images/9-67.png","element":"img","alt":" t ≥","inline":true,"padRight":true},{"text":"1 and ","element":"span"},{"style":{"height":11.6},"width":139.21,"height":29,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/0912.3995/images/9-68.png","element":"img","alt":" x ∈ D","inline":true},{"text":". ","element":"span"},{"text":"Conditioned on ","element":"span"},{"style":{"height":16},"width":674.06,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/0912.3995/images/9-69.png","element":"img","alt":"yt−1 = (y1, . . . , yt−1), {x1, . . . , xt−1}","inline":true,"padRight":true},{"text":"are deterministic, and ","element":"span"},{"style":{"height":17.38},"width":484.42,"height":43.46,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/0912.3995/images/9-70.png","element":"img","alt":" f(x) ∼ N(µt−1(x), σ2t−1(x","inline":true},{"text":")). ","element":"span"},{"text":"As before, Pr","element":"span"},{"style":{"height":20.68},"width":880.32,"height":51.7,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/0912.3995/images/9-71.png","element":"img","alt":"{|f(xt) − µt−1(xt)| > β1/2t σt−1(xt)} ≤ e−βt/2","inline":true},{"text":". Since ","element":"span"},{"style":{"height":18.18},"width":251.76,"height":45.46,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/0912.3995/images/9-72.png","element":"img","alt":" e−βt/2 = δ/πt","inline":true,"padRight":true},{"text":"and using the union bound for ","element":"span"},{"style":{"height":11.6},"width":92.1,"height":29,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/0912.3995/images/9-73.png","element":"img","alt":"t ∈ N","inline":true},{"text":", the statement holds.","element":"span"}],[{"text":"Purely for the sake of analysis, we use a set of discretizations ","element":"span"},{"style":{"height":13.19},"width":145.32,"height":32.97,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/0912.3995/images/9-74.png","element":"img","alt":" Dt ⊂ D","inline":true},{"text":", where ","element":"span"},{"style":{"height":13.19},"width":44.99,"height":32.97,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/0912.3995/images/9-75.png","element":"img","alt":" Dt","inline":true,"padRight":true},{"text":"will be used at time ","element":"span"},{"style":{"fontStyle":"italic"},"text":"t ","element":"span"},{"text":"in the analysis. Essentially, we use this to obtain a valid confidence interval on ","element":"span"},{"style":{"height":10.99},"width":42.26,"height":27.47,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/0912.3995/images/10-0.png","element":"img","alt":" x∗","inline":true},{"text":". The following lemma provides a confidence bound for these subsets.","element":"span"}],[{"id":"id-49","style":{"fontWeight":"bold"},"text":"Lemma 5.6 ","element":"span"},{"style":{"fontStyle":"italic"},"text":"Pick ","element":"span"},{"style":{"height":16},"width":197.28,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/0912.3995/images/10-1.png","element":"img","alt":"δ ∈ (0,","inline":true,"padRight":true},{"text":"1) ","element":"span"},{"style":{"fontStyle":"italic"},"text":"and ","element":"span"},{"style":{"fontStyle":"italic"},"text":"set ","element":"span"},{"style":{"height":14.4},"width":120.11,"height":36,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/0912.3995/images/10-2.png","element":"img","alt":"βt =","inline":true,"padRight":true},{"text":"2 log(","element":"span"},{"style":{"height":16},"width":144.84,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/0912.3995/images/10-3.png","element":"img","alt":"|Dt|πt/δ","inline":true},{"text":")","element":"span"},{"style":{"fontStyle":"italic"},"text":", where ","element":"span"},{"style":{"height":21.6},"width":348.81,"height":54,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/0912.3995/images/10-4.png","element":"img","alt":"�t≥1 π−1t = 1, πt >","inline":true,"padRight":true},{"text":"0","element":"span"},{"style":{"fontStyle":"italic"},"text":". Then,","element":"span"}],[{"style":{"width":"93%"},"width":871,"height":53,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/0912.3995/images/10-5.png","element":"img"}],[{"style":{"fontStyle":"italic"},"text":"holds with probability ","element":"span"},{"style":{"height":14},"width":129.7,"height":35,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/0912.3995/images/10-6.png","element":"img","alt":" ≥ 1 − δ","inline":true},{"style":{"fontStyle":"italic"},"text":".","element":"span"}],[{"style":{"fontWeight":"bold"},"text":"Proof ","element":"span"},{"text":"The proof is identical to that in Lemma ","element":"span"},{"href":"#id-44","text":"5.1","element":"a"},{"text":", except now we use ","element":"span"},{"style":{"height":13.19},"width":44.99,"height":32.98,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/0912.3995/images/10-7.png","element":"img","alt":" Dt","inline":true,"padRight":true},{"text":"at each timestep.","element":"span"}],[{"text":"Now by assumption and the union bound, we have that","element":"span"}],[{"style":{"width":"93%"},"width":873,"height":54,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/0912.3995/images/10-8.png","element":"img"}],[{"text":"which implies that, with probability greater than 1 ","element":"span"},{"style":{"height":4.4},"width":31,"height":11,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/0912.3995/images/10-9.png","element":"img","alt":" −","inline":true},{"style":{"height":16.2},"width":167.03,"height":40.49,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/0912.3995/images/10-10.png","element":"img","alt":"dae−L2/b2","inline":true},{"text":", we have that","element":"span"}],[{"id":"id-48","style":{"width":"84%"},"width":792,"height":43,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/0912.3995/images/10-11.png","element":"img"}],[{"text":"This allows us to obtain confidence on ","element":"span"},{"style":{"height":7.2},"width":40.26,"height":18,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/0912.3995/images/10-12.png","element":"img","alt":" x⋆","inline":true,"padRight":true},{"text":"as follows.","element":"span"}],[{"text":"Now let us choose a discretization ","element":"span"},{"style":{"height":13.19},"width":45,"height":32.98,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/0912.3995/images/10-13.png","element":"img","alt":" Dt","inline":true,"padRight":true},{"text":"of size (","element":"span"},{"style":{"height":17.38},"width":63.95,"height":43.46,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/0912.3995/images/10-14.png","element":"img","alt":"τt)d","inline":true,"padRight":true},{"text":"so that for all ","element":"span"},{"style":{"height":13.19},"width":121.96,"height":32.98,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/0912.3995/images/10-15.png","element":"img","alt":" x ∈ Dt","inline":true}],[{"style":{"width":"35%"},"width":336,"height":41,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/0912.3995/images/10-16.png","element":"img"}],[{"text":"where [","element":"span"},{"style":{"height":16},"width":51.33,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/0912.3995/images/10-17.png","element":"img","alt":"x]t","inline":true,"padRight":true},{"text":"denotes the closest point in ","element":"span"},{"style":{"height":13.19},"width":44.99,"height":32.97,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/0912.3995/images/10-18.png","element":"img","alt":" Dt","inline":true,"padRight":true},{"text":"to ","element":"span"},{"style":{"fontStyle":"italic","fontWeight":"bold"},"text":"x","element":"span"},{"text":". A suf-ficient discretization has each coordinate with ","element":"span"},{"style":{"height":9.19},"width":29.42,"height":22.98,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/0912.3995/images/10-19.png","element":"img","alt":" τt","inline":true,"padRight":true},{"text":"uniformly spaced points.","element":"span"}],[{"style":{"width":"99%"},"width":935,"height":81,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/0912.3995/images/10-20.png","element":"img"}],[{"style":{"height":21.6},"width":373.64,"height":54,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/0912.3995/images/10-21.png","element":"img","alt":"�t≥1 π−1t = 1, πt >","inline":true,"padRight":true},{"text":"0","element":"span"},{"style":{"fontStyle":"italic"},"text":". Let ","element":"span"},{"style":{"height":19.2},"width":227.24,"height":48,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/0912.3995/images/10-22.png","element":"img","alt":" τt = dt2br�","inline":true},{"text":"log(2","element":"span"},{"style":{"height":16},"width":80.77,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/0912.3995/images/10-23.png","element":"img","alt":"da/δ","inline":true},{"text":") ","element":"span"},{"style":{"fontStyle":"italic"},"text":"Let ","element":"span"},{"text":"[","element":"span"},{"style":{"height":16},"width":67.66,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/0912.3995/images/10-24.png","element":"img","alt":"x∗]t","inline":true,"padRight":true},{"style":{"fontStyle":"italic"},"text":"denotes the closest point in ","element":"span"},{"style":{"height":13.19},"width":44.99,"height":32.98,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/0912.3995/images/10-25.png","element":"img","alt":" Dt","inline":true,"padRight":true},{"style":{"fontStyle":"italic"},"text":"to ","element":"span"},{"style":{"height":10.98},"width":42.26,"height":27.46,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/0912.3995/images/10-26.png","element":"img","alt":" x∗","inline":true},{"style":{"fontStyle":"italic"},"text":". Hence, Then,","element":"span"}],[{"style":{"height":20.68},"width":660.48,"height":51.7,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/0912.3995/images/10-27.png","element":"img","alt":"|f(x∗) − µt−1([x∗]t)| ≤ β1/2t σt−1([x∗]t","inline":true},{"text":") + ","element":"span"},{"text":"1","element":"span"},{"style":{"height":21.73},"width":163.02,"height":54.34,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/0912.3995/images/10-28.png","element":"img","alt":"t2 ∀t ≥","inline":true,"padRight":true},{"text":"1","element":"span"}],[{"style":{"fontStyle":"italic"},"text":"holds with probability ","element":"span"},{"style":{"height":14},"width":129.7,"height":35,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/0912.3995/images/10-29.png","element":"img","alt":" ≥ 1 − δ","inline":true},{"style":{"fontStyle":"italic"},"text":".","element":"span"}],[{"style":{"fontWeight":"bold"},"text":"Proof ","element":"span"},{"text":"Using (","element":"span"},{"href":"#id-48","text":"9","element":"a"},{"text":"), we have that with probability greater than 1 ","element":"span"},{"style":{"height":16},"width":79.06,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/0912.3995/images/10-30.png","element":"img","alt":" − δ/","inline":true},{"text":"2,","element":"span"}],[{"style":{"width":"92%"},"width":866,"height":49,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/0912.3995/images/10-31.png","element":"img"}],[{"text":"Hence,","element":"span"}],[{"style":{"width":"88%"},"width":832,"height":49,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/0912.3995/images/10-32.png","element":"img"}],[{"text":"Now by choosing ","element":"span"},{"style":{"height":19.2},"width":213.78,"height":48,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/0912.3995/images/10-33.png","element":"img","alt":" τt = dt2br�","inline":true},{"text":"log(2","element":"span"},{"style":{"height":16},"width":80.77,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/0912.3995/images/10-34.png","element":"img","alt":"da/δ","inline":true},{"text":"), we have that","element":"span"}],[{"style":{"width":"55%"},"width":523,"height":82,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/0912.3995/images/10-35.png","element":"img"}],[{"text":"This implies that ","element":"span"},{"style":{"height":19.2},"width":271.14,"height":48,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/0912.3995/images/10-36.png","element":"img","alt":" |Dt| = (dt2br�","inline":true},{"text":"log(2","element":"span"},{"style":{"height":17.38},"width":128.95,"height":43.46,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/0912.3995/images/10-37.png","element":"img","alt":"da/δ))d","inline":true},{"text":". Using ","element":"span"},{"style":{"height":16},"width":39.2,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/0912.3995/images/10-38.png","element":"img","alt":"δ/","inline":true},{"text":"2 in Lemma ","element":"span"},{"href":"#id-49","text":"5.6","element":"a"},{"text":", we can apply the confidence bound to [","element":"span"},{"style":{"height":16},"width":67.65,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/0912.3995/images/10-39.png","element":"img","alt":"x∗]t","inline":true,"padRight":true},{"text":"(as this lives in ","element":"span"},{"style":{"height":13.19},"width":44.99,"height":32.98,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/0912.3995/images/10-40.png","element":"img","alt":" Dt","inline":true},{"text":") to obtain the result.","element":"span"}],[{"text":"Now we are able to bound the regret.","element":"span"}],[{"style":{"fontWeight":"bold"},"text":"Lemma 5.8 ","element":"span"},{"style":{"fontStyle":"italic"},"text":"Pick ","element":"span"},{"style":{"height":16},"width":197.28,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/0912.3995/images/10-41.png","element":"img","alt":"δ ∈ (0,","inline":true,"padRight":true},{"text":"1) ","element":"span"},{"style":{"fontStyle":"italic"},"text":"and ","element":"span"},{"style":{"fontStyle":"italic"},"text":"set ","element":"span"},{"style":{"height":14.4},"width":120.11,"height":36,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/0912.3995/images/10-42.png","element":"img","alt":"βt =","inline":true,"padRight":true},{"text":"2 log(4","element":"span"},{"style":{"height":16},"width":75.68,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/0912.3995/images/10-43.png","element":"img","alt":"πt/δ","inline":true},{"text":") ","element":"span"},{"text":"+ ","element":"span"},{"text":"4","element":"span"},{"style":{"fontStyle":"italic"},"text":"d ","element":"span"},{"text":"log(","element":"span"},{"style":{"height":19.2},"width":111.32,"height":48,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/0912.3995/images/10-44.png","element":"img","alt":"dtbr�","inline":true},{"text":"log(4","element":"span"},{"style":{"height":16},"width":80.77,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/0912.3995/images/10-45.png","element":"img","alt":"da/δ","inline":true},{"text":"))","element":"span"},{"style":{"fontStyle":"italic"},"text":", ","element":"span"},{"style":{"fontStyle":"italic"},"text":"where ","element":"span"},{"style":{"height":21.6},"width":354.65,"height":54,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/0912.3995/images/10-46.png","element":"img","alt":"�t≥1 π−1t = 1, πt >","inline":true,"padRight":true},{"text":"0","element":"span"},{"style":{"fontStyle":"italic"},"text":". Then, with probability greater than ","element":"span"},{"text":"1 ","element":"span"},{"style":{"height":11.6},"width":63.06,"height":29,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/0912.3995/images/10-47.png","element":"img","alt":" − δ","inline":true},{"style":{"fontStyle":"italic"},"text":", for all ","element":"span"},{"style":{"height":11.6},"width":113.14,"height":29,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/0912.3995/images/10-48.png","element":"img","alt":" t ∈ N","inline":true},{"style":{"fontStyle":"italic"},"text":", the regret is bounded as follows:","element":"span"}],[{"style":{"width":"46%"},"width":433,"height":76,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/0912.3995/images/10-49.png","element":"img"}],[{"style":{"fontWeight":"bold"},"text":"Proof ","element":"span"},{"text":"We use ","element":"span"},{"style":{"height":16},"width":39.21,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/0912.3995/images/10-50.png","element":"img","alt":" δ/","inline":true},{"text":"2 in both Lemma ","element":"span"},{"href":"#id-50","text":"5.5 ","element":"a"},{"text":"and Lemma ","element":"span"},{"href":"#id-51","text":"5.7","element":"a"},{"text":", so that these events hold with probability greater than 1 ","element":"span"},{"style":{"height":11.6},"width":61.6,"height":29,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/0912.3995/images/10-51.png","element":"img","alt":" − δ","inline":true},{"text":". Note that the specification of ","element":"span"},{"style":{"height":14.4},"width":34.54,"height":36,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/0912.3995/images/10-52.png","element":"img","alt":" βt","inline":true,"padRight":true},{"text":"in the above lemma is greater than the specification used in Lemma ","element":"span"},{"href":"#id-50","text":"5.5 ","element":"a"},{"text":"(with ","element":"span"},{"style":{"height":16},"width":39.21,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/0912.3995/images/10-53.png","element":"img","alt":" δ/","inline":true},{"text":"2), so this choice is valid.","element":"span"}],[{"text":"By definition of ","element":"span"},{"style":{"height":16},"width":231.01,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/0912.3995/images/10-54.png","element":"img","alt":" xt: µt−1(xt","inline":true},{"text":") + ","element":"span"},{"style":{"height":20.68},"width":290.14,"height":51.7,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/0912.3995/images/10-55.png","element":"img","alt":" β1/2t σt−1(xt) ≥","inline":true},{"style":{"height":20.68},"width":478.1,"height":51.7,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/0912.3995/images/10-56.png","element":"img","alt":"µt−1([x∗]t)+β1/2t σt−1([x∗]t","inline":true},{"text":"). Also, by Lemma ","element":"span"},{"href":"#id-51","text":"5.7","element":"a"},{"text":", we have that ","element":"span"},{"style":{"height":20.68},"width":473.84,"height":51.7,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/0912.3995/images/10-57.png","element":"img","alt":" µt−1([x∗]t)+β1/2t σt−1([x∗]t","inline":true},{"text":")+1","element":"span"},{"style":{"height":17.38},"width":186.88,"height":43.46,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/0912.3995/images/10-58.png","element":"img","alt":"/t2 ≥ f(x∗","inline":true},{"text":"), which implies ","element":"span"},{"style":{"height":20.68},"width":677.79,"height":51.7,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/0912.3995/images/10-59.png","element":"img","alt":" µt−1(xt)+β1/2t σt−1(xt) ≥ f(x∗)−1/t2","inline":true},{"text":". Therefore,","element":"span"}],[{"style":{"width":"82%"},"width":768,"height":185,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/0912.3995/images/10-60.png","element":"img"}],[{"id":"id-51","text":"which completes the proof.","element":"span"}],[{"text":"Now we are ready to complete the proof of Theorem ","element":"span"},{"href":"#id-37","text":"2","element":"a"},{"text":". As shown in the proof of Lemma ","element":"span"},{"href":"#id-47","text":"5.4","element":"a"},{"text":", we have that with probability greater than 1 ","element":"span"},{"style":{"height":11.6},"width":58.85,"height":29,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/0912.3995/images/10-61.png","element":"img","alt":" − δ","inline":true},{"text":",","element":"span"}],[{"style":{"width":"73%"},"width":687,"height":76,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/0912.3995/images/10-62.png","element":"img"}],[{"text":"so that by Cauchy-Schwarz:","element":"span"}],[{"style":{"width":"84%"},"width":795,"height":76,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/0912.3995/images/10-63.png","element":"img"}],[{"text":"Hence,","element":"span"}],[{"style":{"width":"76%"},"width":713,"height":76,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/0912.3995/images/10-64.png","element":"img"}],[{"text":"(since ","element":"span"},{"style":{"height":16},"width":42,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/0912.3995/images/10-65.png","element":"img","alt":"�","inline":true,"padRight":true},{"text":"1","element":"span"},{"style":{"height":17.38},"width":167.35,"height":43.46,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/0912.3995/images/10-66.png","element":"img","alt":"/t2 = π2/","inline":true},{"text":"6). Theorem ","element":"span"},{"href":"#id-37","text":"2 ","element":"a"},{"text":"now follows.","element":"span"}],[{"style":{"width":"100%"},"width":936,"height":181,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/0912.3995/images/10-67.png","element":"img"}],[{"text":"states that if derivatives up to fourth order exists for (","element":"span"},{"style":{"height":16},"width":300.9,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/0912.3995/images/11-0.png","element":"img","alt":"x, x′) �→ k(x, x′","inline":true},{"text":"), then ","element":"span"},{"style":{"fontStyle":"italic"},"text":"f ","element":"span"},{"text":"is almost surely continuously differentiable, with ","element":"span"},{"style":{"height":16.79},"width":141.74,"height":41.98,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/0912.3995/images/11-1.png","element":"img","alt":" ∂f/(∂xj","inline":true},{"text":") distributed as Gaussian processes again. ","element":"span"},{"text":"Moreover, there are constants ","element":"span"},{"style":{"height":15.59},"width":114.74,"height":38.98,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/0912.3995/images/11-2.png","element":"img","alt":" a, bj >","inline":true,"padRight":true},{"text":"0 such that","element":"span"}],[{"id":"id-52","style":{"width":"87%"},"width":819,"height":96,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/0912.3995/images/11-3.png","element":"img"}],[{"id":"id-56","text":"Picking ","element":"span"},{"style":{"height":16},"width":320.89,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/0912.3995/images/11-4.png","element":"img","alt":" L = [log(da2/δ)/","inline":true,"padRight":true},{"text":"min","element":"span"},{"style":{"height":18.98},"width":116.61,"height":47.44,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/0912.3995/images/11-5.png","element":"img","alt":"j bj]1/2","inline":true},{"text":", we have that ","element":"span"},{"style":{"height":20.2},"width":307.66,"height":50.49,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/0912.3995/images/11-6.png","element":"img","alt":"ae−bjL2 ≤ δ/(2d","inline":true},{"text":") for all ","element":"span"},{"style":{"fontStyle":"italic"},"text":"j ","element":"span"},{"text":"= 1","element":"span"},{"style":{"fontStyle":"italic"},"text":", . . . , d","element":"span"},{"text":", so that for ","element":"span"},{"style":{"height":16.58},"width":224.91,"height":41.44,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/0912.3995/images/11-7.png","element":"img","alt":"K1 = d1/2L","inline":true},{"text":", by the mean value theorem, we have Pr","element":"span"},{"style":{"height":16},"width":862.34,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/0912.3995/images/11-8.png","element":"img","alt":"{|f(x)−f(x′)| ≤ K1∥x−x′∥ ∀ x, x′ ∈ D} ≥ 1−δ/","inline":true},{"text":"2.","element":"span"}],[{"text":"Also, note that ","element":"span"},{"style":{"height":13.19},"width":136.86,"height":32.98,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/0912.3995/images/11-9.png","element":"img","alt":" K1 = O","inline":true},{"text":"((log ","element":"span"},{"style":{"height":18.18},"width":125.71,"height":45.46,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/0912.3995/images/11-10.png","element":"img","alt":" δ−1)1/2","inline":true},{"text":").","element":"span"}],[{"text":"This statement is about the joint distribution of ","element":"span"},{"style":{"height":16},"width":50.3,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/0912.3995/images/11-11.png","element":"img","alt":" f(·","inline":true},{"text":") and its partial derivatives w.r.t. each component. For a certain event in this sample space, all ","element":"span"},{"style":{"height":16.79},"width":141.74,"height":41.98,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/0912.3995/images/11-12.png","element":"img","alt":" ∂f/(∂xj","inline":true},{"text":") exist, are continuous, and the complement of (","element":"span"},{"href":"#id-52","text":"10","element":"a"},{"text":") holds for all ","element":"span"},{"style":{"fontStyle":"italic"},"text":"j","element":"span"},{"text":". Theorem 5 of ","element":"span"},{"href":"#id-36","referenceIndex":12,"text":"Ghosal & Roy ","element":"a"},{"href":"#id-36","referenceIndex":12,"text":"(","element":"a"},{"href":"#id-36","referenceIndex":12,"text":"2006","element":"a"},{"text":"), together with the union bound, implies that this event has probability ","element":"span"},{"style":{"height":16},"width":147.72,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/0912.3995/images/11-13.png","element":"img","alt":" ≥ 1 − δ/","inline":true},{"text":"2. Derivatives up to fourth order exist for the Gaussian covariance function, and for Mat´ern kernels with ","element":"span"},{"style":{"height":9.6},"width":64.28,"height":24,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/0912.3995/images/11-14.png","element":"img","alt":" ν >","inline":true,"padRight":true},{"text":"2 (","element":"span"},{"href":"#id-53","referenceIndex":32,"text":"Stein","element":"a"},{"href":"#id-53","referenceIndex":32,"text":", ","element":"a"},{"href":"#id-53","referenceIndex":32,"text":"1999","element":"a"},{"text":").","element":"span"}]]},{"heading":"B. Regret Bound for Target Function in RKHS","paragraphs":[[{"text":"In this section, we detail a proof of Theorem ","element":"span"},{"href":"#id-38","text":"3","element":"a"},{"text":". Recall that in this setting, we do not know the generator of the target function ","element":"span"},{"style":{"fontStyle":"italic"},"text":"f","element":"span"},{"text":", but only a bound on its RKHS norm ","element":"span"},{"style":{"height":16},"width":80.65,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/0912.3995/images/11-15.png","element":"img","alt":" ∥f∥k","inline":true},{"text":".","element":"span"}],[{"text":"Recall the posterior mean function ","element":"span"},{"style":{"height":16},"width":75.61,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/0912.3995/images/11-16.png","element":"img","alt":" µT (·","inline":true},{"text":") and posterior covariance function ","element":"span"},{"style":{"height":16},"width":101.13,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/0912.3995/images/11-17.png","element":"img","alt":" kT (·, ·","inline":true},{"text":") from Section ","element":"span"},{"text":"2","element":"span"},{"text":", conditioned on data (","element":"span"},{"style":{"height":16},"width":335.03,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/0912.3995/images/11-18.png","element":"img","alt":"xt, yt), t = 1, . . . , T","inline":true},{"text":". It is easy to see that the RKHS norm corresponding to ","element":"span"},{"style":{"height":13.19},"width":43.74,"height":32.98,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/0912.3995/images/11-19.png","element":"img","alt":" kT","inline":true,"padRight":true},{"text":"is given by","element":"span"}],[{"style":{"width":"63%"},"width":598,"height":76,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/0912.3995/images/11-20.png","element":"img"}],[{"text":"This implies that ","element":"span"},{"style":{"height":16},"width":300.5,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/0912.3995/images/11-21.png","element":"img","alt":" Hk(D) = HkT (D","inline":true},{"text":") for any ","element":"span"},{"style":{"fontStyle":"italic"},"text":"T","element":"span"},{"text":", while the RKHS inner products are different: ","element":"span"},{"style":{"height":16},"width":238.4,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/0912.3995/images/11-22.png","element":"img","alt":" ∥f∥kT ≥ ∥f∥k","inline":true},{"text":". Since ","element":"span"},{"style":{"height":16},"width":407.86,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/0912.3995/images/11-23.png","element":"img","alt":" ⟨f(·), kT (·, x)⟩kT = f(x","inline":true},{"text":") for any ","element":"span"},{"style":{"height":16},"width":195.63,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/0912.3995/images/11-24.png","element":"img","alt":" f ∈ HkT (D","inline":true},{"text":") by the reproducing property, then","element":"span"}],[{"id":"id-57","style":{"width":"89%"},"width":841,"height":108,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/0912.3995/images/11-25.png","element":"img"}],[{"text":"by the Cauchy-Schwarz inequality.","element":"span"}],[{"id":"id-61","text":"Compared to our other results, Theorem ","element":"span"},{"href":"#id-38","text":"3 ","element":"a"},{"text":"is an agnostic statement, in that the assumptions the Bayesian UCB algorithm bases its predictions on differ from how ","element":"span"},{"style":{"fontStyle":"italic"},"text":"f ","element":"span"},{"text":"and data ","element":"span"},{"style":{"height":10},"width":31.54,"height":25,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/0912.3995/images/11-26.png","element":"img","alt":" yt","inline":true,"padRight":true},{"text":"are generated. ","element":"span"},{"text":"First, ","element":"span"},{"style":{"fontStyle":"italic"},"text":"f ","element":"span"},{"text":"is not drawn from a GP, but can be an arbitrary function from ","element":"span"},{"style":{"height":16},"width":101.76,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/0912.3995/images/11-27.png","element":"img","alt":" Hk(D","inline":true},{"text":"). Second, while the UCB method assumes that the noise ","element":"span"},{"style":{"height":16},"width":245.3,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/0912.3995/images/11-28.png","element":"img","alt":" εt = yt − f(xt","inline":true},{"text":") is drawn independently from ","element":"span"},{"style":{"height":17.39},"width":129.7,"height":43.46,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/0912.3995/images/11-29.png","element":"img","alt":" N(0, σ2","inline":true},{"text":"), the true sequence of noise variables ","element":"span"},{"style":{"height":9.59},"width":30.58,"height":23.98,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/0912.3995/images/11-30.png","element":"img","alt":" εt","inline":true,"padRight":true},{"text":"can be a uniformly bounded martingale difference sequence: ","element":"span"},{"style":{"height":12.8},"width":108.74,"height":32,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/0912.3995/images/11-31.png","element":"img","alt":" εt ≤ σ","inline":true,"padRight":true},{"text":"for all ","element":"span"},{"style":{"height":11.6},"width":92.1,"height":29,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/0912.3995/images/11-32.png","element":"img","alt":" t ∈ N","inline":true},{"text":". All we have to do in order to lift the proof of Theorem ","element":"span"},{"href":"#id-35","text":"1 ","element":"a"},{"text":"to the agnostic setting is to establish an analogue to Lemma ","element":"span"},{"href":"#id-44","text":"5.1","element":"a"},{"text":", by way of the following concentration result.","element":"span"}],[{"style":{"fontWeight":"bold"},"text":"Theorem 6 ","element":"span"},{"style":{"fontStyle":"italic"},"text":"Let ","element":"span"},{"style":{"height":12.4},"width":64.44,"height":31,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/0912.3995/images/11-33.png","element":"img","alt":" δ ∈","inline":true,"padRight":true},{"text":"(0","element":"span"},{"style":{"fontStyle":"italic"},"text":", ","element":"span"},{"text":"1)","element":"span"},{"style":{"fontStyle":"italic"},"text":". Assume the noise variables ","element":"span"},{"style":{"height":9.59},"width":30.58,"height":23.98,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/0912.3995/images/11-34.png","element":"img","alt":" εt","inline":true,"padRight":true},{"style":{"fontStyle":"italic"},"text":"are uniformly bounded by ","element":"span"},{"style":{"height":6.8},"width":23,"height":17,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/0912.3995/images/11-35.png","element":"img","alt":" σ","inline":true},{"style":{"fontStyle":"italic"},"text":". Define:","element":"span"}],[{"style":{"width":"51%"},"width":486,"height":48,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/0912.3995/images/11-36.png","element":"img"}],[{"style":{"fontStyle":"italic"},"text":"Then","element":"span"}],[{"text":"Pr","element":"span"},{"style":{"height":29.2},"width":368.92,"height":73,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/0912.3995/images/11-37.png","element":"img","alt":"�∀T, ∀x ∈ D, |µT (x","inline":true},{"text":") ","element":"span"},{"style":{"fontStyle":"italic"},"text":"− ","element":"span"},{"style":{"fontStyle":"italic"},"text":"f","element":"span"},{"text":"(","element":"span"},{"style":{"fontStyle":"italic","fontWeight":"bold"},"text":"x","element":"span"},{"text":")","element":"span"},{"style":{"fontStyle":"italic"},"text":"| ≤ ","element":"span"},{"style":{"height":28.8},"width":220.12,"height":72,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/0912.3995/images/11-38.png","element":"img","alt":" β1/2T +1σT (x)�","inline":true},{"style":{"fontStyle":"italic"},"text":"≥ ","element":"span"},{"text":"1","element":"span"},{"style":{"fontStyle":"italic"},"text":"−","element":"span"},{"style":{"height":11.6},"width":27.98,"height":29,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/0912.3995/images/11-39.png","element":"img","alt":"δ.","inline":true}],[{"style":{"fontWeight":"bold"},"text":"B.1. Concentration of Martingales","element":"span"}],[{"text":"In our analysis, we use the following Bernstein-type concentration inequality for martingale differences, due to ","element":"span"},{"href":"#id-54","referenceIndex":11,"text":"Freedman ","element":"a"},{"href":"#id-54","referenceIndex":11,"text":"(","element":"a"},{"href":"#id-54","referenceIndex":11,"text":"1975","element":"a"},{"text":") (see also Theorem 3.15 of ","element":"span"},{"href":"#id-55","referenceIndex":21,"text":"Mc- ","element":"a"},{"href":"#id-55","referenceIndex":21,"text":"Diarmid ","element":"a"},{"href":"#id-55","referenceIndex":21,"text":"1998","element":"a"},{"text":").","element":"span"}],[{"style":{"fontWeight":"bold"},"text":"Theorem 7 (Freedman) ","element":"span"},{"style":{"fontStyle":"italic"},"text":"Suppose ","element":"span"},{"style":{"height":14},"width":195.46,"height":35,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/0912.3995/images/11-40.png","element":"img","alt":" X1, . . . , XT","inline":true,"padRight":true},{"style":{"fontStyle":"italic"},"text":"is a martingale difference sequence, and ","element":"span"},{"style":{"fontStyle":"italic"},"text":"b ","element":"span"},{"style":{"fontStyle":"italic"},"text":"is an uniform upper bound on the steps ","element":"span"},{"style":{"height":13.19},"width":44.02,"height":32.98,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/0912.3995/images/11-41.png","element":"img","alt":" Xi","inline":true},{"style":{"fontStyle":"italic"},"text":". Let ","element":"span"},{"style":{"fontStyle":"italic"},"text":"V ","element":"span"},{"style":{"fontStyle":"italic"},"text":"denote the sum of conditional variances,","element":"span"}],[{"style":{"width":"66%"},"width":626,"height":70,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/0912.3995/images/11-42.png","element":"img"}],[{"style":{"fontStyle":"italic"},"text":"Then, for every ","element":"span"},{"style":{"fontStyle":"italic"},"text":"a, v > ","element":"span"},{"text":"0","element":"span"},{"style":{"fontStyle":"italic"},"text":",","element":"span"}],[{"style":{"width":"93%"},"width":875,"height":98,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/0912.3995/images/11-43.png","element":"img"}],[{"style":{"fontWeight":"bold"},"text":"B.2. Proof of Theorem ","element":"span"},{"href":"#id-56","style":{"fontWeight":"bold"},"text":"6","element":"a"}],[{"text":"We will show that:","element":"span"}],[{"style":{"width":"72%"},"width":674,"height":49,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/0912.3995/images/11-44.png","element":"img"}],[{"text":"Theorem ","element":"span"},{"href":"#id-56","text":"6 ","element":"a"},{"text":"then follows from (","element":"span"},{"href":"#id-57","text":"11","element":"a"},{"text":"). Recall that ","element":"span"},{"style":{"height":9.59},"width":80.09,"height":23.98,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/0912.3995/images/11-45.png","element":"img","alt":" εt =","inline":true},{"style":{"height":16},"width":174.47,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/0912.3995/images/11-46.png","element":"img","alt":"yt − f(xt","inline":true},{"text":"). ","element":"span"},{"text":"We will analyze the quantity ","element":"span"},{"style":{"height":13.19},"width":112.67,"height":32.98,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/0912.3995/images/11-47.png","element":"img","alt":" ZT =","inline":true},{"style":{"height":19.5},"width":205.39,"height":48.75,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/0912.3995/images/11-48.png","element":"img","alt":"∥µT − f∥2kT","inline":true,"padRight":true},{"text":", measuring the error of ","element":"span"},{"style":{"height":10},"width":47.01,"height":25,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/0912.3995/images/11-49.png","element":"img","alt":" µT","inline":true,"padRight":true},{"text":"as approxi- ","element":"span"},{"text":"mation to ","element":"span"},{"style":{"fontStyle":"italic"},"text":"f ","element":"span"},{"text":"under the RKHS norm of ","element":"span"},{"style":{"height":16},"width":123.12,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/0912.3995/images/11-50.png","element":"img","alt":" HkT (D","inline":true},{"text":"). The following lemma provides the connection with the information gain. ","element":"span"},{"text":"This lemma is important since our concentration argument is an inductive argument — roughly speaking, we condition on getting concentration in the past, in order to achieve good concentration in the future.","element":"span"}],[{"style":{"width":"50%"},"width":473,"height":30,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/0912.3995/images/11-51.png","element":"img"}],[{"style":{"height":29.96},"width":110.06,"height":74.9,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/0912.3995/images/11-52.png","element":"img","alt":"�Tt=1","inline":true,"padRight":true},{"text":"min","element":"span"},{"style":{"height":25.58},"width":468.5,"height":63.96,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/0912.3995/images/11-53.png","element":"img","alt":"{σ−2σ2t−1(xt), α} ≤ 2α","inline":true},{"text":"log(1 + ","element":"span"},{"style":{"height":25.73},"width":259.8,"height":64.34,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/0912.3995/images/11-54.png","element":"img","alt":" α)γT , α > 0.","inline":true,"padRight":true},{"style":{"fontWeight":"bold"},"text":"Proof ","element":"span"},{"text":"We have that min","element":"span"},{"style":{"height":16},"width":263.62,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/0912.3995/images/12-0.png","element":"img","alt":"{r, α} ≤ (α/","inline":true,"padRight":true},{"text":"log(1 + ","element":"span"},{"style":{"height":6.8},"width":26,"height":17,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/0912.3995/images/12-1.png","element":"img","alt":"α","inline":true},{"text":")) log(1+","element":"span"},{"style":{"fontStyle":"italic"},"text":"r","element":"span"},{"text":"). The statement follows from Lemma ","element":"span"},{"href":"#id-46","text":"5.3","element":"a"},{"text":".","element":"span"}],[{"style":{"width":"2%"},"width":27,"height":26,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/0912.3995/images/12-2.png","element":"img"}],[{"text":"The next lemma bounds the growth of ","element":"span"},{"style":{"height":13.19},"width":50.2,"height":32.98,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/0912.3995/images/12-3.png","element":"img","alt":" ZT","inline":true,"padRight":true},{"text":". It is for- ","element":"span"},{"id":"id-59","text":"mulated in terms of normalized quantities: ","element":"span"},{"style":{"height":16},"width":162.06,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/0912.3995/images/12-4.png","element":"img","alt":" �εt = εt/σ","inline":true},{"text":", ","element":"span"},{"style":{"height":16},"width":566.35,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/0912.3995/images/12-5.png","element":"img","alt":"�f = f/σ, �µt = µt/σ, �σt = σt/σ","inline":true},{"text":". Also, to ease notation, we will use ","element":"span"},{"style":{"height":10},"width":177.65,"height":25,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/0912.3995/images/12-6.png","element":"img","alt":" µt−1, σt−1","inline":true,"padRight":true},{"text":"as shorthand for ","element":"span"},{"style":{"height":16},"width":132.6,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/0912.3995/images/12-7.png","element":"img","alt":" µt−1(xt","inline":true},{"text":"), ","element":"span"},{"style":{"height":16},"width":131.36,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/0912.3995/images/12-8.png","element":"img","alt":"σt−1(xt","inline":true},{"text":").","element":"span"}],[{"style":{"fontWeight":"bold"},"text":"Lemma 7.2 ","element":"span"},{"style":{"height":14},"width":249.73,"height":35,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/0912.3995/images/12-9.png","element":"img","alt":" For all T ∈ N,","inline":true}],[{"style":{"width":"68%"},"width":641,"height":224,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/0912.3995/images/12-10.png","element":"img"}],[{"style":{"height":17.39},"width":640.17,"height":43.46,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/0912.3995/images/12-11.png","element":"img","alt":"Proof If αt = (Kt + σ2I)−1yt","inline":true},{"text":", then ","element":"span"},{"style":{"height":16},"width":154.95,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/0912.3995/images/12-12.png","element":"img","alt":" µt(x) =","inline":true},{"style":{"height":17.38},"width":135.74,"height":43.46,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/0912.3995/images/12-13.png","element":"img","alt":"αTt kt(x","inline":true},{"text":"). ","element":"span"},{"text":"Then, ","element":"span"},{"style":{"height":18.73},"width":351.78,"height":46.82,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/0912.3995/images/12-14.png","element":"img","alt":"⟨µT , f⟩k = f TT αT","inline":true,"padRight":true},{"text":", ","element":"span"},{"style":{"height":17.9},"width":176.62,"height":44.74,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/0912.3995/images/12-15.png","element":"img","alt":"∥µT ∥2k =","inline":true},{"style":{"height":17.77},"width":314.94,"height":44.44,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/0912.3995/images/12-16.png","element":"img","alt":"yTT αT − σ2∥αT ∥2","inline":true},{"text":". ","element":"span"},{"text":"Moreover, for ","element":"span"},{"style":{"height":16},"width":314.23,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/0912.3995/images/12-17.png","element":"img","alt":" t ≤ T, µT (xt) =","inline":true},{"style":{"height":18.67},"width":111.92,"height":46.67,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/0912.3995/images/12-18.png","element":"img","alt":"δTt KT","inline":true,"padRight":true},{"text":"(","element":"span"},{"style":{"height":13.19},"width":64.48,"height":32.98,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/0912.3995/images/12-19.png","element":"img","alt":"KT","inline":true,"padRight":true},{"text":"+ ","element":"span"},{"style":{"height":17.38},"width":418.12,"height":43.46,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/0912.3995/images/12-20.png","element":"img","alt":" σ2I)−1yT = yt − σ2αt","inline":true},{"text":". ","element":"span"},{"text":"Since ","element":"span"},{"style":{"height":13.19},"width":104.81,"height":32.98,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/0912.3995/images/12-21.png","element":"img","alt":" ZT =","inline":true},{"style":{"height":20.57},"width":387.82,"height":51.42,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/0912.3995/images/12-22.png","element":"img","alt":"∥µT −f∥k +σ−2 �t≤T","inline":true,"padRight":true},{"text":"(","element":"span"},{"style":{"height":17.39},"width":286.9,"height":43.46,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/0912.3995/images/12-23.png","element":"img","alt":"µT (xt)−f(xt))2","inline":true},{"text":", we have that","element":"span"}],[{"style":{"width":"76%"},"width":718,"height":204,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/0912.3995/images/12-24.png","element":"img"}],[{"text":"Now, ","element":"span"},{"style":{"height":17.78},"width":78.99,"height":44.44,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/0912.3995/images/12-25.png","element":"img","alt":" −yTT","inline":true,"padRight":true},{"text":"(","element":"span"},{"style":{"height":17.39},"width":514.3,"height":43.46,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/0912.3995/images/12-26.png","element":"img","alt":"KT +σ2I)−1yT .= 2 log P(yT","inline":true,"padRight":true},{"text":"), where “ ","element":"span"},{"style":{"height":11.04},"width":51,"height":27.59,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/0912.3995/images/12-27.png","element":"img","alt":" .=”","inline":true,"padRight":true},{"text":"means that we drop determinant terms, thus concentrate on quadratic functions. ","element":"span"},{"text":"Since log ","element":"span"},{"style":{"height":16},"width":162.87,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/0912.3995/images/12-28.png","element":"img","alt":" P(yT ) =","inline":true},{"style":{"height":16.74},"width":54.06,"height":41.84,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/0912.3995/images/12-29.png","element":"img","alt":"�t","inline":true,"padRight":true},{"text":"log ","element":"span"},{"style":{"height":17.1},"width":292.39,"height":42.74,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/0912.3995/images/12-30.png","element":"img","alt":" P(yt|y nT","inline":true,"padRight":true},{"text":", define ","element":"span"},{"text":"ˆ","element":"span"},{"style":{"height":13.99},"width":202.77,"height":34.98,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/0912.3995/images/14-32.png","element":"img","alt":"λt = 0 for","inline":true},{"style":{"height":12.39},"width":139.74,"height":30.98,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/0912.3995/images/14-33.png","element":"img","alt":"t = nT","inline":true,"padRight":true},{"text":"+ 1","element":"span"},{"style":{"fontStyle":"italic"},"text":", . . . , T","element":"span"},{"text":". ","element":"span"},{"text":"Information gain maximization over a finite ","element":"span"},{"style":{"height":13.19},"width":55.99,"height":32.98,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/0912.3995/images/14-34.png","element":"img","alt":" DT","inline":true,"padRight":true},{"text":"can be described in terms of a simple linear-Gaussian model over the unknown ","element":"span"},{"style":{"height":14},"width":145.22,"height":35,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/0912.3995/images/14-35.png","element":"img","alt":" f ∈ RnT","inline":true,"padRight":true},{"text":", with prior ","element":"span"},{"style":{"height":16},"width":331.16,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/0912.3995/images/14-36.png","element":"img","alt":" P(f ) = N(0, KDT","inline":true,"padRight":true},{"text":") and likelihood potentials ","element":"span"},{"style":{"height":17.38},"width":383.82,"height":43.46,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/0912.3995/images/14-37.png","element":"img","alt":" P(yt|f ) = N(vTt f , σ2","inline":true},{"text":") with unit-norm features, ","element":"span"},{"style":{"height":16},"width":935.99,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/0912.3995/images/14-38.png","element":"img","alt":"∥vt∥ = 1. With the following lemma, we upper-bound","inline":true,"padRight":true},{"text":"˜","element":"span"},{"style":{"height":10.4},"width":43.63,"height":26,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/0912.3995/images/14-39.png","element":"img","alt":"γT","inline":true,"padRight":true},{"text":"by way of two relaxations.","element":"span"}],[{"style":{"fontWeight":"bold"},"text":"Lemma 7.6 ","element":"span"},{"style":{"fontStyle":"italic"},"text":"For any ","element":"span"},{"style":{"height":13.2},"width":70.89,"height":33,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/0912.3995/images/14-40.png","element":"img","alt":" T ≥","inline":true,"padRight":true},{"text":"1","element":"span"},{"style":{"fontStyle":"italic"},"text":", we have that","element":"span"}],[{"style":{"width":"88%"},"width":825,"height":87,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/0912.3995/images/14-41.png","element":"img"}],[{"style":{"fontStyle":"italic"},"text":"subject to ","element":"span"},{"style":{"height":13.19},"width":130.23,"height":32.98,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/0912.3995/images/14-42.png","element":"img","alt":" mt ∈ N","inline":true},{"style":{"fontStyle":"italic"},"text":", ","element":"span"},{"style":{"height":16.74},"width":208.47,"height":41.84,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/0912.3995/images/14-43.png","element":"img","alt":"�t mT = T","inline":true},{"style":{"fontStyle":"italic"},"text":", where ","element":"span"},{"text":"ˆ","element":"span"},{"style":{"height":17.4},"width":241.95,"height":43.51,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/0912.3995/images/14-44.png","element":"img","alt":"λ1 ≥ ˆλ2 ≥ . . .","inline":true,"padRight":true},{"style":{"fontStyle":"italic"},"text":"is the spectrum of the kernel matrix ","element":"span"},{"style":{"height":14.8},"width":86.54,"height":36.99,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/0912.3995/images/14-45.png","element":"img","alt":" KDT","inline":true,"padRight":true},{"style":{"fontStyle":"italic"},"text":". Here, if ","element":"span"},{"style":{"height":13.19},"width":128.88,"height":32.98,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/0912.3995/images/14-46.png","element":"img","alt":"T > nT","inline":true,"padRight":true},{"style":{"fontStyle":"italic"},"text":", then ","element":"span"},{"style":{"height":13.19},"width":122.17,"height":32.98,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/0912.3995/images/14-47.png","element":"img","alt":" mt = 0","inline":true,"padRight":true},{"style":{"fontStyle":"italic"},"text":"for ","element":"span"},{"style":{"height":12.39},"width":114.44,"height":30.98,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/0912.3995/images/14-48.png","element":"img","alt":" t > nT","inline":true,"padRight":true},{"style":{"fontStyle":"italic"},"text":".","element":"span"}],[{"style":{"fontWeight":"bold"},"text":"Proof ","element":"span"},{"text":"As shown by ","element":"span"},{"href":"#id-27","referenceIndex":19,"text":"Krause & Guestrin ","element":"a"},{"href":"#id-27","referenceIndex":19,"text":"(","element":"a"},{"href":"#id-27","referenceIndex":19,"text":"2005","element":"a"},{"text":"), the function ","element":"span"},{"style":{"height":16},"width":301.75,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/0912.3995/images/14-49.png","element":"img","alt":" F(A) = I(yA; f","inline":true,"padRight":true},{"text":") is submodular. ","element":"span"},{"text":"In the particular case considered here, this can be seen as follows: ","element":"span"},{"style":{"height":16},"width":340.83,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/0912.3995/images/14-50.png","element":"img","alt":"F(A) = H(yA) −","inline":true,"padRight":true},{"text":"H(","element":"span"},{"style":{"height":16},"width":146.8,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/0912.3995/images/14-51.png","element":"img","alt":"yA | f","inline":true,"padRight":true},{"text":"), where the entropy H(","element":"span"},{"style":{"height":11.1},"width":49,"height":27.74,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/0912.3995/images/14-52.png","element":"img","alt":"yA","inline":true},{"text":") is a (not-necessarily monotonic) submodular function in ","element":"span"},{"style":{"fontStyle":"italic"},"text":"A","element":"span"},{"text":", and since the noise is conditionally independent given ","element":"span"},{"style":{"fontStyle":"italic","fontWeight":"bold"},"text":"f ","element":"span"},{"text":", H(","element":"span"},{"style":{"height":16},"width":149.23,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/0912.3995/images/14-53.png","element":"img","alt":"yA | f","inline":true,"padRight":true},{"text":") is an additive (modular) function in ","element":"span"},{"style":{"fontStyle":"italic"},"text":"A","element":"span"},{"text":". ","element":"span"},{"text":"Subtracting a modular function preserves submodularity, thus ","element":"span"},{"style":{"fontStyle":"italic"},"text":"F","element":"span"},{"text":"(","element":"span"},{"style":{"fontStyle":"italic"},"text":"A","element":"span"},{"text":") is submodular. ","element":"span"},{"text":"Furthermore, the information gain is monotonic in ","element":"span"},{"style":{"fontStyle":"italic"},"text":"A ","element":"span"},{"text":"(i.e., ","element":"span"},{"style":{"height":16},"width":229.06,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/0912.3995/images/14-54.png","element":"img","alt":" F(A) ≤ F(B","inline":true},{"text":") whenever ","element":"span"},{"style":{"height":14},"width":140.4,"height":35,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/0912.3995/images/14-55.png","element":"img","alt":"A ⊆ B","inline":true},{"text":") (","element":"span"},{"href":"#id-25","referenceIndex":7,"text":"Cover & Thomas","element":"a"},{"href":"#id-25","referenceIndex":7,"text":", ","element":"a"},{"href":"#id-25","referenceIndex":7,"text":"1991","element":"a"},{"text":"). ","element":"span"},{"text":"Thus, we can apply the result of ","element":"span"},{"href":"#id-11","referenceIndex":24,"text":"Nemhauser et al. ","element":"a"},{"href":"#id-11","referenceIndex":24,"text":"(","element":"a"},{"href":"#id-11","referenceIndex":24,"text":"1978","element":"a"},{"text":")","element":"span"},{"style":{"height":8},"width":16,"height":20,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/0912.3995/images/14-56.png","element":"img","alt":"5","inline":true,"padRight":true},{"text":"which guarantees that ˜","element":"span"},{"style":{"height":10.4},"width":43.63,"height":26,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/0912.3995/images/14-57.png","element":"img","alt":"γT","inline":true,"padRight":true},{"text":"is upper-bounded by 1","element":"span"},{"style":{"height":16},"width":166.9,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/0912.3995/images/14-58.png","element":"img","alt":"/(1 − 1/e","inline":true},{"text":") times the value the greedy maximization algorithm attains. ","element":"span"},{"text":"The latter chooses features of the form ","element":"span"},{"style":{"height":17.68},"width":369.9,"height":44.2,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/0912.3995/images/14-59.png","element":"img","alt":"vt = δxt = [I{x=xt}","inline":true},{"text":"] in each round, ","element":"span"},{"style":{"height":13.19},"width":159.4,"height":32.98,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/0912.3995/images/14-60.png","element":"img","alt":" xt ∈ DT","inline":true,"padRight":true},{"text":". We upper-bound the greedy maximum once more by relaxing these constraints to ","element":"span"},{"style":{"height":16},"width":267.69,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/0912.3995/images/14-61.png","element":"img","alt":" ∥vt∥ = 1 only.","inline":true,"padRight":true},{"text":"In the remainder of the proof, we concentrate on this relaxed greedy procedure. Suppose that up to round ","element":"span"},{"style":{"fontStyle":"italic"},"text":"t","element":"span"},{"text":", it chose ","element":"span"},{"style":{"height":10.8},"width":207.5,"height":27,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/0912.3995/images/14-62.png","element":"img","alt":"v1, . . . , vt−1","inline":true},{"text":". ","element":"span"},{"text":"The posterior ","element":"span"},{"style":{"height":16},"width":164.66,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/0912.3995/images/14-63.png","element":"img","alt":" P(f |yt−1","inline":true},{"text":") has inverse covariance matrix ","element":"span"},{"style":{"height":20.71},"width":262.71,"height":51.76,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/0912.3995/images/14-64.png","element":"img","alt":" Σ−1t−1 = K−1DT","inline":true,"padRight":true},{"text":"+ ","element":"span"},{"style":{"height":18.54},"width":249.14,"height":46.34,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/0912.3995/images/14-65.png","element":"img","alt":" σ−2V t−1V Tt−1","inline":true},{"text":", ","element":"span"},{"style":{"height":16},"width":394.44,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/0912.3995/images/14-66.png","element":"img","alt":"V t−1 = [v1 . . . vt−1","inline":true},{"text":"], and the greedy procedure selects ","element":"span"},{"style":{"fontStyle":"italic","fontWeight":"bold"},"text":"v ","element":"span"},{"text":"so to maximize the variance ","element":"span"},{"style":{"height":15.77},"width":160.08,"height":39.44,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/0912.3995/images/14-67.png","element":"img","alt":" vT Σt−1v","inline":true},{"text":": the eigenvector corresponding to ","element":"span"},{"style":{"height":13.19},"width":86.04,"height":32.98,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/0912.3995/images/14-68.png","element":"img","alt":" Σt−1","inline":true},{"text":"’s largest eigenvalue (by the Rayleigh-Ritz theorem). Since ","element":"span"},{"style":{"height":14.79},"width":206.25,"height":36.98,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/0912.3995/images/14-69.png","element":"img","alt":" Σ0 = KDT","inline":true,"padRight":true},{"text":", then ","element":"span"},{"style":{"height":9.99},"width":164.4,"height":24.98,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/0912.3995/images/14-70.png","element":"img","alt":" v1 = u1","inline":true},{"text":". ","element":"span"},{"text":"Moreover, if all ","element":"span"},{"style":{"height":13.2},"width":201.98,"height":33,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/0912.3995/images/14-71.png","element":"img","alt":" vt′, t′ < t","inline":true},{"text":", have been chosen among ","element":"span"},{"style":{"fontStyle":"italic","fontWeight":"bold"},"text":"U ","element":"span"},{"text":"’s columns, then by the inverse covariance expression just given, ","element":"span"},{"style":{"height":14.79},"width":86.54,"height":36.98,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/0912.3995/images/14-72.png","element":"img","alt":" KDT","inline":true,"padRight":true},{"text":"and ","element":"span"},{"style":{"height":13.19},"width":86.04,"height":32.98,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/0912.3995/images/14-73.png","element":"img","alt":" Σt−1","inline":true,"padRight":true},{"text":"have the same eigenvectors, so that ","element":"span"},{"style":{"height":9.99},"width":36.06,"height":24.98,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/0912.3995/images/14-74.png","element":"img","alt":" vt","inline":true,"padRight":true},{"text":"is a column of ","element":"span"},{"style":{"fontStyle":"italic","fontWeight":"bold"},"text":"U ","element":"span"},{"text":"as well. For example, if ","element":"span"},{"style":{"height":12.39},"width":139.51,"height":30.98,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/0912.3995/images/14-75.png","element":"img","alt":" vt = uj","inline":true},{"text":", then comparing ","element":"span"},{"style":{"height":13.19},"width":86.04,"height":32.98,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/0912.3995/images/14-76.png","element":"img","alt":" Σt−1","inline":true,"padRight":true},{"text":"and ","element":"span"},{"style":{"height":13.19},"width":45.1,"height":32.98,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/0912.3995/images/14-77.png","element":"img","alt":" Σt","inline":true},{"text":", all eigenvalues other than the ","element":"span"},{"style":{"fontStyle":"italic"},"text":"j","element":"span"},{"text":"-th remain the same, while the latter is shrunk. ","element":"span"},{"text":"Therefore, after ","element":"span"},{"style":{"fontStyle":"italic"},"text":"T ","element":"span"},{"text":"rounds of the relaxed greedy procedure: ","element":"span"},{"style":{"height":17.68},"width":696.02,"height":44.2,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/0912.3995/images/14-78.png","element":"img","alt":"vt ∈ {u1, . . . , umin{T,nT }}, t = 1, . . . , T","inline":true},{"text":": at most the leading ","element":"span"},{"style":{"fontStyle":"italic"},"text":"T ","element":"span"},{"text":"eigenvectors of ","element":"span"},{"style":{"height":14.79},"width":86.54,"height":36.98,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/0912.3995/images/14-79.png","element":"img","alt":" KDT","inline":true,"padRight":true},{"text":"can have been selected (possibly multiple times). If ","element":"span"},{"style":{"height":9.19},"width":46.99,"height":22.98,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/0912.3995/images/14-80.png","element":"img","alt":" mt","inline":true,"padRight":true},{"text":"denotes the number that the ","element":"span"},{"style":{"fontStyle":"italic"},"text":"t","element":"span"},{"text":"-th column of ","element":"span"},{"style":{"fontStyle":"italic","fontWeight":"bold"},"text":"U ","element":"span"},{"text":"has been selected, we obtain the theorem statement by a final bounding step.","element":"span"}],[{"style":{"fontWeight":"bold"},"text":"C.2. From Empirical to Process Eigenvalues","element":"span"}],[{"text":"The final step will be to relate the empirical spec- ","element":"span"},{"id":"id-63","text":"trum ","element":"span"},{"style":{"height":19.01},"width":77.2,"height":47.52,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/0912.3995/images/14-81.png","element":"img","alt":" {ˆλt}","inline":true,"padRight":true},{"text":"to the kernel operator spectrum. ","element":"span"},{"text":"Since log(1 + ","element":"span"},{"style":{"height":19},"width":373.54,"height":47.51,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/0912.3995/images/14-82.png","element":"img","alt":" σ−2mtˆλt) ≤ σ−2mtˆλt","inline":true,"padRight":true},{"text":"in Theorem ","element":"span"},{"href":"#id-63","text":"7.6","element":"a"},{"text":", we will mainly be interested in relating the tail sums of the spectra. Let ","element":"span"},{"style":{"height":19.06},"width":404.48,"height":47.65,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/0912.3995/images/14-83.png","element":"img","alt":" µ(x) = V(D)−1I{x∈D}","inline":true,"padRight":true},{"text":"be the uniform distribution on ","element":"span"},{"style":{"height":19.31},"width":355.07,"height":48.27,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/0912.3995/images/14-84.png","element":"img","alt":" D, V(D) =�x∈D dx","inline":true},{"text":", and assume that ","element":"span"},{"style":{"fontStyle":"italic"},"text":"k ","element":"span"},{"text":"is continuous. Note that","element":"span"},{"style":{"height":18},"width":441.23,"height":45,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/0912.3995/images/14-85.png","element":"img","alt":"�k(x, x)µ(x) dx = 1 by","inline":true,"padRight":true},{"text":"our assumption ","element":"span"},{"style":{"fontStyle":"italic"},"text":"k","element":"span"},{"text":"(","element":"span"},{"style":{"fontStyle":"italic","fontWeight":"bold"},"text":"x","element":"span"},{"style":{"fontStyle":"italic"},"text":", ","element":"span"},{"style":{"fontStyle":"italic","fontWeight":"bold"},"text":"x","element":"span"},{"text":") = 1, so that ","element":"span"},{"style":{"fontStyle":"italic"},"text":"k ","element":"span"},{"text":"is HilbertSchmidt on ","element":"span"},{"style":{"height":16},"width":84.5,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/0912.3995/images/15-0.png","element":"img","alt":" L2(µ","inline":true},{"text":"). Then, Mercer’s theorem (","element":"span"},{"href":"#id-24","referenceIndex":34,"text":"Wahba","element":"a"},{"href":"#id-24","referenceIndex":34,"text":", ","element":"a"},{"href":"#id-24","referenceIndex":34,"text":"1990","element":"a"},{"text":") states that the corresponding kernel operator has a discrete eigenspectrum ","element":"span"},{"style":{"height":16},"width":211.75,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/0912.3995/images/15-1.png","element":"img","alt":" {(λs, φs(·))}","inline":true},{"text":", and","element":"span"}],[{"style":{"width":"60%"},"width":569,"height":67,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/0912.3995/images/15-2.png","element":"img"}],[{"text":"where ","element":"span"},{"style":{"height":13.2},"width":333.91,"height":33,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/0912.3995/images/15-3.png","element":"img","alt":" λ1 ≥ λ2 ≥ · · · ≥","inline":true,"padRight":true},{"text":"0, and ","element":"span"},{"style":{"height":16.79},"width":320.58,"height":41.98,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/0912.3995/images/15-4.png","element":"img","alt":" Eµ[φs(x)φt(x)] =","inline":true},{"style":{"height":16.39},"width":54.2,"height":40.98,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/0912.3995/images/15-5.png","element":"img","alt":"δs,t","inline":true},{"text":". ","element":"span"},{"text":"Moreover, ","element":"span"},{"style":{"height":20.57},"width":283.08,"height":51.42,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/0912.3995/images/15-6.png","element":"img","alt":"�s≥1 λ2s < ∞","inline":true},{"text":", and the expansion of ","element":"span"},{"style":{"fontStyle":"italic"},"text":"k ","element":"span"},{"text":"converges absolutely and uniformly on ","element":"span"},{"style":{"height":10.8},"width":75.24,"height":27,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/0912.3995/images/15-7.png","element":"img","alt":" D ×","inline":true,"padRight":true},{"style":{"fontStyle":"italic"},"text":"D","element":"span"},{"text":". ","element":"span"},{"text":"Note that ","element":"span"},{"style":{"height":20.57},"width":636.96,"height":51.42,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/0912.3995/images/15-8.png","element":"img","alt":"�s≥1 λs = �s≥1 λs Eµ[φs(x)2] =","inline":true},{"style":{"height":18},"width":935.77,"height":45,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/0912.3995/images/15-9.png","element":"img","alt":"�K(x, x)µ(x) dx = 1. In order to proceed from The-","inline":true,"padRight":true},{"text":"orem ","element":"span"},{"href":"#id-63","text":"7.6","element":"a"},{"text":", we have to pick a discretization ","element":"span"},{"style":{"height":13.19},"width":55.99,"height":32.98,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/0912.3995/images/15-10.png","element":"img","alt":" DT","inline":true,"padRight":true},{"text":"for which (","element":"span"},{"href":"#id-62","text":"13","element":"a"},{"text":") holds, and for which ","element":"span"},{"style":{"height":18.33},"width":112.84,"height":45.83,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/0912.3995/images/15-11.png","element":"img","alt":"�t>T∗","inline":true,"padRight":true},{"text":"ˆ","element":"span"},{"style":{"height":13.19},"width":35.25,"height":32.98,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/0912.3995/images/15-12.png","element":"img","alt":"λt","inline":true,"padRight":true},{"text":"is not much larger ","element":"span"},{"text":"than ","element":"span"},{"style":{"height":18.33},"width":158.37,"height":45.83,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/0912.3995/images/15-13.png","element":"img","alt":"�t>T∗ λt","inline":true},{"text":". With the following lemma, we deter- ","element":"span"},{"text":"mine sizes ","element":"span"},{"style":{"height":9.19},"width":46.92,"height":22.98,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/0912.3995/images/15-14.png","element":"img","alt":" nT","inline":true,"padRight":true},{"text":"for which such discretizations exist.","element":"span"}],[{"id":"id-65","style":{"fontWeight":"bold"},"text":"Lemma 7.7 ","element":"span"},{"style":{"fontStyle":"italic"},"text":"Fix ","element":"span"},{"style":{"height":14.8},"width":225.33,"height":37,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/0912.3995/images/15-15.png","element":"img","alt":" T ∈ N, δ >","inline":true,"padRight":true},{"text":"0 ","element":"span"},{"style":{"fontStyle":"italic"},"text":"and ","element":"span"},{"style":{"height":9.6},"width":69.14,"height":24,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/0912.3995/images/15-16.png","element":"img","alt":" ε >","inline":true,"padRight":true},{"text":"0","element":"span"},{"style":{"fontStyle":"italic"},"text":". ","element":"span"},{"style":{"fontStyle":"italic"},"text":"There exists a discretization ","element":"span"},{"style":{"height":13.19},"width":144.23,"height":32.98,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/0912.3995/images/15-17.png","element":"img","alt":" DT ⊂ D","inline":true,"padRight":true},{"style":{"fontStyle":"italic"},"text":"of size","element":"span"}],[{"style":{"height":19.16},"width":360.31,"height":47.9,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/0912.3995/images/15-18.png","element":"img","alt":"nT = V(D)(ε/√d)−d","inline":true},{"text":"[log(1","element":"span"},{"style":{"height":16},"width":106.64,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/0912.3995/images/15-19.png","element":"img","alt":"/δ)+d","inline":true,"padRight":true},{"text":"log(","element":"span"},{"style":{"height":19.16},"width":92.9,"height":47.9,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/0912.3995/images/15-20.png","element":"img","alt":"√d/ε","inline":true},{"text":")+log ","element":"span"},{"style":{"fontStyle":"italic"},"text":"V","element":"span"},{"text":"(","element":"span"},{"style":{"fontStyle":"italic"},"text":"D","element":"span"},{"text":")]","element":"span"}],[{"style":{"fontStyle":"italic"},"text":"which fulfils the following requirements:","element":"span"}],[{"style":{"fontStyle":"italic"},"text":"• ","element":"span"},{"style":{"height":7.2},"width":19,"height":18,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/0912.3995/images/15-21.png","element":"img","alt":" ε","inline":true},{"style":{"fontStyle":"italic"},"text":"-denseness: For any ","element":"span"},{"style":{"height":11.6},"width":113.96,"height":29,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/0912.3995/images/15-22.png","element":"img","alt":" x ∈ D","inline":true},{"style":{"fontStyle":"italic"},"text":", there exists ","element":"span"},{"text":"[","element":"span"},{"style":{"height":16},"width":104.5,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/0912.3995/images/15-23.png","element":"img","alt":"x]T ∈","inline":true},{"style":{"height":13.19},"width":55.99,"height":32.98,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/0912.3995/images/15-24.png","element":"img","alt":"DT","inline":true,"padRight":true},{"style":{"fontStyle":"italic"},"text":"such that ","element":"span"},{"style":{"height":16},"width":264.45,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/0912.3995/images/15-25.png","element":"img","alt":" ∥x − [x]T ∥ ≤ ε","inline":true},{"style":{"fontStyle":"italic"},"text":".","element":"span"}],[{"style":{"fontStyle":"italic"},"text":"• ","element":"span"},{"style":{"fontStyle":"italic"},"text":"If ","element":"span"},{"text":"spec(","element":"span"},{"style":{"height":16},"width":190.27,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/0912.3995/images/15-26.png","element":"img","alt":"KDT ) = {","inline":true},{"text":"ˆ","element":"span"},{"style":{"fontStyle":"italic"},"text":"λ","element":"span"},{"style":{"height":12.8},"width":64.98,"height":32,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/0912.3995/images/15-27.png","element":"img","alt":"1 ≥","inline":true,"padRight":true},{"text":"ˆ","element":"span"},{"style":{"fontStyle":"italic"},"text":"λ","element":"span"},{"style":{"height":12.8},"width":64.98,"height":32,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/0912.3995/images/15-28.png","element":"img","alt":"2 ≥","inline":true,"padRight":true},{"style":{"fontStyle":"italic"},"text":". . . ","element":"span"},{"style":{"fontStyle":"italic"},"text":"}","element":"span"},{"style":{"fontStyle":"italic"},"text":", then for any ","element":"span"},{"style":{"fontStyle":"italic"},"text":"T","element":"span"},{"style":{"height":13.19},"width":91.47,"height":32.98,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/0912.3995/images/15-29.png","element":"img","alt":"∗ = 1","inline":true},{"style":{"fontStyle":"italic"},"text":", . . . , n","element":"span"},{"style":{"height":9.19},"width":38.1,"height":22.98,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/0912.3995/images/15-30.png","element":"img","alt":"T :","inline":true}],[{"style":{"width":"54%"},"width":511,"height":76,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/0912.3995/images/15-31.png","element":"img"}],[{"style":{"fontWeight":"bold"},"text":"Proof ","element":"span"},{"text":"First, if we draw ","element":"span"},{"style":{"height":9.19},"width":46.92,"height":22.98,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/0912.3995/images/15-32.png","element":"img","alt":" nT","inline":true,"padRight":true},{"text":"samples ˜","element":"span"},{"style":{"height":17.59},"width":174.62,"height":43.97,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/0912.3995/images/15-33.png","element":"img","alt":"xj ∼ µ(x","inline":true},{"text":") independently at random, then ","element":"span"},{"style":{"height":17.59},"width":209.33,"height":43.97,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/0912.3995/images/15-34.png","element":"img","alt":" DT = {˜xj}","inline":true,"padRight":true},{"text":"is ","element":"span"},{"style":{"height":7.2},"width":19,"height":18,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/0912.3995/images/15-35.png","element":"img","alt":" ε","inline":true},{"text":"-dense with probability ","element":"span"},{"style":{"height":14},"width":156.62,"height":35,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/0912.3995/images/15-36.png","element":"img","alt":" ≥ 1 − δ","inline":true},{"text":". ","element":"span"},{"text":"Namely, cover ","element":"span"},{"style":{"fontStyle":"italic"},"text":"D ","element":"span"},{"text":"with ","element":"span"},{"style":{"height":18.39},"width":360.26,"height":45.96,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/0912.3995/images/15-37.png","element":"img","alt":"N = V(D)(ε/√d)−d","inline":true,"padRight":true},{"text":"hypercubes of sidelength ","element":"span"},{"style":{"height":18.39},"width":92.71,"height":45.96,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/0912.3995/images/15-38.png","element":"img","alt":" ε/√d","inline":true},{"text":", within which the maximum Euclidean distance is ","element":"span"},{"style":{"height":7.2},"width":19,"height":18,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/0912.3995/images/15-39.png","element":"img","alt":" ε","inline":true},{"text":". The probability of not hitting at least one cell is upper-bounded by ","element":"span"},{"style":{"height":16},"width":256.97,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/0912.3995/images/15-40.png","element":"img","alt":" N(1 − 1/N)nT","inline":true,"padRight":true},{"text":". ","element":"span"},{"text":"Since log(1 ","element":"span"},{"style":{"height":16},"width":185.92,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/0912.3995/images/15-41.png","element":"img","alt":" − 1/N) ≤","inline":true},{"style":{"height":16},"width":104.62,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/0912.3995/images/15-42.png","element":"img","alt":"−1/N","inline":true},{"text":", this is upper-bounded by ","element":"span"},{"style":{"height":11.6},"width":19,"height":29,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/0912.3995/images/15-43.png","element":"img","alt":" δ","inline":true,"padRight":true},{"text":"if ","element":"span"},{"style":{"height":13.2},"width":138.16,"height":33,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/0912.3995/images/15-44.png","element":"img","alt":" nT ≥ N","inline":true,"padRight":true},{"text":"log(","element":"span"},{"style":{"height":16},"width":72,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/0912.3995/images/15-45.png","element":"img","alt":"N/δ","inline":true},{"text":").","element":"span"}],[{"text":"Now, let ","element":"span"},{"style":{"height":20.4},"width":285.18,"height":50.99,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/0912.3995/images/15-46.png","element":"img","alt":" S = n−1T �T∗t=1","inline":true,"padRight":true},{"text":"ˆ","element":"span"},{"style":{"height":13.19},"width":35.24,"height":32.98,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/0912.3995/images/15-47.png","element":"img","alt":"λt","inline":true},{"text":". ","element":"span"},{"href":"#id-64","referenceIndex":30,"text":"Shawe-Taylor et al. ","element":"a"},{"href":"#id-64","referenceIndex":30,"text":"(","element":"a"},{"href":"#id-64","referenceIndex":30,"text":"2005","element":"a"},{"text":") show that ","element":"span"},{"style":{"height":20.4},"width":309.44,"height":50.99,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/0912.3995/images/15-48.png","element":"img","alt":" E[S] ≥ �T∗t=1 λt","inline":true},{"text":". ","element":"span"},{"text":"If ","element":"span"},{"style":{"fontStyle":"italic"},"text":"C ","element":"span"},{"text":"is the event ","element":"span"},{"style":{"height":16},"width":75.92,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/0912.3995/images/15-49.png","element":"img","alt":" {DT","inline":true,"padRight":true},{"text":"is ","element":"span"},{"style":{"height":7.2},"width":49.58,"height":18,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/0912.3995/images/15-50.png","element":"img","alt":" ε−","inline":true},{"text":"dense ","element":"span"},{"style":{"fontStyle":"italic"},"text":"}","element":"span"},{"text":", then Pr(","element":"span"},{"style":{"height":16},"width":208.84,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/0912.3995/images/15-51.png","element":"img","alt":"C) ≥ 1 − δ","inline":true},{"text":". ","element":"span"},{"text":"Since ","element":"span"},{"style":{"height":19.1},"width":180.72,"height":47.76,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/0912.3995/images/15-52.png","element":"img","alt":"S ≤ n−1T","inline":true,"padRight":true},{"text":"tr","element":"span"},{"style":{"height":14.79},"width":722.81,"height":36.98,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/0912.3995/images/15-53.png","element":"img","alt":"KDT = 1 in any case, we have that","inline":true},{"style":{"height":16},"width":316.51,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/0912.3995/images/15-54.png","element":"img","alt":"E[S|C] ≥ E[S] −","inline":true,"padRight":true},{"text":"Pr(","element":"span"},{"style":{"height":20.4},"width":360.88,"height":50.99,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/0912.3995/images/15-55.png","element":"img","alt":"Cc) ≥ �T∗t=1 λt − δ","inline":true},{"text":". ","element":"span"},{"text":"By the probabilistic method, there must exist some ","element":"span"},{"style":{"height":13.19},"width":55.99,"height":32.98,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/0912.3995/images/15-56.png","element":"img","alt":" DT","inline":true,"padRight":true},{"text":"for which ","element":"span"},{"style":{"fontStyle":"italic"},"text":"C ","element":"span"},{"text":"and the latter inequality holds.","element":"span"}],[{"text":"The following lemma, the equivalent of Theorem ","element":"span"},{"text":"4 ","element":"span"},{"text":"in the context here, is a direct consequence of Lemma ","element":"span"},{"href":"#id-63","text":"7.6","element":"a"},{"text":".","element":"span"}],[{"style":{"fontWeight":"bold"},"text":"Lemma 7.8 ","element":"span"},{"style":{"fontStyle":"italic"},"text":"Let ","element":"span"},{"style":{"height":13.19},"width":55.99,"height":32.97,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/0912.3995/images/15-57.png","element":"img","alt":" DT","inline":true,"padRight":true},{"style":{"fontStyle":"italic"},"text":"be some discretization of ","element":"span"},{"style":{"fontStyle":"italic"},"text":"D","element":"span"},{"style":{"fontStyle":"italic"},"text":",","element":"span"}],[{"style":{"fontStyle":"italic"},"text":"n","element":"span"},{"style":{"height":16},"width":559.92,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/0912.3995/images/15-58.png","element":"img","alt":"T = |DT |. Then, for any T∗ = 1","inline":true},{"style":{"fontStyle":"italic"},"text":", . . . , ","element":"span"},{"text":"min","element":"span"},{"style":{"fontStyle":"italic"},"text":"{","element":"span"},{"style":{"fontStyle":"italic"},"text":"T, n","element":"span"},{"style":{"height":16},"width":58.03,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/0912.3995/images/15-59.png","element":"img","alt":"T }:","inline":true}],[{"style":{"width":"78%"},"width":736,"height":182,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/0912.3995/images/15-60.png","element":"img"}],[{"style":{"fontWeight":"bold"},"text":"Proof ","element":"span"},{"text":"We split the right hand side in Lemma ","element":"span"},{"href":"#id-63","text":"7.6 ","element":"a"},{"text":"at ","element":"span"},{"style":{"height":13.19},"width":144.2,"height":32.98,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/0912.3995/images/15-61.png","element":"img","alt":" t = T∗","inline":true},{"text":". ","element":"span"},{"text":"Let ","element":"span"},{"style":{"height":19.14},"width":279.72,"height":47.84,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/0912.3995/images/15-62.png","element":"img","alt":" r = �t≤T∗ mt","inline":true},{"text":". ","element":"span"},{"text":"For ","element":"span"},{"style":{"height":13.2},"width":144.2,"height":33,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/0912.3995/images/15-63.png","element":"img","alt":" t ≤ T∗","inline":true},{"text":": log(1 + ","element":"span"},{"style":{"height":19.01},"width":208.69,"height":47.52,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/0912.3995/images/15-64.png","element":"img","alt":" mtˆλt/σ2) ≤","inline":true,"padRight":true},{"text":"log(","element":"span"},{"style":{"height":17.39},"width":128.24,"height":43.46,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/0912.3995/images/15-65.png","element":"img","alt":"rnT /σ2","inline":true},{"text":"), since ","element":"span"},{"text":"ˆ","element":"span"},{"style":{"height":13.2},"width":142.98,"height":33,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/0912.3995/images/15-66.png","element":"img","alt":"λt ≤ nT","inline":true,"padRight":true},{"text":". For ","element":"span"},{"style":{"height":13.19},"width":106.81,"height":32.98,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/0912.3995/images/15-67.png","element":"img","alt":"t > T∗","inline":true},{"text":": log(1+","element":"span"},{"style":{"height":19.01},"width":512.5,"height":47.52,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/0912.3995/images/15-68.png","element":"img","alt":"mtˆλt/σ2) ≤ mtˆλt/σ2 ≤ (T−r","inline":true},{"text":")","element":"span"},{"text":"ˆ","element":"span"},{"style":{"height":17.38},"width":97.4,"height":43.46,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/0912.3995/images/15-69.png","element":"img","alt":"λt/σ2","inline":true},{"text":".","element":"span"}],[{"text":"The following theorem describes our “recipe” for obtaining bounds on ","element":"span"},{"style":{"height":10.4},"width":43.63,"height":26,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/0912.3995/images/15-70.png","element":"img","alt":" γT","inline":true,"padRight":true},{"text":"for a particular kernel ","element":"span"},{"style":{"fontStyle":"italic"},"text":"k","element":"span"},{"text":", given that tail bounds on ","element":"span"},{"style":{"height":18.38},"width":339.95,"height":45.94,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/0912.3995/images/15-71.png","element":"img","alt":" Bk(T∗) = �s>T∗ λs","inline":true,"padRight":true},{"text":"are known.","element":"span"}],[{"style":{"fontWeight":"bold"},"text":"Theorem 8 ","element":"span"},{"style":{"fontStyle":"italic"},"text":"Suppose that ","element":"span"},{"style":{"height":14.18},"width":151.22,"height":35.46,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/0912.3995/images/15-72.png","element":"img","alt":" D ⊂ Rd","inline":true,"padRight":true},{"style":{"fontStyle":"italic"},"text":"is compact, and ","element":"span"},{"style":{"height":16},"width":121.74,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/0912.3995/images/15-73.png","element":"img","alt":"k(x, x′","inline":true},{"text":") ","element":"span"},{"style":{"fontStyle":"italic"},"text":"is a covariance function for which the additional assumption of Theorem ","element":"span"},{"href":"#id-37","style":{"fontStyle":"italic"},"text":"2 ","element":"a"},{"style":{"fontStyle":"italic"},"text":"holds. ","element":"span"},{"style":{"fontStyle":"italic"},"text":"Moreover, let ","element":"span"},{"style":{"height":18.38},"width":359.42,"height":45.94,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/0912.3995/images/15-74.png","element":"img","alt":" Bk(T∗) = �s>T∗ λs","inline":true},{"style":{"fontStyle":"italic"},"text":", where ","element":"span"},{"style":{"height":16},"width":80.2,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/0912.3995/images/15-75.png","element":"img","alt":" {λs}","inline":true,"padRight":true},{"style":{"fontStyle":"italic"},"text":"is the operator ","element":"span"},{"style":{"fontStyle":"italic"},"text":"spectrum of ","element":"span"},{"style":{"fontStyle":"italic"},"text":"k ","element":"span"},{"style":{"fontStyle":"italic"},"text":"with respect to the uniform distribution over ","element":"span"},{"style":{"fontStyle":"italic"},"text":"D","element":"span"},{"style":{"fontStyle":"italic"},"text":". Pick ","element":"span"},{"style":{"height":9.6},"width":67.98,"height":24,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/0912.3995/images/15-76.png","element":"img","alt":" τ >","inline":true,"padRight":true},{"text":"0","element":"span"},{"style":{"fontStyle":"italic"},"text":", and let ","element":"span"},{"style":{"height":13.19},"width":202.34,"height":32.98,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/0912.3995/images/15-77.png","element":"img","alt":" nT = C4T τ","inline":true},{"text":"(log ","element":"span"},{"style":{"fontStyle":"italic"},"text":"T","element":"span"},{"text":") ","element":"span"},{"style":{"fontStyle":"italic"},"text":"with ","element":"span"},{"style":{"height":16},"width":195.61,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/0912.3995/images/15-78.png","element":"img","alt":"C4 = 2V(D","inline":true},{"text":")(2","element":"span"},{"style":{"height":6.8},"width":21,"height":17,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/0912.3995/images/15-79.png","element":"img","alt":"τ","inline":true,"padRight":true},{"text":"+ 1)","element":"span"},{"style":{"fontStyle":"italic"},"text":". Then, the following bound holds true:","element":"span"}],[{"style":{"width":"93%"},"width":875,"height":244,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/0912.3995/images/15-80.png","element":"img"}],[{"style":{"height":16},"width":250.36,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/0912.3995/images/15-81.png","element":"img","alt":"for any T∗ ∈ {","inline":true},{"text":"1","element":"span"},{"style":{"fontStyle":"italic"},"text":", . . . , n","element":"span"},{"style":{"height":16},"width":57.03,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/0912.3995/images/15-82.png","element":"img","alt":"T }.","inline":true}],[{"style":{"fontWeight":"bold"},"text":"Proof ","element":"span"},{"text":"Let ","element":"span"},{"style":{"height":14.18},"width":293.38,"height":35.46,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/0912.3995/images/15-83.png","element":"img","alt":" ε = d1/2T −τ/d","inline":true,"padRight":true},{"text":"and ","element":"span"},{"style":{"height":14.18},"width":254.67,"height":35.46,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/0912.3995/images/15-84.png","element":"img","alt":" δ = T −(τ+1)","inline":true},{"text":". Lemma ","element":"span"},{"href":"#id-65","text":"7.7 ","element":"a"},{"text":"provides ","element":"span"},{"text":"the ","element":"span"},{"text":"existence ","element":"span"},{"text":"of ","element":"span"},{"text":"a ","element":"span"},{"text":"discretization ","element":"span"},{"style":{"height":13.19},"width":55.99,"height":32.98,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/0912.3995/images/15-85.png","element":"img","alt":"DT","inline":true,"padRight":true},{"text":"of ","element":"span"},{"text":"size ","element":"span"},{"style":{"height":9.19},"width":46.92,"height":22.98,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/0912.3995/images/15-86.png","element":"img","alt":"nT","inline":true,"padRight":true},{"text":"which ","element":"span"},{"text":"is ","element":"span"},{"style":{"height":7.2},"width":19,"height":18,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/0912.3995/images/15-87.png","element":"img","alt":"ε","inline":true},{"text":"-dense, and ","element":"span"},{"text":"for ","element":"span"},{"text":"which ","element":"span"},{"style":{"height":20.4},"width":167.91,"height":50.99,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/0912.3995/images/15-88.png","element":"img","alt":"n−1T �T∗t=1","inline":true,"padRight":true},{"text":"ˆ","element":"span"},{"style":{"height":20.4},"width":418.58,"height":50.99,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/0912.3995/images/15-89.png","element":"img","alt":"λt ≥ �T∗t=1 λt − δ","inline":true},{"text":". ","element":"span"},{"text":"Since ","element":"span"},{"style":{"height":19.2},"width":167.91,"height":48,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/0912.3995/images/15-90.png","element":"img","alt":"n−1T �nTt=1","inline":true,"padRight":true},{"text":"ˆ","element":"span"},{"style":{"height":19.14},"width":488.71,"height":47.84,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/0912.3995/images/15-91.png","element":"img","alt":"λt = 1 = �t≥1 λt","inline":true},{"text":", ","element":"span"},{"text":"then ","element":"span"},{"style":{"height":18.33},"width":112.83,"height":45.83,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/0912.3995/images/15-92.png","element":"img","alt":"�t>T∗","inline":true,"padRight":true},{"text":"ˆ","element":"span"},{"style":{"height":16},"width":235.98,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/0912.3995/images/15-93.png","element":"img","alt":"λt ≤ Bk(T∗","inline":true},{"text":") + ","element":"span"},{"style":{"height":11.6},"width":19,"height":29,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/0912.3995/images/15-94.png","element":"img","alt":" δ","inline":true},{"text":". ","element":"span"},{"text":"The statement follows by using Lemma ","element":"span"},{"href":"#id-66","text":"7.8 ","element":"a"},{"text":"with these bounds, and finally employing Lemma ","element":"span"},{"href":"#id-67","text":"7.5","element":"a"},{"text":".","element":"span"}],[{"style":{"fontWeight":"bold"},"text":"C.3. Proof of Theorem ","element":"span"},{"href":"#id-41","style":{"fontWeight":"bold"},"text":"5","element":"a"}],[{"text":"In this section, we instantiate Theorem ","element":"span"},{"href":"#id-65","text":"8 ","element":"a"},{"text":"in order to obtain bounds on ","element":"span"},{"style":{"height":10.4},"width":43.63,"height":26,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/0912.3995/images/15-95.png","element":"img","alt":" γT","inline":true,"padRight":true},{"text":"for Squared Exponential and Mat´ern kernels, results which are summarized in Theorem ","element":"span"},{"href":"#id-41","text":"5","element":"a"},{"text":".","element":"span"}],[{"style":{"width":"63%"},"width":596,"height":35,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/0912.3995/images/15-96.png","element":"img"}],[{"id":"id-66","text":"For the Squared Exponential kernel ","element":"span"},{"style":{"height":16},"width":149.79,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/0912.3995/images/15-97.png","element":"img","alt":" k, Bk(T∗","inline":true},{"text":") is given by ","element":"span"},{"href":"#id-42","referenceIndex":29,"text":"Seeger et al. ","element":"a"},{"href":"#id-42","referenceIndex":29,"text":"(","element":"a"},{"href":"#id-42","referenceIndex":29,"text":"2008","element":"a"},{"text":"). ","element":"span"},{"text":"While ","element":"span"},{"style":{"height":16},"width":65.51,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/0912.3995/images/15-98.png","element":"img","alt":" µ(x","inline":true},{"text":") was Gaussian there, the same decay rate holds for ","element":"span"},{"style":{"height":13.19},"width":38.24,"height":32.98,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/0912.3995/images/16-0.png","element":"img","alt":" λs","inline":true,"padRight":true},{"text":"w.r.t. uniform ","element":"span"},{"style":{"height":16},"width":65.51,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/0912.3995/images/16-1.png","element":"img","alt":"µ(x","inline":true},{"text":"), while constants might change. In hindsight, it turns out that ","element":"span"},{"style":{"height":10.8},"width":116.53,"height":27,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/0912.3995/images/16-2.png","element":"img","alt":" τ = d","inline":true,"padRight":true},{"text":"is the optimal choice for the discretization size, rendering the second term in Theorem ","element":"span"},{"href":"#id-41","text":"5 ","element":"a"},{"text":"to be ","element":"span"},{"style":{"fontStyle":"italic"},"text":"O","element":"span"},{"text":"(1), which is subdominant and will be neglected in the sequel. ","element":"span"},{"text":"We have that ","element":"span"},{"style":{"height":19},"width":217.39,"height":47.49,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/0912.3995/images/16-3.png","element":"img","alt":" λs ≤ cBs1/d","inline":true,"padRight":true},{"text":"with ","element":"span"},{"style":{"fontStyle":"italic"},"text":"B < ","element":"span"},{"text":"1. Following their analysis,","element":"span"}],[{"style":{"width":"71%"},"width":666,"height":82,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/0912.3995/images/16-4.png","element":"img"}],[{"text":"where ","element":"span"},{"style":{"height":6.8},"width":113.53,"height":17,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/0912.3995/images/16-5.png","element":"img","alt":" α = −","inline":true,"padRight":true},{"text":"log ","element":"span"},{"style":{"height":19.88},"width":248.08,"height":49.7,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/0912.3995/images/16-6.png","element":"img","alt":" B, β = αT 1/d∗","inline":true,"padRight":true},{"text":". Therefore, ","element":"span"},{"style":{"height":16},"width":166.39,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/0912.3995/images/16-7.png","element":"img","alt":" Bk(T∗) =","inline":true},{"style":{"height":20.68},"width":425.24,"height":51.7,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/0912.3995/images/16-8.png","element":"img","alt":"O(e−β βd−1), β = αT 1/d∗","inline":true,"padRight":true},{"text":".","element":"span"}],[{"text":"We have to pick ","element":"span"},{"style":{"height":13.19},"width":39.29,"height":32.98,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/0912.3995/images/16-9.png","element":"img","alt":" T∗","inline":true,"padRight":true},{"text":"such that ","element":"span"},{"style":{"height":13.78},"width":61.46,"height":34.46,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/0912.3995/images/16-10.png","element":"img","alt":" e−β","inline":true,"padRight":true},{"text":"is not much larger than (","element":"span"},{"style":{"height":17.38},"width":134.25,"height":43.46,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/0912.3995/images/16-11.png","element":"img","alt":"TnT )−1","inline":true},{"text":". Suppose that ","element":"span"},{"style":{"height":17.38},"width":345.63,"height":43.46,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/0912.3995/images/16-12.png","element":"img","alt":" T∗ = [log(TnT )/α]d","inline":true},{"text":", so that ","element":"span"},{"style":{"height":17.78},"width":531.45,"height":44.46,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/0912.3995/images/16-13.png","element":"img","alt":" e−β = (TnT )−1, β = log(TnT","inline":true,"padRight":true},{"text":"). The bound becomes","element":"span"}],[{"style":{"width":"88%"},"width":824,"height":170,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/0912.3995/images/16-14.png","element":"img"}],[{"text":"with ","element":"span"},{"style":{"height":15.77},"width":214.85,"height":39.44,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/0912.3995/images/16-15.png","element":"img","alt":" nT = C4T d","inline":true},{"text":"(log ","element":"span"},{"style":{"fontStyle":"italic"},"text":"T","element":"span"},{"text":"). ","element":"span"},{"text":"The first part dominates, so that ","element":"span"},{"style":{"fontStyle":"italic"},"text":"r ","element":"span"},{"text":"= ","element":"span"},{"style":{"fontStyle":"italic"},"text":"T ","element":"span"},{"text":"and ","element":"span"},{"style":{"height":14.4},"width":140.31,"height":36,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/0912.3995/images/16-16.png","element":"img","alt":" γT = O","inline":true},{"text":"([log(","element":"span"},{"style":{"height":13.38},"width":85.88,"height":33.46,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/0912.3995/images/16-17.png","element":"img","alt":"T d+1","inline":true},{"text":"(log ","element":"span"},{"style":{"fontStyle":"italic"},"text":"T","element":"span"},{"text":"))]","element":"span"},{"style":{"height":17.38},"width":121.21,"height":43.46,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/0912.3995/images/16-18.png","element":"img","alt":"d+1) =","inline":true,"padRight":true},{"style":{"fontStyle":"italic"},"text":"O","element":"span"},{"text":"((log ","element":"span"},{"style":{"height":17.38},"width":101.37,"height":43.46,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/0912.3995/images/16-19.png","element":"img","alt":" T)d+1","inline":true},{"text":"). ","element":"span"},{"text":"This ","element":"span"},{"text":"should ","element":"span"},{"text":"be ","element":"span"},{"text":"compared ","element":"span"},{"text":"with ","element":"span"},{"text":"E","element":"span"},{"text":"[I(","element":"span"},{"style":{"height":16},"width":248.78,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/0912.3995/images/16-20.png","element":"img","alt":"yT ; f T )] = O","inline":true},{"text":"((log ","element":"span"},{"style":{"height":17.38},"width":101.37,"height":43.46,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/0912.3995/images/16-21.png","element":"img","alt":" T)d+1","inline":true},{"text":") given by ","element":"span"},{"href":"#id-42","referenceIndex":29,"text":"Seeger et al. ","element":"a"},{"href":"#id-42","referenceIndex":29,"text":"(","element":"a"},{"href":"#id-42","referenceIndex":29,"text":"2008","element":"a"},{"text":"), where the ","element":"span"},{"style":{"height":9.59},"width":38.26,"height":23.98,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/0912.3995/images/16-22.png","element":"img","alt":" xt","inline":true,"padRight":true},{"text":"are drawn independently from a Gaussian base distribution. ","element":"span"},{"text":"At least restricted to a compact set ","element":"span"},{"style":{"fontStyle":"italic"},"text":"D","element":"span"},{"text":", we obtain the same expression to leading order for max","element":"span"},{"style":{"height":11.2},"width":66.12,"height":28,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/0912.3995/images/16-23.png","element":"img","alt":"{xt}","inline":true,"padRight":true},{"text":"I(","element":"span"},{"style":{"height":14.7},"width":117.85,"height":36.74,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/0912.3995/images/16-24.png","element":"img","alt":"yT ; f T","inline":true,"padRight":true},{"text":").","element":"span"}],[{"style":{"width":"35%"},"width":332,"height":33,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/0912.3995/images/16-25.png","element":"img"}],[{"text":"For Mat´ern kernels ","element":"span"},{"style":{"fontStyle":"italic"},"text":"k ","element":"span"},{"text":"with roughness parameter ","element":"span"},{"style":{"height":6.8},"width":21,"height":17,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/0912.3995/images/16-26.png","element":"img","alt":" ν","inline":true},{"text":", ","element":"span"},{"style":{"height":16},"width":104.62,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/0912.3995/images/16-27.png","element":"img","alt":"Bk(T∗","inline":true},{"text":") is given by ","element":"span"},{"href":"#id-42","referenceIndex":29,"text":"Seeger et al. ","element":"a"},{"href":"#id-42","referenceIndex":29,"text":"(","element":"a"},{"href":"#id-42","referenceIndex":29,"text":"2008","element":"a"},{"text":") for the uniform base distribution ","element":"span"},{"style":{"height":16},"width":65.51,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/0912.3995/images/16-28.png","element":"img","alt":" µ(x","inline":true},{"text":") on ","element":"span"},{"style":{"fontStyle":"italic"},"text":"D","element":"span"},{"text":". ","element":"span"},{"text":"Namely, ","element":"span"},{"style":{"height":13.2},"width":95.29,"height":33,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/0912.3995/images/16-29.png","element":"img","alt":" λs ≤","inline":true},{"style":{"height":14.18},"width":193.83,"height":35.46,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/0912.3995/images/16-30.png","element":"img","alt":"cs−(2ν+d)/d","inline":true,"padRight":true},{"text":"for almost all ","element":"span"},{"style":{"height":11.6},"width":131.88,"height":29,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/0912.3995/images/16-31.png","element":"img","alt":" s ∈ N","inline":true},{"text":", and ","element":"span"},{"style":{"height":16},"width":182.25,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/0912.3995/images/16-32.png","element":"img","alt":" Bk(T∗) =","inline":true},{"style":{"height":20.68},"width":250.94,"height":51.7,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/0912.3995/images/16-33.png","element":"img","alt":"O(T 1−(2ν+d)/d∗","inline":true,"padRight":true},{"text":"). ","element":"span"},{"text":"To match terms in the ˜","element":"span"},{"style":{"height":10.4},"width":43.63,"height":26,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/0912.3995/images/16-34.png","element":"img","alt":"γT","inline":true,"padRight":true},{"text":"bound, we choose ","element":"span"},{"style":{"height":18.18},"width":345.51,"height":45.46,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/0912.3995/images/16-35.png","element":"img","alt":" T∗ = (TnT )d/(2ν+d)","inline":true},{"text":"(log(","element":"span"},{"style":{"height":13.19},"width":75.74,"height":32.98,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/0912.3995/images/16-36.png","element":"img","alt":"TnT","inline":true,"padRight":true},{"text":"))","element":"span"},{"style":{"height":5.2},"width":19,"height":13,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/0912.3995/images/16-37.png","element":"img","alt":"κ","inline":true,"padRight":true},{"text":"(","element":"span"},{"style":{"height":7.2},"width":23,"height":18,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/0912.3995/images/16-38.png","element":"img","alt":"κ","inline":true,"padRight":true},{"text":"chosen below), so that the bound becomes","element":"span"}],[{"style":{"width":"95%"},"width":893,"height":233,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/0912.3995/images/16-39.png","element":"img"}],[{"text":"with ","element":"span"},{"style":{"height":13.19},"width":194.57,"height":32.98,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/0912.3995/images/16-40.png","element":"img","alt":" nT = C4T τ","inline":true},{"text":"(log ","element":"span"},{"style":{"fontStyle":"italic"},"text":"T","element":"span"},{"text":"). For ","element":"span"},{"style":{"height":16},"width":275.42,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/0912.3995/images/16-41.png","element":"img","alt":" κ = −d/(2ν + d","inline":true},{"text":"), we obtain that the maximum over ","element":"span"},{"style":{"fontStyle":"italic"},"text":"r ","element":"span"},{"text":"is ","element":"span"},{"style":{"height":16},"width":87.62,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/0912.3995/images/16-42.png","element":"img","alt":" O(T∗","inline":true,"padRight":true},{"text":"log(","element":"span"},{"style":{"height":16},"width":158.78,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/0912.3995/images/16-43.png","element":"img","alt":"TnT )) =","inline":true},{"style":{"height":18.18},"width":292.48,"height":45.46,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/0912.3995/images/16-44.png","element":"img","alt":"O(T (τ+1)d/(2ν+d)","inline":true},{"text":"(log ","element":"span"},{"style":{"fontStyle":"italic"},"text":"T","element":"span"},{"text":")). ","element":"span"},{"text":"Finally, we choose ","element":"span"},{"style":{"height":6.8},"width":82.2,"height":17,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/0912.3995/images/16-45.png","element":"img","alt":" τ =","inline":true,"padRight":true},{"text":"2","element":"span"},{"style":{"height":16},"width":211.18,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/0912.3995/images/16-46.png","element":"img","alt":"νd/(2ν+d(d","inline":true},{"text":"+1)) to match this term with ","element":"span"},{"style":{"height":18.18},"width":169.2,"height":45.46,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/0912.3995/images/16-47.png","element":"img","alt":" O(T 1−τ/d","inline":true},{"text":"). Plugging this in, we have ","element":"span"},{"style":{"height":17.38},"width":282.48,"height":43.46,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/0912.3995/images/16-48.png","element":"img","alt":" γT = O(T 1−2η","inline":true},{"text":"(log ","element":"span"},{"style":{"fontStyle":"italic"},"text":"T","element":"span"},{"text":")), ","element":"span"},{"style":{"height":19.37},"width":237.3,"height":48.43,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/0912.3995/images/16-49.png","element":"img","alt":"η = ν2ν+d(d+1)","inline":true},{"text":". Together with Theorem ","element":"span"},{"href":"#id-37","text":"2 ","element":"a"},{"text":"(for ","element":"span"},{"style":{"height":9.6},"width":65.27,"height":24,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/0912.3995/images/16-50.png","element":"img","alt":" ν >","inline":true,"padRight":true},{"text":"2), ","element":"span"},{"text":"we have that ","element":"span"},{"style":{"height":17.39},"width":278.44,"height":43.47,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/0912.3995/images/16-51.png","element":"img","alt":" RT = O∗(T 1−η","inline":true},{"text":") (suppressing log factors): for any ","element":"span"},{"style":{"height":9.6},"width":71.65,"height":24,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/0912.3995/images/16-52.png","element":"img","alt":" ν >","inline":true,"padRight":true},{"text":"2 and any dimension ","element":"span"},{"style":{"fontStyle":"italic"},"text":"d","element":"span"},{"text":", the ","element":"span"},{"text":"GPUCB ","element":"span"},{"text":"algorithm is guaranteed to be no-regret in this case with arbitrarily high probability.","element":"span"}],[{"text":"How does this bound compare to the bound on ","element":"span"},{"text":"E","element":"span"},{"text":"[I(","element":"span"},{"style":{"height":14.7},"width":117.85,"height":36.74,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/0912.3995/images/16-53.png","element":"img","alt":"yT ; f T","inline":true,"padRight":true},{"text":")] given by ","element":"span"},{"href":"#id-42","referenceIndex":29,"text":"Seeger et al. ","element":"a"},{"href":"#id-42","referenceIndex":29,"text":"(","element":"a"},{"href":"#id-42","referenceIndex":29,"text":"2008","element":"a"},{"text":")? Here, ","element":"span"},{"style":{"height":10.4},"width":87.81,"height":26,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/0912.3995/images/16-54.png","element":"img","alt":" γT =","inline":true},{"style":{"height":18.19},"width":372.99,"height":45.46,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/0912.3995/images/16-55.png","element":"img","alt":"O(T d(d+1)/(2ν+d(d+1))","inline":true},{"text":"(log ","element":"span"},{"style":{"fontStyle":"italic"},"text":"T","element":"span"},{"text":")), while ","element":"span"},{"text":"E","element":"span"},{"text":"[I(","element":"span"},{"style":{"height":16},"width":206.74,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/0912.3995/images/16-56.png","element":"img","alt":"yT ; f T )] =","inline":true},{"style":{"height":18.19},"width":209.3,"height":45.46,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/0912.3995/images/16-57.png","element":"img","alt":"O(T d/(2ν+d)","inline":true},{"text":"(log ","element":"span"},{"style":{"height":18.19},"width":193.6,"height":45.46,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/0912.3995/images/16-58.png","element":"img","alt":" T)2ν/(2ν+d)","inline":true},{"text":").","element":"span"}],[{"style":{"width":"31%"},"width":293,"height":29,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/0912.3995/images/16-59.png","element":"img"}],[{"text":"For linear kernels ","element":"span"},{"style":{"height":17.38},"width":419.42,"height":43.46,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/0912.3995/images/16-60.png","element":"img","alt":" k(x, x′) = xT x′, x ∈ Rd","inline":true,"padRight":true},{"text":"with ","element":"span"},{"style":{"height":16},"width":108.19,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/0912.3995/images/16-61.png","element":"img","alt":" ∥x∥ ≤","inline":true,"padRight":true},{"text":"1, we can bound ","element":"span"},{"style":{"height":10.4},"width":43.63,"height":26,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/0912.3995/images/16-62.png","element":"img","alt":" γT","inline":true,"padRight":true},{"text":"directly. Let ","element":"span"},{"style":{"height":16},"width":369.07,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/0912.3995/images/16-63.png","element":"img","alt":" XT = [x1 . . . , xT ] ∈","inline":true},{"style":{"height":13.38},"width":93.28,"height":33.46,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/0912.3995/images/16-64.png","element":"img","alt":"Rd×T","inline":true,"padRight":true},{"text":"with all ","element":"span"},{"style":{"height":16},"width":122.22,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/0912.3995/images/16-65.png","element":"img","alt":" ∥xt∥ ≤","inline":true,"padRight":true},{"text":"1. Now,","element":"span"}],[{"text":"log ","element":"span"},{"style":{"height":18.6},"width":697.58,"height":46.49,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/0912.3995/images/16-66.png","element":"img","alt":" |I + σ−2XTT XT | = log |I + σ−2XT XTT |","inline":true},{"style":{"height":12.8},"width":31,"height":32,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/0912.3995/images/16-67.png","element":"img","alt":"≤","inline":true,"padRight":true},{"text":"log ","element":"span"},{"style":{"height":18.18},"width":201.88,"height":45.45,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/0912.3995/images/16-68.png","element":"img","alt":" |I + σ−2D|","inline":true}],[{"text":"with ","element":"span"},{"style":{"height":18.73},"width":454.58,"height":46.82,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/0912.3995/images/16-69.png","element":"img","alt":" D = diag diag−1(XT XTT","inline":true,"padRight":true},{"text":"), by Hadamard’s in- ","element":"span"},{"text":"equality. The largest eigenvalue ","element":"span"},{"text":"ˆ","element":"span"},{"style":{"height":13.19},"width":39.24,"height":32.98,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/0912.3995/images/16-70.png","element":"img","alt":"λ1","inline":true,"padRight":true},{"text":"of ","element":"span"},{"style":{"height":18.54},"width":129.8,"height":46.34,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/0912.3995/images/16-71.png","element":"img","alt":" XT XTT","inline":true,"padRight":true},{"text":"is ","element":"span"},{"style":{"fontStyle":"italic"},"text":"O","element":"span"},{"text":"(","element":"span"},{"style":{"fontStyle":"italic"},"text":"T","element":"span"},{"text":"), ","element":"span"},{"text":"so that","element":"span"}],[{"style":{"width":"74%"},"width":699,"height":49,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/0912.3995/images/16-72.png","element":"img"}],[{"text":"and ","element":"span"},{"style":{"height":16},"width":168.2,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/0912.3995/images/16-73.png","element":"img","alt":" γT = O(d","inline":true,"padRight":true},{"text":"log ","element":"span"},{"style":{"fontStyle":"italic"},"text":"T","element":"span"},{"text":").","element":"span"}]]}],"_version":"3.3.4"},"paperNode":"$28:props:children:props:children:0:props:product"}]]