Corruption-Tolerant Gaussian Process Bandit Optimization

2020·Arxiv

Abstract

Abstract

We consider the problem of optimizing an unknown (typically non-convex) function with a bounded norm in some Reproducing Kernel Hilbert Space (RKHS), based on noisy bandit feedback. We consider a novel variant of this problem in which the point evaluations are not only corrupted by random noise, but also adversarial corruptions. We introduce an algorithm Fast-Slow GP-UCB based on Gaussian process methods, randomized selection between two instances labeled “fast” (but non-robust) and “slow” (but robust), enlarged con-fidence bounds, and the principle of optimism under uncertainty. We present a novel theoretical analysis upper bounding the cumulative regret in terms of the corruption level, the time horizon, and the underlying kernel, and we argue that certain dependencies cannot be improved. We observe that distinct algorithmic ideas are required depending on whether one is required to perform well in both the corrupted and non-corrupted settings, and whether the corruption level is known or not.

1 Introduction

Bandit optimization problems on large or continuous domains have far-reaching applications in modern machine learning and data science, including robotics [Lizotte et al., 2007], hyperparameter tuning [Snoek et al., 2012], recommender systems [Vanchinathan et al., 2014], environmental monitoring [Srinivas et al., 2010], and more. To make such problems tractable, one needs to exploit correlations between the rewards of “similar” actions. In the kernelized multi-armed bandit (MAB) problem, this is done by utilizing smoothness in the form of a low function norm in some Reproducing Kernel Hilbert Space (RKHS), permitting the application of Gaussian process (GP) methods [Srinivas et al.,

Proceedings of the 23International Conference on Artificial Intelligence and Statistics (AISTATS) 2020, Palermo, Italy. PMLR: Volume 108. Copyright 2020 by the author(s).

2010,Chowdhury and Gopalan, 2017]. See [Rasmussen and Williams, 2006, Ch. 6] for an introduction to the connections between GPs and RKHS functions.

Key theoretical developments for the RKHS optimization problem have included both upper and lower bounds on the performance, measured via some notion of regret [Srinivas et al., 2010, Chowdhury and Gopalan, 2017,Scarlett et al., 2017]. The vast majority of these results have focused only on zero-mean additive noise in the point evaluations, and as a result, it is unclear to what extent the performance degrades under adversarial corruptions. Such considerations are of significant interest under erratic or unpredictable sources of corruption, and particularly arise when the samples may be perturbed by a malicious adversary. As we argue in Section 2, prominent algorithms such as GP-UCB [Srinivas et al., 2010] can be quite brittle in the face of such corruptions.

In this paper, we study the optimization of RKHS functions with both random noise and adversarial corruptions. We propose a novel algorithm and regret analysis building on recently-proposed techniques for the finite-arm stochastic MAB setting [Lykouris et al., 2018]. Specifically, we present a randomized algorithm Fast-Slow GP-UCB based on randomly choosing between a “fast” non-robust instance, and a “slow” robust instance. We bound the cumulative regret of Fast-Slow GP-UCB in terms of the adversarial corruption level, time horizon, and underlying kernel.

The kernelized setting comes with highly non-trivial additional challenges compared to the finite-arm setting, primarily due to the infinite action space and correlations between their associated function values. In particular, while correlations are undoubtedly benefi-cial in the non-corrupted setting (taking a given action permits learning something about similar actions), this benefit can lead to a hindrance in the corrupted setting: An adversary that corrupts a given sample can potentially damage our belief regarding many nearby function values. Moving beyond independent arms was posed as a open problem in [Gupta et al., 2019, Sec. 5.3].

Related work on GP optimization. Numerous

GP-based bandit optimization algorithms have been proposed in recent years [Srinivas et al., 2010,Hennig and Schuler, 2012,Hernández-Lobato et al., 2014,Bo- gunovic et al., 2016b,Wang and Jegelka, 2017,Shekhar and Javidi, 2018, Ru et al., 2017]. Beyond the standard setting, several important extensions have been considered, including multi-fidelity [Bogunovic et al., 2016b,Kandasamy et al., 2017,Song et al., 2019], contextual and time-varying settings [Krause and Ong, 2011,Valko et al., 2013,Bogunovic et al., 2016a], safety requirements [Sui et al., 2015], high-dimensional settings [Djolonga et al., 2013,Kandasamy et al., 2015,Rol- land et al., 2018], and many more.

Certain types of corruption-tolerant GP-based optimization algorithms have been explored previously, with the defining features including (i) whether the corruption applies to the input (i.e., action) or the output (i.e., reward function), (ii) whether all samples are corrupted, or only a final reported point is corrupted, and (iii) whether the corruptions are random or adversarial. The case of random input noise on all samples was studied in [Beland and Nair, 2017,Nogueira et al., 2016,Dai Nguyen et al., 2017]. Perhaps closer to our work is [Martinez-Cantin et al., 2018], considering function outliers; however, no specific corruption model was adopted, and no theoretical regret bounds were given.

In [Bogunovic et al., 2018a], bounds on the simple regret are given for the case that the final reported input is adversarially perturbed, whereas the selected inputs are only subject to random output noise. This makes it desirable to seek broad peaks, which bears some similarity to the input noise viewpoint [Beland and Nair, 2017,Nogueira et al., 2016,Dai Nguyen et al., 2017] and level-set estimation [Gotovos et al., 2013, Bogunovic et al., 2016b]. Our goal of attaining small cumulative regret under input perturbations requires very different techniques from these previous works. Another distinct notion of robustness is considered in [Bogunovic et al., 2018b], in which some experiments in a batch may fail to produce an outcome. None of the preceding works provide regret bounds in the case of non-stochastic corrupted observations.

Related work on corrupted bandits. Adversarially corrupted observations have recently been considered in the finite-arm stochastic MAB problem under various corruption models [Lykouris et al., 2018,Gupta et al., 2019, Kapoor et al., 2019]. As mentioned above, [Lykouris et al., 2018] adopted a “fast-slow” algorithmic approach; this led to regret bounds of the form , where is a standard regret bound for the non-corrupted MAB setting. In [Gupta et al., 2019], this bound was improved to using an epoch-based approach in which the estimates of the arms’ means are reset after each epoch, and the previous epoch guides which arms are selected in the next one.

Our algorithmic approach is based on that of [Lyk- ouris et al., 2018]; however, the bulk of the theoretical analysis requires novel ideas. In particular, our need to handle an infinite action space with correlated rewards between actions poses considerable challenges, as discussed above. In more detail, we note the following:

• Even when studying the case of a known corruption level (which is done as a stepping stone towards our main results), it is non-trivial to characterize the effect of the corruptions (see Lemma 2 below);

• Characterizing that certain suboptimal points are never sampled after a certain time requires signifi-cant technical effort (see Lemmas 7 and 8 below);

• We adopt a UCB-style approach (Alg. 2) complementary to the elimination-style approach of [Lyk- ouris et al., 2018], and the former kind may be of independent interest even in the finite-arm setting.

In a parallel independent work [Li et al., 2019], cumulative regret bounds were given for stochastic linear bandits, which are a special case of the GP setting (with a linear kernel). The algorithm of [Li et al., 2019] is in fact more akin to that of [Gupta et al., 2019], which is potentially preferable due the latter attaining better bounds in the finite-arm setting. However, the algorithm and results of [Li et al., 2019] crucially rely on the notion of gaps between the function values of corner points in the domain, and the idea of exploiting these gaps for linear bandits has no apparent generalization to the GP setting with general kernels. In addition, even when we specialize to the linear kernel, neither our results nor those of [Li et al., 2019] imply each other, and the two both have benefits not provided by the other; see Appendix K for details.

Outline. We introduce the corruption-tolerant kernelized MAB problem in Section 2, and then present algorithms for three settings with increasing difficulty: Known corruption level (Section 3), simultaneous handing of no corruption and a known corruption level (Section 4), and unknown corruption level (Section 5).

2 Problem Statement

We consider the problem of sequentially maximizing a fixed unknown function , where is a compact set and . We assume that D is endowed with a kernel function de-fined on , and the kernel is normalized to satisfy for all . We also assume that f has a bounded norm in the corresponding Reproducing Kernel Hilbert Space (RKHS) , i.e., . This assumption permits the construction of confidence bounds via Gaussian process (GP) methods (see Lemma 1 below).

In the non-corrupted setting, at every time step t, we choose , and observe a noisy function value . In this work, we consider the corrupted setting, where we only observe an adversarially corrupted sample . Formally, for each t = 1, . . . , T:

• Based on the previous decisions and corresponding corrupted observations , the player selects a probability distribution

• Based on the knowledge of the true function the previous decisions and corresponding observations , and the player’s distribu- tion , the adversary chooses the corruptions

• The agent draws at random from observes the noisy and corrupted observation:

where is the noisy non-corrupted observation: , where with independence between times t.

Note that the adversary is allowed to be adaptive, i.e., the corruptions may depend on the agent’s previously selected points and corresponding stochastic observations, as well as the distribution of the player’s next choice, but not its specific realization

We say that the problem instance is C-corrupted (i.e., the corruption level is C) if

Clearly, when C = 0, we recover the standard non-corrupted setting. We measure the performance using the cumulative regret, which is also typically used in the non-corrupted bandit setting [Srinivas et al., 2010]:

where . As noted in [Lykouris et al., 2018], one could alternatively define the cumulative regret with respect to the corrupted values ; the two notions coincide to within at most 2C, and such a difference will be negligible in our regret bound anyway. In Appendix C, we outline how our results can be adapted for simple regret (i.e., the regret of a point reported at the end of T rounds).

2.1 Standard (non-corrupted) setting

In the non-corrupted setting, existing algorithms use Gaussian likelihood models for the observations and zero-mean GP priors for modeling the uncertainty in f. Posterior updates are performed according to a “fictitious” model in which the noise variables are drawn independently across t from , where is a hyperparameter that may differ from the true noise variance . Given a sequence of inputs and their noisy observations , the posterior distribution under this GP(0, k) prior is also Gaussian, with the mean and variance

where is the kernel matrix. Common kernels include the linear, squared exponential (SE) and Matérn kernels.

The main quantity that characterizes the regret bounds in the non-corrupted setting [Srinivas et al., 2010, Chowdhury and Gopalan, 2017] is the maximum information gain, defined at time t as

For compact and convex domains, is sublinear in t for various classes of kernels, e.g., for the SE kernel, and for the Matérn kernel with [Srinivas et al., 2010].

The following well-known result of [Abbasi-Yadkori, 2013] provides confidence bounds around the unknown function in the non-corrupted setting.

Lemma 1. Fix with , and consider the sampling model , with independent noise . Under the choice

the following holds with probability at least

where are given in (4) and (5).

This lemma follows directly from [Abbasi-Yadkori, 2013, Theorem 3.11] (and [Abbasi-Yadkori, 2013, Remark 3.13]) and the definition (6) of

tions. In the noisy non-corrupted setting, several algorithms have been developed and analyzed. A particularly well-known example is GP-UCB, which selects . GP-UCB achieves sublinear cumulative regret with

Figure 1: (Left) Function f, its global maximizer , a local maximizer , and the corruption region. (Middle) GP-UCB eliminates the optimal region (and ) early on due to the corruptions, and continues sampling points in the suboptimal region around . (Right) Our corruption-aware algorithm (see Algorithm 1) does not eliminate the optimal region, and after the corruption budget is exhausted, it identifies the true maximizer

high probability [Srinivas et al., 2010,Chowdhury and Gopalan, 2017], for a suitably chosen (e.g., as in (7)). Despite this success in the non-corrupted setting, these algorithms can fail under adversarial corruptions.

An illustrative example is provided in Figure 1. Observations that correspond to the points sampled in the shaded region around the global maximizer rupted by the value , up to a total corruption budget (C = 3.5). In Figure 1 (Middle), the points selected by GP-UCB for t = 50 time steps are shown. GP-UCB eliminates the global maximizer early on due to corruptions, and later on, it only selects points from the suboptimal region and consequently suffers linear cumulative regret. In the subsequent sections, we design algorithms that are robust to corruptions, and are able to identify the true maximizer after the corruption budget C is exhausted (see Figure 1 (Right)).

3 Known Corruption Setting

We first consider the case that the total corruption C in (2) is known. Given a sequence of inputs and their corrupted observations (with ), we form a posterior mean according to a prior and sampling noise as follows:

where . Note that this matches the posterior mean formed in the non-corrupted setting, simply replacing by . In addition, we form the same posterior standard deviation as in the non-corrupted setting. The role of the parameter is discussed in Appendix I.

The following lemma provides an upper bound on the difference between the non-corrupted and corrupted posterior means, and is proved using the definitions of and along with RKHS function properties. All proofs can be found in the supplementary material.

Lemma 2. For any and , we have , where and are given in (4) and (5), and is given in (9), with

By combining Lemmas 1 and 2, we obtain the following.

. Under the choice with given in (7) and , we have with probability at least

where are given in (9) and (5).

In Algorithm 1 (), we present an upper confi- dence bound based algorithm with enlarged confidence bounds in accordance with Lemma 3. We explicitly define these confidence bounds as follows:

Once the validity of these confidence bounds is established via (10), one can use standard analysis techniques [Srinivas et al., 2010] to bound the cumulative regret. This is formally stated in the following.

Lemma 4. Under the choice of in Lemma 3 and , conditioned on the event (10), the cumulative regret incurred by Algorithm 1 satisfies

The main theorem of this section is now obtained via a direct combination of Lemmas 3 and 4.

Theorem 5. In the C-corrupted setting, Algorithm 1 with and set as in Lemma 3, attains, with probability at least , cumulative regret

Note that when C = 0, this result recovers known non-corrupted cumulative regret bounds (cf. [Chowdhury and Gopalan, 2017, Theorem 3]). More generally, we can decompose the obtained regret bound into two terms: behaves as

(14) The obtained regret bound can be made more explicit by substituting the bound on for particular kernels [Srinivas et al., 2010], e.g., for the SE kernel we obtain . In Appendix J, we argue that the linear dependence on C is unavoidable for any algorithm, and discuss cases where the dependence on T is near-optimal. However, we do not necessarily claim that the joint dependence on C and T is optimal; this is left for future work.

4 Known-or-Zero Corruption Setting

In the previous section, we assumed that the upper bound C on the total corruption is known and the problem is C-corrupted. In this section, we also assume that C is known, but we consider a scenario in which the problem may be either C-corrupted or non-corrupted (i.e., the standard setting). Our goal is to develop an algorithm that has a similar guarantee to the previous section in the corrupted case, while also attaining a similar guarantee to GP-UCB [Srinivas et al., 2010] in the non-corrupted case, and thus obtaining strong guarantees in the two settings simultaneously. Theorem 5 fails to achieve this goal, since the regret depends on C even if the problem is non-corrupted.

Our algorithm Fast-Slow GP-UCB is described in Algorithm 2. It makes use of two instances labeled F (fast; Line 6) and S (slow; Line 8). The S instance is played with probability 1/C, while the rest of the time F is played. The intuition is that F shrinks the confi-dence bounds faster but is not robust to corruptions, while S is slower but robust to corruptions. We formalize this intuition below in Lemma 6 and (20)–(21).

The instances use the following confidence bounds depending on an exploration parameter and an additional parameter whose role is discussed in Appendix I and after Lemma 8 below:

where is the number of times an instance has been selected at a given time instant. We also make use of the following intersected confidence bounds, which have the convenient feature of being monotone:

In Fast-Slow GP-UCB, we check if the following condition (Line 12) holds:

In the non-corrupted setting, under the high-probability event in Lemma 1 (for both F and S), this condition never holds. Hence, when it does hold, we have detected that the problem is C-corrupted.In such a case, the algorithm permanently switches to running Algorithm 1 with C as the input. Note that we can check the condition in (19) by using a global optimizer to find a minimizer of , and checking whether its value is smaller than 0.

Finally, the inner minimization over F instance, together with the validity of the condition (19), ensures that F does not select a point that is already “ruled out” by the robust instance S. We make this statement precise in Lemma 7 below.

4.1 Analysis

First, we provide a high-probability bound on the total corruption that is observed by the S instance. Specifi-cally, we show that because it is sampled with probability 1/C, the total corruption observed by S is constant with high probability, i.e., it is upper bounded by a value not depending on T.

Lemma 6. The S instance in Fast-Slow GPUCB observes, with probability at least , a total corruption of at most

We now fix a constant and condition on three high-probability events:

1. If setting is non-corrupted, the following holds with probability at least

for all and . This claim follows from Lemma 1 by setting the corresponding failure probability to

, then the following holds in both the non-corrupted and corrupted settings with probability at least

for all and . This follows from Lemmas 3 and 6 (using in place of C in Lemma 3), by setting the corresponding failure probabilities to in both. Taking the union bound over these two events establishes the claim. Note that (21) corresponds to , but directly implies an analogous condition for all (since increasing widens the confidence region (15)–(16)).

3. If the condition in (19) is detected at any time instant, then Algorithm 2 permanently switches to running Algorithm 1. If Algorithm 1 is run with

for all and , under the definitions in (11). This is by Lemma 3 with in place of

By the union bound, (20)–(22) all hold with probability at least . In addition, by the definitions in (17), these properties remain true when and are replaced by

The confidence bounds of F are only valid in the non-corrupted case, and hence, in the case of corruptions we rely on the confidence bounds of S. Specifically, we show that F never queries a point that is strictly suboptimal according to the confidence bounds of S.

Lemma 7. Suppose that (20) and (21) hold. For any time , if in Fast-Slow GP-UCB, then the selected point

represents the set of strictly suboptimal points according to the intersected S-confidence bounds.

By the monotonicity of and , the set is non-shrinking in t. The proof shows that F always favors from (23) over , i.e., has a higher value of (see Line 6 of Algorithm 2). To show this, we upper bound in terms of and lower bound in terms of via the confidence bounds and condition (19).

The next lemma characterizes the number of queries made by the S instance before a suboptimal point becomes “eliminated”, i.e., the time after which the point belongs to

Lemma 8. Suppose that the S instance is run with corresponding to (21) and . Then, conditioned on the high-probability confidence bounds in (21), for any given suboptimal point such that , it holds that

This lemma’s proof is perhaps the trickiest, and crucially relies on the fact that . We show that by the time given in (24), we have encountered a round i in which a -optimal point is queried with the con- fidence width also being at most . This means that is much closer to optimal than the -suboptimal point x in the lemma statement. Using the fact that had a higher UCB score than x, we can also deduce that the posterior standard deviation at x was not too large. Since replacing by (as done in the definition of in (23)) halves the confidence width, we can combine the above findings to deduce that the confidence bounds indeed rule out x at time , and hence also for all subsequent times due to the monotonicity of the confidence bounds.

Finally, we state the main theorem of this section, whose proof combines the preceding lemmas.

Theorem 9. For any with , let , and consider Fast-Slow GP-UCB run with

and set in Algorithm 1 as . Then, after T rounds, with probability at least the cumulative regret satisfies

(27) in the non-corrupted case and

in the corrupted case.

The non-corrupted case is straightforward to prove, essentially applying standard arguments separately to F and S. The corrupted case requires more effort. Lemma 8 characterizes the time after which points with a given regret are no longer sampled by S, which permits bounding the cumulative regret incurred by S. By Lemma 7, the points in are also not sampled by F, and on average F is played at most C times more frequently than S. Converting this average to a high-probability bound using basic concentration, this factor of , and we obtain (28).

Using the notation to hide logarithmic factors, the bound obtained in the non-corrupted case simplifies to , and unlike the result from Theorem 5, it does not depend on C. The obtained bound is only a constant factor away from the standard non-corrupted one (cf. (14)), while at the same time our algorithm achieves in the C-corrupted case. As before, we can make the results obtained in this theorem more explicit by substituting the bounds for for various kernels of interest [Srinivas et al., 2010].

5 Unknown Corruption Setting

In this section, we assume that the total corruption C defined in (2) is unknown to the algorithm. Despite this additional challenge, most of the details are similar to the known-or-zero setting, so to avoid repetition, we omit some details and focus on the key differences.

Algorithm. Our corruption-agnostic algorithm is shown in Algorithm 3. We again take inspiration from the finite-arm counterpart [Lykouris et al., 2018], con- sidering that are sampled with probability (with any remaining probability going to layer 1). The idea is that any layer with is robust, for the same reason that the S instance is robust in Fast-Slow GP-UCB (Algorithm 2).

Each instance makes use of confidence bounds defined as follows for some parameters to be chosen later:

where denotes the number of times instance has been selected by time t, and . Similarly to the Section 4, we define intersected confidence bounds:

Each instance selects a point according to , where represents a set of potential maximizers at time t, i.e., a set of points that could still be the global maximizer according to the confidence bounds. More formally, these sets are defined recursively as follows:

Two key properties of these sets are: (i) for every and due to the monotonicity of the confidence bounds; and (ii) . The latter property

implies that once a point is eliminated at some layer it is also eliminated from all , while the former property ensures that it remains eliminated for all subsequent time steps {t + 1, . . . , T}.

Similarly to Fast-Slow GP-UCB, each layer uses

suffices) in its acquisition function, while replacing by 1 in the confidence bounds when constructing the set of potential maximizers. This is done to permit the application of Lemma 8; the intuition behind doing so is discussed in Appendix I and following Lemma 8.

In the case that corresponding to the selected at time t is empty, the algorithm finds the lowest layer i for which , and selects the point that maximizes that layer’s upper confidence bound. In this case, the algorithm makes no changes to the confidence bounds or the sets of potential maximizers.

theoretical analysis in place, we can also obtain a nearidentical regret bound in the case of unknown C. We only provide a brief outline here, with further details in the supplementary material.

We let the robust layer play the role of F and eliminate suboptimal points. Since regret incurred in the lower layers is at most a factor 2C higher than that of layer on average, and this leads to a similar analysis to that used in the proof of Theorem 9. Our final main result is stated as follows.

Theorem 10. For any with , and any , under the parameters

we have that for any unknown corruption level C > 0, the cumulative regret of Algorithm 3 satisfies

with probability at least

This has the same form as (28), with in place of (since there are

6 Conclusion

We have introduced the kernelized MAB problem with adversarially corrupted samples. We provided novel algorithms based on enlarged confidence bounds and randomly-selected fast/slow instances that are provably robust against such corruptions, with the regret bounds being linear in the corruption level. To our knowledge, we are the first to handle this form of adversarial corruption in any bandit problem with an infinite action space and correlated rewards, which are two key notions that significantly complicate the analysis.

An immediate direction for further research is to better understand the joint dependence on the corruption level C and time horizon T. The linear O(C) dependence is unavoidable (see Appendix J), and the dependence matches well-known bounds for the non-corrupted setting [Srinivas et al., 2010, Chowdhury and Gopalan, 2017] (in some cases having near-matching lower bounds [Scarlett et al., 2017]), but it is unclear whether the product of these two terms is unavoidable.

Acknowledgments

This work was gratefully supported by Swiss National Science Foundation, under the grant SNSF NRP 75 407540_167189, ERC grant 815943 and ETH Zürich Postdoctoral Fellowship 19-2 FEL-47. J. Scarlett was supported by the Singapore National Research Foundation (NRF) under grant number R-252-000-A74-281.

References

[Abbasi-Yadkori, 2013] Abbasi-Yadkori, Y. (2013). Online learning for linearly parametrized control problems.

[Abbasi-Yadkori et al., 2011] Abbasi-Yadkori, Y., Pál, D., and Szepesvári, C. (2011). Improved algorithms for linear stochastic bandits. In Advances in Neural Information Processing Systems, pages 2312–2320.

[Beland and Nair, 2017] Beland, J. J. and Nair, P. B. (2017). Bayesian optimization under uncertainty. NIPS BayesOpt 2017 workshop.

[Beygelzimer et al., 2011] Beygelzimer, A., Langford, J., Li, L., Reyzin, L., and Schapire, R. (2011). Contextual bandit algorithms with supervised learning guarantees. In Proceedings of the Fourteenth International Conference on Artificial Intelligence and Statistics, pages 19–26.

[Bogunovic et al., 2016a] Bogunovic, I., Scarlett, J., and Cevher, V. (2016a). Time-varying Gaussian process bandit optimization. In International Conference on Artificial Intelligence and Statistics (AISTATS), pages 314–323.

[Bogunovic et al., 2018a] Bogunovic, I., Scarlett, J., Jegelka, S., and Cevher, V. (2018a). Adversarially robust optimization with Gaussian processes. In Advances in Neural Information Processing Systems (NeurIPS), pages 5760–5770.

[Bogunovic et al., 2016b] Bogunovic, I., Scarlett, J., Krause, A., and Cevher, V. (2016b). Truncated variance reduction: A unified approach to Bayesian optimization and level-set estimation. In Advances in Neural Information Processing Systems (NIPS), pages 1507–1515.

[Bogunovic et al., 2018b] Bogunovic, I., Zhao, J., and Cevher, V. (2018b). Robust maximization of nonsubmodular objectives. In International Conference on Artificial Intelligence and Statistics (AISTATS), pages 890–899.

[Chowdhury and Gopalan, 2017] Chowdhury, S. R. and Gopalan, A. (2017). On kernelized multi-armed bandits. In International Conference on Machine Learning (ICML), pages 844–853.

[Dai Nguyen et al., 2017] Dai Nguyen, T., Gupta, S., Rana, S., and Venkatesh, S. (2017). Stable Bayesian optimization. In Pacific-Asia Conference on Knowledge Discovery and Data Mining, pages 578–591. Springer.

[Djolonga et al., 2013] Djolonga, J., Krause, A., and Cevher, V. (2013). High-dimensional Gaussian process bandits. In Advances in Neural Information Processing Systems (NIPS), pages 1025–1033.

[Freedman et al., 1975] Freedman, D. A. et al. (1975). On tail probabilities for martingales. Annals of Probability, 3(1):100–118.

[Gotovos et al., 2013] Gotovos, A., Casati, N., Hitz, G., and Krause, A. (2013). Active learning for level set estimation. In International Joint Conference on Artificial Intelligence (IJCAI), pages 1344–1350.

[Gupta et al., 2019] Gupta, A., Koren, T., and Talwar, K. (2019). Better algorithms for stochastic bandits with adversarial corruptions. In Conference on Learning Theory (COLT).

[Hennig and Schuler, 2012] Hennig, P. and Schuler, C. J. (2012). Entropy search for information-efficient global optimization. Journal of Machine Learning Research, 13(Jun):1809–1837.

[Hernández-Lobato et al., 2014] Hernández-Lobato, J. M., Hoffman, M. W., and Ghahramani, Z. (2014). Predictive entropy search for efficient global optimization of black-box functions. In Advances in Neural Information Processing Systems (NIPS), pages 918–926.

[Kandasamy et al., 2017] Kandasamy, K., Dasarathy, G., Schneider, J., and Póczos, B. (2017). Multi-fidelity bayesian optimisation with continuous approximations. In Proceedings of the 34th International Conference on Machine Learning-Volume 70, pages 1799–1808. JMLR. org.

[Kandasamy et al., 2015] Kandasamy, K., Schneider, J., and Póczos, B. (2015). High dimensional Bayesian optimisation and bandits via additive models. In International Conference on Machine Learning (ICML), pages 295–304.

[Kapoor et al., 2019] Kapoor, S., Patel, K. K., and Kar, P. (2019). Corruption-tolerant bandit learning. Machine Learning, 108(4):687–715.

[Krause and Ong, 2011] Krause, A. and Ong, C. S. (2011). Contextual Gaussian process bandit optimization. In Advances in Neural Information Processing Systems (NIPS), pages 2447–2455.

[Li et al., 2019] Li, Y., Lou, E. Y., and Shan, L. (2019). Stochastic linear optimization with adversarial corruption. arXiv preprint arXiv:1909.02109.

[Lizotte et al., 2007] Lizotte, D. J., Wang, T., Bowling, M. H., and Schuurmans, D. (2007). Automatic

gait optimization with Gaussian process regression. In International Joint Conference on Artificial Intelligence (IJCAI), pages 944–949.

[Lykouris et al., 2018] Lykouris, T., Mirrokni, V., and Paes Leme, R. (2018). Stochastic bandits robust to adversarial corruptions. In ACM Symposium on Theory of Computing (STOC), pages 114–122. ACM.

[Martinez-Cantin et al., 2018] Martinez-Cantin, R., Tee, K., and McCourt, M. (2018). Practical Bayesian optimization in the presence of outliers. In International Conference on Artificial Intelligence and Statistics (AISTATS).

[Nogueira et al., 2016] Nogueira, J., Martinez-Cantin, R., Bernardino, A., and Jamone, L. (2016). Unscented Bayesian optimization for safe robot grasping. In IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS).

[Rasmussen and Williams, 2006] Rasmussen, C. E. and Williams, C. K. (2006). Gaussian processes for machine learning, volume 1. MIT press Cambridge.

[Rolland et al., 2018] Rolland, P., Scarlett, J., Bogunovic, I., and Cevher, V. (2018). High-dimensional Bayesian optimization via additive models with overlapping groups. In International Conference on Ar-tificial Intelligence and Statistics (AISTATS), pages 298–307.

[Ru et al., 2017] Ru, B., Osborne, M., and McLeod, M. (2017). Fast information-theoretic Bayesian optimisation. arXiv preprint arXiv:1711.00673.

[Scarlett et al., 2017] Scarlett, J., Bogunovic, I., and Cevher, V. (2017). Lower bounds on regret for noisy Gaussian process bandit optimization. In Conference on Learning Theory (COLT).

[Shekhar and Javidi, 2018] Shekhar, S. and Javidi, T. (2018). Gaussian process bandits with adaptive discretization. Electronic Journal of Statistics, 12(2):3829–3874.

[Snoek et al., 2012] Snoek, J., Larochelle, H., and Adams, R. P. (2012). Practical Bayesian optimization of machine learning algorithms. In Advances in Neural information Processing Systems (NIPS), pages 2951–2959.

[Song et al., 2019] Song, J., Chen, Y., and Yue, Y. (2019). A general framework for multi-fidelity bayesian optimization with gaussian processes. In International Conference on Artificial Intelligence and Statistics (AISTATS).

[Srinivas et al., 2010] Srinivas, N., Krause, A., Kakade, S. M., and Seeger, M. (2010). Gaussian process optimization in the bandit setting: No regret and experimental design. In International Conference on Machine Learning (ICML), pages 1015–1022.

[Sui et al., 2015] Sui, Y., Gotovos, A., Burdick, J., and Krause, A. (2015). Safe exploration for optimization with Gaussian processes. In International Conference on Machine Learning (ICML), pages 997–1005.

[Valko et al., 2013] Valko, M., Korda, N., Munos, R., Flaounas, I., and Cristianini, N. (2013). Finite-time analysis of kernelised contextual bandits. Uncertainty In Artificial Intelligence (UAI).

[Vanchinathan et al., 2014] Vanchinathan, H. P., Nikolic, I., De Bona, F., and Krause, A. (2014). Explore-exploit in top-n recommender systems via Gaussian processes. In ACM Conference on Recommender Systems, pages 225–232.

[Wang and Jegelka, 2017] Wang, Z. and Jegelka, S. (2017). Max-value entropy search for efficient Bayesian optimization. In International Conference on Machine Learning (ICML), pages 3627–3635.

All citations below are to the reference list in the main document.

A Proof of Lemma 2 (Corrupted vs. Non-Corrupted Posterior Mean)

Our analysis uses techniques from [Chowdhury and Gopalan, 2017, Appendix C]. Let x be any point in D, and fix a time index . From the definitions of (Eq. (4), (9) and (1)), we have

where and . We proceed by upper bounding the absolute difference between

Let denote the RKHS associated with the kernel k and domain D. We define , where maps to the RKHS associated with the kernel. For any two functions in , we write to denote the kernel inner product , which implies that . By the RKHS reproducing property, i.e., , and the fact that can write

for all . It also follows that where , and . (Here and subsequently, the notation similarly extends to matrix multiplication operations.)

Using these properties, we can characterize the second term of (39) as follows:

where:

• Eq. (42) is by Cauchy-Schwartz.

• The first term in (44) follows from the following identity:

• In (46), denotes the largest eigenvalue of

• Eq. (47) follows since ), and since the norm is always an upper bound on the

• Eq. (48) follows since

This follows since all eigenvectors of are also eigenvectors of , and hence, the eigenvalues of are of the form , all the eigenvalues of

B Proof of Lemma 4 (Regret Bound with Known Corruption)

Conditioned on the confidence bounds (10) being valid according to Lemma 3, we have

where (52) uses the lower confidence bound from (10), (53) uses the upper confidence bound from (10), and (54) uses the selection rule in (13).

When , we have from [Chowdhury and Gopalan, 2017, Lemma 4] that

This is a variant of a more widely-used upper bound on in terms of from [Srinivas et al., 2010].

We set in accordance with the lemma statement, and sum over the time steps:

where (58) uses (55) and the monotonicity of substitutes the choice of applies . Hence, we have , which establishes the lemma.

C Bounding the Simple Regret

While we have focused exclusively on the cumulative regret in our exposition, we can easily adapt our analysis to handle the simple regret similarly to the idea used in the proof of [Bogunovic et al., 2018a, Theorem 1]. We outline this procedure for Theorem 5, since all of the other results can be adapted in the same manner.

We claim that under the setup of Theorem 5, for a given , Algorithm 1 achieves rounds, where the reported point is defined as

To prove this claim, we continue from the end of Appendix B. We set as before, and define

Using (52), we have . From the definition of the reported point in (61), we have that is the time index with the smallest value of . It follows that

where (62) upper bounds the minimum by the average, (63) uses the monotonicity of

Re-arranging (64), we find that after time steps, , which further implies that

D Proof of Lemma 6 (Total Corruption Observed by S)

We follow the proof of [Lykouris et al., 2018, Lemma 3.3], making use of the following martingale concentration inequality.

Lemma 11. [Beygelzimer et al., 2011, Lemma 1] Let be a sequence of real-valued random variables forming a martingale with respect to a filtration , i.e., , and suppose that almost surely. Then for any , the following holds:

where

Let be the point that would be selected at time t if instance S were chosen. We let denote the amount of corruption observed by instance S at time t in Algorithm 2.

Let denote the history (i.e., all selected instances , inputs , and observations ) prior to round t. Noting that is deterministic given , we find that is a random variable equaling with probability otherwise. As a result, we can define the following martingale sequence:

where as stated above. Since for all t and (see Section 2), we have . Hence, we can set in Lemma 11.

Next, we note the following:

where the two inequalities use and respectively. By summing over all the rounds and using the definition of C in (2), we obtain

since . Applying Lemma 11, we have with probability at least

Finally, we complete the proof of Lemma 6 by adding the total expected corruption:

where we have used (65) and

E Proof of Lemma 7 (Characterizing the Points Not Sampled by F)

Consider any round and any point ). We wish to show that F never selects x, i.e., . To establish this, it suffices to prove that

for some ; this means that is favored over x according to the selection rule of F.

To show (66), we first trivially write

Since , by the definition of in (23), there exists

Moreover, the following two equations provide upper bounds on

where (69) follows from the validity of the confidence bounds (see (21)), and (70) is due to , which means that the condition (19) used in Fast-Slow GP-UCB (Line 12) is not satisfied and thus it cannot hold that

From (69) and (70) we have , and from (67) and (68) we have which together prove that (66) holds.

F Proof of Lemma 8 (Characterizing the Points Ruled Out via S)

Although we consider the S instance run with , we are interested in how long it takes before the following (corresponding to ) is observed for the given suboptimal

meaning that are monotone, (72) holding for some it continues to hold for all . Hence, to establish the lemma, it suffices to show that after rounds (with given in (24)), there exists a point such that (71) holds.

Since this proof only concerns points selected by S, we abuse notation slightly and let denote the i-th point queried by S. We use the fact that the instant regret incurred by the S instance satisfies

(via an identical argumentto (55)), and the sum of posterior standard deviations satisfies

when we set (by a direct application of (56)). Combining these gives

where . It is useful to “invert” the right-hand side of (75); to do this, we define the function

Since (75) and (76) state that the “average” value of is at most , we deduce that

That is, at least one time index i yields a value less than or equal to the average.

Now consider the given with instant regret satisfying in accordance with the lemma statement. Setting the parameter in (77) gives

where (79) follows from (73). This means that is much closer to optimal than x is. The properties in (78) and (79) allow us to characterize the confidence bounds of

where (80) uses the definition of the confidence bounds in (15)–(16), (81) uses the validity of the confidence bounds in (21), and (82) uses (78). Similarly,

where (83) is the same as (80), (84) uses the UCB selection rule, (85) uses the validity of the confidence bounds, and (86) uses (78). Combining (82) and (86), we find that the confidence interval within the range

by the UCB rule used in the S instance. For this fixed and , there are then two possible cases that we need to consider:

because we chose , and decreasing only makes decrease and increase (see (15)–(16)). Hence, the condition in (71) holds as required.

Since the confidence bounds (21) are valid and (i.e., ), we must also have . Comparing this with I above, we notice a gap of at least between the upper and lower confidence bounds at x. Let this gap be denoted by

The confidence bounds and are equal to , where is shorthand for the corrupted posterior mean. When we compare to and , the value remains unchanged, but we have ; see (15)–(16). Therefore, we have

since . Substituting gives . Since the width of the

On the other hand, using (82) and (86), we see that the entire confidence interval for lies within I (recall that replacing only shrinks this interval). Hence, , as required.

Recall that the above findings all correspond to some time index . Hence, (24) follows by setting

G Proof of Theorem 9 (Regret Bound in the Known-or-Zero Setting)

Throughout the proof, we condition on the events (20)–(22) that simultaneously hold with probability at least

G.1 Non-corrupted case

Recall that at time t, the chosen instance and input are denoted by and , respectively, and we use to denote the number of times an instance has been chosen up to time t.

In the non-corrupted case, the condition (19) cannot hold (conditioned on the events (20) and (21)), since the confidence bounds for both S and F are valid and hence can never be smaller than . Consequently, Algorithm 2 selects only S or F, and never switches permanently to Algorithm 1.

First, we consider the case that is used to select

where (89) and (91) use the validity of the confidence bounds, (90) follows from the selection rule of S, and (92) uses the definitions (15)–(16).

Next, we consider the case that is used to select

where (93) and (96) use the validity of the confidence bounds, (94) uses the selection rule of F, and (97) follows similarly to (92) by noting that the intersected confidence bounds are at least as tight as the non-intersected ones.

The regret of Algorithm 2 after T rounds can be trivially bounded by the sum regret of instance A when run for T rounds in the non-corrupted case:

where (99) follows from (92) and (97), (100) follows since both and are non-decreasing in the time index, (101) follows from (56) by setting , and (102) follows since (see (25)–(26)).

Substituting , we arrive at the regret bound,

G.2 C-corrupted case

Similarly to the non-corrupted ca