Bandit optimization problems on large or continuous domains have far-reaching applications in modern machine learning and data science, including robotics [Lizotte et al., 2007], hyperparameter tuning [Snoek et al., 2012], recommender systems [Vanchinathan et al., 2014], environmental monitoring [Srinivas et al., 2010], and more. To make such problems tractable, one needs to exploit correlations between the rewards of “similar” actions. In the kernelized multi-armed bandit (MAB) problem, this is done by utilizing smoothness in the form of a low function norm in some Reproducing Kernel Hilbert Space (RKHS), permitting the application of Gaussian process (GP) methods [Srinivas et al.,
Proceedings of the 23International Conference on Artificial Intelligence and Statistics (AISTATS) 2020, Palermo, Italy. PMLR: Volume 108. Copyright 2020 by the author(s).
2010,Chowdhury and Gopalan, 2017]. See [Rasmussen and Williams, 2006, Ch. 6] for an introduction to the connections between GPs and RKHS functions.
Key theoretical developments for the RKHS optimization problem have included both upper and lower bounds on the performance, measured via some notion of regret [Srinivas et al., 2010, Chowdhury and Gopalan, 2017,Scarlett et al., 2017]. The vast majority of these results have focused only on zero-mean additive noise in the point evaluations, and as a result, it is unclear to what extent the performance degrades under adversarial corruptions. Such considerations are of significant interest under erratic or unpredictable sources of corruption, and particularly arise when the samples may be perturbed by a malicious adversary. As we argue in Section 2, prominent algorithms such as GP-UCB [Srinivas et al., 2010] can be quite brittle in the face of such corruptions.
In this paper, we study the optimization of RKHS functions with both random noise and adversarial corruptions. We propose a novel algorithm and regret analysis building on recently-proposed techniques for the finite-arm stochastic MAB setting [Lykouris et al., 2018]. Specifically, we present a randomized algorithm Fast-Slow GP-UCB based on randomly choosing between a “fast” non-robust instance, and a “slow” robust instance. We bound the cumulative regret of Fast-Slow GP-UCB in terms of the adversarial corruption level, time horizon, and underlying kernel.
The kernelized setting comes with highly non-trivial additional challenges compared to the finite-arm setting, primarily due to the infinite action space and correlations between their associated function values. In particular, while correlations are undoubtedly benefi-cial in the non-corrupted setting (taking a given action permits learning something about similar actions), this benefit can lead to a hindrance in the corrupted setting: An adversary that corrupts a given sample can potentially damage our belief regarding many nearby function values. Moving beyond independent arms was posed as a open problem in [Gupta et al., 2019, Sec. 5.3].
Related work on GP optimization. Numerous
GP-based bandit optimization algorithms have been proposed in recent years [Srinivas et al., 2010,Hennig and Schuler, 2012,Hernández-Lobato et al., 2014,Bo- gunovic et al., 2016b,Wang and Jegelka, 2017,Shekhar and Javidi, 2018, Ru et al., 2017]. Beyond the standard setting, several important extensions have been considered, including multi-fidelity [Bogunovic et al., 2016b,Kandasamy et al., 2017,Song et al., 2019], contextual and time-varying settings [Krause and Ong, 2011,Valko et al., 2013,Bogunovic et al., 2016a], safety requirements [Sui et al., 2015], high-dimensional settings [Djolonga et al., 2013,Kandasamy et al., 2015,Rol- land et al., 2018], and many more.
Certain types of corruption-tolerant GP-based optimization algorithms have been explored previously, with the defining features including (i) whether the corruption applies to the input (i.e., action) or the output (i.e., reward function), (ii) whether all samples are corrupted, or only a final reported point is corrupted, and (iii) whether the corruptions are random or adversarial. The case of random input noise on all samples was studied in [Beland and Nair, 2017,Nogueira et al., 2016,Dai Nguyen et al., 2017]. Perhaps closer to our work is [Martinez-Cantin et al., 2018], considering function outliers; however, no specific corruption model was adopted, and no theoretical regret bounds were given.
In [Bogunovic et al., 2018a], bounds on the simple regret are given for the case that the final reported input is adversarially perturbed, whereas the selected inputs are only subject to random output noise. This makes it desirable to seek broad peaks, which bears some similarity to the input noise viewpoint [Beland and Nair, 2017,Nogueira et al., 2016,Dai Nguyen et al., 2017] and level-set estimation [Gotovos et al., 2013, Bogunovic et al., 2016b]. Our goal of attaining small cumulative regret under input perturbations requires very different techniques from these previous works. Another distinct notion of robustness is considered in [Bogunovic et al., 2018b], in which some experiments in a batch may fail to produce an outcome. None of the preceding works provide regret bounds in the case of non-stochastic corrupted observations.
Related work on corrupted bandits. Adversarially corrupted observations have recently been considered in the finite-arm stochastic MAB problem under various corruption models [Lykouris et al., 2018,Gupta et al., 2019, Kapoor et al., 2019]. As mentioned above, [Lykouris et al., 2018] adopted a “fast-slow” algorithmic approach; this led to regret bounds of the form , where is a standard regret bound for the non-corrupted MAB setting. In [Gupta et al., 2019], this bound was improved to using an epoch-based approach in which the estimates of the arms’ means are reset after each epoch, and the previous epoch guides which arms are selected in the next one.
Our algorithmic approach is based on that of [Lyk- ouris et al., 2018]; however, the bulk of the theoretical analysis requires novel ideas. In particular, our need to handle an infinite action space with correlated rewards between actions poses considerable challenges, as discussed above. In more detail, we note the following:
• Even when studying the case of a known corruption level (which is done as a stepping stone towards our main results), it is non-trivial to characterize the effect of the corruptions (see Lemma 2 below);
• Characterizing that certain suboptimal points are never sampled after a certain time requires signifi-cant technical effort (see Lemmas 7 and 8 below);
• We adopt a UCB-style approach (Alg. 2) complementary to the elimination-style approach of [Lyk- ouris et al., 2018], and the former kind may be of independent interest even in the finite-arm setting.
In a parallel independent work [Li et al., 2019], cumulative regret bounds were given for stochastic linear bandits, which are a special case of the GP setting (with a linear kernel). The algorithm of [Li et al., 2019] is in fact more akin to that of [Gupta et al., 2019], which is potentially preferable due the latter attaining better bounds in the finite-arm setting. However, the algorithm and results of [Li et al., 2019] crucially rely on the notion of gaps between the function values of corner points in the domain, and the idea of exploiting these gaps for linear bandits has no apparent generalization to the GP setting with general kernels. In addition, even when we specialize to the linear kernel, neither our results nor those of [Li et al., 2019] imply each other, and the two both have benefits not provided by the other; see Appendix K for details.
Outline. We introduce the corruption-tolerant kernelized MAB problem in Section 2, and then present algorithms for three settings with increasing difficulty: Known corruption level (Section 3), simultaneous handing of no corruption and a known corruption level (Section 4), and unknown corruption level (Section 5).
We consider the problem of sequentially maximizing a fixed unknown function , where is a compact set and . We assume that D is endowed with a kernel function de-fined on , and the kernel is normalized to satisfy for all . We also assume that f has a bounded norm in the corresponding Reproducing Kernel Hilbert Space (RKHS) , i.e., . This assumption permits the construction of confidence bounds via Gaussian process (GP) methods (see Lemma 1 below).
In the non-corrupted setting, at every time step t, we choose , and observe a noisy function value . In this work, we consider the corrupted setting, where we only observe an adversarially corrupted sample . Formally, for each t = 1, . . . , T:
• Based on the previous decisions and corresponding corrupted observations , the player selects a probability distribution
• Based on the knowledge of the true function the previous decisions and corresponding observations , and the player’s distribu- tion , the adversary chooses the corruptions
• The agent draws at random from observes the noisy and corrupted observation:
where is the noisy non-corrupted observation: , where with independence between times t.
Note that the adversary is allowed to be adaptive, i.e., the corruptions may depend on the agent’s previously selected points and corresponding stochastic observations, as well as the distribution of the player’s next choice, but not its specific realization
We say that the problem instance is C-corrupted (i.e., the corruption level is C) if
Clearly, when C = 0, we recover the standard non-corrupted setting. We measure the performance using the cumulative regret, which is also typically used in the non-corrupted bandit setting [Srinivas et al., 2010]:
where . As noted in [Lykouris et al., 2018], one could alternatively define the cumulative regret with respect to the corrupted values ; the two notions coincide to within at most 2C, and such a difference will be negligible in our regret bound anyway. In Appendix C, we outline how our results can be adapted for simple regret (i.e., the regret of a point reported at the end of T rounds).
2.1 Standard (non-corrupted) setting
In the non-corrupted setting, existing algorithms use Gaussian likelihood models for the observations and zero-mean GP priors for modeling the uncertainty in f. Posterior updates are performed according to a “fictitious” model in which the noise variables are drawn independently across t from , where is a hyperparameter that may differ from the true noise variance . Given a sequence of inputs and their noisy observations , the posterior distribution under this GP(0, k) prior is also Gaussian, with the mean and variance
where is the kernel matrix. Common kernels include the linear, squared exponential (SE) and Matérn kernels.
The main quantity that characterizes the regret bounds in the non-corrupted setting [Srinivas et al., 2010, Chowdhury and Gopalan, 2017] is the maximum information gain, defined at time t as
For compact and convex domains, is sublinear in t for various classes of kernels, e.g., for the SE kernel, and for the Matérn kernel with [Srinivas et al., 2010].
The following well-known result of [Abbasi-Yadkori, 2013] provides confidence bounds around the unknown function in the non-corrupted setting.
Lemma 1. Fix with , and consider the sampling model , with independent noise . Under the choice
the following holds with probability at least
where are given in (4) and (5).
This lemma follows directly from [Abbasi-Yadkori, 2013, Theorem 3.11] (and [Abbasi-Yadkori, 2013, Remark 3.13]) and the definition (6) of
tions. In the noisy non-corrupted setting, several algorithms have been developed and analyzed. A particularly well-known example is GP-UCB, which selects . GP-UCB achieves sublinear cumulative regret with
Figure 1: (Left) Function f, its global maximizer , a local maximizer , and the corruption region. (Middle) GP-UCB eliminates the optimal region (and ) early on due to the corruptions, and continues sampling points in the suboptimal region around . (Right) Our corruption-aware algorithm (see Algorithm 1) does not eliminate the optimal region, and after the corruption budget is exhausted, it identifies the true maximizer
high probability [Srinivas et al., 2010,Chowdhury and Gopalan, 2017], for a suitably chosen (e.g., as in (7)). Despite this success in the non-corrupted setting, these algorithms can fail under adversarial corruptions.
An illustrative example is provided in Figure 1. Observations that correspond to the points sampled in the shaded region around the global maximizer rupted by the value , up to a total corruption budget (C = 3.5). In Figure 1 (Middle), the points selected by GP-UCB for t = 50 time steps are shown. GP-UCB eliminates the global maximizer early on due to corruptions, and later on, it only selects points from the suboptimal region and consequently suffers linear cumulative regret. In the subsequent sections, we design algorithms that are robust to corruptions, and are able to identify the true maximizer after the corruption budget C is exhausted (see Figure 1 (Right)).
We first consider the case that the total corruption C in (2) is known. Given a sequence of inputs and their corrupted observations (with ), we form a posterior mean according to a prior and sampling noise as follows:
where . Note that this matches the posterior mean formed in the non-corrupted setting, simply replacing by . In addition, we form the same posterior standard deviation as in the non-corrupted setting. The role of the parameter is discussed in Appendix I.
The following lemma provides an upper bound on the difference between the non-corrupted and corrupted posterior means, and is proved using the definitions of and along with RKHS function properties. All proofs can be found in the supplementary material.
Lemma 2. For any and , we have , where and are given in (4) and (5), and is given in (9), with
By combining Lemmas 1 and 2, we obtain the following.
. Under the choice with given in (7) and , we have with probability at least
where are given in (9) and (5).
In Algorithm 1 (), we present an upper confi- dence bound based algorithm with enlarged confidence bounds in accordance with Lemma 3. We explicitly define these confidence bounds as follows:
Once the validity of these confidence bounds is established via (10), one can use standard analysis techniques [Srinivas et al., 2010] to bound the cumulative regret. This is formally stated in the following.
Lemma 4. Under the choice of in Lemma 3 and , conditioned on the event (10), the cumulative regret incurred by Algorithm 1 satisfies
The main theorem of this section is now obtained via a direct combination of Lemmas 3 and 4.
Theorem 5. In the C-corrupted setting, Algorithm 1 with and set as in Lemma 3, attains, with probability at least , cumulative regret
Note that when C = 0, this result recovers known non-corrupted cumulative regret bounds (cf. [Chowdhury and Gopalan, 2017, Theorem 3]). More generally, we can decompose the obtained regret bound into two terms: behaves as
(14) The obtained regret bound can be made more explicit by substituting the bound on for particular kernels [Srinivas et al., 2010], e.g., for the SE kernel we obtain . In Appendix J, we argue that the linear dependence on C is unavoidable for any algorithm, and discuss cases where the dependence on T is near-optimal. However, we do not necessarily claim that the joint dependence on C and T is optimal; this is left for future work.
In the previous section, we assumed that the upper bound C on the total corruption is known and the problem is C-corrupted. In this section, we also assume that C is known, but we consider a scenario in which the problem may be either C-corrupted or non-corrupted (i.e., the standard setting). Our goal is to develop an algorithm that has a similar guarantee to the previous section in the corrupted case, while also attaining a similar guarantee to GP-UCB [Srinivas et al., 2010] in the non-corrupted case, and thus obtaining strong guarantees in the two settings simultaneously. Theorem 5 fails to achieve this goal, since the regret depends on C even if the problem is non-corrupted.
Our algorithm Fast-Slow GP-UCB is described in Algorithm 2. It makes use of two instances labeled F (fast; Line 6) and S (slow; Line 8). The S instance is played with probability 1/C, while the rest of the time F is played. The intuition is that F shrinks the confi-dence bounds faster but is not robust to corruptions, while S is slower but robust to corruptions. We formalize this intuition below in Lemma 6 and (20)–(21).
The instances use the following confidence bounds depending on an exploration parameter and an additional parameter whose role is discussed in Appendix I and after Lemma 8 below:
where is the number of times an instance has been selected at a given time instant. We also make use of the following intersected confidence bounds, which have the convenient feature of being monotone:
In Fast-Slow GP-UCB, we check if the following condition (Line 12) holds:
In the non-corrupted setting, under the high-probability event in Lemma 1 (for both F and S), this condition never holds. Hence, when it does hold, we have detected that the problem is C-corrupted.In such a case, the algorithm permanently switches to running Algorithm 1 with C as the input. Note that we can check the condition in (19) by using a global optimizer to find a minimizer of , and checking whether its value is smaller than 0.
Finally, the inner minimization over F instance, together with the validity of the condition (19), ensures that F does not select a point that is already “ruled out” by the robust instance S. We make this statement precise in Lemma 7 below.
4.1 Analysis
First, we provide a high-probability bound on the total corruption that is observed by the S instance. Specifi-cally, we show that because it is sampled with probability 1/C, the total corruption observed by S is constant with high probability, i.e., it is upper bounded by a value not depending on T.
Lemma 6. The S instance in Fast-Slow GPUCB observes, with probability at least , a total corruption of at most
We now fix a constant and condition on three high-probability events:
1. If setting is non-corrupted, the following holds with probability at least
for all and . This claim follows from Lemma 1 by setting the corresponding failure probability to
, then the following holds in both the non-corrupted and corrupted settings with probability at least
for all and . This follows from Lemmas 3 and 6 (using in place of C in Lemma 3), by setting the corresponding failure probabilities to in both. Taking the union bound over these two events establishes the claim. Note that (21) corresponds to , but directly implies an analogous condition for all (since increasing widens the confidence region (15)–(16)).
3. If the condition in (19) is detected at any time instant, then Algorithm 2 permanently switches to running Algorithm 1. If Algorithm 1 is run with
for all and , under the definitions in (11). This is by Lemma 3 with in place of
By the union bound, (20)–(22) all hold with probability at least . In addition, by the definitions in (17), these properties remain true when and are replaced by
The confidence bounds of F are only valid in the non-corrupted case, and hence, in the case of corruptions we rely on the confidence bounds of S. Specifically, we show that F never queries a point that is strictly suboptimal according to the confidence bounds of S.
Lemma 7. Suppose that (20) and (21) hold. For any time , if in Fast-Slow GP-UCB, then the selected point
represents the set of strictly suboptimal points according to the intersected S-confidence bounds.
By the monotonicity of and , the set is non-shrinking in t. The proof shows that F always favors from (23) over , i.e., has a higher value of (see Line 6 of Algorithm 2). To show this, we upper bound in terms of and lower bound in terms of via the confidence bounds and condition (19).
The next lemma characterizes the number of queries made by the S instance before a suboptimal point becomes “eliminated”, i.e., the time after which the point belongs to
Lemma 8. Suppose that the S instance is run with corresponding to (21) and . Then, conditioned on the high-probability confidence bounds in (21), for any given suboptimal point such that , it holds that
This lemma’s proof is perhaps the trickiest, and crucially relies on the fact that . We show that by the time given in (24), we have encountered a round i in which a -optimal point is queried with the con- fidence width also being at most . This means that is much closer to optimal than the -suboptimal point x in the lemma statement. Using the fact that had a higher UCB score than x, we can also deduce that the posterior standard deviation at x was not too large. Since replacing by (as done in the definition of in (23)) halves the confidence width, we can combine the above findings to deduce that the confidence bounds indeed rule out x at time , and hence also for all subsequent times due to the monotonicity of the confidence bounds.
Finally, we state the main theorem of this section, whose proof combines the preceding lemmas.
Theorem 9. For any with , let , and consider Fast-Slow GP-UCB run with
and set in Algorithm 1 as . Then, after T rounds, with probability at least the cumulative regret satisfies
(27) in the non-corrupted case and
in the corrupted case.
The non-corrupted case is straightforward to prove, essentially applying standard arguments separately to F and S. The corrupted case requires more effort. Lemma 8 characterizes the time after which points with a given regret are no longer sampled by S, which permits bounding the cumulative regret incurred by S. By Lemma 7, the points in are also not sampled by F, and on average F is played at most C times more frequently than S. Converting this average to a high-probability bound using basic concentration, this factor of , and we obtain (28).
Using the notation to hide logarithmic factors, the bound obtained in the non-corrupted case simplifies to , and unlike the result from Theorem 5, it does not depend on C. The obtained bound is only a constant factor away from the standard non-corrupted one (cf. (14)), while at the same time our algorithm achieves in the C-corrupted case. As before, we can make the results obtained in this theorem more explicit by substituting the bounds for for various kernels of interest [Srinivas et al., 2010].
In this section, we assume that the total corruption C defined in (2) is unknown to the algorithm. Despite this additional challenge, most of the details are similar to the known-or-zero setting, so to avoid repetition, we omit some details and focus on the key differences.
Algorithm. Our corruption-agnostic algorithm is shown in Algorithm 3. We again take inspiration from the finite-arm counterpart [Lykouris et al., 2018], con- sidering that are sampled with probability (with any remaining probability going to layer 1). The idea is that any layer with is robust, for the same reason that the S instance is robust in Fast-Slow GP-UCB (Algorithm 2).
Each instance makes use of confidence bounds defined as follows for some parameters to be chosen later:
where denotes the number of times instance has been selected by time t, and . Similarly to the Section 4, we define intersected confidence bounds:
Each instance selects a point according to , where represents a set of potential maximizers at time t, i.e., a set of points that could still be the global maximizer according to the confidence bounds. More formally, these sets are defined recursively as follows:
Two key properties of these sets are: (i) for every and due to the monotonicity of the confidence bounds; and (ii) . The latter property
implies that once a point is eliminated at some layer it is also eliminated from all , while the former property ensures that it remains eliminated for all subsequent time steps {t + 1, . . . , T}.
Similarly to Fast-Slow GP-UCB, each layer uses
suffices) in its acquisition function, while replacing by 1 in the confidence bounds when constructing the set of potential maximizers. This is done to permit the application of Lemma 8; the intuition behind doing so is discussed in Appendix I and following Lemma 8.
In the case that corresponding to the selected at time t is empty, the algorithm finds the lowest layer i for which , and selects the point that maximizes that layer’s upper confidence bound. In this case, the algorithm makes no changes to the confidence bounds or the sets of potential maximizers.
theoretical analysis in place, we can also obtain a nearidentical regret bound in the case of unknown C. We only provide a brief outline here, with further details in the supplementary material.
We let the robust layer play the role of F and eliminate suboptimal points. Since regret incurred in the lower layers is at most a factor 2C higher than that of layer on average, and this leads to a similar analysis to that used in the proof of Theorem 9. Our final main result is stated as follows.
Theorem 10. For any with , and any , under the parameters
we have that for any unknown corruption level C > 0, the cumulative regret of Algorithm 3 satisfies
with probability at least
This has the same form as (28), with in place of (since there are
We have introduced the kernelized MAB problem with adversarially corrupted samples. We provided novel algorithms based on enlarged confidence bounds and randomly-selected fast/slow instances that are provably robust against such corruptions, with the regret bounds being linear in the corruption level. To our knowledge, we are the first to handle this form of adversarial corruption in any bandit problem with an infinite action space and correlated rewards, which are two key notions that significantly complicate the analysis.
An immediate direction for further research is to better understand the joint dependence on the corruption level C and time horizon T. The linear O(C) dependence is unavoidable (see Appendix J), and the dependence matches well-known bounds for the non-corrupted setting [Srinivas et al., 2010, Chowdhury and Gopalan, 2017] (in some cases having near-matching lower bounds [Scarlett et al., 2017]), but it is unclear whether the product of these two terms is unavoidable.
This work was gratefully supported by Swiss National Science Foundation, under the grant SNSF NRP 75 407540_167189, ERC grant 815943 and ETH Zürich Postdoctoral Fellowship 19-2 FEL-47. J. Scarlett was supported by the Singapore National Research Foundation (NRF) under grant number R-252-000-A74-281.
[Abbasi-Yadkori, 2013] Abbasi-Yadkori, Y. (2013). Online learning for linearly parametrized control problems.
[Abbasi-Yadkori et al., 2011] Abbasi-Yadkori, Y., Pál, D., and Szepesvári, C. (2011). Improved algorithms for linear stochastic bandits. In Advances in Neural Information Processing Systems, pages 2312–2320.
[Beland and Nair, 2017] Beland, J. J. and Nair, P. B. (2017). Bayesian optimization under uncertainty. NIPS BayesOpt 2017 workshop.
[Beygelzimer et al., 2011] Beygelzimer, A., Langford, J., Li, L., Reyzin, L., and Schapire, R. (2011). Contextual bandit algorithms with supervised learning guarantees. In Proceedings of the Fourteenth International Conference on Artificial Intelligence and Statistics, pages 19–26.
[Bogunovic et al., 2016a] Bogunovic, I., Scarlett, J., and Cevher, V. (2016a). Time-varying Gaussian process bandit optimization. In International Conference on Artificial Intelligence and Statistics (AISTATS), pages 314–323.
[Bogunovic et al., 2018a] Bogunovic, I., Scarlett, J., Jegelka, S., and Cevher, V. (2018a). Adversarially robust optimization with Gaussian processes. In Advances in Neural Information Processing Systems (NeurIPS), pages 5760–5770.
[Bogunovic et al., 2016b] Bogunovic, I., Scarlett, J., Krause, A., and Cevher, V. (2016b). Truncated variance reduction: A unified approach to Bayesian optimization and level-set estimation. In Advances in Neural Information Processing Systems (NIPS), pages 1507–1515.
[Bogunovic et al., 2018b] Bogunovic, I., Zhao, J., and Cevher, V. (2018b). Robust maximization of nonsubmodular objectives. In International Conference on Artificial Intelligence and Statistics (AISTATS), pages 890–899.
[Chowdhury and Gopalan, 2017] Chowdhury, S. R. and Gopalan, A. (2017). On kernelized multi-armed bandits. In International Conference on Machine Learning (ICML), pages 844–853.
[Dai Nguyen et al., 2017] Dai Nguyen, T., Gupta, S., Rana, S., and Venkatesh, S. (2017). Stable Bayesian optimization. In Pacific-Asia Conference on Knowledge Discovery and Data Mining, pages 578–591. Springer.
[Djolonga et al., 2013] Djolonga, J., Krause, A., and Cevher, V. (2013). High-dimensional Gaussian process bandits. In Advances in Neural Information Processing Systems (NIPS), pages 1025–1033.
[Freedman et al., 1975] Freedman, D. A. et al. (1975). On tail probabilities for martingales. Annals of Probability, 3(1):100–118.
[Gotovos et al., 2013] Gotovos, A., Casati, N., Hitz, G., and Krause, A. (2013). Active learning for level set estimation. In International Joint Conference on Artificial Intelligence (IJCAI), pages 1344–1350.
[Gupta et al., 2019] Gupta, A., Koren, T., and Talwar, K. (2019). Better algorithms for stochastic bandits with adversarial corruptions. In Conference on Learning Theory (COLT).
[Hennig and Schuler, 2012] Hennig, P. and Schuler, C. J. (2012). Entropy search for information-efficient global optimization. Journal of Machine Learning Research, 13(Jun):1809–1837.
[Hernández-Lobato et al., 2014] Hernández-Lobato, J. M., Hoffman, M. W., and Ghahramani, Z. (2014). Predictive entropy search for efficient global optimization of black-box functions. In Advances in Neural Information Processing Systems (NIPS), pages 918–926.
[Kandasamy et al., 2017] Kandasamy, K., Dasarathy, G., Schneider, J., and Póczos, B. (2017). Multi-fidelity bayesian optimisation with continuous approximations. In Proceedings of the 34th International Conference on Machine Learning-Volume 70, pages 1799–1808. JMLR. org.
[Kandasamy et al., 2015] Kandasamy, K., Schneider, J., and Póczos, B. (2015). High dimensional Bayesian optimisation and bandits via additive models. In International Conference on Machine Learning (ICML), pages 295–304.
[Kapoor et al., 2019] Kapoor, S., Patel, K. K., and Kar, P. (2019). Corruption-tolerant bandit learning. Machine Learning, 108(4):687–715.
[Krause and Ong, 2011] Krause, A. and Ong, C. S. (2011). Contextual Gaussian process bandit optimization. In Advances in Neural Information Processing Systems (NIPS), pages 2447–2455.
[Li et al., 2019] Li, Y., Lou, E. Y., and Shan, L. (2019). Stochastic linear optimization with adversarial corruption. arXiv preprint arXiv:1909.02109.
[Lizotte et al., 2007] Lizotte, D. J., Wang, T., Bowling, M. H., and Schuurmans, D. (2007). Automatic
gait optimization with Gaussian process regression. In International Joint Conference on Artificial Intelligence (IJCAI), pages 944–949.
[Lykouris et al., 2018] Lykouris, T., Mirrokni, V., and Paes Leme, R. (2018). Stochastic bandits robust to adversarial corruptions. In ACM Symposium on Theory of Computing (STOC), pages 114–122. ACM.
[Martinez-Cantin et al., 2018] Martinez-Cantin, R., Tee, K., and McCourt, M. (2018). Practical Bayesian optimization in the presence of outliers. In International Conference on Artificial Intelligence and Statistics (AISTATS).
[Nogueira et al., 2016] Nogueira, J., Martinez-Cantin, R., Bernardino, A., and Jamone, L. (2016). Unscented Bayesian optimization for safe robot grasping. In IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS).
[Rasmussen and Williams, 2006] Rasmussen, C. E. and Williams, C. K. (2006). Gaussian processes for machine learning, volume 1. MIT press Cambridge.
[Rolland et al., 2018] Rolland, P., Scarlett, J., Bogunovic, I., and Cevher, V. (2018). High-dimensional Bayesian optimization via additive models with overlapping groups. In International Conference on Ar-tificial Intelligence and Statistics (AISTATS), pages 298–307.
[Ru et al., 2017] Ru, B., Osborne, M., and McLeod, M. (2017). Fast information-theoretic Bayesian optimisation. arXiv preprint arXiv:1711.00673.
[Scarlett et al., 2017] Scarlett, J., Bogunovic, I., and Cevher, V. (2017). Lower bounds on regret for noisy Gaussian process bandit optimization. In Conference on Learning Theory (COLT).
[Shekhar and Javidi, 2018] Shekhar, S. and Javidi, T. (2018). Gaussian process bandits with adaptive discretization. Electronic Journal of Statistics, 12(2):3829–3874.
[Snoek et al., 2012] Snoek, J., Larochelle, H., and Adams, R. P. (2012). Practical Bayesian optimization of machine learning algorithms. In Advances in Neural information Processing Systems (NIPS), pages 2951–2959.
[Song et al., 2019] Song, J., Chen, Y., and Yue, Y. (2019). A general framework for multi-fidelity bayesian optimization with gaussian processes. In International Conference on Artificial Intelligence and Statistics (AISTATS).
[Srinivas et al., 2010] Srinivas, N., Krause, A., Kakade, S. M., and Seeger, M. (2010). Gaussian process optimization in the bandit setting: No regret and experimental design. In International Conference on Machine Learning (ICML), pages 1015–1022.
[Sui et al., 2015] Sui, Y., Gotovos, A., Burdick, J., and Krause, A. (2015). Safe exploration for optimization with Gaussian processes. In International Conference on Machine Learning (ICML), pages 997–1005.
[Valko et al., 2013] Valko, M., Korda, N., Munos, R., Flaounas, I., and Cristianini, N. (2013). Finite-time analysis of kernelised contextual bandits. Uncertainty In Artificial Intelligence (UAI).
[Vanchinathan et al., 2014] Vanchinathan, H. P., Nikolic, I., De Bona, F., and Krause, A. (2014). Explore-exploit in top-n recommender systems via Gaussian processes. In ACM Conference on Recommender Systems, pages 225–232.
[Wang and Jegelka, 2017] Wang, Z. and Jegelka, S. (2017). Max-value entropy search for efficient Bayesian optimization. In International Conference on Machine Learning (ICML), pages 3627–3635.
All citations below are to the reference list in the main document.
A Proof of Lemma 2 (Corrupted vs. Non-Corrupted Posterior Mean)
Our analysis uses techniques from [Chowdhury and Gopalan, 2017, Appendix C]. Let x be any point in D, and fix a time index . From the definitions of (Eq. (4), (9) and (1)), we have
where and . We proceed by upper bounding the absolute difference between
Let denote the RKHS associated with the kernel k and domain D. We define , where maps to the RKHS associated with the kernel. For any two functions in , we write to denote the kernel inner product , which implies that . By the RKHS reproducing property, i.e., , and the fact that can write
for all . It also follows that where , and . (Here and subsequently, the notation similarly extends to matrix multiplication operations.)
Using these properties, we can characterize the second term of (39) as follows:
where:
• Eq. (42) is by Cauchy-Schwartz.
• The first term in (44) follows from the following identity:
• In (46), denotes the largest eigenvalue of
• Eq. (47) follows since ), and since the norm is always an upper bound on the
• Eq. (48) follows since
This follows since all eigenvectors of are also eigenvectors of , and hence, the eigenvalues of are of the form , all the eigenvalues of
B Proof of Lemma 4 (Regret Bound with Known Corruption)
Conditioned on the confidence bounds (10) being valid according to Lemma 3, we have
where (52) uses the lower confidence bound from (10), (53) uses the upper confidence bound from (10), and (54) uses the selection rule in (13).
When , we have from [Chowdhury and Gopalan, 2017, Lemma 4] that
This is a variant of a more widely-used upper bound on in terms of from [Srinivas et al., 2010].
We set in accordance with the lemma statement, and sum over the time steps:
where (58) uses (55) and the monotonicity of substitutes the choice of applies . Hence, we have , which establishes the lemma.
C Bounding the Simple Regret
While we have focused exclusively on the cumulative regret in our exposition, we can easily adapt our analysis to handle the simple regret similarly to the idea used in the proof of [Bogunovic et al., 2018a, Theorem 1]. We outline this procedure for Theorem 5, since all of the other results can be adapted in the same manner.
We claim that under the setup of Theorem 5, for a given , Algorithm 1 achieves rounds, where the reported point is defined as
To prove this claim, we continue from the end of Appendix B. We set as before, and define
Using (52), we have . From the definition of the reported point in (61), we have that is the time index with the smallest value of . It follows that
where (62) upper bounds the minimum by the average, (63) uses the monotonicity of
Re-arranging (64), we find that after time steps, , which further implies that
D Proof of Lemma 6 (Total Corruption Observed by S)
We follow the proof of [Lykouris et al., 2018, Lemma 3.3], making use of the following martingale concentration inequality.
Lemma 11. [Beygelzimer et al., 2011, Lemma 1] Let be a sequence of real-valued random variables forming a martingale with respect to a filtration , i.e., , and suppose that almost surely. Then for any , the following holds:
where
Let be the point that would be selected at time t if instance S were chosen. We let denote the amount of corruption observed by instance S at time t in Algorithm 2.
Let denote the history (i.e., all selected instances , inputs , and observations ) prior to round t. Noting that is deterministic given , we find that is a random variable equaling with probability otherwise. As a result, we can define the following martingale sequence:
where as stated above. Since for all t and (see Section 2), we have . Hence, we can set in Lemma 11.
Next, we note the following:
where the two inequalities use and respectively. By summing over all the rounds and using the definition of C in (2), we obtain
since . Applying Lemma 11, we have with probability at least
Finally, we complete the proof of Lemma 6 by adding the total expected corruption:
where we have used (65) and
E Proof of Lemma 7 (Characterizing the Points Not Sampled by F)
Consider any round and any point ). We wish to show that F never selects x, i.e., . To establish this, it suffices to prove that
for some ; this means that is favored over x according to the selection rule of F.
To show (66), we first trivially write
Since , by the definition of in (23), there exists
Moreover, the following two equations provide upper bounds on
where (69) follows from the validity of the confidence bounds (see (21)), and (70) is due to , which means that the condition (19) used in Fast-Slow GP-UCB (Line 12) is not satisfied and thus it cannot hold that
From (69) and (70) we have , and from (67) and (68) we have which together prove that (66) holds.
F Proof of Lemma 8 (Characterizing the Points Ruled Out via S)
Although we consider the S instance run with , we are interested in how long it takes before the following (corresponding to ) is observed for the given suboptimal
meaning that are monotone, (72) holding for some it continues to hold for all . Hence, to establish the lemma, it suffices to show that after rounds (with given in (24)), there exists a point such that (71) holds.
Since this proof only concerns points selected by S, we abuse notation slightly and let denote the i-th point queried by S. We use the fact that the instant regret incurred by the S instance satisfies
(via an identical argumentto (55)), and the sum of posterior standard deviations satisfies
when we set (by a direct application of (56)). Combining these gives
where . It is useful to “invert” the right-hand side of (75); to do this, we define the function
Since (75) and (76) state that the “average” value of is at most , we deduce that
That is, at least one time index i yields a value less than or equal to the average.
Now consider the given with instant regret satisfying in accordance with the lemma statement. Setting the parameter in (77) gives
where (79) follows from (73). This means that is much closer to optimal than x is. The properties in (78) and (79) allow us to characterize the confidence bounds of
where (80) uses the definition of the confidence bounds in (15)–(16), (81) uses the validity of the confidence bounds in (21), and (82) uses (78). Similarly,
where (83) is the same as (80), (84) uses the UCB selection rule, (85) uses the validity of the confidence bounds, and (86) uses (78). Combining (82) and (86), we find that the confidence interval within the range
by the UCB rule used in the S instance. For this fixed and , there are then two possible cases that we need to consider:
because we chose , and decreasing only makes decrease and increase (see (15)–(16)). Hence, the condition in (71) holds as required.
Since the confidence bounds (21) are valid and (i.e., ), we must also have . Comparing this with I above, we notice a gap of at least between the upper and lower confidence bounds at x. Let this gap be denoted by
The confidence bounds and are equal to , where is shorthand for the corrupted posterior mean. When we compare to and , the value remains unchanged, but we have ; see (15)–(16). Therefore, we have
since . Substituting gives . Since the width of the
On the other hand, using (82) and (86), we see that the entire confidence interval for lies within I (recall that replacing only shrinks this interval). Hence, , as required.
Recall that the above findings all correspond to some time index . Hence, (24) follows by setting
G Proof of Theorem 9 (Regret Bound in the Known-or-Zero Setting)
Throughout the proof, we condition on the events (20)–(22) that simultaneously hold with probability at least
G.1 Non-corrupted case
Recall that at time t, the chosen instance and input are denoted by and , respectively, and we use to denote the number of times an instance has been chosen up to time t.
In the non-corrupted case, the condition (19) cannot hold (conditioned on the events (20) and (21)), since the confidence bounds for both S and F are valid and hence can never be smaller than . Consequently, Algorithm 2 selects only S or F, and never switches permanently to Algorithm 1.
First, we consider the case that is used to select
where (89) and (91) use the validity of the confidence bounds, (90) follows from the selection rule of S, and (92) uses the definitions (15)–(16).
Next, we consider the case that is used to select
where (93) and (96) use the validity of the confidence bounds, (94) uses the selection rule of F, and (97) follows similarly to (92) by noting that the intersected confidence bounds are at least as tight as the non-intersected ones.
The regret of Algorithm 2 after T rounds can be trivially bounded by the sum regret of instance A when run for T rounds in the non-corrupted case:
where (99) follows from (92) and (97), (100) follows since both and are non-decreasing in the time index, (101) follows from (56) by setting , and (102) follows since (see (25)–(26)).
Substituting , we arrive at the regret bound,
G.2 C-corrupted case
Similarly to the non-corrupted case, we condition on (20)–(22), and we set . We first address the two parts of Fast-Slow GP-UCB whose contributions to the cumulative regret are the simplest to handle: That from Algorithm 1, and that from the slow instance S.
Supposing that Algorithm 1 is run for rounds, we simply use the confidence bounds (22) and apply Lemma 4 (with in place of ): If , then the cumulative regret after
rounds satisfies
The regret obtained by S is analyzed in the same way via Lemma 4, but with in place of C, and the confidence bounds (21) in place of (22). Lemma 4 then implies that the regret coming from S for a total of
rounds satisfies
which is the same as (103) but with in place of C (and possibly a different value). It now only remains to bound the regret of the F instance in the corrupted case.
Regret incurred by the F instance. First, we recall a few facts. The F-confidence bounds in (20) are only valid when there is no corruption, and hence they cannot be used to characterize the regret of the F instance in the corrupted case. Unlike the F-confidence bounds, the S-confidence bounds in (21) are valid even in the corrupted case, and they are useful since the F rule explicitly depends on them (Fast-Slow GP-UCB, Line 6). In Lemma 7, we have shown that no point that is suboptimal according to the S-confidence bounds is sampled by the F instance. Subsequently, in Lemma 8, we have characterized how many points need to be queried in S before this occurs. More formally, the results of Lemmas 7 and 8 (with ) state that by time
all -suboptimal points are ruled out and are not sampled by F in the subsequent time steps. We observe that the following two statements are equivalent:
• After time , the instant regret of each point selected by F is at most • After time , the instant regret of each point selected by F is at most
This is by a simple inversion; if we set in (105) then it trivially holds that
We now seek to characterize how many times F is selected in between successive selections of this is trivial, since S is always selected, so in the following we focus on C > 1. We will establish that with probability at least , in between any two selections of S (or prior to the first such selection), there are at most selections of F with probability at least . We henceforth denote this event by A.
To establish the preceding claim, fix an integer N > 0, and observe that after any given selection of S, the probability of selecting F for the next . Hence, if , then the probability is at most . The number of selections of S is trivially at most T, so taking a union bound over at most T associated events, we obtain
By the union bound, the event A and the events in (20)–(22) hold simultaneously with probability at least . Conditioned on these events, when Fast-Slow GP-UCB is run for T rounds, the cumulative regret of the points selected by
where:
• (106) is established using the equivalence stated after (105) and the definition of A as follows: First, the instant regret bound is upper bounded by are monotone. Then, when summing this weakened upper bound over all time instants, the conditioning on A means that the worst case (i.e., giving the highest upper bound) is that there are exactly N selections of F before each selection of S. The first such selection incurs cumulative regret at most the subsequent selections indexed by incur at most
Substituting (stated above (21)) into (108), we obtain
We could slightly improve this bound by replacing , but we proceed with the former since it is simpler and
Overall corrupted regret bound. The obtained regret bounds (103), (104), and (109) hold simultaneously with probability at least . We obtain our final bound by noting that the cumulative regret of Fast-Slow GP-UCB after T rounds can be trivially upper bounded by the sum of the individual regrets in (103), (104), and (109), where in both (103) and (104) we upper bound by T. Therefore, with probability at least , after T rounds, we obtain
terms, since a term is already present in (103).
H Further Details on the Proof of Theorem 10 (Regret Bound with Unknown C)
As stated in Theorem 10, we set the exploration parameter for each layer as follows:
This ensures the following confidence bound for each , with probability at least
This follows from Lemmas 3 and 6 (with in place of C in Lemma 3, and ), by setting the corresponding failure probabilities to in both. By a union bound over the two events in the lemmas, followed by a union bound over , we obtain (112). Once again, (112) remains true when are replaced by
There are at most “corruption-tolerant” layers (i.e., layers such that ), and their regret is analyzed via Lemma 4, but with in place of C, and the confidence bounds (112) in place of (22). Lemma 4 then implies that the total regret coming from these layers for a total of T rounds is upper bounded according to the following analog of (104):
with probability at least
It remains to characterize the regret coming from the layers that are not corruption-tolerant, i.e., the layers such that . By the algorithm design (i.e., by the established properties of the sets of potential maximizers) and similarly to Lemma 7, it holds that if a point becomes suboptimal at time step t according to the confidence bounds of some layer ), then it is not sampled by any layer in the subsequent time steps {t + 1, . . . , T}. If we denote the minimum layer that is robust to corruption as
then we can use this layer to characterize the number of queries before a suboptimal point becomes “eliminated” from this and all the lower layers . This can be done by using Lemma 8 (where plays the role of the S instance), using the confidence bounds from (112) instead of (21).
We can then repeat the arguments of Theorem 9 (Section G.2; Regret incurred by the F instance) and obtain the regret bounds. First, we characterize how many times layers are selected in between successive selections of . We can establish that with probability at least , in between any two selections of (or prior to the first such selection), there are at most selections of layers (combined) with probability at least . This is done via the same arguments used in the proof of Theorem 9, and the fact that layer is chosen with probability at least by the definition of
By taking the union bound over the previous event and the one in (112), we have that with probability at least , the regret coming from the points selected by the layers is at most given by the following analog of (109):
The following overall regret bound dominates both (113) and (116), and therefore holds for Algorithm 3 with probability at least
This matches the expression given in Theorem 10.
I Discussion on the Parameters and
Recall that our posterior updates are done assuming a sampling noise variance that may differ from the true variance . In the absence of corruptions, one may be inclined to set , as was done (for example) in [Srinivas et al., 2010]. However, a problem with this approach in the corrupted setting is that if is small, the posterior mean will follow the corrupted samples very closely even though they are unreliable. More generally, increasing generally increases robustness against corruptions, but if is too high then the model essentially places no trust in any of the sampled points, which prevents effective learning. In our theoretical analysis, we set as a mathematically convenient choice controlling this trade-off, though other values may also work well in practice.
Next, we discuss the parameter . The idea is that if we set everywhere, it becomes difficult or impossible to establish that suboptimal points are “ruled out” by the S instance (in the sense of Lemma 7) after a certain amount of time. This is because regardless of the suboptimality of a given point x, the posterior variance may be just high enough for its upper confidence bound to be just below the maximal function value will be favored over x according to the UCB rule, and the algorithm may fail to reduce the uncertainty in f(x).
In contrast, if we are using the UCB rule with and the preceding “unlucky” scenario is encountered, then upon halving the confidence width (i.e., considering the confidence bounds with instead of point x will correctly be ruled out as suboptimal. Lemma 8 formalizes this intuition.
J Optimal Dependence on C and T
We first argue that a linear dependence on the corruption C is unavoidable in any cumulative regret bound. However, we do not make any claims of optimality regarding the joint dependence on (C, T).
Let the domain be the unit interval [0, 1], and let and be functions taking values inand satisfying the RKHS norm bound, as well as the following property: Any point within of optimality for one function (e.g., ) is at least -far away from optimality for the other function (e.g., ). Such functions can easily be constructed (at least when the RKHS norm B is not too small), for example, via the approach in [Scarlett et al., 2017].
Now suppose that the the true function is known to be either , but the exact one of the two is unknown. Consider an adversary that, for the first C rounds, simply perturbs the function value to zero. This can be done within the adversary’s budget, since . Given such corruptions, the player cannot learn anything about the function, so at best can randomly guess whether the function is . However, by the property of
Hence, regardless of the sampling algorithm, there exist functions in the function class for which incurred.
As for the dependence on T, we recall from (14) that when C is constant, the dependence on T matches well-known bounds from the non-corrupted setting [Srinivas et al., 2010, Chowdhury and Gopalan, 2017]. Recent lower bounds [Chowdhury and Gopalan, 2017] reveal that this dependence is near-optimal for the SE kernel, though some gaps still remain for the Matérn kernel. Closing these gaps remains a significant challenge even in the non-corrupted setting.
K Comparison to Stochastic Linear Bandits
Regret bounds for corrupted stochastic linear bandits were given in the parallel independent work of Li et al. [Li et al., 2019]. While the stochastic linear setting corresponds to our problem setting with a linear kernel, care should be taken in comparing our results to those of [Li et al., 2019], since the results of [Li et al., 2019] are instance-dependent (i.e., depend on certain gaps associated with the underlying function) and ours hold for an arbitrary (e.g., worst-case) instance satisfying the RKHS norm constraint.
For a polytope-shaped domain in any constant dimension, the cumulative regret bound in [Li et al., 2019] is logarithmic in T with a constant of is the gap between the best action (necessarily a corner point of the domain) and the second-best corner point. By comparison, for fixed B > 0, Theorem 10 yields cumulative regret factors. This is obtained using the fact that for the linear kernel [Srinivas et al., 2010, Theorem 5], and the fact that we are focusing on the case d = O(1) in this discussion.
Naturally, the results of [Li et al., 2019] are stronger when the gaps are constant (i.e., ), attaining log T regret instead of. On the other extreme, the “worst-case” gap used to convert instance-dependent guarantees to worst-case guarantees is [Abbasi-Yadkori et al., 2011], and in this case the bound of [Li et al., 2019] becomes trivial (higher than linear), whereas ours remains sublinear for . More generally, our bound is tighter whenever , and the bound of [Li et al., 2019] is tighter whenever
Overall, however, we believe that the main advantage of our work is the ability to handle general kernels (e.g., SE and Matérn), thereby allowing the underlying function to be highly non-linear.