b

DiscoverSearch
About
My stuff
Semi-bandit Optimization in the Dispersed Setting
2019·arXiv
Abstract
Abstract

In this work, we study the problem of online optimization of piecewise Lipschitz functions with semi-bandit feedback. This challenging class of non-convex optimization problems often arises in algorithm selection problems for combinatorial settings, where the goal is to find the best algorithm from a large algorithm family for a specific application domain. In these settings, each evaluation of the loss functions in the optimization problem can be computationally expensive, often requiring the learner to run a combinatorial algorithm to measure its performance. Combined with the fact that small differences between similar algorithms in the family can lead to cascading changes in algorithm behavior, efficient online optimization in these settings is a challenging problem. However, we show that in many applications, evaluating the loss function for one algorithm choice can sometimes reveal the loss for a range of similar algorithms, essentially for free. We develop online optimization algorithms capable of using this kind of extra information by working in the semi-bandit feedback setting. Our algorithms achieve regret bounds that are essentially as good as algorithms under full-information feedback and are significantly more computationally efficient.

We apply our semi-bandit optimization results to obtain online algorithm selection procedures for two rich families of combinatorial algorithms. We provide the first provable guarantees for online algorithm selection for clustering problems using a family of clustering algorithms containing classic linkage procedures. We also show how to select algorithms from a family of greedy knapsack algorithms with simultaneously lower computational complexity and stronger regret bounds than the best algorithm selection procedures from prior work.

Overview In this work, we consider the problem of online optimization of piecewise Lipschitz functions with semi-bandit feedback. This is an important family of non-convex optimization problems that arises in algorithm selection problems for combinatorial settings, where the goal is to decide in a data-driven way what algorithm to use from a large family of algorithms for a given problem domain. For example, we may want to decide which clustering algorithm to use from a large family of clustering procedures in order to obtain the highest quality results. In the online version of the algorithm selection problem, on each round the learner chooses an algorithm from the family and receives a new instance of the problem. The problem is characterized by a loss function that measures the performance of each algorithm in the family for the given instance. The goal is to select algorithms so that the cumulative performance of the learner is nearly as good as the best algorithm in hindsight for that sequence of problems. The major challenge in these settings is that it is potentially computationally expensive for the learner to characterize the loss function for each round, since each run of the chosen algorithm reveals the value of the loss function at just the single point corresponding to the algorithm selected. Moreover, for combinatorial problems, small differences between two algorithms can lead to a cascade of changes in their behavior and dramatically change their performance. However, when the algorithm family is parameterized, it can often be shown that the loss function is piecewise Lipschitz in the algorithm parameters, so we can phrase the problem as online optimization of piecewise Lipschitz functions.

Prior work on piecewise Lipschitz optimization was limited to two more extreme feedback regimes: Either the learner carries out a computationally expensive process to obtain full-information feedback (i.e., it observes the loss of every algorithm in the family on each instance), or accepts suboptimal regret bounds to work in the bandit feedback setting (i.e., it only observes the loss of one algorithm for each instance). This creates a tradeoff between computational efficiency and good regret bounds. However, many algorithm selection problems exhibit rich additional structure that is ignored by these two approaches. We show that, surprisingly, evaluating the loss function for a single algorithm can sometimes reveal the loss for a range of similar algorithms, essentially for free; in the context of the loss function, we show that an entire Lipshitz region can often be learned at once. This motivates us to define a new learning model, which we call the semi-bandit feedback setting for learning piecewise Lispchitz functions. Our new results in this model achieve the best of both worlds: we can efficiently obtain the necessary feedback while also having regret bounds that are nearly as good as under full-information.

We apply our results to two online algorithm selection problems. Our results for optimizing over a family of greedy knapsack algorithms improve over the procedures of Balcan et al. [11], Gupta and Roughgarden [19], and Cohen-Addad and Kanade [16] by simultaneously being more efficient and having tighter regret bounds. We also provide the first online algorithm selection procedures for a rich family of linkage based clustering algorithms introduced by Balcan et al. [9] that interpolates between single and complete linkage, which are algorithms that are widely used in practice [6, 28, 34] and known to perform optimally in many settings [5, 8, 7, 18]. Balcan et al. [9] consider the algorithm selection problem for this family of algorithms in the statistical batch setting, rather than the online setting, where they model the application domain as a distribution over problem instances, the goal is to find the algorithm with the highest expected performance, and we are given an i.i.d. sample from the distribution as training data.

Problem Setup. We study the problem of online piecewise Lipschitz optimization. The learning protocol is as follows: on each round t, the learner chooses a parameter  ρtbelonging to a d-dimensional parameter space C ⊂ Rd, the adversary chooses a piecewise Lipschitz loss function  ℓt : C → [0, 1], and the learner incurs a loss equal to  ℓt(ρt). A function  ℓt : C → [0, 1]is piecewise L-Lipchitz if we can partition the parameter space C into regions such that  ℓt is L-Lipschitz when restricted to each region. Many important algorithm selection problems require optimizing piecewise Lipschitz functions, including greedy combinatorial algorithms [19], clustering algorithms and SDP-rounding schemes [9], branch and bound mixed integer program solvers [10], initialization procedures for k-means clustering [12], and various auction design problems [13]. In these problems, the family of algorithms is parameterized and each parameter  ρ ∈ Ccorresponds to one algorithm. We suppose that on each round t there is a partition  A(t)1 , . . . , A(t)M of the parameter space C, called the feedback system. If the learner’s parameter  ρtbelongs to the set  A(t)i , then they observe both the set  A(t)ias well as the loss  ℓt(ρ)for every  ρ ∈ A(t)i. We consider the uninformed setting, where the learner does not know the feedback system for round t in advance of selecting a parameter. For simplicity, we consider oblivious adversaries that choose their sequence of loss functions  ℓ1, ℓ2, . . .adversarially, but before the interaction with the learner begins. The learner’s goal is to minimize regret, which is the difference between their total accumulated loss and that of the best parameter in hindsight: �Tt=1 ℓt(ρt) − minρ∈C�Tt=1 ℓt(ρ).Throughout the paper, we use the notation ˜O(·)to optionally suppress all logarithmic terms and dependence on parameters other than the time horizon T and the dimension of the parameter space d.

Main Results and Techniques.

Semi-bandit Regret Bounds in the Dispersed Setting. It is not always possible to achieve sub-linear regret for piecewise Lipschitz loss functions [24, 14, 26]. Balcan et al. [11] provide regret bounds in the full-information and bandit feedback settings under a dispersion condition that roughly measures the number of discontinuous functions in any ball of a given radius, and which is satisfied for a diverse collection of combinatorial algorithm configuration problems. In this paper, we introduce a simplified version of this condition that captures what is asymptotically important for our regret bounds.

Definition 1. The sequence of loss functions  ℓ1, ℓ2, . . .is  β-dispersed for the Lipshitz constant L if for all T and for all  ϵ ≥ T −β, we have that, in expectation, the maximum number of functions among  ℓ1, . . . , ℓTwhich are not L-Lipshitz in any  ϵ-ball contained in C is at most ˜O(ϵT). That is, for all T and for all  ϵ ≥ T −β,we have

image

In our applications, the sequence of loss functions will be chosen by a smoothed adversary, in the sense of Spielman and Teng [30]. Informally, the discontinuity locations of the functions chosen by a smoothed adversary are randomly perturbed. The expectation in Definition 1 is over this randomness in the sequence of loss functions. (Balcan et al. [11] also show examples where sufficient randomness can arise from the algorithm itself, rather than smoothness constraints on the adversary.) In all of our applications, we prove β-dispersion with  β = 1/2. In Definition 1, when a function  ℓt is not L-Lipschitz on the ball  B(ρ, ϵ), it canhave arbitrarily many discontinuities. However, our regret bounds require that the loss functions are bounded, so these discontinuities cannot introduce arbitrarily large losses within the ball. We provide an algorithm for online piecewise Lipschitz optimization under semi-bandit feedback whose regret is characterized by the β-dispersion parameter of the adversarially chosen functions. In Section 2, we prove the following result:

Theorem 2. Let  C ⊂ Rdbe a bounded parameter space and  ℓ1, ℓ2, · · · : C → [0, 1]be piecewise Lipschitz functions that are  β-dispersed. Running the continuous Exp3-SET algorithm (Algorithm 1) under semi-bandit feedback with an appropriate parameter  λhas expected regret bounded by

image

In comparison, the bandit-feedback algorithm of Balcan et al. [11] has expected regret bounded by ˜O(dTd+1d+2 3d + T 1−β). Even in one-dimensional problems, this leads to a regret of  ˜O(T 2/3 + T 1−β), which issignificantly worse than our results. Under different assumptions, the bandit algorithm of Cohen-Addad and Kanade [16] also has ˜O(T 2/3)regret for the special case of one-dimensional piecewise constant functions.

The proof of Theorem 2 is motivated by the analysis of Exp3-SET for finite-armed bandits given by Alon et al. [1], however, there is one significant new challenge. On each round, the Exp3-SET algorithm uses importance weighting and the partially observed loss to construct an unbiased estimate of the entire loss function, which it then passes into the exponentially weighted forecaster full-information algorithm. The regret bound of Alon et al. [1] in the finite-armed setting then follows from applying a second-order analysis of the exponentially weighted forecaster algorithm when run on the sequence of estimated loss functions. The key challenge in our continuous setting is that the estimated loss functions need not be dispersed, even if the original losses chosen by the adversary were. This is because each estimated loss has additional discontinuities at the boundary of the feedback set observed by the learner. Rather than make restrictive assumptions on how the feedback sets are generated (to guarantee that the estimated losses will also be dispersed) we “open up” the full-information regret bound and use a careful application of Jensen’s inequality that allows us to leverage the dispersion of the true loss functions.

General Tools for Verifying Dispersion. We also provide general tools for proving that a sequence of piecewise Lipschitz functions satisfies dispersion. When the sequence  ℓ1, ℓ2, . . .is random, we can usually directly bound the expected number of loss functions that are not L-Lipschitz on any specific ball of radius  ϵby ˜O(Tϵ). However, this does not imply that the functions are  β-dispersed, since the expected number of non-Lipschitz functions in the worst ball of radius  ϵwill typically be larger than the expected number in any specific ball. Building on uniform convergence from learning theory [29], we show that if each loss function has a one-dimensional domain, at most K discontinuities, and any single ball of radius  ϵhas at most ˜O(Tϵ)non-Lipschitz functions in expectation, then the expected number of non-Lipschitz losses on the worst ball of radius  ϵis at most ˜O(Tϵ +�T log(TK))This leads to  β-dispersion with  β = 1/2. Our result gives an exponential improvement in the dependence on K compared to the results of Balcan et al. [11], who upper bound the expected number of non-Lipschitz losses in the worst ball of radius  ϵby ˜O(TKϵ + K�T log(TK)). Our improvement comes from using a more subtle VC-based uniform convergence argument involving the class of functions that counts the number of non-Lipschitz functions on a given  L2 ball in R, as opposed to the class of functions that counts the total number of discontinuities in a given ball (i.e., counting multiple discontinuities from each loss function).

Semi-bandit Online Algorithm Selection. In Section 4, we combine our general regret analysis from Theorem 2 together with application-specific dispersion analysis to design practical algorithm selection procedures for linkage-based clustering and the knapsack problem. In both applications, we show that the discontinuities of each loss function are the roots of polynomials depending on the corresponding problem instance, and that the roots are dispersed under mild smoothness assumptions on the adversary. We obtain the first online algorithm selection procedures for linkage based clustering, and we give algorithm selection procedures for the knapsack problem with substantial computational improvements over the prior work, while at the same time achieving nearly the same regret bound.

Explicit Comparison for Knapsack. To highlight the benefits of our new learning model and results applied to algorithm selection, we give an explicit comparison of the computational complexity for obtaining different types of feedback and the corresponding regret bounds for the family of algorithms introduced in Section 4.1. This is a family of greedy algorithms for the knapsack problem with a single parameter  ρ. The algorithm with parameter  ρassigns the score  σρ(i) = vi/sρito item i, where the value and size of item i are  viand  si, respectively. Then, the algorithm greedily selects items in decreasing order of their score until the knapsack is full. In each round of the online game, the algorithm chooses a parameter  ρ, a new knapsack instance with n items arrives, and our goal is for the total value of items selected by the learner to be close to the total value of the best fixed parameter  ρin hindsight. We compare our results to the best prior full-information and

bandit feedback procedures for this problem.

Full-information. Balcan et al. [11] show that the exponentially weighted forecaster with full-informationfeedback achieves a regret bound of ˜O(n2√T). Our tighter analysis improves the bound to ˜O(√T). Obtaining full-information feedback involves running the greedy algorithm  O(n2)times per round, each costing O(n log n) time, leading to a total cost of  O(n3 log n)per round.

Bandit Feedback. The discretization-based bandit algorithm of Balcan et al. [11] achieves a regret bound of ˜O(T 2/3n2), but only requires a single run of the greedy algorithm per round, costing O(n log n) time.

Semi-bandit Feedback. Finally, in this paper we give an algorithm whose regret is bounded by ˜O(n√T)using semi-bandit feedback obtainable in time O(n log n) per round. Note that our algorithm is as efficient as the bandit-feedback algorithm, yet achieves essentially the same regret as the full-information algorithm.

Related Work. For online optimization of one-dimensional piecewise constant functions, Cohen-Addad and Kanade [16] provide full-information and bandit online optimization procedures. Balcan et al. [11] consider the more general setting of multi-dimensional piecewise Lipschitz functions. They introduce a dispersion condition that roughly measures how many functions are not Lipschitz in any ball, and provide algorithms with dispersion-dependent full-information and bandit regret bounds. They also verify that dispersion is satisfied for a diverse collection of algorithm selection problems.

Prior work on semi-bandit feedback has focused predominantly on finite-armed bandits. Semi-bandit feedback was first considered for online shortest path problems, where on each round the learner selects a path through a graph and observes the length of the edges along that path (but not for other edges) [20, 21]. Audibert et al. [3] obtain minimax bounds for a generalization to combinatorial bandits, where the learner’s action space is described by boolean vectors in  {0, 1}d, the losses are linear, and the on each round the learner observes the entries of the loss vector corresponding to the non-zero entries in their action. Alon et al. [2] introduce the Exp3-SET algorithm for semi-bandit feedback for finite-armed bandits. They consider the graph-feedback setting introduced by Mannor and Shamir [25], where on each round t, there is a feedback graph  Gtover the arms of the bandit and playing arm i reveals the loss for arm i and all arms adjacent in the graph  Gt. We extend the Exp3-SET algorithm to online optimization problems where there are infinitely many arms and where the feedback system on each round is a partition of the parameter space C.

There is a rich literature on data-driven algorithm selection. Most prior work focuses on the statistical setting, where the learner is given a large iid sample of problem instances from some distribution, and the goal is to find the algorithm with the best performance in expectation. Gupta and Roughgarden [19] introduced this formal setting and provide sample complexity results for several families of greedy algorithms. Balcan et al. [9] consider semidefinite rounding schemes for integer quadratic programs and linkage based clustering algorithms, Balcan et al. [10] consider learning the best branch and bound algorithms for mixed integer programs, and Balcan et al. [12] consider learning the best initialization procedures for k-means clustering. In addition to these formal results, this statistical setting has been the predominant model for data-driven algorithm configuration in artificial intelligence [27], combinatorial auctions [23], numerical linear algebra [17], vehicle routing [15], and SAT solving [35].

Another related line of work focuses on the problem of selecting the algorithm with the shortest running time over a distribution of problem instances [22, 33, 32]. This work makes minimal assumptions about the algorithm family and instead designs procedures that can avoid running every algorithm to completion, since this may be very expensive. Our work, on the other hand, explores special structure in algorithm families and can be used to optimize more general performance measures in the online rather than stochastic setting.

In this section we provide an algorithm for online piecewise Lispchitz optimization and analyze its regret under dispersion. Our results are for the following continuous semi-bandit setting.

Definition 3 (Uninformed Semi-bandit Feedback.). An online optimization problem with loss functions ℓ1, ℓ2, . . .has semi-bandit feedback if for each time t, there is partition  A(t)1 , . . . , A(t)M of the parameter space C, called a feedback system, such that when the learner plays point  ρt ∈ A(t)i, they observe the set  A(t)iand ℓt(ρ) for all ρ ∈ A(t)i . For any ρ ∈ C, we let A(t)(ρ)denote the feedback set that contains  ρ.

We analyze a continuous version of the Exp3-SET algorithm of Alon et al. [2]. This algorithm uses importance weighting to construct unbiased estimates of the complete loss function on each round, which it passes as input to a continuous version of the exponentially weighted forecaster. Pseudocode is given in Algorithm 1. Unlike the Exp3 algorithm of Auer et al. [4], the finite-armed Exp3-SET algorithm and our continuous version do not include an explicit exploration term (i.e., we do not mix the distribution  pt with auniform distribution over C). Stoltz [31] was the first to show that mixing with the uniform distribution is unnecessary for the Exp3 algorithm to have optimal expected regret.

In Appendix A.2, we show how to implement this algorithm with O(log T) per round computational complexity (in addition to the cost of collecting feedback) for the case of piecewise constant losses in one dimension. This implementation uses the interval tree data structure of Cohen-Addad and Kanade [16].

image

Given the learner’s observations on round t, Algorithm 1 uses importance weighting to estimate the complete loss function by ˆℓt(ρ) = I{ρ∈A(t)(ρt)}pt(A(t)(ρt)) ℓt(ρ). The estimate  ˆℓt(ρ)is only non-zero for parameters ρthat belong to the feedback set observed by the algorithm at round t. The key property of ˆℓtis that it is an unbiased estimate of the true loss function conditioned on the history until the beginning of round t. More formally, let  Et[·] = E[·|ρ1, . . . , ρt−1, ℓ1, . . . , ℓt]denote the conditional expectation given the learner’s choices until round  t − 1and the first t loss functions. This expectation is only over the randomness of the learner’s choice of  ρtat time t. For clarity, we also use the notation  E<t[·]to denote the expectation of any random variable that is a function of only  ρ1, . . . , ρt−1 and ℓ1, . . . , ℓtso that for any random quantity X, we have  E[X] = E<t�Et[X]�. For any ρ ∈ C and t, a straight forward calculation shows that  Et[ˆℓt(ρ)] = ℓt(ρ).

To simplify presentation, we assume that the sequence of loss functions has an  r0-interior minimizer: with probability one, for all times T there exists  ρ∗ ∈ argminC�Tt=1 ℓt(ρ) such that B(ρ∗, r0) ⊂ C. We canusually modify a sequence of loss functions to obtain an equivalent optimization problem that is guaranteed to have an  r0-interior minimizer. In Appendix A we discuss such a transformation that works whenever the parameter space C is convex.

We bound the regret of Algorithm 1 under a generalization of both both  β-dispersion and the (w, k)-dispersion definition of Balcan et al. [11], leading to more precise bounds and broader applicability.

Definition 4. The sequence of loss functions  ℓ1, ℓ2, . . .is f-dispersed for the Lipshitz constant L and dispersion function  f : N × [0, ∞) → Rif for all T and for all  ϵ > 0, we have

image

We can express both  β-dispersion and (w, k)-dispersion in terms of f-dispersion. In particular, let ℓ1, ℓ2, . . .be a sequence of piecewise L-Lipschitz functions. For any  T ∈ Nand  ϵ > 0, let  D(T, ϵ) =E�maxρ∈C(��{1 ≤ t ≤ T | ℓt is not L-Lipshitz in  B(ρ, ϵ)}���be the expected number of non-Lipschitz functions among  ℓ1, . . . , ℓTon the worst ball of radius  ϵ. If the functions are  β-dispersed, then we know that for all T and all  ϵ ≥ T −β, we have  D(T, ϵ) = ˜O(Tϵ). Since  D(T, ϵ)is a non-decreasing function of the radius  ϵ, we are guaranteed that for any  ϵ < T −βwe have  D(T, ϵ) ≤ D(T, T −β) = ˜O(T 1−β). It follows that the functions are also f-dispersed for  f(T, ϵ) = ˜O(Tϵ + T 1−β). Similarly, the functions are (w, k)-dispersed if every ball of radius w has at most k non-Lipschitz functions. This implies that for all ϵ ≤ w, we have  D(T, ϵ) ≤ k, but for  ϵ > w, we could have  D(T, ϵ)as large as T. It follows that the functions are f-dispersed where  f(T, ϵ) = k for all ϵ ≤ w and f(T, ϵ) = Totherwise.

We bound the regret of Algorithm 1 in terms of the f-dispersion of the losses.

Theorem 5. Let  C ⊂ Rdbe contained in a ball of radius R and  ℓ1, ℓ2, · · · : C → [0, 1]be piecewise L-Lipschitz functions that are f-dispersed with an  r0-interior minimizer. Moreover, suppose the learner gets semi-bandit feedback and, on each round t, the feedback system  A(t)1 , . . . , A(t)M has Mfeedback sets. For any r ∈ (0, r0], running Algorithm 1 with  λ =�d log(R/r)/(TM)satisfies the following regret bound:

image

Proof sketch. Following the proof for the finite-armed case of Alon et al. [2], we upper and lower bound the quantity  E[log(WT+1/W1)], where Wt =�C wt(ρ) dρis the normalizing constant at round t. The upper bound is in terms of the expected total loss of the algorithm, while the lower bound depends on the expected total loss of the best parameter  ρ∗in hindsight. The main additional challenges are in proving the lower bound, which relies crucially on the dispersion properties of  ℓ1, . . . , ℓT. Combining the inequalities leads to the claimed regret bound.

Following a continuous analogue of the arguments used by Alon et al. [2], we arrive at the following upper bound:

image

where the second term roughly quantifies our dependence on the second moment of the estimated losses ˆℓt(ρ). We show that for any fixed  ρ ∈ Cwe have that  Et[ˆℓt(ρ)2] ≤ 1pt(A(t)(ρ)). This implies that

image

where the first equality follows from the fact that the feedback system  A(t)1 , . . . , A(t)Mis a partition of C and the interval of  pt(ρ)/pt(A(t)i ) over A(t)iis equal to 1. Substituting this into our upper bound, we have

image

In the finite-armed setting, lower bounding  E�log WT +1W1 �in terms of the loss of  ρ∗is straightforward, since  W1is the number of arms and  E[log(WT+1)] ≥ E[log(wt+1(ρ∗))] = −λ �Tt=1 ℓt(ρ∗). When the parameter space C is continuous, the lower bound is more involved. First, since  WT+1is an integral rather than a sum,  WT+1is no longer lower bounded by  wt+1(ρ∗). To overcome this, Balcan et al. [11] use dispersion to argue that an entire ball of points close to  ρ∗ has low loss. However, applying this analysis to the estimated losses ˆℓ1, . . . , ˆℓTwill fail, since the estimated loss functions might not be f-dispersed. Moreover, even when ˆℓ1, . . . , ˆℓTare f-dispersed, their Lipschitz constants and range are not bounded unless we make restrictive assumptions on the volume of the feedback sets. Instead, we use a careful application of Jensen’s inequality to postpone the application of dispersion until after we have already taken the expectation of the estimated losses, which allows us to use the dispersion properties of the true loss functions instead.

Formally, let  ρ∗ ∈ argminρ�Tt=1 ℓt(ρ)be a minimizer with  B(ρ∗, r0) ⊂ C, let rbe any radius satisfying r ≤ r0 and let B = B(ρ∗, r)be the ball of radius  r about ρ∗. We lower bound  log WT+1by integrating  wT+1over just the ball B consisting of parameters close to  ρ∗:

image

Using the fact that log is concave, applying Jensen’s inequality to the right hand side gives  log WT+1 ≥log(Vol(B)) − λVol(B)�B�Tt=1 ˆℓt(ρ) dρ. Let  Eℓ[·] = E[·|ℓ1, . . . , ℓT ]denote the expectation conditioned on the sequence of loss functions (i.e., only taking the expectation over the randomness of the algorithm). Then, since  W1 = Vol(C), we have that  Eℓ[ WT +1W1 ] ≥ log Vol(B)Vol(C) − λVol(B)�B�Tt=1 ℓt(ρ). Here, we used the fact that ˆℓtis an unbiased estimate of  ℓt. Finally, letting D denote the number of non-Lipschitz functions in  ℓ1, . . . , ℓTon the ball B, we can upper bound the loss of all  ρ ∈ Bby �Tt=1 ℓt(ρ) ≤ �Tt=1 ℓt(ρ∗) + TLr + D. Since the functions  ℓ1, ℓ2, . . . are f-dispersed, we know that the expected number of non-Lipschitz functions in the worst ball of radius r is at most f(T, r). In particular, this implies that  E[D] ≤ f(T, r). Therefore, taking the expectation of our lower bound gives  E[ WT +1W1 ] ≥ log Vol(B)Vol(C) − λ(�Tt=1 ℓt(ρ∗) + TLr + f(T, r)).

Combining the upper and lower bounds on  E[log(WT+1/W1)], using the fact that Vol(B)Vol(C) ≥ ( rR)d, andrearranging gives

image

Substituting the given value of  λcompletes the proof.

Our regret bound for  β-dispersed losses given in Theorem 2 follows immediately from Theorem 5.

Note that our results are also applicable in two closely related settings: maximizing dispersed piecewise Lipschitz utility functions, and the case when losses are bounded in [0, H] for some known bound H instead of [0, 1]. A discussion of the necessary transformations can be found in Appendix A.1.

In this section we provide a general tool for demonstrating that a sequence of loss functions  ℓ1, ℓ2, . . .is dispersed. For many sequences of loss functions and any fixed ball  B(ρ, ϵ), we are able to bound the expected number of functions among  ℓ1, . . . , ℓTthat are not L-Lipschitz on that ball. However, this does not directly imply that the functions are dispersed, since in general the expected number of non-Lipschitz functions in the worst ball will be larger than the expected number of non-Lipschitz functions in any specific ball. Our main result in this section shows that for sequences of independent loss functions  ℓ1, ℓ2, . . .that are defined over a one-dimensional parameter space with at most K discontinuities each, to prove dispersion for the sequence, it is sufficient to bound the expected number of non-Lipschitz functions for any fixed ball.

Theorem 6. Let  ℓ1, ℓ2, · · · : R → Rbe independent piecewise L-Lipschitz functions, each having at most K discontinuities. Let

image

be the number of functions in  ℓ1, . . . , ℓTthat are not L-Lipschitz on the ball  [ρ − ϵ, ρ + ϵ]. Then we have

image

Proof sketch. For simplicity, we assume that every function has exactly K discontinuities and let  α(t) ∈ RKbe the vector whose entries are the discontinuity locations of  ℓt. The vectors  α(1), α(2), . . .are independent.

For any interval  I ⊂ R, define the function  fI : RK → {0, 1}by  fI(α) = 1if any component of  αbelongs to  I and fI(α) = 0otherwise. The sum �Tt=1 fI(α(t))counts the number of vectors  α(1), . . . , α(T)that have a component in the interval I or, equivalently, the number of functions  ℓ1, . . . , ℓTthat are not L-Lipschitz on I. The main result will follow from VC-dimension based uniform convergence arguments applied to the class  F = {fI : RK → {0, 1} | I ⊂ Ris an interval}.

First, we bound the VC-dimension of F by O(log K). The key insight is the following connection between F and the class of indicator functions for intervals: let  S = {x(1), . . . , x(n)} ⊂ RKbe any collection of n vectors in  RK and let P = {x(1)1 , . . . , x(1)K , . . . , x(n)1 , . . . , x(n)K } ⊂ Rdenote the set containing the union of their combined nK component values. Now consider any pair of intervals I and  I′. If the indicator functions for  I and I′ agree on all the points in P (i.e., the intervals contain exactly the same subset of P), then we must have that  fI and fI′agree on every vector in S. This is because if  I and I′ contain exactly the same subset of P, then for each vector  x(i), both intervals contain the same subset of its component values and we must have  fI(x(i)) = fI′(x(i)). Therefore, that the number of distinct ways that the class F can label the set of vectors S is at most the number of ways that intervals can label the set of points P. The bound on the VC-dimension of F follows from the fact that, by Sauer’s Lemma, intervals can only label the points in P in O((Kn)2)distinct ways, which limits the size of the largest set S that is shatterable by F.

Applying VC-dimension based uniform convergence arguments to the class F and using the fact that D(T, ϵ, ρ) = �Tt=1 fI(α(t)), where I = [ρ − ϵ, ρ + ϵ], it follows that for all time horizons T and all radiuses ϵ, with probability at least  1 − δ we have maxρ∈R D(T, ϵ, ρ) ≤ maxρ∈R E[D(T, ϵ, ρ)] + O(�T log(K/δ)).Converting this high-probability bound into a bound in expectation completes the proof.

If we can show that for all times T, radiuses  ϵ > 0and any fixed interval I of radius  ϵ, the expected number of non-Lipschitz functions on interval I is at most ˜O(Tϵ), then Theorem 6 guarantees that the functions are  β-dispersed with  β = 1/2. Similarly, if the expected number of non-Lipschitz functions on the interval I is at most  g(T, ϵ), then the functions are f-dispersed for  f(T, ϵ) = g(T, ϵ) + O(�T log(TK)).

Improvement Over the Prior Work. Balcan et al. [11] also provide general tools for proving (w, k)-dispersion that can be adapted to  β and f-dispersion. Our bounds provide an exponential improvement in the dependence on K, the number of discontinuities per function. Under the same assumptions as Theorem 6, they are able to show  E[maxρ∈R D(T, ϵ, ρ)] ≤ maxρ∈R E[ ˜D(T, ϵ, ρ)] + O(K�T log(TK)), where ˜D(T, ϵ, ρ) isthe total number of discontinuities across the functions  ℓ1, . . . , ℓTthat fall in the interval of radius  ϵ centeredat  ρ(i.e., it counts multiple discontinuities from each loss function  ℓt) (see Lemma 15 in Appendix B). The dependence of our additive term on K is exponentially smaller and leads to improved dispersion analysis.

In this section we apply our semi-bandit optimization results to online algorithm selection for two rich parameterized families of algorithms. For both families, we show how to obtain semi-bandit feedback by running a single algorithm fom the family. We also analyze dispersion for these problems under the assumption that the adversary is smoothed. In both cases, we obtain ˜O(√T)regret bounds in the semi-bandit feedback setting. Finally, in Appendix C.1 we show how to use binary search to obtain semi-bandit feedback for a large class single-parameter algorithm families.

Smoothed adversaries. We consider adversaries that are smoothed in the sense of Spielman and Teng [30], where their decisions are corrupted by small random perturbations. Formally, we say that a parameter chosen by the adversary is  κ-smooth if it is a random variable whose density is bounded by  κ. After the adversary chooses the density for each smoothed parameter, nature samples each parameter value independently from their corresponding distributions. Small values of  κcorrespond to larger random perturbations of the problem parameters, while in the limit as  κ → ∞, the adversary is able to choose the parameters deterministically. In each application, we will specify which problem parameters are smoothed, together with the bound  κ on theirdensity. For simplicity, we assume that all  κ-smooth random variables are independent (i.e., the corruption of the adversary’s choices is not correlated across variables), though many of our results can be exteneded to allow for some correlation between the parameters of each instance.

4.1 Greedy Algorithms for Knapsack

First, we consider selecting the best algorithm from a parameterized family of a greedy algorithms for the knapsack problem. An instance of the knapsack problem consists of n items, where item i has a value  vi anda size  si, and a knapsack capacity C. Our goal is to find the most valuable subset of items whose total size does not exceed C. Gupta and Roughgarden [19] propose using the following parameterized family of greedy knapsack algorithms: for a given parameter  ρ ∈ [0, R], set the score of item  i to be σρ(i) = vi/sρi . Then, indecreasing order of score, add each item to the knapsack if there is enough capacity left. This algorithm runs in time O(n log n). In our analysis, we assume that the adversary’s item values are  κ-smooth.

First, we show how to obtain semi-bandit feedback for this family of greedy knapsack algorithms by running a single algorithm in the family. Pseudocode is given in Algorithm 2.

image

Lemma 7. Consider a knapsack instance with capacity C and n items with values  v1, . . . , vnand sizes s1, . . . , sn. Algorithm 2 runs in time O(n log n). Moreover, there is a feedback system  A1, . . . , AMpartitioning C into  M = O(n2)intervals such that set of items output by the algorithm is constant for  ρ ∈ Ai. When run with parameter  ρ, in addition to the item set S, the algorithm outputs the interval  Aicontaining  ρ.Proof. Computing the permutation  πin step 1 of Algorithm 2 takes O(n log n) time. Finding the item set S in steps 2 and 3 only requires a linear pass through the items. Similarly, finding the interval A in steps 4 and 5 also only requires a linear pass through the items. Therefore, the total running time is O(n log n).

Gupta and Roughgarden [19] show that for any knapsack instance, the algorithm’s output is a piecewise constant function of the parameter  ρwith at most  O(n2)discontinuities. In particular, for each pair of items i and j, there is a critical parameter value  cij = log(vi/vj)/ log(si/sj)such that the relative order of items i and j only changes at  ρ = cij. These critical parameter values partition C into  M = O(n2)sets A1, . . . , AMsuch that the item ordering is fixed for all  ρ ∈ Ai. Algorithm 2 computes the critical values for each consecutive pair of items  π(i)and  π(i + 1)and outputs the largest interval A containing  ρand none of these critical values. For all  ρ′ ∈ A, we must have  σρ′(π(i)) ≥ σρ′(π(i + 1))for  i = 1, . . . , n − 1, and therefore the item ordering is constant for  ρ′ ∈ A. It follows that that A does not contain  cijfor any pair of items i and j. On the other hand, the end points of A are critical values, so A must be equal to one of the M sets  Ai.

In contrast to Algorithm 2, the most direct approach to obtaining full-information feedback for this family of knapsack algorithms is to first compute a set of  O(n2)critical parameter values arising from all pairs of points and to run the algorithm once for each cell in the corresponding partition, taking  O(n3 log n) time.

Next, we provide a dispersion analysis for selecting the parameter  ρ ∈ [0, R]in order to maximize the value of items selected. We assume that each instance has the same capacity C, item sizes are in [1, C], and the item values are in  [0, 1] and κ-smooth. The corresponding loss function is  ℓ(ρ) = C − �i∈Sρ vi ∈ [0, C],where  Sρis the set of items selected by Algorithm 2 when run with parameter  ρ.

Lemma 8. Consider an adversary choosing knapsack instances with a fixed knapsack capacity C where the tthinstance has item sizes  s(t)1 , . . . , s(t)n ∈ [1, C], and  κ-smooth item values  v(t)1 , . . . , v(t)n ∈ [0, 1]. The loss functions  ℓ1, ℓ2, . . .defined above are piecewise constant and f-dispersed for  f(T, ϵ) = Tϵn2κ2 ln(C) +O(�T log(Tn)) and β-dispersed for  β = 1/2.

Proof. Let  c(t)ij = log(v(t)i /v(t)j )/ log(s(t)i /s(t)j )be the critical parameter value such that at  ρ = c(t)ij, items i and j swap their relative order in the  tthinstance. Balcan et al. [11] show that each critical value  c(t)ijis random and has a density function bounded by  κ2 ln(C)/2. It follows that for any interval I of radius ϵ, the expected total number of critical values  c(t)ijsummed over all pairs of items and t = 1, . . . , T is at most  Tϵn2κ2 ln(C). This is also an upper bound on the expected number of loss functions in  ℓ1, . . . , ℓTthat are not constant on I. Applying Theorem 6, it follows that the functions are f-dispersed for  f(T, ϵ) =Tϵn2κ2 ln(C) + O(�T log(Tn)) = ˜O(Tϵ +√T), which implies  β-dispersion with  β = 1/2.

Running Algorithm 1 using the semi-bandit feedback returned by Algorithm 2, we obtain the following.

Corollary 9. Under the same conditions as Lemma 8, using Algorithm 1 to tune the parameter  ρ ∈ [0, R] ofAlgorithm 2 under semi-bandit feedback has expected regret bounded by  O(Cn�T log(RTnκ log(C))).

The full-information regret bound obtained by Balcan et al. [11] is ˜O(Cn2√T), which is worse than our semi-bandit bound (but can be improved to ˜O(C√T)using our tighter dispersion analysis).

4.2 Interpolating between Single and Complete Linkage Clustering

Next, we consider a rich family of linkage-based clustering algorithms introduced by Balcan et al. [9] that interpolates between the classic single and complete linkage procedures. Clustering instances are described by a matrix  D = (dij) ∈ Rn×ngiving the pairwise distances between a collection of n data points and the goal is to organize the points into a hierarchy or cluster tree. We provide the first dispersion analysis and online configuration procedures for this class of algorithms. We assume that each distance  dij is κ-smooth.

The algorithm family we consider, called  ρ-linkage, is family of agglomerative clustering algorithms with a single parameter  ρ ∈ [0, 1]. These algorithms take as input a distance matrix  D ∈ Rn×nwith entries dijand the parameter value  ρ ∈ [0, 1]and output a cluster tree, which is a binary tree where each node corresponds to a cluster in the data. The leaves of the tree are the individual data points, while the root node corresponds to the entire dataset. The children of each node subdivide that cluster into two subclusters. The ρ-linkage algorithm starts with each point belonging to its own cluster. Then, it repeatedly merges the closest pair of clusters according the distance defined by  dρ(A, B) = (1 − ρ) dmin(A, B) + ρ dmax(A, B), where Aand B are clusters (i.e., subsets of  [n]), dmin(A, B) = mina∈A,b∈B dab and dmax(A, B) = maxa∈A,b∈B dab.When there is only a single cluster remaining, the algorithm outputs the constructed cluster tree. Running this algorithm with  ρ = 0 and ρ = 1recovers single and complete linkage, respectively.

For any pair of candidate cluster merges  (C1, C2)and  (C′1, C′2), where  C1, C2, C′1and  C′2are clusters, there is a critical parameter value c such that  dρ(C1, C2) = dρ(C′1, C′2)only when  ρ = c. To simplify notation in the rest of this section, we let  c(C1, C2, C′1, C′2) = ∆min/(∆min − ∆max), where  ∆min =dmin(C′1, C′2) − dmin(C1, C2) and ∆max = dmax(C′1, C′2) − dmax(C1, C2), denote the critical parameter.

First, we show how to obtain semi-bandit feedback for this family of linkage algorithms by running a single algorithm in the family. Our modified algorithm maintains an interval  (ρmin, ρmax)with the invariant that at any iteration, for all parameters  ρ′ ∈ (ρmin, ρmax), the algorithm would make the same merges that have been made so far. Pseudocode for this procedure is given in Algorithm 3

image

Lemma 10. Consider a clustering instance with distance matrix  D ∈ Rn×n. Algorithm 3 runs in time  O(n3).Moreover, there is a feedback system  A1, . . . , AMpartitioning  [0, 1] into M = O(n8)intervals such that the cluster tree output by the algorithm is constant for  ρ ∈ Ai. When run with parameter  ρ, in addition to the cluster tree T, the algorithm outputs the interval  Aicontaining  ρ.

Proof. The algorithm performs  n − 1merges. For each merge, the algorithm makes two passes through the O(n2)clusters in order to find the closest pair, as well as to update the interval  (ρmin, ρmax). These passes both require us to compute the  dρdistance between all pairs of clusters. However, starting from the input matrix D, we can maintain two distance matrix  Dmin and Dmax storing the minimum and maximum distances between the current set of clusters, respectively. After merging two clusters, these distance matrices can be updated in O(n) time, since at most O(n) distances change. It follows that finding the closest pair of clusters and updating the interval  (ρmin, ρmax)can be done in  O(n2)time per merge. This leads to a total running time of  O(n3).

Balcan et al. [9] prove that there exists a partition  A1, . . . , AM of C into M = O(n8)intervals such that the algorithm output is constant for  ρ ∈ Ai. In particular, for any pair of possible cluster merges  (C1, C2)and  (C′1, C′2) with dmin(C1, C2) < dmin(C′1, C′2), the algorithm prefers to merge  C1 and C2 over C′1 and C′2for all values of the parameter  ρ < c(C1, C2, C′1, C′2). Moreover, since  c(C1, C2, C′1, C′2)only depends on 8 points—the closest and farthest pairs of points between  C1and  C2and between  C′1and  C′2—and there are only  O(n8)ways to select 8 points, these critical parameter values partition  C into the M = O(n8)intervals. For  ρ ∈ Ai, the ordering on all possible merges is fixed, so the algorithm will output the same cluster tree.

Finally, on each iteration of the algorithm, we iterate through all  O(n2)pairs of clusters  (C′1, C′2)that the algorithm did not merge. For each, we calculate the critical parameter value  c(C1, C2, C′1, C′2), which isthe value of  ρat which the algorithm would prefer to merge  (C′1, C′2)over  (C1, C2). We shrink the interval (ρmin, ρmax)so that it does not contain any of these critical values. It follows that the interval  (ρmin, ρmax)satisfies the following invariant: for all  ρ′ ∈ (ρmin, ρmax), the sequence of cluster merges made by the algorithm with parameter  ρ′up until the current iteration would match those made by the algorithm with parameter  ρ. In particular, when the algorithm returns, we are guaranteed that the same cluster tree would be output for all parameter values  ρ′ ∈ (ρmin, ρmax). Since the endpoints  ρminand  ρmaxalways belong to the M = O(n8)critical parameter values, there are at most  M = O(n8)intervals the algorithm might output for a fixed clustering instance.

Similarly to the knapsack example, the most direct approach for obtaining full-information feedback is to first calculate a set of  O(n8)critical parameter values arising from all  O(n8)subsets of 8 points and to run ρ-linkage once for each interval in the corresponding partition. By using a priority queue to maintain the distances between clusters, it is possible to implement  ρ-linkage in  O(n2 log n)time. This leads to a total running time of  O(n10 log n), which is significantly higher than the  O(n3)running time in Lemma 10. Note that using a priority queue in Algorithm 3 does not reduce the running time to  O(n2 log n), since updating the interval  (ρmin, ρmax)requires a linear pass through all  O(n2)pairs of clusters, so finding the closest pair faster does not reduce the running time.

Next, we provide a dispersion analysis for selecting the parameter  ρof Algorithm 3 when the clustering instances are chosen by a smoothed adversary. In particular, we suppose that on each round the adversary chooses a distance matrix  D(t)where each distance  d(t)ijis  κ-smooth and takes values in [0, B], for some maximum distance B. The quantity  B/(1/κ) = Bκroughly captures the scale of the perturbations relative to the true distances. Our analysis leads to regret that depends on  Bκonly logarithmically and give good bounds even for exponentially small perturbations.

Fix any clustering loss function  g : Rn×n × CLUSTERTREES → [0, 1], where g(D, T) measures the cost of cluster tree T for distance matrix D. For example, g(D, T) could be the k-means cost of the best k-pruning of the tree T or the distance to a ground-truth target clustering. We study the loss functions given by ℓt(ρ) = g(D(t), A(D(t); ρ)), where  A(D; ρ)denotes the output cluster tree of Algorithm 3 run on distance matrix D with parameter  ρ.

Lemma 11. Consider an adversary choosing clustering instances where the  tthinstance has symmetric distance matrix  D(t) ∈ [0, B]n×nand for all  i ≤ j, d(t)ijis  κ-smooth. The loss functions  ℓ1, ℓ2, . . .defined above are piecewise constant and f-dispersed for  f(T, ϵ) = 32Tϵn8κ2M2 + O(�T log(Tn))and  β-dispersed for  β = 1/2.

Proof sketch. In the proof of Lemma 10, we showed that for each time t, there are  O(n8)critical parameter values partitioning C into regions so that the algorithm output is constant on each region. Since the loss  ℓtonly depends on  ρthrough the algorithm output,  ℓtis also piecewise constant with at most  O(n8) pieces.

Moreover, we argued that every discontinuity of  ℓtoccurs at a critical parameter value of the form c = (d(t)rr′ − d(t)ii′ )/(d(t)jj′ − d(t)ii′ + d(t)rr′ − d(t)ss′)where  i, i′, j, j′, r, r′, s, s′are 8 point indices. Similarly to the knapsack example, we show that each critical parameter value is random and has a density function bounded by  16(κB)2. From this, it follows that for any interval I of radius  ϵ, summing over all times t = 1, . . . , T and all subsets of 8 points, we have that the expected total number of critical values that land in interval I is at most  32Tϵ(κB)2. This also bounds the expected number of functions  ℓ1, . . . , ℓTthat are not constant on I. By Theorem 6, the functions are f-dispersed for  f(T, ϵ) = 32Tϵ(κB)2 +�T log(Tn) = ˜O(Tϵ +√T),also implying 12-dispersion.

There are several cases when bounding the density of the critical value c, depending on whether any of the 4 distances correspond to the same entry in the distance matrix D. We give the argument for the case when all 4 distances are distinct entries and therefore independent. The remaining cases are similar and considered in Appendix C. Let  X = drr′ − dii′and  Y = djj′ − dss′so that c = X/(X + Y ). The variables X and Y are independent. Since X and Y are each the sum of  κ-smooth random variables, Lemma 20 implies that they are each have  κ-bounded densities. Using the fact that  |X + Y | ≤ 2B, applying Lemma 22 implies that the ratio  c = X/(X + Y ) has a 16(κB)2 bounded density, as required.

Running Algorithm 1 using the semi-bandit feedback returned by Algorithm 3, we obtain the following:

Corollary 12. Under the same conditions as Lemma 11, using Algorithm 1 to tune the parameter  ρ ∈ [0, 1]of Algorithm 3 under semi-bandit feedback has expected regret bounded by

image

In this work, we provide the first online optimization algorithm for piecewise Lipschitz functions under semi-bandit feedback with regret bounds that depend on the dispersion of the loss functions. We also give general tools for verifying dispersion in applications with exponentially tighter bounds than prior work. Finally, we apply our results to two algorithm selection problems. We obtain the first online algorithm selection procedure for a family of linkage-based clustering algorithms, and an online algorithm selection procedure for a greedy family of knapsack algorithms that is more efficient and has better regret bounds than prior work. A cornerstone of our results is that, for many algorithm selection problems, semi-bandit feedback can be obtained as efficiently as bandit-feedback and is sufficient for our algorithms to achieve nearly the same regret bounds as under full-information feedback. Our results largely mitigate the tradeoff between computational efficiency and good regret bounds suffered by prior online algorithm selection approaches, making online algorithm selection more practical.

Acknowledgements. This work was supported in part by NSF grants CCF-1535967, IIS-1618714, DMS-1700365, an Amazon Research Award, a Microsoft Research Faculty Fellowship, and by the generosity of Eric and Wendy Schmidt by recommendation of the Schmidt Futures program.

[1] Noga Alon, Nicolo Cesa-Bianchi, Ofer Dekel, and Tomer Koren. Online learning with feedback graphs: Beyond bandits. In Annual Conference on Learning Theory, volume 40. Microtome Publishing, 2015.

[2] Noga Alon, Nicol`o Cesa-Bianchi, Claudio Gentile, Shie Mannor, Yishay Mansour, and Ohad Shamir. Nonstochastic multi-armed bandits with graph-structured feedback. SIAM J. Comput., 46(6):1785–1826, 2017. doi: 10.1137/140989455. URL https://doi.org/10.1137/140989455.

[3] Jean-Yves Audibert, S´ebastien Bubeck, and G´abor Lugosi. Regret in online combinatorial optimization. Mathematics of Operations Research, 39(1):31–45, March 2014.

[4] Peter Auer, Nicolo Cesa-Bianchi, Yoav Freund, and Robert E Schapire. The nonstochastic multiarmed bandit problem. SIAM journal on computing, 32(1):48–77, 2002.

[5] Pranjal Awasthi, Avrim Blum, and Or Sheffet. Center-based clustering under perturbation stability. In Information Processing Letters, 2012.

[6] Pranjal Awasthi, Maria-Florina Balcan, and Konstantin Voevodski. Local algorithms for interactive clustering. In ICML, 2014.

[7] Maria-Florina Balcan and Yingyu Liang. Clustering under perturbation resilience. In SIAM Journal on Computing, 2016.

[8] Maria-Florina Balcan, Nika Haghtalab, and Colin White. k-center clustering under perturbation resilience. In Proceedings of the Annual International Colloquium on Automata, Languages, and Programming (ICALP), 2016.

[9] Maria-Florina Balcan, Vaishnavh Nagarajan, Ellen Vitercik, and Colin White. Learning-theoretic foundations of algorithm configuration for combinatorial partitioning problems. Proceedings of the Conference on Learning Theory (COLT), 2017.

[10] Maria-Florina Balcan, Travis Dick, Tuomas Sandholm, and Ellen Vitercik. Learning to branch. In ICML, 2018.

[11] Maria-Florina Balcan, Travis Dick, and Ellen Vitercik. Dispersion for data-driven algorithm design, online learning, and private optimization. In FOCS, 2018.

[12] Maria-Florina Balcan, Travis Dick, and Colin White. Data-driven clustering via parameterized lloyd’s families. In NeurIPS, 2018.

[13] Maria-Florina Balcan, Tuomas Sandholm, and Ellen Vitercik. A general theory of sample complexity for multi-item profit maximization. In ACM EC, 2018.

[14] Shai Ben-David, David Pal, and Shai Shalev-Shwartz. Agnostic online learning. In COLT, 2009.

[15] Yves Caseau, Franc¸ois Laburthe, and Glenn Silverstein. A meta-heuristic factory for vehicle routing problems. In International Conference on Principles and Practice of Constraint Programming, pages 144–158. Springer, 1999.

[16] Vincent Cohen-Addad and Varun Kanade. Online Optimization of Smoothed Piecewise Constant Functions. In Proceedings of the International Conference on Artificial Intelligence and Statistics (AISTATS), 2017.

[17] Jim Demmel, Jack Dongarra, Victor Eijkhout, Erika Fuentes, Antoine Petitet, Rich Vuduc, R Clint Whaley, and Katherine Yelick. Self-adapting linear algebra algorithms and software. Proceedings of the IEEE, 93(2):293–312, 2005.

[18] Anna Grosswendt and Heiko Roeglin. Improved analysis of complete linkage clustering. In European Symposium of Algorithms, 2015.

[19] Rishi Gupta and Tim Roughgarden. A PAC approach to application-specific algorithm selection. SIAM Journal on Computing, 46(3):992–1017, 2017.

[20] Andr´as Gy¨orgy, Tam´as Linder, and Gy¨orgy Ottucs´ak. The shortest path problem under partial monitoring. In COLT, 2006.

[21] Satyen Kale, Lev Reyzin, and Robert E. Shapire. Non-stochastic bandit slate problems. In NeurIPS, 2010.

[22] Robert Kleinberg, Kevin Leyton-Brown, and Brendan Lucier. Efficiency through procrastination: Approximately optimal algorithm configuration with runtime guarantees. In IJCAI, 2017.

[23] Kevin Leyton-Brown, Eugene Nudelman, and Yoav Shoham. Empirical hardness models: Methodology and a case study on combinatorial auctions. Journal of the ACM (JACM), 56(4):22, 2009.

[24] N. Littlestone. Learning quickly when irrelevant attributes abound: A new linear-threshold algorithm. In Machine Learning, 1998.

[25] Shie Mannor and Ohad Shamir. From bandits to experts: On the value of side-observations. In Advances in Neural Information Processing Systems, pages 684–692, 2011.

[26] Alexander Rakhlin, Karthik Sridharan, and Ambuj Tewari. Online learning: Stochastic, constrained, and smoothed adversaries. In NeurIPS, 2011.

[27] John R Rice. The algorithm selection problem. In Advances in computers, volume 15, pages 65–118. Elsevier, 1976.

[28] Mehreen Saeed, Onaiza Maqbool, Haroon Atique Babri, Syed Zahoor Hassan, and S. Mansoor Sarwar. Software clustering techniques and the use of combined algorithm. 2003.

[29] S. Shalev-Shwartz and S. Ben-David. Understanding machine learning: From theory to algorithms. Cambridge University Press, 2014.

[30] Daniel A Spielman and Shang-Hua Teng. Smoothed analysis of algorithms: Why the simplex algorithm usually takes polynomial time. Journal of the ACM (JACM), 51(3):385–463, 2004.

[31] Gilles Stoltz. Incomplete information and internal regret in prediction of individual sequences. PhD thesis, Universit´e Paris Sud-Paris XI, 2005.

[32] Gell´ert Weisz, Andr´as Gy¨orgy, and Csaba Szepesv´ari. CapsAndRuns: An improved method for approximately optimal algorithm configuration. In ICML 2018 AutoML Workshop, 2018.

[33] Gell´ert Weisz, Andr´as Gy¨orgy, and Csaba Szepesv´ari. Leapsandbounds: A method for approximately optimal algorithm configuration. In ICML, 2018.

[34] James R. White, Saket Navlakha, Niranjan Nagarajan, Mohammad-Reza Ghodsi, Carl Kingsford, and Mihai Pop. Alignment and clustering of phylogenetic markers—implications for microbial diversity studies. In BCM Bioinformatics, 2010.

[35] Lin Xu, Frank Hutter, Holger H Hoos, and Kevin Leyton-Brown. Satzilla: portfolio-based algorithm selection for sat. Journal of artificial intelligence research, 32:565–606, 2008.

Problem Transformations to Obtain  r0-interior Minimizers. Recall that a sequence of loss functions ℓ1, ℓ2, . . .has an  r0-interior minimizer if with probability 1, for all times T we have that there exists  ρ∗ ∈argminρ∈C�Tt=1 ℓt(ρ) such that B(ρ∗, r0) ⊂ C. We can usually modify a sequence of loss functions to obtain an equivalent optimization problem that is guaranteed to have an  r0-interior minimizer. For example, when the parameter space C is convex (e.g., a cube in  Rd, which covers most algorithm configuration applications), we can apply the following transformation: define an enlarged parameter space  C′ = �ρ∈C B(ρ, r0)and a modified sequence of loss functions  ℓ′t : C → [0, 1]given by  ℓ′t(ρ′) = ℓt(ΠC(ρ′)), where  ΠCdenotes the Euclidean projection onto C. Using the fact that projections onto convex sets are contractions, it follows that the sequence  ℓ′1, ℓ′2, . . .is also L-Lispchitz and f-dispersed. Moreover, it has an  r0-interior minimizer and any sequence of parameters  ρ′1, ρ′2, · · · ∈ C′can be converted into  ρ1, ρ2, · · · ∈ Cby taking  ρt = ΠC(ρ′t). This guarantees that  ℓt(ρt) = ℓ′t(ρ′t)for all t. In particular, an algorithm with low regret playing against ℓ′1, ℓ′2, . . .can be converted into one that plays against  ℓ1, ℓ2, . . .with an identical regret bound. The cost of this transformation is that it increases the diameter of the parameter space C by 2r. Our regret bounds have logarithmic dependence on the diameter of C.

Theorem 5. Let  C ⊂ Rdbe contained in a ball of radius R and  ℓ1, ℓ2, · · · : C → [0, 1]be piecewise L-Lipschitz functions that are f-dispersed with an  r0-interior minimizer. Moreover, suppose the learner gets semi-bandit feedback and, on each round t, the feedback system  A(t)1 , . . . , A(t)M has Mfeedback sets. For any r ∈ (0, r0], running Algorithm 1 with  λ =�d log(R/r)/(TM)satisfies the following regret bound:

image

Proof of Theorem 5. For the majority of the proof we consider an arbitrary sequence of piecewise Lipschitz loss functions  ℓ1, . . . , ℓTwith an  r0-interior minimizer. We will only suppose they are f-dispersed in the final steps of the proof.

Following the proof of the Exp3-Set algorithm of Alon et al. [2], we will upper and lower bound the quantity  E[log(WT+1/W1)]. Our upper bound will be in terms of the learner’s total expected loss, while the lower bound will be in terms of the expected total loss of the optimal parameter in hindsight. Dispersion plays a crucial role in the lower bound, since it allows us to guarantee that a set of parameters with non-trivial volume has nearly optimal total loss. Combining these bounds and then finally taking the expectation of the bound for a sequence of losses  ℓ1, . . . , ℓT that are f-dispersed will give the final bound.

Upper Bound. Consider the ratio of consecutive normalizing constants  Wt+1/Wt. Using the definition of

image

Next, using that  e−z ≤ 1 − z + z2/2 for all z ≥ 0, we have

image

Using the fact that  1 − z ≤ exp(−z) for all z ≥ 0and taking the product over t = 1, . . . , T, we have

image

Taking logs, we have

image

Next, we will take the expectation of the above bound to simplify the two integrals. Recall that for each time t, we let  A(t)1 , . . . , A(t)Mbe the feedback system and for any  ρ ∈ Cand let  A(t)(ρ)denote the set  A(t)isuch that  ρ ∈ A(t)i. Recall that the importance-weighted losses  ˆℓtwere constructed to ensure that for any time t and any fixed  ρ ∈ C, we have Et[ˆℓt(ρ)] = ℓt(ρ). Therefore,

image

The integral in the final expectation is the definition of  Et[ℓt(ρt)], which gives  E��C pt(ρ)ˆℓt(ρ) dρ� =E<t[Et[ℓt(ρt)]] = E[ℓt(ρt)]. Therefore, we have

image

which is the total expected loss of the algorithm on the first T rounds. Now we turn to simplifying the expectation of the second integral in (1). For any  ρ ∈ Cand any time t, we have

image

Using the fact that  ρ ∈ A(t)(ρ′)if and only if  ρ′ ∈ A(t)(ρ), we can upper bound the integral as follows:

image

This implies that

image

Finally, we evaluate the integral by writing it as the sum of integrals over the feedback sets  A(t)1 , . . . , A(t)M, which is justified since these sets partition C. In particular, we have

image

Putting it together, we have

image

Taking the expectation of (1) and using the calculations given by (2) and (3), we have

image

Lower Bound. Next, let  ρ∗ ∈ argminρ∈C�Tt=1 ℓt(ρ)be such that  B(ρ∗, r0) ⊂ Cand fix any radius  r ≤ r0. Using the fact that  W1 =�C 1 dρ = Vol(C)and the weights  wT+1(ρ)are positive, we have

image

Taking the log of this bounds gives

image

At this point it is tempting to apply dispersion to lower bound the term  exp�−λ �t ˆℓt(ρ)�in terms of exp�−λ �t ˆℓt(ρ∗)�. In particular, if at each time t the feedback system  A(t)1 , . . . , A(t)Mcorresponds to the piecewise Lispchitz partitioning of C for the loss function  ℓt, then the estimated loss function ˆℓthas a subset of the discontinuities of  ℓt. In this case, the estimated losses ˆℓ1, ˆℓ2, . . . are also f-dispersed for the same function f as the true losses. However, when the feedback system at around t does not match the piecewise Lipschitz partition, we would require a separate dispersion analysis for the sequence of estimated losses ˆℓ1, ˆℓ2, . . . . Themore serious challenge, however, is that the importance weight  1/pt(A(t)(ρt))in the definition of ˆℓt causes itto take values in the range  [0, 1/pt(A(t)(ρt))], which is much larger than the true losses which take values in [0, 1]. Moreover, the Lipschitz parameter of the estimated loss  ℓtis  L′ = L/pt(A(t)(ρt)). This larger loss range and Lipschitz constant lead to a worse final regret bound. Instead, we defer applying dispersion until after taking expectations so that we can use the dispersion properties of the true losses  ℓ1, ℓ2, . . . directly.

Towards this end, we first use Jensen’s inequality to simplify the above bound. Let  h : C → Rbe any function and  S ⊂ Cbe any subset of the parameter space. Then, using the fact that log is concave, we can apply Jensen’s inequality to obtain the following bound:

image

Applying this inequality to our lower bound on  log WT +1W1with  h(ρ) = −λ �Tt=1 ˆℓt(ρ)and  S = B(ρ∗, r)gives

image

Next, since C is contained in a ball of radius R and the volume of a ball of radius  R in Rd is proportional to Rd, it follows that the volume ratio is at least  (r/R)d. Taking expectations, we have

image

where we used the fact that for any fixed  ρ ∈ C, we have  E[ˆℓt(ρ)] = E<t[Et[ˆℓt(ρ)]] = ℓt(ρ). Finally, we will upper bound the sum of losses �Tt=1 ℓt(ρ)for points in the ball  B(ρ∗, r)in terms of the number of non- Lipschitz functions on that ball. let  D =��{1 ≤ t ≤ T | ℓt is not L-Lipschitz on  B(ρ∗, r)}��be the number of functions among  ℓ1, . . . , ℓTthat are not L-Lipschitz on the ball  B(ρ∗, r). Then for any  ρ ∈ B(ρ∗, r), wehave

image

The integral�B(ρ∗,r) 1Vol(B(ρ∗,r))�Tt=1 ℓt(ρ) dρis the average total loss of the parameters  ρ ∈ B(ρ∗, r), which is at most �tt=1 ℓt(ρ∗) + TLr + D. Substituting this into our bound gives

image

Combined bound. Combining the upper and lower bounds and rearranging, we have

image

Finally, now suppose that the functions  ℓ1, . . . , ℓTare a random sequence that satisfy f-dispersion. To bound E[D], let ℓ1, . . . , ℓT be f-dispersed and  ρ∗ be any r0-interior minimizer. Then we have

image

where this expectation is taken over the draw of the loss functions  ℓ1, . . . , ℓT. This leads to the bound

image

where now the expectation is taken over both the algorithm’s randomness and the random sampling of the loss functions  ℓt. The specific bounds given in the theorem statement follow by substituting the chosen value of  λ.

A.1 Optimizing Utilities and H-Bounded Losses

We note that the regret bound obtained in Theorem 5 for Algorithm 1 can also be used to obtain similar results in two closely related settings. First, if we instead have piecewise Lipschitz utility functions  u1, u2, · · · :C → [0, 1]and our goal is to maximize utility rather than minimize loss, we can transform this into a loss-minimization problem by minimizing the losses given by  ℓt(ρ) = 1 − ut(ρ). This transformation preserves the regret of any algorithm, the feedback system at each round, and the piecewise Lipschitz and dispersion properties of the functions. Similarly, if the losses take values in [0, H] for some known maximum loss H, instead of [0, 1], the learner can preprocess the losses to fall in [0, 1] by dividing them by H. The rescaled functions take values in [0, 1] and have Lipschitz constant  L′ = L/H. Then expected regret of Algorithm 1 with respect to the unscaled loss functions is  O(H�TMd log(R/r) + Hf(T, r) + TLr).

Lemma 13. Let  u1, u2, · · · : C → [0, H]be a sequence of utility functions that are each piecewise L-Lipschitz and f-dispersed. Define a corresponding sequence of losses  ℓ1, ℓ2, · · · : C → [0, H]given by ℓt(ρ) = H − ut(ρ). The functions  ℓ1, ℓ2, . . .are also piecewise L-Lispchitz and f-dispersed.

Proof. First, consider any time t. Since  ut : C → [0, H]is piecewise L Lipschitz, by definition we know that there is a partition  C1, . . . , CMof C so that for each  i ∈ [M]and any  ρ, ρ′ ∈ Ci, we have |ut(ρ) − ut(ρ′)| ≤ L · ∥ρ − ρ′∥2. We will argue that the loss function  ℓtis also piecewise L-lispschitz on the same partition. Fix any  i ∈ [M]and any pair of points  ρ, ρ′ ∈ Ci. Then we have that

image

where the last inequality follows from the fact that  ut is L-Lipschitz restricted to  Ci. It follows that  ℓt is alsopiecewise L-Lipschitz and has the same piecewise Lipschitz partition. This holds for all times t.

Next, we argue that whenever the utility functions  u1, u2, . . .are f-dispersed, so are the loss functions ℓ1, ℓ2, . . .. For any time horizon  T, radius ϵ > 0, and parameter  ρ ∈ C, define

image

Following an identical argument as in the first part, with probability 1, whenever  ut is L-Lipschitz on the ball B(ρ, ϵ), so is the function  ℓt. From this, it follows that  Du(T, ϵ, ρ) = Dℓ(T, ϵ, ρ)for all  T ∈ N, ϵ > 0, and ρ ∈ C. Finally, since the functions  u1, u2, . . . were f-dispersed, we have that for all  T ∈ Nand all radiuses

image

and the loss functions  ℓ1, ℓ2, . . . are also f-dispersed.

A.2 Efficient Implementations via Interval Trees

In this section we show how to use the modified interval tree data structure of Cohen-Addad and Kanade [16] to implement the continuous Exp3-SET algorithm efficiently for one-dimensional problems with piecewise constant loss functions. In particular, the per-round cost of updating the algorithm weights and sampling from them at time t is only O(log(t)), while a direct implementation has running time given by O(t) instead. We also show how to use interval trees to implement the Exp3 algorithm on a set of N arms with per-round running time that is O(log N), which implies that a discretization-based algorithm in the bandit setting can be efficiently implemented even in high dimensions.

Interval Tree Summary. Cohen-Addad and Kanade [16] introduce a modified interval tree data structure used for representing piecewise constant functions mapping R to R. Their data structure represents the function as a balanced tree with one leaf corresponding to each constant interval of the function. It supports two main operations called DRAW and UPDATE:

The DRAW procedure returns a sample from the density function that is proportional to the represented piecewise constant function f.

The UPDATE procedure takes an interval [a, b) and an update u, and updates the represented piecewise function by multiplying the function values in [a, b) by u. That is, if the represented function was originally  f : R → R, after executing UPDATE with interval [a, b) and update u, the resulting function is

image

Cohen-Addad and Kanade [16] show that the operations DRAW and UPDATE can be implemented in O(log P) time, where P is the number of constant pieces in the represented function. The data structure also makes it possible to implement a third procedure INTEGRATE in O(log P) time, which takes an interval [a, b) and returns the integral of the represented represented function on the interval [a, b).

Exp3-Set for Piecewise Constant One Dimensional Problems. First, we show how to efficiently implement Algorithm 1 efficiently for one-dimensional optimization problems with piecewise constant loss functions. We simply use the interval tree datastructure of Cohen-Addad and Kanade [16] to represent the weight function at each round. Pseudocode is given in Algorithm 4.

Lemma 14. Consider an online optimization problem with loss functions  ℓ1, ℓ2 : [0, 1] → [0, 1]that are piecewise constant. Moreover, suppose that on each round t, the loss  ℓtis constant on each of the feedback sets  A(t)i. For such a problem, Algorithm 4 is equivalent to Algorithm 1. The overhead of sampling and updating the weights on round t takes O(log t) time.

Proof. On each round we run UPDATE once to update the interval tree. This at most increases the number of constant intervals in the weights by 2, since the only constant intervals that might get split are the two containing the end points of the feedback set  At. Therefore, on round t, the weight function is piecewise constant with at most O(2t) intervals. It follows that the sampling, integration, and update operations all take O(log t) time, giving a total per-round cost of O(log t).

image

Theorem 6. Let  ℓ1, ℓ2, · · · : R → Rbe independent piecewise L-Lipschitz functions, each having at most K discontinuities. Let

image

be the number of functions in  ℓ1, . . . , ℓTthat are not L-Lipschitz on the ball  [ρ − ϵ, ρ + ϵ]. Then we have

image

Proof. For simplicity, we assume that every function has exactly K discontinuities. For each function  ℓt, letα(t) ∈ RK denote the vector whose entries are the discontinuity locations of  ℓt. That is, ℓthas discontinuities at  α(t)1 , . . . , α(t)K, but is otherwise L-Lispchitz. Since the functions  ℓ1, ℓ2, . . .are independent, the vectors α(1), α(2), . . .are also independent.

For any interval  I ⊂ R, define the function  fI : RK → {0, 1} by

image

The sum �Tt=1 fI(α(t))counts the number of vectors  α(1), . . . , α(T)that have a component in the interval I or, equivalently, the number of functions  ℓ1, . . . , ℓTthat are not L-Lipschitz on I. We will apply VCdimension uniform convergence arguments to the class  F = {fI : RK → {0, 1} | I ⊂ Ris an interval}. In particular, we will show that for an independent set of vectors  α(1), . . . , α(T), with high probability we have that 1T�Tt=1 fI(α(t))is close to  E� 1T�Tt=1 fI(α(t)�for all intervals I. This uniform convergence argument will lead to the desired bounds.

We begin by bounding the VC-dimension of F by O(log K). The key observation is the following connection between F and the class of indicator functions for intervals: let  S = {x(1), . . . , x(n)} ⊂ RKbe any collection of n vectors in  RKand let  P = {x(1)1 , . . . , x(1)K , . . . , x(n)1 , . . . , x(n)K } ⊂ Rdenote the set containing the union of their combined nK component values. Now consider any pair of intervals I and I′. If the indicator functions for I and  I′agree on all the points in P (i.e., the intervals contain exactly the same subset of P), then we must have that  fIand  fI′agree on every vector in S. This is because if I and I′contain exactly the same subset of P, then for each vector  x(i), both intervals contain the same subset of its component values. In particular, either they both contain none of the components, or they both contain at least one. In either case, we have that  fI(x(i)) = fI′(x(i)). This shows that the number of distinct ways that functions from the class F can label the set of vectors S is at most the number of ways that indicator functions for intervals can label the set of points P.

Now suppose that the VC-dimension of F is V . Then there exists a set  S ⊂ RK of vectors of size V that is shattered by  F. Let P ⊂ Rbe the set containing the union of their combined V K components (as above). From Sauer’s Lemma together with the fact that the VC-dimension of intervals is 2, we are guaranteed that indicator functions for intervals can label the set P of points in at most  (eV K)2 distinct ways. By the above reasoning, it follows that F can label the set S of vectors in at most  (eV K)2 distinct ways. On the other hand, since F shatters S, we know that it can label S in all  2Vpossible ways, and it follows that  2V ≤ (eV K)2. Taking logs on both sides and rearranging, we have  V ≤ 2ln 2 ln(V ) + 2 ln(eK)ln 2. Using the fact that for any a ≥ 1and  b ≥ 0, the inequality  y ≤ a ln(y) + bimplies that  y ≤ 4a ln(2a) + 2b, we further have that V ≤ 8 ln(4/ ln 2)ln 2 + 4 ln(eK)ln 2 = O(log K), as required.

Applying VC-dimension uniform convergence arguments for the class F, for any failure probability δ > 0, if  x(1), . . . , x(T) ∈ RKare independent random vectors (but not necessarily identically distributed), then following holds with probability at least  1 − δsimultaneously for all  fI ∈ F:

image

In particular, for any point  ρand any radius  ϵ, we have that  D(T, ϵ, ρ) = �Tt=1 fI(α(t)), where I = [ρ − ϵ, ρ + ϵ]. Therefore, uniform convergence for F implies that for all  T ∈ N and all ϵ > 0, and any failure probability  δ > 0, we have that with probability at least  1 − δthe following holds for all  ρ ∈ R:

image

Multiplying both sides by T and rearranging gives

image

Taking the maximum of both sides over  ρ ∈ R, we have

image

This is a high probability bound on the maximum number of non-Lipschitz functions among  ℓ1, . . . , ℓTfor any interval of radius  ϵ. All that remains is to convert this into a bound in expectation. Let  δ = 1/√T and let

G denote the high-probability uniform convergence event above. Then we have

image

where the last inequality uses the facts that  Pr(G) ≤ 1and  E[maxρ∈R D(T, ϵ, ρ) | G] Pr(G) ≤ Tδ =√T. This argument holds for all  T and ϵ, proving the claim.

Next, we prove a weaker bound that follows from the analysis techniques of Balcan et al. [11].

Lemma 15. Let  ℓ1, ℓ2, · · · : R → Rbe independent piecewise L-Lipschitz functions, each having at most K discontinuities. Let  D(T, ϵ, ρ) = ��{1 ≤ t ≤ T | ℓt is not L-Lipschitz on  [ρ − ϵ, ρ + ϵ]}��be the (random) number of functions in  ℓ1, . . . , ℓTthat are not L-Lipschitz on the ball  [ρ − ϵ, ρ + ϵ]. Moreover, let ˜D(T, ϵ, ρ) =��{(t, i) ∈ [T] × [K] | α(t)i ∈ [ρ − ϵ, ρ + ϵ]}��, where  α(t) ∈ RKis the vector of discontinuities of the loss  ℓt. That is, ˜D(T, ϵ, ρ)is the number of discontinuities of the functions  ℓ1, . . . , ℓTin the ball [ρ − ϵ, ρ + ϵ]. Then we have

image

Note that, using the notation of Lemma 15, we always have  D(T, ϵ, ρ) ≤ ˜D(T, ϵ, ρ) ≤ KD(T, ϵ, ρ). It follows that Lemma 15 is looser than Theorem 6 in two ways: first, the error term is a factor K larger. Second, the upper bound of Lemma 15 multiply-counts functions that have repeated discontinuities in the same ball, while our sharper bound does not.

Proof. For simplicity, we assume that every function  ℓthas exactly K discontinuities. The proof techniques can be generalized to the case where each functions has at most K discontinuities.

For each time t, let  α(t) ∈ RKbe the vector of discontinuities of  ℓt. That is,  ℓthas discontinuities at the points  α(t)1 , . . . , α(t)Kand is otherwise L-Lipschitz. The key challenge is that the discontinuity locations α(t)1 , . . . , α(t)K are not independent.

Fix any discontinuity index  i ∈ [K]and define ˜Di(T, ϵ, ρ) =��{1 ≤ t ≤ T | α(t)i ∈ [ρ − ϵ, ρ + ϵ]}��. That is, ˜Di(T, ϵ, ρ)counts the number of times t for which the  ithdiscontinuity  α(t)i of ℓtlands in the interval of radius  ϵcentered on  ρ. Then we have that ˜D(T, ϵ, ρ) = �i ˜Di(T, ϵ, ρ)counts the total number of discontinuities that belong to the interval of radius  ϵcentered on  ρ. Since the function  ℓtis not L-Lipschitz on an interval I only when I contains some discontinuity for  ℓt, we have

image

Next we will apply uniform convergence arguments to obtain high probability bounds on each ˜Di(T, ϵ, ρ)in terms of their expectations. Fix a discontinuity index  i ∈ [K]. The set of discontinuity locations α(1)i , . . . , α(T)iare independent and, since intervals have VC-dimension 2, applying standard uniform convergence guarantees implies that for any  δ > 0, with probability at least  1 − δthe following holds for all ρ:

image

Setting the failure probability to be  1/(K√T), taking the union bound over all K discontinuities, and summing the resulting bounds, the following holds with probability at least  1 − 1/√T for all ρ:

image

Using the fact that  D(T, ϵ, ρ) ≤ ˜D(T, ϵ, ρ)and taking the supremum over  ρ, the following holds with probability at elast  1 − 1/√T.

image

Let G denote the high-probility uniform convergence event above. Then we have

image

as required.

Lemma 11. Consider an adversary choosing clustering instances where the  tthinstance has symmetric distance matrix  D(t) ∈ [0, B]n×nand for all  i ≤ j, d(t)ijis  κ-smooth. The loss functions  ℓ1, ℓ2, . . .defined above are piecewise constant and f-dispersed for  f(T, ϵ) = 32Tϵn8κ2M2 + O(�T log(Tn))and  β-dispersed for  β = 1/2.

Proof. The key insight of Balcan et al. [9] for this family of algorithms is that for a fixed distance matrix D, the function  ρ �→ Aρ(D)is piecewise constant with at most  O(n8)pieces. That is, the algorithm will only output at most  O(n8)different cluster trees, and each is produced for some subinterval of the parameter space. Their argument is as follows: for any pair of candidate cluster merges, say merging clusters  C1 and C2versus  C′1and  C′2, we can determine the values of the parameter  ρ ∈ [0, 1]for which the algorithm would prefer to merge  (C1, C2)instead of merging  (C′1, C′2)(i.e., the values of  ρso that the  dρdistance between C1and  C2is smaller than between  C′1and  C′2). In particular, the algorithm will merge clusters  C1and  C2instead of  C′1 and C′2 if dρ(C1, C2) ≤ dρ(C′1, C′2)or, equivalently, when

image

Since the above inequality is linear in  ρ, there is a single critical value of the parameter, given by

image

such that the relative preference of merging  C1and  C2or  C′1and  C′2changes only at  ρ = c. Moreover, the definition of c only depends on a collection of 8 points: the closest and farthest pair between  C1 and C2 andbetween  C′1 and C′2. In particular, every such critical parameter value c is given by

image

where  i, i′, j, j′, r, r′, s, s′ ∈ [n]are the indices of 8 points. Similarly to the knapsack example, we show that each critical parameter value is random and has a density function bounded by  16(κB)2. From this, it follows that for any interval I of radius  ϵ, the expected total number of critical values summing over all instances t = 1, . . . , T that land in interval I is at most  32Tϵ(κB)2. This also bounds the expected number of functions  ℓ1, . . . , ℓTthat are not constant on I. By Theorem 6, the functions are f-dispersed for f(T, ϵ) = 32Tϵ(κB)2 +�T log(Tn) = ˜O(Tϵ +√T), also implying 12-dispersion.

When the four distances present in the equation for c are distinct entries of the distance matrix D, then they are independent. However, it is possible that the closest and furthest pair of points between a pair of clusters can be the same, for example, when both clusters consist of just a single point. In this case, the corresponding distances are no longer independent, and we will need to modify our analysis slightly. Note a critical parameter c only arises for competing pairs of merges  (C1, C2)and  (C′1, C′2)that differ on at least one cluster (since otherwise both merges are identical). Moreover, since the set of clusters at any given round of the algorithm partition the data, any pair of clusters the algorithm encounters are either equal or disjoint. From this it follows that there are only four cases to consider depending on whether the closest and farthest pairs of points between  C1 and C2are the same, and whether the closest and farthest pairs of points between C′1 and C′2 are the same. That is, whether  (i, i′) = (j, j′)and whether  (s, s′) = (r, r′).

Case 1:  (i, i′) ̸= (j, j′) and (r, r′) ̸= (s, s′). Let X = drr′ − dii′ and Y = djj′ − dss′. Rewriting expression for c given in (4), we have that c = X/(X + Y ). Moreover, both X and Y are the sum of two independent random variables having  κ-bounded densities, so from Lemma 20, it follows that X and Y also have densities bounded by  κ. Next, since X and Y are independent, take values in  [−2M, 2M], and have  κ-bounded densities, Lemma 22 ensures that the ratio  X/(X + Y ) has an 16(κM)2 bounded density.

Case 2:  (i, i′) = (j, j′)and  (r, r′) ̸= (s, s′). In this case, we are guaranteed that  dii′ = djj′, and the expression for c simplifies to

image

Defining  X = −dii′, Y = −dss′, and  Z = drr′, we have that  β = (X + Z)/(Y + Z). The variables X, Y , and Z are independent, each have  κ-bounded densities, and  |Y | ≤ Mand  |Z| ≤ Mwith probability 1. Applying Lemma 23 to these random variables guarantees that the density function for  β is 4(κM)2-bounded.

Case 3:  (i, i′) ̸= (j, j′)and  (r, r′) = (s, s′). This case is symmetric to case 2 and an identical argument applies.

Case 4:  (i, i′) = (j, j′)and  (r, r′) = (s, s′). In this case, the  dρdistance between  C1and  C2is constant, as is the  dρdistance between  C′1 and C′2. Therefore, for all values of  ρwe will prefer to merge the same pair of clusters and there is no critical parameter value where we switch from one merge to the other.

In every case, the density of the critical parameter value  βis upper bounded by  16κ2M2, completing the proof.

image

Figure 1: Relationship between the binary search intervals [a, b] and [c, d] and the true interval  [ρ∗min, ρ∗max]on which  A(x, ρ′) outputs yρ.

C.1 Single Parameter Piecewise Unique Algorithms.

Next we provide a general approach for obtaining semi-bandit feedback that applies to many single-parameter algorithms. This enables semi-bandit feedback, but we still rely on problem-specific dispersion analysis. This approach applies to any algorithm with a single real-valued parameter whose output is both a piecewise constant function of the parameter for any instance, and such that no output value is repeated across any distinct intervals in the piecewise decomposition. We call such an algorithm single-parameter piecewise-unique. Without loss of generality, we assume that the parameter space is given by  C = [0, 1]. Let A : Π × [0, 1] → Ybe an algorithm mapping problem instances  x ∈ Πand parameters  ρ ∈ [0, 1]to outputs in some space Y. Given a parameter  ρ ∈ [0, 1]and a problem instance x, and an accuracy parameter  ϵ > 0, we will return both  A(x, ρ), together with an interval  I = [ρmin, ρmax]such that for all  ρ′ ∈ I we have A(x, ρ′) = A(x, ρ).Moreover, for any point  ρ′ ̸∈ [ρmin − ϵ, ρmin + ϵ], we have A(x, ρ′) ̸= A(x, ρ). In other words, the interval I output by the algorithm is nearly the largest piecewise constant interval containing  ρ. The high level idea of our approach is to run binary search twice to determine the upper and lower bounds  ρmaxand  ρmin, respectively. Each search will require that we run the algorithm  A at most O(log 1/ϵ)times. In cases where the algorithm parameters are specified using b bits of precision, then this procedure exactly determines the interval using O(b) invocations of the base algorithm. Pseudocode is given in Algorithm 5. Steps 3 and 4 perform binary search to find the upper bound on the constant interval, while steps 5 and 6 perform binary search to find the lower bound.

Lemma 16. Let  A : Π × [0, 1] → Ybe any single-parameter piecewise-unique algorithm and suppose  yρand  I = [ρmin, ρmax]is output by Algorithm 5 when run on A with problem instance  x ∈ Π, parameter ρ ∈ [0, 1], and target accuracy  ϵ. Then Algorithm 5 runs the base algorithm  A at most O(log 1/ϵ) times andwe have that  A(x, ρ′) = yρ for all ρ′ ∈ I, ρ ∈ I, and for all  ρ′ ̸∈ [ρmin − ϵ, ρmax + ϵ] we have A(x, ρ′) ̸= yρ.

Proof. From step 1 of the algorithm, we know that  A(x, ρ) = yρ, by definition. Since the algorithm is single-parameter and piecewise-unique, we know that  A(x, ρ′)will output  yρfor all  ρ′belonging to some interval  [ρ∗min, ρ∗max]containing  ρ, and it will not output  yρfor any point outside that interval. In particular, restricted to the interval  [ρ, 1], there is exactly one critical parameter value, namely  ρ∗maxbelow which the algorithm always outputs  yρand above which the algorithm always outputs something different. The binary search performed in step 3 guarantees that  ρ∗max is always contained in the interval [a, b], yet on each iteration the length of the interval is halved. Similarly, each iteration of the binary search in step 6 guarantees that ρ∗min ∈ [c, d], and the width of the interval halves on each iteration. Each iteration of both binary search instances requires us to run the base algorithm A once, and we will require  O(log 1/ϵ)iterations to guarantee the width of both intervals is less than  ϵ.

Since  a ≤ ρ∗maxand  d ≥ ρ∗min, we have  [a, d] ⊂ [ρ∗min, ρ∗max]and it follows that  A(x, ρ′) = yρfor all ρ′ ∈ [a, d], as required. Moreover, we know that  a + ϵ ≥ b ≥ ρ∗maxand  d − ϵ ≤ c ≤ ρ∗min, implying that A(x, ρ′) ̸= yρfor all  ρ′ ̸∈ [a − ϵ, d + ϵ], as required. Figure 1 depicts the relation between [a, b], [c, d], and [ρ∗min, ρ∗max]at the end of the algorithm.

image

In this section we summarize several useful results that provide upper bounds on the density of random variables that are obtained as functions of other random variables with bounded density functions. These results allow us to reason about the distribution of discontinuity locations that arise as transformations of random problem parameters in algorithm configuration instances.

In many cases, we make use of the following result:

Theorem 17 (Density Function Change of Variables). Let  X ∈ Rdbe a random vector with joint probability density function  fX : Rd → [0, ∞)and let  φ : Rd → Rnbe any bijective differentiable function. Then the random vector  Y = φ(X)also has a density function  fY : Rn → [0, ∞)given by fY (y) = | det(Jφ−1(y))|fX(φ−1(y)), where Jφ−1(y)denotes the Jacobian of  φ−1 evaluated at y.

Lemma 18 (Lemma 6 from [11]). Suppose X and Y are random variables taking values in (0, 1] and suppose that their joint distribution is  κ-bounded. Then the distribution of  Z = ln(X/Y ) is κ/2-bounded.

Lemma 19 (Lemma 8 from [11]). Suppose X is a random variable with a  κ-bounded density and suppose c is a constant. Then  Z = X/c has a cκ-bounded density

Lemma 20. Let X and Y be two independent random variables each having densities upper bounded by  κ.The random variable U = X + Y has density  fUsatisfying  fU(u) ≤ κ for all u.

Proof. Let  fX and fYbe the density functions for X and Y , respectively. The density for U is the convolution of  fX and fY. With this, we have

image

It follows that U = X + Y has a density that is upper bounded by  κ.

Lemma 21. Let X and Y be random variables with joint density  fXYthat is  κ-bounded and such that |Y | ≤ Mwith probability 1 and let U = X/Y . Then the density function  fU is κM2-bounded.

Proof. Consider the change of variables given by U = X/Y and V = Y . This corresponds to the transformation function  φ(x, y) = (x/y, y). The inverse of  φis given by  φ−1(u, v) = (uv, v). The Jacobian of  φ−1 is

image

whose determinant is always equal to v. Therefore, the joint density of U and V is given by

image

To get the marginal density for U, we integrate over v and use the fact that the density  fXY (x, y) = 0whenever |y| > M. This gives

image

It follows that the density for  U satisfies fU(u) ≤ κM2 for all u, as required.

Lemma 22. Let X and Y be independent random variables with  κ-bounded densities so that  |X| ≤ M and|Y | ≤ Mwith probability one and define Z = X/(X + Y ). The random variable Z has a density function fZ that is 4κ2M2-bounded.

Proof. Consider the change of variables given by U = X and V = X + Y . We will argue that the joint density  fUVis  κ2-bounded. Then, since  |X + Y | ≤ 2Mwith probability 1, we can apply Lemma 21 to ensure that the density of Z = U/V is bounded by  κ2(2M)2 = 4κ2M2, as required.

It remains to bound the joint density of U = X and V = X + Y . This change of variables corresponds to the transformation function  φ(x, y) = (x, x + y), whose inverse is given by  φ−1(u, v) = (u, v − u). TheJacobian of  φ−1 is given by

image

whose determinant is always 1. It follows that the joint density for (U, V ) is given by  fUV (u, v) =fXY (u, v − u) = fX(u)fY (v − u) ≤ κ2, as required.

Lemma 23. Let X, Y , and Z be independent random variables with  κ-bounded densities such that  |Y | ≤ M,and  |Z| ≤ Mwith probability one. Then the random variable  R = X+YZ+Yhas a density  fRthat satisfies fR(u) ≤ 4κ2M2.

Proof. Consider the change of variables given by U = X + Y , V = Z + Y . We will argue that the joint density  fUV for U and V is κ2-bounded. Then, since  |V | = |Z + Y | ≤ 2Mwith probability 1, we can apply Lemma 21 to ensure that the density of R = U/V is bounded by  4κ2M2, as required.

It remains to bound the joint density of U = X + Y and V = Z + Y . Consider the change of variables given by U = X + Y , V = Z + Y , and W = Y . This corresponds to the transformation function φ(x, y, z) = (x + y, z + y, y), and has inverse  φ−1(u, v, w) = (u − w, w, v − w). The Jacobian of  φ−1is given by

image

which always has determinant given by  −1. It follows that the joint density for (U, V, W) is given by

image

To get the joint density over only U and V we integrate over w:

image

as required.

In this section we provide a single algorithm that has regret bounds for the full-information, semi-bandit feedback, and bandit-feedback settings. The high level idea is to reduce the problem to a finite-armed bandit and apply the Exp3-SET algorithm of Alon et al. [2], which enjoys regret bounds in all three of these settings. In particular, we choose a value of r > 0 and construct a r-net for the parameter space C (which can be done using at most  (3R/r)dpoints when C is contained in a ball of radius R in d dimensions. Then we run the Exp3-SET algorithm over this finite set). Exp3-SET is guaranteed to have bounded regret compared to the best point from the r-net, and we use f-dispersion to bound the expected difference in total loss between the best discretized point and the best point in all of C by TLr + f(T, r). We then get bounds for each of the three feedback regimes by tuning the parameters  r and λof this single algorithm.

On each round, we assume that the learner observes a feedback set  At, together with the loss  ℓt(ρ)for all points  ρ ∈ At. In the full information setting, we take  At = Cto be the entire parameter space. In the bandit feedback setting, we take  At = {ρt}, where ρtis the point played by the learner in round t. And in the semi-bandit feedback setting, we take  Atto be the piece of  ℓtthat contains the point  ρtplayed by the learner. Using this notation, pseudocode is given in Algorithm 6.

image

Theorem 24. Let  C ⊂ Rdbe contained in a ball of radius R and  ℓ1, . . . , ℓTbe possibly random piecewise L-Lipschitz functions that are f-dispersed and have an  r0-interior optimum. Then running Algorithm 6 with discretization parameter  r ≤ r0, the following claims hold:

image

In each of the above bounds, the role of the dispersion parameter f is clean. For both the full-information and semi-bandit feedback settings, the only term in the bound that depends on  f is the f(T, 1/(L√T)) term.Similarly, for bandit feedback, the bound depends on f only through the  f(T, 1/T 1/(d−2))term. In either case, it is clear that the regret bound after T rounds depends on the dispersion properties in neighborhoods of radius  1/(L√T) and 1/T 1/(d−2), respectively. We know that the function  f(T, ϵ)should be non-decreasing in the parameter  ϵ, so this also shows that in one sense we require “less” dispersion in the bandit setting, since the bandit bounds depend on  f(T, 1/T 1/(d+2)) ≤ f(T, 1/√T)).

Proof. Let  ℓ1, . . . , ℓTbe the sequence of loss functions. By assumption, with probability one there is an optimal parameter in hindsight  ρ∗such that  B(ρ∗, r) ⊂ C. It follows that at least one of the discretization points  ˆρ1, . . . , ˆρNmust belong to this ball. Let  ˆρ∗ be such a point. We use f-dispersion to argue that the total loss of  ˆρ∗ is not much more than the total loss of  ρ∗ in expectation over the loss functions  ℓt. Let

image

be the number of loss functions that are not L-Lipschitz on  B(ρ∗, r). We know that �Tt=1 ℓt(ˆρ∗) − ℓt(ρ∗) ≤TLr + D, since D of the functions have discontinuities and the remaining  T − D < Tare L-Lipschitz. Taking the expectation over  ℓ1, . . . , ℓTand using the fact that  E[D] ≤ f(T, r), we have that

image

Therefore, it is sufficient to show that Algorithm 6 is competitive with  ρ∗in expectation over the algorithm randomness.

Algorithm 6 runs a special case of the Exp3-SET algorithm of Alon et al. [2] over the discretized set of parameters  ˆρ1, . . . , ˆρN. In particular, we have assumed that the feedback system takes a particular form: on each round the arms are partitioned into sets such that playing any arm from one set reveals the loss for all arms in the same set. Alon et al. [2] bound the regret of this algorithm for a broader class of feedback systems, where on every round there is an undirected graphs defined over the arms fo the bandit and playing an arm reveals not only its loss, but also the loss of all adjacent arms in the feedback graph  Gt. Our special case corresponds to the setting where  Gtis the union of several cliques (in the full-information setting,  Gtis the complete graph, in the semi-bandit setting,  Gtis the union of M cliques, where M is the number of pieces, and in the bandit setting,  Gtis the union of N cliques, each consisting of a single node). Corollary 3 of [2] bound the regret of the algorithm by

image

where  α(Gt)is the size of the largest independent set in the feedback graph  Gt at round tand the expectation is over the randomness of the algorithm, and holds for any oblivious choice of loss functions  ℓ1, . . . , ℓT andfeedback graphs  G1, . . . , GT. In particular, if  α(Gt) ≤ αfor some number  αon all time steps, then setting

image

Finally, taking expectations over the randomness in the loss functions  ℓ1, . . . , ℓT , using (5), the above choice of  λ, and the fact that  N ≤ (3R/r)d, we have

image

Full information. In the full information setting, the feedback graph  Gtis the complete graph, for which the

largest independent set is of size  α = 1. In this case, choosing  r = 1/(L√T) leads to λ =�ln(RL√T)/T

and gives expected regret bounded by  O(�dT log(RLT) + f(T, 1/(L√T)).

Semi-bandit feedback. Next suppose that each function  ℓtis piecewise defined with at most M and when the learner plays a point  ρt, they learn the loss for all points belonging to the same piece. In this case, the feedback graph  Gtconsists of the union of M cliques, and the largest independent set has size at most α = M. Again, choosing  r = 1/(L√T)leads to  λ =�d ln(RL√T)/(MT)and gives expected regret

bound by O(�dTM log(RLT) + f(T, 1/(L√T)).

Bandit feedback. Finally, consider the bandit feedback setting. In this case, the feedback graph  Gthas the empty edge-set and therefore the largest independent set is of size  α = N ≤ (3R/r)d(the total number of arms). Choosing  r = Td−1d−2 −1 leads to

image

and expected regret bounded by  O�Td+1d+2 (d(3R)d ln(RT) + L) + f(T, 1T 1/(d+2) )�.

Note, the condition that  B(ρ∗, r0) ⊂ Cwith probability one is only for technical convenience. We can always modify the optimization problem so that this condition is satisfied. In particular, we define an enlarged parameter space  C′ = �ρ∈C B(ρ, r0)and replace the utility function  utwith its Lipschitz extension to  C′. Onthis modified problem we are guaranteed that there exists an optimal parameter in the  r0-interior of  C′, and the Lipschitz-extended functions are still piecewise Lipschitz and f-dispersed for the same f.


Designed for Accessibility and to further Open Science