Semi-bandit Optimization in the Dispersed Setting

2019·arXiv

Abstract

1 Introduction

Overview In this work, we consider the problem of online optimization of piecewise Lipschitz functions with semi-bandit feedback. This is an important family of non-convex optimization problems that arises in algorithm selection problems for combinatorial settings, where the goal is to decide in a data-driven way what algorithm to use from a large family of algorithms for a given problem domain. For example, we may want to decide which clustering algorithm to use from a large family of clustering procedures in order to obtain the highest quality results. In the online version of the algorithm selection problem, on each round the learner chooses an algorithm from the family and receives a new instance of the problem. The problem is characterized by a loss function that measures the performance of each algorithm in the family for the given instance. The goal is to select algorithms so that the cumulative performance of the learner is nearly as good as the best algorithm in hindsight for that sequence of problems. The major challenge in these settings is that it is potentially computationally expensive for the learner to characterize the loss function for each round, since each run of the chosen algorithm reveals the value of the loss function at just the single point corresponding to the algorithm selected. Moreover, for combinatorial problems, small differences between two algorithms can lead to a cascade of changes in their behavior and dramatically change their performance. However, when the algorithm family is parameterized, it can often be shown that the loss function is piecewise Lipschitz in the algorithm parameters, so we can phrase the problem as online optimization of piecewise Lipschitz functions.

Prior work on piecewise Lipschitz optimization was limited to two more extreme feedback regimes: Either the learner carries out a computationally expensive process to obtain full-information feedback (i.e., it observes the loss of every algorithm in the family on each instance), or accepts suboptimal regret bounds to work in the bandit feedback setting (i.e., it only observes the loss of one algorithm for each instance). This creates a tradeoff between computational efficiency and good regret bounds. However, many algorithm selection problems exhibit rich additional structure that is ignored by these two approaches. We show that, surprisingly, evaluating the loss function for a single algorithm can sometimes reveal the loss for a range of similar algorithms, essentially for free; in the context of the loss function, we show that an entire Lipshitz region can often be learned at once. This motivates us to define a new learning model, which we call the semi-bandit feedback setting for learning piecewise Lispchitz functions. Our new results in this model achieve the best of both worlds: we can efficiently obtain the necessary feedback while also having regret bounds that are nearly as good as under full-information.

We apply our results to two online algorithm selection problems. Our results for optimizing over a family of greedy knapsack algorithms improve over the procedures of Balcan et al. [11], Gupta and Roughgarden [19], and Cohen-Addad and Kanade [16] by simultaneously being more efficient and having tighter regret bounds. We also provide the first online algorithm selection procedures for a rich family of linkage based clustering algorithms introduced by Balcan et al. [9] that interpolates between single and complete linkage, which are algorithms that are widely used in practice [6, 28, 34] and known to perform optimally in many settings [5, 8, 7, 18]. Balcan et al. [9] consider the algorithm selection problem for this family of algorithms in the statistical batch setting, rather than the online setting, where they model the application domain as a distribution over problem instances, the goal is to find the algorithm with the highest expected performance, and we are given an i.i.d. sample from the distribution as training data.

Problem Setup. We study the problem of online piecewise Lipschitz optimization. The learning protocol is as follows: on each round t, the learner chooses a parameter belonging to a d-dimensional parameter space , the adversary chooses a piecewise Lipschitz loss function , and the learner incurs a loss equal to . A function is piecewise L-Lipchitz if we can partition the parameter space C into regions such that -Lipschitz when restricted to each region. Many important algorithm selection problems require optimizing piecewise Lipschitz functions, including greedy combinatorial algorithms [19], clustering algorithms and SDP-rounding schemes [9], branch and bound mixed integer program solvers [10], initialization procedures for k-means clustering [12], and various auction design problems [13]. In these problems, the family of algorithms is parameterized and each parameter corresponds to one algorithm. We suppose that on each round t there is a partition of the parameter space C, called the feedback system. If the learner’s parameter belongs to the set , then they observe both the set as well as the loss for every . We consider the uninformed setting, where the learner does not know the feedback system for round t in advance of selecting a parameter. For simplicity, we consider oblivious adversaries that choose their sequence of loss functions adversarially, but before the interaction with the learner begins. The learner’s goal is to minimize regret, which is the difference between their total accumulated loss and that of the best parameter in hindsight: Throughout the paper, we use the notation to optionally suppress all logarithmic terms and dependence on parameters other than the time horizon T and the dimension of the parameter space d.

Main Results and Techniques.

Semi-bandit Regret Bounds in the Dispersed Setting. It is not always possible to achieve sub-linear regret for piecewise Lipschitz loss functions [24, 14, 26]. Balcan et al. [11] provide regret bounds in the full-information and bandit feedback settings under a dispersion condition that roughly measures the number of discontinuous functions in any ball of a given radius, and which is satisfied for a diverse collection of combinatorial algorithm configuration problems. In this paper, we introduce a simplified version of this condition that captures what is asymptotically important for our regret bounds.

Definition 1. The sequence of loss functions is -dispersed for the Lipshitz constant L if for all T and for all , we have that, in expectation, the maximum number of functions among which are not L-Lipshitz in any -ball contained in C is at most . That is, for all T and for all we have

In our applications, the sequence of loss functions will be chosen by a smoothed adversary, in the sense of Spielman and Teng [30]. Informally, the discontinuity locations of the functions chosen by a smoothed adversary are randomly perturbed. The expectation in Definition 1 is over this randomness in the sequence of loss functions. (Balcan et al. [11] also show examples where sufficient randomness can arise from the algorithm itself, rather than smoothness constraints on the adversary.) In all of our applications, we prove -dispersion with . In Definition 1, when a function -Lipschitz on the ball have arbitrarily many discontinuities. However, our regret bounds require that the loss functions are bounded, so these discontinuities cannot introduce arbitrarily large losses within the ball. We provide an algorithm for online piecewise Lipschitz optimization under semi-bandit feedback whose regret is characterized by the -dispersion parameter of the adversarially chosen functions. In Section 2, we prove the following result:

Theorem 2. Let be a bounded parameter space and be piecewise Lipschitz functions that are -dispersed. Running the continuous Exp3-SET algorithm (Algorithm 1) under semi-bandit feedback with an appropriate parameter has expected regret bounded by

In comparison, the bandit-feedback algorithm of Balcan et al. [11] has expected regret bounded by . Even in one-dimensional problems, this leads to a regret of significantly worse than our results. Under different assumptions, the bandit algorithm of Cohen-Addad and Kanade [16] also has regret for the special case of one-dimensional piecewise constant functions.

The proof of Theorem 2 is motivated by the analysis of Exp3-SET for finite-armed bandits given by Alon et al. [1], however, there is one significant new challenge. On each round, the Exp3-SET algorithm uses importance weighting and the partially observed loss to construct an unbiased estimate of the entire loss function, which it then passes into the exponentially weighted forecaster full-information algorithm. The regret bound of Alon et al. [1] in the finite-armed setting then follows from applying a second-order analysis of the exponentially weighted forecaster algorithm when run on the sequence of estimated loss functions. The key challenge in our continuous setting is that the estimated loss functions need not be dispersed, even if the original losses chosen by the adversary were. This is because each estimated loss has additional discontinuities at the boundary of the feedback set observed by the learner. Rather than make restrictive assumptions on how the feedback sets are generated (to guarantee that the estimated losses will also be dispersed) we “open up” the full-information regret bound and use a careful application of Jensen’s inequality that allows us to leverage the dispersion of the true loss functions.

General Tools for Verifying Dispersion. We also provide general tools for proving that a sequence of piecewise Lipschitz functions satisfies dispersion. When the sequence is random, we can usually directly bound the expected number of loss functions that are not L-Lipschitz on any specific ball of radius by . However, this does not imply that the functions are -dispersed, since the expected number of non-Lipschitz functions in the worst ball of radius will typically be larger than the expected number in any specific ball. Building on uniform convergence from learning theory [29], we show that if each loss function has a one-dimensional domain, at most K discontinuities, and any single ball of radius has at most non-Lipschitz functions in expectation, then the expected number of non-Lipschitz losses on the worst ball of radius is at most This leads to -dispersion with . Our result gives an exponential improvement in the dependence on K compared to the results of Balcan et al. [11], who upper bound the expected number of non-Lipschitz losses in the worst ball of radius by . Our improvement comes from using a more subtle VC-based uniform convergence argument involving the class of functions that counts the number of non-Lipschitz functions on a given , as opposed to the class of functions that counts the total number of discontinuities in a given ball (i.e., counting multiple discontinuities from each loss function).

Semi-bandit Online Algorithm Selection. In Section 4, we combine our general regret analysis from Theorem 2 together with application-specific dispersion analysis to design practical algorithm selection procedures for linkage-based clustering and the knapsack problem. In both applications, we show that the discontinuities of each loss function are the roots of polynomials depending on the corresponding problem instance, and that the roots are dispersed under mild smoothness assumptions on the adversary. We obtain the first online algorithm selection procedures for linkage based clustering, and we give algorithm selection procedures for the knapsack problem with substantial computational improvements over the prior work, while at the same time achieving nearly the same regret bound.

Explicit Comparison for Knapsack. To highlight the benefits of our new learning model and results applied to algorithm selection, we give an explicit comparison of the computational complexity for obtaining different types of feedback and the corresponding regret bounds for the family of algorithms introduced in Section 4.1. This is a family of greedy algorithms for the knapsack problem with a single parameter . The algorithm with parameter assigns the score to item i, where the value and size of item i are and , respectively. Then, the algorithm greedily selects items in decreasing order of their score until the knapsack is full. In each round of the online game, the algorithm chooses a parameter , a new knapsack instance with n items arrives, and our goal is for the total value of items selected by the learner to be close to the total value of the best fixed parameter in hindsight. We compare our results to the best prior full-information and

bandit feedback procedures for this problem.

• Full-information. Balcan et al. [11] show that the exponentially weighted forecaster with full-informationfeedback achieves a regret bound of . Our tighter analysis improves the bound to . Obtaining full-information feedback involves running the greedy algorithm times per round, each costing O(n log n) time, leading to a total cost of per round.

• Bandit Feedback. The discretization-based bandit algorithm of Balcan et al. [11] achieves a regret bound of , but only requires a single run of the greedy algorithm per round, costing O(n log n) time.

• Semi-bandit Feedback. Finally, in this paper we give an algorithm whose regret is bounded by using semi-bandit feedback obtainable in time O(n log n) per round. Note that our algorithm is as efficient as the bandit-feedback algorithm, yet achieves essentially the same regret as the full-information algorithm.

Related Work. For online optimization of one-dimensional piecewise constant functions, Cohen-Addad and Kanade [16] provide full-information and bandit online optimization procedures. Balcan et al. [11] consider the more general setting of multi-dimensional piecewise Lipschitz functions. They introduce a dispersion condition that roughly measures how many functions are not Lipschitz in any ball, and provide algorithms with dispersion-dependent full-information and bandit regret bounds. They also verify that dispersion is satisfied for a diverse collection of algorithm selection problems.

Prior work on semi-bandit feedback has focused predominantly on finite-armed bandits. Semi-bandit feedback was first considered for online shortest path problems, where on each round the learner selects a path through a graph and observes the length of the edges along that path (but not for other edges) [20, 21]. Audibert et al. [3] obtain minimax bounds for a generalization to combinatorial bandits, where the learner’s action space is described by boolean vectors in , the losses are linear, and the on each round the learner observes the entries of the loss vector corresponding to the non-zero entries in their action. Alon et al. [2] introduce the Exp3-SET algorithm for semi-bandit feedback for finite-armed bandits. They consider the graph-feedback setting introduced by Mannor and Shamir [25], where on each round t, there is a feedback graph over the arms of the bandit and playing arm i reveals the loss for arm i and all arms adjacent in the graph . We extend the Exp3-SET algorithm to online optimization problems where there are infinitely many arms and where the feedback system on each round is a partition of the parameter space C.

There is a rich literature on data-driven algorithm selection. Most prior work focuses on the statistical setting, where the learner is given a large iid sample of problem instances from some distribution, and the goal is to find the algorithm with the best performance in expectation. Gupta and Roughgarden [19] introduced this formal setting and provide sample complexity results for several families of greedy algorithms. Balcan et al. [9] consider semidefinite rounding schemes for integer quadratic programs and linkage based clustering algorithms, Balcan et al. [10] consider learning the best branch and bound algorithms for mixed integer programs, and Balcan et al. [12] consider learning the best initialization procedures for k-means clustering. In addition to these formal results, this statistical setting has been the predominant model for data-driven algorithm configuration in artificial intelligence [27], combinatorial auctions [23], numerical linear algebra [17], vehicle routing [15], and SAT solving [35].

Another related line of work focuses on the problem of selecting the algorithm with the shortest running time over a distribution of problem instances [22, 33, 32]. This work makes minimal assumptions about the algorithm family and instead designs procedures that can avoid running every algorithm to completion, since this may be very expensive. Our work, on the other hand, explores special structure in algorithm families and can be used to optimize more general performance measures in the online rather than stochastic setting.

2 Semi-bandit Optimization of Piecewise Lipschitz Functions

In this section we provide an algorithm for online piecewise Lispchitz optimization and analyze its regret under dispersion. Our results are for the following continuous semi-bandit setting.

Definition 3 (Uninformed Semi-bandit Feedback.). An online optimization problem with loss functions has semi-bandit feedback if for each time t, there is partition of the parameter space C, called a feedback system, such that when the learner plays point , they observe the set and denote the feedback set that contains

We analyze a continuous version of the Exp3-SET algorithm of Alon et al. [2]. This algorithm uses importance weighting to construct unbiased estimates of the complete loss function on each round, which it passes as input to a continuous version of the exponentially weighted forecaster. Pseudocode is given in Algorithm 1. Unlike the Exp3 algorithm of Auer et al. [4], the finite-armed Exp3-SET algorithm and our continuous version do not include an explicit exploration term (i.e., we do not mix the distribution uniform distribution over C). Stoltz [31] was the first to show that mixing with the uniform distribution is unnecessary for the Exp3 algorithm to have optimal expected regret.

In Appendix A.2, we show how to implement this algorithm with O(log T) per round computational complexity (in addition to the cost of collecting feedback) for the case of piecewise constant losses in one dimension. This implementation uses the interval tree data structure of Cohen-Addad and Kanade [16].

Given the learner’s observations on round t, Algorithm 1 uses importance weighting to estimate the complete loss function by . The estimate is only non-zero for parameters that belong to the feedback set observed by the algorithm at round t. The key property of is that it is an unbiased estimate of the true loss function conditioned on the history until the beginning of round t. More formally, let denote the conditional expectation given the learner’s choices until round and the first t loss functions. This expectation is only over the randomness of the learner’s choice of at time t. For clarity, we also use the notation to denote the expectation of any random variable that is a function of only so that for any random quantity X, we have , a straight forward calculation shows that

To simplify presentation, we assume that the sequence of loss functions has an -interior minimizer: with probability one, for all times T there exists usually modify a sequence of loss functions to obtain an equivalent optimization problem that is guaranteed to have an -interior minimizer. In Appendix A we discuss such a transformation that works whenever the parameter space C is convex.

We bound the regret of Algorithm 1 under a generalization of both both -dispersion and the (w, k)-dispersion definition of Balcan et al. [11], leading to more precise bounds and broader applicability.

Definition 4. The sequence of loss functions is f-dispersed for the Lipshitz constant L and dispersion function if for all T and for all

We can express both -dispersion and (w, k)-dispersion in terms of f-dispersion. In particular, let be a sequence of piecewise L-Lipschitz functions. For any and , let -Lipshitz in be the expected number of non-Lipschitz functions among on the worst ball of radius . If the functions are -dispersed, then we know that for all T and all , we have . Since is a non-decreasing function of the radius , we are guaranteed that for any we have . It follows that the functions are also f-dispersed for . Similarly, the functions are (w, k)-dispersed if every ball of radius w has at most k non-Lipschitz functions. This implies that for all , we have , but for , we could have as large as T. It follows that the functions are f-dispersed where otherwise.

We bound the regret of Algorithm 1 in terms of the f-dispersion of the losses.

Theorem 5. Let be contained in a ball of radius R and be piecewise L-Lipschitz functions that are f-dispersed with an -interior minimizer. Moreover, suppose the learner gets semi-bandit feedback and, on each round t, the feedback system feedback sets. For any , running Algorithm 1 with satisfies the following regret bound:

Proof sketch. Following the proof for the finite-armed case of Alon et al. [2], we upper and lower bound the quantity is the normalizing constant at round t. The upper bound is in terms of the expected total loss of the algorithm, while the lower bound depends on the expected total loss of the best parameter in hindsight. The main additional challenges are in proving the lower bound, which relies crucially on the dispersion properties of . Combining the inequalities leads to the claimed regret bound.

Following a continuous analogue of the arguments used by Alon et al. [2], we arrive at the following upper bound:

where the second term roughly quantifies our dependence on the second moment of the estimated losses . We show that for any fixed we have that . This implies that

where the first equality follows from the fact that the feedback system is a partition of C and the interval of is equal to 1. Substituting this into our upper bound, we have

In the finite-armed setting, lower bounding in terms of the loss of is straightforward, since is the number of arms and . When the parameter space C is continuous, the lower bound is more involved. First, since is an integral rather than a sum, is no longer lower bounded by . To overcome this, Balcan et al. [11] use dispersion to argue that an entire ball of points close to has low loss. However, applying this analysis to the estimated losses will fail, since the estimated loss functions might not be f-dispersed. Moreover, even when are f-dispersed, their Lipschitz constants and range are not bounded unless we make restrictive assumptions on the volume of the feedback sets. Instead, we use a careful application of Jensen’s inequality to postpone the application of dispersion until after we have already taken the expectation of the estimated losses, which allows us to use the dispersion properties of the true loss functions instead.

Formally, let be a minimizer with be any radius satisfying be the ball of radius . We lower bound by integrating over just the ball B consisting of parameters close to

Using the fact that log is concave, applying Jensen’s inequality to the right hand side gives . Let denote the expectation conditioned on the sequence of loss functions (i.e., only taking the expectation over the randomness of the algorithm). Then, since , we have that . Here, we used the fact that is an unbiased estimate of . Finally, letting D denote the number of non-Lipschitz functions in on the ball B, we can upper bound the loss of all by . Since the functions -dispersed, we know that the expected number of non-Lipschitz functions in the worst ball of radius r is at most f(T, r). In particular, this implies that . Therefore, taking the expectation of our lower bound gives

Combining the upper and lower bounds on , using the fact that , andrearranging gives

Substituting the given value of completes the proof.

Our regret bound for -dispersed losses given in Theorem 2 follows immediately from Theorem 5.

Note that our results are also applicable in two closely related settings: maximizing dispersed piecewise Lipschitz utility functions, and the case when losses are bounded in [0, H] for some known bound H instead of [0, 1]. A discussion of the necessary transformations can be found in Appendix A.1.

3 General Tools for Verifying Dispersion

In this section we provide a general tool for demonstrating that a sequence of loss functions is dispersed. For many sequences of loss functions and any fixed ball , we are able to bound the expected number of functions among that are not L-Lipschitz on that ball. However, this does not directly imply that the functions are dispersed, since in general the expected number of non-Lipschitz functions in the worst ball will be larger than the expected number of non-Lipschitz functions in any specific ball. Our main result in this section shows that for sequences of independent loss functions that are defined over a one-dimensional parameter space with at most K discontinuities each, to prove dispersion for the sequence, it is sufficient to bound the expected number of non-Lipschitz functions for any fixed ball.

Theorem 6. Let be independent piecewise L-Lipschitz functions, each having at most K discontinuities. Let

be the number of functions in that are not L-Lipschitz on the ball . Then we have

Proof sketch. For simplicity, we assume that every function has exactly K discontinuities and let be the vector whose entries are the discontinuity locations of . The vectors are independent.

For any interval , define the function by if any component of belongs to otherwise. The sum counts the number of vectors that have a component in the interval I or, equivalently, the number of functions that are not L-Lipschitz on I. The main result will follow from VC-dimension based uniform convergence arguments applied to the class is an interval}.

First, we bound the VC-dimension of F by O(log K). The key insight is the following connection between F and the class of indicator functions for intervals: let be any collection of n vectors in denote the set containing the union of their combined nK component values. Now consider any pair of intervals I and . If the indicator functions for agree on all the points in P (i.e., the intervals contain exactly the same subset of P), then we must have that agree on every vector in S. This is because if contain exactly the same subset of P, then for each vector , both intervals contain the same subset of its component values and we must have . Therefore, that the number of distinct ways that the class F can label the set of vectors S is at most the number of ways that intervals can label the set of points P. The bound on the VC-dimension of F follows from the fact that, by Sauer’s Lemma, intervals can only label the points in distinct ways, which limits the size of the largest set S that is shatterable by F.

Applying VC-dimension based uniform convergence arguments to the class F and using the fact that , it follows that for all time horizons T and all radiuses , with probability at least Converting this high-probability bound into a bound in expectation completes the proof.

If we can show that for all times T, radiuses and any fixed interval I of radius , the expected number of non-Lipschitz functions on interval I is at most , then Theorem 6 guarantees that the functions are -dispersed with . Similarly, if the expected number of non-Lipschitz functions on the interval I is at most , then the functions are f-dispersed for

Improvement Over the Prior Work. Balcan et al. [11] also provide general tools for proving (w, k)-dispersion that can be adapted to -dispersion. Our bounds provide an exponential improvement in the dependence on K, the number of discontinuities per function. Under the same assumptions as Theorem 6, they are able to show the total number of discontinuities across the functions that fall in the interval of radius at (i.e., it counts multiple discontinuities from each loss function ) (see Lemma 15 in Appendix B). The dependence of our additive term on K is exponentially smaller and leads to improved dispersion analysis.

4 Online Algorithm Selection with Semi-bandit Feedback

In this section we apply our semi-bandit optimization results to online algorithm selection for two rich parameterized families of algorithms. For both families, we show how to obtain semi-bandit feedback by running a single algorithm fom the family. We also analyze dispersion for these problems under the assumption that the adversary is smoothed. In both cases, we obtain regret bounds in the semi-bandit feedback setting. Finally, in Appendix C.1 we show how to use binary search to obtain semi-bandit feedback for a large class single-parameter algorithm families.

Smoothed adversaries. We consider adversaries that are smoothed in the sense of Spielman and Teng [30], where their decisions are corrupted by small random perturbations. Formally, we say that a parameter chosen by the adversary is -smooth if it is a random variable whose density is bounded by . After the adversary chooses the density for each smoothed parameter, nature samples each parameter value independently from their corresponding distributions. Small values of correspond to larger random perturbations of the problem parameters, while in the limit as , the adversary is able to choose the parameters deterministically. In each application, we will specify which problem parameters are smoothed, together with the bound density. For simplicity, we assume that all -smooth random variables are independent (i.e., the corruption of the adversary’s choices is not correlated across variables), though many of our results can be exteneded to allow for some correlation between the parameters of each instance.

4.1 Greedy Algorithms for Knapsack

First, we consider selecting the best algorithm from a parameterized family of a greedy algorithms for the knapsack problem. An instance of the knapsack problem consists of n items, where item i has a value a size , and a knapsack capacity C. Our goal is to find the most valuable subset of items whose total size does not exceed C. Gupta and Roughgarden [19] propose using the following parameterized family of greedy knapsack algorithms: for a given parameter , set the score of item decreasing order of score, add each item to the knapsack if there is enough capacity left. This algorithm runs in time O(n log n). In our analysis, we assume that the adversary’s item values are

First, we show how to obtain semi-bandit feedback for this family of greedy knapsack algorithms by running a single algorithm in the family. Pseudocode is given in Algorithm 2.

Lemma 7. Consider a knapsack instance with capacity C and n items with values and sizes . Algorithm 2 runs in time O(n log n). Moreover, there is a feedback system partitioning C into intervals such that set of items output by the algorithm is constant for . When run with parameter , in addition to the item set S, the algorithm outputs the interval containing Proof. Computing the permutation in step 1 of Algorithm 2 takes O(n log n) time. Finding the item set S in steps 2 and 3 only requires a linear pass through the items. Similarly, finding the interval A in steps 4 and 5 also only requires a linear pass through the items. Therefore, the total running time is O(n log n).

Gupta and Roughgarden [19] show that for any knapsack instance, the algorithm’s output is a piecewise constant function of the parameter with at most discontinuities. In particular, for each pair of items i and j, there is a critical parameter value such that the relative order of items i and j only changes at . These critical parameter values partition C into sets such that the item ordering is fixed for all . Algorithm 2 computes the critical values for each consecutive pair of items and and outputs the largest interval A containing and none of these critical values. For all , we must have for , and therefore the item ordering is constant for . It follows that that A does not contain for any pair of items i and j. On the other hand, the end points of A are critical values, so A must be equal to one of the M sets

In contrast to Algorithm 2, the most direct approach to obtaining full-information feedback for this family of knapsack algorithms is to first compute a set of critical parameter values arising from all pairs of points and to run the algorithm once for each cell in the corresponding partition, taking

Next, we provide a dispersion analysis for selecting the parameter in order to maximize the value of items selected. We assume that each instance has the same capacity C, item sizes are in [1, C], and the item values are in -smooth. The corresponding loss function is where is the set of items selected by Algorithm 2 when run with parameter

Lemma 8. Consider an adversary choosing knapsack instances with a fixed knapsack capacity C where the instance has item sizes , and -smooth item values . The loss functions defined above are piecewise constant and f-dispersed for -dispersed for

Proof. Let be the critical parameter value such that at , items i and j swap their relative order in the instance. Balcan et al. [11] show that each critical value is random and has a density function bounded by . It follows that for any interval I of radius , the expected total number of critical values summed over all pairs of items and t = 1, . . . , T is at most . This is also an upper bound on the expected number of loss functions in that are not constant on I. Applying Theorem 6, it follows that the functions are f-dispersed for , which implies -dispersion with

Running Algorithm 1 using the semi-bandit feedback returned by Algorithm 2, we obtain the following.

Corollary 9. Under the same conditions as Lemma 8, using Algorithm 1 to tune the parameter Algorithm 2 under semi-bandit feedback has expected regret bounded by

The full-information regret bound obtained by Balcan et al. [11] is , which is worse than our semi-bandit bound (but can be improved to using our tighter dispersion analysis).

4.2 Interpolating between Single and Complete Linkage Clustering

Next, we consider a rich family of linkage-based clustering algorithms introduced by Balcan et al. [9] that interpolates between the classic single and complete linkage procedures. Clustering instances are described by a matrix giving the pairwise distances between a collection of n data points and the goal is to organize the points into a hierarchy or cluster tree. We provide the first dispersion analysis and online configuration procedures for this class of algorithms. We assume that each distance

The algorithm family we consider, called -linkage, is family of agglomerative clustering algorithms with a single parameter . These algorithms take as input a distance matrix with entries and the parameter value and output a cluster tree, which is a binary tree where each node corresponds to a cluster in the data. The leaves of the tree are the individual data points, while the root node corresponds to the entire dataset. The children of each node subdivide that cluster into two subclusters. The -linkage algorithm starts with each point belonging to its own cluster. Then, it repeatedly merges the closest pair of clusters according the distance defined by and B are clusters (i.e., subsets of When there is only a single cluster remaining, the algorithm outputs the constructed cluster tree. Running this algorithm with recovers single and complete linkage, respectively.

For any pair of candidate cluster merges and , where and are clusters, there is a critical parameter value c such that only when . To simplify notation in the rest of this section, we let , where , denote the critical parameter.

First, we show how to obtain semi-bandit feedback for this family of linkage algorithms by running a single algorithm in the family. Our modified algorithm maintains an interval with the invariant that at any iteration, for all parameters , the algorithm would make the same merges that have been made so far. Pseudocode for this procedure is given in Algorithm 3

Lemma 10. Consider a clustering instance with distance matrix . Algorithm 3 runs in time Moreover, there is a feedback system partitioning intervals such that the cluster tree output by the algorithm is constant for . When run with parameter , in addition to the cluster tree T, the algorithm outputs the interval containing

Proof. The algorithm performs merges. For each merge, the algorithm makes two passes through the clusters in order to find the closest pair, as well as to update the interval . These passes both require us to compute the distance between all pairs of clusters. However, starting from the input matrix D, we can maintain two distance matrix storing the minimum and maximum distances between the current set of clusters, respectively. After merging two clusters, these distance matrices can be updated in O(n) time, since at most O(n) distances change. It follows that finding the closest pair of clusters and updating the interval can be done in time per merge. This leads to a total running time of

Balcan et al. [9] prove that there exists a partition intervals such that the algorithm output is constant for . In particular, for any pair of possible cluster merges and , the algorithm prefers to merge for all values of the parameter . Moreover, since only depends on 8 points—the closest and farthest pairs of points between and and between and —and there are only ways to select 8 points, these critical parameter values partition intervals. For , the ordering on all possible merges is fixed, so the algorithm will output the same cluster tree.

Finally, on each iteration of the algorithm, we iterate through all pairs of clusters that the algorithm did not merge. For each, we calculate the critical parameter value the value of at which the algorithm would prefer to merge over . We shrink the interval so that it does not contain any of these critical values. It follows that the interval satisfies the following invariant: for all , the sequence of cluster merges made by the algorithm with parameter up until the current iteration would match those made by the algorithm with parameter . In particular, when the algorithm returns, we are guaranteed that the same cluster tree would be output for all parameter values . Since the endpoints and always belong to the critical parameter values, there are at most intervals the algorithm might output for a fixed clustering instance.

Similarly to the knapsack example, the most direct approach for obtaining full-information feedback is to first calculate a set of critical parameter values arising from all subsets of 8 points and to run -linkage once for each interval in the corresponding partition. By using a priority queue to maintain the distances between clusters, it is possible to implement -linkage in time. This leads to a total running time of , which is significantly higher than the running time in Lemma 10. Note that using a priority queue in Algorithm 3 does not reduce the running time to , since updating the interval requires a linear pass through all pairs of clusters, so finding the closest pair faster does not reduce the running time.

Next, we provide a dispersion analysis for selecting the parameter of Algorithm 3 when the clustering instances are chosen by a smoothed adversary. In particular, we suppose that on each round the adversary chooses a distance matrix where each distance is -smooth and takes values in [0, B], for some maximum distance B. The quantity roughly captures the scale of the perturbations relative to the true distances. Our analysis leads to regret that depends on only logarithmically and give good bounds even for exponentially small perturbations.

Fix any clustering loss function , where g(D, T) measures the cost of cluster tree T for distance matrix D. For example, g(D, T) could be the k-means cost of the best k-pruning of the tree T or the distance to a ground-truth target clustering. We study the loss functions given by , where denotes the output cluster tree of Algorithm 3 run on distance matrix D with parameter

Lemma 11. Consider an adversary choosing clustering instances where the instance has symmetric distance matrix and for all is -smooth. The loss functions defined above are piecewise constant and f-dispersed for and -dispersed for

Proof sketch. In the proof of Lemma 10, we showed that for each time t, there are critical parameter values partitioning C into regions so that the algorithm output is constant on each region. Since the loss only depends on through the algorithm output, is also piecewise constant with at most

Moreover, we argued that every discontinuity of occurs at a critical parameter value of the form where are 8 point indices. Similarly to the knapsack example, we show that each critical parameter value is random and has a density function bounded by . From this, it follows that for any interval I of radius , summing over all times t = 1, . . . , T and all subsets of 8 points, we have that the expected total number of critical values that land in interval I is at most . This also bounds the expected number of functions that are not constant on I. By Theorem 6, the functions are f-dispersed for also implying -dispersion.

There are several cases when bounding the density of the critical value c, depending on whether any of the 4 distances correspond to the same entry in the distance matrix D. We give the argument for the case when all 4 distances are distinct entries and therefore independent. The remaining cases are similar and considered in Appendix C. Let and so that c = X/(X + Y ). The variables X and Y are independent. Since X and Y are each the sum of -smooth random variables, Lemma 20 implies that they are each have -bounded densities. Using the fact that , applying Lemma 22 implies that the ratio bounded density, as required.

Running Algorithm 1 using the semi-bandit feedback returned by Algorithm 3, we obtain the following:

Corollary 12. Under the same conditions as Lemma 11, using Algorithm 1 to tune the parameter of Algorithm 3 under semi-bandit feedback has expected regret bounded by

5 Conclusion

In this work, we provide the first online optimization algorithm for piecewise Lipschitz functions under semi-bandit feedback with regret bounds that depend on the dispersion of the loss functions. We also give general tools for verifying dispersion in applications with exponentially tighter bounds than prior work. Finally, we apply our results to two algorithm selection problems. We obtain the first online algorithm selection procedure for a family of linkage-based clustering algorithms, and an online algorithm selection procedure for a greedy family of knapsack algorithms that is more efficient and has better regret bounds than prior work. A cornerstone of our results is that, for many algorithm selection problems, semi-bandit feedback can be obtained as efficiently as bandit-feedback and is sufficient for our algorithms to achieve nearly the same regret bounds as under full-information feedback. Our results largely mitigate the tradeoff between computational efficiency and good regret bounds suffered by prior online algorithm selection approaches, making online algorithm selection more practical.

Acknowledgements. This work was supported in part by NSF grants CCF-1535967, IIS-1618714, DMS-1700365, an Amazon Research Award, a Microsoft Research Faculty Fellowship, and by the generosity of Eric and Wendy Schmidt by recommendation of the Schmidt Futures program.

References

[1] Noga Alon, Nicolo Cesa-Bianchi, Ofer Dekel, and Tomer Koren. Online learning with feedback graphs: Beyond bandits. In Annual Conference on Learning Theory, volume 40. Microtome Publishing, 2015.

[2] Noga Alon, Nicol`o Cesa-Bianchi, Claudio Gentile, Shie Mannor, Yishay Mansour, and Ohad Shamir. Nonstochastic multi-armed bandits with graph-structured feedback. SIAM J. Comput., 46(6):1785–1826, 2017. doi: 10.1137/140989455. URL https://doi.org/10.1137/140989455.

[3] Jean-Yves Audibert, S´ebastien Bubeck, and G´abor Lugosi. Regret in online combinatorial optimization. Mathematics of Operations Research, 39(1):31–45, March 2014.

[4] Peter Auer, Nicolo Cesa-Bianchi, Yoav Freund, and Robert E Schapire. The nonstochastic multiarmed bandit problem. SIAM journal on computing, 32(1):48–77, 2002.

[5] Pranjal Awasthi, Avrim Blum, and Or Sheffet. Center-based clustering under perturbation stability. In Information Processing Letters, 2012.

[6] Pranjal Awasthi, Maria-Florina Balcan, and Konstantin Voevodski. Local algorithms for interactive clustering. In ICML, 2014.

[7] Maria-Florina Balcan and Yingyu Liang. Clustering under perturbation resilience. In SIAM Journal on Computing, 2016.

[8] Maria-Florina Balcan, Nika Haghtalab, and Colin White. k-center clustering under perturbation resilience. In Proceedings of the Annual International Colloquium on Automata, Languages, and Programming (ICALP), 2016.

[9] Maria-Florina Balcan, Vaishnavh Nagarajan, Ellen Vitercik, and Colin White. Learning-theoretic foundations of algorithm configuration for combinatorial partitioning problems. Proceedings of the Conference on Learning Theory (COLT), 2017.

[10] Maria-Florina Balcan, Travis Dick, Tuomas Sandholm, and Ellen Vitercik. Learning to branch. In ICML, 2018.

[11] Maria-Florina Balcan, Travis Dick, and Ellen Vitercik. Dispersion for data-driven algorithm design, online learning, and private optimization. In FOCS, 2018.

[12] Maria-Florina Balcan, Travis Dick, and Colin White. Data-driven clustering via parameterized lloyd’s families. In NeurIPS, 2018.

[13] Maria-Florina Balcan, Tuomas Sandholm, and Ellen Vitercik. A general theory of sample complexity for multi-item profit maximization. In ACM EC, 2018.

[14] Shai Ben-David, David Pal, and Shai Shalev-Shwartz. Agnostic online learning. In COLT, 2009.

[15] Yves Caseau, Franc¸ois Laburthe, and Glenn Silverstein. A meta-heuristic factory for vehicle routing problems. In International Conference on Principles and Practice of Constraint Programming, pages 144–158. Springer, 1999.

[16] Vincent Cohen-Addad and Varun Kanade. Online Optimization of Smoothed Piecewise Constant Functions. In Proceedings of the International Conference on Artificial Intelligence and Statistics (AISTATS), 2017.

[17] Jim Demmel, Jack Dongarra, Victor Eijkhout, Erika Fuentes, Antoine Petitet, Rich Vuduc, R Clint Whaley, and Katherine Yelick. Self-adapting linear algebra algorithms and software. Proceedings of the IEEE, 93(2):293–312, 2005.

[18] Anna Grosswendt and Heiko Roeglin. Improved analysis of complete linkage clustering. In European Symposium of Algorithms, 2015.

[19] Rishi Gupta and Tim Roughgarden. A PAC approach to application-specific algorithm selection. SIAM Journal on Computing, 46(3):992–1017, 2017.

[20] Andr´as Gy¨orgy, Tam´as Linder, and Gy¨orgy Ottucs´ak. The shortest path problem under partial monitoring. In COLT, 2006.

[21] Satyen Kale, Lev Reyzin, and Robert E. Shapire. Non-stochastic bandit slate problems. In NeurIPS, 2010.

[22] Robert Kleinberg, Kevin Leyton-Brown, and Brendan Lucier. Efficiency through procrastination: Approximately optimal algorithm configuration with runtime guarantees. In IJCAI, 2017.

[23] Kevin Leyton-Brown, Eugene Nudelman, and Yoav Shoham. Empirical hardness models: Methodology and a case study on combinatorial auctions. Journal of the ACM (JACM), 56(4):22, 2009.

[24] N. Littlestone. Learning quickly when irrelevant attributes abound: A new linear-threshold algorithm. In Machine Learning, 1998.

[25] Shie Mannor and Ohad Shamir. From bandits to experts: On the value of side-observations. In Advances in Neural Information Processing Systems, pages 684–692, 2011.

[26] Alexander Rakhlin, Karthik Sridharan, and Ambuj Tewari. Online learning: Stochastic, constrained, and smoothed adversaries. In NeurIPS, 2011.

[27] John R Rice. The algorithm selection problem. In Advances in computers, volume 15, pages 65–118. Elsevier, 1976.

[28] Mehreen Saeed, Onaiza Maqbool, Haroon Atique Babri, Syed Zahoor Hassan, and S. Mansoor Sarwar. Software clustering techniques and the use of combined algorithm. 2003.

[29] S. Shalev-Shwartz and S. Ben-David. Understanding machine learning: From theory to algorithms. Cambridge University Press, 2014.

[30] Daniel A Spielman and Shang-Hua Teng. Smoothed analysis of algorithms: Why the simplex algorithm usually takes polynomial time. Journal of the ACM (JACM), 51(3):385–463, 2004.

[31] Gilles Stoltz. Incomplete information and internal regret in prediction of individual sequences. PhD thesis, Universit´e Paris Sud-Paris XI, 2005.

[32] Gell´ert Weisz, Andr´as Gy¨orgy, and Csaba Szepesv´ari. CapsAndRuns: An improved method for approximately optimal algorithm configuration. In ICML 2018 AutoML Workshop, 2018.

[33] Gell´ert Weisz, Andr´as Gy¨orgy, and Csaba Szepesv´ari. Leapsandbounds: A method for approximately optimal algorithm configuration. In ICML, 2018.

[34] James R. White, Saket Navlakha, Niranjan Nagarajan, Mohammad-Reza Ghodsi, Carl Kingsford, and Mihai Pop. Alignment and clustering of phylogenetic markers—implications for microbial diversity studies. In BCM Bioinformatics, 2010.

[35] Lin Xu, Frank Hutter, Holger H Hoos, and Kevin Leyton-Brown. Satzilla: portfolio-based algorithm selection for sat. Journal of artificial intelligence research, 32:565–606, 2008.

A Appendix for Online Optimization (Section 2)

Problem Transformations to Obtain -interior Minimizers. Recall that a sequence of loss functions has an -interior minimizer if with probability 1, for all times T we have that there exists . We can usually modify a sequence of loss functions to obtain an equivalent optimization problem that is guaranteed to have an -interior minimizer. For example, when the parameter space C is convex (e.g., a cube in , which covers most algorithm configuration applications), we can apply the following transformation: define an enlarged parameter space and a modified sequence of loss functions given by , where denotes the Euclidean projection onto C. Using the fact that projections onto convex sets are contractions, it follows that the sequence is also L-Lispchitz and f-dispersed. Moreover, it has an -interior minimizer and any sequence of parameters can be converted into by taking . This guarantees that for all t. In particular, an algorithm with low regret playing against can be converted into one that plays against with an identical regret bound. The cost of this transformation is that it increases the diameter of the parameter space C by 2r. Our regret bounds have logarithmic dependence on the diameter of C.

Theorem 5. Let be contained in a ball of radius R and be piecewise L-Lipschitz functions that are f-dispersed with an -interior minimizer. Moreover, suppose the learner gets semi-bandit feedback and, on each round t, the feedback system feedback sets. For any , running Algorithm 1 with satisfies the following regret bound:

Proof of Theorem 5. For the majority of the proof we consider an arbitrary sequence of piecewise Lipschitz loss functions with an -interior minimizer. We will only suppose they are f-dispersed in the final steps of the proof.

Following the proof of the Exp3-Set algorithm of Alon et al. [2], we will upper and lower bound the quantity . Our upper bound will be in terms of the learner’s total expected loss, while the lower bound will be in terms of the expected total loss of the optimal parameter in hindsight. Dispersion plays a crucial role in the lower bound, since it allows us to guarantee that a set of parameters with non-trivial volume has nearly optimal total loss. Combining these bounds and then finally taking the expectation of the bound for a sequence of losses -dispersed will give the final bound.

Upper Bound. Consider the ratio of consecutive normalizing constants . Using the definition of

Next, using that

Using the fact that and taking the product over t = 1, . . . , T, we have

Taking logs, we have

Next, we will take the expectation of the above bound to simplify the two integrals. Recall that for each time t, we let be the feedback system and for any and let denote the set such that . Recall that the importance-weighted losses were constructed to ensure that for any time t and any fixed . Therefore,

The integral in the final expectation is the definition of , which gives . Therefore, we have

which is the total expected loss of the algorithm on the first T rounds. Now we turn to simplifying the expectation of the second integral in (1). For any and any time t, we have

Using the fact that if and only if , we can upper bound the integral as follows:

This implies that

Finally, we evaluate the integral by writing it as the sum of integrals over the feedback sets , which is justified since these sets partition C. In particular, we have

Putting it together, we have

Taking the expectation of (1) and using the calculations given by (2) and (3), we have

Lower Bound. Next, let be such that and fix any radius . Using the fact that and the weights are positive, we have

Taking the log of this bounds gives

At this point it is tempting to apply dispersion to lower bound the term in terms of . In particular, if at each time t the feedback system corresponds to the piecewise Lispchitz partitioning of C for the loss function , then the estimated loss function has a subset of the discontinuities of . In this case, the estimated losses -dispersed for the same function f as the true losses. However, when the feedback system at around t does not match the piecewise Lipschitz partition, we would require a separate dispersion analysis for the sequence of estimated losses more serious challenge, however, is that the importance weight in the definition of to take values in the range , which is much larger than the true losses which take values in [0, 1]. Moreover, the Lipschitz parameter of the estimated loss is . This larger loss range and Lipschitz constant lead to a worse final regret bound. Instead, we defer applying dispersion until after taking expectations so that we can use the dispersion properties of the true losses

Towards this end, we first use Jensen’s inequality to simplify the above bound. Let be any function and be any subset of the parameter space. Then, using the fact that log is concave, we can apply Jensen’s inequality to obtain the following bound:

Applying this inequality to our lower bound on with and gives

Next, since C is contained in a ball of radius R and the volume of a ball of radius is proportional to , it follows that the volume ratio is at least . Taking expectations, we have

where we used the fact that for any fixed , we have . Finally, we will upper bound the sum of losses for points in the ball in terms of the number of non- Lipschitz functions on that ball. let -Lipschitz on be the number of functions among that are not L-Lipschitz on the ball . Then for any have

The integralis the average total loss of the parameters , which is at most . Substituting this into our bound gives

Combined bound. Combining the upper and lower bounds and rearranging, we have

Finally, now suppose that the functions are a random sequence that satisfy f-dispersion. To bound -dispersed and -interior minimizer. Then we have

where this expectation is taken over the draw of the loss functions . This leads to the bound

where now the expectation is taken over both the algorithm’s randomness and the random sampling of the loss functions . The specific bounds given in the theorem statement follow by substituting the chosen value of

A.1 Optimizing Utilities and H-Bounded Losses

We note that the regret bound obtained in Theorem 5 for Algorithm 1 can also be used to obtain similar results in two closely related settings. First, if we instead have piecewise Lipschitz utility functions and our goal is to maximize utility rather than minimize loss, we can transform this into a loss-minimization problem by minimizing the losses given by . This transformation preserves the regret of any algorithm, the feedback system at each round, and the piecewise Lipschitz and dispersion properties of the functions. Similarly, if the losses take values in [0, H] for some known maximum loss H, instead of [0, 1], the learner can preprocess the losses to fall in [0, 1] by dividing them by H. The rescaled functions take values in [0, 1] and have Lipschitz constant . Then expected regret of Algorithm 1 with respect to the unscaled loss functions is

Lemma 13. Let be a sequence of utility functions that are each piecewise L-Lipschitz and f-dispersed. Define a corresponding sequence of losses given by . The functions are also piecewise L-Lispchitz and f-dispersed.

Proof. First, consider any time t. Since is piecewise L Lipschitz, by definition we know that there is a partition of C so that for each and any , we have . We will argue that the loss function is also piecewise L-lispschitz on the same partition. Fix any and any pair of points . Then we have that

where the last inequality follows from the fact that -Lipschitz restricted to . It follows that piecewise L-Lipschitz and has the same piecewise Lipschitz partition. This holds for all times t.

Next, we argue that whenever the utility functions are f-dispersed, so are the loss functions . For any time horizon , and parameter

Following an identical argument as in the first part, with probability 1, whenever -Lipschitz on the ball , so is the function . From this, it follows that for all , and . Finally, since the functions -dispersed, we have that for all and all radiuses

and the loss functions -dispersed.

A.2 Efficient Implementations via Interval Trees

In this section we show how to use the modified interval tree data structure of Cohen-Addad and Kanade [16] to implement the continuous Exp3-SET algorithm efficiently for one-dimensional problems with piecewise constant loss functions. In particular, the per-round cost of updating the algorithm weights and sampling from them at time t is only O(log(t)), while a direct implementation has running time given by O(t) instead. We also show how to use interval trees to implement the Exp3 algorithm on a set of N arms with per-round running time that is O(log N), which implies that a discretization-based algorithm in the bandit setting can be efficiently implemented even in high dimensions.

Interval Tree Summary. Cohen-Addad and Kanade [16] introduce a modified interval tree data structure used for representing piecewise constant functions mapping R to R. Their data structure represents the function as a balanced tree with one leaf corresponding to each constant interval of the function. It supports two main operations called DRAW and UPDATE:

• The DRAW procedure returns a sample from the density function that is proportional to the represented piecewise constant function f.

• The UPDATE procedure takes an interval [a, b) and an update u, and updates the represented piecewise function by multiplying the function values in [a, b) by u. That is, if the represented function was originally , after executing UPDATE with interval [a, b) and update u, the resulting function is

Cohen-Addad and Kanade [16] show that the operations DRAW and UPDATE can be implemented in O(log P) time, where P is the number of constant pieces in the represented function. The data structure also makes it possible to implement a third procedure INTEGRATE in O(log P) time, which takes an interval [a, b) and returns the integral of the represented represented function on the interval [a, b).

Exp3-Set for Piecewise Constant One Dimensional Problems. First, we show how to efficiently implement Algorithm 1 efficiently for one-dimensional optimization problems with piecewise constant loss functions. We simply use the interval tree datastructure of Cohen-Addad and Kanade [16] to represent the weight function at each round. Pseudocode is given in Algorithm 4.

Lemma 14. Consider an online optimization problem with loss functions that are piecewise constant. Moreover, suppose that on each round t, the loss is constant on each of the feedback sets . For such a problem, Algorithm 4 is equivalent to Algorithm 1. The overhead of sampling and updating the weights on round t takes O(log t) time.

Proof. On each round we run UPDATE once to update the interval tree. This at most increases the number of constant intervals in the weights by 2, since the only constant intervals that might get split are the two containing the end points of the feedback set . Therefore, on round t, the weight function is piecewise constant with at most O(2t) intervals. It follows that the sampling, integration, and update operations all take O(log t) time, giving a total per-round cost of O(log t).

B Appendix for Disperion Tools (Section 3)

Theorem 6. Let be independent piecewise L-Lipschitz functions, each having at most K discontinuities. Let

be the number of functions in that are not L-Lipschitz on the ball . Then we have

Proof. For simplicity, we assume that every function has exactly K discontinuities. For each function denote the vector whose entries are the discontinuity locations of has discontinuities at , but is otherwise L-Lispchitz. Since the functions are independent, the vectors are also independent.

For any interval , define the function

The sum counts the number of vectors that have a component in the interval I or, equivalently, the number of functions that are not L-Lipschitz on I. We will apply VCdimension uniform convergence arguments to the class is an interval}. In particular, we will show that for an independent set of vectors , with high probability we have that is close to for all intervals I. This uniform convergence argument will lead to the desired bounds.

We begin by bounding the VC-dimension of F by O(log K). The key observation is the following connection between F and the class of indicator functions for intervals: let be any collection of n vectors in and let denote the set containing the union of their combined nK component values. Now consider any pair of intervals I and . If the indicator functions for I and agree on all the points in P (i.e., the intervals contain exactly the same subset of P), then we must have that and agree on every vector in S. This is because if I and contain exactly the same subset of P, then for each vector , both intervals contain the same subset of its component values. In particular, either they both contain none of the components, or they both contain at least one. In either case, we have that . This shows that the number of distinct ways that functions from the class F can label the set of vectors S is at most the number of ways that indicator functions for intervals can label the set of points P.

Now suppose that the VC-dimension of F is V . Then there exists a set of vectors of size V that is shattered by be the set containing the union of their combined V K components (as above). From Sauer’s Lemma together with the fact that the VC-dimension of intervals is 2, we are guaranteed that indicator functions for intervals can label the set P of points in at most distinct ways. By the above reasoning, it follows that F can label the set S of vectors in at most distinct ways. On the other hand, since F shatters S, we know that it can label S in all possible ways, and it follows that . Taking logs on both sides and rearranging, we have . Using the fact that for any and , the inequality implies that , we further have that , as required.

Applying VC-dimension uniform convergence arguments for the class F, for any failure probability , if are independent random vectors (but not necessarily identically distributed), then following holds with probability at least simultaneously for all

In particular, for any point and any radius , we have that , where I = . Therefore, uniform convergence for F implies that for all , and any failure probability , we have that with probability at least the following holds for all

Multiplying both sides by T and rearranging gives

Taking the maximum of both sides over

This is a high probability bound on the maximum number of non-Lipschitz functions among for any interval of radius . All that remains is to convert this into a bound in expectation. Let

G denote the high-probability uniform convergence event above. Then we have

where the last inequality uses the facts that and . This argument holds for all , proving the claim.

Next, we prove a weaker bound that follows from the analysis techniques of Balcan et al. [11].

Lemma 15. Let be independent piecewise L-Lipschitz functions, each having at most K discontinuities. Let -Lipschitz on be the (random) number of functions in that are not L-Lipschitz on the ball . Moreover, let , where is the vector of discontinuities of the loss . That is, is the number of discontinuities of the functions in the ball . Then we have

Note that, using the notation of Lemma 15, we always have . It follows that Lemma 15 is looser than Theorem 6 in two ways: first, the error term is a factor K larger. Second, the upper bound of Lemma 15 multiply-counts functions that have repeated discontinuities in the same ball, while our sharper bound does not.

Proof. For simplicity, we assume that every function has exactly K discontinuities. The proof techniques can be generalized to the case where each functions has at most K discontinuities.

For each time t, let be the vector of discontinuities of . That is, has discontinuities at the points and is otherwise L-Lipschitz. The key challenge is that the discontinuity locations are not independent.

Fix any discontinuity index and define . That is, counts the number of times t for which the discontinuity lands in the interval of radius centered on . Then we have that counts the total number of discontinuities that belong to the interval of radius centered on . Since the function is not L-Lipschitz on an interval I only when I contains some discontinuity for

Next we will apply uniform convergence arguments to obtain high probability bounds on each in terms of their expectations. Fix a discontinuity index . The set of discontinuity locations are independent and, since intervals have VC-dimension 2, applying standard uniform convergence guarantees implies that for any , with probability at least the following holds for all

Setting the failure probability to be , taking the union bound over all K discontinuities, and summing the resulting bounds, the following holds with probability at least

Using the fact that and taking the supremum over , the following holds with probability at elast

Let G denote the high-probility uniform convergence event above. Then we have

as required.

C Appendix for Applications (Section 4)

Lemma 11. Consider an adversary choosing clustering instances where the instance has symmetric distance matrix and for all is -smooth. The loss functions defined above are piecewise constant and f-dispersed for and -dispersed for

Proof. The key insight of Balcan et al. [9] for this family of algorithms is that for a fixed distance matrix D, the function is piecewise constant with at most pieces. That is, the algorithm will only output at most different cluster trees, and each is produced for some subinterval of the parameter space. Their argument is as follows: for any pair of candidate cluster merges, say merging clusters versus and , we can determine the values of the parameter for which the algorithm would prefer to merge instead of merging (i.e., the values of so that the distance between and is smaller than between and ). In particular, the algorithm will merge clusters and instead of or, equivalently, when

Since the above inequality is linear in , there is a single critical value of the parameter, given by

such that the relative preference of merging and or and changes only at . Moreover, the definition of c only depends on a collection of 8 points: the closest and farthest pair between between . In particular, every such critical parameter value c is given by

where are the indices of 8 points. Similarly to the knapsack example, we show that each critical parameter value is random and has a density function bounded by . From this, it follows that for any interval I of radius , the expected total number of critical values summing over all instances t = 1, . . . , T that land in interval I is at most . This also bounds the expected number of functions that are not constant on I. By Theorem 6, the functions are f-dispersed for , also implying -dispersion.

When the four distances present in the equation for c are distinct entries of the distance matrix D, then they are independent. However, it is possible that the closest and furthest pair of points between a pair of clusters can be the same, for example, when both clusters consist of just a single point. In this case, the corresponding distances are no longer independent, and we will need to modify our analysis slightly. Note a critical parameter c only arises for competing pairs of merges and that differ on at least one cluster (since otherwise both merges are identical). Moreover, since the set of clusters at any given round of the algorithm partition the data, any pair of clusters the algorithm encounters are either equal or disjoint. From this it follows that there are only four cases to consider depending on whether the closest and farthest pairs of points between are the same, and whether the closest and farthest pairs of points between are the same. That is, whether and whether

Case 1: . Rewriting expression for c given in (4), we have that c = X/(X + Y ). Moreover, both X and Y are the sum of two independent random variables having -bounded densities, so from Lemma 20, it follows that X and Y also have densities bounded by . Next, since X and Y are independent, take values in , and have -bounded densities, Lemma 22 ensures that the ratio bounded density.

Case 2: and . In this case, we are guaranteed that , and the expression for c simplifies to

Defining , and , we have that . The variables X, Y , and Z are independent, each have -bounded densities, and and with probability 1. Applying Lemma 23 to these random variables guarantees that the density function for

Case 3: and . This case is symmetric to case 2 and an identical argument applies.

Case 4: and . In this case, the distance between and is constant, as is the distance between . Therefore, for all values of we will prefer to merge the same pair of clusters and there is no critical parameter value where we switch from one merge to the other.

In every case, the density of the critical parameter value is upper bounded by , completing the proof.

Figure 1: Relationship between the binary search intervals [a, b] and [c, d] and the true interval on which

C.1 Single Parameter Piecewise Unique Algorithms.

Next we provide a general approach for obtaining semi-bandit feedback that applies to many single-parameter algorithms. This enables semi-bandit feedback, but we still rely on problem-specific dispersion analysis. This approach applies to any algorithm with a single real-valued parameter whose output is both a piecewise constant function of the parameter for any instance, and such that no output value is repeated across any distinct intervals in the piecewise decomposition. We call such an algorithm single-parameter piecewise-unique. Without loss of generality, we assume that the parameter space is given by be an algorithm mapping problem instances and parameters to outputs in some space Y. Given a parameter and a problem instance x, and an accuracy parameter , we will return both , together with an interval such that for all Moreover, for any point . In other words, the interval I output by the algorithm is nearly the largest piecewise constant interval containing . The high level idea of our approach is to run binary search twice to determine the upper and lower bounds and , respectively. Each search will require that we run the algorithm times. In cases where the algorithm parameters are specified using b bits of precision, then this procedure exactly determines the interval using O(b) invocations of the base algorithm. Pseudocode is given in Algorithm 5. Steps 3 and 4 perform binary search to find the upper bound on the constant interval, while steps 5 and 6 perform binary search to find the lower bound.

Lemma 16. Let be any single-parameter piecewise-unique algorithm and suppose and is output by Algorithm 5 when run on A with problem instance , parameter , and target accuracy . Then Algorithm 5 runs the base algorithm we have that , and for all

Proof. From step 1 of the algorithm, we know that , by definition. Since the algorithm is single-parameter and piecewise-unique, we know that will output for all belonging to some interval containing , and it will not output for any point outside that interval. In particular, restricted to the interval , there is exactly one critical parameter value, namely below which the algorithm always outputs and above which the algorithm always outputs something different. The binary search performed in step 3 guarantees that is always contained in the interval [a, b], yet on each iteration the length of the interval is halved. Similarly, each iteration of the binary search in step 6 guarantees that , and the width of the interval halves on each iteration. Each iteration of both binary search instances requires us to run the base algorithm A once, and we will require iterations to guarantee the width of both intervals is less than

Since and , we have and it follows that for all , as required. Moreover, we know that and , implying that for all , as required. Figure 1 depicts the relation between [a, b], [c, d], and at the end of the algorithm.

D Transformations of Bounded Densities

In this section we summarize several useful results that provide upper bounds on the density of random variables that are obtained as functions of other random variables with bounded density functions. These results allow us to reason about the distribution of discontinuity locations that arise as transformations of random problem parameters in algorithm configuration instances.

In many cases, we make use of the following result:

Theorem 17 (Density Function Change of Variables). Let be a random vector with joint probability density function and let be any bijective differentiable function. Then the random vector also has a density function given by denotes the Jacobian of evaluated at y.

Lemma 18 (Lemma 6 from [11]). Suppose X and Y are random variables taking values in (0, 1] and suppose that their joint distribution is -bounded. Then the distribution of

Lemma 19 (Lemma 8 from [11]). Suppose X is a random variable with a -bounded density and suppose c is a constant. Then -bounded density

Lemma 20. Let X and Y be two independent random variables each having densities upper bounded by The random variable U = X + Y has density satisfying

Proof. Let be the density functions for X and Y , respectively. The density for U is the convolution of . With this, we have

It follows that U = X + Y has a density that is upper bounded by

Lemma 21. Let X and Y be random variables with joint density that is -bounded and such that with probability 1 and let U = X/Y . Then the density function

Proof. Consider the change of variables given by U = X/Y and V = Y . This corresponds to the transformation function . The inverse of is given by . The Jacobian of

whose determinant is always equal to v. Therefore, the joint density of U and V is given by

To get the marginal density for U, we integrate over v and use the fact that the density whenever |y| > M. This gives

It follows that the density for , as required.

Lemma 22. Let X and Y be independent random variables with -bounded densities so that with probability one and define Z = X/(X + Y ). The random variable Z has a density function

Proof. Consider the change of variables given by U = X and V = X + Y . We will argue that the joint density is -bounded. Then, since with probability 1, we can apply Lemma 21 to ensure that the density of Z = U/V is bounded by , as required.

It remains to bound the joint density of U = X and V = X + Y . This change of variables corresponds to the transformation function , whose inverse is given by Jacobian of is given by

whose determinant is always 1. It follows that the joint density for (U, V ) is given by , as required.

Lemma 23. Let X, Y , and Z be independent random variables with -bounded densities such that and with probability one. Then the random variable has a density that satisfies

Proof. Consider the change of variables given by U = X + Y , V = Z + Y . We will argue that the joint density -bounded. Then, since with probability 1, we can apply Lemma 21 to ensure that the density of R = U/V is bounded by , as required.

It remains to bound the joint density of U = X + Y and V = Z + Y . Consider the change of variables given by U = X + Y , V = Z + Y , and W = Y . This corresponds to the transformation function , and has inverse . The Jacobian of is given by

which always has determinant given by . It follows that the joint density for (U, V, W) is given by

To get the joint density over only U and V we integrate over w:

as required.

E Discretization Based Algorithm

In this section we provide a single algorithm that has regret bounds for the full-information, semi-bandit feedback, and bandit-feedback settings. The high level idea is to reduce the problem to a finite-armed bandit and apply the Exp3-SET algorithm of Alon et al. [2], which enjoys regret bounds in all three of these settings. In particular, we choose a value of r > 0 and construct a r-net for the parameter space C (which can be done using at most points when C is contained in a ball of radius R in d dimensions. Then we run the Exp3-SET algorithm over this finite set). Exp3-SET is guaranteed to have bounded regret compared to the best point from the r-net, and we use f-dispersion to bound the expected difference in total loss between the best discretized point and the best point in all of C by TLr + f(T, r). We then get bounds for each of the three feedback regimes by tuning the parameters of this single algorithm.

On each round, we assume that the learner observes a feedback set , together with the loss for all points . In the full information setting, we take to be the entire parameter space. In the bandit feedback setting, we take is the point played by the learner in round t. And in the semi-bandit feedback setting, we take to be the piece of that contains the point played by the learner. Using this notation, pseudocode is given in Algorithm 6.

Theorem 24. Let be contained in a ball of radius R and be possibly random piecewise L-Lipschitz functions that are f-dispersed and have an -interior optimum. Then running Algorithm 6 with discretization parameter , the following claims hold:

In each of the above bounds, the role of the dispersion parameter f is clean. For both the full-information and semi-bandit feedback settings, the only term in the bound that depends on Similarly, for bandit feedback, the bound depends on f only through the term. In either case, it is clear that the regret bound after T rounds depends on the dispersion properties in neighborhoods of radius , respectively. We know that the function should be non-decreasing in the parameter , so this also shows that in one sense we require “less” dispersion in the bandit setting, since the bandit bounds depend on

Proof. Let be the sequence of loss functions. By assumption, with probability one there is an optimal parameter in hindsight such that . It follows that at least one of the discretization points must belong to this ball. Let be such a point. We use f-dispersion to argue that the total loss of is not much more than the total loss of in expectation over the loss functions

be the number of loss functions that are not L-Lipschitz on . We know that TLr + D, since D of the functions have discontinuities and the remaining are L-Lipschitz. Taking the expectation over and using the fact that , we have that

Therefore, it is sufficient to show that Algorithm 6 is competitive with in expectation over the algorithm randomness.

Algorithm 6 runs a special case of the Exp3-SET algorithm of Alon et al. [2] over the discretized set of parameters . In particular, we have assumed that the feedback system takes a particular form: on each round the arms are partitioned into sets such that playing any arm from one set reveals the loss for all arms in the same set. Alon et al. [2] bound the regret of this algorithm for a broader class of feedback systems, where on every round there is an undirected graphs defined over the arms fo the bandit and playing an arm reveals not only its loss, but also the loss of all adjacent arms in the feedback graph . Our special case corresponds to the setting where is the union of several cliques (in the full-information setting, is the complete graph, in the semi-bandit setting, is the union of M cliques, where M is the number of pieces, and in the bandit setting, is the union of N cliques, each consisting of a single node). Corollary 3 of [2] bound the regret of the algorithm by

where is the size of the largest independent set in the feedback graph and the expectation is over the randomness of the algorithm, and holds for any oblivious choice of loss functions feedback graphs . In particular, if for some number on all time steps, then setting

Finally, taking expectations over the randomness in the loss functions , the above choice of , and the fact that

Full information. In the full information setting, the feedback graph is the complete graph, for which the

largest independent set is of size . In this case, choosing

and gives expected regret bounded by

Semi-bandit feedback. Next suppose that each function is piecewise defined with at most M and when the learner plays a point , they learn the loss for all points belonging to the same piece. In this case, the feedback graph consists of the union of M cliques, and the largest independent set has size at most . Again, choosing leads to and gives expected regret

bound by O

Bandit feedback. Finally, consider the bandit feedback setting. In this case, the feedback graph has the empty edge-set and therefore the largest independent set is of size (the total number of arms). Choosing

and expected regret bounded by

Note, the condition that with probability one is only for technical convenience. We can always modify the optimization problem so that this condition is satisfied. In particular, we define an enlarged parameter space and replace the utility function with its Lipschitz extension to this modified problem we are guaranteed that there exists an optimal parameter in the -interior of , and the Lipschitz-extended functions are still piecewise Lipschitz and f-dispersed for the same f.

Designed for Accessibility and to further Open Science