b

DiscoverSearch
About
My stuff
Optimal Contextual Pricing and Extensions
2020·arXiv
Abstract
Abstract

In the contextual pricing problem a seller repeatedly obtains products described by an adversarially chosen feature vector in  Rd and only observes the purchasing decisions of a buyer with a fixed but unknown linear valuation over the products. The regret measures the difference between the revenue the seller could have obtained knowing the buyer valuation and what can be obtained by the learning algorithm.

We give a poly-time algorithm for contextual pricing with O(d log log T +d log d) regret which matches the Ω(d log log T ) lower bound up to the d log d additive factor. If we replace pricing loss by the symmetric loss, we obtain an algorithm with nearly optimal regret of O(d log d) matching the Ω(d) lower bound up to log d. These algorithms are based on a novel technique of bounding the value of the Steiner polynomial of a convex region at various scales. The Steiner polynomial is a degree d polynomial with intrinsic volumes as the coefficients.

We also study a generalized version of contextual search where the hidden linear function over the Euclidean space is replaced by a hidden function  f : X → Yin a certain hypothesis class H. We provide a generic algorithm with  O(d2) regret where d is the covering dimension of this class. This leads in particular to a ˜O(s2) regret algorithm for linear contextual search if the linear function is guaranteed to be s-sparse. Finally we also extend our results to the noisy feedback model, where each round our feedback is flipped with a fixed probability p < 1/2.

In the contextual search problem a learner tries to learn a hidden linear function  x ∈ Rd �→ ⟨v, x⟩ forsome unknown  v ∈ Rd. In every round, the learner is presented with an adversarially chosen vector xt ∈ Rd, and is asked to provide a guess  yt ∈ Rfor the dot-product  ⟨v, xt⟩, subsequently learning whether  yt ≤ ⟨v, xt⟩ or yt > ⟨v, xt⟩and incurring a loss  ℓ(yt, ⟨v, xt⟩). The goal of the learner is to minimize the total loss (the regret), which is given by �t ℓ(yt, ⟨v, xt⟩).A special case of this problem is contextual pricing [2, 8, 11, 14, 16, 17, 19, 23]. In this setup, the vectors  xtare features representing differentiated products and the learner is a seller whose decision at round t is how to price item  xt. Given a price  yta buyer with valuation  ut = ⟨v, xt⟩buys the product if  ut ≥ ytand doesn’t buy otherwise. The seller only observes the purchase or no-purchase decision. The loss in each round is the difference between the revenue made by the seller  yt · 1{yt ≤ ut}and the revenue the seller could have made if v was known. Formally, the pricing loss is given by:

image

A second important case is called symmetric contextual search where the loss is the difference between the guess  ytand the actual dot product  ⟨v, xt⟩. This loss function arises in personalized medicine [4] where the learner chooses the dosage of a medicine and observes whether the patient was over-dosed or under-dosed. Another application is one-bit compressed sensing [9,22] where the learner only observes the sign of a measurement. In either case, we will consider the following loss:

image

Optimal Regret Bounds for Contextual Search In one dimension, both contextual pricing and symmetric contextual search reduce to non-contextual problems and are well understood. For the symmetric loss the optimal regret in the one-dimensional case is Θ(1) using binary search. For pricing, the optimal regret in the one-dimensional case is Θ(log log T) using the algorithm of Kleinberg and Leighton [14]. These results immediately imply Ω(d) and Ω(d log log T) lower bounds for the general contextual case (see Section 2.4 for details on optimality).

In this paper, we design polynomial-time algorithms with regret O(d log log T + d log d) for contextual pricing and O(d log d) for symmetric contextual search, matching these lower bounds up to a log d factor. This improves over the previously known bounds in [17], which are  O(d4 log log T)for pricing and  O(d4) for symmetric contextual search.

Steiner polynomial The main technique driving these results is a new potential function based on the Steiner polynomial. This is an object from integral geometry that is closely connected with the notion of intrinsic volumes which were used in [17] to derive the previously known bounds for this problem.

Given a convex set  S ⊆ Rd, Steiner showed that the volume of the Minkowski sum Vol(S + zB) is a polynomial of degree d in t (where B is the unit ball). The intrinsic volumes  Vj of Scorrespond to the coefficients of this polynomial after normalization by volume  κj of the j-dimensional ball:

image

Both our algorithm and the the algorithm in [17] keep track of the set  Stof vectors consistent with observations seen so far. The intrinsic volumes approach keeps track of  Vj(St) and shows that the loss incurred in round t is proportional to the decrease of one of the suitably normalized intrinsic volumes (i.e.  Vj(Sj)1/j for some index  j ∈ {1, . . . , d}). In this paper, instead of keeping track of each coefficient individually, we control the value of the Steiner polynomial itself at different values of z. Specifically, we show that for some set of  d values {z1, z2, . . . , zd}it is possible to always choose an i (based on the current width of our set) so that  Vol(S + ziB) decreases by a constant fraction. This leads to nearly optimal bounds on regret (via a much simpler proof than that in [17]).

Framework for learning with binary feedback While the Steiner polynomial technique largely resolves the classical problem of contextual search, there is a wide class of learning problems with binary feedback that either do not fit within the framework of learning a linear function, or which impose additional constraints on the linear function that one would hope to leverage.

One example is sparse contextual search, where the hidden vector  v ∈ Rd is guaranteed to be s-sparse, i.e.,  ∥v∥0 ≤ s. This captures settings where we expect few features to matter to the buyer. Another interesting problem is when we are asked to guess maxi vixiinstead of  ⟨v, x⟩. Thiscorresponds to learning the valuation of an unit-demand buyer. This problem is challenging since the set of vectors  v ∈ Rd consistent with observations seen so far is not necessarily convex.

Both of these examples are special cases of a general framework for online learning problems under binary feedback. In our general setup, the learner is trying to learn a function f in a hypothesis class H containing functions mapping from a context space X to an outcome space Y. In each step a context  xt ∈ Xis chosen adversarially and the learner is asked to submit a guess  ytfor the value of  f(xt) and incurs a loss  ℓ(yt, f(xt)). The goal of the learner is to minimize the total loss �Tt=1 ℓ(yt, f(xt)). The original setup corresponds to the case where X is some subset of  Rd e.g.the unit ball  B ⊆ Rd, Y = [−1, 1], His the class of all linear functions  fv(x) = ⟨v, x⟩ for v ∈ B.

Our main result in this space is an algorithm with regret  O(d2) where dis the covering dimension of the hypothesis class H (see Definition 3.3). This result immediately improves the regret of symmetric contextual search from ˜O(d) to ˜O(s2) 1. Similarly, this result immediately implies an O(d2) regret algorithm for the unit-demand buyer problem. We accomplish this by generalizing the Steiner polynomial idea for linear contextual search to a general “Steiner potential” defined for any hypothesis class (see the Techniques subsection below).

We contrast these results in Section 6 with the full feedback case in which the algorithm learns f(xt) each round. In this full-feedback setting, we give matching upper and lower bounds (up to constant factors) on the achievable regret. Our results here are based on a notion we introduce of tree-dimension of a hypothesis class, which is a continuous analogue of Littlestone dimension.

Techniques in the general case The Steiner polynomial is defined for convex sets living in the Euclidean space. Intriguingly, it is possible to generalize (in some sense) this geometric technique to arbitrary classes H of hypotheses. Instead of keeping track of all functions in the hypothesis class that are consistent with the feedback so far, we keep track of an expanded set of functions that don’t violate the feedback up to a certain margin (in the linear case, this is exactly  S + λB). Thishas the effect of regularizing the set of consistent hypotheses and allows for faster progress. Instead of volume, in the general case we control the size of an  ǫ-net of the set of these approximately valid hypotheses.

A second technique we use is adaptive scaling, which involves keeping track of multiple levels of discretization. For the linear case, this boils down to controlling the value of the Steiner polynomial at different values of  λ. More generally, at each step, we can estimate the maximum possible loss achievable in this round given the previous feedback. Based on this value, we will choose a scale, which will dictate the granularity of the  ǫ-net and the margin with which we prune inconsistent hypotheses. After picking the scale we show that it is possible to pick a (random) cut that will either: (i) reduce the number of valid hypotheses in the chosen granularity by half; or (ii) eliminate one valid hypothesis at a much coarser granularity. This will require a careful coupling between the discretizations at two different levels. This coupling between two levels is what allows us to overcome the fact that in the general case we can’t rely on techniques from convex geometry. See Section 4.2 for details.

One important feature of all our algorithms (not shared by previous algorithms) will be our use of randomness, in particular perturbed guesses. Every round, we compute the median  mt ofthe set  f(xt) where franges over the set of approximately valid hypotheses. However, instead of guessing the median  mtdirectly we guess one of the two values (chosen uniformly at random) in {mt − δ, mt + δ}, where the size of perturbation  δdepends on our current scale  λ. Our guarantee is that the potential function will decrease significantly for one of the two choices (and thus in expectation).

Noisy Contextual Search The final direction in which we extend the original contextual search problem is by considering noisy binary feedback, i.e., the feedback of the algorithm is flipped with probability p < 1/2. In this setting we move from keeping track of a set of approximately valid hypotheses to a pseudo-Bayesian approach, where we maintain a distribution w over approximately valid hypotheses and update it as we receive feedback. By carefully bounding the weight of hypotheses within a ball of radius 1/T, this results in an algorithm with regret O(d log T).

Ideally, it would be possible to combine this algorithm with the adaptive scaling technique of the noiseless setting, resulting in an O(poly(d)) regret algorithm for general hypothesis classes. One such approach is to replace the notion of width with a fuzzier notion, based on how tightly concentrated the distribution is along the current context vector (e.g. the width of the smallest strip in this direction which contains 1  − ǫof the mass of the distribution). We can then choose a scale based on this distributional width, and choose the size of the perturbation based on this scale (as in the deterministic case).

This type of approach works, conditional on being able to show that when the distribution concentrates along a thin strip, the true hypothesis is close to this thin strip with high probability. Unfortunately, doing this for general hypothesis classes seems hard – fortunately, it is possible to do this for the specific case of symmetric contextual search by leveraging the Euclidean geometry of the ambient space (see Section 5.2 for more details). This leads to an algorithm for noisy linear contextual search which gets O(poly(d)) regret, and is the first algorithm we are aware of for contextual search in the noisy setting which gets any regret independent of T for d > 1.

Summary of main results To summarize, our results include:

Algorithms with regret O(d log d) for symmetric contextual search (Section 2.1) and O(d log log T+ d log d) for contextual pricing (Section 2.2). Both algorithms are optimal (up to log d) and have only O(d) overhead with respect to the non-contextual case. Both algorithms can be implemented efficiently in poly(d, T) time.

General algorithm for learning a function from a hypothesis class H under binary feedback with regret  O(d2) where dis the covering dimension of the hypothesis class (Section 4.3).

An algorithm for symmetric contextual search with noisy binary feedback with O(poly(d)) regret (Section 5.2).

Related work Core to our results is the idea of coupling together potentials at many different scales. Similar ideas of “adaptive discretization”, “zooming”, and “chaining” exist throughout the online learning literature [6,15,24] and statistical learning theory literature [7,10]. Algorithms in these works also often construct several layers of discretizations and have learning rates parameterized by the covering dimension of the ambient space. However, these algorithms are usually designed for settings where (1) one cannot hope for better than  O(√T) regret (let alone regret independent of T), (2) feedback is not binary but rather zeroth-order ( [21] study a pricing setting where feedback is binary, but where the hypothesis class is large enough that one must incur O(poly(T)) regret). In particular, we believe our technique of coupling together the potentials for different scales in the analysis of Theorem 4.3 is novel.

Our results in the full feedback case (Section 6) – parameterizing the optimal regret in terms of the tree dimension – can be seen as a generalization of similar results for Littlestone dimension [18] (indeed, in the case where Y = {0, 1}, our notion of tree dimension reduces to Littlestone dimension). While there do exist measures which capture the learnability of functions taking values over a metric space (for example, the fat-shattering dimension [3] for real-valued functions), as far as we are aware the notion of tree dimension we introduce does not currently exist in the literature. It is an interesting open direction to connect the notion of tree dimension we present with previously studied measures.

We start by describing the contextual search setup and establishing some useful notation. The hidden object is a vector  v0belonging to the unit ball  B = {v ∈ Rd; ∥v∥2 ≤ 1}. In each round t ∈ {1, . . . , T}the learner is provided an (adversarially chosen) vector  xt ∈ Band asked to provide a guess  yt ∈ R 2. Upon guessing, the learner incurs loss  ℓ(yt, ⟨v0, xt⟩) and receives feedback  σt ∈{−1, +1}corresponding to whether  yt > ⟨v0, xt⟩ (σt = +1) or yt < ⟨v0, xt⟩ (σt = −1). If yt =⟨v0, xt⟩, then the feedback is arbitrary. In other words:

image

This allows the learner to keep track of the set  Stof vectors consistent with observations seen so far:

image

It is clear from the above setup that for both pricing and symmetric loss, it suffices to consider when  ||xt||2= 1 for all rounds.

Throughout the execution of the algorithm we will keep track of the Steiner potential  Vol(St+zB)where  Vol(·) is the standard volume in  Rd and the sum is the Minkowski sum:

image

We will evaluate the potential at different points depending on the width of  Stin the direction xt. We define the width as:

image

2.1 Symmetric loss with O(d log d) regret

We start with the symmetric loss function  ℓ(yt, ut) = |yt − ut|, where we will show it is possible to obtain O(d log d) regret. The main idea of this algorithm is to choose a value  zibased on the width of  Stin the direction of the current context and then choose a guess  ytthat splits the set St + ziBin two parts of equal volume. By doing this, we will show that  Vol(St + ziB) (the “Steiner potential”) decreases by a constant multiplicative fraction. Since  Vol(St + ziB) is bounded below by  Vol(ziB), we can only do this some number of times (roughly  d log(1/zi) times), from which our regret bound will follows.

We describe the algorithm below (ignore for now issues of computational efficiency; we will address these in Section 2.3):

image

Proof. Assume the feedback is  σt= +1 (the other case is analogous). Then  St+1 = {v ∈ St; ⟨v, xt⟩ ≥yt}. In Figure 1 we depict the set  St + ziB and St+1 + ziB. The part of  St+1 + ziB with ⟨v, xt⟩ ≥ ytis exactly  {v ∈ St + ziB; ⟨v, xt⟩ ≥ yt}which has volume 12Vol(St + ziB).

To bound the remaining part, let C be the largest volume of a section of  St+ziBin the direction xt(see the right part of Figure 1). The total volume of  St+ziBcan be bounded below by comparing it to the two cones formed by taking the convex hull of the section of largest volume inside the band and the extreme points  q1 and q2, which are at least 2−(i+1) apart in the  xtdirection. The volume of the two cones is at least:

image

Finally, note that the region of  St+1+ziB with ⟨v, xt⟩ ≤ ythas cross-section with volume at most C and width zi in the xtdirection so its volume is at most  Czi ≤ 14Vol(St + ziB), thus completing the proof. ■Theorem 2.2. The regret of the Multiscale Steiner Potential algorithm is at most O(d log d).

Proof. Every time we choose index i, the loss at is most 2−i and the volume of  Vol(St+ziB) decreases by a constant factor. The set  Stis never empty since  v0 ∈ St for all t, therefore  Vol(St + ziB) ≥Vol(ziB) = zdi Vol(B). For this reason we can’t pick index i by more than  O(d log(1/zi)) times, so the total regret is at most:

image

Figure 1: Illustration of the proof of Lemma 2.1

2.1.1 Comparison with other approaches

Is the Steiner potential necessary? One natural algorithm for this problem is to query  yt such thatVol({v ∈ St; ⟨v, xt⟩ ≥ yt}) = 12Vol(St) (i.e. guess the median without inflating the set). The best upper bound from [17] shows only that this has regret at most 2O(d log d) and a lower bound given in the example in Section 8 of [19] shows that this algorithm has regret at least Ω(d2). Inflating the set by taking the Minkowski sum with a ball seems to be the appropriate regularization that allows us to overcome the  d2 lower bound.

Another natural algorithm is to guess  yt = 12 (minv∈St⟨v, xt⟩ + maxv∈St⟨v, xt⟩). This algorithm was shown to have regret at least 2Ω(d) in [8].

2.2 Pricing loss with O(d log log T + d log d) regret

We now study the pricing loss  ℓ(yt, ut) = ut − yt · 1{yt ≤ ut}. Unlike the previous case the loss function is discontinuous. While the loss when under-estimating  utis small, the loss when over-estimating is very large. In the one-dimensional setting, [14] obtains a O(log log T)-regret algorithm by a conservative variant of binary search that avoids over-estimating the actual value as much as possible.

As before, our algorithm will keep track of  Vol(St + ziB) for different values  zi. This time, however, we will guess more conservatively so that in the case of a no-purchase event, the potential will decrease by a large amount. We will do this in a way that each  zican lead to a no-purchase event approximately O(d) times.

image

Proof. Note that in this case, the set  {v ∈ St + ziB; ⟨v, xt⟩ > mt}is disjoint from  St+1 + ziB. Thedefinition of  mtimmediately gives the desired result. ■

image

Proof. First we upper bound the volume V of the strip  Vol({v ∈ St +ziB; mt −2zi ≤ ⟨v, xt⟩ ≤ mt}).Let C be the largest volume of a section of  St + ziBin the direction  xt(see right part of Figure 1). Then  V ≤ 2Czi. On the other hand since  St + ziBis a convex set and has width at least 2−2i+1,

image

Thus

image

Since the entire set  {v ∈ St + ziB; ⟨v, xt⟩ ≤ mt − 2zi}is disjoint from  St+1 + ziB, this completes the proof of the lemma. ■

Theorem 2.5. The regret of the Multiscale Steiner Potential for Pricing algorithm is at most O(d log log T + d log d).

Proof. First, note that the regret contributed from all rounds where  width(St; xt) ≤ 1/T is O(1).Next, we consider rounds with  width(St; xt) > 1/T. Note that for all of these rounds  i ≤ 2 log log T.If we choose index  i and yt > ⟨v0, xt⟩(leading to a no-purchase), the volume of  St + ziB is cut bya factor of 2−2i−1. Since Vol(St + ziB) ≥ zdi Vol(B) always, this can happen at most

image

times. The loss from each such query is at most 1. If we choose index  i and yt ≤ ⟨v0, xt⟩, thevolume of  St + ziBis cut by a factor of�1 − 110·22i−1�. This can happen at most

image

times. The loss from each such query is at most 2−2i. Thus, our regret is at most

image

2.3 Polynomial time implementation

We note that  ytcan be approximated using binary search as long as we can compute the volume of (St + ziB) ∩ Hfor any half-space H. It is enough to notice that we only need a constant approximation of the volume in the previous proof and in order to approximate the volume we only need access to a separation oracle [5,13,20]. Since  Stis a ball intersected with at most t halfspaces, it is trivial to obtain a separation oracle for it.

To obtain a separation oracle for  St + ziBit is enough to solve the problem of computing the distance from a query point to the convex set  Stwhich is itself a convex problem. Technically, this requires  Stto not be too small since the guarantee of cutting plane methods (like the ellipsoid algorithm) tells us that (given an initial ellipsoid  E ⊇ St), it is possible to compute an  ǫ-optimalsolution in time  O�T · poly(d) · log�Vol(E)ǫVol(St)��.

We can take E to be the unit ball; then this is algorithm is efficient as long as  Vol(St) isnever too small (anything at least exp(−poly(T)) is fine). Here we present a simple modification of Algorithms 1 and 2 that makes sure that the volume of  Ststays large enough throughout by preserving a small ball around  v0.

Initialize  S1 = (1 + T −4)Bto ensure that  B(v0, T −4) ⊆ S1 (where B(c, r) denotes the ball with center c and radius r). Now change the guess  yt to yt − δt where δtis sampled from the uniform distribution over [0, T −2].The total additional loss from this perturbation is O(1). Since the perturbation is smaller than  ziwe can use the same argument in Lemmas 2.1 and 2.4 with 2ziinstead of  zito bound the volume of the band making sure there is constant progress in the Steiner potential.

The advantage of this perturbation is that the probability that the cut passes through the ball of radius 1/T 4 around v0is at most 1/T 2 per period. So with probability 1−1/T, B(v0, 1/T 4) ⊆ St forall periods t. It follows that  Vol(St) ≥ 1T 4d Vol(B), and with this, the convex minimization problem can be solved in poly(d, T) time.

2.4 Optimality

Here we briefly discuss the optimality of our results, namely Theorem 2.2 and Theorem 2.5. Note that we set up the problem by assuming that the hidden vector  v0and all of the adversarially chosen contexts  xtare drawn from the  L2 unit ball. We may alternatively set up the problem by assuming that the hidden vector v is drawn from the cube [−1, 1]d and the contexts  xtare chosen from the  L1 ball i.e. ∥xt∥ ≤ 1 for all t. With this setup, it follows from a direct reduction to the one-dimensional case that we cannot do better than O(d) for symmetric loss and O(d log log T) for pricing loss.

Now we show how our proofs can be modified to achieve the same results, O(d log d) for symmetric loss and O(d log log T +d log d) for pricing loss, which are optimal in this modified setting up to logarithmic factors. We will run exactly the same algorithms. To see that the analysis remains the same, it suffices to note that our maximum loss in any round is still bounded by O(1) and the volumes are scaled by a factor of poly(d)d which becomes an additive d log d factor after taking the logarithm.

We will now consider a general model of learning with binary feedback that has contextual search as a special case. We will derive regret bounds based on the covering dimension of the hypothesis class. The driving technique will still be the Steiner potential together with two new ideas: (i) a new adaptive scaling that will either make a lot of progress in a finer granularity or slower progress in coarser granularity; (ii) randomized cuts that will reduce the potential with constant probability.

Consider a hypothesis class H consisting of functions mapping X to Y. We refer to X as the context space and Y as the output space. We assume that the output space Y is a totally ordered set, i.e, for each  y1, y2 ∈ Y with y1 ̸= y2we have either  y1 < y2 or y2 < y1.

The learning protocol is as follows: an adversary chooses some  f0 ∈ Hand in each round they choose some  xt ∈ X. The learner makes a prediction  yt ∈ Yand incurs loss  ℓ(yt, f0(xt)) for some loss function  ℓ : Y ×Y → [0,1]. Upon making a prediction, the learner receives feedback on whether yt ≤ f0(xt) or yt ≥ f0(xt) (the feedback is arbitrary in case of equality). It will be convenient to represent the feedback as a variable  σt ∈ {−1, +1} such that σt = +1 if yt > f0(xt) and σt = −1otherwise.

We make the following assumptions about the loss function throughout the paper:

 Reflexive: ℓ(y, y) = 0 for all  y ∈ Y.

 Symmetry: ℓ(y1, y2) = ℓ(y2, y1) for all y1, y2 ∈ Y.

 Triangle inequality: ℓ(y1, y2) ≤ ℓ(y1, y′) + ℓ(y′, y2) for all y1, y2, y′ ∈ Y.

 Order consistency: If y1 < y2 < y3 then max{ℓ(y1, y2), ℓ(y2, y3)} ≤ ℓ(y1, y3).

 Continuity: If 0 < ℓ < ℓ(y1, y2) then there are  y′, y′′ ∈ Y such that ℓ = ℓ(y1, y′) = ℓ(y′′, y2).

If Y = R, then for any continuous increasing function  φ : R → [0,1] and parameter  α ≤ 1, theloss  ℓ(y1, y2) = |φ(y1) − φ(y2)|α satisfies the desired properties. Note that while the symmetric loss function is easily cast in this framework the pricing loss is not captured. In Section 4.4 we give an impossibility result showing it is impossible to obtain covering-dimension bounds for the pricing loss.

3.1 Covering Dimension

Our loss bounds will be in terms of the covering dimension of the hypothesis class H. We start by defining a metric  d∞(·, ·) on Hinduced by the loss function:

Definition 3.1. For two hypotheses  f1, f2 ∈ H, let d∞(f1, f2) = supx∈X (ℓ(f1(x), f2(x)))

We can now introduce the notions of  ǫ-net and covering dimension.

Definition 3.2 (ǫ-net). For an ǫ, we say that a set  S ⊆ H is an ǫ-net of H under the d∞ metricif for every  h ∈ H, there is h′ ∈ S with d∞(h, h′) ≤ ǫ. Let Nǫ(H) be an ǫ-net of Hof minimum cardinality.

Definition 3.3 (Covering Dimension). Define the covering dimension H as

image

Note that this definition of Cdim differs from Hausdorff dimension in that we care not just about the limit  ǫ →0, but the largest value for any  ǫ ∈ [0, 1/2]); importantly, this guarantees us that, for any  ǫ ∈ (0, 1),

image

Note that we specify  ǫ ≤ 12 to avoid issues when  ǫis close to 1 - any other fixed constant upper bound p only changes this dimension by a constant factor of at most 1/ log(1/p).

We give a few quick examples to give intuition about covering dimension.

Example 3.4. The space of functions  f : [d] → {0, 1}with loss function  ℓ(y1, y2) = 1y1̸=y2 hascovering dimension d.

Example 3.5 (Contextual Search). Let B be the unit-ball in  Rd, i.e. B = {x ∈ Rd; ∥x∥2 ≤ 1}. Foreach  v ∈ B, let fv : B → Rbe defined by the dot product  fv(x) = ⟨v, x⟩. The linear contextual search problem is defined as the learning problem for class  H = {fv; v ∈ B}with loss function ℓ(y1, y2) = |y1 − y2|. This class has covering dimension O(d).

image

show that for any 0  < ǫ ≤ 12, there is an  ǫ-net of the sphere of size (1/ǫ)O(d). To do this, we can greedily place points in the unit ball such that no two are within  ǫof each other. If we draw an

2-radius ball around each point, these balls must be disjoint and contained in a ball centered at the origin of radius 1 + ǫ2. Thus, the maximum number of points we will place is

image

giving us an  ǫ-net of the same size.

Example 3.6 (Sparse Contextual Search). The sparse version of the contextual search problem is given by class  H = {fv; v ∈ B, ∥v∥0 ≤ s} where ∥v∥0 := |{i; vi ̸= 0}|. The loss function is still ℓ(y1, y2) = |y1 − y2|. The covering dimension of this class is O(s log d) (where we treat s as a constant and d as tending to  ∞). To see that the covering dimension is O(s log d), note that for any  ǫwe can use the result in the previous example to obtain an  ǫ-net of size

image

Example 3.7 (Unit demand). In the unit demand version of contextual search the set  X = {0, 1}dand the hypothesis class consists of functions  fw(x) = maxi∈[d] wixi for w ∈ [0, 1]d. The covering dimension of this class is O(d). This example corresponds to the puzzle in the introduction and corresponds to the economic situation where a seller wants to price a bundle of goods (represented by the context) but the buyer has an unit-demand valuation, i.e., only cares about the highestvalued item in the bundle.

To see that the covering dimension is O(d), note that the set  S = {ǫx|x ∈ {0, 1, . . . , ⌊1/ǫ⌋}d}forms an  ǫ-net and |S| ∼� 1ǫ�d.

For simplicity, in the following theorems we will assume our hypotheses map  X → Rand that our loss is given by  ℓ(y, y′) = |y − y′|. We will then remark on how to generalize our proof to other loss functions.

4.1 O(d log(T)) regret via Single-scale Steiner Potential

Bounds that depend on log(T) are typically easy for learning with binary feedback and can be obtained using different algorithms. The interesting question in this setting is how to obtain bounds that are constant in T. Nevertheless, it is instructive to start with a simpler algorithm with regret O(d log T) for d = Cdim(H). It will illustrate how the Steiner potential can be generalized to an abstract setting. Instead of keeping track of the hypotheses that are consistent with the feedback so far, we will keep an inflated version of that set.

This algorithm starts with a  T −1-net of the hypothesis class and keeps a set of candidate hypotheses that are approximately consistent with observations seen so far. For each  xt, it queries a random point around the median, halving the set of hypotheses with at least half probability.

image

The analysis is based on the following lemma:

image

Proof. Assume f0(xt) > mt + 2/T(the other case is analogous), then with probability half, the algorithm guesses  yt = mt + 2/Tand gets the feedback that  f0(xt) ≥ yt, which eliminates all the hypotheses  f such that f(xt) < mt + 1/T. These hypotheses constitute at least half of  Ht. ■

Corollary 4.2. The Single-scale Steiner Potential algorithm obtains regret O(d log T) in expectation.

Proof. The regret from periods where  |mt − f0(xt)| ≤ 2/Tis at most O(1). For the remaining periods, the size of  |Ht|is halved with at least 12 probability. Note that for such a t,

image

Next,  |Ht| ≥ 1 for all tsince there is some element in  H1 that is 1T -close to f0which is never eliminated. However, 1|H1| ≥ 1T d and 1|Ht| ≤ 1 for all t. Thus for any integer c, the probability that there are c periods with loss greater than 2T is at most T d( 32)c. Thus, the expected number of periods

with loss larger than 2/T is at most

image

4.2 Strategy for achieving constant (in T) regret

In this subsection we provide some intuition on how to improve the regret of our algorithm from O(d log T) to O(d2). In Single-scale Steiner Potentiala loss larger than 1/T causes Ht tohalf in size, but whenever it halves in size the only bound we can get for the loss is 1. To improve this bound, we need to guarantee that a loss of 1 can’t occur very often. Our strategy for doing that involves keeping multiple levels of discretization. Given  ytand the feedback  σt ∈ {−1, +1} wewill keep for each z > 0:

image

In other words, we keep an z-discretization of hypotheses along with all the hypotheses that are consistent with the feedback so far with an z-margin. The z-margin is important to guarantee that any hypothesis that is  z-close to f0will never be eliminated. We will also refer to  H0t as theset of hypotheses consistent with observations so far without any discretization or margin.

Our strategy will be to choose in each round some discretization level z based on the maximum possible loss achievable in this round. We will divide the space of losses in exponentially sized buckets and define i to be the index of the bucket where the maximum loss falls:

image

Now we choose  z = zi based on i, compute the median  mt of {f(xt); f ∈ Hzit }and guess either yt = mt − 2zi or mt + 2ziwith half probability each. Now one of two things can happen:

If the loss is larger than 2zi, the set Hzitwill decrease by a factor of 2 with half probability. This should happen at most log  |Nzi|times in expectation, generating loss 10  · 2−i log |Nzi| =O(2−i · d log(1/zi)).

If the loss is smaller than 2ziwe will show that the set  H2−itwill decrease by at least 1 element in expectation (Lemma 4.5), so we get loss  zi · |N2−i| = O(zi2di)

This leads to a regret of:

image

4.3 Analysis of the O(d2) algorithm

Theorem 4.3. Let d = Cdim(H). The Multi-scale Steiner Potential algorithm incurs expected regret  O(d2) in the binary feedback model.

image

Note if there does not exist an index  i such that width(H0t ; xt) ≤ 10 · 2−ithen we must actually have maxf∈H0t f(xt) = minf∈H0t f(xt) = f0(xt). In this case we know the value of  f0(xt) for sure so we simply query this value and incur 0 loss.

Lemma 4.4. If i is the index chosen in the t-th step and  |mt − f0(xt)| > 2zi then |Hzit | ≤ 12|Hzit+1|with probability at least 1/2.

Proof. The same as the proof of Lemma 4.1 replacing 1/T by zi. ■

The new ingredient is a “potential” argument when the loss is small:

Lemma 4.5. If i is the index chosen in the t-th step and  |mt − f0(xt)| ≤ 2zi, then with probability at least 1/2, |Hrt | ≤ |Hrt+1| − 1 for r = 2−(i+1)

Proof. By the choice of the index  i, width(H0t ; xt) > 10 · 2−(i+1) = 10r, so there must exist  f ∈ Ht0such that  |f(xt) − mt| ≥ 5r. Let’s assume that  f(xt) ≥ mt + 5r(the other case is analogous). The algorithm will query  mt + 2ziwith half probability and learn that  f0(xt) ≤ mt + 2zi (by theassumption that  |mt − f0(xt)| ≤ 2zi).

Such a query must eliminate some hypothesis  f ′ ∈ Hrt since there must be some  f ′ ∈ Hrt withd∞(f, f ′) ≤ r, so this hypothesis must satisfy  f ′(xt) ≥ mt + 4 · rand hence will be ruled out by the information from querying  mt + 2zi. ■

We can now proceed to prove Theorem 4.3.

Proof of Theorem 4.3. Let Aibe the number of times that index i is chosen by the algorithm and |mt − f0(xt)| > 2zi. Let Bibe the number of times that index i is chosen and  |mt − f0(xt)| ≤ 2zi.Combining the previous two claims (Lemmas 4.4 and 4.5), we have that

image

where  ri = 2−i.For each query with index i, the loss is at most 10ri.Also for queries with |mt − f0(xt)| ≤ 2zi, the loss is at most 2zi. Thus the total loss is at most �i(10riAi + 2ziBi). Itremains to note that

image

Corollary 4.6. In the Contextual Search with Symmetric Loss, if the hidden vector  v ∈ Rd isguaranteed to be s-sparse then there is an algorithm with total regret  O(s2 log2(d)).

Proof. By combining Theorem 4.3 and Example 3.6. ■

Remark. To adjust our proof to deal with any loss functions satisfying the assumptions outlined in Section 3 we make the following adjustments. We replace maxf∈H0t f(xt) − minf∈H0t f(xt) withL(maxf∈H0t f(xt), minf∈H0t f(xt)). Also, we replace  mt + 2zit with any y ∈ Y such that mt < y andL(mt, y) = 2zitand similar for  mt − 2zit(note this y exists by the continuity of the loss function).

4.4 Impossibility Results for Pricing Loss

The results from the previous section apply for loss functions that are somewhat well-behaved i.e. satisfying the conditions outlined at the beginning of Section 3. Clearly, the pricing loss function does not satisfy these assumptions. While one may hope to guarantee poly(d) log log T total loss (where d is the covering dimension of the hypothesis class), here we show that for the pricing loss function, it is actually impossible to guarantee regret in this case that is polynomial in the covering dimension.

Claim 4.7. Let B be the unit ball in  Rd. Consider the domain X = B and the hypothesis class H = {fv|v ∈ Rd, ||v||0 = 1, ||v||2 ≤ 1} where fv(x) = ⟨v, x⟩(note this is the same as the hypothesis class in Example 3.6 with s = 1). Any learner must incur at least �√d�regret over  d2 roundswith the pricing loss function.

Remark. Note that the covering dimension of H is O(log d) so the above claim implies that there is an exponential separation between the regret (in the pricing loss setting) and covering dimension.

Proof. Choose v uniformly at random from the d points (1, 0, . . . , 0), (0, 1, . . . , 0), . . . (0, . . . , 0, 1) and let the true function be  fv. Now let

image

Let  x2, . . . , xdbe obtained by cyclically permuting the coordinates of  x1. Now the adversary randomly permutes  x1, . . . , xdto obtain a sequence  xi1, . . . , xidand presents the points in that order to the learner. Note the learner gains no information when it guesses a value  y ≤ 1√2(d−1). If all of the

learner’s guesses through the first d rounds are at most 1√2(d−1) then the learner incurs loss at least

3 over the first d rounds and has gained no information. The adversary can then repeat this process.

Now it remains to consider when the learner guesses a value above 1√2(d−1) at some point within the

first d rounds. Say the first time this occurs is at round  j. With d−1dprobability, the learner incurs

√2(d−1) loss this round and is able to eliminate one of the d possible hypotheses. The problem

then effectively reduces to  d −1 dimensions. Repeating this argument inductively, we see that the adversary can guarantee regret

image

We now consider the binary feedback model with noise, where each round the feedback is (independently) flipped with probability p < 1/2. We will no longer be able to eliminate a hypothesis based on the feedback since it is always possible that the feedback was flipped, instead we will keep a weight function expressing the likelihood of each hypothesis given observations.

We start by giving a O(d log T) algorithm for general hypothesis classes. The algorithmic techniques will be standard and inspired by both Bayesian Inference and by algorithms in the Multiplicative Weight Updates family (such a Hedge or Weighted Majority). The analysis, however, will deviate from the usual analysis of multiplicative weights given the type of feedback. We don’t have access to the actual loss of the arm we pulled nor an unbiased estimator thereof. This will require both a new potential function as well as a modification of the multiplicative weights framework: instead of sampling from the distribution induced by the weights, we will get the (weighted) median advice on what is the right guess for this context.

Our main innovation is in Section 5.2 where we obtain an algorithm with O(poly(d)) regret (independent of T) for the linear contextual search case. Instead of one weight function, we keep a family of weight functions. Each weight function will correspond to different levels of uncertainty about the inner product of the hidden point with the current context. By using geometric techniques to analyze the stochastic evolution of the weights, we show that they must concentrate near the true hypothesis.

5.1 O(d log T) algorithm for a general hypothesis class

The usual approach in Bayesian inference is to start with a uniform prior over the set of hypotheses and given each observation, compute the posterior. It is important to emphasize that the true hypothesis  f0in our model is still chosen adversarially. The Bayesian inference only serves to provide the intuition.

The algorithm will be as follows: as before we will start with a discretized version of the hypothesis class  NT −2(H) which we will call N for short in this section. We will keep a weight function  wt : N → Rwhich roughly expresses the likelihood that a hypothesis is close to the true hypothesis. Our guess will be a perturbed version of the weighted median  mtof the set {f(xt); f ∈ N}. Formally, the weighted median  mtis a number that satisfies:

image

After receiving the feedback, we will update the weights in the following way (we will choose  yt atrandom such that  yt = f(xt) occurs with zero probability):

image

for some parameter  p′ and re-normalizing afterwards:

image

In standard Bayesian inference, we would normally use  p = p′. For this algorithm, we will choose any parameter  p′ with p < p′ < 1/2. The actual choice of parameter will only affect the constants. Also note that unlike Bayesian inference we don’t choose the guess with largest likelihood but a perturbed version of the median.

image

Theorem 5.1. In the noisy feedback model, the above algorithm incurs expected regret O(d log T).

We will denote the true hypothesis by  f0as usual. Since  f0might not belong to the discretized set N, we will control the weight that is placed on the closest hypothesis. Let  f1be a hypothesis in  N with d∞(f1, f0) ≤ T −2.

Lemma 5.2. If f1(xt), f0(xt)are on the same side of  yt and W −tis the total weight mass on the other side of  yt, then we have the following equality:

image

Here the expectation is taken over the randomness in the feedback, and c is a constant satisfying 0 < c < 1 given by

image

Proof. Assume wlog that  f1(xt), f0(xt) ≥ yt and let W −t = �f∈N;f(xt)<yt wt(f) be the weight on hypotheses in the opposite direction and  W +t = 1 − W −tthe remaining weight. Since  yt is chosenfrom a continuous distribution, the event that some  f has f(xt) = ytoccurs with zero probability. With probability 1  − p we have

image

with the remaining probability p we have:

image

Averaging them we have:

image

Lemma 5.3. In expectation over both the randomness in the choice of  ytand the randomness in the feedback, we have:

image

and the magnitude of the perturbation is 1/T. In this case, by the previous lemma, we have: E [1/wt+1(f1)] = 1/wt(f1). With the remaining probability  f1(xt) and f0(xt) are on opposite sides of  ytand we can use the trivial bound in the equation below.

image

Combining, we get the desired inequality. ■

Lemma 5.4. If |f0(xt) − mt| > 2T then for the constant c in Lemma 5.2, then in expectation over both the randomness in the choice of  ytand the randomness in the feedback, we have:

image

Proof. Assume without loss of generality that  f0(xt) > mt + 2T .Since the magnitude of the perturbation is 1/T, f1(xt) will be on the same side of our guess as  f0(xt) and hence we can apply Lemma 5.2. With probability at least 1/2 we have yt > mt and hence W −t ≥ 1/2. With theremaining probability we use the trivial bound  W −t ≥0. Combining those we get the bound in the statement. ■

The previous lemmas imply that  wt(f) grows by a constant factor (in expectation) whenever the median is far from the true point. We conclude the proof by showing that this can’t happen too often since weights are bounded.

Proof of Theorem 5.1. The regret bound follows directly from the fact that the probability of having |f0(xt) − mt| > 2/Tfor more than Ω(d log T) periods is at most O(1/T). Our strategy for proving this is to define a random process  Ytthat is a super-martingale, i.e.  E[Yt+1] ≤ Ytand argue that if |f0(xt) − mt| > 2/Thappens too often, then  YTwill be much larger than  Y1. This happens with small probability by Markov’s inequality.

image

Now define the following stochastic process:

image

It is simple to see that  Y1= 1. The previous lemmas imply that  Ytis a super-martingale, i.e. E[Yt+1] ≤ Yt and hence E[YT ] ≤ 1.Now, in the case that  |f0(xt) − mt| > 2/Tfor more than Ω(d log T) periods, we have

image

but since  E[YT ] ≤1, this can happen with at most O(1/T) probability by Markov’s inequality. ■

Remark. We’ve assumed here that p is a constant bounded away from 1/2. How does the regret of this algorithm depend on p as p approaches 1/2? If p = 12 − δ, then we can set  p′ = 12 − δ′ whereδ′ = δ/2. This leads to  c = O(δ2) – adapting the proof of Theorem 5.1 then shows we can have at most  O�d log Tδ2 �inaccurate rounds, for a total of at most  O�d log Tδ2 �regret.

Comparison with other approaches It is worth comparing our algorithm with other learning techniques in the multiplicative weights family. The ‘experts’ in our problem form a continuous set with a linear structure, which resembles the settings of Kalai and Vempala [12] and Abernathy et al [1]. In their setting, however, the optimal achievable regret is  O(√T) while in our case we achieve O(log T). Another feature of our model is the stochasticity of the losses. With stochastic losses, Wei and Luo [25] recently showed that multiplicative weight update algorithms achieve O(log T) regret when the learning rate is tuned properly, but their guarantees depend on the inverse of the gap between the two best arms. An important difference, however, is that in our setting we don’t have access to the loss. We only learn whether our guess was too large or too small, which doesn’t allow us to apply any of those algorithms.

5.2 O(poly(d)) algorithm for Noisy Contextual Search

In this section we study contextual search in the noisy feedback model. We show that here we can achieve total loss O(poly(d)) independent of T by exploiting the geometry of the Euclidean space. Throughout this section we will use  q0 ∈ Bto denote the true point, i.e.  f0(x) = ⟨q0, x⟩.

Our approach (Algorithm 6) builds off the Bayesian inference approach in the previous section (Algorithm 5) by combining it with the multi-scale discretization ideas in Section 4.3. At a high (and slightly inaccurate) level, Algorithm 6 works as follows. Throughout the algorithm, we maintain a distribution w over the unit ball B(0, 1) where w(q) represents the likelihood that q is our true point  q0. Each round t, we are provided a direction  xtby the adversary. We begin by measuring the “width” of our distribution in the direction  xt– i.e., the length of the smallest interval in this direction which contains almost all of the mass of our distribution w. Then (similarly as in Algorithm 5), we will guess a perturbed version of the median of w in the direction  xt,where the size of the perturbation depends on the width. Finally, we multiplicatively update the distribution w, penalizing points on the wrong side of our guess (again, similarly as in Algorithm 5).

In the actual algorithm, we maintain a separate distribution  wifor each possible scale  γi forthe width (in particular, we are in scale  wiif almost all of the mass of  wi−1is concentrated in a small strip in direction  xt). This aids analysis in letting us guarantee we operate in each scale for at most a bounded number of rounds, which lets us bound the total loss of this algorithm.

image

Theorem 5.5. Algorithm 6 incurs O(poly(d)) expected total loss for the problem of noisy linear contextual search.

The proof of Theorem 5.5 is structured roughly as follows. Let  Libe the (expected) loss sustained at scale i. We wish to show that �i Li = poly(d). To bound  Li, we’ll start by roughly following the analysis in Theorem 5.1. Specifically, we’ll look at the total weight (according to  wi)of a tiny ball surrounding the true point  q0. Let the weight of this ball at time  t be Wi,t. We’ll again show that 1/Wi,twhen suitably scaled is a super-martingale: it decreases in expectation by a large amount whenever our guess is far from accurate and cannot increase very much in expectation even if our guess is close to accurate. Since 1/Wi,tcannot decrease below 1, this lets us upper bound the number of rounds where we are far from accurate.

Now, even when we are far from accurate, we know that since we are in scale i, almost all of the mass of  wiis concentrated on some thin strip in direction  xt. If the true point  q0is located in or near this strip, this lets us bound the loss each round when we are far from accurate (since the median will lie in this strip). So it suffices to show that if a weight function  wiconcentrates on some thin strip, then with high probability, the true point  q0lies close to this strip.

To prove this, we again look at the weight of a small ball  Bα = B(q0, α) with radius  α around

q0(see left side of Figure 2). If we know that the weight on some strip is at least some threshold τ, then if wi,t(Bα) + τ >1, we know that the ball and strip intersect, and therefore  q0is at most distance  αaway from this strip. It thus suffices to show that  wi,t(Bα) > 1−τwith high probability.

image

Now, we will choose  α and τlarge enough so that this inequality is satisfied at time t = 0. We therefore only need to show that this is still true with high probability for all times t. Intuitively, this should be true – the amount of weight on a ball centered at the true point  q0should only increase as time goes on and we get more feedback (the feedback is noisy, so we might occasionally decrease the weight of this ball, but overall the increases should drown out the decreases). Proving this formally, however, is technically challenging and where we need to use the Euclidean geometry specific to linear contextual search.

To show this, we use the following lemma. Choose two points  q1 and q2on a line through  q0so that  q1lies between  q0 and q2. We claim that with high probability (for all times  t), wi,t(q1) ≥κ · wi,t(q2) for some constant  κ. To show this, observe that there is no half space which contains both  q0 and q2 but not q1. This means that the only way  wi,t(q2) can increase relative to  wi,t(q1)is if a guess separates  q2 from q1and if the feedback on this guess is noisy (right side of Figure 2). This occurs with probability p < 1/2 and is unlikelier than the alternative (which increases the weight of  q1relative to  q2). We can thus bound the ratio of  wi,t(q1)/wi,t(q2) from below with high probability over all rounds.

If we could union bound over all points in  Bαwe would be done (this inequality allows us to relate the weight of all the points outside  Bαto the weight of points inside  Bα). Unfortunately there are infinitely many points inside  Bαso we cannot apply a naive union bound. Luckily, we can show that nearby points are very likely to have similar weights: the only way the relative weight of two nearby points  q and q′ changes is if we guess a hyperplane separating  q and q′ – and since we add a perturbation to our guess every round, we can bound the probability of this happening. This allows us to repeat the previous geometric argument with  ǫ-nets instead of single points, which completes the proof.

Notation. Below, we will use  q0to denote the hidden point. We let B(0, 1) denote the unit ball and in general B(q, r) to denote the ball of radius r centered at q. We will use  wi,tto denote the

weight function  wi at round t. For a set  S ⊂ B(0,1), we use the notation

image

Let  αi = 12104d2i. Define the set  Sαi to be the αi-net consisting of all points in the unit ball whose coordinates are integer multiples of αid . Note that  |Sαi| ≤�2dαi�d. For all i, let Γi = B(q0, αi)∩B(0, 1)be the ball of radius  αicentered at  q0. For simplicity, throughout this proof we will assume that the feedback noise is fixed at p = 1/3 (it is straightforward to adapt this proof for any other p < 1/2; doing so only affects the constant factor of the loss bound).

image

As in the analysis of Algorithm 5, we begin by understanding how the reciprocal of our weight function 1/wi,t(p) evolves over time. This will allow us to construct various helpful super-martingales (for example, allowing us to bound the number of rounds we spend in each scale).

The following claim relates how 1/wi,t(q1) changes when  q1 and q0are on the same side of the hyperplane  ⟨xt, q⟩ = ˆy(i.e. is more likely to be consistent with feedback).

Claim 5.6. Consider a round t. Say the adversary picks direction  xtand the algorithm queries ˆy. Let  q1be a point such that  q1 and q0are on the same side of the hyperplane  ⟨xt, q⟩ = ˆy. Let

image

Proof. Recall that points that violate feedback have their weight multiplied by (1  − η) (and then the distribution is renormalized). With probability 1  − p = 2/3 (when the feedback is not flipped), we thus have that

image

Likewise, with probability p = 1/3 (when the feedback is flipped), we have that

image

Taking expectations over the feedback, we therefore have that

image

Points very close to  q0are likely to be on the same side of the hyperplane as  q0, allowing us to apply Claim 5.6.

Claim 5.7. Let q1 ∈ B(0, 1) such that ∥q1 − q0∥ ≤ αi. Then, in expectation both over the randomness in the feedback and the algorithm,

image

Proof. Since ∥q1 − q0∥ ≤ αi, and since ˆy is chosen by adding a uniform  βirandom variable to y, the probability that  q1 and q0are on opposite sides of the plane  ⟨xt, q⟩ = ˆyis at most αiβi . Combining this with Claim 5.6 gives us the desired result. ■

We now use Claims 5.7 and 5.6 to understand how 1/wi,t(Γi) changes over time (generalizing from single points to small balls). This first claim bounds the decrease in 1/wit+1,t+1(Γit+1) whenour guess is close to accurate.

Claim 5.8. Assume |y − ⟨xt, q0⟩| ≤ βit. Then, in expectation both over the randomness in feedback and in our algorithm,

image

Proof. Recall that y is the median of  wit,tin direction  xtas computed by our algorithm. Now, define the two quantities

image

These quantities represent the mass of  wit+1above and below the strip of width 2βit aroundthe median. Note that by the maximality of  it, either X− ≥ γ4dit+1/2 or X+ ≥ γ4dit+1/2; if not, then there exists a strip of width 2βit ≤ 10γit+1containing at least 1−γ4it+1d. Without loss of generality, assume  X− ≥ γ4dit+1/2.

Now, recall that ˆy is chosen uniformly in the interval [y − 2βit, y + 2βit]. We will divide the expectation in the theorem statement into three cases, based on where ˆy lies.

image

This case occurs with probability 14 − αit+12βit . Note that since  |y − ⟨xt, q0⟩| ≤ βit, in this case we also have that ˆy ≤ ⟨xt, q⟩ − αit+1. Therefore in this case we know that the ball Γit+1lies entirely to the left of the hyperplane  ⟨xt, q⟩ = ˆy. By applying Claim 5.6, we know that, conditioned on being in this case,

image

This case covers the ˆy where the hyperplane  ⟨xt, q⟩ = ˆyintersects the ball Γit+1. For ˆy in thiscase, we pessimistically bound the change in weight via

image

Case 3: remainder of the interval [y − 2βit, y + 2βit].

Since case 1 and case 2 together cover at least 1/4 of the interval, this case occurs with probability at most 3/4. In this case the ball Γit+1does not intersect the hyperplane (since all such ˆy are covered by case 2). We can therefore apply (the weaker variant of) Claim 5.6 to show that, conditioned on being in this case,

image

Combining these three cases, we have that

image

When our guess is far from accurate, we can instead (more strongly) bound the decrease in 1/wit,t(Γt).

Claim 5.9. Assume |y − ⟨xt, q0⟩| > βit. Then, in expectation both over the randomness in feedback and in our algorithm,

image

Proof. We essentially repeat the logic from the proof of Claim 5.8, with the change that we can more strongly lower bound  X−. Without loss of generality, assume that  y < ⟨xt, q0⟩ − βit. Define

image

Note that since y is the weighted median of  wit,tin the direction  xt, X− ≥ 1/2.

Now, we again have three cases. To begin, with probability 14 − αitβit , ˆylies in the interval [y, y + βit − αit]. Since y + βit < ⟨xt, q0⟩, the hyperplane  ⟨xt, q⟩ = ˆydoes not intersect Γit, andtherefore we can apply Claim 5.6 to show that

image

Likewise, the probability that the hyperplane  ⟨xt, q⟩ = ˆyintersects Γitis at most αitβit (in whichcase we can pessimistically bound the decrease in weight as in the proof of Claim 5.8), and with the remaining 3/4 probability the hyperplane  ⟨xt, q⟩ = ˆydoes not intersect Γit, and we can apply the weaker variant of Claim 5.6. Combining these observations, we get that

image

In this step we will bound the total number of rounds in each scale i. Specifically, our algorithm ensures that we move onto the next scale once either the weight concentrates on a strip or once Cigrows large enough - we will show with high probability that this is always due to the weight concentrating on a small strip.

Let  Aibe the number of rounds  t such that it = i and |y − ⟨xt, q0⟩| ≤ βi(i.e. the number of rounds where we are “accurate”). Let  Bibe the number of rounds  t such that it = i and|y − ⟨xt, q0⟩| > βi(i.e. the number of rounds where we are “inaccurate”). Note that  Ai + Bi = Ci.Also, recall that our algorithm ensures that  Ci ≤ 100( d4iγ10di + d25i) + 1 for all rounds t.

We first show that with high probability,  Biwill be no larger than O(poly(d)i).

Claim 5.10. For any constant c > 0, with probability at least 1  − 2−d4ic we have throughout all rounds that

image

Proof. We will construct a sequence  Zt so that Ztwi,t(Γi) is a super-martingale. Consider the sequence Ztdefined as follows.

 Z1 =� αi2�d.

If  it ̸∈ {i, i − 1} then Zt+1 = Zt.

If  it = i and |y − ⟨xt, q⟩| ≤ βi or it = i − 1 then Zt+1 =�1 − αiβi

If  it = i and |y − ⟨xt, q⟩| > βi then Zt+1 =�1 + 1d21�Zt.

image

(Here we have used the fact that  Vol(B(q0, αi)∩B(0,1)) must contain a ball of radius  αi/2.) Since Ytis a non-negative super-martingale, by Doob’s martingale inequality it holds that for any constant

image

However note that if there exists a round where  Bi ≥ 100d25i(1 + c), then for t sufficiently large

image

(Here in the last inequality we have used the fact that 2d4i ≥ (2/αi)d). Since wi,t(Γi) ≤ 1 forall t, this implies that  Yt ≥ 2d4ic, which immediately implies the desired claim. ■

Recall that in our algorithm, we check the following two conditions for determining the scale i we use for the current query:

There exists  a, b ∈ R such that |a − b| ≤ 10γi and�a≤⟨xt,q⟩≤b wi(q)dq ≥ 1 − γ4di .

 Ci−1 > 100(d4(i−1)γ10di−1 + d25(i − 1)).

We now show that with high probability, only the first condition is ever relevant.

image

Proof. Again, we will construct a sequence  Zt so that Ztwi+1,t(Γi+1) is a super-martingale. Consider the sequence  Ztdefined as follows.

 Z1 =αdi+12d

If  it ̸∈ {i, i + 1} then Zt+1 = Zt.

If  it = i and |y − ⟨xt, q0⟩| > βi or it = i + 1 then Zt+1 =�1 − αi+1βi+1

If  it = i and |y − ⟨xt, q0⟩| ≤ βi then Zt+1 =�1 + γ10di �Zt

Consider the ratio  Yt = Ztwi+1,t(Γi+1). Similarly as in the proof of Claim 5.10,  Y1 ≤1. Note that Claim 5.7 and Claim 5.8 imply that

image

so  Ytis a super-martingale. Now assume that  Ci ≥ 100�d4iγ10di + d25i�. By the constraints of our algorithm, we are guaranteed that

image

Also by Claim 5.10, with probability at least 1  − 1210d4i , Bi ≤ 1100d25iover all rounds t. This implies that eventually

image

Thus, for sufficiently large t, we have that

image

However note  Y1 ≤ 1 and Ytis a supermartingale. Also,  wi+1,t(Γi+1) ≤1 for all rounds t. Thus, by Doob’s martingale inequality, the probability we ever have  Ci ≥ 100�d4iγ10di + d25i�is at most

image

(the first term is from the probability that at some point  Bi ≥ 1100d25i). ■

image

We now aim to show that with high probability, if the weight function  wiis concentrated on a thin strip, this strip must be close to the true point  q0(this is necessary to bound the total regret we incur each round in scale i). To do this, we will argue that we can “round” points to the  αi-net  Sαiwithout significantly affecting their weight. We will then rely on the geometric observation mentioned earlier: that for points  q1, q2 ∈ Sαi for some i such that q0, q1, and q2are nearly collinear, we can relate the weights  wi,t(q1) and wi,t(q2). We begin by relating the weights of collinear points.

Claim 5.12. Fix an index  i. If q0, q1, and q2are collinear in that order, then with probability at least 1  − 2−d10i we have that for all rounds t

image

Proof. Consider a time step  t where it = i or it = i−1. We say a point is on the “good” side of the hyperplane  ⟨xt, q⟩ = ˆyif it is on the same side as  q0. Otherwise we say the point is on the “bad” side. Note for  q1, q2satisfying the conditions of the claim, one of the following statements must be true:

 Case 1: q1, q2are on the same side of the hyperplane  ⟨xt, q⟩ = ˆy.

 Case 2: q1is on the good side of the hyperplane and  q2is on the bad side of the hyperplane.

We will now consider the quantity  Rt =�wi,t(q2)wi,t(q1)�d9. Note that in Case 1, then  Rt+1 = Rt (therelative weights remain unchanged if both  q1 and q2are on the same side of the hyperplane). In Case 2,

image

By Doob’s martingale inequality, the probability that this ever happens is at most  γd10i ≤ 2−d10i,as desired. ■

We next relate the weights of nearby points  q1 and q2. If q1 and q2are close together, then it is unlikely they are ever separated by a hyperplane, and their weights should be similar. The following claim captures this intuition.

Claim 5.13. Fix an index  i. If q1, and q2 satisfy ∥q1 − q2∥ ≤ 2β10i , then with probability at least 1  − 2−d10i we have that for all rounds t

image

Proof. We will bound the number of rounds  t such that it = i or it = i −1 and the hyperplane ⟨xt, q⟩ = ˆyintersects the segment connecting  q1 and q2. We will show that with high probability, this quantity is at most  d9i. Note that since  wi,t(q1)/wi,t(q2) is unchanged when  q1 and q2 both lieon the same side of the hyperplane, and decreases by at most a factor of (1  − η) when they lie on different sides, this will show that with high probability

image

indices  t for which it = i or it = i−1. The probability that at least  d9iof the hyperplanes  ⟨xt, q⟩ = ˆyintersect the segment connecting  q2 and q1is at most

image

which implies our desired result.

image

Finally, we apply Claims 5.12 and 5.13 to bound the relative weights for all approximately collinear pairs of points in  Sα.

Claim 5.14. Fix an index i. With probability at least 1  − 2−d9i the following claim holds: for all

image

Proof. Fix a pair of points  q1, q2 ∈ Sαisatisfying the conditions in the statement. Let  q⊥ be thefoot of the perpendicular from  q1to the segment connecting  q2 and q0. Note that

image

Therefore, by the conditions of Claim 5.13, with probability at least 1  − 2−d10i, for all rounds t,

image

q0 and q2on this line. By Claim 5.12, this means that with probability at least 1  − 2−d10i, for allrounds t,

image

image

This is for a specific pair of points in  Sαi. Union bounding over all  |Sαi|2 pairs of points, we have that the theorem statement fails with probability at most

image

Finally, we prove that if  wiconcentrates on a strip,  q0 is within γiof this strip. To show this, it suffices to show that the weight of  B(q0, γi) is large enough that it must intersect a sufficiently concentrated strip. We do this by using Claim 5.14 to relate the weight of points of  Sαiinside and outside  B(q0, γi).

Claim 5.15. Fix an index i. With probability at least 1  − 2−d8i, the following statement holds for all t:

image

Proof. First for each point  q ∈ Sαi, consider an axis-parallel box centered at that point with side length αid . Now consider all rounds  t with it = i or it = i −1 and all planes of the form  ⟨xt, q⟩ = ˆyfor these rounds. We show that with high probability, all boxes intersect at most  d9iof these planes. Using essentially the same argument as in Claim 5.13, we find that this probability is at least

image

In both of these inequalities we are using the fact that if at most  d9iplanes intersect any box, then the weights of any two points in the same box are within a factor of (1  − η)d9i.

Next, consider the ball  B(q0, γi − αi). Let Ti = {B(q0, γi − αi) ∩ Sαi}. Consider the following two transformations:  f : Sαi → B(q0, γi − αi), which sends a point q to

image

and the transformation  g : B(q0, γi − αi) → Ti, where g(q) is the point obtained by rounding the coordinates of q to the nearest integer multiple of αid (note that g(q) ∈ Ti). If we consider the map q → q′ = g(f(q)) given by the above, the number of points  q ∈ Sαthat map to a fixed point  q′ ∈ Tiis at most

image

To see this, note that  g−1(q) is an axis-parallel box with side-length αid , and thus f −1(g−1(q))contains all the points in  Sαcontained within an axis aligned box with side-length 2αidγi , whichcontains at least (2/γi + 1)d < (10/γi)d points.

Now, note that  q and q′ = g(f(q)) satisfy the conditions of Claim 5.14. Thus, by Claim 5.14, with probability at least 1  − 12d9iwe have that

image

Combining the above with equations (1) and (2), we conclude that with probability at least

image

This implies that the ball  B(q0, γi) must intersect the strip  a ≤ ⟨xt, q⟩ ≤ b. If this happens then the desired condition is clearly satisfied. ■

image

Finally, we can proceed to prove the main theorem.

Proof of Theorem 5.5. First, for each  i, let Libe the total loss at scale i. We will bound  E[Li].

By Claim 5.11, with probability at least 1  − 12d4(i−1) , Ci−1 ≤ 100(d4(i−1)γ10di−1 + d25(i −1)) for all rounds. If this is true, then the only time we query at level i, there must be some strip given by a ≤ ⟨xt, q⟩ ≤ bof width at most 10γithat contains 1  − γ4diof the total weight of  wi. Thus, byClaim 5.15, with at least 1  − 12d4(i−1) − 12d8(i−1)probability, all queries at level i incur loss at most

12γi + 2βi ≤ 14γi. Now, by using Claim 5.10, we can bound the expected total loss at level i as

image

It follows that

image

Remark. Naively, one can implement Algorithm 6 with time complexity  T O(d), via the observation that T hyperplanes divide B(0, 1) into at most  O(T d) pieces, so we can simply compute this division and the weight of each distribution  wi(we care about at most T scales) on each component of this division.

It is an interesting open question if it is possible to implement Algorithm 6 (or otherwise achieve O(poly(d)) regret) with time complexity poly(d, T). To do so, it would suffice to be able to efficiently sample from the distributions  wi.

We also study the problem where the learner has full feedback, i.e, after the prediction  yt the feed-back is the actual value of  f0(xt). We show that the optimal regret can be completely characterized (up to constant factors) by a continuous analogue of the Littlestone dimension.

For this section we don’t require the assumption that Y is ordered, only that the loss function forms a valid metric (i.e. is symmetric and satisfies the triangle inequality).

6.1 Tree Dimension

Definition 6.1. A (X, Y)-tree of cost c is a labeled binary tree with the following properties

There is a root node and each interior node has two children

Each interior node is labeled with a triple (x, y1, y2) where x ∈ X, y1, y2 ∈ Y

For each leaf, the sum of  ℓ(y1, y2) over all nodes on the path from the root to the leaf is at least c.

Definition 6.2. We say a (X, Y)-tree T is H-satisfiable if we can label each leaf with some  f ∈ Hsuch that for each node (x, y1, y2) ∈ T, all leaves of the left subtree satisfy  f(x) = y1and all leaves of the right subtree satisfy  f(x) = y2.

Definition 6.3 (Tree dimension). We define τ(H), the tree dimension of H, to be the maximum cost of a (X, Y)-tree that is H-satisfiable.

Remark. Note we can naturally extend the above definition to any subset  H′ ⊂ H.

It is worth noting that covering dimension is “more restrictive” than tree dimension in the sense that bounded covering dimension implies bounded tree dimension.

Theorem 6.4. Let H be a hypothesis class consisting of functions mapping  X → Y and let L bea loss function that defines a metric on Y. If Cdim(H) is finite then

image

Before proving the above we prove a few preliminary lemmas.

Lemma 6.5. Let H be a hypothesis class and  ℓbe a loss function that defines a metric on Y. Let T be an H-satisfiable (X, Y)-tree where all leaves have depth d and such that for each internal node (x, y1, y2), ℓ(y1, y2) > 2−i. Then

image

Proof. Since the tree is H-satisfiable, we can label the leaves with functions  f ∈ H. Any twoof these functions  f1, f2must satisfy  d∞(f1, f2) ≥ 2−i since there must be some internal node (x, y1, y2) where f1(x) = y1 and f2(x) = y2. Therefore, there are 2d functions in H such that any two have  d∞distance bigger than 2−i. This implies that

image

Now by the definition of covering dimension, we conclude  d ≤ (i + 1) · Cdim(H). ■

Given a rooted binary tree T, we say a rooted binary tree  T′ is contained in T if all of the nodes of  T′ are nodes of T and the nodes of  T′ form a binary tree where each interior node has two children under the topology given by T.

Lemma 6.6. Consider a rooted binary tree T (where all interior nodes have exactly two children) and say its nodes are colored with colors 1, 2, . . . , c. We say the colored tree satisfies property (x1, . . . , xc)if for each  i ∈ [c], it does not contain a monochromatic complete binary tree of color i and depth  xi. If the coloring of T satisfies property (x1, . . . , xc), there exists a leaf such that on the path from the root to the leaf, there are at most  xinodes of color  i for all i ∈ [c].

Proof. We prove the lemma by induction on  x1 + · · · + xc. The base cases are obvious. Now say the root of T is colored with color i. Clearly we must have  xi >0. Then either the left or right subtree of the root must satisfy property (x1, . . . , xi − 1, . . . , xc). Using the inductive hypothesis, we get the desired. ■Proof of Theorem 6.4. Assume for the sake of contradiction that  τ(H) > 6 · Cdim(H). Consider a (X, Y)-tree T that is H-satisfiable and has cost larger than 6  · Cdim(H). Note we can assume that there are no nodes in  T where ℓ(y1, y2) = 0 since otherwise, we can delete that node and keep only its left subtree. Let c be an integer such that for all nodes (x, y1, y2), we have ℓ(y1, y2) > 2−c.

Now color the internal nodes of T with c colors {1, 2, . . . c} where a node (x, y1, y2) is color i

image

Note by Lemma 6.5, T does not contain any monochromatic, complete binary trees of color i with depth at least (i + 1) · Cdim(H). By Lemma 6.6, this implies the total cost of T is at most

image

which completes the proof. ■

However, we cannot hope for any sort of converse to Theorem 6.4 as evidenced by the following example. Let X = Y = [0, 1] and let  ℓ(y1, y2) = |y1 − y2|. Let H = {1x=c | c ∈ [0, 1]}be the set of all indicator functions of points in [0, 1]. H has infinite covering dimension but its tree dimension is just 1.

6.2 Regret Bounds from Tree Dimension

Theorem 6.7. In the full feedback model there exists an algorithm with regret  O(τ(H)). Furthermore, no algorithm can guarantee less than  τ(H)/2 regret.

First we prove that the algorithm below achieves the upper bound.

image

Proof. Assume for the sake of contradiction that this is false. Then we can construct a  St-satisfiable (X, Y, si)-tree with (xt, y1, y2) as its root node and cost bigger than  τ(St). ■

image

Proof. Since ǫis the smallest value such that  Aǫ,tis non-empty, then for every  y ∈ Aǫ,t we musthave  τ (St ∩ {f|f(xt) = y}) = τ (St) − ǫ. It follows that if  ℓ(yt, f0(xt)) ≤ ǫwe are done.

Consider now the case where  L := ℓ(yt, f0(xt)) > ǫ.For this case, we want to argue that f0(xt) /∈ AL′,t for any L′ < L, since after we get the feedback, we will update  St+1 = {f ∈St; f(xt) = f0(xt)}. Therefore  f0(xt) /∈ AL′,t for all L′ < Limplies that:  τ(St+1) ≤ τ(St) − L.

were the case that  f0(xt) ∈ AL′,t, we would have  L ≤ L′ by Lemma 6.8, contradicting the fact that

image

Proof of Theorem 6.7. The upper bound follows directly from the previous lemma. For the lower bound, consider an (X, Y)-tree with cost  τ(H) that is H-satisfiable. We can ensure that all leaves have the same depth d (by adding nodes of the form (x, y, y)). Now the adversary chooses a leaf uniformly at random. If the sequence of nodes from the root to the leaf are

image

then the adversary presents the inputs  x1, x2, . . . xdin that order to the learner. Since the loss function satisfies the triangle inequality, the expected loss of any learner is at least  τ(H)/2 so weare done. ■

Remark. Algorithm 7 assumes that the set  {ǫ; Aǫ,t ̸= ∅}has a minimum, which is always the case if the hypothesis class H is finite. For infinite H, this minimum might not exist. In such a case, choose  ǫ = inf{ǫ; Aǫ,t ̸= ∅}and choose  yt ∈ Aǫ+gt,t for gt = 1/2t. Theorem 6.7 can then be easily adapted to provide a bound of  τ(H) + �t gt ≤ τ(H) + 1.

6.3 Separating binary and full feedback

With binary feedback we can no longer obtain loss bounds that depend only on tree dimension. To see this, consider the following example:

Let H be the set of all functions  f : [n] → {0, 1/n, .....(n − 1)/n}. There are  nn such functions. Now for each, slightly perturb the outputs (i.e.  f(i) = j/n + ǫ) so that for every  f1, f2 and i,f1(i) ̸= f2(i). Let ℓ(y1, y2) = |y1 − y2|. The tree dimension of this class is O(1). However, clearly any algorithm must incur Ω(n) loss in expectation with binary feedback.

[1] Jacob D Abernethy, Elad Hazan, and Alexander Rakhlin. Interior-point methods for full-information and bandit online learning. IEEE Transactions on Information Theory, 58(7):4164–4175, 2012.

[2] Kareem Amin, Afshin Rostamizadeh, and Umar Syed. Repeated contextual auctions with strategic buyers. In Advances in Neural Information Processing Systems 27: Annual Conference on Neural Information Processing Systems 2014, December 8-13 2014, Montreal, Quebec, Canada, pages 622–630, 2014.

[3] Peter L Bartlett, Philip M Long, and Robert C Williamson. Fat-shattering and the learnability of real-valued functions. Journal of Computer and System Sciences, 52(3):434–452, 1996.

[4] Hamsa Bastani and Mohsen Bayati. Online decision-making with high-dimensional covariates. Working paper, Stanford University, 2016.

[5] Dimitris Bertsimas and Santosh Vempala. Solving convex programs by random walks. J. ACM, 51(4):540–556, 2004.

[6] S´ebastien Bubeck, R´emi Munos, Gilles Stoltz, and Csaba Szepesv´ari. X-armed bandits. Journal of Machine Learning Research, 12(May):1655–1695, 2011.

[7] Nicol`o Cesa-Bianchi, Pierre Gaillard, Claudio Gentile, and S´ebastien Gerchinovitz. Algorithmic chaining and the role of partial feedback in online nonparametric learning. arXiv preprint arXiv:1702.08211, 2017.

[8] Maxime C. Cohen, Ilan Lobel, and Renato Paes Leme. Feature-based dynamic pricing. In Proceedings of the 2016 ACM Conference on Economics and Computation, EC ’16, Maastricht, The Netherlands, July 24-28, 2016, page 817, 2016.

[9] Mark A Davenport, Yaniv Plan, Ewout Van Den Berg, and Mary Wootters. 1-bit matrix completion. Information and Inference: A Journal of the IMA, 3(3):189–223, 2014.

[10] Pierre Gaillard and S´ebastien Gerchinovitz. A chaining algorithm for online nonparametric regression. In Conference on Learning Theory, pages 764–796, 2015.

[11] Adel Javanmard and Hamid Nazerzadeh. Dynamic pricing in high-dimensions. Working paper, University of Southern California, 2016.

[12] Adam Kalai and Santosh Vempala. Efficient algorithms for online decision problems. Journal of Computer and System Sciences, 71(3):291–307, 2005.

[13] Ravi Kannan, L´aszl´o Lov´asz, and Mikl´os Simonovits. Random walks and an o*(n5) volume algorithm for convex bodies. Random Struct. Algorithms, 11(1):1–50, 1997.

[14] Robert Kleinberg and Tom Leighton. The value of knowing a demand curve: Bounds on regret for online posted-price auctions. In Foundations of Computer Science, 2003. Proceedings. 44th Annual IEEE Symposium on, pages 594–605. IEEE, 2003.

[15] Robert Kleinberg, Aleksandrs Slivkins, and Eli Upfal. Bandits and experts in metric spaces. Journal of the ACM (JACM), 66(4):1–77, 2019.

[16] Akshay Krishnamurthy, Thodoris Lykouris, and Chara Podimata. Corrupted multidimensional binary search: Learning in the presence of irrational agents. arXiv preprint arXiv:2002.11650, 2020.

[17] Renato Paes Leme and Jon Schneider. Contextual search via intrinsic volumes. In 59th IEEE Annual Symposium on Foundations of Computer Science, FOCS 2018, Paris, France, October 7-9, 2018, pages 268–282, 2018.

[18] Nick Littlestone. Learning quickly when irrelevant attributes abound: A new linear-threshold algorithm. Machine learning, 2(4):285–318, 1988.

[19] Ilan Lobel, Renato Paes Leme, and Adrian Vladu. Multidimensional binary search for contextual decision-making. Operations Research, 2017.

[20] L´aszl´o Lov´asz. Hit-and-run mixes fast. Mathematical Programming, 86(3):443–461, 1999.

[21] Jieming Mao, Renato Leme, and Jon Schneider. Contextual pricing for lipschitz buyers. In Advances in Neural Information Processing Systems, pages 5643–5651, 2018.

[22] Yaniv Plan and Roman Vershynin. Robust 1-bit compressed sensing and sparse logistic regression: A convex programming approach. IEEE Transactions on Information Theory, 59(1):482–494, 2012.

[23] Sheng Qiang and Mohsen Bayati. Dynamic pricing with demand covariates. Available at SSRN 2765257, 2016.

[24] Aleksandrs Slivkins. Contextual bandits with similarity information. The Journal of Machine Learning Research, 15(1):2533–2568, 2014.

[25] Chen-Yu Wei and Haipeng Luo. More adaptive algorithms for adversarial bandits. Proceedings of Machine Learning Research, 75, 2018.


Designed for Accessibility and to further Open Science