In the contextual search problem a learner tries to learn a hidden linear function some unknown
. In every round, the learner is presented with an adversarially chosen vector
, and is asked to provide a guess
for the dot-product
, subsequently learning whether
and incurring a loss
). The goal of the learner is to minimize the total loss (the regret), which is given by
A special case of this problem is contextual pricing [2, 8, 11, 14, 16, 17, 19, 23]. In this setup, the vectors
are features representing differentiated products and the learner is a seller whose decision at round t is how to price item
. Given a price
a buyer with valuation
buys the product if
and doesn’t buy otherwise. The seller only observes the purchase or no-purchase decision. The loss in each round is the difference between the revenue made by the seller
and the revenue the seller could have made if v was known. Formally, the pricing loss is given by:
A second important case is called symmetric contextual search where the loss is the difference between the guess and the actual dot product
. This loss function arises in personalized medicine [4] where the learner chooses the dosage of a medicine and observes whether the patient was over-dosed or under-dosed. Another application is one-bit compressed sensing [9,22] where the learner only observes the sign of a measurement. In either case, we will consider the following loss:
Optimal Regret Bounds for Contextual Search In one dimension, both contextual pricing and symmetric contextual search reduce to non-contextual problems and are well understood. For the symmetric loss the optimal regret in the one-dimensional case is Θ(1) using binary search. For pricing, the optimal regret in the one-dimensional case is Θ(log log T) using the algorithm of Kleinberg and Leighton [14]. These results immediately imply Ω() lower bounds for the general contextual case (see Section 2.4 for details on optimality).
In this paper, we design polynomial-time algorithms with regret O(d log log T + d log d) for contextual pricing and O(d log d) for symmetric contextual search, matching these lower bounds up to a log d factor. This improves over the previously known bounds in [17], which are for pricing and
) for symmetric contextual search.
Steiner polynomial The main technique driving these results is a new potential function based on the Steiner polynomial. This is an object from integral geometry that is closely connected with the notion of intrinsic volumes which were used in [17] to derive the previously known bounds for this problem.
Given a convex set , Steiner showed that the volume of the Minkowski sum Vol(S + zB) is a polynomial of degree d in t (where B is the unit ball). The intrinsic volumes
correspond to the coefficients of this polynomial after normalization by volume
-dimensional ball:
Both our algorithm and the the algorithm in [17] keep track of the set of vectors consistent with observations seen so far. The intrinsic volumes approach keeps track of
) and shows that the loss incurred in round t is proportional to the decrease of one of the suitably normalized intrinsic volumes (i.e.
for some index
). In this paper, instead of keeping track of each coefficient individually, we control the value of the Steiner polynomial itself at different values of z. Specifically, we show that for some set of
it is possible to always choose an i (based on the current width of our set) so that
) decreases by a constant fraction. This leads to nearly optimal bounds on regret (via a much simpler proof than that in [17]).
Framework for learning with binary feedback While the Steiner polynomial technique largely resolves the classical problem of contextual search, there is a wide class of learning problems with binary feedback that either do not fit within the framework of learning a linear function, or which impose additional constraints on the linear function that one would hope to leverage.
One example is sparse contextual search, where the hidden vector is guaranteed to be s-sparse, i.e.,
. This captures settings where we expect few features to matter to the buyer. Another interesting problem is when we are asked to guess max
instead of
corresponds to learning the valuation of an unit-demand buyer. This problem is challenging since the set of vectors
consistent with observations seen so far is not necessarily convex.
Both of these examples are special cases of a general framework for online learning problems under binary feedback. In our general setup, the learner is trying to learn a function f in a hypothesis class H containing functions mapping from a context space X to an outcome space Y. In each step a context is chosen adversarially and the learner is asked to submit a guess
for the value of
) and incurs a loss
)). The goal of the learner is to minimize the total loss
)). The original setup corresponds to the case where X is some subset of
the unit ball
is the class of all linear functions
Our main result in this space is an algorithm with regret is the covering dimension of the hypothesis class H (see Definition 3.3). This result immediately improves the regret of symmetric contextual search from ˜
. Similarly, this result immediately implies an
) regret algorithm for the unit-demand buyer problem. We accomplish this by generalizing the Steiner polynomial idea for linear contextual search to a general “Steiner potential” defined for any hypothesis class (see the Techniques subsection below).
We contrast these results in Section 6 with the full feedback case in which the algorithm learns ) each round. In this full-feedback setting, we give matching upper and lower bounds (up to constant factors) on the achievable regret. Our results here are based on a notion we introduce of tree-dimension of a hypothesis class, which is a continuous analogue of Littlestone dimension.
Techniques in the general case The Steiner polynomial is defined for convex sets living in the Euclidean space. Intriguingly, it is possible to generalize (in some sense) this geometric technique to arbitrary classes H of hypotheses. Instead of keeping track of all functions in the hypothesis class that are consistent with the feedback so far, we keep track of an expanded set of functions that don’t violate the feedback up to a certain margin (in the linear case, this is exactly has the effect of regularizing the set of consistent hypotheses and allows for faster progress. Instead of volume, in the general case we control the size of an
-net of the set of these approximately valid hypotheses.
A second technique we use is adaptive scaling, which involves keeping track of multiple levels of discretization. For the linear case, this boils down to controlling the value of the Steiner polynomial at different values of . More generally, at each step, we can estimate the maximum possible loss achievable in this round given the previous feedback. Based on this value, we will choose a scale, which will dictate the granularity of the
-net and the margin with which we prune inconsistent hypotheses. After picking the scale we show that it is possible to pick a (random) cut that will either: (i) reduce the number of valid hypotheses in the chosen granularity by half; or (ii) eliminate one valid hypothesis at a much coarser granularity. This will require a careful coupling between the discretizations at two different levels. This coupling between two levels is what allows us to overcome the fact that in the general case we can’t rely on techniques from convex geometry. See Section 4.2 for details.
One important feature of all our algorithms (not shared by previous algorithms) will be our use of randomness, in particular perturbed guesses. Every round, we compute the median the set
ranges over the set of approximately valid hypotheses. However, instead of guessing the median
directly we guess one of the two values (chosen uniformly at random) in
, where the size of perturbation
depends on our current scale
. Our guarantee is that the potential function will decrease significantly for one of the two choices (and thus in expectation).
Noisy Contextual Search The final direction in which we extend the original contextual search problem is by considering noisy binary feedback, i.e., the feedback of the algorithm is flipped with probability p < 1/2. In this setting we move from keeping track of a set of approximately valid hypotheses to a pseudo-Bayesian approach, where we maintain a distribution w over approximately valid hypotheses and update it as we receive feedback. By carefully bounding the weight of hypotheses within a ball of radius 1/T, this results in an algorithm with regret O(d log T).
Ideally, it would be possible to combine this algorithm with the adaptive scaling technique of the noiseless setting, resulting in an O(poly(d)) regret algorithm for general hypothesis classes. One such approach is to replace the notion of width with a fuzzier notion, based on how tightly concentrated the distribution is along the current context vector (e.g. the width of the smallest strip in this direction which contains 1 of the mass of the distribution). We can then choose a scale based on this distributional width, and choose the size of the perturbation based on this scale (as in the deterministic case).
This type of approach works, conditional on being able to show that when the distribution concentrates along a thin strip, the true hypothesis is close to this thin strip with high probability. Unfortunately, doing this for general hypothesis classes seems hard – fortunately, it is possible to do this for the specific case of symmetric contextual search by leveraging the Euclidean geometry of the ambient space (see Section 5.2 for more details). This leads to an algorithm for noisy linear contextual search which gets O(poly(d)) regret, and is the first algorithm we are aware of for contextual search in the noisy setting which gets any regret independent of T for d > 1.
Summary of main results To summarize, our results include:
• Algorithms with regret O(d log d) for symmetric contextual search (Section 2.1) and O(d log log T+ d log d) for contextual pricing (Section 2.2). Both algorithms are optimal (up to log d) and have only O(d) overhead with respect to the non-contextual case. Both algorithms can be implemented efficiently in poly(d, T) time.
• General algorithm for learning a function from a hypothesis class H under binary feedback with regret is the covering dimension of the hypothesis class (Section 4.3).
• An algorithm for symmetric contextual search with noisy binary feedback with O(poly(d)) regret (Section 5.2).
Related work Core to our results is the idea of coupling together potentials at many different scales. Similar ideas of “adaptive discretization”, “zooming”, and “chaining” exist throughout the online learning literature [6,15,24] and statistical learning theory literature [7,10]. Algorithms in these works also often construct several layers of discretizations and have learning rates parameterized by the covering dimension of the ambient space. However, these algorithms are usually designed for settings where (1) one cannot hope for better than ) regret (let alone regret independent of T), (2) feedback is not binary but rather zeroth-order ( [21] study a pricing setting where feedback is binary, but where the hypothesis class is large enough that one must incur O(poly(T)) regret). In particular, we believe our technique of coupling together the potentials for different scales in the analysis of Theorem 4.3 is novel.
Our results in the full feedback case (Section 6) – parameterizing the optimal regret in terms of the tree dimension – can be seen as a generalization of similar results for Littlestone dimension [18] (indeed, in the case where Y = {0, 1}, our notion of tree dimension reduces to Littlestone dimension). While there do exist measures which capture the learnability of functions taking values over a metric space (for example, the fat-shattering dimension [3] for real-valued functions), as far as we are aware the notion of tree dimension we introduce does not currently exist in the literature. It is an interesting open direction to connect the notion of tree dimension we present with previously studied measures.
We start by describing the contextual search setup and establishing some useful notation. The hidden object is a vector belonging to the unit ball
. In each round
the learner is provided an (adversarially chosen) vector
and asked to provide a guess
. Upon guessing, the learner incurs loss
) and receives feedback
corresponding to whether
, then the feedback is arbitrary. In other words:
This allows the learner to keep track of the set of vectors consistent with observations seen so far:
It is clear from the above setup that for both pricing and symmetric loss, it suffices to consider when = 1 for all rounds.
Throughout the execution of the algorithm we will keep track of the Steiner potential where
) is the standard volume in
and the sum is the Minkowski sum:
We will evaluate the potential at different points depending on the width of in the direction
. We define the width as:
2.1 Symmetric loss with O(d log d) regret
We start with the symmetric loss function , where we will show it is possible to obtain O(d log d) regret. The main idea of this algorithm is to choose a value
based on the width of
in the direction of the current context and then choose a guess
that splits the set
in two parts of equal volume. By doing this, we will show that
) (the “Steiner potential”) decreases by a constant multiplicative fraction. Since
) is bounded below by
), we can only do this some number of times (roughly
) times), from which our regret bound will follows.
We describe the algorithm below (ignore for now issues of computational efficiency; we will address these in Section 2.3):
Proof. Assume the feedback is = +1 (the other case is analogous). Then
. In Figure 1 we depict the set
. The part of
is exactly
which has volume
To bound the remaining part, let C be the largest volume of a section of in the direction
(see the right part of Figure 1). The total volume of
can be bounded below by comparing it to the two cones formed by taking the convex hull of the section of largest volume inside the band and the extreme points
, which are at least 2
apart in the
direction. The volume of the two cones is at least:
Finally, note that the region of has cross-section with volume at most
direction so its volume is at most
), thus completing the proof.
Theorem 2.2. The regret of the Multiscale Steiner Potential algorithm is at most O(d log d).
Proof. Every time we choose index i, the loss at is most 2and the volume of
) decreases by a constant factor. The set
is never empty since
, therefore
). For this reason we can’t pick index i by more than
)) times, so the total regret is at most:
Figure 1: Illustration of the proof of Lemma 2.1
2.1.1 Comparison with other approaches
Is the Steiner potential necessary? One natural algorithm for this problem is to query ) (i.e. guess the median without inflating the set). The best upper bound from [17] shows only that this has regret at most 2
and a lower bound given in the example in Section 8 of [19] shows that this algorithm has regret at least Ω(
). Inflating the set by taking the Minkowski sum with a ball seems to be the appropriate regularization that allows us to overcome the
lower bound.
Another natural algorithm is to guess ). This algorithm was shown to have regret at least 2
2.2 Pricing loss with O(d log log T + d log d) regret
We now study the pricing loss . Unlike the previous case the loss function is discontinuous. While the loss when under-estimating
is small, the loss when over-estimating is very large. In the one-dimensional setting, [14] obtains a O(log log T)-regret algorithm by a conservative variant of binary search that avoids over-estimating the actual value as much as possible.
As before, our algorithm will keep track of ) for different values
. This time, however, we will guess more conservatively so that in the case of a no-purchase event, the potential will decrease by a large amount. We will do this in a way that each
can lead to a no-purchase event approximately O(d) times.
Proof. Note that in this case, the set is disjoint from
definition of
immediately gives the desired result.
Proof. First we upper bound the volume V of the strip Let C be the largest volume of a section of
in the direction
(see right part of Figure 1). Then
. On the other hand since
is a convex set and has width at least 2
Thus
Since the entire set is disjoint from
, this completes the proof of the lemma.
Theorem 2.5. The regret of the Multiscale Steiner Potential for Pricing algorithm is at most O(d log log T + d log d).
Proof. First, note that the regret contributed from all rounds where Next, we consider rounds with
. Note that for all of these rounds
If we choose index
(leading to a no-purchase), the volume of
a factor of 2
) always, this can happen at most
times. The loss from each such query is at most 1. If we choose index volume of
is cut by a factor of
. This can happen at most
times. The loss from each such query is at most 2. Thus, our regret is at most
2.3 Polynomial time implementation
We note that can be approximated using binary search as long as we can compute the volume of (
for any half-space H. It is enough to notice that we only need a constant approximation of the volume in the previous proof and in order to approximate the volume we only need access to a separation oracle [5,13,20]. Since
is a ball intersected with at most t halfspaces, it is trivial to obtain a separation oracle for it.
To obtain a separation oracle for it is enough to solve the problem of computing the distance from a query point to the convex set
which is itself a convex problem. Technically, this requires
to not be too small since the guarantee of cutting plane methods (like the ellipsoid algorithm) tells us that (given an initial ellipsoid
), it is possible to compute an
solution in time
We can take E to be the unit ball; then this is algorithm is efficient as long as never too small (anything at least exp(
)) is fine). Here we present a simple modification of Algorithms 1 and 2 that makes sure that the volume of
stays large enough throughout by preserving a small ball around
Initialize to ensure that
) denotes the ball with center c and radius r). Now change the guess
is sampled from the uniform distribution over [0
The total additional loss from this perturbation is O(1). Since the perturbation is smaller than
we can use the same argument in Lemmas 2.1 and 2.4 with 2
instead of
to bound the volume of the band making sure there is constant progress in the Steiner potential.
The advantage of this perturbation is that the probability that the cut passes through the ball of radius 1is at most 1
per period. So with probability 1
all periods t. It follows that
), and with this, the convex minimization problem can be solved in poly(d, T) time.
2.4 Optimality
Here we briefly discuss the optimality of our results, namely Theorem 2.2 and Theorem 2.5. Note that we set up the problem by assuming that the hidden vector and all of the adversarially chosen contexts
are drawn from the
unit ball. We may alternatively set up the problem by assuming that the hidden vector v is drawn from the cube [
and the contexts
are chosen from the
. With this setup, it follows from a direct reduction to the one-dimensional case that we cannot do better than O(d) for symmetric loss and O(d log log T) for pricing loss.
Now we show how our proofs can be modified to achieve the same results, O(d log d) for symmetric loss and O(d log log T +d log d) for pricing loss, which are optimal in this modified setting up to logarithmic factors. We will run exactly the same algorithms. To see that the analysis remains the same, it suffices to note that our maximum loss in any round is still bounded by O(1) and the volumes are scaled by a factor of poly(which becomes an additive d log d factor after taking the logarithm.
We will now consider a general model of learning with binary feedback that has contextual search as a special case. We will derive regret bounds based on the covering dimension of the hypothesis class. The driving technique will still be the Steiner potential together with two new ideas: (i) a new adaptive scaling that will either make a lot of progress in a finer granularity or slower progress in coarser granularity; (ii) randomized cuts that will reduce the potential with constant probability.
Consider a hypothesis class H consisting of functions mapping X to Y. We refer to X as the context space and Y as the output space. We assume that the output space Y is a totally ordered set, i.e, for each we have either
The learning protocol is as follows: an adversary chooses some and in each round they choose some
. The learner makes a prediction
and incurs loss
)) for some loss function
1]. Upon making a prediction, the learner receives feedback on whether
) (the feedback is arbitrary in case of equality). It will be convenient to represent the feedback as a variable
otherwise.
We make the following assumptions about the loss function throughout the paper:
• ) = 0 for all
•
•
•
• ) then there are
If Y = R, then for any continuous increasing function 1] and parameter
loss
satisfies the desired properties. Note that while the symmetric loss function is easily cast in this framework the pricing loss is not captured. In Section 4.4 we give an impossibility result showing it is impossible to obtain covering-dimension bounds for the pricing loss.
3.1 Covering Dimension
Our loss bounds will be in terms of the covering dimension of the hypothesis class H. We start by defining a metric induced by the loss function:
Definition 3.1. For two hypotheses
We can now introduce the notions of -net and covering dimension.
, we say that a set
if for every
of minimum cardinality.
Definition 3.3 (Covering Dimension). Define the covering dimension H as
Note that this definition of Cdim differs from Hausdorff dimension in that we care not just about the limit 0, but the largest value for any
2]); importantly, this guarantees us that, for any
Note that we specify to avoid issues when
is close to 1 - any other fixed constant upper bound p only changes this dimension by a constant factor of at most 1/ log(1/p).
We give a few quick examples to give intuition about covering dimension.
Example 3.4. The space of functions with loss function
covering dimension d.
Example 3.5 (Contextual Search). Let B be the unit-ball in each
be defined by the dot product
. The linear contextual search problem is defined as the learning problem for class
with loss function
. This class has covering dimension O(d).
show that for any 0 , there is an
-net of the sphere of size (1
. To do this, we can greedily place points in the unit ball such that no two are within
of each other. If we draw an
-radius ball around each point, these balls must be disjoint and contained in a ball centered at the origin of radius 1 +
. Thus, the maximum number of points we will place is
giving us an -net of the same size.
Example 3.6 (Sparse Contextual Search). The sparse version of the contextual search problem is given by class . The loss function is still
. The covering dimension of this class is O(s log d) (where we treat s as a constant and d as tending to
). To see that the covering dimension is O(s log d), note that for any
we can use the result in the previous example to obtain an
-net of size
Example 3.7 (Unit demand). In the unit demand version of contextual search the set and the hypothesis class consists of functions
. The covering dimension of this class is O(d). This example corresponds to the puzzle in the introduction and corresponds to the economic situation where a seller wants to price a bundle of goods (represented by the context) but the buyer has an unit-demand valuation, i.e., only cares about the highestvalued item in the bundle.
To see that the covering dimension is O(d), note that the set forms an
For simplicity, in the following theorems we will assume our hypotheses map and that our loss is given by
. We will then remark on how to generalize our proof to other loss functions.
4.1 O(d log(T)) regret via Single-scale Steiner Potential
Bounds that depend on log(T) are typically easy for learning with binary feedback and can be obtained using different algorithms. The interesting question in this setting is how to obtain bounds that are constant in T. Nevertheless, it is instructive to start with a simpler algorithm with regret O(d log T) for d = Cdim(H). It will illustrate how the Steiner potential can be generalized to an abstract setting. Instead of keeping track of the hypotheses that are consistent with the feedback so far, we will keep an inflated version of that set.
This algorithm starts with a -net of the hypothesis class and keeps a set of candidate hypotheses that are approximately consistent with observations seen so far. For each
, it queries a random point around the median, halving the set of hypotheses with at least half probability.
The analysis is based on the following lemma:
(the other case is analogous), then with probability half, the algorithm guesses
and gets the feedback that
, which eliminates all the hypotheses
. These hypotheses constitute at least half of
Corollary 4.2. The Single-scale Steiner Potential algorithm obtains regret O(d log T) in expectation.
Proof. The regret from periods where is at most O(1). For the remaining periods, the size of
is halved with at least
probability. Note that for such a t,
Next, since there is some element in
which is never eliminated. However,
. Thus for any integer c, the probability that there are c periods with loss greater than
. Thus, the expected number of periods
with loss larger than 2/T is at most
4.2 Strategy for achieving constant (in T) regret
In this subsection we provide some intuition on how to improve the regret of our algorithm from a loss larger than 1
half in size, but whenever it halves in size the only bound we can get for the loss is 1. To improve this bound, we need to guarantee that a loss of 1 can’t occur very often. Our strategy for doing that involves keeping multiple levels of discretization. Given
and the feedback
will keep for each z > 0:
In other words, we keep an z-discretization of hypotheses along with all the hypotheses that are consistent with the feedback so far with an z-margin. The z-margin is important to guarantee that any hypothesis that is will never be eliminated. We will also refer to
set of hypotheses consistent with observations so far without any discretization or margin.
Our strategy will be to choose in each round some discretization level z based on the maximum possible loss achievable in this round. We will divide the space of losses in exponentially sized buckets and define i to be the index of the bucket where the maximum loss falls:
Now we choose , compute the median
and guess either
with half probability each. Now one of two things can happen:
• If the loss is larger than 2will decrease by a factor of 2 with half probability. This should happen at most log
times in expectation, generating loss 10
• If the loss is smaller than 2we will show that the set
will decrease by at least 1 element in expectation (Lemma 4.5), so we get loss
This leads to a regret of:
4.3 Analysis of the O(d2) algorithm
Theorem 4.3. Let d = Cdim(H). The Multi-scale Steiner Potential algorithm incurs expected regret ) in the binary feedback model.
Note if there does not exist an index then we must actually have max
). In this case we know the value of
) for sure so we simply query this value and incur 0 loss.
Lemma 4.4. If i is the index chosen in the t-th step and with probability at least 1/2.
Proof. The same as the proof of Lemma 4.1 replacing 1
The new ingredient is a “potential” argument when the loss is small:
Lemma 4.5. If i is the index chosen in the t-th step and , then with probability at least 1
Proof. By the choice of the index , so there must exist
such that
. Let’s assume that
(the other case is analogous). The algorithm will query
with half probability and learn that
assumption that
Such a query must eliminate some hypothesis since there must be some
, so this hypothesis must satisfy
and hence will be ruled out by the information from querying
We can now proceed to prove Theorem 4.3.
be the number of times that index i is chosen by the algorithm and
be the number of times that index i is chosen and
Combining the previous two claims (Lemmas 4.4 and 4.5), we have that
where For each query with index i, the loss is at most 10
Also for queries with
, the loss is at most 2
. Thus the total loss is at most
remains to note that
Corollary 4.6. In the Contextual Search with Symmetric Loss, if the hidden vector guaranteed to be s-sparse then there is an algorithm with total regret
Proof. By combining Theorem 4.3 and Example 3.6.
Remark. To adjust our proof to deal with any loss functions satisfying the assumptions outlined in Section 3 we make the following adjustments. We replace max)). Also, we replace
and similar for
(note this y exists by the continuity of the loss function).
4.4 Impossibility Results for Pricing Loss
The results from the previous section apply for loss functions that are somewhat well-behaved i.e. satisfying the conditions outlined at the beginning of Section 3. Clearly, the pricing loss function does not satisfy these assumptions. While one may hope to guarantee poly(d) log log T total loss (where d is the covering dimension of the hypothesis class), here we show that for the pricing loss function, it is actually impossible to guarantee regret in this case that is polynomial in the covering dimension.
Claim 4.7. Let B be the unit ball in . Consider the domain X = B and the hypothesis class
(note this is the same as the hypothesis class in Example 3.6 with s = 1). Any learner must incur at least Ω
regret over
with the pricing loss function.
Remark. Note that the covering dimension of H is O(log d) so the above claim implies that there is an exponential separation between the regret (in the pricing loss setting) and covering dimension.
Proof. Choose v uniformly at random from the d points (1, 0, . . . , 0), (0, 1, . . . , 0), . . . (0, . . . , 0, 1) and let the true function be
Let be obtained by cyclically permuting the coordinates of
. Now the adversary randomly permutes
to obtain a sequence
and presents the points in that order to the learner. Note the learner gains no information when it guesses a value
. If all of the
learner’s guesses through the first d rounds are at most then the learner incurs loss at least
over the first d rounds and has gained no information. The adversary can then repeat this process.
Now it remains to consider when the learner guesses a value above at some point within the
first d rounds. Say the first time this occurs is at round probability, the learner incurs
loss this round and is able to eliminate one of the d possible hypotheses. The problem
then effectively reduces to 1 dimensions. Repeating this argument inductively, we see that the adversary can guarantee regret
We now consider the binary feedback model with noise, where each round the feedback is (independently) flipped with probability p < 1/2. We will no longer be able to eliminate a hypothesis based on the feedback since it is always possible that the feedback was flipped, instead we will keep a weight function expressing the likelihood of each hypothesis given observations.
We start by giving a O(d log T) algorithm for general hypothesis classes. The algorithmic techniques will be standard and inspired by both Bayesian Inference and by algorithms in the Multiplicative Weight Updates family (such a Hedge or Weighted Majority). The analysis, however, will deviate from the usual analysis of multiplicative weights given the type of feedback. We don’t have access to the actual loss of the arm we pulled nor an unbiased estimator thereof. This will require both a new potential function as well as a modification of the multiplicative weights framework: instead of sampling from the distribution induced by the weights, we will get the (weighted) median advice on what is the right guess for this context.
Our main innovation is in Section 5.2 where we obtain an algorithm with O(poly(d)) regret (independent of T) for the linear contextual search case. Instead of one weight function, we keep a family of weight functions. Each weight function will correspond to different levels of uncertainty about the inner product of the hidden point with the current context. By using geometric techniques to analyze the stochastic evolution of the weights, we show that they must concentrate near the true hypothesis.
5.1 O(d log T) algorithm for a general hypothesis class
The usual approach in Bayesian inference is to start with a uniform prior over the set of hypotheses and given each observation, compute the posterior. It is important to emphasize that the true hypothesis in our model is still chosen adversarially. The Bayesian inference only serves to provide the intuition.
The algorithm will be as follows: as before we will start with a discretized version of the hypothesis class ) which we will call N for short in this section. We will keep a weight function
which roughly expresses the likelihood that a hypothesis is close to the true hypothesis. Our guess will be a perturbed version of the weighted median
of the set
. Formally, the weighted median
is a number that satisfies:
After receiving the feedback, we will update the weights in the following way (we will choose random such that
) occurs with zero probability):
for some parameter and re-normalizing afterwards:
In standard Bayesian inference, we would normally use . For this algorithm, we will choose any parameter
2. The actual choice of parameter will only affect the constants. Also note that unlike Bayesian inference we don’t choose the guess with largest likelihood but a perturbed version of the median.
Theorem 5.1. In the noisy feedback model, the above algorithm incurs expected regret O(d log T).
We will denote the true hypothesis by as usual. Since
might not belong to the discretized set N, we will control the weight that is placed on the closest hypothesis. Let
be a hypothesis in
are on the same side of
is the total weight mass on the other side of
, then we have the following equality:
Here the expectation is taken over the randomness in the feedback, and c is a constant satisfying 0 < c < 1 given by
Proof. Assume wlog that ) be the weight on hypotheses in the opposite direction and
the remaining weight. Since
from a continuous distribution, the event that some
occurs with zero probability. With probability 1
with the remaining probability p we have:
Averaging them we have:
Lemma 5.3. In expectation over both the randomness in the choice of and the randomness in the feedback, we have:
and the magnitude of the perturbation is 1/T. In this case, by the previous lemma, we have: ). With the remaining probability
) are on opposite sides of
and we can use the trivial bound in the equation below.
Combining, we get the desired inequality.
then for the constant c in Lemma 5.2, then in expectation over both the randomness in the choice of
and the randomness in the feedback, we have:
Proof. Assume without loss of generality that Since the magnitude of the perturbation is 1
) will be on the same side of our guess as
) and hence we can apply Lemma 5.2. With probability at least 1
remaining probability we use the trivial bound
0. Combining those we get the bound in the statement.
The previous lemmas imply that ) grows by a constant factor (in expectation) whenever the median is far from the true point. We conclude the proof by showing that this can’t happen too often since weights are bounded.
Proof of Theorem 5.1. The regret bound follows directly from the fact that the probability of having for more than Ω(d log T) periods is at most O(1/T). Our strategy for proving this is to define a random process
that is a super-martingale, i.e.
and argue that if
happens too often, then
will be much larger than
. This happens with small probability by Markov’s inequality.
Now define the following stochastic process:
It is simple to see that = 1. The previous lemmas imply that
is a super-martingale, i.e.
Now, in the case that
for more than Ω(d log T) periods, we have
but since 1, this can happen with at most O(1/T) probability by Markov’s inequality.
Remark. We’ve assumed here that p is a constant bounded away from 1/2. How does the regret of this algorithm depend on p as p approaches 1, then we can set
2. This leads to
) – adapting the proof of Theorem 5.1 then shows we can have at most
inaccurate rounds, for a total of at most
Comparison with other approaches It is worth comparing our algorithm with other learning techniques in the multiplicative weights family. The ‘experts’ in our problem form a continuous set with a linear structure, which resembles the settings of Kalai and Vempala [12] and Abernathy et al [1]. In their setting, however, the optimal achievable regret is ) while in our case we achieve O(log T). Another feature of our model is the stochasticity of the losses. With stochastic losses, Wei and Luo [25] recently showed that multiplicative weight update algorithms achieve O(log T) regret when the learning rate is tuned properly, but their guarantees depend on the inverse of the gap between the two best arms. An important difference, however, is that in our setting we don’t have access to the loss. We only learn whether our guess was too large or too small, which doesn’t allow us to apply any of those algorithms.
5.2 O(poly(d)) algorithm for Noisy Contextual Search
In this section we study contextual search in the noisy feedback model. We show that here we can achieve total loss O(poly(d)) independent of T by exploiting the geometry of the Euclidean space. Throughout this section we will use to denote the true point, i.e.
Our approach (Algorithm 6) builds off the Bayesian inference approach in the previous section (Algorithm 5) by combining it with the multi-scale discretization ideas in Section 4.3. At a high (and slightly inaccurate) level, Algorithm 6 works as follows. Throughout the algorithm, we maintain a distribution w over the unit ball B(0, 1) where w(q) represents the likelihood that q is our true point . Each round t, we are provided a direction
by the adversary. We begin by measuring the “width” of our distribution in the direction
– i.e., the length of the smallest interval in this direction which contains almost all of the mass of our distribution w. Then (similarly as in Algorithm 5), we will guess a perturbed version of the median of w in the direction
where the size of the perturbation depends on the width. Finally, we multiplicatively update the distribution w, penalizing points on the wrong side of our guess (again, similarly as in Algorithm 5).
In the actual algorithm, we maintain a separate distribution for each possible scale
the width (in particular, we are in scale
if almost all of the mass of
is concentrated in a small strip in direction
). This aids analysis in letting us guarantee we operate in each scale for at most a bounded number of rounds, which lets us bound the total loss of this algorithm.
Theorem 5.5. Algorithm 6 incurs O(poly(d)) expected total loss for the problem of noisy linear contextual search.
The proof of Theorem 5.5 is structured roughly as follows. Let be the (expected) loss sustained at scale i. We wish to show that
). To bound
, we’ll start by roughly following the analysis in Theorem 5.1. Specifically, we’ll look at the total weight (according to
of a tiny ball surrounding the true point
. Let the weight of this ball at time
. We’ll again show that 1
when suitably scaled is a super-martingale: it decreases in expectation by a large amount whenever our guess is far from accurate and cannot increase very much in expectation even if our guess is close to accurate. Since 1
cannot decrease below 1, this lets us upper bound the number of rounds where we are far from accurate.
Now, even when we are far from accurate, we know that since we are in scale i, almost all of the mass of is concentrated on some thin strip in direction
. If the true point
is located in or near this strip, this lets us bound the loss each round when we are far from accurate (since the median will lie in this strip). So it suffices to show that if a weight function
concentrates on some thin strip, then with high probability, the true point
lies close to this strip.
To prove this, we again look at the weight of a small ball ) with radius
(see left side of Figure 2). If we know that the weight on some strip is at least some threshold
1, we know that the ball and strip intersect, and therefore
is at most distance
away from this strip. It thus suffices to show that
with high probability.
Now, we will choose large enough so that this inequality is satisfied at time t = 0. We therefore only need to show that this is still true with high probability for all times t. Intuitively, this should be true – the amount of weight on a ball centered at the true point
should only increase as time goes on and we get more feedback (the feedback is noisy, so we might occasionally decrease the weight of this ball, but overall the increases should drown out the decreases). Proving this formally, however, is technically challenging and where we need to use the Euclidean geometry specific to linear contextual search.
To show this, we use the following lemma. Choose two points on a line through
so that
lies between
. We claim that with high probability (for all times
) for some constant
. To show this, observe that there is no half space which contains both
. This means that the only way
) can increase relative to
is if a guess separates
and if the feedback on this guess is noisy (right side of Figure 2). This occurs with probability p < 1/2 and is unlikelier than the alternative (which increases the weight of
relative to
). We can thus bound the ratio of
) from below with high probability over all rounds.
If we could union bound over all points in we would be done (this inequality allows us to relate the weight of all the points outside
to the weight of points inside
). Unfortunately there are infinitely many points inside
so we cannot apply a naive union bound. Luckily, we can show that nearby points are very likely to have similar weights: the only way the relative weight of two nearby points
changes is if we guess a hyperplane separating
– and since we add a perturbation to our guess every round, we can bound the probability of this happening. This allows us to repeat the previous geometric argument with
-nets instead of single points, which completes the proof.
Notation. Below, we will use to denote the hidden point. We let B(0, 1) denote the unit ball and in general B(q, r) to denote the ball of radius r centered at q. We will use
to denote the
weight function . For a set
1), we use the notation
Let . Define the set
-net consisting of all points in the unit ball whose coordinates are integer multiples of
. Note that
be the ball of radius
centered at
. For simplicity, throughout this proof we will assume that the feedback noise is fixed at p = 1/3 (it is straightforward to adapt this proof for any other p < 1/2; doing so only affects the constant factor of the loss bound).
As in the analysis of Algorithm 5, we begin by understanding how the reciprocal of our weight function 1) evolves over time. This will allow us to construct various helpful super-martingales (for example, allowing us to bound the number of rounds we spend in each scale).
The following claim relates how 1) changes when
are on the same side of the hyperplane
(i.e. is more likely to be consistent with feedback).
Claim 5.6. Consider a round t. Say the adversary picks direction and the algorithm queries ˆy. Let
be a point such that
are on the same side of the hyperplane
Proof. Recall that points that violate feedback have their weight multiplied by (1 ) (and then the distribution is renormalized). With probability 1
3 (when the feedback is not flipped), we thus have that
Likewise, with probability p = 1/3 (when the feedback is flipped), we have that
Taking expectations over the feedback, we therefore have that
Points very close to are likely to be on the same side of the hyperplane as
, allowing us to apply Claim 5.6.
. Then, in expectation both over the randomness in the feedback and the algorithm,
, and since ˆy is chosen by adding a uniform
random variable to y, the probability that
are on opposite sides of the plane
is at most
. Combining this with Claim 5.6 gives us the desired result.
We now use Claims 5.7 and 5.6 to understand how 1) changes over time (generalizing from single points to small balls). This first claim bounds the decrease in 1
our guess is close to accurate.
. Then, in expectation both over the randomness in feedback and in our algorithm,
Proof. Recall that y is the median of in direction
as computed by our algorithm. Now, define the two quantities
These quantities represent the mass of above and below the strip of width 2
the median. Note that by the maximality of
2; if not, then there exists a strip of width 2
containing at least 1
. Without loss of generality, assume
Now, recall that ˆy is chosen uniformly in the interval []. We will divide the expectation in the theorem statement into three cases, based on where ˆy lies.
This case occurs with probability . Note that since
, in this case we also have that ˆ
. Therefore in this case we know that the ball Γ
lies entirely to the left of the hyperplane
. By applying Claim 5.6, we know that, conditioned on being in this case,
This case covers the ˆy where the hyperplane intersects the ball Γ
case, we pessimistically bound the change in weight via
• Case 3: remainder of the interval [
Since case 1 and case 2 together cover at least 1/4 of the interval, this case occurs with probability at most 3/4. In this case the ball Γdoes not intersect the hyperplane (since all such ˆy are covered by case 2). We can therefore apply (the weaker variant of) Claim 5.6 to show that, conditioned on being in this case,
Combining these three cases, we have that
When our guess is far from accurate, we can instead (more strongly) bound the decrease in 1
. Then, in expectation both over the randomness in feedback and in our algorithm,
Proof. We essentially repeat the logic from the proof of Claim 5.8, with the change that we can more strongly lower bound . Without loss of generality, assume that
Note that since y is the weighted median of in the direction
Now, we again have three cases. To begin, with probability lies in the interval [
, the hyperplane
does not intersect Γ
therefore we can apply Claim 5.6 to show that
Likewise, the probability that the hyperplane intersects Γ
is at most
case we can pessimistically bound the decrease in weight as in the proof of Claim 5.8), and with the remaining 3/4 probability the hyperplane
does not intersect Γ
, and we can apply the weaker variant of Claim 5.6. Combining these observations, we get that
In this step we will bound the total number of rounds in each scale i. Specifically, our algorithm ensures that we move onto the next scale once either the weight concentrates on a strip or once grows large enough - we will show with high probability that this is always due to the weight concentrating on a small strip.
Let be the number of rounds
(i.e. the number of rounds where we are “accurate”). Let
be the number of rounds
(i.e. the number of rounds where we are “inaccurate”). Note that
Also, recall that our algorithm ensures that
) + 1 for all rounds t.
We first show that with high probability, will be no larger than O(poly(d)i).
Claim 5.10. For any constant c > 0, with probability at least 1 we have throughout all rounds that
Proof. We will construct a sequence is a super-martingale. Consider the sequence
defined as follows.
•
• If
• If
• If
(Here we have used the fact that 1)) must contain a ball of radius
is a non-negative super-martingale, by Doob’s martingale inequality it holds that for any constant
However note that if there exists a round where ), then for t sufficiently large
(Here in the last inequality we have used the fact that 2all t, this implies that
, which immediately implies the desired claim.
Recall that in our algorithm, we check the following two conditions for determining the scale i we use for the current query:
• There exists
•
We now show that with high probability, only the first condition is ever relevant.
Proof. Again, we will construct a sequence is a super-martingale. Consider the sequence
defined as follows.
•
• If
• If
• If
Consider the ratio . Similarly as in the proof of Claim 5.10,
1. Note that Claim 5.7 and Claim 5.8 imply that
so is a super-martingale. Now assume that
. By the constraints of our algorithm, we are guaranteed that
Also by Claim 5.10, with probability at least 1 over all rounds t. This implies that eventually
Thus, for sufficiently large t, we have that
However note is a supermartingale. Also,
1 for all rounds t. Thus, by Doob’s martingale inequality, the probability we ever have
(the first term is from the probability that at some point
We now aim to show that with high probability, if the weight function is concentrated on a thin strip, this strip must be close to the true point
(this is necessary to bound the total regret we incur each round in scale i). To do this, we will argue that we can “round” points to the
net
without significantly affecting their weight. We will then rely on the geometric observation mentioned earlier: that for points
are nearly collinear, we can relate the weights
). We begin by relating the weights of collinear points.
Claim 5.12. Fix an index are collinear in that order, then with probability at least 1
we have that for all rounds t
Proof. Consider a time step 1. We say a point is on the “good” side of the hyperplane
if it is on the same side as
. Otherwise we say the point is on the “bad” side. Note for
satisfying the conditions of the claim, one of the following statements must be true:
• are on the same side of the hyperplane
• is on the good side of the hyperplane and
is on the bad side of the hyperplane.
We will now consider the quantity . Note that in Case 1, then
relative weights remain unchanged if both
are on the same side of the hyperplane). In Case 2,
By Doob’s martingale inequality, the probability that this ever happens is at most as desired.
We next relate the weights of nearby points are close together, then it is unlikely they are ever separated by a hyperplane, and their weights should be similar. The following claim captures this intuition.
Claim 5.13. Fix an index , then with probability at least 1
we have that for all rounds t
Proof. We will bound the number of rounds 1 and the hyperplane
intersects the segment connecting
. We will show that with high probability, this quantity is at most
. Note that since
) is unchanged when
on the same side of the hyperplane, and decreases by at most a factor of (1
) when they lie on different sides, this will show that with high probability
indices 1. The probability that at least
of the hyperplanes
intersect the segment connecting
is at most
which implies our desired result.
Finally, we apply Claims 5.12 and 5.13 to bound the relative weights for all approximately collinear pairs of points in
Claim 5.14. Fix an index i. With probability at least 1 the following claim holds: for all
Proof. Fix a pair of points satisfying the conditions in the statement. Let
foot of the perpendicular from
to the segment connecting
. Note that
Therefore, by the conditions of Claim 5.13, with probability at least 1 , for all rounds t,
on this line. By Claim 5.12, this means that with probability at least 1
rounds t,
This is for a specific pair of points in . Union bounding over all
pairs of points, we have that the theorem statement fails with probability at most
Finally, we prove that if concentrates on a strip,
of this strip. To show this, it suffices to show that the weight of
) is large enough that it must intersect a sufficiently concentrated strip. We do this by using Claim 5.14 to relate the weight of points of
inside and outside
Claim 5.15. Fix an index i. With probability at least 1 , the following statement holds for all t:
Proof. First for each point , consider an axis-parallel box centered at that point with side length
. Now consider all rounds
1 and all planes of the form
for these rounds. We show that with high probability, all boxes intersect at most
of these planes. Using essentially the same argument as in Claim 5.13, we find that this probability is at least
In both of these inequalities we are using the fact that if at most planes intersect any box, then the weights of any two points in the same box are within a factor of (1
Next, consider the ball . Consider the following two transformations:
), which sends a point q to
and the transformation ) is the point obtained by rounding the coordinates of q to the nearest integer multiple of
). If we consider the map
)) given by the above, the number of points
that map to a fixed point
is at most
To see this, note that ) is an axis-parallel box with side-length
contains all the points in
contained within an axis aligned box with side-length
contains at least (2
Now, note that )) satisfy the conditions of Claim 5.14. Thus, by Claim 5.14, with probability at least 1
we have that
Combining the above with equations (1) and (2), we conclude that with probability at least
This implies that the ball ) must intersect the strip
. If this happens then the desired condition is clearly satisfied.
Finally, we can proceed to prove the main theorem.
Proof of Theorem 5.5. First, for each be the total loss at scale i. We will bound
By Claim 5.11, with probability at least 1 1)) for all rounds. If this is true, then the only time we query at level i, there must be some strip given by
of width at most 10
that contains 1
of the total weight of
Claim 5.15, with at least 1
probability, all queries at level i incur loss at most
12. Now, by using Claim 5.10, we can bound the expected total loss at level i as
It follows that
Remark. Naively, one can implement Algorithm 6 with time complexity , via the observation that T hyperplanes divide B(0, 1) into at most
) pieces, so we can simply compute this division and the weight of each distribution
(we care about at most T scales) on each component of this division.
It is an interesting open question if it is possible to implement Algorithm 6 (or otherwise achieve O(poly(d)) regret) with time complexity poly(d, T). To do so, it would suffice to be able to efficiently sample from the distributions
We also study the problem where the learner has full feedback, i.e, after the prediction back is the actual value of
). We show that the optimal regret can be completely characterized (up to constant factors) by a continuous analogue of the Littlestone dimension.
For this section we don’t require the assumption that Y is ordered, only that the loss function forms a valid metric (i.e. is symmetric and satisfies the triangle inequality).
6.1 Tree Dimension
Definition 6.1. A (X, Y)-tree of cost c is a labeled binary tree with the following properties
• There is a root node and each interior node has two children
• Each interior node is labeled with a triple (
• For each leaf, the sum of ) over all nodes on the path from the root to the leaf is at least c.
Definition 6.2. We say a (X, Y)-tree T is H-satisfiable if we can label each leaf with some such that for each node (
, all leaves of the left subtree satisfy
and all leaves of the right subtree satisfy
Definition 6.3 (Tree dimension)), the tree dimension of H, to be the maximum cost of a (X, Y)-tree that is H-satisfiable.
Remark. Note we can naturally extend the above definition to any subset
It is worth noting that covering dimension is “more restrictive” than tree dimension in the sense that bounded covering dimension implies bounded tree dimension.
Theorem 6.4. Let H be a hypothesis class consisting of functions mapping a loss function that defines a metric on Y. If Cdim(H) is finite then
Before proving the above we prove a few preliminary lemmas.
Lemma 6.5. Let H be a hypothesis class and be a loss function that defines a metric on Y. Let T be an H-satisfiable (X, Y)-tree where all leaves have depth d and such that for each internal node (
Proof. Since the tree is H-satisfiable, we can label the leaves with functions of these functions
must satisfy
since there must be some internal node (
. Therefore, there are 2
functions in H such that any two have
distance bigger than 2
. This implies that
Now by the definition of covering dimension, we conclude
Given a rooted binary tree T, we say a rooted binary tree is contained in T if all of the nodes of
are nodes of T and the nodes of
form a binary tree where each interior node has two children under the topology given by T.
Lemma 6.6. Consider a rooted binary tree T (where all interior nodes have exactly two children) and say its nodes are colored with colors 1, 2, . . . , c. We say the colored tree satisfies property (if for each
, it does not contain a monochromatic complete binary tree of color i and depth
. If the coloring of T satisfies property (
, there exists a leaf such that on the path from the root to the leaf, there are at most
nodes of color
Proof. We prove the lemma by induction on . The base cases are obvious. Now say the root of T is colored with color i. Clearly we must have
0. Then either the left or right subtree of the root must satisfy property (
). Using the inductive hypothesis, we get the desired.
Proof of Theorem 6.4. Assume for the sake of contradiction that
). Consider a (X, Y)-tree T that is H-satisfiable and has cost larger than 6
). Note we can assume that there are no nodes in
) = 0 since otherwise, we can delete that node and keep only its left subtree. Let c be an integer such that for all nodes (
Now color the internal nodes of T with c colors {1, 2, . . . c} where a node (
Note by Lemma 6.5, T does not contain any monochromatic, complete binary trees of color i with depth at least (). By Lemma 6.6, this implies the total cost of T is at most
which completes the proof.
However, we cannot hope for any sort of converse to Theorem 6.4 as evidenced by the following example. Let X = Y = [0, 1] and let be the set of all indicator functions of points in [0, 1]. H has infinite covering dimension but its tree dimension is just 1.
6.2 Regret Bounds from Tree Dimension
Theorem 6.7. In the full feedback model there exists an algorithm with regret )). Furthermore, no algorithm can guarantee less than
First we prove that the algorithm below achieves the upper bound.
Proof. Assume for the sake of contradiction that this is false. Then we can construct a -satisfiable (
)-tree with (
) as its root node and cost bigger than
is the smallest value such that
is non-empty, then for every
have
. It follows that if
we are done.
Consider now the case where For this case, we want to argue that
, since after we get the feedback, we will update
. Therefore
implies that:
were the case that , we would have
by Lemma 6.8, contradicting the fact that
Proof of Theorem 6.7. The upper bound follows directly from the previous lemma. For the lower bound, consider an (X, Y)-tree with cost -satisfiable. We can ensure that all leaves have the same depth d (by adding nodes of the form (x, y, y)). Now the adversary chooses a leaf uniformly at random. If the sequence of nodes from the root to the leaf are
then the adversary presents the inputs in that order to the learner. Since the loss function satisfies the triangle inequality, the expected loss of any learner is at least
are done.
Remark. Algorithm 7 assumes that the set has a minimum, which is always the case if the hypothesis class H is finite. For infinite H, this minimum might not exist. In such a case, choose
and choose
. Theorem 6.7 can then be easily adapted to provide a bound of
6.3 Separating binary and full feedback
With binary feedback we can no longer obtain loss bounds that depend only on tree dimension. To see this, consider the following example:
Let H be the set of all functions . There are
such functions. Now for each, slightly perturb the outputs (i.e.
) so that for every
. The tree dimension of this class is O(1). However, clearly any algorithm must incur Ω(n) loss in expectation with binary feedback.
[1] Jacob D Abernethy, Elad Hazan, and Alexander Rakhlin. Interior-point methods for full-information and bandit online learning. IEEE Transactions on Information Theory, 58(7):4164–4175, 2012.
[2] Kareem Amin, Afshin Rostamizadeh, and Umar Syed. Repeated contextual auctions with strategic buyers. In Advances in Neural Information Processing Systems 27: Annual Conference on Neural Information Processing Systems 2014, December 8-13 2014, Montreal, Quebec, Canada, pages 622–630, 2014.
[3] Peter L Bartlett, Philip M Long, and Robert C Williamson. Fat-shattering and the learnability of real-valued functions. Journal of Computer and System Sciences, 52(3):434–452, 1996.
[4] Hamsa Bastani and Mohsen Bayati. Online decision-making with high-dimensional covariates. Working paper, Stanford University, 2016.
[5] Dimitris Bertsimas and Santosh Vempala. Solving convex programs by random walks. J. ACM, 51(4):540–556, 2004.
[6] S´ebastien Bubeck, R´emi Munos, Gilles Stoltz, and Csaba Szepesv´ari. X-armed bandits. Journal of Machine Learning Research, 12(May):1655–1695, 2011.
[7] Nicol`o Cesa-Bianchi, Pierre Gaillard, Claudio Gentile, and S´ebastien Gerchinovitz. Algorithmic chaining and the role of partial feedback in online nonparametric learning. arXiv preprint arXiv:1702.08211, 2017.
[8] Maxime C. Cohen, Ilan Lobel, and Renato Paes Leme. Feature-based dynamic pricing. In Proceedings of the 2016 ACM Conference on Economics and Computation, EC ’16, Maastricht, The Netherlands, July 24-28, 2016, page 817, 2016.
[9] Mark A Davenport, Yaniv Plan, Ewout Van Den Berg, and Mary Wootters. 1-bit matrix completion. Information and Inference: A Journal of the IMA, 3(3):189–223, 2014.
[10] Pierre Gaillard and S´ebastien Gerchinovitz. A chaining algorithm for online nonparametric regression. In Conference on Learning Theory, pages 764–796, 2015.
[11] Adel Javanmard and Hamid Nazerzadeh. Dynamic pricing in high-dimensions. Working paper, University of Southern California, 2016.
[12] Adam Kalai and Santosh Vempala. Efficient algorithms for online decision problems. Journal of Computer and System Sciences, 71(3):291–307, 2005.
[13] Ravi Kannan, L´aszl´o Lov´asz, and Mikl´os Simonovits. Random walks and an o*(n5) volume algorithm for convex bodies. Random Struct. Algorithms, 11(1):1–50, 1997.
[14] Robert Kleinberg and Tom Leighton. The value of knowing a demand curve: Bounds on regret for online posted-price auctions. In Foundations of Computer Science, 2003. Proceedings. 44th Annual IEEE Symposium on, pages 594–605. IEEE, 2003.
[15] Robert Kleinberg, Aleksandrs Slivkins, and Eli Upfal. Bandits and experts in metric spaces. Journal of the ACM (JACM), 66(4):1–77, 2019.
[16] Akshay Krishnamurthy, Thodoris Lykouris, and Chara Podimata. Corrupted multidimensional binary search: Learning in the presence of irrational agents. arXiv preprint arXiv:2002.11650, 2020.
[17] Renato Paes Leme and Jon Schneider. Contextual search via intrinsic volumes. In 59th IEEE Annual Symposium on Foundations of Computer Science, FOCS 2018, Paris, France, October 7-9, 2018, pages 268–282, 2018.
[18] Nick Littlestone. Learning quickly when irrelevant attributes abound: A new linear-threshold algorithm. Machine learning, 2(4):285–318, 1988.
[19] Ilan Lobel, Renato Paes Leme, and Adrian Vladu. Multidimensional binary search for contextual decision-making. Operations Research, 2017.
[20] L´aszl´o Lov´asz. Hit-and-run mixes fast. Mathematical Programming, 86(3):443–461, 1999.
[21] Jieming Mao, Renato Leme, and Jon Schneider. Contextual pricing for lipschitz buyers. In Advances in Neural Information Processing Systems, pages 5643–5651, 2018.
[22] Yaniv Plan and Roman Vershynin. Robust 1-bit compressed sensing and sparse logistic regression: A convex programming approach. IEEE Transactions on Information Theory, 59(1):482–494, 2012.
[23] Sheng Qiang and Mohsen Bayati. Dynamic pricing with demand covariates. Available at SSRN 2765257, 2016.
[24] Aleksandrs Slivkins. Contextual bandits with similarity information. The Journal of Machine Learning Research, 15(1):2533–2568, 2014.
[25] Chen-Yu Wei and Haipeng Luo. More adaptive algorithms for adversarial bandits. Proceedings of Machine Learning Research, 75, 2018.