Optimal Contextual Pricing and Extensions

2020·Arxiv

Abstract

Abstract

In the contextual pricing problem a seller repeatedly obtains products described by an adversarially chosen feature vector in and only observes the purchasing decisions of a buyer with a fixed but unknown linear valuation over the products. The regret measures the difference between the revenue the seller could have obtained knowing the buyer valuation and what can be obtained by the learning algorithm.

We give a poly-time algorithm for contextual pricing with O(d log log T +d log d) regret which matches the Ω(d log log T ) lower bound up to the d log d additive factor. If we replace pricing loss by the symmetric loss, we obtain an algorithm with nearly optimal regret of O(d log d) matching the Ω(d) lower bound up to log d. These algorithms are based on a novel technique of bounding the value of the Steiner polynomial of a convex region at various scales. The Steiner polynomial is a degree d polynomial with intrinsic volumes as the coefficients.

We also study a generalized version of contextual search where the hidden linear function over the Euclidean space is replaced by a hidden function in a certain hypothesis class H. We provide a generic algorithm with ) regret where d is the covering dimension of this class. This leads in particular to a ˜) regret algorithm for linear contextual search if the linear function is guaranteed to be s-sparse. Finally we also extend our results to the noisy feedback model, where each round our feedback is flipped with a fixed probability p < 1/2.

1 Introduction

In the contextual search problem a learner tries to learn a hidden linear function some unknown . In every round, the learner is presented with an adversarially chosen vector , and is asked to provide a guess for the dot-product , subsequently learning whether and incurring a loss ). The goal of the learner is to minimize the total loss (the regret), which is given by A special case of this problem is contextual pricing [2, 8, 11, 14, 16, 17, 19, 23]. In this setup, the vectors are features representing differentiated products and the learner is a seller whose decision at round t is how to price item . Given a price a buyer with valuation buys the product if and doesn’t buy otherwise. The seller only observes the purchase or no-purchase decision. The loss in each round is the difference between the revenue made by the seller and the revenue the seller could have made if v was known. Formally, the pricing loss is given by:

A second important case is called symmetric contextual search where the loss is the difference between the guess and the actual dot product . This loss function arises in personalized medicine [4] where the learner chooses the dosage of a medicine and observes whether the patient was over-dosed or under-dosed. Another application is one-bit compressed sensing [9,22] where the learner only observes the sign of a measurement. In either case, we will consider the following loss:

Optimal Regret Bounds for Contextual Search In one dimension, both contextual pricing and symmetric contextual search reduce to non-contextual problems and are well understood. For the symmetric loss the optimal regret in the one-dimensional case is Θ(1) using binary search. For pricing, the optimal regret in the one-dimensional case is Θ(log log T) using the algorithm of Kleinberg and Leighton [14]. These results immediately imply Ω() lower bounds for the general contextual case (see Section 2.4 for details on optimality).

In this paper, we design polynomial-time algorithms with regret O(d log log T + d log d) for contextual pricing and O(d log d) for symmetric contextual search, matching these lower bounds up to a log d factor. This improves over the previously known bounds in [17], which are for pricing and ) for symmetric contextual search.

Steiner polynomial The main technique driving these results is a new potential function based on the Steiner polynomial. This is an object from integral geometry that is closely connected with the notion of intrinsic volumes which were used in [17] to derive the previously known bounds for this problem.

Given a convex set , Steiner showed that the volume of the Minkowski sum Vol(S + zB) is a polynomial of degree d in t (where B is the unit ball). The intrinsic volumes correspond to the coefficients of this polynomial after normalization by volume -dimensional ball:

Both our algorithm and the the algorithm in [17] keep track of the set of vectors consistent with observations seen so far. The intrinsic volumes approach keeps track of ) and shows that the loss incurred in round t is proportional to the decrease of one of the suitably normalized intrinsic volumes (i.e. for some index ). In this paper, instead of keeping track of each coefficient individually, we control the value of the Steiner polynomial itself at different values of z. Specifically, we show that for some set of it is possible to always choose an i (based on the current width of our set) so that ) decreases by a constant fraction. This leads to nearly optimal bounds on regret (via a much simpler proof than that in [17]).

Framework for learning with binary feedback While the Steiner polynomial technique largely resolves the classical problem of contextual search, there is a wide class of learning problems with binary feedback that either do not fit within the framework of learning a linear function, or which impose additional constraints on the linear function that one would hope to leverage.

One example is sparse contextual search, where the hidden vector is guaranteed to be s-sparse, i.e., . This captures settings where we expect few features to matter to the buyer. Another interesting problem is when we are asked to guess maxinstead of corresponds to learning the valuation of an unit-demand buyer. This problem is challenging since the set of vectors consistent with observations seen so far is not necessarily convex.

Both of these examples are special cases of a general framework for online learning problems under binary feedback. In our general setup, the learner is trying to learn a function f in a hypothesis class H containing functions mapping from a context space X to an outcome space Y. In each step a context is chosen adversarially and the learner is asked to submit a guess for the value of ) and incurs a loss )). The goal of the learner is to minimize the total loss )). The original setup corresponds to the case where X is some subset of the unit ball is the class of all linear functions

Our main result in this space is an algorithm with regret is the covering dimension of the hypothesis class H (see Definition 3.3). This result immediately improves the regret of symmetric contextual search from ˜. Similarly, this result immediately implies an ) regret algorithm for the unit-demand buyer problem. We accomplish this by generalizing the Steiner polynomial idea for linear contextual search to a general “Steiner potential” defined for any hypothesis class (see the Techniques subsection below).

We contrast these results in Section 6 with the full feedback case in which the algorithm learns ) each round. In this full-feedback setting, we give matching upper and lower bounds (up to constant factors) on the achievable regret. Our results here are based on a notion we introduce of tree-dimension of a hypothesis class, which is a continuous analogue of Littlestone dimension.

Techniques in the general case The Steiner polynomial is defined for convex sets living in the Euclidean space. Intriguingly, it is possible to generalize (in some sense) this geometric technique to arbitrary classes H of hypotheses. Instead of keeping track of all functions in the hypothesis class that are consistent with the feedback so far, we keep track of an expanded set of functions that don’t violate the feedback up to a certain margin (in the linear case, this is exactly has the effect of regularizing the set of consistent hypotheses and allows for faster progress. Instead of volume, in the general case we control the size of an -net of the set of these approximately valid hypotheses.

A second technique we use is adaptive scaling, which involves keeping track of multiple levels of discretization. For the linear case, this boils down to controlling the value of the Steiner polynomial at different values of . More generally, at each step, we can estimate the maximum possible loss achievable in this round given the previous feedback. Based on this value, we will choose a scale, which will dictate the granularity of the -net and the margin with which we prune inconsistent hypotheses. After picking the scale we show that it is possible to pick a (random) cut that will either: (i) reduce the number of valid hypotheses in the chosen granularity by half; or (ii) eliminate one valid hypothesis at a much coarser granularity. This will require a careful coupling between the discretizations at two different levels. This coupling between two levels is what allows us to overcome the fact that in the general case we can’t rely on techniques from convex geometry. See Section 4.2 for details.

One important feature of all our algorithms (not shared by previous algorithms) will be our use of randomness, in particular perturbed guesses. Every round, we compute the median the set ranges over the set of approximately valid hypotheses. However, instead of guessing the median directly we guess one of the two values (chosen uniformly at random) in , where the size of perturbation depends on our current scale . Our guarantee is that the potential function will decrease significantly for one of the two choices (and thus in expectation).

Noisy Contextual Search The final direction in which we extend the original contextual search problem is by considering noisy binary feedback, i.e., the feedback of the algorithm is flipped with probability p < 1/2. In this setting we move from keeping track of a set of approximately valid hypotheses to a pseudo-Bayesian approach, where we maintain a distribution w over approximately valid hypotheses and update it as we receive feedback. By carefully bounding the weight of hypotheses within a ball of radius 1/T, this results in an algorithm with regret O(d log T).

Ideally, it would be possible to combine this algorithm with the adaptive scaling technique of the noiseless setting, resulting in an O(poly(d)) regret algorithm for general hypothesis classes. One such approach is to replace the notion of width with a fuzzier notion, based on how tightly concentrated the distribution is along the current context vector (e.g. the width of the smallest strip in this direction which contains 1 of the mass of the distribution). We can then choose a scale based on this distributional width, and choose the size of the perturbation based on this scale (as in the deterministic case).

This type of approach works, conditional on being able to show that when the distribution concentrates along a thin strip, the true hypothesis is close to this thin strip with high probability. Unfortunately, doing this for general hypothesis classes seems hard – fortunately, it is possible to do this for the specific case of symmetric contextual search by leveraging the Euclidean geometry of the ambient space (see Section 5.2 for more details). This leads to an algorithm for noisy linear contextual search which gets O(poly(d)) regret, and is the first algorithm we are aware of for contextual search in the noisy setting which gets any regret independent of T for d > 1.

Summary of main results To summarize, our results include:

• Algorithms with regret O(d log d) for symmetric contextual search (Section 2.1) and O(d log log T+ d log d) for contextual pricing (Section 2.2). Both algorithms are optimal (up to log d) and have only O(d) overhead with respect to the non-contextual case. Both algorithms can be implemented efficiently in poly(d, T) time.

• General algorithm for learning a function from a hypothesis class H under binary feedback with regret is the covering dimension of the hypothesis class (Section 4.3).

• An algorithm for symmetric contextual search with noisy binary feedback with O(poly(d)) regret (Section 5.2).

Related work Core to our results is the idea of coupling together potentials at many different scales. Similar ideas of “adaptive discretization”, “zooming”, and “chaining” exist throughout the online learning literature [6,15,24] and statistical learning theory literature [7,10]. Algorithms in these works also often construct several layers of discretizations and have learning rates parameterized by the covering dimension of the ambient space. However, these algorithms are usually designed for settings where (1) one cannot hope for better than ) regret (let alone regret independent of T), (2) feedback is not binary but rather zeroth-order ( [21] study a pricing setting where feedback is binary, but where the hypothesis class is large enough that one must incur O(poly(T)) regret). In particular, we believe our technique of coupling together the potentials for different scales in the analysis of Theorem 4.3 is novel.

Our results in the full feedback case (Section 6) – parameterizing the optimal regret in terms of the tree dimension – can be seen as a generalization of similar results for Littlestone dimension [18] (indeed, in the case where Y = {0, 1}, our notion of tree dimension reduces to Littlestone dimension). While there do exist measures which capture the learnability of functions taking values over a metric space (for example, the fat-shattering dimension [3] for real-valued functions), as far as we are aware the notion of tree dimension we introduce does not currently exist in the literature. It is an interesting open direction to connect the notion of tree dimension we present with previously studied measures.

2 Optimal Contextual Search

We start by describing the contextual search setup and establishing some useful notation. The hidden object is a vector belonging to the unit ball . In each round the learner is provided an (adversarially chosen) vector and asked to provide a guess . Upon guessing, the learner incurs loss ) and receives feedback corresponding to whether , then the feedback is arbitrary. In other words:

This allows the learner to keep track of the set of vectors consistent with observations seen so far:

It is clear from the above setup that for both pricing and symmetric loss, it suffices to consider when = 1 for all rounds.

Throughout the execution of the algorithm we will keep track of the Steiner potential where ) is the standard volume in and the sum is the Minkowski sum:

We will evaluate the potential at different points depending on the width of in the direction . We define the width as:

2.1 Symmetric loss with O(d log d) regret

We start with the symmetric loss function , where we will show it is possible to obtain O(d log d) regret. The main idea of this algorithm is to choose a value based on the width of in the direction of the current context and then choose a guess that splits the set in two parts of equal volume. By doing this, we will show that ) (the “Steiner potential”) decreases by a constant multiplicative fraction. Since ) is bounded below by ), we can only do this some number of times (roughly ) times), from which our regret bound will follows.

We describe the algorithm below (ignore for now issues of computational efficiency; we will address these in Section 2.3):

Proof. Assume the feedback is = +1 (the other case is analogous). Then . In Figure 1 we depict the set . The part of is exactly which has volume

To bound the remaining part, let C be the largest volume of a section of in the direction (see the right part of Figure 1). The total volume of can be bounded below by comparing it to the two cones formed by taking the convex hull of the section of largest volume inside the band and the extreme points , which are at least 2apart in the direction. The volume of the two cones is at least:

Finally, note that the region of has cross-section with volume at most direction so its volume is at most ), thus completing the proof. Theorem 2.2. The regret of the Multiscale Steiner Potential algorithm is at most O(d log d).

Proof. Every time we choose index i, the loss at is most 2and the volume of ) decreases by a constant factor. The set is never empty since , therefore ). For this reason we can’t pick index i by more than )) times, so the total regret is at most:

Figure 1: Illustration of the proof of Lemma 2.1

2.1.1 Comparison with other approaches

Is the Steiner potential necessary? One natural algorithm for this problem is to query ) (i.e. guess the median without inflating the set). The best upper bound from [17] shows only that this has regret at most 2and a lower bound given in the example in Section 8 of [19] shows that this algorithm has regret at least Ω(). Inflating the set by taking the Minkowski sum with a ball seems to be the appropriate regularization that allows us to overcome the lower bound.

Another natural algorithm is to guess ). This algorithm was shown to have regret at least 2

2.2 Pricing loss with O(d log log T + d log d) regret

We now study the pricing loss . Unlike the previous case the loss function is discontinuous. While the loss when under-estimating is small, the loss when over-estimating is very large. In the one-dimensional setting, [14] obtains a O(log log T)-regret algorithm by a conservative variant of binary search that avoids over-estimating the actual value as much as possible.

As before, our algorithm will keep track of ) for different values . This time, however, we will guess more conservatively so that in the case of a no-purchase event, the potential will decrease by a large amount. We will do this in a way that each can lead to a no-purchase event approximately O(d) times.

Proof. Note that in this case, the set is disjoint from definition of immediately gives the desired result.

Proof. First we upper bound the volume V of the strip Let C be the largest volume of a section of in the direction (see right part of Figure 1). Then . On the other hand since is a convex set and has width at least 2

Thus

Since the entire set is disjoint from , this completes the proof of the lemma.

Theorem 2.5. The regret of the Multiscale Steiner Potential for Pricing algorithm is at most O(d log log T + d log d).

Proof. First, note that the regret contributed from all rounds where Next, we consider rounds with . Note that for all of these rounds If we choose index (leading to a no-purchase), the volume of a factor of 2) always, this can happen at most

times. The loss from each such query is at most 1. If we choose index volume of is cut by a factor of. This can happen at most

times. The loss from each such query is at most 2. Thus, our regret is at most

2.3 Polynomial time implementation

We note that can be approximated using binary search as long as we can compute the volume of (for any half-space H. It is enough to notice that we only need a constant approximation of the volume in the previous proof and in order to approximate the volume we only need access to a separation oracle [5,13,20]. Since is a ball intersected with at most t halfspaces, it is trivial to obtain a separation oracle for it.

To obtain a separation oracle for it is enough to solve the problem of computing the distance from a query point to the convex set which is itself a convex problem. Technically, this requires to not be too small since the guarantee of cutting plane methods (like the ellipsoid algorithm) tells us that (given an initial ellipsoid ), it is possible to compute an solution in time

We can take E to be the unit ball; then this is algorithm is efficient as long as never too small (anything at least exp()) is fine). Here we present a simple modification of Algorithms 1 and 2 that makes sure that the volume of stays large enough throughout by preserving a small ball around

Initialize to ensure that ) denotes the ball with center c and radius r). Now change the guess is sampled from the uniform distribution over [0The total additional loss from this perturbation is O(1). Since the perturbation is smaller than we can use the same argument in Lemmas 2.1 and 2.4 with 2instead of to bound the volume of the band making sure there is constant progress in the Steiner potential.

The advantage of this perturbation is that the probability that the cut passes through the ball of radius 1is at most 1per period. So with probability 1all periods t. It follows that ), and with this, the convex minimization problem can be solved in poly(d, T) time.

2.4 Optimality

Here we briefly discuss the optimality of our results, namely Theorem 2.2 and Theorem 2.5. Note that we set up the problem by assuming that the hidden vector and all of the adversarially chosen contexts are drawn from the unit ball. We may alternatively set up the problem by assuming that the hidden vector v is drawn from the cube [and the contexts are chosen from the . With this setup, it follows from a direct reduction to the one-dimensional case that we cannot do better than O(d) for symmetric loss and O(d log log T) for pricing loss.

Now we show how our proofs can be modified to achieve the same results, O(d log d) for symmetric loss and O(d log log T +d log d) for pricing loss, which are optimal in this modified setting up to logarithmic factors. We will run exactly the same algorithms. To see that the analysis remains the same, it suffices to note that our maximum loss in any round is still bounded by O(1) and the volumes are scaled by a factor of poly(which becomes an additive d log d factor after taking the logarithm.

3 Framework for Online Learning with Binary Feedback

We will now consider a general model of learning with binary feedback that has contextual search as a special case. We will derive regret bounds based on the covering dimension of the hypothesis class. The driving technique will still be the Steiner potential together with two new ideas: (i) a new adaptive scaling that will either make a lot of progress in a finer granularity or slower progress in coarser granularity; (ii) randomized cuts that will reduce the potential with constant probability.

Consider a hypothesis class H consisting of functions mapping X to Y. We refer to X as the context space and Y as the output space. We assume that the output space Y is a totally ordered set, i.e, for each we have either

The learning protocol is as follows: an adversary chooses some and in each round they choose some . The learner makes a prediction and incurs loss )) for some loss function 1]. Upon making a prediction, the learner receives feedback on whether ) (the feedback is arbitrary in case of equality). It will be convenient to represent the feedback as a variable otherwise.

We make the following assumptions about the loss function throughout the paper:

• ) = 0 for all

•

•

•

• ) then there are

If Y = R, then for any continuous increasing function 1] and parameter loss satisfies the desired properties. Note that while the symmetric loss function is easily cast in this framework the pricing loss is not captured. In Section 4.4 we give an impossibility result showing it is impossible to obtain covering-dimension bounds for the pricing loss.

3.1 Covering Dimension

Our loss bounds will be in terms of the covering dimension of the hypothesis class H. We start by defining a metric induced by the loss function:

Definition 3.1. For two hypotheses

We can now introduce the notions of -net and covering dimension.

, we say that a set if for every of minimum cardinality.

Definition 3.3 (Covering Dimension). Define the covering dimension H as

Note that this definition of Cdim differs from Hausdorff dimension in that we care not just about the limit 0, but the largest value for any 2]); importantly, this guarantees us that, for any

Note that we specify to avoid issues when is close to 1 - any other fixed constant upper bound p only changes this dimension by a constant factor of at most 1/ log(1/p).

We give a few quick examples to give intuition about covering dimension.

Example 3.4. The space of functions with loss function covering dimension d.

Example 3.5 (Contextual Search). Let B be the unit-ball in each be defined by the dot product . The linear contextual search problem is defined as the learning problem for class with loss function . This class has covering dimension O(d).

show that for any 0 , there is an -net of the sphere of size (1. To do this, we can greedily place points in the unit ball such that no two are within of each other. If we draw an

-radius ball around each point, these balls must be disjoint and contained in a ball centered at the origin of radius 1 + . Thus, the maximum number of points we will place is

giving us an -net of the same size.

Example 3.6 (Sparse Contextual Search). The sparse version of the contextual search problem is given by class . The loss function is still . The covering dimension of this class is O(s log d) (where we treat s as a constant and d as tending to ). To see that the covering dimension is O(s log d), note that for any we can use the result in the previous example to obtain an -net of size

Example 3.7 (Unit demand). In the unit demand version of contextual search the set and the hypothesis class consists of functions . The covering dimension of this class is O(d). This example corresponds to the puzzle in the introduction and corresponds to the economic situation where a seller wants to price a bundle of goods (represented by the context) but the buyer has an unit-demand valuation, i.e., only cares about the highestvalued item in the bundle.

To see that the covering dimension is O(d), note that the set forms an

4 Loss Bounds based on Covering Dimension

For simplicity, in the following theorems we will assume our hypotheses map and that our loss is given by . We will then remark on how to generalize our proof to other loss functions.

4.1 O(d log(T)) regret via Single-scale Steiner Potential

Bounds that depend on log(T) are typically easy for learning with binary feedback and can be obtained using different algorithms. The interesting question in this setting is how to obtain bounds that are constant in T. Nevertheless, it is instructive to start with a simpler algorithm with regret O(d log T) for d = Cdim(H). It will illustrate how the Steiner potential can be generalized to an abstract setting. Instead of keeping track of the hypotheses that are consistent with the feedback so far, we will keep an inflated version of that set.

This algorithm starts with a -net of the hypothesis class and keeps a set of candidate hypotheses that are approximately consistent with observations seen so far. For each , it queries a random point around the median, halving the set of hypotheses with at least half probability.

The analysis is based on the following lemma:

(the other case is analogous), then with probability half, the algorithm guesses and gets the feedback that , which eliminates all the hypotheses . These hypotheses constitute at least half of

Corollary 4.2. The Single-scale Steiner Potential algorithm obtains regret O(d log T) in expectation.

Proof. The regret from periods where is at most O(1). For the remaining periods, the size of is halved with at least probability. Note that for such a t,

Next, since there is some element in which is never eliminated. However, . Thus for any integer c, the probability that there are c periods with loss greater than . Thus, the expected number of periods

with loss larger than 2/T is at most

4.2 Strategy for achieving constant (in T) regret

In this subsection we provide some intuition on how to improve the regret of our algorithm from a loss larger than 1half in size, but whenever it halves in size the only bound we can get for the loss is 1. To improve this bound, we need to guarantee that a loss of 1 can’t occur very often. Our strategy for doing that involves keeping multiple levels of discretization. Given and the feedback will keep for each z > 0:

In other words, we keep an z-discretization of hypotheses along with all the hypotheses that are consistent with the feedback so far with an z-margin. The z-margin is important to guarantee that any hypothesis that is will never be eliminated. We will also refer to set of hypotheses consistent with observations so far without any discretization or margin.

Our strategy will be to choose in each round some discretization level z based on the maximum possible loss achievable in this round. We will divide the space of losses in exponentially sized buckets and define i to be the index of the bucket where the maximum loss falls: