b

DiscoverSearch
About
The Externalities of Exploration and How Data Diversity Helps Exploitation
2018·arXiv
Abstract
Abstract

Online learning algorithms, widely used to power search and content optimization on the web, must balance exploration and exploitation, potentially sacrificing the experience of current users in order to gain information that will lead to better decisions in the future. Recently, concerns have been raised about whether the process of exploration could be viewed as unfair, placing too much burden on certain individuals or groups. Motivated by these concerns, we initiate the study of the externalities of exploration—the undesirable side effects that the presence of one party may impose on another—under the linear contextual bandits model. We introduce the notion of a group externality, measuring the extent to which the presence of one population of users (the majority) impacts the rewards of another (the minority). We show that this impact can, in some cases, be negative, and that, in a certain sense, no algorithm can avoid it. We then move on to study externalities at the individual level, interpreting the act of exploration as an externality imposed on the current user of a system by future users. This drives us to ask under what conditions inherent diversity in the data makes explicit exploration unnecessary. We build on a recent line of work on the smoothed analysis of the greedy algorithm that always chooses the action that currently looks optimal. We improve on prior results to show that a greedy approach almost matches the best possible Bayesian regret rate of any other algorithm on the same problem instance whenever the diversity conditions hold, and that this regret is at most ˜O(T 1/3). Returning to group-level effects, we show that under the same conditions, negative group externalities essentially vanish if one runs the greedy algorithm. Together, our results uncover a sharp contrast between the high externalities that exist in the worst case, and the ability to remove all externalities if the data is sufficiently diverse.

Online learning algorithms are a key tool in web search and content optimization, adaptively learning what users want to see. In a typical application, each time a user arrives, the algorithm chooses among various content presentation options (e.g., news articles to display), the chosen content is presented to the user, and an outcome (e.g., a click) is observed. Such algorithms must balance exploration (making potentially suboptimal decisions now for the sake of acquiring information that will improve decisions in the future) and exploitation (using information collected in the past to make better decisions now). Exploration could degrade the experience of a current user, but improves user experience in the long run. This exploration-exploitation tradeoff is commonly studied in the online learning framework of multi-armed bandits (Bubeck and Cesa-Bianchi, 2012).

Concerns have been raised about whether exploration in such scenarios could be unfair, in the sense that some individuals or groups may experience too much of the downside of exploration without sufficient upside (Bird et al., 2016). We formally study these concerns in the linear contextual bandits model (Li et al., 2010; Chu et al., 2011), a standard variant of multi-armed bandits appropriate for content personalization scenarios. We focus on externalities arising due to exploration, that is, undesirable side effects that the presence of one party may impose on another.

We first examine the effects of exploration at a group level. We introduce the notion of a group externality in an online learning system, quantifying how much the presence of one population (which we dub the majority) impacts the rewards of another (the minority). We show that this impact can be negative, and that, in a particular precise sense, no algorithm can avoid it. This cannot be explained by the absence of suitably good policies since our adoption of the linear contextual bandits framework implies the existence of a feasible policy that is simultaneously optimal for everyone. Instead, the problem is inherent to the process of exploration. We come to a surprising conclusion that more data can sometimes lead to worse outcomes for the users of an explore-exploit-based system.

We next turn to the effect of exploration at an individual level. We interpret exploration as a potential externality imposed on the current user by future users of the system. Indeed, it is only for the sake of the future users that the algorithm would forego the action that currently looks optimal. To avoid this externality, one may use the greedy algorithm that always chooses the action that appears optimal according to current estimates of the problem parameters. While this greedy algorithm performs poorly in the worst case, it tends to work well in many applications and experiments.1

In a new line of work, Bastani et al. (2017) and Kannan et al. (2018) analyzed conditions under which inherent diversity in the data makes explicit exploration unnecessary. Kannan et al. (2018) proved that the greedy algorithm achieves a regret rate of ˜O(√T)in expectation over small perturbations of the context vectors (which ensure sufficient data diversity). This is the best rate that can be achieved in the worst case (i.e., for all problem instances, without data diversity assumptions), but it leaves open the possibilities that (i) another algorithm may perform much better than the greedy algorithm on some problem instances, or (ii) the greedy algorithm may perform much better than worst case under the diversity conditions. We expand on this line of work. We prove that under the same diversity conditions, the greedy algorithm almost matches the best possible Bayesian regret rate of any algorithm on the same problem instance. This could be as low as polylog(T) for some instances, and, as we prove, at most ˜O(T 1/3)whenever the diversity conditions hold.

Returning to group-level effects, we show that under the same diversity conditions, the negative group externalities imposed by the majority essentially vanish if one runs the greedy algorithm. Together, our results illustrate a sharp contrast between the high individual and group externalities that exist in the worst case, and the ability to remove all externalities if the data is sufficiently diverse.

Additional motivation. Whether and when explicit exploration is necessary is an important concern in the study of the exploration-exploitation tradeoff. Fairness considerations aside, explicit exploration is expensive. It is wasteful and risky in the short term, it adds a layer of complexity to algorithm design (Langford and Zhang, 2007; Agarwal et al., 2014), and its adoption at scale tends to require substantial systems support and buy-in from management (Agarwal et al., 2016, 2017). A system based on the greedy algorithm would typically be cheaper to design and deploy.

Further, explicit exploration can run into incentive issues in applications such as recommender systems. Essentially, when it is up to the users which products or experiences to choose and the algorithm can only issue recommendations and ratings, an explore-exploit algorithm needs to provide incentives to explore for the sake of the future users (Kremer et al., 2014; Frazier et al., 2014; Che and H¨orner, 2015; Mansour et al., 2015; Bimpikis et al., 2017). Such incentive guarantees tend to come at the cost of decreased performance, and rely on assumptions about human behavior. The greedy algorithm avoids this problem as it is inherently consistent with the users’ incentives.

Additional related work. Our research draws inspiration from the growing body of work on fairness in machine learning (e.g., Dwork et al., 2012; Hardt et al., 2016; Kleinberg et al., 2017; Chouldechova, 2017). Several other authors have studied fairness in the context of the contextual bandits framework. Our work differs from the line of research on meritocratic fairness in online learning (Kearns et al., 2017; Liu et al., 2017; Joseph et al., 2016), which considers the allocation of limited resources such as bank loans and requires that nobody should be passed over in favor of a less qualified applicant. We study a fundamentally different scenario in which there are no allocation constraints and we would like to serve each user the best content possible. Our work also differs from that of Celis and Vishnoi (2017), who studied an alternative notion of fairness in the context of news recommendations. According to this notion, all users should have approximately the same probability of seeing a particular type of content (e.g., Republican-leaning articles), regardless of their individual preferences, in order to mitigate the possibility of discriminatory personalization.

The data diversity conditions in Kannan et al. (2018) and this paper are inspired by the smoothed analysis framework of Spielman and Teng (2004), who proved that the expected running time of the simplex algorithm is polynomial for perturbations of any initial problem instance (whereas the worst-case running time has long been known to be exponential). Such disparity implies that very bad problem instances are brittle. We find a similar disparity for the greedy algorithm in our setting.

Our results on group externalities. A typical goal in online learning is to minimize regret, the (expected) difference between the cumulative reward that would have been obtained had the optimal policy been followed at every round and the cumulative reward obtained by the algorithm. We define a corresponding notion of minority regret, the portion of the regret experienced by the minority. Since online learning algorithms update their behavior based on the history of their observations, minority regret is influenced by the entire population on which an algorithm is run. If the minority regret is much higher when a particular algorithm is run on the full population than it is when the same algorithm is run on the minority alone, we can view the majority as imposing a negative externality on the minority; the minority population would achieve a higher cumulative reward if the majority were not present. Asking whether this can ever happen amounts to asking whether access to more data points can ever lead an explore-exploit algorithm to make inferior decisions. One might think that more data should always lead to better decisions and therefore better outcomes for the users. Surprisingly, we show that this is not the case, even with a standard algorithm.

Consider LinUCB (Li et al., 2010; Chu et al., 2011; Abbasi-Yadkori et al., 2011), a standard algorithm for linear contextual bandits that is based on the principle of “optimism under uncertainty.” We provide a specific problem instance on which, after observing T users, LinUCB would have a minority regret of  Ω(√T)if run on the full population, but only constant minority regret if run on the minority alone. While stylized, this example is motivated by the problem of providing driving directions to different populations of users, about which fairness concerns have been raised (Bird et al., 2016). Further, the situation is reversed on a slight variation of this example: LinUCB obtains constant minority regret when run on the full population and  Ω(√T)on the minority alone. That is, group externalities can be large and positive in some cases, and large and negative in others.

Although these regret rates are specific to LinUCB, we show that this phenomenon is, in some sense, unavoidable. Consider the minority regret of LinUCB when run on the full population and the minority regret that LinUCB would incur if run on the minority alone. We know that one may be much smaller or larger than the other. We ask whether any algorithm could achieve the minimum of the two on every problem instance. Using a variation of the same problem instance, we prove that this is impossible; in fact, no algorithm could simultaneously approximate both up to any  o(√T)factor. In other words, an externality-free algorithm would sometimes “leave money on the table.”

In terms of techniques, we rely on the special structure of our example, which can be viewed as an instance of the sleeping bandits problem (Kleinberg et al., 2010). This simplifies the behavior and analysis of LinUCB, allowing us to obtain the O(1) upper bounds. The lower bounds are obtained using KL-divergence techniques to show that the two variants of our example are essentially indistinguishable, and an algorithm that performs well on one must obtain  Ω(√T)regret on the other.

Our results on the greedy algorithm. We consider a version of linear contextual bandits in which the latent weight vector  θis drawn from a known prior. In each round, an algorithm is presented several actions to choose from, each represented by a context vector. The expected reward of an action is a linear product of  θand the corresponding context vector. The tuple of context vectors is drawn independently from a fixed distribution. In the spirit of smoothed analysis, we assume that this distribution has a small amount of jitter. Formally, the tuple of context vectors is drawn from some fixed distribution, and then a small perturbation—small-variance Gaussian noise—is added independently to each coordinate of each context vector. This ensures arriving contexts are diverse. We are interested in Bayesian regret, i.e., regret in expectation over the Bayesian prior. Following the literature, we are primarily interested in the dependence on the time horizon T.

We focus on a batched version of the greedy algorithm, in which new data arrives to the algorithm’s optimization routine in small batches, rather than every round. This is well-motivated from a practical perspective—in high-volume applications data usually arrives to the “learner” only after a substantial delay (Agarwal et al., 2016, 2017)—and is essential for our analysis.

Our main result is that the greedy algorithm matches the Bayesian regret of any algorithm up to polylogarithmic factors, for each problem instance, fixing the Bayesian prior and the context distribution. We also prove that LinUCB achieves regret ˜O(T 1/3)for each realization of  θ. This implies a worst-case Bayesian regret of ˜O(T 1/3)for the greedy algorithm under the perturbation assumption.

Our results hold for both natural versions of the batched greedy algorithm, Bayesian and frequentist, henceforth called BatchBayesGreedy and BatchFreqGreedy. In BatchBayesGreedy, the chosen action maximizes expected reward according to the Bayesian posterior. BatchFreqGreedy estimates  θusing ordinary least squares regression and chooses the best action according to this estimate. The results for BatchFreqGreedy come with additive polylogarithmic factors, but are stronger in that the algorithm does not need to know the prior. Further, the ˜O(T 1/3)regret bound for BatchFreqGreedy is approximately prior-independent, in the sense that it applies even to very concentrated priors such as independent Gaussians with standard deviation on the order of  T −2/3.

The key insight in our analysis of BatchBayesGreedy is that any (perturbed) data can be used to simulate any other data, with some discount factor. The analysis of BatchFreqGreedy requires an additional layer of complexity. We consider a hypothetical algorithm that receives the same data as BatchFreqGreedy, but chooses actions based on the Bayesian-greedy selection rule. We analyze this hypothetical algorithm using the same technique as BatchBayesGreedy, and then upper bound the difference in Bayesian regret between the hypothetical algorithm and BatchFreqGreedy.

Our analyses extend to group externalities and (Bayesian) minority regret. In particular, we circumvent the impossibility result mentioned above. We prove that both BatchBayesGreedy and BatchFreqGreedy match the Bayesian minority regret of any algorithm run on either the full population or the minority alone, up to polylogarithmic factors

Detailed comparison with prior work. We substantially improve over the ˜O(√T)worst-case regret bound from Kannan et al. (2018), at the cost of some additional assumptions. First, we consider Bayesian regret, whereas their regret bound is for each realization of  θ.2 Second, they allow the context vectors to be chosen by an adversary before the perturbation is applied. Third, they extend their analysis to a somewhat more general model, in which there is a separate latent weight vector for every action (which amounts to a more restrictive model of perturbations). However, this extension relies on the greedy algorithm being initialized with a substantial amount of data. The results of Kannan et al. (2018) do not appear to have implications on group externalities.

Bastani et al. (2017) show that the greedy algorithm achieves logarithmic regret in an alternative linear contextual bandits setting that is incomparable to ours in several important ways. They consider two-action instances where the actions share a common context vector in each round, but are parameterized by different latent vectors. They ensure data diversity via a strong assumption on the context distribution. This assumption does not follow from our perturbation conditions; among other things, it implies that each action is the best action in a constant fraction of rounds. Further, they assume a version of Tsybakov’s margin condition, which is known to substantially reduce regret rates in bandit problems (e.g., see Rigollet and Zeevi, 2010).

We consider the model of linear contextual bandits (Li et al., 2010; Chu et al., 2011). Formally, there is a learner who serves a sequence of users over T rounds, where T is the (known) time horizon. For the user who arrives in round t there are (at most) K actions available, with each action  a ∈ {1, . . . , K}associated with a context vector  xa,t ∈ Rd. Each context vector may contain a mix of features of the action, features of the user, and features of both. We assume that the tuple of context vectors for each round t is drawn independently from a fixed distribution. The learner observes the set of contexts and selects an action  atfor the user. The user then experiences a reward rtwhich is visible to the learner. We assume that the expected reward is linear in the chosen context vector. More precisely, we let  ra,tbe the reward of action a if this action is chosen in round t (so that rt = rat,t), and assume that there exists an unknown vector  θ ∈ Rd such that E [ra,t | xa,t] = θ⊤xa,tfor any round t and action a. Throughout most of the paper, the realized rewards are either in {0, 1} or are the expectation plus independent Gaussian noise of variance at most 1. We sometimes consider a Bayesian version, in which the latent vector  θis initially drawn from some known prior P.

A standard goal for the learner is to maximize the expected total reward over T rounds, or �Tt=1 θ⊤xa,t. This is equivalent to minimizing the learner’s regret, defined as

image

where  x∗t = arg maxx∈{x1,t,··· ,xK,t} θ⊤xdenotes the context vector associated with the best action at round t. We are mainly interested in expected regret, where the expectation is taken over the context vectors, the rewards, and the algorithm’s random seed, and Bayesian regret, where the expectation is taken over all of the above and the prior over  θ.

We introduce some notation in order to describe and analyze algorithms in this model. We write xt for xat,t, the context vector chosen at time  t. Let Xt ∈ Rt×d be thecontext matrix at time t, a matrix whose rows are vectors  x1 , . . . , xt ∈ Rd. A d × d matrix Zt := �tτ=1 xτx⊤τ = X⊤t Xt,called the empirical covariance matrix, is an important concept in some of the prior work on linear contextual bandits (e.g., Abbasi-Yadkori et al., 2011; Kannan et al., 2018), as well as in this paper.

Optimism under uncertainty. Optimism under uncertainty is a common paradigm in problems with an explore-exploit tradeoff (Bubeck and Cesa-Bianchi, 2012). The idea is to evaluate each action “optimistically”—assuming the best-case scenario for this action—and then choose an action with the best optimistic evaluation. When applied to the basic multi-armed bandit setting, it leads to a well-known algorithm called UCB1 (Auer et al., 2002), which chooses the action with the highest upper confidence bound (henceforth, UCB) on its mean reward. The UCB is computed as the sample average of the reward for this action plus a term which captures the amount of uncertainty.

Optimism under uncertainty has been extended to linear contextual bandits in the LinUCB algorithm (Chu et al., 2011; Abbasi-Yadkori et al., 2011). The high-level idea is to compute a confidence region  Θt ⊂ Rd in each round  t such that θ ∈ Θtwith high probability, and choose an action a which maximizes the optimistic reward estimate  supθ∈Θt x⊤a,tθ. Concretely, one uses regression to form an empirical estimate ˆθt for θ. Concentration techniques lead to high-probability bounds of the form  |x⊤(θ − ˆθt)| ≤ f(t)�x⊤Z−1t x, where the interval width function f(t) may depend on hyperparameters and features of the instance. LinUCB simply chooses an action

image

Among other results, Abbasi-Yadkori et al. (2011) use

image

where L and S are known upper bounds on  ∥xa,t∥2 and ∥θ∥2, respectively, and  c0is a parameter. For any  c0 ≥ 1, they obtain regret ˜O(dS√c0 K T), with only a polylog dependence on TL/d.

In this section, we study the externalities of exploration at a group level, quantifying how much the presence of one population impacts the rewards of another in an online learning system. We consider linear contextual bandits in a setting in which there are two underlying user populations, called the majority and the minority. The user who arrives at round t is assumed to come from the majority population with some fixed probability and the minority population otherwise, and the population from which the user comes is known to the learner. The tuple of context vectors at time t is then drawn independently from a fixed group-specific distribution.

We assume there is a single hidden vector  θ, and that the distribution of rewards conditioned on the chosen context vector is the same for both groups. Only the distribution over tuples of available context vectors differs between groups. This implies that externalities cannot be explained by the absence of a good policy, since there always exists a policy that is simultaneously optimal for everyone. This allows us to focus only on externalities inherent to the process of exploration.

We define the minority regret to be the regret experienced by the minority. The group externality imposed on the minority by the majority is then the difference between the minority regret of an algorithm run on the minority alone and the minority regret of the same algorithm run on the full population. A negative group externality implies that the minority is worse off due to the presence of the majority. It is generally more meaningful to bound the multiplicative difference between the minority regret obtained with and without the majority present. Several of our results have this form.

We first ask whether large group externalities can exist. We show that on a simple toy example, a large negative group externality arises under LinUCB, while a slight variant of this example leads to a large positive externality. Put another way, more available data can lead to either better or worse outcomes for the users of a system. We show that this general phenomenon is unavoidable. That is, no algorithm can simultaneously approximate the minority regret of LinUCB run on the full population and LinUCB run on the minority alone, up to any  o(√T)multiplicative factor.

3.1. Two-Bridge Instance

We consider a toy example, motivated by a scenario in which a learner is choosing driving routes for two groups of users. Each user starts at point A, B, or C, and wants to get to the same destination, point D, which requires taking one of two bridges, as shown in Figure 1. The travel costs for each of the two bridges are unknown. For simplicity, assume all other edges are known to have 0 cost.

Suppose that 95% of users are in the majority group. All of these users start at point A and have access only to the top bridge. The other 5% are in the minority. Of these users, 95% start at point C, from which they have access only to the bottom bridge. The remaining 5% of the minority users start at point B, and have access to both bridges.

image

Figure 1: Visual illustration of the two-bridge instance.

Consider the behavior of an algorithm that follows the principle of optimism under uncertainty. If run on the full user population, it will quickly collect many observations of the commute time for the top bridge since all users in the majority group must travel over the top bridge. It will collect relatively fewer observations of the commute time over the bottom bridge. Therefore, when the algorithm is faced with a member of the minority population who starts at point B, the algorithm will likely send this user over the bottom bridge in order to collect more data and improve its estimate.

If the same algorithm is instead run on the minority alone, it will quickly collect many more observations of the commute time for the bottom bridge relative to the top. Now when the algorithm is faced with a user who starts at point B, it will likely send her over the top bridge.

Which is better depends on which bridge has the longer commute time. If the top bridge is the better option, then the presence of the majority imposes a negative externality on the minority. If not, then the presence of the majority helps. These two scenarios may be difficult to distinguish.

This toy example can be formalized in the linear contextual bandits framework. There are two underlying actions (the two bridges), but these actions are not always available. To capture this, we define a parameter vector  θ in [0, 1]2, with the two coordinates  θ1 and θ2representing the expected rewards for taking the top and bottom bridge respectively. (Though we motivated the example in terms of costs, it can be expressed equivalently in terms of rewards.) There are two possible context vectors:  [1 0]⊤ and [0 1]⊤. A user has available an action with context vector  [1 0]⊤ if and only if she has access to the top bridge. Similarly, she has available an action with context vector  [0 1]⊤ ifand only if she has access to the bottom bridge. The instance can then be formalized as follows.

Definition 1 (Two-Bridge Instance) The two-bridge instance is an instance of linear contextual bandits. On each round t, the user who arrives is from the majority population with probability 0.95, in which case  x1,t = x2,t = [1 0]⊤. Otherwise, the user is in the minority. In this case, with probability 0.95,  x1,t = x2,t = [0 1]⊤ (based on Figure 1, we call these C rounds), while with probability 0.05,  x1,t = [1 0]⊤ and x2,t = [0 1]⊤ (Brounds). We consider two values for the hidden parameter vector  θ, θ(0) = [1/2 1/2 − ε]⊤ and θ(1) = [1/2 − ε 1/2]⊤ where ε = 1/√T.

3.2. Performance of LinUCB

We start by analyzing the performance of LinUCB on the two-bridge instance. Our main result formalizes the intuition above, showing that when  θ = θ(0) (that is, the top bridge is better) the majority imposes a large negative group externality on the minority, while the majority imposes a large positive externality when  θ = θ(1). We assume rewards are 1-subgaussian.3

Theorem 2 Consider LinUCB with any interval width function f satisfying  f(t) ≥ 2�log(T).4On the two-bridge instance, assuming 1-subgaussian noise on the rewards, when  θ = θ(0), LinUCBachieves expected minority regret O(1) when run on the minority alone, but  Ω(√T)when run on the full population. In contrast, when  θ = θ(1), LinUCB achieves expected minority regret O(1) when run on the full population, but  Ω(√T)when run on the minority alone.

We omit the proofs of the  Ω(√T)lower bounds, which both follow a similar structure to the one used in the proof of the general impossibility result in Section 3.3; in fact, both of these lower

3. A random variable  X is called σ-subgaussian if  E[eσX2] < ∞. A special case is Gaussians with variance  σ2.4. For instance, the interval width function in Equation (3) satisfies this condition whenever  dc0 ≥ 4, so one can either set  c0 ≥ 2, or add two more dimensions to the problem instance (and set  θ3 = θ4 = 0).

bounds could be stated as an immediate corollary of Theorem 4. Essentially, an argument based on KL-divergence shows that it is difficult to distinguish between the case in which  θ = θ(0) and thecase in which  θ = θ(1), and therefore LinUCB must choose similar actions in these two cases.

To prove the O(1) upper bounds, we make heavy use of the special structure of the two-bridge instance, which significantly simplifies the analysis of LinUCB. We exploit the fact that the only context vectors available to the learner are the basis vectors  [1 0]⊤ and [0 1]⊤, which essentially makes this an instance of sleeping bandits (Kleinberg et al., 2010). In this special case, the covariance matrix  Ztis always diagonal, which simplifies Equation (2) and leads to LinUCB choosing the ith basis, where  i maximizes (ˆθt)i + f(t)/�(Zt)ii and (Zt)iiis simply the number of times that this basis vector was already chosen. Additionally, in this setting  (ˆθt)iis just the average reward observed for the ith basis vector, allowing us to bound the difference between each  (ˆθt)i andθiusing standard concentration techniques. Using this, we show that with high probability, after a logarithmic number of rounds—during which the learner can amass at most O(1) regret since the worst-case regret on any round is  ε = 1/√T—the probability that LinUCB chooses the wrong action on a B round is small (O(1/√T)). This leads to constant regret on expectation.

The proof makes use of the following concentration bound:

Lemma 3 Let  Ctbe the number of C rounds observed in the first t minority rounds in the two-bridge instance. For any  δ ∈ (0, 1), with probability at least  1 − δ, Ct ≥ 0.9t for allt ≥ 760 log(T/δ).

Proof We apply the following form of the Chernoff bound:

image

Setting  γ = 1/19, we get

image

for  t ≥ 760 log(T/δ). Applying a union bound over all T rounds, we have  Ct ≥ 0.9t for allt ≥ 760 log(T/δ)with probability at least  1 − δ.

Proof of Theorem 2 Now, consider LinUCB run on the minority population alone on the two-bridge instance with  θ = θ(0). Since we are considering running LinUCB on the minority only, majority rounds are irrelevant, so throughout this proof we abuse notation and use  t ∈ {1, · · · , T0} for someT0 ≤ Tto index minority rounds. T is still the total number of (minority plus majority) rounds.

This proof heavily exploits the special structure of the two-bridge instance to simplify the analysis of LinUCB. In particular, we exploit the fact that the only contexts ever available are the basis vectors  [1 0]⊤ and [0 1]⊤. This implies that the covariance matrix  Ztis always diagonal, which greatly simplifies the expression for the chosen action in Equation (2). The optimistic estimate of the reward for choosing the ith basis vector is simply

image

Additionally, in this special case,  (Zt)iiis simply the number of times that the ith basis vector was chosen over the first t minority rounds, and  (ˆθt)iis the average reward observed over the  (Zt)iirounds on which it was chosen.

image

and  θi. Since rewards were assumed to be 1-subgaussian, Lemma 37 and a union bound give us that for any  δ1, for any t, with probability at least  1 − 4δ1, for all i ∈ {1, 2},

image

Let  Bt and Ctbe the number of B and C rounds respectively before round t. By Lemma 3, for any  δ2, with probability  1 − δ2, Ct ≥ 9Bt when t ≥ 760 log(T/δ2). Suppose this is the case. Since it is only possible to choose [1 0] on B rounds, we have  (Zt)11 ≤ BT. Similarly, since the algorithm can only choose  [0 1] on every C round, (Zt)22 ≥ CT ≥ 9BT . Fixing δ1 = 1/√Tand using the assumption that  f(t) ≥ 2�log(T), Equations (4) and (5) then imply that for any t ≥ 760 log(T/δ2), with probability at least  1 − 2δ1 = 1 − 2√T,

image

and similarly,

image

This shows that with probability at least  1 − δ2, after the first  760 log(T/δ2)rounds, LinUCB picks  [1 0]⊤ on each Bround with probability at least  1 − 2δ1, leading to zero regret on that round. To turn this into a bound on expected regret, first note that with at most  δ2probability, the argument above fails to hold, in which case the minority regret is still bounded by  εBT ≤ εT. Whenthe argument above holds, LinUCB may suffer up to  εregret on each of the first  760 log(T/δ2)minority rounds. On each additional round, there is a failure probability of  2δ1, and in this case LinUCB again suffers regret of at most  ε. Putting this together and setting  δ2 = 1/√T, we get that the expected regret is bounded by  δ2εT + 760 log(T/δ2)ε + 4δ1εT = O(1).

3.3. An Impossibility Result

It is natural to ask whether it is possible to design an algorithm that can distinguish between the two scenarios analyzed above, obtaining minority regret that is close to the best of LinUCB run on the minority alone and LinUCB run on the full population on any problem instance. In this section, we show that the answer is no. In particular, we prove that on the two-bridge instance, if Pr[θ = θ(0)] = Pr[θ = θ(1)] = 1/2, then any algorithm must suffer  Ω(√T)regret on expectation (and therefore  Ω(√T)minority regret, since all regret is incurred by minority users).

To prove this result, we begin by formalizing the idea that it is hard to distinguish between the case in which  θ = θ(0) and the case in which  θ = θ(1). To do so, we bound the KL-divergence between the joint distributions over the sequences of context vectors, actions taken by the given algorithm, and the given algorithm’s rewards that are induced by the two choices of  θ. By applying the high-probability Pinsker lemma (Tsybakov, 2009), we show that a low KL-divergence between these distributions implies that the algorithm must be likely either to choose the top bridge on B rounds more than half the time when the bottom bridge is better or to choose the bottom bridge on B rounds more than half the time when the top bridge is better, either of which would lead to high (Ω(√T)) regret as long as the number of B rounds is sufficiently large. To finish the proof, we use a simple Chernoff bound to show that the number of B rounds is large with high probability.

To derive the KL-divergence bound, we make use of the assumption that the realized rewards  rtat each round are either 0 or 1. This assumption is not strictly necessary. An analogous argument could be made, for instance, for real-valued rewards with Gaussian noise.

Theorem 4 On the two-bridge instance with realized rewards  rt ∈ {0, 1}, any algorithm must incur  Ω(√T)minority regret in expectation when  Pr[θ = θ(0)] = Pr[θ = θ(1)] = 12.

Note that “any algorithm” here includes algorithms run on the minority alone, essentially ignoring data from the majority. Theorems 2 and 4 immediately imply the following corollary.

Corollary 5 No algorithm can simultaneously approximate the minority regret of both LinUCB run on the minority and LinUCB run on the full population up to any  o(√T)multiplicative factor.

Proof of Theorem 4 Fix any algorithm A. We will first derive an  Ω(√T)lower bound on the expected regret of A conditioned on the number of  B rounds, BT, being large. To complete the proof, we then show that  BTis large with high probability.

Let  ht = {(x1,τ, x2,τ, aτ, rτ)}tτ=1 be a history of all context vectors, chosen actions, and re- wards up to round  t, with h0 = ∅. Running Aon the two-bridge instance with  θ = θ(0) inducesa distribution over histories  hT . Let Pdenote the conditional distribution of these histories, conditioned on the event that  BT ≥ T/800. That is, we define

image

We first show that KL(P(hT ) || Q(hT ))is upper bounded a constant that does not depend on T. By the chain rule for KL divergences, since  rtis independent of any previous contexts, actions, or rewards conditioned on  xt,

image

Since the choice of context vectors available at time t is independent of the value of the parameter  θand A may only base its choices on the observed history and current choice of contexts, it is always the case that  P((x1,t, x2,t, at) | ht−1) = Q((x1,t, x2,t, at) | ht−1), so the first sum in this expression is equal to 0.

To bound the second sum, we make use of the assumption that  rt ∈ {0, 1} for all t.5 Lemma 39then tells us that for any round  t, KL(P(rt | x1,t, x2,t, at) || Q(rt | x1,t, x2,t, at)) ≤ 7ε2/2 sincethe probability of getting reward 1 conditioned on a chosen context is always either  1/2 or 1/2 − ε.Putting this together, we get that

image

Now, let E be the event that the algorithm A chooses arm 2 on at least half of the B rounds, conditioned on  BT ≥ T/800. If Eoccurs when  θ = θ(0), the regret of A is at least  BT ε/2, whichis on the order of√T when BT ≥ T/800. If Edoes not occur (i.e.,E occurs) when  θ = θ(1), Aagain has regret at least  BT ε/2. We will use the bound on KL divergence to show that one of these bad cases happens with high probability.

By Lemma 38,

image

Let R be the regret of A. We then have that

image

It remains to bound the probability that  BT ≥ T/800. By a Chernoff bound,

image

Thus, for any  δ ∈ (0, 1), if T ≥ 3200 log(1/δ), then with probability at least  1 − δ, BT ≥ T/800.In particular, let  δ = 1/2. Then if T ≥ 3200 log 2, we have

image

This completes the proof that the regret of  A is Ω(√T)on this problem instance.

We now turn our attention to externalities at an individual level. We interpret exploration as a potential externality imposed on the current user of a system by future users, since the current user would prefer the learner to take the action that appears optimal. One could avoid such externalities by running the greedy algorithm, which does just that, but it is well known that the greedy algorithm performs poorly in the worst case. In this section, we build on a recent line of work analyzing the conditions under which inherent data diversity leads the greedy algorithm to perform well.

We analyze the expected performance of the greedy algorithm under small random perturbations of the context vectors. We focus on greedy algorithms that consume new data in batches, rather than every round. We consider both Bayesian and frequentist versions, BatchBayesGreedy and BatchFreqGreedy. Our main result is that for any specific problem instance, both algorithms match the Bayesian regret of any algorithm on that particular instance up to polylogarithmic factors. We also prove that under the same perturbation assumptions, LinUCB achieves regret ˜O(T 1/3) for eachrealization of  θ, which implies a worst-case Bayesian regret of ˜O(T 1/3)for the greedy algorithms. Finally, we repurpose our analysis to derive a positive result in the group setting, implying that the impossibility result of Section 3.3 breaks down when the data is sufficiently diverse.

Setting and notation. We consider a Bayesian version of linear contextual bandits, with  θ drawnfrom a known multivariate Gaussian prior  P = N(θ, Σ), withθ ∈ Rd and invertible  Σ ∈ Rd×d.To capture the idea of data diversity, we assume the context vectors on each round t are generated using the following perturbed context generation process: First, a tuple  (µ1,t , . . . , µK,t) ofmean context vectors is drawn independently from some fixed distribution  Dµ over (R ∪ {⊥})K,where  µa,t =⊥means action a is not available. For each available action a, the context vector is then  xa,t = µa,t + εa,t, where εa,tis a vector of random noise. Each component of  εa,tis drawn independently from a zero-mean Gaussian with standard deviation  ρ. We refer to  ρ as theperturbation size. In general, our regret bounds deteriorate if  ρis very small. Together we refer to a distribution Dµ, prior P, and perturbation size  ρ as aproblem instance. We make several technical assumptions. First, the distribution  Dµis such that each context vector has bounded 2-norm, i.e.,  ∥µa,t∥2 ≤ 1. It can be arbitrary otherwise. Second, the perturbation size needs to be sufficiently small,  ρ ≤ 1/√d. Third, the realized reward  ra,tfor each action a and round  t is ra,t = x⊤a,tθ + ηa,t, the mean reward  x⊤a,tθplus standard Gaussian noise  ηa,t ∼ N(0, 1).6The history up to round t is denoted by the tuple  ht = ((x1, r1) , . . . , (xt, rt)).

The greedy algorithms. For the batch version of the greedy algorithm, time is divided in batches of Y consecutive rounds each. When forming its estimate of the optimal action at round t, the algorithm may only use the history up to the last round of the previous batch, denoted  t0.

BatchBayesGreedy forms a posterior over  θusing prior P and history  ht0. In round tit chooses the action that maximizes reward in expectation over this posterior. This is equivalent to choosing

image

BatchFreqGreedy does not rely on any knowledge of the prior. It chooses the best action according to the least squares estimate of  θ, denoted θfret , computed with respect to history  ht0:

image

4.1. Main Results

We first state our main results before describing the intuition behind them. We state each theorem in terms of the main relevant parameters  T, K, d, Y , and ρ. First, we prove that in expectation over the random perturbations, both greedy algorithms favorably compare to any other algorithm.

Theorem 6 With perturbed context generation, there is some  Y0 = polylog(d, T)/ρ2 such thatwith batch duration  Y ≥ Y0, the following holds. Fix any bandit algorithm, and let  R0(T) be its

Bayesian regret on a particular problem instance. Then on that same instance, (a) BatchBayesGreedy has Bayesian regret at most  Y · R0(T/Y ) + ˜O(1/T),(b) BatchFreqGreedy has Bayesian regret at most  Y · R0(T/Y ) + ˜O(√d/ρ2).

Our next result asserts that the Bayesian regret for LinUCB and both greedy algorithms is on the order of (at most)  T 1/3. This result requires additional technical assumptions.

Theorem 7 Assume that the maximal eigenvalue of the covariance matrix  Σof the prior P is at most  1,7 and the mean vector satisfies  ∥θ∥2 ≥ 1 + √3 log T. With perturbed context generation,(a) With appropriate parameter settings, LinUCB has Bayesian regret ˜O(d2 K2/3 T 1/3/ρ2).(b) If  Y ≥ Y0as in Theorem 6, then both BatchBayesGreedy and BatchFreqGreedy have Bayesian regret at most ˜O(d2 K2/3 T 1/3/ρ2).

The assumption  ∥θ∥2 ≥ 1 + √3 log Tin Theorem 7 can be replaced with  d ≥ log T/ log log T.We use Theorem 7(b) to derive an “approximately prior-independent” result for BatchFreqGreedy. (For clarity, we state it for independent priors.) The bound in Theorem 7(b) deteriorates if P gets very sharp, but it suffices if P has standard deviation on the order of (at least)  T −2/3.

Corollary 8 Assume that the prior P is independent over the components of  θ, with variance  κ2 ≤1 in each component. Suppose the mean vector satisfies  ∥θ∥2 ≥ 1 + √3 log T. With perturbed context generation, if  Y ≥ Y0as in Theorem 6, then BatchFreqGreedy has Bayesian regret at most ˜O(d2 K2/3 T 1/3/ρ2)as long as  κ ≥ T −2/3.

Finally, we derive a positive result on group externalities. We find that with perturbed context generation, the minority Bayesian regret of the greedy algorithms (i.e., the Bayesian regret incurred on minority rounds) is small compared to the minority Bayesian regret of any algorithm, whether run on the full population or on the minority alone. This sidesteps the impossibility result of Section 3.3.

Theorem 9 Assume  Y ≥ Y0as in Theorem 6 and perturbed context generation. Fix any bandit algorithm and instance, and let  Rmin(T)be the minimum of its minority Bayesian regrets when it is only run over minority rounds or when it is run over the full population. Both greedy algorithms run on the full population achieve minority Bayesian regret at most  Y · Rmin(T) + ˜O(√d/ρ2).

4.2. Key Techniques

The key idea behind our approach is to show that, with perturbed context generation, BatchBayesGreedy collects data that is informative enough to “simulate” the history of contexts and rewards from the run of any other algorithm ALG over fewer rounds. This implies that it remains competitive with ALG since it has at least as much information and makes myopically optimal decisions.

We use the same technique to prove a similar simulation result for BatchFreqGreedy. To treat both algorithms at once, we define a template that unifies them. A bandit algorithm is called batch-greedy-style if it divides the timeline in batches of Y consecutive rounds each, in each round t chooses some estimate  θt of θ, based only on the data from the previous batches, and then chooses the best action according to this estimate, so that  at = arg maxa θ⊤t xa,t. For a batch that starts at round  t0 + 1, thebatch history is the tuple  ((xt0+τ, rt0+τ) : τ ∈ [Y ]), and thebatch context matrix is the matrix X whose rows are vectors  (xt0+τ : τ ∈ [Y ]); here [Y ] = {1, · · · , Y }. Similarly to the “empirical covariance matrix”, we define the batch covariance matrix as  X⊤X.

Let us formulate what we mean by “simulation”. We want to use the data collected from a single batch in order to simulate the reward for any one context x. More formally, we are interested in the randomized function that takes a context x and outputs an independent random sample from N(θ⊤x, 1). We denote it  Rewθ(·); this is the realized reward for an action with context vector x.

Definition 10 Consider batch B in the execution of a batch-greedy-style algorithm. Batch history hBcan simulate  Rewθ()up to radius R > 0 if there exists a function g : {context vectors} ×{batch histories  hB} → R such that g(x, hB)is identically distributed to  Rewθ(x)conditional on the batch context matrix, for all  θand all context vectors  x ∈ Rd with ∥x∥2 ≤ R.

Let us comment on how it may be possible to simulate  Rewθ(x). For intuition, suppose that x = 12 x1+ 12 x2. Then (12 r1+ 12 r2+ξ)is distributed as  N(θ⊤x, 1) if ξis drawn independently from N(0, 12). Thus, we can define  g(x, h) = 12 r1+ 12 r2+ξin Definition 10. We generalize this idea and show that a batch history can simulate  Rewθas long as the batch covariance matrix has a sufficiently large minimum eigenvalue, which holds with high probability when the batch size is large.

Lemma 11 With perturbed context generation, there is some  Y0 = polylog(d, T)/ρ2 and R =O(ρ�d log(TKd))such that with probability at least  1 − T −2 any batch history from a batch-greedy-style algorithm can simulate  Rewθ()up to radius R, as long as  Y ≥ Y0.

If the batch history of an algorithm can simulate  Rewθ, the algorithm has enough information to simulate the outcome of a fresh round of any other algorithm ALG. Eventually, this allows us to use a coupling argument in which we couple a run of BatchBayesGreedy with a slowed-down run of ALG, and prove that the former accumulates at least as much information as the latter, and therefore the Bayesian-greedy action choice is, in expectation, at least as good as that of ALG. This leads to Theorem 6(a). We extend this argument to a scenario in which both the greedy algorithm and ALG measure regret over a randomly chosen subset of the rounds, which leads to Theorem 9.

To extend these results to BatchFreqGreedy, we consider a hypothetical algorithm that receives the same data as BatchFreqGreedy, but chooses actions based on the (batched) Bayesian-greedy selection rule. We analyze this hypothetical algorithm using the same technique as above, and then argue that its Bayesian regret cannot be much smaller than that of BatchFreqGreedy. Intuitively, this is because the two algorithms form almost identical estimates of  θ, differing only in the fact that the hypothetical algorithm uses the P as well as the data. We show that this difference amounts to effects on the order of 1/t, which add up to a maximal difference of O(log T) in Bayesian regret.

In this section, we prove Theorem 7(a), a Bayesian regret bound for the LinUCB algorithm under perturbed context generation. We focus on a version of LinUCB from Abbasi-Yadkori et al. (2011), as defined in (3) on page 7.

Recall that the interval width function in (3) is parameterized by numbers  L, S, c0. We use

image

Recall that  ρdenotes perturbation size, and

Remark 12 Ideally we would like to set L, S according to (8) with equalities. We consider a more permissive version with inequalities so as to not require the exact knowledge of  ρ and ∥θ∥2.

While the original result in Abbasi-Yadkori et al. (2011) requires  ∥xa,t∥2 ≤ L and ∥θ∥2 ≤ S,in our setting this only happens with high probability.

We prove the following theorem (which implies Theorem 7(a)):

Theorem 13 Assume perturbed context generation. Further, suppose that the maximal eigenvalue of the covariance matrix  Σof the prior P is at most 1, and the mean vector satisfies  ∥θ∥2 ≥1 + √3 log T. The version of LinUCB with interval width function (3) and parameters given by (8) has Bayesian regret at most

image

Remark 14 The theorem also holds if the assumption on  ∥θ∥2is replaced with  d ≥ log Tlog log T . Theonly change in the analysis is that in the concluding steps (Section 5.2), we use Lemma 17(b) instead of Lemma 17(a).

On a high level, our analysis proceeds as follows. We massage algorithm’s regret so as to elucidate the dependence on the number of rounds with small “gap” between the best and second-best action, call it N. This step does not rely on perturbed context generation, and makes use of the analysis from Abbasi-Yadkori et al. (2011). The crux is that we derive a much stronger upper-bound on E [N] under perturbed context generation. The analysis relies on some non-trivial technicalities on bounding the deviations from the “high-probability” behavior, which are gathered in Section 5.1.

We reuse the analysis in Abbasi-Yadkori et al. (2011) via the following lemma.8 To state this lemma, define the instantaneous regret at time  t as Rt = θ⊤x∗t − θ⊤xat,t, and let

image

Lemma 15 (Abbasi-Yadkori et al. (2011)) Consider a problem instance with reward noise N(0, 1) and a specific realization of latent vector  θand contexts  xa,t. Consider LinUCB with parameters  L, S, c0that satisfy  ∥xa,t∥2 ≤ L, ∥θ∥2 ≤ S, and c0 = 1. Then

image

The following lemma captures the essence of the proof of Theorem 13. From here on, we assume perturbed context generation. In particular, reward noise is N(0, 1).

Lemma 16 Suppose parameter L is set as in (8). Consider a problem instance with a specific realization of  θ such that ∥θ∥2 ≤ S. Then for any  γ > 0,

image

Proof We will prove that for any  γ > 0,

image

where  Tγ = {t : Rt ∈ (0, γ)}.We use Lemma 15 to upper-bound the second summand in (11). To this end, we condition on the

event that every component of every perturbation  εa,thas absolute value at most�2 log 2T 3Kd;denote this event by U. This implies  ∥xa,t∥2 ≤ Lfor all actions a and all rounds t. By Lemma 33, U is a high-probability event:  Pr[U] ≥ 1 − 1T 2. Now we are ready to apply Lemma 15:

image

To plug this into (11), we need to account for the low-probability event ¯U. We need to be careful because  Rtcould, with low probability, be arbitrarily large. By Lemma 18 with  ℓ = 0,

image

Putting this together and using (12), we obtain:

image

difference in expected rewards between the best and second-best actions at time t (where “best” and “second-best” is according to expected rewards). Here, we’re taking expectations after the perturbations are applied, so the only randomness comes from the noisy rewards. Consider the set of rounds with small gap,  Gγ := {t : ∆t < γ}. Notice that  rt ∈ (0, γ) implies ∆t < γ, so|Tγ| ≤ |Gγ|.

In what follows we prove an upper bound on  E [|Gγ|]. This is the step where perturbed context generation is truly used. For any two arms  a1 and a2, the gap between their expected rewards is

image

Therefore, the probability that the gap between those arms is smaller than  γ is

image

Since  θ⊤εa1,t and θ⊤εa2,tare both distributed as  N(0, ρ2∥θ∥22), their difference is  N(0, 2ρ2∥θ∥22).The maximum value that the Gaussian measure takes is 12ρ∥θ∥2√π, and the measure in any interval of width  2γis therefore at most γρ∥θ∥2√π. This gives us the bound

image

Union-bounding over all�K2�pairs of actions, we have

image

Plugging this into (13) (recalling that  |Tγ| ≤ |Gγ|) completes the proof.

5.1. Bounding the Deviations

We make use of two results that bound deviations from the “high-probability” behavior, one on  ∥θ∥2and another on instantaneous regret. First, we prove high-probability upper and lower bounds on ∥θ∥2under the conditions in Theorem 13. Essentially, these bounds allow us to use Lemma 16.

Lemma 17 Assume the latent vector  θcomes from a multivariate Gaussian,  θ ∼ N(θ, Σ), herethe covariate matrix  Σ satisfies λmax(Σ) ≤ 1.

image

Proof We consider two cases, based on whether  d ≥ log T/ log log T. We need both cases to prove part (a), and we obtain part (b) as an interesting by-product. We repeatedly use Lemma 36, a concentration inequality for  χ2 random variables, to show concentration on the Gaussian norm.

image

Similarly, we can show

image

By the triangle inequality,

image

Thus, in this case, 12 log T ≤ ∥θ∥2 ≤ ∥θ∥2 + √d log Twith probability at least  1 − 2T −1.

image

By Lemma 36,

image

Because  ∥θ∥2 ≥ 1 + √3 log T, 1 ≤ ∥θ∥2 ≤ ∥θ∥2 + √3 log Twith probability at least  1 − T −1.

Next, we show how to upper-bound expected instantaneous regret in the worst case.9

Lemma 18 Fix round t and parameter  ℓ > 0. For any θ, conditioned on any history  ht−1 and theevent that  ∥εa,t∥∞ ≥ ℓfor each arm a, the expected instantaneous regret of any algorithm at round t is at most

image

Proof The expected regret at round t is upper-bounded by the reward difference between the best arm  x∗t and the worst arm  x†t, which is

image

Since  εa,thas symmetry about the origin conditioned on the event that at least one component of one of the perturbations has absolute value at least  ℓ, i.e. v and −vhave equal likelihood, maxa θ⊤εa,t and − mina θ⊤εa,tare identically distributed. Let  Eℓ,tbe the event that at least one of the components of one of the perturbations has absolute value at least  ℓ. This means for any choice

image

where the expectation is taken over the perturbations at time t. Without loss of generality, let  (εa′,t)jbe the component such that  |(εa′,t)j| ≥ ℓ. Then, all other components have distribution  N(0, ρ2). Then,

image

Let  (˜εa,t)i = 0 if a = a′ and i = j, and (εa,t)iotherwise. In other words, we simply zero out the component  (εa′,t)j. Then, this is

image

because by Lemma 34,

image

Next, note that by symmetry and since  θj ≤ ∥θ∥2,

image

Putting this all together, the expected instantaneous regret is bounded by

image

proving the lemma.

5.2. Finishing the Proof of Theorem 13.

We focus on the “nice event” that (14) holds, denote it E for brevity. In particular, note that it implies  ∥θ∥2 ≤ S. Lemma 16guarantees that expected regret under this event, E [Regret(T) | E], is upper-bounded by the expression (9) in the theorem statement.

In what follows we use Lemma 17(a) and Lemma 18 guarantee that if E fails, then the corresponding contribution to expected regret is small. Indeed, Lemma 18 with  ℓ = 0implies that

image

where  B = 1 + ρ(2 + √2 log K)is the “blow-up factor”. Since (14) fails with probability at most

image

The antecedent inequality follows by Lemma 31 with  α = 12 log T , using the assumption that λmax(Σ) ≤ 1. The theorem follows.

We present the proofs for our results on greedy algorithms in Section 4.10 This section is structured as follows. In Section 6.1, we quantify the diversity of data collected by batch-greedy-style algorithms, assuming perturbed context generation. In Section 6.2, we show that a sufficiently “diverse” batch history suffices to simulate the reward for any given context vector, in the sense of Defini-tion 10. Jointly, these two subsections imply that batch history generated by a batch-greedy-style algorithm can simulate rewards with high probability, as long as the batch size is sufficiently large. Section 6.3 builds on this foundation to derive regret bounds for BatchBayesGreedy. The crux is that the history collected by BatchBayesGreedy suffices to simulate a “slowed-down” run of any other algorithm. This analysis extends to a version of BatchFreqGreedy equipped with a Bayesian-greedy prediction rule (and tracks the performance of the prediction rule). Finally, Section 6.4 derives the regret bounds for BatchFreqGreedy, by comparing the prediction-rule version of BatchFreqGreedy with BatchFreqGreedy itself. To derive the results on group externalities, we present all our analysis in Sections 6.3 and 6.4 in a more general framework in which only the minority rounds are counted for regret.

Preliminaries. We assume perturbed context generation in this section, without further mention. Throughout, we will use the following parameters as a shorthand:

image

Recall that  ρdenotes perturbation size, and d is the dimension. The meaning of ˆR and R is thatthey are high-probability upper bounds on the perturbations and the contexts, respectively. More formally, by Lemma 34 we have:

image

Let us recap some of the key definitions from Section 4.2. We consider batch-greedy-style algorithms, a template that unifies BatchBayesGreedy and BatchFreqGreedy. A bandit algorithm is called batch-greedy-style if it divides the timeline in batches of Y consecutive rounds each, in each round t chooses some estimate  θt of θ, based only on the data from the previous batches, and then chooses the best action according to this estimate, so that  at = arg maxa θ⊤t xa,t.

For a batch B that starts at round  t0 + 1, thebatch history  hBis the tuple  ((xt0+τ, rt0+τ) : τ ∈[Y ]), and the batch context matrix  XBis the matrix whose rows are vectors  (xt0+τ : τ ∈ [Y ]).Here and elsewhere,  [Y ] = {1, · · · , Y }. Thebatch covariance matrix is defined as

image

6.1. Data Diversity under Perturbations

We are interested in the diversity of data collected by batch-greedy-style algorithms, assuming perturbed context generation. Informally, the observed contexts  x1, x2, . . .should cover all directions in order to enable good estimation of the latent vector  θ. Following Kannan et al. (2018), we quantify data diversity via the minimal eigenvalue of the empirical covariance matrix  Zt. More precisely, we are interested in proving that  λmin(Zt)is sufficiently large. We adapt some tools from Kannan et al. (2018), and then derive some improvements for batch-greedy-style algorithms.

image

Kannan et al. (2018) prove that  λmin(Zt)grows linearly in time t, assuming t is sufficiently large.

Lemma 19 (Kannan et al. (2018)) Fix any batch-greedy-style algorithm. Consider round  t ≥ τ0,where  τ0 = 160R2ρ2 log 2dδ · log T. Then for any realization of  θ, with probability  1 − δ

image

Proof The claimed conclusion follows from an argument inside the proof of Lemma B.1 from Kannan et al. (2018), plugging in  λ0 = ρ22 log T . This argument applies for any  t ≥ τ ′0, where

image

Recall that  Ztis the sum  Zt := �tτ=1 xτx⊤τ . A key step in the proof of Lemma 19 zeroes in onthe expected contribution of a single round. We use this tool separately in the proof of Lemma 22.

Lemma 20 (Kannan et al. (2018)) Fix any batch-greedy-style algorithm, and the latent vector  θ.Assume  T ≥ 4K. Condition on the event that all perturbations  εa,tare upper-bounded by ˆR, denoteit with E. Then with probability at least 14,

image

Proof The proof is easily assembled from several pieces in the analysis in Kannan et al. (2018). Let ˆθtbe the algorithm’s estimate for  θ at time t. As inKannan et al. (2018), define

image

where  ˆca,tdepends on all perturbations other than the perturbation for  xa,t. Let us say that  ˆca,t is“good” for arm a if

image

First we argue that

image

Indeed, in the proof of their Lemma 3.4, Kannan et al. (2018) show that for any round, conditioned on E, if the probability that arm a was chosen over the randomness of the perturbation is at least 2/T, then the round is good for a with probability at least 12. Let Btbe the set of arms at round t with probability at most 2/T of being chosen over the randomness of the perturbation. Then,

image

This is where we use conditioning on the event  {εa,t ≤ ˆR}. We plug in  r = ρ√2 log T andλ0 = ρ22 log T into Lemma 3.2 of Kannan et al. (2018). This lemma applies because with these parameters, the perturbed distribution of context arrivals satisfies the  (ρ√2 log T, ρ2/(2 log T))-diversity condition from Kannan et al. (2018). The latter is by Lemma 3.6 of Kannan et al. (2018). This completes the proof of (19). The lemma follows from (18) and (19).

Let  θfretbe the BatchFreqGreedy estimate for  θ at time t, as defined in (7). We are interested in quantifying how the quality of this estimate improves over time. Kannan et al. (2018) prove, essentially, that the distance between  θfret and θ scales as√t/λmin(Zt).

Lemma 21 (Kannan et al. (2018)) Consider any round t in the execution of BatchFreqGreedy. Let t0be the last round of the previous batch. For any  θ and any δ > 0, with probability  1 − δ,

image

We focus on batch covariance matrix  ZBof a given batch in a batch-greedy-style algorithm. We would like to prove that  λmin(ZB)is sufficiently large with high probability, as long as the batch size Y is large enough. The analysis from Kannan et al. (2018) (a version of Lemma 19) would apply, but only as long as the batch size is least as large as the  τ0from the statement of Lemma 19. We derive a more efficient version, essentially shaving off a factor of  8.11

Lemma 22 Fix a batch-greedy-style algorithm and any batch B in the execution of this algorithm. Fix  δ > 0and assume that the batch size Y is at least

image

Condition on the event that all perturbations in this batch are upper-bounded by ˆR, more formally:

image

Further, condition on the latent vector  θand the history h before batch B. Then

image

The probability in (21) is over the randomness in context arrivals and rewards in batch B.

The improvement over Lemma 19 comes from two sources: we use a tail bound on the sum of geometric random variables instead of a Chernoff bound on a binomial random variable, and we derive a tighter application of the eigenvalue concentration inequality of Tropp (2012).

Proof Let  t0be the last round before batch B. Recalling (17), let

image

be a similar sum over the expected per-round covariance matrices. Assume  Y ≥ Y0The proof proceeds in two steps: first we lower-bound  λmin(ZB), and then we show that it implies (21). Denoting  m = R2 ee−1 (1 + log 2dδ ), we claim that

image

To prove this, observe that  WB’s minimum eigenvalue increases by at least  λ0 = ρ2/(2 log T)with probability at least 1/4 each round by Lemma 20, where the randomness is over the history, i.e., the sequence of (context, reward) pairs. If we want it to go up to m, this should take  4m/λ0rounds in expectation. However, we need it to go to m with high probability. Notice that this is dominated by the sum of  m/λ0geometric random variables with parameter 14. We’ll use the following bound from Janson (2018): for  X = �ni=1 Xi where Xi ∼ Geom(p) and any c ≥ 1,

image

Because we want the minimum eigenvalue of  WB to be m, we need n = m/λ0, so E [X] = 4m/λ0.Choose  c =�1 + λ0m log 2δ� ee−1. By Corollary 44,

image

This completes the proof of (22).

To derive (21) from (22), we proceed as follows. Consider the event

image

Letting  α = 1−R2/mand rewriting  R2 as (1−α)m, we use a concentration inequality from Tropp (2012) to guarantee that

image

Then, using the fact that  xx ≥ e−1/e for all x > 0, we have

image

since  m ≥ ee−1R2 �1 + log 2dδ�. Finally, observe that, omitting the conditioning on  EB, h, we have:

image

6.2. Reward Simulation with a Diverse Batch History

We consider reward simulation with a batch history, in the sense of Definition 10. We show that a sufficiently “diverse” batch history suffices to simulate the reward for any given context vector. Coupled with the results of Section 6.1, it follows that batch history generated by a batch-greedy-style algorithm can simulate rewards as long as the batch size is sufficiently large.

Let us recap the definition of reward simulation (Definition 10). Let  Rewθ(·)be a randomized function that takes a context x and outputs an independent random sample from  N(θ⊤x, 1). Inother words, this is the realized reward for an action with context vector x.

Definition 23 Consider batch B in the execution of a batch-greedy-style algorithm. Batch history hBcan simulate  Rewθ()up to radius R > 0 if there exists a function g : {context vectors} ×{batch histories  hB} → R such that g(x, hB)is identically distributed to  Rewθ(x)conditional on the batch context matrix, for all  θand all context vectors  x ∈ Rd with ∥x∥2 ≤ R.

Note that we do not require the function g to be efficiently computable. We do not require algorithms to compute g; a mere existence of such function suffices for our analysis.

The result in this subsection does not rely on the “greedy” property. Instead, it applies to all “batch-style” algorithms, defined as follows: time is divided in batches of Y consecutive rounds each, and the action at each round t only depends on the history up to the previous batch. The data diversity condition is formalized as  {λmin(ZB) ≥ R2}; recall that it is a high-probability event, in a precise sense defined in Lemma 22. The result is stated as follows:

Lemma 24 Fix a batch-style algorithm and any batch B in the execution of this algorithm. Assume the batch covariance matrix  ZB satisfies λmin(ZB) ≥ R2. Then batch history  hBcan simulate Rewθup to radius R.

Proof Let us construct a suitable function g for Definition 23. Fix a context vector  x ∈ Rd with∥x∥2 ≤ R. Let rBbe the vector of realized rewards in batch  B, i.e., rB = (rt : rounds t in B) ∈

image

in which the variance of the reward noise is  σ2. Then the noise variance in (23) should be  σ2 (1 −∥wB∥22), with essentially no modifications throughout the rest of the proof.)

image

the proof we show that g is as needed for Definition 23.

image

above satisfy  X⊤B wB = x and ∥wB∥2 ≤ 1. Then, we’ll show that if each  rτ ∼ N(θ⊤xτ, 1), thenr⊤BwB + N(0, 1 − ∥wB∥22) ∼ N(θ⊤x, 1).

image

as desired. We must now show that  ∥wB∥22 ≤ 1. Note that

image

where  ∥v∥2M simply denotes  v⊤Mv. Thus, it is sufficient to show that  ∥x∥2Z−1B ≤ 1. Since ∥x∥2 ≤ Rand  λmin (ZB) ≥ R2, we have by Lemma 40

image

Let  z = Z−1/2B x, so I ⪰ zz⊤. Again by Lemma 40,  λmax(zz⊤) = z⊤z. This means that

image

as desired. Finally, observe that

image

where  η ∼ N(0, I)is the noise vector. Notice that  η⊤wB ∼ N(0, ∥wB∥2), and therefore,  η⊤wB +N(0, 1 − ∥wB∥22) ∼ N(0, 1). Putting this all together, we have

image

and therefore D can simulate E for any x up to radius R.

6.3. Regret Bounds for BatchBayesGreedy

We apply the tools from Sections 6.1 and 6.2 to derive regret bounds for BatchBayesGreedy. On a high level, we prove that the history collected by BatchBayesGreedy suffices to simulate a “slowed-down” run of any other algorithm  ALG0. Therefore, when it comes to choosing the next action, BatchBayesGreedy has at least as much information as  ALG0, so its Bayesian-greedy choice cannot be worse than the choice made by  ALG0.

Our analysis extends to a more general scenario which is useful for the analysis of BatchFreqGreedy. We formulate and prove our results for this scenario directly. We consider an extended bandit model which separates data collection and reward collection. Each round t proceeds as follows: the algorithm observes available actions and the context vectors for these actions, then it chooses two actions,  at and a′t, and observes the reward for the former but not the latter. We re- fer to  a′t as the “prediction” at round t. We will refer to an algorithm in this model as a bandit algorithm (which chooses actions  at) with “prediction rule” that chooses the predictions  a′t. Morespecifically, we will be interested in an arbitrary batch-greedy-style algorithm with prediction rule given by BatchBayesGreedy, as per (6) on 14. We assume this prediction rule henceforth. We are interested in prediction regret: a version of regret (1) if actions  atare replaced with predictions  a′t:

image

where  x∗t is the context vector of the best action at round t, as in (1). More precisely, we are interested in Bayesian prediction regret, the expectation of (24) over everything: the context vectors, the rewards, the algorithm’s random seed, and and the prior over  θ.

We use essentially the same analysis to derive implications on group externalities. For this purpose, we consider a further generalization in which regret is restricted to rounds that correspond to a particular population. Formally, let  T ⊆ Nbe a randomly chosen subset of the rounds where Pr[t ∈ T ]is a constant and rounds are chosen to be in T independently of one another. We allow for the possibility that the underlying context distribution differs for rounds in T compared to rounds in [T]\T . More precisely, we allow the event  {t ∈ T }be correlated with the context tuple at round t. Similar to the definition of minority regret, we define T -restricted regret (resp., prediction regret) in T rounds to be the portion of regret (resp., prediction regret) that corresponds to T -rounds:

image

T -restricted Bayesian (prediction) regret is defined as an expectation over everything. Thus, the main theorem of this subsection is formulated as follows:

Theorem 25 Consider perturbed context generation. Let ALG be an arbitrary batch-greedy-style algorithm whose batch size is at least  Y0 from (20). Fix any bandit algorithm  ALG0, and let RT0 (T)be the T -restricted regret of this algorithm on a particular problem instance I. Then on the same instance, ALG has T -restricted Bayesian prediction regret

image

Proof sketch. We use a t-round history of ALG to simulate a (t/Y )-round history of  ALG0. Morespecifically, we use each batch in the history of ALG to simulate one round of  ALG0. We prove that the simulated history of  ALG0has exactly the same distribution as the actual history, for any  θ. SinceALG predicts the Bayesian-optimal action given the history (up to the previous batch), this action is at least as good (in expectation over the prior) as the one chosen by  ALG0 after t/Yrounds. The detailed proof is deferred to Section 6.3.1.

Implications. As a corollary of this theorem, we obtain regret bounds for BatchBayesGreedy in Theorem 6 and Theorem 7. We take T to be the set of all rounds, i.e.,  Pr[t ∈ T ] = 1, and ALG tobe BatchBayesGreedy. For Theorem 7(b), we take  ALG0to be LinUCB. Thus:

Corollary 26 In the setting of Theorem 25, BatchBayesGreedy has Bayesian regret at most  Y ·E [R0(T/Y )] + ˜O(1/T)on problem instance I. Further, under the assumptions of Theorem 7, BatchBayesGreedy has Bayesian regret at most ˜O(d2 K2/3 T 1/3/ρ2)on all instances.

We also obtain a similar regret bound on the Bayesian prediction regret of BatchFreqGreedy, which is essential for Section 6.4.

Corollary 27 In the setting of Theorem 25, BatchFreqGreedy has Bayesian prediction regret (27).

To derive Theorem 9 for BatchBayesGreedy, we take T to be the set of all minority rounds, and apply Theorem 25 twice: first when  ALG0is run over the minority rounds only (and can behave arbitrarily on the rest), and then when  ALG0is run over full population.

6.3.1. PROOF OF THEOREM 25

We condition on the event that all perturbations are bounded by ˆR, more precisely, on the event

image

Recall that  E1is a high-probability event, by (15). We also condition on the event

image

where  ZBis the batch covariance matrix, as usual. Conditioned on  E1, this too is a high-probability event by Lemma 22 plugging in  δ/Tand taking a union bound over all batches. We will prove that ALG satisfies

image

where the expectation is taken over everything: the context vectors, the rewards, the algorithm’s random seed, and the prior over  θ. Then we take care of the “failure event”E1 ∩ E2.

History simulation. Before we prove (29), let us argue about using the history of ALG to simulate a (shorter) run of  ALG0. Fix round t. We use a t-round history of ALG to simulate a  ⌊t/Y ⌋-round runof  ALG0, where Yis the batch size in ALG. Stating this formally requires some notation. Let  At bethe set of actions available in round  t, and let cont = (xa,t : a ∈ At)be the corresponding tuple of contexts. Let CON be the set of all possible context tuples, more precisely, the set of all finite subsets of  Rd. Let ht and h0t denote, resp., the t-round history of  ALG and ALG0. Let Ht denote the set of all possible t-round histories. Note that  ht and h0t are random variables which take values on  Ht. Wewant to use history  htto simulate history  h0⌊t/Y ⌋. Thus, the simulation result is stated as follows:

Lemma 28 Fix round  t and let σ = (con1 , . . . , con⌊t/Y ⌋)be the sequence of context arrivals up to and including round  ⌊t/Y ⌋. Then there exists a “simulation function”

image

such that the simulated history  sim(ht, σ)is distributed identically to  h0⌊t/Y ⌋, conditional on se- quence  σ, latent vector  θ, and events  E1, E2.

Proof Throughout this proof, condition on events  E1 and E2. Generically,  sim(ht, σ) outputs asequence of pairs  {(xτ, rτ)}⌊t/Y ⌋τ=1 , where xτis a context vector and  rτis a simulated reward for this context vector. We define  sim(ht, σ)by induction on  τwith base case  τ = 0. Throughout, we maintain a run of algorithm  ALG0. For each step  τ ≥ 1, suppose ALG0is simulated up to round τ − 1, and the corresponding history is recorded as  ((x1, r1) , . . . , (xτ−1, rτ−1)). Simulate the next round in the execution of  ALG0by presenting it with the action set  Aτand the corresponding context tuple  conτ. Let xτbe the context vector chosen by  ALG0. The corresponding reward  rτ isconstructed using the  τ-th batch in  ht, denote it with B. By Lemmas 22 and 24, the batch history hBcan simulate a single reward, in the sense of Definition 23. In particular, there exists a function g(x, hB)with the required properties (recall that it is explicitly defined in (23)). Thus, we define rτ = g(xτ, hB), and return  rτas a reward to  ALG0. This completes the construction of  sim(ht, σ).The distribution property of  sim(ht, σ)is immediate from the construction.

Proof of Equation (29) We argue for each batch separately, and then aggregate over all batches in the very end. Fix batch  B, and let t0 = t0(B)be the last round in this batch. Let  τ = 1 + t0/Y ,and consider the context vector  x0τ chosen by ALG0 in round τ. This context vector is a randomized function f of the current context tuple  conτand the history  h0τ−1:

image

By Lemma 28, letting  σ = (con1 , . . . , con⌊t/Y ⌋), it holds that

image

Let t be some round in the next batch after  B, and let x′t = xa′t,t, be the context vector predicted by ALG in round t. Recall that  x′t is a Bayesian-greedy choice from the context tuple  cont, basedon history  ht0. Observe that the Bayesian-greedy action choice from a given context tuple based on history  ht0cannot be worse, in terms of the Bayesian-expected reward, than any other choice from the same context tuple and based on the same history. Using (30), we obtain:

image

for any given context tuple  con ∈ CONthat has a non-zero arrival probability given  E1 ∩ E2.Given  cont = con, the event  t ∈ Tis independent of everything else. Likewise, given  conτ =con, the event  τ ∈ Tis independent of everything else. It follows that

image

for any given context tuple  con ∈ CONthat has a non-zero arrival probability given  E1 ∩ E2.

Observe that  cont and conτhave the same distribution, even conditioned on event  E1∩E2. (Thisis because the definitions of  E1 and E2treat all rounds in the same batch in exactly the same way.) Therefore, we can integrate (32) over the context tuples con:

image

Now, let us sum up (33) over all rounds t in the next batch after B, denote it next(B).

image

Note that the right-hand side of (33) stays the same for all t, hence the factor of Y on the right-hand side of (34). This completes our analysis of a single batch B.

We obtain (29) by over all batches B. Here it is essential that the expectation  E�1{t∈T } θ⊤x∗t�does not depend on round t, and therefore the “regret benchmark”  θ⊤x∗t cancels out from (29). In particular, it is essential that the context tuples  contare identically distributed across rounds.

Proof of Theorem 25 given Equation (29) We must take care of the low-probability failure events

E1 and

image

For ease of exposition, we focus on the special case  Pr [t ∈ T ] = 1; the general case is treated similarly. We know that  Pr[E1∪E2] ≤ δ+δR. Lemma 18 with ℓ = ˆRgives us that the instantaneous regret of every round is at most

image

by Lemma 35. Letting  δ = δR = 1T 2, we verify that our definition of Y means that Lemma 22 indeed holds with probability at least  1 − T −2. Using (29), the Bayesian prediction regret of ALG is

image

This completes the proof of Theorem 25.

6.4. Regret Bounds for BatchFreqGreedy

To analyze BatchFreqGreedy, we show that its Bayesian regret is not too different from its Bayesian prediction regret, and use Corollary 27 to bound the latter. As in the previous subsection, we state this result in more generality for the sake of group externality implications: we consider T -restricted (prediction) regret, exactly as before.

Theorem 29 Assuming perturbed context generation, BatchFreqGreedy satisfies

image

where  Σis the covariance matrix of the prior and  ρis the perturbation size.

Taking T to be the set of all contexts, and using Corollary 27, we obtain Bayesian regret bounds for BatchFreqGreedy in Theorem 6 and Theorem 7. To derive Theorem 9 for BatchFreqGreedy, we take T to be the set of all minority rounds.

The remainder of this section is dedicated to proving Theorem 29. On a high level, the idea is as follows. As in the proof of Theorem 25, we condition on the high-probability event (28) that perturbations are bounded. Specifically, we prove that

image

To prove this statement, we fix round t and compare the action  attaken by BatchFreqGreedy and the predicted action  a′t. We observe that the difference in rewards between these two actions can be upper-bounded in terms of  θbayt − θfret , the difference in the  θestimates with and without knowledge of the prior. (Recall (6) and (7) for definitions.) Specifically, we show that

image

The crux of the proof is to show that the difference  ∥θbayt − θfret ∥2is small, namely

image

ignoring other parameters. Thus, summing over all rounds, we get

image

Proof of Eq. (35). Let  Rt and PRegt be, resp., instantaneous regret and instantaneous prediction regret at time t. Then

image

Thus, it suffices to bound the differences in instantaneous regret. Recall that at time t, the chosen action for BatchFreqGreedy and the predicted action are, resp.,

image

Letting  t0 − 1 = ⌊t/Y ⌋be the last round in the previous batch, we can formulate  θfret and θbayt as

image

Therefore, we have

image

since the mean of the posterior distribution is exactly  θbayt , and θbaytis deterministic given  ht−1.Taking expectation over  ht−1, we have

image

Therefore,

image

Eq. (36) follows. The crux is to prove (37): to bound the expected distance between the Frequentist and Bayesian estimates for  θ. By expanding their definitions, we have

image

Next, note that

image

By Lemma 42,  λmin (Zt0−1 + Σ) ≥ λmin (Zt0−1). Therefore,

image

giving us

image

the following bounds hold, each with probability at least  1 − δ:

image

Therefore, fixing  t0 ≥ 1 + tmin(δ/2), with probability at least  1 − δ we have

image

Note that the high-probability events we need are deterministic given  ht0−1, and therefore are independent of the perturbations at time t. This means that Lemma 18 applies, with  ℓ = 0: conditioned on any  ht0−1, the expected regret for round t is upper-bounded by  2∥θ∥2(1 + ρ(1 + √2 log K)).In particular, this holds for any  ht0−1not satisfying the high probability events from Lemmas 19 and 21. Therefore, for all  t ≥ tmin(δ),

image

Because  θ ∼ N(θ, Σ), we have Σ−1/2(θ − θ) ∼ N(0, I). By Lemma 35,

image

This means

image

Since  t0 = Ω(t), for sufficiently small  δ, this proves (37). We need to do a careful computation to complete the proof of Eq. (35). We know from (36) that

image

Choosing  δ = T −2, we find that

image

so this term vanishes. Furthermore,

image

since  t0 ≥ t − Y , and �Tt=1 1/t = O(log T). Using the fact that  R = ˜O(1)(since by assumption ρ ≤ d−1/2), this is simply

image

Finally, we note that on the first  tmin(T −2) = ˜O(1/ρ2)rounds, the regret bound from Lemma 18 with  ℓ = 0applies, so the total regret difference is at most

image

which implies Eq. (35).

Completing the proof of Theorem 29 given (35). By Theorem 29, this holds whenever all perturbations are bounded by ˆR, which happens with probability at least  1− δR. When the bound fail, the total regret is at most

image

by Lemma 18 (with  ℓ = ˆR) and Lemma 35. Since  δR = T −2, the contribution of regret when the high-probability bound fails is ˜O(1/T) ≤ ˜O(1).

We thank Solon Barocas, Dylan Foster, Jon Kleinberg, Aaron Roth, and Hanna Wallach for helpful discussions about these topics.

Yasin Abbasi-Yadkori, D´avid P´al, and Csaba Szepesv´ari. Improved algorithms for linear stochastic bandits. In Advances in Neural Information Processing Systems (NIPS), 2011.

Alekh Agarwal, Daniel Hsu, Satyen Kale, John Langford, Lihong Li, and Robert Schapire. Taming the monster: A fast and simple algorithm for contextual bandits. In International Conference on Machine Learning (ICML), 2014.

Alekh Agarwal, Sarah Bird, Markus Cozowicz, Miro Dudik, John Langford, Lihong Li, Luong Hoang, Dan Melamed, Siddhartha Sen, Robert Schapire, and Alex Slivkins. Multiworld testing: A system for experimentation, learning, and decision-making. A white paper, available at https://github.com/Microsoft/mwt-ds/raw/master/images/MWT-WhitePaper.pdf, 2016.

Alekh Agarwal, Sarah Bird, Markus Cozowicz, Luong Hoang, John Langford, Stephen Lee, Jiaji Li, Dan Melamed, Gal Oshri, Oswaldo Ribas, Siddhartha Sen, and Alex Slivkins. Making contextual decisions with low technical debt. CoRR arXiv:1606.03966, 2017.

Peter Auer, Nicol`o Cesa-Bianchi, and Paul Fischer. Finite-time analysis of the multiarmed bandit problem. Machine Learning, 47(2-3):235–256, 2002.

Hamsa Bastani, Mohsen Bayati, and Khashayar Khosravi. Exploiting the natural exploration in contextual bandits. CoRR arXiv:1704.09011, 2017.

Alberto Bietti, Alekh Agarwal, and John Langford. Practical evaluation and optimization of con- textual bandit algorithms. CoRR arXiv:1802.04064, 2018.

Kostas Bimpikis, Yiangos Papanastasiou, and Nicos Savva. Crowdsourcing exploration. Management Science, 2017. Forthcoming.

Sarah Bird, Solon Barocas, Kate Crawford, Fernando Diaz, and Hanna Wallach. Exploring or ex- ploiting? Social and ethical implications of autonomous experimentation in AI. Available at SSRN: https://ssrn.com/abstract=2846909, also appeared at the Workshop on Fairness, Accountability, and Transparency in Machine Learning, 2016.

S´ebastien Bubeck and Nicolo Cesa-Bianchi. Regret Analysis of Stochastic and Nonstochastic Multi- armed Bandit Problems. Foundations and Trends in Machine Learning, 5(1), 2012.

L. Elisa Celis and Nisheeth K Vishnoi. Fair personalization. CoRR arXiv:1707.02260, also appeared at the Workshop on Fairness, Accountability, and Transparency in Machine Learning, 2017.

Venkat Chandrasekaran, Benjamin Recht, Pablo A Parrilo, and Alan S Willsky. The convex ge- ometry of linear inverse problems. Foundations of Computational Mathematics, 12(6):805–849, 2012.

Yeon-Koo Che and Johannes H¨orner. Optimal design for social learning. Preprint, 2015.

Alexandra Chouldechova. Fair prediction with disparate impact: A study of bias in recidivism prediction instruments. Big Data, Special Issue on Social and Technical Trade-Offs, 2017.

Wei Chu, Lihong Li, Lev Reyzin, and Robert E. Schapire. Contextual bandits with linear payoff functions. In International Conference on Artificial Intelligence and Statistics (AISTATS), 2011.

John D Cook. Upper and lower bounds for the normal distribution function, 2009.

Sanjoy Dasgupta and Anupam Gupta. An elementary proof of a theorem of Johnson and Linden- strauss. Random Structures & Algorithms, 22(1):60–65, 2003.

Cynthia Dwork, Moritz Hardt, Toniann Pitassi, Omer Reingold, and Richard Zemel. Fairness through awareness. In Innovations in Theoretical Computer Science (ITCS), 2012.

Peter Frazier, David Kempe, Jon M. Kleinberg, and Robert Kleinberg. Incentivizing exploration. In ACM Conference on Economics and Computation (ACM EC), 2014.

Moritz Hardt, Eric Price, and Nati Srebro. Equality of opportunity in supervised learning. In Advances in Neural Information Processing Systems (NIPS), 2016.

Svante Janson. Tail bounds for sums of geometric and exponential variables. Statistics Probability Letters, 135:1–6, 2018.

Matthew Joseph, Michael Kearns, Jamie Morgenstern, and Aaron Roth. Fairness in learning: Clas- sic and contextual bandits. In Advances in Neural Information Processing Systems (NIPS), 2016.

Sampath Kannan, Jamie Morgenstern, Aaron Roth, Bo Waggoner, and Zhiwei Steven Wu. A smoothed analysis of the greedy algorithm for the linear contextual bandit problem. CoRR arXiv:1801.03423, 2018.

Michael Kearns, Aaron Roth, and Zhiwei Steven Wu. Meritocratic fairness for cross-population selection. In International Conference on Machine Learning (ICML), 2017.

Jon Kleinberg, Sendhil Mullainathan, and Manish Raghavan. Inherent trade-offs in the fair deter- mination of risk scores. In Innovations in Theoretical Computer Science (ITCS), 2017.

Robert Kleinberg, Alexandru Niculescu-Mizil, and Yogeshwer Sharma. Regret bounds for sleeping experts and bandits. Machine Learning, 80(2-3):245–272, 2010.

Ilan Kremer, Yishay Mansour, and Motty Perry. Implementing the wisdom of the crowd. Journal of Political Economy, 122:988–1012, 2014.

John Langford and Tong Zhang. The Epoch-Greedy Algorithm for Contextual Multi-armed Bandits. In Advances in Neural Information Processing Systems (NIPS), 2007.

Lihong Li, Wei Chu, John Langford, and Robert E. Schapire. A contextual-bandit approach to per- sonalized news article recommendation. In International World Wide Web Conference (WWW), 2010.

Yang Liu, Goran Radanovic, Christos Dimitrakakis, Debmalya Mandal, and David C. Parkes. Cali- brated fairness in bandits. CoRR arXiv:1707.01875, also appeared at the Workshop on Fairness, Accountability, and Transparency in Machine Learning, 2017.

Yishay Mansour, Aleksandrs Slivkins, and Vasilis Syrgkanis. Bayesian incentive-compatible bandit exploration. In ACM Conference on Economics and Computation (ACM EC), 2015.

Yishay Mansour, Aleksandrs Slivkins, and Zhiwei Steven Wu. Competing bandits: Learning under competition. In Innovations in Theoretical Computer Science (ITCS), 2018.

Philippe Rigollet and Assaf Zeevi. Nonparametric Bandits with Covariates. In Conference on Learning Theory (COLT), 2010.

Daniel A. Spielman and Shang-Hua Teng. Smoothed analysis of algorithms: Why the simplex algorithm usually takes polynomial time. Journal of the ACM, 51(3):385–463, 2004.

Joel A Tropp. User-friendly tail bounds for sums of random matrices. Foundations of Computational Mathematics, 12(4):389–434, 2012.

Alexandre B. Tsybakov. Introduction to Nonparametric Estimation. Springer, 2009.

Throughout the paper, we use a number of tools that are either known or easily follow from something that is known. We move these tools to a separate appendix so as not to interrupt the flow. We provide the proofs for the sake of completeness.

A.1. (Sub)gaussians and Concentration

We rely on several known facts about Gaussian and subgaussian random variables. A random variable  X is called σ-subgaussian, for some  σ > 0, if E[eσX2] < ∞. This includes variance-σ2Gaussian random variables as a special case.

Lemma 30 If  X ∼ N(0, σ2), then for any  t ≥ 0,

image

Proof We begin with

image

X can be represented as  X = σY , where Yis a standard normal random variable. Using a tail bound for the latter (from Cook (2009)),

image

The numerator in (41) is

image

Combining, we have

image

For  t ≤ σ, E [X | X ≥ t] ≤ E [X | X ≥ σ] ≤ 2σby the above bound.

Lemma 31 Suppose  X ∼ N(0, Σ)is a Gaussian random vector with covariance matrix  Σ. Then

image

Proof Assume without loss of generality that  Σis diagonal, since the norm is rotationally invariant. Observe that  ∥X∥2 | ∀i Xi > αstochastically dominates  ∥X∥2 | ∥X∥2 > α. (Geometrically, the latter conditioning shifts the probability mass away from the origin.) Therefore,

image

by Lemma 30, where  λi(Σ) ≤ λmax(Σ) is the ith eigenvalue of  Σ.

Fact 32 If  X is a σ-subgaussian random variable, then

image

Lemma 33 If  X1, . . . , Xnare independent  σ-subgaussian random variables, then

image

Proof For any  Xi, we know from Fact 32 that

image

A union bound completes the proof.

Lemma 34 If  X1, . . . , XKare independent zero-mean  σ-subgaussian random variables, then

image

Proof Let  X = max Xi. Since each  Xi is σ-subgaussian, it follows that

image

Using Jensen’s inequality, we have

image

Rearranging, we have

image

Lemma 35 If  θ ∼ N(θ, Σ) where

Proof From Chandrasekaran et al. (2012), the expected norm of a standard normal d-dimensional Gaussian is at most√d. Using the fact that  Σ−1/2(θ −

image

Lemma 36 (Lemma 2.2 in Dasgupta and Gupta (2003)) If  X ∼ χ2(d), i.e., X = �di=1 X2i ,where  X1 , . . . , Xdare independent standard Normal random variables, then

image

Lemma 37 (Hoeffding bound) If ¯X = 1n�ni=1 Xi, where the  Xi’s are independent  σ-subgaussian random variables with zero mean, then

image

A.2. KL-divergence

We use some basic facts about KL-divergence. Let us recap the definition: given two distributions P, Q on the same finite outcome space  Ω, KL-divergence from P to Q is

image

Lemma 38 (High-probability Pinsker Inequality (Tsybakov, 2009)) For any probability distributions P and Q over the same sample space and any arbitrary event E,

image

Lemma 39 Let P and Q be Bernoulli distributions with means  p ∈ [1/2 − ε, 1/2 + ε] and q ∈[1/2 − ε, 1/2 + ε]respectively, with  ε ≤ 1/4. Then KL(P || Q) ≤ 73 ε2.

Proof For any  ε ≤ 1/4,

image

A.3. Linear Algebra

We use several facts from linear algebra. In what follows, recall that  λmin(M) and λmax(M) denotethe minimal and the maximal eigenvalues of matrix M, resp. For two matrices A, B, let us write B ⪰ Ato mean that  B − Ais positive semidefinite.

Lemma 40  λmax(vv⊤) = ∥v∥22 for any v ∈ Rd.

Proof  vv⊤ has rank one, so it has one eigenvector with nonzero eigenvalue. v is an eigenvector since  (vv⊤)v = (v⊤v)v, and it has eigenvalue  v⊤v = ∥v∥22. This is the only nonzero eigenvalue, so  λmax(vv⊤) = ∥v∥22.

Lemma 41 For symmetric matrices A, B with B invertible,

image

Proof

image

Lemma 42 If  A ⪰ 0 and B ⪰ 0, then λmin(A + B) ≥ λmin(A).

Proof

image

A.4. Logarithms

We use several variants of standard inequalities about logarithms.

Lemma 43  x ≥ log(ex) for all x > 0.

Proof This is true if and only if  x − log(ex) ≥ 0 for x > 0. To show this, observe that

1. At x = 1, this holds with equality.

2. At x = 1, the derivative is

image

This proves the lemma.

Corollary 44  x − log x ≥ e−1e x.

Proof Using Lemma 43 and letting z = x/e,

image

Lemma 45  log� 11−x�≤ 7x6 for any x ∈ [0, 1/4].

Proof First, we note that

image

Integrating both sides, we have

image

for some constant C that does not depend on x. Taking x = 0 yields C = 0. Therefore,

image


Designed for Accessibility and to further Open Science