The Externalities of Exploration and How Data Diversity Helps Exploitation

2018·arXiv

Abstract

1. Introduction

Online learning algorithms are a key tool in web search and content optimization, adaptively learning what users want to see. In a typical application, each time a user arrives, the algorithm chooses among various content presentation options (e.g., news articles to display), the chosen content is presented to the user, and an outcome (e.g., a click) is observed. Such algorithms must balance exploration (making potentially suboptimal decisions now for the sake of acquiring information that will improve decisions in the future) and exploitation (using information collected in the past to make better decisions now). Exploration could degrade the experience of a current user, but improves user experience in the long run. This exploration-exploitation tradeoff is commonly studied in the online learning framework of multi-armed bandits (Bubeck and Cesa-Bianchi, 2012).

Concerns have been raised about whether exploration in such scenarios could be unfair, in the sense that some individuals or groups may experience too much of the downside of exploration without sufficient upside (Bird et al., 2016). We formally study these concerns in the linear contextual bandits model (Li et al., 2010; Chu et al., 2011), a standard variant of multi-armed bandits appropriate for content personalization scenarios. We focus on externalities arising due to exploration, that is, undesirable side effects that the presence of one party may impose on another.

We first examine the effects of exploration at a group level. We introduce the notion of a group externality in an online learning system, quantifying how much the presence of one population (which we dub the majority) impacts the rewards of another (the minority). We show that this impact can be negative, and that, in a particular precise sense, no algorithm can avoid it. This cannot be explained by the absence of suitably good policies since our adoption of the linear contextual bandits framework implies the existence of a feasible policy that is simultaneously optimal for everyone. Instead, the problem is inherent to the process of exploration. We come to a surprising conclusion that more data can sometimes lead to worse outcomes for the users of an explore-exploit-based system.

We next turn to the effect of exploration at an individual level. We interpret exploration as a potential externality imposed on the current user by future users of the system. Indeed, it is only for the sake of the future users that the algorithm would forego the action that currently looks optimal. To avoid this externality, one may use the greedy algorithm that always chooses the action that appears optimal according to current estimates of the problem parameters. While this greedy algorithm performs poorly in the worst case, it tends to work well in many applications and experiments.1

In a new line of work, Bastani et al. (2017) and Kannan et al. (2018) analyzed conditions under which inherent diversity in the data makes explicit exploration unnecessary. Kannan et al. (2018) proved that the greedy algorithm achieves a regret rate of in expectation over small perturbations of the context vectors (which ensure sufficient data diversity). This is the best rate that can be achieved in the worst case (i.e., for all problem instances, without data diversity assumptions), but it leaves open the possibilities that (i) another algorithm may perform much better than the greedy algorithm on some problem instances, or (ii) the greedy algorithm may perform much better than worst case under the diversity conditions. We expand on this line of work. We prove that under the same diversity conditions, the greedy algorithm almost matches the best possible Bayesian regret rate of any algorithm on the same problem instance. This could be as low as polylog(T) for some instances, and, as we prove, at most whenever the diversity conditions hold.

Returning to group-level effects, we show that under the same diversity conditions, the negative group externalities imposed by the majority essentially vanish if one runs the greedy algorithm. Together, our results illustrate a sharp contrast between the high individual and group externalities that exist in the worst case, and the ability to remove all externalities if the data is sufficiently diverse.

Additional motivation. Whether and when explicit exploration is necessary is an important concern in the study of the exploration-exploitation tradeoff. Fairness considerations aside, explicit exploration is expensive. It is wasteful and risky in the short term, it adds a layer of complexity to algorithm design (Langford and Zhang, 2007; Agarwal et al., 2014), and its adoption at scale tends to require substantial systems support and buy-in from management (Agarwal et al., 2016, 2017). A system based on the greedy algorithm would typically be cheaper to design and deploy.

Further, explicit exploration can run into incentive issues in applications such as recommender systems. Essentially, when it is up to the users which products or experiences to choose and the algorithm can only issue recommendations and ratings, an explore-exploit algorithm needs to provide incentives to explore for the sake of the future users (Kremer et al., 2014; Frazier et al., 2014; Che and H¨orner, 2015; Mansour et al., 2015; Bimpikis et al., 2017). Such incentive guarantees tend to come at the cost of decreased performance, and rely on assumptions about human behavior. The greedy algorithm avoids this problem as it is inherently consistent with the users’ incentives.

Additional related work. Our research draws inspiration from the growing body of work on fairness in machine learning (e.g., Dwork et al., 2012; Hardt et al., 2016; Kleinberg et al., 2017; Chouldechova, 2017). Several other authors have studied fairness in the context of the contextual bandits framework. Our work differs from the line of research on meritocratic fairness in online learning (Kearns et al., 2017; Liu et al., 2017; Joseph et al., 2016), which considers the allocation of limited resources such as bank loans and requires that nobody should be passed over in favor of a less qualified applicant. We study a fundamentally different scenario in which there are no allocation constraints and we would like to serve each user the best content possible. Our work also differs from that of Celis and Vishnoi (2017), who studied an alternative notion of fairness in the context of news recommendations. According to this notion, all users should have approximately the same probability of seeing a particular type of content (e.g., Republican-leaning articles), regardless of their individual preferences, in order to mitigate the possibility of discriminatory personalization.

The data diversity conditions in Kannan et al. (2018) and this paper are inspired by the smoothed analysis framework of Spielman and Teng (2004), who proved that the expected running time of the simplex algorithm is polynomial for perturbations of any initial problem instance (whereas the worst-case running time has long been known to be exponential). Such disparity implies that very bad problem instances are brittle. We find a similar disparity for the greedy algorithm in our setting.

Our results on group externalities. A typical goal in online learning is to minimize regret, the (expected) difference between the cumulative reward that would have been obtained had the optimal policy been followed at every round and the cumulative reward obtained by the algorithm. We define a corresponding notion of minority regret, the portion of the regret experienced by the minority. Since online learning algorithms update their behavior based on the history of their observations, minority regret is influenced by the entire population on which an algorithm is run. If the minority regret is much higher when a particular algorithm is run on the full population than it is when the same algorithm is run on the minority alone, we can view the majority as imposing a negative externality on the minority; the minority population would achieve a higher cumulative reward if the majority were not present. Asking whether this can ever happen amounts to asking whether access to more data points can ever lead an explore-exploit algorithm to make inferior decisions. One might think that more data should always lead to better decisions and therefore better outcomes for the users. Surprisingly, we show that this is not the case, even with a standard algorithm.

Consider LinUCB (Li et al., 2010; Chu et al., 2011; Abbasi-Yadkori et al., 2011), a standard algorithm for linear contextual bandits that is based on the principle of “optimism under uncertainty.” We provide a specific problem instance on which, after observing T users, LinUCB would have a minority regret of if run on the full population, but only constant minority regret if run on the minority alone. While stylized, this example is motivated by the problem of providing driving directions to different populations of users, about which fairness concerns have been raised (Bird et al., 2016). Further, the situation is reversed on a slight variation of this example: LinUCB obtains constant minority regret when run on the full population and on the minority alone. That is, group externalities can be large and positive in some cases, and large and negative in others.

Although these regret rates are specific to LinUCB, we show that this phenomenon is, in some sense, unavoidable. Consider the minority regret of LinUCB when run on the full population and the minority regret that LinUCB would incur if run on the minority alone. We know that one may be much smaller or larger than the other. We ask whether any algorithm could achieve the minimum of the two on every problem instance. Using a variation of the same problem instance, we prove that this is impossible; in fact, no algorithm could simultaneously approximate both up to any factor. In other words, an externality-free algorithm would sometimes “leave money on the table.”

In terms of techniques, we rely on the special structure of our example, which can be viewed as an instance of the sleeping bandits problem (Kleinberg et al., 2010). This simplifies the behavior and analysis of LinUCB, allowing us to obtain the O(1) upper bounds. The lower bounds are obtained using KL-divergence techniques to show that the two variants of our example are essentially indistinguishable, and an algorithm that performs well on one must obtain regret on the other.

Our results on the greedy algorithm. We consider a version of linear contextual bandits in which the latent weight vector is drawn from a known prior. In each round, an algorithm is presented several actions to choose from, each represented by a context vector. The expected reward of an action is a linear product of and the corresponding context vector. The tuple of context vectors is drawn independently from a fixed distribution. In the spirit of smoothed analysis, we assume that this distribution has a small amount of jitter. Formally, the tuple of context vectors is drawn from some fixed distribution, and then a small perturbation—small-variance Gaussian noise—is added independently to each coordinate of each context vector. This ensures arriving contexts are diverse. We are interested in Bayesian regret, i.e., regret in expectation over the Bayesian prior. Following the literature, we are primarily interested in the dependence on the time horizon T.

We focus on a batched version of the greedy algorithm, in which new data arrives to the algorithm’s optimization routine in small batches, rather than every round. This is well-motivated from a practical perspective—in high-volume applications data usually arrives to the “learner” only after a substantial delay (Agarwal et al., 2016, 2017)—and is essential for our analysis.

Our main result is that the greedy algorithm matches the Bayesian regret of any algorithm up to polylogarithmic factors, for each problem instance, fixing the Bayesian prior and the context distribution. We also prove that LinUCB achieves regret for each realization of . This implies a worst-case Bayesian regret of for the greedy algorithm under the perturbation assumption.

Our results hold for both natural versions of the batched greedy algorithm, Bayesian and frequentist, henceforth called BatchBayesGreedy and BatchFreqGreedy. In BatchBayesGreedy, the chosen action maximizes expected reward according to the Bayesian posterior. BatchFreqGreedy estimates using ordinary least squares regression and chooses the best action according to this estimate. The results for BatchFreqGreedy come with additive polylogarithmic factors, but are stronger in that the algorithm does not need to know the prior. Further, the regret bound for BatchFreqGreedy is approximately prior-independent, in the sense that it applies even to very concentrated priors such as independent Gaussians with standard deviation on the order of

The key insight in our analysis of BatchBayesGreedy is that any (perturbed) data can be used to simulate any other data, with some discount factor. The analysis of BatchFreqGreedy requires an additional layer of complexity. We consider a hypothetical algorithm that receives the same data as BatchFreqGreedy, but chooses actions based on the Bayesian-greedy selection rule. We analyze this hypothetical algorithm using the same technique as BatchBayesGreedy, and then upper bound the difference in Bayesian regret between the hypothetical algorithm and BatchFreqGreedy.

Our analyses extend to group externalities and (Bayesian) minority regret. In particular, we circumvent the impossibility result mentioned above. We prove that both BatchBayesGreedy and BatchFreqGreedy match the Bayesian minority regret of any algorithm run on either the full population or the minority alone, up to polylogarithmic factors

Detailed comparison with prior work. We substantially improve over the worst-case regret bound from Kannan et al. (2018), at the cost of some additional assumptions. First, we consider Bayesian regret, whereas their regret bound is for each realization of Second, they allow the context vectors to be chosen by an adversary before the perturbation is applied. Third, they extend their analysis to a somewhat more general model, in which there is a separate latent weight vector for every action (which amounts to a more restrictive model of perturbations). However, this extension relies on the greedy algorithm being initialized with a substantial amount of data. The results of Kannan et al. (2018) do not appear to have implications on group externalities.

Bastani et al. (2017) show that the greedy algorithm achieves logarithmic regret in an alternative linear contextual bandits setting that is incomparable to ours in several important ways. They consider two-action instances where the actions share a common context vector in each round, but are parameterized by different latent vectors. They ensure data diversity via a strong assumption on the context distribution. This assumption does not follow from our perturbation conditions; among other things, it implies that each action is the best action in a constant fraction of rounds. Further, they assume a version of Tsybakov’s margin condition, which is known to substantially reduce regret rates in bandit problems (e.g., see Rigollet and Zeevi, 2010).

2. Preliminaries

We consider the model of linear contextual bandits (Li et al., 2010; Chu et al., 2011). Formally, there is a learner who serves a sequence of users over T rounds, where T is the (known) time horizon. For the user who arrives in round t there are (at most) K actions available, with each action associated with a context vector . Each context vector may contain a mix of features of the action, features of the user, and features of both. We assume that the tuple of context vectors for each round t is drawn independently from a fixed distribution. The learner observes the set of contexts and selects an action for the user. The user then experiences a reward which is visible to the learner. We assume that the expected reward is linear in the chosen context vector. More precisely, we let be the reward of action a if this action is chosen in round t (so that ), and assume that there exists an unknown vector for any round t and action a. Throughout most of the paper, the realized rewards are either in {0, 1} or are the expectation plus independent Gaussian noise of variance at most 1. We sometimes consider a Bayesian version, in which the latent vector is initially drawn from some known prior P.

A standard goal for the learner is to maximize the expected total reward over T rounds, or . This is equivalent to minimizing the learner’s regret, defined as

where denotes the context vector associated with the best action at round t. We are mainly interested in expected regret, where the expectation is taken over the context vectors, the rewards, and the algorithm’s random seed, and Bayesian regret, where the expectation is taken over all of the above and the prior over

We introduce some notation in order to describe and analyze algorithms in this model. We write , the context vector chosen at time context matrix at time t, a matrix whose rows are vectors called the empirical covariance matrix, is an important concept in some of the prior work on linear contextual bandits (e.g., Abbasi-Yadkori et al., 2011; Kannan et al., 2018), as well as in this paper.

Optimism under uncertainty. Optimism under uncertainty is a common paradigm in problems with an explore-exploit tradeoff (Bubeck and Cesa-Bianchi, 2012). The idea is to evaluate each action “optimistically”—assuming the best-case scenario for this action—and then choose an action with the best optimistic evaluation. When applied to the basic multi-armed bandit setting, it leads to a well-known algorithm called UCB1 (Auer et al., 2002), which chooses the action with the highest upper confidence bound (henceforth, UCB) on its mean reward. The UCB is computed as the sample average of the reward for this action plus a term which captures the amount of uncertainty.

Optimism under uncertainty has been extended to linear contextual bandits in the LinUCB algorithm (Chu et al., 2011; Abbasi-Yadkori et al., 2011). The high-level idea is to compute a confidence region in each round with high probability, and choose an action a which maximizes the optimistic reward estimate . Concretely, one uses regression to form an empirical estimate . Concentration techniques lead to high-probability bounds of the form , where the interval width function f(t) may depend on hyperparameters and features of the instance. LinUCB simply chooses an action

Among other results, Abbasi-Yadkori et al. (2011) use

where L and S are known upper bounds on , respectively, and is a parameter. For any , they obtain regret , with only a polylog dependence on TL/d.

3. Group Externality of Exploration

In this section, we study the externalities of exploration at a group level, quantifying how much the presence of one population impacts the rewards of another in an online learning system. We consider linear contextual bandits in a setting in which there are two underlying user populations, called the majority and the minority. The user who arrives at round t is assumed to come from the majority population with some fixed probability and the minority population otherwise, and the population from which the user comes is known to the learner. The tuple of context vectors at time t is then drawn independently from a fixed group-specific distribution.

We assume there is a single hidden vector , and that the distribution of rewards conditioned on the chosen context vector is the same for both groups. Only the distribution over tuples of available context vectors differs between groups. This implies that externalities cannot be explained by the absence of a good policy, since there always exists a policy that is simultaneously optimal for everyone. This allows us to focus only on externalities inherent to the process of exploration.

We define the minority regret to be the regret experienced by the minority. The group externality imposed on the minority by the majority is then the difference between the minority regret of an algorithm run on the minority alone and the minority regret of the same algorithm run on the full population. A negative group externality implies that the minority is worse off due to the presence of the majority. It is generally more meaningful to bound the multiplicative difference between the minority regret obtained with and without the majority present. Several of our results have this form.

We first ask whether large group externalities can exist. We show that on a simple toy example, a large negative group externality arises under LinUCB, while a slight variant of this example leads to a large positive externality. Put another way, more available data can lead to either better or worse outcomes for the users of a system. We show that this general phenomenon is unavoidable. That is, no algorithm can simultaneously approximate the minority regret of LinUCB run on the full population and LinUCB run on the minority alone, up to any multiplicative factor.

3.1. Two-Bridge Instance

We consider a toy example, motivated by a scenario in which a learner is choosing driving routes for two groups of users. Each user starts at point A, B, or C, and wants to get to the same destination, point D, which requires taking one of two bridges, as shown in Figure 1. The travel costs for each of the two bridges are unknown. For simplicity, assume all other edges are known to have 0 cost.

Suppose that 95% of users are in the majority group. All of these users start at point A and have access only to the top bridge. The other 5% are in the minority. Of these users, 95% start at point C, from which they have access only to the bottom bridge. The remaining 5% of the minority users start at point B, and have access to both bridges.

Figure 1: Visual illustration of the two-bridge instance.

Consider the behavior of an algorithm that follows the principle of optimism under uncertainty. If run on the full user population, it will quickly collect many observations of the commute time for the top bridge since all users in the majority group must travel over the top bridge. It will collect relatively fewer observations of the commute time over the bottom bridge. Therefore, when the algorithm is faced with a member of the minority population who starts at point B, the algorithm will likely send this user over the bottom bridge in order to collect more data and improve its estimate.

If the same algorithm is instead run on the minority alone, it will quickly collect many more observations of the commute time for the bottom bridge relative to the top. Now when the algorithm is faced with a user who starts at point B, it will likely send her over the top bridge.

Which is better depends on which bridge has the longer commute time. If the top bridge is the better option, then the presence of the majority imposes a negative externality on the minority. If not, then the presence of the majority helps. These two scenarios may be difficult to distinguish.

This toy example can be formalized in the linear contextual bandits framework. There are two underlying actions (the two bridges), but these actions are not always available. To capture this, we define a parameter vector , with the two coordinates representing the expected rewards for taking the top and bottom bridge respectively. (Though we motivated the example in terms of costs, it can be expressed equivalently in terms of rewards.) There are two possible context vectors: . A user has available an action with context vector if and only if she has access to the top bridge. Similarly, she has available an action with context vector and only if she has access to the bottom bridge. The instance can then be formalized as follows.

Definition 1 (Two-Bridge Instance) The two-bridge instance is an instance of linear contextual bandits. On each round t, the user who arrives is from the majority population with probability 0.95, in which case . Otherwise, the user is in the minority. In this case, with probability 0.95, (based on Figure 1, we call these C rounds), while with probability 0.05, rounds). We consider two values for the hidden parameter vector

3.2. Performance of LinUCB

We start by analyzing the performance of LinUCB on the two-bridge instance. Our main result formalizes the intuition above, showing that when (that is, the top bridge is better) the majority imposes a large negative group externality on the minority, while the majority imposes a large positive externality when . We assume rewards are 1-subgaussian.3

Theorem 2 Consider LinUCB with any interval width function f satisfying On the two-bridge instance, assuming 1-subgaussian noise on the rewards, when achieves expected minority regret O(1) when run on the minority alone, but when run on the full population. In contrast, when , LinUCB achieves expected minority regret O(1) when run on the full population, but when run on the minority alone.

We omit the proofs of the lower bounds, which both follow a similar structure to the one used in the proof of the general impossibility result in Section 3.3; in fact, both of these lower

3. A random variable -subgaussian if . A special case is Gaussians with variance 4. For instance, the interval width function in Equation (3) satisfies this condition whenever , so one can either set , or add two more dimensions to the problem instance (and set

bounds could be stated as an immediate corollary of Theorem 4. Essentially, an argument based on KL-divergence shows that it is difficult to distinguish between the case in which case in which , and therefore LinUCB must choose similar actions in these two cases.

To prove the O(1) upper bounds, we make heavy use of the special structure of the two-bridge instance, which significantly simplifies the analysis of LinUCB. We exploit the fact that the only context vectors available to the learner are the basis vectors , which essentially makes this an instance of sleeping bandits (Kleinberg et al., 2010). In this special case, the covariance matrix is always diagonal, which simplifies Equation (2) and leads to LinUCB choosing the ith basis, where is simply the number of times that this basis vector was already chosen. Additionally, in this setting is just the average reward observed for the ith basis vector, allowing us to bound the difference between each using standard concentration techniques. Using this, we show that with high probability, after a logarithmic number of rounds—during which the learner can amass at most O(1) regret since the worst-case regret on any round is —the probability that LinUCB chooses the wrong action on a B round is small (). This leads to constant regret on expectation.

The proof makes use of the following concentration bound:

Lemma 3 Let be the number of C rounds observed in the first t minority rounds in the two-bridge instance. For any , with probability at least

Proof We apply the following form of the Chernoff bound:

Setting

for . Applying a union bound over all T rounds, we have with probability at least

Proof of Theorem 2 Now, consider LinUCB run on the minority population alone on the two-bridge instance with . Since we are considering running LinUCB on the minority only, majority rounds are irrelevant, so throughout this proof we abuse notation and use to index minority rounds. T is still the total number of (minority plus majority) rounds.

This proof heavily exploits the special structure of the two-bridge instance to simplify the analysis of LinUCB. In particular, we exploit the fact that the only contexts ever available are the basis vectors . This implies that the covariance matrix is always diagonal, which greatly simplifies the expression for the chosen action in Equation (2). The optimistic estimate of the reward for choosing the ith basis vector is simply

Additionally, in this special case, is simply the number of times that the ith basis vector was chosen over the first t minority rounds, and is the average reward observed over the rounds on which it was chosen.

and . Since rewards were assumed to be 1-subgaussian, Lemma 37 and a union bound give us that for any , with probability at least

Let be the number of B and C rounds respectively before round t. By Lemma 3, for any , with probability . Suppose this is the case. Since it is only possible to choose [1 0] on B rounds, we have . Similarly, since the algorithm can only choose and using the assumption that , Equations (4) and (5) then imply that for any , with probability at least

and similarly,

This shows that with probability at least , after the first rounds, LinUCB picks round with probability at least , leading to zero regret on that round. To turn this into a bound on expected regret, first note that with at most probability, the argument above fails to hold, in which case the minority regret is still bounded by the argument above holds, LinUCB may suffer up to regret on each of the first minority rounds. On each additional round, there is a failure probability of , and in this case LinUCB again suffers regret of at most . Putting this together and setting , we get that the expected regret is bounded by

3.3. An Impossibility Result

It is natural to ask whether it is possible to design an algorithm that can distinguish between the two scenarios analyzed above, obtaining minority regret that is close to the best of LinUCB run on the minority alone and LinUCB run on the full population on any problem instance. In this section, we show that the answer is no. In particular, we prove that on the two-bridge instance, if , then any algorithm must suffer regret on expectation (and therefore minority regret, since all regret is incurred by minority users).

To prove this result, we begin by formalizing the idea that it is hard to distinguish between the case in which and the case in which . To do so, we bound the KL-divergence between the joint distributions over the sequences of context vectors, actions taken by the given algorithm, and the given algorithm’s rewards that are induced by the two choices of . By applying the high-probability Pinsker lemma (Tsybakov, 2009), we show that a low KL-divergence between these distributions implies that the algorithm must be likely either to choose the top bridge on B rounds more than half the time when the bottom bridge is better or to choose the bottom bridge on B rounds more than half the time when the top bridge is better, either of which would lead to high () regret as long as the number of B rounds is sufficiently large. To finish the proof, we use a simple Chernoff bound to show that the number of B rounds is large with high probability.

To derive the KL-divergence bound, we make use of the assumption that the realized rewards at each round are either 0 or 1. This assumption is not strictly necessary. An analogous argument could be made, for instance, for real-valued rewards with Gaussian noise.

Theorem 4 On the two-bridge instance with realized rewards , any algorithm must incur minority regret in expectation when

Note that “any algorithm” here includes algorithms run on the minority alone, essentially ignoring data from the majority. Theorems 2 and 4 immediately imply the following corollary.

Corollary 5 No algorithm can simultaneously approximate the minority regret of both LinUCB run on the minority and LinUCB run on the full population up to any multiplicative factor.

Proof of Theorem 4 Fix any algorithm A. We will first derive an lower bound on the expected regret of A conditioned on the number of , being large. To complete the proof, we then show that is large with high probability.

Let be a history of all context vectors, chosen actions, and re- wards up to round on the two-bridge instance with a distribution over histories denote the conditional distribution of these histories, conditioned on the event that . That is, we define

We first show that KLis upper bounded a constant that does not depend on T. By the chain rule for KL divergences, since is independent of any previous contexts, actions, or rewards conditioned on

Since the choice of context vectors available at time t is independent of the value of the parameter and A may only base its choices on the observed history and current choice of contexts, it is always the case that , so the first sum in this expression is equal to 0.

To bound the second sum, we make use of the assumption that then tells us that for any round the probability of getting reward 1 conditioned on a chosen context is always either Putting this together, we get that

Now, let E be the event that the algorithm A chooses arm 2 on at least half of the B rounds, conditioned on occurs when , the regret of A is at least is on the order ofdoes not occur (i.e.,E occurs) when again has regret at least . We will use the bound on KL divergence to show that one of these bad cases happens with high probability.

By Lemma 38,

Let R be the regret of A. We then have that

It remains to bound the probability that . By a Chernoff bound,

Thus, for any , then with probability at least In particular, let

This completes the proof that the regret of on this problem instance.

4. Greedy Algorithms and LinUCB with Perturbed Contexts

We now turn our attention to externalities at an individual level. We interpret exploration as a potential externality imposed on the current user of a system by future users, since the current user would prefer the learner to take the action that appears optimal. One could avoid such externalities by running the greedy algorithm, which does just that, but it is well known that the greedy algorithm performs poorly in the worst case. In this section, we build on a recent line of work analyzing the conditions under which inherent data diversity leads the greedy algorithm to perform well.

We analyze the expected performance of the greedy algorithm under small random perturbations of the context vectors. We focus on greedy algorithms that consume new data in batches, rather than every round. We consider both Bayesian and frequentist versions, BatchBayesGreedy and BatchFreqGreedy. Our main result is that for any specific problem instance, both algorithms match the Bayesian regret of any algorithm on that particular instance up to polylogarithmic factors. We also prove that under the same perturbation assumptions, LinUCB achieves regret realization of , which implies a worst-case Bayesian regret of for the greedy algorithms. Finally, we repurpose our analysis to derive a positive result in the group setting, implying that the impossibility result of Section 3.3 breaks down when the data is sufficiently diverse.

Setting and notation. We consider a Bayesian version of linear contextual bandits, with from a known multivariate Gaussian prior and invertible To capture the idea of data diversity, we assume the context vectors on each round t are generated using the following perturbed context generation process: First, a tuple mean context vectors is drawn independently from some fixed distribution where means action a is not available. For each available action a, the context vector is then is a vector of random noise. Each component of is drawn independently from a zero-mean Gaussian with standard deviation . We refer to perturbation size. In general, our regret bounds deteriorate if is very small. Together we refer to a distribution , and perturbation size problem instance. We make several technical assumptions. First, the distribution is such that each context vector has bounded 2-norm, i.e., . It can be arbitrary otherwise. Second, the perturbation size needs to be sufficiently small, . Third, the realized reward for each action a and round , the mean reward plus standard Gaussian noise The history up to round t is denoted by the tuple

The greedy algorithms. For the batch version of the greedy algorithm, time is divided in batches of Y consecutive rounds each. When forming its estimate of the optimal action at round t, the algorithm may only use the history up to the last round of the previous batch, denoted

BatchBayesGreedy forms a posterior over using prior P and history it chooses the action that maximizes reward in expectation over this posterior. This is equivalent to choosing

BatchFreqGreedy does not rely on any knowledge of the prior. It chooses the best action according to the least squares estimate of , computed with respect to history

4.1. Main Results

We first state our main results before describing the intuition behind them. We state each theorem in terms of the main relevant parameters . First, we prove that in expectation over the random perturbations, both greedy algorithms favorably compare to any other algorithm.

Theorem 6 With perturbed context generation, there is some with batch duration , the following holds. Fix any bandit algorithm, and let

Bayesian regret on a particular problem instance. Then on that same instance, (a) BatchBayesGreedy has Bayesian regret at most (b) BatchFreqGreedy has Bayesian regret at most

Our next result asserts that the Bayesian regret for LinUCB and both greedy algorithms is on the order of (at most) . This result requires additional technical assumptions.

Theorem 7 Assume that the maximal eigenvalue of the covariance matrix of the prior P is at most and the mean vector satisfies . With perturbed context generation,(a) With appropriate parameter settings, LinUCB has Bayesian regret (b) If as in Theorem 6, then both BatchBayesGreedy and BatchFreqGreedy have Bayesian regret at most

The assumption in Theorem 7 can be replaced with We use Theorem 7(b) to derive an “approximately prior-independent” result for BatchFreqGreedy. (For clarity, we state it for independent priors.) The bound in Theorem 7(b) deteriorates if P gets very sharp, but it suffices if P has standard deviation on the order of (at least)

Corollary 8 Assume that the prior P is independent over the components of , with variance 1 in each component. Suppose the mean vector satisfies . With perturbed context generation, if as in Theorem 6, then BatchFreqGreedy has Bayesian regret at most as long as

Finally, we derive a positive result on group externalities. We find that with perturbed context generation, the minority Bayesian regret of the greedy algorithms (i.e., the Bayesian regret incurred on minority rounds) is small compared to the minority Bayesian regret of any algorithm, whether run on the full population or on the minority alone. This sidesteps the impossibility result of Section 3.3.

Theorem 9 Assume as in Theorem 6 and perturbed context generation. Fix any bandit algorithm and instance, and let be the minimum of its minority Bayesian regrets when it is only run over minority rounds or when it is run over the full population. Both greedy algorithms run on the full population achieve minority Bayesian regret at most

4.2. Key Techniques

The key idea behind our approach is to show that, with perturbed context generation, BatchBayesGreedy collects data that is informative enough to “simulate” the history of contexts and rewards from the run of any other algorithm ALG over fewer rounds. This implies that it remains competitive with ALG since it has at least as much information and makes myopically optimal decisions.

We use the same technique to prove a similar simulation result for BatchFreqGreedy. To treat both algorithms at once, we define a template that unifies them. A bandit algorithm is called batch-greedy-style if it divides the timeline in batches of Y consecutive rounds each, in each round t chooses some estimate , based only on the data from the previous batches, and then chooses the best action according to this estimate, so that . For a batch that starts at round batch history is the tuple batch context matrix is the matrix X whose rows are vectors . Similarly to the “empirical covariance matrix”, we define the batch covariance matrix as

Let us formulate what we mean by “simulation”. We want to use the data collected from a single batch in order to simulate the reward for any one context x. More formally, we are interested in the randomized function that takes a context x and outputs an independent random sample from . We denote it ; this is the realized reward for an action with context vector x.

Definition 10 Consider batch B in the execution of a batch-greedy-style algorithm. Batch history can simulate up to radius R > 0 if there exists a function g : {context vectors{batch histories is identically distributed to conditional on the batch context matrix, for all and all context vectors

Let us comment on how it may be possible to simulate . For intuition, suppose that is distributed as is drawn independently from . Thus, we can define in Definition 10. We generalize this idea and show that a batch history can simulate as long as the batch covariance matrix has a sufficiently large minimum eigenvalue, which holds with high probability when the batch size is large.

Lemma 11 With perturbed context generation, there is some such that with probability at least any batch history from a batch-greedy-style algorithm can simulate up to radius R, as long as

If the batch history of an algorithm can simulate , the algorithm has enough information to simulate the outcome of a fresh round of any other algorithm ALG. Eventually, this allows us to use a coupling argument in which we couple a run of BatchBayesGreedy with a slowed-down run of ALG, and prove that the former accumulates at least as much information as the latter, and therefore the Bayesian-greedy action choice is, in expectation, at least as good as that of ALG. This leads to Theorem 6(a). We extend this argument to a scenario in which both the greedy algorithm and ALG measure regret over a randomly chosen subset of the rounds, which leads to Theorem 9.

To extend these results to BatchFreqGreedy, we consider a hypothetical algorithm that receives the same data as BatchFreqGreedy, but chooses actions based on the (batched) Bayesian-greedy selection rule. We analyze this hypothetical algorithm using the same technique as above, and then argue that its Bayesian regret cannot be much smaller than that of BatchFreqGreedy. Intuitively, this is because the two algorithms form almost identical estimates of , differing only in the fact that the hypothetical algorithm uses the P as well as the data. We show that this difference amounts to effects on the order of 1/t, which add up to a maximal difference of O(log T) in Bayesian regret.

5. Analysis: LinUCB with Perturbed Contexts

In this section, we prove Theorem 7(a), a Bayesian regret bound for the LinUCB algorithm under perturbed context generation. We focus on a version of LinUCB from Abbasi-Yadkori et al. (2011), as defined in (3) on page 7.

Recall that the interval width function in (3) is parameterized by numbers

Recall that denotes perturbation size, and

Remark 12 Ideally we would like to set L, S according to (8) with equalities. We consider a more permissive version with inequalities so as to not require the exact knowledge of

While the original result in Abbasi-Yadkori et al. (2011) requires in our setting this only happens with high probability.

We prove the following theorem (which implies Theorem 7(a)):

Theorem 13 Assume perturbed context generation. Further, suppose that the maximal eigenvalue of the covariance matrix of the prior P is at most 1, and the mean vector satisfies . The version of LinUCB with interval width function (3) and parameters given by (8) has Bayesian regret at most

Remark 14 The theorem also holds if the assumption on is replaced with only change in the analysis is that in the concluding steps (Section 5.2), we use Lemma 17(b) instead of Lemma 17(a).

On a high level, our analysis proceeds as follows. We massage algorithm’s regret so as to elucidate the dependence on the number of rounds with small “gap” between the best and second-best action, call it N. This step does not rely on perturbed context generation, and makes use of the analysis from Abbasi-Yadkori et al. (2011). The crux is that we derive a much stronger upper-bound on E [N] under perturbed context generation. The analysis relies on some non-trivial technicalities on bounding the deviations from the “high-probability” behavior, which are gathered in Section 5.1.

We reuse the analysis in Abbasi-Yadkori et al. (2011) via the following lemma.8 To state this lemma, define the instantaneous regret at time

Lemma 15 (Abbasi-Yadkori et al. (2011)) Consider a problem instance with reward noise N(0, 1) and a specific realization of latent vector and contexts . Consider LinUCB with parameters that satisfy

The following lemma captures the essence of the proof of Theorem 13. From here on, we assume perturbed context generation. In particular, reward noise is N(0, 1).

Lemma 16 Suppose parameter L is set as in (8). Consider a problem instance with a specific realization of . Then for any

Proof We will prove that for any

where We use Lemma 15 to upper-bound the second summand in (11). To this end, we condition on the

event that every component of every perturbation has absolute value at mostdenote this event by U. This implies for all actions a and all rounds t. By Lemma 33, U is a high-probability event: . Now we are ready to apply Lemma 15:

To plug this into (11), we need to account for the low-probability event . We need to be careful because could, with low probability, be arbitrarily large. By Lemma 18 with

Putting this together and using (12), we obtain:

difference in expected rewards between the best and second-best actions at time t (where “best” and “second-best” is according to expected rewards). Here, we’re taking expectations after the perturbations are applied, so the only randomness comes from the noisy rewards. Consider the set of rounds with small gap, . Notice that

In what follows we prove an upper bound on . This is the step where perturbed context generation is truly used. For any two arms , the gap between their expected rewards is

Therefore, the probability that the gap between those arms is smaller than

Since are both distributed as , their difference is The maximum value that the Gaussian measure takes is , and the measure in any interval of width is therefore at most . This gives us the bound

Union-bounding over allpairs of actions, we have

Plugging this into (13) (recalling that ) completes the proof.

5.1. Bounding the Deviations

We make use of two results that bound deviations from the “high-probability” behavior, one on and another on instantaneous regret. First, we prove high-probability upper and lower bounds on under the conditions in Theorem 13. Essentially, these bounds allow us to use Lemma 16.

Lemma 17 Assume the latent vector comes from a multivariate Gaussian, the covariate matrix

Proof We consider two cases, based on whether . We need both cases to prove part (a), and we obtain part (b) as an interesting by-product. We repeatedly use Lemma 36, a concentration inequality for random variables, to show concentration on the Gaussian norm.

Similarly, we can show

By the triangle inequality,

Thus, in this case, with probability at least

By Lemma 36,

Because with probability at least

Next, we show how to upper-bound expected instantaneous regret in the worst case.9

Lemma 18 Fix round t and parameter , conditioned on any history event that for each arm a, the expected instantaneous regret of any algorithm at round t is at most

Proof The expected regret at round t is upper-bounded by the reward difference between the best arm and the worst arm

Since has symmetry about the origin conditioned on the event that at least one component of one of the perturbations has absolute value at least have equal likelihood, are identically distributed. Let be the event that at least one of the components of one of the perturbations has absolute value at least . This means for any choice

where the expectation is taken over the perturbations at time t. Without loss of generality, let be the component such that . Then, all other components have distribution

Let otherwise. In other words, we simply zero out the component . Then, this is

because by Lemma 34,

Next, note that by symmetry and since

Putting this all together, the expected instantaneous regret is bounded by

proving the lemma.

5.2. Finishing the Proof of Theorem 13.

We focus on the “nice event” that (14) holds, denote it E for brevity. In particular, note that it implies guarantees that expected regret under this event, E [Regret(T) | E], is upper-bounded by the expression (9) in the theorem statement.

In what follows we use Lemma 17(a) and Lemma 18 guarantee that if E fails, then the corresponding contribution to expected regret is small. Indeed, Lemma 18 with implies that

where is the “blow-up factor”. Since (14) fails with probability at most

The antecedent inequality follows by Lemma 31 with , using the assumption that . The theorem follows.

6. Analysis: Greedy Algorithms with Perturbed Contexts

We present the proofs for our results on greedy algorithms in Section 4.10 This section is structured as follows. In Section 6.1, we quantify the diversity of data collected by batch-greedy-style algorithms, assuming perturbed context generation. In Section 6.2, we show that a sufficiently “diverse” batch history suffices to simulate the reward for any given context vector, in the sense of Defini-tion 10. Jointly, these two subsections imply that batch history generated by a batch-greedy-style algorithm can simulate rewards with high probability, as long as the batch size is sufficiently large. Section 6.3 builds on this foundation to derive regret bounds for BatchBayesGreedy. The crux is that the history collected by BatchBayesGreedy suffices to simulate a “slowed-down” run of any other algorithm. This analysis extends to a version of BatchFreqGreedy equipped with a Bayesian-greedy prediction rule (and tracks the performance of the prediction rule). Finally, Section 6.4 derives the regret bounds for BatchFreqGreedy, by comparing the prediction-rule version of BatchFreqGreedy with BatchFreqGreedy itself. To derive the results on group externalities, we present all our analysis in Sections 6.3 and 6.4 in a more general framework in which only the minority rounds are counted for regret.

Preliminaries. We assume perturbed context generation in this section, without further mention. Throughout, we will use the following parameters as a shorthand:

Recall that denotes perturbation size, and d is the dimension. The meaning of they are high-probability upper bounds on the perturbations and the contexts, respectively. More formally, by Lemma 34 we have:

Let us recap some of the key definitions from Section 4.2. We consider batch-greedy-style algorithms, a template that unifies BatchBayesGreedy and BatchFreqGreedy. A bandit algorithm is called batch-greedy-style if it divides the timeline in batches of Y consecutive rounds each, in each round t chooses some estimate , based only on the data from the previous batches, and then chooses the best action according to this estimate, so that

For a batch B that starts at round batch history is the tuple [Y ]), and the batch context matrix is the matrix whose rows are vectors Here and elsewhere, batch covariance matrix is defined as

6.1. Data Diversity under Perturbations

We are interested in the diversity of data collected by batch-greedy-style algorithms, assuming perturbed context generation. Informally, the observed contexts should cover all directions in order to enable good estimation of the latent vector . Following Kannan et al. (2018), we quantify data diversity via the minimal eigenvalue of the empirical covariance matrix . More precisely, we are interested in proving that is sufficiently large. We adapt some tools from Kannan et al. (2018), and then derive some improvements for batch-greedy-style algorithms.

Kannan et al. (2018) prove that grows linearly in time t, assuming t is sufficiently large.

Lemma 19 (Kannan et al. (2018)) Fix any batch-greedy-style algorithm. Consider round where . Then for any realization of , with probability

Proof The claimed conclusion follows from an argument inside the proof of Lemma B.1 from Kannan et al. (2018), plugging in . This argument applies for any

Recall that is the sum . A key step in the proof of Lemma 19 zeroes in onthe expected contribution of a single round. We use this tool separately in the proof of Lemma 22.

Lemma 20 (Kannan et al. (2018)) Fix any batch-greedy-style algorithm, and the latent vector Assume . Condition on the event that all perturbations are upper-bounded by it with E. Then with probability at least

Proof The proof is easily assembled from several pieces in the analysis in Kannan et al. (2018). Let be the algorithm’s estimate for Kannan et al. (2018), define

where depends on all perturbations other than the perturbation for . Let us say that “good” for arm a if

First we argue that

Indeed, in the proof of their Lemma 3.4, Kannan et al. (2018) show that for any round, conditioned on E, if the probability that arm a was chosen over the randomness of the perturbation is at least 2/T, then the round is good for a with probability at least be the set of arms at round t with probability at most 2/T of being chosen over the randomness of the perturbation. Then,

This is where we use conditioning on the event . We plug in into Lemma 3.2 of Kannan et al. (2018). This lemma applies because with these parameters, the perturbed distribution of context arrivals satisfies the diversity condition from Kannan et al. (2018). The latter is by Lemma 3.6 of Kannan et al. (2018). This completes the proof of (19). The lemma follows from (18) and (19).

Let be the BatchFreqGreedy estimate for , as defined in (7). We are interested in quantifying how the quality of this estimate improves over time. Kannan et al. (2018) prove, essentially, that the distance between

Lemma 21 (Kannan et al. (2018)) Consider any round t in the execution of BatchFreqGreedy. Let be the last round of the previous batch. For any , with probability

We focus on batch covariance matrix of a given batch in a batch-greedy-style algorithm. We would like to prove that is sufficiently large with high probability, as long as the batch size Y is large enough. The analysis from Kannan et al. (2018) (a version of Lemma 19) would apply, but only as long as the batch size is least as large as the from the statement of Lemma 19. We derive a more efficient version, essentially shaving off a factor of

Lemma 22 Fix a batch-greedy-style algorithm and any batch B in the execution of this algorithm. Fix and assume that the batch size Y is at least

Condition on the event that all perturbations in this batch are upper-bounded by , more formally:

Further, condition on the latent vector and the history h before batch B. Then

The probability in (21) is over the randomness in context arrivals and rewards in batch B.

The improvement over Lemma 19 comes from two sources: we use a tail bound on the sum of geometric random variables instead of a Chernoff bound on a binomial random variable, and we derive a tighter application of the eigenvalue concentration inequality of Tropp (2012).

Proof Let be the last round before batch B. Recalling (17), let

be a similar sum over the expected per-round covariance matrices. Assume The proof proceeds in two steps: first we lower-bound , and then we show that it implies (21). Denoting , we claim that

To prove this, observe that ’s minimum eigenvalue increases by at least with probability at least 1/4 each round by Lemma 20, where the randomness is over the history, i.e., the sequence of (context, reward) pairs. If we want it to go up to m, this should take rounds in expectation. However, we need it to go to m with high probability. Notice that this is dominated by the sum of geometric random variables with parameter . We’ll use the following bound from Janson (2018): for

Because we want the minimum eigenvalue of Choose . By Corollary 44,

This completes the proof of (22).

To derive (21) from (22), we proceed as follows. Consider the event

Letting and rewriting , we use a concentration inequality from Tropp (2012) to guarantee that

Then, using the fact that

since . Finally, observe that, omitting the conditioning on

6.2. Reward Simulation with a Diverse Batch History

We consider reward simulation with a batch history, in the sense of Definition 10. We show that a sufficiently “diverse” batch history suffices to simulate the reward for any given context vector. Coupled with the results of Section 6.1, it follows that batch history generated by a batch-greedy-style algorithm can simulate rewards as long as the batch size is sufficiently large.

Let us recap the definition of reward simulation (Definition 10). Let be a randomized function that takes a context x and outputs an independent random sample from other words, this is the realized reward for an action with context vector x.

Definition 23 Consider batch B in the execution of a batch-greedy-style algorithm. Batch history can simulate up to radius R > 0 if there exists a function g : {context vectors{batch histories is identically distributed to conditional on the batch context matrix, for all and all context vectors

Note that we do not require the function g to be efficiently computable. We do not require algorithms to compute g; a mere existence of such function suffices for our analysis.

The result in this subsection does not rely on the “greedy” property. Instead, it applies to all “batch-style” algorithms, defined as follows: time is divided in batches of Y consecutive rounds each, and the action at each round t only depends on the history up to the previous batch. The data diversity condition is formalized as ; recall that it is a high-probability event, in a precise sense defined in Lemma 22. The result is stated as follows:

Lemma 24 Fix a batch-style algorithm and any batch B in the execution of this algorithm. Assume the batch covariance matrix . Then batch history can simulate up to radius R.

Proof Let us construct a suitable function g for Definition 23. Fix a context vector be the vector of realized rewards in batch

in which the variance of the reward noise is . Then the noise variance in (23) should be , with essentially no modifications throughout the rest of the proof.)

the proof we show that g is as needed for Definition 23.

above satisfy . Then, we’ll show that if each

as desired. We must now show that . Note that

where simply denotes . Thus, it is sufficient to show that and , we have by Lemma 40

Let . Again by Lemma 40, . This means that

as desired. Finally, observe that

where is the noise vector. Notice that , and therefore, . Putting this all together, we have

and therefore D can simulate E for any x up to radius R.

6.3. Regret Bounds for BatchBayesGreedy

We apply the tools from Sections 6.1 and 6.2 to derive regret bounds for BatchBayesGreedy. On a high level, we prove that the history collected by BatchBayesGreedy suffices to simulate a “slowed-down” run of any other algorithm . Therefore, when it comes to choosing the next action, BatchBayesGreedy has at least as much information as , so its Bayesian-greedy choice cannot be worse than the choice made by

Our analysis extends to a more general scenario which is useful for the analysis of BatchFreqGreedy. We formulate and prove our results for this scenario directly. We consider an extended bandit model which separates data collection and reward collection. Each round t proceeds as follows: the algorithm observes available actions and the context vectors for these actions, then it chooses two actions, , and observes the reward for the former but not the latter. We re- fer to as the “prediction” at round t. We will refer to an algorithm in this model as a bandit algorithm (which chooses actions ) with “prediction rule” that chooses the predictions specifically, we will be interested in an arbitrary batch-greedy-style algorithm with prediction rule given by BatchBayesGreedy, as per (6) on 14. We assume this prediction rule henceforth. We are interested in prediction regret: a version of regret (1) if actions are replaced with predictions

where is the context vector of the best action at round t, as in (1). More precisely, we are interested in Bayesian prediction regret, the expectation of (24) over everything: the context vectors, the rewards, the algorithm’s random seed, and and the prior over

We use essentially the same analysis to derive implications on group externalities. For this purpose, we consider a further generalization in which regret is restricted to rounds that correspond to a particular population. Formally, let be a randomly chosen subset of the rounds where is a constant and rounds are chosen to be in T independently of one another. We allow for the possibility that the underlying context distribution differs for rounds in T compared to rounds in [T]\T . More precisely, we allow the event be correlated with the context tuple at round t. Similar to the definition of minority regret, we define T -restricted regret (resp., prediction regret) in T rounds to be the portion of regret (resp., prediction regret) that corresponds to T -rounds:

T -restricted Bayesian (prediction) regret is defined as an expectation over everything. Thus, the main theorem of this subsection is formulated as follows:

Theorem 25 Consider perturbed context generation. Let ALG be an arbitrary batch-greedy-style algorithm whose batch size is at least . Fix any bandit algorithm be the T -restricted regret of this algorithm on a particular problem instance I. Then on the same instance, ALG has T -restricted Bayesian prediction regret

Proof sketch. We use a t-round history of ALG to simulate a (t/Y )-round history of specifically, we use each batch in the history of ALG to simulate one round of . We prove that the simulated history of has exactly the same distribution as the actual history, for any ALG predicts the Bayesian-optimal action given the history (up to the previous batch), this action is at least as good (in expectation over the prior) as the one chosen by rounds. The detailed proof is deferred to Section 6.3.1.

Implications. As a corollary of this theorem, we obtain regret bounds for BatchBayesGreedy in Theorem 6 and Theorem 7. We take T to be the set of all rounds, i.e., be BatchBayesGreedy. For Theorem 7(b), we take to be LinUCB. Thus:

Corollary 26 In the setting of Theorem 25, BatchBayesGreedy has Bayesian regret at most on problem instance I. Further, under the assumptions of Theorem 7, BatchBayesGreedy has Bayesian regret at most on all instances.

We also obtain a similar regret bound on the Bayesian prediction regret of BatchFreqGreedy, which is essential for Section 6.4.

Corollary 27 In the setting of Theorem 25, BatchFreqGreedy has Bayesian prediction regret (27).

To derive Theorem 9 for BatchBayesGreedy, we take T to be the set of all minority rounds, and apply Theorem 25 twice: first when is run over the minority rounds only (and can behave arbitrarily on the rest), and then when is run over full population.

6.3.1. PROOF OF THEOREM 25

We condition on the event that all perturbations are bounded by , more precisely, on the event

Recall that is a high-probability event, by (15). We also condition on the event

where is the batch covariance matrix, as usual. Conditioned on , this too is a high-probability event by Lemma 22 plugging in and taking a union bound over all batches. We will prove that ALG satisfies

where the expectation is taken over everything: the context vectors, the rewards, the algorithm’s random seed, and the prior over . Then we take care of the “failure event”

History simulation. Before we prove (29), let us argue about using the history of ALG to simulate a (shorter) run of . Fix round t. We use a t-round history of ALG to simulate a of is the batch size in ALG. Stating this formally requires some notation. Let the set of actions available in round be the corresponding tuple of contexts. Let CON be the set of all possible context tuples, more precisely, the set of all finite subsets of denote, resp., the t-round history of denote the set of all possible t-round histories. Note that are random variables which take values on want to use history to simulate history . Thus, the simulation result is stated as follows:

Lemma 28 Fix round be the sequence of context arrivals up to and including round . Then there exists a “simulation function”

such that the simulated history is distributed identically to , conditional on se- quence , latent vector , and events

Proof Throughout this proof, condition on events . Generically, sequence of pairs is a context vector and is a simulated reward for this context vector. We define by induction on with base case . Throughout, we maintain a run of algorithm . For each step is simulated up to round , and the corresponding history is recorded as . Simulate the next round in the execution of by presenting it with the action set and the corresponding context tuple be the context vector chosen by . The corresponding reward constructed using the -th batch in , denote it with B. By Lemmas 22 and 24, the batch history can simulate a single reward, in the sense of Definition 23. In particular, there exists a function with the required properties (recall that it is explicitly defined in (23)). Thus, we define , and return as a reward to . This completes the construction of The distribution property of is immediate from the construction.

Proof of Equation (29) We argue for each batch separately, and then aggregate over all batches in the very end. Fix batch be the last round in this batch. Let and consider the context vector . This context vector is a randomized function f of the current context tuple and the history

By Lemma 28, letting , it holds that

Let t be some round in the next batch after , be the context vector predicted by ALG in round t. Recall that is a Bayesian-greedy choice from the context tuple on history . Observe that the Bayesian-greedy action choice from a given context tuple based on history cannot be worse, in terms of the Bayesian-expected reward, than any other choice from the same context tuple and based on the same history. Using (30), we obtain:

for any given context tuple that has a non-zero arrival probability given Given , the event is independent of everything else. Likewise, given con, the event is independent of everything else. It follows that

for any given context tuple that has a non-zero arrival probability given

Observe that have the same distribution, even conditioned on event is because the definitions of treat all rounds in the same batch in exactly the same way.) Therefore, we can integrate (32) over the context tuples con:

Now, let us sum up (33) over all rounds t in the next batch after B, denote it next(B).

Note that the right-hand side of (33) stays the same for all t, hence the factor of Y on the right-hand side of (34). This completes our analysis of a single batch B.

We obtain (29) by over all batches B. Here it is essential that the expectation does not depend on round t, and therefore the “regret benchmark” cancels out from (29). In particular, it is essential that the context tuples are identically distributed across rounds.

Proof of Theorem 25 given Equation (29) We must take care of the low-probability failure events

For ease of exposition, we focus on the special case ; the general case is treated similarly. We know that gives us that the instantaneous regret of every round is at most

by Lemma 35. Letting , we verify that our definition of Y means that Lemma 22 indeed holds with probability at least ), the Bayesian prediction regret of ALG is

This completes the proof of Theorem 25.

6.4. Regret Bounds for BatchFreqGreedy

To analyze BatchFreqGreedy, we show that its Bayesian regret is not too different from its Bayesian prediction regret, and use Corollary 27 to bound the latter. As in the previous subsection, we state this result in more generality for the sake of group externality implications: we consider T -restricted (prediction) regret, exactly as before.

Theorem 29 Assuming perturbed context generation, BatchFreqGreedy satisfies

where is the covariance matrix of the prior and is the perturbation size.

Taking T to be the set of all contexts, and using Corollary 27, we obtain Bayesian regret bounds for BatchFreqGreedy in Theorem 6 and Theorem 7. To derive Theorem 9 for BatchFreqGreedy, we take T to be the set of all minority rounds.

The remainder of this section is dedicated to proving Theorem 29. On a high level, the idea is as follows. As in the proof of Theorem 25, we condition on the high-probability event (28) that perturbations are bounded. Specifically, we prove that

To prove this statement, we fix round t and compare the action taken by BatchFreqGreedy and the predicted action . We observe that the difference in rewards between these two actions can be upper-bounded in terms of , the difference in the estimates with and without knowledge of the prior. (Recall (6) and (7) for definitions.) Specifically, we show that

The crux of the proof is to show that the difference is small, namely

ignoring other parameters. Thus, summing over all rounds, we get

Proof of Eq. (35). Let be, resp., instantaneous regret and instantaneous prediction regret at time t. Then

Thus, it suffices to bound the differences in instantaneous regret. Recall that at time t, the chosen action for BatchFreqGreedy and the predicted action are, resp.,

Letting be the last round in the previous batch, we can formulate

Therefore, we have

since the mean of the posterior distribution is exactly is deterministic given Taking expectation over

Therefore,

Eq. (36) follows. The crux is to prove (37): to bound the expected distance between the Frequentist and Bayesian estimates for . By expanding their definitions, we have

Next, note that

By Lemma 42, . Therefore,

giving us

the following bounds hold, each with probability at least

Therefore, fixing , with probability at least

Note that the high-probability events we need are deterministic given , and therefore are independent of the perturbations at time t. This means that Lemma 18 applies, with : conditioned on any , the expected regret for round t is upper-bounded by In particular, this holds for any not satisfying the high probability events from Lemmas 19 and 21. Therefore, for all

Because

This means

Since , for sufficiently small , this proves (37). We need to do a careful computation to complete the proof of Eq. (35). We know from (36) that

Choosing , we find that

so this term vanishes. Furthermore,

since . Using the fact that (since by assumption ), this is simply

Finally, we note that on the first rounds, the regret bound from Lemma 18 with applies, so the total regret difference is at most

which implies Eq. (35).

Completing the proof of Theorem 29 given (35). By Theorem 29, this holds whenever all perturbations are bounded by , which happens with probability at least . When the bound fail, the total regret is at most

by Lemma 18 (with ) and Lemma 35. Since , the contribution of regret when the high-probability bound fails is

Acknowledgments

We thank Solon Barocas, Dylan Foster, Jon Kleinberg, Aaron Roth, and Hanna Wallach for helpful discussions about these topics.

References

Yasin Abbasi-Yadkori, D´avid P´al, and Csaba Szepesv´ari. Improved algorithms for linear stochastic bandits. In Advances in Neural Information Processing Systems (NIPS), 2011.

Alekh Agarwal, Daniel Hsu, Satyen Kale, John Langford, Lihong Li, and Robert Schapire. Taming the monster: A fast and simple algorithm for contextual bandits. In International Conference on Machine Learning (ICML), 2014.

Alekh Agarwal, Sarah Bird, Markus Cozowicz, Miro Dudik, John Langford, Lihong Li, Luong Hoang, Dan Melamed, Siddhartha Sen, Robert Schapire, and Alex Slivkins. Multiworld testing: A system for experimentation, learning, and decision-making. A white paper, available at https://github.com/Microsoft/mwt-ds/raw/master/images/MWT-WhitePaper.pdf, 2016.

Alekh Agarwal, Sarah Bird, Markus Cozowicz, Luong Hoang, John Langford, Stephen Lee, Jiaji Li, Dan Melamed, Gal Oshri, Oswaldo Ribas, Siddhartha Sen, and Alex Slivkins. Making contextual decisions with low technical debt. CoRR arXiv:1606.03966, 2017.

Peter Auer, Nicol`o Cesa-Bianchi, and Paul Fischer. Finite-time analysis of the multiarmed bandit problem. Machine Learning, 47(2-3):235–256, 2002.

Hamsa Bastani, Mohsen Bayati, and Khashayar Khosravi. Exploiting the natural exploration in contextual bandits. CoRR arXiv:1704.09011, 2017.

Alberto Bietti, Alekh Agarwal, and John Langford. Practical evaluation and optimization of con- textual bandit algorithms. CoRR arXiv:1802.04064, 2018.

Kostas Bimpikis, Yiangos Papanastasiou, and Nicos Savva. Crowdsourcing exploration. Management Science, 2017. Forthcoming.

Sarah Bird, Solon Barocas, Kate Crawford, Fernando Diaz, and Hanna Wallach. Exploring or ex- ploiting? Social and ethical implications of autonomous experimentation in AI. Available at SSRN: https://ssrn.com/abstract=2846909, also appeared at the Workshop on Fairness, Accountability, and Transparency in Machine Learning, 2016.

S´ebastien Bubeck and Nicolo Cesa-Bianchi. Regret Analysis of Stochastic and Nonstochastic Multi- armed Bandit Problems. Foundations and Trends in Machine Learning, 5(1), 2012.

L. Elisa Celis and Nisheeth K Vishnoi. Fair personalization. CoRR arXiv:1707.02260, also appeared at the Workshop on Fairness, Accountability, and Transparency in Machine Learning, 2017.

Venkat Chandrasekaran, Benjamin Recht, Pablo A Parrilo, and Alan S Willsky. The convex ge- ometry of linear inverse problems. Foundations of Computational Mathematics, 12(6):805–849, 2012.

Yeon-Koo Che and Johannes H¨orner. Optimal design for social learning. Preprint, 2015.

Alexandra Chouldechova. Fair prediction with disparate impact: A study of bias in recidivism prediction instruments. Big Data, Special Issue on Social and Technical Trade-Offs, 2017.

Wei Chu, Lihong Li, Lev Reyzin, and Robert E. Schapire. Contextual bandits with linear payoff functions. In International Conference on Artificial Intelligence and Statistics (AISTATS), 2011.

John D Cook. Upper and lower bounds for the normal distribution function, 2009.

Sanjoy Dasgupta and Anupam Gupta. An elementary proof of a theorem of Johnson and Linden- strauss. Random Structures & Algorithms, 22(1):60–65, 2003.

Cynthia Dwork, Moritz Hardt, Toniann Pitassi, Omer Reingold, and Richard Zemel. Fairness through awareness. In Innovations in Theoretical Computer Science (ITCS), 2012.

Peter Frazier, David Kempe, Jon M. Kleinberg, and Robert Kleinberg. Incentivizing exploration. In ACM Conference on Economics and Computation (ACM EC), 2014.

Moritz Hardt, Eric Price, and Nati Srebro. Equality of opportunity in supervised learning. In Advances in Neural Information Processing Systems (NIPS), 2016.

Svante Janson. Tail bounds for sums of geometric and exponential variables. Statistics Probability Letters, 135:1–6, 2018.

Matthew Joseph, Michael Kearns, Jamie Morgenstern, and Aaron Roth. Fairness in learning: Clas- sic and contextual bandits. In Advances in Neural Information Processing Systems (NIPS), 2016.

Sampath Kannan, Jamie Morgenstern, Aaron Roth, Bo Waggoner, and Zhiwei Steven Wu. A smoothed analysis of the greedy algorithm for the linear contextual bandit problem. CoRR arXiv:1801.03423, 2018.

Michael Kearns, Aaron Roth, and Zhiwei Steven Wu. Meritocratic fairness for cross-population selection. In International Conference on Machine Learning (ICML), 2017.

Jon Kleinberg, Sendhil Mullainathan, and Manish Raghavan. Inherent trade-offs in the fair deter- mination of risk scores. In Innovations in Theoretical Computer Science (ITCS), 2017.

Robert Kleinberg, Alexandru Niculescu-Mizil, and Yogeshwer Sharma. Regret bounds for sleeping experts and bandits. Machine Learning, 80(2-3):245–272, 2010.

Ilan Kremer, Yishay Mansour, and Motty Perry. Implementing the wisdom of the crowd. Journal of Political Economy, 122:988–1012, 2014.

John Langford and Tong Zhang. The Epoch-Greedy Algorithm for Contextual Multi-armed Bandits. In Advances in Neural Information Processing Systems (NIPS), 2007.

Lihong Li, Wei Chu, John Langford, and Robert E. Schapire. A contextual-bandit approach to per- sonalized news article recommendation. In International World Wide Web Conference (WWW), 2010.

Yang Liu, Goran Radanovic, Christos Dimitrakakis, Debmalya Mandal, and David C. Parkes. Cali- brated fairness in bandits. CoRR arXiv:1707.01875, also appeared at the Workshop on Fairness, Accountability, and Transparency in Machine Learning, 2017.

Yishay Mansour, Aleksandrs Slivkins, and Vasilis Syrgkanis. Bayesian incentive-compatible bandit exploration. In ACM Conference on Economics and Computation (ACM EC), 2015.

Yishay Mansour, Aleksandrs Slivkins, and Zhiwei Steven Wu. Competing bandits: Learning under competition. In Innovations in Theoretical Computer Science (ITCS), 2018.

Philippe Rigollet and Assaf Zeevi. Nonparametric Bandits with Covariates. In Conference on Learning Theory (COLT), 2010.

Daniel A. Spielman and Shang-Hua Teng. Smoothed analysis of algorithms: Why the simplex algorithm usually takes polynomial time. Journal of the ACM, 51(3):385–463, 2004.

Joel A Tropp. User-friendly tail bounds for sums of random matrices. Foundations of Computational Mathematics, 12(4):389–434, 2012.

Alexandre B. Tsybakov. Introduction to Nonparametric Estimation. Springer, 2009.

Appendix A. Auxiliary Lemmas

Throughout the paper, we use a number of tools that are either known or easily follow from something that is known. We move these tools to a separate appendix so as not to interrupt the flow. We provide the proofs for the sake of completeness.

A.1. (Sub)gaussians and Concentration

We rely on several known facts about Gaussian and subgaussian random variables. A random variable -subgaussian, for some . This includes variance-Gaussian random variables as a special case.

Lemma 30 If , then for any

Proof We begin with

X can be represented as is a standard normal random variable. Using a tail bound for the latter (from Cook (2009)),

The numerator in (41) is

Combining, we have

For by the above bound.

Lemma 31 Suppose is a Gaussian random vector with covariance matrix

Proof Assume without loss of generality that is diagonal, since the norm is rotationally invariant. Observe that stochastically dominates . (Geometrically, the latter conditioning shifts the probability mass away from the origin.) Therefore,

by Lemma 30, where th eigenvalue of

Fact 32 If -subgaussian random variable, then

Lemma 33 If are independent -subgaussian random variables, then

Proof For any , we know from Fact 32 that

A union bound completes the proof.

Lemma 34 If are independent zero-mean -subgaussian random variables, then

Proof Let . Since each -subgaussian, it follows that

Using Jensen’s inequality, we have

Rearranging, we have

Lemma 35 If

Proof From Chandrasekaran et al. (2012), the expected norm of a standard normal d-dimensional Gaussian is at most. Using the fact that

Lemma 36 (Lemma 2.2 in Dasgupta and Gupta (2003)) If where are independent standard Normal random variables, then

Lemma 37 (Hoeffding bound) If , where the ’s are independent subgaussian random variables with zero mean, then

A.2. KL-divergence

We use some basic facts about KL-divergence. Let us recap the definition: given two distributions P, Q on the same finite outcome space , KL-divergence from P to Q is

Lemma 38 (High-probability Pinsker Inequality (Tsybakov, 2009)) For any probability distributions P and Q over the same sample space and any arbitrary event E,

Lemma 39 Let P and Q be Bernoulli distributions with means respectively, with

Proof For any

A.3. Linear Algebra

We use several facts from linear algebra. In what follows, recall that the minimal and the maximal eigenvalues of matrix M, resp. For two matrices A, B, let us write to mean that is positive semidefinite.

Lemma 40

Proof has rank one, so it has one eigenvector with nonzero eigenvalue. v is an eigenvector since , and it has eigenvalue . This is the only nonzero eigenvalue, so

Lemma 41 For symmetric matrices A, B with B invertible,

Proof

Lemma 42 If

Proof

A.4. Logarithms

We use several variants of standard inequalities about logarithms.

Lemma 43

Proof This is true if and only if . To show this, observe that

1. At x = 1, this holds with equality.

2. At x = 1, the derivative is

This proves the lemma.

Corollary 44

Proof Using Lemma 43 and letting z = x/e,

Lemma 45

Proof First, we note that

Integrating both sides, we have

for some constant C that does not depend on x. Taking x = 0 yields C = 0. Therefore,

Designed for Accessibility and to further Open Science