Fairness in Learning: Classic and Contextual Bandits

We introduce the study of fairness in multi-armed bandit problems. Our fairness definition can be interpreted as demanding that given a pool of applicants (say, for college admission or mortgages), a worse applicant is never favored over a better one, despite a learning algorithm’s uncertainty over the true payoffs. We prove results of two types:

First, in the important special case of the classic stochastic bandits problem (i.e. in which there are no contexts), we provide a provably fair algorithm based on chained confidence intervals, and prove a cumulative regret bound with a cubic dependence on the number of arms. We further show that any fair algorithm must have such a dependence. When combined with regret bounds for standard non-fair algorithms such as UCB, this proves a strong separation between fair and unfair learning, which extends to the general contextual case.

In the general contextual case, we prove a tight connection between fairness and the KWIK (Knows What It Knows) learning model: a KWIK algorithm for a class of functions can be transformed into a provably fair contextual bandit algorithm, and conversely any fair contextual bandit algorithm can be transformed into a KWIK learning algorithm. This tight connection allows us to provide a provably fair algorithm for the linear contextual bandit problem with a polynomial dependence on the dimension, and to show (for a different class of functions) a worst-case exponential gap in regret between fair and non-fair learning algorithms.

Automated techniques from statistics and machine learning are increasingly being used to make decisions that have important consequences on people’s lives, including hiring [Miller, 2015], lending [Byrnes, 2016], policing [Rudin, 2013], and even criminal sentencing [Barry-Jester et al., 2015]. These high stakes uses of machine learning have led to increasing concern in law and policy circles about the potential for (often opaque) machine learning techniques to be discriminatory or unfair [Coglianese and Lehr, 2016, Barocas and Selbst, 2016]. Moreover, these concerns are not merely hypothetical: Sweeney [2013] observed that contextual ads for public record services shown in response to Google searches for stereotypically African American names were more likely to contain text referring to arrest records, compared to comparable ads shown in response to searches for stereotypically Caucasian names, which showed more neutral text. She confirmed that this was not because of stated preferences of the advertisers, but rather the automated outcome of Google’s targeting algorithms. Despite the recognized importance of this problem, very little is known about technical solutions to the problem of “unfairness”, or the extent to which “fairness” is in conflict with the goals of learning.1

In this paper, we consider the extent to which a natural fairness notion is compatible with learning in a general setting (the contextual bandit setting), which can be used to model many of the applications mentioned above in which machine learning is currently employed. In this model, the learner is a sequential decision maker, which must choose at each time step t which decision to make, out of a finite set of k choices (for example, which of k loan applicants – potentially from different populations or racial groups – to give a loan to). Before the learner makes its decision at round t, it observes some  context xtj for each choice of arm  j (xtj could, for example, represent the contents of the loan application of an individual from population j at round t). When the learner chooses arm j at time t, it obtains a stochastic  reward rtj whose expectation is determined by some unknown function of the context:  E�rtj�= fj(xtj). The goal of the learning algorithm is to maximize its expected reward – i.e. to approximate the optimal policy, which at each round, chooses arm j to maximize  E�rtj�. The difficulty in this task stems from the unknown functions fjwhich map contexts to rewards; these functions must be learned. Despite this, there are many known algorithms for learning the optimal policy (in the absence of any fairness constraint).

1.1 Fairness and Learning

Our notion of individual fairness is very simple: it states that it is unfair to preferentially choose one individual (e.g. for a loan, a job, admission to college, etc.) over another if he or she is not as qualified as the other individual. This definition of fairness is apt for our setting, since in contextual learning, the quality of an arm is clear: its expected reward. We view different arms j as representing different populations (e.g. different ethnic groups, cultures, or other divisions within society), and view the context  xtj at round tas representing information about a particular individual from that population. Each population has its own underlying function  fjwhich maps contexts to expected payoff2. At each time step t, the algorithm is asked to choose between specific members of each population, represented by the contexts  xtj. The quality of an individual is thus exactly  E�rtj�= fj(xtj). Our fairness condition translates thus: for any pair of arms  j, j′ at time

t, if fj(xtj) ≥ fj′(xtj′), then an algorithm is said to be discriminatory if it preferentially chooses the lower quality arm  j′. Said another way, an algorithm is fair if it guarantees the following: with high probability, over all rounds t, and for all pairs of arms  j, j′, whenever fj(xtj) ≥ fj′(xtj′), thealgorithm chooses arm j with probability at least that with which it chooses arm  j′3.

It is worth noting that this definition of fairness (formalized in the preliminaries) is entirely consistent with the optimal policy, which can simply choose at each round to play uniformly at random from the arms arg maxj�E�rtj��which maximize the expected reward. This is because – it seems – the goal of fairness as enunciated above is entirely consistent with the goal of maximizing expected reward. Indeed, the fairness constraint exactly states that the algorithm cannot favor low reward arms!

Our main conceptual result is that this intuition is incorrect in the face of unknown reward functions. Even though the constraint of fairness is consistent with implementing the optimal policy, it is not necessarily consistent with learning the optimal policy. We show that fairness always has a cost, in terms of the achievable learning rate of the algorithm. For some problems, the cost is mild, but for others, the cost is large.

1.2 Our Results

We divide our results into two parts. First, we study the classic stochastic multi-armed bandit problem [Lai and Robbins, 1985, Katehakis and Robbins, 1995]. In this case, there are no contexts, and each arm i has a fixed but unknown average reward  µi. Note that this is a special case of the contextual bandit problem in which the contexts are the same every day. In this setting, our fairness constraint specializes to require that with probability 1  − δ, for any pair of arms i, j for which  µi ≥ µj, at no round t does the algorithm play arm j with probability higher than that with which it plays arm i. Note that even this special case models interesting scenarios from the point of view of fairness in learning. It models, for example, the case in which choices are made by a loan officer after applicants have been categorized into k internally indistinguishable equivalence classes based on their applications.

Without a fairness constraint, it is known that it is possible to guarantee non-trivial regret to the optimal policy after only T = O(k) many rounds [Auer et al., 2002]. In Section 3, we give an algorithm that satisfies our fairness constraint and is able to guarantee non-trivial regret after  T = O(k3) rounds. We then show in Section 4 that it is not possible to do better – any fair learning algorithm can be forced to endure constant per-round regret for  T = Ω(k3) rounds.Thus, we tightly characterize the optimal regret attainable by fair algorithms in this setting, and formally separate it from the regret attainable by algorithms absent a fairness constraint. Note that this already shows a separation between the best possible learning rates for contextual bandit learning with and without the fairness constraint – the stochastic multi-armed bandit problem is a special case of every contextual bandit problem, and for general contextual bandit problems, it is also known how to get non-trivial regret after only T = O(k) many rounds [Agarwal et al., 2014, Beygelzimer et al., 2011, Chu et al., 2011].

We then move on to the general contextual bandit setting and prove a broad characterization result, relating fair contextual bandit learning to KWIK learning [Li et al., 2011]. The KWIK model, which stands for Knows What it Knows and has a close relationship with reinforcement learning, is a model of sequential supervised classification in which the learning algorithm must be confident in its predictions. Informally, a KWIK learning algorithm receives a sequence of unlabeled examples, whose true labels are defined by some unknown function in a class C. For each example, the algorithm may either predict a label, or announce “I Don’t Know”. The KWIK requirement is that with high probability, for each example, if the algorithm predicts a label, then its prediction must be very close to the true label. The quality of a KWIK learning algorithm is characterized by its “KWIK bound”, which provides an upper bound on the maximum number of times the algorithm can be forced to announce “I Don’t Know”. For any contextual bandit problem (defined by the set of functions C from which the payoff functions  fjmay be selected), we show that the optimal learning rate of any fair algorithm is determined by the best KWIK bound for the class C. We prove this constructively – we give a reduction showing how to convert a KWIK learning algorithm into a fair contextual bandit algorithm in Section 5, and vice versa in Section 6. Both reductions show that the KWIK bound of the KWIK algorithm is polynomially related to the regret of the fair algorithm.

This general connection has immediate implications, because it allows us to import known results for KWIK learning [Li et al., 2011]. For example, it implies that some fair contextual bandit problems are easy, in that there are fair algorithms which can obtain non-trivial regret guarantees after polynomially many rounds. This is the case, for example, for the important linear special case in which the contexts  xtj ∈ Rdare real valued vectors and the unknown functions  fjare linear: fj(xtj) = ⟨θj, xtj⟩4. In this case, the KWIK-learnability of noisy linear regression problems [Strehl and Littman, 2008, Li et al., 2011] implies that we can construct a fair contextual bandit algorithm whose per-round regret is polynomial in d. Conversely, it also implies that some contextual bandit problems which are easy without the fairness constraint become hard once we impose the fairness constraint, in that any fair algorithm must suffer constant per-round regret for exponentially many rounds. This is the case, for example, when the context consists of boolean vectors  xtj ∈ {0, 1}d,and the unknown functions  fj : {0, 1}d → {0, 1} are conjunctions– the “and”s of some unknown set of features5. The impossibility of non-trivial KWIK-learning of conjunctions [Li, 2009, Li et al., 2011] implies that no fair learner in the contextual bandit setting can achieve non-trivial regret before exponentially many (in d) rounds.

1.3 Other Related Work

Several papers study the problem of fairness in machine learning. One line of work aims to give algorithms for batch classification which achieve group fairness otherwise known as equality of outcomes, statistical parity – or algorithms that avoid disparate impact (see e.g. Calders and Verwer [2010], Luong et al. [2011], Kamishima et al. [2011], Feldman et al. [2015], Fish et al. [2016] and Adler et al. [2016] for a study of auditing existing algorithms for disparate impact). While statistical parity is sometimes a desirable goal – indeed, it is sometimes required by law – as observed by Dwork et al. [2012] and others, it suffers from two problems. First, if different populations indeed have different statistical properties, then it can be at odds with accurate classification. Second, even in cases when statistical parity is attainable with an optimal classifier, it does not prevent discrimination at an individual level – see Dwork et al. [2012] for a catalog of ways in which statistical parity can be insufficient from the perspective of fairness. In contrast, we study a notion aimed at guaranteeing fairness at the individual level.

Our definition of fairness is most closely related to that of Dwork et al. [2012], who proposed and explored the basic properties of a technical definition of individual fairness formalizing the idea that “similar individuals should be treated similarly”. Specifically, their work presupposes the existence of a task-specific metric on individuals, and proposes that fair algorithms should satisfy a Lipschitz condition with respect to this metric. Our definition of fairness is similar, in that the expected reward of each arm is a natural metric through which we define fairness. The main conceptual distinction between our work and Dwork et al. [2012] is that their work operates under the assumption that the metric is known to the algorithm designer, and hence in their setting, the fairness constraint binds only insofar as it is in conflict with the desired outcome of the algorithm designer. The most challenging aspect of this approach (as they acknowledge) is that it requires that some third party design a “fair” metric on individuals, which in a sense encodes much of the relevant challenge. The question of how to design such a metric was considered by Zemel et al. [2013], who study methods to learn representations that encode the data, while obscuring protected attributes. Our fairness constraint, conversely, is entirely aligned with the goal of the algorithm designer in that it is satisfied by the optimal policy; nevertheless, it affects the space of feasible learning algorithms, because it interferes with learning an optimal policy, which depends on the unknown reward functions.

At a technical level, our work is related to Amin et al. [2012] and Amin et al. [2013], which also relate KWIK learning to bandit learning in a different context, unrelated to fairness (when the arm space is very large).

We study the contextual bandit setting, which is defined by a domain X, a set of “arms” [k] := {1, . . . , k} and a class C of functions of the form  f : X → [0,1]. For each arm j there is some function fj ∈ C, unknown to the learner. In rounds t = 1, . . . , T, an adversary reveals to the algorithm a context xtj for each arm6. An algorithm A then chooses an arm  it, and observes stochastic reward rtit for the arm it chose. We assume  rtj ∼ Dtj, E�rtj�= fj(xtj), for some distribution  Dtj over [0, 1].

Let Π be the set of policies mapping contexts to distributions over arms  Xk → ∆k, and π∗ theoptimal policy which selects a distribution over arms as a function of contexts to maximize the expected reward of those arms. The pseudo-regret of an algorithm A on contexts  x1, . . . , xT is

defined as follows, where  πt represents A’s distribution on arms at round t:


We hereafter refer to this as the regret of A. The optimal policy  π∗ pulls arms with highest expectation at each round, so:


We say that A satisfies regret bound  R(T) if maxx1,...,xT Regret(x1, . . . , xt) ≤ R(T).

Let the history  ht ∈�X k × [k] × [0, 1]�t−1be a record of  t −1 rounds experienced by  A, t − 13-tuples which encode for each t the realization of the contexts, arm chosen, and reward observed. We write  πtj|htto denote the probability that A chooses arm j after observing contexts  xt, givenht. For notational simplicity, we will often drop the superscript t on the history when referring to the distribution over arms:  πtj|h := πtj|ht.

We now define what it means for a contextual bandit algorithm to be  δ-fair with respect to its arms. Informally, this will mean that A will play arm i with higher probability than arm j in round t only if i has higher mean than  j in round t, for all i, j ∈ [k], and in all rounds t.

Definition 1 (δ-fair). A is δ-fairif, for all sequences of contexts  x1, . . . , xt and all payoff distributions  Dt1, . . . , Dtk, with probability at least 1  − δover the realization of the history h, for all rounds t ∈ [T] and all pairs of arms  j, j′ ∈ [k],


Remark 1. Definition 1 prohibits favoring lower payoff arms over higher payoff arms. One relaxed definition only requires that  πtj|h = πtj′|h when fj(xtj) = fj′(xtj′) – requiring only identical individuals (concerning expected payoff) be treated identically. This relaxation is a special case of Dwork et al. [2012]’s proposed family of definitions, which require that “similar individuals be treated similarly”. We use Definition 1 as it is better motivated in its implications for fair treatment of individuals, but all of our results – including our lower bounds – apply also to this relaxation.

KWIK learning Let B be an algorithm which takes as input a sequence of examples  x1, . . . , xT ,and when given some  xt ∈ X, outputs either a prediction ˆyt ∈ [0,1] or else outputs ˆyt = ⊥,representing “I don’t know”. When ˆyt = ⊥, Breceives feedback  yt such that E�yt�= f(xt). B isan (ϵ, δ)-KWIK learning algorithm for  C : X → [0,1], with KWIK bound  m(ϵ, δ) if for any sequence of examples  x1, x2, . . .and any target  f ∈ C, with probability at least 1  − δ, both:

1. Its numerical predictions are accurate: for all  t, ˆyt ∈ {⊥} ∪ [f(xt) − ϵ, f(xt) + ϵ], and

2. B rarely outputs “I Don’t Know”: �∞t=1 I�ˆyt = ⊥�≤ m(ϵ, δ).

2.1 Specializing to Classic Stochastic Bandits

In Sections 3 and 4, we study the classic stochastic bandit problem, an important special case of the contextual bandit setting described above. Here we specialize our notation to this setting, in which there are no contexts. For each arm  j ∈ [k], there is an unknown distribution  Dj over [0, 1]with unknown mean  µj. A learning algorithm A chooses an arm  it in round t, and observes the reward  rtit ∼ Dit for the arm that it chose. Let  i∗ ∈ [k] be the arm with highest expected reward: i∗ ∈ arg maxi∈[k] µi. The pseudo-regret of an algorithm  A on D1, . . . , Dkis now just:


Let  ht ∈ ([k] × [0, 1])t−1 denote a record of the  t −1 rounds experienced by the algorithm so far, represented by  t −1 2-tuples encoding the previous arms chosen and rewards observed. We write πtj|htto denote the probability that A chooses arm j given history  ht. Again, we will often drop the superscript t on the history when referring to the distribution over arms:  πtj|h := πtj|ht.

δ-fairness in the classic bandit setting specializes as follows:

Definition 2 (δ-fairness in the classic bandits setting). A is δ-fairif, for all distributions  D1, . . . , Dk,with probability at least 1  − δover the history  h, for all t ∈ [T] and all j, j′ ∈ [k]:


In this section, we describe a simple and intuitive modification of the standard UCB algorithm [Auer et al., 2002], called FairBandits, prove that it is fair, and analyze its regret bound. The algorithm and its analysis highlight a key idea that is important to the design of fair algorithms in this setting: that of chaining confidence intervals. Intuitively, as a  δ-fair algorithm explores different arms it must play two arms  j1 and j2with equal probability until it has sufficient data to deduce, with confidence 1  − δ, either that  µj1 > µj2or vice versa. FairBandits does this by maintaining empirical estimates of the means of both arms, together with confidence intervals around those means. To be safe, the algorithm must play the arms with equal probability while their confidence intervals overlap. The same reasoning applies simultaneously to every pair of arms. Thus, if the confidence intervals of each pair of arms  ji and ji+1overlap for each  i ∈ [k], the algorithm is forced to play all arms j with equal probability. This is the case even if the confidence intervals around arm  jk and arm j1are far from overlapping – i.e. when the algorithm can be confident that µj1 > µjk.

This approach initially seems naive: in an attempt to achieve fairness, it seems overly conservative when ruling out arms, and can be forced to play arms uniformly at random for long periods of time. This is reflected in its regret bound, which is only non-trivial after  T ≫ k3, whereas the UCB algorithm [Auer et al., 2002] achieves non-trivial regret after T = O(k) rounds. However, our lower bound in Section 4 shows that any fair algorithm must suffer constant per-round regret for T ≫ k3 rounds on some instances.

We now give an overview of the behavior of FairBandits. At every round t, FairBandits identifies the arm  it∗ = arg maxi uti that has the largest upper confidence interval amongst the active arms. At each round  t, we say i is linked to j if [ℓti, uti] ∩ [ℓtj, utj] ̸= ∅, and i is chained to j if i andj are in the same component of the transitive closure of the linked relation. FairBandits plays uniformly at random among all active arms chained to arm  it∗.

Initially, the active set contains all arms. The active set of arms at each subsequent round is defined to be the set of arms that are chained to the arm with highest upper confidence bound at the previous round. The algorithm can be confident that arms that have become unchained to the arm with the highest upper confidence bound at any round have means that are lower than the means of any chained arms, and hence such arms can be safely removed from the active set, never to be played again. This has the useful property that the active set of arms can only shrink: at any round  t, St ⊆ St−1; see Figure 1 for an example of active set evolution over time.


We first observe that with probability 1  − δ, all of the confidence intervals maintained by FairBandits (δ) contain the true means of their respective arms over all rounds. We prove this claim, along with all other claims in this section without proofs, in Appendix A.

Lemma 1. With probability at least 1  − δ, for every arm  i and round t ℓti ≤ µi ≤ uti.

The fairness of FairBandits follows almost immediately from this guarantee.


Proof. By Lemma 1, with probability at least 1−δall confidence intervals contain their true means across all rounds. Thus, with probability 1−δ, at every round t, for every  i ∈ St, j /∈ St, it must be that  µj < µi– the arms not in the active set have strictly smaller means than those in the active set; if not,  utj ≥ µj ≥ µi ≥ ℓti implies jwould be chained to  it∗ if iis. Finally, all arms in  St areplayed uniformly at random – but since all such arms are played with the same probability, this does not cause the fairness constraint to bind for any pair  i, i′ ∈ St, for any realization of  µi, µ′iwhich lie within their confidence intervals.

Next, we upper bound the regret of FairBandits.

Theorem 2. If δ < 1/√T, then FairBanditshas regret



Figure 1: Confidence intervals over time for the lower bound instance outlined in Section 4 for k = 10. Lines correspond to upper and lower confidence bounds for each arm and cut off at the round in which the arm leaves the active set.


Remark 2. Before proving Theorem 2, we highlight two points. First, this bound becomes non-trivial (i.e. the average per-round regret is  ≪ 1) for T = Ω(k3). As we show in the next section, it is not possible to improve on this. Second, the bound may appear to have suboptimal dependence on T when compared to unconstrained regret bounds (where the dependence on T is often described as logarithmic). However, it is known that Ω�√kT�regret is necessaryeven in the unrestricted setting (without fairness) if one does not make data-specific assumptions on an instance [Bubeck and Cesa-Bianchi, 2012] (e.g. that there is a lower bound on the gap between the best and second best arm). It would be possible to state a logarithmic dependence on T in our setting as well while making assumptions on the gaps between arms, but since our fairness constraint manifests itself as a cost that depends on k, we choose for clarity to avoid such assumptions. Without such assumptions, our dependence on T is also optimal.

We now prove Theorem 2. Lemma 2 upper bounds the probability any arm i active in round t has been pulled substantially fewer times than its expectation, i.e.  nti ≪ tk. Lemma 3 upper bounds the width of any confidence interval used by  FairBandits in round t by η(t), conditioned on i being pulled the number of times guaranteed by Lemma 2. Finally, we stitch this together to prove Theorem 2 by upper bounding the total regret incurred for T rounds by noticing that the regret of any arm active in round t is at most  kη(t).

We begin by lower bounding the probability that any arm active in round t has been pulled substantially fewer times than its expectation.

Lemma 2. With probability at least 1  − δ2t2 ,


for all  i ∈ St (for all active arms in round t).

We now use this lower bound on the number of pulls of active arm i in round t to upper-bound η(t), an upper bound on the confidence interval width FairBandits uses for any active arm i in round t.


Lemma 3. Consider any round t and any arm  i ∈ St. Condition on  nti ≥ tk −


Finally, we prove the bound on the total regret of the algorithm, using the bound on the width of any active arm’s confidence interval in round t provided by Lemma 3.

Proof of Theorem 2. We condition on  µi ∈ [ℓti, uti] for all i, t. This occurs with probability at least 1  − δ, by Lemma 1. We claim that this implies that arm  i∗with highest expected reward is always in the active set. This follows from the fact that  µi∗ ∈ [ℓti∗, uti∗] and µj ∈ [ℓtj, utj] for all j, t; thus, ifµi∗ > µj, it must be that  uti∗ ≥ ℓtj. Thus, this holds for  it∗, the arm with highest upper confidence bound in round  t, so i∗must be chained to  it∗ in round t for all t.

We further condition on the event that for all j, t,


which holds with probability at least 1  − πδ2 by Lemma 2 and a union bound over all times t. This implies that, for all rounds t, for every active arm  j ∈ St, Lemma 3 applies, and therefore


Finally, we upper-bound the per-round regret of pulling any active arm  i ∈ St at round t. Since i∗is active, any  i ∈ St is chained to arm  i∗. Since all active arms have confidence interval width at most  η(t) and imust be chained using at most k arms’ confidence intervals, we have that


Since  µi ≥ ℓti and uti∗ ≥ µi∗, it follows that  |µi − µi∗| ≤ k · η(t) for any i ∈ St. Finally, summing up over all rounds  t ∈ T, we know that


where this bound is derived in Appendix A.1.

We now show that the regret bound for FairBandits has an optimal dependence on k: no fair algorithm has diminishing regret before  T = Ω(k3) rounds. All missing proofs are in Appendix B. The main result of this section is the following.

Theorem 3. There is a distribution P over k-arm instances of the stochastic multi-armed bandit problem such that any fair algorithm run on P experiences constant per-round regret for at least



Despite the fact that regret is defined in a prior-free way, the proof of Theorem 3 proceeds via Bayesian reasoning. We construct a family of lower bound instances such that arms have payoffs drawn from Bernoulli distributions, denoted  B(µ) for mean µ. So, to specify a problem instance, it suffices to specify a mean for each of  k arms: µ1, . . . , µk. The proof formalizes the following outline.

1. We define an instance distribution  P = P1×. . .×Pkover means  µi(Definition 3). P will have two important properties. First, we will draw means from P such that for any  i ∈ [k − 1],µi = µi+1with probability at least 1/4. Second, for any realization of means drawn from P, if an algorithm plays uniformly at random over [k], it will suffer constant per-round regret.

2. We treat  Pi as a priordistribution over mean  µi, and analyze the posterior distribution Pi(r1i , , . . . , rti) over means that results after applying Bayes’ rule to the payoff observations r1i , . . . , rti made by the algorithm. Bayes’ rule implies (Lemma 4) the joint distribution over rewards and means drawn from P is identical to the distribution which first draws means according to P, then draws rewards conditioned on those means, and finally resamples the means from the posterior distribution on means. Thus, we can reason about fairness (a frequentist quantity) by analyzing the Bayesian posterior distribution on means conditioned on the observed rewards.

3. A  δ-fair algorithm, for any set of means realized from the instance (prior) distribution, must not play arm i + 1 with lower probability than arm  i if µi = µi+1, except with probability δ. By the above change of perspective, therefore, any  δ-fair algorithm must play arms i and i+1 with equal probability until the posterior distribution on means given observed rewards, satisfies  P [µi = µi+1|h] < δ(Lemmas 5 and 6).

4. We finally lower bound the number of reward observations necessary before the posterior distribution on means given payoffs is such that  P [µi = µi+1|h] < δfor any pair of adjacent arms i, i + 1. We show that this is Ω(k2) (Lemma 7). Since fair algorithms must play from among the k arms uniformly at random until this point, with high probability, no arm accumulates sufficiently many reward observations until  T = Ω(k3) rounds of play.

We begin by describing our distribution over instances. Each arm i’s payoff distribution will be Bernoulli with mean  µi ∼ Piindependently of each other arm.

Definition 3 (Prior Distribution over  µi).For each arm  i, µiis distributed according to the distribution with the following probability mass function:


We treat P as a prior distribution over instances, and analyze the posterior distribution on instances given the realized rewards. Lemma 4 justifies this reasoning.

Lemma 4. Consider the following two experiments: In the first, let  µi ∼ Pi and r1i , . . . , rti ∼B(µi), and Wdenote the joint distribution on (µi, r1i , . . . , rti). In the second, let  µi ∼ Pi, andr1i , . . . , rti ∼ B(µi), and then re-draw the mean  µ′i ∼ Pi(r1i , . . . , rti)from its posterior distribution given the rewards. Let (µ′i, r1i , . . . , rti) ∼ W ′. Then, W and W ′are identical distributions.

Next, we lower-bound the number of reward observations necessary such that for some  i ∈ [k]:P�µi = µi+1|ht� < δwith respect to the posterior. It will be useful to refer to an algorithm’s histories as distinguishing the mean of an arm given that history with high probability.

Definition 4 (δ-distinguishing). We will say  ht δ-distinguishes arm i for Aif, for some  α ∈ [0, 1],


The next lemma shows that if no arm is√δ-distinguished by a history, all pairs of arms i, i + 1 have posterior probability strictly greater than  δof having equal means.

Lemma 5. Suppose A has history  ht, and that ht does not√δ-distinguish any arm i. Then, for all arms i, i + 1,


Now, we prove that for any fair algorithm, with probability  ≥ 12 over the draw of histories  ht, htmust√2δ-distinguish some arm, or the algorithm must play uniformly across all k arms conditioned on  ht.

Lemma 6. Suppose an algorithm  A is δ-fair. Then:


We now lower-bound the number of observations from arm i which are required to  δ-distinguish it.

Lemma 7. Fix any δ < 18. Let µi ∼ Pias in Definition 3. Then, arm  i is√2δ-distinguishable by ht only if Ti = Ω(k2 ln 1δ), where Ti = |{t′ : ht′2 = i, t′ ≤ t}|is the number of times arm i is played.

Proof. Write p, p + 13k to represent the two possible realizations that  µimight take, when drawn from the distribution over instances given in Definition 3. Let A represent the event that  µi = pand B the event that  µi = p + 13k. Let δ′ =√2δthroughout.

Fix a history  ht, and let m = Tirepresent the number of observations of arm i’s reward. We will abuse notation and use  hti to refer to the payoff sequence of arm i observed in history  ht. hti istherefore a binary sequence of length  m; let ||hti||0 = sdenote the number of 1s in the sequence. We will calculate conditions under which  hti, m, swill imply that either  1−δ′δ′ ≤P[B|hti]P[A|hti] orP[B|hti]P[A|hti] ≤ δ′1−δ′

holds, implying that one of A or B has posterior probability at least 1  − δ′, conditioned on the observed rewards. If neither of these is the case,  i is not δ′-distinguished by  ht.

We begin by rearranging our definition of this ratio  X = P[B|hti]P[A|hti] = P[hti|B]P[hti|A],which follows from Bayes’ rule and the fact that P [A] = P [B]. We wish to upper and lower bound X in terms of  hti’svalue. By definition of the Bernoulli distribution, we have that


We now calculate under what conditions either (a)  X ≤ δ′1−δ′ , or (b) X ≥ 1−δ′δ′. One of these must hold if  i is δ′-distinguished. Before we do so, we mention that a Chernoff bound implies that with probability 1  − δ′, for events A and B, Equations 1 and 2, respectively:


since the mean of m Bernoulli trials with mean  p ( or p + 13k) is mp (or mp + m3k).We begin by analyzing case (a), where  δ′ =√2δ < 1/2 implies


Taking logarithms on both sides, we have that


where the inequality follows from ln(1 +  x) ≥ xx+1 for x ∈ [−1, ∞]. Then, this implies that

(3kp + 1)(3k(1 − p) − 1) ln(2δ′) > s(3k(1 − p) − 1) − (m − s)(3kp + 1) = 3ks − 3kpm − m.

Multiplying both sides by  −1, this implies that


Since  p, 1 − p ∈ [1/3, 2/3] and δ′ =√2δ, solving for m implies that  m = Ω(k2 ln 1δ′ ).In case (b), we have


where we used the fact that 1 +  x ≤ ex for all x. Taking logarithms, this will imply that


whose last inequality comes the range of p. Combining this inequality with Equation 2, this implies


We now have the tools in hand to prove Theorem 3.

Proof of Theorem 3. Assume A is some δ-fair algorithm where  δ < 1/8. Fix T; we claim that with probability at least 12, for any t = o(k3 ln 1δ), t ≤ T, πtj|ht = 1k for all j. Since the payoff for uniformly random play is  ≤ 12 + 1k, while the best arm has payoff  ≥ 23, in any round  t where πti|ht = πti′|ht forall  i, i′ ∈ [k], the algorithm suffers Ω(1) regret in that round.

Lemma 6 implies that, with probability at least 12 over the distribution over histories  ht, either(a)  πt′i|ht′ = πt′i′|ht′ for all i, i′ ∈ [k], t′ ≤ t or (b) ht must√2δ-distinguish some arm i. Case (a) implies

our claim. In case (b), Lemma 7 states than an arm  i is√2δ-distinguishable only if  Ti = Ω(k2 ln 1δ).We now argue that unless  t = Ω(k3 ln 1δ), Ti = o(k2 ln 1δ), which will imply our claim for case (b).

Fix some i, t. We lower-bound t for which, with probability at least 1  − δ′k over histories  ht, itwill be the case that  nti ≥ c · k2 ln 1δ when πt′i|ht′ = πt′i′|ht′ for all i, i′ ∈ [k], t′ ≤ t. Let X1, . . . , Xt beindicator variables of arm i being played in round  t′ ≤ t. Note that for all  t′ ≤ t, E[Xt′] = 1k, sincein all rounds prior to t, we have all arms are played with equal probability. For any  ϵ ∈ [0, 1], asnt′i are nondecreasing in  t′, an additive Chernoff bound implies


probability 1  − δ′ for all i unless t ≥ min�c2 · k3 ln 1δ,

In this section, we show if a class of functions is KWIK learnable, then there is a fair algorithm for learning the same class of functions in the contextual bandit setting, with a regret bound polynomially related to the function class’ KWIK bound. Intuitively, KWIK-learnability of a class of functions guarantees we can learn the function’s behavior to a high degree of accuracy with a high degree of confidence. As fairness constrains an algorithm most before the algorithm has determined the payoff functions’ behavior accurately, this guarantee enables us to learn fairly without incurring much additional regret. Formally, we prove the following polynomial relationship.

Theorem 4. For an instance of the contextual multi-armed bandit problem where  fj ∈ C forall  j ∈ [k], if C is (ϵ, δ)-KWIK learnable with bound  m(ϵ, δ), KWIKToFair (δ, T) is δ-fair andachieves regret bound:


First, we construct an algorithm  KWIKToFair(δ, T) that uses the KWIK learning algorithm as a subroutine, and prove that it is  δ-fair. A call to  KWIKToFair(δ, T) will initialize a KWIK learner for each arm, and in each of the T rounds will implicitly construct a confidence interval around the prediction of each learner. If a learner makes a numeric valued prediction, we will interpret this as a confidence interval centered at the prediction with width  ϵ∗. If a learner outputs ⊥, we interpret this as a trivial confidence interval (covering all of [0, 1]). We use the same chaining technique that we use in the classic stochastic setting. In every round  t, KWIKToFair (δ, T)identifies the arm  it∗ = arg maxi uti that has the largest upper confidence bound. At each round t, we will say  i is linked to j if [ℓti, uti] ∩ [ℓtj, utj] ̸= ∅, and i is chained to jif they are in the same component of the transitive closure of the linked relation. Then, it plays uniformly at random amongst all arms chained to arm  it∗.Whenever all learners output predictions, they need no feedback. When a learner for  j outputs ⊥, if jis selected then we have feedback  rtj to give it; on the other hand, if j isn’t selected, we “roll back” the learning algorithm for j to before this round by not updating the algorithm’s state.


10: Pull j∗ ← (x ∈R [k]), receive reward  rtj∗ ▷Pick arm at random from all arms


15: hj∗ ← hj∗ :: (xtj∗, rtj∗) ▷Update the history for  Lj∗

We begin by bounding the probability of certain failures of KWIKToFair in Lemma 8, proven in Appendix C. This in turn lets us prove the fairness of KWIKToFair in Theorem 5.

Lemma 8. With probability at least 1  − min(δ, 1T ), for all rounds t and all arms  i, (a) if sti ∈ Rthen  |sti − fi(xti)| ≤ ϵ∗ and (b) �t I�sti = ⊥ and i is pulled�≤ m(ϵ∗, δ∗).


Proof of Theorem 5. We condition on both (a) and (b) holding for all arms i and rounds t from Lemma 8, which occur with probability 1  − δfor all arms and all times t. Therefore, we proceed by conditioning on the event that for all arms i and all rounds  t, if Li = sti for sti ̸= ⊥ then|sti − fi(xti)| ≤ ϵ∗. Having done so, there are two possibilities for each round t.In case 1, for each i we have that  Li(xti) = sti ̸= ⊥. By the condition above, for any arms i and j, fi(xti) ≥ fj(xtj) implies that  sti + ϵ∗ ≥ stj − ϵ∗. Since in this case no learner outputs  ⊥, arm jchains to the top arm only if arm i does. Therefore  πti|h ≥ πtj|h. In case 2, there exists some i such that  Li(xti) = ⊥. Then we choose uniformly at random across all arms, so  πti|h = πtj|h for all i andj.

Thus, with probability at least 1−δ, for each round  t, fi(xti) ≥ fj(xtj) implies that  πti|h ≥ πtj|h.

We now use the KWIK bounds of the KWIK learners to upper-bound the regret of KWIKTo- Fair(δ, T).

Lemma 9. KWIKToFair(δ, T)achieves regret  O(max(k2 · m(ϵ∗, δ∗), k3 ln Tkδ )).

Proof. We first condition on the event that both (a) and (b) from Lemma 8 hold for all t, i, which holds with probability 1  − min(δ, 1T ), and bound the regret when they both hold. Choose an arbitrary round t in the execution of  KWIKToFair(δ, T). As above, there are two cases. In the first case,  Li(xti) = sti ̸= ⊥ for all iand we choose uniformly at random from the arms chained by ϵ∗-intervals to the arm with the highest prediction. Since we have conditioned on the event that all KWIK learners are correct,  i∗ ∈ St. Furthermore, for any  i, j ∈ St, we have that  |sti − stj| ≤ 2kϵ∗,and in particular that  |sti − sti∗| ≤ 2kϵ∗. Thus, the regret is at most 2kϵ∗in such a round. In the second case some arm outputs  ⊥, so we choose randomly from all k arms, and the worst-case regret is 1. Thus, the total regret will be at most 2kϵ∗T + n + δT where nis the number of rounds in which some  Li outputs ⊥.

We now upper bound  ni, the number of rounds in which arm  i outputs ⊥. Fix some arm i which outputs  ⊥ in nirounds. Arm i is played and therefore receives feedback every time it outputs  ⊥with probability at least 1/k. Thus, using a Chernoff bound, with probability 1  − δ′, arm i receivesfeedback for  nioutputs of  ⊥in at least nik −�2ni ln 2δ′ rounds. Lihas the guarantee that there can be at most  m(ϵ∗, δ∗) many such rounds (in which it outputs  ⊥ andreceives feedback). Thus,


If  ni ≥ ck · m(ϵ∗, δ∗), this implies


We now analyze cases in which (1)  k ln 2δ′ ≤ m(ϵ∗, δ∗) and (2) k ln 2δ′ > m(ϵ∗, δ∗).Case (1) this implies


For c > 4, this leads to contradiction. Thus, in this case, if we set  δ′ = δk, we know that with probability 1  − δ, ni ≤ 4k · m(ϵ∗, δ∗) which summing up over all  i implies �i ni ≤ 4k2 · m(ϵ∗, δ∗),as desired.

In case (2), we have that


which solving for  niimplies that  ni = O(k2 ln 1δ′ ), so �i ni = O(k3 ln kδ ) by setting  δ′ = δk and takinga union bound. Thus, there are at most  n = O(max(k2 · m(ϵ, δ∗), k3 ln kδ )) rounds in expectation during the execution of  KWIKToFair(δ, T) in which some arm outputs  ⊥.

Combining both cases, the total regret incurred by  KWIKToFair(δ, T) across all T rounds is


Our presentation of  KWIKToFair(δ, T) has a known time horizon T. Its guarantees extend to the case in which T is unknown via the standard “doubling trick” to prove Theorem 4 in Appendix C.

An important instance of the contextual bandit problem is the linear case, where C consists of the set of all linear functions of bounded norm in d dimensions. This captures the natural setting in which the rewards of each arm are governed by an underlying linear regression model on a d-dimensional real valued feature space. The linear case is well studied, and there are known KWIK algorithms [Strehl and Littman, 2008] for the set of linear functions C, which allows us via our reduction to give a fair contextual bandit algorithm for this setting with a polynomial regret bound.

Lemma 10 ([Strehl and Littman, 2008]). Let C = {fθ|fθ(x) = ⟨θ, x⟩, θ ∈ Rd, ||θ|| ≤ 1} andX = {x ∈ Rd : ||x|| ≤ 1}. C is KWIK learnable with KWIK bound m(ϵ, δ) = ˜O(d3/ϵ4).

Then, an application of Theorem 4 implies that KWIKToFair has a polynomial regret guarantee for the class of linear functions. This proof can be found in Appendix C.

Corollary 1. Let C and X be as in Lemma 10, and  fj ∈ C for each j ∈ [k]. Then, KWIKTo-Fair(T, δ)using the learner from [Strehl and Littman, 2008] has regret:


In this section, we show how to use a fair, no-regret contextual bandit algorithm to construct a KWIK learning algorithm whose KWIK bound has logarithmic dependence on the number of rounds T. Intuitively, any fair algorithm which achieves low regret must both be able to find and exploit an optimal arm (since the algorithm is no-regret) and can only exploit that arm once it has a tight understanding of the qualities of all arms (since the algorithm is fair). Thus, any fair no-regret algorithm will ultimately have tight (1− δ)-confidence about each arm’s reward function.

Theorem 6. Suppose A is a δ-fair algorithm for the contextual bandit problem over the class of functions C, with regret bound  R(T, δ). Suppose also there exists  f ∈ C, x(ℓ) ∈ Xsuch that for every  ℓ ∈ [⌈ 1ϵ⌉], f(x(ℓ)) = ℓ · ϵ. Then, FairToKWIK is an (ϵ, δ)-KWIK algorithm for C with KWIK bound  m(ϵ, δ), with m(ϵ, δ)the solution to m(ϵ,δ)ϵ4 = R(m(ϵ, δ), ϵδ2T ).

Remark 3. The condition that C should contain a function that can take on values that are multiples of  ϵis for technical convenience; C can always be augmented by adding a single such function.

Our aim is to construct a KWIK algorithm B to predict labels for a sequence of examples labeled with some unknown function  f∗ ∈ C. To do this, we will run our fair contextual bandit algorithm A on an instance that we construct online as examples  xt arrive for B. The idea is to simulate a two arm instance, in which one arm’s rewards are governed by  f∗ (the function to be KWIK learned), and the other arm’s rewards are governed by a function f that we can set to take any value in  {0, ϵ, 2ϵ, . . . , 1}. For each input  xt, we perform a thought experiment and consider  A’sprobability distribution over arms when facing a context which forces arm 2’s payoff to take each of the values 0, ϵ∗, 2ϵ∗, . . . , 1. Since A is fair, Awill play arm 1 with weakly higher probability than arm 2 for those  ℓ : ℓϵ∗ ≤ f(xt); analogously, A will play arm 1 with weakly lower probability than arm 2 for those  ℓ : ℓϵ∗ ≥ f(xt). If there are at least 2 values of  ℓfor which arm 1 and arm 2 are played with equal probability, one of those contexts will force  A to suffer ϵ∗ regret, so we continue the simulation of A on one of those instances selected at random, forcing at least  ϵ∗/2 regret in expectation, and at the same time have  B return ⊥. B receives f∗(xt) on such a round, which is used to construct feedback for A. Otherwise, A must transition from playing arm 1 with strictly higher probability to playing 2 with strictly higher probability as  ℓincreases: the point at which that occurs will “sandwich” the value of  f(xt), since A’s fairness implies this transition must occur when the expected payoff of arm 2 exceeds that of arm 1. B uses this value to output a numeric prediction.

An important fact we exploit is that we can query A’s behavior on (xt, x(ℓ)), for any  xt andℓ ∈�⌈ 1ϵ∗ ⌉�without providing it feedback (and instead “roll back” its history to  ht not including the query (xt, x(ℓ))). We update A’s history by providing it feedback only in rounds where B outputs ⊥.


10: Select at ∼ A(ht, (xt, x(ˆℓ))) ▷ Run Ato get a predicted arm


13: rt1 ← yt and h ← ht :: ((xt, x(ˆℓ)), 1, rt1) ▷Use KWIK feedback


16: h ← ht :: ((xt, x(ˆℓ)), 2, rt2) ▷Construct feedback for arm 2

17: else ▷ A’s history is not updated


Proof. For a fixed run of A, we calculate the probability that for all times  t and ℓ ∈�⌈ 1ϵ∗ ⌉�, itis the case that  pt,ℓ1 > pt,ℓ2 only if f∗(xt) > ℓ · ϵ∗ and also pt,ℓ1 < pt,ℓ2 only if f∗(xt) < ℓ · ϵ∗. Inthis run, A is queried on Tϵ∗histories and contexts: prefixes of h along with (xt, x(ℓ)) for each t ∈ [T], ℓ ∈�⌈ 1ϵ∗ ⌉�. The fairness of A implies for any fixed  ht and fixed (xt, x(ℓ)), with probability 1  − δ∗, pt,ℓ1 > pt,ℓ2 only if f∗(xt) > ℓϵ∗ and pt,ℓ1 < pt,ℓ2 only if f∗(xt) < ℓϵ∗. Then, by a union bound over  t ∈ [T], ℓ ∈�⌈ 1ϵ∗ ⌉�, with probability at least 1  − δ∗ Tϵ∗ = 1 − δ, A(ht, xt, x(ℓ)) will satisfy this property for all  t ∈ [T], ℓ ∈�⌈ 1ϵ∗ ⌉�. We condition on this holding in the remainder of the proof.

We now argue that the numeric predictions of B are correct within an additive  ϵ. Let:


When  B(xt) = yt ∈ [0,1], note that  |Et| ≤ 1, else Bwould have output  ⊥.

If  pt,ℓ1 ≤ pt,ℓ2 for all ℓ, since |Et| ≤ 1, either pt,01 < pt,02 or pt,11 < pt,12 , which we have conditioned on implying that either  f∗(xt) < f(x(0)) = 0 or f∗(xt) < f(x(1)) = ϵ∗. Since f∗(xt) ≥ 0, thisimplies  f∗(xt) ∈ [0, ϵ∗) = [ˆℓϵ∗, ϵ∗) = [ˆyt, ˆyt + ϵ∗).

Otherwise, we have that  pt,ˆℓ1 > pt,ˆℓ2 , and pt,ℓ1 ≤ pt,ℓ2 for all ℓ > ˆℓ. If (a) ˆℓ = ⌈ 1ϵ∗ ⌉, then f∗(xt) > 1,a contradiction, so ˆℓ < ⌈ 1ϵ∗ ⌉. If (b) ˆℓ = ⌈ 1ϵ∗ ⌉ − 1, then f∗(xt) > f(x(ˆℓ)) = (⌈ 1ϵ∗ ⌉ − 1)ϵ∗ and sof∗(xt) ∈ ((⌈ 1ϵ∗ ⌉ − 1)ϵ∗, 1] = ( ˆyt, ˆyt + ϵ∗], so ˆyt is ϵ∗-accurate. If neither (a) nor (b), then (c) it must be ˆℓ < ⌈ 1ϵ∗ ⌉ − 1. Since |Et| ≤1, for some  ℓ ∈ {ˆℓ + 1, ˆℓ + 2}, we know that  pt,ℓ1 < pt,ℓ2 ; thus,f∗(xt) < f(x(ℓ)) ≤ (ˆℓ + 2)ϵ∗ and therefore  f∗(xt) ∈ (ˆℓϵ∗, (ˆℓ + 2)ϵ∗) = (ˆyt, ˆyt + 2ϵ∗).

Finally, we upper-bound  m(ϵ, δ), the number of rounds  t : B(xt) = ⊥. For each such t, A runs on a random draw of one of two contexts, one of whose arms’ payoffs differ by at least  ϵ∗. Thus, for one of those contexts, either  f∗(xt) ≥ f(x)−ϵ∗ or f∗(xt) ≤ f(x)−ϵ∗. In either case, since  pt,ℓ1 = pt,ℓ2 = 12for  x(ℓ) = x, Asuffers expected regret at least ϵ∗2 for that context, and at least  ϵ∗4 when faced with one chosen at random. Thus,  m(ϵ, δ) · ϵ∗4 = m(ϵ, δ) · ϵ8 < R(m(ϵ, δ), δ∗) = R(m(ϵ, δ), ϵδ2T ), sinceA’s regret is upper bounded by this quantity over  m(ϵ, δ) rounds (which is an upper bound on the number of rounds for which A is actually run and updated).

6.1 An Exponential Separation Between Fair and Unfair Learning

In this section, we exploit the other direction of the equivalence we have proven between fair contextual bandit algorithms and KWIK learning algorithms to give a simple contextual bandit problem for which fairness imposes an exponential cost in its regret bound. This is in contrast to the case in which the underlying class of functions is linear, for which we gave fair contextual bandit algorithms with regret bounds within a polynomial factor of their unconstrained counterparts. In this problem, the context domain is the d-dimensional boolean hypercube:  X = {0, 1}d – i.e. thecontext each round for each individual consists of d boolean attributes. Our class of functions C is the class of boolean conjunctions:


We first give a simple but unfair algorithm, ConjunctionBandit, for this problem which obtains a regret bound which is linear in d. It maintains a set of candidate variables  C∗j for eachconjunction  fj; this set shrinks across rounds, while always containing the true set of variables over which  fjis defined. We denote the boolean value of variable m in the context for arm j in round t by  xtj,m.

The formal claim and proof that ConjunctionBandit achieves regret  R(T) = O(k2d), as wellas ConjunctionBandit’s formal description, can be found in Appendix C. ConjunctionBandit violates the fairness in every round t in which it predicts 0 for arm i but 1 for arm j even though fi(xt) = fj(xt) = 1, as πti = 0 < 1k < πtj.

We now show that fair algorithms cannot guarantee subexponential regret in d. This relies upon a known lower bound for KWIK learning conjunctions [Li, 2009]:

Lemma 11. There exists a sequence of examples (x1, . . . , x2d−1)such that for  ϵ, δ ≤ 1/2, every(ϵ, δ)-KWIK learning algorithm B for the class C of conjunctions on d variables must output  ⊥ forxt for each t ∈ [2d − 1]. Thus, Bhas a KWIK bound of at least  m(ϵ, δ) = Ω(2d).

We then use the equivalence between fair algorithms and KWIK learning to translate this lower bound on  m(ϵ, δ) into a minimum worst case regret bound for fair algorithms on conjunctions. We modify Theorem 6 to yield the following lemma, proven in Appendix C.

Lemma 12. Suppose A is a δ-fair algorithm for the contextual bandit problem over the class C of conjunctions on d variables. If A has regret bound  R(T, δ) then for δ′ = 2Tδ, FairToKWIK is an(0, δ′)-KWIK algorithm for C with KWIK bound  m(0, δ′) = 4R(m(0, δ′), δ).

Lemma 11 then lets us lower-bound the worst case regret of fair learning algorithms on conjunctions.

Corollary 2. For δ < 12T , any δ-fair algorithm for the contextual bandit problem over the class C of conjunctions on d boolean variables has a worst case regret bound of  R(T) = Ω(2d).

Proof. Let T ≤ 2d−1. We know then that if  δ′ <1, Lemma 11 guarantees the existence of a sequence of contexts  x1, . . . , xT for which any (0, δ′)-KWIK algorithm has KWIK bound  m(T, 0, δ′) = T.

Lemma 12 implies 4R(m(T, 0, δ′), δ) gives a KWIK bound of  m(T, 0, δ′) when δ′ = 2Tδ. Thus,if  δ < 12T , then δ′ < 1 and so R(m(T, 0, δ′), δ) = m(T,0,δ′)4 = T4 .

Together with the analysis of ConjunctionBandit, this demonstrates a strong separation between fair and unfair contextual bandit algorithms: when the underlying functions mapping contexts to payoffs are conjunctions on d variables, there exist a sequence of contexts on which fair algorithms must incur regret exponential in d while unfair algorithms can achieve regret linear in d.

Philip Adler, Casey Falk, Sorelle A. Friedler, Gabriel Rybeck, Carlos Scheidegger, Brandon Smith, and Suresh Venkatasubramanian. Auditing black-box models by obscuring features. CoRR, abs/1602.07043, 2016. URL http://arxiv.org/abs/1602.07043.

Alekh Agarwal, Daniel J. Hsu, Satyen Kale, John Langford, Lihong Li, and Robert E. Schapire. Taming the monster: A fast and simple algorithm for contextual bandits. In Proceedings of the 31th International Conference on Machine Learning, ICML 2014, Beijing, China, 21-26 June 2014, pages 1638–1646, 2014.

Kareem Amin, Michael Kearns, and Umar Syed. Graphical models for bandit problems. arXiv preprint arXiv:1202.3782, 2012.

Kareem Amin, Michael Kearns, Moez Draief, and Jacob D Abernethy. Large-scale bandit problems and kwik learning. In Proceedings of the 30th International Conference on Machine Learning (ICML-13), pages 588–596, 2013.

Peter Auer, Nicolo Cesa-Bianchi, and Paul Fischer. Finite-time analysis of the multiarmed bandit problem. Machine learning, 47(2-3):235–256, 2002.

Solon Barocas and Andrew D. Selbst. Big data’s disparate impact. California Law Review, 104, 2016. Available at SSRN: http://ssrn.com/abstract=2477899.

Anna Maria Barry-Jester, Ben Casselman, and Dana Goldstein. The new science of sentencing. The Marshall Project, August 8 2015. URL https://www.themarshallproject.org/2015/08/04/the-new-science-of-sentencing. Retrieved 4/28/2016.

Alina Beygelzimer, John Langford, Lihong Li, Lev Reyzin, and Robert E. Schapire. Contextual bandit algorithms with supervised learning guarantees. In Proceedings of the Fourteenth International Conference on Artificial Intelligence and Statistics, AISTATS 2011, Fort Lauderdale, USA, April 11-13, 2011, pages 19–26, 2011.

S´ebastien Bubeck and Nicolo Cesa-Bianchi. Regret analysis of stochastic and nonstochastic multi-armed bandit problems. Machine Learning, 5(1):1–122, 2012.

Nanette Byrnes. Artificial intolerance. MIT Technology Review, March 28 2016. URL https://www. technologyreview.com/s/600996/artificial-intolerance/. Retrieved 4/28/2016.

Toon Calders and Sicco Verwer. Three naive bayes approaches for discrimination-free classification. Data Mining and Knowledge Discovery, 21(2):277–292, 2010.

Wei Chu, Lihong Li, Lev Reyzin, and Robert E. Schapire. Contextual bandits with linear payoff functions. In Proceedings of the Fourteenth International Conference on Artificial Intelligence and Statistics, AISTATS 2011, Fort Lauderdale, USA, April 11-13, 2011, pages 208–214, 2011.

Cary Coglianese and David Lehr. Regulating by robot: Administrative decision-making in the machine-learning era. Georgetown Law Journal, 2016. Forthcoming.

Cynthia Dwork, Moritz Hardt, Toniann Pitassi, Omer Reingold, and Richard Zemel. Fairness through awareness. In Proceedings of the 3rd Innovations in Theoretical Computer Science Conference, pages 214–226. ACM, 2012.

Michael Feldman, Sorelle A. Friedler, John Moeller, Carlos Scheidegger, and Suresh Venkatasubramanian. Certifying and removing disparate impact. In Proceedings of the 21th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, Sydney, NSW, Australia, August 10-13, 2015, pages 259–268, 2015.

Benjamin Fish, Jeremy Kun, and ´Ad´am D Lelkes. A confidence-based approach for balancing fairness and accuracy. SIAM International Symposium on Data Mining, 2016.

FTC Commisioner Julie Brill. Navigating the “trackless ocean”: Fairness in big data research and decision making. Keynote Address at the Columbia University Data Science Institute, April 2015.

Toshihiro Kamishima, Shotaro Akaho, and Jun Sakuma. Fairness-aware learning through regularization approach. In Data Mining Workshops (ICDMW), 2011 IEEE 11th International Conference on, pages 643–650. IEEE, 2011.

Michael N Katehakis and Herbert Robbins. Sequential choice from several populations. PROCEEDINGS-NATIONAL ACADEMY OF SCIENCES USA, 92:8584–8584, 1995.

Tze Leung Lai and Herbert Robbins. Asymptotically efficient adaptive allocation rules. Advances in applied mathematics, 6(1):4–22, 1985.

Lihong Li. A unifying framework for computational reinforcement learning theory. PhD thesis, Rutgers, The State University of New Jersey, 2009.

Lihong Li, Michael L Littman, Thomas J Walsh, and Alexander L Strehl. Knows what it knows: a framework for self-aware learning. Machine learning, 82(3):399–443, 2011.

Binh Thanh Luong, Salvatore Ruggieri, and Franco Turini. k-nn as an implementation of situation testing for discrimination discovery and prevention. In Proceedings of the 17th ACM SIGKDD international conference on Knowledge discovery and data mining, pages 502–510. ACM, 2011.

Clair C Miller. Can an algorithm hire better than a human? The New York Times, June 25 2015. URL http://www. nytimes.com/2015/06/26/upshot/can-an-algorithm-hire-better-than-a-human.html. Retrieved 4/28/2016.

Cecilia Munoz, Megan Smith, and DJ Patil. Big data: A report on algorithmic systems, opportunity, and civil rights. Technical report, Executive Office of the President, The White House, 2016.

John Podesta, Penny Pritzker, Ernest J. Moniz, John Holdern, and Jeffrey Zients. Big data: Seizing opportunities, protecting values. Technical report, Executive Office of the President, The White House, 2014.

Cynthia Rudin. Predictive policing using machine learning to detect patterns of crime. Wired Magazine, August 2013. URL http://www.wired.com/insights/2013/08/ predictive-policing-using-machine-learning-to-detect-\patterns-of-crime/. Retrieved 4/28/2016.

Alexander L Strehl and Michael L Littman. Online linear regression and its application to model-based reinforcement learning. In Advances in Neural Information Processing Systems, pages 1417–1424, 2008.

Latanya Sweeney. Discrimination in online ad delivery. Commununications of the ACM, 56(5):44–54, 2013.

Rich Zemel, Yu Wu, Kevin Swersky, Toni Pitassi, and Cynthia Dwork. Learning fair representations. In Proceedings of the 30th International Conference on Machine Learning (ICML-13), pages 325–333, 2013.

We begin by proving Lemma 1, used in Section 3 to prove the fairness of the FairBandits algorithm.

Proof of Lemma 1. Choose an arbitrary arm i and round t and define indicator variables  X1, . . . , Xni(t)where  Xntakes on the reward of pull n of arm i. By a Chernoff bound, for any  a ≥ 0,


By a union bound over all rounds t, the probability of any true mean ever falling outside of its confidence interval is at most  δ( 6π2�∞t=1 1t2 ) = δ.

Next, we prove Lemma 2, which we used in Section 3 to bound the regret of FairBandits in Theorem 2.

Proof of Lemma 2. Let X1, ..., Xtbe indicator variables of whether i was pulled at each time  t′ ∈ [t].Let  Mt = �t′≤t Xt′, with E [Mt] = pt. For any ϵ ∈ [0,1], a standard additive Chernoff bound states that


Since  i ∈ St, it must be that  i ∈ St′ for all t′ ≤ t and all i ∈ St, by the definition of FairBandits. Thus,  P [Xi = 1] ≥ 1k for any i ∈ St, and therefore  pt ≥ tk. so this also implies that


Setting  ϵt =


as desired. Then, taking a union bound over all active arms of which there are at most k, the claim follows.

Proof of Lemma 3. This follows from the definition of  ℓti, uti and the lower bound on  nti providedby the assumption of the lemma.

A.1 Missing Derivation of R(T) for Theorem 2


where the final step follows from  δ ≤ 1√T .

All lemmas in this section are used in Section 4 to prove the fair lower bound in Theorem 3. The first, Lemma 4, lets us analyze distributions over payoffs.

Proof of Lemma 4. Let Rirepresent the joint distribution on rewards for either experiment: in both cases, the joint distribution on rewards is identical, since the process which generates them is the same.

We will use the notation  m, d1, . . . , dt to represent some fixed realization of the random variables µi, r1i , . . . , rti and µ′i, r1i , . . . , rti. In particular, it suffices to show that

P(µi,r1i ,...,rti)∼W�(µi, r1i , . . . , rti) = (m, d1, . . . , dt)�= P(µ′i,r1i ,...,rti)∼W ′�(µ′i, r1i , . . . , rti) = (m, d1, . . . , dt)�.

The first experiment which generates (µi, r1i , . . . , rti) according to W has probability mass on this particular value of its random variables:

P(µi,r1i ,...,rti)∼W�(µi, r1i , . . . , rti) = (m, d1, . . . , dt)�= Pµi∼Pi [µi = m] · Pr1i ,...,rti∼B(µi)�(r1i , . . . , rti) = (d1, . . . , dt)�

The second experiment has joint probability:


where equality follows from Bayes’ Rule.

Next, we prove Lemma 5, used to reason about distinguishing between arms.

Proof of Lemma 5. Since neither  i nor i + 1 is√δ-distinguished by  ht, for any αi ∈ {13 + i3k, 13 +


which completes the proof.

Finally, we prove Lemma 6, which lets us reason about how fair algorithm choices depend on histories.

Proof of Lemma 6. We will define a set of histories which cause A to play some pair of arms i and i + 1 with different probabilities when  µi = µi+1.Define the set unfair(A, µ) such that ht ∈ unfair(A, µ) if there exist  i ∈ [k − 1], t′ ∈ [t] such that  µi = µi+1 but πt′i|ht′ ̸= πt′i+1|ht′.

Consider some  ht which has not√2δ-distinguished any arm, such that there exists some  i, t′ forwhich  πt′i|ht′ ̸= 1k. Then, in particular, there exists some  i ∈ [k −1] such that  πt′i|ht′ ̸= πt′i+1|ht′. ByLemma 5, for all i and in particular this i, it is the case that 2δ < Pµ′∼P|ht�µ′i = µ′i+1�= X and so


where the first equality comes from the fact that  ht is a history for which  πt′i|ht ̸= πt′i+1|ht and thesecond equality from the definition of the set unfair.

We will show that Equation 4 cannot hold with probability more than 12 over the draw of  µ, htfrom the underlying distribution, or else A would not satisfy  δ-fairness. Since  A is δ-fair, for any fixed  µ


and therefore for any distribution  P over µ that


Lemma 4 implies also  δ ≥ Pµ∼P,ht∼A|µ,µ′∼P|ht�ht ∈ unfair(A, µ′)�, so by Markov’s inequality



However, Equation 4 shows this does not hold for any  ht which does not√2δ-distinguish any arm but for which  πt′i|ht′ ̸= 1k for some i ∈ [k], t′ ≤ t. Thus, for at least  12 of all probability mass over histories, either  πt′i|ht′ = 1k for all i, t′ ≤ t, or ht must√2δ-distinguish some arm.

We begin by proving two results related to KWIKToFair. The first, Lemma 8, was used in Section 5 to prove that  KWIKToFair is δ-fair in Theorem 5.

Proof of Lemma 8. We will refer to a violation of either (a) or (b) as a failure of learner  Li.For each  Li, the set of queries asked of it are pairs (hi, xti), histories along with new contexts. There are at most T contexts queried, and at most T histories on which  Liis queried for a fixed run of our algorithm (namely, prefixes of  Li’s final history). Thus, there are at most  T 2queries for  Li.Thus, by a union bound over these  T 2 queries for learner  Li, by the KWIK guarantee,  P [Lifails in some round]  ≤ T 2δ∗ = min(δ, 1T )/k, and by a union bound over k arms, P [A learner fails in a round]  ≤ min(δ, 1T ).

We proceed to Theorem 4, used in Section 5 to construct a  δ-fair algorithm with quantified regret from KWIK learners.

Proof of Theorem 4. We use repeated calls to  KWIKToFair (δ, T) to run for an indefinite number of rounds. Specifically, we will make calls  E = 1, 2, . . . to KWIKToFair (6δ/π(log(T)2, 2E). Wewill refer to each such call to KWIKToFair by its epoch E. By Lemma 8, each epoch E is 6δ/πE2k-fair, i.e. has a 6δ/πE2kprobability of violating fairness. Therefore by a union bound across epochs, the probability of ever violating fairness through repeated calls to KWIKToFair is bounded above by �∞E=1 6δ(πE)2 = 6δπ2�∞E=1 1E2 = δ, so the overall algorithm is  δ-fair.

Next, by Lemma 9 each epoch E contributes at most regret 3  · 2Ekϵ∗E where ϵ∗E denotes the value of  ϵ∗ used in epoch  E, i.e. ϵ∗E satisfying ϵ∗E = k ·m(ϵ∗E, 6δ/πE2, 2E). Then since each epoch E covers 2E rounds, through round T the algorithm has used fewer than log(T) epochs, and by the doubling trick achieves regret  R(T) < �log(T)E=1 3 · 2Ekϵ∗E = O(Tkϵ∗) = O(k2 · m(ϵ∗, δ∗)).

Next, we address the special subcase of KWIKToFair for linear functions outlined in Corollary 1.

Proof of Corollary 1. By Lemma 10, for each arm j the associated learner  Ljhas mistake bound m(ϵ, δ) = ˜O(d3/ϵ4). Since ϵ∗ satisfies ϵ∗ = k · m(ϵ∗, δ)/T we get ϵ∗ =�kd3T �1/5Substituting this

into Theorem 4, the overall regret guarantee satisfies regret  R(T) = O(k2 · m(ϵ∗, δ∗)) = O(Tkϵ∗) =O(T 4/5k6/5d3/5).

This brings us to the formal algorithm description of ConjunctionBandit and its corresponding regret bound, used in Section 6.1 as an example of an unfair learning algorithm for conjunctions.



4: St ← ∅ ▷Initialize active set of arms 5: for j = 1, 2, . . . , k do


13: Pull arm j∗ ← (x ∈R St) ▷Pull arm from active set at random

We can now upper bound the regret achieved by ConjunctionBandit.


Proof of Lemma 13. First, we claim that for every j, for the duration of the algorithm, that  Cj ⊆C∗j , where Cjis the true set of variables corresponding to  fj. This holds at initialization:  Cj ⊆ [d] =C∗j . Suppose the claim holds prior to round  t: if C∗j is updated in this round, then  fj(xtj) = 1 ⇒∀m ∈ [d] : xtj,m = 0, m /∈ Cj. Thus, C∗j = C∗j \{m : xtj,m = 0} = C∗j \{m : xtj,m = 0∩m /∈ Cj} ⊃ Cj.

Therefore, the algorithm never makes false positive mistakes: in any round  t, j ∈ St ⇒ fj(xtj) =1. Therefore ConjunctionBandit only accumulates regret in rounds where it makes false negative mistakes by predicting that all arms have reward 0 when some arm has reward 1.

Then, we have Regret(x1, . . . , xT ) = �t maxj�fj(xtj)�− E��t fit(xtit)�. We then rewrite the first term as �t maxj�fj(xtj)�= �t I{fj(xtj) = 1 for some j ∈ [k]}and the second term as


where the last inequality follows from  P�j∗ = j | St = ∅ ∧ fj(xtj) = 1�= 1k and the fact that if St = ∅ and fj∗(xtj∗) = 1 then Cj∗loses at least one of d variables, and this loss can therefore occur at most d times for each arm j. Substituting this into the original regret expression then yields


Finally, we prove Lemma 12, which we used in Section 6.1 to translate between fair and KWIK learning on conjunctions.

Proof of Lemma 12. We mimic the structure of the proof of Theorem 6, once again using FairToKWIK to construct a KWIK learner B by running the given fair algorithm A on a constructed bandit instance for each context  xt.

There are two primary modifications for the specific case of conjunctions: as conjunctions output either 0 or 1 we set  ϵ = 0, ϵ∗ = 1, and δ∗ = δ2T . Atherefore runs on 2T histories and contexts, either of form (xt, x(0) = 0) or (xt, x(1) = 1). Since we initialize A to be δ∗-fair, if we fix history  htalong with context and arm assignment (xt, x(ℓ)) then, with probability at least 1  − δ∗, pt,ℓ1 > pt,ℓ2implies  f∗(xt) > ℓ/2 and similarly  pt,ℓ2 > pt,ℓ1 implies f∗(xt) < ℓ/2. Union bounding over all such t and ℓyields that A satisfies this fairness over all  t and ℓwith probability at least 1  − δ, and wecondition on this event for the rest of the proof.

We proceed to prove that the resulting KWIK learner  B is ϵ-accurate. Here, as  ϵ = 0, thisrequires showing that all of B’s numerical predictions are correct. Assume instead that B outputs an incorrect prediction on (xt, x(ℓ)). By the construction of FairToKWIK, a prediction from B implies that at least one of  pt,01 , pt,11 , pt,02 and pt,12is distinct from the others. We condition on this distinctness to get two cases. In the first case,  pt,ℓ1 ≤ pt,ℓ2 for both ℓ = 0 and 1. By distinctness, this means that either pt,01 < pt,02or pt,11 < pt,12 . By the fairness assumption, this respectively implies that  f∗(xt) < f(x(0)) = 0 orf∗(xt) < f(x(1)) = 1. In either event, f∗x(t) = 0 = ˆyt. In the second case,  pt,ℓ1 > pt,ℓ2for at least one of  ℓ = 0 or 1. pt,11 > pt,12violates the fairness assumption on A as f(x(1)) = 1, so it must be that  pt,01 > pt,02 . Fairness then implies that  f∗(xt) = 1 = ˆyt. Therefore  B is ϵ-accurate.

It remains to upper bound  m(ϵ, δ). Any round where  B outputs ⊥means a choice between two contexts, one of which has a difference of 1 between arms. It follows that choosing randomly between both arms and contexts incurs expected regret 1/4. Therefore m(ϵ,δ)4 < R(m(ϵ, δ), δ∗, d) =R(m(ϵ, δ), δ2T , d).

Designed for Accessibility and to further Open Science