Fairness in Learning: Classic and Contextual Bandits

2016·arXiv

Abstract

1 Introduction

Automated techniques from statistics and machine learning are increasingly being used to make decisions that have important consequences on people’s lives, including hiring [Miller, 2015], lending [Byrnes, 2016], policing [Rudin, 2013], and even criminal sentencing [Barry-Jester et al., 2015]. These high stakes uses of machine learning have led to increasing concern in law and policy circles about the potential for (often opaque) machine learning techniques to be discriminatory or unfair [Coglianese and Lehr, 2016, Barocas and Selbst, 2016]. Moreover, these concerns are not merely hypothetical: Sweeney [2013] observed that contextual ads for public record services shown in response to Google searches for stereotypically African American names were more likely to contain text referring to arrest records, compared to comparable ads shown in response to searches for stereotypically Caucasian names, which showed more neutral text. She confirmed that this was not because of stated preferences of the advertisers, but rather the automated outcome of Google’s targeting algorithms. Despite the recognized importance of this problem, very little is known about technical solutions to the problem of “unfairness”, or the extent to which “fairness” is in conflict with the goals of learning.

In this paper, we consider the extent to which a natural fairness notion is compatible with learning in a general setting (the contextual bandit setting), which can be used to model many of the applications mentioned above in which machine learning is currently employed. In this model, the learner is a sequential decision maker, which must choose at each time step t which decision to make, out of a finite set of k choices (for example, which of k loan applicants – potentially from different populations or racial groups – to give a loan to). Before the learner makes its decision at round t, it observes some for each choice of arm could, for example, represent the contents of the loan application of an individual from population j at round t). When the learner chooses arm j at time t, it obtains a stochastic whose expectation is determined by some unknown function of the context: ). The goal of the learning algorithm is to maximize its expected reward – i.e. to approximate the optimal policy, which at each round, chooses arm j to maximize . The difficulty in this task stems from the unknown functions which map contexts to rewards; these functions must be learned. Despite this, there are many known algorithms for learning the optimal policy (in the absence of any fairness constraint).

1.1 Fairness and Learning

Our notion of individual fairness is very simple: it states that it is unfair to preferentially choose one individual (e.g. for a loan, a job, admission to college, etc.) over another if he or she is not as qualified as the other individual. This definition of fairness is apt for our setting, since in contextual learning, the quality of an arm is clear: its expected reward. We view different arms j as representing different populations (e.g. different ethnic groups, cultures, or other divisions within society), and view the context as representing information about a particular individual from that population. Each population has its own underlying function which maps contexts to expected payoff. At each time step t, the algorithm is asked to choose between specific members of each population, represented by the contexts . The quality of an individual is thus exactly ). Our fairness condition translates thus: for any pair of arms

), then an algorithm is said to be discriminatory if it preferentially chooses the lower quality arm . Said another way, an algorithm is fair if it guarantees the following: with high probability, over all rounds t, and for all pairs of arms algorithm chooses arm j with probability at least that with which it chooses arm

It is worth noting that this definition of fairness (formalized in the preliminaries) is entirely consistent with the optimal policy, which can simply choose at each round to play uniformly at random from the arms arg maxwhich maximize the expected reward. This is because – it seems – the goal of fairness as enunciated above is entirely consistent with the goal of maximizing expected reward. Indeed, the fairness constraint exactly states that the algorithm cannot favor low reward arms!

Our main conceptual result is that this intuition is incorrect in the face of unknown reward functions. Even though the constraint of fairness is consistent with implementing the optimal policy, it is not necessarily consistent with learning the optimal policy. We show that fairness always has a cost, in terms of the achievable learning rate of the algorithm. For some problems, the cost is mild, but for others, the cost is large.

1.2 Our Results

We divide our results into two parts. First, we study the classic stochastic multi-armed bandit problem [Lai and Robbins, 1985, Katehakis and Robbins, 1995]. In this case, there are no contexts, and each arm i has a fixed but unknown average reward . Note that this is a special case of the contextual bandit problem in which the contexts are the same every day. In this setting, our fairness constraint specializes to require that with probability 1 , for any pair of arms i, j for which , at no round t does the algorithm play arm j with probability higher than that with which it plays arm i. Note that even this special case models interesting scenarios from the point of view of fairness in learning. It models, for example, the case in which choices are made by a loan officer after applicants have been categorized into k internally indistinguishable equivalence classes based on their applications.

Without a fairness constraint, it is known that it is possible to guarantee non-trivial regret to the optimal policy after only T = O(k) many rounds [Auer et al., 2002]. In Section 3, we give an algorithm that satisfies our fairness constraint and is able to guarantee non-trivial regret after ) rounds. We then show in Section 4 that it is not possible to do better – any fair learning algorithm can be forced to endure constant per-round regret for Thus, we tightly characterize the optimal regret attainable by fair algorithms in this setting, and formally separate it from the regret attainable by algorithms absent a fairness constraint. Note that this already shows a separation between the best possible learning rates for contextual bandit learning with and without the fairness constraint – the stochastic multi-armed bandit problem is a special case of every contextual bandit problem, and for general contextual bandit problems, it is also known how to get non-trivial regret after only T = O(k) many rounds [Agarwal et al., 2014, Beygelzimer et al., 2011, Chu et al., 2011].

We then move on to the general contextual bandit setting and prove a broad characterization result, relating fair contextual bandit learning to KWIK learning [Li et al., 2011]. The KWIK model, which stands for Knows What it Knows and has a close relationship with reinforcement learning, is a model of sequential supervised classification in which the learning algorithm must be confident in its predictions. Informally, a KWIK learning algorithm receives a sequence of unlabeled examples, whose true labels are defined by some unknown function in a class C. For each example, the algorithm may either predict a label, or announce “I Don’t Know”. The KWIK requirement is that with high probability, for each example, if the algorithm predicts a label, then its prediction must be very close to the true label. The quality of a KWIK learning algorithm is characterized by its “KWIK bound”, which provides an upper bound on the maximum number of times the algorithm can be forced to announce “I Don’t Know”. For any contextual bandit problem (defined by the set of functions C from which the payoff functions may be selected), we show that the optimal learning rate of any fair algorithm is determined by the best KWIK bound for the class C. We prove this constructively – we give a reduction showing how to convert a KWIK learning algorithm into a fair contextual bandit algorithm in Section 5, and vice versa in Section 6. Both reductions show that the KWIK bound of the KWIK algorithm is polynomially related to the regret of the fair algorithm.

This general connection has immediate implications, because it allows us to import known results for KWIK learning [Li et al., 2011]. For example, it implies that some fair contextual bandit problems are easy, in that there are fair algorithms which can obtain non-trivial regret guarantees after polynomially many rounds. This is the case, for example, for the important linear special case in which the contexts are real valued vectors and the unknown functions are linear: . In this case, the KWIK-learnability of noisy linear regression problems [Strehl and Littman, 2008, Li et al., 2011] implies that we can construct a fair contextual bandit algorithm whose per-round regret is polynomial in d. Conversely, it also implies that some contextual bandit problems which are easy without the fairness constraint become hard once we impose the fairness constraint, in that any fair algorithm must suffer constant per-round regret for exponentially many rounds. This is the case, for example, when the context consists of boolean vectors and the unknown functions – the “and”s of some unknown set of features. The impossibility of non-trivial KWIK-learning of conjunctions [Li, 2009, Li et al., 2011] implies that no fair learner in the contextual bandit setting can achieve non-trivial regret before exponentially many (in d) rounds.

1.3 Other Related Work

Several papers study the problem of fairness in machine learning. One line of work aims to give algorithms for batch classification which achieve group fairness otherwise known as equality of outcomes, statistical parity – or algorithms that avoid disparate impact (see e.g. Calders and Verwer [2010], Luong et al. [2011], Kamishima et al. [2011], Feldman et al. [2015], Fish et al. [2016] and Adler et al. [2016] for a study of auditing existing algorithms for disparate impact). While statistical parity is sometimes a desirable goal – indeed, it is sometimes required by law – as observed by Dwork et al. [2012] and others, it suffers from two problems. First, if different populations indeed have different statistical properties, then it can be at odds with accurate classification. Second, even in cases when statistical parity is attainable with an optimal classifier, it does not prevent discrimination at an individual level – see Dwork et al. [2012] for a catalog of ways in which statistical parity can be insufficient from the perspective of fairness. In contrast, we study a notion aimed at guaranteeing fairness at the individual level.

Our definition of fairness is most closely related to that of Dwork et al. [2012], who proposed and explored the basic properties of a technical definition of individual fairness formalizing the idea that “similar individuals should be treated similarly”. Specifically, their work presupposes the existence of a task-specific metric on individuals, and proposes that fair algorithms should satisfy a Lipschitz condition with respect to this metric. Our definition of fairness is similar, in that the expected reward of each arm is a natural metric through which we define fairness. The main conceptual distinction between our work and Dwork et al. [2012] is that their work operates under the assumption that the metric is known to the algorithm designer, and hence in their setting, the fairness constraint binds only insofar as it is in conflict with the desired outcome of the algorithm designer. The most challenging aspect of this approach (as they acknowledge) is that it requires that some third party design a “fair” metric on individuals, which in a sense encodes much of the relevant challenge. The question of how to design such a metric was considered by Zemel et al. [2013], who study methods to learn representations that encode the data, while obscuring protected attributes. Our fairness constraint, conversely, is entirely aligned with the goal of the algorithm designer in that it is satisfied by the optimal policy; nevertheless, it affects the space of feasible learning algorithms, because it interferes with learning an optimal policy, which depends on the unknown reward functions.

At a technical level, our work is related to Amin et al. [2012] and Amin et al. [2013], which also relate KWIK learning to bandit learning in a different context, unrelated to fairness (when the arm space is very large).

2 Preliminaries

We study the contextual bandit setting, which is defined by a domain X, a set of “arms” [k] := {1, . . . , k} and a class C of functions of the form 1]. For each arm j there is some function , unknown to the learner. In rounds t = 1, . . . , T, an adversary reveals to the algorithm a for each arm. An algorithm A then chooses an arm , and observes stochastic reward for the arm it chose. We assume ), for some distribution

Let Π be the set of policies mapping contexts to distributions over arms optimal policy which selects a distribution over arms as a function of contexts to maximize the expected reward of those arms. The pseudo-regret of an algorithm A on contexts

defined as follows, where ’s distribution on arms at round t:

We hereafter refer to this as the regret of A. The optimal policy pulls arms with highest expectation at each round, so:

We say that A satisfies regret bound

Let the history be a record of 1 rounds experienced by 3-tuples which encode for each t the realization of the contexts, arm chosen, and reward observed. We write to denote the probability that A chooses arm j after observing contexts . For notational simplicity, we will often drop the superscript t on the history when referring to the distribution over arms:

We now define what it means for a contextual bandit algorithm to be -fair with respect to its arms. Informally, this will mean that A will play arm i with higher probability than arm j in round t only if i has higher mean than ], and in all rounds t.

if, for all sequences of contexts and all payoff distributions , with probability at least 1 over the realization of the history h, for all rounds ] and all pairs of arms

Remark 1. Definition 1 prohibits favoring lower payoff arms over higher payoff arms. One relaxed definition only requires that ) – requiring only identical individuals (concerning expected payoff) be treated identically. This relaxation is a special case of Dwork et al. [2012]’s proposed family of definitions, which require that “similar individuals be treated similarly”. We use Definition 1 as it is better motivated in its implications for fair treatment of individuals, but all of our results – including our lower bounds – apply also to this relaxation.

KWIK learning Let B be an algorithm which takes as input a sequence of examples and when given some , outputs either a prediction ˆ1] or else outputs ˆrepresenting “I don’t know”. When ˆreceives feedback an ()-KWIK learning algorithm for 1], with KWIK bound ) if for any sequence of examples and any target , with probability at least 1

1. Its numerical predictions are accurate: for all

2. B rarely outputs “I Don’t Know”:

2.1 Specializing to Classic Stochastic Bandits

In Sections 3 and 4, we study the classic stochastic bandit problem, an important special case of the contextual bandit setting described above. Here we specialize our notation to this setting, in which there are no contexts. For each arm ], there is an unknown distribution with unknown mean . A learning algorithm A chooses an arm , and observes the reward for the arm that it chose. Let ] be the arm with highest expected reward: . The pseudo-regret of an algorithm is now just:

Let denote a record of the 1 rounds experienced by the algorithm so far, represented by 1 2-tuples encoding the previous arms chosen and rewards observed. We write to denote the probability that A chooses arm j given history . Again, we will often drop the superscript t on the history when referring to the distribution over arms:

-fairness in the classic bandit setting specializes as follows:

-fairness in the classic bandits setting)if, for all distributions with probability at least 1 over the history

3 Fair Classic Stochastic Bandits: An Algorithm

In this section, we describe a simple and intuitive modification of the standard UCB algorithm [Auer et al., 2002], called FairBandits, prove that it is fair, and analyze its regret bound. The algorithm and its analysis highlight a key idea that is important to the design of fair algorithms in this setting: that of chaining confidence intervals. Intuitively, as a -fair algorithm explores different arms it must play two arms with equal probability until it has sufficient data to deduce, with confidence 1 , either that or vice versa. FairBandits does this by maintaining empirical estimates of the means of both arms, together with confidence intervals around those means. To be safe, the algorithm must play the arms with equal probability while their confidence intervals overlap. The same reasoning applies simultaneously to every pair of arms. Thus, if the confidence intervals of each pair of arms overlap for each ], the algorithm is forced to play all arms j with equal probability. This is the case even if the confidence intervals around arm are far from overlapping – i.e. when the algorithm can be confident that

This approach initially seems naive: in an attempt to achieve fairness, it seems overly conservative when ruling out arms, and can be forced to play arms uniformly at random for long periods of time. This is reflected in its regret bound, which is only non-trivial after , whereas the UCB algorithm [Auer et al., 2002] achieves non-trivial regret after T = O(k) rounds. However, our lower bound in Section 4 shows that any fair algorithm must suffer constant per-round regret for rounds on some instances.

We now give an overview of the behavior of FairBandits. At every round t, FairBandits identifies the arm that has the largest upper confidence interval amongst the active arms. At each round j are in the same component of the transitive closure of the linked relation. FairBandits plays uniformly at random among all active arms chained to arm

Initially, the active set contains all arms. The active set of arms at each subsequent round is defined to be the set of arms that are chained to the arm with highest upper confidence bound at the previous round. The algorithm can be confident that arms that have become unchained to the arm with the highest upper confidence bound at any round have means that are lower than the means of any chained arms, and hence such arms can be safely removed from the active set, never to be played again. This has the useful property that the active set of arms can only shrink: at any round ; see Figure 1 for an example of active set evolution over time.

We first observe that with probability 1 , all of the confidence intervals maintained by ) contain the true means of their respective arms over all rounds. We prove this claim, along with all other claims in this section without proofs, in Appendix A.

Lemma 1. With probability at least 1 , for every arm

The fairness of FairBandits follows almost immediately from this guarantee.

Proof. By Lemma 1, with probability at least 1all confidence intervals contain their true means across all rounds. Thus, with probability 1, at every round t, for every , it must be that – the arms not in the active set have strictly smaller means than those in the active set; if not, would be chained to is. Finally, all arms in played uniformly at random – but since all such arms are played with the same probability, this does not cause the fairness constraint to bind for any pair , for any realization of which lie within their confidence intervals.

Next, we upper bound the regret of FairBandits.

has regret

Figure 1: Confidence intervals over time for the lower bound instance outlined in Section 4 for k = 10. Lines correspond to upper and lower confidence bounds for each arm and cut off at the round in which the arm leaves the active set.

Remark 2. Before proving Theorem 2, we highlight two points. First, this bound becomes non-trivial (i.e. the average per-round regret is ). As we show in the next section, it is not possible to improve on this. Second, the bound may appear to have suboptimal dependence on T when compared to unconstrained regret bounds (where the dependence on T is often described as logarithmic). However, it is known that Ωeven in the unrestricted setting (without fairness) if one does not make data-specific assumptions on an instance [Bubeck and Cesa-Bianchi, 2012] (e.g. that there is a lower bound on the gap between the best and second best arm). It would be possible to state a logarithmic dependence on T in our setting as well while making assumptions on the gaps between arms, but since our fairness constraint manifests itself as a cost that depends on k, we choose for clarity to avoid such assumptions. Without such assumptions, our dependence on T is also optimal.

We now prove Theorem 2. Lemma 2 upper bounds the probability any arm i active in round t has been pulled substantially fewer times than its expectation, i.e. . Lemma 3 upper bounds the width of any confidence interval used by ), conditioned on i being pulled the number of times guaranteed by Lemma 2. Finally, we stitch this together to prove Theorem 2 by upper bounding the total regret incurred for T rounds by noticing that the regret of any arm active in round t is at most

We begin by lower bounding the probability that any arm active in round t has been pulled substantially fewer times than its expectation.

Lemma 2. With probability at least 1

for all (for all active arms in round t).

We now use this lower bound on the number of pulls of active arm i in round t to upper-bound ), an upper bound on the confidence interval width FairBandits uses for any active arm i in round t.

Lemma 3. Consider any round t and any arm . Condition on

Finally, we prove the bound on the total regret of the algorithm, using the bound on the width of any active arm’s confidence interval in round t provided by Lemma 3.

Proof of Theorem 2. We condition on . This occurs with probability at least 1 , by Lemma 1. We claim that this implies that arm with highest expected reward is always in the active set. This follows from the fact that , it must be that . Thus, this holds for , the arm with highest upper confidence bound in round must be chained to

We further condition on the event that for all j, t,

which holds with probability at least 1 by Lemma 2 and a union bound over all times t. This implies that, for all rounds t, for every active arm , Lemma 3 applies, and therefore

Finally, we upper-bound the per-round regret of pulling any active arm is active, any is chained to arm . Since all active arms have confidence interval width at most must be chained using at most k arms’ confidence intervals, we have that

Since , it follows that . Finally, summing up over all rounds , we know that

where this bound is derived in Appendix A.1.

4 Fair Classic Stochastic Bandits: A Lower Bound

We now show that the regret bound for FairBandits has an optimal dependence on k: no fair algorithm has diminishing regret before ) rounds. All missing proofs are in Appendix B. The main result of this section is the following.

Theorem 3. There is a distribution P over k-arm instances of the stochastic multi-armed bandit problem such that any fair algorithm run on P experiences constant per-round regret for at least

rounds.

Despite the fact that regret is defined in a prior-free way, the proof of Theorem 3 proceeds via Bayesian reasoning. We construct a family of lower bound instances such that arms have payoffs drawn from Bernoulli distributions, denoted . So, to specify a problem instance, it suffices to specify a mean for each of . The proof formalizes the following outline.

1. We define an instance distribution over means (Definition 3). P will have two important properties. First, we will draw means from P such that for any with probability at least 1/4. Second, for any realization of means drawn from P, if an algorithm plays uniformly at random over [k], it will suffer constant per-round regret.

2. We treat distribution over mean , and analyze the posterior distribution ) over means that results after applying Bayes’ rule to the payoff observations made by the algorithm. Bayes’ rule implies (Lemma 4) the joint distribution over rewards and means drawn from P is identical to the distribution which first draws means according to P, then draws rewards conditioned on those means, and finally resamples the means from the posterior distribution on means. Thus, we can reason about fairness (a frequentist quantity) by analyzing the Bayesian posterior distribution on means conditioned on the observed rewards.

3. A -fair algorithm, for any set of means realized from the instance (prior) distribution, must not play arm i + 1 with lower probability than arm , except with probability . By the above change of perspective, therefore, any -fair algorithm must play arms i and i+1 with equal probability until the posterior distribution on means given observed rewards, satisfies (Lemmas 5 and 6).

4. We finally lower bound the number of reward observations necessary before the posterior distribution on means given payoffs is such that for any pair of adjacent arms i, i + 1. We show that this is Ω() (Lemma 7). Since fair algorithms must play from among the k arms uniformly at random until this point, with high probability, no arm accumulates sufficiently many reward observations until ) rounds of play.

We begin by describing our distribution over instances. Each arm i’s payoff distribution will be Bernoulli with mean independently of each other arm.

Definition 3 (Prior Distribution over For each arm is distributed according to the distribution with the following probability mass function:

We treat P as a prior distribution over instances, and analyze the posterior distribution on instances given the realized rewards. Lemma 4 justifies this reasoning.

Lemma 4. Consider the following two experiments: In the first, let denote the joint distribution on (. In the second, let , and then re-draw the mean from its posterior distribution given the rewards. Let (are identical distributions.

Next, we lower-bound the number of reward observations necessary such that for some with respect to the posterior. It will be useful to refer to an algorithm’s histories as distinguishing the mean of an arm given that history with high probability.

-distinguishing). We will say if, for some

The next lemma shows that if no arm is-distinguished by a history, all pairs of arms i, i + 1 have posterior probability strictly greater than of having equal means.

Lemma 5. Suppose A has history -distinguish any arm i. Then, for all arms i, i + 1,

Now, we prove that for any fair algorithm, with probability over the draw of histories must-distinguish some arm, or the algorithm must play uniformly across all k arms conditioned on

Lemma 6. Suppose an algorithm -fair. Then:

We now lower-bound the number of observations from arm i which are required to -distinguish it.

as in Definition 3. Then, arm -distinguishable by is the number of times arm i is played.

to represent the two possible realizations that might take, when drawn from the distribution over instances given in Definition 3. Let A represent the event that and B the event that throughout.

Fix a history represent the number of observations of arm i’s reward. We will abuse notation and use to refer to the payoff sequence of arm i observed in history therefore a binary sequence of length denote the number of 1s in the sequence. We will calculate conditions under which will imply that either

holds, implying that one of A or B has posterior probability at least 1 , conditioned on the observed rewards. If neither of these is the case, -distinguished by

We begin by rearranging our definition of this ratio ],which follows from Bayes’ rule and the fact that P [A] = P [B]. We wish to upper and lower bound X in terms of value. By definition of the Bernoulli distribution, we have that

We now calculate under what conditions either (a) . One of these must hold if -distinguished. Before we do so, we mention that a Chernoff bound implies that with probability 1 , for events A and B, Equations 1 and 2, respectively:

since the mean of m Bernoulli trials with mean We begin by analyzing case (a), where

Taking logarithms on both sides, we have that

where the inequality follows from ln(1 + ]. Then, this implies that

(3

Multiplying both sides by 1, this implies that

Since , solving for m implies that In case (b), we have

where we used the fact that 1 + . Taking logarithms, this will imply that

whose last inequality comes the range of p. Combining this inequality with Equation 2, this implies

We now have the tools in hand to prove Theorem 3.

-fair algorithm where ; we claim that with probability at least . Since the payoff for uniformly random play is , while the best arm has payoff , in any round all ], the algorithm suffers Ω(1) regret in that round.

Lemma 6 implies that, with probability at least over the distribution over histories (a) -distinguish some arm i. Case (a) implies

our claim. In case (b), Lemma 7 states than an arm -distinguishable only if We now argue that unless ), which will imply our claim for case (b).

Fix some i, t. We lower-bound t for which, with probability at least 1 over histories will be the case that indicator variables of arm i being played in round . Note that for all in all rounds prior to t, we have all arms are played with equal probability. For any are nondecreasing in , an additive Chernoff bound implies

probability 1

5 KWIK Learnability Implies Fair Bandit Learnability

In this section, we show if a class of functions is KWIK learnable, then there is a fair algorithm for learning the same class of functions in the contextual bandit setting, with a regret bound polynomially related to the function class’ KWIK bound. Intuitively, KWIK-learnability of a class of functions guarantees we can learn the function’s behavior to a high degree of accuracy with a high degree of confidence. As fairness constrains an algorithm most before the algorithm has determined the payoff functions’ behavior accurately, this guarantee enables us to learn fairly without incurring much additional regret. Formally, we prove the following polynomial relationship.

Theorem 4. For an instance of the contextual multi-armed bandit problem where all -KWIK learnable with bound achieves regret bound:

First, we construct an algorithm ) that uses the KWIK learning algorithm as a subroutine, and prove that it is -fair. A call to ) will initialize a KWIK learner for each arm, and in each of the T rounds will implicitly construct a confidence interval around the prediction of each learner. If a learner makes a numeric valued prediction, we will interpret this as a confidence interval centered at the prediction with width . If a learner outputs , we interpret this as a trivial confidence interval (covering all of [0, 1]). We use the same chaining technique that we use in the classic stochastic setting. In every round identifies the arm that has the largest upper confidence bound. At each round t, we will say if they are in the same component of the transitive closure of the linked relation. Then, it plays uniformly at random amongst all arms chained to arm Whenever all learners output predictions, they need no feedback. When a learner for is selected then we have feedback to give it; on the other hand, if j isn’t selected, we “roll back” the learning algorithm for j to before this round by not updating the algorithm’s state.

]), receive reward Pick arm at random from all arms

Update the history for

We begin by bounding the probability of certain failures of KWIKToFair in Lemma 8, proven in Appendix C. This in turn lets us prove the fairness of KWIKToFair in Theorem 5.

Lemma 8. With probability at least 1 , for all rounds t and all arms then

Proof of Theorem 5. We condition on both (a) and (b) holding for all arms i and rounds t from Lemma 8, which occur with probability 1 for all arms and all times t. Therefore, we proceed by conditioning on the event that for all arms i and all rounds . Having done so, there are two possibilities for each round t.In case 1, for each i we have that . By the condition above, for any arms i and ) implies that . Since in this case no learner outputs chains to the top arm only if arm i does. Therefore . In case 2, there exists some i such that . Then we choose uniformly at random across all arms, so j.

Thus, with probability at least 1, for each round ) implies that

We now use the KWIK bounds of the KWIK learners to upper-bound the regret of KWIKTo-

achieves regret

Proof. We first condition on the event that both (a) and (b) from Lemma 8 hold for all t, i, which holds with probability 1 ), and bound the regret when they both hold. Choose an arbitrary round t in the execution of ). As above, there are two cases. In the first case, and we choose uniformly at random from the arms chained by -intervals to the arm with the highest prediction. Since we have conditioned on the event that all KWIK learners are correct, . Furthermore, for any , we have that and in particular that . Thus, the regret is at most 2in such a round. In the second case some arm outputs , so we choose randomly from all k arms, and the worst-case regret is 1. Thus, the total regret will be at most 2is the number of rounds in which some

We now upper bound , the number of rounds in which arm . Fix some arm i which outputs rounds. Arm i is played and therefore receives feedback every time it outputs with probability at least 1/k. Thus, using a Chernoff bound, with probability 1 feedback for outputs of in at least has the guarantee that there can be at most ) many such rounds (in which it outputs receives feedback). Thus,

If ), this implies

We now analyze cases in which (1) Case (1) this implies

For c > 4, this leads to contradiction. Thus, in this case, if we set , we know that with probability 1 ) which summing up over all as desired.

In case (2), we have that

which solving for implies that ) by setting a union bound. Thus, there are at most )) rounds in expectation during the execution of ) in which some arm outputs

Combining both cases, the total regret incurred by ) across all T rounds is

Our presentation of ) has a known time horizon T. Its guarantees extend to the case in which T is unknown via the standard “doubling trick” to prove Theorem 4 in Appendix C.

An important instance of the contextual bandit problem is the linear case, where C consists of the set of all linear functions of bounded norm in d dimensions. This captures the natural setting in which the rewards of each arm are governed by an underlying linear regression model on a d-dimensional real valued feature space. The linear case is well studied, and there are known KWIK algorithms [Strehl and Littman, 2008] for the set of linear functions C, which allows us via our reduction to give a fair contextual bandit algorithm for this setting with a polynomial regret bound.

Lemma 10 ([Strehl and Littman, 2008])

Then, an application of Theorem 4 implies that KWIKToFair has a polynomial regret guarantee for the class of linear functions. This proof can be found in Appendix C.

Corollary 1. Let C and X be as in Lemma 10, and using the learner from [Strehl and Littman, 2008] has regret:

6 Fair Bandit Learnability Implies KWIK Learnability

In this section, we show how to use a fair, no-regret contextual bandit algorithm to construct a KWIK learning algorithm whose KWIK bound has logarithmic dependence on the number of rounds T. Intuitively, any fair algorithm which achieves low regret must both be able to find and exploit an optimal arm (since the algorithm is no-regret) and can only exploit that arm once it has a tight understanding of the qualities of all arms (since the algorithm is fair). Thus, any fair no-regret algorithm will ultimately have tight (1)-confidence about each arm’s reward function.

-fair algorithm for the contextual bandit problem over the class of functions C, with regret bound . Suppose also there exists such that for every -KWIK algorithm for C with KWIK bound the solution to

Remark 3. The condition that C should contain a function that can take on values that are multiples of is for technical convenience; C can always be augmented by adding a single such function.

Our aim is to construct a KWIK algorithm B to predict labels for a sequence of examples labeled with some unknown function . To do this, we will run our fair contextual bandit algorithm A on an instance that we construct online as examples . The idea is to simulate a two arm instance, in which one arm’s rewards are governed by (the function to be KWIK learned), and the other arm’s rewards are governed by a function f that we can set to take any value in . For each input , we perform a thought experiment and consider probability distribution over arms when facing a context which forces arm 2’s payoff to take each of the values 0will play arm 1 with weakly higher probability than arm 2 for those ); analogously, A will play arm 1 with weakly lower probability than arm 2 for those ). If there are at least 2 values of for which arm 1 and arm 2 are played with equal probability, one of those contexts will force regret, so we continue the simulation of A on one of those instances selected at random, forcing at least 2 regret in expectation, and at the same time have ) on such a round, which is used to construct feedback for A. Otherwise, A must transition from playing arm 1 with strictly higher probability to playing 2 with strictly higher probability as increases: the point at which that occurs will “sandwich” the value of ’s fairness implies this transition must occur when the expected payoff of arm 2 exceeds that of arm 1. B uses this value to output a numeric prediction.

An important fact we exploit is that we can query A’s behavior on ()), for any without providing it feedback (and instead “roll back” its history to not including the query ())). We update A’s history by providing it feedback only in rounds where B outputs

to get a predicted arm

Use KWIK feedback

Construct feedback for arm 2

’s history is not updated

Proof. For a fixed run of A, we calculate the probability that for all times is the case that this run, A is queried on histories and contexts: prefixes of h along with ()) for each . The fairness of A implies for any fixed )), with probability 1 . Then, by a union bound over , with probability at least 1 )) will satisfy this property for all . We condition on this holding in the remainder of the proof.

We now argue that the numeric predictions of B are correct within an additive

When 1], note that would have output

If , which we have conditioned on implying that either implies

Otherwise, we have that a contradiction, so ˆ-accurate. If neither (a) nor (b), then (c) it must be ˆ1, for some , we know that and therefore

Finally, we upper-bound ), the number of rounds . For each such t, A runs on a random draw of one of two contexts, one of whose arms’ payoffs differ by at least . Thus, for one of those contexts, either . In either case, since for suffers expected regret at least for that context, and at least when faced with one chosen at random. Thus, A’s regret is upper bounded by this quantity over ) rounds (which is an upper bound on the number of rounds for which A is actually run and updated).

6.1 An Exponential Separation Between Fair and Unfair Learning

In this section, we exploit the other direction of the equivalence we have proven between fair contextual bandit algorithms and KWIK learning algorithms to give a simple contextual bandit problem for which fairness imposes an exponential cost in its regret bound. This is in contrast to the case in which the underlying class of functions is linear, for which we gave fair contextual bandit algorithms with regret bounds within a polynomial factor of their unconstrained counterparts. In this problem, the context domain is the d-dimensional boolean hypercube: context each round for each individual consists of d boolean attributes. Our class of functions C is the class of boolean conjunctions:

We first give a simple but unfair algorithm, ConjunctionBandit, for this problem which obtains a regret bound which is linear in d. It maintains a set of candidate variables conjunction ; this set shrinks across rounds, while always containing the true set of variables over which is defined. We denote the boolean value of variable m in the context for arm j in round t by

The formal claim and proof that ConjunctionBandit achieves regret as ConjunctionBandit’s formal description, can be found in Appendix C. ConjunctionBandit violates the fairness in every round t in which it predicts 0 for arm i but 1 for arm j even though

We now show that fair algorithms cannot guarantee subexponential regret in d. This relies upon a known lower bound for KWIK learning conjunctions [Li, 2009]:

Lemma 11. There exists a sequence of examples (such that for (-KWIK learning algorithm B for the class C of conjunctions on d variables must output has a KWIK bound of at least

We then use the equivalence between fair algorithms and KWIK learning to translate this lower bound on ) into a minimum worst case regret bound for fair algorithms on conjunctions. We modify Theorem 6 to yield the following lemma, proven in Appendix C.

-fair algorithm for the contextual bandit problem over the class C of conjunctions on d variables. If A has regret bound (0-KWIK algorithm for C with KWIK bound

Lemma 11 then lets us lower-bound the worst case regret of fair learning algorithms on conjunctions.

-fair algorithm for the contextual bandit problem over the class C of conjunctions on d boolean variables has a worst case regret bound of

. We know then that if 1, Lemma 11 guarantees the existence of a sequence of contexts for which any (0)-KWIK algorithm has KWIK bound

Lemma 12 implies 4) gives a KWIK bound of if

Together with the analysis of ConjunctionBandit, this demonstrates a strong separation between fair and unfair contextual bandit algorithms: when the underlying functions mapping contexts to payoffs are conjunctions on d variables, there exist a sequence of contexts on which fair algorithms must incur regret exponential in d while unfair algorithms can achieve regret linear in d.

References

Philip Adler, Casey Falk, Sorelle A. Friedler, Gabriel Rybeck, Carlos Scheidegger, Brandon Smith, and Suresh Venkatasubramanian. Auditing black-box models by obscuring features. CoRR, abs/1602.07043, 2016. URL http://arxiv.org/abs/1602.07043.

Alekh Agarwal, Daniel J. Hsu, Satyen Kale, John Langford, Lihong Li, and Robert E. Schapire. Taming the monster: A fast and simple algorithm for contextual bandits. In Proceedings of the 31th International Conference on Machine Learning, ICML 2014, Beijing, China, 21-26 June 2014, pages 1638–1646, 2014.

Kareem Amin, Michael Kearns, and Umar Syed. Graphical models for bandit problems. arXiv preprint arXiv:1202.3782, 2012.

Kareem Amin, Michael Kearns, Moez Draief, and Jacob D Abernethy. Large-scale bandit problems and kwik learning. In Proceedings of the 30th International Conference on Machine Learning (ICML-13), pages 588–596, 2013.

Peter Auer, Nicolo Cesa-Bianchi, and Paul Fischer. Finite-time analysis of the multiarmed bandit problem. Machine learning, 47(2-3):235–256, 2002.

Solon Barocas and Andrew D. Selbst. Big data’s disparate impact. California Law Review, 104, 2016. Available at SSRN: http://ssrn.com/abstract=2477899.

Anna Maria Barry-Jester, Ben Casselman, and Dana Goldstein. The new science of sentencing. The Marshall Project, August 8 2015. URL https://www.themarshallproject.org/2015/08/04/the-new-science-of-sentencing. Retrieved 4/28/2016.

Alina Beygelzimer, John Langford, Lihong Li, Lev Reyzin, and Robert E. Schapire. Contextual bandit algorithms with supervised learning guarantees. In Proceedings of the Fourteenth International Conference on Artificial Intelligence and Statistics, AISTATS 2011, Fort Lauderdale, USA, April 11-13, 2011, pages 19–26, 2011.

S´ebastien Bubeck and Nicolo Cesa-Bianchi. Regret analysis of stochastic and nonstochastic multi-armed bandit problems. Machine Learning, 5(1):1–122, 2012.

Nanette Byrnes. Artificial intolerance. MIT Technology Review, March 28 2016. URL https://www. technologyreview.com/s/600996/artificial-intolerance/. Retrieved 4/28/2016.

Toon Calders and Sicco Verwer. Three naive bayes approaches for discrimination-free classification. Data Mining and Knowledge Discovery, 21(2):277–292, 2010.

Wei Chu, Lihong Li, Lev Reyzin, and Robert E. Schapire. Contextual bandits with linear payoff functions. In Proceedings of the Fourteenth International Conference on Artificial Intelligence and Statistics, AISTATS 2011, Fort Lauderdale, USA, April 11-13, 2011, pages 208–214, 2011.

Cary Coglianese and David Lehr. Regulating by robot: Administrative decision-making in the machine-learning era. Georgetown Law Journal, 2016. Forthcoming.

Cynthia Dwork, Moritz Hardt, Toniann Pitassi, Omer Reingold, and Richard Zemel. Fairness through awareness. In Proceedings of the 3rd Innovations in Theoretical Computer Science Conference, pages 214–226. ACM, 2012.

Michael Feldman, Sorelle A. Friedler, John Moeller, Carlos Scheidegger, and Suresh Venkatasubramanian. Certifying and removing disparate impact. In Proceedings of the 21th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, Sydney, NSW, Australia, August 10-13, 2015, pages 259–268, 2015.

Benjamin Fish, Jeremy Kun, and ´Ad´am D Lelkes. A confidence-based approach for balancing fairness and accuracy. SIAM International Symposium on Data Mining, 2016.

FTC Commisioner Julie Brill. Navigating the “trackless ocean”: Fairness in big data research and decision making. Keynote Address at the Columbia University Data Science Institute, April 2015.

Toshihiro Kamishima, Shotaro Akaho, and Jun Sakuma. Fairness-aware learning through regularization approach. In Data Mining Workshops (ICDMW), 2011 IEEE 11th International Conference on, pages 643–650. IEEE, 2011.

Michael N Katehakis and Herbert Robbins. Sequential choice from several populations. PROCEEDINGS-NATIONAL ACADEMY OF SCIENCES USA, 92:8584–8584, 1995.

Tze Leung Lai and Herbert Robbins. Asymptotically efficient adaptive allocation rules. Advances in applied mathematics, 6(1):4–22, 1985.

Lihong Li. A unifying framework for computational reinforcement learning theory. PhD thesis, Rutgers, The State University of New Jersey, 2009.

Lihong Li, Michael L Littman, Thomas J Walsh, and Alexander L Strehl. Knows what it knows: a framework for self-aware learning. Machine learning, 82(3):399–443, 2011.

Binh Thanh Luong, Salvatore Ruggieri, and Franco Turini. k-nn as an implementation of situation testing for discrimination discovery and prevention. In Proceedings of the 17th ACM SIGKDD international conference on Knowledge discovery and data mining, pages 502–510. ACM, 2011.

Clair C Miller. Can an algorithm hire better than a human? The New York Times, June 25 2015. URL http://www. nytimes.com/2015/06/26/upshot/can-an-algorithm-hire-better-than-a-human.html. Retrieved 4/28/2016.

Cecilia Munoz, Megan Smith, and DJ Patil. Big data: A report on algorithmic systems, opportunity, and civil rights. Technical report, Executive Office of the President, The White House, 2016.

John Podesta, Penny Pritzker, Ernest J. Moniz, John Holdern, and Jeffrey Zients. Big data: Seizing opportunities, protecting values. Technical report, Executive Office of the President, The White House, 2014.

Cynthia Rudin. Predictive policing using machine learning to detect patterns of crime. Wired Magazine, August 2013. URL http://www.wired.com/insights/2013/08/ predictive-policing-using-machine-learning-to-detect-\patterns-of-crime/. Retrieved 4/28/2016.

Alexander L Strehl and Michael L Littman. Online linear regression and its application to model-based reinforcement learning. In Advances in Neural Information Processing Systems, pages 1417–1424, 2008.

Latanya Sweeney. Discrimination in online ad delivery. Commununications of the ACM, 56(5):44–54, 2013.

Rich Zemel, Yu Wu, Kevin Swersky, Toni Pitassi, and Cynthia Dwork. Learning fair representations. In Proceedings of the 30th International Conference on Machine Learning (ICML-13), pages 325–333, 2013.

A Missing Proofs for the Classic Stochastic Bandits Upper Bound

We begin by proving Lemma 1, used in Section 3 to prove the fairness of the FairBandits algorithm.

Proof of Lemma 1. Choose an arbitrary arm i and round t and define indicator variables where takes on the reward of pull n of arm i. By a Chernoff bound, for any

By a union bound over all rounds t, the probability of any true mean ever falling outside of its confidence interval is at most

Next, we prove Lemma 2, which we used in Section 3 to bound the regret of FairBandits in Theorem 2.

be indicator variables of whether i was pulled at each time Let 1], a standard additive Chernoff bound states that

Since , it must be that , by the definition of FairBandits. Thus, , and therefore . so this also implies that

Setting

as desired. Then, taking a union bound over all active arms of which there are at most k, the claim follows.

Proof of Lemma 3. This follows from the definition of and the lower bound on by the assumption of the lemma.

A.1 Missing Derivation of R(T) for Theorem 2

where the final step follows from

B Missing Proofs for the Classic Stochastic Bandits Lower Bound

All lemmas in this section are used in Section 4 to prove the fair lower bound in Theorem 3. The first, Lemma 4, lets us analyze distributions over payoffs.

represent the joint distribution on rewards for either experiment: in both cases, the joint distribution on rewards is identical, since the process which generates them is the same.

We will use the notation to represent some fixed realization of the random variables . In particular, it suffices to show that

, . . . , r, . . . , d, . . . , r, . . . , d

The first experiment which generates () according to W has probability mass on this particular value of its random variables:

, . . . , r, . . . , d, . . . , r, . . . , d

The second experiment has joint probability:

where equality follows from Bayes’ Rule.

Next, we prove Lemma 5, used to reason about distinguishing between arms.

Proof of Lemma 5. Since neither -distinguished by

which completes the proof.

Finally, we prove Lemma 6, which lets us reason about how fair algorithm choices depend on histories.

Proof of Lemma 6. We will define a set of histories which cause A to play some pair of arms i and i + 1 with different probabilities when Define the set unfair() such that ) if there exist ] such that

Consider some which has not-distinguished any arm, such that there exists some which . Then, in particular, there exists some 1] such that Lemma 5, for all i and in particular this i, it is the case that 2

where the first equality comes from the fact that is a history for which second equality from the definition of the set unfair.

We will show that Equation 4 cannot hold with probability more than over the draw of from the underlying distribution, or else A would not satisfy -fairness. Since -fair, for any fixed

and therefore for any distribution

Lemma 4 implies also , so by Markov’s inequality

However, Equation 4 shows this does not hold for any which does not-distinguish any arm but for which . Thus, for at least of all probability mass over histories, either -distinguish some arm.

C Missing Proofs for the Contextual Bandit Setting

We begin by proving two results related to KWIKToFair. The first, Lemma 8, was used in Section 5 to prove that -fair in Theorem 5.

Proof of Lemma 8. We will refer to a violation of either (a) or (b) as a failure of learner For each , the set of queries asked of it are pairs (), histories along with new contexts. There are at most T contexts queried, and at most T histories on which is queried for a fixed run of our algorithm (namely, prefixes of ’s final history). Thus, there are at most queries for Thus, by a union bound over these queries for learner , by the KWIK guarantee, fails in some round] , and by a union bound over k arms, P [A learner fails in a round]

We proceed to Theorem 4, used in Section 5 to construct a -fair algorithm with quantified regret from KWIK learners.

Proof of Theorem 4. We use repeated calls to ) to run for an indefinite number of rounds. Specifically, we will make calls will refer to each such call to KWIKToFair by its epoch E. By Lemma 8, each epoch E is 6-fair, i.e. has a 6probability of violating fairness. Therefore by a union bound across epochs, the probability of ever violating fairness through repeated calls to KWIKToFair is bounded above by , so the overall algorithm is

Next, by Lemma 9 each epoch E contributes at most regret 3 denotes the value of used in epoch ). Then since each epoch E covers 2rounds, through round T the algorithm has used fewer than log(T) epochs, and by the doubling trick achieves regret

Next, we address the special subcase of KWIKToFair for linear functions outlined in Corollary 1.

Proof of Corollary 1. By Lemma 10, for each arm j the associated learner has mistake bound Substituting this

into Theorem 4, the overall regret guarantee satisfies regret

This brings us to the formal algorithm description of ConjunctionBandit and its corresponding regret bound, used in Section 6.1 as an example of an unfair learning algorithm for conjunctions.

Initialize active set of arms

Pull arm from active set at random

We can now upper bound the regret achieved by ConjunctionBandit.

Proof of Lemma 13. First, we claim that for every j, for the duration of the algorithm, that is the true set of variables corresponding to . This holds at initialization: . Suppose the claim holds prior to round is updated in this round, then

Therefore, the algorithm never makes false positive mistakes: in any round 1. Therefore ConjunctionBandit only accumulates regret in rounds where it makes false negative mistakes by predicting that all arms have reward 0 when some arm has reward 1.

Then, we have Regret(. We then rewrite the first term as and the second term as

where the last inequality follows from and the fact that if loses at least one of d variables, and this loss can therefore occur at most d times for each arm j. Substituting this into the original regret expression then yields

Finally, we prove Lemma 12, which we used in Section 6.1 to translate between fair and KWIK learning on conjunctions.

Proof of Lemma 12. We mimic the structure of the proof of Theorem 6, once again using FairToKWIK to construct a KWIK learner B by running the given fair algorithm A on a constructed bandit instance for each context

There are two primary modifications for the specific case of conjunctions: as conjunctions output either 0 or 1 we set therefore runs on 2T histories and contexts, either of form (-fair, if we fix history along with context and arm assignment ()) then, with probability at least 1 implies 2 and similarly 2. Union bounding over all such yields that A satisfies this fairness over all with probability at least 1 condition on this event for the rest of the proof.

We proceed to prove that the resulting KWIK learner -accurate. Here, as requires showing that all of B’s numerical predictions are correct. Assume instead that B outputs an incorrect prediction on ()). By the construction of FairToKWIK, a prediction from B implies that at least one of is distinct from the others. We condition on this distinctness to get two cases. In the first case, . By the fairness assumption, this respectively implies that . In the second case, for at least one of violates the fairness assumption on A as f(x(1)) = 1, so it must be that . Fairness then implies that . Therefore

It remains to upper bound ). Any round where means a choice between two contexts, one of which has a difference of 1 between arms. It follows that choosing randomly between both arms and contexts incurs expected regret 1/4. Therefore

Designed for Accessibility and to further Open Science