b

DiscoverSearch
About
Algorithms for Fairness in Sequential Decision Making
2019·arXiv
2 Fairness Constraints for MDPs
2 Fairness Constraints for MDPs

Preliminaries. A Markov decision process (MDP) is a tuple  M = (S, A, D, P, R, γ), where S = [n] = {1, ..., n} are the states, A = [m] are the actions, D ∈ R|S|is the initial state distribution (i.e.,  Dsis the probability of starting in state  s), P ∈ R|S|×|A|×|S|are the transitions (i.e.,  Ps,a,s′is the probability of transitioning from s to  s′taking action  a), R ∈ R|S|×|A|are the rewards (i.e.,  Rs,ais the reward obtained taking action a in state s), and  γ ∈ Ris the discount factor. Let  π ∈ R|S|×|A|be a stochastic policy (i.e., πs,ais the probability of taking action a in state s). The induced transtions are  P (π) ∈ R|S|×|S|, where P (π)s,s′ = �a∈A πs,aPs,a,s′. The time-discounted state distribution is

image

where

image

and the time-discounted state-action distribution is Λ  ∈ R|S|×|A|, where Λ(π)s,a = D(π)s πs,a. Note that �a πs,a = 1 and �s D(π)s = 1, so �s,aΛs,a = 1. Thecumulative expected reward is

image

where  ⟨X, Y ⟩ = �s∈S�a∈A Xs,aYs,a; we include a normalizing constant of 1  − γto simplify notation, which does not affect the reinforcement learning problem since  R(π)is scaled equally for different policies. Given policy class Π, the optimal policy is π∗ = arg maxπ∈Π R(π).

Fairness. Consider a population of individuals (e.g., loan applicants) interacting with a decision-maker (e.g., a bank). States S encode an individual’s features (e.g., probability of repaying), actions A are interventions (e.g., loan offer), and transitions P encode state changes (e.g., changes in ability to repay). The decision-maker rewards are not always aligned with individual rewards, so we use rewards R to indicate quality of outcomes for the decision-maker (e.g., the bank’s profit), and  individual rewards ρ ∈ R|S|×|A|to indicate quality of outcomes for an individual (e.g., whether a loan is offered). The cumulative expected individual rewards is  ρ(π) = E(s,a)∼Λ(π)[ρs,a].

Our goal is to learn the optimal policy for the decision-maker under a fairness constraint on the individual rewards. In particular, we want to ensure that  πdoes not favor the majority subpopulation over the minority subpopulation. The specific fairness constraint that should be used depends on the problem domain. We show how two constraints from the supervised learning setting can be extended to the MDP setting; as we discuss below, our results are more general.

First, we have the following extension of demographic parity to the MDP setting:

Definition 2.1. Let  ϵ ∈ R+, Mbe an MDP with states  S = Z טS, where Z = {maj, min}, and  ρ ∈R|S|×|A|be the individual rewards. For  z ∈ Z, let

image

be the time-discounted state-action distribution conditioned on starting from an initial state  s0in subpopulation z—i.e.,  s0has the form  s0 = (z, ˜s0) for some ˜s0 ∈˜S. More precisely,

(π)z)s,a = (D(π)z )sπs,a (∀s ∈ S, a ∈ A)

image

where  νis a normalizing constant. Furthermore, let ρ(π)conditioned on starting in subpopulation z is ρ(π)z = E(s,a)∼Λ(π)z[ρs,a]. Then, we say a policy  πsatis- fies  ϵ demographic parityif  |ρ(π)maj − ρ(π)min| ≤ ϵ.

For an individual (˜s, z) ∈ S, zencodes whether they are from the majority (z = maj) or minority (z = min) subpopulation and ˜s encodes their non-sensitive characteristics (e.g., probability of repaying a loan); demographic parity says the cumulative expected individual rewards are equal for the majority and minority subpopulations. Next, we have the following analog of equal opportunity (Hardt et al., 2016):

Definition 2.2. Let  ϵ ∈ R+, let M be an MDP with states  S = Z ×Y טS, where Z = {maj, min} and Y = {qual, unqual}, and let  ρ ∈ R|S|×|A|be the individual rewards. For each  z ∈ Z, let

image

A policy  πis  ϵequal opportunity if  |ρ(π)maj − ρ(π)min| ≤ ϵ.

This property is similar to demographic parity, but where Λ(π)zis restricted to the qualified subpopulation (i.e., y = qual). In other words, this property says that cumulative expected individual rewards are equal on average for qualified members of the majority and minority subpopulations.

Remark 2.3. In general, our algorithms apply to any fairness constraint that two subpopulations should have equal expected outcomes—i.e., for any Smaj, Smin ⊆ S, letting  ρ(π)z = E(s,a)∼Λ(π)z[ρs,a] and Λ(π)z = Λ(π) | I[s0 ∈ Sz], the constraint  |ρ(π)maj − ρ(π)min| ≤ϵ. They also extend to one-sided inequalities and to multiple majority and minority subpopulations. They also extend to batch decisions; see Appendix A.

We focus on demographic parity when describing our algorithms, but our results are general. Letting ΠDP,ϵbe the class of policies satisfying demographic parity, our goal is to compute the optimal policy

image

We primarily focus is on settings where the MDP is known, which includes settings where the decision-maker learns about individuals via their interactions (see example below), but not ones where they learn across individuals. We describe a basic extension to unknown MDPs in Section 5.

Example. We describe an MDP  Mloanthat models individuals applying for loans. We assume each individual has a true probability p of repaying their loan. On step t, the bank has an estimate of the distribution of p (e.g., a credit score); we assume this distribution is a Beta distribution—i.e.,  pt ∼Beta(αt, βt). Thus, the states of our MDP (αt, βt). 2The actions are to offer (a = 1) or deny (a = 0) a loan. If the bank offers a loan, the transitions are

image

(αt+1, βt+1) =

image

If the bank denies the loan, the transitions are (αt+1, βt+1) = (αt, βt). However, since we are interested in detrimental effects of the bank’s decisions, we assume this decision reduces the applicant’s ability to pay for future loans—i.e., (αt+1, βt+1) = (αt, βt + τ), where  τ ∈ R+is a hyperparameter. We assume the initial state distribution is  z ∼Bernoulli(pZ) and (α, β) ∼ p0(α, β | z) for some  pZ ∈[0, 1] and some distribution  p0—i.e., the initial distribution over the parameters  α, βdepends on the whether the applicant is from the majority or minority subpopulation. Note that  p0can additionally be conditioned individual covariates if available. Now, the bank’s rewards are

image

where P is the principal (without loss of generality, we let P = 1), I is interest,  δindicates whether the loan is repaid, and  λ ∈ R+. The first term is expected profit and the second term is to risk aversion. We assume the goal of the bank is to maximize (2).

The individual rewards are I[a = 1], where I is the indicator function—i.e., the reward is 1 if the loan is offered and 0 if it is denied. Then, demographic parity says that loans should be given to majority and minority members with equal frequency (within an  ϵtolerance), and equal opportunity says that loans should be given to qualified majority and minority members at equal rates (we assume an applicant is qualified if their true probability of repaying satisfies  p ≥ p0for some  p0 ∈[0, 1]).

Separable MDPs. We focus primarily on MDPs where the fairness attribute is constant.

Definition 2.4. An MDP with states  S = Z טS is separable if the transitions satisfy  P(z,˜s),a,(z′,˜s′) =δz,z′˜P˜s,a,˜s′, where  δz,z′ = I[z = z′] and ˜P ∈R| ˜S|×|A|×| ˜S|is a transition matrix.

That is, the transitions do not affect z, so the sensitive attribute  z ∈ Zdoes not change over time. This property is satisfied by many sensitive attributes (e.g., race and gender). Fairnes properties may not make sense when the sensitive attribute can change.

Existence and determinism. Unconstrained MDPs always have a deterministic optimal policy (Sutton and Barto, 2018); however, with a fairness constraint, this result may not hold:

Theorem 2.5. There exists  ϵ >0 and an MDP M such that ΠDP,ϵ = ∅. There exists  ϵ >0 and an MDP M such that  π∗in (1) is not deterministic.

We give a proof in Appendix B. For the following special case, we can prove existence of fair policies:

Definition 2.6. We say  ρis state-independent if for some ˜ρ ∈ R|A|, we have  ρs,a = ˜ρafor all  s ∈ S.

Intuitively, this property captures settings where the decision-maker uses the state to choose actions (e.g., ability to repay), but the outcomes for the individuals only depend on whether the preferred action is taken (e.g., a loan offer). Our example  Mloanhas state-independent individual rewards.

Theorem 2.7. If the individual rewards are state-independent, then (1) has a solution.

Proof. Any policy  πsuch that  πs,a = ˜πafor all  s ∈ Sand some ˜π ∈ R|A|, satisfies  π ∈ ΠDP.

Comparison to supervised learning. Our fairness definitions are natural generalizations of their counterparts for supervised learning. For example, in the supervised learning setting, demographic parity says that majority and minority members should, on average, be given positive outcomes at equal rates. Our extension to MDPs says that this property should hold on average across time—more precisely, averaged over t ∼Geometric(γ), where  γis the discount factor.

Conversely, our constraint reduces to the supervised learning constraint setting when the state distribution is constant over time—i.e.,  D(π,t) = Dis independent of t and  π. To see this claim, note that a constant state distribution implies that  D(π) = D, so the state-action distribution is simply Λ(π)s,a = Dsπs,a, and our MDP demographic parity constraint reduces to

image

In other words, the policy  πshould equalize the expected individual rewards for the majority and minority subpopulations on the initial (constant) state distribution. Finally, assuming the individual rewards are  ρs,a = 1 for a positive outcome and ρs,a = 0 oth-erwise, then our constraint is equivalently

image

where ˆy = ρs,ais the outcome, which is demographic parity for supervised learning (Hardt et al., 2016).

Additionally, we introduce individual rewards  ρ, which may differ from the decision maker rewards R. This distinction also appears in the supervised learning setting if the loss function for the decision maker (used in the learning objective) differs from the loss function of the individual (used in the fairness constraint). For example, R may differ from  ρif the decision maker is risk-averse; then, the decision maker may offer too few loans to minorities if there is less historical information available for minorities. We believe this distinction is particularly important to explicitly model in the MDP setting, since dynamical effects can magnify the negative consequences of unfair decision making.

Importance of dynamics. Dynamics are important when current decisions do not immediately cause unfairness, but can affect the state distribution in a way that leads to unfair outcomes in the future. In our loan applicant example, there are two effects of decisions on the state distribution. First, there is a direct effect— e.g., denying loans can cause adverse outcomes on an applicant’s financial situation. In  Mloan, this effect is captured by the update  βt+1 = βt + τwhen a = 0— i.e., the applicant’s probability of repaying future loans decreases when they are denied a loan.

The second effect is indirect, and is related to the selective labels problem in sequential decision making (Lakkaraju et al., 2017; Bechavod et al., 2019). In particular, the bank only observes outcomes if they offer the applicant a loan. A key concern is that less historical information is available for minorities, leading to higher variance estimates of their ability to repay a loan. Thus, a risk-averse decision maker might conservatively deny loans to minorities, even if their expected rate of repaying loans is equal to that of majority members. The equal opportunity constraint forces the decision maker to give exploratory loans to avoid unfairly denying loans to an applicant for whom little data is available.

image

We describe an algorithm for solving (1), which has strong theoretical guarantees (i.e., it solves (1) exactly in polynomial time). On the other hand, it makes strong assumptions—i.e., that M has finite state and action spaces. In Section 4, we describe a model-free algorithm that applies very generally (e.g., to continuous state and action spaces, or even non-separable MDPs), but lacks performance guarantees.

Our approach is based on the dual of the standard LP formulation of value iteration (Altman, 1999; Sut- ton and Barto, 2018). In particular, the objective and first set of constraints of the LP in Algorithm 1 form the dual. The last set of constraints in the LP in Algorithm 1 encodes demographic parity. These constraints exploit the separable structure of the underlying MDP. In particular, the component z of an initial state  s = (z, ˜s) does not change over time, so the value of z for s equals the value of z for the initial state s0 ∼ D. Thus, randomly sampling a state  s ∼ D(π)zis equivalent to randomly sampling

image

Expanding the conditional probability, the probability of sampling  s ∼ D(π)zis

image

It follows that

image

The last set of constraints in the LP in Algorithm 1 uses (3) to encode demographic parity.

image

Theorem 3.1. Algorithm 1 returns a solution  π∗to (1) if and only if (1) is satisfiable.

We give a proof in Appendix C. Note that Algorithm 1 runs in polynomial time.

Remark 3.2. We briefly compare our approach to algorithms for solving constrained MDPs. Existing approaches are also based on the dual of the LP for solving MDPs (Altman, 1999). Indeed, in the LP we use in Algorithm 1, the objective and the first constraint are taken from the dual. The second constraint, which encodes the fairness constraint, is novel—our key insight is that for separable MDPs, the fairness constraint can be expressed as a linear inequality over  λ.

Next, we propose a general algorithm for solving (1). However, in general, the planning problem may be non-convex, so unlike Algorithm 1, this algorithm may converge to a local optimum.

Our algorithm relies on the cross-entropy (CE) method (Mannor et al., 2003; Hu et al., 2012), a heuristic for solving optimization problems. Suppose our policies  πθ ∈Π are parameterized by  θ ∈Θ, and let a family F of probability distributions over Θ parameterized by  V ⊆ Rd. We use  θand  πθinterchangeably, e.g.,  R(θ) = R(πθ). In the unconstrained setting, CE aims to solve the following optimization problem:

image

where  Ev = Eθ∼fv. In other words, it aims to compute a distribution  fv∗that places high probability mass on θwith high cumulative expected reward  R(θ). Then, it returns a sample  θ ∼ fv∗. To solve (4), CE starts with initial parameters  v0 ∈ V. Then, on each iteration, it

image

updates the current parameters  vkto move “closer” to v∗. More precisely, the update is

image

where  γksatisfies Prvk[R(θ) ≥ γi] = µ. Here,  α, µ ∈(0, 1) are hyperparameters. Intuitively, the first term of  giupweights  θ′with large values of  R(θ′)compared to  fvk, both by directly weighting the probability of  θ′by  R(θ′), and furthermore by placing zero probability mass on the bottom 1  − µfraction of the  θ′. The second term of  gkis a “smoothing” term that makes the update incremental.

To enable efficient optimization of (5), we assume that F is a (natural) exponential family.

Definition 4.1. A family F of distributions over Θ  ⊆Rdis an exponential family if, for a continuous Γ : Θ  →Rd, fv(θ) = ev⊤Γ(θ)/Z(θ), where  Z(θ) =�ev⊤Γ(θ)dθ.

We use the standard choice that F is the space of Gaussians. If F is an exponential family, then

image

where  m(v) = Ev[Γ(θ)] is the moment map (Hu et al., 2012). The CE algorithm approximates (6) by sampling rollouts  ζ = ((s0, a0), ..., (sT −1, aT −1)) according to  πθ. Then, it computes the estimate  R(θ) ≈ˆR(θ) = 1m�mi=1ˆR(ζ(i)), where  ζ(1), ..., ζ(m)are m sam- pled rollouts and ˆR(ζ) = �T −1t=0 γtRst,at.

To estimate  ηk+1, it takes n samples  θ(1), ..., θ(n) ∼ fv, and computes ˆR(θ(i))for each i. Then, it ranks  θ(i)in decreasing order of ˆR(θ(i)), and discards all but the top n′ = ⌈nµ⌉. It estimates the numerator in  ηk+1as

image

The denominator in  ηk+1is estimated similarly.

Algorithm 2 computes this estimate of the update (6) assuming the condition on Line 16 is satisfied (as we discuss below, the check is needed to enforce the constraint that  π ∈ ΠDP,ϵ. Line 6 of Algorithm 2 computes the estimates ˆR(θ(i))for samples  θ(i) ∼ fvkfor i ∈ [n], and Line 14 estimates  ηk+1. On Line 6 & 7, the notation ∼m,T←−−−−means to estimate a quantity using m sampled rollouts  ζ(1), ..., ζ(m)each of length T.

Finally, we adapt constrained cross-entropy (CCE), which extends CE to handle constraints (Wen and Topcu, 2018), to handle fairness constraints. Intuitively, CCE prioritizes policies where the constraint that  π ∈ ΠDP,ϵis closer to holding, unless the constraint holds, in which case CCE prioritizes policies with higher cumulative expected reward. In particular, Algorithm 2 imposes this constraint by checking if ˆθsatisfies the constraint ˆϵ(ˆθ) ≤ ϵin Line 16, where ˆϵ(ˆθ)is estimated from samples. Note that ˜ϵis used in place of  ϵto enforce the constraint even though ˆϵ(ˆθ)is inexact. The reason is that CCE relies on estimates ˆϵ(ˆθ)of  ϵ(ˆθ). These estimates are inexact since (i) they are estimated from samples, and (ii) they are estimated based on a finite time horizon (whereas  ϵ(ˆθ)is defined for an infinite horizon). To account for this error, we use (1  − σ)ϵ(where  σ ∈(0, 1)) in place of  ϵwhen checking the constraint on Line 16 of Algorithm 2.

We provide the following for Algorithm 2 (see Appendix D for a proof).

Theorem 4.2. Assume that  ρmaxis an upper bound on  ρ(i.e.,  ∥ρ∥∞ = ρmaxfor all  z ∈ Z). Let  δ ∈ R+and  σ ∈(0, 1/2] be given, and suppose that

image

Then, with probability at least 1  − δ, we have  πˆθ ∈ΠDP,ϵ, where  πˆθis returned by Algorithm 2.

image

Figure 1: Demographic parity (a) objective value, (b) constraint value, and equal opportunity (c) objective value, (d) constraint value, for race-blind (RB), demographic parity (DP) or equal opportunity (EO), optimistic (Opt), and conservative (Cons).

We discuss extensions to the setting where the MDP is initially unknown, and the goal is to ensure fairness while learning these quantities. We propose an approach to fairness when the transitions P are unknown but the initial state distribution D is known; reducing to the case of unknown D is standard (i.e., add a deterministic initial state  s0and transition to an initial state according to D). Our goal is to ensure that with high probability, fairness holds for all time including during learning. We consider the episodic case where the system is reset after a fixed number of steps T, and take  γ = 1. That is, a finite sequence ofinteractions is performed repeatedly—e.g., each new loan applicant is a new episode. We assume there are a fixed total number of episodes N, and the goal is to perform well on average; the doubling trick can be used to generalize to unknown or unbounded N (see p. 99 of Lattimore and Szepesv´ari).

A key challenge is how to design a fair policy we can use when the dynamics are unknown. Thus, we focus on the setting of state-independent individual rewards  ρ, where we can ensure such a policy exists. In particular, we take  π0to choose actions uniformly randomly—i.e.,  π0(s, a) = 1/|A|for all  s ∈ Sand a ∈ A. Then, we are guaranteed that  π0is fair. Furthermore, we are guaranteed that  π0explores all states (assuming without loss of generality that we prune unreachable states)—i.e., letting  D(π) = 1T�T −1t=0 D(π,t)and Λ(π)s,a = D(π)s πs,a, where  D(π,t)is defined as before, then there exists  λ0 ∈ R+such that

image

We use explore-then-commit (Lattimore and Szepesv´ari). First, we explore using the conservative policy  π0for  N0episodes. Then, we estimate P using the observed state-action-state tuples (s, a, s′) (i.e., transition to  s′upon taking action a in state s):

ˆPs,a,s′ =# observed tuples (s, a, s′)# observed tuples (s, a, s′′) for some  s′′ ∈ S .

Finally, for the remaining  N − N0, it uses the optimal policy ˆπcomputed as if ˆP is the true transition matrix.

We prove a bound on the regret

image

where the expectation is taken over the randomness of the observed tuples (s, a, s′), π∗is the optimal policy for known P that satisfies  π∗ ∈ ΠDP,ϵ/4, and

image

is the policy our algorithm uses on episode n. We show that ˆπis fair, and that given  δ ∈ R+, πn ∈ ΠDP,ϵfor every  n ∈ [N] with probability at least 1  − δ.

Theorem 5.1. Let  ϵ, δ ∈ R+be given. Assume that Rmaxis an upper bound on R (i.e.,  ∥R∥∞ = Rmax) and on  ρ. Let ˆM = (S, A, D, ˆP, R, T), and ˆπbe the optimal policy for ˆM in ˆΠDP,ϵ/2(i.e., the set of policies satisfying demographic parity for ˆM). Let M = (S, A, D, P, R, T), and  π∗be optimal for M in ΠDP,ϵ/4. Then, ˆπ ∈ ΠDP,ϵ, and  R(N) = O((N 2/3+ 1/ϵ2) log(1/δ)) with probability at least 1  − δ.

We give a proof in Appendix E. Note that there is a gap between the fairness constraint of  π∗(which is in ΠDP,ϵ/4) and that of ˆπ(which is only in ΠDP,ϵ)— i.e., we can only guarantee performance compared to a policy that satisfies a stricter level of fairness.

We run simulations using our loan example from Section 2. We estimated parameters based on FICO score data (Hardt et al., 2016). We consider Whites to be majorities, and Blacks, Hispanics, and Asians to be minorities. For the initial distribution  p0, we first fit parameters the parameters of the prior Beta(αz, βz) based on the data. Then, we take a fixed number of steps  Tzusing action a = 1 (i.e., offer loan) to force exploration. We choose  Tmaj > Tminto capture the idea that less data is available for minorities. We also estimate the probability  pZof being a minority from the data. Similar to (Hardt et al., 2016), we choose I so the bank makes a profit on the average applicant. We manually choose  λ, τ, Tmaj, and  Tminbased on intuition; see Appendix G for the values we chose. We focus on evaluation of Algorithm 2, and give additional experimental results in Appendix G.

Baselines that ignore dynamics. To demonstrate the importance of accounting for dynamics, we compare to two baselines that ignore dynamics when constraining fairness. The first optimistically pretends that actions do not affect the state distribution—i.e., D(π,t)does not change over time. In this case, for all t > 0, we have  D(π,t) = D, so  D(π) = Dfor any  π. Thus, we can let

image

where  Dz = D | ∃˜s ∈˜S . s0 = (z, ˜s). We can solve (7) using a straightforward modification of Algorithm 2. This captures the supervised learning setting. Compared to our algorithm, this algorithm may learn a policy that is unfair but achieves higher reward.

The second conservatively assumes  D(π,t)can change arbitrarily on each step. This baseline learns a fair policy, but it may achieve much lower reward. In this case, we restrict to policies  πthat satisfy

image

where  D′z = D′ | ∃˜s ∈ ˜S . s = (z, ˜s), and ∆nis the standard n-simplex. Note that  D′zis conditioned on s = (z, ˜s) (i.e., the current state has sensitive attribute z) instead of  s0 = (z, ˜s) (i.e., the initial state has sensitive attribute z); if M is separable, these two conditions are equivalent. Finally, note that  D′zis undefined if the conditional has zero probability according to  D′; we implicitly omit such  D′from (8).

The difficulty with (8) is the universal quantification over  D′ ∈ ∆|S|. For state-independent individual rewards, the conservative assumption is in fact equivalent to optimizing over state-independent policies— i.e., those of the form  πs,a = ˜πa, where ˜π ∈ R|A|. Thus, we can apply a modified version of Algorithm 2 where we only learn state-independent policies.

Results for Algorithm 2. We ran Algorithm 2 to learn fair policies for both the demographic parity and equal opportunity constraints, using  ϵ = 0.1. For each constraint, we also use our optimistic and conservative baselines. We also consider a race-blind algorithm that is unconstrained but where  πignores the sensitive attribute  z ∈ Z. The optimal policy is race-blind—the state is a sufficient statistic, so it captures all information needed to determine whether to offer a loan.

For demographic parity, Figure 1 (a) shows the reward achieved for the bank, and (b) shows the value of the fairness constraint—i.e., the smallest value of  ϵfor which  π ∈ ΠDP,ϵ. As expected, race-blind achieves the highest reward (10.43), followed by the optimistic algorithm (10.41), and then Algorithm 2 (10.40). Finally, the conservative algorithm performs substantially worse than the others (10.00). However, race-blind achieves a very poor constraint value (0.42), as does the optimistic algorithm (0.14), which per- forms performs 43% worse than Algorithm 2 (0.10). The conservative algorithm achieves constraint value 0. For equal opportunity, Figure 1 (c) shows the bank reward, and (d) shows the value of the constraint. The bank’s rewards are essentially the same for the race-blind algorithm, optimistic algorithm, and Algorithm 2 (10.43), but is substantially worse for the conservative algorithm (10.00). As with demographic parity, the constraint value for race-blind (0.37) is substantially worse than the others, but in this case optimistic (0.11) is fairly close to Algorithm 2 (0.10). The conservative algorithm achieves constraint value 0.

Discussion. Our results show that imposing demographic parity slightly reduces the bank’s reward, but substantially increases fairness compared to the race-blind and optimistic algorithms. The latter models supervised learning—thus, our results show the importance of accounting for dynamics when ensuring fairness. We find similar (but weaker) trends for equal opportunity. Like prior work (Hardt et al., 2016), we find that demographic parity reduces the bank’s rewards more than equal opportunity.

Unlike the static case (Hardt et al., 2016), our model has dynamic parameters. Time series data would be needed to estimate them; instead, we choose them manually. Also, (Hardt et al., 2016) uses the empirical CDF of the distribution over repayment probabilities p0, whereas we assumed  p0is a Beta distribution. Our goal is to understand the consequences of ignoring dynamics, not to study a real-world scenario.

We have proposed algorithms to learn fair policies that account for the dynamical effects, and have demonstrated the importance of accounting for these effects. There is much room for future work. One important direction is extending our results for the case where the initial MDP is unknown beyond explore-then-commit to obtain better regret guarantees. Another direction is theoretically analyzing the cost of fairness— e.g., what is the cost to the bank for imposing a fairness constraint, and how they can mitigate this cost by improving predictive power. Finally, reinforcement learning problems in practice are often offline—i.e., the goal is to learn from historical data and the algorithm does not have the opportunity to explore. Studying fairness in this context is an important problem.

We thank the anonymous reviewers for their insightful comments. This research was partially supported by NSF awards 1910769 and 1652113.

Joshua Achiam, David Held, Aviv Tamar, and Pieter Abbeel. Constrained policy optimization. In ICML, 2017.

Eitan Altman. Constrained Markov decision processes, volume 7. CRC Press, 1999.

Pranjal Awasthi, Corinna Cortes, Yishay Mansour, and Mehryar Mohri. Beyond individual and group fairness. arXiv preprint arXiv:2008.09490, 2020.

Yahav Bechavod, Katrina Ligett, Aaron Roth, Bo Waggoner, and Zhiwei Steven Wu. Equal opportunity in online classification with partial feedback. arXiv preprint arXiv:1902.02242, 2019.

Toon Calders, Faisal Kamiran, and Mykola Pech- enizkiy. Building classifiers with independency constraints. In Data mining workshops, 2009. ICDMW’09. IEEE international conference on, pages 13–18. IEEE, 2009.

Elliot Creager, David Madras, Toniann Pitassi, and Richard Zemel. Causal modeling for fairness in dynamical systems. arXiv preprint arXiv:1909.09141, 2019.

Alexander D’Amour, Hansa Srinivasan, James At- wood, Pallavi Baljekar, D Sculley, and Yoni Halpern. Fairness is not static: deeper understanding of long term fairness via simulation studies. In Proceedings of the 2020 Conference on Fairness, Accountability, and Transparency, pages 525–534, 2020.

Cynthia Dwork, Moritz Hardt, Toniann Pitassi, Omer Reingold, and Richard Zemel. Fairness through awareness. In Proceedings of the 3rd innovations in theoretical computer science conference, pages 214– 226. ACM, 2012.

Hadi Elzayn, Shahin Jabbari, Christopher Jung, Michael Kearns, Seth Neel, Aaron Roth, and Zachary Schutzman. Fair algorithms for learning in allocation problems. 2019.

Moritz Hardt, Eric Price, Nati Srebro, et al. Equal- ity of opportunity in supervised learning. In Advances in neural information processing systems, pages 3315–3323, 2016.

Tatsunori B Hashimoto, Megha Srivastava, Hongseok Namkoong, and Percy Liang. Fairness without demographics in repeated loss minimization. In ICML, 2018.

Jiaqiao Hu, Ping Hu, and Hyeong Soo Chang. A stochastic approximation framework for a class of randomized optimization algorithms. IEEE Transactions on Automatic Control, 57(1):165–178, 2012.

Shahin Jabbari, Matthew Joseph, Michael Kearns, Jamie Morgenstern, and Aaron Roth. Fairness in reinforcement learning. In Proceedings of the 34th International Conference on Machine LearningVolume 70, pages 1617–1626. JMLR. org, 2017.

Matthew Joseph, Michael Kearns, Jamie H Morgen- stern, and Aaron Roth. Fairness in learning: Classic and contextual bandits. In Advances in Neural Information Processing Systems, pages 325–333, 2016.

Niki Kilbertus, Mateo Rojas Carulla, Giambattista Parascandolo, Moritz Hardt, Dominik Janzing, and Bernhard Sch¨olkopf. Avoiding discrimination through causal reasoning. In Advances in Neural Information Processing Systems, pages 656–666, 2017.

Niki Kilbertus, Manuel Gomez-Rodriguez, Bernhard Sch¨olkopf, Krikamol Muandet, and Isabel Valera. Fair decisions despite imperfect predictions. AISTATS, 2019.

Matt J Kusner, Joshua Loftus, Chris Russell, and Ri- cardo Silva. Counterfactual fairness. In Advances in Neural Information Processing Systems, pages 4066–4076, 2017.

Himabindu Lakkaraju, Jon Kleinberg, Jure Leskovec, Jens Ludwig, and Sendhil Mullainathan. The selective labels problem: Evaluating algorithmic predictions in the presence of unobservables. In Proceedings of the 23rd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pages 275–284. ACM, 2017.

Tor Lattimore and Csaba Szepesv´ari. Bandit algorithms.

Lydia T Liu, Sarah Dean, Esther Rolf, Max Sim- chowitz, and Moritz Hardt. Delayed impact of fair machine learning. In ICML, 2018.

Shie Mannor, Reuven Y Rubinstein, and Yohai Gat. The cross entropy method for fast policy search. In Proceedings of the 20th International Conference on Machine Learning (ICML-03), pages 512–519, 2003.

Razieh Nabi and Ilya Shpitser. Fair inference on out- comes. In Proceedings of the... AAAI Conference on Artificial Intelligence. AAAI Conference on Ar-tificial Intelligence, volume 2018, page 1931. NIH Public Access, 2018.

Richard S Sutton and Andrew G Barto. Reinforcement learning: An introduction. MIT press, 2018.

Martin J. Wainwright. High-dimensional statistics: A non-asymptotic viewpoint. Cambridge University Press, 2019.

Min Wen and Ufuk Topcu. Constrained cross-entropy method for safe reinforcement learning. In Advances in Neural Information Processing Systems, pages 7461–7471, 2018.

A Batch Decisions

We describe how our algorithm can be extended to the setting where the decision-maker makes decisions about batches of individuals jointly rather than one individual at a time. For instance, a bank might decide on a portfolio of loans to target at once rather than decide independently for each individual. The challenge is ensuring decisions are fair not just across different batches of individuals, but also across individuals within a batch, since these decisions may be correlated.

In this setting, the state space becomes  S′ = Sk =(Z טS)k, where there are k individuals and S is the state space of individual i. The action space is A′ = Ak = {0, 1}k—i.e., a binary decision for each individual. Then, the state-action distribution is  λ ∈R|S′|×|A′|, with a component  λs,a = λ(s1,...,sk),(a1,...,ak)for each state  s = (s1, ..., sk) ∈ S′and action a = (a1, ..., ak) ∈ A′. The policy  πcan simultaneously make decisions for all k individuals. We let the individual rewards for individual i be  ρi ∈ R|S′|×|A′|. Then, the natural generalization of our fairness constraint is that decisions should be fair on average across both the initial state distribution and across individuals in a single batch. For instance, demographic parity says that

image

where

image

This constraint can be encoded in our linear program in Algorithm 1 by replacing the second constraint with the following (the objective and first constraint remain the same, except with S replaced by  S′and A replaced by  A′):

image

where

Xz =

image

and where  pi,zis a normalizing constant similar to  pzin the original constraint. Intuitively, this constraint is the same as the original one except that we marginalize over all other individuals (i.e., the sums over  sjfor j ̸= i), and then we average over individuals i as in our fairness constraint. This constraint can similarly be incorporated into Algorithm 2. Finally, while this MDP has number of states exponential in the number of individuals k, this blowup is inevitable since the policy is allowed to make complex decisions based on the states of all individuals.

B Proof of Theorem 2.5

For the first claim, consider the MDP M. The states are  s0, s1, s2, s3, s4 ∈˜S × Z, where:

image

The actions are A = {0, 1}. The transitions are

image

for all  a ∈ A. The initial distribution is

image

The discount factor is  γ = 12. The individual rewards are

image

for all  a ∈ A. Let  π : S → Abe a deterministic policy. It is clear that the only value of  πthat matters is  π(s2). Conditioned on  z = maj, regardless of π, the expected cumulative individual reward is

image

Conditioned on  z = min, if π(s2) = 0, then

image

Thus, for  ϵ < 12, it is impossible for the demographic parity constraint to be satisfied.

However, consider the stochastic policy

image

Then,

image

so this policy satisfies the demographic parity constraint.

For the second claim, consider the same MDP, except where

image

for all  a ∈ A. Then, it is clear that

image

regardless of  π. Thus, for  ϵ < 12, the demographic parity constraint cannot be satisfied—i.e., ΠDP,ϵ = ∅.

image

C Proof of Theorem 3.1

Our proof proceeds in three steps. First, we show that any feasible point of the LP in Algorithm 1 is the state-action distribution Λ(π)for some policy  π ∈ ΠDP. Second, we show that conversely, for any fair policy π ∈ ΠDP, the state-action distribution Λ(π)is a feasible point of the LP. Finally, we combine these two results to prove the theorem.

Step 1. Let  π ∈ ΠDPbe any policy satisfying demographic parity. Then, we claim that the state-action distribution Λ(π)is a feasible point of the LP in Algorithm 1.

image

for each  s′ ∈ S.

To this end, note that by induction,

image

so

image

Multiplying each side of (9) by  I − γP (π)(where I is the  |S| × |S|identity matrix), we have

image

Note that these algebraic manipulations are valid since the eigenvalues of  γP (π)are bounded in norm by  γ <1, so all sums converge absolutely. Rearranging this equality gives

image

for each  s′ ∈ S, where we have used the equalities

image

and

image

that follow from the definition of Λ(π). Therefore, Λ(π)satisfies the first constraint.

Next, we show that Λ(π)satisfies the second constraint, which says that

image

In particular, note that

image

since the value of z for s equals the value of z for the initial state  s0 ∼ D. Furthermore, the probability of sampling  s ∼ D(π) | ∃˜s ∈˜S . s = (z, ˜s) is

image

Together with the definition of Λ(π)z, we have

image

Therefore, we have

image

By assumption,  πsatisfies the demographic parity constraint, which says exactly that (12) satisfies (11). Thus, Λ(π)satisfies the second constraint.

Therefore, Λ(π)is a feasible point of the LP, as claimed.

Step 2. Let  λ ∈ R|S|×|A|be a feasible point of the LP in Algorithm 1, and let

image

be the corresponding policy returned by Algorithm 1. Then, we claim that  λ = Λ(π), that  π ∈ ΠDP, and that the value of the objective for  λequals  R(π).

To see the first claim, let  d ∈ R|S|be defined by

image

We show that  D(π) = d. To this end, note that because λsatisfies the first constraint in the LP, we have

image

Together with the equality

image

we have

image

Thus,

image

We note that  I −γP (π)is invertible—in particular, the eigenvalues of  γP (π)have norms bounded by  γ, so the eigenvalues of  I − γP (π)have norms bounded below by 1  − γ; therefore, the eigenvalues of  I − γP (π)are nonzero, so it is invertible. As a consequence, we can solve for d in (13) to get

image

Finally, from (10) in Step 1 of this proof, we established that  D(π)similarly satisfies

image

As before, since  I − γP (π)is invertible, we have

image

Thus,

image

so the first claim follows.

To see the second claim, note that since  λis feasible, it must satisfy the second constraint of the LP. As shown in the first step of this proof, (11) is equivalent to the demographic parity constraint. Thus,  π ∈ ΠDP, as claimed.

To see the third claim, note that

image

In other words, the value of the objective of the LP for the point  λis equal to  R(π), as claimed.

Step 3. Finally, we use the results from the previous two steps to prove the theorem statement. First, let  π∗be the solution to (1). By the claim shown in the first step, Λ(π∗)is a feasible point of the LP in Algorithm 1. Furthermore, by the claim shown in the second step, the value of the objective for  λ = Λ(π∗)is  R(π∗).

Next, let  λ0be the solution to the LP in Algorithm 1. By the claim shown in the second step, (i)  λ0 = Λ(π0), where  π0is the policy returned by Algorithm 1, (ii) π0 ∈ ΠDP, and (iii) the value of the objective for  λ0is R(π0).

It follows that  R(π∗) ≤ R(π0), since  π0maximizes the objective of the LP over feasible points (and Λ(π∗)is feasible). Since  π0 ∈ ΠDP, it follows that  π0is also a solution to (1). Thus, we have proven the theorem statement.

D Proof of Theorem 4.2

Our proof proceeds in three steps. First, we bound the error  |˜ρ(π)−ρ(π)|due to truncation. Second, we bound the estimation error  |ˆρ(π) − ˜ρ(π)|. Third, we combine steps 1 and 2 to prove Theorem 4.2.

Step 1. Note that for any policy  πand any  z ∈ Z, we have

image

Step 2. For each  z ∈ Z, let ˆρ(π)zbe an estimate of ˜ρ(π)zusing m sampled rollouts  ζ(1), ..., ζ(m). First, note that

image

is bounded, so we can apply Hoeffding’s inequality (see Lemma F.1) to get

image

Since Z = {maj, min}, by a union bound,

image

with probability at least 1  − δ.

Step 3. Now, we can prove Theorem 4.2. First, note that with probability 1  − δ,

image

for all  z ∈ Z. Thus,

image

which implies that  π ∈ ΠDP,ϵ. Thus, the theorem follows.

E Proof of Theorem 5.1

We prove the following lemma; Theorem 5.1 follows by choosing  ϵ′ = N −1/3.

Lemma E.1. Let  ϵ, ϵ′, δ ∈ R+be given. Assume that Rmaxbe an upper bound on R (i.e.,  ∥R∥∞ = Rmax) and on  ρ. Let ˜ϵ = min{ϵ, ϵ′}, and let

image

Let ˆM = (S, A, D, ˆP, R, T), and ˆπbe the optimal policy for ˆM in ˆΠDP,ϵ/2(i.e., the set of policies satisfying demographic parity for ˆM). Let M = (S, A, D, P, R, T), and  π∗be optimal for M in ΠDP,ϵ/4. Then, ˆπ ∈ ΠDP,ϵ, and  R(π∗) − R(ˆπ) ≤ ϵ′, where  R(π)is defined for M.

Our proof proceeds in three steps. First, we prove that for any  ϵ0, δ0, we can choose  N0sufficiently large so that

image

with probability at least 1−δ0. Second, we prove that assuming  ∥P −ˆP∥∞ ≤ ϵ0, then for any policy  π, we have

image

where  R(π)(resp., ˆR(π)) is the expected cumulative distribution assuming the transitions are P (resp., ˆP), and similarly for the agent rewards  ρ. Third, we use the first two steps to prove the lemma statement.

Step 1. Given  ϵ0, δ0 ∈ R+, we claim that for

image

then our estimate ˆP satisfies

image

with probability at least 1  − δ0.

Let  Is,abe the random variable indicating whether our algorithm observes a tuple (s, a, s′) (for some  s′ ∈ S) on a single episode, and let  Is,a,ibe samples of  Is,afor each of the  N0exploratory episodes taken by our algorithm. Let

image

Then, by Hoeffding’s inequality (see Lemma F.1), we have

image

By assumption, we have

image

so using  ϵ = λ0/2 in (14), we have

image

with probability at least

image

Taking a union bound over  s ∈ Sand  a ∈ A, we have (15) holds for every  s ∈ Sand  a ∈ Awith probability at least

image

In this event, we have at least N0λ02observations (s, a, s′) (for some  s′ ∈ S) for every  s ∈ Sand  a ∈ A.

Now, for an observation (s, a, s′′), let  Js,a,s′be the random variable indication whether  s′ = s′′. Without loss of generality, we assume that we have exactly N1 = N0λ02samples  Js,a,s′,jof  Js,a,s′for each  s ∈ Sand  a ∈ A. Let

image

Then, by Hoeffding’s inequality (see Lemma F.1), we have

image

Note that by definition,  µ(J)s,a,s′ = Ps,a,s′and ˆµ(J)s,a,s′ =ˆPs,a,s′. Thus, taking  ϵ = ϵ0in (17), we have

image

with probability at least

image

Taking a union bound over all  s, s′ ∈ Sand  a ∈ A, we have (18) for all  s, s′ ∈ Sand  a ∈ Awith probability at least

image

In other words, in this event, we have  ∥P −ˆP∥∞ ≤ ϵ0.

Taking a union bound over (16) and (19), we have

image

with probability at least

image

as claimed.

Step 2. We claim that assuming

image

then for any policy  π, we have

image

where  R(π)is the expected cumulative reward for πin the MDP M = (S, A, D, P, R, T) and ˆR(π)is the expected cumulative reward for  πin the MDP ˆM = (S, A, D, ˆP, R, T). Note that we have replaced the discount factor  γwith the time horizon T. In addition, for all  z ∈ Z, we have

image

where

image

is the expected cumulative agent reward for the MDP M, and ˆρ(π)zis the expected cumulative agent reward for the MDP ˆM. We only prove the claim for  |R(π) −ˆR(π)|; the claim for  |ρ(π)z − ˆρ(π)z |follows using the same argument.

Let  W ∈ R|S|be

image

Then, we have

image

Now, note that

image

so we have

image

Thus,

image

Similarly,

image

It follows that

image

Thus,

image

where the first line follows from H¨older’s inequality, and the second line follows from properties of the matrix norm, from the fact that

image

and using the fact that the summand is zero for t = 0 since (P (π))0 = ( ˆP (π))0 = I. Note that

image

Furthermore,

image

where the first inequality follows from H¨older’s inequality and the second inequality follows since  πs,·is a discrete probability distribution. Therefore,

image

Plugging (21) and (22) into (20) gives

image

Step 3. Now, we prove the theorem. Let ˆπbe the optimal policy for ˆM (i.e., transitions ˆP) satisfying ˆπ ∈ˆΠDP,ϵ/2. Similarly, let  π∗be the optimal policy for M (i.e., transitions P) satisfying  π∗ ∈ ΠDP,ϵ/4. We apply the second step with

image

Then, by the first step, for all  z, z′ ∈ Z, we have

image

≤ (ρ(ˆπ)z − ˆρ(ˆπ)z) + (ˆρ(ˆπ)z − ˆρ(ˆπ)z′) + +(ρ(ˆπ)z′ − ˆρ(ˆπ)z′) ≤ T 2 · |S| · Rmax · ϵ0+ ϵ2 +  T 2 · |S| · Rmax · ϵ0

image

where the inequality on the third line follows because ˆπ ∈ˆΠDP,ϵ/2, and the inequality on the last line follows since ˜ϵ ≤ ϵ. Thus, we guarantee that ˆπ ∈ ΠDP,ϵ.

Next, note that similarly, for all  z, z′ ∈ Z, we have

image

so  π∗ ∈ˆΠDP,ϵ/2. As a consequence, we have

image

 T 2· |S| ·  Rmax · ϵ0+ 0 +  T 2· |S| ·  Rmax · ϵ0 ϵ′,

where the inequality on the third line follows because ˆπmaximizes ˆR(π)over  π ∈ˆΠDP,ϵ/2and  π∗ ∈ˆΠDP,ϵ/2, and the inequality on the last line follows since ˜ϵ ≤ ϵ′.

Thus, the lemma statement follows.

F Technical Lemmas

Lemma F.1. (Hoeffding’s inequality) Let  X ∼ pXbe a random variable with domain [a, b] ⊆ Rand mean µX, and let ˆµX = n−1 �ni=1 Xibe an estimate of  µXa using n i.i.d. samples  Xi ∼ pX. Then, we have

image

where the probability is taken over the randomness in the i.i.d. samples  X1, ..., Xn ∼ pX.

Proof. See Wainwright (2019) for a proof.

image

(a) (b) (c)

Figure 2: Demographic parity (a) objective value, (b) constraint value for our algorithm (black) and the optimistic baseline (red). (c) Regret of our reinforcement learning algorithm.

G Experimental Details & Additional Results

Parameters. We use the following parameters for our loan MDP:

image

ally study how Algorithm 2 varies with the fairness constraint threshold  ϵ. In Figure 2 (a,b), we show the objective value achieved and the fairness constraint value achieved by our algorithm and the optimistic algorithm for the demographic parity constraint. While the objective values achieved are very similar, the optimistic algorithm does not always satisfy the fairness constraint. In particular, for  ϵ = 0.025, its constraint value is 0.052 (exceeds  ϵby 108%), and for  ϵ = 0.05, it is 0.079 (exceeds  ϵby 59%). Intuitively, there are multiple policies that achieve the same objective value, but the optimistic algorithm sometimes fails to find the ones that are fair. In contrast, our algorithm always satisfies the fairness constraint.

rithm 1 on a modified version of our loan MDP where αand  βare discretized and thresholded to make the state space finite. For this MDP, we have compared Algorithm 1 to solving an unconstrained MDP—i.e., without the demographic parity fairness constraint. We use  ϵ = 0.01. Our results are as follows: (i) for Algorithm 1, the cumulative expected reward is 0.68 and the fairness constraint value is 0.01, and (ii) for the unconstrained algorithm, the cumulative expected reward is 0.69 and the fairness constraint value is 0.26. In other words, for a small reduction in reward, our algorithm substantially improves fairness. The remaining baselines cannot be implemented using the approach in Algorithm 1.

our reinforcement learning algorithm in conjunction with the We run the algorithm for N = 1000 episodes total. In particular, we explore for 100 episodes using a conservative policy  π0that ignores the state; then, we use the estimated transitions to learn the optimal policy ˆπand use ˆπfor the remaining 900 episodes. Since our model is parameterized by  αand  β, we estimate these quantities instead of directly estimating the transitions. We show the regret compared to the optimal policy in Figure 2 (c), averaged over 5 iterations. As can be seen, the regret quickly increases while using  π0, and then becomes almost flat when using ˆπ. We note that our algorithm satisfies the fairness constraint across all episodes and iterations.


Designed for Accessibility and to further Open Science