Algorithms for Fairness in Sequential Decision Making

2019·arXiv

2 Fairness Constraints for MDPs

3 Algorithm for Finite-State MDPs

We describe an algorithm for solving (1), which has strong theoretical guarantees (i.e., it solves (1) exactly in polynomial time). On the other hand, it makes strong assumptions—i.e., that M has finite state and action spaces. In Section 4, we describe a model-free algorithm that applies very generally (e.g., to continuous state and action spaces, or even non-separable MDPs), but lacks performance guarantees.

Our approach is based on the dual of the standard LP formulation of value iteration (Altman, 1999; Sut- ton and Barto, 2018). In particular, the objective and first set of constraints of the LP in Algorithm 1 form the dual. The last set of constraints in the LP in Algorithm 1 encodes demographic parity. These constraints exploit the separable structure of the underlying MDP. In particular, the component z of an initial state ) does not change over time, so the value of z for s equals the value of z for the initial state . Thus, randomly sampling a state is equivalent to randomly sampling

Expanding the conditional probability, the probability of sampling is

It follows that

The last set of constraints in the LP in Algorithm 1 uses (3) to encode demographic parity.

Theorem 3.1. Algorithm 1 returns a solution to (1) if and only if (1) is satisfiable.

We give a proof in Appendix C. Note that Algorithm 1 runs in polynomial time.

Remark 3.2. We briefly compare our approach to algorithms for solving constrained MDPs. Existing approaches are also based on the dual of the LP for solving MDPs (Altman, 1999). Indeed, in the LP we use in Algorithm 1, the objective and the first constraint are taken from the dual. The second constraint, which encodes the fairness constraint, is novel—our key insight is that for separable MDPs, the fairness constraint can be expressed as a linear inequality over .

4 Algorithm for General MDPs

Next, we propose a general algorithm for solving (1). However, in general, the planning problem may be non-convex, so unlike Algorithm 1, this algorithm may converge to a local optimum.

Our algorithm relies on the cross-entropy (CE) method (Mannor et al., 2003; Hu et al., 2012), a heuristic for solving optimization problems. Suppose our policies Π are parameterized by Θ, and let a family F of probability distributions over Θ parameterized by . We use and interchangeably, e.g., . In the unconstrained setting, CE aims to solve the following optimization problem:

where . In other words, it aims to compute a distribution that places high probability mass on with high cumulative expected reward . Then, it returns a sample . To solve (4), CE starts with initial parameters . Then, on each iteration, it

updates the current parameters to move “closer” to . More precisely, the update is

where satisfies Pr. Here, (0, 1) are hyperparameters. Intuitively, the first term of upweights with large values of compared to , both by directly weighting the probability of by , and furthermore by placing zero probability mass on the bottom 1 fraction of the . The second term of is a “smoothing” term that makes the update incremental.

To enable efficient optimization of (5), we assume that F is a (natural) exponential family.

Definition 4.1. A family F of distributions over Θ is an exponential family if, for a continuous Γ : Θ ), where .

We use the standard choice that F is the space of Gaussians. If F is an exponential family, then

where [Γ()] is the moment map (Hu et al., 2012). The CE algorithm approximates (6) by sampling rollouts )) according to . Then, it computes the estimate ˆˆ), where are m sam- pled rollouts and ˆ.

To estimate , it takes n samples , and computes ˆfor each i. Then, it ranks in decreasing order of ˆ, and discards all but the top . It estimates the numerator in as

The denominator in is estimated similarly.

Algorithm 2 computes this estimate of the update (6) assuming the condition on Line 16 is satisfied (as we discuss below, the check is needed to enforce the constraint that . Line 6 of Algorithm 2 computes the estimates ˆfor samples for ], and Line 14 estimates . On Line 6 & 7, the notation means to estimate a quantity using m sampled rollouts each of length T.

Finally, we adapt constrained cross-entropy (CCE), which extends CE to handle constraints (Wen and Topcu, 2018), to handle fairness constraints. Intuitively, CCE prioritizes policies where the constraint that is closer to holding, unless the constraint holds, in which case CCE prioritizes policies with higher cumulative expected reward. In particular, Algorithm 2 imposes this constraint by checking if ˆsatisfies the constraint ˆin Line 16, where ˆis estimated from samples. Note that ˜is used in place of to enforce the constraint even though ˆis inexact. The reason is that CCE relies on estimates ˆof . These estimates are inexact since (i) they are estimated from samples, and (ii) they are estimated based on a finite time horizon (whereas is defined for an infinite horizon). To account for this error, we use (1 (where (0, 1)) in place of when checking the constraint on Line 16 of Algorithm 2.

We provide the following for Algorithm 2 (see Appendix D for a proof).

Theorem 4.2. Assume that is an upper bound on (i.e., for all ). Let and (0, 1/2] be given, and suppose that

Then, with probability at least 1 , we have Π, where is returned by Algorithm 2.

Figure 1: Demographic parity (a) objective value, (b) constraint value, and equal opportunity (c) objective value, (d) constraint value, for race-blind (RB), demographic parity (DP) or equal opportunity (EO), optimistic (Opt), and conservative (Cons).

5 Reinforcement Learning

We discuss extensions to the setting where the MDP is initially unknown, and the goal is to ensure fairness while learning these quantities. We propose an approach to fairness when the transitions P are unknown but the initial state distribution D is known; reducing to the case of unknown D is standard (i.e., add a deterministic initial state and transition to an initial state according to D). Our goal is to ensure that with high probability, fairness holds for all time including during learning. We consider the episodic case where the system is reset after a fixed number of steps T, and take interactions is performed repeatedly—e.g., each new loan applicant is a new episode. We assume there are a fixed total number of episodes N, and the goal is to perform well on average; the doubling trick can be used to generalize to unknown or unbounded N (see p. 99 of Lattimore and Szepesv´ari).

A key challenge is how to design a fair policy we can use when the dynamics are unknown. Thus, we focus on the setting of state-independent individual rewards , where we can ensure such a policy exists. In particular, we take to choose actions uniformly randomly—i.e., for all and . Then, we are guaranteed that is fair. Furthermore, we are guaranteed that explores all states (assuming without loss of generality that we prune unreachable states)—i.e., letting and Λ, where is defined as before, then there exists such that

We use explore-then-commit (Lattimore and Szepesv´ari). First, we explore using the conservative policy for episodes. Then, we estimate P using the observed state-action-state tuples () (i.e., transition to upon taking action a in state s):

ˆ# observed tuples ()# observed tuples () for some

Finally, for the remaining , it uses the optimal policy ˆcomputed as if ˆP is the true transition matrix.

We prove a bound on the regret

where the expectation is taken over the randomness of the observed tuples (is the optimal policy for known P that satisfies , and

is the policy our algorithm uses on episode n. We show that ˆis fair, and that given for every ] with probability at least 1 .

Theorem 5.1. Let be given. Assume that is an upper bound on R (i.e., ) and on . Let ˆM = (S, A, D, ˆP, R, T), and ˆbe the optimal policy for ˆM in ˆΠ(i.e., the set of policies satisfying demographic parity for ˆM). Let M = (S, A, D, P, R, T), and be optimal for M in Π. Then, ˆ, and + 1) log(1)) with probability at least 1 .

We give a proof in Appendix E. Note that there is a gap between the fairness constraint of (which is in Π) and that of ˆ(which is only in Π)— i.e., we can only guarantee performance compared to a policy that satisfies a stricter level of fairness.

6 Experiments

We run simulations using our loan example from Section 2. We estimated parameters based on FICO score data (Hardt et al., 2016). We consider Whites to be majorities, and Blacks, Hispanics, and Asians to be minorities. For the initial distribution , we first fit parameters the parameters of the prior Beta() based on the data. Then, we take a fixed number of steps using action a = 1 (i.e., offer loan) to force exploration. We choose to capture the idea that less data is available for minorities. We also estimate the probability of being a minority from the data. Similar to (Hardt et al., 2016), we choose I so the bank makes a profit on the average applicant. We manually choose , and based on intuition; see Appendix G for the values we chose. We focus on evaluation of Algorithm 2, and give additional experimental results in Appendix G.

Baselines that ignore dynamics. To demonstrate the importance of accounting for dynamics, we compare to two baselines that ignore dynamics when constraining fairness. The first optimistically pretends that actions do not affect the state distribution—i.e., does not change over time. In this case, for all t > 0, we have , so for any . Thus, we can let

where ˜). We can solve (7) using a straightforward modification of Algorithm 2. This captures the supervised learning setting. Compared to our algorithm, this algorithm may learn a policy that is unfair but achieves higher reward.

The second conservatively assumes can change arbitrarily on each step. This baseline learns a fair policy, but it may achieve much lower reward. In this case, we restrict to policies that satisfy

where ), and ∆is the standard n-simplex. Note that is conditioned on ) (i.e., the current state has sensitive attribute z) instead of ) (i.e., the initial state has sensitive attribute z); if M is separable, these two conditions are equivalent. Finally, note that is undefined if the conditional has zero probability according to ; we implicitly omit such from (8).

The difficulty with (8) is the universal quantification over . For state-independent individual rewards, the conservative assumption is in fact equivalent to optimizing over state-independent policies— i.e., those of the form , where ˜. Thus, we can apply a modified version of Algorithm 2 where we only learn state-independent policies.

Results for Algorithm 2. We ran Algorithm 2 to learn fair policies for both the demographic parity and equal opportunity constraints, using 1. For each constraint, we also use our optimistic and conservative baselines. We also consider a race-blind algorithm that is unconstrained but where ignores the sensitive attribute . The optimal policy is race-blind—the state is a sufficient statistic, so it captures all information needed to determine whether to offer a loan.

For demographic parity, Figure 1 (a) shows the reward achieved for the bank, and (b) shows the value of the fairness constraint—i.e., the smallest value of for which . As expected, race-blind achieves the highest reward (10.43), followed by the optimistic algorithm (10.41), and then Algorithm 2 (10.40). Finally, the conservative algorithm performs substantially worse than the others (10.00). However, race-blind achieves a very poor constraint value (0.42), as does the optimistic algorithm (0.14), which per- forms performs 43% worse than Algorithm 2 (0.10). The conservative algorithm achieves constraint value 0. For equal opportunity, Figure 1 (c) shows the bank reward, and (d) shows the value of the constraint. The bank’s rewards are essentially the same for the race-blind algorithm, optimistic algorithm, and Algorithm 2 (10.43), but is substantially worse for the conservative algorithm (10.00). As with demographic parity, the constraint value for race-blind (0.37) is substantially worse than the others, but in this case optimistic (0.11) is fairly close to Algorithm 2 (0.10). The conservative algorithm achieves constraint value 0.

Discussion. Our results show that imposing demographic parity slightly reduces the bank’s reward, but substantially increases fairness compared to the race-blind and optimistic algorithms. The latter models supervised learning—thus, our results show the importance of accounting for dynamics when ensuring fairness. We find similar (but weaker) trends for equal opportunity. Like prior work (Hardt et al., 2016), we find that demographic parity reduces the bank’s rewards more than equal opportunity.

Unlike the static case (Hardt et al., 2016), our model has dynamic parameters. Time series data would be needed to estimate them; instead, we choose them manually. Also, (Hardt et al., 2016) uses the empirical CDF of the distribution over repayment probabilities , whereas we assumed is a Beta distribution. Our goal is to understand the consequences of ignoring dynamics, not to study a real-world scenario.

7 Conclusion

We have proposed algorithms to learn fair policies that account for the dynamical effects, and have demonstrated the importance of accounting for these effects. There is much room for future work. One important direction is extending our results for the case where the initial MDP is unknown beyond explore-then-commit to obtain better regret guarantees. Another direction is theoretically analyzing the cost of fairness— e.g., what is the cost to the bank for imposing a fairness constraint, and how they can mitigate this cost by improving predictive power. Finally, reinforcement learning problems in practice are often offline—i.e., the goal is to learn from historical data and the algorithm does not have the opportunity to explore. Studying fairness in this context is an important problem.

Acknowledgements

We thank the anonymous reviewers for their insightful comments. This research was partially supported by NSF awards 1910769 and 1652113.

References

Joshua Achiam, David Held, Aviv Tamar, and Pieter Abbeel. Constrained policy optimization. In ICML, 2017.

Eitan Altman. Constrained Markov decision processes, volume 7. CRC Press, 1999.

Pranjal Awasthi, Corinna Cortes, Yishay Mansour, and Mehryar Mohri. Beyond individual and group fairness. arXiv preprint arXiv:2008.09490, 2020.

Yahav Bechavod, Katrina Ligett, Aaron Roth, Bo Waggoner, and Zhiwei Steven Wu. Equal opportunity in online classification with partial feedback. arXiv preprint arXiv:1902.02242, 2019.

Toon Calders, Faisal Kamiran, and Mykola Pech- enizkiy. Building classifiers with independency constraints. In Data mining workshops, 2009. , pages 13–18. IEEE, 2009.

Elliot Creager, David Madras, Toniann Pitassi, and Richard Zemel. Causal modeling for fairness in dynamical systems. arXiv preprint arXiv:1909.09141, 2019.

Alexander D’Amour, Hansa Srinivasan, James At- wood, Pallavi Baljekar, D Sculley, and Yoni Halpern. Fairness is not static: deeper understanding of long term fairness via simulation studies. In Proceedings of the 2020 Conference on Fairness, Accountability, and Transparency, pages 525–534, 2020.

Cynthia Dwork, Moritz Hardt, Toniann Pitassi, Omer Reingold, and Richard Zemel. Fairness through awareness. In Proceedings of the 3rd innovations in theoretical computer science conference, pages 214– 226. ACM, 2012.

Hadi Elzayn, Shahin Jabbari, Christopher Jung, Michael Kearns, Seth Neel, Aaron Roth, and Zachary Schutzman. Fair algorithms for learning in allocation problems. 2019.

Moritz Hardt, Eric Price, Nati Srebro, et al. Equal- ity of opportunity in supervised learning. In Advances in neural information processing systems, pages 3315–3323, 2016.

Tatsunori B Hashimoto, Megha Srivastava, Hongseok Namkoong, and Percy Liang. Fairness without demographics in repeated loss minimization. In ICML, 2018.

Jiaqiao Hu, Ping Hu, and Hyeong Soo Chang. A stochastic approximation framework for a class of randomized optimization algorithms. IEEE Transactions on Automatic Control, 57(1):165–178, 2012.

Shahin Jabbari, Matthew Joseph, Michael Kearns, Jamie Morgenstern, and Aaron Roth. Fairness in reinforcement learning. In Proceedings of the 34th International Conference on Machine LearningVolume 70, pages 1617–1626. JMLR. org, 2017.

Matthew Joseph, Michael Kearns, Jamie H Morgen- stern, and Aaron Roth. Fairness in learning: Classic and contextual bandits. In Advances in Neural Information Processing Systems, pages 325–333, 2016.

Niki Kilbertus, Mateo Rojas Carulla, Giambattista Parascandolo, Moritz Hardt, Dominik Janzing, and Bernhard Sch¨olkopf. Avoiding discrimination through causal reasoning. In Advances in Neural Information Processing Systems, pages 656–666, 2017.

Niki Kilbertus, Manuel Gomez-Rodriguez, Bernhard Sch¨olkopf, Krikamol Muandet, and Isabel Valera. Fair decisions despite imperfect predictions. AISTATS, 2019.

Matt J Kusner, Joshua Loftus, Chris Russell, and Ri- cardo Silva. Counterfactual fairness. In Advances in Neural Information Processing Systems, pages 4066–4076, 2017.

Himabindu Lakkaraju, Jon Kleinberg, Jure Leskovec, Jens Ludwig, and Sendhil Mullainathan. The selective labels problem: Evaluating algorithmic predictions in the presence of unobservables. In Proceedings of the 23rd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pages 275–284. ACM, 2017.

Tor Lattimore and Csaba Szepesv´ari. Bandit algorithms.

Lydia T Liu, Sarah Dean, Esther Rolf, Max Sim- chowitz, and Moritz Hardt. Delayed impact of fair machine learning. In ICML, 2018.

Shie Mannor, Reuven Y Rubinstein, and Yohai Gat. The cross entropy method for fast policy search. In Proceedings of the 20th International Conference on Machine Learning (ICML-03), pages 512–519, 2003.

Razieh Nabi and Ilya Shpitser. Fair inference on out- comes. In Proceedings of the... AAAI Conference on Artificial Intelligence. AAAI Conference on Ar-tificial Intelligence, volume 2018, page 1931. NIH Public Access, 2018.

Richard S Sutton and Andrew G Barto. Reinforcement learning: An introduction. MIT press, 2018.

Martin J. Wainwright. High-dimensional statistics: A non-asymptotic viewpoint. Cambridge University Press, 2019.

Min Wen and Ufuk Topcu. Constrained cross-entropy method for safe reinforcement learning. In Advances in Neural Information Processing Systems, pages 7461–7471, 2018.

A Batch Decisions

We describe how our algorithm can be extended to the setting where the decision-maker makes decisions about batches of individuals jointly rather than one individual at a time. For instance, a bank might decide on a portfolio of loans to target at once rather than decide independently for each individual. The challenge is ensuring decisions are fair not just across different batches of individuals, but also across individuals within a batch, since these decisions may be correlated.

In this setting, the state space becomes (˜, where there are k individuals and S is the state space of individual i. The action space is —i.e., a binary decision for each individual. Then, the state-action distribution is , with a component for each state and action a = (. The policy can simultaneously make decisions for all k individuals. We let the individual rewards for individual i be . Then, the natural generalization of our fairness constraint is that decisions should be fair on average across both the initial state distribution and across individuals in a single batch. For instance, demographic parity says that

where

This constraint can be encoded in our linear program in Algorithm 1 by replacing the second constraint with the following (the objective and first constraint remain the same, except with S replaced by and A replaced by ):

where

and where is a normalizing constant similar to in the original constraint. Intuitively, this constraint is the same as the original one except that we marginalize over all other individuals (i.e., the sums over for ), and then we average over individuals i as in our fairness constraint. This constraint can similarly be incorporated into Algorithm 2. Finally, while this MDP has number of states exponential in the number of individuals k, this blowup is inevitable since the policy is allowed to make complex decisions based on the states of all individuals.

B Proof of Theorem 2.5

For the first claim, consider the MDP M. The states are ˜, where:

The actions are A = {0, 1}. The transitions are

for all . The initial distribution is

The discount factor is . The individual rewards are

for all . Let be a deterministic policy. It is clear that the only value of that matters is ). Conditioned on , the expected cumulative individual reward is

Conditioned on

Thus, for , it is impossible for the demographic parity constraint to be satisfied.

However, consider the stochastic policy

Then,

so this policy satisfies the demographic parity constraint.

For the second claim, consider the same MDP, except where

for all . Then, it is clear that

regardless of . Thus, for , the demographic parity constraint cannot be satisfied—i.e., Π.

C Proof of Theorem 3.1

Our proof proceeds in three steps. First, we show that any feasible point of the LP in Algorithm 1 is the state-action distribution Λfor some policy . Second, we show that conversely, for any fair policy , the state-action distribution Λis a feasible point of the LP. Finally, we combine these two results to prove the theorem.

Step 1. Let be any policy satisfying demographic parity. Then, we claim that the state-action distribution Λis a feasible point of the LP in Algorithm 1.

for each .

To this end, note that by induction,

so

Multiplying each side of (9) by (where I is the identity matrix), we have

Note that these algebraic manipulations are valid since the eigenvalues of are bounded in norm by 1, so all sums converge absolutely. Rearranging this equality gives

for each , where we have used the equalities

and

that follow from the definition of Λ. Therefore, Λsatisfies the first constraint.

Next, we show that Λsatisfies the second constraint, which says that

In particular, note that

since the value of z for s equals the value of z for the initial state . Furthermore, the probability of sampling ˜) is

Together with the definition of Λ, we have

Therefore, we have

By assumption, satisfies the demographic parity constraint, which says exactly that (12) satisfies (11). Thus, Λsatisfies the second constraint.

Therefore, Λis a feasible point of the LP, as claimed.

Step 2. Let be a feasible point of the LP in Algorithm 1, and let

be the corresponding policy returned by Algorithm 1. Then, we claim that , that , and that the value of the objective for equals .

To see the first claim, let be defined by

We show that . To this end, note that because satisfies the first constraint in the LP, we have

Together with the equality

we have

Thus,

We note that is invertible—in particular, the eigenvalues of have norms bounded by , so the eigenvalues of have norms bounded below by 1 ; therefore, the eigenvalues of are nonzero, so it is invertible. As a consequence, we can solve for d in (13) to get

Finally, from (10) in Step 1 of this proof, we established that similarly satisfies

As before, since is invertible, we have

Thus,

so the first claim follows.

To see the second claim, note that since is feasible, it must satisfy the second constraint of the LP. As shown in the first step of this proof, (11) is equivalent to the demographic parity constraint. Thus, , as claimed.

To see the third claim, note that

In other words, the value of the objective of the LP for the point is equal to , as claimed.

Step 3. Finally, we use the results from the previous two steps to prove the theorem statement. First, let be the solution to (1). By the claim shown in the first step, Λis a feasible point of the LP in Algorithm 1. Furthermore, by the claim shown in the second step, the value of the objective for is .

Next, let be the solution to the LP in Algorithm 1. By the claim shown in the second step, (i) , where is the policy returned by Algorithm 1, (ii) , and (iii) the value of the objective for is .

It follows that , since maximizes the objective of the LP over feasible points (and Λis feasible). Since , it follows that is also a solution to (1). Thus, we have proven the theorem statement.

D Proof of Theorem 4.2

Our proof proceeds in three steps. First, we bound the error due to truncation. Second, we bound the estimation error . Third, we combine steps 1 and 2 to prove Theorem 4.2.

Step 1. Note that for any policy and any , we have

Step 2. For each , let ˆbe an estimate of ˜using m sampled rollouts . First, note that

is bounded, so we can apply Hoeffding’s inequality (see Lemma F.1) to get

Since Z = {maj, min}, by a union bound,

with probability at least 1 .

Step 3. Now, we can prove Theorem 4.2. First, note that with probability 1 ,

for all . Thus,

which implies that . Thus, the theorem follows.

E Proof of Theorem 5.1

We prove the following lemma; Theorem 5.1 follows by choosing .

Lemma E.1. Let be given. Assume that be an upper bound on R (i.e., ) and on . Let ˜, and let

Let ˆM = (S, A, D, ˆP, R, T), and ˆbe the optimal policy for ˆM in ˆΠ(i.e., the set of policies satisfying demographic parity for ˆM). Let M = (S, A, D, P, R, T), and be optimal for M in Π. Then, ˆ, and , where is defined for M.

Our proof proceeds in three steps. First, we prove that for any , we can choose sufficiently large so that

with probability at least 1. Second, we prove that assuming ˆ, then for any policy , we have

where (resp., ˆ) is the expected cumulative distribution assuming the transitions are P (resp., ˆP), and similarly for the agent rewards . Third, we use the first two steps to prove the lemma statement.

Step 1. Given , we claim that for

then our estimate ˆP satisfies

with probability at least 1 .

Let be the random variable indicating whether our algorithm observes a tuple () (for some ) on a single episode, and let be samples of for each of the exploratory episodes taken by our algorithm. Let

Then, by Hoeffding’s inequality (see Lemma F.1), we have

By assumption, we have

so using 2 in (14), we have

with probability at least

Taking a union bound over and , we have (15) holds for every and with probability at least

In this event, we have at least observations () (for some ) for every and .

Now, for an observation (), let be the random variable indication whether . Without loss of generality, we assume that we have exactly samples of for each and . Let

Then, by Hoeffding’s inequality (see Lemma F.1), we have

Note that by definition, and ˆˆ. Thus, taking in (17), we have

with probability at least

Taking a union bound over all and , we have (18) for all and with probability at least

In other words, in this event, we have ˆ.

Taking a union bound over (16) and (19), we have

with probability at least

as claimed.

Step 2. We claim that assuming

then for any policy , we have

where is the expected cumulative reward for in the MDP M = (S, A, D, P, R, T) and ˆis the expected cumulative reward for in the MDP ˆM = (S, A, D, ˆP, R, T). Note that we have replaced the discount factor with the time horizon T. In addition, for all , we have

where

is the expected cumulative agent reward for the MDP M, and ˆis the expected cumulative agent reward for the MDP ˆM. We only prove the claim for ˆ; the claim for follows using the same argument.

Let be

Then, we have

Now, note that

so we have

Thus,

Similarly,

It follows that

Thus,

where the first line follows from H¨older’s inequality, and the second line follows from properties of the matrix norm, from the fact that

and using the fact that the summand is zero for t = 0 since (. Note that

Furthermore,

where the first inequality follows from H¨older’s inequality and the second inequality follows since is a discrete probability distribution. Therefore,

Plugging (21) and (22) into (20) gives

Step 3. Now, we prove the theorem. Let ˆbe the optimal policy for ˆM (i.e., transitions ˆP) satisfying ˆˆΠ. Similarly, let be the optimal policy for M (i.e., transitions P) satisfying . We apply the second step with

Then, by the first step, for all , we have

) + (ˆ) + +() + 2 +

where the inequality on the third line follows because ˆˆΠ, and the inequality on the last line follows since ˜. Thus, we guarantee that ˆ.

Next, note that similarly, for all , we have

so ˆΠ. As a consequence, we have

≤ · |S| · + 0 + · |S| · ≤

where the inequality on the third line follows because ˆmaximizes ˆover ˆΠand ˆΠ, and the inequality on the last line follows since ˜.

Thus, the lemma statement follows.

F Technical Lemmas

Lemma F.1. (Hoeffding’s inequality) Let be a random variable with domain [and mean , and let ˆbe an estimate of a using n i.i.d. samples . Then, we have

where the probability is taken over the randomness in the i.i.d. samples .

Proof. See Wainwright (2019) for a proof.

(a) (b) (c)

Figure 2: Demographic parity (a) objective value, (b) constraint value for our algorithm (black) and the optimistic baseline (red). (c) Regret of our reinforcement learning algorithm.

G Experimental Details & Additional Results

Parameters. We use the following parameters for our loan MDP:

Additional results for Algorithm 2. We addition-

ally study how Algorithm 2 varies with the fairness constraint threshold . In Figure 2 (a,b), we show the objective value achieved and the fairness constraint value achieved by our algorithm and the optimistic algorithm for the demographic parity constraint. While the objective values achieved are very similar, the optimistic algorithm does not always satisfy the fairness constraint. In particular, for 025, its constraint value is 0.052 (exceeds by 108%), and for 05, it is 0.079 (exceeds by 59%). Intuitively, there are multiple policies that achieve the same objective value, but the optimistic algorithm sometimes fails to find the ones that are fair. In contrast, our algorithm always satisfies the fairness constraint.

Results for Algorithm 1. We have evaluated Algo-

rithm 1 on a modified version of our loan MDP where and are discretized and thresholded to make the state space finite. For this MDP, we have compared Algorithm 1 to solving an unconstrained MDP—i.e., without the demographic parity fairness constraint. We use 01. Our results are as follows: (i) for Algorithm 1, the cumulative expected reward is 0.68 and the fairness constraint value is 0.01, and (ii) for the unconstrained algorithm, the cumulative expected reward is 0.69 and the fairness constraint value is 0.26. In other words, for a small reduction in reward, our algorithm substantially improves fairness. The remaining baselines cannot be implemented using the approach in Algorithm 1.

Results for reinforcement learning. We have run

our reinforcement learning algorithm in conjunction with the We run the algorithm for N = 1000 episodes total. In particular, we explore for 100 episodes using a conservative policy that ignores the state; then, we use the estimated transitions to learn the optimal policy ˆand use ˆfor the remaining 900 episodes. Since our model is parameterized by and , we estimate these quantities instead of directly estimating the transitions. We show the regret compared to the optimal policy in Figure 2 (c), averaged over 5 iterations. As can be seen, the regret quickly increases while using , and then becomes almost flat when using ˆ. We note that our algorithm satisfies the fairness constraint across all episodes and iterations.

Designed for Accessibility and to further Open Science