On the Fairness of Randomized Trials for Recommendation with Heterogeneous Demographics and Beyond

2020·Arxiv

Abstract

Abstract

Observed events in recommendation are consequence of the decisions made by a policy, thus they are usually selectively labeled, namely the data are Missing Not At Random (MNAR), which often causes large bias to the estimate of true outcomes risk. A general approach to correct MNAR bias is performing small Randomized Controlled Trials (RCTs), where an additional uniform policy is employed to randomly assign items to each user. In this work, we concentrate on the fairness of RCTs under both homogeneous and heterogeneous demographics, especially analyzing the bias for the least favorable group on the latter setting. Considering RCTs’ limitations, we propose a novel Counterfactual Robust Risk Minimization (CRRM) framework, which is totally free of expensive RCTs, and derive its theoretical generalization error bound. At last, empirical experiments are performed on synthetic tasks and real-world data sets, substantiating our method’s superiority both in fairness and generalization.

1. Introduction

Conventional machine learning model, though can reach high accuracy in average, often falls short in causing high loss on minority groups at the same time. Fairness issues have been studied in various tasks, like medical treatment, criminal justice (Kearns et al., 2019; Nabi et al., 2018; Mary et al., 2019), credit loan (Liu et al., 2018), machine translation (Hashimoto et al., 2018), recommender system (?Beutel et al., 2019; Singh & Joachims, 2019), etc. The reasons for improving model’s fairness are not limited with respect to ensuring morality or legality , but also involving economical considerations. In (Hashimoto et al., 2018), it proves the amplification of demographic disparity in sequential ERM process. That is, the minority group which suffers

Figure 1. Missing ratings in a recommender system.

higher loss will become more likely to quit using our service, thereby providing no feedback data any longer. This negative feedback loop can cause shrinking size of users, hence reducing the income or welfare provided by our machine learning systems over time.

Specifically for recommender systems, there have been line of work showing that, the real-world logging policy often collects missing not at random (MNAR) (Yuan et al., 2019) data (or selective labels). For example, users tend to only reveal ratings for items they like, thus the observed users’ feedback (usually described by click-through-rate, (CTR)) can be substantially better than those not observed yet. In Fig.1, estimated average CTR from the observed outcome is . However, if all unobserved outcomes are 0, the true CTR is . This gap between the factual and counterfactual ratings, in MNAR situation, has been empirically validated by (Steck, 2013). In this work, we show that MNAR also undermines fairness of recommendation over time, by exacerbating minority group’s utility.

To get unbiased estimate of true CTR without revealing all feedback, running randomized controlled trials (RCTs) (Haynes et al., 2012) has been so far the first choice, which satisfies missing-at-random (MAR) (Seaman et al., 2013) condition:

Definition 1 (Condition of MAR). The missing data are missing-at-random (MAR) if the conditional probality of observed pattern, represented by O here, satisfies

However, the cost of RCTs is unacceptable in many scenarios, e.g. we cannot randomly assign drugs to the volunteers, or randomly display ads for users. Vast majority of existing works (Shalit et al., 2017; Rosenfeld et al., 2017; Bonner & Vasile, 2018) tend to leverage small RCTs and large biased logged data to correct the MNAR bias, through multi-task learning or domain adaptation techniques. They claim that small RCTs, though not capable of revealing each individual’s preference (Rosenfeld et al., 2017), are sufficient for reaching low-variance estimate of the average of all, from which we can obtain model that generalizes well to those counterfactual events. However, the conception about small is not well-defined yet, hence no guideline for proper number of RCTs to trade gain against cost. Besides, although RCTs are intuitively regarded as fair for each user, we argue that learning from the average from RCTs causes unfair policy, on heterogeneous demographics, i.e. users vary over populations defined by preferences.

In this work, we aim to mitigate utility disparity among users, as well as improving global utility of all in recommendation. Our object of interest is a RCTs-free method that can achieve both two goals above. We begin from introducing Gini index (Gini, 1997) as a disparity measure in Section 4. In Section 5, we consider two heterogeneous situations, for analyzing the RCTs’s fairness, accounting for the least favorable group’s risk. Then, we propose our RCTs-free Counterfactual Robust Risk Minimization (CRRM) framework in Section 6, and test its performance in Section 7. The proof of all propositions and theorems can be found in appendices. And codes for experiments are available in the supplementary material.

Contributions. Our main contributions are three-fold:

1. We demonstrate the unfairness of conventionally used ERM-based recommendation under MNAR data, and connect protecting the least favorable group to ensuring fairness regarding to Gini index;

2. We are the first to theoretically analyze the fairness of RCTs-based methods, e.g. IPS, by considering the least favorable group’s risk over multiple demographic conditions;

3. We develop a RCTs-free CRRM framework, and derive a constructive theoretical error bound. Furthermore, our method is empirically proved improving both fairness and accuracy, on both synthetic tasks and real-world data.

2. Related work

Fair machine learning. Fairness in machine learning is an emerging topic in the literature recently. The most direct principle of algorithmic fairness is to protect sensitive tags, features or labels, such as race, gender, age, etc. This formulation of fairness has been investigated in (Hardt et al., 2016; Mary et al., 2019) where the conditional independence, between the decision made by the algorithm and the variables to protect, is estimated or guaranteed. In other respects, the Rawlsian fairness (Rawls, 2009) is introduced to substantiate this equal opportunity principle in a bandit learning scheme (Joseph et al., 2016; Jabbari et al., 2017), where an action is preferred only if its true quality is better. However, as far as we know, few work has ever mentioned fairness issue in MNAR scenario.

For some scenarios when the demographics are unknown, the previous works have been so far concentrating on protecting subgroups with latent demographic structure (Kearns et al., 2017), or learning a distributionally robust model for uniform performance without knowing demographics (Duchi & Namkoong, 2018; Hashimoto et al., 2018). Although these approaches are proved to be able to protect the minority group, no evident improvement of global utility for all users is witnessed in practice.

Correcting MNAR bias. An importance sampling approach called inverse propensity score (IPS) is proposed initially to correct the sampling bias (Rosenbaum & Rubin, 1983), then followed by line of work including doubly robust (DR) (Dud´ık et al., 2011; Jiang & Li, 2015), weighted doubly robust (Thomas & Brunskill, 2016), joint learning doubly robust (Wang et al., 2019), self-normalized IPS (Swami- nathan & Joachims, 2015; Schnabel et al., 2016), etc., for recommendation on MNAR data, by utilizing additional small RCTs. Besides, methods (Johansson et al., 2016; Bonner & Vasile, 2018) adapted from domain adaptation and representation learning techniques are used for learning from both and . However, they hardly mention their methods’ fairness, or the effectiveness over heterogeneous demographics.

3. Problem Setup

Denote user , then an even The outcome of an event is

There are in total possible events considering different combination of users and items, where here means the cardinality of a set, we set |U| = M and |V| = N. The true risk of predicted outcome matrix be easily written out as

where is assumed drawn from a true outcome matrix is usually set as surrogate mean-squared-error, or logistic loss, for simplicity we denote it as

When some of outcomes are not observed, calculating Eq.(3) becomes infeasible. Such that, previous works attempt to

Figure 2. A demonstration of MNAR problem.

only minimize the factual empirical risk, namely ERM as

where random variable means whether a pair (u, v) is observed. The observation of outcome is hence constrained by the missing pattern , shown in Fig.2. Such that, the is a biased estimate of true risk

3.1. Correcting sampling bias by propensity scores

To deal with the bias shown in Eq.(5), an unbiased inverse propensity score (IPS) estimator (Rosenbaum & Rubin, 1983) can be built as

where is the used propensity score that , hence it is easy to derive

The propensity score here, can be approximated via a simplified naive Bayes (NB) estimator (Marlin & Zemel, 2009) as following:

with assuming that P(O | Y, X) = P(O | Y ). Though the P(Y | O) and P(O) can be counted by MNAR data, the P(Y = y) has to be estimated via MAR data, which is often achieved by taking average of n RCTs’ outcomes

Using x = (u, v), the bias of IPS estimator with inaccurate propensity is given by (Schnabel et al., 2016) that

For notation simplicity, we omit of the from now on. In this work, we delve deeper into this bias, accounting for the bias of different groups of users. We assume that each individual u belongs to a group , and the group mean preference is represented by

In this viewpoint, the is an efficient empirical estimate of global preference for all users. However, we question that, when demographics are heterogeneous, i.e. are different, is the by RCTs still a fair and economical estimate? Since for ensuring fairness, we care about loss suffered by the least favorable group, rather than only the average over all. To deal with this fairness issue, we first focus on how to measure utility disparity among users in next section.

4. Measuring Utility Disparity by Gini Index

The conventional approach to update a policy for recommender system, often proceeds in repeatedly learning from the MNAR logged data, thus minimizing the empirical risk. In this section, we identify that the policy obtained by this process fails to offer fair services, when taking user’s click-through-ratio (CTR) as their utility. We then propose Gini index for measuring this disparity, and reveal the connection between improving Gini fairness and protecting the least favorable group’s utility.

Figure 3. Analysis of simulation results of the motivating example.

4.1. Motivating example

We build a tiny system with 5 users and 5 items for simulation, and compare the user-wise and item-wise numbers of clicks achieved by and , shown in Fig.3. Unfair treatment towards users by can be witnessed in the Fig.3a. Compared with , though the total clicks achieved by slightly higher (2916 v.s. 2484), the click rates of user #1 and #2 are much lower than their counterparts by . That means, the achieves good performance in average, at the expense of deteriorating experience of the minorities. Besides, the Fig.3b illustrates devastating diversity reduction by , which is also unacceptable in recommendation. Details about the implementation of simulation can be found in Appendix F.1.

4.2. Gini fairness and the minority’s utility

Example above impresses the necessity to measure users’ utility disparity. To this end, we introduce Gini index drawn from Economics to measure fairness in recommendation. The Gini index , with 0 indicating perfect equality and 1 being perfect inequality, is a measure of the distribution of income across percentiles among populations.

Figure 4. Computing the Gini index via the Lorenz curve.

To compute GI, the Lorenz curve (Mankiw, 2006) plots population percentile by utility on the x-axis, and cumulative utility on the y-axis. Denote the cumulative l-th group’s utility as . From Fig.4, we can compute the area under the Lorenz curve by summation of all

therefore the GI can be obtained by

We give the following Proposition 1 which identifies the connection between controlling Gini index and protecting the least favorable group.

Proposition 1 (Least favorable group’s utility and Gini index). Assume that , controlling the worst-case to protect the minority group can improve fairness, by virtue of controlling the supremum of the GI as

if and only if the equality holds.

Based on above analysis, picking the least favorable group’s average utility can be a surrogate metric for fairness, since the GI’s upper bound is restricted by . By the Lorenz curve of the motivating example in Fig.5, the has much less Gini index than , and the minimum utility earned by is larger than

Figure 5. The Lorenz curve of cumulative utility obtained by policy , e.g. in (a), the point (40%, 0.14) means the tail two users obtain 14% of total utility; and in (b), the (40%, 0.36) means 36% of total utility. The value of Gini index is defined as the area of , therefore the is much more fair than

Another concern is that the global utility gU, namely the utility enjoyed by all users, is also important for evaluating a policy’s effectiveness:

because it decides the total income of the system. Then comes two principles for developing a recommender system:

1. Reduce Gini index GI (Eq. (12)) for fairness

2. Improve global utility gU (Eq. (14)) for effectiveness

We will use the above two metrics for comparison in the experiment section.

5. Fairness of RCTs Considering Heterogeneous Users Demographics

Recall that our goal is to control the utility disparity of users, which can be realized by controlling the least favorable group’s utility , as above mentioned in Eq.(13). We proceed the analysis to fairness of RCTs-based methods in two steps: first, we show how the empirical mean influ-ences the minority’s risk by IPS estimator; then, we analyze this risk regarding to several demographics assumptions, i.e. homogeneous and heterogeneous. As a result, we reveal the limitations of these RCTs-based methods, even if the RCTs is commonly regarded as trustworthy knowledge for debiasing MNAR data in the literature.

5.1. Risk of the least favorable group

Let , then the least favorable group is and the event set related to . Bias of IPS estimator on , similar to Eq.(9), can be derived as

Proposition 2 (Bias of the least favorable group’s risk by IPS). The bias of IPS risk suffered by the least favorable group , has a lower bound as

Roughly speaking, the above inequality demonstrates that the ’s property matters. In particular, even under uniform policy, the bias for each group would be dependent on the preference demographics, as shown next.

5.2. Homogeneity: same mean and same variance

Intuitively, when randomly assigning items for users, under MAR condition, the is a fair and efficient estimate for debiasing. However, we argue that this only holds under homogeneous and weak heterogeneous demographics.

Assumption 1 (Homogeneous demographics). Users’ preference are same mean and same variance, namely the

With above assumption, and by law of large numbers, it is easy to get the consistency of , i.e. . Recall the inequality in Eq.(16), the right term approaches to zero at this situation because , which holds for other groups as well. That is the reason why in our motivating example, uniform reaches good fairness. However, these homogeneity hardly happen in practice. A real-world recommender system often serves hundred million of users, such that it is reasonable to assume that the demographics of users are highly heterogeneous. We discuss it by two cases next, i.e. weak and strong heterogeneity.

5.3. Weak heterogeneity: same mean but different variance

We begin our analysis from a simple case without loss of generality, in which users are from two distributions: and , with probability from the first and from the second one. Formally, this is a common case in Robust Statistics (Maronna et al., 2019) called contaminated distribution, as

For simplicity, we assume that both are Gaussian distribution (we will extend it to more general sub-Gaussian family later). Based on Eq.(8) and Eq.(17), it is easy to derive that

We first consider the setting that groups’ preferences are same mean but different variance (SMDV), namely , and . We argue that only if variance is controllable, the RCTs under SMDV demographics lead to asymptotically normal and fair

Theorem 1 (Asymptotic normality under SMDV). Let , under SMDV assumption, if

then the empirical mean satisfies asymptotic normality as n

Similar as in homogeneous demographics, with weak heterogeneity and mild condition on variance, the also leads to fair result.

5.4. Strong heterogeneity: different mean and different variance

Although the above analysis proves that RCTs do lead to fair estimator in theory, the fundamental challenge faced is that, no matter for homogeneity or weak heterogeneity, the same mean assumption seems to be too strong. This argument urges us to explore more general setting, that is, when the demographics are different mean and different variance (DMDV), how well the performs, in terms of its bias and fairness. Accordingly, we continue to L DMDV groups with sub-Gaussian assumption next.

Theorem 2 (Tail bound under strong heterogeneity). Given the

with probability . Specifically when the sub-Gaussian parameters are all same, i.e.

Moreover, For the bias of IPS estimator on the least favorable group, we have

Above bound identifies that inaccurate propensities obtained by NB estimator, with from RCTs as the outcome prior, contributes to unfair worst-case for different groups. Though the is unbiased, its part on is worse than other , since the matters for the lower bound of

Moreover, for other multi-task learning or domain adaptation methods, we conjecture that this challenge appears as well, because small RCTs are only competent for average estimation, rather than representative for all groups. This challenge encourages us to develop a machinery, which can correct MNAR bias without RCTs, as well as achieving better fairness for recommendation, described next.

6. Beyond Randomized Trials: Counterfactual Robust Risk Minimization

Given a predictor , we let , and similar for the . For ensuring fairness on known demographics, the disparity metrics are used by (?) to measure the outcome disparity conditioned by predefined group:

However, demographics of user’s preference in recommendation are usually unknown, thus defining groups and optimizing the Eq.(24) directly is difficult.

To deal with unknown demographics, distributionally robust optimization (DRO) is employed for controlling the least favorable group’s risk by (Hashimoto et al., 2018), while some concerns are (1) the importance sampling scheme introduces additional variance in optimization; (2) the result is sensitive to the distance metric for defining the uncertainty set and (3) the DRO often degenerates to ERM (Hu et al., 2018) in many scenarios. In this section, we propose a more easy-to-use and effective counterfactual robust risk minimization framework, aiming at reducing utility disparity without knowing demographics. We first present the defini-tion of objective function and algorithm of the framework, then give its theoretical generalization error bound.

6.1. Counterfactual robust risk minimization

We define the conception of individual fairness, that is, for similar users, the predictor is ought to give similar outputs. Hence we introduce the so-called robust risk that

where the predictor f is parametrized by sarial input is restricted in a ball, with radius around the original input x. As aforementioned, high utility disparity is often caused by ERM, because the predictor tends to overestimate the factuals, and generalize badly on counterfactuals.

Algorithm 1 Mini-batch SGD learning for CRRM

Intuitively, one would expect the parity between event x and any similar , no matter it is a factual or counterfactual, could be controlled by optimizing . Similar regime named virtual adversarial training proposed by (Miyato et al., 2018), is originally for improving local distributional smoothness for semi-supervised learning from unlabeled samples. In this work, we involve counterfactuals in optimization by the defined robust risk, and find it is also a beneficial constraint for improving fairness.

The objective function for our CRRM framework is defined as

For observed events, we try to minimize both the robust risk and empirical risk ; for the unobserved, as is intractable, only is involved. The empirical risk can be defined as any surrogate loss for supervised learning, e.g. mean-squared-error. For the robust risk, it is tractable by a power iteration (Miyato et al., 2018) approach.

The bias of , like previously done for the , can be easily derived by

Unlike the ERM and IPS estimator, the identifies that it may be beneficial to ”penalize” small empirical risk individuals with overestimating robust risk, and vice versa, for uniform performance among users.

Algorithm 1 details our CRRM framework, where trainable embeddings for users and items are and . The input hyper parameters include: regularization term , conterfactual risk weight , robust radius perturbation radius , learning rate , power iteration step J and total learning epoch T, though we find few of them need being finely tuned in practice. Their setups are discussed in experiments section.

6.2. Bounding the risk over counterfactuals

Definition 2 (CRRM for recommendation). Given a hypothesis space H of predictions as , and CRRM’s objective function , the algorithm selects that optimizes:

To illustrate the validity and principle of our CRRM approach, we state the following bound, where only finite hypothesis set H is considered for the sake of conciseness. Theorem 3 (Generalization error bound of CRRM). For any finite hypothesis space of predictions H = , and empirical loss is bounded , the true risk of our CRR minimizer from H, is bounded with probability

where and we assume

This generalization error bound is constructive, as it motivates a trade-off between fairness and effectiveness which does not occur in ERM. In particular, overestimating for small individuals could facilitate fairness as well as the first bias term above. However, larger variance would appear in the second term at the same time. Fortunately, in most scenarios, our propensity-agnostic CRRM method empirically benefits significantly, both in fairness and generalization, from the additional applied fairness constraint. We will demonstrate it in next section.

7. Experiments

In this section, we perform experiments to explore the performance of proposed CRRM, in terms of both global

Figure 6. Estimate of under two different setting of demographics, left: homogeneous, right: heterogeneous. Green dot lines are the maximum and minimum of among all groups, and red one is the average of all.

Table 1. mU and gU vary under data sparsity.

utility and Gini fairness. Furthermore, additional experiments on real-world data sets (Yahoo2 and Coat) are conducted to compare CRRM with other methods, according to the commonly used measurements, e.g. MSE, AUC. We here consider logistic regression (LR), matrix factorization (MF-naive), inverse propensity score based MF (MF-IPS) (Schnabel et al., 2016), self-normalized IPS (MF-SNIPS) (Swaminathan & Joachims, 2015) and MF incorporated in our CRRM framework (MF-CRRM).

7.1. Synthetic task

The experimental setup is similar as the motivating example done in Section 4.1, but we supplement the heterogeneous demographics setting, i.e. each user’s average preference is different, and extend the scale to M = 10, N = 1000. For experiments on the non-uniform , each time decides the top 1 item for each user, and collects the users’ feedback. Then it learns from the logged data, and continues to give next round of top 1 recommendation.

How does demographic heterogeneity influence property by RCTs and ERM-based policy? We want to compare the performance of with and , with varying heterogeneity, in the synthetic task. The results are shown in Fig.6, where the x-axis is the observed events ratio. With homogeneous demographics in Fig.6a, the by ERM policy is much higher than the global average, and converges to global average with high confidence. However, with strongly heterogeneous demographics defined in section 5.4, the Fig.6b displays high variance of the .

Table 2. Test MSE and AUC on real-world data sets.

Unlike in homogeneous setting, the blue band converges much slower with sampling ratio increasing. Besides, the orange line () also has deteriorating performance compared with before. The result demonstrates that heterogeneity of demographics can cause failure of RCTs-based IPS because of high variance incurred by , and also do harm to ERM-based in practice.

Does CRRM improve fairness and generalization? Results on strong demographic heterogeneity are reported in Table 1, where mU is the minimum of user’s enjoyed utility. We test our CRRM method against LR, naive MF and IPS-based MF, and significant superiority of our method is witnessed. The MF-CRRM substantially perform better on both fairness and effectiveness in general, as it gains the best mU and gU almost under each sparsity setting. By contrast, MF-IPS with propensity scores learned from MAR data with 10% of number of MANR data, does not win over naive MF, due to high variance caused by heterogeneous demographics.

7.2. Real-world data sets

Further experiments conducted on two real-world data sets, which both collected a small test set where users are provided randomly displayed items for labeling. Following previous works’ setting, we train all models on MNAR training set , and test them on MAR test set. More details about experimental setup are in appendix F.2.

Yahoo ! R3 Data Set. This is a user-song rating data set (Marlin & Zemel, 2009), where MNAR training set has over 300K ratings for songs that were selected by users. The test set contains ratings by 5400 users who were asked to rate 10 randomly selected songs.

Coat Shopping Data Set. This data set was collected and used in (Schnabel et al., 2016), which contains 290 users and 300 items. Each user self-selected 24 items, and additional 16 items were randomly drawn for them to rate.

Results. Results are reported by Table 2 and Table 3, where our CRRM substantially wins over other baselines on both data sets, either in effectiveness metric gU, MSE, AUC or in fairness metric GI. Unlike other IPS-based methods, CRRM has much lower cost since it does not need additional RCTs. It contributes to less variance via controlling both robust risk and factual risk, thus facilitating generalization without RCTs, as presented in Theorem 3. It should be noted that on YAHOO data set, for computation efficiency, our CRRM only sampling for optimizing on the in each epoch, which sheds light on our CRRM’s promising prospects in real-world recommender system.

8. Discussion & Conclusion

In this work, we focus on mitigating MNAR bias and unfair policy, which are ubiquitous in recommendation. Previous works tend to leverage empirical outcome mean from small RCTs for correcting MNAR bias, e.g. IPS family. However, our study on the bias of these RCTs-based IPS estimator, with introducing more general different mean demographics, identifies the risk estimator with the least favorable group for ensuring average performance.

We therefore propose an easy-to-use and RCTs-free counterfactual robust risk minimization framework, in order to circumvent the cost of RCTs, and mitigate the unfairness in previous works. The key insight of our CRRM is to balance between robust and factual risk, thus both reducing bias and facilitating fairness, explained by the generalization bound in our Theorem 3. Moreover, by random sampling when alternative events set is large, our algorithm only costs a little additional computation, thus being scalable to large recommender systems.

We conjecture that the cutting edge sampling approaches (Wang et al., 2020), rather than randomly sampling from all events, can be plugged in our framework for further improvement. Besides, other existing models for recommendation, e.g. deep factorization machine (Guo et al., 2017), graph neural networks (Fan et al., 2019), can be retrofit with our CRRM for even better results.

Acknowledgements

This research was funded by the Natural Science Foundation of China 61807021, Shenzhen Science and Technology Research and Development Funds (JCYJ20170818-

094022586), and Innovation and entrepreneurship project for overseas high-level talents of Shenzhen (KQJSCX20180327144037831). The authors would like to thank advice from Professor Yuhan Dong and Xiangxiang

Xu for the writing and theory of this work.

References

Beutel, A., Chen, J., Doshi, T., Qian, H., Wei, L., Wu, Y., Heldt, L., Zhao, Z., Hong, L., Chi, E. H., et al. Fairness in recommendation ranking through pairwise comparisons. arXiv preprint arXiv:1903.00780, 2019.

Bonner, S. and Vasile, F. Causal embeddings for recommendation. In Proceedings of the 12th ACM Conference on Recommender Systems, pp. 104–112. ACM, 2018.

Duchi, J. and Namkoong, H. Learning models with uniform perfor- mance via distributionally robust optimization. arXiv preprint arXiv:1810.08750, 2018.

Dud´ık, M., Langford, J., and Li, L. Doubly robust policy evaluation and learning. arXiv preprint arXiv:1103.4601, 2011.

Fan, S., Zhu, J., Han, X., Shi, C., Hu, L., Ma, B., and Li, Y. Metapath-guided heterogeneous graph neural network for intent recommendation. 2019.

Feller, W. An introduction to probability theory and its applications, volume 2. John Wiley & Sons, 2008.

Gini, C. Concentration and dependency ratios. Rivista di politica economica, 87:769–792, 1997.

Guo, H., Tang, R., Ye, Y., Li, Z., and He, X. Deepfm: a factorization-machine based neural network for ctr prediction. arXiv preprint arXiv:1703.04247, 2017.

Hardt, M., Price, E., Srebro, N., et al. Equality of opportunity in supervised learning. In Advances in neural information processing systems, pp. 3315–3323, 2016.

Hashimoto, T. B., Srivastava, M., Namkoong, H., and Liang, P. Fairness without demographics in repeated loss minimization. arXiv preprint arXiv:1806.08010, 2018.

Haynes, L., Goldacre, B., Torgerson, D., et al. Test, learn, adapt: Developing public policy with randomised controlled trials— cabinet office. 2012.

Hoeffding, W. Probability inequalities for sums of bounded ran- dom variables. In The Collected Works of Wassily Hoeffding, pp. 409–426. Springer, 1994.

Hu, W., Niu, G., Sato, I., and Sugiyama, M. Does distributionally robust supervised learning give robust classifiers? In Proceedings of the 35th International Conference on Machine Learning, volume 80, pp. 2029–2037. PMLR, 2018.

Jabbari, S., Joseph, M., Kearns, M., Morgenstern, J., and Roth, A. Fairness in reinforcement learning. In Proceedings of the 34th International Conference on Machine Learning-Volume 70, pp. 1617–1626. JMLR. org, 2017.

Jiang, N. and Li, L. Doubly robust off-policy value evaluation for reinforcement learning. arXiv preprint arXiv:1511.03722, 2015.

Johansson, F., Shalit, U., and Sontag, D. Learning representations for counterfactual inference. In International conference on machine learning, pp. 3020–3029, 2016.

Joseph, M., Kearns, M., Morgenstern, J., Neel, S., and Roth, A. Rawlsian fairness for machine learning. arXiv preprint arXiv:1610.09559, 1(2), 2016.

Kearns, M., Neel, S., Roth, A., and Wu, Z. S. Preventing fairness gerrymandering: Auditing and learning for subgroup fairness. arXiv preprint arXiv:1711.05144, 2017.

Kearns, M., Roth, A., and Sharifi-Malvajerdi, S. Average individ- ual fairness: Algorithms, generalization and experiments. arXiv preprint arXiv:1905.10607, 2019.

Koren, Y., Bell, R., and Volinsky, C. Matrix factorization tech- niques for recommender systems. Computer, (8):30–37, 2009.

Lawrence, C., Sokolov, A., and Riezler, S. Counterfactual learning from bandit feedback under deterministic logging: A case study in statistical machine translation. arXiv preprint arXiv:1707.09118, 2017.

Liu, L. T., Dean, S., Rolf, E., Simchowitz, M., and Hardt, M. Delayed impact of fair machine learning. arXiv preprint arXiv:1803.04383, 2018.

Ma, W. and Chen, G. H. Missing not at random in matrix completion: The effectiveness of estimating missingness probabilities under a low nuclear norm assumption. In Advances in Neural Information Processing Systems, pp. 14871–14880, 2019.

Mankiw, N. G. Principles of macroeconomics. Cengage Learning, 2006.

Marlin, B. M. and Zemel, R. S. Collaborative prediction and rank- ing with non-random missing data. In Proceedings of the third ACM conference on Recommender systems, pp. 5–12. ACM, 2009.

Maronna, R. A., Martin, R. D., Yohai, V. J., and Salibi´an-Barrera, M. Robust statistics: theory and methods (with R). John Wiley & Sons, 2019.

Mary, J., Calauzenes, C., and El Karoui, N. Fairness-aware learn- ing for continuous attributes and treatments. In International Conference on Machine Learning, pp. 4382–4391, 2019.

Miyato, T., Maeda, S.-i., Koyama, M., and Ishii, S. Virtual ad- versarial training: a regularization method for supervised and semi-supervised learning. IEEE transactions on pattern analysis and machine intelligence, 41(8):1979–1993, 2018.

Nabi, R., Malinsky, D., and Shpitser, I. Learning optimal fair policies. arXiv preprint arXiv:1809.02244, 2018.

Rawls, J. A theory of justice. Harvard university press, 2009.

Richardson, M., Dominowska, E., and Ragno, R. Predicting clicks: estimating the click-through rate for new ads. In Proceedings of the 16th international conference on World Wide Web, pp. 521–530. ACM, 2007.

Rosenbaum, P. R. and Rubin, D. B. The central role of the propen- sity score in observational studies for causal effects. Biometrika, 70(1):41–55, 1983.

Rosenfeld, N., Mansour, Y., and Yom-Tov, E. Predicting counter- factuals from large historical data and small randomized trials. In Proceedings of the 26th International Conference on World Wide Web Companion, pp. 602–609. International World Wide Web Conferences Steering Committee, 2017.

Schnabel, T., Swaminathan, A., Singh, A., Chandak, N., and Joachims, T. Recommendations as treatments: Debiasing learning and evaluation. arXiv preprint arXiv:1602.05352, 2016.

Seaman, S., Galati, J., Jackson, D., and Carlin, J. What is meant by” missing at random”? Statistical Science, pp. 257–268, 2013.

Shalit, U., Johansson, F. D., and Sontag, D. Estimating individual treatment effect: generalization bounds and algorithms. In Proceedings of the 34th International Conference on Machine Learning-Volume 70, pp. 3076–3085. JMLR. org, 2017.

Singh, A. and Joachims, T. Policy learning for fairness in ranking. arXiv preprint arXiv:1902.04056, 2019.

Steck, H. Evaluation of recommendations: rating-prediction and ranking. In Proceedings of the 7th ACM conference on Recommender systems, pp. 213–220. ACM, 2013.

Swaminathan, A. and Joachims, T. The self-normalized estimator for counterfactual learning. In advances in neural information processing systems, pp. 3231–3239, 2015.

Thomas, P. and Brunskill, E. Data-efficient off-policy policy eval- uation for reinforcement learning. In International Conference on Machine Learning, pp. 2139–2148, 2016.

Wang, X., Zhang, R., Sun, Y., and Qi, J. Doubly robust joint learning for recommendation on data missing not at random. In Proceedings of the 36th International Conference on Machine Learning, volume 97, pp. 6638–6647. PMLR, 2019.

Wang, Z., Zhu, H., Dong, Z., He, X., and Huang, S.-L. Less is better: Unweighted data subsampling via influence function. In Proceedings of the AAAI Conference on Artificial Intelligence (AAAI), 2020.

Yuan, B., Hsia, J.-Y., Yang, M.-Y., Zhu, H., Chang, C.-Y., Dong, Z., and Lin, C.-J. Improving ad click prediction by considering nondisplayed events. In Proceedings of the 28th ACM International Conference on Information and Knowledge Management, pp. 329–338. ACM, 2019.

Zabell, S. L. Alan turing and the central limit theorem. The American Mathematical Monthly, 102(6):483–494, 1995.

A. Proof of Proposition 1

Proposition 1 (Least favorable group’s utility and Gini index). Assume that , controlling the worst-case to protect the minority group can improve fairness, by virtue of controlling the supremum of the GI as

if and only if the equality holds.

Proof.

where the last inequality holds because

B. Proof of Proposition 2

Proposition 2 (Bias of the least favorable group’s risk). The bias of IPS risk suffered by the least favorable group lower bound as

Proof. It is easy to derive that

Such that, we concentrate on in the Eq.(15):

The second inequality Eq.(B.5) comes from the fact that the least favorable group ’s true average preference must be less than the empirical average. Combining the result above and Eq.(15) yields the result.

C. Proof of Theorem 1

Definition C.1 (Triangular Array (TA)). A triangular array of random variables is like

For each row n, the random variables are assumed independent, with , and . We denote this triangular array as henceforth.

Definition C.2 (Lindeberg’s condition (Zabell, 1995)). For random variables the variance . We define a triangular array as , which satisfies Lindeberg’s condition if for

Lemma C.1 (Lindeberg’s condition under SMDV). Given , and , if the

variance is controllable, namely

then the satisfies Lindeberg’s condition.

Proof. Assume without loss of generality that a random variable distributed like every

in which the second inequality holds because it is easy to derive that , the last two terms converge to zero, because of the condition set in Eq.(C.4).

Theorem 1 (Asymptotic normality under SMDV). Let , under SMDV assumption, if

then the empirical mean satisfies asymptotic normality as

Proof. It is easy to derive that the summation over the . Based on the condition set by Eq.(C.9), applying Lemma C.1 yields the result that satisfies Lindeberg’s condition, namely

From the Lindeberg-Fellar Central Limit Theorem (CLT) (Feller, 2008), the summation over the each row of converges to a zero-mean unit variance Gaussian distribution, namely

D. Proof of Theorem 2

First, we introduce Hoeffding’s lemma on bounded random variables (Hoeffding, 1994) without proof.

Lemma D.1 (Hoeffding’s lemma (Hoeffding, 1994)). if with probability 1, and E(X) = 0, then X is sub-Gaussian with parameter

Theorem 2 (Tail bound under strong heterogeneity). Given the

with probability . Specifically when the sub-Gaussian parameters are all same, i.e.

Moreover, For the bias of IPS estimator on the least favorable group, we have

Proof. From the definition of , it is easy to derive

And the interval of

Applying Hoeffding’s inequality for

therefore

E. Proof of Theorem 3

Proof.

because . Taking the out of the brace leads to the final result.

Theorem 3 (Generalization error bound of CRRM). For any finite hypothesis space of predictions , and empirical loss is bounded , the true risk of our CRR minimizer from H, is bounded with probability

where and we assume

Proof. For notation simplicity, we omit the subscript CRR next.

which follows from Lemma E.1. Considering Hoeffding inequality and union bounds, we are now going to bound the following

solving for yields the bound that

As we know that , so it is easy to derive the final bound as

Substitute ) yields the final result.

F. Experimental setup

F.1. Simulation Tasks

For motivating example, we build a matrix as the users’ preference over each item, where each element is generated uniformly within [0, 1], so they are homogeneous. For simulating user’s behavior, we introduce an additional latent variable uniformly drawn from [0, 1], such that for each time an event x appears, the click happens when the latent variable is smaller than y. We adopt a bandit feedback setting here, that is, the policy firstly ranks the Top-1 predicted reward item for each user, collects the feedbacks as click or not click; then we do policy learning via ERM from these logged feedbacks. For controlled experiment, we perform the same setting for an uniform policy

After 100 epochs of logging feedbacks and learning policy, we can calculate the empirical outcome distribution from the observed logged data, e.g. from and from . The Fig.F.1 demonstrates the , generated by the non-uniform policy , compared with the true P(Y | X). Significant bias of the estimated is depicted in Fig.F.2b, which is caused by inaccurately modeling on the observed events, with non-uniform policy and assuming that . From Fig.3b we observe that the causes devastating reduction of diversity to the system, all items except for item #0 are underrepresented, because they are relatively less interested by the users in average. The insufficient exploration over items causes large bias of estimated marginal distribution over items, shown in Fig.F.2a, i.e. bias of is near zero, while others’ are large.

F.2. Real-world data

The original data set contains ratings from 1 to 5, we binarize it by setting 3 as threshold. For fair comparison, we tune hyper parameters for all MF-based methods: regularization term , learning rate , embedding size d = 4, and report the results on the best parameters. Speicifically for our CRRM, we simply maintain

Figure F.1. Comparison between the generated preference and empirically estimated preference by

Figure F.2. Analysis of simulation results of the motivating example.

all the time. For the IPS methods, propensity scores are learned by naive Bayes estimator from 5% of MAR test set.