b

DiscoverSearch
About
My stuff
On the Fairness of Randomized Trials for Recommendation with Heterogeneous Demographics and Beyond
2020·arXiv
Abstract
Abstract

Observed events in recommendation are consequence of the decisions made by a policy, thus they are usually selectively labeled, namely the data are Missing Not At Random (MNAR), which often causes large bias to the estimate of true outcomes risk. A general approach to correct MNAR bias is performing small Randomized Controlled Trials (RCTs), where an additional uniform policy is employed to randomly assign items to each user. In this work, we concentrate on the fairness of RCTs under both homogeneous and heterogeneous demographics, especially analyzing the bias for the least favorable group on the latter setting. Considering RCTs’ limitations, we propose a novel Counterfactual Robust Risk Minimization (CRRM) framework, which is totally free of expensive RCTs, and derive its theoretical generalization error bound. At last, empirical experiments are performed on synthetic tasks and real-world data sets, substantiating our method’s superiority both in fairness and generalization.

Conventional machine learning model, though can reach high accuracy in average, often falls short in causing high loss on minority groups at the same time. Fairness issues have been studied in various tasks, like medical treatment, criminal justice (Kearns et al., 2019; Nabi et al., 2018; Mary et al., 2019), credit loan (Liu et al., 2018), machine translation (Hashimoto et al., 2018), recommender system (?Beutel et al., 2019; Singh & Joachims, 2019), etc. The reasons for improving model’s fairness are not limited with respect to ensuring morality or legality , but also involving economical considerations. In (Hashimoto et al., 2018), it proves the amplification of demographic disparity in sequential ERM process. That is, the minority group which suffers

image

Figure 1. Missing ratings in a recommender system.

higher loss will become more likely to quit using our service, thereby providing no feedback data any longer. This negative feedback loop can cause shrinking size of users, hence reducing the income or welfare provided by our machine learning systems over time.

Specifically for recommender systems, there have been line of work showing that, the real-world logging policy often collects missing not at random (MNAR) (Yuan et al., 2019) data (or selective labels). For example, users tend to only reveal ratings for items they like, thus the observed users’ feedback (usually described by click-through-rate, (CTR)) can be substantially better than those not observed yet. In Fig.1, estimated average CTR from the observed outcome is 2/3 ≈ 0.67. However, if all unobserved outcomes are 0, the true CTR is  1/3 ≈ 0.33. This gap between the factual and counterfactual ratings, in MNAR situation, has been empirically validated by (Steck, 2013). In this work, we show that MNAR also undermines fairness of recommendation over time, by exacerbating minority group’s utility.

To get unbiased estimate of true CTR without revealing all feedback, running randomized controlled trials (RCTs) (Haynes et al., 2012) has been so far the first choice, which satisfies missing-at-random (MAR) (Seaman et al., 2013) condition:

Definition 1 (Condition of MAR). The missing data are missing-at-random (MAR) if the conditional probality of observed pattern, represented by O here, satisfies

image

However, the cost of RCTs is unacceptable in many scenarios, e.g. we cannot randomly assign drugs to the volunteers, or randomly display ads for users. Vast majority of existing works (Shalit et al., 2017; Rosenfeld et al., 2017; Bonner & Vasile, 2018) tend to leverage small RCTs and large biased logged data to correct the MNAR bias, through multi-task learning or domain adaptation techniques. They claim that small RCTs, though not capable of revealing each individual’s preference (Rosenfeld et al., 2017), are sufficient for reaching low-variance estimate of the average of all, from which we can obtain model that generalizes well to those counterfactual events. However, the conception about small is not well-defined yet, hence no guideline for proper number of RCTs to trade gain against cost. Besides, although RCTs are intuitively regarded as fair for each user, we argue that learning from the average from RCTs causes unfair policy, on heterogeneous demographics, i.e. users vary over populations defined by preferences.

In this work, we aim to mitigate utility disparity among users, as well as improving global utility of all in recommendation. Our object of interest is a RCTs-free method that can achieve both two goals above. We begin from introducing Gini index (Gini, 1997) as a disparity measure in Section 4. In Section 5, we consider two heterogeneous situations, for analyzing the RCTs’s fairness, accounting for the least favorable group’s risk. Then, we propose our RCTs-free Counterfactual Robust Risk Minimization (CRRM) framework in Section 6, and test its performance in Section 7. The proof of all propositions and theorems can be found in appendices. And codes for experiments are available in the supplementary material.

Contributions. Our main contributions are three-fold:

1. We demonstrate the unfairness of conventionally used ERM-based recommendation under MNAR data, and connect protecting the least favorable group to ensuring fairness regarding to Gini index;

2. We are the first to theoretically analyze the fairness of RCTs-based methods, e.g. IPS, by considering the least favorable group’s risk over multiple demographic conditions;

3. We develop a RCTs-free CRRM framework, and derive a constructive theoretical error bound. Furthermore, our method is empirically proved improving both fairness and accuracy, on both synthetic tasks and real-world data.

Fair machine learning. Fairness in machine learning is an emerging topic in the literature recently. The most direct principle of algorithmic fairness is to protect sensitive tags, features or labels, such as race, gender, age, etc. This formulation of fairness has been investigated in (Hardt et al., 2016; Mary et al., 2019) where the conditional independence, between the decision made by the algorithm and the variables to protect, is estimated or guaranteed. In other respects, the Rawlsian fairness (Rawls, 2009) is introduced to substantiate this equal opportunity principle in a bandit learning scheme (Joseph et al., 2016; Jabbari et al., 2017), where an action is preferred only if its true quality is better. However, as far as we know, few work has ever mentioned fairness issue in MNAR scenario.

For some scenarios when the demographics are unknown, the previous works have been so far concentrating on protecting subgroups with latent demographic structure (Kearns et al., 2017), or learning a distributionally robust model for uniform performance without knowing demographics (Duchi & Namkoong, 2018; Hashimoto et al., 2018). Although these approaches are proved to be able to protect the minority group, no evident improvement of global utility for all users is witnessed in practice.

Correcting MNAR bias. An importance sampling approach called inverse propensity score (IPS) is proposed initially to correct the sampling bias (Rosenbaum & Rubin, 1983), then followed by line of work including doubly robust (DR) (Dud´ık et al., 2011; Jiang & Li, 2015), weighted doubly robust (Thomas & Brunskill, 2016), joint learning doubly robust (Wang et al., 2019), self-normalized IPS (Swami- nathan & Joachims, 2015; Schnabel et al., 2016), etc., for recommendation on MNAR data, by utilizing additional small RCTs. Besides, methods (Johansson et al., 2016; Bonner & Vasile, 2018) adapted from domain adaptation and representation learning techniques are used for learning from both  Dtand  Dc. However, they hardly mention their methods’ fairness, or the effectiveness over heterogeneous demographics.

Denote user  u ∈ U, and item v ∈ V, then an even  x ∈ U×V.The outcome of an event is  y ∈ Y = {0, 1} where

image

There are in total  |U| × |V|possible events considering different combination of users and items, where  | · |here means the cardinality of a set, we set |U| = M and |V| = N. The true risk of predicted outcome matrix ˆY ∈ RM×N canbe easily written out as

image

where  yuvis assumed drawn from a true outcome matrix Y ∈ {0, 1}M×N, and δ(·)is usually set as surrogate mean-squared-error, or logistic loss, for simplicity we denote it as δuv or δx.

When some of outcomes are not observed, calculating Eq.(3) becomes infeasible. Such that, previous works attempt to

image

Figure 2. A demonstration of MNAR problem.

only minimize the factual empirical risk, namely ERM as

image

where random variable  Ouvmeans whether a pair (u, v) is observed. The observation of outcome  yuvis hence constrained by the missing pattern  Ouv, shown in Fig.2. Such that, the ˆRF ( ˆY )is a biased estimate of true risk  R( ˆY ):

image

3.1. Correcting sampling bias by propensity scores

To deal with the bias shown in Eq.(5), an unbiased inverse propensity score (IPS) estimator (Rosenbaum & Rubin, 1983) can be built as

image

where  Zuvis the used propensity score that  EO(Ouv) =Zuv, hence it is easy to derive  EO( ˆRIP S( ˆY )) = R( ˆY ).

The propensity score  Zuvhere, can be approximated via a simplified naive Bayes (NB) estimator (Marlin & Zemel, 2009) ˆZuvas following:

image

with assuming that P(O | Y, X) = P(O | Y ). Though the P(Y | O) and P(O) can be counted by MNAR data, the P(Y = y) has to be estimated via MAR data, which is often achieved by taking average of n RCTs’ outcomes

image

Using x = (u, v), the bias of IPS estimator with inaccurate propensity ˆZxis given by (Schnabel et al., 2016) that

image

For notation simplicity, we omit ˆZof the ˆRIP S( ˆY | ˆZ)from now on. In this work, we delve deeper into this bias, accounting for the bias of different groups of users. We assume that each individual u belongs to a group  Ul, and the group mean preference is represented by

image

In this viewpoint, the ¯Ynis an efficient empirical estimate of global preference for all users. However, we question that, when demographics are heterogeneous, i.e.  µlare different, is the ¯Ynby RCTs still a fair and economical estimate? Since for ensuring fairness, we care about loss suffered by the least favorable group, rather than only the average over all. To deal with this fairness issue, we first focus on how to measure utility disparity among users in next section.

The conventional approach to update a policy for recommender system, often proceeds in repeatedly learning from the MNAR logged data, thus minimizing the empirical risk. In this section, we identify that the policy  φcobtained by this process fails to offer fair services, when taking user’s click-through-ratio (CTR) as their utility. We then propose Gini index for measuring this disparity, and reveal the connection between improving Gini fairness and protecting the least favorable group’s utility.

image

Figure 3. Analysis of simulation results of the motivating example.

4.1. Motivating example

We build a tiny system with 5 users and 5 items for simulation, and compare the user-wise and item-wise numbers of clicks achieved by  φtand  φc, shown in Fig.3. Unfair treatment towards users by  φccan be witnessed in the Fig.3a. Compared with  φt, though the total clicks achieved by  φc isslightly higher (2916 v.s. 2484), the click rates of user #1 and #2 are much lower than their counterparts by  φt. That means, the  φcachieves good performance in average, at the expense of deteriorating experience of the minorities. Besides, the Fig.3b illustrates devastating diversity reduction by  φc, which is also unacceptable in recommendation. Details about the implementation of simulation can be found in Appendix F.1.

4.2. Gini fairness and the minority’s utility

Example above impresses the necessity to measure users’ utility disparity. To this end, we introduce Gini index drawn from Economics to measure fairness in recommendation. The Gini index  GI ∈ [0, 1], with 0 indicating perfect equality and 1 being perfect inequality, is a measure of the distribution of income across percentiles among populations.

image

Figure 4. Computing the Gini index via the Lorenz curve.

To compute GI, the Lorenz curve (Mankiw, 2006) plots population percentile by utility on the x-axis, and cumulative utility on the y-axis. Denote the cumulative l-th group’s utility as  Cuml = �ll′=1 ˆµl′. From Fig.4, we can compute the area under the Lorenz curve  SBby summation of all  Sl

image

therefore the GI can be obtained by

image

We give the following Proposition 1 which identifies the connection between controlling Gini index and protecting the least favorable group.

Proposition 1 (Least favorable group’s utility and Gini index). Assume that  ˆµ1 ≤ ˆµ2 · · · ≤ ˆµL, controlling the worst-case  ˆµ1to protect the minority group can improve fairness, by virtue of controlling the supremum of the GI as

image

if and only if  ˆµ1 = ˆµ2 = · · · = ˆµLthe equality holds.

Based on above analysis, picking the least favorable group’s average utility can be a surrogate metric for fairness, since the GI’s upper bound is restricted by  ˆµ1. By the Lorenz curve of the motivating example in Fig.5, the  φthas much less Gini index than  φc, and the minimum utility earned by φtis larger than  φc as well.

image

Figure 5. The Lorenz curve of cumulative utility  ˆµobtained by policy  φc and φt, e.g. in (a), the point (40%, 0.14) means the tail two users obtain 14% of total utility; and in (b), the (40%, 0.36) means 36% of total utility. The value of Gini index is defined as the area of  SA, therefore the  φtis much more fair than  φc.

Another concern is that the global utility gU, namely the utility enjoyed by all users, is also important for evaluating a policy’s effectiveness:

image

because it decides the total income of the system. Then comes two principles for developing a recommender system:

1. Reduce Gini index GI (Eq. (12)) for fairness

2. Improve global utility gU (Eq. (14)) for effectiveness

We will use the above two metrics for comparison in the experiment section.

Recall that our goal is to control the utility disparity of users, which can be realized by controlling the least favorable group’s utility  ˆµ1, as above mentioned in Eq.(13). We proceed the analysis to fairness of RCTs-based methods in two steps: first, we show how the empirical mean ¯Yninflu-ences the minority’s risk by IPS estimator; then, we analyze this risk regarding to several demographics assumptions, i.e. homogeneous and heterogeneous. As a result, we reveal the limitations of these RCTs-based methods, even if the ¯Yn byRCTs is commonly regarded as trustworthy knowledge for debiasing MNAR data in the literature.

5.1. Risk of the least favorable group

Let  µ1 ≤ µ2 · · · ≤ µL, then the least favorable group is  U1,and the event set related to  U1 is X1 = U1 × V. Bias of IPS estimator on  X1, similar to Eq.(9), can be derived as

image

Proposition 2 (Bias of the least favorable group’s risk by IPS). The bias of IPS risk suffered by the least favorable group  U1, has a lower bound as

image

Roughly speaking, the above inequality demonstrates that the ¯Yn’s property matters. In particular, even under uniform policy, the bias for each group would be dependent on the preference demographics, as shown next.

5.2. Homogeneity: same mean and same variance

Intuitively, when randomly assigning items for users, under MAR condition, the ¯Ynis a fair and efficient estimate for debiasing. However, we argue that this only holds under homogeneous and weak heterogeneous demographics.

Assumption 1 (Homogeneous demographics). Users’ preference are same mean and same variance, namely the Yi ∼ F(µ, σ2) ∀i.

With above assumption, and by law of large numbers, it is easy to get the consistency of ¯Yn, i.e. ¯Yn − µ p→ 0. Recall the inequality in Eq.(16), the right term approaches to zero at this situation because  µ1 = µ, which holds for other groups as well. That is the reason why in our motivating example, uniform  φtreaches good fairness. However, these homogeneity hardly happen in practice. A real-world recommender system often serves hundred million of users, such that it is reasonable to assume that the demographics of users are highly heterogeneous. We discuss it by two cases next, i.e. weak and strong heterogeneity.

5.3. Weak heterogeneity: same mean but different variance

We begin our analysis from a simple case without loss of generality, in which users are from two distributions: F(µ0, σ20)and  F(µ1, σ21), with probability  p ∈ [0, 1]from the first and  1 − pfrom the second one. Formally, this is a common case in Robust Statistics (Maronna et al., 2019) called contaminated distribution, as

image

For simplicity, we assume that both are Gaussian distribution (we will extend it to more general sub-Gaussian family later). Based on Eq.(8) and Eq.(17), it is easy to derive that

image

We first consider the setting that groups’ preferences are same mean but different variance (SMDV), namely  µ0 =µ1 = µ, and  σ0 < σ1. We argue that only if variance is controllable, the RCTs under SMDV demographics lead to asymptotically normal and fair ¯Yn:

Theorem 1 (Asymptotic normality under SMDV). Let s2n = �ni=1(pσ20 + (1 − p)σ21), under SMDV assumption, if

image

then the empirical mean ¯Ynsatisfies asymptotic normality as n

image

Similar as in homogeneous demographics, with weak heterogeneity and mild condition on variance, the ¯Ynalso leads to fair result.

5.4. Strong heterogeneity: different mean and different variance

Although the above analysis proves that RCTs do lead to fair estimator in theory, the fundamental challenge faced is that, no matter for homogeneity or weak heterogeneity, the same mean assumption seems to be too strong. This argument urges us to explore more general setting, that is, when the demographics are different mean and different variance (DMDV), how well the ¯Ynperforms, in terms of its bias and fairness. Accordingly, we continue to L DMDV groups with sub-Gaussian assumption next.

Theorem 2 (Tail bound under strong heterogeneity). Given the  Yi ∼ �Ll=1 plF(µl), where F(µl) ∈ [al, bl], then

image

with probability  1 − η. Specifically when the sub-Gaussian parameters are all same, i.e.  bl − al = 1, ∀l, we have

image

Moreover, For the bias of IPS estimator on the least favorable group, we have

image

Above bound identifies that inaccurate propensities obtained by NB estimator, with ¯Ynfrom RCTs as the outcome prior, contributes to unfair worst-case for different groups. Though the ˆRIP S( ˆY )is unbiased, its part on  X1is worse than other  Xl, since the  µlmatters for the lower bound of Bias( ˆRIP S( ˆY | Xl)).

Moreover, for other multi-task learning or domain adaptation methods, we conjecture that this challenge appears as well, because small RCTs are only competent for average estimation, rather than representative for all groups. This challenge encourages us to develop a machinery, which can correct MNAR bias without RCTs, as well as achieving better fairness for recommendation, described next.

Given a predictor  f : X → R, we let  P ˆY |X∈X1 :=E(f(X) | X ∈ X1), and similar for the  P ˆY |X /∈X1. For ensuring fairness on known demographics, the disparity metrics are used by (?) to measure the outcome disparity conditioned by predefined group:

image

However, demographics of user’s preference in recommendation are usually unknown, thus defining groups and optimizing the Eq.(24) directly is difficult.

To deal with unknown demographics, distributionally robust optimization (DRO) is employed for controlling the least favorable group’s risk by (Hashimoto et al., 2018), while some concerns are (1) the importance sampling scheme introduces additional variance in optimization; (2) the result is sensitive to the distance metric for defining the uncertainty set and (3) the DRO often degenerates to ERM (Hu et al., 2018) in many scenarios. In this section, we propose a more easy-to-use and effective counterfactual robust risk minimization framework, aiming at reducing utility disparity without knowing demographics. We first present the defini-tion of objective function and algorithm of the framework, then give its theoretical generalization error bound.

6.1. Counterfactual robust risk minimization

We define the conception of individual fairness, that is, for similar users, the predictor is ought to give similar outputs. Hence we introduce the so-called robust risk that

image

where the predictor f is parametrized by  Θ1, and the adver-sarial input  x′ is restricted in a ball, with radius  ξaround the original input x. As aforementioned, high utility disparity is often caused by ERM, because the predictor tends to overestimate the factuals, and generalize badly on counterfactuals.

Algorithm 1 Mini-batch SGD learning for CRRM

image

Intuitively, one would expect the parity between event x and any similar  x′, no matter it is a factual or counterfactual, could be controlled by optimizing ˜δx. Similar regime named virtual adversarial training proposed by (Miyato et al., 2018), is originally for improving local distributional smoothness for semi-supervised learning from unlabeled samples. In this work, we involve counterfactuals in optimization by the defined robust risk, and find it is also a beneficial constraint for improving fairness.

The objective function for our CRRM framework is defined as

image

For observed events, we try to minimize both the robust risk ˜δxand empirical risk  δx; for the unobserved, as  δxis intractable, only ˜δxis involved. The empirical risk can be defined as any surrogate loss for supervised learning, e.g. mean-squared-error. For the robust risk, it is tractable by a power iteration (Miyato et al., 2018) approach.

The bias of ˆRCRR( ˆY ), like previously done for the ˆRIP S( ˆY ), can be easily derived by

image

Unlike the ERM and IPS estimator, the ˜δx/δxidentifies that it may be beneficial to ”penalize” small empirical risk individuals with overestimating robust risk, and vice versa, for uniform performance among users.

Algorithm 1 details our CRRM framework, where trainable embeddings for users and items are  ΘU ∈ RM×dand ΘV ∈ RN×d. The input hyper parameters include: regularization term  λ, conterfactual risk weight  α, robust radius  ξ,perturbation radius  ϵ, learning rate  ω, power iteration step J and total learning epoch T, though we find few of them need being finely tuned in practice. Their setups are discussed in experiments section.

6.2. Bounding the risk over counterfactuals

Definition 2 (CRRM for recommendation). Given a hypothesis space H of predictions as ˆY, and CRRM’s objective function ˆR( ˆY ) from Eq.(26), the algorithm selects ˆY ∈ Hthat optimizes:

image

To illustrate the validity and principle of our CRRM approach, we state the following bound, where only finite hypothesis set H is considered for the sake of conciseness. Theorem 3 (Generalization error bound of CRRM). For any finite hypothesis space of predictions H = { ˆY1, . . . , ˆY|H|}, and empirical loss  δxis bounded  0 ≤ δx ≤∆, ∀x, the true risk  R( ˆY )of our CRR minimizer ˆYfrom H, is bounded with probability  1 − η by:

image

where  βx := δx/˜δxand we assume  βx > 0 ∀x.

This generalization error bound is constructive, as it motivates a trade-off between fairness and effectiveness which does not occur in ERM. In particular, overestimating ˜δxfor small  δxindividuals could facilitate fairness as well as the first bias term above. However, larger variance would appear in the second term at the same time. Fortunately, in most scenarios, our propensity-agnostic CRRM method empirically benefits significantly, both in fairness and generalization, from the additional applied fairness constraint. We will demonstrate it in next section.

In this section, we perform experiments to explore the performance of proposed CRRM, in terms of both global

image

Figure 6. Estimate of ¯Ynunder two different setting of demographics, left: homogeneous, right: heterogeneous. Green dot lines are the maximum and minimum of  µlamong all groups, and red one is the average of all.

Table 1. mU and gU vary under data sparsity.

image

utility and Gini fairness. Furthermore, additional experiments on real-world data sets (Yahoo2 and Coat) are conducted to compare CRRM with other methods, according to the commonly used measurements, e.g. MSE, AUC. We here consider logistic regression (LR), matrix factorization (MF-naive), inverse propensity score based MF (MF-IPS) (Schnabel et al., 2016), self-normalized IPS (MF-SNIPS) (Swaminathan & Joachims, 2015) and MF incorporated in our CRRM framework (MF-CRRM).

7.1. Synthetic task

The experimental setup is similar as the motivating example done in Section 4.1, but we supplement the heterogeneous demographics setting, i.e. each user’s average preference is different, and extend the scale to M = 10, N = 1000. For experiments on the non-uniform  φc, each time  φcdecides the top 1 item for each user, and collects the users’ feedback. Then it learns from the logged data, and continues to give next round of top 1 recommendation.

How does demographic heterogeneity influence ¯Yn’sproperty by RCTs and ERM-based policy? We want to compare the performance of ¯Ynwith  φtand  φc, with varying heterogeneity, in the synthetic task. The results are shown in Fig.6, where the x-axis is the observed events ratio. With homogeneous demographics in Fig.6a, the ¯Yn obtainedby ERM policy is much higher than the global average, and φtconverges to global average with high confidence. However, with strongly heterogeneous demographics defined in section 5.4, the Fig.6b displays high variance of the  φt.

Table 2. Test MSE and AUC on real-world data sets.

image

Unlike in homogeneous setting, the blue band converges much slower with sampling ratio increasing. Besides, the orange line (φc) also has deteriorating performance compared with before. The result demonstrates that heterogeneity of demographics can cause failure of RCTs-based IPS because of high variance incurred by ¯Yn, and also do harm to ERM-based  φcin practice.

Does CRRM improve fairness and generalization? Results on strong demographic heterogeneity are reported in Table 1, where mU is the minimum of user’s enjoyed utility. We test our CRRM method against LR, naive MF and IPS-based MF, and significant superiority of our method is witnessed. The MF-CRRM substantially perform better on both fairness and effectiveness in general, as it gains the best mU and gU almost under each sparsity setting. By contrast, MF-IPS with propensity scores learned from MAR data with 10% of number of MANR data, does not win over naive MF, due to high variance caused by ¯Yn underheterogeneous demographics.

7.2. Real-world data sets

Further experiments conducted on two real-world data sets, which both collected a small test set where users are provided randomly displayed items for labeling. Following previous works’ setting, we train all models on MNAR training set  Dc, and test them on MAR test set. More details about experimental setup are in appendix F.2.

Yahoo ! R3 Data Set. This is a user-song rating data set (Marlin & Zemel, 2009), where MNAR training set has over 300K ratings for songs that were selected by users. The test set contains ratings by 5400 users who were asked to rate 10 randomly selected songs.

Coat Shopping Data Set. This data set was collected and used in (Schnabel et al., 2016), which contains 290 users and 300 items. Each user self-selected 24 items, and additional 16 items were randomly drawn for them to rate.

Results. Results are reported by Table 2 and Table 3, where our CRRM substantially wins over other baselines on both data sets, either in effectiveness metric gU, MSE, AUC or in fairness metric GI. Unlike other IPS-based methods, CRRM has much lower cost since it does not need additional RCTs. It contributes to less variance via controlling both robust risk and factual risk, thus facilitating generalization without RCTs, as presented in Theorem 3. It should be noted that on YAHOO data set, for computation efficiency, our CRRM only sampling  0.1% from DF ∪DCFfor optimizing on the ˆR0in each epoch, which sheds light on our CRRM’s promising prospects in real-world recommender system.

In this work, we focus on mitigating MNAR bias and unfair policy, which are ubiquitous in recommendation. Previous works tend to leverage empirical outcome mean ¯Ynfrom small RCTs for correcting MNAR bias, e.g. IPS family. However, our study on the bias of these RCTs-based IPS estimator, with introducing more general different mean demographics, identifies the risk estimator with ¯Yn sacrificesthe least favorable group for ensuring average performance.

We therefore propose an easy-to-use and RCTs-free counterfactual robust risk minimization framework, in order to circumvent the cost of RCTs, and mitigate the unfairness in previous works. The key insight of our CRRM is to balance between robust and factual risk, thus both reducing bias and facilitating fairness, explained by the generalization bound in our Theorem 3. Moreover, by random sampling when alternative events set is large, our algorithm only costs a little additional computation, thus being scalable to large recommender systems.

We conjecture that the cutting edge sampling approaches (Wang et al., 2020), rather than randomly sampling from all events, can be plugged in our framework for further improvement. Besides, other existing models for recommendation, e.g. deep factorization machine (Guo et al., 2017), graph neural networks (Fan et al., 2019), can be retrofit with our CRRM for even better results.

This research was funded by the Natural Science Foundation of China 61807021, Shenzhen Science and Technology Research and Development Funds (JCYJ20170818-

094022586), and Innovation and entrepreneurship project for overseas high-level talents of Shenzhen (KQJSCX20180327144037831). The authors would like to thank advice from Professor Yuhan Dong and Xiangxiang

Xu for the writing and theory of this work.

Beutel, A., Chen, J., Doshi, T., Qian, H., Wei, L., Wu, Y., Heldt, L., Zhao, Z., Hong, L., Chi, E. H., et al. Fairness in recommendation ranking through pairwise comparisons. arXiv preprint arXiv:1903.00780, 2019.

Bonner, S. and Vasile, F. Causal embeddings for recommendation. In Proceedings of the 12th ACM Conference on Recommender Systems, pp. 104–112. ACM, 2018.

Duchi, J. and Namkoong, H. Learning models with uniform perfor- mance via distributionally robust optimization. arXiv preprint arXiv:1810.08750, 2018.

Dud´ık, M., Langford, J., and Li, L. Doubly robust policy evaluation and learning. arXiv preprint arXiv:1103.4601, 2011.

Fan, S., Zhu, J., Han, X., Shi, C., Hu, L., Ma, B., and Li, Y. Metapath-guided heterogeneous graph neural network for intent recommendation. 2019.

Feller, W. An introduction to probability theory and its applications, volume 2. John Wiley & Sons, 2008.

Gini, C. Concentration and dependency ratios. Rivista di politica economica, 87:769–792, 1997.

Guo, H., Tang, R., Ye, Y., Li, Z., and He, X. Deepfm: a factorization-machine based neural network for ctr prediction. arXiv preprint arXiv:1703.04247, 2017.

Hardt, M., Price, E., Srebro, N., et al. Equality of opportunity in supervised learning. In Advances in neural information processing systems, pp. 3315–3323, 2016.

Hashimoto, T. B., Srivastava, M., Namkoong, H., and Liang, P. Fairness without demographics in repeated loss minimization. arXiv preprint arXiv:1806.08010, 2018.

Haynes, L., Goldacre, B., Torgerson, D., et al. Test, learn, adapt: Developing public policy with randomised controlled trials— cabinet office. 2012.

Hoeffding, W. Probability inequalities for sums of bounded ran- dom variables. In The Collected Works of Wassily Hoeffding, pp. 409–426. Springer, 1994.

Hu, W., Niu, G., Sato, I., and Sugiyama, M. Does distributionally robust supervised learning give robust classifiers? In Proceedings of the 35th International Conference on Machine Learning, volume 80, pp. 2029–2037. PMLR, 2018.

Jabbari, S., Joseph, M., Kearns, M., Morgenstern, J., and Roth, A. Fairness in reinforcement learning. In Proceedings of the 34th International Conference on Machine Learning-Volume 70, pp. 1617–1626. JMLR. org, 2017.

Jiang, N. and Li, L. Doubly robust off-policy value evaluation for reinforcement learning. arXiv preprint arXiv:1511.03722, 2015.

Johansson, F., Shalit, U., and Sontag, D. Learning representations for counterfactual inference. In International conference on machine learning, pp. 3020–3029, 2016.

Joseph, M., Kearns, M., Morgenstern, J., Neel, S., and Roth, A. Rawlsian fairness for machine learning. arXiv preprint arXiv:1610.09559, 1(2), 2016.

Kearns, M., Neel, S., Roth, A., and Wu, Z. S. Preventing fairness gerrymandering: Auditing and learning for subgroup fairness. arXiv preprint arXiv:1711.05144, 2017.

Kearns, M., Roth, A., and Sharifi-Malvajerdi, S. Average individ- ual fairness: Algorithms, generalization and experiments. arXiv preprint arXiv:1905.10607, 2019.

Koren, Y., Bell, R., and Volinsky, C. Matrix factorization tech- niques for recommender systems. Computer, (8):30–37, 2009.

Lawrence, C., Sokolov, A., and Riezler, S. Counterfactual learning from bandit feedback under deterministic logging: A case study in statistical machine translation. arXiv preprint arXiv:1707.09118, 2017.

Liu, L. T., Dean, S., Rolf, E., Simchowitz, M., and Hardt, M. Delayed impact of fair machine learning. arXiv preprint arXiv:1803.04383, 2018.

Ma, W. and Chen, G. H. Missing not at random in matrix completion: The effectiveness of estimating missingness probabilities under a low nuclear norm assumption. In Advances in Neural Information Processing Systems, pp. 14871–14880, 2019.

Mankiw, N. G. Principles of macroeconomics. Cengage Learning, 2006.

Marlin, B. M. and Zemel, R. S. Collaborative prediction and rank- ing with non-random missing data. In Proceedings of the third ACM conference on Recommender systems, pp. 5–12. ACM, 2009.

Maronna, R. A., Martin, R. D., Yohai, V. J., and Salibi´an-Barrera, M. Robust statistics: theory and methods (with R). John Wiley & Sons, 2019.

Mary, J., Calauzenes, C., and El Karoui, N. Fairness-aware learn- ing for continuous attributes and treatments. In International Conference on Machine Learning, pp. 4382–4391, 2019.

Miyato, T., Maeda, S.-i., Koyama, M., and Ishii, S. Virtual ad- versarial training: a regularization method for supervised and semi-supervised learning. IEEE transactions on pattern analysis and machine intelligence, 41(8):1979–1993, 2018.

Nabi, R., Malinsky, D., and Shpitser, I. Learning optimal fair policies. arXiv preprint arXiv:1809.02244, 2018.

Rawls, J. A theory of justice. Harvard university press, 2009.

Richardson, M., Dominowska, E., and Ragno, R. Predicting clicks: estimating the click-through rate for new ads. In Proceedings of the 16th international conference on World Wide Web, pp. 521–530. ACM, 2007.

Rosenbaum, P. R. and Rubin, D. B. The central role of the propen- sity score in observational studies for causal effects. Biometrika, 70(1):41–55, 1983.

Rosenfeld, N., Mansour, Y., and Yom-Tov, E. Predicting counter- factuals from large historical data and small randomized trials. In Proceedings of the 26th International Conference on World Wide Web Companion, pp. 602–609. International World Wide Web Conferences Steering Committee, 2017.

Schnabel, T., Swaminathan, A., Singh, A., Chandak, N., and Joachims, T. Recommendations as treatments: Debiasing learning and evaluation. arXiv preprint arXiv:1602.05352, 2016.

Seaman, S., Galati, J., Jackson, D., and Carlin, J. What is meant by” missing at random”? Statistical Science, pp. 257–268, 2013.

Shalit, U., Johansson, F. D., and Sontag, D. Estimating individual treatment effect: generalization bounds and algorithms. In Proceedings of the 34th International Conference on Machine Learning-Volume 70, pp. 3076–3085. JMLR. org, 2017.

Singh, A. and Joachims, T. Policy learning for fairness in ranking. arXiv preprint arXiv:1902.04056, 2019.

Steck, H. Evaluation of recommendations: rating-prediction and ranking. In Proceedings of the 7th ACM conference on Recommender systems, pp. 213–220. ACM, 2013.

Swaminathan, A. and Joachims, T. The self-normalized estimator for counterfactual learning. In advances in neural information processing systems, pp. 3231–3239, 2015.

Thomas, P. and Brunskill, E. Data-efficient off-policy policy eval- uation for reinforcement learning. In International Conference on Machine Learning, pp. 2139–2148, 2016.

Wang, X., Zhang, R., Sun, Y., and Qi, J. Doubly robust joint learning for recommendation on data missing not at random. In Proceedings of the 36th International Conference on Machine Learning, volume 97, pp. 6638–6647. PMLR, 2019.

Wang, Z., Zhu, H., Dong, Z., He, X., and Huang, S.-L. Less is better: Unweighted data subsampling via influence function. In Proceedings of the AAAI Conference on Artificial Intelligence (AAAI), 2020.

Yuan, B., Hsia, J.-Y., Yang, M.-Y., Zhu, H., Chang, C.-Y., Dong, Z., and Lin, C.-J. Improving ad click prediction by considering nondisplayed events. In Proceedings of the 28th ACM International Conference on Information and Knowledge Management, pp. 329–338. ACM, 2019.

Zabell, S. L. Alan turing and the central limit theorem. The American Mathematical Monthly, 102(6):483–494, 1995.

image

Proposition 1 (Least favorable group’s utility and Gini index). Assume that  ˆµ1 ≤ ˆµ2 · · · ≤ ˆµL, controlling the worst-case ˆµ1to protect the minority group can improve fairness, by virtue of controlling the supremum of the GI as

image

if and only if  ˆµ1 = ˆµ2 = · · · = ˆµLthe equality holds.

Proof.

image

where the last inequality holds because

image

Proposition 2 (Bias of the least favorable group’s risk). The bias of IPS risk suffered by the least favorable group  U1, has alower bound as

image

Proof. It is easy to derive that

image

Such that, we concentrate on  Ox/ ˆZxin the Eq.(15):

The second inequality Eq.(B.5) comes from the fact that the least favorable group  U1’s true average preference must be less than the empirical average. Combining the result above and Eq.(15) yields the result.

Definition C.1 (Triangular Array (TA)). A triangular array of random variables is like

image

For each row n, the random variables  Xn1 . . . Xnnare assumed independent, with  E(Xnk) = µnk, and  Var(Xnk) < ∞. We denote this triangular array as  TA(Xn,k)henceforth.

Definition C.2 (Lindeberg’s condition (Zabell, 1995)). For random variables  Y1, Y2, . . . Yn, we have s2n = �ni=1 σ2i wherethe variance  σ2i of Yi is finite ∀i. We define a triangular array as  TA( 1sn (Yi − µ)), which satisfies Lindeberg’s condition if for

image

Lemma C.1 (Lindeberg’s condition under SMDV). Given  Yi ∼ pF(µ, σ20) + (1 − p)F(µ, σ21), and  σ1 > σ0 ≥ 1, if the

variance is controllable, namely

image

then the  TA( 1sn (Yi − µ))satisfies Lindeberg’s condition.

Proof. Assume without loss of generality that  µ = 0, and let Ya random variable distributed like every  Yi, then for ϵ > 0,

image

in which the second inequality holds because it is easy to derive that  s2n ≥ n. When n → ∞, the last two terms converge to zero, because of the condition set in Eq.(C.4).

Theorem 1 (Asymptotic normality under SMDV). Let  s2n = �ni=1(pσ20 + (1 − p)σ21), under SMDV assumption, if

image

then the empirical mean ¯Ynsatisfies asymptotic normality as

Proof. It is easy to derive that the summation over the  n-th row of TA( 1sn (Yi − µ)) is nsn ( ¯Yn − µ). Based on the condition set by Eq.(C.9), applying Lemma C.1 yields the result that  TA( 1sn (Yi − µ))satisfies Lindeberg’s condition, namely

image

From the Lindeberg-Fellar Central Limit Theorem (CLT) (Feller, 2008), the summation over the each row of  TA( 1sn (Yi−µ))converges to a zero-mean unit variance Gaussian distribution, namely

image

First, we introduce Hoeffding’s lemma on bounded random variables (Hoeffding, 1994) without proof.

Lemma D.1 (Hoeffding’s lemma (Hoeffding, 1994)). if  a ≤ X ≤ bwith probability 1, and E(X) = 0, then X is sub-Gaussian with parameter (b−a)24 .

Theorem 2 (Tail bound under strong heterogeneity). Given the  Yi ∼ �Ll=1 plF(µl), where F(µl) ∈ [al, bl], then

image

with probability  1 − η. Specifically when the sub-Gaussian parameters are all same, i.e.  bl − al = 1, ∀l, we have

image

Moreover, For the bias of IPS estimator on the least favorable group, we have

image

Proof. From the definition of  Yi and ¯Yn, it is easy to derive

image

And the interval of  Yi is

image

Applying Hoeffding’s inequality for  Yn yields

image

therefore

image

image

Proof.

because  Ox ∈ [0, 1]. Taking the  δxout of the brace leads to the final result.

Theorem 3 (Generalization error bound of CRRM). For any finite hypothesis space of predictions  H = { ˆY1, . . . , ˆY|H|}, and empirical loss  δxis bounded  δx ≤ ∆, ∀x, the true risk  R( ˆY )of our CRR minimizer ˆYfrom H, is bounded with probability  1 − η by:

image

where  βx := δx/˜δxand we assume  βx > 0 ∀x.

Proof. For notation simplicity, we omit the subscript CRR next.

image

which follows from Lemma E.1. Considering Hoeffding inequality and union bounds, we are now going to bound the following

image

solving for  ϵyields the bound that

As we know that  δx ≤ ∆, so it is easy to derive the final bound as

image

Substitute  EO ˆR( ˆY ) in Eq.(E.10) yields the final result.

F.1. Simulation Tasks

For motivating example, we build a  5 × 5matrix as the users’ preference over each item, where each element is generated uniformly within [0, 1], so they are homogeneous. For simulating user’s behavior, we introduce an additional latent variable uniformly drawn from [0, 1], such that for each time an event x appears, the click happens when the latent variable is smaller than y. We adopt a bandit feedback setting here, that is, the policy  φcfirstly ranks the Top-1 predicted reward item for each user, collects the feedbacks as click or not click; then we do policy learning via ERM from these logged feedbacks. For controlled experiment, we perform the same setting for an uniform policy  φt as well.

After 100 epochs of logging feedbacks and learning policy, we can calculate the empirical outcome distribution from the observed logged data, e.g. ˆPc1(Y | X)from  Dcand  ˆPt1(Y | X)from  Dt. The Fig.F.1 demonstrates the  ˆPc1(Y | X), generated by the non-uniform policy  φc, compared with the true P(Y | X). Significant bias of the estimated ˆPc1(Y | X)is depicted in Fig.F.2b, which is caused by inaccurately modeling on the observed events, with non-uniform policy and assuming that ˆPc1(Y | X) = P(Y | X). From Fig.3b we observe that the  φc causes devastating reduction of diversity to the system, all items except for item #0 are underrepresented, because they are relatively less interested by the users in average. The insufficient exploration over items causes large bias of estimated marginal distribution over items, shown in Fig.F.2a, i.e. bias of ˆPc1(Y | V = 0)is near zero, while others’ are large.

F.2. Real-world data

The original data set contains ratings from 1 to 5, we binarize it by setting 3 as threshold. For fair comparison, we tune hyper parameters for all MF-based methods: regularization term  λ ∈ {10−3, 0.01, 0.1, 1}, learning rate  ω ∈ {0.01, 0.05, 0.1}, embedding size d = 4, and report the results on the best parameters. Speicifically for our CRRM, we simply maintain

image

Figure F.1. Comparison between the generated preference and empirically estimated preference by  φc.

image

Figure F.2. Analysis of simulation results of the motivating example.

ϵ = 0.1, ξ = 1, α = 1, and J = 1all the time. For the IPS methods, propensity scores are learned by naive Bayes estimator from 5% of MAR test set.


Designed for Accessibility and to further Open Science