Regularized Learning for Domain Adaptation under Label Shifts

2019·Arxiv

ABSTRACT

ABSTRACT

We propose Regularized Learning under Label shifts (RLLS), a principled and a practical domain-adaptation algorithm to correct for shifts in the label distribution between a source and a target domain. We first estimate importance weights using labeled source data and unlabeled target data, and then train a classifier on the weighted source samples. We derive a generalization bound for the classifier on the target domain which is independent of the (ambient) data dimensions, and instead only depends on the complexity of the function class. To the best of our knowledge, this is the first generalization bound for the label-shift problem where the labels in the target domain are not available. Based on this bound, we propose a regularized estimator for the small-sample regime which accounts for the uncertainty in the estimated weights. Experiments on the CIFAR-10 and MNIST datasets show that RLLS improves classification accuracy, especially in the low sample and large-shift regimes, compared to previous methods.

1 INTRODUCTION

When machine learning models are employed “in the wild”, the distribution of the data of interest(target distribution) can be significantly shifted compared to the distribution of the data on which the model was trained (source distribution). In many cases, the publicly available large-scale datasets with which the models are trained do not represent and reflect the statistics of a particular dataset of interest. This is for example relevant in managed services on cloud providers used by clients in different domains and regions, or medical diagnostic tools trained on data collected in a small number of hospitals and deployed on previously unobserved populations and time frames.

There are various ways to approach distribution shifts between a source data distribution P and a target data distribution Q. If we denote input variables as x and output variables as y, we consider the two following settings: (i) Covariate shift, which assumes that the conditional output distribution is invariant: p(y|x) = q(y|x) between source and target distributions, but the source distribution p(x) changes. (ii) Label shift, where the conditional input distribution is invariant: p(x|y) = q(x|y) and p(y) changes from source to target. In the following, we assume that both input and output variables are observed in the source distribution whereas only input variables are available from the target distribution.

While covariate shift has been the focus of the literature on distribution shifts to date, label-shift scenarios appear in a variety of practical machine learning problems and warrant a separate discussion as well. In one setting, suppliers of machine-learning models such as cloud providers have large resources of diverse data sets (source set) to train the models, while during deployment, they have no control over the proportion of label categories.

In another setting of e.g. medical diagnostics, the disease distribution changes over locations and time. Consider the task of diagnosing a disease in a country with bad infrastructure and little data, based on reported symptoms. Can we use data from a different location with data abundance to diagnose the disease in the new target location in an efficient way? How many labeled source and unlabeled target data samples do we need to obtain good performance on the target data?

Apart from being relevant in practice, label shift is a computationally more tractable scenario than covariate shift which can be mitigated. The reason is that the outputs y typically have a much lower dimension than the inputs x. Labels are usually either categorical variables with a finite number of categories or have simple well-defined structures. Despite being an intuitively natural scenario in many real-world application, even this simplified model has only been scarcely studied in the literature. Zhang et al. (2013) proposed a kernel mean matching method for label shift which is not computationally feasible for large-scale data. The approach in Lipton et al. (2018) is based on importance weights that are estimated using the confusion matrix (also used in the procedures of Saerens et al. (2002); McLachlan (2004)) and demonstrate promising performance on large-scale data. Using a black-box classifier which can be biased, uncalibrated and inaccurate, they first estimate importance weights q(y)/p(y) for the source samples and train a classifier on the weighted data. In the following we refer to the procedure as black box shift learning (BBSL) which the authors proved to be effective for large enough sample sizes.

However, there are three relevant questions which remain unanswered by their work: How to estimate the importance weights in low sample setting, What are the generalization guarantees for the final predictor which uses the weighted samples? How do we deal with the uncertainty of the weight estimation when only few samples are available? This paper aims to fill the gap in terms of both theoretical understanding and practical methods for the label shift setting and thereby move a step closer towards having a more complete understanding on the general topic of supervised learning for distributionally shifted data. In particular, our goal is to find an efficient method which is applicable to large-scale data and to establish generalization guarantees.

Our contribution in this work is trifold. Firstly, we propose an efficient weight estimator for which we can obtain good statistical guarantees without a requirement on the problem-dependentminimum sample complexity as necessary for BBSL. In the BBSL case, the estimation error can become arbitrarily large for small sample sizes. Secondly, we propose a novel regularization method to compensate for the high estimation error of the importance weights in low target sample settings. It explicitly controls the influence of our weight estimates when the target sample size is low (in the following referred to as the low sample regime). Finally, we derive a dimension-independent generalization bound for the final Regularized Learning under Label Shift (RLLS) classifier based on our weight estimator. In particular, our method improves the weight estimation error and excess risk of the classifier on reweighted samples by a factor of k log(k), where k is the number of classes, i.e. the cardinality of Y.

In order to demonstrate the benefit of the proposed method for practical situations, we empirically study the performance of RLLS and show weight estimation as well as prediction accuracy comparison for a variety of shifts, sample sizes and regularization parameters on the CIFAR-10 and MNIST datasets. For large target sample sizes and large shifts, when applying the regularized weights fully, we achieve an order of magnitude smaller weight estimation error than baseline methods and enjoy at most 20% higher accuracy and F-1 score in corresponding predictive tasks. For low target sample sizes, applying regularized weights partially also yields an accuracy improvement of at least 10% over fully weighted and unweighted methods.

2 REGULARIZED LEARNING OF LABEL SHIFTS (RLLS)

Formally let us the short hand for the marginal probability mass functions of Y on finite Y with respect to P, Q as with p(i) = P(Y = i), and q(i) = Q(Y = i) for all , representable by vectors in which sum to one. In the label shift setting, we define the importance weight vector between these two domains as . We quantify the shift using the exponent of the infinite and second order Renyi divergence as follows

Given a hypothesis class H and a loss function , our aim is to find the hypothesis which minimizes

In the usual finite sample setting however, L unknown and we observe samples from P instead. If we are given the vector of importance weights w we could then minimize the empirical loss with importance weighted samples defined as

where n is the number of available observations drawn from P used to learn the classifier h. As w is unknown in practice, we have to find the minimizer of the empirical loss with estimated importance weights

where are estimates of w. Given a set of samples from the source distribution P, we first divide it into two sets where we use samples in set to compute the estimate and the remaining in the set to find the classifier which minimizes the loss (1), i.e. . In the following, we describe how to estimate the weights and provide guarantees for the resulting estimator .

Plug-in weight estimation The following simple correlation between the label distributions p, q was noted in Lipton et al. (2018): for a fixed hypothesis h, if for all it holds that , we have

for all . This can equivalently be written in matrix vector notation as

where is the confusion matrix with and is the vector which represents the probability mass function of h(X) under distribution Q. The requirement is a reasonable condition since without any prior knowledge, there is no way to properly reason about a class in the target domain that is not represented in the source domain.

In reality, both and can only be estimated by the corresponding finite sample averages . Lipton et al. (2018) simply compute the inverse of the estimated confusion matrix in order to estimate the importance weight, i.e. . While is a statistically efficient estimator, with estimated can be arbitrarily bad since can be arbitrary close to a singular matrix especially for small sample sizes and small minimum singular value of the confusion matrix. Intuitively, when there are very few samples, the weight estimation will have high variance in which case it might be better to avoid importance weighting altogether. Furthermore, even when the sample complexity in Lipton et al. (2018), unknown in practice, is met, the resulting error of this estimator is linear in k which is problematic for large k.

We therefore aim to address these shortcomings by proposing the following two-step procedure to compute importance weights. In the case of no shift we have w = 1 so that we define the amount of weight shift as 1. Given a “decent” black box estimator which we denote by , we make the final classifier less sensitive to the estimation performance of C (i.e. regularize the weight estimate) by

1. calculating the measurement error adjusted (described in Section 2.1 for ) and

2. computing the regularized weight where depends on the sample size .

By "decent" we refer to a classifier which yields a full rank confusion matrix . A trivial example for a non-”decent” classifier is one that always outputs a fixed class. As it does not capture any characteristics of the data, there is no hope to gain any statistical information without any prior information.

Both the confusion matrix and the label distribution on the target for the black box hypothesis are unknown and we are instead only given access to finite sample estimates . In what follows all empirical and population confusion matrices, as well as label distributions, are defined with respect to the hypothesis . For notation simplicity, we thus drop the subscript in what follows. The reparameterized linear model (2) with respect to then reads

with corresponding finite sample quantity 1. When is near singular, the estimation of becomes unstable. Furthermore, large values in the true shift result in large variances. We address this problem by adding a regularizing penalty term to the usual loss and thus push the amount of shift towards 0, a method that has been proposed in (Pires & Szepesvári, 2012). In particular, we compute

Here, is a parameter which will eventually be high probability upper bounds for . Let also denote the high probability upper bounds for .

Lemma 1 For as defined in equation (3), we have with probability at least that1

The proof of this lemma can be found in Appendix B.1. A couple of remarks are in order at this point. First of all, notice that the weight estimation procedure (3) does not require a minimum sample complexity which is in the order of to obtain the guarantees for BBSL. This is due to the fact that errors in the covariates are accounted for. In order to directly see the improvements in the upper bound of Lemma 1 compared to Theorem 3 in Lipton et al. (2018), first observe that in order to obtain their upper bound with a probability of at least , it is necessary that . As a consequence, the upper bound in Theorem 3 of Lipton et al. (2018) is bigger than

by a factor of k.

Furthermore, as in Lipton et al. (2018), this result holds for any black box estimator which enters the bound via . We can directly see how a good choice of helps to decrease the upper bound in Lemma 1. In particular, if is an ideal estimator, and the source set is balanced, C is the unit matrix with . In contrast, when the model is uncertain, the singular value is close to zero.

Moreover, for least square problems with Gaussian measurement errors in both input and target variables, it is standard to use regularized total least squares approaches which requires a singular value decomposition. Finally, our choice for the alternative estimator in Eq. 3 with norm instead of norm squared regularization is motivated by the cases with large shifts , where using the squared norm may shrink the estimate too much and away from the true .

When a few samples from the target set are available or the label shift is mild, the estimated weights might be too uncertain to be applied. We therefore propose a regularized estimator defined as follows

Note that implicitly depends on , and . By rewriting , we see that intuitively closer to 1 the more reason there is to believe that 1 is in fact the true weight.

Define the set and its Rademacher complexity measure

with as the Rademacher random variables (see e.g. Bartlett & Mendelson (2002)). We can now state a generalization bound for the classifier in a general hypothesis class H, which is trained on source data with the estimated weights defined in equation (4).

Theorem 1 (Generalization bound for ) Given samples from the source data set and samples from the target set, a hypothesis class H and loss function , the following generalization bound holds with probability at least

where

The proof can be found in Appendix B.4. Additionally, we derive the analysis also for finite hypothesis classes in Appendix B.6 to provide more insight into the proof of general hypothesis classes. The size of is determined by the structure of the function class H and the loss . For example for the 0/1 loss, the VC dimension of H can be deployed to upper bound the Rademacher complexity.

The bound (5) in Theorem 1 holds for all choices of . In order to exploit the possibility of choosing and to have an improved accuracy depending on the sample sizes, we first let the user define a set of shifts against which we want to be robust against, i.e. all shifts with . For these shifts, we obtain the following upper bound

The bound in equation (6) suggests using Algorithm 1 as our ultimate label shift correction procedure. where for step 2 of the algorithm, we choose whenever (hereby neglecting the log factors and thus dependencies on k) and 0 else. When using this rule, we obtain

which is smaller than the unregu- larized bound for small . Notice that in practice, we do not know in advance so that in Algorithm 1 we need to use an estimate of , which could e.g. be the minimum eigenvalue of the empirical confusion matrix with an additional computational complexity of at most .

Figure 1: Given a and switches from 0 to 1 at a particular and k are fixed.

Figure 1 shows how the oracle thresholds vary with and when is kept fix. When the parameters are above the curves for fixed should be chosen as 1 otherwise the samples should be unweighted, i.e. . This figure illustrates that when the confusion matrix has small singular values, the estimated weights should only be trusted for rather high and high believed shifts . Although the overall statistical rate of the excess risk of the classifier does not change as a function of the sample sizes, could be significantly smaller than when is very small and thus the accuracy in this regime could improve. Indeed we observe this to be the case empirically in Section 3.3.

In the case of slight deviation from the label shift setting, we expect the Alg. 1 to perform reasonably. For as the deviation form label shift constraint, i.e., zero under label shift assumption, we have;

Theorem 2 (Drift in Label shift assumption) In the presence of deviation from label shift assumption, the true importance weights , the RLLS generalizes as;

with high probability. Proof in Appendix B.7.

3 EXPERIMENTS

In this section we illustrate the theoretical analysis by running RLLS on a variety of artificially generated shifts on the MNIST (LeCun & Cortes, 2010) and CIFAR10 (Krizhevsky & Hinton, 2009) datasets. We first randomly separate the entire dataset into two sets (source and target pool) of the same size. Then we sample, unless specified otherwise, the same number of data points from each pool to form the source and target set respectively. We chose to have equal sample sizes to allow for fair comparisons across shifts.

There are various kinds of shifts which we consider in our experiments. In general we assume one of the source or target datasets to have uniform distribution over the labels. Within the non-uniform set, we consider three types of sampling strategies in the main text: the Tweak-One shift refers to the case where we set a class to have probability p > 0.1, while the distribution over the rest of the classes is uniform. The Minority-Class Shift is a more general version of Tweak-One shift, where a fixed number of classes m to have probability p < 0.1, while the distribution over the rest of the classes is uniform. For the Dirichlet shift, we draw a probability vector p from the Dirichlet distribution with concentration parameter set to for all classes, before including sample points which correspond to the multinomial label variable according to p. Results for the tweak-one shift strategy as in Lipton et al. (2018) can be found in Section A.0.1.

After artificially shifting the label distribution in one of the source and target sets, we then follow algorithm 1, where we choose the black box predictor to be a two-layer fully connected neural network trained on (shifted) source dataset. Note that any black box predictor could be employed here, though the higher the accuracy, the more likely weight estimation will be precise. Therefore, we use different shifted source data to get (corrupted) black box predictor across experiments. If not noted, is trained using uniform data.

In order to compute in Eq. (3), we call a built-in solver to directly solve the low dimensional problem where we empirically observer that 0.01 times of the true yields in a better estimator on various levels of label shift pre-computed beforehand. It is worth noting that 0.001 makes the theoretical bound in Lemma. 1 O(1/0.01) times bigger. We thus treat it as a hyperparameter that can be chosen using standard cross validation methods. Finally, we train a classifier on the source samples weighted by , where we use a two-layer fully connected neural network for MNIST and a ResNet-18 (He et al., 2016) for CIFAR10.

We sample 20 datasets with the label distributions for each shift parameter. to evaluate the empirical mean square estimation error (MSE) and variance of the estimated weights and the predictive accuracy on the target set. We use these measures to compare our procedure with the black box shift learning method (BBSL) in Lipton et al. (2018). Notice that although KMM methods (Zhang et al., 2013) would be another standard baseline to compare with, it is not scalable to large sample size regimes for above n = 8000 as mentioned by Lipton et al. (2018).

In this set of experiments on the CIFAR10 dataset, we illustrate our weight estimation and prediction performance for Tweak-One source shifts and compare it with BBSL. For this set of experiments, we set the number of data points in both source and target set to 10000 and sample from the two pools without replacement.

Figure 2 illustrates the weight estimation alongside final classification performance for MinorityClass source shift of CIFAR10. We created shifts with . We use a fixed black-box classifier that is trained on biased source data, with tweak-one . Observe that the MSE in weight estimation is relatively large and RLLS outperforms BBSL as the number of minority classes increases. As the shift increases the performance for all methods deteriorates. Furthermore, Figure 2 (b) illustrates how the advantage of RLLS over the unweighted classifier increases as the shift increases. Across all shifts, the RLLS based classifier yields higher accuracy than the one based on BBSL. Results for MNIST can be found in Section A.1.

Figure 2: (a) Mean squared error in estimated weights and (b) accuracy on CIFAR10 for tweak-one shifted source and uniform target with trained using tweak-one shifted source data.

In this section, we compare the predictive performances between a classifier trained on unweighted source data and the classifiers trained on weighted loss obtained by the RLLS and BBSL procedure on CIFAR10. The target set is shifted using the Dirichlet shift with parameters . The number of data points in both source and target set is 10000.

In the case of target shifts, larger shifts actually make the predictive task easier, such that even a constant majority class vote would give high accuracy. However it would have zero accuracy on all but one class. Therefore, in order to allow for a more comprehensive performance between the methods, we also compute the macro-averaged F-1 score by averaging the per-class quantity 2(precision recall)/(precision + recall) over all classes. For a class i, precision is the percentage of correct predictions among all samples predicted to have label i, while recall is the proportion of correctly predicted labels over the number of samples with true label i. This measure gives higher weight to the accuracies of minority classes which have no effect on the total accuracy.

Figure 3 depicts the MSE of the weight estimation (a), the corresponding performance comparison on accuracy (b) and F-1 score (c). Recall that the accuracy performance for low shifts is not comparable with standard CIFAR10 benchmark results because of the overall lower sample size chosen for the comparability between shifts. We can see that in the large target shift case for , the F-1

Figure 3: (a) Mean squared error in estimated weights, (b) accuracy and (c) F-1 score on CIFAR10 for uniform source and Dirichlet shifted target. Smaller corresponds to bigger shift.

score for BBSL and the unweighted classifier is rather low compared to RLLS while the accuracy is high. As mentioned before, the reason for this observation and why in Figure 3 (b) the accuracy is higher when the shift is larger, is that the predictive task actually becomes easier with higher shift.

In the following, we present the average accuracy of RLLS in Figure 4 as a function of the number of target samples for different values of for small . Here we fix the sample size in the source set to and investigate a Minority-Class source shift with fixed p = 0.01 and five minority classes.

A motivation to use intermediate is discussed in Section 2.2, as in equation (4) may be chosen according to . In practice, since is just an upper bound on the true amount of shift

, in some cases should in fact ideally be 0 when . Thus for target sample sizes that are a little bit above the threshold (depending on the certainty of the belief how close to the norm of the shift is believed to be), it could be sensible to use an intermediate value .

Figure 4: Performance on MNIST for Minority-Class shifted source and uniform target with various target sample size and using (a) better predictor trained on tweak-one shifted source with , (b) neutral predictor with and (c) corrupted predictor with .

Figure 4 suggests that unweighted samples (red) yield the best classifier for very few samples , while for an intermediate (purple) has the highest accuracy and for , the weight estimation is certain enough for the fully weighted classifier (yellow) to have the best performance (see also the corresponding data points in Figure 2). The unweighted BBSL classifier is also shown for completeness. We can conclude that regularizing the influence of the estimated weights allows us to adjust to the uncertainty on importance weights and generalize well for a wide range of target sample sizes.

Furthermore, the different plots in Figure 4 correspond to black-box predictors for weight estimation which are trained on more or less corrupted data, i.e. have a better or worse conditioned confusion matrix. The fully weighted methods with achieve the best performance faster with a better trained black-box classifier (a), while it takes longer for it to improve with a corrupted one (c). Furthermore, this reflects the relation between eigenvalue of confusion matrix and target sample size in Theorem 1. In other words, we need more samples from the target data to compensate a bad predictor in weight estimation. So the generalization error decreases faster with an increasing number of samples for good predictors.

In summary, our RLLS method outperforms BBSL in all settings for the common image datasets MNIST and CIFAR10 to varying degrees. In general, significant improvements compared to BBSL can be observed for large shifts and the low sample regime. A note of caution is in order: comparison between the two methods alone might not always be meaningful. In particular, there are cases when the estimator trained on unweighted samples outperforms both RLLS and BBSL. Our extensive experiments for many different shifts, black box classifiers and sample sizes do not allow for a final conclusive statement about how weighting samples using our estimator affects predictive results for real-world data in general, as it usually does not fulfill the label-shift assumptions.

4 RELATED WORK

The covariate and label shift assumptions follow naturally when viewing the data generating process as a causal or anti-causal model (Schölkopf et al., 2012): With label shift, the label Y causes the input X (that is, X is not a causal parent of Y , hence "anti-causal") and the causal mechanism that generates X from Y is independent of the distribution of Y . A long line of work has addressed the reverse causal setting where X causes Y and the conditional distribution of Y given X is assumed to be constant. This assumption is sensible when there is reason to believe that there is a true optimal mapping from X to Y which does not change if the distribution of X changes. Mathematically this scenario corresponds to the covariate shift assumption.

Among the various methods to correct for covariate shift, the majority uses the concept of importance weights q(x)/p(x) (Zadrozny, 2004; Cortes et al., 2010; Cortes & Mohri, 2014; Shimodaira, 2000), which are unknown but can be estimated for example via kernel embeddings (Huang et al., 2007; Gretton et al., 2009; 2012; Zhang et al., 2013; Zaremba et al., 2013) or by learning a binary discriminative classifier between source and target (Lopez-Paz & Oquab, 2016; Liu et al., 2017). A minimax approach that aims to be robust to the worst-case shared conditional label distribution between source and target has also been investigated (Liu & Ziebart, 2014; Chen et al., 2016). Sanderson & Scott (2014); Ramaswamy et al. (2016) formulate the label shift problem as a mixture of the class conditional covariate distributions with unknown mixture weights. Under the pairwise mutual irreducibility (Scott et al., 2013) assumption on the class conditional covariate distributions, they deploy the Neyman-Pearson criterion (Blanchard et al., 2010) to estimate the class distribution q(y) which also investigated in the maximum mean discrepancy framework (Iyer et al., 2014).

Common issues shared by these methods is that they either result in a massive computational burden for large sample size problems or cannot be deployed for neural networks. Furthermore, importance weighting methods such as (Shimodaira, 2000) estimate the density (ratio) beforehand, which is a difficult task on its own when the data is high-dimensional. The resulting generalization bounds based on importance weighting methods require the second order moments of the density ratio to be bounded, which means the bounds are extremely loose in most cases (Cortes et al., 2010).

Despite the wide applicability of label shift, approaches with global guarantees in high dimensional data regimes remain under-explored. The correction of label shift mainly requires to estimate the importance weights q(y)/p(y) over the labels which typically live in a very low-dimensional space. Bayesian and probabilistic approaches are studied when a prior over the marginal label distribution is assumed (Storkey, 2009; Chan & Ng, 2005). These methods often need to explicitly compute the posterior distribution of y and suffer from the curse of dimensionality. Recent advances as in Lipton et al. (2018) have proposed solutions applicable large scale data. This approach is related to Buck et al. (1966); Forman (2008); Saerens et al. (2002) in the low dimensional setting but lacks guarantees for the excess risk.

Existing generalization bounds have historically been mainly developed for the case when P = Q (see e.g. Vapnik (1999); Bartlett & Mendelson (2002); Kakade et al. (2009); Wainwright (2019)). Ben-David et al. (2010) provides theoretical analysis and generalization guarantees for distribution shifts when the H-divergence between joint distributions is considered, whereas Crammer et al. (2008) proves generalization bounds for learning from multiple sources. For the covariate shift setting, Cortes et al. (2010) provides a generalization bound when q(x)/p(x) is known which however does not apply in practice. To the best of our knowledge our work is the first to give generalization bounds for the label shift scenario.

5 DISCUSSION

In this work, we establish the first generalization guarantee for the label shift setting and propose an importance weighting procedure for which no prior knowledge of q(y)/p(y) is required. Although RLLS is inspired by BBSL, it leads to a more robust importance weight estimator as well as generalization guarantees in particular for the small sample regime, which BBSL does not allow for. RLLS is also equipped with a sample-size-dependent regularization technique and further improves the classifier in both regimes.

We consider this work a necessary step in the direction of solving shifts of this type, although the label shift assumption itself might be too simplified in the real world. In future work, we plan to also study the setting when it is slightly violated. For instance, x in practice cannot be solely explained by the wanted label y, but may also depend on attributes z which might not be observable. In the disease prediction task for example, the symptoms might not only depend on the disease but also on the city and living conditions of its population. In such a case, the label shift assumption only holds in a slightly modified sense, i.e. P(X|Y = y, Z = z) = Q(X|Y = y, Z = z). If the attributes Z are observed, then our framework can readily be used to perform importance weighting.

Furthermore, it is not clear whether the final predictor is in fact “better” or more robust to shifts just because it achieves a better target accuracy than a vanilla unweighted estimator. In fact, there is a reason to believe that under certain shift scenarios, the predictor might learn to use spurious correlations to boost accuracy. Finding a procedure which can both learn a robust model and achieve high accuracies on new target sets remains to be an ongoing challenge. Moreover, the current choice of regularization depends on the number of samples rather than data-driven regularization which is more desirable.

An important direction towards active learning for the same disease-symptoms scenario is when we also have an expert for diagnosing a limited number of patients in the target location. Now the question is which patients would be most "useful" to diagnose to obtain a high accuracy on the entire target set? Furthermore, in the case of high risk, we might be able to choose some of the patients for further medical diagnosis or treatment, up to some varying cost. We plan to extend the current framework to the active learning setting where we actively query the label of certain x’s (Beygelzimer et al., 2009) as well as the cost-sensitive setting where we also consider the cost of querying labels (Krishnamurthy et al., 2017).

Consider a realizable and over-parameterized setting, where there exists a deterministic mapping from x to y, and also suppose a perfect interpolation of the source data with a minimum proper norm is desired. In this case, weighting the samples in the empirical loss might not alter the trained classifier (Belkin et al., 2018). Therefore, our results might not directly help the design of better classifiers in this particular regime. However, for the general overparameterized settings, it remains an open problem of how the importance weighting can improve the generalization. We leave this study for future work.

6 ACKNOWLEDGEMENT

K. Azizzadenesheli is supported in part by NSF Career Award CCF-1254106 and Air Force FA9550-15-1-0221. This research has been conducted when the first author was a visiting researcher at Caltech. Anqi Liu is supported in part by DOLCIT Postdoctoral Fellowship at Caltech and Caltech’s Center for Autonomous Systems and Technologies. Fan Yang is supported by the Institute for Theoretical Studies ETH Zurich and the Dr. Max Rössler and the Walter Haefner Foundation. A. Anandkumar is supported in part by Microsoft Faculty Fellowship, Google faculty award, Adobe grant, NSF Career Award CCF- 1254106, and AFOSR YIP FA9550-15-1-0221.

REFERENCES

Animashree Anandkumar, Daniel Hsu, and Sham M Kakade. A method of moments for mixture models and hidden markov models. In Conference on Learning Theory, pp. 33–1, 2012.

Kamyar Azizzadenesheli, Alessandro Lazaric, and Animashree Anandkumar. Reinforcement learn- ing of pomdps using spectral methods. arXiv preprint arXiv:1602.07764, 2016.

Peter L Bartlett and Shahar Mendelson. Rademacher and gaussian complexities: Risk bounds and structural results. Journal of Machine Learning Research, 3(Nov):463–482, 2002.

Mikhail Belkin, Daniel Hsu, Siyuan Ma, and Soumik Mandal. Reconciling modern machine learning and the bias-variance trade-off. arXiv preprint arXiv:1812.11118, 2018.

Shai Ben-David, John Blitzer, Koby Crammer, Alex Kulesza, Fernando Pereira, and Jennifer Wort- man Vaughan. A theory of learning from different domains. Machine learning, 79(1-2):151–175, 2010.

Alina Beygelzimer, Sanjoy Dasgupta, and John Langford. Importance weighted active learning. In Proceedings of the 26th annual international conference on machine learning, pp. 49–56. ACM, 2009.

Gilles Blanchard, Gyemin Lee, and Clayton Scott. Semi-supervised novelty detection. Journal of Machine Learning Research, 11(Nov):2973–3009, 2010.

AA Buck, JJ Gart, et al. Comparison of a screening test and a reference test in epidemiologic studies. ii. a probabilistic model for the comparison of diagnostic tests. American Journal of Epidemiology, 83(3):593–602, 1966.

Yee Seng Chan and Hwee Tou Ng. Word sense disambiguation with distribution estimation. In IJCAI, volume 5, pp. 1010–5, 2005.

Xiangli Chen, Mathew Monfort, Anqi Liu, and Brian D Ziebart. Robust covariate shift regression. In Artificial Intelligence and Statistics, pp. 1270–1279, 2016.

Corinna Cortes and Mehryar Mohri. Domain adaptation and sample bias correction theory and algorithm for regression. Theoretical Computer Science, 519:103–126, 2014.

Corinna Cortes, Yishay Mansour, and Mehryar Mohri. Learning bounds for importance weighting. In Advances in neural information processing systems, pp. 442–450, 2010.

Koby Crammer, Michael Kearns, and Jennifer Wortman. Learning from multiple sources. Journal of Machine Learning Research, 9(Aug):1757–1774, 2008.

George Forman. Quantifying counts and costs via classification. Data Mining and Knowledge Discovery, 17(2):164–206, 2008.

David A Freedman. On tail probabilities for martingales. the Annals of Probability, pp. 100–118, 1975.

Arthur Gretton, Alexander J Smola, Jiayuan Huang, Marcel Schmittfull, Karsten M Borgwardt, and Bernhard Schölkopf. Covariate shift by kernel mean matching. 2009.

Arthur Gretton, Karsten M Borgwardt, Malte J Rasch, Bernhard Schölkopf, and Alexander Smola. A kernel two-sample test. Journal of Machine Learning Research, 13(Mar):723–773, 2012.

Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recog- nition. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 770–778, 2016.

Daniel Hsu, Sham M Kakade, and Tong Zhang. A spectral algorithm for learning hidden markov models. Journal of Computer and System Sciences, 78(5):1460–1480, 2012.

Jiayuan Huang, Arthur Gretton, Karsten M Borgwardt, Bernhard Schölkopf, and Alex J Smola. Correcting sample selection bias by unlabeled data. In Advances in neural information processing systems, pp. 601–608, 2007.

Arun Iyer, Saketha Nath, and Sunita Sarawagi. Maximum mean discrepancy for class ratio es- timation: Convergence bounds and kernel selection. In International Conference on Machine Learning, pp. 530–538, 2014.

Sham M Kakade, Karthik Sridharan, and Ambuj Tewari. On the complexity of linear prediction: Risk bounds, margin bounds, and regularization. In Advances in neural information processing systems, pp. 793–800, 2009.

Akshay Krishnamurthy, Alekh Agarwal, Tzu-Kuo Huang, Hal Daume III, and John Langford. Ac- tive learning for cost-sensitive classification. arXiv preprint arXiv:1703.01014, 2017.

Alex Krizhevsky and Geoffrey Hinton. Learning multiple layers of features from tiny images. Tech- nical report, Citeseer, 2009.

Yann LeCun and Corinna Cortes. MNIST handwritten digit database. 2010. URL http://yann.lecun.com/exdb/mnist/.

Zachary C Lipton, Yu-Xiang Wang, and Alex Smola. Detecting and correcting for label shift with black box predictors. arXiv preprint arXiv:1802.03916, 2018.

Anqi Liu and Brian Ziebart. Robust classification under sample selection bias. In Advances in neural information processing systems, pp. 37–45, 2014.

Song Liu, Akiko Takeda, Taiji Suzuki, and Kenji Fukumizu. Trimmed density ratio estimation. In Advances in Neural Information Processing Systems, pp. 4518–4528, 2017.

David Lopez-Paz and Maxime Oquab. Revisiting classifier two-sample tests. arXiv preprint arXiv:1610.06545, 2016.

Geoffrey McLachlan. Discriminant analysis and statistical pattern recognition, volume 544. John Wiley & Sons, 2004.

Bernardo Avila Pires and Csaba Szepesvári. Statistical linear estimation with penalized estimators: an application to reinforcement learning. arXiv preprint arXiv:1206.6444, 2012.

Harish Ramaswamy, Clayton Scott, and Ambuj Tewari. Mixture proportion estimation via kernel embeddings of distributions. In International Conference on Machine Learning, pp. 2052–2060, 2016.

Marco Saerens, Patrice Latinne, and Christine Decaestecker. Adjusting the outputs of a classifier to new a priori probabilities: a simple procedure. Neural computation, 14(1):21–41, 2002.

Tyler Sanderson and Clayton Scott. Class proportion estimation with application to multiclass anomaly rejection. In Artificial Intelligence and Statistics, pp. 850–858, 2014.

Bernhard Schölkopf, Dominik Janzing, Jonas Peters, Eleni Sgouritsa, Kun Zhang, and Joris Mooij. On causal and anticausal learning. arXiv preprint arXiv:1206.6471, 2012.

Clayton Scott, Gilles Blanchard, and Gregory Handy. Classification with asymmetric label noise: Consistency and maximal denoising. In Conference On Learning Theory, pp. 489–511, 2013.

Hidetoshi Shimodaira. Improving predictive inference under covariate shift by weighting the log- likelihood function. Journal of statistical planning and inference, 90(2):227–244, 2000.

Amos Storkey. When training and test sets are different: characterizing learning transfer. Dataset shift in machine learning, pp. 3–28, 2009.

Joel A Tropp. User-friendly tail bounds for sums of random matrices. Foundations of computational mathematics, 12(4):389–434, 2012.

Vladimir Naumovich Vapnik. An overview of statistical learning theory. IEEE transactions on neural networks, 10(5):988–999, 1999.

M. J. Wainwright. High-dimensional statistics: A non-asymptotic viewpoint. Cambridge University Press, 2019.

Yiming Ying. Mcdiarmid’s inequalities of bernstein and bennett forms. City University of Hong Kong, 2004.

Bianca Zadrozny. Learning and evaluating classifiers under sample selection bias. In Proceedings of the twenty-first international conference on Machine learning, pp. 114. ACM, 2004.

Wojciech Zaremba, Arthur Gretton, and Matthew Blaschko. B-test: A non-parametric, low variance kernel two-sample test. In Advances in neural information processing systems, pp. 755–763, 2013.

Kun Zhang, Bernhard Schölkopf, Krikamol Muandet, and Zhikun Wang. Domain adaptation under target and conditional shift. In International Conference on Machine Learning, pp. 819–827, 2013.

A MORE EXPERIMENTAL RESULTS

This section contains more experiments that provide more insights about in which settings the ad-

vantage of using RLLS vs. BBSL are more or less pronounced.

A.0.1 CIFAR10 EXPERIMENTS UNDER TWEAK-ONE SHIFT AND DIRICHLET SHIFT

Here we compare weight estimation performance between RLLS and BBSL for different types of

shifts including the Tweak-one Shift, for which we randomly choose one class, e.g. i and set

while all other classes are distributed evenly. Figure 5 depicts the the weight estimation performance

of RLLS compared to BBSL for a variety of values of and . Note that larger shifts correspond to

smaller and larger . In general, one observes that our RLLS estimator has smaller MSE and that

as the shift increases, the error of both methods increases. For tweak-one shift we can additionally

see that as the shift increases, RLLS outperforms BBSL more and more as both in terms of bias and

variance.

Figure 5: Comparing MSE of estimated weights using BBSL and RLLS on CIFAR10 with (a) tweak- one shift on source and uniform target, and (b) Dirichlet shift on source and uniform target. is trained using the same source shifted data respectively.

A.1 MNIST EXPERIMENTS UNDER MINORITY-CLASS SOURCE SHIFTS FOR DIFFERENT VALUES OF

In order to show weight estimation and classification performance under different level of label

shifts, we include several additional sets of experiments here in the appendix. Figure 6 shows the

weight estimation error and accuracy comparison under a minority-class shift with p = 0.001. The

training and testing sample size is 10000 examples in this case. We can see that whenever the weight

estimation of RLLS is better, the accuracy is also better, except in the four classes case when both

methods are bad in weight estimation.

Figure 6: (a) Mean squared error in estimated weights and (b) accuracy on MNIST for minority-class shifted source and uniform target with p = 0.001.

Figure 7 demonstrates another case in minority-class shift when p = 0.01. The black-box classifier

is the same two-layers neural network trained on a biased source data set with tweak-one .

We observe that when the number of minority class is small like 1 or 2, the weight estimation

is similar between two methods, as well as in the classification accuracy. But when the shift get

larger, the weights are worse and the performance in accuracy decreases, getting even worse than

the unweighted classifier.

Figure 7: (a) Mean squared error in estimated weights and (b) accuracy on MNIST for minority- trained on tweak-one shifted source data.

Figure 8 illustrates the weight estimation alongside final classification performance for Minority-

Class source shift of MNIST. We use 1000 training and testing data. We created large shifts of three

or more minority classes with p = 0.005. We use a fixed black-box classifier that is trained on biased

source data, with tweak-one . Observe that the MSE in weight estimation is relatively large

and RLLS outperforms BBSL as the number of minority classes increases. As the shift increases the

performance for all methods deteriorates. Furthermore, Figure 8 (b) illustrates how the advantage

of RLLS over the unweighted classifier increases as the shift increases. Across all shifts, the RLLS

based classifier yields higher accuracy than the one based on BBSL.

Figure 8: (a) Mean squared error in estimated weights and (b) accuracy on MNIST for minority- trained on tweak-one shifted source data.

A.2 CIFAR10 EXPERIMENT UNDER DIRICHLET SOURCE SHIFTS

Figure 9 illustrates the weight estimation alongside final classification performance for Dirichlet

source shift of CIFAR10 dataset. We use 10000 training and testing data in this experiment, follow-

ing the way we generate shift on source data. We train with tweak-one shifted source data with

. The results show that importance weighting in general is not helping the classification in

this relatively large shift case, because the weighted methods, including true weights and estimated

weights, are similar in accuracy with the unweighted method.

A.3 MNIST EXPERIMENT UNDER DIRICHLET SHIFT WITH LOW TARGET SAMPLE SIZE

We show the performance of classifier with different regularization under a Dirichlet shift with

in Figure 10. The training has 5000 examples in this case. We can see that in this low target

sample case, only take over after several hundreds example, while some value between 0

and 1 outperforms it at the beginning. Similar as in the paper, we use different black-box classifier

that is corrupted in different levels to show the relation between the quality of black-box predictor

and the necessary target sample size. We use biased source data with tweak-one

Figure 9: (a) Mean squared error in estimated weights and (b) accuracy on CIFAR10 for Dirichlet shifted source and uniform target, with trained on tweak-one shifted source data.

to train the black-box classifier. We see that we need more target samples for the fully weighted

version to take over for a more corrupted black-box classifier.

Figure 10: Performance on MNIST for Dirichlet shifted source and uniform target with various target sample size and using (a) better predictor, (b) neutral predictor and (c) corrupted predictor.

B PROOFS

B.1 PROOF OF LEMMA 1

From Thm. 3.4 in (Pires & Szepesvári, 2012) we know that for as defined in equation (3), if with

probability at least and hold simultaneously, then

where we use the shorthand .

We can get an upper bound on the right hand side of (7) is the infimum by simply choosing a feasible

. We then have and hence

as a consequence,

Since by definition of the minimum singular value, we thus

have

Let us first notice that

The mathematical definition of the finite sample estimates (in matrix and vector representa-

tion) with respect to some hypothesis h are as follows

where and I is the indicator function. can equivalently be expressed with the

population over P for and over Q for respectively. We now use the following concentration

Lemmas to bound the estimation errors of where we drop the subscript h for ease of notation.

Lemma 2 (Concentration of measurement matrix ) For finite sample estimate we have

with probability at least .

Lemma 3 (Concentration of label measurements) For the finite sample estimate with respect to

any hypothesis h it holds that

with probability at least .

By Lemma. 2 for concentration of C and Lemma. 3 for concentration of b we now have with proba-

bility at least

which, considering that dominates , yields the statement of the Lemma 1.

B.2 PROOF OF LEMMA 2

We prove this lemma using the theorem 1.4[Matrix Bernstein] and Dilations technique from Tropp

(2012). We can rewrite where e(i) is the one-hot-encoding of index i.

Consider a finite sequence of independent random matrices with dimension k. By dilations,

lets construct another sequence of self-adjoint random matrices of of dimension 2k, such

that for all i

therefore,

which results in . The dilation technique translates the initial sequence of ran-

dom matrices to the sequence of random self-adjoint matrices where we can apply the Matrix Bern-

stein theorem which states that, for a finite sequence of i.i.d. self-adjoint matrices , such that,

almost surely and , then for all ,

with probability at least where which is also

due to Eq. 8. Therefore, thanks to the dilation trick and theorem 1.4[Matrix Bernstein] in Tropp

with probability at least .

Now, by plugging in , we have . Together with

as well as and setting t = n, we have

B.3 PROOF OF LEMMA 3

The proof of this lemma is mainly based on a special case of and appreared at proposition 6

in Azizzadenesheli et al. (2016), Lemma F.1 in Anandkumar et al. (2012) and Proposition 19 of

Analogous to the previous section we can rewrite where e(i)

is the one-hot-encoding of index i. Note that (dropping the subscript h) we have

We now bound both estimates of probability vectors separately.

Consider a fixed multinomial distribution characterized with probability vector ofwhere

is a dimensional simplex. Further, consider t realization of this multinomial distribution

where is the one-hot-encoding of the i’th sample. Consider the empirical estimate

mean of this distribution through empirical average of the samples; , then

with probability at least .

By plugging inwith and finally and equivalently

for we obtain;

with probability at least , therefore;

resulting in the statement in the Lemma 3.

B.4 PROOF OF THEOREM 1

We want to ultimately bound . By addition and subtraction we have

where and we used optimality of . Here (a) is the weight estimation error and (b) is the

finite sample error.

Uniform law for bounding (b) We bound (b) using standard results for uniform laws for

uniformly bounded functions which holds since and . Since

, by deploying the McDiarmid’s inequality we then

obtain that

where and the Rademacher complexity is defined

as .

of the hypothesis class H (see for example Percy Liang notes on Statistical Learning Theory and

chapter 4 in Wainwright (2019))

Bounding term (a) Remember that k = |Y| is the cardinality of the finite domain of Y , or the

number of classes. Let us define with . Notice that by

definition and from which it follows by Hoelder’s inequality that .

Furthermore, we slightly abuse notation and use w to denote the k-dimensional vector with

w(i). Therefore, for all h we have via the Cauchy Schwarz inequality that

It then follows by Lemma 1 that

Lemma 4 (McDiarmid-Doob-Freedman-Rademacher) For a given A hypothesis class H, a set

, under n data points and loss function we have

with probability at least

Plugging both bounds back into equation (9) concludes the proof of the theorem.

B.5 PROOF OF LEMMA 4

With a bit abuse of notation let’s restate the empirical loss with known importance weights instead

on the random variables

We further define a ghost data set and the corresponding ghost loss;

Let’s define a random variable . This random variable is the key to

derive the tight generalization bound in Lemma 4.

This random variable has the following properties;

Which we can rewrite as

and swapping the sup with the expectation

We can remove the condition with law of iterated conditional expectation and have expectation on

both of the data sets;

we further open the expression up;

In the following we use the usual symmetrizing technique through Rademacher variables .

Each is a uniform random variable either 1 or . Therefore since

is a symmetric random variable we have

where the expectation is also over the Rademacher variables. After propagation sup

By propagating the expectation and again symmetry in the Rademacher variable we have

where the right hand side is two times the Rademacher complexity of class . Consider a

sequence of Doob Martingale and filtration defined on some probability space ;

and the corresponding Martingale difference;

In the following we show that each is bounded above.

Let’s define as the solution to the maximization and the solution to the mini-

mization, therefore,

�

The same way we can bound . Therefore the absolute value each is bounded by .

In the following we bound the conditional second moment, ;

Let’s construct an event the event that is bigger than , and also its compliment. Therefore,

for the we have

E

+ E

For the firs term in Eq. 11 after again introducing ghost variables we have the following

upper bound

So far we have that the first term in Eq. 11 is bounded by . Now for the second term we have

the following upper bound;

E

E

Therefore, since

For the first inequality, we used the fact that the loss is within [0, 1] and the second one is from

Eq. 12. Since the first term in RHS is , therefore, second moment of each is bounded

by 2d(q||p)n .

Therefore for the Doob Martingale sequence of we have as well as

. Using the Freedman’s inequality Freedman (1975), we have

Moreover, if we multiply each loss with , it results in hypothesis class of which has the same

Rademacher complexity as H, due the symmetric Rademacher random variable. Let denote the

same quantity as but on . We use this slack variable in order to bound the absolute value of

. Therefore

and the same bound for . By solving it for and we have

Note: A few days prior to the camera ready submission, we realized that a quite similar analysis

and statement to Theorem 4 is also studied in Ying (2004).

B.6 GENERALIZATION FOR FINITE HYPOTHESIS CLASSES

For finite hypothesis classes, one may bound (b) in (9) using Bernstein’s inequality.

We bound (b) by first noting that satisfies the Bernstein conditions so that Bern-

stein’s inequality holds

by definition. Because we assume , we directly have

Since we have a bound on the second moment of weighted loss while its first moment is L(h) we

can apply Bernstein’s inequality to obtain for any fixed h that

For the uniform law for finite hypothesis classes make the union bound on all the hypotheses;

The second moment of the importance weighted loss , given a can

be bounded for general , potentially leading to smaller values than d(q||p):

ℓY, h

where the first inequality follows from Hölder’s inequality, the second one follows from Jensen’s

inequality and the fact that the loss is in [0, 1] as well as the fact that the exponentiation function is

convex in this region. Moreover, since and upper bound for the loss square, ,

then;

which gives bound on the second moment of weighted loss.

B.7 SLIGHT DRIFT FROM THE LABEL SHIFT

Drift in label shift assumption: If the label shift approximation is slightly violated, we expect the

generalizing bound to deviate from the statement in the Theorem. 1. Define

as the deviation form label shift constraint which is zero in label shift setting.

Consider the case where the Label shift assumption is slightly violated, i.e., for each covariate and

label, we have , resulting importance weight for each covariate

and label. Similar to decomposing in Eq. 9, we have

where the desired excess risk is defined with respect to . The differences between Eq. 15 and Eq. 9

are in a new term (c) as well as term (a). The term (b) remains untouched.

Bound on term (c) For any h, the two contributing components in (c), i.e., and L(h) are

as follows;

))] , and ))]

For their deviation we have

[()) ))]

Where in the last inequality we deploy Cauchy Schwarz inequality as well as loss is in [0, 1] and

hold for . It is worth noting that the expectation in is with respect to Q and does not

blow up if the supports of P and Q do not match

Bound on term (a) For any , similar to the derivation in Eq. 10 we have

The previous weight estimation analysis does not directly hold for this case where the label shift

is slightly violated, but with a few modification we provide an upper-bound on the error. Given a

classifier

where , resulting;

where we drop the in both and C. Both the confusion matrix C and the label distribution

on the target for the black box hypothesis are unknown and we are instead only given access to

finite sample estimates . Similar to previous analysis we have

with corresponding finite sample quantity 1. Similarly to the analysis when there was

no violation in label shift assumption, we have and the solution to Eq. 3

satisfies;

We can simplify the upper bound by setting . We then have

Υ(2∆+ 2∆+ 22∆+ 2∆+ 2)

resulting in

designed for accessibility and to further open science