A Unified View of Label Shift Estimation

2020·Arxiv

Abstract

Abstract

Under label shift, the label distribution p(y) might change but the class-conditional distributions p(x|y) do not. There are two dominant approaches for estimating the label marginal. BBSE, a moment-matching approach based on confusion matrices, is provably consistent and provides interpretable error bounds. However, a maximum likelihood estimation approach, which we call MLLS, dominates empirically. In this paper, we present a unified view of the two methods and the first theoretical characterization of MLLS. Our contributions include (i) consistency conditions for MLLS, which include calibration of the classifier and a confusion matrix invertibility condition that BBSE also requires; (ii) a unified framework, casting BBSE as roughly equivalent to MLLS for a particular choice of calibration method; and (iii) a decomposition of MLLS’s finite-sample error into terms reflecting miscalibration and estimation error. Our analysis attributes BBSE’s statistical inefficiency to a loss of information due to coarse calibration. Experiments on synthetic data, MNIST, and CIFAR10 support our findings.

1 Introduction

Supervised algorithms are typically developed and evaluated assuming independent and identically distributed (iid) data. However, the real world seldom abides, presenting domain adaptation problems in which the source distribution , from which we sample labeled training examples, differs from the target distribution , from which we only observe unlabeled data. Absent assumptions on the nature of shift, the problem is underspecified. Multiple assumptions may be compatible with the same observations while implying different courses of action. Fortunately, some assumptions can render shift detection, estimation, and on-the-fly updates to our classifiers possible.

This paper focuses on label shift [22, 18, 16], which aligns with the anticausal setting in which the labels y cause the features x [19]. Label shift arises in diagnostic problems because diseases cause symptoms. In this setting, an intervention on p(y) induces the shift, but the process generating x given y is fixed (). Under label shift, the optimal predictor may change, e.g., the probability that a patient suffers from a disease given their symptoms can increase under a pandemic. Contrast label shift with the better-known covariate shift assumption, which aligns with the assumption that x causes y, yielding the reverse implication that

Under label shift, our first task is to estimate the ratios for all labels y. Two dominant approaches leverage off-the-shelf classifiers to estimate w: (i) Black Box Shift Estimation (BBSE) [16] and a variant called Regularized Learning under Label Shift (RLLS) [2]: moment-matching based estimators that leverage (possibly biased, uncalibrated, or inaccurate) predictions to estimate the shift; and (ii) Maximum Likelihood Label Shift (MLLS) [18]: an Expectation Maximization (EM) algorithm that assumes access to a classifier that outputs the true source distribution conditional probabilities

Given a predictor with an invertible confusion matrix, BBSE and RLLS have known consistency results and finite-sample guarantees [16, 2]. However, MLLS, in combination with a calibration heuristic called Bias-Corrected Temperature Scaling (BCTS), outperforms them empirically [1].

In this paper, we theoretically characterize MLLS, establishing conditions for consistency and bounding its finite-sample error. To start, we observe that given the true label conditional , MLLS is simply a concave Maximum Likelihood Estimation (MLE) problem and standard results apply. However, because we never know exactly, MLLS is always applied with an estimated model and thus the procedure consists of MLE under model misspecification.

First, we prove that (i) canonical calibration (Definition 1) and (ii) an invertible confusion matrix (as required by BBSE) are sufficient conditions to ensure MLLS’s consistency (Proposition 1, Theorems 1 and 2). We also show that calibration can sometimes be necessary for consistency (Example 1 in Section 4.3). Recall that neural network classifiers tend to be uncalibrated absent post-hoc adjustments [10]. Second, we observe that confusion matrices can be instruments for calibrating a classifier. Applying MLLS with this technique, BBSE and MLLS are distinguished only by their objective functions. Through extensive experiments, we show that they perform similarly, concluding that MLLS’s superior performance (when applied with more granular calibration techniques) is not due to its objective but rather to the information lost by BBSE via confusion matrix calibration. Third, we analyze the finite-sample error of the MLLS estimator by decomposing its error into terms reflecting the miscalibration error and finite-sample error (Theorem 3). Depending on the calibration method, the miscalibration error can further be divided into two terms: finite sample error due to re-calibration on a validation set and the minimum achievable calibration error with that technique.

We validate our results on synthetic data, MNIST, and CIFAR-10. Empirical results show that MLLS can have lower Mean Squared estimation Error (MSE) depending on the magnitude of the shift. Our experiments relate MLLS’s MSE to the granularity of the calibration.

In summary, we contribute the following: (i) Sufficient conditions for MLLS’s consistency; (ii) Unification of MLLS and BBSE methods under a common framework, with BBSE corresponding to a particular choice of calibration method; (iii) Finite-sample error bounds for MLLS; (iv) Experiments on synthetic and image recognition datasets that support our theoretical arguments.

2 Problem Setup

Let X be the input space and Y = {1, 2, . . . , k} the output space. Let be the source and target distributions and let denote the corresponding probability density (or mass) functions. We use to denote expectations over the source and target distributions. In unsupervised domain adaptation, we possess labeled source data and unlabeled target data . We also assume access to a black-box predictor , e.g., a model trained to approximate the true probability function , where . Here and in the rest of the paper, we use to denote the standard k-dimensional probability simplex. For a vector to access the element at index y.

Absent assumptions relating the source and target distributions, domain adaptation is underspecified [3]. We work with the label shift assumption, i.e., , focusing on multiclass classification. Moreover, we assume non-zero support for all labels in the source distribution: for all ]. Under label shift, three common goals are (i) detection—determining whether distribution shift has occurred; (ii) quantification—estimating the target label distribution; and (iii) correction—producing a predictor that minimizes error on the target distribution [16].

This paper focuses on goal (ii), estimating importance weights for all . Given w, we can update our classifiers on the fly, either by retraining in an importance-weighted ERM framework [20, 9, 16, 2]—a practice that may be problematic for overparameterized neural networks [5], or by applying an analytic correction [1, 18]. Within the ERM framework, the generalization result from Azizzadenesheli et al. [2] (Theorem 1) depends only on the error of the estimated weights, and hence any method that improves weight estimates tightens this bound.

There are multiple definitions of calibration in the multiclass setting. Guo et al. [10] study the calibration of the arg-max prediction, while Kumar et al. [14] study a notion of per-label calibration. We use canonical calibration [24] and the expected canonical calibration error on the source data defined as follows:

Definition 1 (Canonical calibration). A prediction model is canonically calibrated on the source domain if for all

Definition 2 (Expected canonical calibration error). For a predictor f, the expected squared canonical calibration error on the source domain is

Calibration methods typically work either by calibrating the model during training or by calibrating a trained classifier on held-out data, post-hoc. We refer the interested reader to Kumar et al. [14] and Guo et al. [10] for detailed studies on calibration. We focus on the latter category of methods. Our experiments follow Alexandari et al. [1], who leverage BCTS 1 to calibrate their models. BCTS extends temperature scaling [10] by incorporating per-class bias terms.

3 Prior Work

Two families of solutions have been explored that leverage a blackbox predictor: BBSE [16], a moment matching method, uses the predictor to compute a confusion matrix on the source data. Depending on how is defined, there are two types of confusion matrix for a predictor : (i) the hard confusion matrix ; and (ii) the soft confusion matrix, where is defined as a random prediction that follows the discrete distribution over Y. Both soft and hard confusion matrix can be estimated from labeled source data samples. The estimate computed as is the estimate of confusion matrix and is an estimate of computed by applying the predictor to the target data. In a related vein, RLLS [2] incorporates an additional regularization term of the form and solves a constrained optimization problem to estimate the shift ratios w.

MLLS estimates w as if performing maximum likelihood estimation, but substitutes the predictor outputs for the true probabilities . Saerens et al. [18], who introduce this procedure, describe it as an application of EM. However, as observed in [8, 1], the likelihood objective is concave, and thus a variety of optimization algorithms may be applied to recover the MLLS estimate. Alexandari et al. [1] also showed that MLLS underperforms BBSE when applied naively, a phenomenon that we shed more light on in this paper.

4 A Uniﬁed View of Label Shift Estimation with Black Box Predictors

We now present a unified view that subsumes MLLS and BBSE and demonstrate how each is instantiated under this framework. We also establish identifiability and consistency conditions for MLLS, deferring a treatment of finite-sample issues to Section 5. For convenience, throughout Sections 3 and 4, we use the term calibration exclusively to refer to canonical calibration (Definition 1) on the source data. We relegate all technical proofs to Appendix D.

4.1 A Unified Distribution Matching View

To start, we introduce a generalized distribution matching approach for estimating w. Under label shift, for any (possibly randomized) mapping from X to Z, we have that since, Throughout the paper, we use the notation p(z|y) to rep- resent either (which are identical). We now define a family of distributions over Z parameterized by

where . When , we have that are known because are known. So one potential strategy to estimate is to find a weight vector w such that

At least one such weight vector w must exist as satisfies (2). We now characterize conditions under which the weight vector w satisfying (2) is unique:

Lemma 1 (Identifiability). If the set of distributions {p(z|y) : y = 1, ..., k} are linearly independent, then for any w that satisfies (2), we must have . This condition is also necessary in general: if the linear independence does not hold then there exists a problem instance where we have satisfying (2) while

Lemma 1 follows from the fact that (2) is a linear system with at least one solution . This solution is unique when is of rank k. The linear independence condition in Lemma 1, in general, is sufficient for identifiability of discrete Z. However, for continuous Z, the linear dependence condition has the undesirable property of being sensitive to changes on sets of measure zero. By changing a collection of linearly dependent distributions on a set of measure zero, we can make them linearly independent. As a consequence, we impose a stronger notion of identifiability i.e., the set of distributions {p(z|y) : y = 1, ..., k} are such that there does not exist for which We refer this condition as strict linear independence.

In generalized distribution matching, one can set p(z|x) to be the Dirac delta function at such that Z is the same space as X, which leads to solving (2) with z replaced by x. In practice where X is high-dimensional and/or continuous, approximating the solution to (2) from finite samples can be hard when choosing z = x. Our motivation for generalizing distribution matching from X to Z is that the solution to (2) can be better approximated using finite samples when Z is chosen carefully. Under this framework, the design of a label shift estimation algorithm can be decomposed into two parts: (i) the choice of p(z|x) and (ii) how to approximate the solution to (2). Later on, we consider how these design choices may affect label shift estimation procedures in practice.

4.2 The Confusion Matrix Approach

If Z is a discrete space, one can first estimate , and then subsequently attempt to solve (2). Confusion matrix approaches use Z = Y, and construct p(z|x) using a black box predictor . There are two common choices to construct the confusion matrix: (i) The soft confusion matrix approach: We set . We then define a random variable for each x. Then we construct and . (ii) The hard confusion matrix approach: Here we set . We then define a random variable for each x. Then again we have

Since is a square matrix, the identifiability condition becomes the invertibility of the confusion matrix. Given an estimated confusion matrix, one can find w by inverting the confusion matrix (BBSE) or minimizing some distance between the vectors on the two sides of (2).

4.3 Maximum Likelihood Label Shift Estimation

When Z is a continuous space, the set of equations in (2) indexed by Z is intractable. In this case, one possibility is to find a weight vector by minimizing the KL-divergence , for defined in (1). This is equivalent to maximizing the population log-likelihood: One can further show that Therefore we can equivalently define:

This yields a straightforward convex optimization problem whose objective is bounded from below [1, 8]. Assuming access to labeled source data and unlabeled target data, one can maximize the empirical counterpart of the objective in (3), using either EM or an alternative iterative optimization scheme. Saerens et al. [18] derived an EM algorithm to maximize the objective (3) when z = x, assuming access to . Absent knowledge of the ground truth , we can plug in any approximate predictor f and optimize the following objective:

In practice, f is fit from a finite sample drawn from and standard machine learning methods often produce uncalibrated predictors. While BBSE and RLLS are provably consistent whenever the predictor f yields an invertible confusion matrix, to our knowledge, no prior works have established sufficient conditions to guarantee MLLS’ consistency when f differs from

It is intuitive that for some values of , MLLS will yield inconsistent estimates. Supplying empirical evidence, Alexandari et al. [1] show that MLLS performs poorly when f is a vanilla neural network predictor learned from data. However, Alexandari et al. [1] also show that in combination with a particular post-hoc calibration technique, MLLS achieves low error, significantly outperforming BBSE and RLLS. As the calibration error is not a distance metric between f and (zero calibration error does not indicate ), a calibrated predictor f may still be substantially different from . Some natural questions then arise:

1. Why does calibration improve MLLS so dramatically? 2. Is calibration necessary or sufficient to ensure the consistency of MLLS? 3. What accounts for the comparative efficiency of MLLS over BBSE? (Addressed in Section 5)

To address the first two questions, we make the following observations. Suppose we define z (for each x) with distribution , for some calibrated predictor f. Then, because f is calibrated, it holds that . Note that in general, the MLLS objective (4) can differ from (3). However, when , the two objectives are identical. We can formalize this as follows:

Lemma 2. If f is calibrated, then the two objectives (3) and (4) are identical when Z is chosen as is defined to be

Lemma 2 follows from changing the variable of expectation in (4) from x to f(x) and applying (definition of calibration). It shows that MLLS with a calibrated predictor on the input space X is in fact equivalent to performing distribution matching in the space Z. Building on this observation, we now state our population-level consistency theorem for MLLS:

Theorem 1 (Population consistency of MLLS). If a predictor is calibrated and the distributions {p(f(x)|y) : y = 1, . . . , k} are strictly linearly independent, then is the unique maximizer of the MLLS objective (4).

We now turn our attention to establishing consistency of the sample-based estimator. Let . The finite sample objective for MLLS can be written as

Theorem 2 (Consistency of MLLS). If f satisfies the conditions in Theorem 1, then in (5) converges to almost surely.

The main idea of the proof of Theorem 2 is to derive a metric entropy bound on the class of functions to prove Hellinger consistency (Theorem 4.6 [25]). The consistency of MLLS relies on the linear independence of the collection of distributions {p(f(x)|y) : y = 1, . . . , k}. The following result develops several alternative equivalent characterizations of this linear independence condition.

Proposition 1. For a calibrated predictor f, the following statements are equivalent:

Proposition 1 shows that with a calibrated predictor, the invertibility condition as required by BBSE (or RLLS) is exactly the same as the linear independence condition required for MLLS’s consistency.

Having provided sufficient conditions, we consider a binary classification example to provide intuition for why we need calibration for consistency. In this example, we relate the estimation error to the miscalibration error, showing that calibration is not only sufficient but also necessary to achieve zero estimation error for a certain class of predictors.

Example 1. Consider a mixture of two Gaussians with and . We suppose that the source mixing coefficients are both , while the target mixing coefficients are . Assume a class of probabilistic threshold classifiers: for , otherwise . Then the population error of MLLS is given by

which is zero only if for a non-degenerate classifier.

The expression for estimation error arising from our example yields two key insights: (i) an uncalibrated thresholded classifier has an estimation error proportional to the true shift in label distribution i.e. ; (ii) the error is also proportional to the canonical calibration error which is . While earlier in this section, we concluded that canonical calibration is sufficient for consistency, the above example provides some intuition for why it might also be necessary. In Appendix C, we show that marginal calibration [14, 10, 24], a less restricted definition is insufficient to achieve consistency.

4.4 MLLS with Confusion Matrix

So far, we have shown that MLLS with any calibrated predictor can be viewed as distribution matching in a latent space. Now we discuss a method to construct a predictor f to perform MLLS given any p(z|x), e.g., those induced by confusion matrix approaches. Recall, we already have the maximum log-likelihood objective. It just remains to construct a calibrated predictor f from the confusion matrix.

This is straightforward when p(z|x) is deterministic, i.e., for some function g: setting makes the objectives (3) and (4) to be the same. Recall that for the hard confusion matrix, the induced latent space is . So the corresponding predictor in MLLS is , where . Then we obtain the MLLS objective for the hard confusion matrix:

The confusion matrix and predictor directly give us . Given an input x, one can first get from , then normalize the -th row of as . We denote MLLS with hard confusion matrix calibration (6) by MLLS-CM.

When is stochastic, we need to extend (4) to allow f to be a random predictor: for . To incorporate the randomness of f, one only needs to change the expectation in (4) to be over both x and f(x), then (4) becomes a rewrite of (3).

Proposition 2 indicates that constructing the confusion matrix is a calibration procedure. Thus, the predictor constructed with constructed using confusion matrix is calibrated and suitable for application with MLLS.

Proposition 2 (Vaicenavicius et al. [24]). For any function is a calibrated predictor.

We can now summarize the relationship between BBSE and MLLS: A label shift estimator involves two design choices: (i) designing the latent space p(z|x) (which is equivalent to designing a calibrated predictor); and (ii) performing distribution matching in the new space Z. In BBSE, we design a calibrated predictor via the confusion matrix and then perform distribution matching by directly solving linear equations. In general, MLLS does not specify how to obtain a calibrated predictor, but specifies KL minimization as the distribution matching procedure. One can apply the confusion matrix approach to obtain a calibrated predictor and then plug it into MLLS, which is the BBSE analog under MLLS, and is a special case of MLLS.

5 Theoretical Analysis of MLLS

We now analyze the performance of MLLS estimator. Even when is the unique optimizer of (4) for some calibrated predictor f, assuming convex optimization can be done perfectly, there are still two sources of error preventing us from exactly computing in practice. First, we are optimizing a sample-based approximation (5) to the objective in expectation (4). We call this source of error finite-sample error. Second, the predictor f we use may not be perfectly calibrated on the source distribution as we only have access to samples from source data distribution . We call this source of error miscalibration error. We will first analyze how these two sources of errors affect the estimate of separately and then give a general error bound that incorporates both. All proofs are relegated to Appendix E.

Before presenting our analysis, we introduce some notation and regularity assumptions. For any predictor , we define and as in (4) and (5). If f satisfies the conditions in Theorem 2 (calibration and linear independence) then we have that . Our goal is to bound for a given (possibly miscalibrated) predictor f. We now introduce a regularity condition: Condition 1 (Regularity condition for a predictor f). For any x within the support of , i.e. , we have both for some universal constant

Condition 1 is mild if f is calibrated since in this case is the maximizer of and the condition is satisfied if the expectation is finite. Since and are upperbounded (they are the inner products of two vectors which sum to 1), they also must be lower-bounded away from 0 with arbitrarily high probability without any assumptions. For miscalibrated f, a similar justification holds for assumption that is lower bounded. Turning our attention to the assumption that is lower bounded, we note that it is sufficient if f is close (pointwise) to some calibrated predictor. This in turn is a reasonable assumption on the actual predictor we use for MLLS in practice as it is post-hoc calibrated on source data samples.

Define to be the minimum eigenvalue of the Hessian . To state our results compactly we use standard stochastic order notation (see, for instance, [26]). We first bound the estimation error introduced by only having finite samples from the target distribution in Lemma 3. Next, we bound the estimation error introduced by having a miscalibrated f in Lemma 4.

Lemma 3. For any predictor f that satisfies Condition 1, we have

Lemma 4. For any predictor f and any calibrated predictor that satisfies Condition 1, we have for some constant C.

If we set , which is a calibrated predictor (Proposition 2), we can bound the error in terms of the calibration error of f on the source data 4:

Note that since , we can upper-bound the error in Lemma 4 with calibration error on the source data. We combine the two sources of error to bound the estimation error Theorem 3. For any predictor f that satisfies Condition 1, we have

The estimation error of MLLS can be decomposed into (i) finite-sample error, which decays at a rate of ; and (ii) the calibration error of the predictor that we use. The proof is a direct combination of Lemma 3 and Lemma 4 applied to the same f with the following error decomposition:

Theorem 3 shows that the estimation error depends inversely on the minimum eigenvalue of the Hessian at two different points . One can unify these two eigenvalues as a single quantity , the minimum eigenvalue . We formalize this observation in Appendix E.

If we use the post-hoc calibration procedure (as discussed in Section 2 and A) to calibrate a blackbox predictor , we can obtain a bound on the calibration error of f. In more detail, suppose that the class G used for calibration satisfies standard regularity conditions (injectivity, Lipschitz-continuity, twice differentiability, non-singular Hessian). We have the following lemma:

Lemma 5. Let be the predictor after post-hoc calibration with squared loss l and g belongs to a function class G that satisfies the standard regularity conditions, we have

Figure 1: (top) MSE vs the degree of shift; For GMM, we control the shift in the label marginal for class 1 with a fixed target sample size of 1000. For multiclass problems—-MNIST and CIFAR-10, we control the Dirichlet shift parameter with a fixed sample size of 5000. (bottom) MSE (in log scale) vs target sample size; For GMM, we fix the label marginal for class 1 at 0.01 whereas for multiclass problems, MNIST and CIFAR-10, we fix the Dirichlet parameter to 0.1. In all plots, MLLS dominates other methods. All confusion matrix approaches perform similarly, indicating that the advantage of MLLS comes from the choice of calibration but not the way of performing distribution matching.

This result is similar to Theorem 4.1 [14]. For a model class G that is rich enough to contain a function that achieves zero calibration error, i.e., , then we obtain an estimation error bound for MLLS of . This bound is similar to rate of RLLS and BBSE, where instead of they have minimum eigenvalue of the confusion matrix.

The estimation error bound explains the efficiency of MLLS. Informally, the error of MLLS depends inversely on the minimum eigenvalue of the Hessian of the likelihood . When we apply coarse calibration via the confusion matrix (in MLLS-CM), we only decrease the value of . Coarse calibration throws away information [13] and thus results in greater estimation error for MLLS. In Section 6, we empricially show that MLLS-CM’s performance is similar to that of BBSE. Moreover, on a synthetic Gaussian mixture model, we show that the minimum eigenvalue of the Hessian obtained using confusion matrix calibration is smaller than the minimum eigenvalue obtained with more granular calibration. Our analysis and observations together suggest MLLS’s superior performance than BBSE (or RLLS) is due to the granular calibration but not due to the difference in the optimization objective.

Finally, we want to highlight one minor point regarding applicability of our result. If f is calibrated, Theorem 3, together with Proposition 3 (in Appendix E), implies that MLLS is consistent if is invertible. Compared to the consistency condition in Theorem 1 that is invertible (together with Proposition 1), these two conditions are the same if the likelihood ratio is lower-bounded. This is true if all entries in are non-zero. Even if contains non-zero entries, the two conditions are still the same if there exists some such that p(f(x)|y) covers the full support of . In general however, the invertibility of is a stronger requirement than the invertibility of . We leave further investigation of this gap for future work.

6 Experiments

We experimentally illustrate the performance of MLLS on synthetic data, MNIST [15], and CIFAR10 [12]. Following Lipton et al. [16], we experiment with Dirichlet shift simulations. On each run, we sample a target label distribution from a Dirichlet with concentration parameter . We then generate each target example by first sampling a label and then sampling (with replacement) an example conditioned on that label . Note that smaller values of alpha correspond to more severe shift. In our experiments, the source label distribution is uniform.

First, we consider a mixture of two Gaussians (as in Example in Section 4.3) with . With CIFAR10 and MNIST, we split the full training set into two subsets: train and valid, and use the provided test set as is. Then according to the label distribution, we randomly sample with replacement train, valid, and test set from each of their respective pool to form the source and target set. To learn the black box predictor on real datasets, we use the same architecture as Lipton et al. [16] for MNIST, and for CIFAR10 we use ResNet-18 [11] as in Azizzadenesheli et al. [2]5. For simulated data, we use the true as our predictor function. For each experiment, we sample 100 datasets for each shift parameter and evaluate the empirical MSE and variance of the estimated weights.

We consider three sets of experiments: (1) MSE vs degree of target shift; (2) MSE vs target sample sizes; and (3) MSE vs calibrated predictors on the source distribution. We refer to MLLS-CM as MLLS with hard confusion matrix calibration as in (6). In our experiments, we compare MLLS estimator with BBSE, RLLS, and MLLS-CM. For RLLS and BBSE, we use the publicly available code 6. To post-hoc calibration, we use BCTS [1] on the held-out validation set. Using the same validation set, we calculate the confusion matrix for BBSE, RLLS, and MLLS-CM.

We examine the performance of various estimators across all three datasets for various target dataset sizes and shift magnitudes (Figure 1). Across all shifts, MLLS (with BCTS-calibrated classifiers) uniformly dominates BBSE, RLLS, and MLLS-CM in terms of MSE (Figure 1). Observe for severe shifts, MLLS is comparatively dominant. As the available target data increased, all methods improve rapidly, with MLLS outperforming all other methods by a significant margin. Moreover, MLLS’s advantages grow more pronounced under extreme shifts. Notice MLLS-CM is roughly equivalent to BBSE across all settings of dataset, target size, and shift magnitude. This concludes MLLS’s superior performance is not because of differences in loss function used for distribution matching but due to differences in the granularity of the predictions, caused by crude confusion matrix aggregation.

Note that given a predictor , we can partition our input space and produce another predictor that, for any datapoint gives the expected output of on points belonging to that partition. If is calibrated, then will also be calibrated [24]. On synthetic data, we vary the granularity of calibration (for MLLS) by aggregating over a variable number of equal-sized bins. With more bins, less information is lost due to calibration. Consequently, the minimum eigenvalue of the Hessian increases and the MSE decreases, supporting our theoretical bounds (Figure 2). We also verify that the confusion matrix calibration performs poorly (Figure 2). For MLLS-CM, the minimum eigenvalue of the Hessian is 0.195, significantly smaller than for the binned predictor for #bin . Thus, the poor performance of MLLS-CM is predicted by its looser upper bound per our analysis. Note that these experiments presume access to the true predictor and thus the

MSE strictly improves with the number of bins. In practice, with a fixed source dataset size, increasing

the number of bins could lead to overfitting, worsening our calibration.

7 Conclusion

This paper provides a unified framework relating techniques that use off-the-shelf predictors for label shift estimation. We argue that these methods all employ calibration, either explicitly or implicitly, differing only in the choice of calibration method and their optimization objective. Moreover, with our analysis we show that the choice of calibration method (and not the optimization objective for distribution matching) accounts for the advantage of MLLS with BCTS calibration over BBSE. In future work, we hope to operationalize these insights to provide guidance for a calibration scheme to improve label shift estimation.

Broader Impact

This paper investigates the (statistical) consistency and efficiency of two existing methods for estimating target domain label distributions. While this could potentially guide practitioners to improve detection, estimation, and classification in applications where the label shift assumption holds, we do not believe that it will fundamentally impact how machine learning is used in a way that could conceivably be socially salient. While we take the potential impact of machine learning on society seriously, we believe that this work, which addresses a foundational theoretical problem, does not present a significant societal concern.

Acknowledgments

We thank Zico Kolter and David Childers for their helpful feedback. This material is based on research sponsored by Air Force Research Laboratory (AFRL) under agreement number FA8750-19-1-1000. The U.S. Government is authorized to reproduce and distribute reprints for Government purposes notwithstanding any copyright notation therein. The views and conclusions contained herein are those of the authors and should not be interpreted as necessarily representing the official policies or endorsements, either expressed or implied, of Air Force Laboratory, DARPA or the U.S. Government. SB acknowledges funding from the NSF grants DMS-1713003 and CIF-1763734. ZL acknowledges Amazon AI, Salesforce Research, Facebook, UPMC, Abridge, and the Center for Machine Learning and Health for their generous support of ACMI Lab’s research on machine learning under distribution shift.

References

[1] A. Alexandari, A. Kundaje, and A. Shrikumar. Adapting to label shift with bias-corrected calibration. In arXiv preprint arXiv:1901.06852, 2019.

[2] K. Azizzadenesheli, A. Liu, F. Yang, and A. Anandkumar. Regularized learning for domain adaptation under label shifts. In International Conference on Learning Representations (ICLR), 2019.

[3] S. Ben-David, T. Lu, T. Luu, and D. Pál. Impossibility Theorems for Domain Adaptation. In International Conference on Artificial Intelligence and Statistics (AISTATS), 2010.

[4] G. Blanchard, G. Lee, and C. Scott. Semi-supervised novelty detection. Journal of Machine Learning Research, 11(Nov):2973–3009, 2010.

[5] J. Byrd and Z. C. Lipton. What is the effect of importance weighting in deep learning? In International Conference on Machine Learning (ICML), 2019.

[6] C. Cortes and M. Mohri. Domain adaptation and sample bias correction theory and algorithm for regression. Theoretical Computer Science, 519, 2014.

[7] C. Cortes, Y. Mansour, and M. Mohri. Learning Bounds for Importance Weighting. In Advances in Neural Information Processing Systems (NIPS), 2010.

[8] M. C. Du Plessis and M. Sugiyama. Semi-supervised learning of class balance under class-prior change by distribution matching. Neural Networks, 50:110–119, 2014.

[9] A. Gretton, A. J. Smola, J. Huang, M. Schmittfull, K. M. Borgwardt, and B. Schölkopf. Covariate Shift by Kernel Mean Matching. Journal of Machine Learning Research (JMLR), 2009.

[10] C. Guo, G. Pleiss, Y. Sun, and K. Q. Weinberger. On calibration of modern neural networks. In International Conference on Machine Learning (ICML), 2017.

[11] K. He, X. Zhang, S. Ren, and J. Sun. Deep Residual Learning for Image Recognition. In Computer Vision and Pattern Recognition (CVPR), 2016.

[12] A. Krizhevsky and G. Hinton. Learning Multiple Layers of Features from Tiny Images. Technical report, Citeseer, 2009.

[13] V. Kuleshov and P. S. Liang. Calibrated structured prediction. In Advances in Neural Information Processing Systems (NeurIPS), 2015.

[14] A. Kumar, P. S. Liang, and T. Ma. Verified uncertainty calibration. In Advances in Neural Information Processing Systems (NeurIPS), 2019.

[15] Y. LeCun, L. Bottou, Y. Bengio, and P. Haffner. Gradient-Based Learning Applied to Document Recognition. Proceedings of the IEEE, 86, 1998.

[16] Z. C. Lipton, Y.-X. Wang, and A. Smola. Detecting and Correcting for Label Shift with Black Box Predictors. In International Conference on Machine Learning (ICML), 2018.

[17] H. Ramaswamy, C. Scott, and A. Tewari. Mixture proportion estimation via kernel embeddings of distributions. In International Conference on Machine Learning, pages 2052–2060, 2016.

[18] M. Saerens, P. Latinne, and C. Decaestecker. Adjusting the Outputs of a Classifier to New a Priori Probabilities: A Simple Procedure. Neural Computation, 2002.

[19] B. Schölkopf, D. Janzing, J. Peters, E. Sgouritsa, K. Zhang, and J. Mooij. On Causal and Anticausal Learning. In International Conference on Machine Learning (ICML), 2012.

[20] H. Shimodaira. Improving Predictive Inference Under Covariate Shift by Weighting the LogLikelihood Function. Journal of Statistical Planning and Inference, 2000.

[21] C. M. Stein. Estimation of the mean of a multivariate normal distribution. The annals of Statistics, pages 1135–1151, 1981.

[22] A. Storkey. When Training and Test Sets Are Different: Characterizing Learning Transfer. Dataset Shift in Machine Learning, 2009.

[23] J. A. Tropp et al. An introduction to matrix concentration inequalities. Foundations and Trends® in Machine Learning, 2015.

[24] J. Vaicenavicius, D. Widmann, C. Andersson, F. Lindsten, J. Roll, and T. B. Schön. Evaluating model calibration in classification. In International Conference on Machine Learning (ICML), 2019.

[25] S. van de Geer. Empirical Processes in M-estimation, volume 6. Cambridge university press, 2000.

[26] A. W. van der Vaart and J. A. Wellner. Weak convergence. In Weak convergence and empirical processes. Springer, 1996.

[27] B. Zadrozny. Learning and Evaluating Classifiers Under Sample Selection Bias. In International Conference on Machine Learning (ICML), 2004.

[28] K. Zhang, B. Schölkopf, K. Muandet, and Z. Wang. Domain Adaptation Under Target and Conditional Shift. In International Conference on Machine Learning (ICML), 2013.

A MLLS Algorithm

Step 1. description. Let the model class used for post-hoc calibration be represented by G. Given a validation dataset sampled from the source distribution we compute, , applying our classifier to the data. Using this we estimate a function,

where the loss function can be the negative log-likelihood or squared error. Experimentally we observe same performance with both the loss functions. Subsequently, we can apply the calibrated predictor

Our experiments follow Alexandari et al. [1], who leverage BCTS 7 to calibrate their models. BCTS extends temperature scaling [10] by incorporating per-class bias terms. Formally, a function in the BCTS class G, is given by

where parameters to be learned.

B Prior Work on Label Shift Estimation

Dataset shifts are predominantly studied under two scenarios: covariate shift and label shift [22]. Schölkopf et al. [19] articulates connections between label shift and covariate shift with anti-causal and causal models respectively. Covariate shift is well explored in past [28, 27, 7, 6, 9].

Approaches for estimating label shift (or prior shift) can be categorized into three classes:

1. Methods that leverage Mixture Proportion Estimation (MPE) [4, 17] techniques to estimate the target label distribution. MPE estimate in general (e.g. Blanchard et al. [4]) needs explicit calculations of which is infeasible for high dimensional data. More recent methods for MPE estimation, i.e. Ramaswamy et al. [17], uses Kernel embeddings, which like many kernel methods, require the inversion of an Gram matrix. The complexity makes them infeasible for large datasets, practically used in deep learning these days;

2. Methods that directly operate in RKHS for distribution matching [28, 8]. Zhang et al. [28] extend the kernel mean matching approach due to Gretton et al. [9] to the label shift problem. Instead of minimizing maximum mean discrepancy, Du Plessis and Sugiyama [8] explored minimizing PE divergence between the kernel embeddings to estimate the target label distribution. Again, both the methods involve inversion of an kernel matrix, rendering them infeasible for large datasets; and

3. Methods that work in low dimensional setting [16, 2, 18] by directly estimating to avoid the curse of dimensionality. These methods leverage an off-the-shelf predictor to estimate the label shift ratio.

In this paper, we primarily focus on unifying methods that fall into the third category.

C Marginal calibration is insufﬁcient to achieve consistency

In this section, we will illustrate insufficiency of marginal calibration to achieve consistency. For completeness, we first define margin calibration:

Definition 3 (Marginal calibration). A prediction model is marginally calibrated on the source domain if for all

Intuitively, this definition captures per-label calibration of the classifier which is strictly less restrictive than requiring canonical calibration. In the example, we construct a classifier on discrete X which is marginally calibrated, but not canonically calibrated. With the constructed example, we show that the population objective (4) yields inconsistent estimates.

Example. Assume . Suppose the predictor f(x) and are given as,

Clearly, the prediction f(x) is marginally calibrated. We have one more degree to freedom to choose, which is the source marginal distribution on X. For simplicity let’s assume for all i = {1, . . . , 6}. Thus, we have . Note, with our assumption of the source marginal on x, we get . This follows as is an one-to-one mapping.

Now, assume a shift i.e. prior on Y for the target distribution of the form label shift assumption, we get

Assume the importance weight vector as w. Clearly, we have . Re-writing the population MLLS objective (4), we get the maximisation problem as

Differentiating (10) with respect to and , we get two high order equations, solving which give us the MLLS estimate . To show inconsistency, it is enough to consider one instantiation of and such that . Assuming and and solving (10) using numerical methods, we get . As concluding the proof.

D Proofs from Section 4

Lemma 1 (Identifiability). If the set of distributions {p(z|y) : y = 1, ..., k} are linearly independent, then for any w that satisfies (2), we must have . This condition is also necessary in general: if the linear independence does not hold then there exists a problem instance where we have satisfying (2) while

Proof. First we prove sufficiency. If there exists such that (2) holds, then we have is not the zero vector, are linearly dependent. Since and for all y (by assumption), we also have that {p(z|y), y = 1, ..., k} are linearly dependent. By contradiction, we show that the linear independence is necessary.

To show necessity, assume . We know that by definition. If linear independence does not hold, then there exists a vector such that and . Since the we construct is not on the boundary of W, we can scale v such that where and . Therefore, setting gives another solution for (2), which concludes the proof.

Lemma 2. If f is calibrated, then the two objectives (3) and (4) are identical when Z is chosen as is defined to be

Proof. The proof follows a sequence of straightforward manipulations. In more detail,

where the final step uses the fact that f is calibrated.

Theorem 1 (Population consistency of MLLS). If a predictor is calibrated and the distributions {p(f(x)|y) : y = 1, . . . , k} are strictly linearly independent, then is the unique maximizer of the MLLS objective (4).

Proof. According to Lemma 2 we know that maximizing (4) is the same as maximizing (3) with , thus also the same as minimizing the KL divergence between we know that is a minimizer of the KL divergence such that the KL divergence is 0. We also have that if and only if , so all maximizers of (4) should satisfy (2). According to Lemma 1, if the strict linear independence holds, then unique solution of (2). Thus is the unique maximizer of (4).

Proposition 1. For a calibrated predictor f, the following statements are equivalent:

Proof. We first show the equivalence of (1) and (2). If f is calibrated, we have . Then for any vector

On the other hand, we can have

If {p(f(x)|y) : y = 1, . . . , k} are linearly dependent, then there exist such that (11) is zero for any x. Consequently, there exists a non-zero vector u with such that for any x satisfying , which means and thus is not invertible. On the other hand, if is non-invertible, then there exist some such that . Further as . As a result, the vector v with satisfies that (11) is zero for any x, which means {p(f(x)|y) : y = 1, . . . , k} are not strictly linearly independent.

Let C be the soft confusion matrix of f, then

Therefore, we have , which means (2) and (3) are equivalent.

We introduce some notation before proving consistency. Let be the class of densities8 for a given calibrated predictor f. Suppose are densities corresponding to MLE estimate and true weights, respectively. We use to denote the Hellinger distance and TVto denote the total variation distance between two densities denotes -entropy for class P with respect to metric . Similarly, denotes the corresponding bracketing entropy. Moreover, denotes the empirical random distribution that puts uniform mass on observed samples . Before proving consistency we need to re-state two results:

Lemma 6 (Lemma 2.1 [25]). If P is a probability measure, for all

Lemma 7 (Corollary 2.7.10 [26]). Let F be the class of convex functions a compact, convex set for every x,y. Then

for a constant K that depends on the dimension d and C.

We can now present our proof of consistency, which is based on Theorem 4.6 from van de Geer [25]:

then almost surely. Theorem 2 (Consistency of MLLS). If f satisfies the conditions in Theorem 1, then in (5) converges to almost surely.

Proof. Assume the maximizer of (5) is . Define class prove consistency, we first bound the bracketing entropy for class G using Lemma 6 and Lemma 7. Clearly P is linear in parameters and hence, convex. Gradient of function is given by which in turn is bounded by . Under assumptions of Condition 1, the functions in G are Lipschitz with constant . We can bound the bracketing entropy using Lemma 7 and Lemma 6 as

for some constant that depends on k.

On the other hand, for cases where can be arbitrarily close to zero, i.e., Condition 1 doesn’t hold true, we define

Assume such that (14) is satisfied. Using (15), we have

Thus, for the cases where can be arbitrarily close to zero, instead of bounding , we we bound . For any , there is a compact subset , such that . Using arguments similar to above, function is Lipschitz with constant . Again using Lemma 7 and Lemma 6, we conclude

for some constant that depends on k. Finally, we use Lemma 8 to conclude . Further, as TVimplies TV

where is the minimum eigenvalue of covariance matrix. Note using Proposition 1, we have . Thus, we conclude

Example 1. Consider a mixture of two Gaussians with and . We suppose that the source mixing coefficients are both , while the target mixing coefficients are . Assume a class of probabilistic threshold classifiers: for , otherwise

Then the population error of MLLS is given by

which is zero only if for a non-degenerate classifier.

Proof. The intuition behind the construction is, for such an Example, we can get a closed form solution for the population MLLS and hence allows a careful analysis of the estimation error. The classifier f(x) predicts class 0 with probability c and class 1 with probability for , and vice-versa for x < 0. Using such a classifier, the weight estimator is given by:

where equality (i) follows from and the predictor function and (ii) follows from the fact that within each integral, the term inside the log is independent of x. Differentiating w.r.t. to have:

which gives . Thus for the population MLLS estimate, the estimation error is given by

E Proofs from Section 5

The gradient of the MLLS objective can be written as

and the Hessian is

We use to denote the minimum eigenvalue of the matrix X.

Lemma 9 (Theorem 5.1.1 [23]). Let be a finite sequence of identically distributed independent, random, symmetric matrices with common dimension k. Assume and . With probability at least

Lemma 3. For any predictor f that satisfies Condition 1, we have

Proof. We present our proof in two steps. Step-1 is the non-probabilistic part, i.e., bounding the error in terms of the gradient difference . This step uses Taylor’s expansion upto second order terms for empirical log-likelihood around the true involves deriving a concentration on the gradient difference using the Lipschitz property implied by Condition 1. Combining these two steps along with Lemma 22 concludes the proof. Now we detail each of these steps.

Step-1. We represent the empirical Negative Log-Likelihood (NLL) function with by absorbing the negative sign to simplify notation. Using a Taylor expansion, we have

where . With the assumption , we have . Using this we get,

where term-I is less than zero as is the minimizer of empirical NLL . Ignoring term-I and re-arranging a few terms we get:

Using Holder’s inequality on the LHS we have,

Let be the minimum eigenvalue of . Using the fact that

Step-2. The empirical gradient is where

the gradient terms as

As the gradient terms decompose and are independent, using Hoeffding’s inequality we have with probability at least

Let be the minimum eigenvalue of . Using lemma 9, with probability at least

Plugging (21) and (22) in (20), and applying a union bound, we conclude that with probability at least

Neglecting the order m term and letting

Lemma 4. For any predictor f and any calibrated predictor that satisfies Condition 1, we have for some constant C.

If we set , which is a calibrated predictor (Proposition 2), we can bound the error in terms of the calibration error of f on the source data 9:

Proof. We present our proof in two steps. Note, all calculations are non-probabilistic. Step-1 involves bounding the error in terms of the gradient difference . This step uses Taylor’s expansion on upto the second orderth term for population log-likelihood around the true . Step-2 involves deriving a bound on the gradient difference in terms of the difference using the Lipschitz property implied by Condition 1. Further, for a crude calibration choice of , the gradient difference can be bounded by miscalibration error. We now detail both of these steps.

Step-1. Similar to Lemma 3, we represent with L by absorbing the negative sign to simplify notation. Using the Taylor expansion, we have

where . With the assumption , we have . Using this we get,

where term-I is less than zero as is the minimizer of NLL L(w, f). Ignoring that term and re-arranging a few terms we get

As before, let be the minimum eigenvalue of . Using the fact that

Using Holder’s inequality on the LHS and re-arranging terms gives

Step-2. By lower bound assumptions

where the first inequality is implied by Jensen’s inequality and the second is implied by the Lipschitz property of the gradient. Further, we have

Combining equations (23), (24), and (25), we have

Further, if we set , which is a calibrated predictor according to Proposition 2, we can bound the error on the RHS in terms of the calibration error of f. Moreover, in the label shift estimation problem, we have the assumption that for all y. Hence, we have . Using Jensen’s inequality, we get

Plugging (27) back in (26),we get the required upper bound.

Proposition 3. For any , we have where is the minimum eigenvalue of and . Furthermore, if f satisfies Condition 1, we have

Proof. For any

where

and

if f satisfies Condition 1 and . Therefore, we have

for w

Lemma 5. Let be the predictor after post-hoc calibration with squared loss l and g belongs to a function class G that satisfies the standard regularity conditions, we have

Proof. Assume regularity conditions on the model class (injectivity, Lipschitz-continuity, twice differentiability, non-singular Hessian, and consistency) as in Theorem 5.23 of Stein [21] hold true. Using the injectivity property of the model class as in Kumar et al. [14], we have for all

Let be models parameterized by and , respectively. Using the strong convexity of the empirical mean squared error we have,

where is the parameter constant for strong convexity. Re-arranging a few terms, we have

where term-I is less than zero because is the empirical minimizer of the mean-squared error. Ignoring term-I, we get:

As the assumed model class is Lipschitz w.r.t. , the gradient is bounded by Lipschitz constant as is the population minimizer. Using Hoeffding’s bound for bounded functions, we have with probability at least

Using the smoothness of the MSE(g), we have

where is the operator norm of the . Combining (28), (29), and (30), we have for some universal constant with probability at least

Moreover, with Lemma 4, depending on the degree of the miscalibration and the method involved to calibrate, we can bound the E(f). For example, if using vector scaling on a held out training data for calibration, we can use Lemma 5 to bound the calibration error E(f), i.e., with probability at least

Plugging (27) and (31) into (26), we have with probability at least

designed for accessibility and to further open science