b

DiscoverSearch
About
My stuff
A Unified View of Label Shift Estimation
2020·arXiv
Abstract
Abstract

Under label shift, the label distribution p(y) might change but the class-conditional distributions p(x|y) do not. There are two dominant approaches for estimating the label marginal. BBSE, a moment-matching approach based on confusion matrices, is provably consistent and provides interpretable error bounds. However, a maximum likelihood estimation approach, which we call MLLS, dominates empirically. In this paper, we present a unified view of the two methods and the first theoretical characterization of MLLS. Our contributions include (i) consistency conditions for MLLS, which include calibration of the classifier and a confusion matrix invertibility condition that BBSE also requires; (ii) a unified framework, casting BBSE as roughly equivalent to MLLS for a particular choice of calibration method; and (iii) a decomposition of MLLS’s finite-sample error into terms reflecting miscalibration and estimation error. Our analysis attributes BBSE’s statistical inefficiency to a loss of information due to coarse calibration. Experiments on synthetic data, MNIST, and CIFAR10 support our findings.

Supervised algorithms are typically developed and evaluated assuming independent and identically distributed (iid) data. However, the real world seldom abides, presenting domain adaptation problems in which the source distribution  Ps, from which we sample labeled training examples, differs from the target distribution  Pt, from which we only observe unlabeled data. Absent assumptions on the nature of shift, the problem is underspecified. Multiple assumptions may be compatible with the same observations while implying different courses of action. Fortunately, some assumptions can render shift detection, estimation, and on-the-fly updates to our classifiers possible.

This paper focuses on label shift [22, 18, 16], which aligns with the anticausal setting in which the labels y cause the features x [19]. Label shift arises in diagnostic problems because diseases cause symptoms. In this setting, an intervention on p(y) induces the shift, but the process generating x given y is fixed (ps(x|y) = pt(x|y)). Under label shift, the optimal predictor may change, e.g., the probability that a patient suffers from a disease given their symptoms can increase under a pandemic. Contrast label shift with the better-known covariate shift assumption, which aligns with the assumption that x causes y, yielding the reverse implication that  ps(y|x) = pt(y|x).

Under label shift, our first task is to estimate the ratios  w(y) = pt(y)/ps(y)for all labels y. Two dominant approaches leverage off-the-shelf classifiers to estimate w: (i) Black Box Shift Estimation (BBSE) [16] and a variant called Regularized Learning under Label Shift (RLLS) [2]: moment-matching based estimators that leverage (possibly biased, uncalibrated, or inaccurate) predictions to estimate the shift; and (ii) Maximum Likelihood Label Shift (MLLS) [18]: an Expectation Maximization (EM) algorithm that assumes access to a classifier that outputs the true source distribution conditional probabilities  ps(y|x).

image

Given a predictor �fwith an invertible confusion matrix, BBSE and RLLS have known consistency results and finite-sample guarantees [16, 2]. However, MLLS, in combination with a calibration heuristic called Bias-Corrected Temperature Scaling (BCTS), outperforms them empirically [1].

In this paper, we theoretically characterize MLLS, establishing conditions for consistency and bounding its finite-sample error. To start, we observe that given the true label conditional  ps(y|x), MLLS is simply a concave Maximum Likelihood Estimation (MLE) problem and standard results apply. However, because we never know  ps(y|x)exactly, MLLS is always applied with an estimated model �fand thus the procedure consists of MLE under model misspecification.

First, we prove that (i) canonical calibration (Definition 1) and (ii) an invertible confusion matrix (as required by BBSE) are sufficient conditions to ensure MLLS’s consistency (Proposition 1, Theorems 1 and 2). We also show that calibration can sometimes be necessary for consistency (Example 1 in Section 4.3). Recall that neural network classifiers tend to be uncalibrated absent post-hoc adjustments [10]. Second, we observe that confusion matrices can be instruments for calibrating a classifier. Applying MLLS with this technique, BBSE and MLLS are distinguished only by their objective functions. Through extensive experiments, we show that they perform similarly, concluding that MLLS’s superior performance (when applied with more granular calibration techniques) is not due to its objective but rather to the information lost by BBSE via confusion matrix calibration. Third, we analyze the finite-sample error of the MLLS estimator by decomposing its error into terms reflecting the miscalibration error and finite-sample error (Theorem 3). Depending on the calibration method, the miscalibration error can further be divided into two terms: finite sample error due to re-calibration on a validation set and the minimum achievable calibration error with that technique.

We validate our results on synthetic data, MNIST, and CIFAR-10. Empirical results show that MLLS can have  2–10×lower Mean Squared estimation Error (MSE) depending on the magnitude of the shift. Our experiments relate MLLS’s MSE to the granularity of the calibration.

In summary, we contribute the following: (i) Sufficient conditions for MLLS’s consistency; (ii) Unification of MLLS and BBSE methods under a common framework, with BBSE corresponding to a particular choice of calibration method; (iii) Finite-sample error bounds for MLLS; (iv) Experiments on synthetic and image recognition datasets that support our theoretical arguments.

Let X be the input space and Y = {1, 2, . . . , k} the output space. Let  Ps, Pt : X × Y → [0, 1]be the source and target distributions and let  ps and ptdenote the corresponding probability density (or mass) functions. We use  Es and Etto denote expectations over the source and target distributions. In unsupervised domain adaptation, we possess labeled source data  {(x1, y1), (x2, y2), . . . , (xn, yn)}and unlabeled target data  {xn+1, xn+2, . . . , xn+m}. We also assume access to a black-box predictor �f : X �→ ∆k−1, e.g., a model trained to approximate the true probability function  f ∗, where f ∗(x) := ps(·|x). Here and in the rest of the paper, we use  ∆k−1 to denote the standard k-dimensional probability simplex. For a vector  v, we use vyto access the element at index y.

Absent assumptions relating the source and target distributions, domain adaptation is underspecified [3]. We work with the label shift assumption, i.e.,  ps(x|y) = pt(x|y), focusing on multiclass classification. Moreover, we assume non-zero support for all labels in the source distribution: for all y ∈ Y, ps(y) ≥ c > 0 [16, 2]. Under label shift, three common goals are (i) detection—determining whether distribution shift has occurred; (ii) quantification—estimating the target label distribution; and (iii) correction—producing a predictor that minimizes error on the target distribution [16].

This paper focuses on goal (ii), estimating importance weights  w(y) = pt(y)/ps(y)for all  y ∈ Y. Given w, we can update our classifiers on the fly, either by retraining in an importance-weighted ERM framework [20, 9, 16, 2]—a practice that may be problematic for overparameterized neural networks [5], or by applying an analytic correction [1, 18]. Within the ERM framework, the generalization result from Azizzadenesheli et al. [2] (Theorem 1) depends only on the error of the estimated weights, and hence any method that improves weight estimates tightens this bound.

There are multiple definitions of calibration in the multiclass setting. Guo et al. [10] study the calibration of the arg-max prediction, while Kumar et al. [14] study a notion of per-label calibration. We use canonical calibration [24] and the expected canonical calibration error on the source data defined as follows:

Definition 1 (Canonical calibration). A prediction model  f : X �→ ∆k−1 is canonically calibrated on the source domain if for all  x ∈ X and j ∈ Y, Ps(y = j|f(x)) = fj(x) .

Definition 2 (Expected canonical calibration error). For a predictor f, the expected squared canonical calibration error on the source domain is  E2(f) = Es ∥f − fc∥2, where fc = Ps(y = ·|f(x)).

Calibration methods typically work either by calibrating the model during training or by calibrating a trained classifier on held-out data, post-hoc. We refer the interested reader to Kumar et al. [14] and Guo et al. [10] for detailed studies on calibration. We focus on the latter category of methods. Our experiments follow Alexandari et al. [1], who leverage BCTS 1 to calibrate their models. BCTS extends temperature scaling [10] by incorporating per-class bias terms.

Two families of solutions have been explored that leverage a blackbox predictor: BBSE [16], a moment matching method, uses the predictor �fto compute a confusion matrix  C �f := ps(�y, y) ∈ Rk×kon the source data. Depending on how  �yis defined, there are two types of confusion matrix for a predictor �f: (i) the hard confusion matrix  �y = arg max �f(x); and (ii) the soft confusion matrix, where  �yis defined as a random prediction that follows the discrete distribution �f(x)over Y. Both soft and hard confusion matrix can be estimated from labeled source data samples. The estimate  �w iscomputed as  �w := �C−1�f �µ, where �C �f is the estimate of confusion matrix and  �µis an estimate of  pt(�y),computed by applying the predictor �fto the target data. In a related vein, RLLS [2] incorporates an additional regularization term of the form  ∥w − 1∥and solves a constrained optimization problem to estimate the shift ratios w.

MLLS estimates w as if performing maximum likelihood estimation, but substitutes the predictor outputs for the true probabilities  ps(y|x). Saerens et al. [18], who introduce this procedure, describe it as an application of EM. However, as observed in [8, 1], the likelihood objective is concave, and thus a variety of optimization algorithms may be applied to recover the MLLS estimate. Alexandari et al. [1] also showed that MLLS underperforms BBSE when applied naively, a phenomenon that we shed more light on in this paper.

We now present a unified view that subsumes MLLS and BBSE and demonstrate how each is instantiated under this framework. We also establish identifiability and consistency conditions for MLLS, deferring a treatment of finite-sample issues to Section 5. For convenience, throughout Sections 3 and 4, we use the term calibration exclusively to refer to canonical calibration (Definition 1) on the source data. We relegate all technical proofs to Appendix D.

4.1 A Unified Distribution Matching View

To start, we introduce a generalized distribution matching approach for estimating w. Under label shift, for any (possibly randomized) mapping from X to Z, we have that  ps(z|y) = pt(z|y)since, ps(z|y) = pt(z|y) =�X p(z|x)p(x|y)dx.Throughout the paper, we use the notation p(z|y) to rep- resent either  ps(z|y) or pt(z|y)(which are identical). We now define a family of distributions over Z parameterized by  w ∈ W as

image

where  W = {w | ∀y , wy ≥ 0 and �ky=1 wyps(y) = 1}. When  w = w∗, we have that  pw(z) =pt(z). For fixed p(z|x), pt(z) and ps(z, y)are known because  pt(x) and ps(x, y)are known. So one potential strategy to estimate  w∗ is to find a weight vector w such that

image

At least one such weight vector w must exist as  w∗satisfies (2). We now characterize conditions under which the weight vector w satisfying (2) is unique:

Lemma 1 (Identifiability). If the set of distributions {p(z|y) : y = 1, ..., k} are linearly independent, then for any w that satisfies (2), we must have  w = w∗. This condition is also necessary in general: if the linear independence does not hold then there exists a problem instance where we have  w, w∗ ∈ Wsatisfying (2) while  w ̸= w∗.

Lemma 1 follows from the fact that (2) is a linear system with at least one solution  w∗. This solution is unique when  ps(z, y)is of rank k. The linear independence condition in Lemma 1, in general, is sufficient for identifiability of discrete Z. However, for continuous Z, the linear dependence condition has the undesirable property of being sensitive to changes on sets of measure zero. By changing a collection of linearly dependent distributions on a set of measure zero, we can make them linearly independent. As a consequence, we impose a stronger notion of identifiability i.e., the set of distributions {p(z|y) : y = 1, ..., k} are such that there does not exist  v ̸= 0for which �Z|�y p(z|y)vy|dz = 0.We refer this condition as strict linear independence.

In generalized distribution matching, one can set p(z|x) to be the Dirac delta function at  δx2such that Z is the same space as X, which leads to solving (2) with z replaced by x. In practice where X is high-dimensional and/or continuous, approximating the solution to (2) from finite samples can be hard when choosing z = x. Our motivation for generalizing distribution matching from X to Z is that the solution to (2) can be better approximated using finite samples when Z is chosen carefully. Under this framework, the design of a label shift estimation algorithm can be decomposed into two parts: (i) the choice of p(z|x) and (ii) how to approximate the solution to (2). Later on, we consider how these design choices may affect label shift estimation procedures in practice.

4.2 The Confusion Matrix Approach

If Z is a discrete space, one can first estimate  ps(z, y) ∈ R|Z|×k and pt(z) ∈ R, and then subsequently attempt to solve (2). Confusion matrix approaches use Z = Y, and construct p(z|x) using a black box predictor �f. There are two common choices to construct the confusion matrix: (i) The soft confusion matrix approach: We set  p(z|x) := �f(x) ∈ ∆k−1. We then define a random variable  �y ∼ �f(x)for each x. Then we construct  ps(z, y) = ps(�y, y)and  pt(z) = pt(�y). (ii) The hard confusion matrix approach: Here we set  p(z|x) = δarg max �f(x). We then define a random variable  �y = arg max �f(x)for each x. Then again we have  ps(z, y) = ps(�y, y) and pt(z) = pt(�y).

Since  ps(z, y)is a square matrix, the identifiability condition becomes the invertibility of the confusion matrix. Given an estimated confusion matrix, one can find w by inverting the confusion matrix (BBSE) or minimizing some distance between the vectors on the two sides of (2).

4.3 Maximum Likelihood Label Shift Estimation

When Z is a continuous space, the set of equations in (2) indexed by Z is intractable. In this case, one possibility is to find a weight vector  �wby minimizing the KL-divergence KL(pt(z), pw(z)) = Et [log pt(z)/pw(z)], for  pwdefined in (1). This is equivalent to maximizing the population log-likelihood: �w := arg maxw∈W Et [log pw(z)] .One can further show that  Et [log pw(z)] = Et[log �ky=1 ps(z, y)wy] = Et[log �ky=1 ps(y|z)ps(z)wy] =Et[log �ky=1 ps(y|z)wy] + Et [log ps(z)] .Therefore we can equivalently define:

image

This yields a straightforward convex optimization problem whose objective is bounded from below [1, 8]. Assuming access to labeled source data and unlabeled target data, one can maximize the empirical counterpart of the objective in (3), using either EM or an alternative iterative optimization scheme. Saerens et al. [18] derived an EM algorithm to maximize the objective (3) when z = x, assuming access to  ps(y|x). Absent knowledge of the ground truth  ps(y|x), we can plug in any approximate predictor f and optimize the following objective:

image

In practice, f is fit from a finite sample drawn from  ps(x, y)and standard machine learning methods often produce uncalibrated predictors. While BBSE and RLLS are provably consistent whenever the predictor f yields an invertible confusion matrix, to our knowledge, no prior works have established sufficient conditions to guarantee MLLS’ consistency when f differs from  ps(y|x).

It is intuitive that for some values of  f ̸= ps(y|x), MLLS will yield inconsistent estimates. Supplying empirical evidence, Alexandari et al. [1] show that MLLS performs poorly when f is a vanilla neural network predictor learned from data. However, Alexandari et al. [1] also show that in combination with a particular post-hoc calibration technique, MLLS achieves low error, significantly outperforming BBSE and RLLS. As the calibration error is not a distance metric between f and ps(y|x)(zero calibration error does not indicate  f = ps(y|x)), a calibrated predictor f may still be substantially different from  ps(y|x). Some natural questions then arise:

1. Why does calibration improve MLLS so dramatically? 2. Is calibration necessary or sufficient to ensure the consistency of MLLS? 3. What accounts for the comparative efficiency of MLLS over BBSE? (Addressed in Section 5)

To address the first two questions, we make the following observations. Suppose we define z (for each x) with distribution  p(z|x) := δf(x), for some calibrated predictor f. Then, because f is calibrated, it holds that  ps(y|z) = f(x). Note that in general, the MLLS objective (4) can differ from (3). However, when  p(z|x) := δf(x), the two objectives are identical. We can formalize this as follows:

Lemma 2. If f is calibrated, then the two objectives (3) and (4) are identical when Z is chosen as ∆k−1 and p(z|x)is defined to be  δf(x).

Lemma 2 follows from changing the variable of expectation in (4) from x to f(x) and applying f(x) = ps(y|f(x))(definition of calibration). It shows that MLLS with a calibrated predictor on the input space X is in fact equivalent to performing distribution matching in the space Z. Building on this observation, we now state our population-level consistency theorem for MLLS:

Theorem 1 (Population consistency of MLLS). If a predictor  f : X �→ ∆k−1 is calibrated and the distributions {p(f(x)|y) : y = 1, . . . , k} are strictly linearly independent, then  w∗is the unique maximizer of the MLLS objective (4).

We now turn our attention to establishing consistency of the sample-based estimator. Let x1, x2, . . . , xmiid∼ pt(x). The finite sample objective for MLLS can be written as

image

Theorem 2 (Consistency of MLLS). If f satisfies the conditions in Theorem 1, then  �wfin (5) converges to  w∗ almost surely.

The main idea of the proof of Theorem 2 is to derive a metric entropy bound on the class of functions G =�(f T w)/(f T w + f T w∗)|w ∈ W�to prove Hellinger consistency (Theorem 4.6 [25]). The consistency of MLLS relies on the linear independence of the collection of distributions {p(f(x)|y) : y = 1, . . . , k}. The following result develops several alternative equivalent characterizations of this linear independence condition.

Proposition 1. For a calibrated predictor f, the following statements are equivalent:

image

Proposition 1 shows that with a calibrated predictor, the invertibility condition as required by BBSE (or RLLS) is exactly the same as the linear independence condition required for MLLS’s consistency.

Having provided sufficient conditions, we consider a binary classification example to provide intuition for why we need calibration for consistency. In this example, we relate the estimation error to the miscalibration error, showing that calibration is not only sufficient but also necessary to achieve zero estimation error for a certain class of predictors.

Example 1. Consider a mixture of two Gaussians with  ps(x|y = 0) := N(µ, 1)and  ps(x|y =1) := N(−µ, 1). We suppose that the source mixing coefficients are both 12, while the target mixing coefficients are  α(̸= 12), 1−α. Assume a class of probabilistic threshold classifiers:  f(x) = [1−c, c]for  x ≥ 0, otherwise  f(x) = [c, 1 − c] with c ∈ [0, 1]. Then the population error of MLLS is given by

image

which is zero only if  c = ps(x ≥ 0|y = 0)for a non-degenerate classifier.

The expression for estimation error arising from our example yields two key insights: (i) an uncalibrated thresholded classifier has an estimation error proportional to the true shift in label distribution i.e.  1 − 2α; (ii) the error is also proportional to the canonical calibration error which is  ps(x ≥ 0|y = 0) − c. While earlier in this section, we concluded that canonical calibration is sufficient for consistency, the above example provides some intuition for why it might also be necessary. In Appendix C, we show that marginal calibration [14, 10, 24], a less restricted definition is insufficient to achieve consistency.

4.4 MLLS with Confusion Matrix

So far, we have shown that MLLS with any calibrated predictor can be viewed as distribution matching in a latent space. Now we discuss a method to construct a predictor f to perform MLLS given any p(z|x), e.g., those induced by confusion matrix approaches. Recall, we already have the maximum log-likelihood objective. It just remains to construct a calibrated predictor f from the confusion matrix.

This is straightforward when p(z|x) is deterministic, i.e.,  p(z|x) = δg(x)for some function g: setting f(x) = ps(y|g(x))makes the objectives (3) and (4) to be the same. Recall that for the hard confusion matrix, the induced latent space is  p(z|x) = δarg max �f(x). So the corresponding predictor in MLLS is  f(x) = ps(y|�yx), where  �yx = arg max �f(x). Then we obtain the MLLS objective for the hard confusion matrix:

image

The confusion matrix  C �f and predictor  �fdirectly give us  ps(y|�yx). Given an input x, one can first get �yxfrom �f, then normalize the  �yx-th row of  C �fas  ps(y|�yx). We denote MLLS with hard confusion matrix calibration (6) by MLLS-CM.

When  ps(z|x)is stochastic, we need to extend (4) to allow f to be a random predictor:  f(x) = ps(y|z)for  z ∼ p(z|x)3. To incorporate the randomness of f, one only needs to change the expectation in (4) to be over both x and f(x), then (4) becomes a rewrite of (3).

Proposition 2 indicates that constructing the confusion matrix is a calibration procedure. Thus, the predictor constructed with constructed using confusion matrix is calibrated and suitable for application with MLLS.

Proposition 2 (Vaicenavicius et al. [24]). For any function  g, f(x) = ps(y|g(x))is a calibrated predictor.

We can now summarize the relationship between BBSE and MLLS: A label shift estimator involves two design choices: (i) designing the latent space p(z|x) (which is equivalent to designing a calibrated predictor); and (ii) performing distribution matching in the new space Z. In BBSE, we design a calibrated predictor via the confusion matrix and then perform distribution matching by directly solving linear equations. In general, MLLS does not specify how to obtain a calibrated predictor, but specifies KL minimization as the distribution matching procedure. One can apply the confusion matrix approach to obtain a calibrated predictor and then plug it into MLLS, which is the BBSE analog under MLLS, and is a special case of MLLS.

We now analyze the performance of MLLS estimator. Even when  w∗is the unique optimizer of (4) for some calibrated predictor f, assuming convex optimization can be done perfectly, there are still two sources of error preventing us from exactly computing  w∗in practice. First, we are optimizing a sample-based approximation (5) to the objective in expectation (4). We call this source of error finite-sample error. Second, the predictor f we use may not be perfectly calibrated on the source distribution as we only have access to samples from source data distribution  ps(x, y). We call this source of error miscalibration error. We will first analyze how these two sources of errors affect the estimate of  w∗ separately and then give a general error bound that incorporates both. All proofs are relegated to Appendix E.

Before presenting our analysis, we introduce some notation and regularity assumptions. For any predictor  f : X �→ ∆k−1, we define  wfand  �wfas in (4) and (5). If f satisfies the conditions in Theorem 2 (calibration and linear independence) then we have that  wf = w∗. Our goal is to bound ∥ �wf − w∗∥for a given (possibly miscalibrated) predictor f. We now introduce a regularity condition: Condition 1 (Regularity condition for a predictor f). For any x within the support of  pt(x), i.e. pt(x) > 0, we have both  f(x)T wf ≥ τ, f(x)T w∗ ≥ τfor some universal constant  τ > 0.

Condition 1 is mild if f is calibrated since in this case  wf = w∗ is the maximizer of  Et�log f(x)T w�,and the condition is satisfied if the expectation is finite. Since  f(x)T w∗and  f(x)T wfare upperbounded (they are the inner products of two vectors which sum to 1), they also must be lower-bounded away from 0 with arbitrarily high probability without any assumptions. For miscalibrated f, a similar justification holds for assumption that  f(x)T wfis lower bounded. Turning our attention to the assumption that  f(x)T w∗is lower bounded, we note that it is sufficient if f is close (pointwise) to some calibrated predictor. This in turn is a reasonable assumption on the actual predictor we use for MLLS in practice as it is post-hoc calibrated on source data samples.

Define  σf,wto be the minimum eigenvalue of the Hessian  −∇2wL(w, f). To state our results compactly we use standard stochastic order notation (see, for instance, [26]). We first bound the estimation error introduced by only having finite samples from the target distribution in Lemma 3. Next, we bound the estimation error introduced by having a miscalibrated f in Lemma 4.

Lemma 3. For any predictor f that satisfies Condition 1, we have  ∥wf − �wf∥ ≤ σ−1f,wf Op�m−1/2�.

Lemma 4. For any predictor f and any calibrated predictor  fcthat satisfies Condition 1, we have ∥wf − w∗∥ ≤ σ−1f,w∗ · C · Et [∥f − fc∥] ,for some constant C.

If we set  fc(x) = ps(y|f(x)), which is a calibrated predictor (Proposition 2), we can bound the error in terms of the calibration error of f on the source data 4:  ∥wf − w∗∥ ≤ σ−1f,w∗ · C · E(f) .

Note that since  ps(y) > 0 for all y, we can upper-bound the error in Lemma 4 with calibration error on the source data. We combine the two sources of error to bound the estimation error  ∥ �wf − w∗∥:Theorem 3. For any predictor f that satisfies Condition 1, we have

image

The estimation error of MLLS can be decomposed into (i) finite-sample error, which decays at a rate of  m−1/2; and (ii) the calibration error of the predictor that we use. The proof is a direct combination of Lemma 3 and Lemma 4 applied to the same f with the following error decomposition:

image

Theorem 3 shows that the estimation error depends inversely on the minimum eigenvalue of the Hessian at two different points  wf and w∗. One can unify these two eigenvalues as a single quantity σf, the minimum eigenvalue  Et�f(x)f(x)T �. We formalize this observation in Appendix E.

If we use the post-hoc calibration procedure (as discussed in Section 2 and A) to calibrate a blackbox predictor �f, we can obtain a bound on the calibration error of f. In more detail, suppose that the class G used for calibration satisfies standard regularity conditions (injectivity, Lipschitz-continuity, twice differentiability, non-singular Hessian). We have the following lemma:

Lemma 5. Let  f = g◦ �fbe the predictor after post-hoc calibration with squared loss l and g belongs to a function class G that satisfies the standard regularity conditions, we have

image

image

Figure 1: (top) MSE vs the degree of shift; For GMM, we control the shift in the label marginal for class 1 with a fixed target sample size of 1000. For multiclass problems—-MNIST and CIFAR-10, we control the Dirichlet shift parameter with a fixed sample size of 5000. (bottom) MSE (in log scale) vs target sample size; For GMM, we fix the label marginal for class 1 at 0.01 whereas for multiclass problems, MNIST and CIFAR-10, we fix the Dirichlet parameter to 0.1. In all plots, MLLS dominates other methods. All confusion matrix approaches perform similarly, indicating that the advantage of MLLS comes from the choice of calibration but not the way of performing distribution matching.

This result is similar to Theorem 4.1 [14]. For a model class G that is rich enough to contain a function  g ∈ Gthat achieves zero calibration error, i.e.,  ming∈G E(g ◦ �f) = 0, then we obtain an estimation error bound for MLLS of  σ−1f · Op�m−1/2 + n−1/2�. This bound is similar to rate of RLLS and BBSE, where instead of  σfthey have minimum eigenvalue of the confusion matrix.

The estimation error bound explains the efficiency of MLLS. Informally, the error of MLLS depends inversely on the minimum eigenvalue of the Hessian of the likelihood  σf. When we apply coarse calibration via the confusion matrix (in MLLS-CM), we only decrease the value of  σf. Coarse calibration throws away information [13] and thus results in greater estimation error for MLLS. In Section 6, we empricially show that MLLS-CM’s performance is similar to that of BBSE. Moreover, on a synthetic Gaussian mixture model, we show that the minimum eigenvalue of the Hessian obtained using confusion matrix calibration is smaller than the minimum eigenvalue obtained with more granular calibration. Our analysis and observations together suggest MLLS’s superior performance than BBSE (or RLLS) is due to the granular calibration but not due to the difference in the optimization objective.

Finally, we want to highlight one minor point regarding applicability of our result. If f is calibrated, Theorem 3, together with Proposition 3 (in Appendix E), implies that MLLS is consistent if  Et�f(x)f(x)T �is invertible. Compared to the consistency condition in Theorem 1 that Es�f(x)f(x)T �is invertible (together with Proposition 1), these two conditions are the same if the likelihood ratio  pt(f(x))/ps(f(x))is lower-bounded. This is true if all entries in  w∗are non-zero. Even if  w∗ contains non-zero entries, the two conditions are still the same if there exists some  w∗y > 0such that p(f(x)|y) covers the full support of  ps(f(x)). In general however, the invertibility of Et�f(x)f(x)T �is a stronger requirement than the invertibility of  Es�f(x)f(x)T �. We leave further investigation of this gap for future work.

We experimentally illustrate the performance of MLLS on synthetic data, MNIST [15], and CIFAR10 [12]. Following Lipton et al. [16], we experiment with Dirichlet shift simulations. On each run, we sample a target label distribution  pt(y)from a Dirichlet with concentration parameter  α. We then generate each target example by first sampling a label  y ∼ pt(y)and then sampling (with replacement) an example conditioned on that label . Note that smaller values of alpha correspond to more severe shift. In our experiments, the source label distribution is uniform.

First, we consider a mixture of two Gaussians (as in Example in Section 4.3) with  µ = 1. With CIFAR10 and MNIST, we split the full training set into two subsets: train and valid, and use the provided test set as is. Then according to the label distribution, we randomly sample with replacement train, valid, and test set from each of their respective pool to form the source and target set. To learn the black box predictor on real datasets, we use the same architecture as Lipton et al. [16] for MNIST, and for CIFAR10 we use ResNet-18 [11] as in Azizzadenesheli et al. [2]5. For simulated data, we use the true  ps(y|x)as our predictor function. For each experiment, we sample 100 datasets for each shift parameter and evaluate the empirical MSE and variance of the estimated weights.

We consider three sets of experiments: (1) MSE vs degree of target shift; (2) MSE vs target sample sizes; and (3) MSE vs calibrated predictors on the source distribution. We refer to MLLS-CM as MLLS with hard confusion matrix calibration as in (6). In our experiments, we compare MLLS estimator with BBSE, RLLS, and MLLS-CM. For RLLS and BBSE, we use the publicly available code 6. To post-hoc calibration, we use BCTS [1] on the held-out validation set. Using the same validation set, we calculate the confusion matrix for BBSE, RLLS, and MLLS-CM.

We examine the performance of various estimators across all three datasets for various target dataset sizes and shift magnitudes (Figure 1). Across all shifts, MLLS (with BCTS-calibrated classifiers) uniformly dominates BBSE, RLLS, and MLLS-CM in terms of MSE (Figure 1). Observe for severe shifts, MLLS is comparatively dominant. As the available target data increased, all methods improve rapidly, with MLLS outperforming all other methods by a significant margin. Moreover, MLLS’s advantages grow more pronounced under extreme shifts. Notice MLLS-CM is roughly equivalent to BBSE across all settings of dataset, target size, and shift magnitude. This concludes MLLS’s superior performance is not because of differences in loss function used for distribution matching but due to differences in the granularity of the predictions, caused by crude confusion matrix aggregation.

Note that given a predictor  f1, we can partition our input space and produce another predictor  f2that, for any datapoint gives the expected output of  f1on points belonging to that partition. If  f1is calibrated, then  f2will also be calibrated [24]. On synthetic data, we vary the granularity of calibration (for MLLS) by aggregating  ps(y|x)over a variable number of equal-sized bins. With more bins, less information is lost due to calibration. Consequently, the minimum eigenvalue of the Hessian increases and the MSE decreases, supporting our theoretical bounds (Figure 2). We also verify that the confusion matrix calibration performs poorly (Figure 2). For MLLS-CM, the minimum eigenvalue of the Hessian is 0.195, significantly smaller than for the binned predictor for #bin  ≥ 4. Thus, the poor performance of MLLS-CM is predicted by its looser upper bound per our analysis. Note that these experiments presume access to the true predictor  ps(y|x)and thus the

MSE strictly improves with the number of bins. In practice, with a fixed source dataset size, increasing

the number of bins could lead to overfitting, worsening our calibration.

This paper provides a unified framework relating techniques that use off-the-shelf predictors for label shift estimation. We argue that these methods all employ calibration, either explicitly or implicitly, differing only in the choice of calibration method and their optimization objective. Moreover, with our analysis we show that the choice of calibration method (and not the optimization objective for distribution matching) accounts for the advantage of MLLS with BCTS calibration over BBSE. In future work, we hope to operationalize these insights to provide guidance for a calibration scheme to improve label shift estimation.

This paper investigates the (statistical) consistency and efficiency of two existing methods for estimating target domain label distributions. While this could potentially guide practitioners to improve detection, estimation, and classification in applications where the label shift assumption holds, we do not believe that it will fundamentally impact how machine learning is used in a way that could conceivably be socially salient. While we take the potential impact of machine learning on society seriously, we believe that this work, which addresses a foundational theoretical problem, does not present a significant societal concern.

We thank Zico Kolter and David Childers for their helpful feedback. This material is based on research sponsored by Air Force Research Laboratory (AFRL) under agreement number FA8750-19-1-1000. The U.S. Government is authorized to reproduce and distribute reprints for Government purposes notwithstanding any copyright notation therein. The views and conclusions contained herein are those of the authors and should not be interpreted as necessarily representing the official policies or endorsements, either expressed or implied, of Air Force Laboratory, DARPA or the U.S. Government. SB acknowledges funding from the NSF grants DMS-1713003 and CIF-1763734. ZL acknowledges Amazon AI, Salesforce Research, Facebook, UPMC, Abridge, and the Center for Machine Learning and Health for their generous support of ACMI Lab’s research on machine learning under distribution shift.

[1] A. Alexandari, A. Kundaje, and A. Shrikumar. Adapting to label shift with bias-corrected calibration. In arXiv preprint arXiv:1901.06852, 2019.

[2] K. Azizzadenesheli, A. Liu, F. Yang, and A. Anandkumar. Regularized learning for domain adaptation under label shifts. In International Conference on Learning Representations (ICLR), 2019.

[3] S. Ben-David, T. Lu, T. Luu, and D. Pál. Impossibility Theorems for Domain Adaptation. In International Conference on Artificial Intelligence and Statistics (AISTATS), 2010.

[4] G. Blanchard, G. Lee, and C. Scott. Semi-supervised novelty detection. Journal of Machine Learning Research, 11(Nov):2973–3009, 2010.

[5] J. Byrd and Z. C. Lipton. What is the effect of importance weighting in deep learning? In International Conference on Machine Learning (ICML), 2019.

[6] C. Cortes and M. Mohri. Domain adaptation and sample bias correction theory and algorithm for regression. Theoretical Computer Science, 519, 2014.

[7] C. Cortes, Y. Mansour, and M. Mohri. Learning Bounds for Importance Weighting. In Advances in Neural Information Processing Systems (NIPS), 2010.

[8] M. C. Du Plessis and M. Sugiyama. Semi-supervised learning of class balance under class-prior change by distribution matching. Neural Networks, 50:110–119, 2014.

[9] A. Gretton, A. J. Smola, J. Huang, M. Schmittfull, K. M. Borgwardt, and B. Schölkopf. Covariate Shift by Kernel Mean Matching. Journal of Machine Learning Research (JMLR), 2009.

[10] C. Guo, G. Pleiss, Y. Sun, and K. Q. Weinberger. On calibration of modern neural networks. In International Conference on Machine Learning (ICML), 2017.

[11] K. He, X. Zhang, S. Ren, and J. Sun. Deep Residual Learning for Image Recognition. In Computer Vision and Pattern Recognition (CVPR), 2016.

[12] A. Krizhevsky and G. Hinton. Learning Multiple Layers of Features from Tiny Images. Technical report, Citeseer, 2009.

[13] V. Kuleshov and P. S. Liang. Calibrated structured prediction. In Advances in Neural Information Processing Systems (NeurIPS), 2015.

[14] A. Kumar, P. S. Liang, and T. Ma. Verified uncertainty calibration. In Advances in Neural Information Processing Systems (NeurIPS), 2019.

[15] Y. LeCun, L. Bottou, Y. Bengio, and P. Haffner. Gradient-Based Learning Applied to Document Recognition. Proceedings of the IEEE, 86, 1998.

[16] Z. C. Lipton, Y.-X. Wang, and A. Smola. Detecting and Correcting for Label Shift with Black Box Predictors. In International Conference on Machine Learning (ICML), 2018.

[17] H. Ramaswamy, C. Scott, and A. Tewari. Mixture proportion estimation via kernel embeddings of distributions. In International Conference on Machine Learning, pages 2052–2060, 2016.

[18] M. Saerens, P. Latinne, and C. Decaestecker. Adjusting the Outputs of a Classifier to New a Priori Probabilities: A Simple Procedure. Neural Computation, 2002.

[19] B. Schölkopf, D. Janzing, J. Peters, E. Sgouritsa, K. Zhang, and J. Mooij. On Causal and Anticausal Learning. In International Conference on Machine Learning (ICML), 2012.

[20] H. Shimodaira. Improving Predictive Inference Under Covariate Shift by Weighting the LogLikelihood Function. Journal of Statistical Planning and Inference, 2000.

[21] C. M. Stein. Estimation of the mean of a multivariate normal distribution. The annals of Statistics, pages 1135–1151, 1981.

[22] A. Storkey. When Training and Test Sets Are Different: Characterizing Learning Transfer. Dataset Shift in Machine Learning, 2009.

[23] J. A. Tropp et al. An introduction to matrix concentration inequalities. Foundations and Trends® in Machine Learning, 2015.

[24] J. Vaicenavicius, D. Widmann, C. Andersson, F. Lindsten, J. Roll, and T. B. Schön. Evaluating model calibration in classification. In International Conference on Machine Learning (ICML), 2019.

[25] S. van de Geer. Empirical Processes in M-estimation, volume 6. Cambridge university press, 2000.

[26] A. W. van der Vaart and J. A. Wellner. Weak convergence. In Weak convergence and empirical processes. Springer, 1996.

[27] B. Zadrozny. Learning and Evaluating Classifiers Under Sample Selection Bias. In International Conference on Machine Learning (ICML), 2004.

[28] K. Zhang, B. Schölkopf, K. Muandet, and Z. Wang. Domain Adaptation Under Target and Conditional Shift. In International Conference on Machine Learning (ICML), 2013.

image

Step 1. description. Let the model class used for post-hoc calibration be represented by G. Given a validation dataset  {(xv1, yv1), . . . , (xvn, yvn)}sampled from the source distribution  Pswe compute, {( �f(xv1), yv1), ( �f(xv2), yv2), . . . , ( �f(xvn), yvn)}, applying our classifier �fto the data. Using this we estimate a function,

image

where the loss function  ℓcan be the negative log-likelihood or squared error. Experimentally we observe same performance with both the loss functions. Subsequently, we can apply the calibrated predictor  �g ◦ �f.

Our experiments follow Alexandari et al. [1], who leverage BCTS 7 to calibrate their models. BCTS extends temperature scaling [10] by incorporating per-class bias terms. Formally, a function g : ∆k−1 �→ ∆k−1 in the BCTS class G, is given by

image

where  {T, b1, . . . , b|Y|} are the |Y| + 1parameters to be learned.

Dataset shifts are predominantly studied under two scenarios: covariate shift and label shift [22]. Schölkopf et al. [19] articulates connections between label shift and covariate shift with anti-causal and causal models respectively. Covariate shift is well explored in past [28, 27, 7, 6, 9].

Approaches for estimating label shift (or prior shift) can be categorized into three classes:

1. Methods that leverage Mixture Proportion Estimation (MPE) [4, 17] techniques to estimate the target label distribution. MPE estimate in general (e.g. Blanchard et al. [4]) needs explicit calculations of  ps(x|y)(= pt(x|y))which is infeasible for high dimensional data. More recent methods for MPE estimation, i.e. Ramaswamy et al. [17], uses Kernel embeddings, which like many kernel methods, require the inversion of an  n × nGram matrix. The  O(n3)complexity makes them infeasible for large datasets, practically used in deep learning these days;

2. Methods that directly operate in RKHS for distribution matching [28, 8]. Zhang et al. [28] extend the kernel mean matching approach due to Gretton et al. [9] to the label shift problem. Instead of minimizing maximum mean discrepancy, Du Plessis and Sugiyama [8] explored minimizing PE divergence between the kernel embeddings to estimate the target label distribution. Again, both the methods involve inversion of an  n × nkernel matrix, rendering them infeasible for large datasets; and

3. Methods that work in low dimensional setting [16, 2, 18] by directly estimating  pt(y)/ps(y)to avoid the curse of dimensionality. These methods leverage an off-the-shelf predictor to estimate the label shift ratio.

In this paper, we primarily focus on unifying methods that fall into the third category.

In this section, we will illustrate insufficiency of marginal calibration to achieve consistency. For completeness, we first define margin calibration:

Definition 3 (Marginal calibration). A prediction model  f : X �→ ∆k−1 is marginally calibrated on the source domain if for all  x ∈ X and j ∈ Y,

image

Intuitively, this definition captures per-label calibration of the classifier which is strictly less restrictive than requiring canonical calibration. In the example, we construct a classifier on discrete X which is marginally calibrated, but not canonically calibrated. With the constructed example, we show that the population objective (4) yields inconsistent estimates.

Example. Assume  X = {x1, x2, x3, x4, x5, x6} and Y = {1, 2, 3}. Suppose the predictor f(x) and Ps(y|f(x))are given as,

image

Clearly, the prediction f(x) is marginally calibrated. We have one more degree to freedom to choose, which is the source marginal distribution on X. For simplicity let’s assume  ps(xi) = 1/6for all i = {1, . . . , 6}. Thus, we have  ps(y = j) = 1/3 for all j = {1, 2, 3}. Note, with our assumption of the source marginal on x, we get  Pt(xi|y = j) = Ps(xi|y = j) = Ps(y = j|f(xi))/2. This follows as  x �→ f(x)is an one-to-one mapping.

Now, assume a shift i.e. prior on Y for the target distribution of the form  [α, β, 1 − α − β]. With thelabel shift assumption, we get

image

Assume the importance weight vector as w. Clearly, we have  w1 + w2 + w3 = 3. Re-writing the population MLLS objective (4), we get the maximisation problem as

image

Differentiating (10) with respect to  w1and  w2, we get two high order equations, solving which give us the MLLS estimate  wf. To show inconsistency, it is enough to consider one instantiation of  αand  βsuch that  |3α − w1| + |3β − w2| + |w1 + w2 − 3α − 3β| ̸= 0. Assuming  α = 0.8and β = 0.1and solving (10) using numerical methods, we get  wf = [2.505893, 0.240644, 0.253463]. As  w = [2.4, 0.3, 0.3], we have wf ̸= wconcluding the proof.

Lemma 1 (Identifiability). If the set of distributions {p(z|y) : y = 1, ..., k} are linearly independent, then for any w that satisfies (2), we must have  w = w∗. This condition is also necessary in general: if the linear independence does not hold then there exists a problem instance where we have  w, w∗ ∈ Wsatisfying (2) while  w ̸= w∗.

Proof. First we prove sufficiency. If there exists  w ̸= w∗such that (2) holds, then we have �ky=1 ps(z, y)(wy−w∗y) = 0 for all z ∈ Z. As w−w∗is not the zero vector,  {ps(z, y), y = 1, ..., k}are linearly dependent. Since  ps(z, y) = ps(y)p(z|y)and  ps(y) > 0for all y (by assumption), we also have that {p(z|y), y = 1, ..., k} are linearly dependent. By contradiction, we show that the linear independence is necessary.

To show necessity, assume  w∗y = 1kps(y) for y = 1, ..., k. We know that  w∗ satisfies (2)by definition. If linear independence does not hold, then there exists a vector  v ∈ Rksuch that  v ̸= 0and �ky=1 ps(z, y)vy = 0 for all z ∈ Z. Since the  w∗we construct is not on the boundary of W, we can scale v such that  w∗ + αv ∈ Wwhere  α ≥ 0and  v ̸= 0. Therefore, setting  w = w∗ + αvgives another solution for (2), which concludes the proof.

Lemma 2. If f is calibrated, then the two objectives (3) and (4) are identical when Z is chosen as ∆k−1 and p(z|x)is defined to be  δf(x).

Proof. The proof follows a sequence of straightforward manipulations. In more detail,

image

where the final step uses the fact that f is calibrated.

image

Theorem 1 (Population consistency of MLLS). If a predictor  f : X �→ ∆k−1 is calibrated and the distributions {p(f(x)|y) : y = 1, . . . , k} are strictly linearly independent, then  w∗is the unique maximizer of the MLLS objective (4).

Proof. According to Lemma 2 we know that maximizing (4) is the same as maximizing (3) with p(z|x) = δf(x), thus also the same as minimizing the KL divergence between  pt(z) and pw(z). Sincept(z) ≡ pw∗(z)we know that  w∗is a minimizer of the KL divergence such that the KL divergence is 0. We also have that  KL(pt(z), pw(z)) = 0if and only if  pt(z) ≡ pw(z), so all maximizers of (4) should satisfy (2). According to Lemma 1, if the strict linear independence holds, then  w∗ is theunique solution of (2). Thus  w∗ is the unique maximizer of (4).

image

Proposition 1. For a calibrated predictor f, the following statements are equivalent:

image

Proof. We first show the equivalence of (1) and (2). If f is calibrated, we have  ps(f(x))fy(x) =ps(y)p(f(x)|y) for any x, y. Then for any vector  v ∈ Rk we have

image

On the other hand, we can have

image

If {p(f(x)|y) : y = 1, . . . , k} are linearly dependent, then there exist  v ̸= 0such that (11) is zero for any x. Consequently, there exists a non-zero vector u with  uy = vy/ps(y)such that uT f(x) = 0for any x satisfying  ps(f(x)) > 0, which means  uT Es�f(x)f(x)T �u = 0and thus Es�f(x)f(x)T �is not invertible. On the other hand, if  Es�f(x)f(x)T �is non-invertible, then there exist some  u ̸= 0such that  uT Es�f(x)f(x)T �u = 0. Further as  uT Es�f(x)f(x)T �u =�uT f(x)f(x)T u ps(x)dx =� ��f(x)T u�� ps(x)dx. As a result, the vector v with  vy = ps(y)uysatisfies that (11) is zero for any x, which means {p(f(x)|y) : y = 1, . . . , k} are not strictly linearly independent.

Let C be the soft confusion matrix of f, then

image

Therefore, we have  C = Es�f(x)f(x)T �, which means (2) and (3) are equivalent.

image

We introduce some notation before proving consistency. Let  P = {⟨f, w⟩|w ∈ W}be the class of densities8 for a given calibrated predictor f. Suppose  �pn, p0 ∈ Pare densities corresponding to MLE estimate and true weights, respectively. We use  h(p1, p2)to denote the Hellinger distance and TV(p1, p2)to denote the total variation distance between two densities  p1, p2. Hr(δ, P, P)denotes  δ-entropy for class P with respect to metric  Lr(P). Similarly,  Hr,B(δ, P, P)denotes the corresponding bracketing entropy. Moreover,  Pndenotes the empirical random distribution that puts uniform mass on observed samples  x1, x2, . . . xn. Before proving consistency we need to re-state two results:

Lemma 6 (Lemma 2.1 [25]). If P is a probability measure, for all  1 ≤ r < ∞, we have

image

Lemma 7 (Corollary 2.7.10 [26]). Let F be the class of convex functions  f : C �→ [0, 1] defined ona compact, convex set  C ⊂ Rd such that |f(x) − f(y)| ≤ L ∥x − y∥for every x,y. Then

image

for a constant K that depends on the dimension d and C.

We can now present our proof of consistency, which is based on Theorem 4.6 from van de Geer [25]:

image

then  h(�pn, p0) → 0almost surely. Theorem 2 (Consistency of MLLS). If f satisfies the conditions in Theorem 1, then  �wfin (5) converges to  w∗ almost surely.

Proof. Assume the maximizer of (5) is  �wf and p0 = ⟨f, w∗⟩. Define class  G =� 2pp+p0 |p ∈ P�. Toprove consistency, we first bound the bracketing entropy for class G using Lemma 6 and Lemma 7. Clearly P is linear in parameters and hence, convex. Gradient of function  g ∈ Gis given by 2p0(p+p0)2which in turn is bounded by 2p0 . Under assumptions of Condition 1, the functions in G are Lipschitz with constant  2/τ. We can bound the bracketing entropy  H2,B(δ, G , P)using Lemma 7 and Lemma 6 as

image

for some constant  K1that depends on k.

On the other hand, for cases where  p0can be arbitrarily close to zero, i.e., Condition 1 doesn’t hold true, we define  τ(δ) and Gτ as

image

Assume  τ(δ)such that (14) is satisfied. Using (15), we have

image

Thus, for the cases where  p0can be arbitrarily close to zero, instead of bounding  H2,B(δ, G , P), we we bound  HB(δ, Gτ(δ), P). For any  δ > 0, there is a compact subset  Kδ ∈ X, such that ps(X \ Kδ) < δ. Using arguments similar to above, function  g ∈ Gτ(δ)is Lipschitz with constant 2/τ(δ) > 0. Again using Lemma 7 and Lemma 6, we conclude

image

for some constant  K2that depends on k. Finally, we use Lemma 8 to conclude  h(�pn, p0) →a.s. 0. Further, as TV(�pn, p0) ≤ h(�pn, p0), we have h(�pn, p0) →a.s. 0implies TV(�pn, p0) →a.s. 0. Further

image

where  λminis the minimum eigenvalue of covariance matrix��f(x)f(x)T ps(x)dx�. Note using Proposition 1, we have  λmin > 0. Thus, we conclude  ∥ �wf − w∗∥ →a.s. 0.

Example 1. Consider a mixture of two Gaussians with  ps(x|y = 0) := N(µ, 1)and  ps(x|y =1) := N(−µ, 1). We suppose that the source mixing coefficients are both 12, while the target mixing coefficients are  α(̸= 12), 1−α. Assume a class of probabilistic threshold classifiers:  f(x) = [1−c, c]for  x ≥ 0, otherwise  f(x) = [c, 1 − c] with c ∈ [0, 1].

Then the population error of MLLS is given by

image

which is zero only if  c = ps(x ≥ 0|y = 0)for a non-degenerate classifier.

Proof. The intuition behind the construction is, for such an Example, we can get a closed form solution for the population MLLS and hence allows a careful analysis of the estimation error. The classifier f(x) predicts class 0 with probability c and class 1 with probability  1 − cfor  x ≥ 0, and vice-versa for x < 0. Using such a classifier, the weight estimator is given by:

image

where equality (i) follows from  w1 = 2 − w0and the predictor function and (ii) follows from the fact that within each integral, the term inside the log is independent of x. Differentiating w.r.t. to  w0, wehave:

image

which gives  w0 = 2pt(x≤0)−2c1−2c. Thus for the population MLLS estimate, the estimation error is given by

image

The gradient of the MLLS objective can be written as

image

and the Hessian is

image

We use  λmin(X)to denote the minimum eigenvalue of the matrix X.

Lemma 9 (Theorem 5.1.1 [23]). Let  X1, X2, . . . , Xnbe a finite sequence of identically distributed independent, random, symmetric matrices with common dimension k. Assume  0 ⪯ X ⪯ R · Iand µminI ⪯ E [X] ⪯ µmaxI. With probability at least  1 − δ,

image

Lemma 3. For any predictor f that satisfies Condition 1, we have  ∥wf − �wf∥ ≤ σ−1f,wf Op�m−1/2�.

Proof. We present our proof in two steps. Step-1 is the non-probabilistic part, i.e., bounding the error ∥ �wf − wf∥in terms of the gradient difference  ∥∇wL(wf, f) − ∇wLm(wf, f)∥. This step uses Taylor’s expansion upto second order terms for empirical log-likelihood around the true  w∗. Step-2involves deriving a concentration on the gradient difference using the Lipschitz property implied by Condition 1. Combining these two steps along with Lemma 22 concludes the proof. Now we detail each of these steps.

Step-1. We represent the empirical Negative Log-Likelihood (NLL) function with  Lmby absorbing the negative sign to simplify notation. Using a Taylor expansion, we have

Lm( �wf, f) = Lm(wf, f) + ⟨∇wLm(wf, f), �wf − wf⟩ + 12( �wf − wf)T ∇2wLm( �w, fc)( �wf − wf) ,where �w ∈ [ �wf, wf]. With the assumption  f T wf ≥ τ, we have  ∇2wLm( �w, f) ≥τ 2min ps(y)2 ∇2wLm(wf, f). Let κ = τ 2min ps(y)2. Using this we get,

Lm( �wf, f) ≥ Lm(wf, f) + ⟨∇wLm(wf, f), �wf − wf⟩ + κ2 ( �wf − wf)T ∇2wLm(wf, f)( �wf − wf)

image

where term-I is less than zero as  �wfis the minimizer of empirical NLL  Lm( �wf, f). Ignoring term-I and re-arranging a few terms we get:

image

Using Holder’s inequality on the LHS we have,

image

Let  �σf,wfbe the minimum eigenvalue of  ∇2wLm(w∗, fc). Using the fact that  ( �wf −wf)T ∇2wLm(wf, f)( �wf − wf) ≥ �σmin ∥ �wf − wf∥2, we get,

image

Step-2. The empirical gradient is  ∇wLm(wf, f) = �mi=1∇wL1(xi,wf ,f)mwhere  ∇L1(xi, wf, f) =

image

the gradient terms as

image

As the gradient terms decompose and are independent, using Hoeffding’s inequality we have with probability at least  1 − δ2,

image

Let  σf,wfbe the minimum eigenvalue of  ∇2wL(wf, f). Using lemma 9, with probability at least

image

Plugging (21) and (22) in (20), and applying a union bound, we conclude that with probability at least  1 − δ,

image

Neglecting the order m term and letting  c = 1κτ , we have

image

Lemma 4. For any predictor f and any calibrated predictor  fcthat satisfies Condition 1, we have ∥wf − w∗∥ ≤ σ−1f,w∗ · C · Et [∥f − fc∥] ,for some constant C.

If we set  fc(x) = ps(y|f(x)), which is a calibrated predictor (Proposition 2), we can bound the error in terms of the calibration error of f on the source data 9:  ∥wf − w∗∥ ≤ σ−1f,w∗ · C · E(f) .

Proof. We present our proof in two steps. Note, all calculations are non-probabilistic. Step-1 involves bounding the error  ∥wf − w∗∥in terms of the gradient difference  ∥∇wL(w∗, fc) − ∇wL(w∗, f)∥. This step uses Taylor’s expansion on  L(wf, f)upto the second orderth term for population log-likelihood around the true  w∗. Step-2 involves deriving a bound on the gradient difference in terms of the difference  ∥f − fc∥using the Lipschitz property implied by Condition 1. Further, for a crude calibration choice of  fc(x) = ps(·|x), the gradient difference can be bounded by miscalibration error. We now detail both of these steps.

Step-1. Similar to Lemma 3, we represent with L by absorbing the negative sign to simplify notation. Using the Taylor expansion, we have

image

where �w ∈ [wf, w∗]. With the assumption  f T w∗ ≥ τ, we have  ∇2wL( �w, f) ≥τ 2min ps(y)2 ∇2wL(w∗, f) . Let κ = τ 2min ps(y)2. Using this we get,

image

where term-I is less than zero as  wfis the minimizer of NLL L(w, f). Ignoring that term and re-arranging a few terms we get

image

As before, let  σf,wbe the minimum eigenvalue of  ∇2wL(w∗, f). Using the fact that  (wf −

image

Using Holder’s inequality on the LHS and re-arranging terms gives

image

Step-2. By lower bound assumptions  f Tc w∗ ≥ τ and f T w∗ ≥ τ, we have

image

where the first inequality is implied by Jensen’s inequality and the second is implied by the Lipschitz property of the gradient. Further, we have

image

image

Combining equations (23), (24), and (25), we have

image

Further, if we set  fc(x) = ps(·|f(x)), which is a calibrated predictor according to Proposition 2, we can bound the error on the RHS in terms of the calibration error of f. Moreover, in the label shift estimation problem, we have the assumption that  ps(y) ≥ c > 0for all y. Hence, we have maxy pt(y)/ps(y) ≤ 1/c. Using Jensen’s inequality, we get

image

Plugging (27) back in (26),we get the required upper bound.

Proposition 3. For any  w ∈ W, we have  σf,w ≥ ps,minσfwhere  σfis the minimum eigenvalue of  Et�f(x)f(x)T �and  ps,min = miny∈Y ps(y). Furthermore, if f satisfies Condition 1, we have p2s,min · σf ≤ σf,w ≤ τ −2 · σf , for w ∈ {wf, w∗}.

Proof. For any  v ∈ Rk, we have

image

where

image

and

image

if f satisfies Condition 1 and  w ∈ {wf, w∗}. Therefore, we have

image

for w  ∈ {wf, w∗}.

image

Lemma 5. Let  f = g◦ �fbe the predictor after post-hoc calibration with squared loss l and g belongs to a function class G that satisfies the standard regularity conditions, we have

image

Proof. Assume regularity conditions on the model class  Gθ(injectivity, Lipschitz-continuity, twice differentiability, non-singular Hessian, and consistency) as in Theorem 5.23 of Stein [21] hold true. Using the injectivity property of the model class as in Kumar et al. [14], we have for all  g1, g2 ∈ G,

image

Let  �g, g∗ ∈ Gbe models parameterized by �θand  θ∗, respectively. Using the strong convexity of the empirical mean squared error we have,

image

where  µis the parameter constant for strong convexity. Re-arranging a few terms, we have

image

where term-I is less than zero because  �gis the empirical minimizer of the mean-squared error. Ignoring term-I, we get:

image

As the assumed model class is Lipschitz w.r.t.  θ, the gradient is bounded by Lipschitz constant L = c1. E [∇θMSEn(g∗)] = 0as  g∗is the population minimizer. Using Hoeffding’s bound for bounded functions, we have with probability at least  1 − δ,

image

Using the smoothness of the MSE(g), we have

image

where  c2is the operator norm of the  ∇2MSE(g∗). Combining (28), (29), and (30), we have for some universal constant  c = c1c2µ2with probability at least  1 − δ,

image

Moreover, with Lemma 4, depending on the degree of the miscalibration and the method involved to calibrate, we can bound the E(f). For example, if using vector scaling on a held out training data for calibration, we can use Lemma 5 to bound the calibration error E(f), i.e., with probability at least

image

Plugging (27) and (31) into (26), we have with probability at least  1 − δ that

image


Designed for Accessibility and to further Open Science