Robust Covariate Shift Prediction with General Losses and Feature Views

Covariate shift relaxes the widely-employed independent and identically distributed (IID) assumption by allowing different training and testing input distributions. Unfortunately, common methods for addressing covariate shift by trying to remove the bias between training and testing distributions using importance weighting often provide poor performance guarantees in theory and unreliable predictions with high variance in practice. Recently developed methods that construct a predictor that is inherently robust to the difficulties of learning under covariate shift are restricted to minimizing logloss and can be too conservative when faced with high-dimensional learning tasks. We address these limitations in two ways: by robustly minimizing various loss functions, including non-convex ones, under the testing distribution; and by separately shaping the influence of covariate shift according to different feature-based views of the relationship between input variables and example labels. These generalizations make robust covariate shift prediction applicable to more task scenarios. We demonstrate the benefits on classification under covariate shift tasks.

The independent and identically distributed (IID) assumption employed widely across machine learning methods requires the testing data distribution to be the same as the training data distribution. This is quite restrictive in the sense that shift can occur between the training distribution and testing distribution in many settings, which makes models built on the IID assumption inappropriate. Specifically, the predictor minimizing (regularized) loss defined on the training samples provides no performance guarantees when applied to the testing distribution (Zadrozny, 2004; Fan et al., 2005). Though nothing can be learned when the shift between training and testing data is arbitrary, certain assumptions about how training and testing distributions differ allow reasonable adaptive learning methods to be derived (Blitzer et al., 2008). One of the common assumptions is that the bias just comes from the input variables. In this setting, also known as covariate shift, only the distribution of inputs,  Ptrain(x)and  Ptest(x), differ, while the conditional label distribution, P(y|x), is the same under both the training and the testing distributions. This assumption is much weaker than the IID assumption and covers a broad range of real application scenarios.

The most prevalent methods for addressing covariate shift attempt to debias the training data by reweighting it using a density ratio,  Ptest(x)/Ptrain(x). This approach tends to work well when the training and the testing distributions are fairly similar and large amounts of training samples are available. It also enjoys asymptotical consistency guarantee provided an infinite amount of training data. However, when these conditions are violated, i.e., there is only a limited amount of training data and/or significant differences between the training and testing distributions, some of the density ratios for training examples can be extremely large. This leads to high-variance estimates that extrapolate heavily from scant amounts of training data and a lack of generalization guarantees for the resulting predictor (Cortes et al., 2010, 2008).

Recently developed robust covariate shift methods take a worst-case approach, constructing a predictor that (approximately) matches training data statistics, but is otherwise the most uncertain on the testing distribution (Liu and Ziebart, 2014; Chen et al., 2016). These methods were built by minimizing the worst case expected target logloss and obtain a parametric form of the predicted output labels’ probability distributions. Unfortunately, log loss may not be of interest for many applications and robust accuracy maximization is instead desired, for example. In addition, these previous methods can be too robust, providing overly conservative predictions that are nearly uniform on portions of the testing input space that lacks training support, particularly when the dimensionality of the input space is large—a situations that is problematic for importance weighted loss minimization as well. In this paper, we introduce two generalization of robust covariate shift clas-sification. First, we robustly minimize other loss functions, like the 0-1 loss, under covariate shift. Second, we enable a better balance between making generalization assumptions about data and providing robustness to uncertainty by introducing feature view-based generalization assumptions to the robust covariate shift approach. We demonstrate the effectiveness of our method using both synthetic examples and real datasets.

2.1 Covariate Shift

In many learning settings, the distribution available for training a predictor differs from the distribution of data on which it will be employed and evaluated. Under covariate shift, the training distribution and testing distribution share the same conditional label distribution, P(y|x), but have differing distributions over inputs:


Unlike the more common independent and identically distributed (IID) learning setting (Mohri et al., 2012), which further imposes that  Ptrain(x) = Ptest(x), constructing a predictor with limited complexity that performs well on the training data does not guarantee good performance on the testing distribution (with high probability). Also known as sample selection bias (Heckman, 1977), covariate shift has ties to more general domain adaptation (Jiang, 2008) and transfer learning settings (Pan and Yang, 2010), where the assumption on how training and testing data distributions differ is not fully specified.

2.2 Debiasing Via Importance Weighting

The most prevalent approach for addressing covariate shift attempts to remove the bias between the training and testing distributions (Shimodaira, 2000; Huang et al., 2006; Sugiyama et al., 2008). Under this perspective, minimizing the importance-weighted loss of (n) training examples,


where ˆfis estimated predictor and  ˜pis the empirical distribution of data, asymptotically minimizes the testing distribution loss, so long as  Ptest(x) > 0 =⇒ Ptrain(x) > 0.

Despite this asymptotic guarantee, predictive performance can be poor when training from finite amounts of samples in both theory and practice. Conceptually, the density ratios of a small number of training examples can become disproportionately large, making the resulting predictor overly sensitive to a small number of training data points—or even one single datapoint. This will lead to predictive results with high variance. Indeed, finite generalization bounds for importance-weighted methods require finite second moments:  EPtrain(x)[(Ptest(X)/Ptrain(X))2] < ∞(Cortes et al., 2010), which is often not satisfied in practice.

In order to overcome these difficulties, there is a significant literature studying how to reasonably estimate the weight Ptest(x)Ptrain(x)from training and testing sample data (Gret- ton et al., 2009). For example, methods based on minimizing certain types of divergences or loss functions on density (ratios) (Sugiyama et al., 2008; Kanamori et al., 2009) have been investigated. Other methods (Cortes and Mohri, 2014) make implicit assumptions on the feature space that there exist weights w(x) that makes the “distance” between training and testing features small enough so that reweighted training features can be used to learn classifiers on testing distribution. Other work focuses on utilizing two stages of regularization to reduce the variance of the resulting predictions (Reddi et al., 2014).

2.3 Robust Bias-Aware Prediction

Robust covariate shift classification (Liu and Ziebart, 2014) is motivated from minimax robust estimation (Topsøe, 1979; Gr¨unwald and Dawid, 2004). The expected testing loss minimization is formulated as a two player game, where the estimator player seeks to minimize the loss function, while an adversarial player tries to maximize it under constraints based on training samples.


The adversary must choose a distribution ˇPthat is similar to certain measured properties (features), e.g., EX∼ ˜P , ˇY |X∼ ˇP�φ(X, ˇY )� = E(X,Y )∼ ˜P [φ(X, Y )], of the training data. These are denoted by the convex set ˜Ξtrain, with  φas the feature function. The advantage of this formulation resides in its robustness to the worst possible case of covariate shift and avoidance of huge losses caused by over optimistic extrapolations. Moreover, the expected testing loss under this formulation is upper bounded by the testing entropy (Liu et al., 2015).

Recent advances in adversarial loss minimization in the IID setting extend beyond the logloss to non-smooth loss functions, such as the 0-1 loss (Fathony et al., 2016; Farnia and Tse, 2016) and ordinal regression loss (Fathony et al., 2017). In this paper, we establish the general form of the method that minimizes different loss functions under covariate shift.

We now provide a general form of the adversarial game for covariate shift that incorporates a wider set of loss functions and transfer assumptions between training and testing distributions. We then introduce variants and special cases that are applicable for different scenarios.

3.1 General Loss and Feature Shift Formulation

Our goal is to construct a predictor that is robust to the worst case testing distribution implied by available training data. To allow flexibility in what the training data can imply, we assume there exists a generalization distribution, Pgenv(x), where features of training data are assumed to generalize. We consider a set V of view-based feature vectors,  φv(xv, y), for  v ∈ {1, . . . , |V|}, defined over a subset of the input variables  xv. We then apply the robust method for covariate shift to the generalization distribution instead of the original training distribution  Ptrain(x), as shown in the following definition:

Definition 1. The generalized robust covariate shift clas-sifier results from the adversarial loss optimization game:


∀v ∈ {1, . . . ,  |V|}, EX∼ ˜Pgenv , ˇY |X∼ ˇP�φv(Xv, ˇY )�=


with a loss function that we want to minimize and a distribution,  Pgenv(X), to which training data feature statistics are assumed to generalize.

Note that the statistics in the constraints are reweighted training sample statistics. This provides us with flexibil-ity to impose different assumptions—in terms of  Pgenv(x)densities—for how training data should generalize.

Strong Lagrangian duality holds when Loss(·, ·)is a concave-convex function of ˇPand ˆP. This enables us to re-write the game in terms of Lagrangian multipliers  θ:


where ˜φv ≜ E(X,Y )∼ ˜Ptrain

� Pgenv (X)Ptrain(X) φv(Xv, Y )�is the feature function evaluated on empirical training data, and we allow  ϵslack for matching the primal constraints, leading to regularization in the dual. The optimization of this objective function is then composed of two steps: first, solve the inner minimax game with respect to ˆPand ˇP; second, optimize for  θin the outer minimization to satisfy imposed constraints. We focus our attention on  0 − 1loss and log loss, but many other loss functions can also be incorporated.

3.2 Classification Losses

Letting Loss( ˇPX, ˆPX) = ˆP T C ˇP—a bilinear and therefore concave-convex function of ˇPand ˆP—allows many classification losses to be represented in the cost matrix C. We can reformulate the inner minimax game as  min ˆP max ˇP EX[ ˆP T C′ ˇP], where  C′ = C +�vPgenv (X)Ptest(X) θvφv(Xv, ˇY ). The inner minimax game, which is a two player zero sum game, can be solved by linear programming. Another way to find the equilibrium of the inner minimax game for the special case of 0-1 loss is by seeking an analytical form of the game value as in Fathony et al. (2016), which brings more computational ef-ficiency. For the outer minimization, we take the subgradient with respect to  θ, which we approximate by reweighting training samples to the generalization distribution,


and perform subgradient descent.

We are also able to bound the approximation error in both subgradient and objective, given the probability densities are accurate. We refer to more details in Section 3.8 and just show two illustrative examples in Figure 1, where training distribution (solid line) and testing distribution(dashed line) is overlapping in different ways. The prediction color map shows a similar uncertain prediction with logloss-based classifier where there is not enough training data support, like the top right corner in the first figure. Moreover, the 0-1 loss provides more certain prediction in the overlapped region while logloss-based classifier’s prediction changes gradually in certainty from the most supported region to the least.


Figure 1: Prediction colormap with robust classifier using 0-1 loss when  Pgenv(x) = Ptrain(x). The colormap shows the P(‘+ ’|x). Training data with 5% noise is also shown.

3.3 The Log Loss Setting

When Loss( ˇPX, ˆPX) = E ˇY |X∼ ˇP [− log ˆP( ˇY |X)], this is the logloss case which generalizes to multiple feature views. In the logloss case, the inner minimax game has the solution ˆP = ˇP, so we follow the same procedure in (Liu and Ziebart, 2014) and establish the parametric form of the conditional label distribution in Theorem 1.

Theorem 1. The generalized robust covariate shift classi-fier based on logloss has the following parametric form:


We need to specify  Pgenv(x)carefully in practice (§3.4), but first consider two special cases under logloss.

Reduction to Robust Bias-Aware Prediction: When Pgenv(x) = Ptrain(x), this gives us a model that is representationally equivalent to the robust bias aware prediction method proposed in Liu and Ziebart (2014). The solution has a parametric form with the density ratio appearing as:

P(y|x) ∝e

tainty of the predictor to be larger for inputs that are relatively less likely in the training data. Thus, the density ratio

Ptest(x)in this parametric form controls the uncertainty of predictions. Unfortunately, when the input space is high-dimensional this robustness guarantee can be too conservative and restrictive, leading to predictions that are not informative (i.e., uniform distributions). This method is the most conservative model this framework generates, which limits the adversarial player to match training sample statistics strictly.

Reduction to Importance Weighting: If all the features are assumed to generalized fully to the testing distribution, i.e.,  Pgenv(x) = Ptest(x), the generalized robust covariate shift classifier is equivalent to the importance weighting method (Eq. 2). It produces the most aggressive model when the ˇPcould match reweighted features using Ptest(x)Ptrain(x). Note that this is first proposed and proven as Theorem 3 in (Liu and Ziebart, 2014). Here we derive using the generalized form to produce the same result. So when the assumption does not hold, the classification performance suffers. That motivates to use  Pgenvinstead as in (7), which sits in between the most conservative and aggressive model and provides flexibility to find the right generalization distribution for training data. The good generalization distribution should help avoid errors brought by over optimistic estimates but also achieve better performance than no generalization.

3.4 The Role of the Generalization Distribution

Feature generalization makes it possible to utilize information shared by both training and testing distributions and is essential to improve performance in predicting testing data. In order to illustrate the effect of the generalization distribution, in Figure 2 we consider a synthetic example with data sampled from two Gaussian distributions in 2-dimensional input space, with the training distribution totally contained in the testing distribution (two magenta ellipses). We compare the performance of logloss-based clas-sifier on them. 100 data points are sampled and only the training datapoints are shown in Figure 2, with 5% of noise in both training and testing data. We assume there exists a generalization distribution that training features generalize to (white ellipse). After training with larger and larger generalization distribution, predictive performance is evaluated on testing data and logloss is shown under the figures.

We can see from the figures that the generalization from training features get broader and broader with lager and larger generalization distribution. In the first case,  Pgenv(x)equals  Ptrain(x), the method is equivalent with robust covariate shift method and the prediction is limited only to the space around where there is enough training support. In the second and third case, the certain portion in the whole space increases with logloss on testing data getting better. Finally, in the last case,  Pgenv(x)equals to  Ptest(x), the method is equivalent to importance weighting. We can see the prediction is quite certain across the whole space in this setting. The logloss, however, gets worse in this case due to the noise in the data. So the takeaway from this example is that it is important to get a balance between feature generalization and robustness.

3.5 View-based Feature Generalization

The choice of the generalization distribution contributes heavily to the resulting prediction on testing data. We now propose a possible set of assumptions for the generalized formulation in the case when there are multiple views of features. We then explicitly apply assumptions about how each individual view of feature will generalize to the testing distribution. We denote the variables outside of view v as  x−v. If we assume that certain view-based features partially generalize from the training distribution to the testing distribution by assuming only the input variables outside of view v generalize to testing distribution, this corresponds with the following relationships between inputs:


Applying the above assumptions in the generalized formulation, the right hand side of the constraints for those generalized views take the form of an importance weighting of view v’s feature vector based on the non-view input variables,  x−v:


We use these partially reweighted features for those generalize features to formulate a new predictor for classification under covariate shift in  §3.6. This view-based robust clas-sifier leverages partial generalization of features, which is possible in many applications when there exists noise or covariate shift in only certain feature views. This provides


Figure 2: Comparison of incorporating different generalization distribution (white ellipses) in robust covariate shift classifier. Logloss evaluated on testing data points (not shown) is shown below each figure. Colormap represents the predicted probability of P(‘+ ’|x).

a solution that is robust to shift in a subset of feature views but also utilizes the information from the ones that are not shifted.

3.6 Robust Multiview Reformulation

Leveraging the view-based feature generalizations of Equation (9), we re-formulate the adversarial game in Defini-tion 2 with set  Vgof generalized views and set  Voof non-generalized views of features.

Definition 2. The robust multiview covariate shift clas-sifier is the solution to the adversarial loss optimization game:


This definition implies that there are two different sets of constraints: one set for features that we believe could be generalized, and the one set for features that we believe could not. Solving this constrained game formulation based on minimax duality and the method of Lagrangian multipliers for solving convex optimization problems, we have the parametric form of Theorem 2 for conditional label probability distribution when Loss( ˇPX, ˆPX) =E ˇY |X∼ ˇP [− log ˆP( ˇY |X)].

Theorem 2. The robust multiview covariate shift classifier when minimizing expected logloss has the following para-

metric form:


where view-specific density ratios,  Ptrain(xv)/Ptest(xv)are applied on the generalized views  Vgand joint density ratios Ptrain(x)/Ptest(x)are applied on non-generalized views  Vo.

We show in the next theorem that in the logloss case, the parameter estimation for  θis equivalent with maximizing the conditional likelihood of testing data with ˆP(Y |X)de-fined as (11). Therefore, the parameter can be estimated by using a gradient descent algorithm outlined in Theorem 3 by using reweighted training samples.

Theorem 3. The parameters of the robust multiview covariate shift classifier are obtained through implicitly maximizing the conditional likelihood of testing data by taking gradient steps as:


Note that even though we discuss the logloss case here in more detail, the same generalization assumption could also be applied to other loss functions. For those losses, there may not exist analytic forms for ˆP, but the subgradient should follow the same form as in (12). Therefore, as long as we are able to find an equilibrium of the inner minimax game in (5), we can solve the optimization by subgradient descent.


Figure 3: Comparison of Logistic Regression (a), Importanct Weighting Logistic Regression (b), Robust Bias-Aware Prediction (c) and View-based Robust Bias-Aware Prediction (d). Logloss evaluated on testing data points (not shown) is shown below each figure. Colormap represents the predicted probability of P(‘+’ |x).

3.7 Understanding the Multiview Classifier

We consider an illustrative synthetic example with data sampled from two overlapping Gaussian distributions (X) and identical true decision boundary (Y ) in Figure 3. In 50 training and 100 testing data points, 10% of the example are chosen uniformly at random to be noise (label flipped). We train four methods, all of which are logloss-based, using training data points (shown in the figures, roughly within the smaller ellipses) and evaluate them on testing data (not shown in the figures, roughly within larger ellipse). The colormap represents the testing conditional label distribution in the whole space. Logloss evaluated on the testing data is listed below each figure.

We see from the figures that the true decision boundary (the tilted line) could not be recovered by any of the methods using the limited data points. In fact, this is why covariate shift problems are so challenging, even though the assumption  Ptrain(y|x) = Ptest(y|x)holds. LR makes very certain predictions based on the training data, but produces an incorrect decision boundary and a worse logloss. IW, with reweighted training data, provides a less abrupt decision boundary but remains very certain towards the corners of the input space. The robust method, on the other hand, restricts the certain prediction regions only to areas with enough training data to support the prediction. The rest of the testing distribution space is covered with uniform predictions. It achieves better testing logloss by being more conservative. However, the question remains: could we leverage more information from the training data? We get our answer from the last model: our robust view-based model, which leverages the fact that the view-based training and testing feature distribution is much closer in the vertical dimension (x2) than in the horizontal dimension (x1). Thus, the assumption that the training vertical feature dimension can generalize to the testing distribution in our generalized robust covariate shift classifier. This corresponds with a parametric form of ˆPθ(y|x) ∝


ization robust method makes it possible to produce a solution that leverages the benefits of both the conservative robust method and the IW method. It maintains uncertainty in areas that have too little data to make predictions with any certainty (the top and bottom area in the input space), but gives more meaningful predictions in areas where the method expects the data could provide reasonable extrapolations.

3.8 Bounding Expected Worst Case Test Loss

One significant difference between the robust covariate shift methods in this paper and empirical risk minimization based methods is that we directly minimize the worst case expected loss under the target distribution. The reason why this works is that the (sub-)gradient in our formulation only depends on the training distribution, so we are able to use training data to approximate it. On the contrary, the ERM based methods directly approximate the expected loss function using limited data as in Eq.(2). Despite this difference, we can easily control the error in our (sub-)gradients and therefore bound the error in the optimized worst case expected target loss.

We first define the notation WCLoss(θ)as the (regularized) worst case loss under the testing distribution, which is equivalent with the Lagrangian form of the optimization game of robust covariate shift classifier in (5). Note that WCLoss(θ)differs in meaning from the Loss( ˆP, ˇP)we used to optimize in the original framework in Definition (4). For example, in logloss case, the worst case testing loss is obtained from the solved parametric form of ˆPand ˇP, which is the worst case predictor  Pθ( ˆY |X)as in (7), to the Loss( ˆP, ˇP): EPtest(X, ˆY )[− log Pθ( ˆY |X)].

Theorem 4. Assuming we have m training samples and n dimensional features, the Lagrangian form of the robust covariate shift classifier (5) is strong convex in terms of  θwith strong convexity constant M, all density estimation is accurate, and the inner minimax game in (5) is solved exactly, the expected loss on testing distribution of the robust covariate shift classifier is bounded, with probability  1−δ:


This bound indicates the distance between the expected target loss induced by our learned model from m training data and the optimal target loss is decreasing with speed  O( 1m). Note that the strong convexity condition is easy to satisfy even with non-smooth loss functions with  L2regularization. We refer to proof in the supplemental materials.

We conduct experiments on real datasets and investigate the performance of two classifiers from the robust covariate shift general framework. One corresponds with using 0-1 loss in the framework but just considers all features as one view and assumes  Pgenv(x) = Ptrain(x). The other uses logloss in (10) and applies generalization assumptions (8) on multiview features. We chose four datasets from the UCI repository (Newman et al., 1998; Siebert, 1987) for both sets of experiments. We show the detailed information about each dataset in supplement A.5. In order to create covariate shift, we synthetically generate 30 separate experiments in each dataset by drawing 100 training samples and 100 testing data samples from it, following similar sampling procedure in Huang et al. (2006), which is described in supplement A.6. Note that we normalize the data to [0, 1] beforehand. For each method in both sets of experiments, the regularization weights are chosen by 5-fold cross validation or importance weighted cross validation(IWCV). We refer to the details in supplement A.9.

4.1 Logistic Regression as Density Estimator

We use a discriminative classifier—logistic regression— as the density (ratio) estimator. To estimate ratios like


P (X|test) = P (train|X)P (test|X) P (test)P (train). To estimate density ratios as Ptest(X−v|Xv)Ptrain(X−v|Xv), we analyze this ratio according to Bayes’ rule and obtain: Ptest(X−v|Xv)Ptrain(X−v|Xv) = P (test|X)P (train|X) · P (train|Xv)P (test|Xv). Therefore, it is easy to construct two logistic regression models to estimate each ratio for X and  Xvrespectively. We carefully control the magnitude of the density ratio by adjusting the regularization using finite sample properties of the training data. According to (Dud´ık and Schapire, 2006), the amount of regularization corresponds to the slack of feature expectation matching in constraints in the primal form of maximum entropy type of problems, like logistic regression. If we are using  L2regularization and


l2diameter of the feature space, the loss of the estimated conditional label distribution will be bounded as  L(ˆθ) ≤L(θ∗) + ||θ∗||2D2[√2 + 2(1 +√2)�ln(1/σ)]/√m, with probability at least  1 − σ. Therefore, we use the maximum range among all the features to estimate  D2in the regularization for density estimation in both robust and importance weighted methods. Note that since we normalize features, D2 ≤ 1and  σis set to 0.05. We use first order features for density estimation in our experiment, which is enough in most cases.

4.2 Robust 0-l Loss Classifier

We evaluate four methods:

Robust bias aware 0-1 classifier (Robust 0-1) utilizes the general robust covariate shift classification framework (4) with Loss( ˇPX, ˆPX) = ˆP T C ˇPwith C as the 0-1 loss matrix and  Pgenv(x) = Ptrain(x).

Adversarial 0-1 classifier (Adv 0-1) minimizes expected 0-1 loss on the training distribution and has an optimization objective of: minθ min ˆP max ˇP EX∼Ptrain[ ˆP T C ˇP +�v θvφv(Xv, ˇY )] − �v θv ˜φv + ϵ||θ||2, where ˜φ = E(X,Y )∼ ˜Ptrain [φv(Xv, Y )]here.

Multiclass SVM (SVM) follows Crammer and Singer (2001) by minimizing hinge loss on training data.

Importance Weighted SVM (IW-SVM) reweights the training data with Ptest(X)Ptrain(X)and minimizes reweighted hinge loss on training data.

We show the comparison of accuracy in Table 1 and highlight methods that are either the best under paired t-test or not statistically distinguishable with significance level 0.1 in bold. We can see that Robust 0-1 performs better than other methods except in Seed, where it is statistically no worse than others. And Robust 0-1 can improve from Adv 0-1 at most times. That means minimizing worst case test loss using the adversarial game formulation (4) under covariate shfit is better than minimizing training loss and ignoring the bias using the same formulation.

Table 1: Average Accuracy Comparison for UCI datasets


4.3 Robust Multiview Covariate Shift Classifier

For a second set of experiments, we regard each feature dimension as a specific view to simplify our experimental setup for UCI datasets. We use KL-divergence as the criterion to determine the features that are generalizable or not. Details for the criterion is in supplement A.8. Besides the UCI datasets, we also evaluate our method on the multi-view dataset Language (Amini et al., 2009), which consist of text features of documents in five different languages (English-EN, French-FR, Germany-GR, Italian-IT, and Spanish-SP). This dataset is generated by translating documents originally in one language to the other four languages using machine translation. We regard different language features as different views for this task. In our experiment, we use the document originally in English. We use two languages in training and testing, with one view the same and the other view different between training and testing.

There are six categories as labels, more than ten thousand features for each language and around twenty thousand samples for English documents. To better estimate the densities we use PCA to reduce the dimension of features to 100 for each view and randomly sample 500 datapoints as training and 500 datapoints as testing. Therefore, we construct different settings for this dataset. For example, we can train using English and French views and test on Germany and French views (EN FR - GR FR). We evaluate the multiview robust covariate shift approach and three other methods:

Multiview robust bias aware classifier (Robust -View) utilize the general robust covariate shift clas-sification framework applying multiview feature generalization assumptions as in Definition 2.

Robust bias aware classifier (Robust) adversarially minimizes the testing distribution logloss Eq.3, using the parametric form as Eq.7.

Logistic regression (LR) maximizes the conditional log likelihood on training data, maxθ EPtrain(x)P (y|x) [log Pθ(Y |X)] − λ∥θ∥2, where ˆPθ(y|x) = exp(θ·Φ(x,y))�y′∈Y exp(θ·Φ(x,y′))and  λis the regular- ization constant. This approach ignores the covariate shift of the problem setting entirely.

Importance weighting method (IW) maximizes the conditional testing data likelihood as estimated using importance weighting with the density ratio,  maxθ EPtrain(x)P (y|x)�Ptest(x)Ptrain(x) (log Pθ(Y |X))�−


Table 2: Average Logloss Comparison for UCI datasets


Table 3: Logloss Comparison for Language datasets


We compare logloss of each method in Table 2 and Table 3. We denote the significantly best result under paired t-test with significance level 0.05 in bold numbers for UCI datasets. We can see from the Table 2 that Robust - View outperforms all other methods in most datasets for UCI experiments by having the lowest logloss. Moreover, it always improve from Robust, except being comparable with Robust in Seed for logloss and in Vehicle for accuracy. On the other hand, the performance of the other methods are mixed, with Robust achieves slightly better logloss and comparable accuracy with LR and IW.

For Language datasets, LR and IW are actually even worse than the Random baseline in terms of logloss due to the possibly large shift between different languages. In contrast, Robust -View and Robust are usually better than the baseline due to their robustness property. And Robust -View can improve from Robust in both logloss by utilizing the generalization property of certain features, especially when Robust is even worse than IR in accuracy because it is overly uncertain with logloss close to random. The reason why Robust is so close to uniform is that it regards all features as a whole and differentiate training and testing data. It disregards the fact that there are useful information that could be used to improve the predictive performance, which is exactly what motivates this work.

Covariate shift classification is an important but difficult task for machine learning in non-stationary environments when the testing labels are not available. The original robust bias-aware classifier only optimize logloss and may lose predictive power on testing input space that lacks training support, especially when dimension of features are large. We propose a general robust covariate shift classi-fication framework that is flexible enough to minimize various loss functions and make different feature generalization assumptions for multiview features. We derive different classifiers from the framework and demonstrate a set of assumptions that could be applied to achieve a balance of robustness and informativeness. We use synthetic examples, UCI biased datasets and real multiview datasets to demonstrate the model performance. We also investigate the theoretical property of the framework and are able to bound the worse case testing loss of our model properly.

Massih Amini, Nicolas Usunier, and Cyril Goutte. Learning from multiple partially observed views-an application to multilingual text categorization. In Advances in neural information processing systems, pages 28–36, 2009.

John Blitzer, Koby Crammer, Alex Kulesza, Fernando

Pereira, and Jennifer Wortman. Learning bounds for domain adaptation. In Advances in neural information processing systems, pages 129–136, 2008.

Xiangli Chen, Mathew Monfort, Anqi Liu, and Brian D Ziebart. Robust covariate shift regression. In Proceedings of the 19th International Conference on Artificial Intelligence and Statistics, pages 1270–1279, 2016.

Corinna Cortes and Mehryar Mohri. Domain adaptation and sample bias correction theory and algorithm for regression. Theoretical Computer Science, 519:103–126, 2014.

Corinna Cortes, Mehryar Mohri, Michael Riley, and Afshin Rostamizadeh. Sample selection bias correction theory. In Proceedings of the International Conference on Algorithmic Learning Theory, pages 38–53, 2008.

Corinna Cortes, Yishay Mansour, and Mehryar Mohri. Learning bounds for importance weighting. In Advances in Neural Information Processing Systems, pages 442– 450, 2010.

Koby Crammer and Yoram Singer. On the algorithmic implementation of multiclass kernel-based vector machines. Journal of machine learning research, 2(Dec): 265–292, 2001.

Miroslav Dud´ık and Robert E. Schapire. Maximum entropy distribution estimation with generalized regularization. In Learning Theory, pages 123–138. Springer Berlin Heidelberg, 2006.

Wei Fan, Ian Davidson, Bianca Zadrozny, and Philip S. Yu. An improved categorization of classifier’s sensitivity on sample selection bias. In Proc. of the IEEE International Conference on Data Mining, pages 605–608, 2005.

Farzan Farnia and David Tse. A minimax approach to supervised learning. In D. D. Lee, M. Sugiyama, U. V. Luxburg, I. Guyon, and R. Garnett, editors, Advances in Neural Information Processing Systems 29, pages 4240– 4248. Curran Associates, Inc., 2016.

Rizal Fathony, Anqi Liu, Kaiser Asif, and Brian Ziebart. Adversarial multiclass classification: A risk minimization perspective. In D. D. Lee, M. Sugiyama, U. V. Luxburg, I. Guyon, and R. Garnett, editors, Advances in Neural Information Processing Systems 29, pages 559– 567. Curran Associates, Inc., 2016.

Rizal Fathony, Mohammad Ali Bashiri, and Brian Ziebart. Adversarial surrogate losses for ordinal regression. In Advances in Neural Information Processing Systems 30. 2017.

Arthur Gretton, Alex Smola, Jiayuan Huang, Marcel Schmittfull, Karsten Borgwardt, and Bernhard Sch¨olkopf. Covariate shift by kernel mean matching. Dataset shift in machine learning, 3(4):5, 2009.

Peter D. Gr¨unwald and A. Phillip Dawid. Game theory, maximum entropy, minimum discrepancy, and robust

Bayesian decision theory. Annals of Statistics, 32:1367– 1433, 2004.

James J Heckman. Sample selection bias as a specifica-tion error (with an application to the estimation of labor supply functions), 1977.

Jiayuan Huang, Alexander J. Smola, Arthur Gretton, Karsten M. Borgwardt, and Bernhard Schlkopf. Correcting sample selection bias by unlabeled data. In Advances in Neural Information Processing Systems, pages 601– 608, 2006.

Jing Jiang. A literature survey on domain adaptation of statistical classifiers. URL: http://sifaka. cs. uiuc. edu/jiang4/domainadaptation/survey, 3, 2008.

Takafumi Kanamori, Shohei Hido, and Masashi Sugiyama. Efficient direct density ratio estimation for nonstationarity adaptation and outlier detection. In Advances in neural information processing systems, pages 809– 816, 2009.

Anqi Liu and Brian D. Ziebart. Robust classification under sample selection bias. In Advances in Neural Information Processing Systems, pages 37–45, 2014.

Anqi Liu, Lev Reyzin, and Brian D Ziebart. Shiftpessimistic active learning using robust bias-aware prediction. In AAAI, pages 2764–2770, 2015.

Mehryar Mohri, Afshin Rostamizadeh, and Ameet Talwalkar. Foundations of machine learning. MIT press, 2012.


Sinno Jialin Pan and Qiang Yang. A survey on transfer learning. IEEE Transactions on Knowledge and Data Engineering, 22(10):1345–1359, 2010.

Sashank J Reddi, Barnab´as P´oczos, and Alex Smola. Doubly robust covariate shift correction. 2014.

Hidetoshi Shimodaira. Improving predictive inference under covariate shift by weighting the log-likelihood function. Journal of Statistical Planning and Inference, 90 (2):227–244, 2000.

J P Siebert. Vehicle recognition using rule based methods. Technical report, Mar 1987.

Masashi Sugiyama, Matthias Krauledat, and Klaus-Robert M¨uller. Covariate shift adaptation by importance weighted cross validation. The Journal of Machine Learning Research, 8:985–1005, 2007.

Masashi Sugiyama, Shinichi Nakajima, Hisashi Kashima, Paul V. Buenau, and Motoaki Kawanabe. Direct importance estimation with model selection and its application to covariate shift adaptation. In Advances in Neural Information Processing Systems, pages 1433–1440, 2008.

Flemming Topsøe. Information theoretical optimization techniques. Kybernetika, 15(1):8–27, 1979.

Bianca Zadrozny. Learning and evaluating classifiers under sample selection bias. In Proceedings of the International Conference on Machine Learning, pages 903– 910. ACM, 2004. doi: 10.1145/1015330.1015425. URL http://doi.acm.org/10.1145/1015330.1015425.

A.1 Proof for Theorem 1

Proof. Solving the constrained minimax game (4), we follow Gr¨unwald and Dawid (2004) and Liu and Ziebart (2014). The minimax game reduces to a constrained maximum entropy problem:


such that:  ∀v ∈ {1, . . . ,  |V|}, EX∼ ˜Pgen, ˆY |X∼ ˆP�φv(Xv, ˆY )�= E(X,Y )∼ ˜Ptrain


Solving this constrained optimization problem using Lagrangian multiplier method, the Lagrangian is:

L( ˆP(y|x), θ, λ(x)) =  EX∼Ptest, ˆY |X∼ ˆP�−log ˆP( ˆY |X)�+ θ · (EX∼ ˜Pgen, ˆY |X∼ ˆP�φv(Xv,ˆY )�


According to strong Lagrangian duality, the max and min could be switched and preserving the same value:


So, assuming a fixed  θand  λ(x), the internal maximization problem can be solved first. Taking the partial derivative with respect to the conditional probability of a specific y and x, ˆP(y|x),


setting it equal to zero, and solving it, we obtain:



We analytically solve the normalization terms and yielding the conditional probability distribution:


where  Zθ(x) = �y′∈Y e

A.2 Proof for Theorem 2

Proof. Solving the constrained minimax game (2), we follow a similar procedure as in Proof A.1. The minimax game reduces to a constrained maximum entropy problem:


and for:  ∀v′ ∈ Vo, EX∼ ˜Ptrain, ˆY |X∼ ˆP�φv′(Xv′, ˆY )�= E(X,Y )∼ ˜Ptrain [φv′(Xv′, Y )]


Solving this constrained optimization problem using Lagrangian multiplier method, the Lagrangian is:

L( ˆP(y|x), θ, λ(x)) =  EX∼Ptest, ˆY |X∼ ˆP�−log ˆP( ˆY |X)�+ θv · (EX∼ ˜Ptrain, ˆY |X∼ ˆP


Taking the partial derivative with respect to the conditional probability of a specific y and x, ˆP(y|x),

∂∂ˆP(y|x)L( ˆP(y|x), θ, λ(x)) =  −Ptest(x) log ˆP(y|x) − Ptest(x) +�v Ptrain(xv)Ptest(x−v|xv)θv · φv(xv, y)


setting it equal to zero, and solving it, we obtain:

Ptrain(xv)Ptest(xv)θv · φv(xv, y) +�v′


We analytically solve the normalization terms and yielding the conditional probability distribution:


where  Zθ(x) = �y′∈Y e�vPtrain(xv)Ptest(xv) θv·φv(xv,y′)+�v′Ptrain(x)Ptest(x) θv′·φv′(xv′,y′).

A.3 Proof for Theorem 3

Proof. Plugging in the parametric form of ˆP(y|x)into the Lagrangian objective function in Proof A.2, we have:


Ptrain(Xv)Ptest(Xv)θv · φv(Xv, Y )


A.4 Proof for Theorem 4

Proof. We first investigate the empirical approximation of (sub-)gradient ˜Gand see how far it could deviate from the true (sub-)gradient G.


where n is the total dimension of features in  φ(x, y). According to Hoeffding bound:


Then we have the following, with probability  1 − δ,

The reason we are interested in the error in norm-2 of (sub-)gradient is we want to utilize the property that for a strongly convex objective function the following is true:


where M is the constant for strong convexity, i.e.  f(s) ≥ f(t) + ▽f(t)T (s − t) + M2 ||s − t||2. This is also true when the objective function is not smooth, when  ▽f(t)can be replaced by subgradient  g ∈ ∂f(x). Therefore, if we assume mins∈Sf(s), which in our case is the true worse case expected target loss, is reached at  ||G||2 = 0, then the objective function is bounded by


with probability  1 − δ, where Loss(θ)is the worst case loss function in the general game formulation.

A.5 Detailed Information of Datasets

Table 4: Datasets information and sampling details for synthetically biased experiments


A.6 Biased sampling strategy in UCI datasets

1. Separate the data into training and testing portion according to a feature;

2. Perform Principal Component Analysis (PCA) on both portions of data respectively;

3. Generate a random value a and b from different intervals;

4. Randomly choose a principal component i, calculate the weight vector as  normpdf(mi, µi, σi), where  µi =min(mi) + (max(mi) − min(mi))/a, σi = std(mi)/b;

5. Sample examples for the testing data samples from the testing portion of data in proportion to the weight vector values;

6. Follow the same procedure to sample examples for the training data.

A.7 Logistic regression as Density Estimation


A.8 Generalization Criterion

For UCI datasets, we regard each feature dimension as a view. We evaluate the KL-divergence of the training distribution Ptrain(xv)and the testing distribution  Ptest(xv)after density estimation to determine whether we should assume the generalization of each view, i.e.,  v ∈ Voor  v ∈ Vg. We use the threshold of 0.1, that if K < 0.1, we consider  Ptrain(xv)to be similar enough with  Ptest(xv)and  v ∈ Vg, otherwise,  v ∈ Vo. We include both training and testing inputs in the computation of KL-divergence.


For Language datasets, we assume we do not have prior knowledge for which view of features should be generalized. We conduct density estimation on both views and detect the one which is similar between training and testing. In practice, we could rely on both data observation and expert knowledge to choose the generalization criterion.

A.9 Model Selection

For each method, the regularization parameter  λare chosen using 5-fold cross validation, or importance weighted cross validation (IWCV) on a parameter range  λ ∈ [2−16, 2−12, 2−8, 2−4, 1]. Here the traditional cross validation is applied on LR, while IWCV is applied on all the other methods. Note that the traditional cross validation process is not correct anymore in the covariate shift setting where the training marginal data distribution of P(x) is different from the testing distribution (Sugiyama et al., 2007). Therefore, standard cross validation only matches the logistic regression method which ignores the bias. Though IWCV was originally designed for the importance weighting methods, it is proven to be unbiased for any loss function. We apply it to perform model tuning for our robust methods, even though the error estimate variance could be large.

Designed for Accessibility and to further Open Science