b

DiscoverModelsSearch
About
Class Distribution Shifts in Zero-Shot Learning: Learning Robust Representations
4 weeks ago
·
NeurIPS
Abstract

Zero-shot learning methods typically assume that the new, unseen classes encountered during deployment come from the same distribution as the the classes in the training set. However, real-world scenarios often involve class distribution shifts (e.g., in age or gender for person identification), posing challenges for zero-shot classifiers that rely on learned representations from training classes. In this work, we propose and analyze a model that assumes that the attribute responsible for the shift is unknown in advance. We show that in this setting, standard training may lead to non-robust representations. To mitigate this, we develop an algorithm for learning robust representations in which (a) synthetic data environments are constructed via hierarchical sampling, and (b) environment balancing penalization, inspired by out-of-distribution problems, is applied. We show that our algorithm improves generalization to diverse class distributions in both simulations and experiments on real-world datasets.

Zero-shot learning systems [14, 27] are designed to classify instances of new, previously unseen classes at deployment, a scenario known as open-world classification. These systems are widely applied in extreme multi-class applications, such as face or voice recognition [19] for matching observations of the same individual, and more generally, for learning data representations [2].

Class distribution shifts typically refer to changes in the prevalence of a fixed set of classes between training and testing. In zero-shot learning, however, a different challenge arises: the appearance of entirely new classes at test time. This raises a critical question – are these new classes drawn from the same distribution as the training classes? Most zero-shot methods assume that they are, an assumption that not only shapes the design of test sets [57, 16] but also plays an explicit role in assessing the generalization capabilities of zero-shot classifiers [59, 48].

In practice, training classes are often chosen based on convenience and accessibility during data collection. Even when data is carefully collected, the distribution of classes may shift over time, leading to a different distribution. For instance, this could occur when a face recognition system is deployed in a building located in a neighborhood undergoing demographic changes.

Class distribution shifts pose significant challenges to zero-shot classifiers, since they rely on learning data representations from the training classes to distinguish new, unseen ones. Typically, these classifiers are trained by minimizing the loss on the training set to effectively separate the training classes. However, this approach may result in poor performance when confronted with data from distributions that differ significantly from the class distribution in the training data. Notably, in person re-identification, this concern gained attention from a fairness perspective with respect to gender [15, 23], age [5, 33, 50], and racial [39, 54] bias. In all these studies the variable (i.e., gender, age, race) expected to cause the distribution shift was known in advance.

In contrast, in real-world scenarios, the attribute responsible for a future distribution shift is usually unknown during training. In such cases, existing approaches based on collecting balanced datasets or re-weighting training examples [54, 41, 53] are inapplicable. Furthermore, while class distribution shifts have been extensively studied in the standard setting of supervised learning (see Appendix A), previous research assumed a closed-world setting that does not account for new classes at test time. Instead, it only addressed changes in the prevalence of fixed classes between training and testing. Consequently, class distribution shifts in zero-shot learning remain largely unaddressed.

In this paper we first address these limitations by examining the effects of class distribution shifts on constrastive zero-shot learning, by proposing and analyzing a parametric model (§3). We identify conditions where minimizing loss in this model leads to representations that perform poorly when a distribution shift has occurred.

We then use the insights gained from this model to present our second contribution (§4): an algorithm for learning representations that are robust against class distribution shifts in zero-shot classification. In our proposed approach, artificial data environments with diverse attribute distributions are constructed using hierarchical subsampling, and an environment balancing criterion inspired by out-of-distribution (OOD) methods is applied. We assess our method’s effectiveness in both simulations and experiments on real-world datasets, demonstrating its enhanced robustness in §5.

1.1 Problem Setup

Let  {zi, ci}Nzi=1be a labeled set of training data points  z ∈ Zand classes  c ∈ C, such that  ciis the class of  zi.

In this work, we focus on verification algorithms that enable open-world classification by determining whether two data points  xij := (zi, zj)belong to the same class. For instance, in person re-identification the task is to identify whether two data points (e.g., face images or voice recordings) belong to the same person. We denote this by  yij, where  yij = 1otherwise. When the identity of each data point in the pair is not important, a single index is used for simplicity, namely  (xk, yk).

We assume that each class c is characterized by some attribute A. We further assume that the training classes are sampled from  PC(c), the test classes are sampled according to  QC(c), and the two distributions differ solely due to a shift in the distribution of an attribute A:

image

Importantly, we assume that the attribute A is unknown, and that both during training and testing, data points  z ∈ Zfor each class are sampled according to  PZ|C(z|c). For instance, revisit the person identification example where each person is a class. If the attribute A is binary (e.g.,  a1is blond and  a2is dark-haired), then  P(C|A = a1)represents the distribution of people with blond hair, and P(C|A = a2)of individuals with other hair colors. The training classes might be predominantly sampled from the blond population  P(A = a1) = ρtr = 0.8, while test classes are predominantly sampled from  Q(A = a1) = ρte = 0.1.

We focus on verification techniques based on deep metric learning methods (for surveys see [43, 34]) such as contrastive-learning [17], Siamese neural networks [24], triplet networks [20], and other more recent variations [35, 49, 56, 58]. These methods learn a representation function that maps data points to a representation space  g : Z → ˆZ, so that examples from the same class are close (in a predefined distance function  d(·, ·)), while those from different classes are farther apart.

We assume that g is a neural network trained by optimizing a deep-metric-learning loss, such as the contrastive loss [17]:

image

where  m ≥ 0is a predefined margin, and  dg(zi, zj) := d(g(zi), g(zj))is the distance between the representations of the datapoints  zi, zj. In our theoretical analysis, we examine the no-hinge

contrastive loss (see Appendix B for additional details):

image

To evaluate the class separation capability of a representation, we treat the distances between representations,  dg(zi, zj), as classification scores. Following common practice in the field (e.g., [47, 22]), we use the area under the receiver operating characteristic curve (AUC) to evaluate the representation, enabling threshold-agnostic assessment:

image

Our goal is to learn a representation g that is robust to class attribute shifts. That is, such that for an unknown shifted distribution  QA, the performance  EQA [AUC(g)]does not significantly deteriorate compared to the performance obtained on the training distribution  PA.

The field of OOD generalization gained attention since the work of Peters et al. [36], [37], which deals with closed-world classification where training data is gathered from multiple environments Etrain. In this setting it is assumed that in each environment  e ∈ Etrainexamples share the same joint distribution  P eC,Z(c, z), but across environments the joint distribution changes, often due to variations in  P eZ|C(z|c). A well-known example [1] involving the classification of images of cows and camels demonstrates how an algorithm relying on background cues during training (e.g., cows in green pastures, camels in deserts) performs poorly on new images of cows with sandy backgrounds.

Several approaches that rely on access to diverse training environments were proposed to identify stable relations between the data point z and its class c. Examples of such stable relations include choosing causal variables using statistical tests [42], leveraging conditional independence induced by the common causal mechanism [9], and using multi-environment calibration as a surrogate for OOD performance [52].

Most relevant to our work are methods that aim to balance the loss over multiple environments. These methods consider a representation  g = gθthat is a neural network parameterized by  θtrained to optimize an objective of the form

image

where  ℓe(gθ)is the empirical loss obtained on the environment  e, Etrainis the set of all training environments, R is a regularization term designed to balance performance over multiple environments, and  λis a regularization factor balancing the tradeoff between the empirical risk minimization (ERM) term and the balance penalty. Below, we describe three such methods, which we will refer to later in the paper.

Invariant risk minimization (IRM) presented in [1], aims to find data representations  gθ such thatthe optimal classifier w on top of the data representation  w ◦ gθis shared across all environments. Therefore, the authors proposed minimizing the sum of environment losses  ℓe(w◦gθ)over all training environments such that  w ∈ arg minw′ ℓe(w′ ◦ gθ) for all e ∈ Etrain. However, since this objective is too difficult to optimize, a relaxed version was also proposed, taking the form of Equation 5 with a penalty that measures how close w is to minimizing  ℓe(w◦gθ): ReIRMv1(gθ) =��∇w|w=1ℓe (w · gθ)��2.

Note that for loss functions for which optimal classifiers can be expressed as conditional expectations, the original IRM objective is equivalent to the requirement that for all environments  e, e′ ∈ EtrainEP eC,Z [c|g(z) = h] = EP e′C,Z [c|g(z) = h] ,where  P eC,Zand  P e′C,Zare the joint data distributions in the respective environments.

Calibration Loss Over Environments (CLOvE) presented in [52], leverages the equivalence above to establish a link between multi-environment calibration and invariance for binary predictors (c ∈ {0, 1}). The proposed regularizer is based on the maximum mean calibration error (MMCE) [26]. Let  s : ˆZ → [0, 1]be a classification score function applied on the representation  s ◦ g, and si = max{s ◦ g(zi), 1 − s ◦ g(zi)} be theconfidence on the i-th data point. Denote the correctness  bi = 1{|ci − si| < 12}, and let K : R × R → Rbe a universal kernel. Let  Zedenote the training data in the environment e. The authors proposed using the MMCE as a penalty in an objective that takes the form of Equation 5 with  ReMMCE (s, gθ) = 1m2�zi,zj∈Ze(bi − si) (bj − sj) K (si, sj) .

Variance Risk Extrapolation (VarREx) proposed by Krueger et al. [25], is based on the observation that reducing differences in loss (risk) across training domains can reduce a model’s sensitivity to a wide range of distribution shifts. The authors found that using the variance of losses as a regularizer is stabler and more effective compared to other penalties. Therefore, they propose the following regularization term for n training environments:  RVarREx (gθ, Etrain) = Var (ℓe1(gθ), . . . , ℓen(gθ)) .

While simple and intuitive, this approach assumes that losses across different environments accurately reflect the classifier’s performance. However, as discussed in §4, this is often not true for deep metric learning losses, where significant changes in loss may correspond to only minor variations in performance.

In this section, we introduce a parametric model of class distribution shifts. Our model shows that in zero-shot learning, even if the conditional distribution of data given the class P(z|c) remains the same between training and testing, a shift in the class distribution from P(c) to Q(c) can cause poor performance on newly encountered classes sampled from the shifted distribution Q(c).

Assume that for all classes, the data points  zi ∈ Rdare sampled from  zi|ci ∼ N (ci, Σz), where Σz = νz · Id, and  Idis the identity matrix. Let the attribute A indicate the type of a class c, with two possible types:  a1. Assume that the classes  ciare drawn according to  ci ∼ N (0, Σa)for  a ∈ {a1, a2}. Finally, assume that in training,  a1is the majority type with  P(a1) = ρtr ≫ 0.5whereas at test time,  a2is the majority type with  Q(a1) = ρte ≪ 0.5.

We construct the model such that differences between the training class distribution P(c) and the test distribution Q(c) stem solely from a shift in the mixing probabilities of an unknown attribute A (see Equation 1). Therefore, we define  Σaas a diagonal matrix with replicates of three distinct values on its diagonal:  ν0, ν+, ν−. Let 0 < ν− < νz ≤ ν0 < ν+. Then, in the coordinates corresponding to  ν0and  ν+ data points from different classes are well separated, whereas in the coordinates corresponding to  ν− they are not. Assume the coordinates corresponding to  ν0are shared by both types, but  ν+ andν− are swapped:

image

An illustration with one replicate of each value is shown in Figure 1.

The following proposition shows that if the number of dimensions  d1that allow good separation for classes of type  a1is relatively similar to the number of dimensions  d2that enable good separation for classes of type  a2, specifically if  hl (ρ, νz, ν0, ν1, ν2) < d2+2d1+2 < hu (ρ, νz, ν0, ν1, ν2), then the optimal solution for the training distribution prioritizes the components (features) corresponding to  ν+for classes of type  a1. Thus, the prioritized features allow good separation for classes from the majority type in training, but offer poor separation for the shifted test distribution, where most classes are of type  a2. Note that if  d2is large, when combined, the corresponding components may still provide reasonable separation. We define  hland  huin Equation 35 and provide the proof of Proposition 1 in Appendix B.2.

Proposition 1. Consider a weight representation  W ∈ Rd×dis a diagonal matrix, and the squared Euclidean distance  dg (zi, zj) = ∥W (zi − zj)∥2arg minW E��ℓ (·, ·, ·; dg)�. Denote w∗21 = 1d1�d1k=d0+1 w∗k and w∗22 = 1d2�dk=d1+1 w∗2k . Then, for

12l (ρ, νz, ν0, ν1, ν2) < d2+2d1+2 < hu (ρ, νz, ν0, ν1, ν2)holds that2w∗22 ≤ d1w∗21 .

image

Figure 1: Illustration of the parametric model. Classes of each type are best separated along specific axes: classes of type a1along the red axis (z(1)) and classes of type  a2along the green axis (z(2)). On axis  z(0)both types can be separated but not as effectively as on their respective optimal axes.

image

Figure 2: Optimal weights. Top row:  d0 is fixed, d1 andd2vary. Middle and bottom rows:  d0, d1, d2are fixed. Middle: ν0/ν− varies. Bottom:ν0/ν+ varies.

image

Note that the conditions outlined in Proposition 1 are sufficient but not necessary. Accordingly, in Appendix B.3, we provide the complete analytical solution for  w∗ that minimizes the expected loss E��ℓ (·, ·, ·; dg)�, using the squared Euclidean distance. According to Proposition 1, larger  d2values favor  ν−for better aggregated separation. Increasing ν0/ν+ leads to increased differences between  w∗21 and w∗22 , and vice versa for  ν0/ν−.

These relationships in the optimal solution are illustrated in Figure 2, showcasing different scenarios. The top row shows that when  d1 = d2 = 10dimensions favoring classes of type  a1are prioritized for  ρ > 0.5, while those favoring type  a2are prioritized for  ρ < 0.5. When  d1 = 10while  d2 = 5dimensions favoring type  a1are prioritized for all values of  ρ, and vice versa when  d2is significantly larger than  d1. The middle and the bottom row further explore the  d1 = d2case, showing how differences in separability between shared dimensions (ν0) and type-favoring dimensions impact weight allocation.

Since components corresponding to  ν+for classes of type  a1align with  ν−for classes of type  a2the optimal representation for the training distribution results in poor separation for the shifted test distribution. Therefore, a robust representation should prioritize dimensions that provide effective separation for both class types, corresponding to  ν0.

This aligns with a common principle in the OOD generalization field, where robust representations are those that rely on features shared across environments (see §2). This principle is often referred to as invariance.

Motivated by our analysis of the parametric model, we propose a new approach for tackling class distribution shifts in zero-shot learning. Our approach revolves around two key ideas: (i) during training, different mixtures of the attribute A can be produced by sampling small subsets of the classes, forming artificial environments, and (ii) penalizing for differences in performance across these environments is likely to increase robustness to the class mixture encountered at test time.

4.1 Synthetic Environments

Standard ERM training involves sampling pairs of data points  (zi, zj)uniformly at random from all Ncclasses available during training. However, as discussed in §3, this is prone to overfitting to the attribute distribution of the training data. Since the identity of the attribute is unknown, weighted sampling (and similar approaches) cannot be used to create environments with different attribute mixtures.

Yet, our goal is to design artificial environments with diverse compositions of the (unknown) attribute of interest. To do so, we leverage the variability in small samples: while class subsets of similar size to  Ncmaintain attribute mixtures similar to the overall training set, smaller subsets with  k ≪ Ncclasses are likely to exhibit distinct attribute mixtures. Therefore, we propose creating multiple environments, composed of examples from few sampled classes.

This results in a hierarchical sampling scheme for the data pairs: first, sample a subset of k classes, S = {c1, . . . , ck}. Then, for each  c ∈ Ssample 2r pairs of data points as follows: r pairs from within the class  c, {zi; ci = c}, uniformly at random (positive pairs); and r negative pairs, where one point is sampled uniformly at random from c, and the other from all other data points in S, {zi; ci ̸= c, ci ∈ S}.1

Across multiple class subsets  S = {S1, . . . , Sn}, this hierarchical sampling results in diverse mixtures of any unknown attribute (see Figure 3). In particular, in some of the class subsets, classes from the overall minority type constitute the majority in the environment.

image

Figure 3: Illustration of the proposed hierarchical sampling. Top:  Nc = 6classes, with 2 minority-type classes D, F (in purple). Middle: synthetic environments formed by sampling small (k = 3) class subsets; in 1/5of the environments, minority-type classes become the majority constituting 2/3of the classes. Bottom: sampling r = 1 positive and r = 1 negative pairs for each class in the environment.

4.2 Environment Balancing Algorithm for Class Distribution Shifts

Our goal is to learn data representations that will allow separation between classes without knowing which attribute is expected to change and how significantly. Therefore, we require the learned data representation to perform similarly well on all mixtures obtained on the synthetic environments.

To achieve this, inspired by OOD performance balancing methods (see §2), we optimize a penalized objective:

image

where  R (S1, . . . , Sn)is any balancing term between the constructed synthetic environments.

Note that computing  R (S1, . . . , Sn)often involves evaluating some value on each environment separately. For a general balancing term, we denote the value in the l-th environment as  f(Sl)and accordingly express  R (S1, . . . , Sn) = ˚f (f(S1), . . . , f(Sn)), where ˚frepresents the corresponding aggregation function. Our approach2 for balancing performance across synthetic environments of class subsets, is outlined in Algorithm 1.

4.3 Balancing Performance Instead of Loss

In multiple OOD penalties (e.g., IRM and VarREx), f represents the loss in each environment, which, in deep metric learning algorithms, is based on distance. This presents a challenge in zero-

image

shot verification, where sampled tuples often include numerous easy negative examples, leading to performance plateau early in the learning process, although the distances themselves still exhibit considerable variations. Strategies like selecting the most difficult tuples [18] were proposed to address this issue, however these methods have been found to generate noisy gradients and loss values [34].

We therefore propose to balance performance directly instead of relying on the losses in the training environments. Denote the set of negative pairs in a synthetic environment by  D0l = {xij = (zi, zj) :ci, cj ∈ Sl, yij = 0}and the set of positive pairs by  D1l = {xij = (zi, zj) : ci, cj ∈ Sl, yij = 1}An unbiased estimator of the AUC on a given synthetic environment  Slis given by

image

for  xij ∈ D1l and xuv ∈ D0l . Since this estimator is non-differentiable and therefore cannot be used in gradient-descent-based optimization, we use soft-AUC as an approximation [7]

image

where a sigmoid  σβ(t) = 11+e−βtapproximates the step function. Note that when  β → ∞, σβconverges pointwise to the step function. Consequently, we propose the penalty:

image

4.4 How Many Environments Are Needed?

The proposed hierarchical sampling scheme allows for the construction of many synthetic environments with various attribute mixtures, influenced by the number of classes in each environment. As shown in the analysis below, this ensures that with high probability there will be at least one environment with a pair of minority type classes, thereby supporting learning to separate negative pairs within the minority type.

In each training iteration, we consider n class subsets (environments) of size k. Our goal is to achieve robustness to all attribute values a that are associated with at least  ρmin ∈ (0, 1)of the training classes. Note that  ρminis specified by the practitioner without knowledge of the true attribute that may cause the shift or its true prevalence  ρin the training set.

We compute the number of synthetic environments n, such that with high probability of  (1 − α)S1, . . . , Snwill include at least one subset with at least two classes associated with a (otherwise none of the subsets would contain negative pairs with the attribute a). Denote the probability of a given subset not to contain any class associated with  a by ϕ0 = (⌈(1−ρmin)Nc⌉k )/(Nck )and the probability of a given subset to contain exactly one such class by  ϕ1 = ρminNc(⌈(1−ρmin)Nc⌉k−1 )/(Nck ). Therefore, the required number of environments needed to ensure that at least two minority-type classes appear together in the same environment is

image

Note that that�Nck�.

Our method enhances standard training with two components: a hierarchical sampling scheme and a balancing term for synthetic environments. To the best of our knowledge, this is the first work addressing OOD generalization for class distribution in zero-shot learning. We therefore benchmark our algorithm against the ERM baseline (uniform random sampling with an unpenalized score) and a hierarchical sampling baseline (hierarchical sampling with unpenalized score).

image

Figure 4: Average AUC over 10 simulation repetitions for majority attribute proportion  ρ = 0.9in training (and 0.1 in test). Solid lines: distributionshift. Dashed lines: in-distribution. Our method improves robustness for shifts, without compromising training distribution results.

Additionally, we tested standard regularization techniques including dropout and the  L2norm, which did not yield notable improvements in the distribution shift scenario, and are therefore not shown.

To ensure a comprehensive comparison, in addition to the proposed VarAUC penalty, we evaluate variants of our algorithm in which the IRM, CLOvE, and VarREx penalties are used instead. While we show that VarAUC consistently outperforms other penalties, the crucial improvement lies in its performance compared to the ERM baseline: application of existing OOD penalties is enabled by the construction of synthetic environments in our algorithm. As discussed in Appendix C, this construction facilitates the formulation of class distribution shifts in zero-shot learning within the OOD setting.

In all of the experiments performed, we trained the network with contrastive loss (Equation 2) and the normalized cosine distance:  dg(z1, z2) = 12�1 − g(z1)·g(z2)∥g(z1)∥∥g(z2)∥�. The specific setups are detailed below (additional details can be found in Appendix F), and code to reproduce our results is available at https://github.com/YuliSl/Zero_Shot_Robust_Representations .

5.1 Simulations: Revisiting the Parametric Model

We now revisit the parametric model presented in §3. To increase the complexity of the problem, we add dimensions where classes from both types are not well separated. That is,  Σaincludes additional dimensions set to zero.

Setup We used 68 subsets in each training iteration, each consisting of two classes. This corresponds to choosing  ρmin = 0.1(desired sensitivity, regardless of the true unknown parameter ρ ∈ {0.05, 0.1, 0.3}), with a low  αvalue of 0.5, resulting in the construction of fewer environments according to Equation 10. For each class, we sampled 2r = 10 pairs of data points. The representation was defined as g(z) = wz for  w ∈ Rd×p 3. Here we focus on the case of  p = 16, νz = ν0 = 1

ν− = 0.1, ν+ = 2, d0 = 5, d1 = d2 = 10. The results for additional representation sizes p, noise ratios ν+ν−and varying proportions of positive and negative examples are presented in Appendix D.1.

To assess the importance assigned to each dimension, we examine weight values relative to other weights: Importancei =����

Results In Figure 5 we examine the learned representation. The analysis indicates that ERM prioritizes dimensions 5-15, providing good separation for  a1, the dominant type in training,

image

Figure 5: Average feature importance for  ρ = 0.9,10 repetitions. Our VarAUC penalty favors shared features (blocks 1 and 3), while deprioritizing majority features (block 2). All methods assign low weight to noise features (block 4).

but leading to poor separation after the shift. ERM assigns low weights to dimensions beneficial for both types (0-5) and those suitable for  a2(15-25). In contrast, our algorithm, particularly with the two variance-based penalties, assigns the lowest weights to dimensions corresponding to  a1and higher weights to shared dimensions and those that effectively separate  a2 classes.

In Figure 4, the learning progress is depicted for ρ = 0.9(a similar analysis for  ρ = 0.95 and ρ =0.7 can be found in Appendix D). Performance on the same distribution as the training data is similar for ERM and our algorithm, suggesting that applying our algorithm does not negatively impact performance when no distribution shift occurs. However, when there is a distribution shift our algorithm achieves much better results. The VarREx penalty achieves high AUC values more quickly than the VarAUC penalty, but the

VarAUC penalty attains higher overall accuracy. IRM shows noisier convergence, since it is applied directly on the gradients, which have been shown to be noisy in contrastive learning due to high variance in data-pair samples [34]. Means and standard deviations are reported in Appendix D.1, as well as the results for additional data dimensions, positive proportions, and variance ratios.

5.2 Experiments on Real Data

image

Figure 6: Average percentage changes of our method compared to ERM across 10 repetitions are shown for the ETHEC (top) and CelebA (bottom) datasets. Error bars represent  ±one std-dev.

Experiment 1 - Species Recognition We used the ETHEC dataset [11] which contains 47,978 butterfly images from six families and 561 species (example of the images are provided in Appendix D). We filtered out species with less than five images and focused on images of butterflies from the Lycaenidae and Nymphalidae families. In the training set, 10% of the species were from the Nymphalidae family, while at test time, 90% of the species were from the Nymphalidae family. For each class we sampled 2r = 20 pairs.

Experiment 2 - Face Recognition We used the CelebA dataset [30] which contains 202,599 images of 10,177 celebrities. We filtered out people for which the dataset contains less than three images. Following Vinyals et al. [51], we implemented g as a convolutional neural network which has four modules with  3 × 3convolutions and 64 filters, followed by batch normalization, a ReLU activation,  2 × 2max-pooling, and a

fully connected layer of size 32. We used the attribute blond hair for the class distribution shift: for training, we mainly sampled people without blond hair (95%), while at test time, most people (95%) had blond hair. Each training iteration had 150 synthetic environments of two classes and 2r = 20 data points per class.

We trained the models on 200 synthetic environments at a time, each of two classes. We implemented g as a fully connected neural network with layers of sizes 128, 64, 32 and 16, and ReLU activations between them.

Experimental Results As can be seen in Figure 6, while all versions of our algorithm show some improvement over ERM, the best results are achieved with the VarAUC penalty (exact means and standard deviations are reported in Table 3 in Appendix D). One-sided paired t-tests show that the improvement over ERM achieved by our algorithm with the VarAUC penalty is statistically significant, with p-values of < 0.04 on both datasets; p-values for other penalties are reported in Table 4. All p-values were adjusted with FDR [4] correction.

In Appendix D we also provide additional analysis confirming that the main improvement of our algorithm over the ERM baseline stems from improved performance on negative minority pairs.

In this study, we examined class distribution shifts in zero-shot learning, with a focus on shifts induced by unknown attributes. Such shifts pose significant challenges in zero-shot learning where new classes emerge in testing, causing standard techniques trained via ERM to fail on shifted class distributions, even when the conditional distribution of the data given class remains the same.

Previous research (see Appendix A) assumes closed-world classification or a known cause, making these methods unsuitable for zero-shot learning or shifts caused by unknown attributes. In response, we introduced a framework and the first algorithm to address class distribution shifts in zero-shot learning using OOD environment balancing methods.

In the causal terminology of closed-world OOD generalization, our framework employs synthetic environments to intervene on attribute mixtures by sampling small class subsets, thereby manipulating the class distribution. This facilitates the creation of diverse environments with varied attribute mixtures, enhancing the distinction between negative examples. A further comparison of our framework with OOD environment balancing methods is provided in Appendix C. Additionally, our proposed VarAUC penalty, designed for metric losses, enhances the separation of negative examples.

Our results demonstrate improvements compared to the ERM baseline on shifted distributions, without compromising performance on unshifted distributions, enabling the learning of more robust representations for zero-shot tasks and ensuring reliable performance.

While the proposed framework is general, our current experiments address shifts in a binary attribute. We defer exploration of additional scenarios, such as those involving shifts in multiple correlated attributes, to future work. An additional promising direction for future work is the consideration of shifts where the responsible attribute is strongly correlated with additional attributes or covariates. This opens up possibilities to explore structured constructions of synthetic environments that leverage such correlations.

[1] Martin Arjovsky, Léon Bottou, Ishaan Gulrajani, and David Lopez-Paz. Invariant risk minimization. arXiv preprint arXiv:1907.02893, 2019.

[2] Tadas Baltrusaitis, Amir Zadeh, Yao Chong Lim, and Louis-Philippe Morency. Openface 2.0: Facial behavior analysis toolkit. In 2018 13th IEEE international conference on automatic face & gesture recognition (FG 2018), pages 59–66. IEEE, 2018.

[3] Aharon Ben-Tal, Dick Den Hertog, Anja De Waegenaere, Bertrand Melenberg, and Gijs Rennen. Robust solutions of optimization problems affected by uncertain probabilities. Management Science, 59(2):341–357, 2013.

[4] Yoav Benjamini and Yosef Hochberg. Controlling the false discovery rate: a practical and powerful approach to multiple testing. Journal of the Royal statistical society: series B (Methodological), 57(1):289–300, 1995.

[5] Lacey Best-Rowden and Anil K Jain. Longitudinal study of automatic face recognition. IEEE transactions on pattern analysis and machine intelligence, 40(1):148–162, 2017.

[6] Jonathon Byrd and Zachary Lipton. What is the effect of importance weighting in deep learning? In International conference on machine learning, pages 872–881. PMLR, 2019.

[7] Toon Calders and Szymon Jaroszewicz. Efficient auc optimization for classification. In European conference on principles of data mining and knowledge discovery, pages 42–53. Springer, 2007.

[8] Kaidi Cao, Colin Wei, Adrien Gaidon, Nikos Arechiga, and Tengyu Ma. Learning imbalanced datasets with label-distribution-aware margin loss. Advances in neural information processing systems, 32, 2019.

[9] Shiyu Chang, Yang Zhang, Mo Yu, and Tommi Jaakkola. Invariant rationalization. In International Conference on Machine Learning, pages 1448–1458. PMLR, 2020.

[10] Yin Cui, Menglin Jia, Tsung-Yi Lin, Yang Song, and Serge Belongie. Class-balanced loss based on effective number of samples. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 9268–9277, 2019.

[11] Ankit Dhall. Eth entomological collection (ethec) dataset [palearctic macrolepidoptera, spring 2019]. 2019.

[12] John C Duchi and Hongseok Namkoong. Learning models with uniform performance via distributionally robust optimization. The Annals of Statistics, 49(3):1378–1406, 2021.

[13] John C Duchi, Peter W Glynn, and Hongseok Namkoong. Statistics of robust optimization: A generalized empirical likelihood approach. Mathematics of Operations Research, 46(3): 946–969, 2021.

[14] Li Fei-Fei, Rob Fergus, and Pietro Perona. One-shot learning of object categories. IEEE transactions on pattern analysis and machine intelligence, 28(4):594–611, 2006.

[15] Patrick Grother, Mei Ngan, Kayee Hanaoka, et al. Ongoing face recognition vendor test (frvt) part 3: Demographic effects. Nat. Inst. Stand. Technol., Gaithersburg, MA, USA, Rep. NISTIR, 8280, 2019.

[16] Zhangxuan Gu, Siyuan Zhou, Li Niu, Zihan Zhao, and Liqing Zhang. Context-aware feature generation for zero-shot semantic segmentation. In Proceedings of the 28th ACM International Conference on Multimedia, pages 1921–1929, 2020.

[17] Raia Hadsell, Sumit Chopra, and Yann LeCun. Dimensionality reduction by learning an invariant mapping. In 2006 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR’06), volume 2, pages 1735–1742. IEEE, 2006.

[18] Alexander Hermans, Lucas Beyer, and Bastian Leibe. In defense of the triplet loss for person re-identification. arXiv preprint arXiv:1703.07737, 2017.

[19] Alexander Hermans, Lucas Beyer, and Bastian Leibe. In defense of the triplet loss for person re-identification. arXiv preprint arXiv:1703.07737, 2017.

[20] Elad Hoffer and Nir Ailon. Deep metric learning using triplet network. In International workshop on similarity-based pattern recognition, pages 84–92. Springer, 2015.

[21] Badr Youbi Idrissi, Martin Arjovsky, Mohammad Pezeshki, and David Lopez-Paz. Simple data balancing achieves competitive worst-group-accuracy. In Conference on Causal Learning and Reasoning, pages 336–351. PMLR, 2022.

[22] Aparna R Joshi, Xavier Suau Cuadros, Nivedha Sivakumar, Luca Zappella, and Nicholas Apostoloff. Fair sa: Sensitivity analysis for fairness in face recognition. In Algorithmic fairness through the lens of causality and robustness workshop, pages 40–58. PMLR, 2022.

[23] Brendan F Klare, Mark J Burge, Joshua C Klontz, Richard W Vorder Bruegge, and Anil K Jain. Face recognition performance: Role of demographic information. IEEE Transactions on information forensics and security, 7(6):1789–1801, 2012.

[24] Gregory Koch, Richard Zemel, Ruslan Salakhutdinov, et al. Siamese neural networks for one-shot image recognition. In International Conference on Machine Learning (ICML) deep learning workshop, volume 2, 2015.

[25] David Krueger, Ethan Caballero, Joern-Henrik Jacobsen, Amy Zhang, Jonathan Binas, Dinghuai Zhang, Remi Le Priol, and Aaron Courville. Out-of-distribution generalization via risk extrapolation (rex). In International Conference on Machine Learning, pages 5815–5826. PMLR, 2021.

[26] Aviral Kumar, Sunita Sarawagi, and Ujjwal Jain. Trainable calibration measures for neural networks from kernel mean embeddings. In Jennifer Dy and Andreas Krause, editors, Proceedings of the 35th International Conference on Machine Learning, volume 80 of Proceedings of Machine Learning Research, pages 2805–2814. PMLR, 10–15 Jul 2018. URL https://proceedings.mlr.press/v80/kumar18a.html.

[27] Hugo Larochelle, Dumitru Erhan, and Yoshua Bengio. Zero-data learning of new tasks. In AAAI, volume 1, page 3, 2008.

[28] Tsung-Yi Lin, Priya Goyal, Ross Girshick, Kaiming He, and Piotr Dollár. Focal loss for dense object detection. In Proceedings of the IEEE international conference on computer vision, pages 2980–2988, 2017.

[29] Evan Z Liu, Behzad Haghgoo, Annie S Chen, Aditi Raghunathan, Pang Wei Koh, Shiori Sagawa, Percy Liang, and Chelsea Finn. Just train twice: Improving group robustness without training group information. In International Conference on Machine Learning, pages 6781–6792. PMLR, 2021.

[30] Ziwei Liu, Ping Luo, Xiaogang Wang, and Xiaoou Tang. Deep learning face attributes in the wild. In Proceedings of International Conference on Computer Vision (ICCV), December 2015.

[31] Arakaparampil M Mathai and Serge B Provost. Quadratic forms in random variables: theory and applications. (No Title), 1992.

[32] Aditya Krishna Menon, Sadeep Jayasumana, Ankit Singh Rawat, Himanshu Jain, Andreas Veit, and Sanjiv Kumar. Long-tail learning via logit adjustment. In International Conference on Learning Representations, 2020.

[33] Dana Michalski, Sau Yee Yiu, and Chris Malec. The impact of age and threshold variation on facial recognition algorithm performance using images of children. In 2018 International Conference on Biometrics (ICB), pages 217–224. IEEE, 2018.

[34] Kevin Musgrave, Serge Belongie, and Ser-Nam Lim. A metric learning reality check. In Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part XXV 16, pages 681–699. Springer, 2020.

[35] Hyun Oh Song, Yu Xiang, Stefanie Jegelka, and Silvio Savarese. Deep metric learning via lifted structured feature embedding. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 4004–4012, 2016.

[36] Jonas Peters, Peter Bühlmann, and Nicolai Meinshausen. Causal inference by using invariant prediction: identification and confidence intervals. Journal of the Royal Statistical Society Series B: Statistical Methodology, 78(5):947–1012, 2016.

[37] Jonas Peters, Dominik Janzing, and Bernhard Schölkopf. Elements of causal inference: foundations and learning algorithms. The MIT Press, 2017.

[38] Vihari Piratla, Praneeth Netrapalli, and Sunita Sarawagi. Focus on the common good: Group distributional robustness follows. In International Conference on Learning Representations, 2021.

[39] Inioluwa Deborah Raji and Joy Buolamwini. Actionable auditing: Investigating the impact of publicly naming biased performance results of commercial ai products. In Proceedings of the 2019 AAAI/ACM Conference on AI, Ethics, and Society, pages 429–435, 2019.

[40] Jiawei Ren, Cunjun Yu, Xiao Ma, Haiyu Zhao, Shuai Yi, et al. Balanced meta-softmax for long-tailed visual recognition. Advances in neural information processing systems, 33:4175–4186, 2020.

[41] Joseph P Robinson, Gennady Livitz, Yann Henon, Can Qin, Yun Fu, and Samson Timoner. Face recognition: too bias, or not too bias? In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops, pages 0–1, 2020.

[42] Mateo Rojas-Carulla, Bernhard Schölkopf, Richard Turner, and Jonas Peters. Invariant models for causal transfer learning. The Journal of Machine Learning Research, 19(1):1309–1342, 2018.

[43] Karsten Roth, Timo Milbich, Samarth Sinha, Prateek Gupta, Bjorn Ommer, and Joseph Paul Cohen. Revisiting training strategies and generalization performance in deep metric learning. In International Conference on Machine Learning, pages 8242–8252. PMLR, 2020.

[44] Marco Saerens, Patrice Latinne, and Christine Decaestecker. Adjusting the outputs of a classifier to new a priori probabilities: a simple procedure. Neural computation, 14(1):21–41, 2002.

[45] Shiori Sagawa, Pang Wei Koh, Tatsunori B Hashimoto, and Percy Liang. Distributionally robust neural networks. In International Conference on Learning Representations, 2019.

[46] Hidetoshi Shimodaira. Improving predictive inference under covariate shift by weighting the log-likelihood function. Journal of statistical planning and inference, 90(2):227–244, 2000.

[47] Tomáš Sixta, Julio CS Jacques Junior, Pau Buch-Cardona, Eduard Vazquez, and Sergio Escalera. Fairface challenge at eccv 2020: Analyzing bias in face recognition. In Computer Vision–ECCV 2020 Workshops: Glasgow, UK, August 23–28, 2020, Proceedings, Part VI 16, pages 463–481. Springer, 2020.

[48] Yuli Slavutsky and Yuval Benjamini. Predicting classification accuracy when adding new unobserved classes. In International Conference on Learning Representations, ICLR, Conference Track Proceedings, 2021.

[49] Kihyuk Sohn. Improved deep metric learning with multi-class n-pair loss objective. Advances in neural information processing systems, 29, 2016.

[50] Nisha Srinivas, Karl Ricanek, Dana Michalski, David S Bolme, and Michael King. Face recognition algorithm bias: Performance differences on images of children and adults. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition workshops, pages 0–0, 2019.

[51] Oriol Vinyals, Charles Blundell, Timothy Lillicrap, Daan Wierstra, et al. Matching networks for one shot learning. Advances in neural information processing systems, 29, 2016.

[52] Yoav Wald, Amir Feder, Daniel Greenfeld, and Uri Shalit. On calibration and out-of-domain generalization. Advances in neural information processing systems, 34:2215–2227, 2021.

[53] Mei Wang and Weihong Deng. Mitigating bias in face recognition using skewness-aware reinforcement learning. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 9322–9331, 2020.

[54] Mei Wang, Weihong Deng, Jiani Hu, Xunqiang Tao, and Yaohai Huang. Racial faces in the wild: Reducing racial bias by information maximization adaptation network. In Proceedings of the ieee/cvf international conference on computer vision, pages 692–702, 2019.

[55] Jiaheng Wei, Harikrishna Narasimhan, Ehsan Amid, Wen-Sheng Chu, Yang Liu, and Abhishek Kumar. Distributionally robust post-hoc classifiers under prior shifts. In International Conference on Learning Representations (ICLR), 2023.

[56] Chao-Yuan Wu, R Manmatha, Alexander J Smola, and Philipp Krahenbuhl. Sampling matters in deep embedding learning. In Proceedings of the IEEE international conference on computer vision, pages 2840–2848, 2017.

[57] Yongqin Xian, Subhabrata Choudhury, Yang He, Bernt Schiele, and Zeynep Akata. Semantic projection network for zero-and few-label semantic segmentation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 8256–8265, 2019.

[58] Tongtong Yuan, Weihong Deng, Jian Tang, Yinan Tang, and Binghui Chen. Signal-to-noise ratio: A robust distance metric for deep metric learning. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 4815–4824, 2019.

[59] Charles Zheng, Rakesh Achanta, and Yuval Benjamini. Extrapolating expected accuracies for large multi-class problems. The Journal of Machine Learning Research, 19(1):2609–2638, 2018.

[60] Zhisheng Zhong, Jiequan Cui, Shu Liu, and Jiaya Jia. Improving calibration for long-tailed recognition. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 16489–16498, 2021.

In class-imbalanced learning [28, 10, 8] it is assumed that some classes are more dominant in training, while in deployment this is no longer the case. Therefore, solutions classically include data or loss re-weighting [46, 6, 40, 32] and calibration of the classification score [44, 60]. A popular framework for addressing class distribut ion shifts is distributionally robust optimization (DRO) [3, 13, 12, 55], where instead of assuming a specific probability distribution, a set or range of possible distributions is considered, and optimization is performed to achieve the best results on the worst-case distribution. A special case known as group DRO [45, 38], involves a group variable that introduces discriminatory patterns among classes within specific groups. The framework to address this includes methods that assume that the classifier does not have access to the group information, and therefore propose re-weighting high loss examples [29], and data sub-sampling to balance classes and groups [21]. Nevertheless, the methods mentioned above rely on the training and test class sets being identical, making them unsuitable for direct application in zero-shot learning scenarios.

B.1 Derivation of the Loss

We begin by revisiting the parametric model introduced in §3. Let  zi|ci ∼ N (ci, Σz), where Σz = νzId, 0 < νz ∈ R, and Id is the ddimensional identity matrix. Classes  ciare drawn according to a Gaussian distribution  ci ∼ N (0, Σa)corresponding to their type  a ∈ {a1, a2}. Here, we use a simpler (although less intuitive) notation for the values of the diagonal matrices  Σa:

image

where  0 < ν2 < νz < ν0 < ν1.

We consider a weight representations g(z) = Wz, where W is a diagonal matrix with diagonal w ∈ Rd.

Since  Σzis of full rank, it suffices to consider the no-hinge version of the contrastive loss, that is

image

where  dg (zi, zj) := ∥g (zi − zj)∥2 = ∥W (zi − zj)∥2 (∥ · ∥denotes the Euclidean norm4).

For a balanced sample of positive and negative examples, the expected loss is given by

image

To calculate the expression above, we begin by proving the following lemma:

Lemma 1. Let  µ ∈ Rd be a random variable and let  t|µ ∼ N(µ, Σ). If µ ≡ 0(constant), then

image

If  µ ∼ N(0, Σµ), then

2.  E ∥t∥2 = tr(Σ) + tr(Σµ) ,

3.  E ∥t∥4 = 2 tr(Σ2) + 4 tr(ΣΣµ) + tr2(Σ) + 2 tr(Σ) tr(Σµ) + 2 tr(Σ2µ) + tr2(Σµ) .

Proof. For any random variable  u ∈ Rd, such that  u ∼ N(µu, Σu), and any symmetric matrix A, we have

image

(see, for example, Thm. 3.2b.2 in [31]).

First, letting  µu = 0, Σu = Σ and A = Idin (15) we get

image

Now, assume that  µ ∼ N(0, Σµ). From (14) we get  Eµ ∥µ∥2 = Eµ[µT µ] = tr(Σµ), and thus

image

Similarly, from (15) we have

image

By substituting  A = Σin (14) we get  Eµ[µT Σµ] = tr(ΣΣµ), and from (15) we have

image

Therefore,

image

Note that  W (zi − zj) ∼ N (µ, Σ), with µ = W (ci − cj) and Σ = 2νzW T W.

If  yij = 1, then  ziand  zjare from the same class, meaning that  ci = cjand thus  µ = 0. Therefore, by Lemma 1.(1) we have

image

However, for pairs from different classes, that is, when  yij = 0, the mean  µis itself a Gaussian random variable distributed according to  N (0, Σµ), where

image

Therefore, by Lemma 1.(2) we have

image

where

image

By Lemma 1.(3) we have

image

where

image

and so

image

and similarly

image

where we denote for short

image

Finally, due to symmetry, at the optimal solution we have

image

and by combining these results, we get

image

B.2 Analysis of the Optimal Solution (Proof of Proposition 1)

Proposition 1 shows that when  d1and  d2are relatively similar, the optimal solution on the training distribution, assigns more weight to components with high variance in the training data than to those with high variance in the shifted test distribution.

We begin by defining the required condition on  d1 and d2. Denote

image

Then for  α, β, γvalues as in equations 24 and 31, we define

image

and the corresponding condition

image

While this condition is sufficient, it is not necessary.Values of  ρ, νz, ν0, ν1, ν2 and d1, d2that satisfy 35 provide an example requiring only a simple analysis, without a full characterization of the optimal solution, for the failure of optimization over the training distribution. However, such failures can occur for additional parameter values, and the full characterization is provided in Appendix B.3.

Proof. Without loss of generality assume m=1. Then,

image

and by setting the partial derivative to zero we get

image

Therefore,

image

and similarly

image

Hence,

image

Denoting

image

we have

image

and therefore

image

Denote

image

and thus

image

Note that for  d1, d2 such that

image

we have  ∆ > 0. Additionally, note that

image

and thus  1 − 1ξ (2 + d1) ψ1η12 > 0 iff

image

Combining these conditions reduces to

image

and therefore, for  νz, ν0, ν1, ν2, d1, d2satisfying

image

we have  d1u21 − d2u22 > 0.5

B.3 Explicit Expression for the Optimal Representation

In order to derive the optimal representation, we differentiate the expected loss with respect to the squared values in the diagonal of  W, that is, w2i :

image

Combining these results, we get for  1 ≤ i ≤ d0

image

for d0 + 1 ≤ i ≤ d0 + d1

image

and similarly for  d0 + d1 + 1 ≤ i ≤ d

image

Thus, we can write for the symmetric solution

image

where

image

Therefore, the optimal representation is given by the solution to the following set of linear equations:

image

Table 1: Simulation results. For each mixture ratio we report the mean AUC and the standard deviation across 10 repetitions of the experiment. Results are reported for in-distribution scenario (PC), and class distribution shift (QC). Best result is marked in bold.

image

where

image

Previous methods in the field of OOD generalization (see §2) exhibit several key differences compared to our setting: (i) They address closed-world classification, whereas in zero-shot learning, new classes are encountered. (ii) The presumed shift is typically in the conditional distribution of the data given the class (e.g., the background given the class being a cow or a camel), whereas we consider shifts in the class distribution P(c). (iii) Existing methods often assume that training data comes from various data environments, providing explicit information about how the distribution might shift, while we assume the attribute A causing the shift is unknown.

Despite these differences, in this work we recast class distribution shifts in zero-shot learning into environment balancing OOD setting, by making the following observations. First, when posed as verification methods, zero-shot classifiers in fact perform a binary (closed-world) classification task, predicting whether a pair of data points  xij := (zi, zj)belong to the same class  yij = 1ci=cj.

Note that the distribution of possible pairs  xij = (zi, zj)given the label  yijchanges with variations in class attribute probabilities, and therefore across synthetic environments S. Thus, in this formulation the shift occurs in the conditional distribution of the data given the class  p(xij|yij).

Another distinction lies in data availability: in the setting of closed-world OOD environment balancing methods, a main drawback is the challenge of securing a sufficient number of diverse training environments. This is essential to ensure that a representation performing well on observed environments, will likely perform similarly on unobserved ones. In contrast, our framework allows for the construction of many synthetic environments via sampling.

D.1 Additional Simulation Results

Simulations Exact mean and standard deviations matching Figure 4 are provided in table 1.

AUC progress during training iterations and feature importance results for the majority class proportion of  ρ = 0.1were shown in the main text. Here, we provide analogous results for  ρ = 0.05and ρ = 0.3. These are summarized in Figure 8.

image

Figure 7: Additional simulation results. Top row: Additional dimensions of the representation. Middle row: additional rations of the attribute variances. Bottom row: unbalanced sets of positive and negative examples. Bars show mean AUC values on the test set across 5 repetitions of the experiment, whiskers show  ±standard deviation.

For  ρ = 0.05the convergence results are similar to those obtained for  ρ = 0.1– under distributionshift the two variance based methods show significantly better results compared to other approaches. Our algorithm with the VarREx penalty achieves high AUC values more quickly than the VarAUC penalty, but the VarAUC penalty attains higher accuracy overall. The CLOvE penalty achieves improvement over ERM, but smaller compared to the variance based methods. IRM converges to the same AUC as ERM. In contrast, on in-distribution data all methods perform well.

For  ρ = 0.3the distribution shift is milder and therefore ERM performs very well (0.902 AUC is achieved on distribution shift scenario compared to 0.932 on in-distribution setting). Therefore encouragement of similar performance across different data subsets does not benefit the learning process. Slightly better result is achieved with VarREx penalty (0.911).

The analysis of feature importance for  ρ = 0.05yields results similar to those for  ρ = 0.1. At ρ = 0.3the analysis remains mostly unchanged, except that VarREx assigns higher importance to features corresponding to  ν0(0-5) compared to VarAUC, while in more extreme distribution shifts VarAUC assigns higher importance to the shared features.

D.2 Additional Representation Sizes, Noise Ratios and Positive Proportions

In §5.1 we explored varying values of  ρin a setting where  ν+ = 2, ν− = 0.1( ν+ν− = 20). We now focus on the case of  ρ = 0.1and examine additional representation sizes p, and noise ratios ( ν+ν− ∈ {10, 40}). Additionally, we examine the original setting where  p = 16 and ν+ = 2, ν− = 0.1,with varying proportions of positive and negative examples.

The results in Figure 7 show that in all the additional settings our methods provides statistically significant improvement over the baseline. FDR adjusted p-values for multiple comparisons are provided in Table 2.

Table 2: FDR adjusted p-values for the results reported in Figure 7

image

Experiments In Table 3, we provide the means and standard deviations for the experiments detailed in §5.2. Additionally, Table 4 presents the adjusted p-values for assessing the performance increase over the ERM baseline achieved by our algorithm with the explored penalties.

Table 3: Experimental results. Mean and standard deviation of AUC values over 5 repetitions are reported for in distribution scenario (PC), and class distribution shift (QC). Best result is marked in bold.

image

Table 4: Adjusted p-values for one-sided paired t-tests for testing the improvements over the ERM baseline.

image

D.3 Analysis of Loss Values

Here we present an analysis of the unpenalized loss after convergence in both real-data experiments. We performed separate analyses on pairs of data points from the dominant type during training (majority), and those from the other type (minority). Additionally, we separated positive pairs (y = 1) and negative pairs (y = 0). Figure 9 displays histograms illustrating the differences between losses on the training set obtained with the representation learned using ERM (gERM), and those obtained using our algorithm with VarAUC penalty (gVarAUC):

image

Positive values of the differences correspond to higher losses for ERM.

In both experiments, when examining negative pairs from the minority group, as shown in the top-left histograms, most of the observed differences are positive. This indicates that the ERM losses for these pairs are higher compared to the losses obtained for the representation trained with the VarAUC penalty. The disparities are smaller for the other three groups: majority negative pairs, minority negative pairs, and minority positive pairs. Among these groups, ERM performs better on positive pairs.

image

Figure 8: Additional Simulation Results. Top row:  ρ = 0.05, Bottom row:  ρ = 0.3. Left: Average AUC progress over 10 repetitions of the simulation. Solid lines correspond to performance on test data (distribution shift scenario), dashed lines show performance on data sampled from the same distribution as training data (in-distribution scenario). Right: Average feature importance results over 10 repetitions.

image

Figure 9: Analysis of Loss Differences. Histograms of differences between ERM and our algorithm with VarAUC penalty are shown for two experiments in separate sub-figures: (a) CelebA dataset, (b) ETHEC dataset. The top rows show differences for negative pairs (y = 0), bottom ones show differences for positive pairs (y = 1). In each sub-figure the left column corresponds to the minority type and right one to the majority. A dotted black line marks a difference of 0. Positive values correspond to higher losses for ERM.

image

Figure 10: Sample Images from the CelebA Dataset. Top: a random sample of the training data with 95% non-blond people. Bottom: a random sample of the test data with 95% blond people.

image

Figure 11: Sample Images from the ETHEC Dataset. Top: a sample of the the training data – 9 species of the Lycaenidae family and 1 from the Nymphalidae family. Bottom: a sample of the test data where the proportion of the families is reversed. Nymphalidae species names are marked in bold.

A link to a permanent repository with code to reproduce our results is included in the main text.

The data-related parameters of our experiments are described in the main text. In all our experiments we used margin of m = 0.5 for the contrastive loss and Adam (Kingma & Ba, 2014) optimizer to train all models.

For the CLOvE penalty we used a Laplacian kernel  k(r, r′) = e 1width −|r−r′|with width of 0.4 as originally suggested by Kumar et al. (2018).

For optimization of the VarAUC objective we disregard the finite sample correction Ns−nNs−1in the implementation since n is very small compared to  Ns. In practice, we minimize the standard deviation instead of the variance in both variance based penalties, and the hyperparameters are reported accordingly.

In our scenario where the attribute of interest is unknown, we generated a synthetic attribute for hyperparameter selection using Principle Components (PC). We ranked examples based on their first PC component values, classifying the top 10% as positive and the rest as negative. Hyperparameters for all methods were chosen via grid-search in a single experiment repetition, ensuring robustness against this synthetic attribute. Notably, the experiments themselves did not involve the PC attribute; instead, they focused on dimension swapping in simulations and attributes like hair color or species family in CelebA and ETHEC experiments.

The grid search produced almost identical hyperparameters for all three  ρvalues. We observed that performance converged to the same value when employing hyperparameters derived from crossvalidation for one  ρvalue, as those selected for another. Therefore, for simplicity we repeated simulations using the same hyperparameters, determined based on the grid search results for  ρ = 0.1(the intermediate parameter value). Similarly, minimal differences in optimal learning rates were observed among the methods within an experiment and therefore a shared learning rate was used for each experiment. To emphasize the improvement of OOD methods over the ERM baseline, we used the learning rate optimized for the ERM method. Large differences were observed in optimal regularization factors, and therefore these parameters (as well as method-specific parameters) were not shared. All hyper-parameters are reported in Table 5.

All models were initialized with identical weights, and trained on identical data splits.

All the code in this work was implemented in Python 3.10. We used the TensorFlow 2.13 and TensorFlow Addons 0.21 packages. For evaluation we used the auc function from scikitlearn 1.2. The CelebA dataset was loaded through TensorFlow Datasets 4.9 and pandas 1.5 was used to process the ETHEC dataset. Statistical tests were performed using ttest_rel and false_discovery_control functions from scipy.stats 1.11.4. All figures were generated using Matplotlib 3.7.

The IRM implementation was adapted from the source code of the paper, available at https://github.com/facebookresearch/InvariantRiskMinimization.

We ran all experiments on a single A100 cloud GPU. For simulations, each full repetition of the experiment (comparing all methods) required on average 2.06 hours. Each repetition on the ETHEC dataset took 7.38 hours on average, and on the CelebA dataset 11.52 hours.

Table 5: Hyper Parameters.

image

1. Claims

Question: Do the main claims made in the abstract and introduction accurately reflect the paper’s contributions and scope? Answer: [Yes] , Justification: Both abstract and the introduction (last 2 paragraphs) accurately state the main contributions of the paper. Guidelines:

• The answer NA means that the abstract and introduction do not include the claims made in the paper.

• The abstract and/or introduction should clearly state the claims made, including the contributions made in the paper and important assumptions and limitations. A No or NA answer to this question will not be perceived well by the reviewers.

• The claims made should match theoretical and experimental results, and reflect how much the results can be expected to generalize to other settings.

• It is fine to include aspirational goals as motivation as long as it is clear that these goals are not attained by the paper.

2. Limitations

Question: Does the paper discuss the limitations of the work performed by the authors? Answer: [Yes] Justification: Main limitations are discussed in §6. Guidelines:

• The answer NA means that the paper has no limitation while the answer No means that the paper has limitations, but those are not discussed in the paper.

• The authors are encouraged to create a separate "Limitations" section in their paper.

• The paper should point out any strong assumptions and how robust the results are to violations of these assumptions (e.g., independence assumptions, noiseless settings, model well-specification, asymptotic approximations only holding locally). The authors should reflect on how these assumptions might be violated in practice and what the implications would be.

• The authors should reflect on the scope of the claims made, e.g., if the approach was only tested on a few datasets or with a few runs. In general, empirical results often depend on implicit assumptions, which should be articulated.

• The authors should reflect on the factors that influence the performance of the approach. For example, a facial recognition algorithm may perform poorly when image resolution is low or images are taken in low lighting. Or a speech-to-text system might not be used reliably to provide closed captions for online lectures because it fails to handle technical jargon.

• The authors should discuss the computational efficiency of the proposed algorithms and how they scale with dataset size.

• If applicable, the authors should discuss possible limitations of their approach to address problems of privacy and fairness.

• While the authors might fear that complete honesty about limitations might be used by reviewers as grounds for rejection, a worse outcome might be that reviewers discover limitations that aren’t acknowledged in the paper. The authors should use their best judgment and recognize that individual actions in favor of transparency play an important role in developing norms that preserve the integrity of the community. Reviewers will be specifically instructed to not penalize honesty concerning limitations.

3. Theory Assumptions and Proofs

Question: For each theoretical result, does the paper provide the full set of assumptions and a complete (and correct) proof? Answer: [Yes]

Justification: All assumptions are clearly stated and full proofs to the theoretical claims appear in Appendix B Guidelines:

• The answer NA means that the paper does not include theoretical results. • All the theorems, formulas, and proofs in the paper should be numbered and crossreferenced.

• All assumptions should be clearly stated or referenced in the statement of any theorems.

• The proofs can either appear in the main paper or the supplemental material, but if they appear in the supplemental material, the authors are encouraged to provide a short proof sketch to provide intuition.

• Inversely, any informal proof provided in the core of the paper should be complemented by formal proofs provided in appendix or supplemental material.

• Theorems and Lemmas that the proof relies upon should be properly referenced.

4. Experimental Result Reproducibility

Question: Does the paper fully disclose all the information needed to reproduce the main experimental results of the paper to the extent that it affects the main claims and/or conclusions of the paper (regardless of whether the code and data are provided or not)? Answer: [Yes] Justification: All the details are provided in Section §5 and Appendix F. Guidelines:

• The answer NA means that the paper does not include experiments.

• If the paper includes experiments, a No answer to this question will not be perceived well by the reviewers: Making the paper reproducible is important, regardless of whether the code and data are provided or not.

• If the contribution is a dataset and/or model, the authors should describe the steps taken to make their results reproducible or verifiable.

• Depending on the contribution, reproducibility can be accomplished in various ways. For example, if the contribution is a novel architecture, describing the architecture fully might suffice, or if the contribution is a specific model and empirical evaluation, it may be necessary to either make it possible for others to replicate the model with the same dataset, or provide access to the model. In general. releasing code and data is often one good way to accomplish this, but reproducibility can also be provided via detailed instructions for how to replicate the results, access to a hosted model (e.g., in the case of a large language model), releasing of a model checkpoint, or other means that are appropriate to the research performed.

• While NeurIPS does not require releasing code, the conference does require all submissions to provide some reasonable avenue for reproducibility, which may depend on the nature of the contribution. For example (a) If the contribution is primarily a new algorithm, the paper should make it clear how to reproduce that algorithm.

(b) If the contribution is primarily a new model architecture, the paper should describe the architecture clearly and fully.

(c) If the contribution is a new model (e.g., a large language model), then there should either be a way to access this model for reproducing the results or a way to reproduce the model (e.g., with an open-source dataset or instructions for how to construct the dataset).

(d) We recognize that reproducibility may be tricky in some cases, in which case authors are welcome to describe the particular way they provide for reproducibility. In the case of closed-source models, it may be that access to the model is limited in some way (e.g., to registered users), but it should be possible for other researchers to have some path to reproducing or verifying the results.

5. Open access to data and code

Question: Does the paper provide open access to the data and code, with sufficient instructions to faithfully reproduce the main experimental results, as described in supplemental material?

Answer: [Yes]

Justification: The datasets are publicly available and code implementing all our results is submitted with the paper.

Guidelines:

• The answer NA means that paper does not include experiments requiring code. • Please see the NeurIPS code and data submission guidelines (https://nips.cc/ public/guides/CodeSubmissionPolicy) for more details.

• While we encourage the release of code and data, we understand that this might not be possible, so “No” is an acceptable answer. Papers cannot be rejected simply for not including code, unless this is central to the contribution (e.g., for a new open-source benchmark).

• The instructions should contain the exact command and environment needed to run to reproduce the results. See the NeurIPS code and data submission guidelines (https: //nips.cc/public/guides/CodeSubmissionPolicy) for more details.

• The authors should provide instructions on data access and preparation, including how to access the raw data, preprocessed data, intermediate data, and generated data, etc.

• The authors should provide scripts to reproduce all experimental results for the new proposed method and baselines. If only a subset of experiments are reproducible, they should state which ones are omitted from the script and why.

• At submission time, to preserve anonymity, the authors should release anonymized versions (if applicable).

• Providing as much information as possible in supplemental material (appended to the paper) is recommended, but including URLs to data and code is permitted.

6. Experimental Setting/Details

Question: Does the paper specify all the training and test details (e.g., data splits, hyper-parameters, how they were chosen, type of optimizer, etc.) necessary to understand the results?

Answer: [Yes] Justification: We specify all hyperparameters and training details in Appendix F. Guidelines:

• The answer NA means that the paper does not include experiments. • The experimental setting should be presented in the core of the paper to a level of detail that is necessary to appreciate the results and make sense of them.

• The full details can be provided either with the code, in appendix, or as supplemental material.

7. Experiment Statistical Significance

Question: Does the paper report error bars suitably and correctly defined or other appropriate information about the statistical significance of the experiments?

Answer: [Yes]

Justification: We perform statistical testing to provide significance of our results and report FDR adjusted p-values. For all reported results we include either error bars in the main text, or when other visualizations are chosen we report means and standard deviations in Appendix D.

Guidelines:

• The answer NA means that the paper does not include experiments.

• The authors should answer "Yes" if the results are accompanied by error bars, confidence intervals, or statistical significance tests, at least for the experiments that support the main claims of the paper.

• The factors of variability that the error bars are capturing should be clearly stated (for example, train/test split, initialization, random drawing of some parameter, or overall run with given experimental conditions).

• The method for calculating the error bars should be explained (closed form formula, call to a library function, bootstrap, etc.)

• The assumptions made should be given (e.g., Normally distributed errors). • It should be clear whether the error bar is the standard deviation or the standard error of the mean.

• It is OK to report 1-sigma error bars, but one should state it. The authors should preferably report a 2-sigma error bar than state that they have a 96% CI, if the hypothesis of Normality of errors is not verified.

• For asymmetric distributions, the authors should be careful not to show in tables or figures symmetric error bars that would yield results that are out of range (e.g. negative error rates).

• If error bars are reported in tables or plots, The authors should explain in the text how they were calculated and reference the corresponding figures or tables in the text.

8. Experiments Compute Resources

Question: For each experiment, does the paper provide sufficient information on the computer resources (type of compute workers, memory, time of execution) needed to reproduce the experiments? Answer: [Yes] Justification: All resources including GPU information and run times are provided in Appendix F. Guidelines:

• The answer NA means that the paper does not include experiments. • The paper should indicate the type of compute workers CPU or GPU, internal cluster, or cloud provider, including relevant memory and storage.

• The paper should provide the amount of compute required for each of the individual experimental runs as well as estimate the total compute.

• The paper should disclose whether the full research project required more compute than the experiments reported in the paper (e.g., preliminary or failed experiments that didn’t make it into the paper).

9. Code Of Ethics

Question: Does the research conducted in the paper conform, in every respect, with the NeurIPS Code of Ethics https://neurips.cc/public/EthicsGuidelines? Answer: [Yes] Justification: The paper conforms with the provided code of ethics. Guidelines:

• The answer NA means that the authors have not reviewed the NeurIPS Code of Ethics. • If the authors answer No, they should explain the special circumstances that require a deviation from the Code of Ethics.

• The authors should make sure to preserve anonymity (e.g., if there is a special consideration due to laws or regulations in their jurisdiction).

10. Broader Impacts

Question: Does the paper discuss both potential positive societal impacts and negative societal impacts of the work performed? Answer: [NA]

Justification: This paper presents work whose goal is to advance the field of learning robust data representations. It is not tied to any particular applications and therefore we do not see an immediate risk for negative societal impact.

Guidelines:

• The answer NA means that there is no societal impact of the work performed. • If the authors answer NA or No, they should explain why their work has no societal impact or why the paper does not address societal impact.

• Examples of negative societal impacts include potential malicious or unintended uses (e.g., disinformation, generating fake profiles, surveillance), fairness considerations (e.g., deployment of technologies that could make decisions that unfairly impact specific groups), privacy considerations, and security considerations.

• The conference expects that many papers will be foundational research and not tied to particular applications, let alone deployments. However, if there is a direct path to any negative applications, the authors should point it out. For example, it is legitimate to point out that an improvement in the quality of generative models could be used to generate deepfakes for disinformation. On the other hand, it is not needed to point out that a generic algorithm for optimizing neural networks could enable people to train models that generate Deepfakes faster.

• The authors should consider possible harms that could arise when the technology is being used as intended and functioning correctly, harms that could arise when the technology is being used as intended but gives incorrect results, and harms following from (intentional or unintentional) misuse of the technology.

• If there are negative societal impacts, the authors could also discuss possible mitigation strategies (e.g., gated release of models, providing defenses in addition to attacks, mechanisms for monitoring misuse, mechanisms to monitor how a system learns from feedback over time, improving the efficiency and accessibility of ML).

11. Safeguards

Question: Does the paper describe safeguards that have been put in place for responsible release of data or models that have a high risk for misuse (e.g., pretrained language models, image generators, or scraped datasets)?

Answer: [NA] Justification: We do not release any new data or models. Guidelines:

• The answer NA means that the paper poses no such risks.

• Released models that have a high risk for misuse or dual-use should be released with necessary safeguards to allow for controlled use of the model, for example by requiring that users adhere to usage guidelines or restrictions to access the model or implementing safety filters.

• Datasets that have been scraped from the Internet could pose safety risks. The authors should describe how they avoided releasing unsafe images.

• We recognize that providing effective safeguards is challenging, and many papers do not require this, but we encourage authors to take this into account and make a best faith effort.

12. Licenses for existing assets

Question: Are the creators or original owners of assets (e.g., code, data, models), used in the paper, properly credited and are the license and terms of use explicitly mentioned and properly respected?

Answer: [Yes]

Justification: The only asset used is the IRM implementation. The corresponding paper is cited and we explicitly mention this in Appendix F, while providing also reference for the code itself.

Guidelines:

• The answer NA means that the paper does not use existing assets. • The authors should cite the original paper that produced the code package or dataset. • The authors should state which version of the asset is used and, if possible, include a URL.

• The name of the license (e.g., CC-BY 4.0) should be included for each asset. • For scraped data from a particular source (e.g., website), the copyright and terms of service of that source should be provided.

• If assets are released, the license, copyright information, and terms of use in the package should be provided. For popular datasets, paperswithcode.com/datasets has curated licenses for some datasets. Their licensing guide can help determine the license of a dataset.

• For existing datasets that are re-packaged, both the original license and the license of the derived asset (if it has changed) should be provided.

• If this information is not available online, the authors are encouraged to reach out to the asset’s creators.

13. New Assets

Question: Are new assets introduced in the paper well documented and is the documentation provided alongside the assets?

Answer: [NA] Justification: We do not release new assets. Guidelines:

• The answer NA means that the paper does not release new assets.

• Researchers should communicate the details of the dataset/code/model as part of their submissions via structured templates. This includes details about training, license, limitations, etc.

• The paper should discuss whether and how consent was obtained from people whose asset is used.

• At submission time, remember to anonymize your assets (if applicable). You can either create an anonymized URL or include an anonymized zip file.

14. Crowdsourcing and Research with Human Subjects

Question: For crowdsourcing experiments and research with human subjects, does the paper include the full text of instructions given to participants and screenshots, if applicable, as well as details about compensation (if any)?

Answer: [NA] Justification: The paper does not involve human subjects and did not use crowd sourcing. Guidelines:

• The answer NA means that the paper does not involve crowdsourcing nor research with human subjects.

• Including this information in the supplemental material is fine, but if the main contribution of the paper involves human subjects, then as much detail as possible should be included in the main paper.

• According to the NeurIPS Code of Ethics, workers involved in data collection, curation, or other labor should be paid at least the minimum wage in the country of the data collector.

15. Institutional Review Board (IRB) Approvals or Equivalent for Research with Human Subjects

Question: Does the paper describe potential risks incurred by study participants, whether such risks were disclosed to the subjects, and whether Institutional Review Board (IRB) approvals (or an equivalent approval/review based on the requirements of your country or institution) were obtained?

Answer: [NA] Justification: The paper does not involve human subjects and did not use crowd sourcing. Guidelines:

• The answer NA means that the paper does not involve crowdsourcing nor research with human subjects.

• Depending on the country in which research is conducted, IRB approval (or equivalent) may be required for any human subjects research. If you obtained IRB approval, you should clearly state this in the paper.

• We recognize that the procedures for this may vary significantly between institutions and locations, and we expect authors to adhere to the NeurIPS Code of Ethics and the guidelines for their institution.

• For initial submissions, do not include any information that would break anonymity (if applicable), such as the institution conducting the review.

Designed for Accessibility and to further Open Science