An Investigation of Why Overparameterization Exacerbates Spurious Correlations

2020·arXiv

Abstract

1. Introduction

The typical goal in machine learning is to minimize the average error on a test set that is independent and identically distributed (i.i.d.) to the training set. A large body of prior work has shown that overparameterization—increasing model size beyond the point of zero training error—improves average test error in a variety of settings, both empirically (with neural networks, e.g., Nakkiran et al. (2019)) and theoretically (with linear and random projection models, e.g., Belkin et al. (2019); Mei & Montanari (2019)).

However, recent work has also demonstrated that models with low average error can still fail on particular groups of

*Equal contribution 1Stanford University. Correspondence to: Shiori Sagawa <ssagawa@cs.stanford.edu>, Aditi Raghunathan <aditir@stanford.edu>, Pang Wei Koh <pangwei@cs.stanford.edu>.

Proceedings of the International Conference on Machine Learning, Online, PMLR 119, 2020. Copyright 2020 by the author(s).

Figure 1. Top: Overparameterization hurts test error on the worst group when models are trained with the reweighted objective that upweights minority groups (Equation 3). Without reweighting, models have poor worst-group error regardless of model size (Appendix A.1). Bottom: Consider data points comprises a core feature (x-axis) and a spurious feature (y-axis). The label y is highly correlated with , except on two minority groups (crosses). Underparameterized models use the core feature (left), but overparameterized models use the spurious feature and memorize the minority points (right).

data points (Blodgett et al., 2016; Hashimoto et al., 2018; Buolamwini & Gebru, 2018). This problem of high worst-group error arises especially in the presence of spurious correlations, such as strong associations between label and background in image classification (McCoy et al., 2019; Sagawa et al., 2020). To mitigate this problem, common approaches reduce the worst-group training loss, e.g., through distributionally robust optimization (DRO) or simply upweighting the minority groups. Sagawa et al. (2020) showed these approaches improve worst-group error on strongly regularized neural networks but fail to help standard neural networks that can achieve zero training error, suggesting that increasing model capacity by reducing regularization—and perhaps by increasing overparameterization as well—can exacerbate spurious correlations.

In this paper, we investigate why overparameterization exacerbates spurious correlations under the above approach of upweighting minority groups. We first confirm on two

Figure 2. We consider two image datasets, CelebA and Waterbirds, where the label y is correlated with a spurious attribute a in a majority of the training data. The % beside each group shows its frequency in the training data. To measure how robust a model is to the spurious attribute, we divide the data into groups based on (y, a) and record the highest error incurred by a group. Figure adapted from Sagawa et al. (2020).

image datasets (Figure 2) that directly increasing overparameterization (i.e., increasing model size) indeed hurts worst-group error, leading to models that are highly inaccurate on the minority groups where the spurious correlation does not hold (Section 3). In contrast, their underparameterized counterparts obtain much better worst-group error, but do worse on average. We also confirm that models trained via empirical risk minimization (i.e., without upweighting the minority) have poor worst-group test error regardless of whether they are under- or overparameterized. Through simulations on a synthetic setting, we further identify two properties of the training data that modulate the effect of overparameterization: (i) the relative sizes of the majority versus minority groups, and (ii) how informative the spurious features are relative to the core features (Section 4).

Why does overparameterization exacerbate spurious correlations? Underparameterized models do not rely on spurious features because that would incur high training error on the (upweighted) minority groups where the spurious correlation does not hold. In contrast, overparameterized models can always obtain zero training error by memorizing training examples, and instead rely on their inductive bias to pick a solution—which features to use and which examples to memorize—out of all solutions with zero training error. Our results suggest an intuitive story of why overparameterization can hurt: because overparameterized models can have an inductive bias towards “memorizing” fewer examples (Figure 1). If (i) the majority groups are sufficiently large and (ii) the spurious features are more informative than the core features for these groups, then overparameterized models could choose to use the spurious features because it entails less memorization, and therefore suffer high worst- group test error. We test this intuition through simulations and formalize it in a theoretical analysis (Section 5).

Our analysis also leads to the counterintuitive result that on overparameterized models, subsampling the majority groups is much more effective at improving worst-group error than upweighting the minority groups. Indeed, an overparameterized model trained on a subset of <5% of the data performs similarly (on average and on the worst group) to an underparameterized model trained on all the data (Section 6). This suggests a possible tension between using overparameterized models and using all the data; average error benefits from both, but improving worst-group error seems to rely on using only one but not both.

2. Setup

Spurious correlation setup. We adopt the setting studied in Sagawa et al. (2020), where each example comprises the input features x, a label (core attribute) , and a spurious attribute . Each example belongs to a group , where g = (y, a). Importantly, the spurious attribute a is correlated with the label y in the training set. We focus on the binary setting in which

Applications. We study two image classification tasks (Figure 2). In the first task, the label is spuriously correlated with demographics: specifically, we use the CelebA dataset (Liu et al., 2015) to classify hair color between the labels Y = {blonde, non-blonde}, which are correlated with the gender A = {female, male}. In the second task, the label is spuriously correlated with image background. We use the Waterbirds dataset (based on datasets from Wah et al. (2011); Zhou et al. (2017) and modified by Sagawa et al. (2020)) to classify between the labels Y = {waterbird, landbird}, which are spuriously correlated with the image background A = {water background, land background}. See Appendix A.5 for more dataset details.

Objectives and metrics. We evaluate a model w by its worst-group error,

where is the 0-1 loss. In other words, we measure the error (% of examples that are incorrectly labeled) in each group, and then record the highest error across all groups. The standard approach to training models is empirical risk minimization (ERM): given a loss function , find the model w that minimizes the average training loss

However, in line with Sagawa et al. (2020), we find that models trained via ERM have poor worst-group test error regardless of whether they are under- or overparameterized (Appendix A.1). To achieve low worst-group test error, prior work proposed modified objectives that focus on the worst-group loss, such as group distributionally robust optimization (group DRO) which directly optimizes for the worst-group training loss (Hu et al., 2018; Sagawa et al., 2020) or reweighting (Shimodaira, 2000; Byrd & Lipton, 2019). Sagawa et al. (2020) showed that both approaches can help worst-group loss, though group DRO is typically more effective. For simplicity, we focus on the well-studied reweighting approach, which optimizes

where is the fraction of training examples in group g. The intuition behind reweighting is that it makes each group contribute the same weight to the training objective: that is, minority groups are upweighted, while majority groups are downweighted. Note that this approach requires the groups g to be specified at training time, though not at test time.

3. Overparameterization hurts worst-group error

Sagawa et al. (2020) observed that decreasing regularization hurts worst-group error. Though increasing overparameterization and reducing regularization can have different effects (Zhang et al., 2017; Mei & Montanari, 2019), this suggests that overparameterization might similarly exacerbate spurious correlations. Here, we show that directly increasing overparameterization (model size) indeed hurts worst-group error even though it improves average error.

Models. We study the CelebA and Waterbirds datasets described above. For CelebA, we train a ResNet10 model (He et al., 2016), varying model size by increasing the network width from 1 to 96, as in Nakkiran et al. (2019). For Waterbirds, we use logistic regression over random projections, as in Mei & Montanari (2019). Specifically, let the input features, which we obtain by passing the input image through a pre-trained, fixed ResNet-18 model. We train an unregularized logistic regression model over the feature representation ReLU, where is a random matrix with each row sampled uniformly from the unit sphere . We vary model size by increasing the number of projections m from 1 to 10,000. We train each model by minimizing the reweighted objective (Equation (3)). For more details, see Appendix A.5.

Results. Overparameterization improves average test error across both datasets, in line with prior work (Belkin et al., 2019; Nakkiran et al., 2019) (Figure 3). However, in stark contrast, overparameterization hurts worst-group error: the best worst-group test error is achieved by an underparameterized model with non-zero training error. On CelebA, the smallest model (width 1) has 12.4% worst-group training error but comparatively low worst-group test error of 25.6%. As width increases, training error goes to zero but worst-group test error gets worse, reaching >60% for overparameterized models with zero training error. Similarly, on Waterbirds, an underparameterized model with 90 random features and worst-group training error of 17.7% obtains the best worst-group test error of 26.6%, while overparameterized models with zero training error yield worst-group test error of 42.4% at best.

In Appendix A.2, we also confirm that stronger regularization improves worst-group error but hurts average error in overparameterized models, while it has little effect on both worst-group and average error in underparameterized models. However, we focus on understanding the effect of overparameterization in the remainder of the paper.

Discussion. Why does overparameterization hurt worst-group test error? We make two observations. First, in the overparameterized regime, the smallest groups incur the highest test error (blonde males in CelebA and waterbirds on land background in Waterbirds), despite having zero training error. In other words, overparameterized models perfectly fit the minority points at training time, but seem to do so by using patterns that do not generalize. We informally refer to this behavior as “memorizing” the minority points.

Second, underparameterized models do obtain low worst-group error by learning patterns that generalize to both majority and minority groups. Therefore, overparameterized models should also be able to learn these patterns while attaining zero training error (e.g., by memorizing the training points that the underparameterized model cannot fit). Despite this, overparameterized models seem to learn patterns that generalize well on the majority but do not work on the minority (such as the spurious attributes a in Figure 2).

What makes overparameterized models memorize the minority instead of learning patterns that generalize well on both majority and minority groups? We study this question in the next two sections: in Section 4, we use simulations to understand properties of the data distribution that give rise to this trend, and in Section 5 we analyze a simplified linear setting and show how the inductive bias of models towards memorizing fewer points can lead to overparameterized models choosing to use spurious correlations.

4. Simulation studies

The discussion in Section 3 suggests two properties of the training distribution that modulate the effect of overparameterization on worst-group error. Intuitively, overparameterized models should be more incentivized to use the spurious features and memorize the minority groups if (i) the proportion of the majority group, , is higher, and (ii) the ratio

Figure 3. Increasing overparameterization (i.e., increasing model size) hurts the worst-group test error even though it improves the average test error. Here, we show results for models trained on the reweighted objective for CelebA (left) and Waterbirds (right).

of how informative the spurious features are relative to the core features, , is higher. In this section, we use simulations to confirm these intuitions and probe how and affect worst-group error in overparameterized models.

4.1. Synthetic experiment setup

Data distribution. We construct a synthetic dataset that replicates the empirical trends in Section 3. As in Section 2, the label is spuriously correlated with a spurious attribute . We divide our training data into four groups accordingly: two majority groups with a = y, each of size , and two minority groups with each of size . We define as the total number of training points, and as the fraction of majority examples. The higher is, the more strongly a is correlated with y in the training data.

Each (y, a) group has its own distribution over input features comprising core features generated from the label/core attribute y, and spurious features generated from the spurious attribute a:

The core and spurious features are both noisy and encode their respective attributes at different signal-to-noise ratios. We define the spurious-core information ratio (SCR) as . The higher the SCR, the more signal there is about the spurious attribute in the spurious features, relative to the signal about the label in the core features.

Compared to the image datasets we studied in Section 3, this synthetic dataset offers two key simplifications. First, the only differences between groups stem from their differences in (y, a), which isolates the effect of flipping the spurious attribute a. In contrast, in real datasets, groups can differ in other ways, e.g., more label noise in one group. Second, the relative difficulty of estimating y versus a is completely governed by changing . In contrast, real datasets have additional complications, e.g., estimating y might involve a more complex function of the input x than

Figure 4. Overparameterization hurts worst-group test error but improves average test error on synthetic data, reproducing the trends we observe in real data.

Figure 5. Overparameterized models have poor worst-group performance on the synthetic data because they rely on spurious features. Left: removing the spurious feature (green) eliminates the detrimental effect of overparameterization. Right: overparamerized models do well on the majority groups where the spurious features match the label, but poorly on the minority groups.

estimating a, and there might be an inductive bias towards learning a simpler model over a more complex one.

In all of the experiments below, we fix the total number of training points n to 3000, and set d = 100 (so each input x has 2d = 200 dimensions). Unless otherwise specified, we set the majority fraction and the noise levels to encourage the model to use the spurious features over the core features.

Model. To avoid the complexities of optimizing neural networks, we follow the same random features setup we used for Waterbirds in Section 3: unregularized logistic regression using the reweighted objective on the random feature representation ReLU, where is a random matrix (Mei & Montanari, 2019).

4.2. Observations on synthetic dataset

The synthetic dataset replicates the trends we observe on real datasets. Figure 4 shows how average and worst-group error change with the number of parameters/random projections m. This matches the trends we obtained on CelebA and Waterbirds in Section 3. The best worst-group test error of 28.5% is achieved by an underparameterized model, whereas highly overparameterized models achieve high worst-group test error that plateaus at around 55%. In contrast, the average test error is better for overparameterized models than for underparameterized models.

Overparameterized models use spurious features. Fig-

ure 5-Right shows that overparameterized models have high test error on minority groups () despite zero training error, but perform very well on the majority groups (a = y). Since the only difference between the minority and majority groups in the synthetic dataset is the relative signs of the core and spurious attributes, this suggests overparameterized models are using spurious features and simply memorizing the minority groups to get zero training error, consistent with our discussion in Section 3. In contrast, the underparameterized model has low training and test errors across all groups, suggesting that it relies mainly on core features.

These results imply that the degradation in the worst-group test error is due to the spurious features. We confirm that overparameterization no longer hurts when we “remove” the spurious features by replacing them with noise centered around zero (i.e., we replace the mean of by 0). In this case, the best worst-group test error is now obtained by an overparameterized model, as shown in Figure 5-Left.

4.3. Distributional properties

What properties of the training data make overparameterization hurt worst-group error? We study (i) , which controls the relative size of majority to minority groups, and (ii) , the relative informativeness of spurious to core features. In the synthetic dataset, overparameterization hurts worst-group test error only when both are sufficiently high. In contrast, overparameterization helps average test error regardless; see Appendix A.3.

Effect of the majority fraction We observe that increasing , which controls the relative size of the majority versus minority groups, makes overparameterization hurt worst-group error more (Figure 6). When the groups are perfectly balanced with , overparameterization no longer hurts the worst-group test error, with overparameterized models achieving better worst-group test error than all underparameterized models. This suggests that group imbalance can be a key factor inducing the detrimental effect of overparameterization.

Effect of the spurious-core information ratio Next, we characterize the effect of , which mea- sures the relative informativeness of the spurious versus core features. A high means that the spurious features are more informative. We vary by changing while keep- ing fixed, since this does not change the best possible worst-group test error (with a model that uses only the core features shows that the higher the more overparameterization hurts. As increases, the spurious features become more informative, and overparameterized models rely more on them than the core features; underparameterized models outperform overparameterized models only for sufficiently large . Note that increasing does not significantly affect the worst-group

Figure 6. The higher the majority fraction and the spurious-core information ratio , the more overparameterization hurts the worst-group test error. With sufficiently low overparameterization switches to helping worst-group test error.

test error in the underparameterized regime, since the core features are unaffected. In contrast, increasing the majority fraction hurts the worst-group test error in both underparameterized and overparameterized models.

4.4. An intuitive story

We return to the question of what makes overparameterized models memorize the minority instead of learning patterns that generalize on both majority and minority groups. The simulation results above show that of all overparameterized models that achieve zero training error, the inductive bias of the model class and training algorithm favors models that use spurious features which generalize only for the majority groups, instead of learning to use core features that also generalize well on the minority groups.

What is the nature of this inductive bias? Consider a model that predicts the label y by returning its estimate of the spurious at