An Investigation of Why Overparameterization Exacerbates Spurious Correlations

2020·Arxiv

Abstract

Abstract

We study why overparameterization—increasing model size well beyond the point of zero training error—can hurt test error on minority groups despite improving average test error when there are spurious correlations in the data. Through simulations and experiments on two image datasets, we identify two key properties of the training data that drive this behavior: the proportions of majority versus minority groups, and the signal-to-noise ratio of the spurious correlations. We then analyze a linear setting and theoretically show how the inductive bias of models towards “memorizing” fewer examples can cause overparameterization to hurt. Our analysis leads to a counterintuitive approach of subsampling the majority group, which empirically achieves low minority error in the overparameterized regime, even though the standard approach of upweighting the minority fails. Overall, our results suggest a tension between using overparameterized models versus using all the training data for achieving low worst-group error.

1. Introduction

The typical goal in machine learning is to minimize the average error on a test set that is independent and identically distributed (i.i.d.) to the training set. A large body of prior work has shown that overparameterization—increasing model size beyond the point of zero training error—improves average test error in a variety of settings, both empirically (with neural networks, e.g., Nakkiran et al. (2019)) and theoretically (with linear and random projection models, e.g., Belkin et al. (2019); Mei & Montanari (2019)).

However, recent work has also demonstrated that models with low average error can still fail on particular groups of

*Equal contribution 1Stanford University. Correspondence to: Shiori Sagawa <ssagawa@cs.stanford.edu>, Aditi Raghunathan <aditir@stanford.edu>, Pang Wei Koh <pangwei@cs.stanford.edu>.

Proceedings of the International Conference on Machine Learning, Online, PMLR 119, 2020. Copyright 2020 by the author(s).

Figure 1. Top: Overparameterization hurts test error on the worst group when models are trained with the reweighted objective that upweights minority groups (Equation 3). Without reweighting, models have poor worst-group error regardless of model size (Appendix A.1). Bottom: Consider data points comprises a core feature (x-axis) and a spurious feature (y-axis). The label y is highly correlated with , except on two minority groups (crosses). Underparameterized models use the core feature (left), but overparameterized models use the spurious feature and memorize the minority points (right).

data points (Blodgett et al., 2016; Hashimoto et al., 2018; Buolamwini & Gebru, 2018). This problem of high worst-group error arises especially in the presence of spurious correlations, such as strong associations between label and background in image classification (McCoy et al., 2019; Sagawa et al., 2020). To mitigate this problem, common approaches reduce the worst-group training loss, e.g., through distributionally robust optimization (DRO) or simply upweighting the minority groups. Sagawa et al. (2020) showed these approaches improve worst-group error on strongly regularized neural networks but fail to help standard neural networks that can achieve zero training error, suggesting that increasing model capacity by reducing regularization—and perhaps by increasing overparameterization as well—can exacerbate spurious correlations.

In this paper, we investigate why overparameterization exacerbates spurious correlations under the above approach of upweighting minority groups. We first confirm on two

Figure 2. We consider two image datasets, CelebA and Waterbirds, where the label y is correlated with a spurious attribute a in a majority of the training data. The % beside each group shows its frequency in the training data. To measure how robust a model is to the spurious attribute, we divide the data into groups based on (y, a) and record the highest error incurred by a group. Figure adapted from Sagawa et al. (2020).

image datasets (Figure 2) that directly increasing overparameterization (i.e., increasing model size) indeed hurts worst-group error, leading to models that are highly inaccurate on the minority groups where the spurious correlation does not hold (Section 3). In contrast, their underparameterized counterparts obtain much better worst-group error, but do worse on average. We also confirm that models trained via empirical risk minimization (i.e., without upweighting the minority) have poor worst-group test error regardless of whether they are under- or overparameterized. Through simulations on a synthetic setting, we further identify two properties of the training data that modulate the effect of overparameterization: (i) the relative sizes of the majority versus minority groups, and (ii) how informative the spurious features are relative to the core features (Section 4).

Why does overparameterization exacerbate spurious correlations? Underparameterized models do not rely on spurious features because that would incur high training error on the (upweighted) minority groups where the spurious correlation does not hold. In contrast, overparameterized models can always obtain zero training error by memorizing training examples, and instead rely on their inductive bias to pick a solution—which features to use and which examples to memorize—out of all solutions with zero training error. Our results suggest an intuitive story of why overparameterization can hurt: because overparameterized models can have an inductive bias towards “memorizing” fewer examples (Figure 1). If (i) the majority groups are sufficiently large and (ii) the spurious features are more informative than the core features for these groups, then overparameterized models could choose to use the spurious features because it entails less memorization, and therefore suffer high worst- group test error. We test this intuition through simulations and formalize it in a theoretical analysis (Section 5).

Our analysis also leads to the counterintuitive result that on overparameterized models, subsampling the majority groups is much more effective at improving worst-group error than upweighting the minority groups. Indeed, an overparameterized model trained on a subset of <5% of the data performs similarly (on average and on the worst group) to an underparameterized model trained on all the data (Section 6). This suggests a possible tension between using overparameterized models and using all the data; average error benefits from both, but improving worst-group error seems to rely on using only one but not both.

2. Setup

Spurious correlation setup. We adopt the setting studied in Sagawa et al. (2020), where each example comprises the input features x, a label (core attribute) , and a spurious attribute . Each example belongs to a group , where g = (y, a). Importantly, the spurious attribute a is correlated with the label y in the training set. We focus on the binary setting in which

Applications. We study two image classification tasks (Figure 2). In the first task, the label is spuriously correlated with demographics: specifically, we use the CelebA dataset (Liu et al., 2015) to classify hair color between the labels Y = {blonde, non-blonde}, which are correlated with the gender A = {female, male}. In the second task, the label is spuriously correlated with image background. We use the Waterbirds dataset (based on datasets from Wah et al. (2011); Zhou et al. (2017) and modified by Sagawa et al. (2020)) to classify between the labels Y = {waterbird, landbird}, which are spuriously correlated with the image background A = {water background, land background}. See Appendix A.5 for more dataset details.

Objectives and metrics. We evaluate a model w by its worst-group error,

where is the 0-1 loss. In other words, we measure the error (% of examples that are incorrectly labeled) in each group, and then record the highest error across all groups. The standard approach to training models is empirical risk minimization (ERM): given a loss function , find the model w that minimizes the average training loss

However, in line with Sagawa et al. (2020), we find that models trained via ERM have poor worst-group test error regardless of whether they are under- or overparameterized (Appendix A.1). To achieve low worst-group test error, prior work proposed modified objectives that focus on the worst-group loss, such as group distributionally robust optimization (group DRO) which directly optimizes for the worst-group training loss (Hu et al., 2018; Sagawa et al., 2020) or reweighting (Shimodaira, 2000; Byrd & Lipton, 2019). Sagawa et al. (2020) showed that both approaches can help worst-group loss, though group DRO is typically more effective. For simplicity, we focus on the well-studied reweighting approach, which optimizes

where is the fraction of training examples in group g. The intuition behind reweighting is that it makes each group contribute the same weight to the training objective: that is, minority groups are upweighted, while majority groups are downweighted. Note that this approach requires the groups g to be specified at training time, though not at test time.

3. Overparameterization hurts worst-group error

Sagawa et al. (2020) observed that decreasing regularization hurts worst-group error. Though increasing overparameterization and reducing regularization can have different effects (Zhang et al., 2017; Mei & Montanari, 2019), this suggests that overparameterization might similarly exacerbate spurious correlations. Here, we show that directly increasing overparameterization (model size) indeed hurts worst-group error even though it improves average error.

Models. We study the CelebA and Waterbirds datasets described above. For CelebA, we train a ResNet10 model (He et al., 2016), varying model size by increasing the network width from 1 to 96, as in Nakkiran et al. (2019). For Waterbirds, we use logistic regression over random projections, as in Mei & Montanari (2019). Specifically, let the input features, which we obtain by passing the input image through a pre-trained, fixed ResNet-18 model. We train an unregularized logistic regression model over the feature representation ReLU, where is a random matrix with each row sampled uniformly from the unit sphere . We vary model size by increasing the number of projections m from 1 to 10,000. We train each model by minimizing the reweighted objective (Equation (3)). For more details, see Appendix A.5.

Results. Overparameterization improves average test error across both datasets, in line with prior work (Belkin et al., 2019; Nakkiran et al., 2019) (Figure 3). However, in stark contrast, overparameterization hurts worst-group error: the best worst-group test error is achieved by an underparameterized model with non-zero training error. On CelebA, the smallest model (width 1) has 12.4% worst-group training error but comparatively low worst-group test error of 25.6%. As width increases, training error goes to zero but worst-group test error gets worse, reaching >60% for overparameterized models with zero training error. Similarly, on Waterbirds, an underparameterized model with 90 random features and worst-group training error of 17.7% obtains the best worst-group test error of 26.6%, while overparameterized models with zero training error yield worst-group test error of 42.4% at best.

In Appendix A.2, we also confirm that stronger regularization improves worst-group error but hurts average error in overparameterized models, while it has little effect on both worst-group and average error in underparameterized models. However, we focus on understanding the effect of overparameterization in the remainder of the paper.

Discussion. Why does overparameterization hurt worst-group test error? We make two observations. First, in the overparameterized regime, the smallest groups incur the highest test error (blonde males in CelebA and waterbirds on land background in Waterbirds), despite having zero training error. In other words, overparameterized models perfectly fit the minority points at training time, but seem to do so by using patterns that do not generalize. We informally refer to this behavior as “memorizing” the minority points.

Second, underparameterized models do obtain low worst-group error by learning patterns that generalize to both majority and minority groups. Therefore, overparameterized models should also be able to learn these patterns while attaining zero training error (e.g., by memorizing the training points that the underparameterized model cannot fit). Despite this, overparameterized models seem to learn patterns that generalize well on the majority but do not work on the minority (such as the spurious attributes a in Figure 2).

What makes overparameterized models memorize the minority instead of learning patterns that generalize well on both majority and minority groups? We study this question in the next two sections: in Section 4, we use simulations to understand properties of the data distribution that give rise to this trend, and in Section 5 we analyze a simplified linear setting and show how the inductive bias of models towards memorizing fewer points can lead to overparameterized models choosing to use spurious correlations.

4. Simulation studies

The discussion in Section 3 suggests two properties of the training distribution that modulate the effect of overparameterization on worst-group error. Intuitively, overparameterized models should be more incentivized to use the spurious features and memorize the minority groups if (i) the proportion of the majority group, , is higher, and (ii) the ratio

Figure 3. Increasing overparameterization (i.e., increasing model size) hurts the worst-group test error even though it improves the average test error. Here, we show results for models trained on the reweighted objective for CelebA (left) and Waterbirds (right).

of how informative the spurious features are relative to the core features, , is higher. In this section, we use simulations to confirm these intuitions and probe how and affect worst-group error in overparameterized models.

4.1. Synthetic experiment setup

Data distribution. We construct a synthetic dataset that replicates the empirical trends in Section 3. As in Section 2, the label is spuriously correlated with a spurious attribute . We divide our training data into four groups accordingly: two majority groups with a = y, each of size , and two minority groups with each of size . We define as the total number of training points, and as the fraction of majority examples. The higher is, the more strongly a is correlated with y in the training data.

Each (y, a) group has its own distribution over input features comprising core features generated from the label/core attribute y, and spurious features generated from the spurious attribute a:

The core and spurious features are both noisy and encode their respective attributes at different signal-to-noise ratios. We define the spurious-core information ratio (SCR) as . The higher the SCR, the more signal there is about the spurious attribute in the spurious features, relative to the signal about the label in the core features.

Compared to the image datasets we studied in Section 3, this synthetic dataset offers two key simplifications. First, the only differences between groups stem from their differences in (y, a), which isolates the effect of flipping the spurious attribute a. In contrast, in real datasets, groups can differ in other ways, e.g., more label noise in one group. Second, the relative difficulty of estimating y versus a is completely governed by changing . In contrast, real datasets have additional complications, e.g., estimating y might involve a more complex function of the input x than

Figure 4. Overparameterization hurts worst-group test error but improves average test error on synthetic data, reproducing the trends we observe in real data.

Figure 5. Overparameterized models have poor worst-group performance on the synthetic data because they rely on spurious features. Left: removing the spurious feature (green) eliminates the detrimental effect of overparameterization. Right: overparamerized models do well on the majority groups where the spurious features match the label, but poorly on the minority groups.

estimating a, and there might be an inductive bias towards learning a simpler model over a more complex one.

In all of the experiments below, we fix the total number of training points n to 3000, and set d = 100 (so each input x has 2d = 200 dimensions). Unless otherwise specified, we set the majority fraction and the noise levels to encourage the model to use the spurious features over the core features.

Model. To avoid the complexities of optimizing neural networks, we follow the same random features setup we used for Waterbirds in Section 3: unregularized logistic regression using the reweighted objective on the random feature representation ReLU, where is a random matrix (Mei & Montanari, 2019).

4.2. Observations on synthetic dataset

The synthetic dataset replicates the trends we observe on real datasets. Figure 4 shows how average and worst-group error change with the number of parameters/random projections m. This matches the trends we obtained on CelebA and Waterbirds in Section 3. The best worst-group test error of 28.5% is achieved by an underparameterized model, whereas highly overparameterized models achieve high worst-group test error that plateaus at around 55%. In contrast, the average test error is better for overparameterized models than for underparameterized models.

Overparameterized models use spurious features. Fig-

ure 5-Right shows that overparameterized models have high test error on minority groups () despite zero training error, but perform very well on the majority groups (a = y). Since the only difference between the minority and majority groups in the synthetic dataset is the relative signs of the core and spurious attributes, this suggests overparameterized models are using spurious features and simply memorizing the minority groups to get zero training error, consistent with our discussion in Section 3. In contrast, the underparameterized model has low training and test errors across all groups, suggesting that it relies mainly on core features.

These results imply that the degradation in the worst-group test error is due to the spurious features. We confirm that overparameterization no longer hurts when we “remove” the spurious features by replacing them with noise centered around zero (i.e., we replace the mean of by 0). In this case, the best worst-group test error is now obtained by an overparameterized model, as shown in Figure 5-Left.

4.3. Distributional properties

What properties of the training data make overparameterization hurt worst-group error? We study (i) , which controls the relative size of majority to minority groups, and (ii) , the relative informativeness of spurious to core features. In the synthetic dataset, overparameterization hurts worst-group test error only when both are sufficiently high. In contrast, overparameterization helps average test error regardless; see Appendix A.3.

Effect of the majority fraction We observe that increasing , which controls the relative size of the majority versus minority groups, makes overparameterization hurt worst-group error more (Figure 6). When the groups are perfectly balanced with , overparameterization no longer hurts the worst-group test error, with overparameterized models achieving better worst-group test error than all underparameterized models. This suggests that group imbalance can be a key factor inducing the detrimental effect of overparameterization.

Effect of the spurious-core information ratio Next, we characterize the effect of , which mea- sures the relative informativeness of the spurious versus core features. A high means that the spurious features are more informative. We vary by changing while keep- ing fixed, since this does not change the best possible worst-group test error (with a model that uses only the core features shows that the higher the more overparameterization hurts. As increases, the spurious features become more informative, and overparameterized models rely more on them than the core features; underparameterized models outperform overparameterized models only for sufficiently large . Note that increasing does not significantly affect the worst-group

Figure 6. The higher the majority fraction and the spurious-core information ratio , the more overparameterization hurts the worst-group test error. With sufficiently low overparameterization switches to helping worst-group test error.

test error in the underparameterized regime, since the core features are unaffected. In contrast, increasing the majority fraction hurts the worst-group test error in both underparameterized and overparameterized models.

4.4. An intuitive story

We return to the question of what makes overparameterized models memorize the minority instead of learning patterns that generalize on both majority and minority groups. The simulation results above show that of all overparameterized models that achieve zero training error, the inductive bias of the model class and training algorithm favors models that use spurious features which generalize only for the majority groups, instead of learning to use core features that also generalize well on the minority groups.

What is the nature of this inductive bias? Consider a model that predicts the label y by returning its estimate of the spurious attribute , taking advantage of the fact that y and a are correlated in the training data. To get achieve zero training error, it will need to memorize the points in the minority group, e.g., by exploiting variations due to noise in the features x. On the other hand, consider a model that predicts y by returning a direct estimate of y based on the core features . Because provides a noisier estimate of y than does for a, this model will need to memorize all points for which gives an inaccurate prediction of y due to noise. Since the estimators of the core and spurious attributes are equally easy to learn, the main difference between these two models is the number of examples to be memorized.

We therefore hypothesize that the inductive bias favors memorizing as few points as possible. This is consistent with the results above: the model uses and memorizes the minority points only when the fraction of minority points is small (high majority fraction ). Similarly, the model uses over to fit the majority points only when the spurious features are less noisy (high ) and therefore require less memorization to obtain zero training error than the core features. In the next section, we make this intuition formal by analyzing a related but simpler linear setting.

5. Theoretical analysis

In this section, we show how the inductive bias against memorization leads to overparameterization exacerbating spurious correlations. Our analysis explicates the effect of the inductive bias and the importance of the data parameters discussed in Section 4.

The synthetic setting discussed in Section 4 is difficult to analyze because of the non-linear random projections, so we introduce a linear explicit-memorization setting that allows us to precisely define the concept of memorization. For clarity, we refer to the previous synthetic setting in Section 4 as the implicit-memorization setting. In Appendix A.4, we show empirically that models in these two settings behave similarly in the overparameterized regime, though they differ in the underparameterized regime.

In the previous implicit-memorization setting, we varied model size and memorization capacity by varying the number of random projections of the input. In the new explicit-memorization setting, we instead use linear models that act directly on the input and introduce explicit “noise features” that can be used to memorize. We vary the memorization capacity by varying the number of explicit noise features.

5.1. Explicit-memorization setup

Training data. We consider input features x = , where the core feature and the spurious feature are scalars. As in the implicit-memorization setup, they are generated based on the label and the spurious attribute, respectively:

The “noise” features are generated as

where is a constant. The scaling by 1/N ensures that for large N, the norm of the noise vectors is approximately constant with high probability. Intuitively, when N is large, overparameterized models can use to fit a training point x without affecting its predictions on other points, thereby memorizing x. We formalize this notion of memorization later in Section 5.2.

As before, the training data is composed of four groups, each corresponding to a combination of the label and the spurious attribute : two majority groups with a = y, each of size , and two minority groups with , each of size . Combined, there are n training examples

Model. We study unregularized logistic regression on the input features . As before, we consider the reweighted estimator . When the training data is linearly separable, the minimizer of the unregularized logistic loss on the training data is not well-defined. We therefore define in terms of the sequence of -regularized models

where is the logistic loss and is the fraction of training examples in group g. Since scaling a model does not affect its 0-1 error, we define as the limit of this sequence, scaled to unit norm, as the regularization strength

In the underparameterized regime, the training data is not linearly separable and we simply have In the overparameterized regime where , the training data is linearly separable, and Rosset et al. (2004) showed that is the max-margin classifier

The equivalence holds regardless of the reweighting by : if we define the ERM estimator analogously to (5) without the reweighting, it is also equal to . We will therefore analyze in the overparameterized regime since it subsumes both

We also note that if we use gradient descent to directly optimize the unregularized logistic regression objective (either reweighted or not), the resulting solution after scaling to unit norm also converges to as the number of gradient steps goes to infinity (Soudry et al., 2018).

5.2. Analysis of worst-group error

We now state our main analytical result: in the explicit-memorization setting, the worst-group test error of a suffi-ciently overparameterized model is greater than 1/2 (worse than random) under certain settings of In contrast, underparameterized models attain reasonable worst-group error even under such a setting.

Theorem 1. For any

, and , there exists such that for all (overparameterized regime), with high probability over draws of the data,

where is the max-margin classifier.

However, for N = 0 (underparameterized regime), with , and , and in the asymptotic regime with

where minimizes the reweighted logistic loss.

The result in the overparameterized regime applies to the max-margin classifier , which as discussed above subsumes both when the data is linearly separable. The proof of Theorem 1 appears in Appendix B.

The conditions on and in Theorem 1 above im- ply high spurious-core information ratio . Theorem 1 therefore provides a setting where high and high provably make overparameterized models obtain high worst-group error, matching the trends we observed upon varying in the implicit-memorization setting (Figure 6). Furthermore, underparameterized models obtain reasonable worst-group error despite these conditions, mirroring the observations in earlier sections.

5.3. Overparameterization and memorization

We now sketch the key ideas in the proof of Theorem 1 (full proof in Appendix B), focusing first on the overparameterized regime. We start by establishing an inductive bias towards learning the minimum-norm model that fits the training data. We then define memorization and show how the minimum-norm inductive bias translates into a bias against memorization. Finally, we illustrate how the bias against memorization leads to learning the spurious feature and suffering high worst-group error.

Minimum-norm inductive bias. Define a separator as any model that correctly classifies all of the training points (x, y) with margin . Then from standard duality arguments, can be rewritten as , the scaled version of the minimum-norm separator

Since scaling does not affect the 0-1 test error, it suffices to analyze . Equation (9) shows that out of the set of all separators (which all perfectly fit the training data), the inductive bias favors the separator with the minimum norm. We now discuss how this minimum-norm inductive bias favors less memorization.

Memorization. For convenience, we denote the three components of a model w as

where , and . By the representer theorem, we can decompose as follows:

In the overparameterized regime when , a model can “memorize” a training point via