36:[["$","audio",null,{"id":"tts"}],["$","$L3b",null,{"paperID":"2005.04345","publisher":"arxiv","paperJSON":{"title":"An Investigation of Why Overparameterization Exacerbates Spurious Correlations","paperID":"2005.04345","avgLineHeight":11.96,"imgScale":4,"sections":[{"heading":"Abstract","paragraphs":[[{"text":"We study why overparameterization—increasing model size well beyond the point of zero training error—can hurt test error on minority groups despite improving average test error when there are spurious correlations in the data. Through simulations and experiments on two image datasets, we identify two key properties of the training data that drive this behavior: the proportions of majority versus minority groups, and the signal-to-noise ratio of the spurious correlations. We then analyze a linear setting and theoretically show how the inductive bias of models towards “memorizing” fewer examples can cause overparameterization to hurt. Our analysis leads to a counterintuitive approach of subsampling the majority group, which empirically achieves low minority error in the overparameterized regime, even though the standard approach of upweighting the minority fails. Overall, our results suggest a tension between using overparameterized models versus using all the training data for achieving low worst-group error.","element":"span"}]]},{"heading":"1. Introduction","paragraphs":[[{"text":"The typical goal in machine learning is to minimize the average error on a test set that is independent and identically distributed (i.i.d.) to the training set. A large body of prior work has shown that overparameterization—increasing model size beyond the point of zero training error—improves average test error in a variety of settings, both empirically (with neural networks, e.g., ","element":"span"},{"href":"#id-0","referenceIndex":26,"text":"Nakkiran et al. ","element":"a"},{"href":"#id-0","referenceIndex":26,"text":"(","element":"a"},{"href":"#id-0","referenceIndex":26,"text":"2019","element":"a"},{"text":")) and theoretically (with linear and random projection models, e.g., ","element":"span"},{"href":"#id-1","referenceIndex":3,"text":"Belkin et al. ","element":"a"},{"href":"#id-1","referenceIndex":3,"text":"(","element":"a"},{"href":"#id-1","referenceIndex":3,"text":"2019","element":"a"},{"text":"); ","element":"span"},{"href":"#id-2","referenceIndex":25,"text":"Mei & Montanari ","element":"a"},{"href":"#id-2","referenceIndex":25,"text":"(","element":"a"},{"href":"#id-2","referenceIndex":25,"text":"2019","element":"a"},{"text":")).","element":"span"}],[{"text":"However, recent work has also demonstrated that models with low average error can still fail on particular groups of","element":"span"}],[{"text":"*","element":"span"},{"text":"Equal contribution ","element":"span"},{"text":"1","element":"span"},{"text":"Stanford University. ","element":"span"},{"text":"Correspondence ","element":"span"},{"text":"to: ","element":"span"},{"text":"Shiori ","element":"span"},{"text":"Sagawa ","element":"span"},{"style":{"fontStyle":"italic"},"text":"<","element":"span"},{"text":"ssagawa@cs.stanford.edu","element":"span"},{"style":{"fontStyle":"italic"},"text":">","element":"span"},{"text":", Aditi Raghunathan ","element":"span"},{"style":{"fontStyle":"italic"},"text":"<","element":"span"},{"text":"aditir@stanford.edu","element":"span"},{"style":{"fontStyle":"italic"},"text":">","element":"span"},{"text":", Pang Wei Koh ","element":"span"},{"style":{"fontStyle":"italic"},"text":"<","element":"span"},{"text":"pangwei@cs.stanford.edu","element":"span"},{"style":{"fontStyle":"italic"},"text":">","element":"span"},{"text":".","element":"span"}],[{"style":{"fontStyle":"italic"},"text":"Proceedings of the ","element":"span"},{"style":{"height":12.9},"width":70.88,"height":32.24,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2005.04345/images/0-0.png","element":"img","alt":" 37 th ","inline":true,"padRight":true},{"style":{"fontStyle":"italic"},"text":"International Conference on Machine Learning","element":"span"},{"text":", Online, PMLR 119, 2020. Copyright 2020 by the author(s).","element":"span"}],[{"style":{"width":"83%"},"width":779,"height":765,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2005.04345/images/0-1.png","element":"img"}],[{"style":{"fontStyle":"italic"},"text":"Figure 1. ","element":"figcaption","subtype":"caption"},{"style":{"fontWeight":"bold"},"text":"Top","element":"figcaption","subtype":"caption"},{"text":": Overparameterization ","element":"figcaption","subtype":"caption"},{"style":{"fontStyle":"italic"},"text":"hurts ","element":"figcaption","subtype":"caption"},{"text":"test error on the worst group when models are trained with the reweighted objective that upweights minority groups (Equation ","element":"figcaption","subtype":"caption"},{"href":"#id-3","text":"3","element":"a","subtype":"caption"},{"text":"). Without reweighting, models have poor worst-group error regardless of model size (Appendix ","element":"figcaption","subtype":"caption"},{"href":"#id-4","text":"A.1","element":"a","subtype":"caption"},{"text":"). ","element":"figcaption","subtype":"caption"},{"style":{"fontWeight":"bold"},"text":"Bottom","element":"figcaption","subtype":"caption"},{"text":": Consider data points ","element":"figcaption","subtype":"caption"},{"style":{"height":16.09},"width":306.52,"height":40.24,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2005.04345/images/0-2.png","element":"img","alt":" (x, y), where x ∈ R2","inline":true,"padRight":true},{"text":"comprises a core feature ","element":"figcaption","subtype":"caption"},{"style":{"height":7.99},"width":63.84,"height":19.98,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2005.04345/images/0-3.png","element":"img","alt":" xcore","inline":true,"padRight":true},{"text":"(x-axis) and a spurious feature ","element":"figcaption","subtype":"caption"},{"style":{"height":9.99},"width":56.76,"height":24.98,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2005.04345/images/0-4.png","element":"img","alt":" xspu","inline":true,"padRight":true},{"text":"(y-axis). The label ","element":"figcaption","subtype":"caption"},{"style":{"fontStyle":"italic"},"text":"y ","element":"figcaption","subtype":"caption"},{"text":"is highly correlated with ","element":"figcaption","subtype":"caption"},{"style":{"height":9.99},"width":56.76,"height":24.98,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2005.04345/images/0-5.png","element":"img","alt":" xspu","inline":true},{"text":", except on two minority groups (crosses). Underparameterized models use the core feature (left), but overparameterized models use the spurious feature and memorize the minority points (right).","element":"figcaption","subtype":"caption"}],[{"text":"data points (","element":"span"},{"href":"#id-5","referenceIndex":5,"text":"Blodgett et al.","element":"a"},{"href":"#id-5","referenceIndex":5,"text":", ","element":"a"},{"href":"#id-5","referenceIndex":5,"text":"2016","element":"a"},{"text":"; ","element":"span"},{"href":"#id-6","referenceIndex":14,"text":"Hashimoto et al.","element":"a"},{"href":"#id-6","referenceIndex":14,"text":", ","element":"a"},{"href":"#id-6","referenceIndex":14,"text":"2018","element":"a"},{"text":"; ","element":"span"},{"href":"#id-7","referenceIndex":7,"text":"Buolamwini & Gebru","element":"a"},{"href":"#id-7","referenceIndex":7,"text":", ","element":"a"},{"href":"#id-7","referenceIndex":7,"text":"2018","element":"a"},{"text":"). This problem of high worst-group error arises especially in the presence of spurious correlations, such as strong associations between label and background in image classification (","element":"span"},{"href":"#id-8","referenceIndex":24,"text":"McCoy et al.","element":"a"},{"href":"#id-8","referenceIndex":24,"text":", ","element":"a"},{"href":"#id-8","referenceIndex":24,"text":"2019","element":"a"},{"text":"; ","element":"span"},{"href":"#id-9","referenceIndex":31,"text":"Sagawa et al.","element":"a"},{"href":"#id-9","referenceIndex":31,"text":", ","element":"a"},{"href":"#id-9","referenceIndex":31,"text":"2020","element":"a"},{"text":"). To mitigate this problem, common approaches reduce the worst-group training loss, e.g., through distributionally robust optimization (DRO) or simply upweighting the minority groups. ","element":"span"},{"href":"#id-9","referenceIndex":31,"text":"Sagawa et al. ","element":"a"},{"href":"#id-9","referenceIndex":31,"text":"(","element":"a"},{"href":"#id-9","referenceIndex":31,"text":"2020","element":"a"},{"text":") showed these approaches improve worst-group error on strongly regularized neural networks but fail to help standard neural networks that can achieve zero training error, suggesting that increasing model capacity by reducing regularization—and perhaps by increasing overparameterization as well—can exacerbate spurious correlations.","element":"span"}],[{"text":"In this paper, we investigate why overparameterization exacerbates spurious correlations under the above approach of upweighting minority groups. We first confirm on two","element":"span"}],[{"style":{"width":"93%"},"width":875,"height":649,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2005.04345/images/1-0.png","element":"img"}],[{"id":"id-10","style":{"fontStyle":"italic"},"text":"Figure 2. ","element":"figcaption","subtype":"caption"},{"text":"We consider two image datasets, CelebA and Waterbirds, where the label ","element":"figcaption","subtype":"caption"},{"style":{"fontStyle":"italic"},"text":"y ","element":"figcaption","subtype":"caption"},{"text":"is correlated with a spurious attribute ","element":"figcaption","subtype":"caption"},{"style":{"fontStyle":"italic"},"text":"a ","element":"figcaption","subtype":"caption"},{"text":"in a majority of the training data. The % beside each group shows its frequency in the training data. To measure how robust a model is to the spurious attribute, we divide the data into groups based on ","element":"figcaption","subtype":"caption"},{"text":"(","element":"figcaption","subtype":"caption"},{"style":{"fontStyle":"italic"},"text":"y, a","element":"figcaption","subtype":"caption"},{"text":") ","element":"figcaption","subtype":"caption"},{"text":"and record the highest error incurred by a group. Figure adapted from ","element":"figcaption","subtype":"caption"},{"href":"#id-9","referenceIndex":31,"text":"Sagawa et al. ","element":"a","subtype":"caption"},{"href":"#id-9","referenceIndex":31,"text":"(","element":"a","subtype":"caption"},{"href":"#id-9","referenceIndex":31,"text":"2020","element":"a","subtype":"caption"},{"text":").","element":"figcaption","subtype":"caption"}],[{"text":"image datasets (Figure ","element":"span"},{"href":"#id-10","text":"2","element":"a"},{"text":") that directly increasing overparameterization (i.e., increasing model size) indeed hurts worst-group error, leading to models that are highly inaccurate on the minority groups where the spurious correlation does not hold (Section ","element":"span"},{"text":"3","element":"span"},{"text":"). In contrast, their underparameterized counterparts obtain much better worst-group error, but do worse on average. We also confirm that models trained via empirical risk minimization (i.e., without upweighting the minority) have poor worst-group test error regardless of whether they are under- or overparameterized. Through simulations on a synthetic setting, we further identify two properties of the training data that modulate the effect of overparameterization: (i) the relative sizes of the majority versus minority groups, and (ii) how informative the spurious features are relative to the core features (Section ","element":"span"},{"text":"4","element":"span"},{"text":").","element":"span"}],[{"text":"Why does overparameterization exacerbate spurious correlations? Underparameterized models do not rely on spurious features because that would incur high training error on the (upweighted) minority groups where the spurious correlation does not hold. In contrast, overparameterized models can always obtain zero training error by memorizing training examples, and instead rely on their inductive bias to pick a solution—which features to use and which examples to memorize—out of all solutions with zero training error. Our results suggest an intuitive story of why overparameterization can hurt: because overparameterized models can have an inductive bias towards “memorizing” fewer examples (Figure ","element":"span"},{"text":"1","element":"span"},{"text":"). If (i) the majority groups are sufficiently large and (ii) the spurious features are more informative than the core features for these groups, then overparameterized models could choose to use the spurious features because it entails less memorization, and therefore suffer high worst- ","element":"span"},{"text":"group test error. We test this intuition through simulations and formalize it in a theoretical analysis (Section ","element":"span"},{"text":"5","element":"span"},{"text":").","element":"span"}],[{"text":"Our analysis also leads to the counterintuitive result that on overparameterized models, subsampling the majority groups is much more effective at improving worst-group error than upweighting the minority groups. Indeed, an overparameterized model trained on a subset of ","element":"span"},{"style":{"fontStyle":"italic"},"text":"<","element":"span"},{"text":"5% of the data performs similarly (on average and on the worst group) to an underparameterized model trained on all the data (Section ","element":"span"},{"text":"6","element":"span"},{"text":"). This suggests a possible tension between using overparameterized models and using all the data; average error benefits from both, but improving worst-group error seems to rely on using only one but not both.","element":"span"}]]},{"heading":"2. Setup","paragraphs":[[{"style":{"fontWeight":"bold"},"text":"Spurious correlation setup. ","element":"span"},{"text":"We adopt the setting studied in ","element":"span"},{"href":"#id-9","referenceIndex":31,"text":"Sagawa et al. ","element":"a"},{"href":"#id-9","referenceIndex":31,"text":"(","element":"a"},{"href":"#id-9","referenceIndex":31,"text":"2020","element":"a"},{"text":"), where each example comprises the input features ","element":"span"},{"style":{"fontStyle":"italic"},"text":"x","element":"span"},{"text":", a label (core attribute) ","element":"span"},{"style":{"height":14},"width":113.92,"height":35,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2005.04345/images/1-1.png","element":"img","alt":" y ∈ Y","inline":true},{"text":", and a spurious attribute ","element":"span"},{"style":{"height":12.4},"width":117.25,"height":31,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2005.04345/images/1-2.png","element":"img","alt":" a ∈ A","inline":true},{"text":". Each example belongs to a group ","element":"span"},{"style":{"height":14.8},"width":288.04,"height":37,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2005.04345/images/1-3.png","element":"img","alt":" g ∈ G = Y × A","inline":true},{"text":", where ","element":"span"},{"style":{"fontStyle":"italic"},"text":"g ","element":"span"},{"text":"= (","element":"span"},{"style":{"fontStyle":"italic"},"text":"y, a","element":"span"},{"text":")","element":"span"},{"text":". Importantly, the spurious attribute ","element":"span"},{"style":{"fontStyle":"italic"},"text":"a ","element":"span"},{"text":"is correlated with the label ","element":"span"},{"style":{"fontStyle":"italic"},"text":"y ","element":"span"},{"text":"in the training set. We focus on the binary setting in which ","element":"span"},{"style":{"height":16},"width":512.28,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2005.04345/images/1-4.png","element":"img","alt":"Y = {1, −1} and A = {1, −1}.","inline":true}],[{"style":{"fontWeight":"bold"},"text":"Applications. ","element":"span"},{"text":"We study two image classification tasks (Figure ","element":"span"},{"href":"#id-10","text":"2","element":"a"},{"text":"). In the first task, the label is spuriously correlated with demographics: specifically, we use the CelebA dataset (","element":"span"},{"href":"#id-11","referenceIndex":23,"text":"Liu et al.","element":"a"},{"href":"#id-11","referenceIndex":23,"text":", ","element":"a"},{"href":"#id-11","referenceIndex":23,"text":"2015","element":"a"},{"text":") to classify hair color between the labels ","element":"span"},{"style":{"fontStyle":"italic"},"text":"Y ","element":"span"},{"text":"= ","element":"span"},{"style":{"fontStyle":"italic"},"text":"{","element":"span"},{"text":"blonde, non-blonde","element":"span"},{"style":{"fontStyle":"italic"},"text":"}","element":"span"},{"text":", which are correlated with the gender ","element":"span"},{"style":{"fontStyle":"italic"},"text":"A ","element":"span"},{"text":"= ","element":"span"},{"style":{"fontStyle":"italic"},"text":"{","element":"span"},{"text":"female, male","element":"span"},{"style":{"fontStyle":"italic"},"text":"}","element":"span"},{"text":". In the second task, the label is spuriously correlated with image background. We use the Waterbirds dataset (based on datasets from ","element":"span"},{"href":"#id-12","referenceIndex":35,"text":"Wah et al. ","element":"a"},{"href":"#id-12","referenceIndex":35,"text":"(","element":"a"},{"href":"#id-12","referenceIndex":35,"text":"2011","element":"a"},{"text":"); ","element":"span"},{"href":"#id-13","referenceIndex":39,"text":"Zhou et al. ","element":"a"},{"href":"#id-13","referenceIndex":39,"text":"(","element":"a"},{"href":"#id-13","referenceIndex":39,"text":"2017","element":"a"},{"text":") and modified by ","element":"span"},{"href":"#id-9","referenceIndex":31,"text":"Sagawa et al. ","element":"a"},{"href":"#id-9","referenceIndex":31,"text":"(","element":"a"},{"href":"#id-9","referenceIndex":31,"text":"2020","element":"a"},{"text":")) to classify between the labels ","element":"span"},{"style":{"fontStyle":"italic"},"text":"Y ","element":"span"},{"text":"= ","element":"span"},{"style":{"fontStyle":"italic"},"text":"{","element":"span"},{"text":"waterbird, landbird","element":"span"},{"style":{"fontStyle":"italic"},"text":"}","element":"span"},{"text":", which are spuriously correlated with the image background ","element":"span"},{"style":{"fontStyle":"italic"},"text":"A ","element":"span"},{"text":"= ","element":"span"},{"style":{"fontStyle":"italic"},"text":"{","element":"span"},{"text":"water background, land background","element":"span"},{"style":{"fontStyle":"italic"},"text":"}","element":"span"},{"text":". See Appendix ","element":"span"},{"href":"#id-14","text":"A.5 ","element":"a"},{"text":"for more dataset details.","element":"span"}],[{"style":{"fontWeight":"bold"},"text":"Objectives and metrics. ","element":"span"},{"text":"We evaluate a model ","element":"span"},{"style":{"fontStyle":"italic"},"text":"w ","element":"span"},{"text":"by its ","element":"span"},{"style":{"fontStyle":"italic"},"text":"worst-group ","element":"span"},{"text":"error,","element":"span"}],[{"style":{"width":"86%"},"width":808,"height":63,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2005.04345/images/1-5.png","element":"img"}],[{"text":"where ","element":"span"},{"style":{"height":7.6},"width":71.33,"height":19,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2005.04345/images/1-6.png","element":"img","alt":" ℓ0−1","inline":true,"padRight":true},{"text":"is the 0-1 loss. In other words, we measure the error (% of examples that are incorrectly labeled) in each group, and then record the highest error across all groups. The standard approach to training models is empirical risk minimization (ERM): given a loss function ","element":"span"},{"style":{"height":0},"width":20,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2005.04345/images/1-7.png","element":"img","alt":" ℓ","inline":true},{"text":", find the model ","element":"span"},{"style":{"fontStyle":"italic"},"text":"w ","element":"span"},{"text":"that minimizes the average training loss","element":"span"}],[{"id":"id-61","style":{"width":"80%"},"width":751,"height":53,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2005.04345/images/1-8.png","element":"img"}],[{"text":"However, in line with ","element":"span"},{"href":"#id-9","referenceIndex":31,"text":"Sagawa et al. ","element":"a"},{"href":"#id-9","referenceIndex":31,"text":"(","element":"a"},{"href":"#id-9","referenceIndex":31,"text":"2020","element":"a"},{"text":"), we find that models trained via ERM have poor worst-group test error ","element":"span"},{"text":"regardless of whether they are under- or overparameterized (Appendix ","element":"span"},{"href":"#id-4","text":"A.1","element":"a"},{"text":"). To achieve low worst-group test error, prior work proposed modified objectives that focus on the worst-group loss, such as group distributionally robust optimization (group DRO) which directly optimizes for the worst-group training loss (","element":"span"},{"href":"#id-15","referenceIndex":19,"text":"Hu et al.","element":"a"},{"href":"#id-15","referenceIndex":19,"text":", ","element":"a"},{"href":"#id-15","referenceIndex":19,"text":"2018","element":"a"},{"text":"; ","element":"span"},{"href":"#id-9","referenceIndex":31,"text":"Sagawa et al.","element":"a"},{"href":"#id-9","referenceIndex":31,"text":", ","element":"a"},{"href":"#id-9","referenceIndex":31,"text":"2020","element":"a"},{"text":") or reweighting (","element":"span"},{"href":"#id-16","referenceIndex":32,"text":"Shimodaira","element":"a"},{"href":"#id-16","referenceIndex":32,"text":", ","element":"a"},{"href":"#id-16","referenceIndex":32,"text":"2000","element":"a"},{"text":"; ","element":"span"},{"href":"#id-17","referenceIndex":8,"text":"Byrd & Lipton","element":"a"},{"href":"#id-17","referenceIndex":8,"text":", ","element":"a"},{"href":"#id-17","referenceIndex":8,"text":"2019","element":"a"},{"text":"). ","element":"span"},{"href":"#id-9","referenceIndex":31,"text":"Sagawa et al. ","element":"a"},{"href":"#id-9","referenceIndex":31,"text":"(","element":"a"},{"href":"#id-9","referenceIndex":31,"text":"2020","element":"a"},{"text":") showed that both approaches can help worst-group loss, though group DRO is typically more effective. For simplicity, we focus on the well-studied reweighting approach, which optimizes","element":"span"}],[{"id":"id-3","style":{"width":"85%"},"width":806,"height":98,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2005.04345/images/2-0.png","element":"img"}],[{"text":"where ","element":"span"},{"style":{"height":15.59},"width":36.05,"height":38.98,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2005.04345/images/2-1.png","element":"img","alt":" ˆpg","inline":true,"padRight":true},{"text":"is the fraction of training examples in group ","element":"span"},{"style":{"fontStyle":"italic"},"text":"g","element":"span"},{"text":". The intuition behind reweighting is that it makes each group contribute the same weight to the training objective: that is, minority groups are upweighted, while majority groups are downweighted. Note that this approach requires the groups ","element":"span"},{"style":{"fontStyle":"italic"},"text":"g ","element":"span"},{"text":"to be specified at training time, though not at test time.","element":"span"}]]},{"heading":"3. Overparameterization hurts worst-group error","paragraphs":[[{"href":"#id-9","referenceIndex":31,"text":"Sagawa et al. ","element":"a"},{"href":"#id-9","referenceIndex":31,"text":"(","element":"a"},{"href":"#id-9","referenceIndex":31,"text":"2020","element":"a"},{"text":") observed that decreasing ","element":"span"},{"style":{"height":13.19},"width":43.12,"height":32.98,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2005.04345/images/2-2.png","element":"img","alt":" L2","inline":true,"padRight":true},{"text":"regularization hurts worst-group error. Though increasing overparameterization and reducing regularization can have different effects (","element":"span"},{"href":"#id-18","referenceIndex":38,"text":"Zhang et al.","element":"a"},{"href":"#id-18","referenceIndex":38,"text":", ","element":"a"},{"href":"#id-18","referenceIndex":38,"text":"2017","element":"a"},{"text":"; ","element":"span"},{"href":"#id-2","referenceIndex":25,"text":"Mei & Montanari","element":"a"},{"href":"#id-2","referenceIndex":25,"text":", ","element":"a"},{"href":"#id-2","referenceIndex":25,"text":"2019","element":"a"},{"text":"), this suggests that overparameterization might similarly exacerbate spurious correlations. Here, we show that directly increasing overparameterization (model size) indeed hurts worst-group error even though it improves average error.","element":"span"}],[{"style":{"fontWeight":"bold"},"text":"Models. ","element":"span"},{"text":"We study the CelebA and Waterbirds datasets described above. For CelebA, we train a ResNet10 model (","element":"span"},{"href":"#id-19","referenceIndex":16,"text":"He ","element":"a"},{"href":"#id-19","referenceIndex":16,"text":"et al.","element":"a"},{"href":"#id-19","referenceIndex":16,"text":", ","element":"a"},{"href":"#id-19","referenceIndex":16,"text":"2016","element":"a"},{"text":"), varying model size by increasing the network width from 1 to 96, as in ","element":"span"},{"href":"#id-0","referenceIndex":26,"text":"Nakkiran et al. ","element":"a"},{"href":"#id-0","referenceIndex":26,"text":"(","element":"a"},{"href":"#id-0","referenceIndex":26,"text":"2019","element":"a"},{"text":"). For Waterbirds, we use logistic regression over random projections, as in ","element":"span"},{"href":"#id-2","referenceIndex":25,"text":"Mei & Montanari ","element":"a"},{"href":"#id-2","referenceIndex":25,"text":"(","element":"a"},{"href":"#id-2","referenceIndex":25,"text":"2019","element":"a"},{"text":"). Specifically, let ","element":"span"},{"style":{"height":14.18},"width":232.6,"height":35.46,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2005.04345/images/2-3.png","element":"img","alt":" x ∈ Rd denote","inline":true,"padRight":true},{"text":"the input features, which we obtain by passing the input image through a pre-trained, fixed ResNet-18 model. We train an unregularized logistic regression model over the feature representation ReLU","element":"span"},{"style":{"height":16},"width":213.48,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2005.04345/images/2-4.png","element":"img","alt":"(Wx) ∈ Rm","inline":true},{"text":", where ","element":"span"},{"style":{"height":14.18},"width":201.9,"height":35.46,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2005.04345/images/2-5.png","element":"img","alt":" W ∈ Rm×d","inline":true,"padRight":true},{"text":"is a random matrix with each row sampled uniformly from the unit sphere ","element":"span"},{"style":{"height":13.39},"width":79.64,"height":33.46,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2005.04345/images/2-6.png","element":"img","alt":" Sd−1","inline":true},{"text":". We vary model size by increasing the number of projections ","element":"span"},{"style":{"fontStyle":"italic"},"text":"m ","element":"span"},{"text":"from 1 to 10,000. We train each model by minimizing the reweighted objective (Equation (","element":"span"},{"href":"#id-3","text":"3","element":"a"},{"text":")). For more details, see Appendix ","element":"span"},{"href":"#id-14","text":"A.5","element":"a"},{"text":".","element":"span"}],[{"style":{"fontWeight":"bold"},"text":"Results. ","element":"span"},{"text":"Overparameterization improves average test error across both datasets, in line with prior work (","element":"span"},{"href":"#id-1","referenceIndex":3,"text":"Belkin et al.","element":"a"},{"href":"#id-1","referenceIndex":3,"text":", ","element":"a"},{"href":"#id-1","referenceIndex":3,"text":"2019","element":"a"},{"text":"; ","element":"span"},{"href":"#id-0","referenceIndex":26,"text":"Nakkiran et al.","element":"a"},{"href":"#id-0","referenceIndex":26,"text":", ","element":"a"},{"href":"#id-0","referenceIndex":26,"text":"2019","element":"a"},{"text":") (Figure ","element":"span"},{"href":"#id-20","text":"3","element":"a"},{"text":"). However, in stark contrast, overparameterization ","element":"span"},{"style":{"fontStyle":"italic"},"text":"hurts ","element":"span"},{"text":"worst-group error: the best worst-group test error is achieved by an ","element":"span"},{"style":{"fontStyle":"italic"},"text":"underparameterized ","element":"span"},{"text":"model with non-zero training error. On CelebA, ","element":"span"},{"text":"the smallest model (width 1) has 12.4% worst-group training error but comparatively low worst-group test error of 25.6%. As width increases, training error goes to zero but worst-group test error gets worse, reaching ","element":"span"},{"style":{"fontStyle":"italic"},"text":">","element":"span"},{"text":"60% for overparameterized models with zero training error. Similarly, on Waterbirds, an underparameterized model with ","element":"span"},{"text":"90 ","element":"span"},{"text":"random features and worst-group training error of 17.7% obtains the best worst-group test error of 26.6%, while overparameterized models with zero training error yield worst-group test error of 42.4% at best.","element":"span"}],[{"text":"In Appendix ","element":"span"},{"href":"#id-21","text":"A.2","element":"a"},{"text":", we also confirm that stronger regularization improves worst-group error but hurts average error in overparameterized models, while it has little effect on both worst-group and average error in underparameterized models. However, we focus on understanding the effect of overparameterization in the remainder of the paper.","element":"span"}],[{"style":{"fontWeight":"bold"},"text":"Discussion. ","element":"span"},{"text":"Why does overparameterization hurt worst-group test error? We make two observations. First, in the overparameterized regime, the smallest groups incur the highest test error (blonde males in CelebA and waterbirds on land background in Waterbirds), despite having zero training error. In other words, overparameterized models perfectly fit the minority points at training time, but seem to do so by using patterns that do not generalize. We informally refer to this behavior as “memorizing” the minority points.","element":"span"}],[{"text":"Second, underparameterized models do obtain low worst-group error by learning patterns that generalize to both majority and minority groups. Therefore, overparameterized models should also be able to learn these patterns while attaining zero training error (e.g., by memorizing the training points that the underparameterized model cannot fit). Despite this, overparameterized models seem to learn patterns that generalize well on the majority but do not work on the minority (such as the spurious attributes ","element":"span"},{"style":{"fontStyle":"italic"},"text":"a ","element":"span"},{"text":"in Figure ","element":"span"},{"href":"#id-10","text":"2","element":"a"},{"text":").","element":"span"}],[{"text":"What makes overparameterized models memorize the minority instead of learning patterns that generalize well on both majority and minority groups? We study this question in the next two sections: in Section ","element":"span"},{"text":"4","element":"span"},{"text":", we use simulations to understand properties of the data distribution that give rise to this trend, and in Section ","element":"span"},{"text":"5 ","element":"span"},{"text":"we analyze a simplified linear setting and show how the inductive bias of models towards memorizing fewer points can lead to overparameterized models choosing to use spurious correlations.","element":"span"}]]},{"heading":"4. Simulation studies","paragraphs":[[{"text":"The discussion in Section ","element":"span"},{"text":"3 ","element":"span"},{"text":"suggests two properties of the training distribution that modulate the effect of overparameterization on worst-group error. Intuitively, overparameterized models should be more incentivized to use the spurious features and memorize the minority groups if (i) the proportion of the majority group, ","element":"span"},{"style":{"height":11.59},"width":67.82,"height":28.98,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2005.04345/images/2-7.png","element":"img","alt":" pmaj","inline":true},{"text":", is higher, and (ii) the ratio","element":"span"}],[{"style":{"width":"100%"},"width":939,"height":399,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2005.04345/images/3-0.png","element":"img"}],[{"id":"id-22","style":{"fontStyle":"italic"},"text":"Figure 3. ","element":"figcaption","subtype":"caption"},{"text":"Increasing overparameterization (i.e., increasing model size) hurts the worst-group test error even though it improves the average test error. Here, we show results for models trained on the reweighted objective for CelebA (left) and Waterbirds (right).","element":"figcaption","subtype":"caption"}],[{"text":"of how informative the spurious features are relative to the core features, ","element":"span"},{"style":{"height":9.19},"width":51.36,"height":22.98,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2005.04345/images/3-1.png","element":"img","alt":" rs:c","inline":true},{"text":", is higher. In this section, we use simulations to confirm these intuitions and probe how ","element":"span"},{"style":{"height":11.59},"width":67.82,"height":28.98,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2005.04345/images/3-2.png","element":"img","alt":" pmaj","inline":true,"padRight":true},{"text":"and ","element":"span"},{"style":{"height":9.19},"width":51.36,"height":22.98,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2005.04345/images/3-3.png","element":"img","alt":"rs:c","inline":true,"padRight":true},{"text":"affect worst-group error in overparameterized models.","element":"span"}],[{"id":"id-23","style":{"fontWeight":"bold"},"text":"4.1. Synthetic experiment setup","element":"span"}],[{"style":{"fontWeight":"bold"},"text":"Data distribution. ","element":"span"},{"text":"We construct a synthetic dataset that replicates the empirical trends in Section ","element":"span"},{"text":"3","element":"span"},{"text":". As in Section ","element":"span"},{"href":"#id-10","text":"2","element":"a"},{"text":", the label ","element":"span"},{"style":{"height":16},"width":198.16,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2005.04345/images/3-4.png","element":"img","alt":" y ∈ {1, −1}","inline":true,"padRight":true},{"text":"is spuriously correlated with a spurious attribute ","element":"span"},{"style":{"height":16},"width":198.3,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2005.04345/images/3-5.png","element":"img","alt":" a ∈ {1, −1}","inline":true},{"text":". We divide our training data into four groups accordingly: two majority groups with ","element":"span"},{"style":{"fontStyle":"italic"},"text":"a ","element":"span"},{"text":"= ","element":"span"},{"style":{"fontStyle":"italic"},"text":"y","element":"span"},{"text":", each of size ","element":"span"},{"style":{"height":16.79},"width":111.51,"height":41.98,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2005.04345/images/3-6.png","element":"img","alt":" nmaj/2","inline":true},{"text":", and two minority groups with ","element":"span"},{"style":{"height":10},"width":136.21,"height":25,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2005.04345/images/3-7.png","element":"img","alt":" a = −y,","inline":true,"padRight":true},{"text":"each of size ","element":"span"},{"style":{"height":16},"width":111.75,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2005.04345/images/3-8.png","element":"img","alt":" nmin/2","inline":true},{"text":". We define ","element":"span"},{"style":{"height":13.99},"width":285.77,"height":34.98,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2005.04345/images/3-9.png","element":"img","alt":" n = nmaj + nmin","inline":true,"padRight":true},{"text":"as the total number of training points, and ","element":"span"},{"style":{"height":16.79},"width":239.92,"height":41.97,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2005.04345/images/3-10.png","element":"img","alt":" pmaj = nmaj/n","inline":true,"padRight":true},{"text":"as the fraction of majority examples. The higher ","element":"span"},{"style":{"height":11.59},"width":67.82,"height":28.98,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2005.04345/images/3-11.png","element":"img","alt":" pmaj","inline":true,"padRight":true},{"text":"is, the more strongly ","element":"span"},{"style":{"fontStyle":"italic"},"text":"a ","element":"span"},{"text":"is correlated with ","element":"span"},{"style":{"fontStyle":"italic"},"text":"y ","element":"span"},{"text":"in the training data.","element":"span"}],[{"text":"Each ","element":"span"},{"text":"(","element":"span"},{"style":{"fontStyle":"italic"},"text":"y, a","element":"span"},{"text":") ","element":"span"},{"text":"group has its own distribution over input features ","element":"span"},{"style":{"height":18.18},"width":386.03,"height":45.44,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2005.04345/images/3-12.png","element":"img","alt":"x = [xcore, xspu] ∈ R2d","inline":true,"padRight":true},{"text":"comprising core features ","element":"span"},{"style":{"height":11.19},"width":117.79,"height":27.98,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2005.04345/images/3-13.png","element":"img","alt":" xcore ∈","inline":true},{"style":{"height":13.38},"width":45.78,"height":33.46,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2005.04345/images/3-14.png","element":"img","alt":"Rd","inline":true,"padRight":true},{"text":"generated from the label/core attribute ","element":"span"},{"style":{"fontStyle":"italic"},"text":"y","element":"span"},{"text":", and spurious features ","element":"span"},{"style":{"height":18.18},"width":161.22,"height":45.44,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2005.04345/images/3-15.png","element":"img","alt":" xspu ∈ Rd ","inline":true,"padRight":true},{"text":"generated from the spurious attribute ","element":"span"},{"style":{"fontStyle":"italic"},"text":"a","element":"span"},{"text":":","element":"span"}],[{"style":{"width":"72%"},"width":678,"height":115,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2005.04345/images/3-16.png","element":"img"}],[{"text":"The core and spurious features are both noisy and encode their respective attributes at different signal-to-noise ratios. We define the ","element":"span"},{"style":{"fontStyle":"italic"},"text":"spurious-core information ratio ","element":"span"},{"text":"(SCR) as ","element":"span"},{"style":{"height":19.72},"width":281.16,"height":49.31,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2005.04345/images/3-17.png","element":"img","alt":"rs:c = σ2core/σ2spu","inline":true},{"text":". The higher the SCR, the more signal ","element":"span"},{"text":"there is about the spurious attribute in the spurious features, relative to the signal about the label in the core features.","element":"span"}],[{"text":"Compared to the image datasets we studied in Section ","element":"span"},{"text":"3","element":"span"},{"text":", this synthetic dataset offers two key simplifications. First, the only differences between groups stem from their differences in ","element":"span"},{"text":"(","element":"span"},{"style":{"fontStyle":"italic"},"text":"y, a","element":"span"},{"text":")","element":"span"},{"text":", which isolates the effect of flipping the spurious attribute ","element":"span"},{"style":{"fontStyle":"italic"},"text":"a","element":"span"},{"text":". In contrast, in real datasets, groups can differ in other ways, e.g., more label noise in one group. Second, the relative difficulty of estimating ","element":"span"},{"style":{"fontStyle":"italic"},"text":"y ","element":"span"},{"text":"versus ","element":"span"},{"style":{"fontStyle":"italic"},"text":"a ","element":"span"},{"text":"is completely governed by changing ","element":"span"},{"style":{"height":19.72},"width":211.88,"height":49.31,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2005.04345/images/3-18.png","element":"img","alt":" σ2core and σ2spu","inline":true},{"text":". In contrast, ","element":"span"},{"text":"real datasets have additional complications, e.g., estimating ","element":"span"},{"style":{"fontStyle":"italic"},"text":"y ","element":"span"},{"text":"might involve a more complex function of the input ","element":"span"},{"style":{"fontStyle":"italic"},"text":"x ","element":"span"},{"text":"than","element":"span"}],[{"style":{"width":"82%"},"width":778,"height":369,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2005.04345/images/3-19.png","element":"img"}],[{"id":"id-20","style":{"fontStyle":"italic"},"text":"Figure 4. ","element":"figcaption","subtype":"caption"},{"text":"Overparameterization hurts worst-group test error but improves average test error on synthetic data, reproducing the trends we observe in real data.","element":"figcaption","subtype":"caption"}],[{"style":{"width":"99%"},"width":933,"height":327,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2005.04345/images/3-20.png","element":"img"}],[{"style":{"fontStyle":"italic"},"text":"Figure 5. ","element":"figcaption","subtype":"caption"},{"text":"Overparameterized models have poor worst-group performance on the synthetic data because they rely on spurious features. ","element":"figcaption","subtype":"caption"},{"style":{"fontWeight":"bold"},"text":"Left","element":"figcaption","subtype":"caption"},{"text":": removing the spurious feature (green) eliminates the detrimental effect of overparameterization. ","element":"figcaption","subtype":"caption"},{"style":{"fontWeight":"bold"},"text":"Right","element":"figcaption","subtype":"caption"},{"text":": overparamerized models do well on the majority groups where the spurious features match the label, but poorly on the minority groups.","element":"figcaption","subtype":"caption"}],[{"text":"estimating ","element":"span"},{"style":{"fontStyle":"italic"},"text":"a","element":"span"},{"text":", and there might be an inductive bias towards learning a simpler model over a more complex one.","element":"span"}],[{"text":"In all of the experiments below, we fix the total number of training points ","element":"span"},{"style":{"fontStyle":"italic"},"text":"n ","element":"span"},{"text":"to ","element":"span"},{"text":"3000","element":"span"},{"text":", and set ","element":"span"},{"style":{"fontStyle":"italic"},"text":"d ","element":"span"},{"text":"= 100 ","element":"span"},{"text":"(so each input ","element":"span"},{"style":{"fontStyle":"italic"},"text":"x ","element":"span"},{"text":"has ","element":"span"},{"text":"2","element":"span"},{"style":{"fontStyle":"italic"},"text":"d ","element":"span"},{"text":"= 200 ","element":"span"},{"text":"dimensions). Unless otherwise specified, we set the majority fraction ","element":"span"},{"style":{"height":15.59},"width":171.84,"height":38.98,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2005.04345/images/3-21.png","element":"img","alt":" pmaj = 0.9","inline":true,"padRight":true},{"text":"and the noise levels ","element":"span"},{"style":{"height":19.72},"width":403.37,"height":49.31,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2005.04345/images/3-22.png","element":"img","alt":"σ2spu = 1 and σ2core = 100","inline":true,"padRight":true},{"text":"to encourage the model to use the ","element":"span"},{"text":"spurious features over the core features.","element":"span"}],[{"style":{"fontWeight":"bold"},"text":"Model. ","element":"span"},{"text":"To avoid the complexities of optimizing neural networks, we follow the same random features setup we used for Waterbirds in Section ","element":"span"},{"text":"3","element":"span"},{"text":": unregularized logistic regression using the reweighted objective on the random feature representation ReLU","element":"span"},{"style":{"height":16},"width":205.67,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2005.04345/images/3-23.png","element":"img","alt":"(Wx) ∈ Rm","inline":true},{"text":", where ","element":"span"},{"style":{"height":14.19},"width":192.31,"height":35.47,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2005.04345/images/3-24.png","element":"img","alt":" W ∈ Rm×d","inline":true,"padRight":true},{"text":"is a random matrix (","element":"span"},{"href":"#id-2","referenceIndex":25,"text":"Mei & Montanari","element":"a"},{"href":"#id-2","referenceIndex":25,"text":", ","element":"a"},{"href":"#id-2","referenceIndex":25,"text":"2019","element":"a"},{"text":").","element":"span"}],[{"style":{"fontWeight":"bold"},"text":"4.2. Observations on synthetic dataset","element":"span"}],[{"style":{"fontWeight":"bold"},"text":"The synthetic dataset replicates the trends we observe on real datasets. ","element":"span"},{"text":"Figure ","element":"span"},{"href":"#id-22","text":"4 ","element":"a"},{"text":"shows how average and worst-group error change with the number of parameters/random projections ","element":"span"},{"style":{"fontStyle":"italic"},"text":"m","element":"span"},{"text":". This matches the trends we obtained on CelebA and Waterbirds in Section ","element":"span"},{"text":"3","element":"span"},{"text":". The best worst-group test error of 28.5% is achieved by an underparameterized model, whereas highly overparameterized models achieve high worst-group test error that plateaus at around 55%. In contrast, the average test error is better for overparameterized models than for underparameterized models.","element":"span"}],[{"style":{"fontWeight":"bold"},"text":"Overparameterized models use spurious features. ","element":"span"},{"text":"Fig-","element":"span"}],[{"text":"ure ","element":"span"},{"href":"#id-23","text":"5","element":"a"},{"text":"-Right shows that overparameterized models have high test error on minority groups (","element":"span"},{"style":{"height":10},"width":125.19,"height":25,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2005.04345/images/4-0.png","element":"img","alt":"a = −y","inline":true},{"text":") despite zero training error, but perform very well on the majority groups (","element":"span"},{"style":{"fontStyle":"italic"},"text":"a ","element":"span"},{"text":"= ","element":"span"},{"style":{"fontStyle":"italic"},"text":"y","element":"span"},{"text":"). Since the only difference between the minority and majority groups in the synthetic dataset is the relative signs of the core and spurious attributes, this suggests overparameterized ","element":"span"},{"id":"id-25","text":"models are using spurious features and simply memorizing ","element":"span"},{"text":"the minority groups to get zero training error, consistent with our discussion in Section ","element":"span"},{"text":"3","element":"span"},{"text":". In contrast, the underparameterized model has low training and test errors across all groups, suggesting that it relies mainly on core features.","element":"span"}],[{"text":"These results imply that the degradation in the worst-group test error is due to the spurious features. We confirm that overparameterization no longer hurts when we “remove” the spurious features by replacing them with noise centered around zero (i.e., we replace the mean of ","element":"span"},{"style":{"height":11.59},"width":64.44,"height":28.98,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2005.04345/images/4-1.png","element":"img","alt":" xspu","inline":true,"padRight":true},{"text":"by 0). In this case, the best worst-group test error is now obtained by an overparameterized model, as shown in Figure ","element":"span"},{"href":"#id-23","text":"5","element":"a"},{"text":"-Left.","element":"span"}],[{"style":{"fontWeight":"bold"},"text":"4.3. Distributional properties","element":"span"}],[{"text":"What properties of the training data make overparameterization hurt worst-group error? We study (i) ","element":"span"},{"style":{"height":11.59},"width":67.82,"height":28.98,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2005.04345/images/4-2.png","element":"img","alt":" pmaj","inline":true},{"text":", which controls the relative size of majority to minority groups, and (ii) ","element":"span"},{"style":{"height":9.19},"width":51.36,"height":22.98,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2005.04345/images/4-3.png","element":"img","alt":" rs:c","inline":true},{"text":", the relative informativeness of spurious to core features. In the synthetic dataset, overparameterization hurts worst-group test error only when both are sufficiently high. In contrast, overparameterization helps average test error regardless; see Appendix ","element":"span"},{"href":"#id-24","text":"A.3","element":"a"},{"text":".","element":"span"}],[{"style":{"fontWeight":"bold"},"text":"Effect of the majority fraction ","element":"span"},{"style":{"height":11.59},"width":77.71,"height":28.98,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2005.04345/images/4-4.png","element":"img","alt":" pmaj.","inline":true,"padRight":true},{"text":"We observe that increasing ","element":"span"},{"style":{"height":16.79},"width":236.46,"height":41.98,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2005.04345/images/4-5.png","element":"img","alt":" pmaj = nmaj/n","inline":true},{"text":", which controls the relative size of the majority versus minority groups, makes overparameterization hurt worst-group error more (Figure ","element":"span"},{"href":"#id-25","text":"6","element":"a"},{"text":"). When the groups are perfectly balanced with ","element":"span"},{"style":{"height":15.99},"width":171.84,"height":39.98,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2005.04345/images/4-6.png","element":"img","alt":" pmaj = 0.5","inline":true},{"text":", overparameterization no longer hurts the worst-group test error, with overparameterized models achieving better worst-group test error than all underparameterized models. This suggests that group imbalance can be a key factor inducing the detrimental effect of overparameterization.","element":"span"}],[{"style":{"fontWeight":"bold"},"text":"Effect of the spurious-core information ratio ","element":"span"},{"style":{"height":9.19},"width":63.53,"height":22.98,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2005.04345/images/4-7.png","element":"img","alt":" rs:c.","inline":true,"padRight":true},{"text":"Next, we characterize the effect of ","element":"span"},{"style":{"height":19.73},"width":266.26,"height":49.32,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2005.04345/images/4-8.png","element":"img","alt":" rs:c = σ2core/σ2spu","inline":true},{"text":", which mea- ","element":"span"},{"text":"sures the relative informativeness of the spurious versus core features. A high ","element":"span"},{"style":{"height":9.19},"width":51.36,"height":22.98,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2005.04345/images/4-9.png","element":"img","alt":" rs:c","inline":true,"padRight":true},{"text":"means that the spurious features are more informative. We vary ","element":"span"},{"style":{"height":9.19},"width":51.36,"height":22.98,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2005.04345/images/4-10.png","element":"img","alt":" rs:c","inline":true,"padRight":true},{"text":"by changing ","element":"span"},{"style":{"height":19.72},"width":64.44,"height":49.31,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2005.04345/images/4-11.png","element":"img","alt":" σ2spu ","inline":true,"padRight":true},{"text":"while keep- ","element":"span"},{"text":"ing ","element":"span"},{"style":{"height":17.32},"width":192.92,"height":43.31,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2005.04345/images/4-12.png","element":"img","alt":" σ2core = 100","inline":true,"padRight":true},{"text":"fixed, since this does not change the best ","element":"span"},{"text":"possible worst-group test error (with a model that uses only the core features ","element":"span"},{"href":"#id-25","style":{"height":14.4},"width":239.58,"height":36,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2005.04345/images/4-13.png","element":"img","alt":" xcore). Figure 6","inline":true,"padRight":true},{"text":"shows that the higher ","element":"span"},{"style":{"height":12.8},"width":97.5,"height":32,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2005.04345/images/4-14.png","element":"img","alt":" rs:c is,","inline":true,"padRight":true},{"text":"the more overparameterization hurts. As ","element":"span"},{"style":{"height":9.19},"width":51.36,"height":22.98,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2005.04345/images/4-15.png","element":"img","alt":" rs:c","inline":true,"padRight":true},{"text":"increases, the spurious features become more informative, and overparameterized models rely more on them than the core features; underparameterized models outperform overparameterized models only for sufficiently large ","element":"span"},{"style":{"height":13.2},"width":137.23,"height":33,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2005.04345/images/4-16.png","element":"img","alt":" rs:c ≥ 1","inline":true},{"text":". Note that increasing ","element":"span"},{"style":{"height":9.19},"width":51.36,"height":22.98,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2005.04345/images/4-17.png","element":"img","alt":" rs:c","inline":true,"padRight":true},{"text":"does not significantly affect the worst-group","element":"span"}],[{"style":{"width":"99%"},"width":938,"height":298,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2005.04345/images/4-18.png","element":"img"}],[{"style":{"fontStyle":"italic"},"text":"Figure 6. ","element":"figcaption","subtype":"caption"},{"text":"The higher the majority fraction ","element":"figcaption","subtype":"caption"},{"style":{"height":10.01},"width":59.88,"height":25.03,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2005.04345/images/4-19.png","element":"img","alt":" pmaj","inline":true,"padRight":true},{"text":"and the spurious-core information ratio ","element":"figcaption","subtype":"caption"},{"style":{"height":7.99},"width":45.78,"height":19.98,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2005.04345/images/4-20.png","element":"img","alt":" rs:c","inline":true},{"text":", the more overparameterization hurts the worst-group test error. With sufficiently low ","element":"figcaption","subtype":"caption"},{"style":{"height":13.61},"width":195.89,"height":34.03,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2005.04345/images/4-21.png","element":"img","alt":" pmaj and rs:c,","inline":true,"padRight":true},{"text":"overparameterization switches to helping worst-group test error.","element":"figcaption","subtype":"caption"}],[{"text":"test error in the underparameterized regime, since the core features ","element":"span"},{"style":{"height":9.19},"width":73.04,"height":22.98,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2005.04345/images/4-22.png","element":"img","alt":" xcore","inline":true,"padRight":true},{"text":"are unaffected. In contrast, increasing the majority fraction ","element":"span"},{"style":{"height":11.59},"width":67.82,"height":28.98,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2005.04345/images/4-23.png","element":"img","alt":" pmaj","inline":true,"padRight":true},{"text":"hurts the worst-group test error in both underparameterized and overparameterized models.","element":"span"}],[{"style":{"fontWeight":"bold"},"text":"4.4. An intuitive story","element":"span"}],[{"text":"We return to the question of what makes overparameterized models memorize the minority instead of learning patterns that generalize on both majority and minority groups. The simulation results above show that of all overparameterized models that achieve zero training error, the inductive bias of the model class and training algorithm favors models that use spurious features which generalize only for the majority groups, instead of learning to use core features that also generalize well on the minority groups.","element":"span"}],[{"text":"What is the nature of this inductive bias? Consider a model that predicts the label ","element":"span"},{"style":{"fontStyle":"italic"},"text":"y ","element":"span"},{"text":"by returning its estimate of the spurious attribute ","element":"span"},{"style":{"height":15.59},"width":183.2,"height":38.98,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2005.04345/images/4-24.png","element":"img","alt":" a from xspu","inline":true},{"text":", taking advantage of the fact that ","element":"span"},{"style":{"fontStyle":"italic"},"text":"y ","element":"span"},{"text":"and ","element":"span"},{"style":{"fontStyle":"italic"},"text":"a ","element":"span"},{"text":"are correlated in the training data. To get achieve zero training error, it will need to memorize the points in the minority group, e.g., by exploiting variations due to noise in the features ","element":"span"},{"style":{"fontStyle":"italic"},"text":"x","element":"span"},{"text":". On the other hand, consider a model that predicts ","element":"span"},{"style":{"fontStyle":"italic"},"text":"y ","element":"span"},{"text":"by returning a direct estimate of ","element":"span"},{"style":{"fontStyle":"italic"},"text":"y ","element":"span"},{"text":"based on the core features ","element":"span"},{"style":{"height":9.19},"width":73.04,"height":22.98,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2005.04345/images/4-25.png","element":"img","alt":" xcore","inline":true},{"text":". Because ","element":"span"},{"style":{"height":9.19},"width":73.04,"height":22.98,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2005.04345/images/4-26.png","element":"img","alt":" xcore","inline":true,"padRight":true},{"text":"provides a noisier estimate of ","element":"span"},{"style":{"fontStyle":"italic"},"text":"y ","element":"span"},{"text":"than ","element":"span"},{"style":{"height":11.59},"width":64.44,"height":28.98,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2005.04345/images/4-27.png","element":"img","alt":" xspu","inline":true,"padRight":true},{"text":"does for ","element":"span"},{"style":{"fontStyle":"italic"},"text":"a","element":"span"},{"text":", this model will need to memorize all points for which ","element":"span"},{"style":{"height":9.19},"width":73.04,"height":22.98,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2005.04345/images/4-28.png","element":"img","alt":" xcore","inline":true,"padRight":true},{"text":"gives an inaccurate prediction of ","element":"span"},{"style":{"fontStyle":"italic"},"text":"y ","element":"span"},{"text":"due to noise. Since the estimators of the core and spurious attributes are equally easy to learn, the main difference between these two models is the number of examples to be memorized.","element":"span"}],[{"text":"We therefore hypothesize that ","element":"span"},{"style":{"fontStyle":"italic"},"text":"the inductive bias favors memorizing as few points as possible","element":"span"},{"text":". This is consistent with the results above: the model uses ","element":"span"},{"style":{"height":11.59},"width":64.44,"height":28.98,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2005.04345/images/4-29.png","element":"img","alt":" xspu","inline":true,"padRight":true},{"text":"and memorizes the minority points only when the fraction of minority points is small (high majority fraction ","element":"span"},{"style":{"height":11.59},"width":67.82,"height":28.97,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2005.04345/images/4-30.png","element":"img","alt":" pmaj","inline":true},{"text":"). Similarly, the model uses ","element":"span"},{"style":{"height":11.59},"width":64.44,"height":28.98,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2005.04345/images/4-31.png","element":"img","alt":" xspu","inline":true,"padRight":true},{"text":"over ","element":"span"},{"style":{"height":9.19},"width":73.04,"height":22.98,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2005.04345/images/4-32.png","element":"img","alt":" xcore","inline":true,"padRight":true},{"text":"to fit the majority points only when the spurious features are less noisy (high ","element":"span"},{"style":{"height":9.19},"width":51.36,"height":22.98,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2005.04345/images/4-33.png","element":"img","alt":" rs:c","inline":true},{"text":") and therefore require less memorization to obtain zero training error than the core features. In the next section, we make this intuition formal by analyzing a related but simpler linear setting.","element":"span"}]]},{"heading":"5. Theoretical analysis","paragraphs":[[{"text":"In this section, we show how the inductive bias against memorization leads to overparameterization exacerbating spurious correlations. Our analysis explicates the effect of the inductive bias and the importance of the data parameters ","element":"span"},{"style":{"height":15.99},"width":196.55,"height":39.98,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2005.04345/images/5-0.png","element":"img","alt":"pmaj and rs:c","inline":true,"padRight":true},{"text":"discussed in Section ","element":"span"},{"text":"4","element":"span"},{"text":".","element":"span"}],[{"text":"The synthetic setting discussed in Section ","element":"span"},{"text":"4 ","element":"span"},{"text":"is difficult to analyze because of the non-linear random projections, so we introduce a linear ","element":"span"},{"style":{"fontStyle":"italic"},"text":"explicit-memorization ","element":"span"},{"text":"setting that allows us to precisely define the concept of memorization. For clarity, we refer to the previous synthetic setting in Section ","element":"span"},{"text":"4 ","element":"span"},{"text":"as the ","element":"span"},{"style":{"fontStyle":"italic"},"text":"implicit-memorization ","element":"span"},{"text":"setting. In Appendix ","element":"span"},{"href":"#id-26","text":"A.4","element":"a"},{"text":", we show empirically that models in these two settings behave similarly in the overparameterized regime, though they differ in the underparameterized regime.","element":"span"}],[{"text":"In the previous implicit-memorization setting, we varied model size and memorization capacity by varying the number of random projections of the input. In the new explicit-memorization setting, we instead use linear models that act directly on the input and introduce explicit “noise features” that can be used to memorize. We vary the memorization capacity by varying the number of explicit noise features.","element":"span"}],[{"id":"id-44","style":{"fontWeight":"bold"},"text":"5.1. Explicit-memorization setup","element":"span"}],[{"style":{"fontWeight":"bold"},"text":"Training data. ","element":"span"},{"text":"We consider input features ","element":"span"},{"style":{"fontStyle":"italic"},"text":"x ","element":"span"},{"text":"= ","element":"span"},{"style":{"height":16.79},"width":285.96,"height":41.98,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2005.04345/images/5-1.png","element":"img","alt":"[xcore, xspu, xnoise]","inline":true},{"text":", where the core feature ","element":"span"},{"style":{"height":13.19},"width":166.65,"height":32.98,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2005.04345/images/5-2.png","element":"img","alt":" xcore ∈ R","inline":true,"padRight":true},{"text":"and the spurious feature ","element":"span"},{"style":{"height":15.59},"width":144.44,"height":38.97,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2005.04345/images/5-3.png","element":"img","alt":" xspu ∈ R","inline":true,"padRight":true},{"text":"are scalars. As in the implicit-memorization setup, they are generated based on the label and the spurious attribute, respectively:","element":"span"}],[{"style":{"width":"83%"},"width":780,"height":51,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2005.04345/images/5-4.png","element":"img"}],[{"text":"The “noise” features ","element":"span"},{"style":{"height":15.78},"width":190.98,"height":39.44,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2005.04345/images/5-5.png","element":"img","alt":" xnoise ∈ RN ","inline":true,"padRight":true},{"text":"are generated as","element":"span"}],[{"style":{"width":"46%"},"width":440,"height":98,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2005.04345/images/5-6.png","element":"img"}],[{"text":"where ","element":"span"},{"style":{"height":17.8},"width":84.34,"height":44.49,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2005.04345/images/5-7.png","element":"img","alt":" σ2noise ","inline":true,"padRight":true},{"text":"is a constant. The scaling by ","element":"span"},{"text":"1","element":"span"},{"style":{"fontStyle":"italic"},"text":"/N ","element":"span"},{"text":"ensures that ","element":"span"},{"text":"for large ","element":"span"},{"style":{"fontStyle":"italic"},"text":"N","element":"span"},{"text":", the norm of the noise vectors ","element":"span"},{"style":{"height":17.8},"width":281.7,"height":44.49,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2005.04345/images/5-8.png","element":"img","alt":" ∥xnoise∥22 ≈ σ2noise","inline":true,"padRight":true},{"text":"is approximately constant with high probability. Intuitively, when ","element":"span"},{"style":{"fontStyle":"italic"},"text":"N ","element":"span"},{"text":"is large, overparameterized models can use ","element":"span"},{"style":{"height":9.19},"width":84.34,"height":22.98,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2005.04345/images/5-9.png","element":"img","alt":" xnoise","inline":true,"padRight":true},{"text":"to fit a training point ","element":"span"},{"style":{"fontStyle":"italic"},"text":"x ","element":"span"},{"text":"without affecting its predictions on other points, thereby memorizing ","element":"span"},{"style":{"fontStyle":"italic"},"text":"x","element":"span"},{"text":". We formalize this notion of memorization later in Section ","element":"span"},{"href":"#id-27","text":"5.2","element":"a"},{"text":".","element":"span"}],[{"text":"As before, the training data is composed of four groups, each corresponding to a combination of the label ","element":"span"},{"style":{"height":16},"width":205.14,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2005.04345/images/5-10.png","element":"img","alt":" y ∈ {−1, 1}","inline":true,"padRight":true},{"text":"and the spurious attribute ","element":"span"},{"style":{"height":16},"width":198.25,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2005.04345/images/5-11.png","element":"img","alt":" a ∈ {−1, 1}","inline":true},{"text":": two majority groups with ","element":"span"},{"style":{"fontStyle":"italic"},"text":"a ","element":"span"},{"text":"= ","element":"span"},{"style":{"fontStyle":"italic"},"text":"y","element":"span"},{"text":", each of size ","element":"span"},{"style":{"height":16.79},"width":111.51,"height":41.98,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2005.04345/images/5-12.png","element":"img","alt":" nmaj/2","inline":true},{"text":", and two minority groups with ","element":"span"},{"style":{"height":10},"width":127.4,"height":25,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2005.04345/images/5-13.png","element":"img","alt":" a = −y","inline":true},{"text":", each of size ","element":"span"},{"style":{"height":16},"width":111.75,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2005.04345/images/5-14.png","element":"img","alt":" nmin/2","inline":true},{"text":". Combined, there are ","element":"span"},{"style":{"fontStyle":"italic"},"text":"n ","element":"span"},{"text":"training examples ","element":"span"},{"style":{"height":18.33},"width":272.27,"height":45.83,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2005.04345/images/5-15.png","element":"img","alt":" {(x(i), y(i))}ni=1.","inline":true}],[{"style":{"fontWeight":"bold"},"text":"Model. ","element":"span"},{"text":"We study unregularized logistic regression on the input features ","element":"span"},{"style":{"height":14.18},"width":190.9,"height":35.46,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2005.04345/images/5-16.png","element":"img","alt":" x ∈ RN+2","inline":true},{"text":". As before, we consider the ","element":"span"},{"text":"reweighted estimator ","element":"span"},{"style":{"height":10.98},"width":59.73,"height":27.46,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2005.04345/images/5-17.png","element":"img","alt":" ˆwrw","inline":true},{"text":". When the training data is linearly separable, the minimizer of the unregularized logistic loss on the training data is not well-defined. We therefore define ","element":"span"},{"style":{"height":10.99},"width":59.73,"height":27.46,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2005.04345/images/5-18.png","element":"img","alt":"ˆwrw ","inline":true,"padRight":true},{"text":"in terms of the sequence of ","element":"span"},{"style":{"height":13.19},"width":43.12,"height":32.98,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2005.04345/images/5-19.png","element":"img","alt":" L2","inline":true},{"text":"-regularized models ","element":"span"},{"style":{"height":15.5},"width":73.37,"height":38.75,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2005.04345/images/5-20.png","element":"img","alt":" ˆwrwλ :","inline":true}],[{"style":{"width":"91%"},"width":856,"height":98,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2005.04345/images/5-21.png","element":"img"}],[{"text":"where ","element":"span"},{"style":{"height":0},"width":20,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2005.04345/images/5-22.png","element":"img","alt":" ℓ","inline":true,"padRight":true},{"text":"is the logistic loss and ","element":"span"},{"style":{"height":15.59},"width":36.05,"height":38.98,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2005.04345/images/5-23.png","element":"img","alt":" ˆpg","inline":true,"padRight":true},{"text":"is the fraction of training examples in group ","element":"span"},{"style":{"fontStyle":"italic"},"text":"g","element":"span"},{"text":". Since scaling a model does not affect its 0-1 error, we define ","element":"span"},{"style":{"height":10.98},"width":59.73,"height":27.46,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2005.04345/images/5-24.png","element":"img","alt":" ˆwrw","inline":true,"padRight":true},{"text":"as the limit of this sequence, scaled to unit norm, as the regularization strength ","element":"span"},{"style":{"height":13.39},"width":142.62,"height":33.47,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2005.04345/images/5-25.png","element":"img","alt":" λ → 0+:","inline":true}],[{"id":"id-29","style":{"width":"69%"},"width":648,"height":95,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2005.04345/images/5-26.png","element":"img"}],[{"text":"In the underparameterized regime, the training data is not linearly separable and we simply have ","element":"span"},{"style":{"height":16},"width":327.97,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2005.04345/images/5-27.png","element":"img","alt":" ˆwrw = ˆwrw0 /∥ ˆwrw0 ∥2.","inline":true,"padRight":true},{"text":"In the overparameterized regime where ","element":"span"},{"style":{"height":12},"width":122.35,"height":30,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2005.04345/images/5-28.png","element":"img","alt":" N ≫ n","inline":true},{"text":", the training data is linearly separable, and ","element":"span"},{"href":"#id-28","referenceIndex":30,"text":"Rosset et al. ","element":"a"},{"href":"#id-28","referenceIndex":30,"text":"(","element":"a"},{"href":"#id-28","referenceIndex":30,"text":"2004","element":"a"},{"text":") showed that ","element":"span"},{"style":{"height":13.6},"width":398.56,"height":34,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2005.04345/images/5-29.png","element":"img","alt":" ˆwrw = ˆwmm, where ˆwmm ","inline":true,"padRight":true},{"text":"is the max-margin classifier","element":"span"}],[{"style":{"width":"81%"},"width":761,"height":85,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2005.04345/images/5-30.png","element":"img"}],[{"text":"The equivalence ","element":"span"},{"style":{"height":10.98},"width":227.89,"height":27.46,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2005.04345/images/5-31.png","element":"img","alt":" ˆwrw = ˆwmm","inline":true,"padRight":true},{"text":"holds regardless of the reweighting by ","element":"span"},{"style":{"height":16.79},"width":75.9,"height":41.98,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2005.04345/images/5-32.png","element":"img","alt":" 1/ˆpg","inline":true},{"text":": if we define the ERM estimator ","element":"span"},{"style":{"height":10.98},"width":76.89,"height":27.46,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2005.04345/images/5-33.png","element":"img","alt":" ˆwerm","inline":true,"padRight":true},{"text":"analogously to (","element":"span"},{"href":"#id-29","text":"5","element":"a"},{"text":") without the reweighting, it is also equal to ","element":"span"},{"style":{"height":10.99},"width":77.14,"height":27.46,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2005.04345/images/5-34.png","element":"img","alt":" ˆwmm","inline":true},{"text":". We will therefore analyze ","element":"span"},{"style":{"height":10.99},"width":77.14,"height":27.46,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2005.04345/images/5-35.png","element":"img","alt":" ˆwmm ","inline":true,"padRight":true},{"text":"in the overparameterized regime since it subsumes both ","element":"span"},{"style":{"height":11.6},"width":228.27,"height":29,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2005.04345/images/5-36.png","element":"img","alt":" ˆwrw and ˆwerm.","inline":true}],[{"text":"We also note that if we use gradient descent to directly optimize the unregularized logistic regression objective (either reweighted or not), the resulting solution after scaling to unit norm also converges to ","element":"span"},{"style":{"height":10.98},"width":77.14,"height":27.46,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2005.04345/images/5-37.png","element":"img","alt":" ˆwmm ","inline":true,"padRight":true},{"text":"as the number of gradient steps goes to infinity (","element":"span"},{"href":"#id-30","referenceIndex":33,"text":"Soudry et al.","element":"a"},{"href":"#id-30","referenceIndex":33,"text":", ","element":"a"},{"href":"#id-30","referenceIndex":33,"text":"2018","element":"a"},{"text":").","element":"span"}],[{"id":"id-27","style":{"fontWeight":"bold"},"text":"5.2. Analysis of worst-group error","element":"span"}],[{"text":"We now state our main analytical result: in the explicit-memorization setting, the worst-group test error of a suffi-ciently overparameterized model is greater than ","element":"span"},{"text":"1","element":"span"},{"style":{"fontStyle":"italic"},"text":"/","element":"span"},{"text":"2 ","element":"span"},{"text":"(worse than random) under certain settings of ","element":"span"},{"style":{"height":19.72},"width":348.5,"height":49.31,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2005.04345/images/5-38.png","element":"img","alt":" σ2spu, σ2core, nmaj, nmin.","inline":true,"padRight":true},{"text":"In contrast, underparameterized models attain reasonable worst-group error even under such a setting.","element":"span"}],[{"id":"id-31","style":{"fontWeight":"bold"},"text":"Theorem 1. ","element":"span"},{"style":{"fontStyle":"italic"},"text":"For any ","element":"span"},{"style":{"height":28.77},"width":846.24,"height":71.93,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2005.04345/images/5-39.png","element":"img","alt":" pmaj ≥�1 − 12001�, σ2core ≥ 1, σ2spu ≤1","inline":true}],[{"style":{"height":10.82},"width":186.21,"height":27.04,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2005.04345/images/5-40.png","element":"img","alt":"16 log 100nmaj","inline":true,"padRight":true},{"style":{"fontStyle":"italic"},"text":", ","element":"span"},{"style":{"height":18.88},"width":217.76,"height":47.21,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2005.04345/images/5-41.png","element":"img","alt":" σ2noise ≤ nmaj6002","inline":true,"padRight":true},{"style":{"fontStyle":"italic"},"text":"and ","element":"span"},{"style":{"height":13.2},"width":196.51,"height":33,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2005.04345/images/5-42.png","element":"img","alt":" nmin ≥ 100","inline":true},{"style":{"fontStyle":"italic"},"text":", there exists ","element":"span"},{"style":{"height":13.19},"width":48.02,"height":32.98,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2005.04345/images/5-43.png","element":"img","alt":"N0","inline":true,"padRight":true},{"style":{"fontStyle":"italic"},"text":"such that for all ","element":"span"},{"style":{"height":13.19},"width":138.67,"height":32.98,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2005.04345/images/5-44.png","element":"img","alt":" N > N0","inline":true,"padRight":true},{"style":{"fontStyle":"italic"},"text":"(overparameterized regime), with high probability over draws of the data,","element":"span"}],[{"style":{"width":"67%"},"width":630,"height":42,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2005.04345/images/5-45.png","element":"img"}],[{"style":{"fontStyle":"italic"},"text":"where ","element":"span"},{"style":{"height":10.99},"width":77.14,"height":27.46,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2005.04345/images/5-46.png","element":"img","alt":" ˆwmm ","inline":true,"padRight":true},{"style":{"fontStyle":"italic"},"text":"is the max-margin classifier.","element":"span"}],[{"style":{"fontStyle":"italic"},"text":"However, for ","element":"span"},{"style":{"fontStyle":"italic"},"text":"N ","element":"span"},{"text":"= 0 ","element":"span"},{"style":{"fontStyle":"italic"},"text":"(underparameterized regime), with ","element":"span"},{"style":{"height":19.37},"width":496.61,"height":48.43,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2005.04345/images/5-47.png","element":"img","alt":"pmaj =�1 − 12001�, σ2core = 1","inline":true},{"style":{"fontStyle":"italic"},"text":", and ","element":"span"},{"style":{"height":19.72},"width":150.48,"height":49.31,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2005.04345/images/5-48.png","element":"img","alt":" σ2spu = 0","inline":true},{"style":{"fontStyle":"italic"},"text":", and in the ","element":"span"},{"style":{"fontStyle":"italic"},"text":"asymptotic regime with ","element":"span"},{"style":{"height":15.99},"width":412.66,"height":39.98,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2005.04345/images/5-49.png","element":"img","alt":" nmaj, nmin → ∞, we have","inline":true}],[{"style":{"width":"66%"},"width":622,"height":42,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2005.04345/images/5-50.png","element":"img"}],[{"style":{"fontStyle":"italic"},"text":"where ","element":"span"},{"style":{"height":10.98},"width":59.73,"height":27.46,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2005.04345/images/6-0.png","element":"img","alt":" ˆwrw ","inline":true,"padRight":true},{"style":{"fontStyle":"italic"},"text":"minimizes the reweighted logistic loss.","element":"span"}],[{"text":"The result in the overparameterized regime applies to the max-margin classifier ","element":"span"},{"style":{"height":10.98},"width":77.14,"height":27.46,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2005.04345/images/6-1.png","element":"img","alt":" ˆwmm","inline":true},{"text":", which as discussed above subsumes both ","element":"span"},{"style":{"height":10.98},"width":211.42,"height":27.46,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2005.04345/images/6-2.png","element":"img","alt":" ˆwrw and ˆwerm ","inline":true,"padRight":true},{"text":"when the data is linearly separable. The proof of Theorem ","element":"span"},{"href":"#id-31","text":"1 ","element":"a"},{"text":"appears in Appendix ","element":"span"},{"text":"B","element":"span"},{"text":".","element":"span"}],[{"text":"The conditions on ","element":"span"},{"style":{"height":19.72},"width":64.44,"height":49.31,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2005.04345/images/6-3.png","element":"img","alt":" σ2spu","inline":true,"padRight":true},{"text":"and ","element":"span"},{"style":{"height":17.32},"width":73.04,"height":43.31,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2005.04345/images/6-4.png","element":"img","alt":" σ2core","inline":true,"padRight":true},{"text":"in Theorem ","element":"span"},{"href":"#id-31","text":"1 ","element":"a"},{"text":"above im- ","element":"span"},{"text":"ply high spurious-core information ratio ","element":"span"},{"style":{"height":9.19},"width":51.36,"height":22.98,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2005.04345/images/6-5.png","element":"img","alt":" rs:c","inline":true},{"text":". Theorem ","element":"span"},{"href":"#id-31","text":"1 ","element":"a"},{"text":"therefore provides a setting where high ","element":"span"},{"style":{"height":11.59},"width":67.82,"height":28.98,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2005.04345/images/6-6.png","element":"img","alt":" pmaj","inline":true,"padRight":true},{"text":"and high ","element":"span"},{"style":{"height":9.19},"width":51.36,"height":22.98,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2005.04345/images/6-7.png","element":"img","alt":" rs:c","inline":true,"padRight":true},{"text":"provably make overparameterized models obtain high worst-group error, matching the trends we observed upon varying ","element":"span"},{"style":{"height":15.59},"width":194.06,"height":38.98,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2005.04345/images/6-8.png","element":"img","alt":"pmaj and rs:c","inline":true,"padRight":true},{"text":"in the implicit-memorization setting (Figure ","element":"span"},{"href":"#id-25","text":"6","element":"a"},{"text":"). Furthermore, underparameterized models obtain reasonable worst-group error despite these conditions, mirroring the observations in earlier sections.","element":"span"}],[{"style":{"fontWeight":"bold"},"text":"5.3. Overparameterization and memorization","element":"span"}],[{"text":"We now sketch the key ideas in the proof of Theorem ","element":"span"},{"href":"#id-31","text":"1 ","element":"a"},{"text":"(full proof in Appendix ","element":"span"},{"text":"B","element":"span"},{"text":"), focusing first on the overparameterized regime. We start by establishing an inductive bias towards learning the minimum-norm model that fits the training data. We then define memorization and show how the minimum-norm inductive bias translates into a bias against memorization. Finally, we illustrate how the bias against memorization leads to learning the spurious feature and suffering high worst-group error.","element":"span"}],[{"id":"id-96","style":{"fontWeight":"bold"},"text":"Minimum-norm inductive bias. ","element":"span"},{"text":"Define a ","element":"span"},{"style":{"fontStyle":"italic"},"text":"separator ","element":"span"},{"text":"as any model that correctly classifies all of the training points ","element":"span"},{"text":"(","element":"span"},{"style":{"fontStyle":"italic"},"text":"x, y","element":"span"},{"text":") ","element":"span"},{"text":"with margin ","element":"span"},{"style":{"height":14},"width":175.27,"height":35,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2005.04345/images/6-9.png","element":"img","alt":" yw · x ≥ 1","inline":true},{"text":". Then from standard duality arguments, ","element":"span"},{"style":{"height":10.99},"width":77.14,"height":27.46,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2005.04345/images/6-10.png","element":"img","alt":" ˆwmm","inline":true,"padRight":true},{"text":"can be rewritten as ","element":"span"},{"style":{"height":17.79},"width":340.78,"height":44.46,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2005.04345/images/6-11.png","element":"img","alt":" ˆwminnorm/∥ ˆwminnorm∥","inline":true},{"text":", the scaled version of the ","element":"span"},{"style":{"fontStyle":"italic"},"text":"minimum-norm separator ","element":"span"},{"style":{"height":13.78},"width":138.96,"height":34.46,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2005.04345/images/6-12.png","element":"img","alt":" ˆwminnorm","inline":true}],[{"id":"id-32","style":{"width":"97%"},"width":913,"height":82,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2005.04345/images/6-13.png","element":"img"}],[{"text":"Since scaling does not affect the 0-1 test error, it suffices to analyze ","element":"span"},{"style":{"height":13.78},"width":138.96,"height":34.46,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2005.04345/images/6-14.png","element":"img","alt":" ˆwminnorm","inline":true},{"text":". Equation (","element":"span"},{"href":"#id-32","text":"9","element":"a"},{"text":") shows that out of the set of all separators (which all perfectly fit the training data), the inductive bias favors the separator with the minimum norm. We now discuss how this minimum-norm inductive bias favors less memorization.","element":"span"}],[{"style":{"fontWeight":"bold"},"text":"Memorization. ","element":"span"},{"text":"For convenience, we denote the three components of a model ","element":"span"},{"style":{"fontStyle":"italic"},"text":"w ","element":"span"},{"text":"as","element":"span"}],[{"style":{"width":"71%"},"width":667,"height":42,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2005.04345/images/6-15.png","element":"img"}],[{"text":"where ","element":"span"},{"style":{"height":15.59},"width":364.67,"height":38.98,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2005.04345/images/6-16.png","element":"img","alt":" wcore ∈ R, wspu ∈ R","inline":true},{"text":", and ","element":"span"},{"style":{"height":15.78},"width":212.09,"height":39.44,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2005.04345/images/6-17.png","element":"img","alt":" wnoise ∈ RN","inline":true},{"text":". By the representer theorem, we can decompose ","element":"span"},{"style":{"height":9.19},"width":90.1,"height":22.97,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2005.04345/images/6-18.png","element":"img","alt":" wnoise","inline":true,"padRight":true},{"text":"as follows:","element":"span"}],[{"id":"id-33","style":{"width":"69%"},"width":651,"height":90,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2005.04345/images/6-19.png","element":"img"}],[{"text":"In the overparameterized regime when ","element":"span"},{"style":{"height":12},"width":132.18,"height":30,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2005.04345/images/6-20.png","element":"img","alt":" N ≫ n","inline":true},{"text":", a model ","element":"span"},{"id":"id-34","text":"can “memorize” a training point ","element":"span"},{"style":{"height":14.18},"width":58.51,"height":35.46,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2005.04345/images/6-21.png","element":"img","alt":" x(i)","inline":true,"padRight":true},{"text":"via ","element":"span"},{"style":{"height":9.19},"width":90.1,"height":22.98,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2005.04345/images/6-22.png","element":"img","alt":" wnoise","inline":true},{"text":", in particular by putting a large weight ","element":"span"},{"style":{"height":14.19},"width":61.37,"height":35.47,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2005.04345/images/6-23.png","element":"img","alt":" α(i)","inline":true,"padRight":true},{"text":"in the direction of ","element":"span"},{"style":{"height":14.19},"width":58.5,"height":35.47,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2005.04345/images/6-24.png","element":"img","alt":" x(i)","inline":true,"padRight":true},{"text":"(Equation (","element":"span"},{"href":"#id-33","text":"11","element":"a"},{"text":")):","element":"span"}],[{"style":{"fontWeight":"bold"},"text":"Definition 1 ","element":"span"},{"text":"(","element":"span"},{"style":{"height":10.4},"width":22,"height":26,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2005.04345/images/6-25.png","element":"img","alt":"γ","inline":true},{"text":"-memorization)","element":"span"},{"style":{"fontWeight":"bold"},"text":". ","element":"span"},{"style":{"fontStyle":"italic"},"text":"A model ","element":"span"},{"style":{"fontStyle":"italic"},"text":"w ","element":"span"},{"style":{"fontStyle":"italic"},"text":"memorizes a point ","element":"span"},{"style":{"height":18.6},"width":387.1,"height":46.49,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2005.04345/images/6-26.png","element":"img","alt":" x(i) if |α(i)| ≥ γ2/σ2noise ","inline":true,"padRight":true},{"style":{"fontStyle":"italic"},"text":"for some constant ","element":"span"},{"style":{"height":14.4},"width":110.33,"height":36,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2005.04345/images/6-27.png","element":"img","alt":" γ ∈ R.","inline":true}],[{"text":"Because the noise vectors of the training points (high-dimensional Gaussians) are nearly orthogonal for large ","element":"span"},{"style":{"fontStyle":"italic"},"text":"N","element":"span"},{"text":", the component ","element":"span"},{"style":{"height":21.39},"width":148.16,"height":53.46,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2005.04345/images/6-28.png","element":"img","alt":" α(i)x(i)noise","inline":true,"padRight":true},{"text":"affects the prediction on ","element":"span"},{"style":{"height":14.18},"width":58.5,"height":35.46,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2005.04345/images/6-29.png","element":"img","alt":" x(i)","inline":true},{"text":", but ","element":"span"},{"text":"not on any other training or test points.","element":"span"}],[{"text":"This ability to memorize plays a crucial role in making overparameterized models obtain high worst-group error. Intuitively, the minimum-norm inductive bias favors less memorization in overparameterized models. Roughly speaking, models that memorize more have larger weights ","element":"span"},{"style":{"height":18.19},"width":85.88,"height":45.47,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2005.04345/images/6-30.png","element":"img","alt":" |α(i)|","inline":true,"padRight":true},{"text":"on the noise vectors ","element":"span"},{"style":{"height":21.39},"width":84.34,"height":53.46,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2005.04345/images/6-31.png","element":"img","alt":" x(i)noise","inline":true},{"text":". Since these noise vectors are ","element":"span"},{"text":"nearly orthogonal and have similar norm, this translates into a larger norm ","element":"span"},{"style":{"height":17.38},"width":159.98,"height":43.46,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2005.04345/images/6-32.png","element":"img","alt":" ∥wnoise∥22.","inline":true}],[{"style":{"fontWeight":"bold"},"text":"Comparing using ","element":"span"},{"style":{"height":9.19},"width":73.04,"height":22.98,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2005.04345/images/6-33.png","element":"img","alt":" xcore","inline":true,"padRight":true},{"style":{"fontWeight":"bold"},"text":"versus using ","element":"span"},{"style":{"height":11.59},"width":76.74,"height":28.98,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2005.04345/images/6-34.png","element":"img","alt":" xspu.","inline":true,"padRight":true},{"text":"To illustrate how the inductive bias against memorization leads to high worst-group error, we consider two extreme sets of separators: (i) ones that use the spurious feature but not the core feature, denoted by ","element":"span"},{"style":{"height":11.79},"width":149.06,"height":29.46,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2005.04345/images/6-35.png","element":"img","alt":" Wuse−spu ","inline":true,"padRight":true},{"text":"(ii) ones that use the core feature but not the spurious feature, denoted by ","element":"span"},{"style":{"height":11.78},"width":169.84,"height":29.46,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2005.04345/images/6-36.png","element":"img","alt":" Wuse−core.","inline":true}],[{"style":{"width":"97%"},"width":916,"height":174,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2005.04345/images/6-37.png","element":"img"}],[{"text":"In scenario (i), using the spurious feature ","element":"span"},{"style":{"height":11.59},"width":64.44,"height":28.97,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2005.04345/images/6-38.png","element":"img","alt":" xspu","inline":true,"padRight":true},{"text":"alone allows models to fit the majority groups very well. Thus, models that use ","element":"span"},{"style":{"height":11.59},"width":64.44,"height":28.98,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2005.04345/images/6-39.png","element":"img","alt":" xspu","inline":true,"padRight":true},{"text":"only need to memorize the minority points. ","element":"span"},{"text":"In Proposition ","element":"span"},{"href":"#id-34","text":"1","element":"a"},{"text":", we construct a separator ","element":"span"},{"style":{"height":11.78},"width":347.19,"height":29.46,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2005.04345/images/6-40.png","element":"img","alt":"wuse−spu ∈ Wuse−spu","inline":true,"padRight":true},{"text":"and show that its norm ","element":"span"},{"style":{"fontStyle":"italic"},"text":"only ","element":"span"},{"text":"scales with the number of minority points ","element":"span"},{"style":{"height":9.19},"width":81.83,"height":22.98,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2005.04345/images/6-41.png","element":"img","alt":" nmin.","inline":true}],[{"text":"Conversely, in scenario (ii), using the core feature ","element":"span"},{"style":{"height":9.19},"width":73.04,"height":22.98,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2005.04345/images/6-42.png","element":"img","alt":" xcore","inline":true,"padRight":true},{"text":"alone allows models to fit all groups equally well. However, when ","element":"span"},{"style":{"height":9.19},"width":51.36,"height":22.97,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2005.04345/images/6-43.png","element":"img","alt":" rs:c","inline":true,"padRight":true},{"text":"is high, ","element":"span"},{"style":{"height":9.19},"width":73.04,"height":22.97,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2005.04345/images/6-44.png","element":"img","alt":" xcore","inline":true,"padRight":true},{"text":"is noisier than ","element":"span"},{"style":{"height":11.59},"width":64.44,"height":28.97,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2005.04345/images/6-45.png","element":"img","alt":" xspu","inline":true},{"text":", so models that use ","element":"span"},{"style":{"height":9.19},"width":73.04,"height":22.98,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2005.04345/images/6-46.png","element":"img","alt":" xcore","inline":true,"padRight":true},{"text":"still need to memorize a constant fraction of ","element":"span"},{"style":{"fontStyle":"italic"},"text":"all ","element":"span"},{"text":"the training points. In Proposition ","element":"span"},{"href":"#id-35","text":"2","element":"a"},{"text":", we show that norms of all separators ","element":"span"},{"style":{"height":11.78},"width":353.23,"height":29.46,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2005.04345/images/6-47.png","element":"img","alt":" wuse−core ∈ Wuse−core","inline":true,"padRight":true},{"text":"are lower bounded by a quantity linear in the total number of training points ","element":"span"},{"style":{"fontStyle":"italic"},"text":"n","element":"span"},{"text":".","element":"span"}],[{"text":"When the majority fraction ","element":"span"},{"style":{"height":11.59},"width":67.82,"height":28.97,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2005.04345/images/6-48.png","element":"img","alt":" pmaj","inline":true,"padRight":true},{"text":"is sufficiently large such that ","element":"span"},{"style":{"height":11.59},"width":166.96,"height":28.98,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2005.04345/images/6-49.png","element":"img","alt":" nmin ≪ n","inline":true},{"text":", the separator ","element":"span"},{"style":{"height":10.99},"width":136.01,"height":27.46,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2005.04345/images/6-50.png","element":"img","alt":" wuse−spu","inline":true,"padRight":true},{"text":"that uses ","element":"span"},{"style":{"height":11.59},"width":64.44,"height":28.98,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2005.04345/images/6-51.png","element":"img","alt":" xspu","inline":true,"padRight":true},{"text":"will have a lower norm than any separator ","element":"span"},{"style":{"height":11.78},"width":353.16,"height":29.46,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2005.04345/images/6-52.png","element":"img","alt":" wuse−core ∈ Wuse−core","inline":true,"padRight":true},{"text":"that uses ","element":"span"},{"style":{"height":9.19},"width":73.04,"height":22.98,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2005.04345/images/6-53.png","element":"img","alt":" xcore","inline":true},{"text":". Since the inductive bias favors the minimum-norm separator, it prefers a separator ","element":"span"},{"style":{"height":10.99},"width":136.01,"height":27.46,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2005.04345/images/6-54.png","element":"img","alt":" wuse−spu","inline":true,"padRight":true},{"text":"that memorizes the minority points and suffers high worst-group error over any ","element":"span"},{"style":{"height":11.78},"width":365.36,"height":29.46,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2005.04345/images/6-55.png","element":"img","alt":" wuse−core ∈ Wuse−core.","inline":true}],[{"style":{"fontWeight":"bold"},"text":"Proposition 1 ","element":"span"},{"text":"(Norm of models using the spurious feature)","element":"span"},{"style":{"fontWeight":"bold"},"text":". ","element":"span"},{"style":{"fontStyle":"italic"},"text":"When ","element":"span"},{"style":{"height":19.72},"width":157.38,"height":49.31,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2005.04345/images/6-56.png","element":"img","alt":" σ2core, σ2spu","inline":true,"padRight":true},{"style":{"fontStyle":"italic"},"text":"satisfy the conditions in Theorem ","element":"span"},{"href":"#id-31","style":{"fontStyle":"italic"},"text":"1","element":"a"},{"style":{"fontStyle":"italic"},"text":", there ","element":"span"},{"style":{"fontStyle":"italic"},"text":"exists ","element":"span"},{"style":{"height":13.19},"width":48.02,"height":32.98,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2005.04345/images/6-57.png","element":"img","alt":" N0","inline":true,"padRight":true},{"style":{"fontStyle":"italic"},"text":"such that for all ","element":"span"},{"style":{"height":13.19},"width":139.71,"height":32.98,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2005.04345/images/6-58.png","element":"img","alt":" N > N0","inline":true},{"style":{"fontStyle":"italic"},"text":", with high probability,","element":"span"}],[{"style":{"fontStyle":"italic"},"text":"there exists a separator ","element":"span"},{"style":{"height":12},"width":493.66,"height":30,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2005.04345/images/7-0.png","element":"img","alt":" wuse−spu ∈ Wuse−spu such that","inline":true}],[{"style":{"width":"56%"},"width":526,"height":121,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2005.04345/images/7-1.png","element":"img"}],[{"style":{"fontStyle":"italic"},"text":"for some constants ","element":"span"},{"style":{"height":14.4},"width":177.79,"height":36,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2005.04345/images/7-2.png","element":"img","alt":" γ1, γ2 > 0.","inline":true}],[{"style":{"fontStyle":"italic"},"text":"Proof sketch. ","element":"span"},{"text":"To simplify exposition in this sketch, suppose that the noise vectors ","element":"span"},{"style":{"height":21.39},"width":84.34,"height":53.47,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2005.04345/images/7-3.png","element":"img","alt":" x(i)noise","inline":true,"padRight":true},{"text":"are orthogonal and have con- ","element":"span"},{"text":"stant norm ","element":"span"},{"style":{"height":21.39},"width":297.09,"height":53.47,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2005.04345/images/7-4.png","element":"img","alt":" ∥x(i)noise∥22 = σ2noise","inline":true},{"text":". We construct a separator ","element":"span"},{"style":{"height":11.79},"width":354.7,"height":29.46,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2005.04345/images/7-5.png","element":"img","alt":"wuse−spu ∈ Wuse−spu","inline":true,"padRight":true},{"text":"that does not use the core feature ","element":"span"},{"style":{"height":9.19},"width":73.04,"height":22.98,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2005.04345/images/7-6.png","element":"img","alt":"xcore","inline":true,"padRight":true},{"text":"as follows. Set ","element":"span"},{"style":{"height":17.33},"width":228.08,"height":43.32,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2005.04345/images/7-7.png","element":"img","alt":" wuse−spuspu = γ1","inline":true,"padRight":true},{"text":"for some large enough constant ","element":"span"},{"style":{"height":14.4},"width":123.85,"height":36,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2005.04345/images/7-8.png","element":"img","alt":" γ1 > 0","inline":true},{"text":". This is sufficient to satisfy the margin condition on the majority points: since ","element":"span"},{"style":{"height":19.72},"width":64.43,"height":49.31,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2005.04345/images/7-9.png","element":"img","alt":" σ2spu","inline":true,"padRight":true},{"text":"is very small, ","element":"span"},{"text":"w.h.p. all majority training points satisfy ","element":"span"},{"style":{"height":21.1},"width":278.44,"height":52.74,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2005.04345/images/7-10.png","element":"img","alt":" y(i)(x(i)spuγ1) ≥ 1.","inline":true}],[{"text":"However, for the minority training points, the spurious attribute ","element":"span"},{"style":{"fontStyle":"italic"},"text":"a ","element":"span"},{"text":"does not match the label ","element":"span"},{"style":{"fontStyle":"italic"},"text":"y","element":"span"},{"text":", and in order to satisfy the margin condition with a positive ","element":"span"},{"style":{"height":17.33},"width":136.01,"height":43.32,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2005.04345/images/7-11.png","element":"img","alt":" wuse−spuspu","inline":true,"padRight":true},{"text":", these ","element":"span"},{"style":{"height":9.19},"width":69.55,"height":22.97,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2005.04345/images/7-12.png","element":"img","alt":" nmin","inline":true,"padRight":true},{"text":"minority points have to be memorized. Since ","element":"span"},{"style":{"height":19.72},"width":64.44,"height":49.31,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2005.04345/images/7-13.png","element":"img","alt":" σ2spu","inline":true,"padRight":true},{"text":"is very ","element":"span"},{"text":"small, the decrease in the margin due to ","element":"span"},{"style":{"height":17.32},"width":232.48,"height":43.31,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2005.04345/images/7-14.png","element":"img","alt":" wuse−spuspu = γ1","inline":true,"padRight":true},{"text":"is at most ","element":"span"},{"style":{"height":10.4},"width":88.23,"height":26,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2005.04345/images/7-15.png","element":"img","alt":" −ργ1","inline":true,"padRight":true},{"text":"w.h.p. for some constant ","element":"span"},{"style":{"height":10},"width":21,"height":25,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2005.04345/images/7-16.png","element":"img","alt":" ρ","inline":true,"padRight":true},{"text":"that depends on ","element":"span"},{"style":{"height":19.72},"width":64.44,"height":49.31,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2005.04345/images/7-17.png","element":"img","alt":"σ2spu","inline":true},{"text":". To satisfy the margin condition, it thus suffices to set ","element":"span"},{"style":{"height":23.05},"width":500.43,"height":57.62,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2005.04345/images/7-18.png","element":"img","alt":"α(i)use−spu = y(i)(1+ργ1)/σ2noise","inline":true},{"text":", and the bound on the norm ","element":"span"},{"text":"follows. The full proof appears in Section ","element":"span"},{"href":"#id-36","text":"B.2.6","element":"a"},{"text":".","element":"span"}],[{"id":"id-35","style":{"fontWeight":"bold"},"text":"Proposition 2 ","element":"span"},{"text":"(Norm of models using the core feature)","element":"span"},{"style":{"fontWeight":"bold"},"text":". ","element":"span"},{"style":{"fontStyle":"italic"},"text":"When ","element":"span"},{"style":{"height":19.73},"width":157.38,"height":49.32,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2005.04345/images/7-19.png","element":"img","alt":" σ2core, σ2spu","inline":true,"padRight":true},{"style":{"fontStyle":"italic"},"text":"satisfy the conditions in Theorem ","element":"span"},{"href":"#id-31","style":{"fontStyle":"italic"},"text":"1 ","element":"a"},{"style":{"fontStyle":"italic"},"text":"and ","element":"span"},{"style":{"height":13.2},"width":201.33,"height":33,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2005.04345/images/7-20.png","element":"img","alt":"nmin ≥ 100","inline":true},{"style":{"fontStyle":"italic"},"text":", there exists ","element":"span"},{"style":{"height":13.19},"width":48.02,"height":32.98,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2005.04345/images/7-21.png","element":"img","alt":" N0","inline":true,"padRight":true},{"style":{"fontStyle":"italic"},"text":"such that for all ","element":"span"},{"style":{"height":13.19},"width":154.04,"height":32.98,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2005.04345/images/7-22.png","element":"img","alt":" N > N0","inline":true},{"style":{"fontStyle":"italic"},"text":", with high probability, all separators ","element":"span"},{"style":{"height":11.78},"width":353.16,"height":29.46,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2005.04345/images/7-23.png","element":"img","alt":" wuse−core ∈ Wuse−core","inline":true,"padRight":true},{"style":{"fontStyle":"italic"},"text":"satisfy","element":"span"}],[{"style":{"width":"38%"},"width":358,"height":85,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2005.04345/images/7-24.png","element":"img"}],[{"style":{"fontStyle":"italic"},"text":"for some constant ","element":"span"},{"style":{"height":14.4},"width":121.57,"height":36,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2005.04345/images/7-25.png","element":"img","alt":" γ3 > 0.","inline":true}],[{"style":{"fontStyle":"italic"},"text":"Proof sketch. ","element":"span"},{"text":"Any model ","element":"span"},{"style":{"height":11.78},"width":407.38,"height":29.46,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2005.04345/images/7-26.png","element":"img","alt":" wuse−core ∈ Wuse−core","inline":true,"padRight":true},{"text":"has ","element":"span"},{"style":{"height":17.32},"width":221.24,"height":43.31,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2005.04345/images/7-27.png","element":"img","alt":"wuse−corespu = 0","inline":true,"padRight":true},{"text":"by definition. We show that a constant fraction of training points have to be ","element":"span"},{"style":{"height":10.4},"width":22,"height":26,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2005.04345/images/7-28.png","element":"img","alt":" γ","inline":true},{"text":"-memorized in order to satisfy the margin condition. We do so by first showing that the probability that a training point ","element":"span"},{"style":{"fontStyle":"italic"},"text":"x ","element":"span"},{"text":"satisfies the margin condition ","element":"span"},{"style":{"fontStyle":"italic"},"text":"without ","element":"span"},{"text":"being ","element":"span"},{"style":{"height":10.4},"width":22,"height":26,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2005.04345/images/7-29.png","element":"img","alt":" γ","inline":true},{"text":"-memorized cannot be too large. For simplicity, suppose again that the noise vectors ","element":"span"},{"style":{"height":21.39},"width":84.34,"height":53.47,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2005.04345/images/7-30.png","element":"img","alt":" x(i)noise","inline":true,"padRight":true},{"text":"are orthogonal and have constant norm ","element":"span"},{"style":{"height":21.39},"width":284.79,"height":53.47,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2005.04345/images/7-31.png","element":"img","alt":" ∥x(i)noise∥22 = σ2noise","inline":true},{"text":". ","element":"span"},{"text":"Then this probability is ","element":"span"},{"style":{"height":19.2},"width":527.1,"height":48,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2005.04345/images/7-32.png","element":"img","alt":" P�xcorewuse−corecore ≤ 1 − γ2� ≥","inline":true},{"style":{"height":16},"width":206.35,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2005.04345/images/7-33.png","element":"img","alt":"Φ(−1/σcore)","inline":true,"padRight":true},{"text":"for small ","element":"span"},{"style":{"height":10.4},"width":22,"height":26,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2005.04345/images/7-34.png","element":"img","alt":" γ","inline":true},{"text":", where ","element":"span"},{"style":{"height":10.8},"width":29,"height":27,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2005.04345/images/7-35.png","element":"img","alt":" Φ","inline":true,"padRight":true},{"text":"is the Gaussian CDF. Hence, in expectation, at least a constant fraction of points ","element":"span"},{"id":"id-42","text":"from the training distribution need to be memorized in order ","element":"span"},{"text":"for ","element":"span"},{"style":{"height":10.98},"width":144.61,"height":27.46,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2005.04345/images/7-36.png","element":"img","alt":" wuse−core","inline":true,"padRight":true},{"text":"to satisfy the margin condition. With high probability, this is also true on the training set consisting of ","element":"span"},{"style":{"fontStyle":"italic"},"text":"n ","element":"span"},{"text":"points (via the DKW inequality) and the bound on the norm follows. The full proof appears in Section ","element":"span"},{"href":"#id-37","text":"B.2.7","element":"a"},{"text":".","element":"span"}],[{"text":"In the full proof of Theorem ","element":"span"},{"href":"#id-31","text":"1 ","element":"a"},{"text":"in Appendix ","element":"span"},{"text":"B","element":"span"},{"text":", we generalize the above ideas to consider all separators in ","element":"span"},{"style":{"height":13.39},"width":97.48,"height":33.47,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2005.04345/images/7-37.png","element":"img","alt":" RN+2","inline":true,"padRight":true},{"text":"instead of just the separators in ","element":"span"},{"style":{"height":16},"width":355.5,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2005.04345/images/7-38.png","element":"img","alt":" Wuse−spu � Wuse−core","inline":true},{"text":". Note the importance of both ","element":"span"},{"style":{"height":15.99},"width":369.38,"height":39.98,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2005.04345/images/7-39.png","element":"img","alt":" rs:c and pmaj: when rs:c","inline":true,"padRight":true},{"text":"is high, models that use ","element":"span"},{"style":{"height":11.59},"width":64.44,"height":28.98,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2005.04345/images/7-40.png","element":"img","alt":" xspu","inline":true,"padRight":true},{"text":"only need to memorize the minority groups (Proposition ","element":"span"},{"href":"#id-34","text":"1","element":"a"},{"text":"), and when ","element":"span"},{"style":{"height":11.59},"width":67.82,"height":28.98,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2005.04345/images/7-41.png","element":"img","alt":" pmaj","inline":true,"padRight":true},{"text":"is also high, these models end up memorizing fewer points than models that use ","element":"span"},{"style":{"height":9.19},"width":73.04,"height":22.98,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2005.04345/images/7-42.png","element":"img","alt":" xcore","inline":true,"padRight":true},{"text":"and have to memorize a constant fraction of the entire training set (Proposition ","element":"span"},{"href":"#id-35","text":"2","element":"a"},{"text":").","element":"span"}]]},{"heading":"6. Subsampling","paragraphs":[[{"text":"Our results above highlight the role of the majority fraction ","element":"span"},{"style":{"height":11.59},"width":67.82,"height":28.98,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2005.04345/images/7-43.png","element":"img","alt":" pmaj","inline":true,"padRight":true},{"text":"in determining if overparameterization hurts worst-group test error. When ","element":"span"},{"style":{"height":11.59},"width":67.82,"height":28.98,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2005.04345/images/7-44.png","element":"img","alt":" pmaj","inline":true,"padRight":true},{"text":"is large, the inductive bias favors using spurious features because it entails memorizing only a relatively small number of minority points, while the alternative of using core features requires memorizing a large number of majority points. This suggests that reducing the memorization cost of using core features by directly removing some majority points could induce overparameterized models to obtain low worst-group error.","element":"span"}],[{"text":"Here, we show that this approach of ","element":"span"},{"style":{"fontStyle":"italic"},"text":"subsampling ","element":"span"},{"text":"the majority group achieves good worst-group test error on the datasets studied above. Subsampling creates a new ","element":"span"},{"style":{"fontStyle":"italic"},"text":"group-balanced ","element":"span"},{"text":"dataset by randomly removing training points in all other groups to match the number of points from the smallest group (","element":"span"},{"href":"#id-38","referenceIndex":21,"text":"Japkowicz & Stephen","element":"a"},{"href":"#id-38","referenceIndex":21,"text":", ","element":"a"},{"href":"#id-38","referenceIndex":21,"text":"2002","element":"a"},{"text":"; ","element":"span"},{"href":"#id-39","text":"Haixiang et al.","element":"a"},{"href":"#id-39","text":", ","element":"a"},{"href":"#id-39","text":"2017","element":"a"},{"text":"; ","element":"span"},{"href":"#id-40","referenceIndex":6,"text":"Buda et al.","element":"a"},{"href":"#id-40","referenceIndex":6,"text":", ","element":"a"},{"href":"#id-40","referenceIndex":6,"text":"2018","element":"a"},{"text":"). We then train a model to minimize the average loss on this subsampled dataset. For a precise description, see Appendix ","element":"span"},{"href":"#id-41","text":"A.6","element":"a"},{"text":".","element":"span"}],[{"text":"Figure ","element":"span"},{"href":"#id-42","text":"7 ","element":"a"},{"text":"shows that overparameterized models trained via subsampling (Equation ","element":"span"},{"href":"#id-43","text":"15","element":"a"},{"text":") obtain low worst-group error on the CelebA, Waterbirds, and synthetic (implicit-memorization) datasets. Across all three datasets, training via subsampling makes increasing overparameterization help ","element":"span"},{"style":{"fontStyle":"italic"},"text":"both ","element":"span"},{"text":"average and worst-group test error. Moreover, overparameterized models trained on subsampled data are comparable to or better than the best models trained on the full dataset (i.e., underparameterized models trained with reweighting).","element":"span"}],[{"style":{"width":"98%"},"width":928,"height":293,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2005.04345/images/7-45.png","element":"img"}],[{"style":{"fontStyle":"italic"},"text":"Figure 7. ","element":"figcaption","subtype":"caption"},{"text":"Overparameterization helps worst-group test error when training via subsampling, which involves creating a group-balanced dataset by reducing the number of majority points and minimizing average training loss on the new dataset.","element":"figcaption","subtype":"caption"}],[{"text":"Subsampling seems wasteful since it throws away a large fraction of the training data: we only use 3.4% of the full training data for CelebA, 4.6% for Waterbirds, and 10% for the synthetic dataset. However, the results above show that subsampling in overparameterized models matches or outperforms reweighting with underparameterized models. For example, on CelebA, an overparameterized model trained via subsampling obtains 11.1% average test and 15.1% worst-group test error, whereas an underparameterized model trained with reweighting obtains 11.3% average and 25.6% worst-group test error.","element":"span"}],[{"style":{"fontWeight":"bold"},"text":"Subsampling vs. ","element":"span"},{"style":{"fontWeight":"bold"},"text":"reweighting. ","element":"span"},{"text":"Both subsampling and reweighting artificially balance the groups in the training data, and previous work on imbalanced datasets has concluded that reweighting is typically at least as effective as subsampling (","element":"span"},{"href":"#id-40","referenceIndex":6,"text":"Buda et al.","element":"a"},{"href":"#id-40","referenceIndex":6,"text":", ","element":"a"},{"href":"#id-40","referenceIndex":6,"text":"2018","element":"a"},{"text":"). However, we find a clear difference between subsampling and reweighting in the overparameterized regime: increasing overparameterization with reweighting increases worst-group error, while doing so with subsampling decreases worst-group error. The intuition developed in Sections ","element":"span"},{"text":"4 ","element":"span"},{"text":"and ","element":"span"},{"text":"5 ","element":"span"},{"text":"shed some light on this difference. Consider an overparameterized model: as in Section ","element":"span"},{"href":"#id-44","text":"5.1","element":"a"},{"text":", reweighting does not change the learned model which is the max-margin classifier. However, subsampling reduces ","element":"span"},{"style":{"height":11.59},"width":67.82,"height":28.98,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2005.04345/images/8-0.png","element":"img","alt":" pmaj","inline":true},{"text":". Recall that the inductive bias favors spurious features when the alternative of using core features requires memorizing a large number of training points. By reducing ","element":"span"},{"style":{"height":11.59},"width":67.82,"height":28.98,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2005.04345/images/8-1.png","element":"img","alt":"pmaj","inline":true},{"text":", we reduce this memorization cost associated with core features, thereby inducing the model to use core features and achieve low worst-group test error.","element":"span"}]]},{"heading":"7. Related work","paragraphs":[[{"style":{"fontWeight":"bold"},"text":"The effect of overparameterization. ","element":"span"},{"text":"The effect of overparameterization on average test error has been widely studied. In what is commonly referred to as “double descent”, increasing model size beyond zero training error decreases test error, despite conventional wisdom that overfitting should increase test error. This behavior has been observed empirically (","element":"span"},{"href":"#id-1","referenceIndex":3,"text":"Belkin et al.","element":"a"},{"href":"#id-1","referenceIndex":3,"text":", ","element":"a"},{"href":"#id-1","referenceIndex":3,"text":"2019","element":"a"},{"text":"; ","element":"span"},{"href":"#id-45","referenceIndex":28,"text":"Opper","element":"a"},{"href":"#id-45","referenceIndex":28,"text":", ","element":"a"},{"href":"#id-45","referenceIndex":28,"text":"1995","element":"a"},{"text":"; ","element":"span"},{"href":"#id-46","referenceIndex":1,"text":"Advani & Saxe","element":"a"},{"href":"#id-46","referenceIndex":1,"text":", ","element":"a"},{"href":"#id-46","referenceIndex":1,"text":"2017","element":"a"},{"text":"; ","element":"span"},{"href":"#id-0","referenceIndex":26,"text":"Nakkiran et al.","element":"a"},{"href":"#id-0","referenceIndex":26,"text":", ","element":"a"},{"href":"#id-0","referenceIndex":26,"text":"2019","element":"a"},{"text":") and shown analytically in high-dimensional regression (","element":"span"},{"href":"#id-47","referenceIndex":15,"text":"Hastie et al.","element":"a"},{"href":"#id-47","referenceIndex":15,"text":", ","element":"a"},{"href":"#id-47","referenceIndex":15,"text":"2019","element":"a"},{"text":"; ","element":"span"},{"href":"#id-48","referenceIndex":2,"text":"Bartlett et al.","element":"a"},{"href":"#id-48","referenceIndex":2,"text":", ","element":"a"},{"href":"#id-48","referenceIndex":2,"text":"2019","element":"a"},{"text":"; ","element":"span"},{"href":"#id-2","referenceIndex":25,"text":"Mei & Montanari","element":"a"},{"href":"#id-2","referenceIndex":25,"text":", ","element":"a"},{"href":"#id-2","referenceIndex":25,"text":"2019","element":"a"},{"text":"). These works focus on average test error and are consistent with our findings there. However, our focus is on worst-group test error, particularly when the groups are defined based on spurious attributes, and in this paper we establish that worst-group test error can behave quite differently from average test error.","element":"span"}],[{"text":"Increasing overparameterization can actually improve model robustness to some types of distributional shifts (","element":"span"},{"href":"#id-49","referenceIndex":18,"text":"Hendrycks ","element":"a"},{"href":"#id-49","referenceIndex":18,"text":"et al.","element":"a"},{"href":"#id-49","referenceIndex":18,"text":", ","element":"a"},{"href":"#id-49","referenceIndex":18,"text":"2019","element":"a"},{"text":"; ","element":"span"},{"href":"#id-50","referenceIndex":17,"text":"Hendrycks & Dietterich","element":"a"},{"href":"#id-50","referenceIndex":17,"text":", ","element":"a"},{"href":"#id-50","referenceIndex":17,"text":"2019","element":"a"},{"text":"; ","element":"span"},{"href":"#id-51","referenceIndex":37,"text":"Yang et al.","element":"a"},{"href":"#id-51","referenceIndex":37,"text":", ","element":"a"},{"href":"#id-51","referenceIndex":37,"text":"2020","element":"a"},{"text":"). In this light, our results show that the effect of overparameterization on model robustness can depend heavily ","element":"span"},{"text":"on the dataset (e.g., properties like ","element":"span"},{"style":{"height":11.59},"width":67.82,"height":28.98,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2005.04345/images/8-2.png","element":"img","alt":" pmaj","inline":true,"padRight":true},{"text":"and ","element":"span"},{"style":{"height":9.19},"width":51.36,"height":22.98,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2005.04345/images/8-3.png","element":"img","alt":" rs:c","inline":true},{"text":"), type of distributional shift, and training procedure.","element":"span"}],[{"style":{"fontWeight":"bold"},"text":"Worst-group error. ","element":"span"},{"text":"Prior work on improving worst-group error focused on the underparameterized regime, with methods based on weighting/sampling (","element":"span"},{"href":"#id-16","referenceIndex":32,"text":"Shimodaira","element":"a"},{"href":"#id-16","referenceIndex":32,"text":", ","element":"a"},{"href":"#id-16","referenceIndex":32,"text":"2000","element":"a"},{"text":"; ","element":"span"},{"href":"#id-38","referenceIndex":21,"text":"Jap- ","element":"a"},{"href":"#id-38","referenceIndex":21,"text":"kowicz & Stephen","element":"a"},{"href":"#id-38","referenceIndex":21,"text":", ","element":"a"},{"href":"#id-38","referenceIndex":21,"text":"2002","element":"a"},{"text":"; ","element":"span"},{"href":"#id-40","referenceIndex":6,"text":"Buda et al.","element":"a"},{"href":"#id-40","referenceIndex":6,"text":", ","element":"a"},{"href":"#id-40","referenceIndex":6,"text":"2018","element":"a"},{"text":"; ","element":"span"},{"href":"#id-52","referenceIndex":10,"text":"Cui et al.","element":"a"},{"href":"#id-52","referenceIndex":10,"text":", ","element":"a"},{"href":"#id-52","referenceIndex":10,"text":"2019","element":"a"},{"text":"), distributionally robust optimization (DRO) (","element":"span"},{"href":"#id-53","referenceIndex":4,"text":"Ben-Tal et al.","element":"a"},{"href":"#id-53","referenceIndex":4,"text":", ","element":"a"},{"href":"#id-53","referenceIndex":4,"text":"2013","element":"a"},{"text":"; ","element":"span"},{"href":"#id-54","referenceIndex":27,"text":"Namkoong & Duchi","element":"a"},{"href":"#id-54","referenceIndex":27,"text":", ","element":"a"},{"href":"#id-54","referenceIndex":27,"text":"2017","element":"a"},{"text":"; ","element":"span"},{"href":"#id-55","referenceIndex":29,"text":"Oren et al.","element":"a"},{"href":"#id-55","referenceIndex":29,"text":", ","element":"a"},{"href":"#id-55","referenceIndex":29,"text":"2019","element":"a"},{"text":"), and fair algorithms (","element":"span"},{"href":"#id-56","referenceIndex":11,"text":"Dwork et al.","element":"a"},{"href":"#id-56","referenceIndex":11,"text":", ","element":"a"},{"href":"#id-56","referenceIndex":11,"text":"2012","element":"a"},{"text":"; ","element":"span"},{"href":"#id-57","referenceIndex":13,"text":"Hardt et al.","element":"a"},{"href":"#id-57","referenceIndex":13,"text":", ","element":"a"},{"href":"#id-57","referenceIndex":13,"text":"2016","element":"a"},{"text":"; ","element":"span"},{"href":"#id-58","referenceIndex":22,"text":"Klein- ","element":"a"},{"href":"#id-58","referenceIndex":22,"text":"berg et al.","element":"a"},{"href":"#id-58","referenceIndex":22,"text":", ","element":"a"},{"href":"#id-58","referenceIndex":22,"text":"2017","element":"a"},{"text":"). Our focus is on the overparameterized, zero-training-error regime; here, previous methods based on reweighting and DRO are ineffective (","element":"span"},{"href":"#id-59","referenceIndex":36,"text":"Wen et al.","element":"a"},{"href":"#id-59","referenceIndex":36,"text":", ","element":"a"},{"href":"#id-59","referenceIndex":36,"text":"2014","element":"a"},{"text":"; ","element":"span"},{"href":"#id-17","referenceIndex":8,"text":"Byrd & Lipton","element":"a"},{"href":"#id-17","referenceIndex":8,"text":", ","element":"a"},{"href":"#id-17","referenceIndex":8,"text":"2019","element":"a"},{"text":"; ","element":"span"},{"href":"#id-9","referenceIndex":31,"text":"Sagawa et al.","element":"a"},{"href":"#id-9","referenceIndex":31,"text":", ","element":"a"},{"href":"#id-9","referenceIndex":31,"text":"2020","element":"a"},{"text":"). As mentioned in Section ","element":"span"},{"text":"1","element":"span"},{"text":", ","element":"span"},{"href":"#id-9","referenceIndex":31,"text":"Sagawa et al. ","element":"a"},{"href":"#id-9","referenceIndex":31,"text":"(","element":"a"},{"href":"#id-9","referenceIndex":31,"text":"2020","element":"a"},{"text":") demonstrated that stronger ","element":"span"},{"style":{"height":13.19},"width":43.12,"height":32.98,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2005.04345/images/8-4.png","element":"img","alt":"L2","inline":true},{"text":"-regularization can improve worst-group error on neural networks (when coupled with reweighting or group DRO). Similarly ","element":"span"},{"href":"#id-60","referenceIndex":9,"text":"Cao et al. ","element":"a"},{"href":"#id-60","referenceIndex":9,"text":"(","element":"a"},{"href":"#id-60","referenceIndex":9,"text":"2019","element":"a"},{"text":") show that data-dependent regularization can improve error on rare labels. While their work focuses on developing methods to improve worst-group error, our focus is on understanding the mechanisms by which overparameterization hurts worst-group error.","element":"span"}]]},{"heading":"8. Discussion","paragraphs":[[{"text":"Our work shows that overparameterization hurts worst-group error on real datasets that contain spurious correlations. We studied the implicit- and explicit-memorization settings to provide a potential story for why this might occur: there can be an inductive bias towards solutions that do not need to memorize as many training points, and this can favor models that exploit the spurious correlations.","element":"span"}],[{"text":"However, our synthetic settings make several simplifying assumptions, e.g., they suppose that the model prefers the spurious feature because it is less noisy than the core feature. This assumption need not always apply, and different assumptions might also lead to overparameterization exacerbating spurious correlations. For example, there might exist a true classifier based on the core features which has high accuracy but which is relatively more complex (e.g., high parameter norm) and therefore not favored by the training procedure. Studying the effect of overparameterization in settings such as those is important future work.","element":"span"}],[{"text":"We also observed that subsampling allows overparameterized models to achieve low average and worst-group test error, despite eliminating a large fraction of training examples. In contrast, when using the full training data, only underparameterized models attain low worst-group test error under our current training methods. These observations call for future work to develop methods that can exploit both the statistical information in the full training data as well as the expressivity of overparameterized models, so as to attain good worst-group and average test error.","element":"span"}]]},{"heading":"Acknowledgements","paragraphs":[[{"text":"We are grateful to Yair Carmon, John Duchi, Tatsunori Hashimoto, Ananya Kumar, Yiping Lu, Tengyu Ma, and Jacob Steinhardt for helpful discussions and suggestions. SS was supported by a Stanford Graduate Fellowship, AR was supported by a Google PhD Fellowship and Open Philanthropy Project AI Fellowship, and PWK was supported by the Facebook Fellowship Program.","element":"span"}],[{"style":{"fontWeight":"bold"},"text":"Reproducibiltity","element":"span"}],[{"text":"Code ","element":"span"},{"text":"is ","element":"span"},{"text":"available ","element":"span"},{"text":"at ","element":"span"},{"href":"https://github.com/ssagawa/overparam_spur_corr","text":"https://github. ","element":"a"},{"href":"https://github.com/ssagawa/overparam_spur_corr","text":"com/ssagawa/overparam_spur_corr","element":"a"},{"text":".","element":"span"}],[{"id":"id-39","style":{"width":"100%"},"width":940,"height":172,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2005.04345/images/9-0.png","element":"img"}]]},{"heading":"References","paragraphs":[[{"id":"id-46","text":"Advani, M. S. and Saxe, A. M. High-dimensional dynamics ","element":"span"},{"text":"of generalization error in neural networks. ","element":"span"},{"style":{"fontStyle":"italic"},"text":"arXiv preprint arXiv:1710.03667","element":"span"},{"text":", 2017.","element":"span"}],[{"id":"id-48","text":"Bartlett, P. L., Long, P. M., Lugosi, G., and Tsigler, A. ","element":"span"},{"text":"Benign overfitting in linear regression. ","element":"span"},{"style":{"fontStyle":"italic"},"text":"arXiv","element":"span"},{"text":", 2019.","element":"span"}],[{"id":"id-1","text":"Belkin, M., Hsu, D., Ma, S., and Mandal, S. Reconciling ","element":"span"},{"text":"modern machine-learning practice and the classical bias– variance trade-off. ","element":"span"},{"style":{"fontStyle":"italic"},"text":"Science","element":"span"},{"text":", 116(32), 2019.","element":"span"}],[{"id":"id-53","text":"Ben-Tal, A., den Hertog, D., Waegenaere, A. D., Melenberg, ","element":"span"},{"text":"B., and Rennen, G. Robust solutions of optimization problems affected by uncertain probabilities. ","element":"span"},{"style":{"fontStyle":"italic"},"text":"Management Science","element":"span"},{"text":", 59:341–357, 2013.","element":"span"}],[{"id":"id-5","text":"Blodgett, S. L., Green, L., and O’Connor, B. ","element":"span"},{"text":"Demographic dialectal variation in social media: A case study of African-American English. In ","element":"span"},{"style":{"fontStyle":"italic"},"text":"Empirical Methods in Natural Language Processing (EMNLP)","element":"span"},{"text":", pp. 1119–1130, 2016.","element":"span"}],[{"id":"id-40","text":"Buda, M., Maki, A., and Mazurowski, M. A. A systematic ","element":"span"},{"text":"study of the class imbalance problem in convolutional neural networks. ","element":"span"},{"style":{"fontStyle":"italic"},"text":"Neural Networks","element":"span"},{"text":", 106:249–259, 2018.","element":"span"}],[{"id":"id-7","text":"Buolamwini, J. and Gebru, T. Gender shades: Intersectional ","element":"span"},{"text":"accuracy disparities in commercial gender classification. In ","element":"span"},{"style":{"fontStyle":"italic"},"text":"Conference on Fairness, Accountability and Transparency","element":"span"},{"text":", pp. 77–91, 2018.","element":"span"}],[{"id":"id-17","text":"Byrd, J. and Lipton, Z. What is the effect of importance ","element":"span"},{"text":"weighting in deep learning? In ","element":"span"},{"style":{"fontStyle":"italic"},"text":"International Conference on Machine Learning (ICML)","element":"span"},{"text":", pp. 872–881, 2019.","element":"span"}],[{"id":"id-60","text":"Cao, K., Wei, C., Gaidon, A., Arechiga, N., and Ma, ","element":"span"},{"text":"T. Learning imbalanced datasets with label-distribution-aware margin loss. In ","element":"span"},{"style":{"fontStyle":"italic"},"text":"Advances in Neural Information Processing Systems (NeurIPS)","element":"span"},{"text":", 2019.","element":"span"}],[{"id":"id-52","text":"Cui, Y., Jia, M., Lin, T., Song, Y., and Belongie, S. Class- ","element":"span"},{"text":"balanced loss based on effective number of samples. In ","element":"span"},{"style":{"fontStyle":"italic"},"text":"Computer Vision and Pattern Recognition (CVPR)","element":"span"},{"text":", pp. 9268–9277, 2019.","element":"span"}],[{"id":"id-56","text":"Dwork, C., Hardt, M., Pitassi, T., Reingold, O., and Zemel, ","element":"span"},{"text":"R. Fairness through awareness. In ","element":"span"},{"style":{"fontStyle":"italic"},"text":"Innovations in Theoretical Computer Science (ITCS)","element":"span"},{"text":", pp. 214–226, 2012.","element":"span"}],[{"text":"Haixiang, G., Yijing, L., Shang, J., Mingyun, G., Yuanyue, H., and Bing, G. Learning from class-imbalanced data: Review of methods and applications. ","element":"span"},{"style":{"fontStyle":"italic"},"text":"Expert Systems with Applications","element":"span"},{"text":", 73:220–239, 2017.","element":"span"}],[{"id":"id-57","text":"Hardt, M., Price, E., and Srebo, N. Equality of opportunity ","element":"span"},{"text":"in supervised learning. ","element":"span"},{"text":"In ","element":"span"},{"style":{"fontStyle":"italic"},"text":"Advances in Neural Information Processing Systems (NeurIPS)","element":"span"},{"text":", pp. 3315–3323, 2016.","element":"span"}],[{"id":"id-6","text":"Hashimoto, T. B., Srivastava, M., Namkoong, H., and Liang, ","element":"span"},{"text":"P. Fairness without demographics in repeated loss minimization. In ","element":"span"},{"style":{"fontStyle":"italic"},"text":"International Conference on Machine Learning (ICML)","element":"span"},{"text":", 2018.","element":"span"}],[{"id":"id-47","text":"Hastie, T., Montanari, A., Rosset, S., and Tibshirani, R. J. ","element":"span"},{"text":"Surprises in high-dimensional ridgeless least squares interpolation. ","element":"span"},{"style":{"fontStyle":"italic"},"text":"arXiv preprint arXiv:1903.08560","element":"span"},{"text":", 2019.","element":"span"}],[{"id":"id-19","text":"He, K., Zhang, X., Ren, S., and Sun, J. Deep residual ","element":"span"},{"text":"learning for image recognition. In ","element":"span"},{"style":{"fontStyle":"italic"},"text":"Computer Vision and Pattern Recognition (CVPR)","element":"span"},{"text":", 2016.","element":"span"}],[{"id":"id-50","text":"Hendrycks, D. and Dietterich, T. Benchmarking neural ","element":"span"},{"text":"network robustness to common corruptions and perturbations. ","element":"span"},{"style":{"fontStyle":"italic"},"text":"arXiv preprint arXiv:1903.12261","element":"span"},{"text":", 2019.","element":"span"}],[{"id":"id-49","text":"Hendrycks, D., Zhao, K., Basart, S., Steinhardt, J., and ","element":"span"},{"text":"Song, D. Natural adversarial examples. ","element":"span"},{"style":{"fontStyle":"italic"},"text":"arXiv preprint arXiv:1907.07174","element":"span"},{"text":", 2019.","element":"span"}],[{"id":"id-15","text":"Hu, W., Niu, G., Sato, I., and Sugiyama, M. Does distri- ","element":"span"},{"text":"butionally robust supervised learning give robust classi-fiers? In ","element":"span"},{"style":{"fontStyle":"italic"},"text":"International Conference on Machine Learning (ICML)","element":"span"},{"text":", 2018.","element":"span"}],[{"id":"id-69","text":"Ioffe, S. and Szegedy, C. Batch normalization: Accelerat- ","element":"span"},{"text":"ing deep network training by reducing internal covariate shift. In ","element":"span"},{"style":{"fontStyle":"italic"},"text":"International Conference on Machine Learning (ICML)","element":"span"},{"text":", pp. 448–456, 2015.","element":"span"}],[{"id":"id-38","text":"Japkowicz, N. and Stephen, S. The class imbalance problem: ","element":"span"},{"text":"A systematic study. ","element":"span"},{"style":{"fontStyle":"italic"},"text":"Intelligent data analysis","element":"span"},{"text":", 6(5):429– 449, 2002.","element":"span"}],[{"id":"id-58","text":"Kleinberg, J., Mullainathan, S., and Raghavan, M. Inherent ","element":"span"},{"text":"trade-offs in the fair determination of risk scores. In ","element":"span"},{"style":{"fontStyle":"italic"},"text":"Innovations in Theoretical Computer Science (ITCS)","element":"span"},{"text":", 2017.","element":"span"}],[{"id":"id-11","text":"Liu, Z., Luo, P., Wang, X., and Tang, X. Deep learning ","element":"span"},{"text":"face attributes in the wild. In ","element":"span"},{"style":{"fontStyle":"italic"},"text":"Proceedings of the IEEE International Conference on Computer Vision","element":"span"},{"text":", pp. 3730– 3738, 2015.","element":"span"}],[{"id":"id-8","text":"McCoy, R. T., Pavlick, E., and Linzen, T. Right for the ","element":"span"},{"text":"wrong reasons: Diagnosing syntactic heuristics in natural language inference. In ","element":"span"},{"style":{"fontStyle":"italic"},"text":"Association for Computational Linguistics (ACL)","element":"span"},{"text":", 2019.","element":"span"}],[{"id":"id-2","text":"Mei, S. and Montanari, A. The generalization error of ran- ","element":"span"},{"text":"dom features regression: Precise asymptotics and double descent curve. ","element":"span"},{"style":{"fontStyle":"italic"},"text":"arXiv preprint arXiv:1908.05355","element":"span"},{"text":", 2019.","element":"span"}],[{"id":"id-0","text":"Nakkiran, P., Kaplun, G., Bansal, Y., Yang, T., Barak, ","element":"span"},{"text":"B., and Sutskever, I. ","element":"span"},{"text":"Deep double descent: Where bigger models and more data hurt. ","element":"span"},{"style":{"fontStyle":"italic"},"text":"arXiv preprint arXiv:1912.02292","element":"span"},{"text":", 2019.","element":"span"}],[{"id":"id-54","text":"Namkoong, H. and Duchi, J. Variance regularization with ","element":"span"},{"text":"convex objectives. In ","element":"span"},{"style":{"fontStyle":"italic"},"text":"Advances in Neural Information Processing Systems (NeurIPS)","element":"span"},{"text":", 2017.","element":"span"}],[{"id":"id-45","text":"Opper, M. Statistical mechanics of learning: Generalization. ","element":"span"},{"style":{"fontStyle":"italic"},"text":"The Handbook of Brain Theory and Neural Networks,","element":"span"},{"text":", pp. 922–925, 1995.","element":"span"}],[{"id":"id-55","text":"Oren, Y., Sagawa, S., Hashimoto, T., and Liang, P. Distribu- ","element":"span"},{"text":"tionally robust language modeling. In ","element":"span"},{"style":{"fontStyle":"italic"},"text":"Empirical Methods in Natural Language Processing (EMNLP)","element":"span"},{"text":", 2019.","element":"span"}],[{"id":"id-28","text":"Rosset, S., Zhu, J., and Hastie, T. J. Margin maximizing loss ","element":"span"},{"text":"functions. In ","element":"span"},{"style":{"fontStyle":"italic"},"text":"Advances in neural information processing systems","element":"span"},{"text":", pp. 1237–1244, 2004.","element":"span"}],[{"id":"id-9","text":"Sagawa, S., Koh, P. W., Hashimoto, T. B., and Liang, P. ","element":"span"},{"text":"Distributionally robust neural networks for group shifts: On the importance of regularization for worst-case generalization. In ","element":"span"},{"style":{"fontStyle":"italic"},"text":"International Conference on Learning Representations (ICLR)","element":"span"},{"text":", 2020.","element":"span"}],[{"id":"id-16","text":"Shimodaira, H. Improving predictive inference under covari- ","element":"span"},{"text":"ate shift by weighting the log-likelihood function. ","element":"span"},{"style":{"fontStyle":"italic"},"text":"Journal of Statistical Planning and Inference","element":"span"},{"text":", 90:227–244, 2000.","element":"span"}],[{"id":"id-30","text":"Soudry, D., Hoffer, E., Nacson, M. S., Gunasekar, S., and ","element":"span"},{"text":"Srebro, N. The implicit bias of gradient descent on separable data. ","element":"span"},{"style":{"fontStyle":"italic"},"text":"Journal of Machine Learning Research (JMLR)","element":"span"},{"text":", 19(1):2822–2878, 2018.","element":"span"}],[{"id":"id-70","text":"Srivastava, N., Hinton, G., Krizhevsky, A., Sutskever, I., ","element":"span"},{"text":"and Salakhutdinov, R. Dropout: A simple way to prevent neural networks from overfitting. ","element":"span"},{"style":{"fontStyle":"italic"},"text":"Journal of Machine Learning Research (JMLR)","element":"span"},{"text":", 15(1):1929–1958, 2014.","element":"span"}],[{"id":"id-12","text":"Wah, C., Branson, S., Welinder, P., Perona, P., and Belongie, ","element":"span"},{"text":"S. The Caltech-UCSD Birds-200-2011 dataset. Technical report, California Institute of Technology, 2011.","element":"span"}],[{"id":"id-59","text":"Wen, J., Yu, C., and Greiner, R. Robust learning under ","element":"span"},{"text":"uncertain test distributions: Relating covariate shift to model misspecification. In ","element":"span"},{"style":{"fontStyle":"italic"},"text":"International Conference on Machine Learning (ICML)","element":"span"},{"text":", pp. 631–639, 2014.","element":"span"}],[{"id":"id-51","text":"Yang, Z., Yu, Y., You, C., Steinhardt, J., and Ma, Y. Rethink- ","element":"span"},{"text":"ing bias-variance trade-off for generalization of neural networks. ","element":"span"},{"style":{"fontStyle":"italic"},"text":"arXiv preprint arXiv:2002.11328","element":"span"},{"text":", 2020.","element":"span"}],[{"id":"id-18","text":"Zhang, C., Bengio, S., Hardt, M., Recht, B., and Vinyals, ","element":"span"},{"text":"O. Understanding deep learning requires rethinking generalization. In ","element":"span"},{"style":{"fontStyle":"italic"},"text":"International Conference on Learning Representations (ICLR)","element":"span"},{"text":", 2017.","element":"span"}],[{"id":"id-13","text":"Zhou, B., Lapedriza, A., Khosla, A., Oliva, A., and Tor- ","element":"span"},{"text":"ralba, A. Places: A 10 million image database for scene recognition. ","element":"span"},{"style":{"fontStyle":"italic"},"text":"IEEE Transactions on Pattern Analysis and Machine Intelligence","element":"span"},{"text":", 40(6):1452–1464, 2017.","element":"span"}]]},{"heading":"A. Supplemental experiments","paragraphs":[[{"id":"id-4","style":{"fontWeight":"bold"},"text":"A.1. ERM models have poor worst-group error regardless of the degree of overparameterization","element":"span"}],[{"text":"In the main text, we focused on reweighted models, trained with the reweighted objective on the full data (Sections ","element":"span"},{"text":"3","element":"span"},{"text":"-","element":"span"},{"text":"5","element":"span"},{"text":"), as well as subsampled models, trained on subsampled data with the ERM objective (Section ","element":"span"},{"text":"6","element":"span"},{"text":"). Here, we study the effect of overparameterization on ERM models, trained with the ERM objective on the full data. Consistent with prior work, we observe that ERM models obtain poor worst-group error (near or worse than random), regardless of whether the model is underparameterized or overparameterized (","element":"span"},{"href":"#id-9","referenceIndex":31,"text":"Sagawa et al.","element":"a"},{"href":"#id-9","referenceIndex":31,"text":", ","element":"a"},{"href":"#id-9","referenceIndex":31,"text":"2020","element":"a"},{"text":"). We also confirm that overparameterization helps average test error (see, e.g., ","element":"span"},{"href":"#id-0","referenceIndex":26,"text":"Nakkiran et al. ","element":"a"},{"href":"#id-0","referenceIndex":26,"text":"(","element":"a"},{"href":"#id-0","referenceIndex":26,"text":"2019","element":"a"},{"text":"); ","element":"span"},{"href":"#id-1","referenceIndex":3,"text":"Belkin et al. ","element":"a"},{"href":"#id-1","referenceIndex":3,"text":"(","element":"a"},{"href":"#id-1","referenceIndex":3,"text":"2019","element":"a"},{"text":"); ","element":"span"},{"href":"#id-2","referenceIndex":25,"text":"Mei & Montanari ","element":"a"},{"href":"#id-2","referenceIndex":25,"text":"(","element":"a"},{"href":"#id-2","referenceIndex":25,"text":"2019","element":"a"},{"text":")).","element":"span"}],[{"style":{"fontWeight":"bold"},"text":"Empirical results. ","element":"span"},{"text":"We first consider the CelebA and Waterbirds dataset, following the experimental set-up of Section ","element":"span"},{"text":"3","element":"span"}],[{"text":"but now training with the standard ERM objective (Equation (","element":"span"},{"href":"#id-61","text":"2","element":"a"},{"text":")) instead of the reweighted objective (Equation (","element":"span"},{"href":"#id-3","text":"3","element":"a"},{"text":")).","element":"span"}],[{"text":"On these datasets, overparameterization helps the average test error (Figure ","element":"span"},{"href":"#id-62","text":"8","element":"a"},{"text":"). As model size increases past the point of zero training error, the average test error decreases. The best average test error is obtained by highly overparameterized models with zero training error—4.6% for CelebA at width 96, and 4.2% for Waterbirds at 6,000 random features.","element":"span"}],[{"text":"In contrast, the worst-group error is consistently high across model sizes: it is consistently worse than random (","element":"span"},{"style":{"fontStyle":"italic"},"text":">","element":"span"},{"text":"50%) for CelebA and nearly random (44%) for Waterbirds (Figure ","element":"span"},{"href":"#id-62","text":"8","element":"a"},{"text":"). These worst-group errors are much worse than those obtained by reweighted, underparameterized models (","element":"span"},{"text":"25","element":"span"},{"style":{"fontStyle":"italic"},"text":".","element":"span"},{"text":"6% ","element":"span"},{"text":"for CelebA and ","element":"span"},{"text":"26","element":"span"},{"style":{"fontStyle":"italic"},"text":".","element":"span"},{"text":"6% ","element":"span"},{"text":"for Waterbirds; see Section ","element":"span"},{"text":"3","element":"span"},{"text":"). Thus, while overparameterization helps ERM models achieve better test error, these models all fail to yield good worst-group error regardless of the degree of overparameterization.","element":"span"}],[{"style":{"width":"99%"},"width":1944,"height":606,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2005.04345/images/11-0.png","element":"img"}],[{"id":"id-62","style":{"fontStyle":"italic"},"text":"Figure 8. ","element":"figcaption","subtype":"caption"},{"text":"The effect of overparameterization on the average and worst-group error of an ERM model. Increasing model size helps average test error, but worst-group error remains poor across model sizes.","element":"figcaption","subtype":"caption"}],[{"style":{"fontWeight":"bold"},"text":"Simulation results. ","element":"span"},{"text":"We also evaluate the effect of overparameterization on ERM models on the synthetic dataset introduced in Section ","element":"span"},{"text":"4","element":"span"},{"text":". As above, ERM models fail to achieve reasonable worst-group test error across model sizes, but improve in average test error as model size increases (Figure ","element":"span"},{"href":"#id-62","text":"8","element":"a"},{"text":"). The best average test error is obtained by a highly overparameterized model with zero training error—9.0% error at 9,000 random features—while the worst-group test error is nearly random or worse (","element":"span"},{"style":{"fontStyle":"italic"},"text":"> ","element":"span"},{"text":"48","element":"span"},{"text":"%) across model sizes.","element":"span"}],[{"id":"id-21","style":{"fontWeight":"bold"},"text":"A.2. Stronger ","element":"span"},{"style":{"height":13.19},"width":43.12,"height":32.98,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2005.04345/images/11-1.png","element":"img","alt":" L2","inline":true,"padRight":true},{"style":{"fontWeight":"bold"},"text":"regularization improves worst-group error in overparameterized reweighted models","element":"span"}],[{"text":"In the main text, we studied models with default/weak or no ","element":"span"},{"style":{"height":13.19},"width":43.12,"height":32.98,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2005.04345/images/11-2.png","element":"img","alt":" L2","inline":true,"padRight":true},{"text":"regularization. In this section, we study the role of ","element":"span"},{"style":{"height":13.19},"width":43.12,"height":32.98,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2005.04345/images/11-3.png","element":"img","alt":" L2","inline":true,"padRight":true},{"text":"regularization in modulating the effect of overparameterization on worst-group error by changing the hyperparameter ","element":"span"},{"style":{"height":11.6},"width":93.18,"height":29,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2005.04345/images/11-4.png","element":"img","alt":" λ that","inline":true,"padRight":true},{"text":"controls ","element":"span"},{"style":{"height":13.19},"width":43.12,"height":32.98,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2005.04345/images/11-5.png","element":"img","alt":" L2","inline":true,"padRight":true},{"text":"regularization strength. Overall, we find that increasing ","element":"span"},{"style":{"height":13.19},"width":43.12,"height":32.98,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2005.04345/images/11-6.png","element":"img","alt":" L2","inline":true,"padRight":true},{"text":"regularization (to the point where models do not have zero training error) improves worst-group error but hurts average error in overparameterized reweighted models. In contrast, ","element":"span"},{"style":{"height":13.19},"width":43.12,"height":32.98,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2005.04345/images/11-7.png","element":"img","alt":" L2","inline":true,"padRight":true},{"text":"regularization has little effect on both worst-group and average error in the underparameterized regime.","element":"span"}],[{"style":{"fontWeight":"bold"},"text":"Strong ","element":"span"},{"style":{"height":13.19},"width":43.12,"height":32.98,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2005.04345/images/12-0.png","element":"img","alt":" L2","inline":true,"padRight":true},{"style":{"fontWeight":"bold"},"text":"regularization improves worst-group error in overparameterized reweighted models. ","element":"span"},{"text":"In the main text, we trained ResNet10 models with default, weak regularization (","element":"span"},{"style":{"height":10.8},"width":193.02,"height":27,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2005.04345/images/12-1.png","element":"img","alt":"λ = 0.0001","inline":true},{"text":") on the CelebA dataset, and unregularized logistic regression on the Waterbirds and synthetic datasets. Here, we consider strongly-regularized models with ","element":"span"},{"style":{"height":10.8},"width":127.38,"height":27,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2005.04345/images/12-2.png","element":"img","alt":" λ = 0.1","inline":true,"padRight":true},{"text":"for both types of models; unlike before, these models no longer achieve zero training error even when overparameterized. Figure ","element":"span"},{"href":"#id-63","text":"9 ","element":"a"},{"text":"shows the results of varying model size on strongly-regularized ERM, reweighted, and subsampled models on the three datasets.","element":"span"}],[{"text":"On all three datasets, with strong regularization, ERM models continue to yield poor worst-group test error across model sizes, with similar or worse worst-group test error compared to with weak/ no regularization. Conversely, strongly-regularized subsampled models continue to achieve low worst-group test error across model sizes.","element":"span"}],[{"text":"Where strong regularization has a large effect is on reweighted models. With reweighting, we find that strong regularization improves worst-group error in overparameterized models: across all three datasets, the worst-group test error in the overparameterized regime is much lower for the strongly-regularized models than their weakly regularized or unregularized counterparts (Figure ","element":"span"},{"href":"#id-20","text":"3","element":"a"},{"text":"). These results are consistent with similar observations made in ","element":"span"},{"href":"#id-9","referenceIndex":31,"text":"Sagawa et al. ","element":"a"},{"href":"#id-9","referenceIndex":31,"text":"(","element":"a"},{"href":"#id-9","referenceIndex":31,"text":"2020","element":"a"},{"text":"). However, even though strongly-regularized overparameterized models outperform weakly-regularized overparameterized models, overparameterization can still hurt the worst-group error in strongly-regularized reweighted models. On the CelebA and synthetic datasets, with ","element":"span"},{"style":{"height":10.8},"width":127.37,"height":27,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2005.04345/images/12-3.png","element":"img","alt":" λ = 0.1","inline":true},{"text":", the best worst-group error is still obtained by an underparameterized model for the CelebA and synthetic datasets, though overparameterization seems to help worst-group error on the Waterbirds dataset at least in the range of model sizes studied.","element":"span"}],[{"style":{"width":"94%"},"width":1847,"height":1413,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2005.04345/images/12-4.png","element":"img"}],[{"id":"id-63","style":{"fontStyle":"italic"},"text":"Figure 9. ","element":"figcaption","subtype":"caption"},{"text":"Strongly-regularized models have lower worst-group error than their weakly-regularized counterparts in the overparameterized regime (Figure ","element":"figcaption","subtype":"caption"},{"href":"#id-20","text":"3","element":"a","subtype":"caption"},{"text":"). Even under strong regularization, increasing model size can hurt the worst-group error on the CelebA (top) and synthetic (bottom) datasets, although overparameterization seems to improve worst-group error in the Waterbirds datase (middle) for the range of model sizes studied.","element":"figcaption","subtype":"caption"}],[{"style":{"fontWeight":"bold"},"text":"Overparameterized models require strong regularization for worst-group test error but not average test error. ","element":"span"},{"text":"Given a fixed overparameterized model size, how does its performance change with the ","element":"span"},{"style":{"height":13.19},"width":43.12,"height":32.98,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2005.04345/images/13-0.png","element":"img","alt":" L2","inline":true,"padRight":true},{"text":"regularization strength ","element":"span"},{"style":{"height":10.8},"width":23,"height":27,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2005.04345/images/13-1.png","element":"img","alt":" λ","inline":true},{"text":"? We study this with the logistic regression model on the Waterbirds and synthetic datasets, using a model size of ","element":"span"},{"style":{"fontStyle":"italic"},"text":"m ","element":"span"},{"text":"= 10","element":"span"},{"style":{"fontStyle":"italic"},"text":", ","element":"span"},{"text":"000 ","element":"span"},{"text":"random features and varying the ","element":"span"},{"style":{"height":13.19},"width":43.12,"height":32.98,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2005.04345/images/13-2.png","element":"img","alt":" L2","inline":true,"padRight":true},{"text":"regularization strength from ","element":"span"},{"style":{"height":13.78},"width":380.37,"height":34.46,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2005.04345/images/13-3.png","element":"img","alt":" λ = 10−9 to λ = 102. 1","inline":true}],[{"text":"Results are in Figure ","element":"span"},{"href":"#id-64","text":"10","element":"a"},{"text":". As before, ERM models obtain poor worst-group error regardless of the regularization strength, and subsampled models are relatively insensitive to regularization, achieving reasonable worst-group error at most settings of ","element":"span"},{"style":{"height":11.2},"width":33.24,"height":28,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2005.04345/images/13-4.png","element":"img","alt":" λ.","inline":true}],[{"text":"For reweighted models, however, having the right level of regularization is critical for obtaining good worst-group test error. On both datasets, the best worst-group test error is obtained by strongly-regularized models that do not achieve zero training error. In contrast, increasing regularization strength hurts average error, with the best average test error attained by models with nearly zero regularization.","element":"span"}],[{"style":{"width":"99%"},"width":1944,"height":987,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2005.04345/images/13-5.png","element":"img"}],[{"id":"id-64","style":{"fontStyle":"italic"},"text":"Figure 10. ","element":"figcaption","subtype":"caption"},{"text":"The effect of regularization on overparameterized random features logistic regression models (","element":"figcaption","subtype":"caption"},{"style":{"fontStyle":"italic"},"text":"m ","element":"figcaption","subtype":"caption"},{"text":"= 10","element":"figcaption","subtype":"caption"},{"style":{"fontStyle":"italic"},"text":", ","element":"figcaption","subtype":"caption"},{"text":"000","element":"figcaption","subtype":"caption"},{"text":"). ERM models (left) do consistently poorly while subsampled models (right) do consistently well on worst-group error. For reweighted models (middle), the best worst-group error is obtained by a strongly-regularized model that does not achieve zero training error.","element":"figcaption","subtype":"caption"}],[{"style":{"height":13.19},"width":43.12,"height":32.98,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2005.04345/images/13-6.png","element":"img","alt":"L2","inline":true,"padRight":true},{"style":{"fontWeight":"bold"},"text":"regularization affects where worst-group test error plateaus as model size increases. ","element":"span"},{"text":"In the above experiments, we kept either model size or regularization strength fixed, and varied the other. Here, we vary both: we consider ","element":"span"},{"style":{"height":13.19},"width":43.12,"height":32.98,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2005.04345/images/13-7.png","element":"img","alt":" L2","inline":true,"padRight":true},{"text":"regularization strengths ","element":"span"},{"style":{"height":17.39},"width":529.56,"height":43.46,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2005.04345/images/13-8.png","element":"img","alt":" λ ∈ {10−9, 10−6, 0.001, 0.1, 10}","inline":true,"padRight":true},{"text":"and investigate the effect of increasing model size for each ","element":"span"},{"style":{"height":11.2},"width":97.22,"height":28,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2005.04345/images/13-9.png","element":"img","alt":" λ. We","inline":true,"padRight":true},{"text":"plot the results for Waterbirds and the synthetic dataset in Figure ","element":"span"},{"href":"#id-65","text":"11 ","element":"a"},{"text":"and Figure ","element":"span"},{"href":"#id-66","text":"12 ","element":"a"},{"text":"respectively.","element":"span"}],[{"text":"For reweighted models, the results match what we observed above. Strengthening ","element":"span"},{"style":{"height":13.19},"width":43.12,"height":32.98,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2005.04345/images/13-10.png","element":"img","alt":" L2","inline":true,"padRight":true},{"text":"regularization reduces the detrimental effect of overparameterization on worst-group error. For any fixed model size in the overparameterized regime, the worst-group test error improves as ","element":"span"},{"style":{"height":10.8},"width":23,"height":27,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2005.04345/images/13-11.png","element":"img","alt":" λ","inline":true,"padRight":true},{"text":"increases up to a certain value. Worst-group test error seems to plateau at different values as model size increases, depending on the regularization strength, though we note that it is possible that further increasing model size beyond the range we studied might lead models with different regularization strengths to eventually converge. Further empirical studies as well as theoretical characterization of the interaction between regularization and overparameterization are needed to confirm this phenomenon.","element":"span"}],[{"text":"Given sufficiently large ","element":"span"},{"style":{"height":10.8},"width":23,"height":27,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2005.04345/images/13-12.png","element":"img","alt":" λ","inline":true,"padRight":true},{"text":"(e.g., ","element":"span"},{"style":{"height":10.8},"width":120.88,"height":27,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2005.04345/images/13-13.png","element":"img","alt":" λ = 10","inline":true,"padRight":true},{"text":"for both Waterbirds and synthetic datasets), overparameterized models seem to ","element":"span"},{"text":"outperform underparameterized models, at least for the range of model sizes studied. However, we caution that this trend does not seem to hold on the CelebA dataset (Figure ","element":"span"},{"href":"#id-63","text":"9","element":"a"},{"text":").","element":"span"}],[{"text":"Finally, in contrast with its effects on overparameterized models, regularization seems to only have a modest effect on worst-group test error in the underparameterized regime.","element":"span"}],[{"style":{"width":"99%"},"width":1932,"height":919,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2005.04345/images/14-0.png","element":"img"}],[{"id":"id-65","style":{"fontStyle":"italic"},"text":"Figure 11. ","element":"figcaption","subtype":"caption"},{"text":"The effect of overparameterization on models with different ","element":"figcaption","subtype":"caption"},{"style":{"height":11.6},"width":40.08,"height":28.99,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2005.04345/images/14-1.png","element":"img","alt":" L2","inline":true,"padRight":true},{"text":"regularization strengths ","element":"figcaption","subtype":"caption"},{"style":{"height":10},"width":22,"height":25,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2005.04345/images/14-2.png","element":"img","alt":" λ","inline":true,"padRight":true},{"text":"on the Waterbirds dataset. Different regularization strengths are shown in different colors, with training and test errors plotted in light and dark colors, respectively.","element":"figcaption","subtype":"caption"}],[{"style":{"width":"99%"},"width":1932,"height":919,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2005.04345/images/14-3.png","element":"img"}],[{"id":"id-66","style":{"fontStyle":"italic"},"text":"Figure 12. ","element":"figcaption","subtype":"caption"},{"text":"The effect of overparameterization on models with different ","element":"figcaption","subtype":"caption"},{"style":{"height":11.6},"width":40.08,"height":28.99,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2005.04345/images/14-4.png","element":"img","alt":" L2","inline":true,"padRight":true},{"text":"regularization strengths ","element":"figcaption","subtype":"caption"},{"style":{"height":10},"width":22,"height":25,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2005.04345/images/14-5.png","element":"img","alt":" λ","inline":true,"padRight":true},{"text":"on the synthetic dataset. The plotting scheme follows that of Figure ","element":"figcaption","subtype":"caption"},{"href":"#id-65","text":"11","element":"a","subtype":"caption"},{"text":".","element":"figcaption","subtype":"caption"}],[{"id":"id-24","style":{"fontWeight":"bold"},"text":"A.3. Overparameterization helps average test error on the synthetic data regardless of ","element":"span"},{"style":{"height":15.59},"width":203.24,"height":38.98,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2005.04345/images/15-0.png","element":"img","alt":" pmaj and rs:c","inline":true}],[{"text":"Figure ","element":"span"},{"href":"#id-67","text":"13 ","element":"a"},{"text":"shows how the average test error changes as a function of model size under different settings of the majority fraction ","element":"span"},{"style":{"height":11.59},"width":67.82,"height":28.98,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2005.04345/images/15-1.png","element":"img","alt":"pmaj","inline":true,"padRight":true},{"text":"and the spurious-core ratio ","element":"span"},{"style":{"height":9.19},"width":51.36,"height":22.98,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2005.04345/images/15-2.png","element":"img","alt":" rs:c","inline":true,"padRight":true},{"text":"on the synthetic dataset introduced in Section ","element":"span"},{"text":"4","element":"span"},{"text":". As expected, overparameterization helps the average test error regardless of SCR and the majority fraction.","element":"span"}],[{"style":{"width":"69%"},"width":1361,"height":447,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2005.04345/images/15-3.png","element":"img"}],[{"id":"id-67","style":{"fontStyle":"italic"},"text":"Figure 13. ","element":"figcaption","subtype":"caption"},{"text":"The effect of overparameterization on average error of a reweighted model on synthetic data. Different values of ","element":"figcaption","subtype":"caption"},{"style":{"height":13.61},"width":174.9,"height":34.03,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2005.04345/images/15-4.png","element":"img","alt":" pmaj and rs:c","inline":true,"padRight":true},{"text":"are plotted in different colors, with training and test errors plotted in light and dark colors, respectively. Across all values of ","element":"figcaption","subtype":"caption"},{"style":{"height":13.61},"width":185.84,"height":34.03,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2005.04345/images/15-5.png","element":"img","alt":" pmaj and rs:c,","inline":true,"padRight":true},{"text":"overparameterization helps the average test error.","element":"figcaption","subtype":"caption"}],[{"id":"id-26","style":{"fontWeight":"bold"},"text":"A.4. Comparison between implicit and explicit implicit memorization","element":"span"}],[{"text":"To motivate the explicit-memorization setting, we ran some brief experiments to show that in the overparameterized regime, linear models in the explicit-memorization setting behave similarly to random projection (RP) models in the implicit-memorization setting, with ","element":"span"},{"style":{"height":19.73},"width":215.72,"height":49.32,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2005.04345/images/15-6.png","element":"img","alt":" σ2core and σ2spu ","inline":true,"padRight":true},{"text":"in the latter scaled up by a factor of ","element":"span"},{"style":{"fontStyle":"italic"},"text":"d ","element":"span"},{"text":"(Figure ","element":"span"},{"href":"#id-68","text":"14","element":"a"},{"text":"). Recall that in the latter, ","element":"span"},{"style":{"height":15.78},"width":169.72,"height":39.44,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2005.04345/images/15-7.png","element":"img","alt":"xcore ∈ Rd ","inline":true,"padRight":true},{"text":"is distributed as ","element":"span"},{"style":{"height":17.38},"width":380.48,"height":43.46,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2005.04345/images/15-8.png","element":"img","alt":" xcore|y ∼ N(y, σ2coreId)","inline":true},{"text":". Roughly speaking, all the information about ","element":"span"},{"style":{"fontStyle":"italic"},"text":"y ","element":"span"},{"text":"is contained in the mean ","element":"span"},{"style":{"height":21.06},"width":317.45,"height":52.64,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2005.04345/images/15-9.png","element":"img","alt":"¯xcore = 1d�j xcore,j","inline":true},{"text":", which is distributed as ","element":"span"},{"style":{"height":17.38},"width":260.74,"height":43.46,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2005.04345/images/15-10.png","element":"img","alt":" N(y, σ2coreId/d)","inline":true},{"text":". In the explicit-memorization setting, we can view ","element":"span"},{"style":{"height":13.19},"width":152.95,"height":32.98,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2005.04345/images/15-11.png","element":"img","alt":" xcore ∈ R","inline":true,"padRight":true},{"text":"as equivalent to ","element":"span"},{"style":{"height":11.99},"width":70.4,"height":29.98,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2005.04345/images/15-12.png","element":"img","alt":" ¯xcore","inline":true,"padRight":true},{"text":"in the implicit-memorization setting (and similarly for ","element":"span"},{"style":{"height":11.59},"width":64.44,"height":28.98,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2005.04345/images/15-13.png","element":"img","alt":" xspu","inline":true},{"text":"), explaining the quantitative fit observed in Figure ","element":"span"},{"href":"#id-68","text":"14","element":"a"},{"text":".","element":"span"}],[{"text":"However, in the highly underparameterized regime, the RP models do poorly because of model misspecification (owing to a small number of random projections), whereas the linear models can still learn to use ","element":"span"},{"style":{"height":9.19},"width":73.04,"height":22.98,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2005.04345/images/15-14.png","element":"img","alt":" xcore","inline":true,"padRight":true},{"text":"and therefore do well.","element":"span"}],[{"style":{"width":"59%"},"width":1166,"height":366,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2005.04345/images/15-15.png","element":"img"}],[{"id":"id-68","style":{"fontStyle":"italic"},"text":"Figure 14. ","element":"figcaption","subtype":"caption"},{"text":"The effect of overparameterization on the worst-group test error for linear models in the explicit-memorization setting (","element":"figcaption","subtype":"caption"},{"style":{"height":17.02},"width":483.5,"height":42.54,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2005.04345/images/15-16.png","element":"img","alt":"σ2core = 1, σ2spu = 0.01, σ2noise = 1","inline":true},{"text":") and random projection models in the implicit-memorization setting (","element":"figcaption","subtype":"caption"},{"style":{"height":17.02},"width":474.4,"height":42.54,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2005.04345/images/15-17.png","element":"img","alt":"σ2core = 100, σ2spu = 1, d = 100).","inline":true,"padRight":true},{"text":"The models agree in the overparameterized regime.","element":"figcaption","subtype":"caption"}],[{"id":"id-14","style":{"fontWeight":"bold"},"text":"A.5. Experimental details","element":"span"}],[{"style":{"fontWeight":"bold"},"text":"Waterbirds and CelebA datasets. ","element":"span"},{"text":"For the CelebA dataset, we use the official train-val-test split from ","element":"span"},{"href":"#id-11","referenceIndex":23,"text":"Liu et al. ","element":"a"},{"href":"#id-11","referenceIndex":23,"text":"(","element":"a"},{"href":"#id-11","referenceIndex":23,"text":"2015","element":"a"},{"text":"), with the ","element":"span"},{"style":{"fontStyle":"italic"},"text":"Blond Hair ","element":"span"},{"text":"attribute as the target ","element":"span"},{"style":{"fontStyle":"italic"},"text":"y ","element":"span"},{"text":"and the ","element":"span"},{"style":{"fontStyle":"italic"},"text":"Male ","element":"span"},{"text":"as the spurious association ","element":"span"},{"style":{"fontStyle":"italic"},"text":"a","element":"span"},{"text":".","element":"span"}],[{"text":"For the Waterbirds dataset, we follow the setup in ","element":"span"},{"href":"#id-9","referenceIndex":31,"text":"Sagawa et al. ","element":"a"},{"href":"#id-9","referenceIndex":31,"text":"(","element":"a"},{"href":"#id-9","referenceIndex":31,"text":"2020","element":"a"},{"text":"); for convenience, we reproduce some details of how it was constructed here. This dataset was obtained by combining bird images from the CUB dataset (","element":"span"},{"href":"#id-12","referenceIndex":35,"text":"Wah et al.","element":"a"},{"href":"#id-12","referenceIndex":35,"text":", ","element":"a"},{"href":"#id-12","referenceIndex":35,"text":"2011","element":"a"},{"text":") with backgrounds from the Places dataset (","element":"span"},{"href":"#id-13","referenceIndex":39,"text":"Zhou et al.","element":"a"},{"href":"#id-13","referenceIndex":39,"text":", ","element":"a"},{"href":"#id-13","referenceIndex":39,"text":"2017","element":"a"},{"text":"). The CUB dataset comes with annotations of bird species. For the Waterbirds dataset, each bird was labeled was a waterbird if it was a seabird or waterfowl in the CUB dataset; otherwise, it was labeled as a landbird. Bird images were cropped using the provided segmentation masks and placed on either a land (bamboo forest or broadleaf forest) or water (ocean or natural lake) background obtained from the Places dataset.","element":"span"}],[{"text":"For Waterbirds, we follow the same train-val-test split as in ","element":"span"},{"href":"#id-9","referenceIndex":31,"text":"Sagawa et al. ","element":"a"},{"href":"#id-9","referenceIndex":31,"text":"(","element":"a"},{"href":"#id-9","referenceIndex":31,"text":"2020","element":"a"},{"text":"). Note that in these validation and test sets, ","element":"span"},{"text":"landbirds and waterbirds are uniformly distributed on land and water backgrounds so that accuracy on the rare groups can be more accurately estimated. When calculating average test accuracy, we therefore first compute the average test accuracy over each group and then report a weighted average, with weights corresponding to the relative proportion of each group in the skewed training dataset.","element":"span"}],[{"text":"We post-process Waterbirds by extracting feature representations taken from the last layer of a ResNet18 model pre-trained on ImageNet. We use the Pytorch ","element":"span"},{"text":"torchvision ","element":"span"},{"text":"implementation of the ResNet18 model for this. All models on the Waterbirds dataset in our paper are logistic regression models trained on top of this (fixed) feature representation.","element":"span"}],[{"style":{"fontWeight":"bold"},"text":"ResNet. ","element":"span"},{"text":"We used a modified ResNet10 with variable widths, following the approach in ","element":"span"},{"href":"#id-0","referenceIndex":26,"text":"Nakkiran et al. ","element":"a"},{"href":"#id-0","referenceIndex":26,"text":"(","element":"a"},{"href":"#id-0","referenceIndex":26,"text":"2019","element":"a"},{"text":") and extending the ","element":"span"},{"text":"torchvision ","element":"span"},{"text":"implementation. We trained all ResNet10 models with stochastic gradient descent with momentum of 0.9 and a batch size of 128, with the ","element":"span"},{"style":{"height":13.19},"width":43.12,"height":32.98,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2005.04345/images/16-0.png","element":"img","alt":" L2","inline":true,"padRight":true},{"text":"regularization parameter ","element":"span"},{"style":{"height":10.8},"width":23,"height":27,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2005.04345/images/16-1.png","element":"img","alt":" λ","inline":true,"padRight":true},{"text":"was passed in to the optimizer as the weight decay parameter. In the experiments in the main text, we used the default setting of ","element":"span"},{"style":{"height":13.38},"width":160.95,"height":33.46,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2005.04345/images/16-2.png","element":"img","alt":" λ = 10−4","inline":true},{"text":". We used a fixed learning rate instead of a learning rate schedule and selected the largest learning rate for which optimization was stable, following ","element":"span"},{"href":"#id-9","referenceIndex":31,"text":"Sagawa et al. ","element":"a"},{"href":"#id-9","referenceIndex":31,"text":"(","element":"a"},{"href":"#id-9","referenceIndex":31,"text":"2020","element":"a"},{"text":"). This resulted in learning rates of 0.01 and 0.0001 for ","element":"span"},{"style":{"height":13.79},"width":364.02,"height":34.46,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2005.04345/images/16-3.png","element":"img","alt":" λ = 10−4 and λ = 0.1","inline":true},{"text":", respectively, across all training procedures. As in the original ResNet paper (","element":"span"},{"href":"#id-19","referenceIndex":16,"text":"He et al.","element":"a"},{"href":"#id-19","referenceIndex":16,"text":", ","element":"a"},{"href":"#id-19","referenceIndex":16,"text":"2016","element":"a"},{"text":"), we used batch normalization (","element":"span"},{"href":"#id-69","referenceIndex":20,"text":"Ioffe & ","element":"a"},{"href":"#id-69","referenceIndex":20,"text":"Szegedy","element":"a"},{"href":"#id-69","referenceIndex":20,"text":", ","element":"a"},{"href":"#id-69","referenceIndex":20,"text":"2015","element":"a"},{"text":") and no dropout (","element":"span"},{"href":"#id-70","referenceIndex":34,"text":"Srivastava et al.","element":"a"},{"href":"#id-70","referenceIndex":34,"text":", ","element":"a"},{"href":"#id-70","referenceIndex":34,"text":"2014","element":"a"},{"text":"), and for simplicity, we trained all models without data augmentation.","element":"span"}],[{"text":"We trained for 50 epochs for ERM and reweighted models and 500 epochs for subsampled models (due to smaller number of examples per epoch). We found that worst-group error can be unstable across epochs due to the small sample size and relatively large learning rate, so in our results we report the error averaged over the last 10 epochs.","element":"span"}],[{"style":{"fontWeight":"bold"},"text":"Logistic regression. ","element":"span"},{"text":"We used the logistic regression implementation from ","element":"span"},{"text":"scikit-learn","element":"span"},{"text":", training with the L-BFGS solver until convergence with tolerance 0.0001, and setting the regularization parameter as ","element":"span"},{"style":{"height":16},"width":203.06,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2005.04345/images/16-4.png","element":"img","alt":" C = 1/(nλ)","inline":true},{"text":". For unregularized models, we set ","element":"span"},{"style":{"height":13.38},"width":157.14,"height":33.46,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2005.04345/images/16-5.png","element":"img","alt":" λ = 10−9 ","inline":true,"padRight":true},{"text":"for numerical stability.","element":"span"}],[{"id":"id-41","style":{"fontWeight":"bold"},"text":"A.6. Subsampling","element":"span"}],[{"text":"Formally, given a set of groups ","element":"span"},{"style":{"fontStyle":"italic"},"text":"G ","element":"span"},{"text":"and a dataset D comprising a set of ","element":"span"},{"style":{"fontStyle":"italic"},"text":"n ","element":"span"},{"text":"training points with their group identities ","element":"span"},{"style":{"height":18.18},"width":285.05,"height":45.46,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2005.04345/images/16-6.png","element":"img","alt":"{(x(i), y(i), g(i))}","inline":true},{"text":", the subsampling procedure involves two steps. First, we group training points based on group identities:","element":"span"}],[{"style":{"width":"68%"},"width":1341,"height":55,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2005.04345/images/16-7.png","element":"img"}],[{"text":"For each group ","element":"span"},{"style":{"fontStyle":"italic"},"text":"g","element":"span"},{"text":", we select a subset D","element":"span"},{"style":{"height":19.05},"width":127.79,"height":47.62,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2005.04345/images/16-8.png","element":"img","alt":"ssg ⊆ Dg ","inline":true,"padRight":true},{"text":"uniformly at random from D","element":"span"},{"style":{"height":7.2},"width":16,"height":18,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2005.04345/images/16-9.png","element":"img","alt":"g ","inline":true,"padRight":true},{"text":"such that each subset has the same number of ","element":"span"},{"text":"points as the smallest group in the training set. We form a new dataset D","element":"span"},{"style":{"height":6},"width":27.94,"height":15,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2005.04345/images/16-10.png","element":"img","alt":"ss ","inline":true,"padRight":true},{"text":"by combining these subsets:","element":"span"}],[{"style":{"width":"63%"},"width":1236,"height":177,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2005.04345/images/16-11.png","element":"img"}],[{"text":"Note that D","element":"span"},{"style":{"height":6},"width":27.95,"height":15,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2005.04345/images/16-12.png","element":"img","alt":"ss ","inline":true,"padRight":true},{"text":"is group-balanced, with ","element":"span"},{"style":{"height":15.99},"width":171.84,"height":39.98,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2005.04345/images/16-13.png","element":"img","alt":" pmaj = 0.5","inline":true},{"text":". We then train a model by minimizing the average loss on D","element":"span"},{"style":{"height":15.11},"width":39.89,"height":37.77,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2005.04345/images/16-14.png","element":"img","alt":"ss,","inline":true}],[{"id":"id-43","style":{"width":"68%"},"width":1332,"height":111,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2005.04345/images/16-15.png","element":"img"}],[{"text":"Since D","element":"span"},{"style":{"height":6},"width":27.94,"height":15,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2005.04345/images/16-16.png","element":"img","alt":"ss","inline":true,"padRight":true},{"text":"is group-balanced, the reweighted training loss (Equation ","element":"span"},{"href":"#id-3","text":"3","element":"a"},{"text":") has the same weight on all training points and minimizing the reweighted objective on D","element":"span"},{"style":{"height":6},"width":27.94,"height":15,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2005.04345/images/16-17.png","element":"img","alt":"ss ","inline":true,"padRight":true},{"text":"is equivalent to minimizing the average loss objective above.","element":"span"}]]},{"heading":"B. Proof of Theorem 1","paragraphs":[[{"text":"Here, we detail the proof of Theorem ","element":"span"},{"href":"#id-31","text":"1 ","element":"a"},{"text":"presented in Section ","element":"span"},{"text":"5","element":"span"},{"text":". We structure the proof by splitting Theorem ","element":"span"},{"href":"#id-31","text":"1 ","element":"a"},{"text":"into two smaller theorems: one for the overparameterized regime (Appendix ","element":"span"},{"href":"#id-71","text":"B.2","element":"a"},{"text":"), and another for the underparameterized regime (Appendix ","element":"span"},{"href":"#id-72","text":"B.3","element":"a"},{"text":").","element":"span"}],[{"style":{"fontWeight":"bold"},"text":"B.1. Notation and definitions.","element":"span"}],[{"text":"We denote the separate components of the weight vector ","element":"span"},{"style":{"height":18.18},"width":697.5,"height":45.44,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2005.04345/images/17-0.png","element":"img","alt":" ˆwcore ∈ R, ˆwspu ∈ R, ˆwnoise ∈ RN such that","inline":true}],[{"style":{"width":"60%"},"width":1168,"height":42,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2005.04345/images/17-1.png","element":"img"}],[{"text":"Further, by the representer theorem, we decompose ","element":"span"},{"style":{"height":13.19},"width":135.91,"height":32.98,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2005.04345/images/17-2.png","element":"img","alt":" ˆwnoise as","inline":true}],[{"text":"Note that ","element":"span"},{"style":{"height":18.19},"width":124.91,"height":45.46,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2005.04345/images/17-3.png","element":"img","alt":" α(i)(w)","inline":true,"padRight":true},{"text":"is equivalent to the ","element":"span"},{"style":{"height":14.19},"width":61.37,"height":35.46,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2005.04345/images/17-4.png","element":"img","alt":" α(i) ","inline":true,"padRight":true},{"text":"referred to in the main text. Recall that we define memorization of each training point ","element":"span"},{"style":{"height":14.19},"width":58.5,"height":35.46,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2005.04345/images/17-5.png","element":"img","alt":" x(i) ","inline":true,"padRight":true},{"text":"by the weight ","element":"span"},{"style":{"height":14.19},"width":61.37,"height":35.46,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2005.04345/images/17-6.png","element":"img","alt":" α(i) ","inline":true,"padRight":true},{"text":"as follows.","element":"span"}],[{"style":{"fontWeight":"bold"},"text":"Definition 2 ","element":"span"},{"text":"(","element":"span"},{"style":{"height":10.4},"width":22,"height":26,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2005.04345/images/17-7.png","element":"img","alt":"γ","inline":true},{"text":"-memorization)","element":"span"},{"style":{"fontWeight":"bold"},"text":". ","element":"span"},{"style":{"fontStyle":"italic"},"text":"Consider a separator ","element":"span"},{"style":{"height":10.8},"width":29,"height":27,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2005.04345/images/17-8.png","element":"img","alt":" ˆw","inline":true,"padRight":true},{"style":{"fontStyle":"italic"},"text":"on training data ","element":"span"},{"style":{"height":18.33},"width":260.39,"height":45.84,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2005.04345/images/17-9.png","element":"img","alt":" {(x(i), y(i))}ni=1","inline":true},{"style":{"fontStyle":"italic"},"text":". For some constant ","element":"span"},{"style":{"height":14.4},"width":100.64,"height":36,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2005.04345/images/17-10.png","element":"img","alt":" γ ∈ R","inline":true},{"style":{"fontStyle":"italic"},"text":", we ","element":"span"},{"style":{"fontStyle":"italic"},"text":"say that a model ","element":"span"},{"style":{"height":10.4},"width":22,"height":26,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2005.04345/images/17-11.png","element":"img","alt":" γ","inline":true},{"style":{"fontStyle":"italic"},"text":"-memorizes a training point if","element":"span"}],[{"style":{"width":"57%"},"width":1121,"height":100,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2005.04345/images/17-12.png","element":"img"}],[{"text":"The component ","element":"span"},{"style":{"height":21.39},"width":208.75,"height":53.46,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2005.04345/images/17-13.png","element":"img","alt":" α(i)( ˆw)x(i)noise ","inline":true,"padRight":true},{"text":"serves to “memorize” ","element":"span"},{"style":{"height":14.59},"width":202.1,"height":36.46,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2005.04345/images/17-14.png","element":"img","alt":" x(i) when N","inline":true,"padRight":true},{"text":"is sufficiently large, as it affects the prediction on ","element":"span"},{"style":{"height":14.59},"width":121.47,"height":36.46,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2005.04345/images/17-15.png","element":"img","alt":" x(i) but","inline":true,"padRight":true},{"text":"not on any other training or test points (because noise vectors are nearly orthogonal when ","element":"span"},{"style":{"fontStyle":"italic"},"text":"N ","element":"span"},{"text":"is large). In the proof, we set the constant ","element":"span"},{"style":{"height":16.98},"width":38.84,"height":42.46,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2005.04345/images/17-16.png","element":"img","alt":" γ2 ","inline":true,"padRight":true},{"text":"appropriately (based on other parameter settings in Theorem ","element":"span"},{"href":"#id-31","text":"1","element":"a"},{"text":") to get the required result.","element":"span"}],[{"text":"Finally, let ","element":"span"},{"style":{"height":15.59},"width":173.67,"height":38.97,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2005.04345/images/17-17.png","element":"img","alt":" Gmaj, Gmin","inline":true,"padRight":true},{"text":"denote the indices of training points in the majority and minority group respectively.","element":"span"}],[{"id":"id-71","style":{"fontWeight":"bold"},"text":"B.2. Overparameterized regime","element":"span"}],[{"text":"In our explicit-memorization set-up, sufficiently overparameterized models provably have high worst-group error under certain settings of ","element":"span"},{"style":{"height":19.72},"width":336.23,"height":49.31,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2005.04345/images/17-18.png","element":"img","alt":" σ2spu, σ2core, nmaj, nmin ","inline":true,"padRight":true},{"text":"as stated in Theorem ","element":"span"},{"href":"#id-31","text":"1 ","element":"a"},{"text":"(restated below as Theorem ","element":"span"},{"href":"#id-73","text":"2","element":"a"},{"text":").","element":"span"}],[{"id":"id-73","style":{"fontWeight":"bold"},"text":"Theorem 2. ","element":"span"},{"style":{"fontStyle":"italic"},"text":"For any ","element":"span"},{"style":{"height":22.59},"width":814.26,"height":56.47,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2005.04345/images/17-19.png","element":"img","alt":" pmaj ≥�1 − 12001�, σ2core ≥ 1, σ2spu ≤ 116 log 100nmaj","inline":true,"padRight":true},{"style":{"fontStyle":"italic"},"text":", ","element":"span"},{"style":{"height":18.88},"width":209.93,"height":47.2,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2005.04345/images/17-20.png","element":"img","alt":" σ2noise ≤ nmaj6002","inline":true,"padRight":true},{"style":{"fontStyle":"italic"},"text":"and ","element":"span"},{"style":{"height":13.2},"width":188.67,"height":33,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2005.04345/images/17-21.png","element":"img","alt":" nmin ≥ 100","inline":true},{"style":{"fontStyle":"italic"},"text":", there exists ","element":"span"},{"style":{"height":13.19},"width":48.02,"height":32.97,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2005.04345/images/17-22.png","element":"img","alt":" N0","inline":true,"padRight":true},{"style":{"fontStyle":"italic"},"text":"such that for all ","element":"span"},{"style":{"height":13.19},"width":137.52,"height":32.98,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2005.04345/images/17-23.png","element":"img","alt":" N > N0","inline":true,"padRight":true},{"style":{"fontStyle":"italic"},"text":"(overparametrized regime), with high probability over draws of the data,","element":"span"}],[{"style":{"width":"58%"},"width":1134,"height":42,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2005.04345/images/17-24.png","element":"img"}],[{"style":{"fontStyle":"italic"},"text":"where ","element":"span"},{"style":{"height":10.99},"width":77.14,"height":27.46,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2005.04345/images/17-25.png","element":"img","alt":" ˆwmm ","inline":true,"padRight":true},{"style":{"fontStyle":"italic"},"text":"is the max-margin classifier.","element":"span"}],[{"text":"In Section ","element":"span"},{"text":"5","element":"span"},{"text":", we sketched key ideas in the proof by considering special families of separators: because the minimum-norm inductive bias favors less memorization, models can prefer to learn the spurious feature and memorize the minority examples (entailing high worst-group error), instead of learning the core feature and memorizing some fraction of all training points (possibly attaining reasonable worst-group error). We now provide the full proof of Theorem ","element":"span"},{"href":"#id-73","text":"2","element":"a"},{"text":", generalizing the above key concepts by considering ","element":"span"},{"style":{"fontStyle":"italic"},"text":"all ","element":"span"},{"text":"separators.","element":"span"}],[{"style":{"fontStyle":"italic"},"text":"Proof. ","element":"span"},{"text":"Recall from Section ","element":"span"},{"text":"5 ","element":"span"},{"text":"that we consider the maximum-margin classifier ","element":"span"},{"style":{"height":14.19},"width":151.46,"height":35.46,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2005.04345/images/17-26.png","element":"img","alt":" ˆwminnorm:","inline":true}],[{"style":{"width":"71%"},"width":1400,"height":48,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2005.04345/images/17-27.png","element":"img"}],[{"text":"In other words, ","element":"span"},{"style":{"height":13.78},"width":138.96,"height":34.46,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2005.04345/images/17-28.png","element":"img","alt":" ˆwminnorm ","inline":true,"padRight":true},{"text":"is the minimum-norm separator, where separator is a classifier with zero training error and required margins, satisfying ","element":"span"},{"style":{"height":18.18},"width":412.24,"height":45.46,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2005.04345/images/17-29.png","element":"img","alt":" y(i)(w · x(i)) ≥ 1 for all i","inline":true},{"text":". We analyze the worst-group error of the minimum-norm separator ","element":"span"},{"style":{"height":13.78},"width":138.96,"height":34.46,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2005.04345/images/17-30.png","element":"img","alt":" ˆwminnorm","inline":true,"padRight":true},{"text":"as outlined below:","element":"span"}],[{"text":"1. We first upper bound the fraction of ","element":"span"},{"style":{"fontStyle":"italic"},"text":"majority ","element":"span"},{"text":"examples memorized by the minimum-norm separator ","element":"span"},{"style":{"height":13.79},"width":303.86,"height":34.46,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2005.04345/images/17-31.png","element":"img","alt":" ˆwminnorm. We show","inline":true,"padRight":true},{"text":"that there exists a separator that can use spurious features and needs to memorize only the minority points (Lemma ","element":"span"},{"href":"#id-74","text":"1","element":"a"},{"text":") for the parameter settings in Theorem ","element":"span"},{"href":"#id-73","text":"2 ","element":"a"},{"text":"where ","element":"span"},{"style":{"height":11.59},"width":64.44,"height":28.98,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2005.04345/images/17-32.png","element":"img","alt":" σspu","inline":true,"padRight":true},{"text":"is sufficiently small. Since the norm of a separator is roughly scales with the number of points memorized (","element":"span"},{"style":{"height":18.6},"width":350.23,"height":46.49,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2005.04345/images/17-33.png","element":"img","alt":"|α(i)( ˆw)| ≥ γ2/σ2noise","inline":true},{"text":"), we have an upper bound on the number of training ","element":"span"},{"text":"points memorized by ","element":"span"},{"style":{"height":13.79},"width":138.96,"height":34.47,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2005.04345/images/17-34.png","element":"img","alt":" ˆwminnorm","inline":true},{"text":". Since the number of majority points is much larger than the number of minority points, this says that only a small fraction of majority points could be memorized by ","element":"span"},{"style":{"height":14.18},"width":150.46,"height":35.46,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2005.04345/images/17-35.png","element":"img","alt":" ˆwminnorm.","inline":true}],[{"text":"2. Next, we observe that since the core feature is noisy as per the parameter setting in Theorem ","element":"span"},{"href":"#id-73","text":"2","element":"a"},{"text":", if we do not use the spurious feature, a constant fraction of majority points have to be memorized if spurious features are not used. Conversely, if less than this fraction of majority points can be memorized, the separator must use spurious features. Since using spurious features leads to higher worst-group test error, this reveals a trade-off between the worst-group test error of a separator and the fraction of ","element":"span"},{"style":{"fontStyle":"italic"},"text":"majority points ","element":"span"},{"text":"that it memorizes at training time. Succinctly, smaller fraction memorized implies the use of spurious features which in turn implies higher worst-group test error. Smaller worst-group test error requires eliminating the use of spurious features which would lead to a large fraction of majority points requiring memorization in order for a classifier to be a separator. We formalize the above trade-off between the worst-group test error and fraction of majority examples to be memorized in Proposition ","element":"span"},{"href":"#id-75","text":"3","element":"a"},{"text":".","element":"span"}],[{"text":"Combining the two steps together, since ","element":"span"},{"style":{"height":13.79},"width":138.96,"height":34.46,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2005.04345/images/18-0.png","element":"img","alt":" ˆwminnorm ","inline":true,"padRight":true},{"text":"memorizes only a small fraction of majority points by virtue of being the minimum norm separator, ","element":"span"},{"style":{"height":13.79},"width":138.96,"height":34.46,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2005.04345/images/18-1.png","element":"img","alt":" ˆwminnorm ","inline":true,"padRight":true},{"text":"suffers high worst-group test error.","element":"span"}],[{"text":"We now formally prove Theorem ","element":"span"},{"href":"#id-73","text":"2","element":"a"},{"text":", invoking propositions that we prove in subsequent sections.","element":"span"}],[{"style":{"width":"77%"},"width":1506,"height":28,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2005.04345/images/18-2.png","element":"img"}],[{"text":"In the first part of the proof, we show that the minimum-norm separator ","element":"span"},{"style":{"height":13.79},"width":138.96,"height":34.47,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2005.04345/images/18-3.png","element":"img","alt":" ˆwminnorm","inline":true,"padRight":true},{"text":"“memorizes” a small fraction of the majority examples. Formally, we study the quantity ","element":"span"},{"style":{"height":19.2},"width":252.91,"height":48,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2005.04345/images/18-4.png","element":"img","alt":" δmaj-train�ˆw, γ2�","inline":true},{"text":"defined as follows.","element":"span"}],[{"id":"id-82","style":{"fontWeight":"bold"},"text":"Definition 3. ","element":"span"},{"style":{"fontStyle":"italic"},"text":"Consider a separator ","element":"span"},{"style":{"height":10.8},"width":29,"height":27,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2005.04345/images/18-5.png","element":"img","alt":" ˆw","inline":true,"padRight":true},{"style":{"fontStyle":"italic"},"text":"on training data ","element":"span"},{"style":{"height":18.33},"width":260.39,"height":45.84,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2005.04345/images/18-6.png","element":"img","alt":" {(x(i), y(i))}ni=1","inline":true},{"style":{"fontStyle":"italic"},"text":". Let ","element":"span"},{"style":{"height":19.2},"width":255.62,"height":48,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2005.04345/images/18-7.png","element":"img","alt":" δmaj-train�ˆw, γ2�","inline":true},{"style":{"fontStyle":"italic"},"text":"be the fraction of training examples that ","element":"span"},{"style":{"height":14.4},"width":61.57,"height":36,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2005.04345/images/18-8.png","element":"img","alt":" ˆw γ","inline":true},{"style":{"fontStyle":"italic"},"text":"-memorizes in the majority groups:","element":"span"}],[{"style":{"width":"72%"},"width":1405,"height":116,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2005.04345/images/18-9.png","element":"img"}],[{"text":"We provide an upper bound on ","element":"span"},{"style":{"height":19.27},"width":363.77,"height":48.18,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2005.04345/images/18-10.png","element":"img","alt":" δmaj-train�ˆwminnorm, γ2�","inline":true},{"text":"(Lemma ","element":"span"},{"href":"#id-76","text":"4","element":"a"},{"text":") by first bounding ","element":"span"},{"style":{"height":17.79},"width":180.39,"height":44.46,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2005.04345/images/18-11.png","element":"img","alt":" ∥ ˆwminnorm∥","inline":true,"padRight":true},{"text":"and then bounding ","element":"span"},{"style":{"height":19.27},"width":363.77,"height":48.18,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2005.04345/images/18-12.png","element":"img","alt":"δmaj-train�ˆwminnorm, γ2�","inline":true},{"text":"in terms of ","element":"span"},{"style":{"height":17.78},"width":190.32,"height":44.46,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2005.04345/images/18-13.png","element":"img","alt":" ∥ ˆwminnorm∥.","inline":true}],[{"style":{"fontWeight":"bold"},"text":"Bounding ","element":"span"},{"style":{"height":17.78},"width":180.39,"height":44.46,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2005.04345/images/18-14.png","element":"img","alt":" ∥ ˆwminnorm∥","inline":true}],[{"id":"id-74","style":{"fontWeight":"bold"},"text":"Lemma 1. ","element":"span"},{"style":{"fontStyle":"italic"},"text":"There exists a separator ","element":"span"},{"style":{"height":10.98},"width":136.01,"height":27.46,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2005.04345/images/18-15.png","element":"img","alt":" wuse−spu","inline":true,"padRight":true},{"style":{"fontStyle":"italic"},"text":"that satisfies ","element":"span"},{"style":{"height":18.98},"width":686.8,"height":47.44,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2005.04345/images/18-16.png","element":"img","alt":" y(i)(wuse−spu · x(i)) ≥ 1, ∀i ∈ Gmaj, Gmin","inline":true},{"style":{"fontStyle":"italic"},"text":". The norm of this separator gives a bound on ","element":"span"},{"style":{"height":17.78},"width":180.39,"height":44.46,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2005.04345/images/18-17.png","element":"img","alt":" ∥ ˆwminnorm∥","inline":true,"padRight":true},{"style":{"fontStyle":"italic"},"text":"as follows. For the parameter settings under Theorem ","element":"span"},{"href":"#id-73","style":{"fontStyle":"italic"},"text":"2","element":"a"},{"style":{"fontStyle":"italic"},"text":", with high probability, we have","element":"span"}],[{"style":{"width":"78%"},"width":1521,"height":88,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2005.04345/images/18-18.png","element":"img"}],[{"style":{"fontStyle":"italic"},"text":"for constants ","element":"span"},{"style":{"height":23.61},"width":360.28,"height":59.03,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2005.04345/images/18-19.png","element":"img","alt":" u = 1.3125, s = 2.61σ2noise .","inline":true}],[{"style":{"fontStyle":"italic"},"text":"Proof. ","element":"span"},{"text":"In order to get an upper bound on ","element":"span"},{"style":{"height":17.78},"width":180.38,"height":44.46,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2005.04345/images/18-20.png","element":"img","alt":" ∥ ˆwminnorm∥","inline":true},{"text":", we compute the norm of a particular separator. Concretely, we consider a separator ","element":"span"},{"style":{"height":10.99},"width":136.02,"height":27.47,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2005.04345/images/18-21.png","element":"img","alt":" wuse−spu ","inline":true,"padRight":true},{"text":"of the following form:","element":"span"}],[{"style":{"width":"34%"},"width":666,"height":349,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2005.04345/images/18-22.png","element":"img"}],[{"text":"First, because we are interested in ","element":"span"},{"style":{"height":10.99},"width":136.02,"height":27.47,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2005.04345/images/18-23.png","element":"img","alt":" wuse−spu ","inline":true,"padRight":true},{"text":"that does not use the core feature and relies on the spurious feature instead, we let ","element":"span"},{"style":{"height":17.54},"width":620.44,"height":43.85,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2005.04345/images/18-24.png","element":"img","alt":"wuse−spucore = 0 and wuse−spuspu = u, u ∈ R","inline":true},{"text":". We set the value ","element":"span"},{"style":{"fontStyle":"italic"},"text":"u ","element":"span"},{"text":"appropriately so that none of the majority points are memorized (corresponding to ","element":"span"},{"style":{"height":18.97},"width":561.54,"height":47.44,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2005.04345/images/18-25.png","element":"img","alt":" α(i)(wuse−spu) = 0 for all i ∈ Gmaj","inline":true},{"text":"). However since the spurious correlations are reversed in the minority ","element":"span"},{"text":"points and ","element":"span"},{"style":{"height":14.92},"width":212.1,"height":37.31,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2005.04345/images/19-0.png","element":"img","alt":" wuse−spucore = 0","inline":true},{"text":", the minority points have to be memorized. For simplicity, we set ","element":"span"},{"style":{"height":18.18},"width":365.03,"height":45.46,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2005.04345/images/19-1.png","element":"img","alt":" α(i)(wuse−spu) = y(i)s","inline":true,"padRight":true},{"text":"for all ","element":"span"},{"style":{"height":13.19},"width":151.67,"height":32.98,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2005.04345/images/19-2.png","element":"img","alt":"i ∈ Gmin.","inline":true}],[{"text":"Now it remains to select appropriate values of constants ","element":"span"},{"style":{"height":18.18},"width":660.33,"height":45.46,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2005.04345/images/19-3.png","element":"img","alt":" u and s such that y(i)(wuse−spu · x(i)) ≥ 1","inline":true,"padRight":true},{"text":"is satisfied for all training examples.","element":"span"}],[{"text":"For majority points, this involves setting ","element":"span"},{"style":{"fontStyle":"italic"},"text":"u ","element":"span"},{"text":"large enough such that the less noisy spurious feature can be used to obtain the required margin. Without loss of generality, assume ","element":"span"},{"style":{"height":17.38},"width":132.3,"height":43.46,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2005.04345/images/19-4.png","element":"img","alt":" y(i) = 1","inline":true},{"text":". Formally, for ","element":"span"},{"style":{"height":15.59},"width":151.43,"height":38.98,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2005.04345/images/19-5.png","element":"img","alt":" i ∈ Gmaj,","inline":true}],[{"style":{"width":"70%"},"width":1366,"height":420,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2005.04345/images/19-6.png","element":"img"}],[{"text":"The first inequality follows from the fact that ","element":"span"},{"style":{"height":11.59},"width":64.44,"height":28.98,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2005.04345/images/19-7.png","element":"img","alt":" σspu","inline":true,"padRight":true},{"text":"is small enough under the parameter settings of Theorem ","element":"span"},{"href":"#id-73","text":"2 ","element":"a"},{"text":"to allow a uniform bound on ","element":"span"},{"href":"#id-77","style":{"height":21.1},"width":238.28,"height":52.75,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2005.04345/images/19-8.png","element":"img","alt":" x(i)spu (Lemma 5","inline":true},{"text":"). The second inequality follows from setting the number of random features ","element":"span"},{"style":{"fontStyle":"italic"},"text":"N ","element":"span"},{"text":"to be large ","element":"span"},{"text":"enough so that the noise features are near orthogonal (Lemma ","element":"span"},{"href":"#id-78","text":"8","element":"a"},{"text":"). Conversely, we have","element":"span"}],[{"id":"id-79","style":{"width":"82%"},"width":1596,"height":89,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2005.04345/images/19-9.png","element":"img"}],[{"text":"Notice that the condition in Equation ","element":"span"},{"href":"#id-79","text":"23 ","element":"a"},{"text":"requires that ","element":"span"},{"style":{"fontStyle":"italic"},"text":"u ","element":"span"},{"text":"be greater than ","element":"span"},{"text":"0","element":"span"},{"text":". Since the minority points have spurious attribute ","element":"span"},{"style":{"height":10},"width":125.2,"height":25,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2005.04345/images/19-10.png","element":"img","alt":"a = −y","inline":true},{"text":", we need to set ","element":"span"},{"style":{"fontStyle":"italic"},"text":"s ","element":"span"},{"text":"to be large enough so that ","element":"span"},{"style":{"height":10.99},"width":136.02,"height":27.46,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2005.04345/images/19-11.png","element":"img","alt":" wuse−spu ","inline":true,"padRight":true},{"text":"as defined above separates the minority points. Just as before, we set ","element":"span"},{"style":{"height":14.4},"width":600.31,"height":36,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2005.04345/images/19-12.png","element":"img","alt":" y = 1 WLOG. For i ∈ Gmin, we have","inline":true}],[{"style":{"width":"73%"},"width":1437,"height":420,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2005.04345/images/19-13.png","element":"img"}],[{"text":"The steps are similar to the condition for majority points, with the key difference that the contribution from the noise term involves ","element":"span"},{"href":"#id-80","style":{"height":21.38},"width":361.04,"height":53.46,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2005.04345/images/19-14.png","element":"img","alt":" s∥x(i)noise∥22 (Lemma 9).","inline":true}],[{"id":"id-81","text":"Conversely, we have","element":"span"}],[{"text":"A set of parameters that satisfies both conditions above Equation ","element":"span"},{"href":"#id-81","text":"24 ","element":"a"},{"text":"and Equation ","element":"span"},{"href":"#id-79","text":"23 ","element":"a"},{"text":"is the following:","element":"span"}],[{"style":{"width":"22%"},"width":441,"height":46,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2005.04345/images/19-15.png","element":"img"}],[{"text":"We use the fact that ","element":"span"},{"style":{"height":16},"width":207.88,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2005.04345/images/19-16.png","element":"img","alt":" c1 < 1/2000","inline":true,"padRight":true},{"text":"(From Lemma ","element":"span"},{"href":"#id-80","text":"9","element":"a"},{"text":").","element":"span"}],[{"text":"Finally, we have w.h.p,","element":"span"}],[{"style":{"width":"100%"},"width":940,"height":149,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2005.04345/images/19-17.png","element":"img"}],[{"style":{"fontWeight":"bold"},"text":"Bounding ","element":"span"},{"style":{"height":19.2},"width":265.3,"height":48,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2005.04345/images/20-0.png","element":"img","alt":" δmaj-train�ˆw, γ2�","inline":true},{"style":{"fontWeight":"bold"},"text":"in terms of ","element":"span"},{"style":{"height":16},"width":69.52,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2005.04345/images/20-1.png","element":"img","alt":" ∥ ˆw∥","inline":true}],[{"id":"id-83","style":{"fontWeight":"bold"},"text":"Lemma 2. ","element":"span"},{"style":{"fontStyle":"italic"},"text":"For a separator ","element":"span"},{"style":{"height":10.8},"width":29,"height":27,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2005.04345/images/20-2.png","element":"img","alt":" ˆw","inline":true,"padRight":true},{"style":{"fontStyle":"italic"},"text":"with bounded ","element":"span"},{"style":{"height":23.92},"width":269.25,"height":59.81,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2005.04345/images/20-3.png","element":"img","alt":" α(i)( ˆw)2 ≤ 10nσ2noise","inline":true,"padRight":true},{"style":{"fontStyle":"italic"},"text":"for all ","element":"span"},{"style":{"fontStyle":"italic"},"text":"i ","element":"span"},{"text":"= 1","element":"span"},{"style":{"fontStyle":"italic"},"text":", . . . , n","element":"span"},{"style":{"fontStyle":"italic"},"text":", its norm can be bounded with high ","element":"span"},{"style":{"fontStyle":"italic"},"text":"probability as","element":"span"}],[{"style":{"width":"71%"},"width":1389,"height":100,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2005.04345/images/20-4.png","element":"img"}],[{"style":{"fontStyle":"italic"},"text":"under the parameter settings of Theorem ","element":"span"},{"href":"#id-73","style":{"fontStyle":"italic"},"text":"2","element":"a"},{"style":{"fontStyle":"italic"},"text":".","element":"span"}],[{"style":{"fontStyle":"italic"},"text":"Proof. ","element":"span"},{"text":"The result follows bounded norms (Lemma ","element":"span"},{"href":"#id-80","text":"9","element":"a"},{"text":"), bounded dot products (Lemma ","element":"span"},{"href":"#id-78","text":"8","element":"a"},{"text":"), and the definition of ","element":"span"},{"style":{"height":19.2},"width":252.91,"height":48,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2005.04345/images/20-5.png","element":"img","alt":" δmaj-train�ˆw, γ2�","inline":true}],[{"text":"(Definition ","element":"span"},{"href":"#id-82","text":"3","element":"a"},{"text":").","element":"span"}],[{"text":"We now apply Lemma ","element":"span"},{"href":"#id-74","text":"1 ","element":"a"},{"text":"and Lemma ","element":"span"},{"href":"#id-83","text":"2 ","element":"a"},{"text":"in order to bound ","element":"span"},{"style":{"height":19.27},"width":363.77,"height":48.18,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2005.04345/images/20-6.png","element":"img","alt":" δmaj-train�ˆwminnorm, γ2�","inline":true},{"text":", showing that the fraction of majority points that are memorized is small for appropriate choice of ","element":"span"},{"style":{"height":10.4},"width":32.84,"height":26,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2005.04345/images/20-7.png","element":"img","alt":" γ.","inline":true}],[{"text":"To invoke Lemma ","element":"span"},{"href":"#id-83","text":"2","element":"a"},{"text":", we first show that the coefficient ","element":"span"},{"style":{"height":18.19},"width":235.78,"height":45.46,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2005.04345/images/20-8.png","element":"img","alt":" α(i)( ˆwminnorm)","inline":true,"padRight":true},{"text":"is bounded above with high probabiltity.","element":"span"}],[{"id":"id-84","style":{"fontWeight":"bold"},"text":"Lemma 3. ","element":"span"},{"style":{"fontStyle":"italic"},"text":"Under the parameter settings of Theorem ","element":"span"},{"href":"#id-73","style":{"fontStyle":"italic"},"text":"2","element":"a"},{"style":{"fontStyle":"italic"},"text":", with high probability, ","element":"span"},{"style":{"height":18.19},"width":235.78,"height":45.46,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2005.04345/images/20-9.png","element":"img","alt":" α(i)( ˆwminnorm)","inline":true,"padRight":true},{"style":{"fontStyle":"italic"},"text":"is bounded above for ","element":"span"},{"style":{"fontStyle":"italic"},"text":"i ","element":"span"},{"text":"= 1","element":"span"},{"style":{"fontStyle":"italic"},"text":", . . . , n ","element":"span"},{"style":{"fontStyle":"italic"},"text":"as","element":"span"}],[{"style":{"width":"99%"},"width":1942,"height":734,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2005.04345/images/20-10.png","element":"img"}],[{"text":"From the upper bound on ","element":"span"},{"href":"#id-74","style":{"height":17.78},"width":536.83,"height":44.46,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2005.04345/images/21-0.png","element":"img","alt":" ∥ ˆwminnorm∥22 (Lemma 1), we have","inline":true}],[{"text":"Since ","element":"span"},{"style":{"height":17.8},"width":1344.46,"height":44.49,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2005.04345/images/21-1.png","element":"img","alt":" c1 < 1/2000, and n ≥ 2000, setting u = 1.3125, sσ2noise = 2.61, we get M 2 ≤ 10n.","inline":true}],[{"text":"Now, we are ready to show that ","element":"span"},{"style":{"height":19.27},"width":506.9,"height":48.18,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2005.04345/images/21-2.png","element":"img","alt":" δmaj-train�ˆwminnorm, γ2�is small.","inline":true}],[{"id":"id-76","style":{"fontWeight":"bold"},"text":"Lemma 4. ","element":"span"},{"style":{"fontStyle":"italic"},"text":"Under the parameter settings of Theorem ","element":"span"},{"href":"#id-73","style":{"fontStyle":"italic"},"text":"2","element":"a"},{"style":{"fontStyle":"italic"},"text":", the following is true with high probability.","element":"span"}],[{"style":{"width":"64%"},"width":1250,"height":96,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2005.04345/images/21-3.png","element":"img"}],[{"style":{"fontStyle":"italic"},"text":"Proof. ","element":"span"},{"text":"Applying Lemma ","element":"span"},{"href":"#id-83","text":"2 ","element":"a"},{"text":"to ","element":"span"},{"style":{"height":13.78},"width":138.96,"height":34.46,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2005.04345/images/21-4.png","element":"img","alt":" ˆwminnorm ","inline":true,"padRight":true},{"text":"by invoking the bounds on ","element":"span"},{"href":"#id-84","style":{"height":18.18},"width":433.41,"height":45.46,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2005.04345/images/21-5.png","element":"img","alt":" α(i)( ˆwminnorm) (Lemma 3),","inline":true}],[{"style":{"width":"76%"},"width":1498,"height":101,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2005.04345/images/21-6.png","element":"img"}],[{"text":"with high probability. Putting this together with Lemma ","element":"span"},{"href":"#id-74","text":"1","element":"a"},{"text":", we have","element":"span"}],[{"style":{"width":"74%"},"width":1459,"height":116,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2005.04345/images/21-7.png","element":"img"}],[{"style":{"height":39.89},"width":755.86,"height":99.73,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2005.04345/images/21-8.png","element":"img","alt":"=⇒ δmaj-train�ˆwminnorm, γ2�≤ u2σ2noiseγ4nmaj(1 − c1)","inline":true},{"style":{"height":19.45},"width":258.68,"height":48.62,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2005.04345/images/21-9.png","element":"img","alt":"� �� Very small","inline":true}],[{"text":"where in the last step we substitute the constants ","element":"span"},{"style":{"height":18.18},"width":1039.4,"height":45.44,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2005.04345/images/21-10.png","element":"img","alt":" γ2 = 9/10, u = 1.3125, sσ2noise = 2.61, nmaj/nmin ≤ 1/2000","inline":true,"padRight":true},{"text":"and ","element":"span"},{"style":{"height":18.17},"width":360.69,"height":45.44,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2005.04345/images/21-11.png","element":"img","alt":"σ2noise ≤ nmaj/360000.","inline":true}],[{"style":{"width":"71%"},"width":667,"height":34,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2005.04345/images/21-12.png","element":"img"}],[{"id":"id-77","style":{"fontWeight":"bold"},"text":"Lemma 5. ","element":"span"},{"style":{"fontStyle":"italic"},"text":"With probability ","element":"span"},{"style":{"height":22.77},"width":558.88,"height":56.92,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2005.04345/images/21-13.png","element":"img","alt":" > 1 − 1/100, if σspu ≤ 14√log 100n,","inline":true}],[{"style":{"width":"32%"},"width":303,"height":53,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2005.04345/images/21-14.png","element":"img"}],[{"style":{"fontStyle":"italic"},"text":"where ","element":"span"},{"style":{"fontStyle":"italic"},"text":"a ","element":"span"},{"style":{"fontStyle":"italic"},"text":"is the spurious attribute.","element":"span"}],[{"text":"This follows from standard subgaussian concentration and union bound over ","element":"span"},{"style":{"height":15.59},"width":386.56,"height":38.98,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2005.04345/images/21-15.png","element":"img","alt":" n = nmaj + nmin points.","inline":true}],[{"id":"id-85","style":{"fontWeight":"bold"},"text":"Lemma 6. ","element":"span"},{"style":{"fontStyle":"italic"},"text":"For a vector ","element":"span"},{"style":{"height":17.38},"width":542.26,"height":43.46,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2005.04345/images/21-16.png","element":"img","alt":" z ∈ RN such that z ∈ N(0, σ2I),","inline":true}],[{"id":"id-86","style":{"fontWeight":"bold"},"text":"Lemma 7. ","element":"span"},{"style":{"fontStyle":"italic"},"text":"For two vectors ","element":"span"},{"style":{"height":18.17},"width":665.28,"height":45.44,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2005.04345/images/21-17.png","element":"img","alt":" zi, zj ∈ RN such that zi, zj ∼ N(0, σ2I)","inline":true},{"style":{"fontStyle":"italic"},"text":", by Hoeffding’s inequality, we have","element":"span"}],[{"style":{"width":"66%"},"width":1298,"height":97,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2005.04345/images/21-18.png","element":"img"}],[{"id":"id-87","style":{"fontWeight":"bold"},"text":"Corollary 1. ","element":"span"},{"style":{"fontStyle":"italic"},"text":"Combining Lemma ","element":"span"},{"href":"#id-85","style":{"fontStyle":"italic"},"text":"6 ","element":"a"},{"style":{"fontStyle":"italic"},"text":"and Lemma ","element":"span"},{"href":"#id-86","style":{"fontStyle":"italic"},"text":"7","element":"a"},{"style":{"fontStyle":"italic"},"text":", we get","element":"span"}],[{"id":"id-78","style":{"fontWeight":"bold"},"text":"Lemma 8. ","element":"span"},{"style":{"fontStyle":"italic"},"text":"For ","element":"span"},{"style":{"height":16},"width":273.32,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2005.04345/images/22-0.png","element":"img","alt":" N = Ω(poly(n))","inline":true},{"style":{"fontStyle":"italic"},"text":", with probability greater than ","element":"span"},{"style":{"height":16},"width":199.18,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2005.04345/images/22-1.png","element":"img","alt":" 1 − 1/2000,","inline":true}],[{"style":{"width":"65%"},"width":1274,"height":88,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2005.04345/images/22-2.png","element":"img"}],[{"text":"This follows from Corollary ","element":"span"},{"href":"#id-87","text":"1 ","element":"a"},{"text":"and union bound over ","element":"span"},{"style":{"height":13.38},"width":39.92,"height":33.46,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2005.04345/images/22-3.png","element":"img","alt":" n2 ","inline":true,"padRight":true},{"text":"pairs of training points.","element":"span"}],[{"id":"id-80","style":{"fontWeight":"bold"},"text":"Lemma 9. ","element":"span"},{"style":{"fontStyle":"italic"},"text":"For ","element":"span"},{"style":{"height":16},"width":273.32,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2005.04345/images/22-4.png","element":"img","alt":" N = Ω(poly(n))","inline":true},{"style":{"fontStyle":"italic"},"text":", with probability greater than ","element":"span"},{"style":{"height":16},"width":199.18,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2005.04345/images/22-5.png","element":"img","alt":" 1 − 1/2000,","inline":true}],[{"style":{"width":"66%"},"width":1301,"height":55,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2005.04345/images/22-6.png","element":"img"}],[{"style":{"fontStyle":"italic"},"text":"This follows from Lemma ","element":"span"},{"href":"#id-85","style":{"fontStyle":"italic"},"text":"6 ","element":"a"},{"style":{"fontStyle":"italic"},"text":"and union bound over ","element":"span"},{"style":{"fontStyle":"italic"},"text":"n ","element":"span"},{"style":{"fontStyle":"italic"},"text":"training points. In particular, we can set ","element":"span"},{"style":{"height":16},"width":207.88,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2005.04345/images/22-7.png","element":"img","alt":" c1 < 1/2000","inline":true,"padRight":true},{"style":{"fontStyle":"italic"},"text":"for large enough ","element":"span"},{"style":{"fontStyle":"italic"},"text":"N","element":"span"},{"style":{"fontStyle":"italic"},"text":".","element":"span"}],[{"style":{"width":"65%"},"width":1274,"height":49,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2005.04345/images/22-8.png","element":"img"}],[{"text":"In the previous section, we proved that ","element":"span"},{"style":{"height":19.27},"width":363.77,"height":48.18,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2005.04345/images/22-9.png","element":"img","alt":" δmaj-train�ˆwminnorm, γ2�","inline":true},{"text":", the fraction of majority training samples that can have coefficient on the noise vectors greater than ","element":"span"},{"style":{"height":17.8},"width":144.99,"height":44.49,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2005.04345/images/22-10.png","element":"img","alt":" γ2/σ2noise ","inline":true,"padRight":true},{"text":"in the max margin separator ","element":"span"},{"style":{"height":13.78},"width":138.96,"height":34.46,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2005.04345/images/22-11.png","element":"img","alt":" ˆwminnorm","inline":true,"padRight":true},{"text":"is bounded for suitable value of ","element":"span"},{"style":{"height":10.4},"width":22,"height":26,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2005.04345/images/22-12.png","element":"img","alt":"γ","inline":true},{"text":". We showed this using the fact that the norm of ","element":"span"},{"style":{"height":13.78},"width":138.96,"height":34.46,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2005.04345/images/22-13.png","element":"img","alt":" ˆwminnorm ","inline":true,"padRight":true},{"text":"is the smallest among all separators and the observation that the squared norm of a separator roughlty scales proportional the number of training points that have large coefficient along the noise vectors.","element":"span"}],[{"text":"What does small ","element":"span"},{"style":{"height":19.27},"width":363.77,"height":48.18,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2005.04345/images/22-14.png","element":"img","alt":" δmaj-train�ˆwminnorm, γ2�","inline":true},{"text":"imply? We now show that the bound on ","element":"span"},{"style":{"height":19.27},"width":363.77,"height":48.18,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2005.04345/images/22-15.png","element":"img","alt":" δmaj-train�ˆwminnorm, γ2�","inline":true},{"text":"has an important consequence on the worst-group error Err","element":"span"},{"style":{"height":18.57},"width":209.43,"height":46.44,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2005.04345/images/22-16.png","element":"img","alt":"wg( ˆwminnorm)","inline":true},{"text":"; low ","element":"span"},{"style":{"height":19.27},"width":345.9,"height":48.18,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2005.04345/images/22-17.png","element":"img","alt":" δmaj-train�ˆwminnorm, γ�","inline":true},{"text":"would imply high worst-group error Err","element":"span"},{"style":{"height":18.57},"width":209.43,"height":46.43,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2005.04345/images/22-18.png","element":"img","alt":"wg( ˆwminnorm)","inline":true},{"text":". We show that there is a trade-off between the worst-group test error of a separator and the fraction of ","element":"span"},{"style":{"fontStyle":"italic"},"text":"majority points ","element":"span"},{"text":"that it “memorizes” at training time. If a model that has low worst-group test error must use the core feature and not the spurious feature, and to obtain zero training error such a model would memorize a potentially large fraction of majority and minority points. In contrast, if the model instead uses only the spurious feature, then the worst-group test error would be high, but it would memorize only a small fraction of majority examples at training time; because we assume that the spurious feature is much less noisy than the core feature (","element":"span"},{"style":{"height":13.99},"width":201.66,"height":34.97,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2005.04345/images/22-19.png","element":"img","alt":"σcore ≫ σspu","inline":true},{"text":"), much fewer majority examples would need to be memorized. To summarize, ","element":"span"},{"style":{"fontStyle":"italic"},"text":"a large ","element":"span"},{"style":{"height":15.59},"width":70.2,"height":38.98,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2005.04345/images/22-20.png","element":"img","alt":" ˆwspu","inline":true,"padRight":true},{"style":{"fontStyle":"italic"},"text":"would require smaller fraction of majority points to be memorized ","element":"span"},{"style":{"height":19.2},"width":255.62,"height":48,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2005.04345/images/22-21.png","element":"img","alt":" δmaj-train�ˆw, γ2�","inline":true}],[{"style":{"fontStyle":"italic"},"text":"but increase the worst-group test error Err","element":"span"},{"style":{"height":16.79},"width":98.57,"height":41.98,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2005.04345/images/22-22.png","element":"img","alt":"wg( ˆw)","inline":true},{"text":". We formalize the above trade-off between the worst-group error and fraction of majority examples to be memorized in Proposition ","element":"span"},{"href":"#id-75","text":"3","element":"a"},{"text":".","element":"span"}],[{"id":"id-75","style":{"fontWeight":"bold"},"text":"Proposition 3. ","element":"span"},{"style":{"fontStyle":"italic"},"text":"For the minimum norm separator ","element":"span"},{"style":{"height":13.78},"width":138.96,"height":34.46,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2005.04345/images/22-23.png","element":"img","alt":" ˆwminnorm","inline":true},{"style":{"fontStyle":"italic"},"text":", under the parameter settings of Theorem ","element":"span"},{"href":"#id-73","style":{"fontStyle":"italic"},"text":"2","element":"a"},{"style":{"fontStyle":"italic"},"text":", with high probability,","element":"span"}],[{"style":{"width":"77%"},"width":1501,"height":140,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2005.04345/images/22-24.png","element":"img"}],[{"style":{"fontStyle":"italic"},"text":"for some constants ","element":"span"},{"style":{"height":16},"width":369.35,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2005.04345/images/22-25.png","element":"img","alt":" c3, c4 < 1/1000 and Φ","inline":true,"padRight":true},{"style":{"fontStyle":"italic"},"text":"the Gaussian CDF.","element":"span"}],[{"style":{"fontStyle":"italic"},"text":"For any separator ","element":"span"},{"style":{"height":10.8},"width":29,"height":27,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2005.04345/images/22-26.png","element":"img","alt":" ˆw","inline":true,"padRight":true},{"style":{"fontStyle":"italic"},"text":"that spans the training points and satisfies","element":"span"}],[{"style":{"fontStyle":"italic"},"text":"under the parameter settings of Theorem ","element":"span"},{"href":"#id-73","style":{"fontStyle":"italic"},"text":"2","element":"a"},{"style":{"fontStyle":"italic"},"text":", with high probability,","element":"span"}],[{"style":{"width":"77%"},"width":1511,"height":140,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2005.04345/images/22-27.png","element":"img"}],[{"style":{"fontStyle":"italic"},"text":"for some constants ","element":"span"},{"style":{"height":16},"width":594.87,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2005.04345/images/22-28.png","element":"img","alt":" c1 < 1/2000; c5, c6 < 1/1000 and Φ","inline":true,"padRight":true},{"style":{"fontStyle":"italic"},"text":"the Gaussian CDF.","element":"span"}],[{"text":"We prove Proposition ","element":"span"},{"href":"#id-75","text":"3 ","element":"a"},{"text":"in Section ","element":"span"},{"href":"#id-88","text":"B.2.5","element":"a"},{"text":".","element":"span"}],[{"text":"As mentioned before, we see that the spurious component weight ","element":"span"},{"style":{"height":20.12},"width":138.96,"height":50.31,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2005.04345/images/23-0.png","element":"img","alt":" ˆwminnormspu","inline":true,"padRight":true},{"text":"has opposite effects on the two quantities; Err","element":"span"},{"style":{"height":16.79},"width":98.57,"height":41.98,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2005.04345/images/23-1.png","element":"img","alt":"wg( ˆw)","inline":true,"padRight":true},{"text":"increases with increase ","element":"span"},{"style":{"height":16.79},"width":379.46,"height":41.98,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2005.04345/images/23-2.png","element":"img","alt":" ˆwspu, but δmaj-train ( ˆw, γ)","inline":true,"padRight":true},{"text":"decreases with increase in ","element":"span"},{"style":{"height":15.59},"width":70.19,"height":38.98,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2005.04345/images/23-3.png","element":"img","alt":" ˆwspu","inline":true},{"text":". This dependence can be exploited to relate the two quantities to each other as follows.","element":"span"}],[{"id":"id-89","style":{"width":"95%"},"width":1863,"height":131,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2005.04345/images/23-4.png","element":"img"}],[{"text":"In other words, if the ","element":"span"},{"style":{"height":19.27},"width":345.9,"height":48.18,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2005.04345/images/23-5.png","element":"img","alt":" δmaj-train�ˆwminnorm, γ�","inline":true},{"text":"is low, then Err","element":"span"},{"style":{"height":18.58},"width":209.43,"height":46.44,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2005.04345/images/23-6.png","element":"img","alt":"wg( ˆwminnorm)","inline":true,"padRight":true},{"text":"would need to be high.","element":"span"}],[{"style":{"width":"67%"},"width":638,"height":28,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2005.04345/images/23-7.png","element":"img"}],[{"text":"Recall from part 1 that ","element":"span"},{"style":{"height":19.27},"width":499.05,"height":48.18,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2005.04345/images/23-8.png","element":"img","alt":" δmaj-train�ˆwminnorm, γ�< 1/200","inline":true,"padRight":true},{"text":"for appropriate choice of ","element":"span"},{"style":{"height":10.4},"width":22,"height":26,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2005.04345/images/23-9.png","element":"img","alt":" γ","inline":true},{"text":", and from part 2 the trade-off between ","element":"span"},{"style":{"height":19.27},"width":683.46,"height":48.18,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2005.04345/images/23-10.png","element":"img","alt":"δmaj-train�ˆwminnorm, γ�and Errwg( ˆwminnorm)","inline":true,"padRight":true},{"text":"(Equation (","element":"span"},{"href":"#id-89","text":"50","element":"a"},{"text":")). As a final step, we need to bound the quantities on the RHS of Equation (","element":"span"},{"href":"#id-89","text":"50","element":"a"},{"text":"). All the constants are small, and ","element":"span"},{"href":"#id-76","style":{"height":19.27},"width":920.75,"height":48.18,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2005.04345/images/23-11.png","element":"img","alt":" γ2 = 9/10, δmaj-train�ˆwminnorm, 9/10�≤ 1/200 (Lemma 4","inline":true},{"text":") which allows us to write","element":"span"}],[{"style":{"width":"86%"},"width":1687,"height":259,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2005.04345/images/23-12.png","element":"img"}],[{"text":"We have hence proved that the minimum-norm separator ","element":"span"},{"style":{"height":13.78},"width":138.96,"height":34.46,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2005.04345/images/23-13.png","element":"img","alt":" ˆwminnorm ","inline":true,"padRight":true},{"text":"incurs high worst-group error with high probability under the specified conditions.","element":"span"}],[{"id":"id-88","style":{"width":"59%"},"width":558,"height":29,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2005.04345/images/23-14.png","element":"img"}],[{"style":{"fontWeight":"bold"},"text":"Proposition 3. ","element":"span"},{"style":{"fontStyle":"italic"},"text":"For the minimum norm separator ","element":"span"},{"style":{"height":13.79},"width":138.96,"height":34.46,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2005.04345/images/23-15.png","element":"img","alt":" ˆwminnorm","inline":true},{"style":{"fontStyle":"italic"},"text":", under the parameter settings of Theorem ","element":"span"},{"href":"#id-73","style":{"fontStyle":"italic"},"text":"2","element":"a"},{"style":{"fontStyle":"italic"},"text":", with high probability,","element":"span"}],[{"style":{"width":"77%"},"width":1501,"height":140,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2005.04345/images/23-16.png","element":"img"}],[{"style":{"fontStyle":"italic"},"text":"for some constants ","element":"span"},{"style":{"height":16},"width":369.35,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2005.04345/images/23-17.png","element":"img","alt":" c3, c4 < 1/1000 and Φ","inline":true,"padRight":true},{"style":{"fontStyle":"italic"},"text":"the Gaussian CDF.","element":"span"}],[{"style":{"fontStyle":"italic"},"text":"For any separator ","element":"span"},{"style":{"height":10.8},"width":29,"height":27,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2005.04345/images/23-18.png","element":"img","alt":" ˆw","inline":true,"padRight":true},{"style":{"fontStyle":"italic"},"text":"that spans the training points and satisfies","element":"span"}],[{"style":{"fontStyle":"italic"},"text":"under the parameter settings of Theorem ","element":"span"},{"href":"#id-73","style":{"fontStyle":"italic"},"text":"2","element":"a"},{"style":{"fontStyle":"italic"},"text":", with high probability,","element":"span"}],[{"style":{"width":"77%"},"width":1511,"height":140,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2005.04345/images/23-19.png","element":"img"}],[{"style":{"fontStyle":"italic"},"text":"for some constants ","element":"span"},{"style":{"height":16},"width":594.87,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2005.04345/images/23-20.png","element":"img","alt":" c1 < 1/2000; c5, c6 < 1/1000 and Φ","inline":true,"padRight":true},{"style":{"fontStyle":"italic"},"text":"the Gaussian CDF.","element":"span"}],[{"style":{"fontStyle":"italic"},"text":"Proof. ","element":"span"},{"text":"We derive the two bounds below.","element":"span"}],[{"style":{"fontWeight":"bold"},"text":"Worst-group test error","element":"span"}],[{"text":"We bound the expected worst-group error Err","element":"span"},{"style":{"height":18.58},"width":209.43,"height":46.44,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2005.04345/images/23-21.png","element":"img","alt":"wg( ˆwminnorm)","inline":true},{"text":", which is the expected worst-group loss over the data distribution. Below, we lower bound the worst-group error Err","element":"span"},{"style":{"height":18.57},"width":209.43,"height":46.44,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2005.04345/images/23-22.png","element":"img","alt":"wg( ˆwminnorm)","inline":true,"padRight":true},{"text":"by bounding the error on a particular group: minority positive points which have label ","element":"span"},{"style":{"fontStyle":"italic"},"text":"y ","element":"span"},{"text":"= 1 ","element":"span"},{"text":"and spurious attribute ","element":"span"},{"style":{"height":10.8},"width":125.19,"height":27,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2005.04345/images/24-0.png","element":"img","alt":" a = −1","inline":true},{"text":". The test error is the probability that a test example ","element":"span"},{"style":{"fontStyle":"italic"},"text":"x ","element":"span"},{"text":"from this group gets misclassified, i.e. ","element":"span"},{"style":{"height":14.59},"width":275.08,"height":36.47,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2005.04345/images/24-1.png","element":"img","alt":" ˆwminnorm · x < 0.","inline":true}],[{"style":{"width":"89%"},"width":1739,"height":206,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2005.04345/images/24-2.png","element":"img"}],[{"text":"In the last step, we rewrite for convenience ","element":"span"},{"style":{"height":17.19},"width":1087.96,"height":42.98,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2005.04345/images/24-3.png","element":"img","alt":" xcore = y + σcorez1 and xspu = a + σspuz2, where z1, z2 ∼ N(0, 1).","inline":true}],[{"text":"We use the properties of high-dimensional Gaussian random vectors to bound the quantity ","element":"span"},{"style":{"height":18.2},"width":255.12,"height":45.49,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2005.04345/images/24-4.png","element":"img","alt":" ˆwminnormnoise · xnoise","inline":true},{"text":". Recall that ","element":"span"},{"style":{"height":18.2},"width":138.96,"height":45.49,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2005.04345/images/24-5.png","element":"img","alt":"ˆwminnormnoise","inline":true,"padRight":true},{"text":"can be written as","element":"span"}],[{"style":{"width":"67%"},"width":1317,"height":140,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2005.04345/images/24-6.png","element":"img"}],[{"text":"From Lemma ","element":"span"},{"href":"#id-84","text":"3","element":"a"},{"text":", we know that ","element":"span"},{"text":"max","element":"span"}],[{"text":"probability ","element":"span"},{"style":{"height":13.19},"width":101.88,"height":32.98,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2005.04345/images/24-7.png","element":"img","alt":" 1 − c4","inline":true,"padRight":true},{"text":"for some small constants ","element":"span"},{"style":{"height":16},"width":376.49,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2005.04345/images/24-8.png","element":"img","alt":" c3, c4 < 1/1000. Let B","inline":true,"padRight":true},{"text":"denote the event t","element":"span"},{"href":"#id-86","text":"ha","element":"a"},{"text":"t this high probability event where the dot product ","element":"span"},{"style":{"height":18.2},"width":363.78,"height":45.49,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2005.04345/images/24-9.png","element":"img","alt":" |xnoise · ˆwminnormnoise | ≤ c3","inline":true},{"text":". Using the fact that ","element":"span"},{"style":{"height":16},"width":451.91,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2005.04345/images/24-10.png","element":"img","alt":" P(A) ≥ P(A | B) − P(¬B)","inline":true,"padRight":true},{"text":"which follows from simple algebra, we have","element":"span"}],[{"style":{"width":"89%"},"width":1739,"height":403,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2005.04345/images/24-11.png","element":"img"}],[{"text":"From the expression above, we see that Err","element":"span"},{"style":{"height":18.58},"width":209.43,"height":46.44,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2005.04345/images/24-12.png","element":"img","alt":"wg( ˆwminnorm)","inline":true,"padRight":true},{"text":"increases as the spurious component ","element":"span"},{"style":{"height":20.12},"width":138.96,"height":50.31,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2005.04345/images/24-13.png","element":"img","alt":" ˆwminnormspu","inline":true,"padRight":true},{"text":"increases. This is because in the minority group, the spurious feature is negatively correlated with the label.","element":"span"}],[{"style":{"fontWeight":"bold"},"text":"Fraction of memorized training examples in majority groups","element":"span"}],[{"text":"We now compute a lower bound on ","element":"span"},{"style":{"height":19.27},"width":363.77,"height":48.18,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2005.04345/images/24-14.png","element":"img","alt":" δmaj-train�ˆwminnorm, γ2�","inline":true},{"text":", which is the number of majority points (where ","element":"span"},{"style":{"fontStyle":"italic"},"text":"a ","element":"span"},{"text":"= ","element":"span"},{"style":{"fontStyle":"italic"},"text":"y","element":"span"},{"text":") that are “memorized.” Intuitively, we want to show that the fraction depends on ","element":"span"},{"style":{"height":15.59},"width":201.05,"height":38.97,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2005.04345/images/24-15.png","element":"img","alt":" ˆwspu − ˆwcore","inline":true},{"text":". The more the core feature is used relative to the spurious feature, the larger fraction of points need to be memorized because the core feature is more noisy.","element":"span"}],[{"text":"First, consider a separator ","element":"span"},{"style":{"height":10.8},"width":29,"height":27,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2005.04345/images/24-16.png","element":"img","alt":" ˆw","inline":true,"padRight":true},{"text":"with some core and spurious components ","element":"span"},{"style":{"height":15.59},"width":227.35,"height":38.98,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2005.04345/images/24-17.png","element":"img","alt":" ˆwcore and ˆwspu","inline":true},{"text":". Recall that ","element":"span"},{"style":{"height":30.7},"width":402.84,"height":76.76,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2005.04345/images/24-18.png","element":"img","alt":" ˆwnoise = �i α(i)( ˆw)x(i)noise","inline":true,"padRight":true},{"text":"and ","element":"span"},{"style":{"height":18.18},"width":282.45,"height":45.46,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2005.04345/images/24-19.png","element":"img","alt":" y(i)( ˆw · x(i)) ≥ 1","inline":true,"padRight":true},{"text":"by the definition of separators. For a given ","element":"span"},{"style":{"height":15.99},"width":227.86,"height":39.98,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2005.04345/images/24-20.png","element":"img","alt":" ˆwcore and ˆwspu","inline":true},{"text":", we want to bound the fraction of majority points (","element":"span"},{"style":{"fontStyle":"italic"},"text":"a ","element":"span"},{"text":"= ","element":"span"},{"style":{"fontStyle":"italic"},"text":"y","element":"span"},{"text":") which can have ","element":"span"},{"style":{"height":27.26},"width":244.7,"height":68.15,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2005.04345/images/24-21.png","element":"img","alt":" α(i)( ˆw) < γ2σ2noise","inline":true,"padRight":true},{"text":". We focus only on separators with bounded memorization, i.e. those that ","element":"span"},{"text":"satisfy ","element":"span"},{"style":{"height":27.55},"width":262.36,"height":68.88,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2005.04345/images/24-22.png","element":"img","alt":" α(i)( ˆw)2 ≤ 10nσ4noise ","inline":true,"padRight":true},{"text":". Note that from Lemma ","element":"span"},{"href":"#id-84","text":"3","element":"a"},{"text":", w.h.p., the mininum-norm separator ","element":"span"},{"style":{"height":13.78},"width":138.96,"height":34.46,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2005.04345/images/24-23.png","element":"img","alt":" ˆwminnorm","inline":true,"padRight":true},{"text":"satifies this condition.","element":"span"}],[{"text":"We bound the above by bounding a related quantity: the fraction of points that are memorized in the training distribution in expectation. We then use concentration to relate it to the fraction of the training set.","element":"span"}],[{"text":"Formally, we have fixed quantities ","element":"span"},{"style":{"height":15.99},"width":227.72,"height":39.98,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2005.04345/images/24-24.png","element":"img","alt":" ˆwcore and ˆwspu","inline":true},{"text":". The training set is generated as per the usual data generating distribution. As before, we are interested in separators on the training set. For any majority training point, the coefficient ","element":"span"},{"style":{"height":18.18},"width":124.91,"height":45.46,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2005.04345/images/24-25.png","element":"img","alt":" α(i)( ˆw)","inline":true,"padRight":true},{"text":"in a separator is a random variable. Since training point ","element":"span"},{"style":{"fontStyle":"italic"},"text":"i ","element":"span"},{"text":"is separated, we have","element":"span"}],[{"style":{"width":"61%"},"width":1188,"height":103,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2005.04345/images/24-26.png","element":"img"}],[{"text":"From Lemma ","element":"span"},{"href":"#id-78","text":"8","element":"a"},{"text":", Lemma ","element":"span"},{"href":"#id-85","text":"6","element":"a"},{"text":", and the condition on ","element":"span"},{"style":{"height":18.18},"width":124.92,"height":45.46,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2005.04345/images/24-27.png","element":"img","alt":" α(i)( ˆw)","inline":true},{"text":", this implies with high probability that","element":"span"}],[{"style":{"width":"60%"},"width":1176,"height":50,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2005.04345/images/24-28.png","element":"img"}],[{"text":"for some constant ","element":"span"},{"style":{"height":16},"width":208.27,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2005.04345/images/25-0.png","element":"img","alt":" c5 < 1/1000","inline":true},{"text":". Conditioning on the high probability event just as before (","element":"span"},{"style":{"height":16},"width":453.32,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2005.04345/images/25-1.png","element":"img","alt":"P(A) ≤ P(A | B) + P(¬B)","inline":true},{"text":"), we get","element":"span"}],[{"style":{"width":"94%"},"width":1840,"height":409,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2005.04345/images/25-2.png","element":"img"}],[{"text":"for some ","element":"span"},{"style":{"height":16},"width":191.98,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2005.04345/images/25-3.png","element":"img","alt":" δ < 1/2000","inline":true},{"text":". Finally, we connect to ","element":"span"},{"style":{"height":18.18},"width":268.06,"height":45.44,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2005.04345/images/25-4.png","element":"img","alt":" δmaj-train ( ˆw) (γ2)","inline":true,"padRight":true},{"text":"which is the finite sample version of the quantity ","element":"span"},{"style":{"height":18.19},"width":206.33,"height":45.46,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2005.04345/images/25-5.png","element":"img","alt":" P(α(i)( ˆw) ≤","inline":true}],[{"style":{"height":12.81},"width":32.45,"height":32.03,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2005.04345/images/25-6.png","element":"img","alt":"γ2","inline":true},{"style":{"height":21.74},"width":85.33,"height":54.34,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2005.04345/images/25-7.png","element":"img","alt":"σ2noise )","inline":true},{"text":". By DKW, we know that the empirical CDF converges to the population CDF. Under the conditions of Theorem ","element":"span"},{"href":"#id-73","text":"2","element":"a"},{"text":", ","element":"span"},{"text":"which lower bounds the number of majority elements, we have with high probability,","element":"span"}],[{"style":{"width":"77%"},"width":1517,"height":140,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2005.04345/images/25-8.png","element":"img"}],[{"text":"for constants ","element":"span"},{"style":{"height":16},"width":270.64,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2005.04345/images/25-9.png","element":"img","alt":" c5, c6 < 1/1000.","inline":true}],[{"id":"id-36","style":{"fontWeight":"bold"},"text":"Proposition 1 ","element":"span"},{"text":"(Norm of models using the spurious feature)","element":"span"},{"style":{"fontWeight":"bold"},"text":". ","element":"span"},{"style":{"fontStyle":"italic"},"text":"When ","element":"span"},{"style":{"height":19.72},"width":157.37,"height":49.31,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2005.04345/images/25-10.png","element":"img","alt":" σ2core, σ2spu","inline":true,"padRight":true},{"style":{"fontStyle":"italic"},"text":"satisfy the conditions in Theorem ","element":"span"},{"href":"#id-31","style":{"fontStyle":"italic"},"text":"1","element":"a"},{"style":{"fontStyle":"italic"},"text":", there ","element":"span"},{"style":{"fontStyle":"italic"},"text":"exists ","element":"span"},{"style":{"height":13.19},"width":48.02,"height":32.98,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2005.04345/images/25-11.png","element":"img","alt":" N0","inline":true,"padRight":true},{"style":{"fontStyle":"italic"},"text":"such that for all ","element":"span"},{"style":{"height":13.19},"width":137.52,"height":32.98,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2005.04345/images/25-12.png","element":"img","alt":" N > N0","inline":true},{"style":{"fontStyle":"italic"},"text":", with high probability, there exists a separator ","element":"span"},{"style":{"height":12},"width":493.67,"height":30,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2005.04345/images/25-13.png","element":"img","alt":" wuse−spu ∈ Wuse−spu such that","inline":true}],[{"style":{"width":"27%"},"width":526,"height":121,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2005.04345/images/25-14.png","element":"img"}],[{"style":{"fontStyle":"italic"},"text":"for some constants ","element":"span"},{"style":{"height":14.4},"width":177.79,"height":36,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2005.04345/images/25-15.png","element":"img","alt":" γ1, γ2 > 0.","inline":true}],[{"style":{"fontStyle":"italic"},"text":"Proof. ","element":"span"},{"text":"The proposition follows directly from Lemma ","element":"span"},{"href":"#id-74","text":"1","element":"a"},{"text":".","element":"span"}],[{"id":"id-37","style":{"fontWeight":"bold"},"text":"Proposition 2 ","element":"span"},{"text":"(Norm of models using the core feature)","element":"span"},{"style":{"fontWeight":"bold"},"text":". ","element":"span"},{"style":{"fontStyle":"italic"},"text":"When ","element":"span"},{"style":{"height":19.72},"width":157.37,"height":49.31,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2005.04345/images/25-16.png","element":"img","alt":" σ2core, σ2spu ","inline":true,"padRight":true},{"style":{"fontStyle":"italic"},"text":"satisfy the conditions in Theorem ","element":"span"},{"href":"#id-31","style":{"fontStyle":"italic"},"text":"1 ","element":"a"},{"style":{"fontStyle":"italic"},"text":"and ","element":"span"},{"style":{"height":13.2},"width":195.73,"height":33,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2005.04345/images/25-17.png","element":"img","alt":" nmin ≥ 100,","inline":true,"padRight":true},{"style":{"fontStyle":"italic"},"text":"there exists ","element":"span"},{"style":{"height":13.19},"width":48.02,"height":32.98,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2005.04345/images/25-18.png","element":"img","alt":" N0","inline":true,"padRight":true},{"style":{"fontStyle":"italic"},"text":"such that for all ","element":"span"},{"style":{"height":13.19},"width":137.51,"height":32.98,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2005.04345/images/25-19.png","element":"img","alt":" N > N0","inline":true},{"style":{"fontStyle":"italic"},"text":", with high probability, all separators ","element":"span"},{"style":{"height":14.18},"width":468.48,"height":35.46,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2005.04345/images/25-20.png","element":"img","alt":" wuse−core ∈ Wuse−core satisfy","inline":true}],[{"style":{"width":"18%"},"width":358,"height":85,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2005.04345/images/25-21.png","element":"img"}],[{"style":{"fontStyle":"italic"},"text":"for some constant ","element":"span"},{"style":{"height":14.4},"width":121.57,"height":36,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2005.04345/images/25-22.png","element":"img","alt":" γ3 > 0.","inline":true}],[{"style":{"fontStyle":"italic"},"text":"Proof. ","element":"span"},{"text":"To bound the norm for all ","element":"span"},{"style":{"height":11.78},"width":357.48,"height":29.46,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2005.04345/images/26-0.png","element":"img","alt":" wuse−core ∈ Wuse−core","inline":true},{"text":", we provide a lower bound on the norm of the minimum-norm separator in the set ","element":"span"},{"style":{"height":11.79},"width":170.84,"height":29.47,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2005.04345/images/26-1.png","element":"img","alt":" Wuse−core:","inline":true}],[{"style":{"width":"63%"},"width":1243,"height":74,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2005.04345/images/26-2.png","element":"img"}],[{"text":"We bound the ","element":"span"},{"style":{"height":16},"width":186.73,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2005.04345/images/26-3.png","element":"img","alt":" ∥ ¯wuse−core∥","inline":true,"padRight":true},{"text":"in two steps:","element":"span"}],[{"text":"1. We first provide a lower bound for ","element":"span"},{"style":{"height":16},"width":186.73,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2005.04345/images/26-4.png","element":"img","alt":" ∥ ¯wuse−core∥","inline":true,"padRight":true},{"text":"in terms of the fraction of training points memorized ","element":"span"},{"style":{"height":19.2},"width":318.98,"height":48,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2005.04345/images/26-5.png","element":"img","alt":" δtrain�¯wuse−core, γ2�","inline":true,"padRight":true},{"text":"(defined formally below) in Corollary ","element":"span"},{"href":"#id-90","text":"2","element":"a"},{"text":".","element":"span"}],[{"text":"2. We then provide a lower bound for ","element":"span"},{"style":{"height":19.2},"width":318.98,"height":48,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2005.04345/images/26-6.png","element":"img","alt":" δtrain�¯wuse−core, γ2�","inline":true},{"text":"in Corollary ","element":"span"},{"href":"#id-91","text":"3","element":"a"},{"text":".","element":"span"}],[{"text":"We first formally define ","element":"span"},{"style":{"height":19.2},"width":212.04,"height":48,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2005.04345/images/26-7.png","element":"img","alt":" δtrain�ˆw, γ2�.","inline":true}],[{"id":"id-92","style":{"fontWeight":"bold"},"text":"Definition 4. ","element":"span"},{"style":{"fontStyle":"italic"},"text":"For a separator ","element":"span"},{"style":{"height":10.8},"width":29,"height":27,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2005.04345/images/26-8.png","element":"img","alt":" ˆw","inline":true,"padRight":true},{"style":{"fontStyle":"italic"},"text":"on training data ","element":"span"},{"style":{"height":18.33},"width":260.39,"height":45.84,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2005.04345/images/26-9.png","element":"img","alt":" {(x(i), y(i))}ni=1","inline":true},{"style":{"fontStyle":"italic"},"text":", let ","element":"span"},{"style":{"height":19.2},"width":204.48,"height":48,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2005.04345/images/26-10.png","element":"img","alt":" δtrain�ˆw, γ2�","inline":true},{"style":{"fontStyle":"italic"},"text":"be the fraction of training examples that ","element":"span"},{"style":{"height":14.4},"width":61.57,"height":36,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2005.04345/images/26-11.png","element":"img","alt":" ˆw γ","inline":true},{"style":{"fontStyle":"italic"},"text":"-memorizes:","element":"span"}],[{"style":{"width":"99%"},"width":1943,"height":247,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2005.04345/images/26-12.png","element":"img"}],[{"id":"id-93","style":{"fontWeight":"bold"},"text":"Lemma 10. ","element":"span"},{"style":{"fontStyle":"italic"},"text":"For a separator ","element":"span"},{"style":{"height":10.8},"width":29,"height":27,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2005.04345/images/26-13.png","element":"img","alt":" ˆw","inline":true,"padRight":true},{"style":{"fontStyle":"italic"},"text":"with bounded ","element":"span"},{"style":{"height":23.92},"width":266.34,"height":59.8,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2005.04345/images/26-14.png","element":"img","alt":" α(i)( ˆw)2 ≤ 10nσ2noise","inline":true,"padRight":true},{"style":{"fontStyle":"italic"},"text":"for all ","element":"span"},{"style":{"fontStyle":"italic"},"text":"i ","element":"span"},{"text":"= 1","element":"span"},{"style":{"fontStyle":"italic"},"text":", . . . , n","element":"span"},{"style":{"fontStyle":"italic"},"text":", its norm can be bounded with high ","element":"span"},{"style":{"fontStyle":"italic"},"text":"probability as","element":"span"}],[{"style":{"width":"68%"},"width":1339,"height":101,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2005.04345/images/26-15.png","element":"img"}],[{"style":{"fontStyle":"italic"},"text":"Proof. ","element":"span"},{"text":"Similarly to the proof of Lemma ","element":"span"},{"href":"#id-83","text":"2","element":"a"},{"text":", the result follows bounded norms (Lemma ","element":"span"},{"href":"#id-80","text":"9","element":"a"},{"text":"), bounded dot products (Lemma ","element":"span"},{"href":"#id-78","text":"8","element":"a"},{"text":"), and the definition of ","element":"span"},{"href":"#id-92","style":{"height":19.2},"width":440.07,"height":48,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2005.04345/images/26-16.png","element":"img","alt":" δtrain�ˆw, γ2�(Definition 4).","inline":true}],[{"style":{"width":"80%"},"width":1563,"height":483,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2005.04345/images/26-17.png","element":"img"}],[{"id":"id-90","style":{"fontWeight":"bold"},"text":"Corollary 2. ","element":"span"},{"style":{"fontStyle":"italic"},"text":"With high probability,","element":"span"}],[{"style":{"fontStyle":"italic"},"text":"Proof. ","element":"span"},{"text":"The result follows from applying Lemma ","element":"span"},{"href":"#id-93","text":"10 ","element":"a"},{"text":"to ","element":"span"},{"style":{"height":10.98},"width":144.62,"height":27.46,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2005.04345/images/26-18.png","element":"img","alt":" ¯wuse−core","inline":true},{"text":", invoking the bounds on any individual component ","element":"span"},{"style":{"height":18.18},"width":242.12,"height":45.46,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2005.04345/images/26-19.png","element":"img","alt":"α(i)( ¯wuse−core)","inline":true,"padRight":true},{"text":"obtained below in Lemma ","element":"span"},{"href":"#id-94","text":"11","element":"a"},{"text":".","element":"span"}],[{"text":"Below, we bound ","element":"span"},{"style":{"height":18.19},"width":242.11,"height":45.47,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2005.04345/images/26-20.png","element":"img","alt":" α(i)( ¯wuse−core)","inline":true},{"text":", where ","element":"span"},{"style":{"height":18.19},"width":242.12,"height":45.47,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2005.04345/images/26-21.png","element":"img","alt":" α(i)( ¯wuse−core)","inline":true,"padRight":true},{"text":"is the component of training point ","element":"span"},{"style":{"fontStyle":"italic"},"text":"i ","element":"span"},{"text":"to the classifier ","element":"span"},{"style":{"height":10.99},"width":144.61,"height":27.47,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2005.04345/images/26-22.png","element":"img","alt":" ¯wuse−core","inline":true,"padRight":true},{"text":"via the representer theorem.","element":"span"}],[{"id":"id-94","style":{"fontWeight":"bold"},"text":"Lemma 11. ","element":"span"},{"style":{"fontStyle":"italic"},"text":"With high probability, ","element":"span"},{"style":{"height":18.18},"width":461.3,"height":45.46,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2005.04345/images/27-0.png","element":"img","alt":" i = 1, . . . , n, α(i)( ¯wuse−core)","inline":true,"padRight":true},{"style":{"fontStyle":"italic"},"text":"can be bounded as follows.","element":"span"}],[{"style":{"width":"60%"},"width":1179,"height":94,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2005.04345/images/27-1.png","element":"img"}],[{"style":{"fontStyle":"italic"},"text":"Proof. ","element":"span"},{"text":"As a first step, we upper bound the norm of ","element":"span"},{"style":{"height":10.99},"width":144.61,"height":27.47,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2005.04345/images/27-2.png","element":"img","alt":" ¯wuse−core ","inline":true,"padRight":true},{"text":"by the norm of another separator ","element":"span"},{"style":{"height":14.59},"width":460.14,"height":36.47,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2005.04345/images/27-3.png","element":"img","alt":" wuse−core ∈ Wuse−core, using","inline":true,"padRight":true},{"text":"the fact that ","element":"span"},{"style":{"height":10.99},"width":144.62,"height":27.47,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2005.04345/images/27-4.png","element":"img","alt":" ¯wuse−core","inline":true,"padRight":true},{"text":"is the minimum-norm separator in ","element":"span"},{"style":{"height":11.79},"width":157.65,"height":29.47,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2005.04345/images/27-5.png","element":"img","alt":" Wuse−core","inline":true},{"text":". In particular, we construct a separator ","element":"span"},{"style":{"height":11.79},"width":187.95,"height":29.47,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2005.04345/images/27-6.png","element":"img","alt":" wuse−core ∈","inline":true},{"style":{"height":11.79},"width":157.65,"height":29.46,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2005.04345/images/27-7.png","element":"img","alt":"Wuse−core ","inline":true,"padRight":true},{"text":"that “memorizes” all training points, of the following form:","element":"span"}],[{"style":{"width":"36%"},"width":702,"height":172,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2005.04345/images/27-8.png","element":"img"}],[{"text":"This is analogous to the construction of ","element":"span"},{"style":{"height":11.78},"width":336.14,"height":29.46,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2005.04345/images/27-9.png","element":"img","alt":" wuse−spu ∈ Wuse−spu","inline":true,"padRight":true},{"text":"(Lemma ","element":"span"},{"href":"#id-74","text":"1","element":"a"},{"text":"), and similar calculations can be used to obtain a suitable value ","element":"span"},{"style":{"height":6.8},"width":26,"height":17,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2005.04345/images/27-10.png","element":"img","alt":" α","inline":true,"padRight":true},{"text":"to ensure that ","element":"span"},{"style":{"height":10.98},"width":144.61,"height":27.46,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2005.04345/images/27-11.png","element":"img","alt":" wuse−core","inline":true,"padRight":true},{"text":"is a separator with high probability. We provide it below for completeness. We show that the following condition is sufficient to satisfy the margin constraints ","element":"span"},{"style":{"height":17.38},"width":684.59,"height":43.46,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2005.04345/images/27-12.png","element":"img","alt":" y(i)wuse−core · x(i) ≥ 1 for all i = 1, . . . , n","inline":true,"padRight":true},{"text":"with high probability:","element":"span"}],[{"style":{"width":"60%"},"width":1179,"height":91,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2005.04345/images/27-13.png","element":"img"}],[{"text":"for ","element":"span"},{"style":{"height":16},"width":207.88,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2005.04345/images/27-14.png","element":"img","alt":" c1 < 1/2000","inline":true},{"text":". We obtain the above condition by applying Lemma ","element":"span"},{"href":"#id-78","text":"8 ","element":"a"},{"text":"and Lemma ","element":"span"},{"href":"#id-80","text":"9 ","element":"a"},{"text":"to the margin condition.","element":"span"}],[{"style":{"width":"73%"},"width":1427,"height":275,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2005.04345/images/27-15.png","element":"img"}],[{"text":"Thus, we can construct ","element":"span"},{"style":{"height":10.99},"width":144.62,"height":27.46,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2005.04345/images/27-16.png","element":"img","alt":" wuse−core ","inline":true,"padRight":true},{"text":"by setting some constant ","element":"span"},{"style":{"height":17.8},"width":195.18,"height":44.49,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2005.04345/images/27-17.png","element":"img","alt":" ασ2noise ≤ 2.","inline":true,"padRight":true},{"text":"Now that we have constructed ","element":"span"},{"style":{"height":10.99},"width":144.62,"height":27.46,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2005.04345/images/27-18.png","element":"img","alt":" wuse−core","inline":true},{"text":", we can bound the norm of the minimum norm separator ","element":"span"},{"style":{"height":10.99},"width":144.62,"height":27.46,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2005.04345/images/27-19.png","element":"img","alt":" ¯wuse−core ","inline":true,"padRight":true},{"text":"by the norm of ","element":"span"},{"style":{"height":10.99},"width":144.62,"height":27.46,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2005.04345/images/27-20.png","element":"img","alt":"wuse−core","inline":true},{"text":". The following is true with high probability,","element":"span"}],[{"style":{"width":"71%"},"width":1399,"height":292,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2005.04345/images/27-21.png","element":"img"}],[{"id":"id-95","text":"Finally, we bound ","element":"span"},{"style":{"height":18.18},"width":242.11,"height":45.46,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2005.04345/images/27-22.png","element":"img","alt":" α(i)( ¯wuse−core)","inline":true,"padRight":true},{"text":"for all ","element":"span"},{"style":{"fontStyle":"italic"},"text":"i ","element":"span"},{"text":"by bounding ","element":"span"},{"text":"max ","element":"span"},{"style":{"height":7.2},"width":11,"height":18,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2005.04345/images/27-23.png","element":"img","alt":"i","inline":true},{"style":{"height":23.92},"width":377.55,"height":59.8,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2005.04345/images/27-24.png","element":"img","alt":"α(i)( ¯wuse−core) = Mσ2noise","inline":true,"padRight":true},{"text":". As we showed in the proof of","element":"span"}],[{"text":"Lemma ","element":"span"},{"href":"#id-84","text":"3","element":"a"},{"text":", following is true with high probability:","element":"span"}],[{"text":"Combined with the upper bound on ","element":"span"},{"style":{"height":17.38},"width":202.66,"height":43.46,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2005.04345/images/27-25.png","element":"img","alt":" ∥ ¯wuse−core∥22 ","inline":true,"padRight":true},{"text":"(Equation (","element":"span"},{"href":"#id-95","text":"80","element":"a"},{"text":")), we have","element":"span"}],[{"style":{"width":"99%"},"width":1941,"height":318,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2005.04345/images/27-26.png","element":"img"}],[{"style":{"fontWeight":"bold"},"text":"Bounding ","element":"span"},{"style":{"height":19.2},"width":326.74,"height":48,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2005.04345/images/28-0.png","element":"img","alt":" δtrain�¯wuse−core, γ2�","inline":true}],[{"id":"id-91","style":{"fontWeight":"bold"},"text":"Corollary 3. ","element":"span"},{"style":{"fontStyle":"italic"},"text":"Under the parameter settings of Theorem ","element":"span"},{"href":"#id-73","style":{"fontStyle":"italic"},"text":"2","element":"a"},{"style":{"fontStyle":"italic"},"text":", with high probability,","element":"span"}],[{"style":{"width":"77%"},"width":1516,"height":120,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2005.04345/images/28-1.png","element":"img"}],[{"style":{"fontStyle":"italic"},"text":"for some constants ","element":"span"},{"style":{"height":16},"width":631.01,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2005.04345/images/28-2.png","element":"img","alt":" c1 < 1/2000; c5, c6 < 1/1000 where Φ","inline":true,"padRight":true},{"style":{"fontStyle":"italic"},"text":"is the Gaussian CDF.","element":"span"}],[{"style":{"fontStyle":"italic"},"text":"Proof. ","element":"span"},{"text":"The result follows from applying Proposition ","element":"span"},{"href":"#id-75","text":"3 ","element":"a"},{"text":"(which computes a bound on the majority fraction of points that is ","element":"span"},{"style":{"height":10.4},"width":53.84,"height":26,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2005.04345/images/28-3.png","element":"img","alt":" γ−","inline":true},{"text":"memorized) to ","element":"span"},{"style":{"height":10.99},"width":144.62,"height":27.47,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2005.04345/images/28-4.png","element":"img","alt":" ¯wuse−core","inline":true},{"text":", invoking Lemma ","element":"span"},{"href":"#id-94","text":"11","element":"a"},{"text":", and plugging in ","element":"span"},{"style":{"height":17.33},"width":235.6,"height":43.32,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2005.04345/images/28-5.png","element":"img","alt":" ¯wuse−corespu = 0","inline":true},{"text":". Note that when ","element":"span"},{"style":{"height":17.33},"width":235.6,"height":43.32,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2005.04345/images/28-6.png","element":"img","alt":" ¯wuse−corespu = 0","inline":true},{"text":", ","element":"span"},{"style":{"height":19.2},"width":752.76,"height":48,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2005.04345/images/28-7.png","element":"img","alt":"δtrain�¯wuse−core, γ2�= δmaj-train�¯wuse−core, γ2�.","inline":true}],[{"text":"Finally, the above bound on ","element":"span"},{"style":{"height":19.2},"width":318.98,"height":48,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2005.04345/images/28-8.png","element":"img","alt":" δtrain�¯wuse−core, γ2�","inline":true},{"text":"translates to a bound on the norm ","element":"span"},{"style":{"height":16},"width":186.73,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2005.04345/images/28-9.png","element":"img","alt":" ∥ ¯wuse−core∥","inline":true,"padRight":true},{"text":"via simple algebra. For ","element":"span"},{"style":{"height":10.4},"width":22,"height":26,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2005.04345/images/28-10.png","element":"img","alt":" γ","inline":true,"padRight":true},{"text":"that satisfies ","element":"span"},{"style":{"height":17.39},"width":411.99,"height":43.47,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2005.04345/images/28-11.png","element":"img","alt":" 1 − (1 + c1)γ2 − c5 > 0:","inline":true}],[{"style":{"width":"76%"},"width":1482,"height":232,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2005.04345/images/28-12.png","element":"img"}],[{"text":"Plugging the above lower bound into the bound on ","element":"span"},{"style":{"height":16},"width":186.72,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2005.04345/images/28-13.png","element":"img","alt":" ∥ ¯wuse−core∥","inline":true,"padRight":true},{"text":"from Corollary ","element":"span"},{"href":"#id-90","text":"2","element":"a"},{"text":", we have","element":"span"}],[{"style":{"width":"76%"},"width":1491,"height":383,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2005.04345/images/28-14.png","element":"img"}],[{"text":"for some ","element":"span"},{"style":{"height":16},"width":217.8,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2005.04345/images/28-15.png","element":"img","alt":" c7 < 1/1000.","inline":true}],[{"id":"id-72","style":{"fontWeight":"bold"},"text":"B.3. Underparameterized regime","element":"span"}],[{"text":"So far, we have studied the overparameterized regime for the data distribution described in Section ","element":"span"},{"text":"5","element":"span"},{"text":". In the overparameterized setting, where the dimension of noise features ","element":"span"},{"style":{"fontStyle":"italic"},"text":"N ","element":"span"},{"text":"is very large, logistic regression (both ERM and reweighted) leads to max-margin classifiers. We showed that for some setting of parameters ","element":"span"},{"style":{"height":11.59},"width":336.31,"height":28.97,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2005.04345/images/28-16.png","element":"img","alt":" nmaj, nmin, σspu, σcore","inline":true},{"text":", the robust error of such max-margin classifiers can be ","element":"span"},{"style":{"fontStyle":"italic"},"text":"> ","element":"span"},{"text":"2","element":"span"},{"style":{"fontStyle":"italic"},"text":"/","element":"span"},{"text":"3","element":"span"},{"text":", worse than random guessing. How does the same reweighted logistic regression perform in the underparameterized regime? We focus on the setting where ","element":"span"},{"style":{"fontStyle":"italic"},"text":"N ","element":"span"},{"text":"= 0","element":"span"},{"text":". In this setting, the data is two-dimensional, and w.h.p., the training data is not linearly separable unless ","element":"span"},{"style":{"height":13.19},"width":148.38,"height":32.98,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2005.04345/images/28-17.png","element":"img","alt":" σcore = 0","inline":true},{"text":". Consequently, the learned model ","element":"span"},{"style":{"height":13.38},"width":107.15,"height":33.46,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2005.04345/images/28-18.png","element":"img","alt":" ˆwrwR2 ","inline":true,"padRight":true},{"text":"that minimizes the reweighted training loss is not generally a max-margin separator.","element":"span"}],[{"text":"For intuition, consider the following two sets of models, which are analogous to what we considered in Equation ","element":"span"},{"href":"#id-96","text":"12 ","element":"a"},{"text":"in the main text for the overparameterized regime:","element":"span"}],[{"style":{"width":"68%"},"width":1334,"height":131,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2005.04345/images/28-19.png","element":"img"}],[{"text":"The first set ","element":"span"},{"style":{"height":11.78},"width":149.06,"height":29.46,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2005.04345/images/28-20.png","element":"img","alt":" Wuse−spu ","inline":true,"padRight":true},{"text":"comprises models that use the spurious feature but not the core feature, and the second set ","element":"span"},{"style":{"height":11.78},"width":157.65,"height":29.46,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2005.04345/images/28-21.png","element":"img","alt":" Wuse−core","inline":true,"padRight":true},{"text":"comprises models that use the core feature but not the spurious feature. Models in ","element":"span"},{"style":{"height":11.78},"width":149.05,"height":29.46,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2005.04345/images/28-22.png","element":"img","alt":" Wuse−spu","inline":true,"padRight":true},{"text":"that exclusively use ","element":"span"},{"style":{"height":11.59},"width":64.44,"height":28.98,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2005.04345/images/28-23.png","element":"img","alt":" xspu","inline":true,"padRight":true},{"text":"will have high training loss on the minorities since the minority points cannot be memorized. Due to upweighting the minorities, these models will have high reweighted training loss. On the other hand, models in ","element":"span"},{"style":{"height":11.78},"width":157.65,"height":29.46,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2005.04345/images/28-24.png","element":"img","alt":" Wuse−core","inline":true,"padRight":true},{"text":"exclusively use the core ","element":"span"},{"text":"features that are informative for the label ","element":"span"},{"style":{"fontStyle":"italic"},"text":"y ","element":"span"},{"text":"across all groups. Hence they obtain reasonable loss across all groups and have smaller reweighted training loss than models in ","element":"span"},{"style":{"height":11.79},"width":161.34,"height":29.47,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2005.04345/images/29-0.png","element":"img","alt":" Wuse−spu.","inline":true}],[{"text":"We will show in this section that the population minimizer of the reweighted loss is indeed in ","element":"span"},{"style":{"height":11.78},"width":157.65,"height":29.46,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2005.04345/images/29-1.png","element":"img","alt":" Wuse−core","inline":true,"padRight":true},{"text":"and bound the asymptotic variance of the reweighted estimator, leading to the final result in Theorem ","element":"span"},{"href":"#id-31","text":"1","element":"a"},{"text":". Our approach is to study the asypmtotic behavior of the reweighted estimator when the number of data points ","element":"span"},{"style":{"height":12},"width":116.65,"height":30,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2005.04345/images/29-2.png","element":"img","alt":" n ≫ d.","inline":true}],[{"style":{"fontWeight":"bold"},"text":"Data distribution. ","element":"span"},{"text":"We first recap the data generating distribution (described in Section ","element":"span"},{"text":"5","element":"span"},{"text":"). ","element":"span"},{"style":{"height":16.79},"width":375.04,"height":41.98,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2005.04345/images/29-3.png","element":"img","alt":" x = [xcore, xspu] where,","inline":true}],[{"style":{"width":"40%"},"width":780,"height":51,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2005.04345/images/29-4.png","element":"img"}],[{"text":"For ","element":"span"},{"style":{"height":11.59},"width":67.82,"height":28.98,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2005.04345/images/29-5.png","element":"img","alt":" pmaj","inline":true,"padRight":true},{"text":"fraction of points, we have ","element":"span"},{"style":{"fontStyle":"italic"},"text":"a ","element":"span"},{"text":"= ","element":"span"},{"style":{"fontStyle":"italic"},"text":"y ","element":"span"},{"text":"(majority points) and for ","element":"span"},{"style":{"height":15.59},"width":136.29,"height":38.98,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2005.04345/images/29-6.png","element":"img","alt":" 1 − pmaj","inline":true,"padRight":true},{"text":"fraction of points, we have ","element":"span"},{"style":{"height":14.4},"width":286.08,"height":36,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2005.04345/images/29-7.png","element":"img","alt":" a = −y (minority","inline":true,"padRight":true},{"text":"points).","element":"span"}],[{"style":{"fontWeight":"bold"},"text":"Reweighted logistic loss. ","element":"span"},{"text":"Let ","element":"span"},{"style":{"height":11.59},"width":67.82,"height":28.98,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2005.04345/images/29-8.png","element":"img","alt":" pmaj","inline":true,"padRight":true},{"text":"be the fraction of the majority group points and ","element":"span"},{"style":{"height":16.79},"width":167.86,"height":41.98,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2005.04345/images/29-9.png","element":"img","alt":" (1 − pmaj)","inline":true,"padRight":true},{"text":"be the fraction of minority points. In order to use standard results from the asymptotics of M-estimators, we rewrite the reweighted estimator (defined in Section ","element":"span"},{"href":"#id-10","text":"2","element":"a"},{"text":") as the minimizer of the following loss over ","element":"span"},{"style":{"fontStyle":"italic"},"text":"n ","element":"span"},{"text":"training points ","element":"span"},{"style":{"height":16.15},"width":172.31,"height":40.37,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2005.04345/images/29-10.png","element":"img","alt":" [xi, yi]ni=1.","inline":true}],[{"id":"id-100","style":{"width":"84%"},"width":1645,"height":385,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2005.04345/images/29-11.png","element":"img"}],[{"text":"We follow the standard steps of asymptotic analysis where we:","element":"span"}],[{"text":"1. Compute the population minimizer ","element":"span"},{"style":{"height":6.8},"width":43.6,"height":17,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2005.04345/images/29-12.png","element":"img","alt":" w⋆ ","inline":true,"padRight":true},{"text":"that satisfies ","element":"span"},{"style":{"height":16},"width":820.84,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2005.04345/images/29-13.png","element":"img","alt":" ∇Lrw(w⋆) = 0, where Lrw(w⋆) = E[ℓrw(x, y, w⋆)].","inline":true}],[{"text":"2. Bound the asymptotic variance ","element":"span"},{"style":{"height":17.39},"width":804.4,"height":43.46,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2005.04345/images/29-14.png","element":"img","alt":" ∇2Lrw(w⋆)−1 Cov[∇ℓrw(x, y, w⋆)]∇2Lrw(w⋆)−1.","inline":true}],[{"id":"id-99","style":{"fontWeight":"bold"},"text":"Proposition 4. ","element":"span"},{"style":{"fontStyle":"italic"},"text":"For the data distribution under study, the population minimizer ","element":"span"},{"style":{"height":6.8},"width":43.6,"height":17,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2005.04345/images/29-15.png","element":"img","alt":" w⋆","inline":true,"padRight":true},{"style":{"fontStyle":"italic"},"text":"that satisfies ","element":"span"},{"style":{"height":16},"width":247.96,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2005.04345/images/29-16.png","element":"img","alt":" ∇Lrw(w⋆) = 0","inline":true,"padRight":true},{"style":{"fontStyle":"italic"},"text":"is the following.","element":"span"}],[{"style":{"width":"57%"},"width":1111,"height":96,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2005.04345/images/29-17.png","element":"img"}],[{"text":"This is a very important property in the underparameterized regime: the population minimizer has the best possible worst-group error by only using the core feature and not the spurious feature.","element":"span"}],[{"id":"id-98","style":{"fontWeight":"bold"},"text":"Proposition 5. ","element":"span"},{"style":{"fontStyle":"italic"},"text":"The asymptotic distribution of the reweighted logistic regression estimator is as follows.","element":"span"}],[{"style":{"width":"90%"},"width":1764,"height":203,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2005.04345/images/29-18.png","element":"img"}],[{"id":"id-104","style":{"fontStyle":"italic"},"text":"For ","element":"span"},{"style":{"height":13.6},"width":297.99,"height":34,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2005.04345/images/29-19.png","element":"img","alt":" σcore ≥ 1, we have","inline":true}],[{"style":{"width":"36%"},"width":339,"height":121,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2005.04345/images/29-20.png","element":"img"}],[{"style":{"fontStyle":"italic"},"text":"for some constants ","element":"span"},{"style":{"height":14},"width":120.43,"height":35,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2005.04345/images/29-21.png","element":"img","alt":" C1, C2.","inline":true}],[{"text":"We see that the asymptotic variance increases as ","element":"span"},{"style":{"height":11.59},"width":67.82,"height":28.98,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2005.04345/images/30-0.png","element":"img","alt":" pmaj","inline":true,"padRight":true},{"text":"increases. This is expected because the reweighted estimator upweights the minority points by inverse of group size. As these weights increase, the variance also increases. However, as we noted before, since the population minimizer has small worst-group error, for large enough training set size, we get small worst-group error since the asymptotic variance is finite (for fixed ","element":"span"},{"style":{"height":11.59},"width":67.82,"height":28.98,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2005.04345/images/30-1.png","element":"img","alt":" pmaj","inline":true},{"text":") and the estimator approaches the population minimizer.","element":"span"}],[{"text":"We now prove Theorem ","element":"span"},{"href":"#id-31","text":"1 ","element":"a"},{"text":"for the underparameterized regime, restated as Theorem ","element":"span"},{"href":"#id-97","text":"3 ","element":"a"},{"text":"below.","element":"span"}],[{"id":"id-97","style":{"fontWeight":"bold"},"text":"Theorem 3. ","element":"span"},{"style":{"fontStyle":"italic"},"text":"In the underparameterized regime with ","element":"span"},{"style":{"fontStyle":"italic"},"text":"N ","element":"span"},{"text":"= 0","element":"span"},{"style":{"fontStyle":"italic"},"text":", for ","element":"span"},{"style":{"height":19.37},"width":497.84,"height":48.43,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2005.04345/images/30-2.png","element":"img","alt":" pmaj =�1 − 12001�, σ2core = 1","inline":true},{"style":{"fontStyle":"italic"},"text":", and ","element":"span"},{"style":{"height":19.72},"width":150.91,"height":49.31,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2005.04345/images/30-3.png","element":"img","alt":" σ2spu = 0","inline":true},{"style":{"fontStyle":"italic"},"text":", in the ","element":"span"},{"style":{"fontStyle":"italic"},"text":"asymptotic regime with ","element":"span"},{"style":{"height":15.99},"width":412.66,"height":39.98,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2005.04345/images/30-4.png","element":"img","alt":" nmaj, nmin → ∞, we have","inline":true}],[{"style":{"width":"57%"},"width":1126,"height":43,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2005.04345/images/30-5.png","element":"img"}],[{"style":{"fontStyle":"italic"},"text":"Proof. ","element":"span"},{"text":"We now put the two Propositions ","element":"span"},{"href":"#id-98","text":"5 ","element":"a"},{"text":"and ","element":"span"},{"href":"#id-99","text":"4 ","element":"a"},{"text":"together. We have ","element":"span"},{"style":{"height":14.92},"width":244.86,"height":37.31,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2005.04345/images/30-6.png","element":"img","alt":" ˆwrwcore ≥ 2 − ϵ1","inline":true,"padRight":true},{"text":"and ","element":"span"},{"style":{"height":18.34},"width":187.03,"height":45.85,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2005.04345/images/30-7.png","element":"img","alt":" | ˆwrwspu| ≤ ϵ2","inline":true,"padRight":true},{"text":"for ","element":"span"},{"style":{"height":16},"width":225.82,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2005.04345/images/30-8.png","element":"img","alt":" ϵ1, ϵ2 < 1/10","inline":true},{"text":", ","element":"span"},{"text":"i.e the estimator is very close to the population minimizer. This follows from setting ","element":"span"},{"style":{"height":21.62},"width":434.72,"height":54.06,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2005.04345/images/30-9.png","element":"img","alt":" σcore, σspu, pmaj = nmajnmaj+nmin","inline":true,"padRight":true},{"text":"to their ","element":"span"},{"text":"corresponding values and setting ","element":"span"},{"style":{"height":13.99},"width":266.76,"height":34.98,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2005.04345/images/30-10.png","element":"img","alt":" n = nmaj + nmin","inline":true,"padRight":true},{"text":"to be large enough. In order to compute the worst-group error, WLOG consider points with label ","element":"span"},{"style":{"fontStyle":"italic"},"text":"y ","element":"span"},{"text":"= 1 ","element":"span"},{"text":"(labels are balanced in the population). For a point from the majority group, the probability of misclassification is as follows.","element":"span"}],[{"style":{"width":"77%"},"width":1507,"height":102,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2005.04345/images/30-11.png","element":"img"}],[{"text":"where ","element":"span"},{"style":{"height":16.4},"width":210.54,"height":41,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2005.04345/images/30-12.png","element":"img","alt":" z ∼ N(0, 1).","inline":true}],[{"text":"Similarly, for the minority group, the probability of misclassification is","element":"span"}],[{"style":{"width":"72%"},"width":1407,"height":102,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2005.04345/images/30-13.png","element":"img"}],[{"text":"Therefore, the worst-group error of ","element":"span"},{"style":{"height":10.98},"width":59.73,"height":27.46,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2005.04345/images/30-14.png","element":"img","alt":" ˆwrw ","inline":true,"padRight":true},{"text":"can be bounded as.","element":"span"}],[{"text":"where ","element":"span"},{"style":{"height":10.8},"width":29,"height":27,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2005.04345/images/30-15.png","element":"img","alt":" Φ","inline":true,"padRight":true},{"text":"is the Gaussian CDF. Substituting ","element":"span"},{"style":{"height":18.34},"width":763.14,"height":45.86,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2005.04345/images/30-16.png","element":"img","alt":" σcore = 1, σspu = 0, ˆwrwcore ≥ 2 − ϵ1, | ˆwrwspu| ≤ ϵ2","inline":true,"padRight":true},{"text":"gives the required result that ","element":"span"},{"text":"Err","element":"span"},{"style":{"height":16.79},"width":251.19,"height":41.97,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2005.04345/images/30-17.png","element":"img","alt":"wg( ˆwrw) < 1/4","inline":true},{"text":". In contrast, in the overparameterized regime where ","element":"span"},{"style":{"height":12},"width":129.71,"height":30,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2005.04345/images/30-18.png","element":"img","alt":" N ≫ n","inline":true},{"text":", even for very large ","element":"span"},{"style":{"fontStyle":"italic"},"text":"n","element":"span"},{"text":", the reweighted estimator has high worst-group error, as shown in Theorem ","element":"span"},{"href":"#id-31","text":"1","element":"a"},{"text":".","element":"span"}],[{"style":{"width":"47%"},"width":445,"height":29,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2005.04345/images/30-19.png","element":"img"}],[{"text":"We now provide the proofs for Proposition ","element":"span"},{"href":"#id-99","text":"4 ","element":"a"},{"text":"and Proposition ","element":"span"},{"href":"#id-98","text":"5 ","element":"a"},{"text":"which mostly follow from straightforward algebra.","element":"span"}],[{"style":{"fontWeight":"bold"},"text":"Proposition 4. ","element":"span"},{"style":{"fontStyle":"italic"},"text":"For the data distribution under study, the population minimizer ","element":"span"},{"style":{"height":6.8},"width":43.6,"height":17,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2005.04345/images/30-20.png","element":"img","alt":" w⋆","inline":true,"padRight":true},{"style":{"fontStyle":"italic"},"text":"that satisfies ","element":"span"},{"style":{"height":16},"width":247.96,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2005.04345/images/30-21.png","element":"img","alt":" ∇Lrw(w⋆) = 0","inline":true,"padRight":true},{"style":{"fontStyle":"italic"},"text":"is the following.","element":"span"}],[{"style":{"width":"57%"},"width":1111,"height":96,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2005.04345/images/30-22.png","element":"img"}],[{"style":{"fontStyle":"italic"},"text":"Proof. ","element":"span"},{"text":"For convenience, we compute expectations over the majority and minority groups separately and express the population loss L","element":"span"},{"style":{"height":6.4},"width":29.29,"height":16,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2005.04345/images/30-23.png","element":"img","alt":"rw","inline":true,"padRight":true},{"text":"as the weighted sum of the two terms. Recall that we denote ","element":"span"},{"style":{"height":16.79},"width":267.73,"height":41.97,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2005.04345/images/30-24.png","element":"img","alt":" x = [xcore, xspu].","inline":true}],[{"id":"id-101","style":{"width":"75%"},"width":1470,"height":178,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2005.04345/images/30-25.png","element":"img"}],[{"text":"We use the following expression for computing the population gradient.","element":"span"}],[{"style":{"width":"73%"},"width":1426,"height":60,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2005.04345/images/30-26.png","element":"img"}],[{"text":"Combining the definition of the reweighted loss and population losses (Equation ","element":"span"},{"href":"#id-100","text":"91 ","element":"a"},{"text":"and Equation ","element":"span"},{"href":"#id-101","text":"102","element":"a"},{"text":") with the gradient expression above gives the following.","element":"span"}],[{"style":{"width":"86%"},"width":1677,"height":256,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2005.04345/images/31-0.png","element":"img"}],[{"text":"Now we compute ","element":"span"},{"style":{"height":17.37},"width":949.42,"height":43.43,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2005.04345/images/31-1.png","element":"img","alt":" ∇Lrw(w⋆) = pmaj∇Lrw-maj(w⋆) + (1 − pmaj)∇Lrw-min(w⋆)","inline":true},{"text":". First we compute wrt the spurious attribute ","element":"span"},{"style":{"height":16.79},"width":215.36,"height":41.98,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2005.04345/images/31-2.png","element":"img","alt":"∇spuLrw(w⋆)","inline":true},{"text":". For convenience, let ","element":"span"},{"style":{"height":22.56},"width":146.45,"height":56.4,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2005.04345/images/31-3.png","element":"img","alt":" c = 2σ2core .","inline":true}],[{"style":{"height":5.6},"width":765.23,"height":14,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2005.04345/images/31-4.png","element":"img","alt":"� �� ","inline":true,"padRight":true},{"text":"Replacing ","element":"span"},{"style":{"height":13.6},"width":596.21,"height":34,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2005.04345/images/31-5.png","element":"img","alt":" xcore ∼ N (−1, σ2core) with −xcore ∼ N (1, σ2core)","inline":true}],[{"text":"Now we take the weighted combination of ","element":"span"},{"style":{"height":17.37},"width":615.02,"height":43.43,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2005.04345/images/31-6.png","element":"img","alt":" ∇spuLrw-maj(w⋆) and ∇spuLrw-min(w⋆)","inline":true},{"text":", based on the fraction of the majority and minority samples in the population, which makes the two terms cancel out.","element":"span"}],[{"style":{"width":"77%"},"width":1517,"height":44,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2005.04345/images/31-7.png","element":"img"}],[{"text":"Now we compute ","element":"span"},{"style":{"height":16},"width":234.14,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2005.04345/images/32-0.png","element":"img","alt":" ∇coreLrw(w⋆).","inline":true}],[{"text":"Similarly, we get ","element":"span"},{"style":{"height":16},"width":351.54,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2005.04345/images/32-1.png","element":"img","alt":" ∇coreLrw-min(w⋆) = 0","inline":true,"padRight":true},{"text":"and hence proved that ","element":"span"},{"style":{"height":16},"width":307.2,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2005.04345/images/32-2.png","element":"img","alt":" ∇coreLrw(w⋆) = 0.","inline":true}],[{"id":"id-102","style":{"fontWeight":"bold"},"text":"Lemma 12. ","element":"span"},{"style":{"fontStyle":"italic"},"text":"The following is true.","element":"span"}],[{"text":"We now compute the asymptotic variance which involves computing ","element":"span"},{"style":{"height":17.39},"width":496.06,"height":43.46,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2005.04345/images/32-3.png","element":"img","alt":" ∇2L(w⋆) and Cov[∇ℓrw(w⋆)].","inline":true}],[{"style":{"fontStyle":"italic"},"text":"Proof. ","element":"span"},{"text":"First, we show that the off-diagonal entries of ","element":"span"},{"style":{"height":16},"width":444.6,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2005.04345/images/32-4.png","element":"img","alt":" Cov[ℓrw(x, y, w⋆)] are zero.","inline":true}],[{"style":{"width":"99%"},"width":1946,"height":1062,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2005.04345/images/32-5.png","element":"img"}],[{"text":"Now, we bound the diagonal elements.","element":"span"}],[{"style":{"height":18.18},"width":833.12,"height":45.45,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2005.04345/images/33-0.png","element":"img","alt":"E[∇core(ℓrw(x, y, w⋆))2] − (E[∇coreℓrw(x, y, w⋆)])2","inline":true}],[{"style":{"height":18.18},"width":435.14,"height":45.46,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2005.04345/images/33-1.png","element":"img","alt":"= E[∇core(ℓrw(x, y, w⋆))2]","inline":true}],[{"style":{"width":"98%"},"width":929,"height":897,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2005.04345/images/33-2.png","element":"img"}],[{"text":"Finally,","element":"span"}],[{"style":{"width":"75%"},"width":711,"height":552,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2005.04345/images/33-3.png","element":"img"}],[{"id":"id-103","style":{"fontWeight":"bold"},"text":"Lemma 13. ","element":"span"},{"style":{"fontStyle":"italic"},"text":"The following is true.","element":"span"}],[{"style":{"fontStyle":"italic"},"text":"Proof. ","element":"span"},{"text":"We use the following expression for computing the population gradient.","element":"span"}],[{"style":{"width":"99%"},"width":1941,"height":164,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2005.04345/images/33-4.png","element":"img"}],[{"text":"Recall the definition of the population majority and minority losses (Equation ","element":"span"},{"href":"#id-101","text":"102","element":"a"},{"text":").","element":"span"}],[{"style":{"width":"76%"},"width":716,"height":186,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2005.04345/images/34-0.png","element":"img"}],[{"text":"Like previously, we first compute the off-diagonal entries.","element":"span"}],[{"text":"Now, we bound the diagonal entries. Recall that ","element":"span"},{"style":{"height":22.56},"width":638.12,"height":56.4,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2005.04345/images/34-1.png","element":"img","alt":" w⋆spu = 0 and w⋆core = c where c = 2σ2core .","inline":true}],[{"style":{"width":"47%"},"width":450,"height":359,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2005.04345/images/34-2.png","element":"img"}],[{"text":"Finally, we calculate ","element":"span"},{"style":{"height":18.76},"width":353.64,"height":46.89,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2005.04345/images/35-0.png","element":"img","alt":" [∇2Lrw-maj(w⋆)]spu, spu","inline":true,"padRight":true},{"text":"as follows.","element":"span"}],[{"style":{"width":"100%"},"width":939,"height":1329,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2005.04345/images/35-1.png","element":"img"}],[{"style":{"fontStyle":"italic"},"text":"For ","element":"span"},{"style":{"height":13.6},"width":297.99,"height":34,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2005.04345/images/35-2.png","element":"img","alt":" σcore ≥ 1, we have","inline":true}],[{"style":{"width":"36%"},"width":339,"height":120,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2005.04345/images/35-3.png","element":"img"}],[{"style":{"fontStyle":"italic"},"text":"for some constants ","element":"span"},{"style":{"height":14},"width":120.43,"height":35,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2005.04345/images/35-4.png","element":"img","alt":" C1, C2.","inline":true}],[{"style":{"fontStyle":"italic"},"text":"Proof. ","element":"span"},{"text":"By asymptotic normality, we have ","element":"span"},{"style":{"height":17.38},"width":1086.7,"height":43.46,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2005.04345/images/35-5.png","element":"img","alt":"√n( ˆw − w⋆) → N(0, ∇2L(w⋆)−1 Cov[∇ℓ(x, y, w⋆)]∇2L(w⋆)−1)","inline":true},{"text":". Combining Lemma ","element":"span"},{"href":"#id-102","text":"12 ","element":"a"},{"text":"and Lemma ","element":"span"},{"href":"#id-103","text":"13","element":"a"},{"text":", we get the expression in Equation ","element":"span"},{"href":"#id-104","text":"96","element":"a"},{"text":". Each term is decreasing in ","element":"span"},{"style":{"height":9.19},"width":73.04,"height":22.97,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2005.04345/images/35-6.png","element":"img","alt":" σcore","inline":true},{"text":", and hence we get the final result by substituting ","element":"span"},{"style":{"height":17.32},"width":148.38,"height":43.31,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2005.04345/images/35-7.png","element":"img","alt":" σ2core = 1","inline":true,"padRight":true},{"text":"to obtain the constants ","element":"span"},{"style":{"height":14},"width":108.55,"height":35,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2005.04345/images/35-8.png","element":"img","alt":" C1, C2 ","inline":true,"padRight":true},{"text":"(and noting that ","element":"span"},{"style":{"height":19.72},"width":163.06,"height":49.31,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2005.04345/images/35-9.png","element":"img","alt":" σ2spu ≥ 0).","inline":true}]]}],"_version":"3.3.4"},"paperNode":"$28:props:children:props:children:0:props:product"}]]