36:[["$","audio",null,{"id":"tts"}],["$","$L3b",null,{"paperID":"71943","publisher":"neurips","paperJSON":{"title":"Training shallow ReLU networks on noisy data using hinge loss: when do we overfit and is it benign?","paperID":"71943","avgLineHeight":10.88,"imgScale":4,"sections":[{"heading":"Abstract","paragraphs":[[{"text":"$3c","element":"span"}]]},{"heading":"1 Introduction","paragraphs":[[{"text":"Conventional machine learning wisdom suggests that the generalization error of a complex model will typically be worse versus a simpler model when both are trained to interpolate data. Indeed, the bias-variance trade-off implies that although choosing a complex model is advantageous in terms of approximation error, it comes at the price of an increased risk of overfitting. The traditional solution to managing this trade-off is to use some form of regularization, allowing the optimizer to select a predictor from a rich class of functions while at the same time encouraging it to choose one that is in some sense simple. However, in recent years this perspective has been challenged by the observation that deep learning models, trained with minimal if any form of regularization, can almost perfectly interpolate noisy data with nominal cost to their generalization performance ","element":"span"},{"href":"#id-0","referenceIndex":38,"text":"(Zhang et al., ","element":"a"},{"href":"#id-0","referenceIndex":38,"text":"2017; ","element":"a"},{"href":"#id-1","referenceIndex":7,"text":"Belkin et al., ","element":"a"},{"href":"#id-1","referenceIndex":7,"text":"2018b, ","element":"a"},{"href":"#id-2","referenceIndex":8,"text":"2019)","element":"a"},{"text":". This phenomenon is referred to as ","element":"span"},{"style":{"fontStyle":"italic"},"text":"benign overfitting","element":"span"},{"text":".","element":"span"}],[{"text":"Following these empirical observations, a line of research has emerged aiming to theoretically characterize the conditions under which various machine learning models, trained to zero loss on noisy data, obtain, at least asymptotically, optimal generalization error. To date, the majority of analyses in this regard have focused primarily on linear models, including linear regression ","element":"span"},{"href":"#id-3","referenceIndex":5,"text":"(Bartlett ","element":"a"},{"href":"#id-3","referenceIndex":5,"text":"et al., ","element":"a"},{"href":"#id-3","referenceIndex":5,"text":"2020; ","element":"a"},{"href":"#id-4","referenceIndex":28,"text":"Muthukumar et al., ","element":"a"},{"href":"#id-4","referenceIndex":28,"text":"2020; ","element":"a"},{"href":"#id-5","referenceIndex":35,"text":"Wu & Xu, ","element":"a"},{"href":"#id-5","referenceIndex":35,"text":"2020; ","element":"a"},{"href":"#id-6","referenceIndex":12,"text":"Chatterji & Long, ","element":"a"},{"href":"#id-6","referenceIndex":12,"text":"2021; ","element":"a"},{"href":"#id-7","referenceIndex":39,"text":"Zou et al., ","element":"a"},{"href":"#id-7","referenceIndex":39,"text":"2021; ","element":"a"},{"href":"#id-8","referenceIndex":16,"text":"Hastie et al., ","element":"a"},{"href":"#id-8","referenceIndex":16,"text":"2022; ","element":"a"},{"href":"#id-9","referenceIndex":20,"text":"Koehler et al., ","element":"a"},{"href":"#id-9","referenceIndex":20,"text":"2021; ","element":"a"},{"href":"#id-10","referenceIndex":33,"text":"Wang et al., ","element":"a"},{"href":"#id-10","referenceIndex":33,"text":"2021a; ","element":"a"},{"href":"#id-11","referenceIndex":13,"text":"Chatterji & Long, ","element":"a"},{"href":"#id-11","referenceIndex":13,"text":"2022; ","element":"a"},{"href":"#id-12","referenceIndex":10,"text":"Cao et al., ","element":"a"},{"href":"#id-12","referenceIndex":10,"text":"2021; ","element":"a"},{"href":"#id-13","referenceIndex":31,"text":"Shamir, ","element":"a"},{"href":"#id-13","referenceIndex":31,"text":"2022)","element":"a"},{"text":", logistic regression ","element":"span"},{"href":"#id-6","referenceIndex":12,"text":"(Chatterji & Long, ","element":"a"},{"href":"#id-6","referenceIndex":12,"text":"2021; ","element":"a"},{"href":"#id-14","referenceIndex":29,"text":"Muthukumar et al., ","element":"a"},{"href":"#id-14","referenceIndex":29,"text":"2021; ","element":"a"},{"href":"#id-15","referenceIndex":34,"text":"Wang ","element":"a"},{"href":"#id-15","referenceIndex":34,"text":"et al., ","element":"a"},{"href":"#id-15","referenceIndex":34,"text":"2021b) ","element":"a"},{"text":"and kernel regression ","element":"span"},{"href":"#id-16","referenceIndex":6,"text":"(Belkin et al., ","element":"a"},{"href":"#id-16","referenceIndex":6,"text":"2018a; ","element":"a"},{"href":"#id-17","referenceIndex":27,"text":"Mei & Montanari, ","element":"a"},{"href":"#id-17","referenceIndex":27,"text":"2019; ","element":"a"},{"href":"#id-18","referenceIndex":23,"text":"Liang & Rakhlin, ","element":"a"},{"href":"#id-18","referenceIndex":23,"text":"2020; ","element":"a"},{"href":"#id-19","referenceIndex":24,"text":"Liang et al., ","element":"a"},{"href":"#id-19","referenceIndex":24,"text":"2019)","element":"a"},{"text":". With regards to understanding benign overfitting in neural networks, in the ","element":"span"},{"text":"Neural Tangent Kernel (NTK) regime ","element":"span"},{"href":"#id-20","referenceIndex":17,"text":"(Jacot et al., ","element":"a"},{"href":"#id-20","referenceIndex":17,"text":"2018) ","element":"a"},{"text":"the prediction of a neural network is well approximated via kernel regression ","element":"span"},{"href":"#id-21","referenceIndex":1,"text":"(Adlam & Pennington, ","element":"a"},{"href":"#id-21","referenceIndex":1,"text":"2020)","element":"a"},{"text":". However, this regime typically requires unrealistically large network width and fails to capture feature learning. Indeed, and despite being the initial source of inspiration, an understanding of when and how neural networks benignly overfit in the rich, feature learning regime is not well understood.","element":"span"}],[{"style":{"fontWeight":"bold"},"text":"1.1 ","element":"span"},{"style":{"fontWeight":"bold"},"text":"Contributions and related work","element":"span"}],[{"text":"In this work we study benign overfitting in the context of binary classification for two-layer ReLU networks, trained using gradient descent and hinge loss, on label corrupted, linearly separable data. There are a number of recent and or concurrent works which prove benign overfitting results in a similar setting ","element":"span"},{"href":"#id-22","referenceIndex":14,"text":"Frei et al. ","element":"a"},{"href":"#id-22","referenceIndex":14,"text":"(2022, ","element":"a"},{"href":"#id-23","referenceIndex":15,"text":"2023)","element":"a"},{"text":"; ","element":"span"},{"href":"#id-24","referenceIndex":36,"text":"Xu & Gu ","element":"a"},{"href":"#id-24","referenceIndex":36,"text":"(2023)","element":"a"},{"text":"; ","element":"span"},{"href":"#id-25","referenceIndex":11,"text":"Cao et al. ","element":"a"},{"href":"#id-25","referenceIndex":11,"text":"(2022)","element":"a"},{"text":"; ","element":"span"},{"href":"#id-26","referenceIndex":22,"text":"Kou et al. ","element":"a"},{"href":"#id-26","referenceIndex":22,"text":"(2023)","element":"a"},{"text":"; ","element":"span"},{"href":"#id-27","referenceIndex":21,"text":"Kornowski et al. ","element":"a"},{"href":"#id-27","referenceIndex":21,"text":"(2023)","element":"a"},{"text":", however, we emphasize that these exclusively study exponentially tailed losses, notably the popular logistic loss. Benign overfitting is intimately related to the notion of ","element":"span"},{"style":{"fontStyle":"italic"},"text":"implicit bias","element":"span"},{"text":", the preference of an algorithm for selecting minimizers with certain properties over others. The implicit bias of homogeneous networks trained with gradient descent on an exponentially tailed loss from a low initial loss is known to converge in direction to a Karush-Kuhn-Tucker (KKT) point of the associated max-margin problem ","element":"span"},{"href":"#id-28","referenceIndex":25,"text":"Lyu & Li ","element":"a"},{"href":"#id-28","referenceIndex":25,"text":"(2020)","element":"a"},{"text":"; ","element":"span"},{"href":"#id-29","referenceIndex":18,"text":"Ji & Telgarsky ","element":"a"},{"href":"#id-29","referenceIndex":18,"text":"(2020)","element":"a"},{"text":". This implies at least intuitively a certain bias towards margin maximization. In a recent work ","element":"span"},{"href":"#id-23","referenceIndex":15,"text":"Frei et al. ","element":"a"},{"href":"#id-23","referenceIndex":15,"text":"(2023) ","element":"a"},{"text":"it is shown that if the input data is sufficiently orthogonal then a shallow, leaky ReLU network evaluated on such a KKT point is equivalent to a particular linear classifier. Moreover, and under additional data assumptions, the authors show such networks benignly overfit. Another recent paper ","element":"span"},{"href":"#id-27","referenceIndex":21,"text":"Kornowski ","element":"a"},{"href":"#id-27","referenceIndex":21,"text":"et al. ","element":"a"},{"href":"#id-27","referenceIndex":21,"text":"(2023) ","element":"a"},{"text":"uses a similar approach to derive benign overfitting results for ReLU networks and also provides a description of the transition between benign and tempered overfitting in the univariate input case. To the best of our knowledge, equivalent results on the implicit bias of homogeneous networks trained with non-exponentially tailed losses are not characterized. Furthermore, training a linear classifier with an exponential versus non-exponential tailed loss is known to result in a different implicit bias, with the non-exponential tailed loss potentially inducing convergence in direction to a classifier with a poor margin ","element":"span"},{"href":"#id-30","referenceIndex":19,"text":"Ji et al. ","element":"a"},{"href":"#id-30","referenceIndex":19,"text":"(2020)","element":"a"},{"text":". As a result, a priori it is not clear if and how the choice of hinge loss impacts the propensity for a shallow ReLU network to overfit.","element":"span"}],[{"text":"There are two main existing lines of work which study benign overfitting in neural networks outside of the kernel regime. Concerning perhaps the most relevant line of prior work to our own, ","element":"span"},{"href":"#id-22","referenceIndex":14,"text":"Frei ","element":"a"},{"href":"#id-22","referenceIndex":14,"text":"et al. ","element":"a"},{"href":"#id-22","referenceIndex":14,"text":"(2022) ","element":"a"},{"text":"consider a smooth, leaky ReLU activation function, train the network using the logistic instead of the hinge loss and assume the data is drawn from a mixture of well-separated sub-Gaussian distributions. The key result of this work is that given a sufficient number of iterations of GD, then the network will interpolate the noisy training data while also achieving minimax optimal generalization error up to constants in the exponents. A concurrent work ","element":"span"},{"href":"#id-24","referenceIndex":36,"text":"Xu & Gu ","element":"a"},{"href":"#id-24","referenceIndex":36,"text":"(2023) ","element":"a"},{"text":"extends this result to more general activation functions including ReLU, relaxes the assumptions on the noise distribution to being centered with bounded logarithmic Sobolev constant, and also improves the convergence rate. As highlighted in ","element":"span"},{"href":"#id-24","referenceIndex":36,"text":"Xu & Gu ","element":"a"},{"href":"#id-24","referenceIndex":36,"text":"(2023)","element":"a"},{"text":", the fact that ReLU is non-smooth and non-leaky significantly complicates the analysis of both the convergence and generalization. A second line of work ","element":"span"},{"href":"#id-25","referenceIndex":11,"text":"(Cao ","element":"a"},{"href":"#id-25","referenceIndex":11,"text":"et al., ","element":"a"},{"href":"#id-25","referenceIndex":11,"text":"2022; ","element":"a"},{"href":"#id-26","referenceIndex":22,"text":"Kou et al., ","element":"a"},{"href":"#id-26","referenceIndex":22,"text":"2023) ","element":"a"},{"text":"studies benign overfitting in two-layer convolutional as opposed to feedforward neural networks. Whereas here and in ","element":"span"},{"href":"#id-22","referenceIndex":14,"text":"Frei et al. ","element":"a"},{"href":"#id-22","referenceIndex":14,"text":"(2022)","element":"a"},{"text":"; ","element":"span"},{"href":"#id-24","referenceIndex":36,"text":"Xu & Gu ","element":"a"},{"href":"#id-24","referenceIndex":36,"text":"(2023) ","element":"a"},{"text":"each data point is modeled as the sum of a signal and noise component, in ","element":"span"},{"href":"#id-25","referenceIndex":11,"text":"Cao et al. ","element":"a"},{"href":"#id-25","referenceIndex":11,"text":"(2022)","element":"a"},{"text":"; ","element":"span"},{"href":"#id-26","referenceIndex":22,"text":"Kou et al. ","element":"a"},{"href":"#id-26","referenceIndex":22,"text":"(2023) ","element":"a"},{"text":"the signal and noise components lie in disjoint patches. The weight vector of each neuron is applied to both patches separately and a non-linearity, such as ReLU, is applied to the resulting pre-activation. In this setting, the authors prove interpolation of the noisy training data and derive conditions on the clean margin under which the network benignly vs non-benignly overfits. We emphasize that the data model studied in this work is very different to the setting we study here, and as a result we primarily restrict our comparison to that with ","element":"span"},{"href":"#id-22","referenceIndex":14,"text":"Frei et al. ","element":"a"},{"href":"#id-22","referenceIndex":14,"text":"(2022) ","element":"a"},{"text":"and the concurrent work ","element":"span"},{"href":"#id-24","referenceIndex":36,"text":"Xu & Gu ","element":"a"},{"href":"#id-24","referenceIndex":36,"text":"(2023)","element":"a"},{"text":". Finally, in regard to optimizing shallow ReLU networks using hinge loss, a line of work ","element":"span"},{"href":"#id-31","referenceIndex":9,"text":"(Brutzkus et al., ","element":"a"},{"href":"#id-31","referenceIndex":9,"text":"2018; ","element":"a"},{"href":"#id-32","referenceIndex":32,"text":"Wang et al., ","element":"a"},{"href":"#id-32","referenceIndex":32,"text":"2019; ","element":"a"},{"href":"#id-33","referenceIndex":37,"text":"Yang et al., ","element":"a"},{"href":"#id-33","referenceIndex":37,"text":"2021) ","element":"a"},{"text":"studies the convergence of gradient descent on generic, linearly separable data without label corruptions. These works also require additional assumptions, notably leaky ReLU instead of ReLU, insertion of noise into the optimization algorithm or changes to the loss function.","element":"span"}],[{"text":"Before we discuss our contributions we remark that a previous work ","element":"span"},{"href":"#id-34","referenceIndex":26,"text":"Mallinar et al. ","element":"a"},{"href":"#id-34","referenceIndex":26,"text":"(2022) ","element":"a"},{"text":"describes and experimentally explores a taxonomy of overfitting: benign overfitting, where the generalization error is optimal; catastrophic overfitting, where the generalization error is close to random chance; ","element":"span"},{"text":"and tempered overfitting, which lies in between. In this work, we do not consider the full breadth of this taxonomy, and use the terms “non-benign overfitting” or equivalently “harmful overfitting” to refer to overfitting that may be either tempered or catastrophic. We now summarize our contributions: in particular, under certain assumptions on the model hyperparameters, we prove conditions on the clean margin resulting in the three distinct training outcomes highlighted below. We remark also that the prior works discussed primarily focus on deriving positive benign overfitting results.","element":"span"}],[{"text":"1. ","element":"span"},{"style":{"fontWeight":"bold"},"text":"Benign overfitting: ","element":"span"},{"text":"Theorem ","element":"span"},{"href":"#id-35","text":"3.1 ","element":"a"},{"text":"provides conditions under which the training loss converges to zero and bounds the generalization error, showing that it is asymptotically optimal. This result is analogous to those of ","element":"span"},{"href":"#id-22","referenceIndex":14,"text":"Frei et al. ","element":"a"},{"href":"#id-22","referenceIndex":14,"text":"(2022) ","element":"a"},{"text":"and ","element":"span"},{"href":"#id-24","referenceIndex":36,"text":"Xu & Gu ","element":"a"},{"href":"#id-24","referenceIndex":36,"text":"(2023) ","element":"a"},{"text":"but for the hinge instead of logistic loss.","element":"span"}],[{"text":"2. ","element":"span"},{"style":{"fontWeight":"bold"},"text":"Non-benign overfitting: ","element":"span"},{"text":"Theorem ","element":"span"},{"href":"#id-36","text":"3.6 ","element":"a"},{"text":"provides conditions under which the network achieves zero training loss while generalization error is bounded below by a constant. Unlike ","element":"span"},{"href":"#id-22","referenceIndex":14,"text":"Frei ","element":"a"},{"href":"#id-22","referenceIndex":14,"text":"et al. ","element":"a"},{"href":"#id-22","referenceIndex":14,"text":"(2022) ","element":"a"},{"text":"and ","element":"span"},{"href":"#id-24","referenceIndex":36,"text":"Xu & Gu ","element":"a"},{"href":"#id-24","referenceIndex":36,"text":"(2023)","element":"a"},{"text":", this is not due to the non-separability of the data model but is instead a result of the neural network failing to learn the optimal classifier.","element":"span"}],[{"text":"3. ","element":"span"},{"style":{"fontWeight":"bold"},"text":"No overfitting: ","element":"span"},{"text":"Theorem ","element":"span"},{"href":"#id-37","text":"3.8 ","element":"a"},{"text":"provides conditions under which the network achieves zero training loss on points with uncorrupted label signs but nonzero loss on points with corrupted signs. Again the generalization error is bounded and shown to be asymptotically optimal.","element":"span"}],[{"text":"To conclude this section we further remark that our proof techniques are quite different from those used in ","element":"span"},{"href":"#id-22","referenceIndex":14,"text":"Frei et al. ","element":"a"},{"href":"#id-22","referenceIndex":14,"text":"(2022)","element":"a"},{"text":"; ","element":"span"},{"href":"#id-24","referenceIndex":36,"text":"Xu & Gu ","element":"a"},{"href":"#id-24","referenceIndex":36,"text":"(2023) ","element":"a"},{"text":"and indeed the other works highlighted in this section. Again we emphasize this is due to the fact we study the hinge loss instead of the logistic loss and discuss the differences arising from this in detail in Section ","element":"span"},{"text":"3. ","element":"span"},{"text":"In particular, we set up the problem in such a way that the convergence analysis reduces to counting the number of activations of clean versus corrupt points during various stages of training. Our analysis further provides a detailed description of the dynamics of the network’s neurons, thereby allowing us to understand how the network fits both the clean and corrupted data.","element":"span"}]]},{"heading":"2 Preliminaries","paragraphs":[[{"id":"id-39","style":{"fontWeight":"bold"},"text":"2.1 ","element":"span"},{"style":{"fontWeight":"bold"},"text":"Data model","element":"span"}],[{"text":"We consider a training sample of ","element":"span"},{"text":"2","element":"span"},{"style":{"fontStyle":"italic"},"text":"n ","element":"span"},{"text":"pairs of points and their labels ","element":"span"},{"style":{"height":17.54},"width":537.12,"height":43.84,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/71943/images/2-0.png","element":"img","alt":" (xi, yi)2ni=1 where (xi, yi) ∈ Rd ×","inline":true},{"style":{"height":16},"width":159.48,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/71943/images/2-1.png","element":"img","alt":"{−1, +1}","inline":true,"padRight":true},{"text":"for all ","element":"span"},{"style":{"height":16},"width":128.36,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/71943/images/2-2.png","element":"img","alt":" i ∈ [2n]","inline":true},{"text":". Furthermore, we identify two disjoint subsets ","element":"span"},{"style":{"height":16},"width":413.76,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/71943/images/2-3.png","element":"img","alt":" ST ⊂ [2n] = {1, . . . , 2n}","inline":true,"padRight":true},{"text":"and ","element":"span"},{"style":{"height":16},"width":463.72,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/71943/images/2-4.png","element":"img","alt":" SF ⊂ [2n], ST ∪ SF = [2n]","inline":true},{"text":", which correspond to the clean and corrupt points in the sample respectively. The categorization of a point as clean or corrupted is determined by its label: for all ","element":"span"},{"style":{"height":16},"width":140.08,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/71943/images/2-5.png","element":"img","alt":"i ∈ [2n]","inline":true,"padRight":true},{"text":"we assume ","element":"span"},{"style":{"height":16.99},"width":259.92,"height":42.48,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/71943/images/2-6.png","element":"img","alt":" yi = β(i)(−1)i","inline":true,"padRight":true},{"text":"where ","element":"span"},{"style":{"height":16},"width":185.24,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/71943/images/2-7.png","element":"img","alt":" β(i) = −1","inline":true,"padRight":true},{"text":"iff ","element":"span"},{"style":{"height":13.58},"width":122.32,"height":33.96,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/71943/images/2-8.png","element":"img","alt":" i ∈ SF","inline":true,"padRight":true},{"text":"and ","element":"span"},{"style":{"height":16},"width":154.28,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/71943/images/2-9.png","element":"img","alt":" β(i) = 1","inline":true,"padRight":true},{"text":"otherwise. In addition, we assume ","element":"span"},{"style":{"height":16},"width":540.12,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/71943/images/2-10.png","element":"img","alt":" |SF ∩ [2n]e| = |SF ∩ [2n]o| = k","inline":true,"padRight":true},{"text":"and ","element":"span"},{"style":{"height":16},"width":611.64,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/71943/images/2-11.png","element":"img","alt":" |ST ∩ [2n]e| = |ST ∩ [2n]o| = n − k","inline":true},{"text":", where ","element":"span"},{"style":{"height":16},"width":482.04,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/71943/images/2-12.png","element":"img","alt":" [2n]e ⊂ [2n] and [2n]o ⊂ [2n]","inline":true,"padRight":true},{"text":"are the even and odd indices, respectively. We remark that this assumption simplifies the exposition of our results but is not integral to our analysis. Each data point is assumed to have the form","element":"span"}],[{"id":"id-44","style":{"width":"68%"},"width":1088,"height":50,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/71943/images/2-13.png","element":"img"}],[{"text":"Here ","element":"span"},{"style":{"height":14.18},"width":119.4,"height":35.44,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/71943/images/2-14.png","element":"img","alt":" v ∈ Rd","inline":true,"padRight":true},{"text":"satisfies ","element":"span"},{"style":{"height":16},"width":137.88,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/71943/images/2-15.png","element":"img","alt":" ∥v∥ = 1","inline":true,"padRight":true},{"text":"and furthermore we refer to ","element":"span"},{"style":{"fontWeight":"bold"},"text":"v ","element":"span"},{"text":"as the signal vector as the alignment of a clean point with ","element":"span"},{"style":{"fontWeight":"bold"},"text":"v ","element":"span"},{"text":"determines its sign. Indeed, ","element":"span"},{"style":{"height":16.98},"width":479.08,"height":42.44,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/71943/images/2-16.png","element":"img","alt":" sign(⟨xi, v⟩) = (−1)i = yi","inline":true,"padRight":true},{"text":"for ","element":"span"},{"style":{"height":13.18},"width":127.68,"height":32.96,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/71943/images/2-17.png","element":"img","alt":" i ∈ ST","inline":true,"padRight":true},{"text":"whereas sign","element":"span"},{"style":{"height":16},"width":256.64,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/71943/images/2-18.png","element":"img","alt":"(⟨xi, v⟩) = −yi","inline":true,"padRight":true},{"text":"for ","element":"span"},{"style":{"height":13.58},"width":110.6,"height":33.96,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/71943/images/2-19.png","element":"img","alt":" i ∈ SF","inline":true,"padRight":true},{"text":". Thus we may view the labels of a corrupt point as flipped from their clean state. The vectors ","element":"span"},{"style":{"height":17.54},"width":121.48,"height":43.84,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/71943/images/2-20.png","element":"img","alt":" (ni)2ni=1","inline":true,"padRight":true},{"text":"are mutually independent and identically distributed ","element":"span"},{"text":"(i.i.d.) random vectors drawn from the uniform distribution over ","element":"span"},{"style":{"height":17.39},"width":301.36,"height":43.48,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/71943/images/2-21.png","element":"img","alt":" Sd−1 ∩ span{v}⊥","inline":true},{"text":", which we denote ","element":"span"},{"style":{"height":17.39},"width":362,"height":43.48,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/71943/images/2-22.png","element":"img","alt":" U(Sd−1 ∩ span{v}⊥)","inline":true},{"text":". Clearly this distribution is symmetric, mean zero and for any ","element":"span"},{"style":{"height":7.2},"width":66,"height":18,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/71943/images/2-23.png","element":"img","alt":" n ∼","inline":true},{"style":{"height":17.39},"width":345.4,"height":43.48,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/71943/images/2-24.png","element":"img","alt":"U(Sd−1∩span{v}⊥)","inline":true,"padRight":true},{"text":"it holds that ","element":"span"},{"style":{"height":16},"width":314.16,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/71943/images/2-25.png","element":"img","alt":" n ⊥ v and ∥n∥ = 1","inline":true},{"text":". We refer to these vectors as noise components due to the fact that they are independent of the labels of their respective points. The real, scalar quantity ","element":"span"},{"style":{"height":16},"width":159.96,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/71943/images/2-26.png","element":"img","alt":" γ ∈ [0, 1]","inline":true,"padRight":true},{"text":"controls the strength of the signal versus the noise and also defines the clean margin. Finally, at test time a clean label ","element":"span"},{"style":{"height":16},"width":265.56,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/71943/images/2-27.png","element":"img","alt":" y ∼ U({−1, 1})","inline":true,"padRight":true},{"text":"is sampled and the corresponding test data point is constructed,","element":"span"}],[{"style":{"width":"63%"},"width":1003,"height":50,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/71943/images/2-28.png","element":"img"}],[{"text":"where again ","element":"span"},{"style":{"height":17.39},"width":448.44,"height":43.48,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/71943/images/2-29.png","element":"img","alt":" n ∼ U(Sd−1 ∩ span{v}⊥).","inline":true}],[{"text":"The key idea we use to characterize the training dynamics is to reduce the analysis of the trajectory of each neuron to that of counting the number of clean versus corrupt updates to it. This combinatorial approach relies on each point having similar sized signal and noise components. In order to make our analysis as clear as possible, we select a data model which ensures the signal and noise components are consistent in size across all points. We emphasize that these assumptions are not strictly necessary and we believe analogous analyses could be conducted when the signal and noise components are instead appropriately bounded. In addition, and as discussed in more detail in Section ","element":"span"},{"href":"#id-38","text":"3.2, ","element":"a"},{"text":"the orthogonality of the signal and noise components allow us to demonstrate non-benign overfitting even when a perfect classifier exists.","element":"span"}],[{"style":{"fontWeight":"bold"},"text":"2.2 ","element":"span"},{"style":{"fontWeight":"bold"},"text":"Network architecture, optimization and initialization","element":"span"}],[{"text":"We consider a densely connected, single layer feed-forward neural network ","element":"span"},{"style":{"height":16.8},"width":358,"height":42,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/71943/images/3-0.png","element":"img","alt":" f : R2m×d × Rd → R","inline":true,"padRight":true},{"text":"with the following forward pass map,","element":"span"}],[{"style":{"width":"34%"},"width":539,"height":122,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/71943/images/3-1.png","element":"img"}],[{"text":"Here ","element":"span"},{"style":{"height":16},"width":272.52,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/71943/images/3-2.png","element":"img","alt":" ϕ := max{0, z}","inline":true,"padRight":true},{"text":"denotes the ReLU activation function and ","element":"span"},{"style":{"height":12},"width":45,"height":30,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/71943/images/3-3.png","element":"img","alt":" wj","inline":true,"padRight":true},{"text":"the ","element":"span"},{"style":{"fontStyle":"italic"},"text":"j","element":"span"},{"text":"-th row of the weight matrix ","element":"span"},{"style":{"height":14.19},"width":217.24,"height":35.48,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/71943/images/3-4.png","element":"img","alt":" W ∈ R2m×d","inline":true},{"text":". The network weights are optimized using full batch gradient descent (GD) with step size ","element":"span"},{"style":{"height":14.4},"width":104.92,"height":36,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/71943/images/3-5.png","element":"img","alt":" η > 0","inline":true,"padRight":true},{"text":"in order to minimize the hinge loss over a training sample ","element":"span"},{"style":{"height":17.54},"width":250.92,"height":43.84,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/71943/images/3-6.png","element":"img","alt":" ((xi, yi))2ni=1 ⊂","inline":true},{"style":{"height":17.39},"width":291.4,"height":43.48,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/71943/images/3-7.png","element":"img","alt":"(Rd × {−1, 1})2n","inline":true,"padRight":true},{"text":"sampled as described in Section ","element":"span"},{"href":"#id-39","text":"2.1. ","element":"a"},{"text":"After ","element":"span"},{"style":{"height":12.19},"width":22.52,"height":30.48,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/71943/images/3-8.png","element":"img","alt":" t′","inline":true,"padRight":true},{"text":"iterations this optimization process generates a sequence of weight matrices ","element":"span"},{"style":{"height":18.18},"width":170.48,"height":45.44,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/71943/images/3-9.png","element":"img","alt":" (W(t))t′t=0","inline":true},{"text":". For convenience, we overload our notation for ","element":"span"},{"text":"the forward pass map of the network and let ","element":"span"},{"style":{"height":18.19},"width":359.44,"height":45.48,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/71943/images/3-10.png","element":"img","alt":" f(t, x) := f(W(t), x)","inline":true},{"text":". Furthermore, we denote the hinge loss on the ","element":"span"},{"style":{"fontStyle":"italic"},"text":"i","element":"span"},{"text":"-th point at iteration ","element":"span"},{"style":{"height":16},"width":595.08,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/71943/images/3-11.png","element":"img","alt":" t as ℓ(t, i) := max{0, 1 − yif(t, xi)}","inline":true},{"text":". The hinge loss over the entire training sample at iteration ","element":"span"},{"style":{"fontStyle":"italic"},"text":"t ","element":"span"},{"text":"is therefore ","element":"span"},{"style":{"height":20.38},"width":943.84,"height":50.96,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/71943/images/3-12.png","element":"img","alt":" L(t) := �2ni=1 ℓ(t, i). Let F(t) := {i ∈ [2n] : ℓ(t, xi) > 0}","inline":true,"padRight":true},{"text":"and ","element":"span"},{"style":{"height":23.52},"width":568.48,"height":58.8,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/71943/images/3-13.png","element":"img","alt":" A(t)j := {i ∈ [2n] : ⟨w(t)j , xi⟩ > 0}","inline":true,"padRight":true},{"text":"denote the sets of point indices that have nonzero loss and ","element":"span"},{"text":"which activate the ","element":"span"},{"style":{"fontStyle":"italic"},"text":"j","element":"span"},{"text":"th neuron at iteration ","element":"span"},{"style":{"fontStyle":"italic"},"text":"t ","element":"span"},{"text":"respectively. With","element":"span"}],[{"style":{"width":"44%"},"width":713,"height":120,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/71943/images/3-14.png","element":"img"}],[{"text":"then the GD update rule","element":"span"},{"text":"1 ","element":"span"},{"text":"for the neuron weights at iteration ","element":"span"},{"style":{"height":12.99},"width":85.48,"height":32.48,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/71943/images/3-15.png","element":"img","alt":" t ≥ 0","inline":true,"padRight":true},{"text":"may be written as","element":"span"}],[{"id":"id-51","style":{"width":"78%"},"width":1239,"height":118,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/71943/images/3-16.png","element":"img"}],[{"text":"In regard to the initialization of the network parameters, for convenience we assume each neuron’s weight vector is drawn mutually i.i.d. uniform from the centered sphere with radius ","element":"span"},{"style":{"height":13.6},"width":123,"height":34,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/71943/images/3-17.png","element":"img","alt":" λw > 0","inline":true},{"text":". We remark that results analogous to the ones presented hold if the weights are instead initialized mutually i.i.d. as ","element":"span"},{"style":{"height":23.52},"width":281.8,"height":58.8,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/71943/images/3-18.png","element":"img","alt":" w(0)jc ∼ N(0, σ2w)","inline":true,"padRight":true},{"text":"for sufficiently small ","element":"span"},{"style":{"height":17.39},"width":55,"height":43.48,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/71943/images/3-19.png","element":"img","alt":" σ2w.","inline":true}],[{"style":{"fontWeight":"bold"},"text":"2.3 ","element":"span"},{"style":{"fontWeight":"bold"},"text":"Notation","element":"span"}],[{"text":"For indices ","element":"span"},{"style":{"height":15.6},"width":166.32,"height":39,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/71943/images/3-20.png","element":"img","alt":" i, j ∈ Z≥1","inline":true,"padRight":true},{"text":"we say ","element":"span"},{"style":{"height":13.81},"width":82,"height":34.52,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/71943/images/3-21.png","element":"img","alt":" i ∼ j","inline":true,"padRight":true},{"text":"iff ","element":"span"},{"style":{"height":19.22},"width":243.24,"height":48.04,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/71943/images/3-22.png","element":"img","alt":" (−1)i = (−1)j","inline":true},{"text":". We often refer to a data point or neuron by its index alone, e.g. “point ","element":"span"},{"style":{"fontStyle":"italic"},"text":"i","element":"span"},{"text":"” refers to the ","element":"span"},{"style":{"fontStyle":"italic"},"text":"i","element":"span"},{"text":"-th training point ","element":"span"},{"style":{"height":16},"width":119.48,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/71943/images/3-23.png","element":"img","alt":" (xi, yi)","inline":true},{"text":". For two iterations ","element":"span"},{"style":{"height":13.2},"width":77,"height":33,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/71943/images/3-24.png","element":"img","alt":" t0, t1","inline":true,"padRight":true},{"text":"with ","element":"span"},{"style":{"height":12.59},"width":114,"height":31.48,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/71943/images/3-25.png","element":"img","alt":"t1 > t0","inline":true,"padRight":true},{"text":"we define the following.","element":"span"}],[{"text":"1. ","element":"span"},{"style":{"height":23.54},"width":787.52,"height":58.84,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/71943/images/3-26.png","element":"img","alt":" Gj(t0, t1) := �i∈ST�t1−1τ=t0 1(i ∈ A(τ)j ∩ F(τ))","inline":true,"padRight":true},{"text":"is the number of clean updates applied tothe ","element":"span"},{"style":{"fontStyle":"italic"},"text":"j","element":"span"},{"text":"-th neuron between iterations ","element":"span"},{"style":{"height":14},"width":149,"height":35,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/71943/images/3-27.png","element":"img","alt":" t0 and t1.","inline":true}],[{"text":"2. ","element":"span"},{"style":{"height":23.52},"width":787.04,"height":58.8,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/71943/images/3-28.png","element":"img","alt":" Bj(t0, t1) := �i∈SF�t1−1τ=t0 1(i ∈ A(τ)j ∩ F(τ))","inline":true,"padRight":true},{"text":"is the number of corrupt updates appliedto the ","element":"span"},{"style":{"fontStyle":"italic"},"text":"j","element":"span"},{"text":"-th neuron between iterations ","element":"span"},{"style":{"height":13.58},"width":152,"height":33.96,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/71943/images/3-29.png","element":"img","alt":" t0 and t1.","inline":true}],[{"text":"3. ","element":"span"},{"style":{"height":19.58},"width":1115.64,"height":48.96,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/71943/images/3-30.png","element":"img","alt":" G(t0, t1) := �j∈[2m] Gj(t0, t1) and B(t0, t1) := �j∈[2m] Bj(t0, t1)","inline":true,"padRight":true},{"text":"are the total number ","element":"span"},{"text":"of clean and corrupt updates applied to the entire network between iterations ","element":"span"},{"style":{"height":13.6},"width":149,"height":34,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/71943/images/3-31.png","element":"img","alt":" t0 and t1.","inline":true}],[{"text":"4. ","element":"span"},{"style":{"height":16},"width":545.52,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/71943/images/4-0.png","element":"img","alt":" T(t0, t1) := G(t0, t1) + B(t0, t1)","inline":true,"padRight":true},{"text":"is the total number of updates from all points applied to the entire network between iterations ","element":"span"},{"style":{"height":13.6},"width":149.04,"height":34,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/71943/images/4-1.png","element":"img","alt":" t0 and t1.","inline":true}],[{"text":"We extend all these definitions to the case ","element":"span"},{"style":{"height":12.61},"width":115.52,"height":31.52,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/71943/images/4-2.png","element":"img","alt":" t0 = t1","inline":true,"padRight":true},{"text":"by letting the empty sum be 0. Finally, we use ","element":"span"},{"style":{"height":13.6},"width":268,"height":34,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/71943/images/4-3.png","element":"img","alt":"C ≥ 1 and c ≤ 1","inline":true,"padRight":true},{"text":"to denote generic, positive constants.","element":"span"}]]},{"heading":"3 Results","paragraphs":[[{"text":"The main contributions of this work are Theorem ","element":"span"},{"href":"#id-35","text":"3.1, ","element":"a"},{"text":"Theorem ","element":"span"},{"href":"#id-36","text":"3.6 ","element":"a"},{"text":"and Theorem ","element":"span"},{"href":"#id-37","text":"3.8, ","element":"a"},{"text":"which characterize how the margin of the clean data drives three different training regimes: namely benign overfitting, non-benign (or harmful) overfitting and no-overfitting respectively. We primarily distinguish between the three aforementioned training outcomes based on conditions on the signal strength ","element":"span"},{"style":{"height":16},"width":156.84,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/71943/images/4-4.png","element":"img","alt":" γ ∈ [0, 1]","inline":true,"padRight":true},{"text":"which controls the clean margin. Assuming the corrupt points are the minority in the training sample, then heuristically we might expect the following behavior as ","element":"span"},{"style":{"height":10.59},"width":21.52,"height":26.48,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/71943/images/4-5.png","element":"img","alt":" γ","inline":true,"padRight":true},{"text":"varies: if ","element":"span"},{"style":{"height":14.4},"width":132.6,"height":36,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/71943/images/4-6.png","element":"img","alt":" nγ ≫ 1","inline":true},{"text":", then the signal dominates the noise during training, corrupted points are never fitted and the network generalizes well. If ","element":"span"},{"style":{"height":14},"width":125,"height":35,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/71943/images/4-7.png","element":"img","alt":" nγ ≪ 1","inline":true},{"text":", then all points are eventually fitted based on their noise component and the network generalizes poorly. As such, we expect to observe benign overfitting when ","element":"span"},{"style":{"height":10.59},"width":21.52,"height":26.48,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/71943/images/4-8.png","element":"img","alt":" γ","inline":true,"padRight":true},{"text":"is small but not too small: in this regime the network learns the signal, thus ensuring it generalizes well, but corrupted points can still be fitted based on their noise component, thereby allowing training to zero loss.","element":"span"}],[{"text":"With each theorem we provide here we give a sketch of its proof: full proofs are contained in the Supplementary Materials, which also contain supporting numerical simulations in Appendix ","element":"span"},{"text":"F. ","element":"span"},{"text":"Throughout this section, and in order to establish a common setting in which to observe a variety of different behaviors, we make the following assumptions on the network and data hyperparameters.","element":"span"}],[{"id":"id-40","style":{"fontWeight":"bold"},"text":"Assumption 1. ","element":"span"},{"style":{"fontStyle":"italic"},"text":"For a sufficiently large constant ","element":"span"},{"style":{"height":13.2},"width":99.52,"height":33,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/71943/images/4-9.png","element":"img","alt":" C ≥ 1","inline":true},{"style":{"fontStyle":"italic"},"text":", failure probability ","element":"span"},{"style":{"height":16},"width":196.84,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/71943/images/4-10.png","element":"img","alt":" δ ∈ (0, 1/2)","inline":true,"padRight":true},{"style":{"fontStyle":"italic"},"text":"and noise inner product bound ","element":"span"},{"style":{"height":16},"width":167.32,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/71943/images/4-11.png","element":"img","alt":" ρ ∈ (0, 1)","inline":true},{"style":{"fontStyle":"italic"},"text":", let ","element":"span"},{"style":{"height":17.38},"width":649.64,"height":43.44,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/71943/images/4-12.png","element":"img","alt":" d ≥ Cρ−2 log(n/δ), k ≤ cn, λw ≤ cη","inline":true,"padRight":true},{"style":{"fontStyle":"italic"},"text":"and ","element":"span"},{"style":{"height":14.4},"width":101.32,"height":36,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/71943/images/4-13.png","element":"img","alt":" η ≤ ξ","inline":true},{"style":{"fontStyle":"italic"},"text":", where ","element":"span"},{"style":{"height":14.61},"width":17.52,"height":36.52,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/71943/images/4-14.png","element":"img","alt":" ξ","inline":true,"padRight":true},{"style":{"fontStyle":"italic"},"text":"depends on ","element":"span"},{"style":{"height":14.61},"width":278.48,"height":36.52,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/71943/images/4-15.png","element":"img","alt":" n, m, k, γ, and d.","inline":true}],[{"text":"We remark that the condition ","element":"span"},{"style":{"height":17.38},"width":349.24,"height":43.44,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/71943/images/4-16.png","element":"img","alt":" d ≥ Cρ−2 log(n/δ)","inline":true,"padRight":true},{"text":"ensures the noise components are nearlyorthogonal: in particular, ","element":"span"},{"style":{"height":16.8},"width":372,"height":42,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/71943/images/4-17.png","element":"img","alt":" maxi̸=ℓ |⟨ni, nℓ⟩| ≤ cρ","inline":true,"padRight":true},{"text":"with high probability for some positive constant ","element":"span"},{"style":{"fontStyle":"italic"},"text":"c","element":"span"},{"text":". This near orthogonality condition on the noise terms is restrictive, but is a common assumption in the related works ","element":"span"},{"href":"#id-22","referenceIndex":14,"text":"Frei et al. ","element":"a"},{"href":"#id-22","referenceIndex":14,"text":"(2022)","element":"a"},{"text":"; ","element":"span"},{"href":"#id-24","referenceIndex":36,"text":"Xu & Gu ","element":"a"},{"href":"#id-24","referenceIndex":36,"text":"(2023)","element":"a"},{"text":". We note that the value of ","element":"span"},{"style":{"height":10.8},"width":19.52,"height":27,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/71943/images/4-18.png","element":"img","alt":" ρ","inline":true,"padRight":true},{"text":"required for each of our results to hold varies. Likewise, the optimal constants ","element":"span"},{"style":{"fontStyle":"italic"},"text":"c ","element":"span"},{"text":"and ","element":"span"},{"style":{"fontStyle":"italic"},"text":"C ","element":"span"},{"text":"required in each case also vary and we will not concern ourselves with finding the tightest possible constants.","element":"span"}],[{"text":"While there are differences the proofs of Theorem ","element":"span"},{"href":"#id-35","text":"3.1, ","element":"a"},{"href":"#id-36","text":"3.6 ","element":"a"},{"text":"and ","element":"span"},{"href":"#id-37","text":"3.8 ","element":"a"},{"text":"generally fit the following outline.","element":"span"}],[{"text":"1. Use concentration to show with high probability the training data is nearly orthogonal and a certain initialization pattern is satisfied.","element":"span"}],[{"text":"2. Characterize the activation pattern early in training before any point achieves zero loss.","element":"span"}],[{"text":"3. Bound the activations at an iteration just before any training point achieves zero loss.","element":"span"}],[{"text":"4. Based on bounds on the activations at a given iteration, derive an iteration-independent upper bound on the number of subsequent updates that can occur before convergence. At convergence all points either have zero loss or activate no neurons.","element":"span"}],[{"text":"We emphasize that our proof techniques are significantly different from those used in ","element":"span"},{"href":"#id-22","referenceIndex":14,"text":"Frei et al. ","element":"a"},{"href":"#id-22","referenceIndex":14,"text":"(2022)","element":"a"},{"text":"; ","element":"span"},{"href":"#id-24","referenceIndex":36,"text":"Xu & Gu ","element":"a"},{"href":"#id-24","referenceIndex":36,"text":"(2023) ","element":"a"},{"text":"due to the differences between the hinge and logistic loss. In particular, letting ","element":"span"},{"style":{"height":16},"width":76,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/71943/images/4-19.png","element":"img","alt":" σ(z)","inline":true,"padRight":true},{"text":"denote the logistic loss, a key step in the proof of these prior works is showing at any iteration ","element":"span"},{"style":{"height":12.99},"width":85.52,"height":32.48,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/71943/images/4-20.png","element":"img","alt":" t ≥ 0","inline":true,"padRight":true},{"text":"that the ratio ","element":"span"},{"style":{"height":16},"width":465.4,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/71943/images/4-21.png","element":"img","alt":" σ′(yif(t, xi))/σ′(ylf(t, xl))","inline":true,"padRight":true},{"text":"is upper bounded by a constant for all pairs of points ","element":"span"},{"style":{"fontStyle":"italic"},"text":"i, l ","element":"span"},{"text":"in the training sample. For the hinge loss this approach is not feasible: indeed, if at an iteration ","element":"span"},{"style":{"fontStyle":"italic"},"text":"t ","element":"span"},{"text":"some points achieve zero loss while others have not then this ratio is unbounded.","element":"span"}],[{"style":{"fontWeight":"bold"},"text":"3.1 ","element":"span"},{"style":{"fontWeight":"bold"},"text":"Benign overfitting","element":"span"}],[{"text":"The following theorem states conditions in particular on ","element":"span"},{"style":{"height":10.61},"width":21.48,"height":26.52,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/71943/images/4-22.png","element":"img","alt":" γ","inline":true,"padRight":true},{"text":"under which the network simultaneously achieves asymptotically optimal test error and achieves zero loss on both the clean and corrupted data after a finite number of iterations. A detailed proof of this Theorem along with the associated lemmas is provided in Appendix ","element":"span"},{"text":"C.","element":"span"}],[{"id":"id-35","style":{"fontWeight":"bold"},"text":"Theorem 3.1. ","element":"span"},{"style":{"fontStyle":"italic"},"text":"Let Assumption ","element":"span"},{"href":"#id-40","style":{"fontStyle":"italic"},"text":"1 ","element":"a"},{"style":{"fontStyle":"italic"},"text":"hold and further assume ","element":"span"},{"style":{"height":16},"width":679.4,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/71943/images/5-0.png","element":"img","alt":" n ≥ C log(1/δ), m ≥ C log(n/δ), ρ ≤ cγ","inline":true,"padRight":true},{"style":{"fontStyle":"italic"},"text":"and ","element":"span"},{"style":{"height":19.2},"width":473.92,"height":48,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/71943/images/5-1.png","element":"img","alt":" C�log(n/δ)/d ≤ γ ≤ cn−1","inline":true},{"style":{"fontStyle":"italic"},"text":". Then there exists a sufficiently small step-size ","element":"span"},{"style":{"height":10.8},"width":19,"height":27,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/71943/images/5-2.png","element":"img","alt":" η","inline":true,"padRight":true},{"style":{"fontStyle":"italic"},"text":"such that with probability at least ","element":"span"},{"style":{"height":11.6},"width":84,"height":29,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/71943/images/5-3.png","element":"img","alt":" 1 − δ","inline":true,"padRight":true},{"style":{"fontStyle":"italic"},"text":"over the randomness of the dataset and network initialization the following hold.","element":"span"}],[{"style":{"fontStyle":"italic"},"text":"1. The training process terminates at an iteration ","element":"span"},{"style":{"height":21.78},"width":181.16,"height":54.44,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/71943/images/5-4.png","element":"img","alt":" Tend ≤ Cnη .","inline":true}],[{"style":{"fontStyle":"italic"},"text":"2. For all ","element":"span"},{"style":{"height":16},"width":466.8,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/71943/images/5-5.png","element":"img","alt":" i ∈ [2n] then ℓ(Tend, xi) = 0.","inline":true}],[{"style":{"fontStyle":"italic"},"text":"3. The generalization error satisfies","element":"span"}],[{"style":{"width":"41%"},"width":661,"height":49,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/71943/images/5-6.png","element":"img"}],[{"style":{"fontStyle":"italic"},"text":"Proof sketch. ","element":"span"},{"text":"Recall the parameter ","element":"span"},{"style":{"height":10.59},"width":19.52,"height":26.48,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/71943/images/5-7.png","element":"img","alt":" ρ","inline":true,"padRight":true},{"text":"bounds the inner products of the noise components of the training data. Specifically, the conditions on ","element":"span"},{"style":{"fontStyle":"italic"},"text":"d ","element":"span"},{"text":"given in Assumption ","element":"span"},{"href":"#id-40","text":"1 ","element":"a"},{"text":"ensure ","element":"span"},{"style":{"height":19.89},"width":392.76,"height":49.72,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/71943/images/5-8.png","element":"img","alt":" maxi̸=l |⟨ni, nl⟩| ≤ ρ1−γ","inline":true,"padRight":true},{"text":"with high probability. We also identify the following sets of neurons for ","element":"span"},{"style":{"height":16},"width":207.16,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/71943/images/5-9.png","element":"img","alt":" p ∈ {−1, 1},","inline":true}],[{"style":{"width":"81%"},"width":1287,"height":155,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/71943/images/5-10.png","element":"img"}],[{"text":"These sets are useful in that neurons in ","element":"span"},{"style":{"height":15.58},"width":41.88,"height":38.96,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/71943/images/5-11.png","element":"img","alt":" Γp","inline":true,"padRight":true},{"text":"have predictable activation patterns during the early phase of training. Furthermore, if ","element":"span"},{"style":{"fontStyle":"italic"},"text":"i ","element":"span"},{"text":"is the index of a corrupted point which activates a neuron in ","element":"span"},{"style":{"height":16.19},"width":54,"height":40.48,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/71943/images/5-12.png","element":"img","alt":"Θyi","inline":true,"padRight":true},{"text":"at initialization, then this point will continue activating this neuron throughout the early phase of training. Concentration argument shows that ","element":"span"},{"style":{"height":15.6},"width":41.92,"height":39,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/71943/images/5-13.png","element":"img","alt":" Γp","inline":true,"padRight":true},{"text":"and ","element":"span"},{"style":{"height":16},"width":45,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/71943/images/5-14.png","element":"img","alt":" Θp","inline":true,"padRight":true},{"text":"are sufficiently significant subsets of ","element":"span"},{"style":{"height":18.18},"width":109.48,"height":45.44,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/71943/images/5-15.png","element":"img","alt":"[2m]p2 ","inline":true,"padRight":true},{"text":"with high probability. In summary, for benign overfitting we say we have a ","element":"span"},{"style":{"fontStyle":"italic"},"text":"good initialization ","element":"span"},{"text":"if i) ","element":"span"},{"style":{"height":19.9},"width":402.88,"height":49.76,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/71943/images/5-16.png","element":"img","alt":" maxi̸=l |⟨ni, nl⟩| ≤ ρ1−γ","inline":true,"padRight":true},{"text":", ii) for some small constant ","element":"span"},{"style":{"height":16},"width":173.52,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/71943/images/5-17.png","element":"img","alt":" α ∈ (0, 1)","inline":true,"padRight":true},{"text":"then ","element":"span"},{"style":{"height":16.8},"width":293.08,"height":42,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/71943/images/5-18.png","element":"img","alt":" |Γp| ≥ (1 − α)m","inline":true,"padRight":true},{"text":"for ","element":"span"},{"style":{"height":16},"width":197.24,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/71943/images/5-19.png","element":"img","alt":"p ∈ {−1, 1}","inline":true},{"text":", and iii) for each ","element":"span"},{"style":{"height":13.79},"width":109,"height":34.48,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/71943/images/5-20.png","element":"img","alt":" i ∈ SF","inline":true,"padRight":true},{"text":"there exists a ","element":"span"},{"style":{"height":23.52},"width":716.28,"height":58.8,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/71943/images/5-21.png","element":"img","alt":" j ∈ [2m] such that (−1)j = yi and i ∈ A(0)j .","inline":true}],[{"id":"id-63","style":{"fontWeight":"bold"},"text":"Lemma 3.2. ","element":"span"},{"style":{"fontStyle":"italic"},"text":"Under the assumptions of Theorem ","element":"span"},{"href":"#id-35","style":{"fontStyle":"italic"},"text":"3.1 ","element":"a"},{"style":{"fontStyle":"italic"},"text":"and assuming we have a good initialization, suppose at some iteration ","element":"span"},{"style":{"height":12.61},"width":28.48,"height":31.52,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/71943/images/5-22.png","element":"img","alt":" t0","inline":true,"padRight":true},{"style":{"fontStyle":"italic"},"text":"the loss of every clean point is bounded above by ","element":"span"},{"style":{"height":15.6},"width":137.52,"height":39,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/71943/images/5-23.png","element":"img","alt":" a ∈ R≥0","inline":true},{"style":{"fontStyle":"italic"},"text":", while the loss of every corrupted point is bounded above by ","element":"span"},{"style":{"height":15.58},"width":135.52,"height":38.96,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/71943/images/5-24.png","element":"img","alt":" b ∈ R≥0","inline":true},{"style":{"fontStyle":"italic"},"text":". Then for all ","element":"span"},{"style":{"height":12.8},"width":96,"height":32,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/71943/images/5-25.png","element":"img","alt":" t ≥ t0","inline":true,"padRight":true},{"style":{"fontStyle":"italic"},"text":"the total number of clean and corrupt updates which occur after ","element":"span"},{"style":{"height":12.59},"width":28.52,"height":31.48,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/71943/images/5-26.png","element":"img","alt":" t0","inline":true,"padRight":true},{"style":{"fontStyle":"italic"},"text":"are upper bounded as follows,","element":"span"}],[{"style":{"width":"71%"},"width":1137,"height":96,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/71943/images/5-27.png","element":"img"}],[{"text":"Because these upper bounds are independent of ","element":"span"},{"style":{"fontStyle":"italic"},"text":"t ","element":"span"},{"text":"then we may conclude that training reaches a steady state after a finite number of iterations. In particular, this means every point either has zero loss or activates no neurons. To prove the network achieves zero loss we need only show that every training point activates at least one neuron after the last training update. This property is simple to prove for clean points: indeed, if ","element":"span"},{"style":{"height":13.81},"width":110,"height":34.52,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/71943/images/5-28.png","element":"img","alt":" i ∈ ST","inline":true,"padRight":true},{"text":"then ","element":"span"},{"style":{"fontStyle":"italic"},"text":"i ","element":"span"},{"text":"activates every neuron in ","element":"span"},{"style":{"height":15.81},"width":49,"height":39.52,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/71943/images/5-29.png","element":"img","alt":" Γyi","inline":true,"padRight":true},{"text":"after the first iteration. An inductive argument then shows ","element":"span"},{"style":{"fontStyle":"italic"},"text":"i ","element":"span"},{"text":"activates a neuron in every subsequent iteration. Showing that every corrupt point activates a neuron at the end of training is not as simple, and requires a more careful consideration of the training dynamics. To this end we say a neuron is a ","element":"span"},{"style":{"fontStyle":"italic"},"text":"carrier ","element":"span"},{"text":"of a training point","element":"span"}],[{"text":"between iterations ","element":"span"},{"style":{"height":18.85},"width":574.44,"height":47.12,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/71943/images/5-30.png","element":"img","alt":" t0 and t if i ∈ A(τ)j for all τ ∈ [t0, t]","inline":true},{"text":". In order to prove the network fits the corrupt data we need to show each corrupt point ","element":"span"},{"style":{"height":16},"width":119.44,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/71943/images/5-31.png","element":"img","alt":" (xi, yi)","inline":true,"padRight":true},{"text":"has a carrier neuron in ","element":"span"},{"style":{"height":15.58},"width":58.16,"height":38.96,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/71943/images/5-32.png","element":"img","alt":" Θyi","inline":true,"padRight":true},{"text":"throughout training. If too many clean points activate such a neuron, then it is possible it will eventually cease to carry any corrupt points and if a corrupt point loses all of its carrier neurons then it cannot be fitted. We show this event cannot occur by studying the activation patterns of neurons in ","element":"span"},{"style":{"height":13.39},"width":250.48,"height":33.48,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/71943/images/5-33.png","element":"img","alt":" Γ := Γ1 ∪ Γ−1.","inline":true}],[{"id":"id-65","style":{"fontWeight":"bold"},"text":"Lemma 3.3. ","element":"span"},{"style":{"fontStyle":"italic"},"text":"Let the assumptions of Theorem ","element":"span"},{"href":"#id-35","style":{"fontStyle":"italic"},"text":"3.1 ","element":"a"},{"style":{"fontStyle":"italic"},"text":"hold and suppose we have a good initialization. Let ","element":"span"},{"style":{"height":14},"width":92.44,"height":35,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/71943/images/5-34.png","element":"img","alt":" j ∈ Γ","inline":true,"padRight":true},{"style":{"fontStyle":"italic"},"text":"and ","element":"span"},{"style":{"fontStyle":"italic"},"text":"t > ","element":"span"},{"text":"0 ","element":"span"},{"style":{"fontStyle":"italic"},"text":"be an iteration such that no point achieves zero loss at or before this iteration. For a point ","element":"span"},{"style":{"height":23.52},"width":479.68,"height":58.8,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/71943/images/5-35.png","element":"img","alt":" i ∈ ST , then i ∈ A(t)j iff i ∼ j","inline":true},{"style":{"fontStyle":"italic"},"text":". For a point ","element":"span"},{"style":{"height":23.52},"width":639.8,"height":58.8,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/71943/images/5-36.png","element":"img","alt":" i ∈ SF with i ̸∼ j, i ∈ A(t)j iff i ∈ A(1)j .","inline":true}],[{"text":"The next lemma bounds the activations just before any points achieve zero loss.","element":"span"}],[{"id":"id-68","style":{"fontWeight":"bold"},"text":"Lemma 3.4. ","element":"span"},{"style":{"fontStyle":"italic"},"text":"Under the assumptions of Theorem ","element":"span"},{"href":"#id-35","style":{"fontStyle":"italic"},"text":"3.1 ","element":"a"},{"style":{"fontStyle":"italic"},"text":"and assuming we have a good initialization, there is an iteration ","element":"span"},{"style":{"height":22.18},"width":373.04,"height":55.44,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/71943/images/6-0.png","element":"img","alt":" T1 ≤ Cηm[1+(γ+ρ)(n−k)] ","inline":true,"padRight":true},{"style":{"fontStyle":"italic"},"text":"before any point achieves zero loss where the following ","element":"span"},{"style":{"fontStyle":"italic"},"text":"hold for a constant that varies from line to line.","element":"span"}],[{"style":{"fontStyle":"italic"},"text":"1. For all ","element":"span"},{"style":{"height":23.52},"width":1071.72,"height":58.8,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/71943/images/6-1.png","element":"img","alt":" p ∈ {−1, 1}, j ∈ Γp, i ∼ j, and i ∈ ST , then ⟨w(T1)j , xi⟩ ≥ cm−1.","inline":true}],[{"style":{"fontStyle":"italic"},"text":"2. For all ","element":"span"},{"style":{"height":23.52},"width":1149.48,"height":58.8,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/71943/images/6-2.png","element":"img","alt":" p ∈ {−1, 1}, j ∈ Γp, i ̸∼ j, and i ∈ ST , then ⟨w(T1)j , xi⟩ ≤ −cnγm−1.","inline":true}],[{"style":{"fontStyle":"italic"},"text":"3. For all ","element":"span"},{"style":{"height":16},"width":432.92,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/71943/images/6-3.png","element":"img","alt":" i ∈ ST , then ℓ(T1, xi) ≤ c.","inline":true}],[{"text":"Due to the fact that clean points are the majority and all of them push the network in the same signal direction, then immediately after ","element":"span"},{"style":{"height":13.79},"width":34.48,"height":34.48,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/71943/images/6-4.png","element":"img","alt":" T1","inline":true,"padRight":true},{"text":"the loss of clean points is small and clean points activate all neurons in the relevant ","element":"span"},{"style":{"height":15.58},"width":41.92,"height":38.96,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/71943/images/6-5.png","element":"img","alt":" Γp","inline":true,"padRight":true},{"text":"strongly. Furthermore, once the loss of a clean point is small it stays small. In subsequent iterations, if the number of corrupt updates since ","element":"span"},{"style":{"height":13.81},"width":34,"height":34.52,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/71943/images/6-6.png","element":"img","alt":" T1","inline":true,"padRight":true},{"text":"is also small, approximately ","element":"span"},{"style":{"height":16},"width":279.6,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/71943/images/6-7.png","element":"img","alt":"Cεnγ/(η(γ + ρ)","inline":true},{"text":", then each clean point will activate on all but an ","element":"span"},{"style":{"height":7.2},"width":19,"height":18,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/71943/images/6-8.png","element":"img","alt":" ε","inline":true,"padRight":true},{"text":"proportion of neurons in the relevant ","element":"span"},{"style":{"height":15.6},"width":40,"height":39,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/71943/images/6-9.png","element":"img","alt":" Γp","inline":true},{"text":". As the hinge loss switches off the updates from a point once it reaches zero loss, eventually clean points do not participate in every iteration. Furthermore, when they do participate their updates are spread over a large proportion of the neurons. This ensures that most neurons in ","element":"span"},{"style":{"height":16},"width":45,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/71943/images/6-10.png","element":"img","alt":" Θp","inline":true,"padRight":true},{"text":"cannot receive too many clean updates in isolation, thereby ensuring carrier neurons continue to carry corrupted points throughout training.","element":"span"}],[{"text":"Lastly, the generalization result follows from the near orthogonality of the noise components of both the training and test data. Indeed, using the same concentration bound, a test point satisfies the same inner product noise condition as the training data with high probability.","element":"span"}],[{"id":"id-41","style":{"fontWeight":"bold"},"text":"Lemma 3.5. ","element":"span"},{"style":{"fontStyle":"italic"},"text":"Consider a test label ","element":"span"},{"style":{"height":16},"width":207.72,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/71943/images/6-11.png","element":"img","alt":" y ∈ {−1, 1}","inline":true,"padRight":true},{"style":{"fontStyle":"italic"},"text":"and point ","element":"span"},{"style":{"height":17.7},"width":403,"height":44.24,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/71943/images/6-12.png","element":"img","alt":" x := y√γv + √1 − γn","inline":true},{"style":{"fontStyle":"italic"},"text":", where ","element":"span"},{"style":{"height":7.2},"width":69,"height":18,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/71943/images/6-13.png","element":"img","alt":" n ∼","inline":true,"padRight":true},{"style":{"fontStyle":"italic"},"text":"Uniform","element":"span"},{"style":{"height":17.38},"width":334.28,"height":43.44,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/71943/images/6-14.png","element":"img","alt":"(Sd−1 ∩ span{v}⊥)","inline":true,"padRight":true},{"style":{"fontStyle":"italic"},"text":"is mutually i.i.d. from the training sample. Assume the conditions of Theorem ","element":"span"},{"href":"#id-35","style":{"fontStyle":"italic"},"text":"3.1 ","element":"a"},{"style":{"fontStyle":"italic"},"text":"hold and that we have a good initialization. In addition, suppose that ","element":"span"},{"style":{"height":19.9},"width":250.72,"height":49.76,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/71943/images/6-15.png","element":"img","alt":" |⟨n, nl⟩| < ρ1−γ","inline":true,"padRight":true},{"style":{"fontStyle":"italic"},"text":"for all ","element":"span"},{"style":{"height":16},"width":490.56,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/71943/images/6-16.png","element":"img","alt":" l ∈ [2n], then yf(Tend, x) > 0.","inline":true}],[{"id":"id-38","style":{"fontWeight":"bold"},"text":"3.2 ","element":"span"},{"style":{"fontWeight":"bold"},"text":"Non-benign overfitting","element":"span"}],[{"text":"The next theorem states a harmful overfitting result: for sufficiently small ","element":"span"},{"style":{"height":10.61},"width":21.52,"height":26.52,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/71943/images/6-17.png","element":"img","alt":" γ","inline":true,"padRight":true},{"text":"the network achieves again zero loss on both the clean and corrupt data after a finite number of iterations, but the probability of misclassification is bounded from below by a constant. A detailed proof of this Theorem along with the associated lemmas is provided in Appendix ","element":"span"},{"text":"D.","element":"span"}],[{"id":"id-36","style":{"fontWeight":"bold"},"text":"Theorem 3.6. ","element":"span"},{"style":{"fontStyle":"italic"},"text":"Let Assumption ","element":"span"},{"href":"#id-40","style":{"fontStyle":"italic"},"text":"1 ","element":"a"},{"style":{"fontStyle":"italic"},"text":"hold and further assume ","element":"span"},{"style":{"height":17.39},"width":691.12,"height":43.48,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/71943/images/6-18.png","element":"img","alt":" m ≥ C log(n/δ), ρ ≤ cn−1, η < 1/(2mn)","inline":true,"padRight":true},{"style":{"fontStyle":"italic"},"text":"and ","element":"span"},{"style":{"height":20},"width":143,"height":50,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/71943/images/6-19.png","element":"img","alt":" γ ≤ c√nd","inline":true},{"style":{"fontStyle":"italic"},"text":". Then with probability at least ","element":"span"},{"style":{"height":11.6},"width":84,"height":29,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/71943/images/6-20.png","element":"img","alt":" 1 − δ","inline":true,"padRight":true},{"style":{"fontStyle":"italic"},"text":"over the randomness of the dataset and network ","element":"span"},{"style":{"fontStyle":"italic"},"text":"initialization the following hold.","element":"span"}],[{"style":{"fontStyle":"italic"},"text":"1. The training process terminates at an iteration ","element":"span"},{"style":{"height":21.79},"width":176,"height":54.48,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/71943/images/6-21.png","element":"img","alt":" Tend ≤ Cnη .","inline":true}],[{"style":{"fontStyle":"italic"},"text":"2. For all ","element":"span"},{"style":{"height":16},"width":466.8,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/71943/images/6-22.png","element":"img","alt":" i ∈ [2n] then ℓ(Tend, xi) = 0.","inline":true}],[{"style":{"fontStyle":"italic"},"text":"3. The generalization error satisfies","element":"span"}],[{"style":{"width":"29%"},"width":469,"height":83,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/71943/images/6-23.png","element":"img"}],[{"text":"We remark that the above result holds for ","element":"span"},{"style":{"height":12.8},"width":110.52,"height":32,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/71943/images/6-24.png","element":"img","alt":" n ≥ 1","inline":true,"padRight":true},{"text":"and any ","element":"span"},{"style":{"fontStyle":"italic"},"text":"k","element":"span"},{"text":". Indeed, in this regime the noise components dominate the training dynamics and we therefore expect the performance of the network on test points to be close to random. We re-emphasize that, unlike in the data model used by ","element":"span"},{"href":"#id-22","referenceIndex":14,"text":"Frei ","element":"a"},{"href":"#id-22","referenceIndex":14,"text":"et al. ","element":"a"},{"href":"#id-22","referenceIndex":14,"text":"(2022) ","element":"a"},{"text":"and ","element":"span"},{"href":"#id-24","referenceIndex":36,"text":"Xu & Gu ","element":"a"},{"href":"#id-24","referenceIndex":36,"text":"(2023)","element":"a"},{"text":", there does exist a classifier with perfect generalization error for arbitrarily small ","element":"span"},{"style":{"height":10.61},"width":21.52,"height":26.52,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/71943/images/6-25.png","element":"img","alt":" γ","inline":true},{"text":". The significance of Theorem ","element":"span"},{"href":"#id-36","text":"3.6 ","element":"a"},{"text":"is that under the data model considered GD results in a suboptimal classifier.","element":"span"}],[{"style":{"fontStyle":"italic"},"text":"Proof sketch. ","element":"span"},{"text":"Similar to the proof of Theorem ","element":"span"},{"href":"#id-35","text":"3.1, ","element":"a"},{"text":"in the context of non-benign overfitting we say the initialization is “good\" if ","element":"span"},{"style":{"height":19.81},"width":407.48,"height":49.52,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/71943/images/6-26.png","element":"img","alt":" maxi̸=l |⟨ni, nl⟩| ≤ ρ1−γ","inline":true,"padRight":true},{"text":"and if each point in the training sample ","element":"span"},{"text":"activates a neuron of the same sign. Under the conditions of Theorem ","element":"span"},{"href":"#id-36","text":"3.6 ","element":"a"},{"text":"it can be shown that a good initialization in this context happens with high probability.","element":"span"}],[{"id":"id-82","style":{"fontWeight":"bold"},"text":"Lemma 3.7. ","element":"span"},{"style":{"fontStyle":"italic"},"text":"In addition to the conditions of Theorem ","element":"span"},{"href":"#id-36","style":{"fontStyle":"italic"},"text":"3.6, ","element":"a"},{"style":{"fontStyle":"italic"},"text":"suppose we have a good initialization and that for some iteration ","element":"span"},{"style":{"height":21.78},"width":947.56,"height":54.44,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/71943/images/7-0.png","element":"img","alt":" t0 then ℓ(t0, xi) ≤ a for all i ∈ [2n]. Then T(t0, t) ≤ Cnaη .","inline":true}],[{"text":"As for the benign overfitting case, we need to show that each training point activates a neuron after the last training iteration. Under the assumptions on ","element":"span"},{"style":{"height":10.59},"width":21.52,"height":26.48,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/71943/images/7-1.png","element":"img","alt":" γ","inline":true,"padRight":true},{"text":"it can be shown that the loss of a point decreases during every iteration it participates in, regardless of the status and activations of other points in the training sample. All that remains is to lower bound the generalization error. To this end observe for a test point ","element":"span"},{"text":"(","element":"span"},{"style":{"fontWeight":"bold"},"text":"x","element":"span"},{"style":{"fontStyle":"italic"},"text":", y","element":"span"},{"text":") ","element":"span"},{"text":"that","element":"span"}],[{"style":{"width":"55%"},"width":882,"height":121,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/71943/images/7-2.png","element":"img"}],[{"text":"If the right-hand-side of this equality is negative we can conclude that either ","element":"span"},{"style":{"height":7.41},"width":131,"height":18.52,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/71943/images/7-3.png","element":"img","alt":" x or −x","inline":true,"padRight":true},{"text":"is misclassified. That this event is true with probability lower bounded by a constant in turn follows by appropriately upper bounding the norm of the network weights in the signal subspace, as well as lower bounding the norm of the network weights in the noise subspace.","element":"span"}],[{"style":{"fontWeight":"bold"},"text":"3.3 ","element":"span"},{"style":{"fontWeight":"bold"},"text":"No-overfitting","element":"span"}],[{"text":"The following theorem illustrates that for ","element":"span"},{"style":{"height":10.61},"width":21.48,"height":26.52,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/71943/images/7-4.png","element":"img","alt":" γ","inline":true,"padRight":true},{"text":"larger than the upper bound required for benign overfitting, then after convergence, which occurs in a finite number of iterations, only the clean points achieve zero loss. By contrast, the corrupt points cease to activate any neurons and are thus zeroed by the network. The network also achieves asymptotically optimal test error. A detailed proof of this theorem along with the associated lemmas is provided in Appendix ","element":"span"},{"text":"E.","element":"span"}],[{"id":"id-37","style":{"fontWeight":"bold"},"text":"Theorem 3.8. ","element":"span"},{"style":{"fontStyle":"italic"},"text":"Let Assumption ","element":"span"},{"href":"#id-40","style":{"fontStyle":"italic"},"text":"1 ","element":"a"},{"style":{"fontStyle":"italic"},"text":"hold and further assume ","element":"span"},{"style":{"height":19.2},"width":543.32,"height":48,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/71943/images/7-5.png","element":"img","alt":" m ≥ 2, n ≥ C log� mδ�, ρ ≤ cγ","inline":true,"padRight":true},{"style":{"fontStyle":"italic"},"text":"and ","element":"span"},{"style":{"height":16.98},"width":303.56,"height":42.44,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/71943/images/7-6.png","element":"img","alt":"cn−1 ≤ γ ≤ ck−1","inline":true},{"style":{"fontStyle":"italic"},"text":". Then there exists a sufficiently small step-size ","element":"span"},{"style":{"height":10.8},"width":19.48,"height":27,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/71943/images/7-7.png","element":"img","alt":" η","inline":true,"padRight":true},{"style":{"fontStyle":"italic"},"text":"such that with probability at least ","element":"span"},{"style":{"height":11.6},"width":87.64,"height":29,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/71943/images/7-8.png","element":"img","alt":" 1 − δ","inline":true,"padRight":true},{"style":{"fontStyle":"italic"},"text":"over the randomness of the dataset and network initialization we have the following.","element":"span"}],[{"style":{"fontStyle":"italic"},"text":"1. The training process terminates at an iteration ","element":"span"},{"style":{"height":21.76},"width":181.16,"height":54.4,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/71943/images/7-9.png","element":"img","alt":" Tend ≤ Cnη .","inline":true}],[{"style":{"fontStyle":"italic"},"text":"2. For all ","element":"span"},{"style":{"height":16},"width":1027.64,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/71943/images/7-10.png","element":"img","alt":" i ∈ ST then ℓ(Tend, xi) = 0 while ℓ(Tend, xi) = 1 for all i ∈ SF .","inline":true}],[{"style":{"fontStyle":"italic"},"text":"3. The generalization error satisfies","element":"span"}],[{"style":{"width":"41%"},"width":661,"height":50,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/71943/images/7-11.png","element":"img"}],[{"text":"We remark that the upper bound on ","element":"span"},{"style":{"height":10.59},"width":21.48,"height":26.48,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/71943/images/7-12.png","element":"img","alt":" γ","inline":true,"padRight":true},{"text":"allows us to re-deploy the same proof technique used to prove convergence in the benign overfitting case, thereby ensuring the training process converges within a finite number of iterations. We conjecture this upper bound can be relaxed but leave such an analysis to future work.","element":"span"}],[{"style":{"fontStyle":"italic"},"text":"Proof sketch. ","element":"span"},{"text":"In the context of no-overfitting we identify a “good” initialization as one for which ","element":"span"},{"style":{"height":19.9},"width":392.76,"height":49.76,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/71943/images/7-13.png","element":"img","alt":"maxi̸=l |⟨ni, nl⟩| ≤ ρ1−γ","inline":true,"padRight":true},{"text":"and ","element":"span"},{"style":{"height":16},"width":387.36,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/71943/images/7-14.png","element":"img","alt":" Γ = Γ−1 ∪ Γ+1 = [2m]","inline":true},{"text":". Under the conditions of Theorem ","element":"span"},{"href":"#id-37","text":"3.8 ","element":"a"},{"text":"it can ","element":"span"},{"text":"be shown a good initialization in this context occurs with high probability, furthermore the resulting activation pattern early during training is simple to characterize.","element":"span"}],[{"id":"id-93","style":{"fontWeight":"bold"},"text":"Lemma 3.9. ","element":"span"},{"style":{"fontStyle":"italic"},"text":"Suppose that the conditions of Theorem ","element":"span"},{"href":"#id-37","style":{"fontStyle":"italic"},"text":"3.8 ","element":"a"},{"style":{"fontStyle":"italic"},"text":"hold and that we have a good initialization. Consider an arbitrary ","element":"span"},{"style":{"height":16},"width":144.36,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/71943/images/7-15.png","element":"img","alt":" j ∈ [2m]","inline":true,"padRight":true},{"style":{"fontStyle":"italic"},"text":"and iteration ","element":"span"},{"style":{"height":14},"width":178.28,"height":35,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/71943/images/7-16.png","element":"img","alt":" 2 ≤ t ≤ T0","inline":true,"padRight":true},{"style":{"fontStyle":"italic"},"text":"occurring before a point has achieved zero loss. Then ","element":"span"},{"style":{"height":23.52},"width":281.16,"height":58.8,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/71943/images/7-17.png","element":"img","alt":" i ∈ A(t)j iff i ∼ j.","inline":true}],[{"text":"Next we bound the activations of the training points just before ","element":"span"},{"style":{"height":14},"width":35,"height":35,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/71943/images/7-18.png","element":"img","alt":" T0","inline":true},{"text":", the iteration at which any training points first achieve zero loss. In the following we use ","element":"span"},{"style":{"height":14.4},"width":223.56,"height":36,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/71943/images/7-19.png","element":"img","alt":" F1, F2 and F3","inline":true,"padRight":true},{"text":"as placeholders for expressions depending on the data and model parameters. Here, for the sake of conveying the ideas in the proof we do not write them in full and refer the reader to Supplementary Material.","element":"span"}],[{"id":"id-94","style":{"fontWeight":"bold"},"text":"Lemma 3.10. ","element":"span"},{"style":{"fontStyle":"italic"},"text":"Suppose that the conditions of Theorem ","element":"span"},{"href":"#id-37","style":{"fontStyle":"italic"},"text":"3.8 ","element":"a"},{"style":{"fontStyle":"italic"},"text":"hold and that we have a good initialization, then there is an iteration ","element":"span"},{"style":{"height":14},"width":37.72,"height":35,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/71943/images/8-0.png","element":"img","alt":" T1","inline":true,"padRight":true},{"style":{"fontStyle":"italic"},"text":"before any point achieves zero loss such that","element":"span"}],[{"style":{"width":"34%"},"width":547,"height":278,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/71943/images/8-1.png","element":"img"}],[{"text":"Next we seek to ensure the activation patterns remain mostly fixed: in particular, we show ","element":"span"},{"style":{"height":23.52},"width":169,"height":58.8,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/71943/images/8-2.png","element":"img","alt":" i ∈ A(t)j if","inline":true},{"style":{"height":23.52},"width":666.16,"height":58.8,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/71943/images/8-3.png","element":"img","alt":"i ∈ ST and i ∼ j, while i /∈ A(t)j if i ̸∼ j.","inline":true}],[{"id":"id-97","style":{"fontWeight":"bold"},"text":"Lemma 3.11. ","element":"span"},{"style":{"fontStyle":"italic"},"text":"Suppose that the conditions of Theorem ","element":"span"},{"href":"#id-37","style":{"fontStyle":"italic"},"text":"3.8 ","element":"a"},{"style":{"fontStyle":"italic"},"text":"hold and that we have a good initialization. In addition, for ","element":"span"},{"style":{"height":14},"width":147.68,"height":35,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/71943/images/8-4.png","element":"img","alt":" a, b ∈ R","inline":true,"padRight":true},{"style":{"fontStyle":"italic"},"text":"assume there is a time ","element":"span"},{"style":{"height":12.61},"width":28.52,"height":31.52,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/71943/images/8-5.png","element":"img","alt":" t0","inline":true,"padRight":true},{"style":{"fontStyle":"italic"},"text":"such that ","element":"span"},{"style":{"height":16},"width":223.24,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/71943/images/8-6.png","element":"img","alt":" ℓ(t0, xi) ≤ a","inline":true,"padRight":true},{"style":{"fontStyle":"italic"},"text":"for all ","element":"span"},{"style":{"height":13.18},"width":123.68,"height":32.96,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/71943/images/8-7.png","element":"img","alt":" i ∈ ST","inline":true,"padRight":true},{"style":{"fontStyle":"italic"},"text":"and ","element":"span"},{"style":{"height":23.52},"width":302.84,"height":58.8,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/71943/images/8-8.png","element":"img","alt":"ϕ(⟨w(t0)j , xi⟩) ≤ b","inline":true,"padRight":true},{"style":{"fontStyle":"italic"},"text":"for all ","element":"span"},{"style":{"height":13.41},"width":112.48,"height":33.52,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/71943/images/8-9.png","element":"img","alt":" i ∈ SF","inline":true,"padRight":true},{"style":{"fontStyle":"italic"},"text":"and ","element":"span"},{"style":{"height":13.6},"width":87.44,"height":34,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/71943/images/8-10.png","element":"img","alt":" i ∼ j","inline":true},{"style":{"fontStyle":"italic"},"text":". If ","element":"span"},{"style":{"height":14.4},"width":222,"height":36,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/71943/images/8-11.png","element":"img","alt":" i ∈ ST , i ∼ j","inline":true,"padRight":true},{"style":{"fontStyle":"italic"},"text":"implies ","element":"span"},{"style":{"height":23.52},"width":140.2,"height":58.8,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/71943/images/8-12.png","element":"img","alt":" i ∈ A(τ)j","inline":true,"padRight":true},{"style":{"fontStyle":"italic"},"text":"and ","element":"span"},{"style":{"height":15.2},"width":87.48,"height":38,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/71943/images/8-13.png","element":"img","alt":" i ̸∼ j","inline":true,"padRight":true},{"style":{"fontStyle":"italic"},"text":"implies ","element":"span"},{"style":{"height":23.52},"width":278.56,"height":58.8,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/71943/images/8-14.png","element":"img","alt":"i /∈ A(τ)j for all τ","inline":true,"padRight":true},{"style":{"fontStyle":"italic"},"text":"satisfying ","element":"span"},{"style":{"height":13.6},"width":263.48,"height":34,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/71943/images/8-15.png","element":"img","alt":" t0 ≤ τ < t, then","inline":true}],[{"style":{"width":"71%"},"width":1128,"height":111,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/71943/images/8-16.png","element":"img"}],[{"text":"As before, this update bound is finite and iteration-independent, therefore GD converges provided the assumptions on the activation patterns are not violated. Furthermore, if these activation patterns do hold, then every clean point activates a neuron and no corrupt point activates a neuron of the same label sign. Therefore, under the assumption on the activation pattern, at convergence clean points achieve zero loss while corrupt points have non-zero loss, i.e., they activate no neurons. It therefore suffices to prove the condition on the activation pattern, which we show holds as long as","element":"span"}],[{"style":{"width":"71%"},"width":1134,"height":97,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/71943/images/8-17.png","element":"img"}],[{"text":"As ","element":"span"},{"style":{"fontStyle":"italic"},"text":"C ","element":"span"},{"text":"does not depend on the parameters, we can ensure this condition holds by letting ","element":"span"},{"style":{"height":16},"width":221.8,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/71943/images/8-18.png","element":"img","alt":" Ck(γ + ρ) be","inline":true,"padRight":true},{"text":"sufficiently small. With ","element":"span"},{"style":{"height":16.58},"width":159.24,"height":41.44,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/71943/images/8-19.png","element":"img","alt":" ρ ≤ cn−1","inline":true},{"text":", we show it suffices that ","element":"span"},{"style":{"height":16.98},"width":159.6,"height":42.44,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/71943/images/8-20.png","element":"img","alt":" γ < ck−1","inline":true},{"text":". Finally, the generalization result follows in a fashion almost identical to that used for Lemma ","element":"span"},{"href":"#id-41","text":"3.5.","element":"a"}],[{"style":{"fontWeight":"bold"},"text":"3.4 ","element":"span"},{"style":{"fontWeight":"bold"},"text":"Comparison of results","element":"span"}],[{"text":"We compare the differing regimes of our results side-by-side with those of ","element":"span"},{"href":"#id-22","referenceIndex":14,"text":"Frei et al. ","element":"a"},{"href":"#id-22","referenceIndex":14,"text":"(2022)","element":"a"},{"text":"; ","element":"span"},{"href":"#id-24","referenceIndex":36,"text":"Xu & ","element":"a"},{"href":"#id-24","referenceIndex":36,"text":"Gu ","element":"a"},{"href":"#id-24","referenceIndex":36,"text":"(2023) ","element":"a"},{"text":"in Table ","element":"span"},{"href":"#id-42","text":"1. ","element":"a"},{"text":"We note that comparisons are not like-for-like as ","element":"span"},{"href":"#id-22","referenceIndex":14,"text":"Frei et al. ","element":"a"},{"href":"#id-22","referenceIndex":14,"text":"(2022) ","element":"a"},{"text":"consider smooth, leaky ReLU and logistic loss, ","element":"span"},{"href":"#id-24","referenceIndex":36,"text":"Xu & Gu ","element":"a"},{"href":"#id-24","referenceIndex":36,"text":"(2023) ","element":"a"},{"text":"a generalized family of activation functions, which includes ReLU, and logistic loss, and this paper ReLU and hinge loss. Furthermore, in addition to differences in the noise distribution discussed in Section ","element":"span"},{"href":"#id-39","text":"2.1, ","element":"a"},{"href":"#id-22","referenceIndex":14,"text":"Frei et al. ","element":"a"},{"href":"#id-22","referenceIndex":14,"text":"(2022)","element":"a"},{"text":"; ","element":"span"},{"href":"#id-24","referenceIndex":36,"text":"Xu & Gu ","element":"a"},{"href":"#id-24","referenceIndex":36,"text":"(2023) ","element":"a"},{"text":"assume a data model where the norm of each data point is approximately proportional to","element":"span"},{"style":{"height":16.19},"width":51.48,"height":40.48,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/71943/images/8-21.png","element":"img","alt":"√d","inline":true},{"text":". We therefore re-scale their results in order to make comparison with this work in which all data points have unit norm.","element":"span"}],[{"text":"Taken together these results suggest, at least under the type of data model considered, that benign overfitting occurs for signal strengths proportional to between roughly ","element":"span"},{"style":{"height":18.4},"width":253,"height":46,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/71943/images/8-22.png","element":"img","alt":" 1/√dn and 1/n","inline":true},{"text":". Furthermore, our results also suggest that above approximately ","element":"span"},{"text":"1","element":"span"},{"style":{"fontStyle":"italic"},"text":"/n ","element":"span"},{"text":"one might expect to see a transition to no-overfitting, while below approximately ","element":"span"},{"style":{"height":18.38},"width":117.96,"height":45.96,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/71943/images/8-23.png","element":"img","alt":" 1/√nd","inline":true,"padRight":true},{"text":"a transition to harmful overfitting. We provide preliminary supporting experiments in the Supplementary Material. We again remark that the latter is non-trivial in our setting as for all ","element":"span"},{"style":{"height":14.19},"width":100,"height":35.48,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/71943/images/8-24.png","element":"img","alt":" γ > 0","inline":true,"padRight":true},{"text":"the classifier ","element":"span"},{"style":{"height":16},"width":329,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/71943/images/8-25.png","element":"img","alt":" h(x) = sign(⟨v, x⟩)","inline":true,"padRight":true},{"text":"always has perfect accuracy.","element":"span"}],[{"text":"Table 1: across all results ","element":"figcaption","subtype":"caption"},{"style":{"height":13.2},"width":113.52,"height":33,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/71943/images/9-0.png","element":"img","alt":" k ≤ cn","inline":true,"padRight":true},{"text":"while ","element":"figcaption","subtype":"caption"},{"style":{"height":17.2},"width":294.48,"height":43,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/71943/images/9-1.png","element":"img","alt":" d ≥ Cn2 log(n/δ)","inline":true,"padRight":true},{"text":"for ","element":"figcaption","subtype":"caption"},{"href":"#id-22","referenceIndex":14,"text":"(Frei et al., ","element":"a","subtype":"caption"},{"href":"#id-22","referenceIndex":14,"text":"2022)","element":"a","subtype":"caption"},{"text":", ","element":"figcaption","subtype":"caption"},{"href":"#id-24","referenceIndex":36,"text":"Xu & Gu ","element":"a","subtype":"caption"},{"href":"#id-24","referenceIndex":36,"text":"(2023) ","element":"a","subtype":"caption"},{"text":"and Theorem ","element":"figcaption","subtype":"caption"},{"id":"id-42","href":"#id-35","text":"3.1.","element":"a","subtype":"caption"}],[{"style":{"width":"99%"},"width":1570,"height":630,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/71943/images/9-2.png","element":"img"}]]},{"heading":"4 Conclusion","paragraphs":[[{"text":"Developing a theoretical description of benign overfitting in neural networks is a highly nascent area, with mathematical results available only for very limited data models. Furthermore, the conditions describing the transitions between overfitting versus non-overfitting and benign versus non-benign even in these simplified settings are yet to be fully characterized. The goal of this work was to address this issue as well as explore the impact of using the hinge loss. In particular, and admittedly for a simple data model, we prove three different training outcomes, corresponding to non-benign overfitting, benign overfitting and no-overfitting, based on conditions on the margin of the clean data. Our analysis also differs significantly from prior works due to the fact the ratio of loss between different training points can be unbounded and the implicit bias of using hinge loss versus exponentially tailed loss is poorly understood.","element":"span"}],[{"style":{"fontWeight":"bold"},"text":"Limitations and future work: ","element":"span"},{"text":"the key limitation of this work is the restrictiveness of the data model. In particular, as in prior and related works we use a near-orthogonal noise model and assume a rank one signal, we also place additional conditions on the noise distribution. In addition to generalizing the signal and noise model as well as improving the bounds required for our results to hold, we believe the following themes are important areas for future research: first relaxing the near orthogonal noise condition, second exploring data models beyond those which are linearly separable, third investigating the role and impact of depth.","element":"span"}]]},{"heading":"Acknowledgments","paragraphs":[[{"text":"EG, WS and DN were partially supported by NSF DMS 2011140 and NSF DMS 2108479. EG was also partially supported by NSF DGE 2034835.","element":"span"}]]},{"heading":"References","paragraphs":[[{"id":"id-21","text":"Ben Adlam and Jeffrey Pennington. The neural tangent kernel in high dimensions: Triple descent and ","element":"span"},{"text":"a multi-scale theory of generalization. In ","element":"span"},{"style":{"fontStyle":"italic"},"text":"International Conference on Machine Learning","element":"span"},{"text":", 2020.","element":"span"}],[{"id":"id-87","text":"Hilal Asi and John C. Duchi. Stochastic (approximate) proximal point methods: Convergence, ","element":"span"},{"text":"optimality, and adaptivity. ","element":"span"},{"style":{"fontStyle":"italic"},"text":"SIAM Journal on Optimization","element":"span"},{"text":", 29(3):2257–2290, 2019. doi: 10.1137/ 18M1230323. URL ","element":"span"},{"href":"https://doi.org/10.1137/18M1230323","text":"https://doi.org/10.1137/18M1230323","element":"a"},{"text":".","element":"span"}],[{"id":"id-43","text":"Keith Ball. An elementary introduction to modern convex geometry. 1997.","element":"span"}],[{"id":"id-45","text":"Rémi Bardenet and Odalric-Ambrym Maillard. Concentration inequalities for sampling without ","element":"span"},{"text":"replacement. 2015.","element":"span"}],[{"id":"id-3","text":"Peter L. Bartlett, Philip M. Long, Gábor Lugosi, and Alexander Tsigler. Benign overfitting in linear ","element":"span"},{"text":"regression. ","element":"span"},{"style":{"fontStyle":"italic"},"text":"Proceedings of the National Academy of Sciences","element":"span"},{"text":", 117(48):30063–30070, 2020. doi: 10.1073/pnas.1907378117. URL ","element":"span"},{"href":"https://www.pnas.org/doi/abs/10.1073/pnas.1907378117","text":"https://www.pnas.org/doi/abs/10.1073/pnas. ","element":"a"},{"href":"https://www.pnas.org/doi/abs/10.1073/pnas.1907378117","text":"1907378117","element":"a"},{"text":".","element":"span"}],[{"id":"id-16","text":"Mikhail Belkin, Daniel J Hsu, and Partha Mitra. ","element":"span"},{"text":"Overfitting or perfect fitting? ","element":"span"},{"text":"Risk bounds for classification and regression rules that interpolate. ","element":"span"},{"text":"In S. Bengio, H. Wallach, ","element":"span"},{"text":"H. ","element":"span"},{"text":"Larochelle, ","element":"span"},{"text":"K. ","element":"span"},{"text":"Grauman, ","element":"span"},{"text":"N. ","element":"span"},{"text":"Cesa-Bianchi, ","element":"span"},{"text":"and ","element":"span"},{"text":"R. ","element":"span"},{"text":"Garnett ","element":"span"},{"text":"(eds.), ","element":"span"},{"style":{"fontStyle":"italic"},"text":"Advances in Neural Information Processing Systems","element":"span"},{"text":", volume 31. Curran Associates, Inc., 2018a. ","element":"span"},{"text":"URL ","element":"span"},{"href":"https://proceedings.neurips.cc/paper_files/paper/2018/file/e22312179bf43e61576081a2f250f845-Paper.pdf","text":"https://proceedings.neurips.cc/paper_files/paper/2018/ ","element":"a"},{"href":"https://proceedings.neurips.cc/paper_files/paper/2018/file/e22312179bf43e61576081a2f250f845-Paper.pdf","text":"file/e22312179bf43e61576081a2f250f845-Paper.pdf","element":"a"},{"text":".","element":"span"}],[{"id":"id-1","text":"Mikhail Belkin, Siyuan Ma, and Soumik Mandal. To understand deep learning we need to understand ","element":"span"},{"text":"kernel learning. In Jennifer Dy and Andreas Krause (eds.), ","element":"span"},{"style":{"fontStyle":"italic"},"text":"Proceedings of the 35th International Conference on Machine Learning","element":"span"},{"text":", volume 80 of ","element":"span"},{"style":{"fontStyle":"italic"},"text":"Proceedings of Machine Learning Research","element":"span"},{"text":", pp. 541–549. PMLR, 10–15 Jul 2018b. URL ","element":"span"},{"href":"https://proceedings.mlr.press/v80/belkin18a.html","text":"https://proceedings.mlr.press/v80/ ","element":"a"},{"href":"https://proceedings.mlr.press/v80/belkin18a.html","text":"belkin18a.html","element":"a"},{"text":".","element":"span"}],[{"id":"id-2","text":"Mikhail Belkin, Daniel Hsu, Siyuan Ma, and Soumik Mandal. Reconciling modern machine-learning ","element":"span"},{"text":"practice and the classical bias–variance trade-off. ","element":"span"},{"style":{"fontStyle":"italic"},"text":"Proceedings of the National Academy of Sciences","element":"span"},{"text":", 116(32):15849–15854, 2019.","element":"span"}],[{"id":"id-31","text":"Alon Brutzkus, Amir Globerson, Eran Malach, and Shai Shalev-Shwartz. ","element":"span"},{"text":"SGD learns over-parameterized networks that provably generalize on linearly separable data. In ","element":"span"},{"style":{"fontStyle":"italic"},"text":"International Conference on Learning Representations","element":"span"},{"text":", 2018. URL ","element":"span"},{"href":"https://openreview.net/forum?id=rJ33wwxRb","text":"https://openreview.net/forum? ","element":"a"},{"href":"https://openreview.net/forum?id=rJ33wwxRb","text":"id=rJ33wwxRb","element":"a"},{"text":".","element":"span"}],[{"id":"id-12","text":"Yuan Cao, Quanquan Gu, and Mikhail Belkin. ","element":"span"},{"text":"Risk bounds for over-parameterized maximum margin classification on sub-gaussian mixtures. ","element":"span"},{"text":"In M. Ranzato, A. Beygelzimer, Y. Dauphin, P.S. Liang, and J. Wortman Vaughan (eds.), ","element":"span"},{"style":{"fontStyle":"italic"},"text":"Advances in Neural Information Processing Systems","element":"span"},{"text":", volume 34, pp. 8407–8418. Curran Associates, Inc., 2021. ","element":"span"},{"text":"URL ","element":"span"},{"href":"https://proceedings.neurips.cc/paper_files/paper/2021/file/46e0eae7d5217c79c3ef6b4c212b8c6f-Paper.pdf","text":"https://proceedings.neurips.cc/paper_files/paper/2021/ ","element":"a"},{"href":"https://proceedings.neurips.cc/paper_files/paper/2021/file/46e0eae7d5217c79c3ef6b4c212b8c6f-Paper.pdf","text":"file/46e0eae7d5217c79c3ef6b4c212b8c6f-Paper.pdf","element":"a"},{"text":".","element":"span"}],[{"id":"id-25","text":"Yuan Cao, Zixiang Chen, Misha Belkin, and Quanquan Gu. Benign overfitting in two-layer convo- ","element":"span"},{"text":"lutional neural networks. ","element":"span"},{"style":{"fontStyle":"italic"},"text":"Advances in neural information processing systems","element":"span"},{"text":", 35:25237–25250, 2022.","element":"span"}],[{"id":"id-6","text":"Niladri S. Chatterji and Philip M. Long. Finite-sample analysis of interpolating linear classifiers in ","element":"span"},{"text":"the overparameterized regime. ","element":"span"},{"style":{"fontStyle":"italic"},"text":"Journal of Machine Learning Research","element":"span"},{"text":", 22(129):1–30, 2021. URL ","element":"span"},{"href":"http://jmlr.org/papers/v22/20-974.html","text":"http://jmlr.org/papers/v22/20-974.html","element":"a"},{"text":".","element":"span"}],[{"id":"id-11","text":"Niladri S. Chatterji and Philip M. Long. Foolish crowds support benign overfitting. ","element":"span"},{"style":{"fontStyle":"italic"},"text":"Journal of Machine Learning Research","element":"span"},{"text":", 23(125):1–12, 2022. URL ","element":"span"},{"href":"http://jmlr.org/papers/v23/21-1199.html","text":"http://jmlr.org/papers/v23/ ","element":"a"},{"href":"http://jmlr.org/papers/v23/21-1199.html","text":"21-1199.html","element":"a"},{"text":".","element":"span"}],[{"id":"id-22","text":"Spencer Frei, Niladri S Chatterji, and Peter Bartlett. Benign overfitting without linearity: Neural ","element":"span"},{"text":"network classifiers trained by gradient descent for noisy linear data. In ","element":"span"},{"style":{"fontStyle":"italic"},"text":"Conference on Learning Theory","element":"span"},{"text":", pp. 2668–2703. PMLR, 2022.","element":"span"}],[{"id":"id-23","text":"Spencer Frei, Gal Vardi, Peter Bartlett, and Nathan Srebro. Benign overfitting in linear classifiers ","element":"span"},{"text":"and leaky relu networks from kkt conditions for margin maximization. In Gergely Neu and Lorenzo Rosasco (eds.), ","element":"span"},{"style":{"fontStyle":"italic"},"text":"Proceedings of Thirty Sixth Conference on Learning Theory","element":"span"},{"text":", volume 195 of ","element":"span"},{"style":{"fontStyle":"italic"},"text":"Proceedings of Machine Learning Research","element":"span"},{"text":", pp. 3173–3228. PMLR, 12–15 Jul 2023. URL ","element":"span"},{"href":"https://proceedings.mlr.press/v195/frei23a.html","text":"https://proceedings.mlr.press/v195/frei23a.html","element":"a"},{"text":".","element":"span"}],[{"id":"id-8","text":"Trevor Hastie, Andrea Montanari, Saharon Rosset, and Ryan J. Tibshirani. Surprises in high- ","element":"span"},{"text":"dimensional ridgeless least squares interpolation. ","element":"span"},{"style":{"fontStyle":"italic"},"text":"The Annals of Statistics","element":"span"},{"text":", 50(2):949 – 986, 2022. doi: 10.1214/21-AOS2133. URL ","element":"span"},{"href":"https://doi.org/10.1214/21-AOS2133","text":"https://doi.org/10.1214/21-AOS2133","element":"a"},{"text":".","element":"span"}],[{"id":"id-20","text":"Arthur Jacot, Franck Gabriel, and Clement Hongler. Neural tangent kernel: Convergence and general- ","element":"span"},{"text":"ization in neural networks. In S. Bengio, H. Wallach, H. Larochelle, K. Grauman, N. Cesa-Bianchi, and R. Garnett (eds.), ","element":"span"},{"style":{"fontStyle":"italic"},"text":"Advances in Neural Information Processing Systems","element":"span"},{"text":", volume 31. Curran Associates, Inc., 2018. URL ","element":"span"},{"href":"https://proceedings.neurips.cc/paper_files/paper/2018/file/5a4be1fa34e62bb8a6ec6b91d2462f5a-Paper.pdf","text":"https://proceedings.neurips.cc/paper_files/ ","element":"a"},{"href":"https://proceedings.neurips.cc/paper_files/paper/2018/file/5a4be1fa34e62bb8a6ec6b91d2462f5a-Paper.pdf","text":"paper/2018/file/5a4be1fa34e62bb8a6ec6b91d2462f5a-Paper.pdf","element":"a"},{"text":".","element":"span"}],[{"id":"id-29","text":"Ziwei Ji and Matus Telgarsky. Directional convergence and alignment in deep learning. In ","element":"span"},{"style":{"fontStyle":"italic"},"text":"Proceedings of the 34th International Conference on Neural Information Processing Systems","element":"span"},{"text":", NIPS’20, Red Hook, NY, USA, 2020. Curran Associates Inc. ISBN 9781713829546.","element":"span"}],[{"id":"id-30","text":"Ziwei Ji, Miroslav Dudík, Robert E. Schapire, and Matus Telgarsky. Gradient descent follows the ","element":"span"},{"text":"regularization path for general losses. In Jacob Abernethy and Shivani Agarwal (eds.), ","element":"span"},{"style":{"fontStyle":"italic"},"text":"Proceedings of Thirty Third Conference on Learning Theory","element":"span"},{"text":", volume 125 of ","element":"span"},{"style":{"fontStyle":"italic"},"text":"Proceedings of Machine Learning Research","element":"span"},{"text":", pp. 2109–2136. PMLR, 09–12 Jul 2020. URL ","element":"span"},{"href":"https://proceedings.mlr.press/v125/ji20a.html","text":"https://proceedings.mlr. ","element":"a"},{"href":"https://proceedings.mlr.press/v125/ji20a.html","text":"press/v125/ji20a.html","element":"a"},{"text":".","element":"span"}],[{"id":"id-9","text":"Frederic Koehler, Lijia Zhou, Danica J. Sutherland, and Nathan Srebro. Uniform convergence of ","element":"span"},{"text":"interpolators: Gaussian width, norm bounds and benign overfitting. In A. Beygelzimer, Y. Dauphin, P. Liang, and J. Wortman Vaughan (eds.), ","element":"span"},{"style":{"fontStyle":"italic"},"text":"Advances in Neural Information Processing Systems","element":"span"},{"text":", 2021. URL ","element":"span"},{"href":"https://openreview.net/forum?id=FyOhThdDBM","text":"https://openreview.net/forum?id=FyOhThdDBM","element":"a"},{"text":".","element":"span"}],[{"id":"id-27","text":"Guy Kornowski, Gilad Yehudai, and Ohad Shamir. From tempered to benign overfitting in relu neural ","element":"span"},{"text":"networks, 2023.","element":"span"}],[{"id":"id-26","text":"Yiwen Kou, Zixiang Chen, Yuanzhou Chen, and Quanquan Gu. Benign overfitting for two-layer relu ","element":"span"},{"text":"networks, 2023.","element":"span"}],[{"id":"id-18","text":"Tengyuan Liang and Alexander Rakhlin. Just interpolate: Kernel “Ridgeless” regression can gen- ","element":"span"},{"text":"eralize. ","element":"span"},{"style":{"fontStyle":"italic"},"text":"The Annals of Statistics","element":"span"},{"text":", 48(3):1329 – 1347, 2020. doi: 10.1214/19-AOS1849. URL ","element":"span"},{"href":"https://doi.org/10.1214/19-AOS1849","text":"https://doi.org/10.1214/19-AOS1849","element":"a"},{"text":".","element":"span"}],[{"id":"id-19","text":"Tengyuan Liang, Alexander Rakhlin, and Xiyu Zhai. On the multiple descent of minimum-norm ","element":"span"},{"text":"interpolants and restricted lower isometry of kernels. In ","element":"span"},{"style":{"fontStyle":"italic"},"text":"Annual Conference Computational Learning Theory","element":"span"},{"text":", 2019.","element":"span"}],[{"id":"id-28","text":"Kaifeng Lyu and Jian Li. Gradient descent maximizes the margin of homogeneous neural networks. ","element":"span"},{"text":"In ","element":"span"},{"style":{"fontStyle":"italic"},"text":"8th International Conference on Learning Representations, ICLR 2020, Addis Ababa, Ethiopia, April 26-30, 2020","element":"span"},{"text":". OpenReview.net, 2020. URL ","element":"span"},{"href":"https://openreview.net/forum?id=SJeLIgBKPS","text":"https://openreview.net/forum?id= ","element":"a"},{"href":"https://openreview.net/forum?id=SJeLIgBKPS","text":"SJeLIgBKPS","element":"a"},{"text":".","element":"span"}],[{"id":"id-34","text":"Neil Rohit Mallinar, James B Simon, Amirhesam Abedsoltan, Parthe Pandit, Misha Belkin, and ","element":"span"},{"text":"Preetum Nakkiran. Benign, tempered, or catastrophic: Toward a refined taxonomy of overfitting. In Alice H. Oh, Alekh Agarwal, Danielle Belgrave, and Kyunghyun Cho (eds.), ","element":"span"},{"style":{"fontStyle":"italic"},"text":"Advances in Neural Information Processing Systems","element":"span"},{"text":", 2022. URL ","element":"span"},{"href":"https://openreview.net/forum?id=5oS20NUCJEX","text":"https://openreview.net/forum? ","element":"a"},{"href":"https://openreview.net/forum?id=5oS20NUCJEX","text":"id=5oS20NUCJEX","element":"a"},{"text":".","element":"span"}],[{"id":"id-17","text":"Song Mei and Andrea Montanari. The generalization error of random features regression: Precise ","element":"span"},{"text":"asymptotics and the double descent curve. ","element":"span"},{"style":{"fontStyle":"italic"},"text":"Communications on Pure and Applied Mathematics","element":"span"},{"text":", 75, 2019.","element":"span"}],[{"id":"id-4","text":"Vidya Muthukumar, Kailas Vodrahalli, Vignesh Subramanian, and Anant Sahai. Harmless interpo- ","element":"span"},{"text":"lation of noisy data in regression. ","element":"span"},{"style":{"fontStyle":"italic"},"text":"IEEE Journal on Selected Areas in Information Theory","element":"span"},{"text":", 1(1): 67–83, 2020. doi: 10.1109/JSAIT.2020.2984716.","element":"span"}],[{"id":"id-14","text":"Vidya Muthukumar, Adhyyan Narang, Vignesh Subramanian, Mikhail Belkin, Daniel Hsu, and Anant ","element":"span"},{"text":"Sahai. Classification vs regression in overparameterized regimes: Does the loss function matter? ","element":"span"},{"style":{"fontStyle":"italic"},"text":"J. Mach. Learn. Res.","element":"span"},{"text":", 22(1), Jan 2021. ISSN 1532-4435.","element":"span"}],[{"id":"id-86","text":"Li S. Concise formulas for the area and volume of a hyperspherical cap. ","element":"span"},{"style":{"fontStyle":"italic"},"text":"Asian Journal of Mathematics and Statistics","element":"span"},{"text":", 4, 01 2011. doi: 10.3923/ajms.2011.66.70.","element":"span"}],[{"id":"id-13","text":"Ohad Shamir. The implicit bias of benign overfitting. In Po-Ling Loh and Maxim Raginsky (eds.), ","element":"span"},{"style":{"fontStyle":"italic"},"text":"Proceedings of Thirty Fifth Conference on Learning Theory","element":"span"},{"text":", volume 178 of ","element":"span"},{"style":{"fontStyle":"italic"},"text":"Proceedings of Machine Learning Research","element":"span"},{"text":", pp. 448–478. PMLR, 02–05 Jul 2022. URL ","element":"span"},{"href":"https://proceedings.mlr.press/v178/shamir22a.html","text":"https://proceedings.mlr. ","element":"a"},{"href":"https://proceedings.mlr.press/v178/shamir22a.html","text":"press/v178/shamir22a.html","element":"a"},{"text":".","element":"span"}],[{"id":"id-32","text":"Gang Wang, Georgios B. Giannakis, and Jie Chen. Learning ReLU networks on linearly separable ","element":"span"},{"text":"data: Algorithm, optimality, and generalization. ","element":"span"},{"style":{"fontStyle":"italic"},"text":"IEEE Transactions on Signal Processing","element":"span"},{"text":", 67(9): 2357–2370, 2019. doi: 10.1109/TSP.2019.2904921.","element":"span"}],[{"id":"id-10","text":"Guillaume Wang, Konstantin Donhauser, and Fanny Yang. Tight bounds for minimum ","element":"span"},{"style":{"height":13.81},"width":29.48,"height":34.52,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/71943/images/12-0.png","element":"img","alt":" ℓ1","inline":true},{"text":"-norm interpolation of noisy data. In ","element":"span"},{"style":{"fontStyle":"italic"},"text":"International Conference on Artificial Intelligence and Statistics","element":"span"},{"text":", 2021a.","element":"span"}],[{"id":"id-15","text":"Ke Wang, Vidya Muthukumar, and Christos Thrampoulidis. ","element":"span"},{"text":"Benign overfitting in multiclass classification: ","element":"span"},{"text":"All roads lead to interpolation. ","element":"span"},{"text":"In M. Ranzato, A. Beygelzimer, Y. Dauphin, P.S. Liang, and J. Wortman Vaughan (eds.), ","element":"span"},{"style":{"fontStyle":"italic"},"text":"Advances in Neural Information Processing Systems","element":"span"},{"text":", volume 34, pp. 24164–24179. Curran Associates, Inc., 2021b. ","element":"span"},{"text":"URL ","element":"span"},{"href":"https://proceedings.neurips.cc/paper_files/paper/2021/file/caaa29eab72b231b0af62fbdff89bfce-Paper.pdf","text":"https://proceedings.neurips.cc/paper_files/paper/2021/ ","element":"a"},{"href":"https://proceedings.neurips.cc/paper_files/paper/2021/file/caaa29eab72b231b0af62fbdff89bfce-Paper.pdf","text":"file/caaa29eab72b231b0af62fbdff89bfce-Paper.pdf","element":"a"},{"text":".","element":"span"}],[{"id":"id-5","text":"Denny Wu and Ji Xu. On the optimal weighted ","element":"span"},{"style":{"height":13.79},"width":31,"height":34.48,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/71943/images/12-1.png","element":"img","alt":" ℓ2","inline":true,"padRight":true},{"text":"regularization in overparameterized linear regression. ","element":"span"},{"text":"In H. Larochelle, M. Ranzato, R. Hadsell, M.F. Balcan, and H. Lin (eds.), ","element":"span"},{"style":{"fontStyle":"italic"},"text":"Advances in Neural Information Processing Systems","element":"span"},{"text":", volume 33, pp. 10112–10123. Curran Associates, Inc., 2020. URL ","element":"span"},{"href":"https://proceedings.neurips.cc/paper_files/paper/2020/file/72e6d3238361fe70f22fb0ac624a7072-Paper.pdf","text":"https://proceedings.neurips.cc/paper_files/paper/ ","element":"a"},{"href":"https://proceedings.neurips.cc/paper_files/paper/2020/file/72e6d3238361fe70f22fb0ac624a7072-Paper.pdf","text":"2020/file/72e6d3238361fe70f22fb0ac624a7072-Paper.pdf","element":"a"},{"text":".","element":"span"}],[{"id":"id-24","text":"Xingyu Xu and Yuantao Gu. Benign overfitting of non-smooth neural networks beyond lazy training. ","element":"span"},{"text":"In Francisco Ruiz, Jennifer Dy, and Jan-Willem van de Meent (eds.), ","element":"span"},{"style":{"fontStyle":"italic"},"text":"Proceedings of The 26th International Conference on Artificial Intelligence and Statistics","element":"span"},{"text":", volume 206 of ","element":"span"},{"style":{"fontStyle":"italic"},"text":"Proceedings of Machine Learning Research","element":"span"},{"text":", pp. 11094–11117. PMLR, 25–27 Apr 2023. URL ","element":"span"},{"href":"https://proceedings.mlr.press/v206/xu23k.html","text":"https: ","element":"a"},{"href":"https://proceedings.mlr.press/v206/xu23k.html","text":"//proceedings.mlr.press/v206/xu23k.html","element":"a"},{"text":".","element":"span"}],[{"id":"id-33","text":"Qiuling Yang, Alireza Sadeghi, Gang Wang, and Jian Sun. Learning two-layer ReLU networks ","element":"span"},{"text":"is nearly as easy as learning linear classifiers on separable data. ","element":"span"},{"style":{"fontStyle":"italic"},"text":"IEEE Transactions on Signal Processing","element":"span"},{"text":", 69:4416–4427, 2021. doi: 10.1109/TSP.2021.3094911.","element":"span"}],[{"id":"id-0","text":"Chiyuan Zhang, Samy Bengio, Moritz Hardt, Benjamin Recht, and Oriol Vinyals. Understand- ","element":"span"},{"text":"ing deep learning requires rethinking generalization. In ","element":"span"},{"style":{"fontStyle":"italic"},"text":"International Conference on Learning Representations","element":"span"},{"text":", 2017.","element":"span"}],[{"id":"id-7","text":"Difan Zou, Jingfeng Wu, Vladimir Braverman, Quanquan Gu, and Sham Kakade. Benign overfit- ","element":"span"},{"text":"ting of constant-stepsize SGD for linear regression. In Mikhail Belkin and Samory Kpotufe (eds.), ","element":"span"},{"style":{"fontStyle":"italic"},"text":"Proceedings of Thirty Fourth Conference on Learning Theory","element":"span"},{"text":", volume 134 of ","element":"span"},{"style":{"fontStyle":"italic"},"text":"Proceedings of Machine Learning Research","element":"span"},{"text":", pp. 4633–4635. PMLR, 15–19 Aug 2021. ","element":"span"},{"text":"URL ","element":"span"},{"href":"https://proceedings.mlr.press/v134/zou21a.html","text":"https://proceedings.mlr.press/v134/zou21a.html","element":"a"},{"text":".","element":"span"}]]},{"heading":"Appendix A Properties of the data and network at initialization","paragraphs":[[{"text":"For each of our results to hold we require certain properties on both the network weights and training sample to hold at initialization. Here we bound the probabilities of these events in turn. Later, for each specific setting we combine the relevant conditions using the union bound.","element":"span"}],[{"text":"First, and in order to prove convergence, we require the noise components of the training sample to be approximately orthogonal to one another.","element":"span"}],[{"id":"id-62","style":{"fontWeight":"bold"},"text":"Lemma A.1. ","element":"span"},{"style":{"fontStyle":"italic"},"text":"Let ","element":"span"},{"style":{"height":16},"width":200.72,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/71943/images/13-0.png","element":"img","alt":" ρ, δ ∈ (0, 1)","inline":true},{"style":{"fontStyle":"italic"},"text":". Given a sequence ","element":"span"},{"style":{"height":17.39},"width":115,"height":43.48,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/71943/images/13-1.png","element":"img","alt":" (ni)2ni=1","inline":true,"padRight":true},{"style":{"fontStyle":"italic"},"text":"of mutually i.i.d. random vectors with ","element":"span"},{"style":{"height":17.39},"width":685,"height":43.48,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/71943/images/13-2.png","element":"img","alt":"ni ∼ U(Sd−1 ∩ span(v)⊥) for all i ∈ [2n]","inline":true},{"style":{"fontStyle":"italic"},"text":", then assuming ","element":"span"},{"style":{"height":29.2},"width":483.96,"height":73,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/71943/images/13-3.png","element":"img","alt":" d ≥ max�3, 3ρ−2 ln�2n2δ ��","inline":true}],[{"style":{"width":"42%"},"width":676,"height":145,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/71943/images/13-4.png","element":"img"}],[{"style":{"fontStyle":"italic"},"text":"Proof. ","element":"span"},{"text":"Consider two pairs of mutually i.i.d. random vectors ","element":"span"},{"style":{"height":17.39},"width":496.2,"height":43.48,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/71943/images/13-5.png","element":"img","alt":" n, n′ ∼ U(Sd−1 ∩ span(v)⊥)","inline":true,"padRight":true},{"text":"and ","element":"span"},{"style":{"height":17.38},"width":420.56,"height":43.44,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/71943/images/13-6.png","element":"img","alt":"u, u′ ∼ U(Sd−2), observe","inline":true}],[{"style":{"width":"18%"},"width":288,"height":58,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/71943/images/13-7.png","element":"img"}],[{"text":"Due to the fact that ","element":"span"},{"style":{"fontWeight":"bold"},"text":"u ","element":"span"},{"text":"and ","element":"span"},{"style":{"height":7.2},"width":39.48,"height":18,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/71943/images/13-8.png","element":"img","alt":" u′","inline":true,"padRight":true},{"text":"are independent as well as the rotational invariance of ","element":"span"},{"style":{"height":17.38},"width":144.6,"height":43.44,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/71943/images/13-9.png","element":"img","alt":" U(Sd−2)","inline":true,"padRight":true},{"text":"it follows that","element":"span"}],[{"style":{"width":"18%"},"width":288,"height":58,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/71943/images/13-10.png","element":"img"}],[{"text":"where ","element":"span"},{"style":{"height":17.38},"width":252.12,"height":43.44,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/71943/images/13-11.png","element":"img","alt":" e1 := [1, 0..0]T","inline":true,"padRight":true},{"text":". Let Cap","element":"span"},{"style":{"height":17.38},"width":608,"height":43.44,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/71943/images/13-12.png","element":"img","alt":"(e1, ρ) := {z ∈ Sd−2 : ⟨e1, z⟩ ≥ ρ}","inline":true,"padRight":true},{"text":"denote the ","element":"span"},{"style":{"fontStyle":"italic"},"text":"spherical cap ","element":"span"},{"text":"of ","element":"span"},{"style":{"height":13.39},"width":79.64,"height":33.48,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/71943/images/13-13.png","element":"img","alt":"Sd−2 ","inline":true,"padRight":true},{"text":"centered on ","element":"span"},{"style":{"height":13.2},"width":209.28,"height":33,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/71943/images/13-14.png","element":"img","alt":" e1. As d ≥ 3","inline":true,"padRight":true},{"text":"then from ","element":"span"},{"href":"#id-43","referenceIndex":3,"text":"Ball ","element":"a"},{"href":"#id-43","referenceIndex":3,"text":"(1997)","element":"a"},{"text":"[Lemma 2.2] it follows that","element":"span"}],[{"style":{"width":"81%"},"width":1289,"height":98,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/71943/images/13-15.png","element":"img"}],[{"text":"Applying the union bound","element":"span"}],[{"style":{"width":"74%"},"width":1175,"height":321,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/71943/images/13-16.png","element":"img"}],[{"text":"Setting ","element":"span"},{"style":{"height":28.8},"width":345.28,"height":72,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/71943/images/13-17.png","element":"img","alt":" δ ≥ 2n2 exp�− dρ23 �","inline":true},{"text":"and rearranging we arrive at the result claimed.","element":"span"}],[{"text":"In addition to requiring the approximate orthogonality property on the training data, our approach also requires a large proportion of the neurons at initialization to satisfy particular conditions in regard to the number of clean versus corrupt activations. To this end, we introduce the following terms where ","element":"span"},{"style":{"height":16},"width":205,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/71943/images/13-18.png","element":"img","alt":" p ∈ {−1, 1}.","inline":true}],[{"text":"• Let ","element":"span"},{"style":{"height":23.1},"width":1138.72,"height":57.76,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/71943/images/13-19.png","element":"img","alt":" Γp := {j : (−1)j = p, Gj(0, 1)(γ − ρ) − Bj(0, 1)(γ + ρ) ≥ 2λwη }","inline":true,"padRight":true},{"text":"denote the set","element":"span"},{"text":"of neurons with output weight ","element":"span"},{"style":{"height":16},"width":98.92,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/71943/images/13-20.png","element":"img","alt":" (−1)p","inline":true,"padRight":true},{"text":"which have more clean points activating them than corrupt ones at initialization. We will show that these sets of neurons have a ","element":"span"},{"style":{"fontStyle":"italic"},"text":"predictable ","element":"span"},{"text":"behavior early during training before any clean points achieve zero loss. We further let ","element":"span"},{"style":{"height":13.2},"width":242.76,"height":33,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/71943/images/13-21.png","element":"img","alt":"Γ = Γ1 ∪ Γ−1.","inline":true}],[{"text":"• Let ","element":"span"},{"style":{"height":16.78},"width":1228.36,"height":41.96,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/71943/images/13-22.png","element":"img","alt":" Θp := {j ∼ Γp : Gj(0, 1)(γ + ρ) − Bj(0, 1)(γ − ρ) < 1 − γ + ρ} ⊂ Γp","inline":true},{"text":". For our benign overfitting result we will show that neurons in this subset are able to ","element":"span"},{"style":{"fontStyle":"italic"},"text":"carry ","element":"span"},{"text":"corrupt points throughout training, eventually, at least in the overfitting setting, enabling them to achieve zero loss. We further let ","element":"span"},{"style":{"height":13.18},"width":261.08,"height":32.96,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/71943/images/13-23.png","element":"img","alt":" Θ = Θ1 ∪ Θ−1.","inline":true}],[{"text":"First we show ","element":"span"},{"style":{"height":11.01},"width":22.48,"height":27.52,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/71943/images/13-24.png","element":"img","alt":" Γ","inline":true,"padRight":true},{"text":"accounts for a significant proportion of neurons. To this end we first provide the following result.","element":"span"}],[{"id":"id-46","style":{"fontWeight":"bold"},"text":"Lemma A.2. ","element":"span"},{"style":{"fontStyle":"italic"},"text":"Define ","element":"span"},{"style":{"height":20.98},"width":154.72,"height":52.44,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/71943/images/14-0.png","element":"img","alt":" µ := 2kn+k","inline":true,"padRight":true},{"style":{"fontStyle":"italic"},"text":"and assume ","element":"span"},{"style":{"height":16},"width":161.32,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/71943/images/14-1.png","element":"img","alt":" κ ∈ (0, 1)","inline":true,"padRight":true},{"style":{"fontStyle":"italic"},"text":"satisfies ","element":"span"},{"style":{"height":12.21},"width":98,"height":30.52,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/71943/images/14-2.png","element":"img","alt":" κ > µ","inline":true},{"style":{"fontStyle":"italic"},"text":". Given an arbitrary neuron ","element":"span"},{"style":{"height":18.18},"width":248.28,"height":45.44,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/71943/images/14-3.png","element":"img","alt":"wj ∼ U(Sd−1)","inline":true},{"style":{"fontStyle":"italic"},"text":", we say that a collection of training points is ","element":"span"},{"style":{"height":16},"width":90.76,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/71943/images/14-4.png","element":"img","alt":" (ε, κ)","inline":true},{"style":{"fontStyle":"italic"},"text":"-good iff both ","element":"span"},{"style":{"height":16.8},"width":202.44,"height":42,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/71943/images/14-5.png","element":"img","alt":" Tj(0, 1) ≥ 1","inline":true,"padRight":true},{"style":{"fontStyle":"italic"},"text":"and ","element":"span"},{"style":{"height":16.78},"width":340.8,"height":41.96,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/71943/images/14-6.png","element":"img","alt":"Bj(0, 1) < κTj(0, 1)","inline":true,"padRight":true},{"style":{"fontStyle":"italic"},"text":"with probability at least ","element":"span"},{"style":{"height":10.99},"width":80.48,"height":27.48,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/71943/images/14-7.png","element":"img","alt":" 1 − ϵ","inline":true,"padRight":true},{"style":{"fontStyle":"italic"},"text":"over the randomness of the neuron. There exist positive constants ","element":"span"},{"style":{"fontStyle":"italic"},"text":"C, c ","element":"span"},{"style":{"fontStyle":"italic"},"text":"such that if ","element":"span"},{"style":{"height":17.39},"width":577.88,"height":43.48,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/71943/images/14-8.png","element":"img","alt":" δ := exp(−cn(κ − µ2)) and n ≥ C","inline":true,"padRight":true},{"style":{"fontStyle":"italic"},"text":"then with probability at least ","element":"span"},{"style":{"height":20},"width":86,"height":50,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/71943/images/14-9.png","element":"img","alt":"1 − δϵ ","inline":true,"padRight":true},{"style":{"fontStyle":"italic"},"text":"the training sample is ","element":"span"},{"style":{"height":16},"width":193.24,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/71943/images/14-10.png","element":"img","alt":" (ε, κ)-good.","inline":true}],[{"style":{"fontStyle":"italic"},"text":"Proof. ","element":"span"},{"text":"First we establish certain pieces of notation specific to what follows: we say a point ","element":"span"},{"style":{"fontWeight":"bold"},"text":"x ","element":"span"},{"text":"is positive iff ","element":"span"},{"style":{"height":16},"width":177.24,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/71943/images/14-11.png","element":"img","alt":" ⟨x, v⟩ > 0","inline":true,"padRight":true},{"text":"and is negative iff ","element":"span"},{"style":{"height":16},"width":177.24,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/71943/images/14-12.png","element":"img","alt":" ⟨x, v⟩ < 0","inline":true},{"text":". We use ","element":"span"},{"style":{"height":13.2},"width":49.52,"height":33,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/71943/images/14-13.png","element":"img","alt":" S+","inline":true,"padRight":true},{"text":"and ","element":"span"},{"style":{"height":11.79},"width":49,"height":29.48,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/71943/images/14-14.png","element":"img","alt":" S−","inline":true,"padRight":true},{"text":"to denote these sets of points respectively. Note by construction, see ","element":"span"},{"href":"#id-44","text":"(1)","element":"a"},{"text":", clean and corrupt points of the same sign are mutually i.i.d. As here we only ever consider one neuron and the activations of the training sample on this neuron at initialization, we also drop both the subscript ","element":"span"},{"style":{"fontStyle":"italic"},"text":"j ","element":"span"},{"text":"as well as the argument parentheses on the counting functions. We also use ","element":"span"},{"style":{"height":10.8},"width":27.48,"height":27,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/71943/images/14-15.png","element":"img","alt":" ±","inline":true,"padRight":true},{"text":"superscripts to denote the subsets corresponding to activations from positive and negative points respectively: as examples ","element":"span"},{"style":{"fontStyle":"italic"},"text":"T ","element":"span"},{"text":"is used as shorthand for the total number of activations, ","element":"span"},{"style":{"height":12.8},"width":53.52,"height":32,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/71943/images/14-16.png","element":"img","alt":" B+ ","inline":true,"padRight":true},{"text":"is the number corrupt positive activations and ","element":"span"},{"style":{"height":11.81},"width":52,"height":29.52,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/71943/images/14-17.png","element":"img","alt":" G− ","inline":true,"padRight":true},{"text":"is the number of clean negative activations.","element":"span"}],[{"text":"First by the symmetry of the distribution of ","element":"span"},{"style":{"height":19.38},"width":652.08,"height":48.44,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/71943/images/14-18.png","element":"img","alt":" w, P(⟨w, v⟩ > 0) = P(⟨w, v⟩ < 0) = 12","inline":true},{"text":". As a result","element":"span"}],[{"style":{"width":"69%"},"width":1101,"height":177,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/71943/images/14-19.png","element":"img"}],[{"text":"As the analysis and results derived under either condition will prove identical under reversal of the signs involved, without loss of generality we let ","element":"span"},{"style":{"height":16},"width":175,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/71943/images/14-20.png","element":"img","alt":" ⟨w, v⟩ > 0","inline":true},{"text":". Using the union bound","element":"span"}],[{"style":{"height":16},"width":1572.44,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/71943/images/14-21.png","element":"img","alt":"P((B < κT) ∩ (T > 0) | ⟨w, v⟩ > 0) ≥ 1 − P(T = 0 | ⟨w, v⟩ > 0) − P(B ≥ κT | ⟨w, v⟩ > 0),","inline":true,"padRight":true},{"text":"therefore it suffices to upper bound the two probabilities on the right-hand-side.","element":"span"}],[{"text":"Observe if ","element":"span"},{"style":{"height":16},"width":196,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/71943/images/14-22.png","element":"img","alt":" ⟨w, v⟩ > 0","inline":true,"padRight":true},{"text":"then for ","element":"span"},{"style":{"height":13.79},"width":140.64,"height":34.48,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/71943/images/14-23.png","element":"img","alt":" x ∈ S+","inline":true,"padRight":true},{"text":"we have ","element":"span"},{"style":{"height":16},"width":290.56,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/71943/images/14-24.png","element":"img","alt":" P(⟨x, w⟩) > 1/2","inline":true,"padRight":true},{"text":"and for ","element":"span"},{"style":{"height":11.6},"width":140.64,"height":29,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/71943/images/14-25.png","element":"img","alt":" x ∈ S−","inline":true,"padRight":true},{"text":"we have ","element":"span"},{"style":{"height":16},"width":288.84,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/71943/images/14-26.png","element":"img","alt":"P(⟨x, w⟩) < 1/2","inline":true},{"text":". By the mutual independence of the preactivations ","element":"span"},{"style":{"height":18.18},"width":205.52,"height":45.44,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/71943/images/14-27.png","element":"img","alt":" (⟨x, wj⟩)2ni=1","inline":true,"padRight":true},{"text":"then ","element":"span"},{"text":"P","element":"span"},{"text":"(","element":"span"},{"style":{"fontStyle":"italic"},"text":"T ","element":"span"},{"text":"= ","element":"span"},{"style":{"height":16},"width":412.88,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/71943/images/14-28.png","element":"img","alt":"0 | ⟨w, v⟩ > 0) ≤ (1/2)n","inline":true},{"text":". Consider now a slightly different data model, in which a training sample consists of ","element":"span"},{"style":{"height":10.8},"width":97.8,"height":27,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/71943/images/14-29.png","element":"img","alt":" n − k","inline":true,"padRight":true},{"text":"clean positive points and ","element":"span"},{"text":"2","element":"span"},{"style":{"fontStyle":"italic"},"text":"k ","element":"span"},{"text":"corrupt positive points. Abusing notation, we let ","element":"span"},{"style":{"height":14},"width":19,"height":35,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/71943/images/14-30.png","element":"img","alt":"ζ","inline":true,"padRight":true},{"text":"denote the event that we are instead drawing our training sample in this manner and also that ","element":"span"},{"style":{"height":16},"width":180.44,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/71943/images/14-31.png","element":"img","alt":"⟨w, v⟩ > 0","inline":true},{"text":". In this setting ","element":"span"},{"style":{"height":12.98},"width":137.44,"height":32.44,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/71943/images/14-32.png","element":"img","alt":" T + = T","inline":true,"padRight":true},{"text":"and furthermore the event ","element":"span"},{"style":{"height":11.6},"width":137.36,"height":29,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/71943/images/14-33.png","element":"img","alt":" B < κT","inline":true,"padRight":true},{"text":"is equivalent to ","element":"span"},{"style":{"height":13.78},"width":200.08,"height":34.44,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/71943/images/14-34.png","element":"img","alt":" B+ < κT +.","inline":true,"padRight":true},{"text":"Again, as the preactivations are mutually independent the number of positive activations can be lower bounded using a binomial distribution with probability ","element":"span"},{"text":"1","element":"span"},{"style":{"fontStyle":"italic"},"text":"/","element":"span"},{"text":"2","element":"span"},{"text":". Applying a Chernoff bound it follows that","element":"span"}],[{"style":{"width":"43%"},"width":696,"height":106,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/71943/images/14-35.png","element":"img"}],[{"text":"Furthermore, observe sampling positive points which activate ","element":"span"},{"style":{"height":11.98},"width":46.08,"height":29.96,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/71943/images/14-36.png","element":"img","alt":" wj","inline":true,"padRight":true},{"text":"is equivalent to uniformly sampling without replacement ","element":"span"},{"style":{"height":12.99},"width":53.8,"height":32.48,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/71943/images/14-37.png","element":"img","alt":" T + ","inline":true,"padRight":true},{"text":"points from ","element":"span"},{"style":{"height":15.38},"width":384,"height":38.44,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/71943/images/14-38.png","element":"img","alt":" S+. Let Zℓ = 1 iff the ℓ","inline":true},{"text":"-th element sampled from ","element":"span"},{"style":{"height":16.19},"width":211.16,"height":40.48,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/71943/images/14-39.png","element":"img","alt":" S+ is corrupt","inline":true,"padRight":true},{"text":"and is ","element":"span"},{"text":"0 ","element":"span"},{"text":"otherwise. Using a variant of Hoeffding’s bound for sampling without replacement (see for example Proposition 1.2 of ","element":"span"},{"href":"#id-45","referenceIndex":4,"text":"Bardenet & Maillard ","element":"a"},{"href":"#id-45","referenceIndex":4,"text":"(2015)","element":"a"},{"text":")","element":"span"}],[{"style":{"width":"81%"},"width":1296,"height":150,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/71943/images/14-40.png","element":"img"}],[{"text":"Therefore","element":"span"}],[{"style":{"width":"90%"},"width":1432,"height":280,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/71943/images/14-41.png","element":"img"}],[{"text":"Combining these results it follows that","element":"span"}],[{"style":{"width":"98%"},"width":1564,"height":162,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/71943/images/14-42.png","element":"img"}],[{"text":"Therefore, there exist constants ","element":"span"},{"style":{"fontStyle":"italic"},"text":"C, c > ","element":"span"},{"text":"0 ","element":"span"},{"text":"such that if ","element":"span"},{"style":{"height":13.6},"width":187.04,"height":34,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/71943/images/15-0.png","element":"img","alt":" n ≥ C then","inline":true}],[{"style":{"width":"48%"},"width":761,"height":41,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/71943/images/15-1.png","element":"img"}],[{"text":"Note if instead we condition on the event ","element":"span"},{"style":{"height":16},"width":182.88,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/71943/images/15-2.png","element":"img","alt":" ⟨x, w⟩ < 0","inline":true,"padRight":true},{"text":"then swapping the roles of the negative and positive points in the argument above gives the same outcome. As a result","element":"span"}],[{"style":{"width":"34%"},"width":547,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/71943/images/15-3.png","element":"img"}],[{"text":"For convenience let ","element":"span"},{"style":{"height":17.54},"width":242.56,"height":43.84,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/71943/images/15-4.png","element":"img","alt":" X := (xi)2ni=1","inline":true,"padRight":true},{"text":"denote the training sample and ","element":"span"},{"style":{"height":18.34},"width":440.48,"height":45.84,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/71943/images/15-5.png","element":"img","alt":" Xcϵ,κ := {X : Pw((B ≥","inline":true},{"style":{"height":16},"width":372.6,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/71943/images/15-6.png","element":"img","alt":"κT) ∪ (T = 0)) > ϵ}","inline":true,"padRight":true},{"text":"the set of training samples which are ","element":"span"},{"style":{"fontStyle":"italic"},"text":"not ","element":"span"},{"style":{"height":16},"width":88.32,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/71943/images/15-7.png","element":"img","alt":" (ϵ, κ)","inline":true},{"text":"-good. Note here that the subscript ","element":"span"},{"style":{"fontStyle":"italic"},"text":"w ","element":"span"},{"text":"indicates randomness over the neuron alone, furthermore by construction","element":"span"}],[{"style":{"width":"43%"},"width":685,"height":50,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/71943/images/15-8.png","element":"img"}],[{"text":"Furthermore, as","element":"span"}],[{"style":{"width":"85%"},"width":1363,"height":50,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/71943/images/15-9.png","element":"img"}],[{"text":"then it follows that ","element":"span"},{"style":{"height":21.02},"width":305.28,"height":52.56,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/71943/images/15-10.png","element":"img","alt":" P�X ∈ Xcϵ,κ�≤ δϵ","inline":true},{"text":". As a result we conclude that the probability of drawing a ","element":"span"},{"style":{"height":16},"width":88.36,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/71943/images/15-11.png","element":"img","alt":"(ϵ, κ)","inline":true},{"text":"-good training sample is at least ","element":"span"},{"style":{"height":20.18},"width":103.68,"height":50.44,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/71943/images/15-12.png","element":"img","alt":" 1 − δ.","inline":true}],[{"text":"Based on Lemma ","element":"span"},{"href":"#id-46","text":"A.2, ","element":"a"},{"text":"the following lemma bounds the probability that the cardinality of ","element":"span"},{"style":{"height":15.98},"width":177.48,"height":39.96,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/71943/images/15-13.png","element":"img","alt":" Γp is large.","inline":true,"padRight":true},{"text":"We note that the result presented here on non-overfitting requires ","element":"span"},{"style":{"height":16.78},"width":153.6,"height":41.96,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/71943/images/15-14.png","element":"img","alt":" |Γp| = m","inline":true,"padRight":true},{"text":"while the result on benign overfitting that ","element":"span"},{"style":{"height":16.78},"width":278.88,"height":41.96,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/71943/images/15-15.png","element":"img","alt":" |Γp| ≥ (1 − α)m","inline":true,"padRight":true},{"text":"for some small constant ","element":"span"},{"style":{"height":16},"width":172.88,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/71943/images/15-16.png","element":"img","alt":" α ∈ (0, 1).","inline":true}],[{"id":"id-50","style":{"fontWeight":"bold"},"text":"Lemma A.3. ","element":"span"},{"style":{"fontStyle":"italic"},"text":"Suppose ","element":"span"},{"style":{"height":19.81},"width":375.2,"height":49.52,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/71943/images/15-17.png","element":"img","alt":" n ≥ 15k, λw ≤ η γ−ρ4γ","inline":true,"padRight":true},{"style":{"fontStyle":"italic"},"text":"and ","element":"span"},{"style":{"height":14.4},"width":128.76,"height":36,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/71943/images/15-18.png","element":"img","alt":" γ ≥ 4ρ","inline":true},{"style":{"fontStyle":"italic"},"text":". Then there exist positive constants ","element":"span"},{"style":{"fontStyle":"italic"},"text":"C, c > ","element":"span"},{"text":"0 ","element":"span"},{"style":{"fontStyle":"italic"},"text":"such that if ","element":"span"},{"style":{"height":13.2},"width":108.04,"height":33,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/71943/images/15-19.png","element":"img","alt":" n ≥ C","inline":true,"padRight":true},{"style":{"fontStyle":"italic"},"text":"the following are true.","element":"span"}],[{"style":{"fontStyle":"italic"},"text":"1. ","element":"span"},{"style":{"height":16.78},"width":553,"height":41.96,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/71943/images/15-20.png","element":"img","alt":" P (|Γp| = m) ≥ 1 − m exp(−cn).","inline":true}],[{"style":{"fontStyle":"italic"},"text":"2. With ","element":"span"},{"style":{"height":16},"width":241.56,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/71943/images/15-21.png","element":"img","alt":" α ∈ (0, 1) then","inline":true}],[{"style":{"width":"40%"},"width":635,"height":42,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/71943/images/15-22.png","element":"img"}],[{"style":{"fontStyle":"italic"},"text":"Proof. ","element":"span"},{"text":"Again as here we only ever consider the activations at initialization, we write ","element":"span"},{"style":{"height":16.8},"width":231.96,"height":42,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/71943/images/15-23.png","element":"img","alt":" Tj(0, 1) as Tj,","inline":true},{"style":{"height":16.8},"width":137.16,"height":42,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/71943/images/15-24.png","element":"img","alt":"Gj(0, 1)","inline":true,"padRight":true},{"text":"as ","element":"span"},{"style":{"height":15.6},"width":44.36,"height":39,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/71943/images/15-25.png","element":"img","alt":" Gj","inline":true,"padRight":true},{"text":"and ","element":"span"},{"style":{"height":16.8},"width":136.08,"height":42,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/71943/images/15-26.png","element":"img","alt":" Bj(0, 1)","inline":true,"padRight":true},{"text":"as ","element":"span"},{"style":{"height":15.6},"width":43.24,"height":39,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/71943/images/15-27.png","element":"img","alt":" Bj","inline":true},{"text":". Let ","element":"span"},{"style":{"height":16},"width":201.52,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/71943/images/15-28.png","element":"img","alt":" p ∈ {−1, 1}","inline":true,"padRight":true},{"text":"and consider an arbitrary neuron ","element":"span"},{"style":{"fontStyle":"italic"},"text":"j ","element":"span"},{"text":"such that ","element":"span"},{"style":{"height":16.99},"width":173.84,"height":42.48,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/71943/images/15-29.png","element":"img","alt":"(−1)j = p","inline":true},{"text":", by definition if","element":"span"}],[{"style":{"width":"33%"},"width":524,"height":98,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/71943/images/15-30.png","element":"img"}],[{"text":"we may conclude ","element":"span"},{"style":{"height":15.58},"width":109.32,"height":38.96,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/71943/images/15-31.png","element":"img","alt":" j ∈ Γp","inline":true},{"text":". Rearranging this expression, equivalently ","element":"span"},{"style":{"height":15.58},"width":147.76,"height":38.96,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/71943/images/15-32.png","element":"img","alt":" j ∈ Γp if","inline":true}],[{"style":{"width":"22%"},"width":364,"height":98,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/71943/images/15-33.png","element":"img"}],[{"text":"As ","element":"span"},{"style":{"height":22.03},"width":171.68,"height":55.08,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/71943/images/15-34.png","element":"img","alt":"λwη ≤ γ−ρ4γ","inline":true,"padRight":true},{"text":", then membership to ","element":"span"},{"style":{"height":15.58},"width":41.88,"height":38.96,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/71943/images/15-35.png","element":"img","alt":" Γp","inline":true,"padRight":true},{"text":"is guaranteed as long as ","element":"span"},{"style":{"height":15.58},"width":120.32,"height":38.96,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/71943/images/15-36.png","element":"img","alt":" Tj ≥ 1","inline":true,"padRight":true},{"text":"and ","element":"span"},{"style":{"height":19.81},"width":213,"height":49.52,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/71943/images/15-37.png","element":"img","alt":" Bj ≤ γ−ρ4γ Tj","inline":true},{"text":". Note ","element":"span"},{"text":"by the assumptions of the lemma ","element":"span"},{"style":{"height":20.98},"width":249.6,"height":52.44,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/71943/images/15-38.png","element":"img","alt":" µ := 2kn+k ≤ 18","inline":true,"padRight":true},{"text":"and ","element":"span"},{"style":{"height":21.78},"width":162.6,"height":54.44,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/71943/images/15-39.png","element":"img","alt":" γ−ρ4γ ≥ 316","inline":true},{"text":". Conditioning on the event we ","element":"span"},{"text":"draw a ","element":"span"},{"style":{"height":19.38},"width":109.12,"height":48.44,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/71943/images/15-40.png","element":"img","alt":" (ε, 316)","inline":true},{"text":"-good training sample then the probability that ","element":"span"},{"style":{"height":16.8},"width":115.2,"height":42,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/71943/images/15-41.png","element":"img","alt":" j /∈ Γp","inline":true,"padRight":true},{"text":"is at most ","element":"span"},{"style":{"height":0},"width":20,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/71943/images/15-42.png","element":"img","alt":" ϵ","inline":true,"padRight":true},{"text":"by Lemma ","element":"span"},{"href":"#id-46","text":"A.2. ","element":"a"},{"text":"Furthermore, with the training sample fixed the activations of each neuron are mutually independent. Let ","element":"span"},{"style":{"height":17.54},"width":211.2,"height":43.84,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/71943/images/15-43.png","element":"img","alt":" X = (xi)2ni=1","inline":true,"padRight":true},{"text":"denote the draw of the training sample and ","element":"span"},{"style":{"height":16.8},"width":589.96,"height":42,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/71943/images/15-44.png","element":"img","alt":" Xϵ,κ = {X : Pw((B ≥ κT) ∪ (T =","inline":true},{"style":{"height":16},"width":150.44,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/71943/images/15-45.png","element":"img","alt":"0)) ≤ ϵ}","inline":true,"padRight":true},{"text":"the set of ","element":"span"},{"style":{"height":16},"width":90.76,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/71943/images/15-46.png","element":"img","alt":" (ε, κ)","inline":true},{"text":"-good training samples. In what follows we assume ","element":"span"},{"style":{"height":16},"width":166.08,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/71943/images/15-47.png","element":"img","alt":" κ = 3/16","inline":true,"padRight":true},{"text":"and let ","element":"span"},{"style":{"height":16},"width":234.76,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/71943/images/15-48.png","element":"img","alt":"ϵ = exp(−cn)","inline":true,"padRight":true},{"text":"where ","element":"span"},{"style":{"height":14.18},"width":152.04,"height":35.44,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/71943/images/15-49.png","element":"img","alt":" c < 16−2","inline":true,"padRight":true},{"text":"is a sufficiently small positive constant, then by Lemma ","element":"span"},{"href":"#id-46","text":"A.2 ","element":"a"},{"text":"there exist constants ","element":"span"},{"style":{"fontStyle":"italic"},"text":"C, c ","element":"span"},{"text":"such that if ","element":"span"},{"style":{"height":13.2},"width":108.08,"height":33,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/71943/images/15-50.png","element":"img","alt":" n ≥ C","inline":true}],[{"style":{"width":"42%"},"width":681,"height":83,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/71943/images/15-51.png","element":"img"}],[{"text":"The first result follows by applying the union bound,","element":"span"}],[{"style":{"width":"57%"},"width":905,"height":152,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/71943/images/15-52.png","element":"img"}],[{"text":"For the second result, let ","element":"span"},{"style":{"height":16},"width":174.6,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/71943/images/16-0.png","element":"img","alt":" α′ ∈ (0, 1)","inline":true,"padRight":true},{"text":"denote the smallest scalar satisfying both ","element":"span"},{"style":{"height":16},"width":376.72,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/71943/images/16-1.png","element":"img","alt":" α < α′ and α′m ∈ [m].","inline":true,"padRight":true},{"text":"Observe","element":"span"}],[{"style":{"width":"96%"},"width":1532,"height":268,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/71943/images/16-2.png","element":"img"}],[{"text":"As ","element":"span"},{"style":{"height":7.2},"width":23,"height":18,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/71943/images/16-3.png","element":"img","alt":" α","inline":true,"padRight":true},{"text":"is a constant, there exists positive constants ","element":"span"},{"style":{"fontStyle":"italic"},"text":"C, c ","element":"span"},{"text":"such that if ","element":"span"},{"style":{"height":13.6},"width":187.08,"height":34,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/71943/images/16-4.png","element":"img","alt":" n ≥ C then","inline":true}],[{"style":{"width":"48%"},"width":768,"height":80,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/71943/images/16-5.png","element":"img"}],[{"text":"Therefore","element":"span"}],[{"style":{"width":"72%"},"width":1155,"height":157,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/71943/images/16-6.png","element":"img"}],[{"text":"as claimed.","element":"span"}],[{"text":"We now turn our attention to establishing the conditions required at initialization on the corrupt points for the result on benign overfitting. To this end, in the following two lemmas we introduce the notion of an ","element":"span"},{"style":{"height":14},"width":87.4,"height":35,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/71943/images/16-7.png","element":"img","alt":" ϵ-fine","inline":true,"padRight":true},{"text":"training sample and lower bound the probability of drawing one. Then, by conditioning on drawing such a training sample, we lower bound the cardinality of a set of neurons, which we denote ","element":"span"},{"style":{"height":11.6},"width":28,"height":29,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/71943/images/16-8.png","element":"img","alt":" Λ","inline":true},{"text":", which satisfy a property related to ","element":"span"},{"style":{"height":11.79},"width":36.52,"height":29.48,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/71943/images/16-9.png","element":"img","alt":" Θ.","inline":true}],[{"id":"id-48","style":{"fontWeight":"bold"},"text":"Lemma A.4. ","element":"span"},{"style":{"fontStyle":"italic"},"text":"Assume ","element":"span"},{"style":{"height":19.38},"width":316.52,"height":48.44,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/71943/images/16-10.png","element":"img","alt":" γ ≤ 45n and γ ≥ 5ρ","inline":true},{"style":{"fontStyle":"italic"},"text":". We say a training sample is ","element":"span"},{"style":{"height":0},"width":20,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/71943/images/16-11.png","element":"img","alt":" ϵ","inline":true},{"style":{"fontStyle":"italic"},"text":"-fine if for a random neuron ","element":"span"},{"style":{"height":12},"width":46.08,"height":30,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/71943/images/16-12.png","element":"img","alt":"wj","inline":true,"padRight":true},{"style":{"fontStyle":"italic"},"text":"the inequality ","element":"span"},{"style":{"height":19.78},"width":295.84,"height":49.44,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/71943/images/16-13.png","element":"img","alt":" Gj < 410Tj + 58n","inline":true,"padRight":true},{"style":{"fontStyle":"italic"},"text":"holds with probability at least ","element":"span"},{"style":{"height":10.8},"width":92.92,"height":27,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/71943/images/16-14.png","element":"img","alt":" 1 − ϵ","inline":true},{"style":{"fontStyle":"italic"},"text":". There exist constants ","element":"span"},{"style":{"fontStyle":"italic"},"text":"C, c > ","element":"span"},{"text":"0 ","element":"span"},{"style":{"fontStyle":"italic"},"text":"such that if ","element":"span"},{"style":{"height":13.2},"width":108.08,"height":33,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/71943/images/16-15.png","element":"img","alt":" n ≥ C","inline":true,"padRight":true},{"style":{"fontStyle":"italic"},"text":"then with probability at least ","element":"span"},{"style":{"height":17.39},"width":305.4,"height":43.48,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/71943/images/16-16.png","element":"img","alt":" 1 − ϵ−1 exp (−cn)","inline":true,"padRight":true},{"style":{"fontStyle":"italic"},"text":"the training sample is ","element":"span"},{"style":{"height":14},"width":96.4,"height":35,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/71943/images/16-17.png","element":"img","alt":"ϵ-fine.","inline":true}],[{"style":{"fontStyle":"italic"},"text":"Proof. ","element":"span"},{"text":"This lemma is analogous to Lemma ","element":"span"},{"href":"#id-46","text":"A.2 ","element":"a"},{"text":"and to this end we reuse much of the same notation. In particular, recall a point ","element":"span"},{"style":{"fontWeight":"bold"},"text":"x ","element":"span"},{"text":"is positive iff ","element":"span"},{"style":{"height":16},"width":170.88,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/71943/images/16-18.png","element":"img","alt":" ⟨x, v⟩ > 0","inline":true,"padRight":true},{"text":"and is negative iff ","element":"span"},{"style":{"height":16},"width":170.92,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/71943/images/16-19.png","element":"img","alt":" ⟨x, v⟩ < 0","inline":true},{"text":". Note all points with the same sign are mutually i.i.d. by construction. We use ","element":"span"},{"style":{"height":12.98},"width":52.12,"height":32.44,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/71943/images/16-20.png","element":"img","alt":" S+","inline":true,"padRight":true},{"text":"and ","element":"span"},{"style":{"height":10.8},"width":52.12,"height":27,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/71943/images/16-21.png","element":"img","alt":" S−","inline":true,"padRight":true},{"text":"to denote these sets of positive and negative points respectively. As here we only consider a single random neuron ","element":"span"},{"style":{"height":16},"width":117.52,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/71943/images/16-22.png","element":"img","alt":" wj and","inline":true,"padRight":true},{"text":"the activations at initialization of the training sample on it, we also drop both the subscript ","element":"span"},{"style":{"fontStyle":"italic"},"text":"j ","element":"span"},{"text":"as well as the argument parentheses on the counting functions. We also use ","element":"span"},{"style":{"height":10.8},"width":27.48,"height":27,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/71943/images/16-23.png","element":"img","alt":" ±","inline":true,"padRight":true},{"text":"superscripts to denote the subsets corresponding to activations from clean and corrupt points. As indicative examples of our notation going forward, we denote ","element":"span"},{"style":{"height":15.58},"width":182.32,"height":38.96,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/71943/images/16-24.png","element":"img","alt":" wj as w, T","inline":true,"padRight":true},{"text":"is used as shorthand for the total number of activations while ","element":"span"},{"style":{"height":13.39},"width":192.48,"height":33.48,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/71943/images/16-25.png","element":"img","alt":" B+ and G− ","inline":true,"padRight":true},{"text":"are the number of corrupt positive and clean negative activations respectively.","element":"span"}],[{"text":"By the symmetry of the distribution of ","element":"span"},{"style":{"height":19.38},"width":652.08,"height":48.44,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/71943/images/16-26.png","element":"img","alt":" w, P(⟨w, v⟩ > 0) = P(⟨w, v⟩ < 0) = 12","inline":true},{"text":". As a result","element":"span"}],[{"style":{"width":"64%"},"width":1015,"height":208,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/71943/images/16-27.png","element":"img"}],[{"text":"As the analysis and results derived under either condition will prove identical under reversal of the signs involved, without loss of generality we let ","element":"span"},{"style":{"height":16},"width":184.08,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/71943/images/16-28.png","element":"img","alt":" ⟨w, v⟩ > 0","inline":true},{"text":". Consider this problem for a slightly different data model, in which a training sample consists of ","element":"span"},{"style":{"height":16},"width":147.32,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/71943/images/16-29.png","element":"img","alt":" 2(n − k)","inline":true,"padRight":true},{"text":"clean, positive points and ","element":"span"},{"style":{"fontStyle":"italic"},"text":"k ","element":"span"},{"text":"negative points. Abusing notation, we let ","element":"span"},{"style":{"height":14.4},"width":17.52,"height":36,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/71943/images/16-30.png","element":"img","alt":" ζ","inline":true,"padRight":true},{"text":"denote the event that we are instead drawing our training sample in this manner and also that ","element":"span"},{"style":{"height":16},"width":320.04,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/71943/images/16-31.png","element":"img","alt":" ⟨w, v⟩ > 0. Clearly","inline":true}],[{"style":{"width":"69%"},"width":1097,"height":97,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/71943/images/16-32.png","element":"img"}],[{"text":"In this setting, only positive points activate ","element":"span"},{"style":{"fontWeight":"bold"},"text":"w","element":"span"},{"text":", therefore all points which activate ","element":"span"},{"style":{"fontWeight":"bold"},"text":"w ","element":"span"},{"text":"are identically distributed. As a result, sampling positive points which activate ","element":"span"},{"style":{"fontWeight":"bold"},"text":"w ","element":"span"},{"text":"is equivalent to uniformly sampling without replacement ","element":"span"},{"style":{"height":12.99},"width":53.8,"height":32.48,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/71943/images/16-33.png","element":"img","alt":" T +","inline":true,"padRight":true},{"text":"points from ","element":"span"},{"style":{"height":12.99},"width":52.12,"height":32.48,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/71943/images/16-34.png","element":"img","alt":" S+","inline":true},{"text":". Let ","element":"span"},{"style":{"height":13.18},"width":115.68,"height":32.96,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/71943/images/16-35.png","element":"img","alt":" Zℓ = 1","inline":true,"padRight":true},{"text":"if the","element":"span"},{"text":"-th element sampled from ","element":"span"},{"style":{"height":12.99},"width":52.12,"height":32.48,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/71943/images/16-36.png","element":"img","alt":" S+","inline":true,"padRight":true},{"text":"is clean ","element":"span"},{"text":"and is ","element":"span"},{"text":"0 ","element":"span"},{"text":"otherwise. Define ","element":"span"},{"style":{"height":21.63},"width":184.48,"height":54.08,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/71943/images/17-0.png","element":"img","alt":" µ = 2(n−k)2n−k ","inline":true,"padRight":true},{"text":", then again using a variant of Hoeffding’s bound for sampling ","element":"span"},{"text":"without replacement ","element":"span"},{"href":"#id-45","referenceIndex":4,"text":"(Bardenet & Maillard, ","element":"a"},{"href":"#id-45","referenceIndex":4,"text":"2015)","element":"a"},{"text":"[Proposition 1.2] and as long as ","element":"span"},{"style":{"height":19.78},"width":296.88,"height":49.44,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/71943/images/17-1.png","element":"img","alt":"410 + 5n8T + > µ, we","inline":true,"padRight":true},{"text":"have","element":"span"}],[{"id":"id-47","style":{"width":"90%"},"width":1432,"height":398,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/71943/images/17-2.png","element":"img"}],[{"text":"We proceed to lower and upper bound ","element":"span"},{"style":{"height":12.8},"width":50.52,"height":32,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/71943/images/17-3.png","element":"img","alt":" T +","inline":true},{"text":", to this end we first lower and upper bound the probability that a positive point activates ","element":"span"},{"style":{"fontWeight":"bold"},"text":"w ","element":"span"},{"text":"conditioned on the event ","element":"span"},{"style":{"height":17.7},"width":670.44,"height":44.24,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/71943/images/17-4.png","element":"img","alt":" ⟨w, v⟩ > 0. Let x = √γv + √1 − γn be","inline":true,"padRight":true},{"text":"a ","element":"span"},{"style":{"fontStyle":"italic"},"text":"fixed ","element":"span"},{"text":"positive point, then by the symmetry of the distribution of ","element":"span"},{"style":{"fontWeight":"bold"},"text":"w","element":"span"}],[{"style":{"width":"80%"},"width":1282,"height":234,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/71943/images/17-5.png","element":"img"}],[{"text":"For convenience, let ","element":"span"},{"style":{"height":13.18},"width":45.4,"height":32.96,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/71943/images/17-6.png","element":"img","alt":" E1","inline":true,"padRight":true},{"text":"denote the event ","element":"span"},{"style":{"height":16},"width":253.76,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/71943/images/17-7.png","element":"img","alt":" ⟨w, x⟩ > 0, E2","inline":true,"padRight":true},{"text":"the event ","element":"span"},{"style":{"height":16},"width":186.8,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/71943/images/17-8.png","element":"img","alt":" ⟨w, v⟩ > 0","inline":true,"padRight":true},{"text":"and ","element":"span"},{"style":{"height":13.18},"width":45.44,"height":32.96,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/71943/images/17-9.png","element":"img","alt":" E3","inline":true,"padRight":true},{"text":"the event ","element":"span"},{"style":{"height":16},"width":206.52,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/71943/images/17-10.png","element":"img","alt":"|⟨w, v⟩| ≤ φ","inline":true,"padRight":true},{"text":"for some arbitrary ","element":"span"},{"style":{"height":16},"width":163.84,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/71943/images/17-11.png","element":"img","alt":" φ ∈ (0, 1)","inline":true},{"text":". For the upper bound observe","element":"span"}],[{"style":{"width":"92%"},"width":1464,"height":158,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/71943/images/17-12.png","element":"img"}],[{"text":"Let Cap","element":"span"},{"style":{"height":17.38},"width":594.44,"height":43.44,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/71943/images/17-13.png","element":"img","alt":"(n, φ) := {z ∈ Sd−1 : ⟨n, z⟩ ≥ φ}","inline":true,"padRight":true},{"text":"denote the ","element":"span"},{"style":{"fontStyle":"italic"},"text":"spherical cap ","element":"span"},{"text":"of ","element":"span"},{"style":{"height":13.38},"width":79.64,"height":33.44,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/71943/images/17-14.png","element":"img","alt":" Sd−1","inline":true,"padRight":true},{"text":"centered on ","element":"span"},{"style":{"fontWeight":"bold"},"text":"n","element":"span"},{"text":". As ","element":"span"},{"style":{"height":17.39},"width":402.72,"height":43.48,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/71943/images/17-15.png","element":"img","alt":"d ≥ 3 and w ∼ U(Sd−1)","inline":true,"padRight":true},{"text":"then from ","element":"span"},{"href":"#id-43","referenceIndex":3,"text":"Ball ","element":"a"},{"href":"#id-43","referenceIndex":3,"text":"(1997)","element":"a"},{"text":"[Lemma 2.2] it follows that","element":"span"}],[{"style":{"width":"57%"},"width":914,"height":99,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/71943/images/17-16.png","element":"img"}],[{"text":"Furthermore","element":"span"}],[{"style":{"width":"85%"},"width":1354,"height":396,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/71943/images/17-17.png","element":"img"}],[{"text":"Therefore, and noting under the assumptions of the lemma that ","element":"span"},{"style":{"height":21.78},"width":156.44,"height":54.44,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/71943/images/17-18.png","element":"img","alt":"γ1−γ ≤ 4n,","inline":true}],[{"style":{"width":"87%"},"width":1382,"height":412,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/71943/images/17-19.png","element":"img"}],[{"text":"Letting ","element":"span"},{"style":{"height":16},"width":203.88,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/71943/images/17-20.png","element":"img","alt":" ω ∈ (0, 1/2)","inline":true,"padRight":true},{"text":"be an arbitrary constant, then as long ","element":"span"},{"style":{"height":28.8},"width":588.56,"height":72,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/71943/images/17-21.png","element":"img","alt":" n ≥ max{ln2 � 1ω�, 16 ln−2 � 11−ω�}","inline":true}],[{"style":{"width":"38%"},"width":608,"height":82,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/71943/images/17-22.png","element":"img"}],[{"text":"In order to take advantage of concentration the upper bound on ","element":"span"},{"style":{"height":12.8},"width":50.48,"height":32,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/71943/images/18-0.png","element":"img","alt":" T + ","inline":true,"padRight":true},{"text":"must be greater than ","element":"span"},{"style":{"height":16.98},"width":174.92,"height":42.44,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/71943/images/18-1.png","element":"img","alt":" E[T +]. On","inline":true,"padRight":true},{"text":"the other hand, if the upper bound is too large then the condition ","element":"span"},{"style":{"height":21.73},"width":228.88,"height":54.32,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/71943/images/18-2.png","element":"img","alt":"410 + 5n8T +j > µ","inline":true,"padRight":true},{"text":"will be compromised.","element":"span"}],[{"text":"As ","element":"span"},{"style":{"height":14},"width":134.12,"height":35,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/71943/images/18-3.png","element":"img","alt":" µ < 1 if","inline":true}],[{"style":{"width":"33%"},"width":528,"height":98,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/71943/images/18-4.png","element":"img"}],[{"text":"then this condition is not compromised. Setting ","element":"span"},{"style":{"height":20.24},"width":633.16,"height":50.6,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/71943/images/18-5.png","element":"img","alt":" ω = 148, if n ≥ 16 ln−2 � 4847�=: C then","inline":true}],[{"style":{"width":"39%"},"width":621,"height":82,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/71943/images/18-6.png","element":"img"}],[{"text":"and therefore, applying a Chernoff bound,","element":"span"}],[{"style":{"width":"45%"},"width":724,"height":102,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/71943/images/18-7.png","element":"img"}],[{"text":"Under the same conditions, to lower bound ","element":"span"},{"style":{"height":12.8},"width":51,"height":32,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/71943/images/18-8.png","element":"img","alt":" T + ","inline":true,"padRight":true},{"text":"we again apply a Chernoff, which gives","element":"span"}],[{"style":{"width":"54%"},"width":864,"height":106,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/71943/images/18-9.png","element":"img"}],[{"text":"Therefore, there exists a small positive constant ","element":"span"},{"style":{"fontStyle":"italic"},"text":"c ","element":"span"},{"text":"and a large constant positive constant ","element":"span"},{"style":{"fontStyle":"italic"},"text":"C ","element":"span"},{"text":"such that if","element":"span"}],[{"style":{"height":13.6},"width":187.04,"height":34,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/71943/images/18-10.png","element":"img","alt":"n ≥ C then","inline":true}],[{"style":{"width":"53%"},"width":850,"height":106,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/71943/images/18-11.png","element":"img"}],[{"text":"In combination with the bound in ","element":"span"},{"href":"#id-47","text":"(4)","element":"a"},{"text":", from this result it follows that there also exists a sufficiently small constant ","element":"span"},{"style":{"fontStyle":"italic"},"text":"c > ","element":"span"},{"text":"0 ","element":"span"},{"text":"such that","element":"span"}],[{"style":{"width":"66%"},"width":1061,"height":96,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/71943/images/18-12.png","element":"img"}],[{"text":"As a result, there exist positive constants ","element":"span"},{"style":{"fontStyle":"italic"},"text":"C, c ","element":"span"},{"text":"such that if ","element":"span"},{"style":{"height":13.6},"width":187.04,"height":34,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/71943/images/18-13.png","element":"img","alt":" n ≥ C then","inline":true}],[{"style":{"width":"100%"},"width":1710,"height":711,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/71943/images/18-14.png","element":"img"}],[{"text":"Note if instead ","element":"span"},{"style":{"height":16},"width":170.84,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/71943/images/18-15.png","element":"img","alt":" ⟨x, v⟩ < 0","inline":true},{"text":", then swapping the roles of the negative and positive points in the argument outlined above gives the same result. Therefore, under the assumptions of the lemma,","element":"span"}],[{"style":{"width":"41%"},"width":661,"height":97,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/71943/images/18-16.png","element":"img"}],[{"text":"Let ","element":"span"},{"style":{"height":17.54},"width":209.76,"height":43.84,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/71943/images/18-17.png","element":"img","alt":" X = (xi)2ni=1","inline":true,"padRight":true},{"text":"denote the training sample and ","element":"span"},{"style":{"height":19.78},"width":630.92,"height":49.44,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/71943/images/18-18.png","element":"img","alt":" Xcϵ = {X : Pw�G ≥ 410T + 58n�> ϵ}","inline":true,"padRight":true},{"text":"the set of training samples which are ","element":"span"},{"style":{"fontStyle":"italic"},"text":"not ","element":"span"},{"style":{"height":7.2},"width":13.52,"height":18,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/71943/images/18-19.png","element":"img","alt":" ϵ","inline":true},{"text":"-fine. Note the subscript ","element":"span"},{"style":{"fontStyle":"italic"},"text":"w ","element":"span"},{"text":"above indicates randomness over the neuron ","element":"span"},{"style":{"fontWeight":"bold"},"text":"w ","element":"span"},{"text":"alone. Clearly by construction","element":"span"}],[{"style":{"width":"37%"},"width":602,"height":96,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/71943/images/18-20.png","element":"img"}],[{"text":"Furthermore, as","element":"span"}],[{"style":{"width":"86%"},"width":1364,"height":97,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/71943/images/19-0.png","element":"img"}],[{"text":"then it follows that ","element":"span"},{"style":{"height":17.39},"width":489.12,"height":43.48,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/71943/images/19-1.png","element":"img","alt":" P (X ∈ Xcϵ ) ≤ ϵ−1 exp (−cn)","inline":true},{"text":". As a result we conclude that there exist positive ","element":"span"},{"text":"constants ","element":"span"},{"style":{"fontStyle":"italic"},"text":"C, c ","element":"span"},{"text":"such that if ","element":"span"},{"style":{"height":13.2},"width":108.08,"height":33,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/71943/images/19-2.png","element":"img","alt":" n ≥ C","inline":true,"padRight":true},{"text":"then the probability of drawing an ","element":"span"},{"style":{"height":7.2},"width":14,"height":18,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/71943/images/19-3.png","element":"img","alt":" ϵ","inline":true},{"text":"-fine training sample is at least ","element":"span"},{"style":{"height":17.39},"width":314.92,"height":43.48,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/71943/images/19-4.png","element":"img","alt":"1 − ϵ−1 exp (−cn).","inline":true}],[{"id":"id-49","style":{"fontWeight":"bold"},"text":"Lemma A.5. ","element":"span"},{"style":{"fontStyle":"italic"},"text":"Assume ","element":"span"},{"style":{"height":21.78},"width":994.32,"height":54.44,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/71943/images/19-5.png","element":"img","alt":" γ ≤ 45n, γ ≥ 5ρ and let Λ := {j ∈ [2m] : Gj < γ−ρ2γ Tj + 12γ }","inline":true},{"style":{"fontStyle":"italic"},"text":". There exists a ","element":"span"},{"style":{"fontStyle":"italic"},"text":"positive constants ","element":"span"},{"style":{"fontStyle":"italic"},"text":"C ","element":"span"},{"style":{"fontStyle":"italic"},"text":"such that for any ","element":"span"},{"style":{"height":16},"width":533.8,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/71943/images/19-6.png","element":"img","alt":" δ ∈ (0, 1) if n ≥ C log(1/δ) then","inline":true}],[{"style":{"width":"25%"},"width":411,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/71943/images/19-7.png","element":"img"}],[{"style":{"fontStyle":"italic"},"text":"Proof. ","element":"span"},{"text":"To bound the size of ","element":"span"},{"style":{"height":11.41},"width":26,"height":28.52,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/71943/images/19-8.png","element":"img","alt":" Λ","inline":true,"padRight":true},{"text":"with high probability we follow a similar approach used to bound the size of ","element":"span"},{"style":{"height":15.58},"width":41.92,"height":38.96,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/71943/images/19-9.png","element":"img","alt":" Γp","inline":true,"padRight":true},{"text":"with high probability. As ","element":"span"},{"style":{"height":19.38},"width":295.12,"height":48.44,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/71943/images/19-10.png","element":"img","alt":" γ ≤ 45n and ρ ≤ γ5 ","inline":true,"padRight":true},{"text":", then observe","element":"span"}],[{"style":{"width":"30%"},"width":491,"height":91,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/71943/images/19-11.png","element":"img"}],[{"text":"Therefore, ","element":"span"},{"style":{"height":19.76},"width":419.76,"height":49.4,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/71943/images/19-12.png","element":"img","alt":" j ∈ Λ if Gj < 410Tj + 58n","inline":true},{"text":". Conditioned on the event that the training sample is ","element":"span"},{"style":{"height":11.2},"width":145.28,"height":28,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/71943/images/19-13.png","element":"img","alt":" ϵ-fine for","inline":true},{"style":{"height":16},"width":244.92,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/71943/images/19-14.png","element":"img","alt":"ϵ := exp(−cn)","inline":true,"padRight":true},{"text":"for some sufficiently small constant ","element":"span"},{"style":{"fontStyle":"italic"},"text":"c","element":"span"},{"text":", then with the data fixed the preactivations of the data on each neuron are mutually independent and identically distributed by construction. As such, in this setting the events ","element":"span"},{"style":{"height":20.83},"width":371.36,"height":52.08,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/71943/images/19-15.png","element":"img","alt":" (Gj < 410Tj + 58n)2mj=1","inline":true,"padRight":true},{"text":"are also mutually independent. Let ","element":"span"},{"style":{"height":17.54},"width":212.6,"height":43.84,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/71943/images/19-16.png","element":"img","alt":" X = (xi)2ni=1","inline":true,"padRight":true},{"text":"denote the training sample and ","element":"span"},{"style":{"height":13.18},"width":47,"height":32.96,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/71943/images/19-17.png","element":"img","alt":" Xϵ","inline":true,"padRight":true},{"text":"the set of ","element":"span"},{"style":{"height":0},"width":20,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/71943/images/19-18.png","element":"img","alt":" ϵ","inline":true},{"text":"-fine training samples, then","element":"span"}],[{"style":{"width":"89%"},"width":1416,"height":219,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/71943/images/19-19.png","element":"img"}],[{"text":"It follows that there exists a sufficiently small positive constant ","element":"span"},{"style":{"fontStyle":"italic"},"text":"c ","element":"span"},{"text":"such that","element":"span"}],[{"style":{"width":"45%"},"width":716,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/71943/images/19-20.png","element":"img"}],[{"text":"Therefore, using Lemma ","element":"span"},{"href":"#id-48","text":"A.4 ","element":"a"},{"text":"there exists a sufficiently small positive constant ","element":"span"},{"style":{"fontStyle":"italic"},"text":"c ","element":"span"},{"text":"such that","element":"span"}],[{"style":{"width":"60%"},"width":954,"height":152,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/71943/images/19-21.png","element":"img"}],[{"text":"from which the result claimed follows.","element":"span"}],[{"id":"id-61","style":{"fontWeight":"bold"},"text":"Lemma A.6. ","element":"span"},{"style":{"fontStyle":"italic"},"text":"Assume ","element":"span"},{"style":{"height":19.38},"width":260.28,"height":48.44,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/71943/images/19-22.png","element":"img","alt":" γ ≤ 45n, γ ≥ 5ρ","inline":true,"padRight":true},{"style":{"fontStyle":"italic"},"text":"and ","element":"span"},{"style":{"height":16.8},"width":225.36,"height":42,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/71943/images/19-23.png","element":"img","alt":" |Γp| > 0.99m","inline":true,"padRight":true},{"style":{"fontStyle":"italic"},"text":"for ","element":"span"},{"style":{"height":16},"width":198.12,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/71943/images/19-24.png","element":"img","alt":" p ∈ {−1, 1}","inline":true},{"style":{"fontStyle":"italic"},"text":". There exists a positive ","element":"span"},{"style":{"fontStyle":"italic"},"text":"constant ","element":"span"},{"style":{"fontStyle":"italic"},"text":"C ","element":"span"},{"style":{"fontStyle":"italic"},"text":"such that for any ","element":"span"},{"style":{"height":16},"width":156.96,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/71943/images/19-25.png","element":"img","alt":" δ ∈ (0, 1)","inline":true},{"style":{"fontStyle":"italic"},"text":", if ","element":"span"},{"style":{"height":19.38},"width":223.44,"height":48.44,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/71943/images/19-26.png","element":"img","alt":" n ≥ C log( 1δ )","inline":true,"padRight":true},{"style":{"fontStyle":"italic"},"text":"and ","element":"span"},{"style":{"height":19.38},"width":247.64,"height":48.44,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/71943/images/19-27.png","element":"img","alt":" m ≥ C log� kδ�","inline":true},{"style":{"fontStyle":"italic"},"text":", then with probability at least ","element":"span"},{"style":{"height":14.8},"width":316.88,"height":37,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/71943/images/19-28.png","element":"img","alt":" 1 − δ for all i ∈ SF","inline":true,"padRight":true},{"style":{"fontStyle":"italic"},"text":"there exists a ","element":"span"},{"style":{"height":23.54},"width":447.2,"height":58.84,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/71943/images/19-29.png","element":"img","alt":" j ∈ Θyi for which i ∈ A(0)j .","inline":true}],[{"style":{"fontStyle":"italic"},"text":"Proof. ","element":"span"},{"text":"For convenience here we use ","element":"span"},{"style":{"height":16.78},"width":797.08,"height":41.96,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/71943/images/19-30.png","element":"img","alt":" Tj, Gj and Bj for Tj(0, 1), Gj(0, 1) and Bj(0, 1)","inline":true,"padRight":true},{"text":"respectively. For a neuron to be in ","element":"span"},{"style":{"height":15.58},"width":48,"height":38.96,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/71943/images/19-31.png","element":"img","alt":" Θp","inline":true,"padRight":true},{"text":"it must satisfy the following condition,","element":"span"}],[{"style":{"width":"38%"},"width":611,"height":42,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/71943/images/19-32.png","element":"img"}],[{"text":"Adding and subtracting ","element":"span"},{"style":{"height":16.8},"width":163.72,"height":42,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/71943/images/19-33.png","element":"img","alt":" Tj(γ − ρ)","inline":true,"padRight":true},{"text":"to the left-hand-side gives","element":"span"}],[{"style":{"width":"32%"},"width":523,"height":43,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/71943/images/19-34.png","element":"img"}],[{"text":"Rearranging this inequality it follows that the conditions ","element":"span"},{"style":{"height":21.78},"width":480,"height":54.44,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/71943/images/19-35.png","element":"img","alt":" j ∈ Γp and Gj < γ−ρ2γ Tj + 12γ ","inline":true,"padRight":true},{"text":"are sufficient ","element":"span"},{"text":"to conclude ","element":"span"},{"style":{"height":15.6},"width":126.12,"height":39,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/71943/images/19-36.png","element":"img","alt":" j ∈ Θp","inline":true},{"text":". Let ","element":"span"},{"style":{"height":21.78},"width":660.04,"height":54.44,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/71943/images/19-37.png","element":"img","alt":" Λ := {j ∈ [2m] : Gj < γ−ρ2γ Tj + 12γ }","inline":true},{"text":". Therefore, in order to prove ","element":"span"},{"text":"the desired result it suffices to lower bound the probability that for each corrupt point ","element":"span"},{"style":{"height":16},"width":119.48,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/71943/images/19-38.png","element":"img","alt":" (xi, yi)","inline":true},{"text":", the intersection between the set of neurons which ","element":"span"},{"style":{"height":9.79},"width":33.48,"height":24.48,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/71943/images/19-39.png","element":"img","alt":" xi","inline":true,"padRight":true},{"text":"activates, the set of neurons ","element":"span"},{"style":{"height":15.81},"width":49.48,"height":39.52,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/71943/images/19-40.png","element":"img","alt":" Γyi","inline":true,"padRight":true},{"text":"and the set of neurons ","element":"span"},{"style":{"height":11.6},"width":28,"height":29,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/71943/images/19-41.png","element":"img","alt":" Λ","inline":true,"padRight":true},{"text":"is nonempty.","element":"span"}],[{"text":"By a Chernoff bound there exists a small constant ","element":"span"},{"style":{"fontStyle":"italic"},"text":"c > ","element":"span"},{"text":"0 ","element":"span"},{"text":"such that with probability at least ","element":"span"},{"style":{"height":10.8},"width":61.84,"height":27,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/71943/images/20-0.png","element":"img","alt":" 1 −","inline":true},{"style":{"height":16},"width":175.6,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/71943/images/20-1.png","element":"img","alt":"exp(−cm)","inline":true,"padRight":true},{"text":"a fixed training point is activated by at least ","element":"span"},{"text":"1","element":"span"},{"style":{"fontStyle":"italic"},"text":"/","element":"span"},{"text":"3 ","element":"span"},{"text":"of the neurons of each sign. Therefore, using the union bound, every corrupt training point is activated by at least ","element":"span"},{"style":{"fontStyle":"italic"},"text":"m/","element":"span"},{"text":"3 ","element":"span"},{"text":"of the neurons with matching sign with probability at least ","element":"span"},{"style":{"height":16},"width":294.08,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/71943/images/20-2.png","element":"img","alt":" 1 − 2k exp(−cm)","inline":true},{"text":". Conditioning on this event, then under the assumption ","element":"span"},{"style":{"height":16.78},"width":228.16,"height":41.96,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/71943/images/20-3.png","element":"img","alt":" |Γp| > 0.99m","inline":true,"padRight":true},{"text":"for ","element":"span"},{"style":{"height":16},"width":200.96,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/71943/images/20-4.png","element":"img","alt":" p ∈ {−1, 1}","inline":true},{"text":", each corrupt point ","element":"span"},{"style":{"height":16},"width":119.44,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/71943/images/20-5.png","element":"img","alt":" (xi, yi)","inline":true,"padRight":true},{"text":"activates at least ","element":"span"},{"style":{"height":19.76},"width":87.44,"height":49.4,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/71943/images/20-6.png","element":"img","alt":"97300m","inline":true,"padRight":true},{"text":"neurons in ","element":"span"},{"style":{"height":15.58},"width":52.08,"height":38.96,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/71943/images/20-7.png","element":"img","alt":" Γyi","inline":true},{"text":". Therefore, if for instance, ","element":"span"},{"style":{"height":16},"width":217.4,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/71943/images/20-8.png","element":"img","alt":" |Λ| > 1.75m","inline":true},{"text":", we can conclude for each ","element":"span"},{"style":{"height":16},"width":119.48,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/71943/images/20-9.png","element":"img","alt":" (xi, yi)","inline":true,"padRight":true},{"text":"with ","element":"span"},{"style":{"height":13.58},"width":110.6,"height":33.96,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/71943/images/20-10.png","element":"img","alt":"i ∈ SF","inline":true,"padRight":true},{"text":"that there exists a ","element":"span"},{"style":{"height":23.52},"width":425.92,"height":58.8,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/71943/images/20-11.png","element":"img","alt":" j ∈ Θyi such that i ∈ A(0)j ","inline":true,"padRight":true},{"text":". Therefore, under the conditions of the lemma, ","element":"span"},{"text":"using the union bound and Lemmas ","element":"span"},{"href":"#id-49","text":"A.5 ","element":"a"},{"text":"and ","element":"span"},{"href":"#id-50","text":"A.3, ","element":"a"},{"text":"we can upper bound the failure probability of this as","element":"span"}],[{"style":{"width":"27%"},"width":443,"height":41,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/71943/images/20-12.png","element":"img"}],[{"text":"Therefore, for ","element":"span"},{"style":{"height":16},"width":167.56,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/71943/images/20-13.png","element":"img","alt":" δ ∈ (0, 1)","inline":true,"padRight":true},{"text":"there exists a positive constant ","element":"span"},{"style":{"fontStyle":"italic"},"text":"C ","element":"span"},{"text":"such that if ","element":"span"},{"style":{"height":19.38},"width":234.04,"height":48.44,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/71943/images/20-14.png","element":"img","alt":" n ≥ C log( 1δ )","inline":true,"padRight":true},{"text":"and ","element":"span"},{"style":{"height":12.4},"width":78.52,"height":31,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/71943/images/20-15.png","element":"img","alt":" m ≥","inline":true},{"style":{"height":19.38},"width":159.52,"height":48.44,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/71943/images/20-16.png","element":"img","alt":"C log� kδ�","inline":true},{"text":"then the probability that for all ","element":"span"},{"style":{"height":13.6},"width":110.56,"height":34,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/71943/images/20-17.png","element":"img","alt":" i ∈ SF","inline":true,"padRight":true},{"text":"there exists a ","element":"span"},{"style":{"height":23.52},"width":424.76,"height":58.8,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/71943/images/20-18.png","element":"img","alt":" j ∈ Θyi such that i ∈ A(0)j","inline":true,"padRight":true},{"text":"is at least ","element":"span"},{"style":{"height":12},"width":97.84,"height":30,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/71943/images/20-19.png","element":"img","alt":"1 − δ.","inline":true}],[{"text":"The final lemma we provide here states, under mild conditions on the network width, that with high probability every point in the training sample activates a neuron whose output weight matches its label in sign. We use this to prove the result on non-benign overfitting, detailed in Section ","element":"span"},{"text":"D.","element":"span"}],[{"id":"id-81","style":{"fontWeight":"bold"},"text":"Lemma A.7. ","element":"span"},{"style":{"fontStyle":"italic"},"text":"Let ","element":"span"},{"style":{"height":19.38},"width":442.36,"height":48.44,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/71943/images/20-20.png","element":"img","alt":" δ ∈ (0, 1), if m ≥ log2( 2nδ )","inline":true,"padRight":true},{"style":{"fontStyle":"italic"},"text":"then the probability that for all ","element":"span"},{"style":{"height":16},"width":128.36,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/71943/images/20-21.png","element":"img","alt":" i ∈ [2n]","inline":true,"padRight":true},{"style":{"fontStyle":"italic"},"text":"there exists a ","element":"span"},{"style":{"height":23.54},"width":707.72,"height":58.84,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/71943/images/20-22.png","element":"img","alt":"j ∈ [2m] such that (−1)j = yi and i ∈ A(0)j","inline":true,"padRight":true},{"style":{"fontStyle":"italic"},"text":"is at least ","element":"span"},{"style":{"height":12},"width":97.88,"height":30,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/71943/images/20-23.png","element":"img","alt":" 1 − δ.","inline":true}],[{"style":{"fontStyle":"italic"},"text":"Proof. ","element":"span"},{"text":"Observe by the rotational symmetry of the weight distribution that for any ","element":"span"},{"style":{"height":16},"width":144.4,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/71943/images/20-24.png","element":"img","alt":" j ∈ [2m]","inline":true}],[{"style":{"width":"44%"},"width":704,"height":42,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/71943/images/20-25.png","element":"img"}],[{"text":"By construction, for each element in the training sample ","element":"span"},{"style":{"height":17.54},"width":358.88,"height":43.84,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/71943/images/20-26.png","element":"img","alt":" (xi, yi)2ni=1 there are m","inline":true,"padRight":true},{"text":"neurons whose output ","element":"span"},{"text":"weight has the same sign. As the preactivations of ","element":"span"},{"style":{"height":9.79},"width":33.52,"height":24.48,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/71943/images/20-27.png","element":"img","alt":" xi","inline":true,"padRight":true},{"text":"with each neuron are mutually independent from one another, then using the union bound it follows that","element":"span"}],[{"style":{"width":"100%"},"width":1596,"height":233,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/71943/images/20-28.png","element":"img"}],[{"text":"Setting ","element":"span"},{"style":{"height":14},"width":189.04,"height":35,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/71943/images/20-29.png","element":"img","alt":" δ ≥ 2n2−m ","inline":true,"padRight":true},{"text":"and rearranging we arrive at the stated result.","element":"span"}]]},{"heading":"Appendix B Supporting Lemmas","paragraphs":[[{"style":{"fontWeight":"bold"},"text":"B.1 ","element":"span"},{"style":{"fontWeight":"bold"},"text":"Bounds on activations and preactivations","element":"span"}],[{"text":"For any pair of iterations ","element":"span"},{"style":{"height":13.2},"width":62.48,"height":33,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/71943/images/20-30.png","element":"img","alt":" t, t0","inline":true,"padRight":true},{"text":"satisfying ","element":"span"},{"style":{"height":12.4},"width":97.92,"height":31,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/71943/images/20-31.png","element":"img","alt":" t > t0","inline":true},{"text":", unrolling the GD update rule ","element":"span"},{"href":"#id-51","text":"(3) ","element":"a"},{"text":"gives","element":"span"}],[{"style":{"width":"43%"},"width":684,"height":118,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/71943/images/20-32.png","element":"img"}],[{"text":"Using ","element":"span"},{"href":"#id-44","text":"(1) ","element":"a"},{"text":"and the fact that ","element":"span"},{"style":{"height":16},"width":467,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/71943/images/20-33.png","element":"img","alt":" ni ⊥ v for any i ∈ [2n], then","inline":true}],[{"id":"id-52","style":{"width":"85%"},"width":1357,"height":396,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/71943/images/20-34.png","element":"img"}],[{"text":"where we define ","element":"span"},{"style":{"height":16.98},"width":514.84,"height":42.44,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/71943/images/20-35.png","element":"img","alt":" λiℓ := (−1)ℓ+iβ(i)β(ℓ)⟨xℓ, xi⟩","inline":true},{"text":". Towards the goal of bounding the activation of a neuron with a data point we provide the following results.","element":"span"}],[{"id":"id-53","style":{"fontWeight":"bold"},"text":"Lemma B.1. ","element":"span"},{"style":{"fontStyle":"italic"},"text":"Assume ","element":"span"},{"style":{"height":19.9},"width":811.32,"height":49.76,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/71943/images/21-0.png","element":"img","alt":" |⟨ni, nℓ⟩| ≤ ρ1−γ for all i, ℓ ∈ [2n] such that i ̸= ℓ.","inline":true}],[{"style":{"fontStyle":"italic"},"text":"1. If ","element":"span"},{"style":{"height":13.6},"width":304.92,"height":34,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/71943/images/21-1.png","element":"img","alt":" i = ℓ then λiℓ = 1.","inline":true}],[{"style":{"fontStyle":"italic"},"text":"2. If ","element":"span"},{"style":{"height":16},"width":993.6,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/71943/images/21-2.png","element":"img","alt":" i ̸= ℓ, i ∈ ST , and ℓ ∈ SF , then −(γ + ρ) ≤ λiℓ ≤ −(γ − ρ).","inline":true}],[{"style":{"fontStyle":"italic"},"text":"3. If ","element":"span"},{"style":{"height":16},"width":993.6,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/71943/images/21-3.png","element":"img","alt":" i ̸= ℓ, i ∈ SF , and ℓ ∈ ST , then −(γ + ρ) ≤ λiℓ ≤ −(γ − ρ).","inline":true}],[{"style":{"fontStyle":"italic"},"text":"4. If ","element":"span"},{"style":{"fontStyle":"italic"},"text":"i ","element":"span"},{"style":{"height":15.2},"width":733.32,"height":38,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/71943/images/21-4.png","element":"img","alt":" ̸= ℓ and i, ℓ ∈ ST , then γ − ρ ≤ λiℓ ≤ γ + ρ.","inline":true}],[{"style":{"fontStyle":"italic"},"text":"5. If ","element":"span"},{"style":{"fontStyle":"italic"},"text":"i ","element":"span"},{"style":{"height":15.2},"width":734.68,"height":38,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/71943/images/21-5.png","element":"img","alt":" ̸= ℓ and i, ℓ ∈ SF , then γ − ρ ≤ λiℓ ≤ γ + ρ.","inline":true}],[{"style":{"fontStyle":"italic"},"text":"Proof. ","element":"span"},{"text":"Observe by the data model that","element":"span"}],[{"style":{"width":"52%"},"width":832,"height":52,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/71943/images/21-6.png","element":"img"}],[{"text":"Therefore","element":"span"}],[{"style":{"width":"35%"},"width":568,"height":46,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/71943/images/21-7.png","element":"img"}],[{"text":"from which the results claimed follow.","element":"span"}],[{"id":"id-54","style":{"fontWeight":"bold"},"text":"Lemma B.2. ","element":"span"},{"style":{"fontStyle":"italic"},"text":"Assume ","element":"span"},{"style":{"height":19.89},"width":800.44,"height":49.72,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/71943/images/21-8.png","element":"img","alt":" |⟨ni, nℓ⟩| ≤ ρ1−γ for all i, ℓ ∈ [2n] such that i ̸= ℓ","inline":true},{"style":{"fontStyle":"italic"},"text":". Then for any ","element":"span"},{"style":{"height":16},"width":201.84,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/71943/images/21-9.png","element":"img","alt":" j ∈ [2m] the","inline":true,"padRight":true},{"style":{"fontStyle":"italic"},"text":"following are true.","element":"span"}],[{"style":{"width":"91%"},"width":1443,"height":1083,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/71943/images/21-10.png","element":"img"}],[{"style":{"fontStyle":"italic"},"text":"Proof. ","element":"span"},{"text":"Considering ","element":"span"},{"href":"#id-52","text":"(5) ","element":"a"},{"text":"we can further separate the summation term as follows,","element":"span"}],[{"style":{"width":"97%"},"width":1553,"height":193,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/71943/images/21-11.png","element":"img"}],[{"text":"Note, with ","element":"span"},{"style":{"height":13.81},"width":108,"height":34.52,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/71943/images/21-12.png","element":"img","alt":" i ∈ ST","inline":true,"padRight":true},{"text":"and ","element":"span"},{"style":{"height":13.6},"width":84.08,"height":34,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/71943/images/21-13.png","element":"img","alt":" i ∼ j","inline":true},{"text":", or ","element":"span"},{"style":{"height":13.79},"width":109,"height":34.48,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/71943/images/21-14.png","element":"img","alt":" i ∈ SF","inline":true,"padRight":true},{"text":"and ","element":"span"},{"style":{"height":15.2},"width":84.08,"height":38,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/71943/images/21-15.png","element":"img","alt":" i ̸∼ j","inline":true},{"text":", then ","element":"span"},{"style":{"height":16.98},"width":277.16,"height":42.44,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/71943/images/21-16.png","element":"img","alt":" (−1)j+iβ(i) = 1","inline":true},{"text":". On the other hand, with ","element":"span"},{"style":{"height":13.2},"width":111.16,"height":33,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/71943/images/21-17.png","element":"img","alt":"i ∈ ST","inline":true,"padRight":true},{"text":"and ","element":"span"},{"style":{"height":15.2},"width":85.44,"height":38,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/71943/images/21-18.png","element":"img","alt":" i ̸∼ j","inline":true},{"text":", or ","element":"span"},{"style":{"height":13.6},"width":112.12,"height":34,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/71943/images/21-19.png","element":"img","alt":" i ∈ SF","inline":true,"padRight":true},{"text":"and ","element":"span"},{"style":{"height":13.6},"width":85.44,"height":34,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/71943/images/21-20.png","element":"img","alt":" i ∼ j","inline":true},{"text":", then ","element":"span"},{"style":{"height":16.98},"width":309.52,"height":42.44,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/71943/images/21-21.png","element":"img","alt":" (−1)j+iβ(i) = −1","inline":true},{"text":". Substituting the relevant bounds on ","element":"span"},{"style":{"height":13.2},"width":48.52,"height":33,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/71943/images/21-22.png","element":"img","alt":" λiℓ","inline":true,"padRight":true},{"text":"provided in Lemma ","element":"span"},{"href":"#id-53","text":"B.1, ","element":"a"},{"text":"and observing by definition that ","element":"span"},{"style":{"height":23.86},"width":552.96,"height":59.64,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/71943/images/21-23.png","element":"img","alt":" G(i)j (t0, t) = �ℓ∈ST ,ℓ̸=i Tℓj(t0, t)","inline":true,"padRight":true},{"text":"and ","element":"span"},{"style":{"height":23.86},"width":553.84,"height":59.64,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/71943/images/21-24.png","element":"img","alt":" B(i)j (t0, t) = �ℓ∈ST ,ℓ̸=i Tℓj(t0, t)","inline":true},{"text":", one arrives at the results claimed.","element":"span"}],[{"text":"We will often make use of the following similar but more pessimistic bounds on the activations. Recall that ","element":"span"},{"style":{"height":14.59},"width":21.52,"height":36.48,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/71943/images/22-0.png","element":"img","alt":" ϕ","inline":true,"padRight":true},{"text":"is the ReLU function: ","element":"span"},{"style":{"height":16},"width":311.64,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/71943/images/22-1.png","element":"img","alt":" ϕ(a) = max{a, 0}.","inline":true,"padRight":true},{"id":"id-57","style":{"fontWeight":"bold"},"text":"Lemma B.3. ","element":"span"},{"style":{"fontStyle":"italic"},"text":"For any ","element":"span"},{"style":{"height":16},"width":144.4,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/71943/images/22-2.png","element":"img","alt":" j ∈ [2m]","inline":true,"padRight":true},{"style":{"fontStyle":"italic"},"text":"and iterations ","element":"span"},{"style":{"height":14.4},"width":252.36,"height":36,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/71943/images/22-3.png","element":"img","alt":" t0, t with t0 ≤ t","inline":true,"padRight":true},{"style":{"fontStyle":"italic"},"text":"the following hold:","element":"span"}],[{"style":{"fontStyle":"italic"},"text":"1. If ","element":"span"},{"style":{"height":14.4},"width":293.6,"height":36,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/71943/images/22-4.png","element":"img","alt":" i ∈ ST , i ∼ j then","inline":true}],[{"style":{"width":"90%"},"width":1437,"height":132,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/71943/images/22-5.png","element":"img"}],[{"style":{"fontStyle":"italic"},"text":"2. If ","element":"span"},{"style":{"height":15.2},"width":293.6,"height":38,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/71943/images/22-6.png","element":"img","alt":" i ∈ ST , i ̸∼ j then","inline":true}],[{"style":{"width":"95%"},"width":1507,"height":132,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/71943/images/22-7.png","element":"img"}],[{"style":{"fontStyle":"italic"},"text":"3. If ","element":"span"},{"style":{"height":14.4},"width":295,"height":36,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/71943/images/22-8.png","element":"img","alt":" i ∈ SF , i ∼ j then","inline":true}],[{"style":{"width":"95%"},"width":1509,"height":132,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/71943/images/22-9.png","element":"img"}],[{"style":{"fontStyle":"italic"},"text":"4. If ","element":"span"},{"style":{"height":15.2},"width":295,"height":38,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/71943/images/22-10.png","element":"img","alt":" i ∈ SF , i ̸∼ j then","inline":true}],[{"style":{"width":"90%"},"width":1439,"height":132,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/71943/images/22-11.png","element":"img"}],[{"style":{"fontStyle":"italic"},"text":"The ","element":"span"},{"style":{"height":10.8},"width":19,"height":27,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/71943/images/22-12.png","element":"img","alt":" η","inline":true,"padRight":true},{"style":{"fontStyle":"italic"},"text":"term in the upper bound for cases 2 and 3 is only necessary if ","element":"span"},{"style":{"height":16.78},"width":229.76,"height":41.96,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/71943/images/22-13.png","element":"img","alt":" Tij(t0, t) > 0.","inline":true}],[{"text":"We remark that we will often use this result in a setting where ","element":"span"},{"style":{"height":14},"width":95.72,"height":35,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/71943/images/22-14.png","element":"img","alt":" ρ ≤ γ","inline":true},{"text":". In these cases, the terms that involve ","element":"span"},{"style":{"height":16},"width":147.4,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/71943/images/22-15.png","element":"img","alt":" ϕ(ρ − γ)","inline":true,"padRight":true},{"text":"are zero and will be dropped.","element":"span"}],[{"style":{"fontStyle":"italic"},"text":"Proof. ","element":"span"},{"text":"For each of these results, we make use of Lemma ","element":"span"},{"href":"#id-54","text":"B.2, ","element":"a"},{"style":{"height":16},"width":442.28,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/71943/images/22-16.png","element":"img","alt":" a ≤ ϕ(a) for all a ∈ R, and","inline":true}],[{"style":{"width":"29%"},"width":466,"height":133,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/71943/images/22-17.png","element":"img"}],[{"text":"for all ","element":"span"},{"style":{"height":13.6},"width":146,"height":34,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/71943/images/22-18.png","element":"img","alt":" i, j, t0, t1","inline":true},{"text":". We will only prove the inequalities for ","element":"span"},{"style":{"height":13.2},"width":109.56,"height":33,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/71943/images/22-19.png","element":"img","alt":" i ∈ ST","inline":true,"padRight":true},{"text":"here, as the inequalities for ","element":"span"},{"style":{"height":13.6},"width":110.6,"height":34,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/71943/images/22-20.png","element":"img","alt":" i ∈ SF","inline":true,"padRight":true},{"text":"are analogous.","element":"span"}],[{"text":"For the first inequality in Statement 1 we claim it suffices to show","element":"span"}],[{"id":"id-55","style":{"width":"99%"},"width":1581,"height":91,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/71943/images/22-21.png","element":"img"}],[{"text":"Indeed, if ","element":"span"},{"href":"#id-55","text":"(6) ","element":"a"},{"text":"is true then the result claimed follows as","element":"span"}],[{"style":{"width":"90%"},"width":1437,"height":451,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/71943/images/22-22.png","element":"img"}],[{"text":"In order to prove ","element":"span"},{"href":"#id-55","text":"(6) ","element":"a"},{"text":"we bound","element":"span"}],[{"style":{"width":"72%"},"width":1152,"height":208,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/71943/images/22-23.png","element":"img"}],[{"text":"This follows from Statement 1 in Lemma ","element":"span"},{"href":"#id-54","text":"B.2. ","element":"a"},{"text":"From here, we consider two cases: first, if ","element":"span"},{"style":{"height":23.52},"width":206.8,"height":58.8,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/71943/images/23-0.png","element":"img","alt":" ⟨w(τ)j , xi⟩ ≥","inline":true,"padRight":true},{"text":"0 ","element":"span"},{"text":"then ","element":"span"},{"style":{"height":23.52},"width":445.24,"height":58.8,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/71943/images/23-1.png","element":"img","alt":" ⟨w(τ)j , xi⟩ = ϕ(⟨w(τ)j , xi⟩)","inline":true,"padRight":true},{"text":"and so ","element":"span"},{"href":"#id-55","text":"(6) ","element":"a"},{"text":"clearly holds. Alternatively, if ","element":"span"},{"style":{"height":23.52},"width":245.32,"height":58.8,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/71943/images/23-2.png","element":"img","alt":" ⟨w(τ)j , xi⟩ < 0","inline":true,"padRight":true},{"text":"then ","element":"span"},{"style":{"height":23.52},"width":596.04,"height":58.8,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/71943/images/23-3.png","element":"img","alt":"Tij(τ, τ + 1) = 0, ϕ(⟨w(τ)j , xi⟩) = 0","inline":true,"padRight":true},{"text":"and as a result the right-hand-side of ","element":"span"},{"href":"#id-55","text":"(6) ","element":"a"},{"text":"is non-positive while the left is non-negative. As such ","element":"span"},{"href":"#id-55","text":"(6) ","element":"a"},{"text":"holds trivially.","element":"span"}],[{"text":"For the second equality in Statement 1 we bound","element":"span"}],[{"style":{"width":"87%"},"width":1379,"height":147,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/71943/images/23-4.png","element":"img"}],[{"text":"Since the right-hand side is non-negative, this inequality is true even if we replace the left-hand side by ","element":"span"},{"style":{"height":23.52},"width":223.56,"height":58.8,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/71943/images/23-5.png","element":"img","alt":" ϕ(⟨w(t)j , xi⟩).","inline":true}],[{"text":"We now proceed to Statement 2. For the first inequality, notice that if ","element":"span"},{"style":{"height":23.52},"width":304.48,"height":58.8,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/71943/images/23-6.png","element":"img","alt":" ϕ(⟨w(t0)j , xi⟩) = 0","inline":true,"padRight":true},{"text":"then the right-hand side is non-positive and therefore the inequality trivially holds. Otherwise, it must be the case that ","element":"span"},{"style":{"height":23.54},"width":457.16,"height":58.84,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/71943/images/23-7.png","element":"img","alt":" ϕ(⟨w(t0)j , xi⟩) = ⟨w(t0)j , xi⟩","inline":true},{"text":". Using Statement 2 from Lemma ","element":"span"},{"href":"#id-54","text":"B.2, ","element":"a"},{"text":"we obtain the bound","element":"span"}],[{"style":{"width":"89%"},"width":1421,"height":221,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/71943/images/23-8.png","element":"img"}],[{"text":"We now turn to the second inequality in Statement 2. The corresponding statement from Lemma ","element":"span"},{"href":"#id-54","text":"B.2 ","element":"a"},{"text":"yields","element":"span"}],[{"id":"id-56","style":{"width":"96%"},"width":1537,"height":208,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/71943/images/23-9.png","element":"img"}],[{"text":"we remark that the reason for the addition of ","element":"span"},{"style":{"height":10.8},"width":19.48,"height":27,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/71943/images/23-10.png","element":"img","alt":" η","inline":true,"padRight":true},{"text":"to the right-hand-side will soon become apparent. The desired inequality holds as long as the right-hand-side is non-negative, we therefore proceed by induction to prove","element":"span"}],[{"style":{"width":"83%"},"width":1318,"height":59,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/71943/images/23-11.png","element":"img"}],[{"text":"for ","element":"span"},{"style":{"height":12.8},"width":123.52,"height":32,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/71943/images/23-12.png","element":"img","alt":" τ ≥ t0","inline":true},{"text":". The base case ","element":"span"},{"style":{"height":12.38},"width":123.56,"height":30.96,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/71943/images/23-13.png","element":"img","alt":" τ = t0","inline":true,"padRight":true},{"text":"is trivial, assume then that the induction hypothesis holds for some ","element":"span"},{"style":{"height":12.8},"width":110,"height":32,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/71943/images/23-14.png","element":"img","alt":" τ ≥ t0","inline":true},{"text":". For iteration ","element":"span"},{"style":{"height":12},"width":92.48,"height":30,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/71943/images/23-15.png","element":"img","alt":" τ + 1","inline":true,"padRight":true},{"text":"there are two cases to consider: first, if ","element":"span"},{"style":{"height":23.52},"width":242.4,"height":58.8,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/71943/images/23-16.png","element":"img","alt":" ⟨w(τ)j , xi⟩ < 0","inline":true,"padRight":true},{"text":"then ","element":"span"},{"style":{"height":16.8},"width":414.28,"height":42,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/71943/images/23-17.png","element":"img","alt":"Tij(t0, τ+1) = Tij(t0, τ)","inline":true},{"text":". In addition, as ","element":"span"},{"style":{"height":23.52},"width":927.52,"height":58.8,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/71943/images/23-18.png","element":"img","alt":" Bj(t0, τ) ≤ Bj(t0, τ+1) and G(i)j (t0, τ) ≤ G(i)j (t0, τ+1)","inline":true,"padRight":true},{"text":"then","element":"span"}],[{"style":{"width":"96%"},"width":1532,"height":133,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/71943/images/23-19.png","element":"img"}],[{"text":"by the induction hypothesis. Alternatively, if instead ","element":"span"},{"style":{"height":23.52},"width":256.56,"height":58.8,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/71943/images/23-20.png","element":"img","alt":" ⟨w(τ)j , xi⟩ ≥ 0","inline":true,"padRight":true},{"text":"one may use the second inequality from ","element":"span"},{"href":"#id-56","text":"(7) ","element":"a"},{"text":"to conclude that","element":"span"}],[{"style":{"width":"93%"},"width":1484,"height":59,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/71943/images/23-21.png","element":"img"}],[{"text":"In addition, as ","element":"span"},{"style":{"height":16.78},"width":498.96,"height":41.96,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/71943/images/23-22.png","element":"img","alt":" Tij(t0, τ + 1) ≤ Tij(t0, τ) + 1","inline":true,"padRight":true},{"text":"it follows that","element":"span"}],[{"style":{"width":"96%"},"width":1532,"height":133,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/71943/images/23-23.png","element":"img"}],[{"text":"which completes the induction.","element":"span"}],[{"text":"Lastly, we consider the final remark in the statement of the lemma: if ","element":"span"},{"style":{"height":16.8},"width":221.16,"height":42,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/71943/images/23-24.png","element":"img","alt":" Tij(t0, t) = 0","inline":true,"padRight":true},{"text":"then the right hand side of the second line in ","element":"span"},{"href":"#id-56","text":"(7) ","element":"a"},{"text":"is non-negative trivially, so we do not need the additional ","element":"span"},{"style":{"height":10.4},"width":20,"height":26,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/71943/images/23-25.png","element":"img","alt":" η","inline":true,"padRight":true},{"text":"term.","element":"span"}],[{"style":{"fontWeight":"bold"},"text":"B.2 ","element":"span"},{"style":{"fontWeight":"bold"},"text":"Convergence of training","element":"span"}],[{"text":"We say that GD terminates if it reaches a finite iteration in which a zero update is applied to the network parameters. The following lemmas are used to show that GD terminates by in turn upper bounding the number of clean and corrupt updates. The first lemma facilitates the bounding of the hinge loss of clean and corrupt points.","element":"span"}],[{"id":"id-58","style":{"fontWeight":"bold"},"text":"Lemma B.4. ","element":"span"},{"style":{"fontStyle":"italic"},"text":"For any iterations ","element":"span"},{"style":{"height":13.2},"width":62.48,"height":33,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/71943/images/24-0.png","element":"img","alt":" t, t0","inline":true,"padRight":true},{"style":{"fontStyle":"italic"},"text":"satisfying ","element":"span"},{"style":{"height":12.8},"width":110.8,"height":32,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/71943/images/24-1.png","element":"img","alt":" t ≥ t0,","inline":true}],[{"style":{"fontStyle":"italic"},"text":"1. if ","element":"span"},{"style":{"height":13.58},"width":190.36,"height":33.96,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/71943/images/24-2.png","element":"img","alt":" i ∈ ST then","inline":true}],[{"style":{"width":"86%"},"width":1376,"height":46,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/71943/images/24-3.png","element":"img"}],[{"style":{"fontStyle":"italic"},"text":"2. if ","element":"span"},{"style":{"height":14},"width":191.72,"height":35,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/71943/images/24-4.png","element":"img","alt":" i ∈ SF then","inline":true}],[{"style":{"width":"86%"},"width":1376,"height":44,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/71943/images/24-5.png","element":"img"}],[{"style":{"fontStyle":"italic"},"text":"Proof. ","element":"span"},{"text":"Both statements follow from the bounds provided in Lemma ","element":"span"},{"href":"#id-57","text":"B.3. ","element":"a"},{"text":"For Statement 1","element":"span"}],[{"style":{"width":"99%"},"width":1574,"height":400,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/71943/images/24-6.png","element":"img"}]]},{"heading":"For Statement 2","paragraphs":[[{"style":{"width":"100%"},"width":1621,"height":401,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/71943/images/24-7.png","element":"img"}],[{"text":"The following lemma bounds the number of updates of corrupt and clean points in an interval of iterations in terms of their hinge loss at the beginning of the interval as well as the number of clean and corrupt updates.","element":"span"}],[{"id":"id-64","style":{"fontWeight":"bold"},"text":"Lemma B.5. ","element":"span"},{"style":{"fontStyle":"italic"},"text":"Let ","element":"span"},{"style":{"height":13.2},"width":310.32,"height":33,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/71943/images/24-8.png","element":"img","alt":" t ≥ t0. For i ∈ ST ,","inline":true}],[{"style":{"width":"69%"},"width":1101,"height":93,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/71943/images/24-9.png","element":"img"}],[{"style":{"fontStyle":"italic"},"text":"For ","element":"span"},{"style":{"height":13.6},"width":124.04,"height":34,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/71943/images/24-10.png","element":"img","alt":" i ∈ SF ,","inline":true}],[{"style":{"width":"69%"},"width":1101,"height":94,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/71943/images/24-11.png","element":"img"}],[{"style":{"fontStyle":"italic"},"text":"Proof. ","element":"span"},{"text":"We will show this for ","element":"span"},{"style":{"height":14},"width":302,"height":35,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/71943/images/24-12.png","element":"img","alt":" i ∈ ST ; the i ∈ SF","inline":true,"padRight":true},{"text":"case is analogous but with the roles of corrupt and clean points reversed. We proceed by induction on ","element":"span"},{"style":{"fontStyle":"italic"},"text":"t ","element":"span"},{"text":"and assume ","element":"span"},{"style":{"height":16},"width":209.2,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/71943/images/24-13.png","element":"img","alt":" ℓ(t0, xi) ≤ a","inline":true},{"text":". If ","element":"span"},{"style":{"height":12.38},"width":97.92,"height":30.96,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/71943/images/24-14.png","element":"img","alt":" t = t0","inline":true,"padRight":true},{"text":"this holds trivially because the left-hand side is zero and the right-hand side is positive. Otherwise, assume the inequality holds at iteration ","element":"span"},{"style":{"fontStyle":"italic"},"text":"t","element":"span"},{"text":". By Lemma ","element":"span"},{"href":"#id-58","text":"B.4 ","element":"a"},{"text":"and our assumption on ","element":"span"},{"style":{"height":16},"width":145.04,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/71943/images/24-15.png","element":"img","alt":" ℓ(t0, xi),","inline":true}],[{"style":{"width":"85%"},"width":1357,"height":115,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/71943/images/24-16.png","element":"img"}],[{"text":"We consider two cases:","element":"span"}],[{"text":"1. If ","element":"span"},{"style":{"height":18.18},"width":949,"height":45.44,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/71943/images/25-0.png","element":"img","alt":" η(Ti(t0, t)−(γ+ρ)B(t0, t)−ϕ(ρ−γ)G(i)(t0, t)−m) ≥ a","inline":true,"padRight":true},{"text":"then we see that ","element":"span"},{"style":{"height":16},"width":200.2,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/71943/images/25-1.png","element":"img","alt":" ℓ(t, xi) = 0.","inline":true,"padRight":true},{"text":"Therefore,","element":"span"}],[{"style":{"width":"48%"},"width":765,"height":235,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/71943/images/25-2.png","element":"img"}],[{"text":"2. Otherwise, ","element":"span"},{"style":{"height":22.08},"width":930.52,"height":55.2,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/71943/images/25-3.png","element":"img","alt":" Ti(t0, t) ≤ aη + (γ + ρ)B(t0, t) + ϕ(ρ − γ)G(i)(t0, t) + m","inline":true},{"text":". Since there are only ","element":"span"},{"text":"2","element":"span"},{"style":{"fontStyle":"italic"},"text":"m ","element":"span"},{"text":"neurons, we bound","element":"span"}],[{"style":{"width":"82%"},"width":1313,"height":310,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/71943/images/25-4.png","element":"img"}]]},{"heading":"Appendix C Benign overfitting","paragraphs":[[{"id":"id-59","style":{"fontWeight":"bold"},"text":"Assumption 2. ","element":"span"},{"style":{"fontStyle":"italic"},"text":"With ","element":"span"},{"style":{"height":16},"width":303.36,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/71943/images/25-5.png","element":"img","alt":" δ, ρ ∈ (0, 1) and C","inline":true,"padRight":true},{"style":{"fontStyle":"italic"},"text":"a generic, positive constant, then we assume the following conditions on the data and model hyperparameters.","element":"span"}],[{"style":{"fontStyle":"italic"},"text":"1. ","element":"span"},{"style":{"height":19.38},"width":233.92,"height":48.44,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/71943/images/25-6.png","element":"img","alt":" n ≥ C log( 3δ ),","inline":true}],[{"style":{"fontStyle":"italic"},"text":"2. ","element":"span"},{"style":{"height":19.36},"width":262.64,"height":48.4,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/71943/images/25-7.png","element":"img","alt":" m ≥ C log( 6kδ ),","inline":true}],[{"style":{"fontStyle":"italic"},"text":"3. ","element":"span"},{"style":{"height":29.2},"width":493.56,"height":73,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/71943/images/25-8.png","element":"img","alt":" d ≥ max�3, 3ρ−2 ln�9n2δ ��.","inline":true}],[{"style":{"fontStyle":"italic"},"text":"4. ","element":"span"},{"style":{"fontStyle":"italic"},"text":"k < ","element":"span"},{"style":{"height":16.58},"width":63.44,"height":41.44,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/71943/images/25-9.png","element":"img","alt":"n100,","inline":true}],[{"style":{"fontStyle":"italic"},"text":"5. ","element":"span"},{"text":"ln(9 ","element":"span"},{"style":{"height":29.01},"width":382,"height":72.52,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/71943/images/25-10.png","element":"img","alt":"d ≤ γ ≤ 45n,","inline":true}],[{"style":{"fontStyle":"italic"},"text":"6. ","element":"span"},{"style":{"fontStyle":"italic"},"text":"λ","element":"span"},{"style":{"height":12.4},"width":110.12,"height":31,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/71943/images/25-11.png","element":"img","alt":"w < η.","inline":true}],[{"text":"In addition to the assumptions detailed in Assumption ","element":"span"},{"href":"#id-59","text":"2, ","element":"a"},{"text":"in our analysis we use three further conditions.","element":"span"}],[{"id":"id-60","style":{"fontWeight":"bold"},"text":"Assumption 3. ","element":"span"},{"style":{"fontStyle":"italic"},"text":"Let ","element":"span"},{"style":{"height":16},"width":395.24,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/71943/images/25-12.png","element":"img","alt":" ρ ∈ (0, 1) satisfy γ ≥ 5ρ","inline":true},{"style":{"fontStyle":"italic"},"text":". In addition to the assumptions detailed in Assumption ","element":"span"},{"href":"#id-59","style":{"fontStyle":"italic"},"text":"2, ","element":"a"},{"style":{"fontStyle":"italic"},"text":"assume that the following conditions hold.","element":"span"}],[{"style":{"fontStyle":"italic"},"text":"1. ","element":"span"},{"style":{"fontStyle":"italic"},"text":"|","element":"span"},{"style":{"height":16.78},"width":486.96,"height":41.96,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/71943/images/25-13.png","element":"img","alt":"Γp| > 0.99m for p ∈ {−1, 1}.","inline":true}],[{"style":{"fontStyle":"italic"},"text":"2. For all ","element":"span"},{"style":{"height":23.54},"width":684.44,"height":58.84,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/71943/images/25-14.png","element":"img","alt":" i ∈ SF there is j ∈ Γyi such that i ∈ A(0)j .","inline":true}],[{"style":{"fontStyle":"italic"},"text":"3. For all ","element":"span"},{"style":{"height":19.89},"width":626.04,"height":49.72,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/71943/images/25-15.png","element":"img","alt":" i, l ∈ [2n], i ̸= l then |⟨ni, nl⟩| ≤ ρ1−γ .","inline":true}],[{"text":"We remark that under these conditions then for sufficiently large ","element":"span"},{"style":{"fontStyle":"italic"},"text":"n ","element":"span"},{"text":"the inequalities ","element":"span"},{"style":{"height":13.81},"width":81,"height":34.52,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/71943/images/25-16.png","element":"img","alt":" ρ ≤","inline":true},{"style":{"height":29.6},"width":1113,"height":74,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/71943/images/25-17.png","element":"img","alt":"min�n−3kn+k γ, 16(n−k)�and γ +ρ < min��","inline":true,"padRight":true},{"text":"are satisfied. As shown in the following lemma, these three additional conditions hold with high probability over the randomness of the initialization and training sample.","element":"span"}],[{"id":"id-79","style":{"fontWeight":"bold"},"text":"Lemma C.1. ","element":"span"},{"style":{"fontStyle":"italic"},"text":"There exists a positive constant ","element":"span"},{"style":{"fontStyle":"italic"},"text":"C ","element":"span"},{"style":{"fontStyle":"italic"},"text":"such that for any ","element":"span"},{"style":{"height":16},"width":158.2,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/71943/images/25-18.png","element":"img","alt":" δ ∈ (0, 1)","inline":true,"padRight":true},{"style":{"fontStyle":"italic"},"text":"if ","element":"span"},{"style":{"height":19.38},"width":224.68,"height":48.44,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/71943/images/25-19.png","element":"img","alt":" n ≥ C log( 3δ )","inline":true,"padRight":true},{"style":{"fontStyle":"italic"},"text":"then ","element":"span"},{"style":{"fontStyle":"italic"},"text":"the extra conditions of Assumption ","element":"span"},{"href":"#id-60","style":{"fontStyle":"italic"},"text":"3 ","element":"a"},{"style":{"fontStyle":"italic"},"text":"hold with probability at least ","element":"span"},{"style":{"height":12},"width":97.84,"height":30,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/71943/images/25-20.png","element":"img","alt":" 1 − δ.","inline":true}],[{"style":{"fontStyle":"italic"},"text":"Proof. ","element":"span"},{"text":"Using Lemma ","element":"span"},{"href":"#id-50","text":"A.3, ","element":"a"},{"text":"under the Assumption ","element":"span"},{"href":"#id-60","text":"3 ","element":"a"},{"text":"for sufficiently large ","element":"span"},{"style":{"fontStyle":"italic"},"text":"n ","element":"span"},{"text":"there exists a positive constant ","element":"span"},{"style":{"fontStyle":"italic"},"text":"c ","element":"span"},{"text":"such that the probability the first condition does not hold is at most ","element":"span"},{"style":{"height":16},"width":164.52,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/71943/images/25-21.png","element":"img","alt":" exp(−cn)","inline":true},{"text":". Alternatively,","element":"span"}],[{"text":"setting ","element":"span"},{"style":{"height":16},"width":273.04,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/71943/images/26-0.png","element":"img","alt":" δ ≥ 3 exp(−cn)","inline":true,"padRight":true},{"text":"and rearranging, as long as ","element":"span"},{"style":{"height":19.38},"width":244.4,"height":48.44,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/71943/images/26-1.png","element":"img","alt":" n ≥ C log� 3δ�","inline":true},{"text":"then the probability the first condition does not hold is at most ","element":"span"},{"style":{"height":20},"width":16,"height":50,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/71943/images/26-2.png","element":"img","alt":"δ3","inline":true},{"text":". Conditioned on the first event, using Lemma ","element":"span"},{"href":"#id-61","text":"A.6 ","element":"a"},{"text":"then if ","element":"span"},{"style":{"height":19.36},"width":223.44,"height":48.4,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/71943/images/26-3.png","element":"img","alt":"n ≥ C log( 3δ )","inline":true,"padRight":true},{"text":"and ","element":"span"},{"style":{"height":19.38},"width":263.52,"height":48.44,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/71943/images/26-4.png","element":"img","alt":" m ≥ C log� 6kδ�","inline":true},{"text":"then the probability condition two does not hold is also at most ","element":"span"},{"style":{"height":20},"width":16.52,"height":50,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/71943/images/26-5.png","element":"img","alt":"δ","inline":true},{"style":{"height":7.6},"width":16,"height":19,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/71943/images/26-6.png","element":"img","alt":"3","inline":true},{"text":". Therefore the probability that the first two events hold is at least ","element":"span"},{"style":{"height":20.18},"width":300.4,"height":50.44,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/71943/images/26-7.png","element":"img","alt":" (1 − δ3)2 ≥ 1 − 2δ3 ","inline":true,"padRight":true},{"text":". For the third ","element":"span"},{"text":"condition, noting ","element":"span"},{"style":{"height":19.89},"width":350.52,"height":49.72,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/71943/images/26-8.png","element":"img","alt":"ρ1−γ > ρ and ρ ≤ γ/5","inline":true},{"text":", then by Lemma ","element":"span"},{"href":"#id-62","text":"A.1 ","element":"a"},{"text":"the probability the third condition does ","element":"span"},{"text":"not hold is also at most ","element":"span"},{"style":{"height":19.81},"width":16.48,"height":49.52,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/71943/images/26-9.png","element":"img","alt":"δ3","inline":true},{"text":". Therefore, we conclude that all three properties hold with probability at ","element":"span"},{"text":"least ","element":"span"},{"style":{"height":12},"width":97.84,"height":30,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/71943/images/26-10.png","element":"img","alt":" 1 − δ.","inline":true}],[{"style":{"fontWeight":"bold"},"text":"C.1 ","element":"span"},{"style":{"fontWeight":"bold"},"text":"Proof of Lemma ","element":"span"},{"href":"#id-63","style":{"fontWeight":"bold"},"text":"3.2","element":"a"}],[{"text":"The following lemma characterizes an iteration independent upper bound on the number of clean and corrupt updates. This result will prove significant for proving the termination of GD.","element":"span"}],[{"id":"id-74","style":{"fontWeight":"bold"},"text":"Lemma C.2 ","element":"span"},{"text":"(Lemma ","element":"span"},{"href":"#id-63","text":"3.2)","element":"a"},{"style":{"fontWeight":"bold"},"text":". ","element":"span"},{"style":{"fontStyle":"italic"},"text":"Assume Assumption ","element":"span"},{"href":"#id-60","style":{"fontStyle":"italic"},"text":"3 ","element":"a"},{"style":{"fontStyle":"italic"},"text":"holds. Suppose further that at some epoch ","element":"span"},{"style":{"height":12.61},"width":28.48,"height":31.52,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/71943/images/26-11.png","element":"img","alt":" t0","inline":true,"padRight":true},{"style":{"fontStyle":"italic"},"text":"the loss of every clean point is bounded above by ","element":"span"},{"style":{"height":15.6},"width":142.32,"height":39,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/71943/images/26-12.png","element":"img","alt":" a ∈ R≥0","inline":true},{"style":{"fontStyle":"italic"},"text":", while the loss of every corrupted point is bounded above by ","element":"span"},{"style":{"height":15.58},"width":135.56,"height":38.96,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/71943/images/26-13.png","element":"img","alt":" b ∈ R≥0","inline":true},{"style":{"fontStyle":"italic"},"text":". Then the total number of updates which occurs after this epoch is upper bounded as follows,","element":"span"}],[{"style":{"width":"80%"},"width":1275,"height":220,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/71943/images/26-14.png","element":"img"}],[{"style":{"fontStyle":"italic"},"text":"for all ","element":"span"},{"style":{"height":12.8},"width":109.8,"height":32,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/71943/images/26-15.png","element":"img","alt":" t ≥ t0.","inline":true}],[{"style":{"fontStyle":"italic"},"text":"Proof. ","element":"span"},{"text":"From Lemma ","element":"span"},{"href":"#id-64","text":"B.5, ","element":"a"},{"style":{"height":14},"width":95.76,"height":35,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/71943/images/26-16.png","element":"img","alt":" ρ ≤ γ","inline":true},{"text":", and the assumption on ","element":"span"},{"style":{"fontStyle":"italic"},"text":"a ","element":"span"},{"text":"and ","element":"span"},{"style":{"fontStyle":"italic"},"text":"b","element":"span"},{"text":",","element":"span"}],[{"style":{"width":"69%"},"width":1107,"height":240,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/71943/images/26-17.png","element":"img"}],[{"text":"Substituting these bounds into each other, and as ","element":"span"},{"style":{"height":18.19},"width":417.12,"height":45.48,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/71943/images/26-18.png","element":"img","alt":" γ + ρ < (4(n − k)k)−1/2 ","inline":true,"padRight":true},{"text":"under Assumption ","element":"span"},{"href":"#id-60","text":"3, ","element":"a"},{"text":"we arrive at the iteration independent bound on the number of updates as claimed in the statement of the theorem.","element":"span"}],[{"style":{"fontWeight":"bold"},"text":"C.2 ","element":"span"},{"style":{"fontWeight":"bold"},"text":"Early training and proof of Lemma ","element":"span"},{"href":"#id-65","style":{"fontWeight":"bold"},"text":"3.3","element":"a"}],[{"id":"id-66","style":{"fontWeight":"bold"},"text":"Lemma C.3. ","element":"span"},{"style":{"fontStyle":"italic"},"text":"Under Assumption ","element":"span"},{"href":"#id-60","style":{"fontStyle":"italic"},"text":"3, ","element":"a"},{"style":{"fontStyle":"italic"},"text":"then for ","element":"span"},{"style":{"height":14},"width":281.44,"height":35,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/71943/images/26-19.png","element":"img","alt":" i ∈ ST and j ∈ Γ","inline":true,"padRight":true},{"style":{"fontStyle":"italic"},"text":"it follows that ","element":"span"},{"style":{"height":23.54},"width":381.96,"height":58.84,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/71943/images/26-20.png","element":"img","alt":" ⟨w(1)j , xi⟩ > 0 iff i ∼ j.","inline":true,"padRight":true},{"style":{"fontStyle":"italic"},"text":"For ","element":"span"},{"style":{"height":16.4},"width":423.32,"height":41,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/71943/images/26-21.png","element":"img","alt":" i ∈ SF , j ∈ Θp, and i ̸∼ j","inline":true,"padRight":true},{"style":{"fontStyle":"italic"},"text":"it follows that ","element":"span"},{"style":{"height":23.52},"width":424.84,"height":58.8,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/71943/images/26-22.png","element":"img","alt":" ⟨w(1)j , xi⟩ > 0 if i ∈ A(0)j .","inline":true}],[{"style":{"fontStyle":"italic"},"text":"Proof. ","element":"span"},{"text":"Suppose ","element":"span"},{"style":{"height":14},"width":384.24,"height":35,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/71943/images/26-23.png","element":"img","alt":" j ∈ Γ, i ∼ j, i ∈ ST","inline":true,"padRight":true},{"text":". Recall from definition of ","element":"span"},{"style":{"height":15.58},"width":41.92,"height":38.96,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/71943/images/26-24.png","element":"img","alt":" Γp","inline":true,"padRight":true},{"text":"that ","element":"span"},{"style":{"height":23.52},"width":330.32,"height":58.8,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/71943/images/26-25.png","element":"img","alt":" G(i)j (0, 1)(γ − ρ) −","inline":true},{"style":{"height":24.58},"width":394.8,"height":61.44,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/71943/images/26-26.png","element":"img","alt":"B(i)j (0, 1)(γ + ρ) ≥ 2λwη ","inline":true,"padRight":true},{"text":". Using Lemma ","element":"span"},{"href":"#id-54","text":"B.2","element":"a"}],[{"style":{"width":"81%"},"width":1293,"height":274,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/71943/images/26-27.png","element":"img"}],[{"text":"On the other hand, if ","element":"span"},{"style":{"height":14.99},"width":82,"height":37.48,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/71943/images/26-28.png","element":"img","alt":" i ̸∼ j","inline":true,"padRight":true},{"text":"then again from Lemma ","element":"span"},{"href":"#id-54","text":"B.2","element":"a"}],[{"style":{"width":"81%"},"width":1293,"height":200,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/71943/images/26-29.png","element":"img"}],[{"style":{"width":"99%"},"width":1582,"height":266,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/71943/images/27-0.png","element":"img"}],[{"id":"id-69","style":{"fontWeight":"bold"},"text":"Lemma C.4 ","element":"span"},{"text":"(Lemma ","element":"span"},{"href":"#id-65","text":"3.3)","element":"a"},{"style":{"fontWeight":"bold"},"text":". ","element":"span"},{"style":{"fontStyle":"italic"},"text":"Suppose Assumption ","element":"span"},{"href":"#id-60","style":{"fontStyle":"italic"},"text":"3 ","element":"a"},{"style":{"fontStyle":"italic"},"text":"holds. Let ","element":"span"},{"style":{"height":15.58},"width":117.36,"height":38.96,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/71943/images/27-1.png","element":"img","alt":" j ∈ Γp","inline":true},{"style":{"fontStyle":"italic"},"text":". Let ","element":"span"},{"style":{"height":13.98},"width":194.4,"height":34.96,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/71943/images/27-2.png","element":"img","alt":" 0 < t < T0","inline":true},{"style":{"fontStyle":"italic"},"text":". A point ","element":"span"},{"style":{"height":23.52},"width":130.76,"height":58.8,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/71943/images/27-3.png","element":"img","alt":"i ∈ A(t)j","inline":true,"padRight":true},{"style":{"fontStyle":"italic"},"text":"if one of the following conditions hold:","element":"span"}],[{"style":{"fontStyle":"italic"},"text":"1. ","element":"span"},{"style":{"height":14.4},"width":275.2,"height":36,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/71943/images/27-4.png","element":"img","alt":" i ∈ ST and i ∼ j","inline":true}],[{"style":{"fontStyle":"italic"},"text":"2. ","element":"span"},{"style":{"height":23.52},"width":455.24,"height":58.8,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/71943/images/27-5.png","element":"img","alt":" i ∈ SF , i ̸∼ j, and i ∈ A(1)j .","inline":true}],[{"style":{"fontStyle":"italic"},"text":"Furthermore, if one of the following conditions hold, then ","element":"span"},{"style":{"height":23.54},"width":146.2,"height":58.84,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/71943/images/27-6.png","element":"img","alt":" i /∈ A(t)j :","inline":true}],[{"style":{"fontStyle":"italic"},"text":"1. ","element":"span"},{"style":{"height":15.2},"width":275.2,"height":38,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/71943/images/27-7.png","element":"img","alt":" i ∈ ST and i ̸∼ j","inline":true}],[{"style":{"fontStyle":"italic"},"text":"2. ","element":"span"},{"style":{"height":23.52},"width":455.24,"height":58.8,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/71943/images/27-8.png","element":"img","alt":" i ∈ SF , i ̸∼ j, and i /∈ A(1)j .","inline":true}],[{"style":{"fontStyle":"italic"},"text":"Proof. ","element":"span"},{"text":"We proceed by induction. For ","element":"span"},{"style":{"fontStyle":"italic"},"text":"t ","element":"span"},{"text":"= 1","element":"span"},{"text":", the ","element":"span"},{"style":{"height":13.18},"width":117.48,"height":32.96,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/71943/images/27-9.png","element":"img","alt":" i ∈ ST","inline":true,"padRight":true},{"text":"case was shown in Lemma ","element":"span"},{"href":"#id-66","text":"C.3 ","element":"a"},{"text":"and the ","element":"span"},{"style":{"height":15.2},"width":216.48,"height":38,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/71943/images/27-10.png","element":"img","alt":"i ∈ SF , i ̸∼ j","inline":true,"padRight":true},{"text":"case is clear. Now, suppose the lemma holds for iteration ","element":"span"},{"style":{"fontStyle":"italic"},"text":"t ","element":"span"},{"text":"and consider iteration ","element":"span"},{"style":{"fontStyle":"italic"},"text":"t ","element":"span"},{"text":"+ 1","element":"span"},{"text":". First let ","element":"span"},{"style":{"height":23.52},"width":493.08,"height":58.8,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/71943/images/27-11.png","element":"img","alt":" i ∈ SF , i ̸∼ j. If i ∈ A(1)j then","inline":true}],[{"style":{"width":"94%"},"width":1499,"height":180,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/71943/images/27-12.png","element":"img"}],[{"text":"Here the first line is Lemma ","element":"span"},{"href":"#id-54","text":"B.2, ","element":"a"},{"text":"the second line comes from the inductive hypothesis, and the third line comes from ","element":"span"},{"style":{"height":19.38},"width":242.68,"height":48.44,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/71943/images/27-13.png","element":"img","alt":" (γ + ρ) < 1n−k ","inline":true,"padRight":true},{"text":"(Assumption ","element":"span"},{"href":"#id-60","text":"3)","element":"a"},{"text":". If ","element":"span"},{"style":{"height":23.52},"width":215.72,"height":58.8,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/71943/images/27-14.png","element":"img","alt":" i /∈ A(1)j then","inline":true}],[{"style":{"width":"94%"},"width":1499,"height":236,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/71943/images/27-15.png","element":"img"}],[{"text":"Again, the first line is Lemma ","element":"span"},{"href":"#id-54","text":"B.2, ","element":"a"},{"text":"the second line uses the inductive hypothesis, and the fourth line uses ","element":"span"},{"style":{"height":20.98},"width":183.44,"height":52.44,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/71943/images/27-16.png","element":"img","alt":" ρ ≤ n−3kn+k γ","inline":true,"padRight":true},{"text":"(Assumption ","element":"span"},{"href":"#id-60","text":"3)","element":"a"},{"text":".","element":"span"}],[{"text":"Now, let ","element":"span"},{"style":{"height":13.18},"width":109.56,"height":32.96,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/71943/images/27-17.png","element":"img","alt":" i ∈ ST","inline":true,"padRight":true},{"text":". We again use, in order, Lemma ","element":"span"},{"href":"#id-54","text":"B.2, ","element":"a"},{"text":"the inductive hypothesis, and ","element":"span"},{"style":{"height":20.98},"width":183.4,"height":52.44,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/71943/images/27-18.png","element":"img","alt":" ρ ≤ n−3kn+k γ","inline":true},{"text":". If ","element":"span"},{"style":{"height":14.4},"width":164.2,"height":36,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/71943/images/27-19.png","element":"img","alt":"i ∼ j then","inline":true}],[{"style":{"width":"95%"},"width":1506,"height":182,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/71943/images/27-20.png","element":"img"}],[{"text":"If ","element":"span"},{"style":{"height":15.2},"width":164.2,"height":38,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/71943/images/27-21.png","element":"img","alt":" i ̸∼ j then","inline":true}],[{"style":{"width":"97%"},"width":1544,"height":236,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/71943/images/27-22.png","element":"img"}],[{"id":"id-67","style":{"fontWeight":"bold"},"text":"Lemma C.5. ","element":"span"},{"style":{"fontStyle":"italic"},"text":"Suppose Assumption ","element":"span"},{"href":"#id-60","style":{"fontStyle":"italic"},"text":"3 ","element":"a"},{"style":{"fontStyle":"italic"},"text":"holds. For all ","element":"span"},{"style":{"height":14},"width":221.36,"height":35,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/71943/images/27-23.png","element":"img","alt":" t0 ≤ t1 < T0,","inline":true}],[{"style":{"width":"43%"},"width":697,"height":90,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/71943/images/27-24.png","element":"img"}],[{"style":{"fontStyle":"italic"},"text":"Proof. ","element":"span"},{"text":"First we claim that for all ","element":"span"},{"style":{"height":15.2},"width":421.68,"height":38,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/71943/images/28-0.png","element":"img","alt":" i ∈ ST , j ̸∼ i, and t < T0,","inline":true}],[{"style":{"width":"31%"},"width":501,"height":59,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/71943/images/28-1.png","element":"img"}],[{"text":"We prove the claim by induction. The base case ","element":"span"},{"style":{"fontStyle":"italic"},"text":"t ","element":"span"},{"text":"= 0 ","element":"span"},{"text":"follows because ","element":"span"},{"style":{"height":23.52},"width":272.48,"height":58.8,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/71943/images/28-2.png","element":"img","alt":" ⟨w(0)j , xi⟩ ≤ λw","inline":true},{"text":". Now ","element":"span"},{"text":"suppose it is true at iteration ","element":"span"},{"style":{"height":23.52},"width":305.16,"height":58.8,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/71943/images/28-3.png","element":"img","alt":" t. If ⟨w(t)j , xi⟩ > 0","inline":true,"padRight":true},{"text":"then by Lemma ","element":"span"},{"href":"#id-54","text":"B.2,","element":"a"}],[{"style":{"width":"67%"},"width":1064,"height":214,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/71943/images/28-4.png","element":"img"}],[{"text":"using ","element":"span"},{"style":{"height":19.38},"width":182.96,"height":48.44,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/71943/images/28-5.png","element":"img","alt":" γ + ρ < 12k","inline":true},{"text":". From this, the claim follows. Otherwise,","element":"span"}],[{"style":{"width":"67%"},"width":1064,"height":122,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/71943/images/28-6.png","element":"img"}],[{"text":"We now turn to the statement of the lemma, again proceeding by induction. The base case ","element":"span"},{"style":{"height":13.18},"width":154.64,"height":32.96,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/71943/images/28-7.png","element":"img","alt":" t1 = t0 is","inline":true,"padRight":true},{"text":"clear. Otherwise, we consider two cases:","element":"span"}],[{"style":{"width":"92%"},"width":1468,"height":354,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/71943/images/28-8.png","element":"img"}],[{"text":"by the claim and ","element":"span"},{"style":{"height":14.4},"width":122.16,"height":36,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/71943/images/28-9.png","element":"img","alt":" λw < η","inline":true,"padRight":true},{"text":"(Assumption ","element":"span"},{"href":"#id-60","text":"3)","element":"a"},{"text":". Therefore, ","element":"span"},{"style":{"height":16.78},"width":610.44,"height":41.96,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/71943/images/28-10.png","element":"img","alt":" Gj(t0, t1 +1) ≤ Gj(t0, t1)+(n−k).","inline":true}],[{"style":{"width":"94%"},"width":1496,"height":406,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/71943/images/28-11.png","element":"img"}],[{"id":"id-76","style":{"fontWeight":"bold"},"text":"Lemma C.6. ","element":"span"},{"style":{"fontStyle":"italic"},"text":"Suppose Assumption ","element":"span"},{"href":"#id-60","style":{"fontStyle":"italic"},"text":"3 ","element":"a"},{"style":{"fontStyle":"italic"},"text":"holds. For all ","element":"span"},{"style":{"height":14.8},"width":426.28,"height":37,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/71943/images/28-12.png","element":"img","alt":" t < T0, i ∈ SF , and i ∼ j,","inline":true}],[{"style":{"width":"66%"},"width":1060,"height":93,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/71943/images/28-13.png","element":"img"}],[{"style":{"fontStyle":"italic"},"text":"Proof. ","element":"span"},{"text":"Consider ","element":"span"},{"style":{"height":14},"width":105.24,"height":35,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/71943/images/28-14.png","element":"img","alt":" t < T0","inline":true},{"text":". We consider three cases","element":"span"}],[{"text":"1. If ","element":"span"},{"style":{"fontStyle":"italic"},"text":"t ","element":"span"},{"text":"= 0 ","element":"span"},{"text":"then ","element":"span"},{"style":{"height":23.52},"width":275.88,"height":58.8,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/71943/images/28-15.png","element":"img","alt":"⟨w(0)j , xi⟩ ≤ λw.","inline":true}],[{"text":"2. If ","element":"span"},{"style":{"height":23.54},"width":272.76,"height":58.84,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/71943/images/28-16.png","element":"img","alt":" ⟨w(t−1)j , xi⟩ ≤ 0","inline":true,"padRight":true},{"text":"then by Lemma ","element":"span"},{"href":"#id-54","text":"B.2","element":"a"}],[{"style":{"width":"83%"},"width":1326,"height":203,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/71943/images/28-17.png","element":"img"}],[{"text":"3. If ","element":"span"},{"style":{"height":23.52},"width":272.8,"height":58.8,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/71943/images/29-0.png","element":"img","alt":" ⟨w(t−1)j , xi⟩ > 0","inline":true,"padRight":true},{"text":"then let ","element":"span"},{"style":{"height":13.01},"width":91.52,"height":32.52,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/71943/images/29-1.png","element":"img","alt":" t′ < t","inline":true,"padRight":true},{"text":"be the smallest iteration such that ","element":"span"},{"style":{"height":23.52},"width":237.88,"height":58.8,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/71943/images/29-2.png","element":"img","alt":" ⟨w(τ)j , xi⟩ > 0","inline":true,"padRight":true},{"text":"for all ","element":"span"},{"style":{"height":14.4},"width":166.48,"height":36,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/71943/images/29-3.png","element":"img","alt":"t′ ≤ τ < t","inline":true},{"text":". By Lemma ","element":"span"},{"href":"#id-54","text":"B.2, ","element":"a"},{"text":"Lemma ","element":"span"},{"href":"#id-67","text":"C.5, ","element":"a"},{"text":"and the previous two cases above,","element":"span"}],[{"style":{"width":"90%"},"width":1438,"height":435,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/71943/images/29-4.png","element":"img"}],[{"style":{"fontWeight":"bold"},"text":"C.3 ","element":"span"},{"style":{"fontWeight":"bold"},"text":"Proof of Lemma ","element":"span"},{"href":"#id-68","style":{"fontWeight":"bold"},"text":"3.4","element":"a"}],[{"id":"id-70","style":{"fontWeight":"bold"},"text":"Lemma C.7 ","element":"span"},{"text":"(Lemma ","element":"span"},{"href":"#id-68","text":"3.4)","element":"a"},{"style":{"fontWeight":"bold"},"text":". ","element":"span"},{"style":{"fontStyle":"italic"},"text":"Suppose Assumption ","element":"span"},{"href":"#id-60","style":{"fontStyle":"italic"},"text":"3 ","element":"a"},{"style":{"fontStyle":"italic"},"text":"holds. There is an iteration ","element":"span"},{"style":{"height":14},"width":140.88,"height":35,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/71943/images/29-5.png","element":"img","alt":" T1 < T0","inline":true,"padRight":true},{"style":{"fontStyle":"italic"},"text":"during training and expressions ","element":"span"},{"style":{"height":13.6},"width":246.8,"height":34,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/71943/images/29-6.png","element":"img","alt":" C1, C2, and C3","inline":true,"padRight":true},{"style":{"fontStyle":"italic"},"text":"where the following hold:","element":"span"}],[{"style":{"width":"59%"},"width":944,"height":501,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/71943/images/29-7.png","element":"img"}],[{"style":{"fontStyle":"italic"},"text":"Furthermore,","element":"span"}],[{"style":{"width":"44%"},"width":705,"height":83,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/71943/images/29-8.png","element":"img"}],[{"style":{"fontStyle":"italic"},"text":"Proof. ","element":"span"},{"text":"Fix ","element":"span"},{"style":{"height":13.2},"width":109.56,"height":33,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/71943/images/29-9.png","element":"img","alt":" i ∈ ST","inline":true,"padRight":true},{"text":". At every iteration ","element":"span"},{"style":{"height":14},"width":356.24,"height":35,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/71943/images/29-10.png","element":"img","alt":" 1 ≤ t < T0, we bound","inline":true}],[{"style":{"width":"89%"},"width":1416,"height":819,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/71943/images/29-11.png","element":"img"}],[{"text":"where we use in order: ","element":"span"},{"style":{"height":14},"width":108.12,"height":35,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/71943/images/29-12.png","element":"img","alt":" t < T0","inline":true},{"text":", the definition of ","element":"span"},{"style":{"fontStyle":"italic"},"text":"f","element":"span"},{"text":"(","element":"span"},{"style":{"fontStyle":"italic"},"text":"t, ","element":"span"},{"style":{"fontWeight":"bold"},"text":"x","element":"span"},{"text":")","element":"span"},{"text":", Lemma ","element":"span"},{"href":"#id-69","text":"C.4, ","element":"a"},{"text":"Lemma ","element":"span"},{"href":"#id-57","text":"B.3, ","element":"a"},{"style":{"height":15.2},"width":231.88,"height":38,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/71943/images/29-13.png","element":"img","alt":" Γ−1 ∩ Γ1 = ∅","inline":true},{"text":", and Lemma ","element":"span"},{"href":"#id-69","text":"C.4 ","element":"a"},{"text":"again. We also use ","element":"span"},{"style":{"height":16.8},"width":224.48,"height":42,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/71943/images/29-14.png","element":"img","alt":" |Γp| ≥ 0.99m","inline":true,"padRight":true},{"text":"(Assumption ","element":"span"},{"href":"#id-60","text":"3)","element":"a"},{"text":". We further simplify this bound to conclude","element":"span"}],[{"style":{"width":"54%"},"width":867,"height":37,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/71943/images/29-15.png","element":"img"}],[{"text":"Additionally, we bound","element":"span"}],[{"style":{"width":"69%"},"width":1099,"height":473,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/71943/images/30-0.png","element":"img"}],[{"text":"Therefore as long as","element":"span"}],[{"style":{"width":"88%"},"width":1398,"height":95,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/71943/images/30-1.png","element":"img"}],[{"text":"then","element":"span"}],[{"style":{"width":"87%"},"width":1390,"height":225,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/71943/images/30-2.png","element":"img"}],[{"text":"Notice that this does not depend on ","element":"span"},{"style":{"fontStyle":"italic"},"text":"i ","element":"span"},{"text":"or ","element":"span"},{"style":{"fontStyle":"italic"},"text":"p","element":"span"},{"text":". Therefore we can let ","element":"span"},{"style":{"height":13.98},"width":37.72,"height":34.96,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/71943/images/30-3.png","element":"img","alt":" T1","inline":true,"padRight":true},{"text":"be the largest integer satisfying this bound for ","element":"span"},{"style":{"height":16},"width":643.64,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/71943/images/30-4.png","element":"img","alt":" t and bound ℓ(T1, xℓ) > 0 for all l ∈ ST","inline":true,"padRight":true},{"text":". To verify that ","element":"span"},{"style":{"height":14.38},"width":421.28,"height":35.96,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/71943/images/30-5.png","element":"img","alt":" T1 < T0, consider i ∈ SF :","inline":true}],[{"style":{"width":"66%"},"width":1062,"height":416,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/71943/images/30-6.png","element":"img"}],[{"text":"This is less than 1 for all ","element":"span"},{"style":{"height":17.1},"width":308.84,"height":42.76,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/71943/images/30-7.png","element":"img","alt":" t < T1 since k ≤ n3 ","inline":true,"padRight":true},{"text":"(Assumption ","element":"span"},{"href":"#id-60","text":"3)","element":"a"},{"text":".","element":"span"}],[{"text":"Now, fix ","element":"span"},{"style":{"height":13.2},"width":109.56,"height":33,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/71943/images/30-8.png","element":"img","alt":" i ∈ ST","inline":true,"padRight":true},{"text":"again. For ","element":"span"},{"style":{"height":13.6},"width":83.84,"height":34,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/71943/images/30-9.png","element":"img","alt":" i ∼ j","inline":true},{"text":", we then can use Lemma ","element":"span"},{"href":"#id-54","text":"B.2 ","element":"a"},{"text":"and Lemma ","element":"span"},{"href":"#id-69","text":"C.4:","element":"a"}],[{"style":{"width":"86%"},"width":1369,"height":549,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/71943/images/30-10.png","element":"img"}],[{"text":"using ","element":"span"},{"style":{"height":22.18},"width":225.4,"height":55.44,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/71943/images/31-0.png","element":"img","alt":" ρ ≤ 16(n−k), η","inline":true,"padRight":true},{"text":"is sufficiently small, and ","element":"span"},{"style":{"height":19.38},"width":392.4,"height":48.44,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/71943/images/31-1.png","element":"img","alt":" γ +ρ < min� 199k, 1100�","inline":true},{"text":"(Assumption ","element":"span"},{"href":"#id-60","text":"3)","element":"a"},{"text":". Now assume ","element":"span"},{"style":{"height":15.2},"width":83.88,"height":38,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/71943/images/31-2.png","element":"img","alt":"i ̸∼ j","inline":true},{"text":". Using Lemma ","element":"span"},{"href":"#id-54","text":"B.2 ","element":"a"},{"text":"and Lemma ","element":"span"},{"href":"#id-69","text":"C.4 ","element":"a"},{"text":"we can bound","element":"span"}],[{"style":{"width":"86%"},"width":1369,"height":546,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/71943/images/31-3.png","element":"img"}],[{"text":"using ","element":"span"},{"style":{"height":10.8},"width":19,"height":27,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/71943/images/31-4.png","element":"img","alt":" η","inline":true,"padRight":true},{"text":"is sufficiently small and ","element":"span"},{"style":{"height":17.41},"width":306.32,"height":43.52,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/71943/images/31-5.png","element":"img","alt":" k ≤ n100 and ρ ≤ γ5 ","inline":true,"padRight":true},{"text":"(Assumption ","element":"span"},{"href":"#id-60","text":"3)","element":"a"},{"text":". Likewise, for ","element":"span"},{"style":{"height":14},"width":190.16,"height":35,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/71943/images/31-6.png","element":"img","alt":" 1 ≤ t < T1,","inline":true}],[{"style":{"width":"92%"},"width":1474,"height":1072,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/71943/images/31-7.png","element":"img"}],[{"text":"In the first six lines we use: ","element":"span"},{"style":{"height":14},"width":105.2,"height":35,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/71943/images/31-8.png","element":"img","alt":" t < T0","inline":true},{"text":", the definition of ","element":"span"},{"style":{"fontStyle":"italic"},"text":"f","element":"span"},{"text":"(","element":"span"},{"style":{"fontStyle":"italic"},"text":"t, ","element":"span"},{"style":{"fontWeight":"bold"},"text":"x","element":"span"},{"text":")","element":"span"},{"text":", Lemma ","element":"span"},{"href":"#id-69","text":"C.4, ","element":"a"},{"text":"Lemma ","element":"span"},{"href":"#id-54","text":"B.2 ","element":"a"},{"text":"and Lemma ","element":"span"},{"href":"#id-57","text":"B.3, ","element":"a"},{"text":"Lemma ","element":"span"},{"href":"#id-69","text":"C.4 ","element":"a"},{"text":"again, and ","element":"span"},{"style":{"height":16.8},"width":224.44,"height":42,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/71943/images/31-9.png","element":"img","alt":" |Γp| ≥ 0.99m","inline":true,"padRight":true},{"text":"(Assumption ","element":"span"},{"href":"#id-60","text":"3)","element":"a"},{"text":", respectively.","element":"span"}],[{"text":"We also bound","element":"span"}],[{"style":{"width":"72%"},"width":1146,"height":474,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/71943/images/31-10.png","element":"img"}],[{"text":"Combining these two bounds we see that","element":"span"}],[{"style":{"width":"87%"},"width":1384,"height":480,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/71943/images/32-0.png","element":"img"}],[{"text":"at this iteration, as desired. Here we use ","element":"span"},{"style":{"height":29.2},"width":561.52,"height":73,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/71943/images/32-1.png","element":"img","alt":" (γ + ρ) ≤ min� 199k, 1100, 1n−k�, η","inline":true,"padRight":true},{"text":"is sufficiently small, and ","element":"span"},{"style":{"height":22.16},"width":181.08,"height":55.4,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/71943/images/32-2.png","element":"img","alt":"ρ ≤ 16(n−k) ","inline":true,"padRight":true},{"text":"(Assumption ","element":"span"},{"href":"#id-60","text":"3)","element":"a"}],[{"style":{"fontWeight":"bold"},"text":"C.4 ","element":"span"},{"style":{"fontWeight":"bold"},"text":"Late training","element":"span"}],[{"id":"id-75","style":{"fontWeight":"bold"},"text":"Lemma C.8. ","element":"span"},{"style":{"fontStyle":"italic"},"text":"Suppose Assumption ","element":"span"},{"href":"#id-60","style":{"fontStyle":"italic"},"text":"3 ","element":"a"},{"style":{"fontStyle":"italic"},"text":"holds. Fix ","element":"span"},{"style":{"height":11.39},"width":89.52,"height":28.48,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/71943/images/32-3.png","element":"img","alt":" ε > 0","inline":true},{"style":{"fontStyle":"italic"},"text":". We will say a neuron is ","element":"span"},{"text":"aligned ","element":"span"},{"style":{"fontStyle":"italic"},"text":"(at iteration ","element":"span"},{"style":{"fontStyle":"italic"},"text":"t","element":"span"},{"style":{"fontStyle":"italic"},"text":") if","element":"span"}],[{"style":{"width":"25%"},"width":408,"height":54,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/71943/images/32-4.png","element":"img"}],[{"style":{"fontStyle":"italic"},"text":"for all ","element":"span"},{"style":{"height":14.8},"width":360.64,"height":37,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/71943/images/32-5.png","element":"img","alt":" i ∈ ST . For t ≥ T1, if","inline":true}],[{"style":{"width":"16%"},"width":255,"height":72,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/71943/images/32-6.png","element":"img"}],[{"style":{"fontStyle":"italic"},"text":"than at least ","element":"span"},{"style":{"height":16},"width":153.24,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/71943/images/32-7.png","element":"img","alt":" (1 − ε)m","inline":true,"padRight":true},{"style":{"fontStyle":"italic"},"text":"neurons in each ","element":"span"},{"style":{"height":15.6},"width":39.52,"height":39,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/71943/images/32-8.png","element":"img","alt":" Γp","inline":true,"padRight":true},{"style":{"fontStyle":"italic"},"text":"will be aligned.","element":"span"}],[{"style":{"fontStyle":"italic"},"text":"Proof. ","element":"span"},{"text":"Let ","element":"span"},{"style":{"height":16},"width":200.8,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/71943/images/32-9.png","element":"img","alt":" p ∈ {−1, 1}","inline":true,"padRight":true},{"text":"and ","element":"span"},{"style":{"fontStyle":"italic"},"text":"t ","element":"span"},{"text":"be such that ","element":"span"},{"style":{"height":7.81},"width":52,"height":19.52,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/71943/images/32-10.png","element":"img","alt":" εm","inline":true,"padRight":true},{"text":"different neurons in ","element":"span"},{"style":{"height":15.6},"width":41.88,"height":39,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/71943/images/32-11.png","element":"img","alt":" Γp","inline":true,"padRight":true},{"text":"are unaligned at iteration ","element":"span"},{"style":{"fontStyle":"italic"},"text":"t","element":"span"},{"text":". For any neuron index ","element":"span"},{"style":{"height":16},"width":297.76,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/71943/images/32-12.png","element":"img","alt":" j ∈ Γp and i ∈ ST","inline":true,"padRight":true},{"text":", we can use Lemma ","element":"span"},{"href":"#id-54","text":"B.2 ","element":"a"},{"text":"to bound","element":"span"}],[{"style":{"width":"69%"},"width":1097,"height":341,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/71943/images/32-13.png","element":"img"}],[{"text":"Since ","element":"span"},{"style":{"height":14.4},"width":130.8,"height":36,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/71943/images/32-14.png","element":"img","alt":" nγ < 1","inline":true,"padRight":true},{"text":"(Assumption ","element":"span"},{"href":"#id-60","text":"3)","element":"a"},{"text":", we see that ","element":"span"},{"style":{"height":20.21},"width":363.68,"height":50.52,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/71943/images/32-15.png","element":"img","alt":" min{ 34m, 3nγ4m } = 3nγ4m","inline":true,"padRight":true},{"text":". If the lower bound above is ","element":"span"},{"text":"positive then ","element":"span"},{"style":{"height":23.52},"width":418,"height":58.8,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/71943/images/32-16.png","element":"img","alt":" (−1)j sgn⟨w(t)j , xi⟩ = yi","inline":true},{"text":". Therefore, if a neuron ","element":"span"},{"style":{"fontStyle":"italic"},"text":"j ","element":"span"},{"text":"is unaligned then ","element":"span"},{"style":{"height":16.78},"width":197.96,"height":41.96,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/71943/images/32-17.png","element":"img","alt":" Bj(T1, t) ≥","inline":true},{"style":{"height":10},"width":53.6,"height":25,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/71943/images/32-18.png","element":"img","alt":"3nγ","inline":true},{"style":{"height":18.7},"width":192.8,"height":46.76,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/71943/images/32-19.png","element":"img","alt":"4ηm(γ+ρ) ≥","inline":true},{"style":{"height":22.18},"width":282.4,"height":55.44,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/71943/images/32-20.png","element":"img","alt":"5n8ηm (using ρ ≤ γ5 ","inline":true,"padRight":true},{"text":"from Assumption ","element":"span"},{"href":"#id-60","text":"3)","element":"a"},{"text":". If there are ","element":"span"},{"style":{"height":7.2},"width":53.6,"height":18,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/71943/images/32-21.png","element":"img","alt":" εm","inline":true,"padRight":true},{"text":"unaligned neurons, then","element":"span"}],[{"style":{"width":"34%"},"width":542,"height":128,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/71943/images/32-22.png","element":"img"}],[{"text":"Denote the first iteration after ","element":"span"},{"style":{"height":13.98},"width":37.72,"height":34.96,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/71943/images/32-23.png","element":"img","alt":" T1","inline":true,"padRight":true},{"text":"where more than ","element":"span"},{"style":{"height":7.2},"width":53.56,"height":18,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/71943/images/32-24.png","element":"img","alt":" εm","inline":true,"padRight":true},{"text":"neurons in one of the ","element":"span"},{"style":{"height":15.58},"width":41.92,"height":38.96,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/71943/images/32-25.png","element":"img","alt":" Γp","inline":true,"padRight":true},{"text":"are unaligned as ","element":"span"},{"style":{"height":13.98},"width":48.76,"height":34.96,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/71943/images/32-26.png","element":"img","alt":" Tε.","inline":true,"padRight":true},{"text":"If no such iteration exists, let ","element":"span"},{"style":{"height":13.98},"width":133.68,"height":34.96,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/71943/images/32-27.png","element":"img","alt":" Tε = ∞","inline":true},{"text":". We will eventually show that indeed ","element":"span"},{"style":{"height":14.19},"width":130.48,"height":35.48,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/71943/images/32-28.png","element":"img","alt":" Tε = ∞","inline":true},{"text":", by showing that the training process reaches zero loss before such an iteration can happen.","element":"span"}],[{"id":"id-71","style":{"fontWeight":"bold"},"text":"Lemma C.9. ","element":"span"},{"style":{"fontStyle":"italic"},"text":"Assume Assumption ","element":"span"},{"href":"#id-60","style":{"fontStyle":"italic"},"text":"3 ","element":"a"},{"style":{"fontStyle":"italic"},"text":"holds and also ","element":"span"},{"style":{"height":21.63},"width":286.48,"height":54.08,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/71943/images/32-29.png","element":"img","alt":" γ + ρ < 0.99(1−ε)4k","inline":true,"padRight":true},{"style":{"fontStyle":"italic"},"text":". There is an iteration ","element":"span"},{"style":{"height":14},"width":130.4,"height":35,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/71943/images/32-30.png","element":"img","alt":" T2 ≥ T1","inline":true,"padRight":true},{"style":{"fontStyle":"italic"},"text":"so that for all iterations ","element":"span"},{"style":{"fontStyle":"italic"},"text":"t ","element":"span"},{"style":{"fontStyle":"italic"},"text":"satisfying ","element":"span"},{"style":{"height":14},"width":453.4,"height":35,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/71943/images/32-31.png","element":"img","alt":" T2 ≤ t < Tε and all i ∈ ST ,","inline":true}],[{"style":{"width":"25%"},"width":401,"height":41,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/71943/images/32-32.png","element":"img"}],[{"style":{"fontStyle":"italic"},"text":"Furthermore, we can choose ","element":"span"},{"style":{"height":13.98},"width":157.88,"height":34.96,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/71943/images/32-33.png","element":"img","alt":" T2 so that","inline":true}],[{"style":{"width":"48%"},"width":776,"height":92,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/71943/images/32-34.png","element":"img"}],[{"style":{"fontStyle":"italic"},"text":"Proof. ","element":"span"},{"text":"Fix ","element":"span"},{"style":{"height":16},"width":721.68,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/71943/images/33-0.png","element":"img","alt":" i ∈ ST and t < Tε − 1. Suppose ℓ(t, xi) > 0","inline":true},{"text":". Using Lemma ","element":"span"},{"href":"#id-57","text":"B.3 ","element":"a"},{"text":"and ","element":"span"},{"style":{"height":14},"width":116.28,"height":35,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/71943/images/33-1.png","element":"img","alt":" t < Tε,","inline":true}],[{"style":{"width":"77%"},"width":1234,"height":332,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/71943/images/33-2.png","element":"img"}],[{"text":"Therefore,","element":"span"}],[{"style":{"width":"58%"},"width":932,"height":46,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/71943/images/33-3.png","element":"img"}],[{"text":"By Lemma ","element":"span"},{"href":"#id-70","text":"C.7, ","element":"a"},{"text":"the loss of each clean point at ","element":"span"},{"style":{"height":13.98},"width":37.72,"height":34.96,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/71943/images/33-4.png","element":"img","alt":" T1","inline":true,"padRight":true},{"text":"is at most ","element":"span"},{"style":{"height":19.39},"width":16.52,"height":48.48,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/71943/images/33-5.png","element":"img","alt":"13","inline":true},{"text":", so each clean point reaches zero loss ","element":"span"},{"text":"in at most","element":"span"}],[{"style":{"width":"35%"},"width":556,"height":102,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/71943/images/33-6.png","element":"img"}],[{"text":"iterations.","element":"span"}],[{"text":"Now suppose ","element":"span"},{"style":{"height":16},"width":190.32,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/71943/images/33-7.png","element":"img","alt":" ℓ(t, xi) = 0","inline":true},{"text":". We similarly argue","element":"span"}],[{"style":{"width":"77%"},"width":1227,"height":325,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/71943/images/33-8.png","element":"img"}],[{"text":"This implies ","element":"span"},{"style":{"height":16},"width":338.8,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/71943/images/33-9.png","element":"img","alt":" ℓ(t + 1, xi) ≤ 4ηmk","inline":true},{"text":". By induction, we see that if ","element":"span"},{"style":{"height":14},"width":106.16,"height":35,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/71943/images/33-10.png","element":"img","alt":" t < Tε","inline":true,"padRight":true},{"text":"and ","element":"span"},{"style":{"height":16},"width":269.4,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/71943/images/33-11.png","element":"img","alt":" ℓ(t, xi) ≤ 4ηmk","inline":true},{"text":", then ","element":"span"},{"style":{"height":16},"width":347.04,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/71943/images/33-12.png","element":"img","alt":"ℓ(t + 1, xi) ≤ 4ηmk.","inline":true}],[{"id":"id-72","style":{"fontWeight":"bold"},"text":"Lemma C.10. ","element":"span"},{"style":{"fontStyle":"italic"},"text":"Assume Assumption ","element":"span"},{"href":"#id-60","style":{"fontStyle":"italic"},"text":"3 ","element":"a"},{"style":{"fontStyle":"italic"},"text":"holds and ","element":"span"},{"style":{"height":21.63},"width":293.24,"height":54.08,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/71943/images/33-13.png","element":"img","alt":" γ + ρ < 0.99(1−ε)4k","inline":true,"padRight":true},{"style":{"fontStyle":"italic"},"text":". For all ","element":"span"},{"style":{"height":13.2},"width":80.4,"height":33,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/71943/images/33-14.png","element":"img","alt":" t1, t2","inline":true,"padRight":true},{"style":{"fontStyle":"italic"},"text":"satisfying ","element":"span"},{"style":{"height":14},"width":83.72,"height":35,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/71943/images/33-15.png","element":"img","alt":" T2 ≤","inline":true},{"style":{"height":14},"width":411.92,"height":35,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/71943/images/33-16.png","element":"img","alt":"t1 ≤ t2 < Tε and i ∈ ST ,","inline":true}],[{"style":{"width":"43%"},"width":689,"height":95,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/71943/images/33-17.png","element":"img"}],[{"style":{"fontStyle":"italic"},"text":"Proof. ","element":"span"},{"text":"Recall Lemma ","element":"span"},{"href":"#id-64","text":"B.5 ","element":"a"},{"text":"(restated in this setting, using Lemma ","element":"span"},{"href":"#id-71","text":"C.9)","element":"a"},{"text":":","element":"span"}],[{"style":{"width":"46%"},"width":739,"height":91,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/71943/images/33-18.png","element":"img"}],[{"text":"Using ","element":"span"},{"style":{"height":13.98},"width":272.36,"height":34.96,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/71943/images/33-19.png","element":"img","alt":" t < Tε we bound","inline":true}],[{"style":{"width":"63%"},"width":1009,"height":206,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/71943/images/33-20.png","element":"img"}],[{"text":"From this, the desired inequality follows.","element":"span"}],[{"id":"id-73","style":{"fontWeight":"bold"},"text":"Lemma C.11. ","element":"span"},{"style":{"fontStyle":"italic"},"text":"Assume Assumption ","element":"span"},{"href":"#id-60","style":{"fontStyle":"italic"},"text":"3 ","element":"a"},{"style":{"fontStyle":"italic"},"text":"holds and ","element":"span"},{"style":{"height":29.84},"width":829.08,"height":74.6,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/71943/images/33-21.png","element":"img","alt":" γ + ρ ≤ min�0.99(1−ε)4k ,�0.99(1−ε)8(n−k)k�. Let i ∈ SF .","inline":true,"padRight":true},{"style":{"fontStyle":"italic"},"text":"Suppose there is ","element":"span"},{"style":{"height":15.2},"width":240.84,"height":38,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/71943/images/33-22.png","element":"img","alt":" j ̸∼ i such that","inline":true}],[{"style":{"width":"42%"},"width":679,"height":95,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/71943/images/33-23.png","element":"img"}],[{"style":{"fontStyle":"italic"},"text":"Then for all ","element":"span"},{"style":{"fontStyle":"italic"},"text":"t ","element":"span"},{"style":{"fontStyle":"italic"},"text":"satisfying ","element":"span"},{"style":{"height":23.74},"width":450.88,"height":59.36,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/71943/images/33-24.png","element":"img","alt":" T2 ≤ t < Tε, ⟨w(t)j′ , xi⟩ > 0","inline":true,"padRight":true},{"style":{"fontStyle":"italic"},"text":"for some neuron ","element":"span"},{"style":{"height":15.6},"width":27.48,"height":39,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/71943/images/33-25.png","element":"img","alt":" j′","inline":true,"padRight":true},{"style":{"fontStyle":"italic"},"text":"depending on ","element":"span"},{"style":{"fontStyle":"italic"},"text":"t","element":"span"},{"style":{"fontStyle":"italic"},"text":".","element":"span"}],[{"style":{"fontStyle":"italic"},"text":"Proof. ","element":"span"},{"text":"Let ","element":"span"},{"style":{"height":14},"width":126.16,"height":35,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/71943/images/34-0.png","element":"img","alt":" τ0 ≥ T2","inline":true,"padRight":true},{"text":"be the first iteration after ","element":"span"},{"style":{"height":16},"width":346.48,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/71943/images/34-1.png","element":"img","alt":" T2 where ℓ(t, xi) = 0","inline":true},{"text":". We will show by induction that for all ","element":"span"},{"style":{"fontStyle":"italic"},"text":"t ","element":"span"},{"text":"satisfying ","element":"span"},{"style":{"height":14},"width":203.16,"height":35,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/71943/images/34-2.png","element":"img","alt":" T2 ≤ t ≤ τ0","inline":true,"padRight":true},{"text":"that ","element":"span"},{"style":{"height":23.52},"width":236.72,"height":58.8,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/71943/images/34-3.png","element":"img","alt":" ⟨w(t)j , xi⟩ > 0","inline":true},{"text":". The case ","element":"span"},{"style":{"height":14},"width":109.96,"height":35,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/71943/images/34-4.png","element":"img","alt":" t = T2","inline":true,"padRight":true},{"text":"follows immediately by the ","element":"span"},{"text":"assumption of the lemma. Otherwise, assume ","element":"span"},{"style":{"height":23.52},"width":245.12,"height":58.8,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/71943/images/34-5.png","element":"img","alt":" ⟨w(t′)j , xi⟩ > 0","inline":true,"padRight":true},{"text":"for all ","element":"span"},{"style":{"height":14},"width":190.12,"height":35,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/71943/images/34-6.png","element":"img","alt":" T2 ≤ t′ < t","inline":true},{"text":". By Lemma ","element":"span"},{"href":"#id-54","text":"B.2 ","element":"a"},{"text":"and Lemma ","element":"span"},{"href":"#id-72","text":"C.10,","element":"a"}],[{"style":{"width":"90%"},"width":1441,"height":739,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/71943/images/34-7.png","element":"img"}],[{"text":"We now continue the induction past ","element":"span"},{"style":{"height":9.39},"width":31.48,"height":23.48,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/71943/images/34-8.png","element":"img","alt":" τ0","inline":true},{"text":". If ","element":"span"},{"style":{"height":16},"width":192.12,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/71943/images/34-9.png","element":"img","alt":" ℓ(t, xi) = 0","inline":true},{"text":", then point ","element":"span"},{"style":{"fontStyle":"italic"},"text":"i ","element":"span"},{"text":"clearly activates some neuron. Let ","element":"span"},{"style":{"height":9.39},"width":30,"height":23.48,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/71943/images/34-10.png","element":"img","alt":" τ1","inline":true,"padRight":true},{"text":"be the first iteration after ","element":"span"},{"style":{"height":16},"width":342.88,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/71943/images/34-11.png","element":"img","alt":" τ0 where ℓ(t, xi) > 0","inline":true},{"text":". By Lemma ","element":"span"},{"href":"#id-57","text":"B.3","element":"a"}],[{"style":{"width":"85%"},"width":1352,"height":334,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/71943/images/34-12.png","element":"img"}],[{"text":"This means","element":"span"}],[{"style":{"width":"60%"},"width":958,"height":104,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/71943/images/34-13.png","element":"img"}],[{"text":"and there is some ","element":"span"},{"style":{"height":14.4},"width":194.88,"height":36,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/71943/images/34-14.png","element":"img","alt":" j′ satisfying","inline":true}],[{"style":{"width":"66%"},"width":1047,"height":95,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/71943/images/34-15.png","element":"img"}],[{"text":"assuming ","element":"span"},{"style":{"height":10.4},"width":20,"height":26,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/71943/images/34-16.png","element":"img","alt":" η","inline":true,"padRight":true},{"text":"is sufficiently small (Assumption ","element":"span"},{"href":"#id-60","text":"3)","element":"a"},{"text":". We can run the original induction argument with ","element":"span"},{"style":{"height":9.18},"width":33.44,"height":22.96,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/71943/images/34-17.png","element":"img","alt":"τ1","inline":true,"padRight":true},{"text":"replacing ","element":"span"},{"style":{"height":13.98},"width":37.72,"height":34.96,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/71943/images/34-18.png","element":"img","alt":" T2","inline":true,"padRight":true},{"text":"and ","element":"span"},{"style":{"height":16},"width":567.8,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/71943/images/34-19.png","element":"img","alt":" τ2 = min{t ≥ τ2 : ℓ(t, xi) = 0}","inline":true,"padRight":true},{"text":"replacing ","element":"span"},{"style":{"height":9.6},"width":30.96,"height":24,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/71943/images/34-20.png","element":"img","alt":" τ0","inline":true,"padRight":true},{"text":"to verify the conclusion for ","element":"span"},{"style":{"height":12.8},"width":191.04,"height":32,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/71943/images/34-21.png","element":"img","alt":"τ1 ≤ t ≤ τ2","inline":true},{"text":". By switching back and forth between these two arguments, we can show that point ","element":"span"},{"style":{"fontStyle":"italic"},"text":"i ","element":"span"},{"text":"activates some neuron for all ","element":"span"},{"style":{"height":14},"width":209,"height":35,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/71943/images/34-22.png","element":"img","alt":" T2 ≤ t < Tε.","inline":true}],[{"id":"id-77","style":{"fontWeight":"bold"},"text":"Lemma C.12. ","element":"span"},{"style":{"fontStyle":"italic"},"text":"If Assumption ","element":"span"},{"href":"#id-60","style":{"fontStyle":"italic"},"text":"3 ","element":"a"},{"style":{"fontStyle":"italic"},"text":"holds, the training process reaches loss.","element":"span"}],[{"style":{"fontStyle":"italic"},"text":"Proof. ","element":"span"},{"text":"In this proof, let ","element":"span"},{"style":{"height":19.38},"width":92.52,"height":48.44,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/71943/images/34-23.png","element":"img","alt":" ε = 15","inline":true},{"text":". The conditions of Lemma ","element":"span"},{"href":"#id-71","text":"C.9, ","element":"a"},{"text":"Lemma ","element":"span"},{"href":"#id-72","text":"C.10, ","element":"a"},{"text":"and Lemma ","element":"span"},{"href":"#id-73","text":"C.11 ","element":"a"},{"text":"hold ","element":"span"},{"text":"because ","element":"span"},{"style":{"height":29.6},"width":565.48,"height":74,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/71943/images/34-24.png","element":"img","alt":" γ + ρ ≤ min�0.99/5k ,�","inline":true,"padRight":true},{"text":"0 2","element":"span"}],[{"text":"By Lemma ","element":"span"},{"href":"#id-74","text":"C.2, ","element":"a"},{"text":"there is a finite bound on the number of updates, independent of the number of iterations spent training. If we carry out the training procedure for infinitely many iterations, there must be some iteration where we make no updates. Since the training procedure is deterministic, we will not make any updates after this point, and we will have converged. It remains to show that this convergence results in zero training loss. The only way for a point to not update any neurons is for that point’s loss to be zero or for that point to activate no neurons.","element":"span"}],[{"text":"Lemma ","element":"span"},{"href":"#id-75","text":"C.8 ","element":"a"},{"text":"and Lemma ","element":"span"},{"href":"#id-73","text":"C.11 ","element":"a"},{"text":"say, under certain conditions, that every clean point and every corrupted point activates some neuron for each iteration ","element":"span"},{"style":{"height":14},"width":104.24,"height":35,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/71943/images/35-0.png","element":"img","alt":" t ≤ Tε","inline":true},{"text":". We need only to verify that these conditions hold and that ","element":"span"},{"style":{"height":16},"width":135.44,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/71943/images/35-1.png","element":"img","alt":" B(T1, t)","inline":true,"padRight":true},{"text":"remains below the limitation set in Lemma ","element":"span"},{"href":"#id-75","text":"C.8.","element":"a"}],[{"text":"We apply Lemma ","element":"span"},{"href":"#id-74","text":"C.2 ","element":"a"},{"text":"starting at ","element":"span"},{"style":{"height":13.98},"width":131,"height":34.96,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/71943/images/35-2.png","element":"img","alt":" t0 = T1","inline":true},{"text":". By Lemma ","element":"span"},{"href":"#id-70","text":"C.7, ","element":"a"},{"style":{"height":19.36},"width":224.16,"height":48.4,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/71943/images/35-3.png","element":"img","alt":" ℓ(T1, xi) ≤ 13","inline":true,"padRight":true},{"text":"for all ","element":"span"},{"style":{"height":13.18},"width":117.44,"height":32.96,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/71943/images/35-4.png","element":"img","alt":" i ∈ ST","inline":true,"padRight":true},{"text":". Using ","element":"span"},{"text":"Lemma ","element":"span"},{"href":"#id-76","text":"C.6, ","element":"a"},{"text":"for ","element":"span"},{"style":{"height":13.58},"width":123.08,"height":33.96,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/71943/images/35-5.png","element":"img","alt":" i ∈ SF ,","inline":true}],[{"style":{"width":"44%"},"width":699,"height":326,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/71943/images/35-6.png","element":"img"}],[{"text":"With these bounds, Lemma ","element":"span"},{"href":"#id-74","text":"C.2 ","element":"a"},{"text":"shows that for all ","element":"span"},{"style":{"height":14},"width":117.12,"height":35,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/71943/images/35-7.png","element":"img","alt":" t ≥ T1,","inline":true}],[{"style":{"width":"73%"},"width":1166,"height":425,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/71943/images/35-8.png","element":"img"}],[{"text":"using ","element":"span"},{"style":{"height":19.38},"width":440.12,"height":48.44,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/71943/images/35-9.png","element":"img","alt":" γ + ρ ≤ min{ 1n−k, 199k}, η","inline":true,"padRight":true},{"text":"sufficiently small, and ","element":"span"},{"style":{"height":16.58},"width":127.76,"height":41.44,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/71943/images/35-10.png","element":"img","alt":" k ≤ n100","inline":true,"padRight":true},{"text":"(Assumption ","element":"span"},{"href":"#id-60","text":"3)","element":"a"},{"text":". By Lemma ","element":"span"},{"href":"#id-75","text":"C.8, ","element":"a"},{"style":{"height":22.18},"width":343.64,"height":55.44,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/71943/images/35-11.png","element":"img","alt":"Tε = ∞ if n10η < 5εn8η ","inline":true,"padRight":true},{"text":", which is clearly true for ","element":"span"},{"style":{"height":19.38},"width":107.16,"height":48.44,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/71943/images/35-12.png","element":"img","alt":" ε = 15.","inline":true}],[{"text":"We now show that every training point ","element":"span"},{"style":{"fontStyle":"italic"},"text":"i ","element":"span"},{"text":"activates at least one neuron each iteration. By Lemma ","element":"span"},{"href":"#id-71","text":"C.9, ","element":"a"},{"text":"this is true if ","element":"span"},{"style":{"height":13.2},"width":113.16,"height":33,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/71943/images/35-13.png","element":"img","alt":" i ∈ ST","inline":true,"padRight":true},{"text":". By Lemma ","element":"span"},{"href":"#id-73","text":"C.11, ","element":"a"},{"text":"this is true for ","element":"span"},{"style":{"height":13.6},"width":114.16,"height":34,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/71943/images/35-14.png","element":"img","alt":" i ∈ SF","inline":true,"padRight":true},{"text":"if there is a neuron ","element":"span"},{"style":{"height":15.2},"width":89.4,"height":38,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/71943/images/35-15.png","element":"img","alt":" j ̸∼ i","inline":true,"padRight":true},{"text":"such that ","element":"span"},{"style":{"height":24.98},"width":758.96,"height":62.44,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/71943/images/35-16.png","element":"img","alt":"⟨w(T2)j , xi⟩ > η 2(n−k)(γ+ρ)(4k+3)0.99(1−ε) . Fix i ∈ SF .","inline":true}],[{"text":"First, assume that ","element":"span"},{"style":{"height":16},"width":190.48,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/71943/images/35-17.png","element":"img","alt":" ℓ(t, xi) > 0","inline":true,"padRight":true},{"text":"for all ","element":"span"},{"style":{"height":13.98},"width":105.4,"height":34.96,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/71943/images/35-18.png","element":"img","alt":" t < T2","inline":true},{"text":". By Assumption ","element":"span"},{"href":"#id-60","text":"3 ","element":"a"},{"text":"and Lemma ","element":"span"},{"href":"#id-66","text":"C.3, ","element":"a"},{"text":"we know there is ","element":"span"},{"style":{"height":23.52},"width":420.4,"height":58.8,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/71943/images/35-19.png","element":"img","alt":"j ∈ Γyi such that i ∈ A(1)j ","inline":true,"padRight":true},{"text":". Using Lemma ","element":"span"},{"href":"#id-54","text":"B.2 ","element":"a"},{"text":"and Lemma ","element":"span"},{"href":"#id-69","text":"C.4 ","element":"a"},{"text":"we can bound","element":"span"}],[{"style":{"width":"85%"},"width":1361,"height":116,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/71943/images/35-20.png","element":"img"}],[{"text":"and","element":"span"}],[{"style":{"width":"91%"},"width":1449,"height":59,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/71943/images/35-21.png","element":"img"}],[{"text":"By induction on ","element":"span"},{"style":{"fontStyle":"italic"},"text":"t ","element":"span"},{"text":"we see that","element":"span"}],[{"style":{"width":"65%"},"width":1034,"height":96,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/71943/images/35-22.png","element":"img"}],[{"text":"for ","element":"span"},{"style":{"height":14},"width":106.72,"height":35,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/71943/images/35-23.png","element":"img","alt":" t > T1","inline":true},{"text":". The base case ","element":"span"},{"style":{"height":14},"width":177.92,"height":35,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/71943/images/35-24.png","element":"img","alt":" t = T1 + 1","inline":true,"padRight":true},{"text":"is clear. Suppose the inequality holds for ","element":"span"},{"style":{"fontStyle":"italic"},"text":"t","element":"span"},{"text":". Either ","element":"span"},{"style":{"height":16.8},"width":151.32,"height":42,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/71943/images/35-25.png","element":"img","alt":" Gj(T1, t)","inline":true,"padRight":true},{"text":"increases by at most ","element":"span"},{"style":{"height":10.8},"width":97.2,"height":27,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/71943/images/35-26.png","element":"img","alt":" n − k","inline":true,"padRight":true},{"text":"or there is some ","element":"span"},{"style":{"height":23.52},"width":288.36,"height":58.8,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/71943/images/35-27.png","element":"img","alt":" i′ ∈ ST ∩ A(t+1)j","inline":true,"padRight":true},{"text":"with ","element":"span"},{"style":{"height":15.2},"width":104,"height":38,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/71943/images/35-28.png","element":"img","alt":" i′ ̸∼ j","inline":true},{"text":". By Lemma ","element":"span"},{"href":"#id-54","text":"B.2 ","element":"a"},{"text":"and Lemma ","element":"span"},{"href":"#id-70","text":"C.7,","element":"a"}],[{"style":{"width":"78%"},"width":1237,"height":117,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/71943/images/35-29.png","element":"img"}],[{"text":"from which the inequality follows. Since ","element":"span"},{"style":{"height":24.37},"width":789.2,"height":60.92,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/71943/images/35-30.png","element":"img","alt":"(γ+ρ)Bj(T1,t′)γ−ρ ≤ 3k(t′ − T1) < (n − k)(t′ − T1)","inline":true,"padRight":true},{"text":"(using ","element":"span"},{"style":{"height":17.41},"width":306.96,"height":43.52,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/71943/images/35-31.png","element":"img","alt":"ρ < γ5 and k ≤ n100 ","inline":true,"padRight":true},{"text":"from Assumption ","element":"span"},{"href":"#id-60","text":"3)","element":"a"},{"text":", this maximum occurs at ","element":"span"},{"style":{"height":14},"width":112.8,"height":35,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/71943/images/35-32.png","element":"img","alt":" τ = T1","inline":true},{"text":". This yields","element":"span"}],[{"style":{"width":"72%"},"width":1146,"height":115,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/71943/images/35-33.png","element":"img"}],[{"text":"We want to show this bound is larger than a quantity that is ","element":"span"},{"style":{"height":16},"width":84.2,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/71943/images/36-0.png","element":"img","alt":" O(η)","inline":true},{"text":". This happens when both of the following hold:","element":"span"}],[{"style":{"width":"33%"},"width":531,"height":41,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/71943/images/36-1.png","element":"img"}],[{"text":"which holds when ","element":"span"},{"style":{"height":10.8},"width":19,"height":27,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/71943/images/36-2.png","element":"img","alt":" η","inline":true,"padRight":true},{"text":"is sufficiently small.","element":"span"}],[{"text":"Now suppose ","element":"span"},{"style":{"height":16},"width":195.68,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/71943/images/36-3.png","element":"img","alt":" ℓ(τ, xi) = 0","inline":true,"padRight":true},{"text":"for some iteration ","element":"span"},{"style":{"height":14},"width":112.8,"height":35,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/71943/images/36-4.png","element":"img","alt":" τ ≤ T2","inline":true},{"text":". By Lemma ","element":"span"},{"href":"#id-70","text":"C.7, ","element":"a"},{"style":{"height":13.98},"width":113.72,"height":34.96,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/71943/images/36-5.png","element":"img","alt":" T1 < τ","inline":true},{"text":". In this case, we see that","element":"span"}],[{"style":{"width":"77%"},"width":1234,"height":414,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/71943/images/36-6.png","element":"img"}],[{"text":"where in the third line we use Lemma ","element":"span"},{"href":"#id-74","text":"C.2 ","element":"a"},{"text":"and the fourth line we use ","element":"span"},{"style":{"height":19.38},"width":496.48,"height":48.44,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/71943/images/36-7.png","element":"img","alt":" γ + ρ ≤ min{ 1n−k, 199k} and η","inline":true,"padRight":true},{"text":"sufficiently small (Assumption ","element":"span"},{"href":"#id-60","text":"3)","element":"a"},{"text":". Sine this is positive, there is some neuron ","element":"span"},{"style":{"fontStyle":"italic"},"text":"j ","element":"span"},{"text":"with ","element":"span"},{"style":{"height":15.2},"width":83.84,"height":38,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/71943/images/36-8.png","element":"img","alt":" i ̸∼ j","inline":true,"padRight":true},{"text":"such that ","element":"span"},{"style":{"height":23.52},"width":235.64,"height":58.8,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/71943/images/36-9.png","element":"img","alt":"ϕ(⟨w(T2)j , xi⟩)","inline":true,"padRight":true},{"text":"is at least ","element":"span"},{"style":{"height":19.38},"width":28,"height":48.44,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/71943/images/36-10.png","element":"img","alt":"1m ","inline":true,"padRight":true},{"text":"this bound. This is an ","element":"span"},{"style":{"height":16},"width":80.2,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/71943/images/36-11.png","element":"img","alt":" Ω(1)","inline":true,"padRight":true},{"text":"lower bound. Since the required condition is ","element":"span"},{"style":{"height":23.52},"width":317.76,"height":58.8,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/71943/images/36-12.png","element":"img","alt":"⟨w(T2)j , xi⟩ > O(η)","inline":true},{"text":", this can be achieved by taking ","element":"span"},{"style":{"height":10.4},"width":20,"height":26,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/71943/images/36-13.png","element":"img","alt":" η","inline":true,"padRight":true},{"text":"sufficiently small.","element":"span"}],[{"style":{"fontWeight":"bold"},"text":"C.5 ","element":"span"},{"style":{"fontWeight":"bold"},"text":"Proof of Lemma ","element":"span"},{"href":"#id-41","style":{"fontWeight":"bold"},"text":"3.5","element":"a"}],[{"id":"id-78","style":{"fontWeight":"bold"},"text":"Lemma C.13 ","element":"span"},{"text":"(Lemma ","element":"span"},{"href":"#id-41","text":"3.5)","element":"a"},{"style":{"fontWeight":"bold"},"text":". ","element":"span"},{"style":{"fontStyle":"italic"},"text":"Assume Assumption ","element":"span"},{"href":"#id-60","style":{"fontStyle":"italic"},"text":"3 ","element":"a"},{"style":{"fontStyle":"italic"},"text":"holds. Let ","element":"span"},{"style":{"height":16},"width":200.88,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/71943/images/36-14.png","element":"img","alt":" y ∈ {−1, 1}","inline":true,"padRight":true},{"style":{"fontStyle":"italic"},"text":"chosen uniformly and ","element":"span"},{"style":{"height":17.71},"width":394.76,"height":44.28,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/71943/images/36-15.png","element":"img","alt":"x := y√γv + √1 − γn","inline":true},{"style":{"fontStyle":"italic"},"text":", where ","element":"span"},{"style":{"height":17.39},"width":552.52,"height":43.48,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/71943/images/36-16.png","element":"img","alt":" n ∼ Uniform(Sd−1 ∩ span{v}⊥)","inline":true},{"style":{"fontStyle":"italic"},"text":". Suppose that ","element":"span"},{"style":{"height":19.89},"width":257.48,"height":49.72,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/71943/images/36-17.png","element":"img","alt":" |⟨n, nℓ⟩| < ρ1−γ","inline":true,"padRight":true},{"style":{"fontStyle":"italic"},"text":"for all ","element":"span"},{"style":{"height":16},"width":490.56,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/71943/images/36-18.png","element":"img","alt":" l ∈ [2n], then yf(Tend, x) > 0.","inline":true}],[{"style":{"fontStyle":"italic"},"text":"Proof. ","element":"span"},{"text":"Following the same steps as in ","element":"span"},{"href":"#id-52","text":"(5) ","element":"a"},{"text":"for any ","element":"span"},{"style":{"height":16},"width":144.4,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/71943/images/36-19.png","element":"img","alt":" j ∈ [2m]","inline":true}],[{"style":{"width":"85%"},"width":1350,"height":616,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/71943/images/36-20.png","element":"img"}],[{"text":"Recall, from Lemma ","element":"span"},{"href":"#id-69","text":"C.4 ","element":"a"},{"text":"for any ","element":"span"},{"style":{"height":15.6},"width":130.04,"height":39,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/71943/images/36-21.png","element":"img","alt":" j ∈ Γp","inline":true,"padRight":true},{"text":"then ","element":"span"},{"style":{"height":16.8},"width":658.8,"height":42,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/71943/images/36-22.png","element":"img","alt":" Gj(1, Tend) ≥ Gj(1, T1) = T1(n − k)","inline":true},{"text":". As a consequence, for ","element":"span"},{"style":{"height":16},"width":251.24,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/71943/images/36-23.png","element":"img","alt":" j ∈ Γp we have","inline":true}],[{"style":{"width":"71%"},"width":1138,"height":118,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/71943/images/36-24.png","element":"img"}],[{"text":"For ","element":"span"},{"style":{"height":16.99},"width":432.88,"height":42.48,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/71943/images/36-25.png","element":"img","alt":" j such that (−1)j = y then","inline":true}],[{"style":{"width":"78%"},"width":1249,"height":117,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/71943/images/36-26.png","element":"img"}],[{"style":{"width":"88%"},"width":1409,"height":435,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/71943/images/37-0.png","element":"img"}],[{"text":"using ","element":"span"},{"style":{"height":16.78},"width":189.52,"height":41.96,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/71943/images/37-1.png","element":"img","alt":" |Γp| ≥ 0.99","inline":true,"padRight":true},{"text":"(Assumption ","element":"span"},{"href":"#id-60","text":"3)","element":"a"},{"text":". From Lemma ","element":"span"},{"href":"#id-70","text":"C.7","element":"a"}],[{"style":{"width":"45%"},"width":714,"height":94,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/71943/images/37-2.png","element":"img"}],[{"text":"furthermore, combining the assumptions ","element":"span"},{"style":{"height":14.4},"width":208.32,"height":36,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/71943/images/37-3.png","element":"img","alt":" 100k < n, η","inline":true,"padRight":true},{"text":"sufficiently small, and ","element":"span"},{"style":{"height":19.36},"width":221.48,"height":48.4,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/71943/images/37-4.png","element":"img","alt":" γ + ρ < 1n−k","inline":true,"padRight":true},{"text":"with ","element":"span"},{"text":"Lemma ","element":"span"},{"href":"#id-74","text":"C.2 ","element":"a"},{"text":"we see","element":"span"}],[{"style":{"width":"77%"},"width":1228,"height":195,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/71943/images/37-5.png","element":"img"}],[{"text":"Here we also use that ","element":"span"},{"style":{"height":16},"width":673.88,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/71943/images/37-6.png","element":"img","alt":" ℓ(0, xi) ≤ 1 + mλw = 1 + O(η) for all i.","inline":true}],[{"text":"Combining these inequalities it follows that","element":"span"}],[{"style":{"width":"58%"},"width":928,"height":186,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/71943/images/37-7.png","element":"img"}],[{"text":"again using ","element":"span"},{"style":{"height":14.4},"width":198.76,"height":36,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/71943/images/37-8.png","element":"img","alt":" 100k < n, η","inline":true,"padRight":true},{"text":"sufficiently small, and ","element":"span"},{"style":{"height":19.38},"width":227.08,"height":48.44,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/71943/images/37-9.png","element":"img","alt":" γ + ρ < 1n−k.","inline":true}],[{"style":{"fontWeight":"bold"},"text":"C.6 ","element":"span"},{"style":{"fontWeight":"bold"},"text":"Proof of Theorem ","element":"span"},{"href":"#id-35","style":{"fontWeight":"bold"},"text":"3.1","element":"a"}],[{"id":"id-104","style":{"fontWeight":"bold"},"text":"Theorem C.14 ","element":"span"},{"text":"(Theorem ","element":"span"},{"href":"#id-35","text":"3.1)","element":"a"},{"style":{"fontWeight":"bold"},"text":". ","element":"span"},{"style":{"fontStyle":"italic"},"text":"Let Assumption ","element":"span"},{"href":"#id-59","style":{"fontStyle":"italic"},"text":"2 ","element":"a"},{"style":{"fontStyle":"italic"},"text":"hold with ","element":"span"},{"style":{"height":16},"width":136.52,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/71943/images/37-10.png","element":"img","alt":" ρ = γ/5","inline":true},{"style":{"fontStyle":"italic"},"text":". There exists a sufficiently small step-size ","element":"span"},{"style":{"height":10.4},"width":20,"height":26,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/71943/images/37-11.png","element":"img","alt":" η","inline":true,"padRight":true},{"style":{"fontStyle":"italic"},"text":"such that with probability at least ","element":"span"},{"style":{"height":11.6},"width":83.48,"height":29,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/71943/images/37-12.png","element":"img","alt":" 1 − δ","inline":true,"padRight":true},{"style":{"fontStyle":"italic"},"text":"over the randomness of the dataset and network initialization the following hold.","element":"span"}],[{"style":{"fontStyle":"italic"},"text":"1. There exists a positive constant ","element":"span"},{"style":{"fontStyle":"italic"},"text":"C ","element":"span"},{"style":{"fontStyle":"italic"},"text":"such that the training process terminates at an iteration ","element":"span"},{"style":{"height":21.78},"width":181.16,"height":54.44,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/71943/images/37-13.png","element":"img","alt":"Tend ≤ Cnη .","inline":true}],[{"style":{"fontStyle":"italic"},"text":"2. For all ","element":"span"},{"style":{"height":16},"width":466.8,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/71943/images/37-14.png","element":"img","alt":" i ∈ [2n] then ℓ(Tend, xi) = 0.","inline":true}],[{"style":{"fontStyle":"italic"},"text":"3. There exists a positive constant ","element":"span"},{"style":{"fontStyle":"italic"},"text":"c ","element":"span"},{"style":{"fontStyle":"italic"},"text":"such that the generalization error satisfies","element":"span"}],[{"style":{"width":"42%"},"width":668,"height":56,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/71943/images/37-15.png","element":"img"}],[{"style":{"fontStyle":"italic"},"text":"Proof. ","element":"span"},{"text":"Under Assumption ","element":"span"},{"href":"#id-60","text":"3 ","element":"a"},{"text":"Statement 1 and 2 follow from Lemma ","element":"span"},{"href":"#id-77","text":"C.12, ","element":"a"},{"text":"Note the bound in Statement 1 comes from Lemma ","element":"span"},{"href":"#id-74","text":"C.2 ","element":"a"},{"text":"applied from iteration ","element":"span"},{"text":"0 ","element":"span"},{"text":"to iteration ","element":"span"},{"style":{"height":17.39},"width":612.16,"height":43.48,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/71943/images/37-16.png","element":"img","alt":" Tend, using 4k(n − k)(γ + ρ)2 < 4/99","inline":true,"padRight":true},{"text":"and ","element":"span"},{"style":{"height":16},"width":352.16,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/71943/images/37-17.png","element":"img","alt":" ℓ(0, xi) = 1 + O(η)","inline":true,"padRight":true},{"text":"for all ","element":"span"},{"style":{"height":16},"width":145.12,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/71943/images/37-18.png","element":"img","alt":" i ∈ [2n]","inline":true},{"text":". With regards to Statement 3, from Lemma ","element":"span"},{"href":"#id-78","text":"C.13 ","element":"a"},{"text":"if ","element":"span"},{"style":{"height":19.9},"width":261.16,"height":49.76,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/71943/images/37-19.png","element":"img","alt":"|⟨n, nℓ⟩| < ρ1−γ","inline":true,"padRight":true},{"text":"for all ","element":"span"},{"style":{"height":16},"width":134.68,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/71943/images/37-20.png","element":"img","alt":" l ∈ [2n]","inline":true,"padRight":true},{"text":"it follows that ","element":"span"},{"style":{"height":16},"width":329.96,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/71943/images/37-21.png","element":"img","alt":" sgn(f(Tend, x)) = y","inline":true},{"text":". Therefore, as ","element":"span"},{"style":{"height":19.81},"width":143.6,"height":49.52,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/71943/images/37-22.png","element":"img","alt":"ρ1−ρ > ρ","inline":true,"padRight":true},{"text":"and ","element":"span"},{"text":"analogous to Lemma ","element":"span"},{"href":"#id-62","text":"A.1, ","element":"a"},{"text":"under Assumption ","element":"span"},{"href":"#id-60","text":"3 ","element":"a"},{"text":"there exists a positive constant ","element":"span"},{"style":{"fontStyle":"italic"},"text":"c ","element":"span"},{"text":"such that","element":"span"}],[{"style":{"width":"58%"},"width":924,"height":360,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/71943/images/37-23.png","element":"img"}],[{"text":"Finally, under Assumption ","element":"span"},{"href":"#id-59","text":"2 ","element":"a"},{"text":"then Assumption ","element":"span"},{"href":"#id-60","text":"3 ","element":"a"},{"text":"holds with probability at least ","element":"span"},{"style":{"height":11.6},"width":86.48,"height":29,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/71943/images/37-24.png","element":"img","alt":" 1 − δ","inline":true,"padRight":true},{"text":"by Lemma","element":"span"}],[{"href":"#id-79","text":"C.1.","element":"a"}]]},{"heading":"Appendix D Non-benign overfitting","paragraphs":[[{"id":"id-88","style":{"fontWeight":"bold"},"text":"Assumption 4. ","element":"span"},{"style":{"fontStyle":"italic"},"text":"With ","element":"span"},{"style":{"height":16},"width":211.24,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/71943/images/38-0.png","element":"img","alt":" δ, ρ ∈ (0, 1)","inline":true,"padRight":true},{"style":{"fontStyle":"italic"},"text":"we assume the following conditions on the data and model hyperparameters.","element":"span"}],[{"style":{"fontStyle":"italic"},"text":"1. ","element":"span"},{"style":{"height":13.2},"width":107.96,"height":33,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/71943/images/38-1.png","element":"img","alt":" n ≥ 1,","inline":true}],[{"style":{"fontStyle":"italic"},"text":"2. ","element":"span"},{"style":{"height":19.38},"width":244.6,"height":48.44,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/71943/images/38-2.png","element":"img","alt":" m ≥ log2( 4nδ ),","inline":true}],[{"style":{"fontStyle":"italic"},"text":"3. ","element":"span"},{"style":{"height":29.2},"width":494.56,"height":73,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/71943/images/38-3.png","element":"img","alt":" d ≥ max�3, 3ρ−2 ln�4n2δ ��,","inline":true}],[{"style":{"fontStyle":"italic"},"text":"4. ","element":"span"},{"style":{"height":13.2},"width":106.04,"height":33,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/71943/images/38-4.png","element":"img","alt":" k ≥ 0,","inline":true}],[{"style":{"fontStyle":"italic"},"text":"5. ","element":"span"},{"style":{"fontStyle":"italic"},"text":"γ ","element":"span"},{"style":{"height":22.74},"width":141,"height":56.84,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/71943/images/38-5.png","element":"img","alt":" ≤ 16√dn,","inline":true}],[{"style":{"fontStyle":"italic"},"text":"6. ","element":"span"},{"style":{"fontStyle":"italic"},"text":"η < ","element":"span"},{"style":{"height":19.38},"width":79.64,"height":48.44,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/71943/images/38-6.png","element":"img","alt":"12mn,","inline":true}],[{"style":{"fontStyle":"italic"},"text":"7. ","element":"span"},{"style":{"fontStyle":"italic"},"text":"λ","element":"span"},{"style":{"height":12.4},"width":110.12,"height":31,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/71943/images/38-7.png","element":"img","alt":"w < η.","inline":true}],[{"text":"In our analysis we require two additional assumptions on the training sample and activations at initialization.","element":"span"}],[{"id":"id-80","style":{"fontWeight":"bold"},"text":"Assumption 5. ","element":"span"},{"style":{"fontStyle":"italic"},"text":"Let ","element":"span"},{"style":{"height":16},"width":160.48,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/71943/images/38-8.png","element":"img","alt":" ρ ∈ (0, 1)","inline":true,"padRight":true},{"style":{"fontStyle":"italic"},"text":"satisfy ","element":"span"},{"style":{"height":20.21},"width":427.04,"height":50.52,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/71943/images/38-9.png","element":"img","alt":" ρ ≤ min{ 1−γ4n , 12n−1 − γ}","inline":true,"padRight":true},{"style":{"fontStyle":"italic"},"text":"and in addition to the conditions ","element":"span"},{"style":{"fontStyle":"italic"},"text":"detailed in Assumption ","element":"span"},{"href":"#id-80","style":{"fontStyle":"italic"},"text":"5, ","element":"a"},{"style":{"fontStyle":"italic"},"text":"assume the following two conditions hold.","element":"span"}],[{"style":{"fontStyle":"italic"},"text":"1. For all ","element":"span"},{"style":{"height":16},"width":128.36,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/71943/images/38-10.png","element":"img","alt":" i ∈ [2n]","inline":true,"padRight":true},{"style":{"fontStyle":"italic"},"text":"there exists a ","element":"span"},{"style":{"height":23.52},"width":720.2,"height":58.8,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/71943/images/38-11.png","element":"img","alt":" j ∈ [2m] such that (−1)j = yi and i ∈ A(0)j .","inline":true}],[{"style":{"fontStyle":"italic"},"text":"2. For all ","element":"span"},{"style":{"height":19.9},"width":547.44,"height":49.76,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/71943/images/38-12.png","element":"img","alt":" i, l ∈ [2n], i ̸= l |⟨ni, nl⟩| ≤ ρ1−γ .","inline":true}],[{"text":"Note under these assumptions that ","element":"span"},{"style":{"height":19.36},"width":146.28,"height":48.4,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/71943/images/38-13.png","element":"img","alt":" γ < 124n2","inline":true,"padRight":true},{"text":", this implies ","element":"span"},{"style":{"height":20.21},"width":658.52,"height":50.52,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/71943/images/38-14.png","element":"img","alt":"12n−1 > γ and ρ ≤ min{ 1−γ4n , 12n−1 − γ}","inline":true,"padRight":true},{"text":"if ","element":"span"},{"style":{"height":19.36},"width":114.4,"height":48.4,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/71943/images/38-15.png","element":"img","alt":" ρ ≤ 15n","inline":true},{"text":". As demonstrated in the following Lemma, these additional two conditions hold with high ","element":"span"},{"text":"probability over the randomness of the initialization and training set.","element":"span"}],[{"style":{"fontWeight":"bold"},"text":"Lemma D.1. ","element":"span"},{"style":{"fontStyle":"italic"},"text":"The additional conditions of Assumption ","element":"span"},{"href":"#id-80","style":{"fontStyle":"italic"},"text":"5 ","element":"a"},{"style":{"fontStyle":"italic"},"text":"hold with probability at least ","element":"span"},{"style":{"height":12},"width":97.84,"height":30,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/71943/images/38-16.png","element":"img","alt":" 1 − δ.","inline":true}],[{"style":{"fontStyle":"italic"},"text":"Proof. ","element":"span"},{"text":"Using Lemma ","element":"span"},{"href":"#id-81","text":"A.7, ","element":"a"},{"text":"then as long as ","element":"span"},{"style":{"height":19.38},"width":234.16,"height":48.44,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/71943/images/38-17.png","element":"img","alt":" m ≥ log2( 4nδ )","inline":true,"padRight":true},{"text":"the probability the first condition does not ","element":"span"},{"text":"hold is at most ","element":"span"},{"style":{"height":16},"width":59.16,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/71943/images/38-18.png","element":"img","alt":" δ/2","inline":true},{"text":". Using Lemma ","element":"span"},{"href":"#id-62","text":"A.1, ","element":"a"},{"text":"and observing ","element":"span"},{"style":{"height":19.81},"width":138.2,"height":49.52,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/71943/images/38-19.png","element":"img","alt":"ρ1−γ > ρ","inline":true},{"text":", then as long as","element":"span"}],[{"style":{"width":"31%"},"width":506,"height":98,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/71943/images/38-20.png","element":"img"}],[{"text":"the probability that the second condition does not hold is also at most ","element":"span"},{"style":{"height":16},"width":59.16,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/71943/images/38-21.png","element":"img","alt":" δ/2","inline":true},{"text":". Using the union bound we conclude that both properties hold with probability at least ","element":"span"},{"style":{"height":12},"width":29.24,"height":30,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/71943/images/38-22.png","element":"img","alt":" δ.","inline":true}],[{"style":{"fontWeight":"bold"},"text":"D.1 ","element":"span"},{"style":{"fontWeight":"bold"},"text":"Proof of Lemma ","element":"span"},{"href":"#id-82","style":{"fontWeight":"bold"},"text":"3.7","element":"a"}],[{"id":"id-83","style":{"fontWeight":"bold"},"text":"Lemma D.2 ","element":"span"},{"text":"(Lemma ","element":"span"},{"href":"#id-82","text":"3.7)","element":"a"},{"style":{"fontWeight":"bold"},"text":". ","element":"span"},{"style":{"fontStyle":"italic"},"text":"In addition to Assumption ","element":"span"},{"href":"#id-80","style":{"fontStyle":"italic"},"text":"5, ","element":"a"},{"style":{"fontStyle":"italic"},"text":"assume also that ","element":"span"},{"style":{"height":16},"width":221.48,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/71943/images/38-23.png","element":"img","alt":" ℓ(t0, xi) ≤ a","inline":true,"padRight":true},{"style":{"fontStyle":"italic"},"text":"for all","element":"span"}],[{"style":{"height":16},"width":230.52,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/71943/images/38-24.png","element":"img","alt":"i ∈ [2n]. Then","inline":true}],[{"style":{"width":"46%"},"width":744,"height":102,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/71943/images/38-25.png","element":"img"}],[{"style":{"fontStyle":"italic"},"text":"Proof. ","element":"span"},{"text":"From Lemma ","element":"span"},{"href":"#id-64","text":"B.5, ","element":"a"},{"style":{"height":16},"width":291.36,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/71943/images/38-26.png","element":"img","alt":" ϕ(ρ − γ) ≤ ρ + γ","inline":true},{"text":", and the assumption on ","element":"span"},{"style":{"fontStyle":"italic"},"text":"a","element":"span"},{"text":",","element":"span"}],[{"style":{"width":"41%"},"width":662,"height":82,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/71943/images/38-27.png","element":"img"}],[{"text":"If we sum over ","element":"span"},{"style":{"fontStyle":"italic"},"text":"i","element":"span"},{"text":", we get","element":"span"}],[{"style":{"width":"52%"},"width":827,"height":91,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/71943/images/38-28.png","element":"img"}],[{"text":"from which the result follows.","element":"span"}],[{"style":{"fontWeight":"bold"},"text":"D.2 ","element":"span"},{"style":{"fontWeight":"bold"},"text":"Supporting lemmas","element":"span"}],[{"id":"id-89","style":{"fontWeight":"bold"},"text":"Lemma D.3. ","element":"span"},{"style":{"fontStyle":"italic"},"text":"If Assumption ","element":"span"},{"href":"#id-80","style":{"fontStyle":"italic"},"text":"5 ","element":"a"},{"style":{"fontStyle":"italic"},"text":"holds, then the training process converges to zero loss.","element":"span"}],[{"style":{"fontStyle":"italic"},"text":"Proof. ","element":"span"},{"text":"By Lemma ","element":"span"},{"href":"#id-83","text":"D.2, ","element":"a"},{"text":"there is an upper bound on the number of updates independent of iteration. This can only happen if there is some iteration after which we make no updates. In turn, this can only happen if every point is either at zero loss or activates no neurons. We prove by induction that every point activates a neuron each iteration ","element":"span"},{"style":{"fontStyle":"italic"},"text":"t ","element":"span"},{"text":"= 0","element":"span"},{"text":". Consider an arbitrary point ","element":"span"},{"style":{"height":16},"width":132.16,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/71943/images/39-0.png","element":"img","alt":" i ∈ [2n]","inline":true},{"text":", at ","element":"span"},{"style":{"fontStyle":"italic"},"text":"t ","element":"span"},{"text":"= 0 ","element":"span"},{"text":"the induction hypothesis is true by Statement 1 of Assumption ","element":"span"},{"href":"#id-80","text":"5. ","element":"a"},{"text":"Suppose the induction hypothesis is true at iteration ","element":"span"},{"style":{"fontStyle":"italic"},"text":"t","element":"span"},{"text":", we consider the following two cases separately in order to show the induction hypothesis also holds at iteration ","element":"span"},{"style":{"fontStyle":"italic"},"text":"t ","element":"span"},{"text":"+ 1","element":"span"},{"text":".","element":"span"}],[{"text":"1. If ","element":"span"},{"style":{"height":16},"width":190.32,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/71943/images/39-1.png","element":"img","alt":" ℓ(t, xi) > 0","inline":true},{"text":", then by assumption we can choose a ","element":"span"},{"style":{"height":16},"width":144.44,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/71943/images/39-2.png","element":"img","alt":" j ∈ [2m]","inline":true,"padRight":true},{"text":"such that ","element":"span"},{"style":{"height":23.52},"width":286.76,"height":58.8,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/71943/images/39-3.png","element":"img","alt":" ϕ(⟨w(t)j , xi⟩) > 0","inline":true},{"text":". ","element":"span"},{"text":"We bound","element":"span"}],[{"style":{"width":"63%"},"width":1000,"height":194,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/71943/images/39-4.png","element":"img"}],[{"text":"which follows as ","element":"span"},{"style":{"height":19.38},"width":226.56,"height":48.44,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/71943/images/39-5.png","element":"img","alt":" γ + ρ < 12n−1 ","inline":true,"padRight":true},{"text":"(Assumption ","element":"span"},{"href":"#id-80","text":"5)","element":"a"},{"text":".","element":"span"}],[{"text":"2. If ","element":"span"},{"style":{"height":16},"width":278.84,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/71943/images/39-6.png","element":"img","alt":" ℓ(t, xi) = 0, then","inline":true}],[{"style":{"width":"39%"},"width":628,"height":252,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/71943/images/39-7.png","element":"img"}],[{"text":"is bounded below by 1. This means that there is some ","element":"span"},{"style":{"fontStyle":"italic"},"text":"j ","element":"span"},{"text":"such that ","element":"span"},{"style":{"height":23.52},"width":299.52,"height":58.8,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/71943/images/39-8.png","element":"img","alt":" ϕ(⟨w(t)j , xi⟩) ≥ 1m","inline":true},{"text":". We ","element":"span"},{"text":"bound","element":"span"}],[{"style":{"width":"58%"},"width":928,"height":210,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/71943/images/39-9.png","element":"img"}],[{"text":"as ","element":"span"},{"style":{"height":19.36},"width":143.28,"height":48.4,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/71943/images/39-10.png","element":"img","alt":" η < 12mn ","inline":true,"padRight":true},{"text":"(Assumption ","element":"span"},{"href":"#id-80","text":"5)","element":"a"},{"text":".","element":"span"}],[{"id":"id-84","style":{"fontWeight":"bold"},"text":"Lemma D.4. ","element":"span"},{"style":{"fontStyle":"italic"},"text":"Suppose that at epoch ","element":"span"},{"style":{"height":7.2},"width":19.48,"height":18,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/71943/images/39-11.png","element":"img","alt":" τ","inline":true,"padRight":true},{"style":{"fontStyle":"italic"},"text":"every point is at zero loss. Then","element":"span"}],[{"style":{"width":"22%"},"width":358,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/71943/images/39-12.png","element":"img"}],[{"style":{"fontStyle":"italic"},"text":"Proof. ","element":"span"},{"text":"If ","element":"span"},{"style":{"height":16},"width":1202.96,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/71943/images/39-13.png","element":"img","alt":" ℓ(τ, xi) = 0 for all i ∈ [2n], then yif(τ, xi) ≥ 1 for all i ∈ [2n]. We bound","inline":true}],[{"style":{"width":"65%"},"width":1033,"height":291,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/71943/images/39-14.png","element":"img"}],[{"text":"Summing over ","element":"span"},{"style":{"height":16},"width":128.36,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/71943/images/39-15.png","element":"img","alt":" i ∈ [2n]","inline":true,"padRight":true},{"text":"we see that","element":"span"}],[{"style":{"width":"59%"},"width":948,"height":41,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/71943/images/39-16.png","element":"img"}],[{"text":"from which the result claimed follows.","element":"span"}],[{"id":"id-90","style":{"fontWeight":"bold"},"text":"Lemma D.5. ","element":"span"},{"style":{"fontStyle":"italic"},"text":"Let ","element":"span"},{"style":{"height":16},"width":265.6,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/71943/images/39-17.png","element":"img","alt":" y ∼ U({−1, 1})","inline":true,"padRight":true},{"style":{"fontStyle":"italic"},"text":"and consider a clean test point ","element":"span"},{"style":{"height":17.71},"width":524.6,"height":44.28,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/71943/images/39-18.png","element":"img","alt":" x = y(√γv + √1 − γn), where","inline":true},{"style":{"height":17.39},"width":389.28,"height":43.48,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/71943/images/39-19.png","element":"img","alt":"n ∼ U(Sd ∩ span(v)⊥)","inline":true},{"style":{"fontStyle":"italic"},"text":". If Assumption ","element":"span"},{"href":"#id-80","style":{"fontStyle":"italic"},"text":"5 ","element":"a"},{"style":{"fontStyle":"italic"},"text":"holds, then","element":"span"}],[{"style":{"width":"25%"},"width":401,"height":83,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/71943/images/39-20.png","element":"img"}],[{"style":{"fontStyle":"italic"},"text":"Proof. ","element":"span"},{"text":"Observe by symmetry of the distributions of both ","element":"span"},{"style":{"height":14},"width":248.52,"height":35,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/71943/images/40-0.png","element":"img","alt":" y and n that −x","inline":true,"padRight":true},{"text":"is identically distributed to ","element":"span"},{"style":{"fontWeight":"bold"},"text":"x ","element":"span"},{"text":"and furthermore that the labels of ","element":"span"},{"style":{"height":10.8},"width":151.56,"height":27,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/71943/images/40-1.png","element":"img","alt":" x and −x","inline":true,"padRight":true},{"text":"are opposite. As a result, if ","element":"span"},{"style":{"height":16},"width":480.48,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/71943/images/40-2.png","element":"img","alt":" y(f(Tend, x)−f(Tend, −x)) <","inline":true,"padRight":true},{"text":"0 ","element":"span"},{"text":"then at least one of ","element":"span"},{"style":{"height":16},"width":655,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/71943/images/40-3.png","element":"img","alt":" y(f(Tend, x) < 0 or −y(f(Tend, −x) < 0","inline":true},{"text":", in turn implying at least one of them is misclassified. By construction, ","element":"span"},{"style":{"height":23.52},"width":526.88,"height":58.8,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/71943/images/40-4.png","element":"img","alt":" ⟨w(t)j , x⟩ > 0 iff ⟨w(t)j , −x⟩ < 0","inline":true},{"text":", therefore","element":"span"}],[{"style":{"width":"82%"},"width":1306,"height":266,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/71943/images/40-5.png","element":"img"}],[{"text":"Unwinding the GD update to a neuron we have","element":"span"}],[{"style":{"width":"97%"},"width":1543,"height":766,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/71943/images/40-6.png","element":"img"}],[{"text":"where the final equality follows from symmetry of the noise distribution, ","element":"span"},{"style":{"height":20.38},"width":392.52,"height":50.96,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/71943/images/40-7.png","element":"img","alt":" z := �2ni=1 Ti(0, Tend)ni","inline":true,"padRight":true},{"text":"and ","element":"span"},{"style":{"height":19.78},"width":291.68,"height":49.44,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/71943/images/40-8.png","element":"img","alt":" u = z∥z∥. Observe","inline":true}],[{"style":{"width":"58%"},"width":928,"height":521,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/71943/images/40-9.png","element":"img"}],[{"text":"where the final inequality follows from Jensen’s inequality. By assumption ","element":"span"},{"style":{"height":14.4},"width":228.88,"height":36,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/71943/images/40-10.png","element":"img","alt":" 4nρ < 1 − γ","inline":true},{"text":", and ","element":"span"},{"style":{"height":25.22},"width":346.6,"height":63.04,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/71943/images/40-11.png","element":"img","alt":"10mλw ≤ nγ ≤ √n6√d","inline":true},{"text":", furthermore trivially ","element":"span"},{"style":{"height":16},"width":230.64,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/71943/images/40-12.png","element":"img","alt":" (1 − γ) > 0.8","inline":true},{"text":". Conditioning on the event ","element":"span"},{"style":{"height":16},"width":175.64,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/71943/images/40-13.png","element":"img","alt":" ⟨n, u⟩ > 0","inline":true},{"text":", ","element":"span"},{"text":"which holds with probability ","element":"span"},{"text":"1","element":"span"},{"style":{"fontStyle":"italic"},"text":"/","element":"span"},{"text":"2","element":"span"},{"text":", these inequalities in combination with Lemma ","element":"span"},{"href":"#id-84","text":"D.4 ","element":"a"},{"text":"give","element":"span"}],[{"style":{"width":"60%"},"width":961,"height":239,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/71943/images/40-14.png","element":"img"}],[{"text":"Therefore, if ","element":"span"},{"style":{"height":16},"width":172.76,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/71943/images/40-15.png","element":"img","alt":" ⟨n, u⟩ > 0","inline":true,"padRight":true},{"text":"then the condition","element":"span"}],[{"id":"id-85","style":{"width":"57%"},"width":904,"height":82,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/71943/images/40-16.png","element":"img"}],[{"text":"implies at least one of ","element":"span"},{"style":{"fontWeight":"bold"},"text":"x ","element":"span"},{"text":"or ","element":"span"},{"style":{"height":7.2},"width":55,"height":18,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/71943/images/41-0.png","element":"img","alt":" −x","inline":true,"padRight":true},{"text":"is misclassified. Suppose ","element":"span"},{"style":{"height":17.38},"width":389.32,"height":43.44,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/71943/images/41-1.png","element":"img","alt":" n ∼ U(Sd ∩ span(v)⊥)","inline":true,"padRight":true},{"text":"is such that ","element":"span"},{"href":"#id-85","text":"(8) ","element":"a"},{"text":"holds. Then as ","element":"span"},{"style":{"height":16},"width":265.56,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/71943/images/41-2.png","element":"img","alt":" y ∼ U({−1, 1})","inline":true,"padRight":true},{"text":"it follows given ","element":"span"},{"style":{"fontWeight":"bold"},"text":"n ","element":"span"},{"text":"that either ","element":"span"},{"style":{"fontWeight":"bold"},"text":"x ","element":"span"},{"text":"or ","element":"span"},{"style":{"height":7.2},"width":52,"height":18,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/71943/images/41-3.png","element":"img","alt":" −x","inline":true,"padRight":true},{"text":"are sampled each with equal probability and thus the chance of misclassifying is at least ","element":"span"},{"text":"1","element":"span"},{"style":{"fontStyle":"italic"},"text":"/","element":"span"},{"text":"2","element":"span"},{"text":". As a result, the probability of misclassification is at least","element":"span"}],[{"style":{"width":"27%"},"width":430,"height":96,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/71943/images/41-4.png","element":"img"}],[{"text":"as claimed. We note the final inequality above follows by showing that the two spherical caps corresponding to the set of unit vectors ","element":"span"},{"style":{"fontWeight":"bold"},"text":"z ","element":"span"},{"text":"satisfying ","element":"span"},{"style":{"height":18.38},"width":297.96,"height":45.96,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/71943/images/41-5.png","element":"img","alt":" ⟨n, u⟩ ≥ 1/(2√d)","inline":true,"padRight":true},{"text":"account for less than half the area of ","element":"span"},{"style":{"height":13.39},"width":79.64,"height":33.48,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/71943/images/41-6.png","element":"img","alt":" Sd−1","inline":true},{"text":", which can be derived from formulas provided in ","element":"span"},{"href":"#id-86","referenceIndex":30,"text":"(S, ","element":"a"},{"href":"#id-86","referenceIndex":30,"text":"2011)","element":"a"},{"text":". This inequality has also appeared for instance in ","element":"span"},{"href":"#id-87","referenceIndex":2,"text":"(Asi & Duchi, ","element":"a"},{"href":"#id-87","referenceIndex":2,"text":"2019)","element":"a"},{"text":".","element":"span"}],[{"style":{"fontWeight":"bold"},"text":"D.3 ","element":"span"},{"style":{"fontWeight":"bold"},"text":"Proof of Theorem ","element":"span"},{"href":"#id-36","style":{"fontWeight":"bold"},"text":"3.6","element":"a"}],[{"style":{"fontWeight":"bold"},"text":"Theorem D.6 ","element":"span"},{"text":"(Theorem ","element":"span"},{"href":"#id-36","text":"3.6)","element":"a"},{"style":{"fontWeight":"bold"},"text":". ","element":"span"},{"style":{"fontStyle":"italic"},"text":"Assume Assumption ","element":"span"},{"href":"#id-88","style":{"fontStyle":"italic"},"text":"4 ","element":"a"},{"style":{"fontStyle":"italic"},"text":"holds with ","element":"span"},{"style":{"height":19.38},"width":116.24,"height":48.44,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/71943/images/41-7.png","element":"img","alt":" ρ = 15n","inline":true},{"style":{"fontStyle":"italic"},"text":". With probability at least ","element":"span"},{"style":{"height":11.6},"width":87.64,"height":29,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/71943/images/41-8.png","element":"img","alt":"1 − δ","inline":true,"padRight":true},{"style":{"fontStyle":"italic"},"text":"over the randomness of the dataset and network initialization the following hold.","element":"span"}],[{"style":{"fontStyle":"italic"},"text":"1. There exists a positive constant ","element":"span"},{"style":{"fontStyle":"italic"},"text":"C ","element":"span"},{"style":{"fontStyle":"italic"},"text":"such that the training process terminates at an iteration ","element":"span"},{"style":{"height":21.78},"width":181.16,"height":54.44,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/71943/images/41-9.png","element":"img","alt":"Tend ≤ Cnη .","inline":true}],[{"style":{"fontStyle":"italic"},"text":"2. For all ","element":"span"},{"style":{"height":16},"width":388.2,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/71943/images/41-10.png","element":"img","alt":" i ∈ [2n] ℓ(Tend, xi) = 0.","inline":true}],[{"style":{"fontStyle":"italic"},"text":"3. The generalization error satisfies","element":"span"}],[{"style":{"width":"29%"},"width":469,"height":83,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/71943/images/41-11.png","element":"img"}],[{"style":{"fontStyle":"italic"},"text":"Proof. ","element":"span"},{"text":"Under Assumption ","element":"span"},{"href":"#id-80","text":"5 ","element":"a"},{"text":"Statement 1 and 2 come from Lemma ","element":"span"},{"href":"#id-89","text":"D.3. ","element":"a"},{"text":"The bound on ","element":"span"},{"style":{"height":13.98},"width":62.04,"height":34.96,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/71943/images/41-12.png","element":"img","alt":" Tend","inline":true,"padRight":true},{"text":"comes from Lemma ","element":"span"},{"href":"#id-83","text":"D.2 ","element":"a"},{"text":"applied between iterations ","element":"span"},{"text":"0 ","element":"span"},{"text":"and ","element":"span"},{"style":{"height":13.98},"width":62.04,"height":34.96,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/71943/images/41-13.png","element":"img","alt":" Tend","inline":true},{"text":", using ","element":"span"},{"style":{"height":16},"width":328.52,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/71943/images/41-14.png","element":"img","alt":" ℓ(0, xi) = 1 + O(η)","inline":true,"padRight":true},{"text":"for all ","element":"span"},{"style":{"height":16},"width":128.36,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/71943/images/41-15.png","element":"img","alt":" i ∈ [2n]","inline":true,"padRight":true},{"text":"and ","element":"span"},{"style":{"height":19.38},"width":340.68,"height":48.44,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/71943/images/41-16.png","element":"img","alt":" (γ + ρ)(2n − 1) < 23","inline":true},{"text":". Statement 3 follows from Lemma ","element":"span"},{"href":"#id-90","text":"D.5. ","element":"a"},{"text":"We conclude by observing under ","element":"span"},{"text":"Assumption ","element":"span"},{"href":"#id-88","text":"4 ","element":"a"},{"text":"that Assumption ","element":"span"},{"href":"#id-80","text":"5 ","element":"a"},{"text":"holds with probability at least ","element":"span"},{"style":{"height":11.6},"width":92.48,"height":29,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/71943/images/41-17.png","element":"img","alt":" 1 − δ.","inline":true}]]},{"heading":"Appendix E No-overfitting","paragraphs":[[{"id":"id-91","style":{"fontWeight":"bold"},"text":"Assumption 6. ","element":"span"},{"style":{"fontStyle":"italic"},"text":"With ","element":"span"},{"style":{"height":16},"width":204.88,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/71943/images/41-18.png","element":"img","alt":" δ, ρ ∈ (0, 1)","inline":true,"padRight":true},{"style":{"fontStyle":"italic"},"text":"and ","element":"span"},{"style":{"fontStyle":"italic"},"text":"C ","element":"span"},{"style":{"fontStyle":"italic"},"text":"a generic, positive constant, we assume the following conditions on the data and model hyperparameters.","element":"span"}],[{"style":{"fontStyle":"italic"},"text":"1. ","element":"span"},{"style":{"fontStyle":"italic"},"text":"n ","element":"span"},{"style":{"height":19.38},"width":239.44,"height":48.44,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/71943/images/41-19.png","element":"img","alt":" ≥ C log� 2mδ �,","inline":true}],[{"style":{"fontStyle":"italic"},"text":"2. ","element":"span"},{"style":{"height":13.2},"width":108.12,"height":33,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/71943/images/41-20.png","element":"img","alt":" m ≥ 2","inline":true}],[{"style":{"fontStyle":"italic"},"text":"3. ","element":"span"},{"style":{"height":29.2},"width":413.72,"height":73,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/71943/images/41-21.png","element":"img","alt":" d ≥�3, 3ρ−2 ln�6n2δ ��,","inline":true}],[{"style":{"fontStyle":"italic"},"text":"4. ","element":"span"},{"style":{"fontStyle":"italic"},"text":"k < ","element":"span"},{"style":{"height":16.58},"width":63.44,"height":41.44,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/71943/images/41-22.png","element":"img","alt":"n100,","inline":true}],[{"style":{"fontStyle":"italic"},"text":"5. ","element":"span"},{"style":{"height":19.38},"width":421.28,"height":48.44,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/71943/images/41-23.png","element":"img","alt":"3n < γ < 136 min{k−1, 1},","inline":true}],[{"style":{"fontStyle":"italic"},"text":"6. ","element":"span"},{"style":{"fontStyle":"italic"},"text":"λ","element":"span"},{"style":{"height":12.4},"width":110.12,"height":31,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/71943/images/41-24.png","element":"img","alt":"w < η.","inline":true}],[{"text":"For our analysis we make two additional assumptions. ","element":"span"},{"id":"id-92","style":{"fontWeight":"bold"},"text":"Assumption 7. ","element":"span"},{"style":{"fontStyle":"italic"},"text":"Let ","element":"span"},{"style":{"height":16},"width":395.24,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/71943/images/41-25.png","element":"img","alt":" ρ ∈ (0, 1) satisfy γ ≥ 5ρ","inline":true},{"style":{"fontStyle":"italic"},"text":". In addition to the assumptions detailed in Assumption ","element":"span"},{"href":"#id-91","style":{"fontStyle":"italic"},"text":"6, ","element":"a"},{"style":{"fontStyle":"italic"},"text":"suppose the following conditions hold.","element":"span"}],[{"style":{"fontStyle":"italic"},"text":"1. ","element":"span"},{"style":{"height":16},"width":165.08,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/71943/images/41-26.png","element":"img","alt":" Γ = [2m].","inline":true}],[{"style":{"fontStyle":"italic"},"text":"2. For all ","element":"span"},{"style":{"height":19.9},"width":770.48,"height":49.76,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/71943/images/41-27.png","element":"img","alt":" i, l ∈ [2n] such that i ̸= l then |⟨ni, nl⟩| ≤ ρ1−γ .","inline":true}],[{"text":"We remark under these assumptions that for sufficiently large ","element":"span"},{"style":{"fontStyle":"italic"},"text":"n ","element":"span"},{"text":"then ","element":"span"},{"style":{"height":10},"width":21,"height":25,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/71943/images/41-28.png","element":"img","alt":" ρ","inline":true,"padRight":true},{"text":"satisfies the inequality ","element":"span"},{"style":{"height":29.22},"width":479.24,"height":73.04,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/71943/images/41-29.png","element":"img","alt":"ρ < min�γ(n−3k)−2n+k , γ5 , n11�","inline":true},{"text":". As shown in the following lemma, these two additional conditions hold with high probability over the randomness of the initialization and training set.","element":"span"}],[{"id":"id-105","style":{"fontWeight":"bold"},"text":"Lemma E.1. ","element":"span"},{"style":{"fontStyle":"italic"},"text":"There exists a positive constant ","element":"span"},{"style":{"fontStyle":"italic"},"text":"C ","element":"span"},{"style":{"fontStyle":"italic"},"text":"such that if ","element":"span"},{"style":{"height":19.38},"width":263.16,"height":48.44,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/71943/images/42-0.png","element":"img","alt":" n ≥ C log� 2mδ �","inline":true},{"style":{"fontStyle":"italic"},"text":"then the extra conditions of Assumption ","element":"span"},{"href":"#id-92","style":{"fontStyle":"italic"},"text":"7 ","element":"a"},{"style":{"fontStyle":"italic"},"text":"hold with probability at least ","element":"span"},{"style":{"height":11.6},"width":90.52,"height":29,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/71943/images/42-1.png","element":"img","alt":" 1 − δ.","inline":true}],[{"style":{"fontStyle":"italic"},"text":"Proof. ","element":"span"},{"text":"Using Lemma ","element":"span"},{"href":"#id-50","text":"A.3, ","element":"a"},{"text":"there exists a positive constant ","element":"span"},{"style":{"fontStyle":"italic"},"text":"C ","element":"span"},{"text":"such that for ","element":"span"},{"style":{"height":13.2},"width":121.64,"height":33,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/71943/images/42-2.png","element":"img","alt":" n ≥ C","inline":true,"padRight":true},{"text":"there in turn exists a constant ","element":"span"},{"style":{"fontStyle":"italic"},"text":"c ","element":"span"},{"text":"such that the probability the first condition does not hold is at most ","element":"span"},{"style":{"height":16},"width":215.68,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/71943/images/42-3.png","element":"img","alt":" m exp(−cn).","inline":true,"padRight":true},{"text":"Setting ","element":"span"},{"style":{"height":16},"width":298.44,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/71943/images/42-4.png","element":"img","alt":" δ ≥ 2m exp(−cn)","inline":true,"padRight":true},{"text":"and rearranging, as long as ","element":"span"},{"style":{"height":19.38},"width":263.12,"height":48.44,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/71943/images/42-5.png","element":"img","alt":" n ≥ C log� 2mδ �","inline":true},{"text":", then the probability the first condition does not hold is at most ","element":"span"},{"style":{"height":16},"width":59.12,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/71943/images/42-6.png","element":"img","alt":" δ/2","inline":true},{"text":". Using Lemma ","element":"span"},{"href":"#id-62","text":"A.1 ","element":"a"},{"text":"and observing ","element":"span"},{"style":{"height":19.81},"width":138.2,"height":49.52,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/71943/images/42-7.png","element":"img","alt":"ρ1−γ > ρ","inline":true},{"text":", then under the ","element":"span"},{"text":"condition on ","element":"span"},{"style":{"fontStyle":"italic"},"text":"d ","element":"span"},{"text":"stated in Assumption ","element":"span"},{"href":"#id-92","text":"7, ","element":"a"},{"text":"the probability that the second condition does not hold is also at most ","element":"span"},{"style":{"height":20.18},"width":16,"height":50.44,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/71943/images/42-8.png","element":"img","alt":"δ2","inline":true},{"text":". Using the union bound, we therefore conclude that both properties hold with probability at ","element":"span"},{"text":"least ","element":"span"},{"style":{"height":12},"width":29.24,"height":30,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/71943/images/42-9.png","element":"img","alt":" δ.","inline":true}],[{"style":{"fontWeight":"bold"},"text":"E.1 ","element":"span"},{"style":{"fontWeight":"bold"},"text":"Proof of Lemma ","element":"span"},{"href":"#id-93","style":{"fontWeight":"bold"},"text":"3.9","element":"a"}],[{"id":"id-95","style":{"fontWeight":"bold"},"text":"Lemma E.2 ","element":"span"},{"text":"(Lemma ","element":"span"},{"href":"#id-93","text":"3.9)","element":"a"},{"style":{"fontWeight":"bold"},"text":". ","element":"span"},{"style":{"fontStyle":"italic"},"text":"Suppose Assumption ","element":"span"},{"href":"#id-92","style":{"fontStyle":"italic"},"text":"7 ","element":"a"},{"style":{"fontStyle":"italic"},"text":"holds. Consider an arbitrary ","element":"span"},{"style":{"height":16},"width":157.56,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/71943/images/42-10.png","element":"img","alt":" j ∈ [2m]","inline":true,"padRight":true},{"style":{"fontStyle":"italic"},"text":"and iteration ","element":"span"},{"style":{"fontStyle":"italic"},"text":"t ","element":"span"},{"style":{"fontStyle":"italic"},"text":"satisfying ","element":"span"},{"style":{"height":23.52},"width":573.32,"height":58.8,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/71943/images/42-11.png","element":"img","alt":" 2 ≤ t < T0. Then i ∈ A(t)j iff i ∼ j.","inline":true}],[{"style":{"fontStyle":"italic"},"text":"Proof. ","element":"span"},{"text":"First we establish at iteration ","element":"span"},{"style":{"fontStyle":"italic"},"text":"t ","element":"span"},{"text":"= 1 ","element":"span"},{"text":"that for all ","element":"span"},{"style":{"height":23.52},"width":404.76,"height":58.8,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/71943/images/42-12.png","element":"img","alt":" i ∈ ST , i ∈ A(t)j iff i ∼ j","inline":true},{"text":". The argument here is similar to that of Lemma ","element":"span"},{"href":"#id-66","text":"C.3. ","element":"a"},{"text":"Suppose ","element":"span"},{"style":{"height":13.6},"width":83.84,"height":34,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/71943/images/42-13.png","element":"img","alt":" i ∼ j","inline":true,"padRight":true},{"text":"and ","element":"span"},{"style":{"height":13.2},"width":109.56,"height":33,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/71943/images/42-14.png","element":"img","alt":" i ∈ ST","inline":true,"padRight":true},{"text":". By Assumption ","element":"span"},{"href":"#id-92","text":"7 ","element":"a"},{"text":"all neurons are in ","element":"span"},{"style":{"height":10.8},"width":25,"height":27,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/71943/images/42-15.png","element":"img","alt":" Γ","inline":true},{"text":", therefore from the definition of ","element":"span"},{"style":{"height":23.52},"width":1089.88,"height":58.8,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/71943/images/42-16.png","element":"img","alt":" Γp for all j ∈ [2m] we have G(i)j (0, 1)(γ − ρ) − B(i)j (0, 1)(γ + ρ) ≥","inline":true},{"style":{"height":22.03},"width":54.8,"height":55.08,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/71943/images/42-17.png","element":"img","alt":"2λwη ","inline":true,"padRight":true},{"text":". Using Lemma ","element":"span"},{"href":"#id-54","text":"B.2","element":"a"}],[{"style":{"width":"90%"},"width":1438,"height":930,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/71943/images/42-18.png","element":"img"}],[{"text":"By assumption ","element":"span"},{"style":{"height":21.63},"width":400.28,"height":54.08,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/71943/images/42-19.png","element":"img","alt":" γ > 2+(n+k)ρn−3k > (n+k)ρn−3k","inline":true,"padRight":true},{"text":"and therefore for ","element":"span"},{"style":{"height":13.18},"width":114.32,"height":32.96,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/71943/images/42-20.png","element":"img","alt":" i ∈ ST","inline":true,"padRight":true},{"text":"then ","element":"span"},{"style":{"height":23.52},"width":139.4,"height":58.8,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/71943/images/42-21.png","element":"img","alt":" i ∈ A(2)j","inline":true,"padRight":true},{"text":"iff ","element":"span"},{"style":{"height":13.6},"width":88.64,"height":34,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/71943/images/42-22.png","element":"img","alt":" i ∼ j","inline":true},{"text":". Again using Lemma ","element":"span"},{"href":"#id-54","text":"B.2, ","element":"a"},{"text":"for ","element":"span"},{"style":{"height":14.4},"width":274.4,"height":36,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/71943/images/42-23.png","element":"img","alt":" i ∈ SF and i ∼ j","inline":true}],[{"style":{"width":"90%"},"width":1436,"height":535,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/71943/images/42-24.png","element":"img"}],[{"text":"Therefore, as ","element":"span"},{"style":{"height":21.63},"width":648.48,"height":54.08,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/71943/images/43-0.png","element":"img","alt":" γ > 2+(n+k)ρn−3k then for i ∈ SF and i ∼ j","inline":true}],[{"style":{"width":"91%"},"width":1444,"height":446,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/71943/images/43-1.png","element":"img"}],[{"text":"With the base case established we proceed by induction to prove if ","element":"span"},{"style":{"height":23.52},"width":540.32,"height":58.8,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/71943/images/43-2.png","element":"img","alt":" i ∈ A(t−1)j iff i ∼ j, then i ∈ A(t)j","inline":true,"padRight":true},{"text":"iff ","element":"span"},{"style":{"height":13.6},"width":83.84,"height":34,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/71943/images/43-3.png","element":"img","alt":" i ∼ j","inline":true},{"text":". By the assumptions on ","element":"span"},{"style":{"height":10.61},"width":21.52,"height":26.52,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/71943/images/43-4.png","element":"img","alt":" γ","inline":true},{"text":", the induction hypothesis and again using Lemma ","element":"span"},{"href":"#id-54","text":"B.2, ","element":"a"},{"text":"for ","element":"span"},{"style":{"height":13.6},"width":83.88,"height":34,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/71943/images/43-5.png","element":"img","alt":" i ∼ j","inline":true}],[{"style":{"width":"97%"},"width":1539,"height":441,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/71943/images/43-6.png","element":"img"}],[{"text":"Therefore, for an epoch ","element":"span"},{"style":{"fontStyle":"italic"},"text":"t ","element":"span"},{"text":"satisfying ","element":"span"},{"style":{"height":23.52},"width":604.76,"height":58.8,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/71943/images/43-7.png","element":"img","alt":" 2 ≤ t ≤ T1 then i ∈ A(t−1)j iff i ∼ j .","inline":true}],[{"style":{"fontWeight":"bold"},"text":"E.2 ","element":"span"},{"style":{"fontWeight":"bold"},"text":"Proof of Lemma ","element":"span"},{"href":"#id-94","style":{"fontWeight":"bold"},"text":"3.10","element":"a"}],[{"id":"id-100","style":{"fontWeight":"bold"},"text":"Lemma E.3 ","element":"span"},{"text":"(Lemma ","element":"span"},{"href":"#id-94","text":"3.10)","element":"a"},{"style":{"fontWeight":"bold"},"text":". ","element":"span"},{"style":{"fontStyle":"italic"},"text":"Suppose Assumption ","element":"span"},{"href":"#id-92","style":{"fontStyle":"italic"},"text":"7 ","element":"a"},{"style":{"fontStyle":"italic"},"text":"holds, then there is an iteration ","element":"span"},{"style":{"height":14},"width":130.84,"height":35,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/71943/images/43-8.png","element":"img","alt":" T1 < T0","inline":true,"padRight":true},{"style":{"fontStyle":"italic"},"text":"such that","element":"span"}],[{"style":{"width":"75%"},"width":1196,"height":315,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/71943/images/43-9.png","element":"img"}],[{"style":{"fontStyle":"italic"},"text":"Furthermore for ","element":"span"},{"style":{"height":13.18},"width":109.56,"height":32.96,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/71943/images/43-10.png","element":"img","alt":" i ∈ ST","inline":true}],[{"style":{"width":"53%"},"width":849,"height":91,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/71943/images/43-11.png","element":"img"}],[{"style":{"fontStyle":"italic"},"text":"Finally,","element":"span"}],[{"style":{"width":"52%"},"width":830,"height":86,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/71943/images/43-12.png","element":"img"}],[{"style":{"fontStyle":"italic"},"text":"Proof. ","element":"span"},{"text":"Let ","element":"span"},{"style":{"height":13.98},"width":105.24,"height":34.96,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/71943/images/43-13.png","element":"img","alt":" t < T0","inline":true},{"text":". By Lemma ","element":"span"},{"href":"#id-95","text":"E.2 ","element":"a"},{"text":"and Lemma ","element":"span"},{"href":"#id-54","text":"B.2 ","element":"a"},{"text":"we can bound for ","element":"span"},{"style":{"height":14.4},"width":274.4,"height":36,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/71943/images/43-14.png","element":"img","alt":" i ∈ SF and i ∼ j","inline":true,"padRight":true},{"text":"as follows,","element":"span"}],[{"style":{"width":"78%"},"width":1242,"height":171,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/71943/images/43-15.png","element":"img"}],[{"text":"Similarly, for ","element":"span"},{"style":{"height":15.2},"width":228.52,"height":38,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/71943/images/43-16.png","element":"img","alt":" i ∈ SF , i ̸∼ j,","inline":true}],[{"style":{"width":"78%"},"width":1243,"height":171,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/71943/images/43-17.png","element":"img"}],[{"text":"For ","element":"span"},{"style":{"height":14},"width":227.16,"height":35,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/71943/images/44-0.png","element":"img","alt":" i ∈ ST , i ∼ j,","inline":true}],[{"style":{"width":"78%"},"width":1242,"height":171,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/71943/images/44-1.png","element":"img"}],[{"text":"and","element":"span"}],[{"style":{"width":"78%"},"width":1242,"height":171,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/71943/images/44-2.png","element":"img"}],[{"text":"Lastly, for ","element":"span"},{"style":{"height":15.2},"width":227.12,"height":38,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/71943/images/44-3.png","element":"img","alt":" i ∈ ST , i ̸∼ j,","inline":true}],[{"style":{"width":"78%"},"width":1242,"height":171,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/71943/images/44-4.png","element":"img"}],[{"text":"Therefore, for ","element":"span"},{"style":{"height":13.2},"width":121.68,"height":33,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/71943/images/44-5.png","element":"img","alt":" i ∈ ST ,","inline":true}],[{"style":{"width":"37%"},"width":588,"height":236,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/71943/images/44-6.png","element":"img"}],[{"text":"from which we conclude","element":"span"}],[{"style":{"fontStyle":"italic"},"text":"ηmt","element":"span"},{"style":{"height":16},"width":1513.32,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/71943/images/44-7.png","element":"img","alt":"(1+γ(n−2k)−ρn−(γ−ρ))+O(η) ≤ f(t, xi) ≤ ηmt(1+γ(n−2k)+ρn−(γ−ρ))+O(η).","inline":true}],[{"text":"Therefore, as long as","element":"span"}],[{"id":"id-96","style":{"width":"75%"},"width":1199,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/71943/images/44-8.png","element":"img"}],[{"text":"then ","element":"span"},{"style":{"height":16},"width":313.32,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/71943/images/44-9.png","element":"img","alt":" ℓ(t, xi) > 0. Let T1","inline":true,"padRight":true},{"text":"be the largest value of ","element":"span"},{"style":{"fontStyle":"italic"},"text":"t ","element":"span"},{"text":"satisfying ","element":"span"},{"href":"#id-96","text":"(9) ","element":"a"},{"text":"and ","element":"span"},{"style":{"height":13.98},"width":105.24,"height":34.96,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/71943/images/44-10.png","element":"img","alt":" t < T0","inline":true},{"text":". We see that","element":"span"}],[{"style":{"width":"52%"},"width":830,"height":92,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/71943/images/44-11.png","element":"img"}],[{"text":"From this, the bounds claimed follow.","element":"span"}],[{"style":{"fontWeight":"bold"},"text":"E.3 ","element":"span"},{"style":{"fontWeight":"bold"},"text":"Proof of Lemma ","element":"span"},{"href":"#id-97","style":{"fontWeight":"bold"},"text":"3.11","element":"a"}],[{"id":"id-101","style":{"fontWeight":"bold"},"text":"Lemma E.4 ","element":"span"},{"text":"(Lemma ","element":"span"},{"href":"#id-97","text":"3.11)","element":"a"},{"style":{"fontWeight":"bold"},"text":". ","element":"span"},{"style":{"fontStyle":"italic"},"text":"Let Assumption ","element":"span"},{"href":"#id-92","style":{"fontStyle":"italic"},"text":"7 ","element":"a"},{"style":{"fontStyle":"italic"},"text":"hold. Suppose at iteration ","element":"span"},{"style":{"height":12.4},"width":28.48,"height":31,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/71943/images/44-12.png","element":"img","alt":" t0","inline":true,"padRight":true},{"style":{"fontStyle":"italic"},"text":"the following conditions are satisfied.","element":"span"}],[{"style":{"fontStyle":"italic"},"text":"a. ","element":"span"},{"style":{"height":16},"width":450.36,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/71943/images/44-13.png","element":"img","alt":" ℓ(t0, xi) ≤ a for all i ∈ ST ,","inline":true,"padRight":true},{"style":{"fontStyle":"italic"},"text":"b. ","element":"span"},{"style":{"height":23.52},"width":707.12,"height":58.8,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/71943/images/44-14.png","element":"img","alt":" ϕ(⟨w(t0)j , xi⟩) ≤ b for all i ∈ SF and i ∼ j,","inline":true,"padRight":true},{"style":{"fontStyle":"italic"},"text":"c. For all iterations ","element":"span"},{"style":{"height":6.8},"width":21,"height":17,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/71943/images/44-15.png","element":"img","alt":" τ","inline":true,"padRight":true},{"style":{"fontStyle":"italic"},"text":"satisfying ","element":"span"},{"style":{"height":12.8},"width":174.48,"height":32,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/71943/images/44-16.png","element":"img","alt":" t0 ≤ τ ≤ t","inline":true,"padRight":true},{"style":{"fontStyle":"italic"},"text":"it holds that ","element":"span"},{"style":{"height":23.54},"width":356.28,"height":58.84,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/71943/images/44-17.png","element":"img","alt":" i ∈ A(τ)j only if i ∼ j,","inline":true,"padRight":true},{"style":{"fontStyle":"italic"},"text":"d. For all iterations ","element":"span"},{"style":{"height":6.8},"width":21,"height":17,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/71943/images/44-18.png","element":"img","alt":" τ","inline":true,"padRight":true},{"style":{"fontStyle":"italic"},"text":"satisfying ","element":"span"},{"style":{"height":23.52},"width":662.88,"height":58.8,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/71943/images/44-19.png","element":"img","alt":" t0 ≤ τ ≤ t, i ∈ A(τ)j if i ∼ j and i ∈ ST .","inline":true}],[{"style":{"fontStyle":"italic"},"text":"Then for ","element":"span"},{"style":{"height":16},"width":561.08,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/71943/images/44-20.png","element":"img","alt":" j ∈ [2m] and p ∈ {−1, 1} we have","inline":true}],[{"style":{"width":"34%"},"width":545,"height":224,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/71943/images/44-21.png","element":"img"}],[{"style":{"fontStyle":"italic"},"text":"Proof. ","element":"span"},{"text":"Consider an arbitrary neuron ","element":"span"},{"style":{"height":16},"width":144.44,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/71943/images/45-0.png","element":"img","alt":" j ∈ [2m]","inline":true},{"text":", using Lemma ","element":"span"},{"href":"#id-57","text":"B.3 ","element":"a"},{"text":"and assumption (b) we bound for ","element":"span"},{"style":{"height":14.4},"width":461.24,"height":36,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/71943/images/45-1.png","element":"img","alt":"t < τ ≤ t, i ∈ SF , and i ∼ j","inline":true}],[{"style":{"width":"59%"},"width":950,"height":66,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/71943/images/45-2.png","element":"img"}],[{"text":"As ","element":"span"},{"style":{"height":23.52},"width":292.6,"height":58.8,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/71943/images/45-3.png","element":"img","alt":" ϕ(⟨w(τ)j , xi⟩) ≥ 0","inline":true,"padRight":true},{"text":"in general we may conclude that","element":"span"}],[{"style":{"width":"40%"},"width":644,"height":98,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/71943/images/45-4.png","element":"img"}],[{"text":"Summing over all ","element":"span"},{"style":{"height":14.4},"width":359.64,"height":36,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/71943/images/45-5.png","element":"img","alt":" i ∈ SF such that i ∼ j","inline":true,"padRight":true},{"text":"then by assumption (c)","element":"span"}],[{"style":{"width":"28%"},"width":454,"height":124,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/71943/images/45-6.png","element":"img"}],[{"text":"Combining these expressions it follows that","element":"span"}],[{"style":{"width":"42%"},"width":674,"height":92,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/71943/images/45-7.png","element":"img"}],[{"text":"As the number of clean updates on a pair of neurons ","element":"span"},{"style":{"height":11.6},"width":16,"height":29,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/71943/images/45-8.png","element":"img","alt":" ℓ","inline":true,"padRight":true},{"text":"and ","element":"span"},{"style":{"fontStyle":"italic"},"text":"j ","element":"span"},{"text":"with ","element":"span"},{"style":{"height":13.6},"width":86.72,"height":34,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/71943/images/45-9.png","element":"img","alt":" ℓ ∼ j","inline":true,"padRight":true},{"text":"is the same by assumptions (c) and (d), then we may rewrite this bound as","element":"span"}],[{"id":"id-98","style":{"width":"73%"},"width":1164,"height":111,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/71943/images/45-10.png","element":"img"}],[{"style":{"width":"100%"},"width":1590,"height":90,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/71943/images/45-11.png","element":"img"}],[{"style":{"width":"98%"},"width":1556,"height":66,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/71943/images/45-12.png","element":"img"}],[{"text":"By assumptions (c) and (d)","element":"span"}],[{"style":{"width":"80%"},"width":1274,"height":559,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/71943/images/45-13.png","element":"img"}],[{"text":"Note either ","element":"span"},{"style":{"height":16},"width":649.44,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/71943/images/45-14.png","element":"img","alt":" Ti(t0, τ) = Ti(t0, τ − 1) or ℓ(τ, xi) > 0","inline":true},{"text":". Consider the case where the latter holds, then","element":"span"}],[{"style":{"width":"90%"},"width":1442,"height":91,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/71943/images/45-15.png","element":"img"}],[{"text":"Furthermore, suppose ","element":"span"},{"style":{"height":14.4},"width":108,"height":36,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/71943/images/45-16.png","element":"img","alt":" τ ′ ≤ τ","inline":true,"padRight":true},{"text":"is the first iteration before ","element":"span"},{"style":{"height":6.99},"width":20,"height":17.48,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/71943/images/45-17.png","element":"img","alt":" τ","inline":true,"padRight":true},{"text":"such that ","element":"span"},{"style":{"height":16.78},"width":227.08,"height":41.96,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/71943/images/45-18.png","element":"img","alt":" Gj(τ ′, τ) = 0","inline":true,"padRight":true},{"text":"and let ","element":"span"},{"style":{"height":13.18},"width":111.64,"height":32.96,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/71943/images/45-19.png","element":"img","alt":" i ∈ ST","inline":true,"padRight":true},{"text":", ","element":"span"},{"style":{"height":10.4},"width":85.88,"height":26,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/71943/images/45-20.png","element":"img","alt":"i ∼ s","inline":true,"padRight":true},{"text":"be a point that makes an update at iteration ","element":"span"},{"style":{"height":12.4},"width":98,"height":31,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/71943/images/45-21.png","element":"img","alt":" τ ′ − 1","inline":true},{"text":". Using the above bound it follows that","element":"span"}],[{"style":{"width":"96%"},"width":1530,"height":426,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/71943/images/45-22.png","element":"img"}],[{"text":"By the construction of ","element":"span"},{"style":{"height":12},"width":30,"height":30,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/71943/images/46-0.png","element":"img","alt":" τ ′ ","inline":true,"padRight":true},{"text":"it follows that","element":"span"}],[{"style":{"width":"69%"},"width":1106,"height":104,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/71943/images/46-1.png","element":"img"}],[{"text":"From this we get the bound","element":"span"}],[{"id":"id-99","style":{"width":"81%"},"width":1287,"height":145,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/71943/images/46-2.png","element":"img"}],[{"text":"Combining ","element":"span"},{"href":"#id-98","text":"(10) ","element":"a"},{"text":"summed over ","element":"span"},{"style":{"height":13.6},"width":90.84,"height":34,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/71943/images/46-3.png","element":"img","alt":" j ∼ s","inline":true,"padRight":true},{"text":"with ","element":"span"},{"href":"#id-99","text":"(11)","element":"a"},{"text":", then","element":"span"}],[{"style":{"width":"88%"},"width":1407,"height":256,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/71943/images/46-4.png","element":"img"}],[{"text":"and","element":"span"}],[{"style":{"width":"74%"},"width":1178,"height":102,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/71943/images/46-5.png","element":"img"}],[{"text":"Using ","element":"span"},{"style":{"height":17.41},"width":141.68,"height":43.52,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/71943/images/46-6.png","element":"img","alt":" ρ ≤ γ5 , η","inline":true,"padRight":true},{"text":"sufficiently small and ","element":"span"},{"style":{"height":19.38},"width":129.52,"height":48.44,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/71943/images/46-7.png","element":"img","alt":" γ ≤ 136k ","inline":true,"padRight":true},{"text":"(Assumption ","element":"span"},{"href":"#id-92","text":"7) ","element":"a"},{"text":"these bounds simplify to","element":"span"}],[{"style":{"width":"34%"},"width":552,"height":230,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/71943/images/46-8.png","element":"img"}],[{"style":{"fontWeight":"bold"},"text":"E.4 ","element":"span"},{"style":{"fontWeight":"bold"},"text":"Late training","element":"span"}],[{"id":"id-102","style":{"fontWeight":"bold"},"text":"Lemma E.5. ","element":"span"},{"style":{"fontStyle":"italic"},"text":"Under Assumption ","element":"span"},{"href":"#id-92","style":{"fontStyle":"italic"},"text":"7 ","element":"a"},{"style":{"fontStyle":"italic"},"text":"the training process terminates at an iteration ","element":"span"},{"style":{"height":14.21},"width":62,"height":35.52,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/71943/images/46-9.png","element":"img","alt":" Tend","inline":true,"padRight":true},{"style":{"fontStyle":"italic"},"text":"satisfying","element":"span"}],[{"style":{"width":"60%"},"width":959,"height":173,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/71943/images/46-10.png","element":"img"}],[{"style":{"fontStyle":"italic"},"text":"for all ","element":"span"},{"style":{"height":16},"width":347.2,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/71943/images/46-11.png","element":"img","alt":" i ∈ SF and j ∈ [2m].","inline":true}],[{"style":{"fontStyle":"italic"},"text":"Proof. ","element":"span"},{"text":"By Lemma ","element":"span"},{"href":"#id-100","text":"E.3, ","element":"a"},{"text":"at iteration ","element":"span"},{"style":{"height":13.98},"width":188,"height":34.96,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/71943/images/46-12.png","element":"img","alt":" t = T1 with","inline":true}],[{"style":{"width":"49%"},"width":779,"height":201,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/71943/images/46-13.png","element":"img"}],[{"text":"then the first two conditions of Lemma ","element":"span"},{"href":"#id-101","text":"E.4 ","element":"a"},{"text":"are satisfied. Next, using Lemma ","element":"span"},{"href":"#id-54","text":"B.2, ","element":"a"},{"text":"we see by induction on ","element":"span"},{"style":{"height":14},"width":213.88,"height":35,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/71943/images/46-14.png","element":"img","alt":" t ≥ T1 that if","inline":true}],[{"style":{"width":"66%"},"width":1052,"height":95,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/71943/images/46-15.png","element":"img"}],[{"text":"then for ","element":"span"},{"style":{"height":15.2},"width":355.2,"height":38,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/71943/images/46-16.png","element":"img","alt":" i ̸∼ j and T1 ≤ τ ≤ t,","inline":true}],[{"style":{"width":"46%"},"width":742,"height":181,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/71943/images/46-17.png","element":"img"}],[{"text":"and for ","element":"span"},{"style":{"height":14.8},"width":496.76,"height":37,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/71943/images/47-0.png","element":"img","alt":" i ∈ ST , i ∼ j, and T1 ≤ τ ≤ t,","inline":true}],[{"style":{"width":"58%"},"width":921,"height":286,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/71943/images/47-1.png","element":"img"}],[{"text":"for ","element":"span"},{"style":{"height":10.8},"width":19,"height":27,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/71943/images/47-2.png","element":"img","alt":" η","inline":true,"padRight":true},{"text":"sufficiently small. Thus we have shown under an additional assumption on ","element":"span"},{"style":{"height":16.78},"width":300.08,"height":41.96,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/71943/images/47-3.png","element":"img","alt":" Bj(T1, t) that with","inline":true},{"style":{"height":13.98},"width":314.6,"height":34.96,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/71943/images/47-4.png","element":"img","alt":"t0 = T1 and a and b","inline":true,"padRight":true},{"text":"as defined above, then all four conditions of Lemma ","element":"span"},{"href":"#id-101","text":"E.4 ","element":"a"},{"text":"are satisfied. As a result GD converges or terminates as long as","element":"span"}],[{"style":{"width":"72%"},"width":1154,"height":97,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/71943/images/47-5.png","element":"img"}],[{"text":"which is equivalent to","element":"span"}],[{"style":{"width":"80%"},"width":1283,"height":97,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/71943/images/47-6.png","element":"img"}],[{"text":"This is true by Assumption ","element":"span"},{"href":"#id-92","text":"7, ","element":"a"},{"text":"as","element":"span"}],[{"style":{"width":"90%"},"width":1440,"height":247,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/71943/images/47-7.png","element":"img"}],[{"text":"where above we used ","element":"span"},{"style":{"height":19.38},"width":681.36,"height":48.44,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/71943/images/47-8.png","element":"img","alt":" γ ≤ min� 136k, 136�and ρ ≤ min� γ5 , n11�.","inline":true}],[{"id":"id-103","style":{"fontWeight":"bold"},"text":"Lemma E.6. ","element":"span"},{"style":{"fontStyle":"italic"},"text":"Assume Assumption ","element":"span"},{"href":"#id-92","style":{"fontStyle":"italic"},"text":"7 ","element":"a"},{"style":{"fontStyle":"italic"},"text":"holds. Let ","element":"span"},{"style":{"height":16},"width":206.32,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/71943/images/47-9.png","element":"img","alt":" y ∈ {−1, 1}","inline":true,"padRight":true},{"style":{"fontStyle":"italic"},"text":"be drawn uniformly at random and ","element":"span"},{"style":{"height":17.71},"width":397.32,"height":44.28,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/71943/images/47-10.png","element":"img","alt":"x := y√γv + √1 − γn","inline":true,"padRight":true},{"style":{"fontStyle":"italic"},"text":"where ","element":"span"},{"style":{"height":17.39},"width":555.08,"height":43.48,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/71943/images/47-11.png","element":"img","alt":" n ∼ Uniform(Sd−1 ∩ span{v}⊥)","inline":true},{"style":{"fontStyle":"italic"},"text":". Suppose that ","element":"span"},{"style":{"height":19.89},"width":259.28,"height":49.72,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/71943/images/47-12.png","element":"img","alt":" |⟨n, nℓ⟩| < ρ1−γ","inline":true,"padRight":true},{"style":{"fontStyle":"italic"},"text":"for all ","element":"span"},{"style":{"height":16},"width":490.56,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/71943/images/47-13.png","element":"img","alt":" l ∈ [2n], then yf(Tend, x) > 0.","inline":true}],[{"style":{"fontStyle":"italic"},"text":"Proof. ","element":"span"},{"text":"We proceed as in the proof of Lemma ","element":"span"},{"href":"#id-78","text":"C.13. ","element":"a"},{"text":"Following the same steps as in ","element":"span"},{"href":"#id-52","text":"(5)","element":"a"},{"text":", for any ","element":"span"},{"style":{"height":16},"width":144.36,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/71943/images/47-14.png","element":"img","alt":"j ∈ [2m]","inline":true}],[{"style":{"width":"69%"},"width":1098,"height":404,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/71943/images/47-15.png","element":"img"}],[{"text":"where ","element":"span"},{"style":{"height":17.9},"width":837.4,"height":44.76,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/71943/images/47-16.png","element":"img","alt":" λ′ℓ := (−1)lyβ(ℓ)⟨xℓ, x⟩ = β(ℓ)γ + (1 − γ)⟨nℓ, n⟩","inline":true},{"text":". Then as in Lemma ","element":"span"},{"href":"#id-53","text":"B.1","element":"a"}],[{"style":{"width":"40%"},"width":640,"height":112,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/71943/images/47-17.png","element":"img"}],[{"text":"Recall, from Lemma ","element":"span"},{"href":"#id-95","text":"E.2, ","element":"a"},{"text":"for any ","element":"span"},{"style":{"height":16},"width":144.4,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/71943/images/47-18.png","element":"img","alt":" j ∈ [2m]","inline":true,"padRight":true},{"text":"then ","element":"span"},{"style":{"height":16.8},"width":708.48,"height":42,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/71943/images/47-19.png","element":"img","alt":" Gj(2, Tend) ≥ Gj(2, T1) = (T1 − 2)(n − k)","inline":true},{"text":". As a consequence, for ","element":"span"},{"style":{"height":16},"width":251.24,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/71943/images/47-20.png","element":"img","alt":" j ∈ Γp we have","inline":true}],[{"style":{"width":"71%"},"width":1138,"height":117,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/71943/images/47-21.png","element":"img"}],[{"text":"For ","element":"span"},{"style":{"height":16.99},"width":432.88,"height":42.48,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/71943/images/47-22.png","element":"img","alt":" j such that (−1)j = y then","inline":true}],[{"style":{"width":"78%"},"width":1249,"height":117,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/71943/images/47-23.png","element":"img"}],[{"text":"As a result","element":"span"}],[{"style":{"width":"89%"},"width":1411,"height":397,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/71943/images/48-0.png","element":"img"}],[{"text":"Decompose ","element":"span"},{"style":{"height":16},"width":592.12,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/71943/images/48-1.png","element":"img","alt":" B(2, Tend) = B(2, T1) + B(T1, Tend)","inline":true,"padRight":true},{"text":"and observe from Lemma ","element":"span"},{"href":"#id-95","text":"E.2 ","element":"a"},{"text":"that","element":"span"}],[{"style":{"width":"45%"},"width":718,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/71943/images/48-2.png","element":"img"}],[{"text":"From Lemma ","element":"span"},{"href":"#id-100","text":"E.3 ","element":"a"},{"text":"and Lemma ","element":"span"},{"href":"#id-101","text":"E.4, ","element":"a"},{"text":"using the assumptions ","element":"span"},{"style":{"height":16.58},"width":167.48,"height":41.44,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/71943/images/48-3.png","element":"img","alt":" ρ ≤ n11, η","inline":true,"padRight":true},{"text":"sufficiently small and ","element":"span"},{"style":{"height":13.6},"width":66,"height":34,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/71943/images/48-4.png","element":"img","alt":" γ ≤","inline":true},{"style":{"height":19.36},"width":285.32,"height":48.4,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/71943/images/48-5.png","element":"img","alt":"min{ k36, 136} then","inline":true}],[{"style":{"width":"77%"},"width":1228,"height":424,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/71943/images/48-6.png","element":"img"}],[{"text":"Using the assumption that ","element":"span"},{"style":{"height":16.7},"width":229.84,"height":41.76,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/71943/images/48-7.png","element":"img","alt":" k ≤ n100 and η","inline":true,"padRight":true},{"text":"is sufficiently small we see that","element":"span"}],[{"style":{"width":"27%"},"width":442,"height":92,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/71943/images/48-8.png","element":"img"}],[{"text":"As","element":"span"}],[{"style":{"width":"61%"},"width":970,"height":46,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/71943/images/48-9.png","element":"img"}],[{"text":"then ","element":"span"},{"style":{"height":16},"width":182.16,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/71943/images/48-10.png","element":"img","alt":" yf(Tend, x)","inline":true,"padRight":true},{"text":"is positive provided ","element":"span"},{"style":{"height":16},"width":325.32,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/71943/images/48-11.png","element":"img","alt":" (γ − ρ) − (γ + ρ)/9","inline":true,"padRight":true},{"text":"is positive and ","element":"span"},{"style":{"height":10.4},"width":20,"height":26,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/71943/images/48-12.png","element":"img","alt":" η","inline":true,"padRight":true},{"text":"is sufficiently small. Both these conditions are guaranteed by Assumption ","element":"span"},{"href":"#id-92","text":"7 ","element":"a"},{"text":"and thus the test point is correctly classified.","element":"span"}],[{"style":{"fontWeight":"bold"},"text":"E.5 ","element":"span"},{"style":{"fontWeight":"bold"},"text":"Proof of Theorem ","element":"span"},{"href":"#id-37","style":{"fontWeight":"bold"},"text":"3.8","element":"a"}],[{"style":{"fontWeight":"bold"},"text":"Theorem E.7 ","element":"span"},{"text":"(Theorem ","element":"span"},{"href":"#id-37","text":"3.8)","element":"a"},{"style":{"fontWeight":"bold"},"text":". ","element":"span"},{"style":{"fontStyle":"italic"},"text":"Let Assumption ","element":"span"},{"href":"#id-91","style":{"fontStyle":"italic"},"text":"6 ","element":"a"},{"style":{"fontStyle":"italic"},"text":"hold with ","element":"span"},{"style":{"height":16},"width":136.56,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/71943/images/48-13.png","element":"img","alt":" ρ = γ/5","inline":true},{"style":{"fontStyle":"italic"},"text":". There exists a sufficiently small step-size ","element":"span"},{"style":{"height":10.4},"width":20,"height":26,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/71943/images/48-14.png","element":"img","alt":" η","inline":true,"padRight":true},{"style":{"fontStyle":"italic"},"text":"such that with probability at least ","element":"span"},{"style":{"height":11.6},"width":87.44,"height":29,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/71943/images/48-15.png","element":"img","alt":" 1 − δ","inline":true,"padRight":true},{"style":{"fontStyle":"italic"},"text":"over the randomness of the dataset and network initialization we have the following.","element":"span"}],[{"style":{"fontStyle":"italic"},"text":"1. There exists a positive constant ","element":"span"},{"style":{"fontStyle":"italic"},"text":"C ","element":"span"},{"style":{"fontStyle":"italic"},"text":"such that the training process terminates at an iteration ","element":"span"},{"style":{"height":21.76},"width":181.16,"height":54.4,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/71943/images/48-16.png","element":"img","alt":"Tend ≤ Cnη .","inline":true}],[{"style":{"fontStyle":"italic"},"text":"2. For all ","element":"span"},{"style":{"height":16},"width":1027.64,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/71943/images/48-17.png","element":"img","alt":" i ∈ ST then ℓ(Tend, xi) = 0 while ℓ(Tend, xi) = 1 for all i ∈ SF .","inline":true}],[{"style":{"fontStyle":"italic"},"text":"3. There exists a positive constant ","element":"span"},{"style":{"fontStyle":"italic"},"text":"c ","element":"span"},{"style":{"fontStyle":"italic"},"text":"such that the generalization error satisfies","element":"span"}],[{"style":{"width":"41%"},"width":661,"height":49,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/71943/images/48-18.png","element":"img"}],[{"style":{"fontStyle":"italic"},"text":"Proof. ","element":"span"},{"text":"Under Assumption ","element":"span"},{"href":"#id-92","text":"7, ","element":"a"},{"text":"Statements 1 and 2 follow from Lemma ","element":"span"},{"href":"#id-102","text":"E.5. ","element":"a"},{"text":"The bound on ","element":"span"},{"style":{"height":13.98},"width":189.48,"height":34.96,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/71943/images/48-19.png","element":"img","alt":" Tend follows","inline":true,"padRight":true},{"text":"from Lemma ","element":"span"},{"href":"#id-101","text":"E.4 ","element":"a"},{"text":"applied at ","element":"span"},{"style":{"height":13.18},"width":108.52,"height":32.96,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/71943/images/48-20.png","element":"img","alt":" t0 = 2","inline":true},{"text":", indeed the number of iterations cannot exceed the number of updates which, as ","element":"span"},{"style":{"height":19.38},"width":100.8,"height":48.44,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/71943/images/48-21.png","element":"img","alt":" γ > 3n","inline":true},{"text":", is bounded as","element":"span"}],[{"style":{"width":"57%"},"width":908,"height":96,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/71943/images/48-22.png","element":"img"}],[{"text":"Again under Assumption ","element":"span"},{"href":"#id-92","text":"7 ","element":"a"},{"text":"using Lemma ","element":"span"},{"href":"#id-103","text":"E.6 ","element":"a"},{"text":"then Statement 3 follows in exactly the same manner as the proof of Statement 3 for Theorem ","element":"span"},{"href":"#id-104","text":"C.14. ","element":"a"},{"text":"Finally, Lemma ","element":"span"},{"href":"#id-105","text":"E.1 ","element":"a"},{"text":"implies that under Assumption ","element":"span"},{"href":"#id-91","text":"6 ","element":"a"},{"text":"then Assumption ","element":"span"},{"href":"#id-92","text":"7 ","element":"a"},{"text":"holds with probability at least ","element":"span"},{"style":{"height":12},"width":97.88,"height":30,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/71943/images/48-23.png","element":"img","alt":" 1 − δ.","inline":true}]]},{"heading":"Appendix F Numerical simulations","paragraphs":[[{"style":{"fontWeight":"bold"},"text":"Reproducibility statement: ","element":"span"},{"text":"the code used to generate the following figures can be found at ","element":"span"},{"href":"https://github.com/wswartworth/benign_overfitting","text":"https: ","element":"a"},{"href":"https://github.com/wswartworth/benign_overfitting","text":"//github.com/wswartworth/benign_overfitting","element":"a"},{"text":".","element":"span"}],[{"text":"To investigate our theory we train two-layer neural networks with ReLU activations using full-batch gradient descent and a fixed step size. We train on a synthetic binary classification dataset generated as per Section ","element":"span"},{"href":"#id-39","text":"2.1. ","element":"a"},{"text":"Finally, we train using both the hinge and logistic loss.","element":"span"}],[{"style":{"width":"99%"},"width":1573,"height":749,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/71943/images/49-0.png","element":"img"}],[{"text":"Figure 1: from left to right, the first row shows the clean, corrupt, and test losses as a function of epoch (or iteration). The second row shows the fraction of clean, corrupt, and test points that are classified correctly. These plots were generated with ","element":"figcaption","subtype":"caption"},{"style":{"fontStyle":"italic"},"text":"n ","element":"figcaption","subtype":"caption"},{"text":"= 100","element":"figcaption","subtype":"caption"},{"text":", ","element":"figcaption","subtype":"caption"},{"style":{"fontStyle":"italic"},"text":"d ","element":"figcaption","subtype":"caption"},{"text":"= 800","element":"figcaption","subtype":"caption"},{"text":", ","element":"figcaption","subtype":"caption"},{"style":{"fontStyle":"italic"},"text":"k/n ","element":"figcaption","subtype":"caption"},{"text":"= 0","element":"figcaption","subtype":"caption"},{"style":{"fontStyle":"italic"},"text":".","element":"figcaption","subtype":"caption"},{"text":"1","element":"figcaption","subtype":"caption"},{"text":", ","element":"figcaption","subtype":"caption"},{"style":{"fontStyle":"italic"},"text":"m ","element":"figcaption","subtype":"caption"},{"text":"= 100","element":"figcaption","subtype":"caption"},{"text":", ","element":"figcaption","subtype":"caption"},{"style":{"height":14.8},"width":166.84,"height":37,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/71943/images/49-1.png","element":"img","alt":"γ = 0.015","inline":true},{"id":"id-106","text":", and a step size of ","element":"figcaption","subtype":"caption"},{"style":{"height":14.4},"width":155.2,"height":36,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/71943/images/49-2.png","element":"img","alt":" η = 0.01.","inline":true}],[{"text":"In Figure ","element":"span"},{"href":"#id-106","text":"1 ","element":"a"},{"text":"we call attention to the difference in the training dynamics of hinge loss versus logistic loss. Perhaps the key difference between the hinge loss and logistic loss is that the contributions from any given point do not get smaller as the point approaches ","element":"span"},{"text":"0 ","element":"span"},{"text":"loss. Furthermore, unlike with the logistic loss, points can actually attain zero hinge loss after a finite number of epochs. While a point has zero loss it ceases to contribute to the update of the network parameters. As a result, points close to zero hinge loss periodically activate and deactivate giving rise to the chaotic behavior observed as the training loss approaches zero. We emphasize that managing this behavior required a careful analysis distinct from that of prior works analysing the logistic loss.","element":"span"}],[{"text":"In Figure ","element":"span"},{"href":"#id-107","text":"2 ","element":"a"},{"text":"we call particular attention to the bottom right plot. Our theory predicts a phase transition between benign overfitting and non-benign overfitting when ","element":"span"},{"style":{"height":16},"width":137.16,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/71943/images/49-3.png","element":"img","alt":" γ ≈ c/n","inline":true},{"text":": the phase transition we observe empirically in the bottom-right heatmap suggests this estimate is reasonable. With regard to the hinge loss over the corrupt points, displayed in the top-right heatmap, we observe another phase transition, this time between overfitting and non-overfitting. The top and bottom heatmaps of the left-hand column display the hinge loss over the clean training set and total training set respectively, these appear very similar due to the fact that clean points make up ","element":"span"},{"text":"95% ","element":"span"},{"text":"of the training set. The clean points fail to achieve zero, or close to zero, hinge loss only when ","element":"span"},{"style":{"height":10.61},"width":21.48,"height":26.52,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/71943/images/49-4.png","element":"img","alt":" γ","inline":true,"padRight":true},{"text":"is small and ","element":"span"},{"style":{"fontStyle":"italic"},"text":"n ","element":"span"},{"text":"is large. As stated in the caption, in these experiments ","element":"span"},{"style":{"fontStyle":"italic"},"text":"d ","element":"span"},{"text":"is fixed and thus as ","element":"span"},{"style":{"fontStyle":"italic"},"text":"n ","element":"span"},{"text":"increases the near-orthogonality condition we require on the noise components in order to prove convergence to zero clean loss is compromised. As a result, when ","element":"span"},{"style":{"height":10.4},"width":22,"height":26,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/71943/images/49-5.png","element":"img","alt":" γ","inline":true,"padRight":true},{"text":"is small and the correlations between noise vectors is potentially large it is possible for pairs of points with opposite labels to be significantly correlated.","element":"span"}],[{"style":{"width":"88%"},"width":1410,"height":1075,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/71943/images/50-0.png","element":"img"}],[{"text":"Figure 2: from left to right in the top row we show the loss on clean training and corrupt training points after training. In the bottom row and again from left to right we show the total loss after training and the test loss on ","element":"figcaption","subtype":"caption"},{"id":"id-107","text":"10000 ","element":"figcaption","subtype":"caption"},{"text":"randomly generated points. For each plot we set ","element":"figcaption","subtype":"caption"},{"style":{"height":14.82},"width":482.16,"height":37.04,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/71943/images/50-1.png","element":"img","alt":" d = 1000, m = 30, η = 0.005","inline":true,"padRight":true},{"text":"and train for ","element":"figcaption","subtype":"caption"},{"text":"5000 ","element":"figcaption","subtype":"caption"},{"text":"iterations of gradient descent using hinge loss. In each plot we vary ","element":"figcaption","subtype":"caption"},{"style":{"height":10.4},"width":22,"height":26,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/71943/images/50-2.png","element":"img","alt":" γ","inline":true,"padRight":true},{"text":"and ","element":"figcaption","subtype":"caption"},{"style":{"fontStyle":"italic"},"text":"n ","element":"figcaption","subtype":"caption"},{"text":"and hold the fraction of corrupt points constant at ","element":"figcaption","subtype":"caption"},{"text":"0","element":"figcaption","subtype":"caption"},{"style":{"fontStyle":"italic"},"text":".","element":"figcaption","subtype":"caption"},{"text":"05","element":"figcaption","subtype":"caption"},{"text":". In the bottom right plot we also graph the curve ","element":"figcaption","subtype":"caption"},{"style":{"height":16},"width":248.84,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/71943/images/50-3.png","element":"img","alt":"c/n for c ≈ 0.6","inline":true}]]}],"_version":"3.3.4"},"paperNode":"$28:props:children:props:children:0:props:product"}]]