36:[["$","audio",null,{"id":"tts"}],["$","$L3b",null,{"paperID":"2003.02218","publisher":"arxiv","paperJSON":{"title":"The large learning rate phase of deep learning: the catapult mechanism","paperID":"2003.02218","avgLineHeight":11.94,"imgScale":4,"sections":[{"heading":"Abstract","paragraphs":[[{"text":"$3c","element":"span"}]]},{"heading":"1. Introduction","paragraphs":[[{"text":"Deep learning has shown remarkable success across a variety of machine learning tasks. At the same time, our theoretical understanding of deep learning methods remains limited. In particular, the interplay between training dynamics, properties of the learned network, and generalization remains a largely open problem.","element":"span"}],[{"text":"In this work we take a step toward addressing these questions. We present a dynamical mechanism that allows deep networks trained using SGD to find flat minima and achieve superior performance. Our theoretical predictions agree well with empirical results in a variety of deep learning settings. In many cases we are able to predict the regime of learning rates where optimal performance is achieved. Figure ","element":"span"},{"href":"#id-0","text":"1 ","element":"a"},{"text":"summarizes our main results. This work builds on several existing results, which we now review.","element":"span"}],[{"style":{"fontWeight":"bold"},"text":"1.1. Large learning rate SGD improves generalization","element":"span"}],[{"text":"SGD training with large initial learning rates often leads to improved performance over training with small initial learning rates (see ","element":"span"},{"href":"#id-1","referenceIndex":20,"text":"Li et al. ","element":"a"},{"href":"#id-1","referenceIndex":20,"text":"(","element":"a"},{"href":"#id-1","referenceIndex":20,"text":"2019","element":"a"},{"text":"); ","element":"span"},{"href":"#id-2","referenceIndex":16,"text":"Leclerc & Madry ","element":"a"},{"href":"#id-2","referenceIndex":16,"text":"(","element":"a"},{"href":"#id-2","referenceIndex":16,"text":"2020","element":"a"},{"text":"); ","element":"span"},{"href":"#id-3","referenceIndex":35,"text":"Xie et al. ","element":"a"},{"href":"#id-3","referenceIndex":35,"text":"(","element":"a"},{"href":"#id-3","referenceIndex":35,"text":"2020","element":"a"},{"text":"); ","element":"span"},{"href":"#id-4","referenceIndex":9,"text":"Frankle et al. ","element":"a"},{"href":"#id-4","referenceIndex":9,"text":"(","element":"a"},{"href":"#id-4","referenceIndex":9,"text":"2020","element":"a"},{"text":"); ","element":"span"},{"href":"#id-5","referenceIndex":13,"text":"Jastrzebski et al. ","element":"a"},{"href":"#id-5","referenceIndex":13,"text":"(","element":"a"},{"href":"#id-5","referenceIndex":13,"text":"2020","element":"a"},{"text":") for recent discussions). It has been suggested that one of the mechanisms underlying the benefit of large learning rates is that noise from stochastic gradient descent leads to flat minima, and that flat minima generalize better than sharp minima (","element":"span"},{"href":"#id-6","referenceIndex":10,"text":"Hochreiter & Schmidhuber","element":"a"},{"href":"#id-6","referenceIndex":10,"text":", ","element":"a"},{"href":"#id-6","referenceIndex":10,"text":"1997","element":"a"},{"text":"; ","element":"span"},{"href":"#id-7","referenceIndex":15,"text":"Keskar et al.","element":"a"},{"href":"#id-7","referenceIndex":15,"text":", ","element":"a"},{"href":"#id-7","referenceIndex":15,"text":"2016","element":"a"},{"text":"; ","element":"span"},{"href":"#id-8","referenceIndex":30,"text":"Smith & Le","element":"a"},{"href":"#id-8","referenceIndex":30,"text":", ","element":"a"},{"href":"#id-8","referenceIndex":30,"text":"2018","element":"a"},{"text":"; ","element":"span"},{"href":"#id-9","referenceIndex":14,"text":"Jiang et al.","element":"a"},{"href":"#id-9","referenceIndex":14,"text":", ","element":"a"},{"href":"#id-9","referenceIndex":14,"text":"2020","element":"a"},{"text":"; ","element":"span"},{"href":"#id-10","referenceIndex":27,"text":"Park et al.","element":"a"},{"href":"#id-10","referenceIndex":27,"text":", ","element":"a"},{"href":"#id-10","referenceIndex":27,"text":"2019","element":"a"},{"text":") (though see ","element":"span"},{"href":"#id-11","referenceIndex":6,"text":"Dinh et al. ","element":"a"},{"href":"#id-11","referenceIndex":6,"text":"(","element":"a"},{"href":"#id-11","referenceIndex":6,"text":"2017","element":"a"},{"text":") for discussion of some caveats). According to this suggestion, training with a large learning rate (or with a small batch size) can improve performance because it leads to more stochasticity during training (","element":"span"},{"href":"#id-12","referenceIndex":21,"text":"Mandt et al.","element":"a"},{"href":"#id-12","referenceIndex":21,"text":", ","element":"a"},{"href":"#id-12","referenceIndex":21,"text":"2017","element":"a"},{"text":"; ","element":"span"},{"href":"#id-13","referenceIndex":31,"text":"Smith et al.","element":"a"},{"href":"#id-13","referenceIndex":31,"text":", ","element":"a"},{"href":"#id-13","referenceIndex":31,"text":"2017","element":"a"},{"text":"; ","element":"span"},{"href":"#id-8","referenceIndex":30,"text":"Smith & Le","element":"a"},{"href":"#id-8","referenceIndex":30,"text":", ","element":"a"},{"href":"#id-8","referenceIndex":30,"text":"2018","element":"a"},{"text":"; ","element":"span"},{"href":"#id-14","referenceIndex":32,"text":"Smith et al.","element":"a"},{"href":"#id-14","referenceIndex":32,"text":", ","element":"a"},{"href":"#id-14","referenceIndex":32,"text":"2018","element":"a"},{"text":").","element":"span"}],[{"text":"We will develop a connection between large learning rate and flatness of minima in models trained via SGD. Unlike the relationship explored in most previous work though, this connection is not driven by SGD noise, but arises solely as a result of training with a large initial learning rate, and holds even for full batch gradient descent.","element":"span"}],[{"id":"id-0","style":{"width":"99%"},"width":1945,"height":727,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2003.02218/images/1-0.png","element":"img"}],[{"style":{"fontStyle":"italic"},"text":"Figure 1. ","element":"figcaption","subtype":"caption"},{"text":"A summary of our main results. (a) A visualization of gradient descent dynamics derived in our theoretical setup. A 2D slice of parameter space is shown, where lighter color indicates higher loss and dots represents points visited during optimization. Initially, the loss grows rapidly while local curvature decreases. Once curvature is sufficiently low, gradient descent converges to a flat minimum. We call this the ","element":"figcaption","subtype":"caption"},{"style":{"fontStyle":"italic"},"text":"catapult effect","element":"figcaption","subtype":"caption"},{"text":". See Figures ","element":"figcaption","subtype":"caption"},{"href":"#id-15","text":"2 ","element":"a","subtype":"caption"},{"text":"and ","element":"figcaption","subtype":"caption"},{"href":"#id-16","text":"S1 ","element":"a","subtype":"caption"},{"text":"for more details. (b) Confirmation of our theoretical predictions in a practical deep learning setting. Line shows the test accuracy of a Wide ResNet trained on CIFAR-10 as a function of learning rate, each trained for a fixed number of steps. Dashed lines show our predictions for the boundaries of the large learning rate regime (the ","element":"figcaption","subtype":"caption"},{"style":{"fontStyle":"italic"},"text":"catapult phase","element":"figcaption","subtype":"caption"},{"text":"), where we expect optimal performance to occur. Maximal performance is achieved between the dashed lines, confirming our predictions. See Section ","element":"figcaption","subtype":"caption"},{"text":"3 ","element":"span","subtype":"caption"},{"text":"for details.","element":"figcaption","subtype":"caption"}],[{"style":{"fontWeight":"bold"},"text":"1.2. The existing theory of infinite width networks is insufficient to describe large learning rates","element":"span"}],[{"text":"A recent body of work has investigated the gradient descent dynamics of deep networks in the limit of infinite width (","element":"span"},{"href":"#id-17","referenceIndex":5,"text":"Daniely","element":"a"},{"href":"#id-17","referenceIndex":5,"text":", ","element":"a"},{"href":"#id-17","referenceIndex":5,"text":"2017","element":"a"},{"text":"; ","element":"span"},{"href":"#id-18","referenceIndex":12,"text":"Jacot et al.","element":"a"},{"href":"#id-18","referenceIndex":12,"text":", ","element":"a"},{"href":"#id-18","referenceIndex":12,"text":"2018","element":"a"},{"text":"; ","element":"span"},{"href":"#id-19","referenceIndex":18,"text":"Lee et al.","element":"a"},{"href":"#id-19","referenceIndex":18,"text":", ","element":"a"},{"href":"#id-19","referenceIndex":18,"text":"2019","element":"a"},{"text":"; ","element":"span"},{"href":"#id-20","referenceIndex":7,"text":"Du et al.","element":"a"},{"href":"#id-20","referenceIndex":7,"text":", ","element":"a"},{"href":"#id-20","referenceIndex":7,"text":"2019","element":"a"},{"text":"; ","element":"span"},{"href":"#id-21","referenceIndex":37,"text":"Zou et al.","element":"a"},{"href":"#id-21","referenceIndex":37,"text":", ","element":"a"},{"href":"#id-21","referenceIndex":37,"text":"2018","element":"a"},{"text":"; ","element":"span"},{"href":"#id-22","referenceIndex":1,"text":"Allen-Zhu et al.","element":"a"},{"href":"#id-22","referenceIndex":1,"text":", ","element":"a"},{"href":"#id-22","referenceIndex":1,"text":"2019","element":"a"},{"text":"; ","element":"span"},{"href":"#id-23","referenceIndex":19,"text":"Li & Liang","element":"a"},{"href":"#id-23","referenceIndex":19,"text":", ","element":"a"},{"href":"#id-23","referenceIndex":19,"text":"2018","element":"a"},{"text":"; ","element":"span"},{"href":"#id-24","referenceIndex":4,"text":"Chizat ","element":"a"},{"href":"#id-24","referenceIndex":4,"text":"et al.","element":"a"},{"href":"#id-24","referenceIndex":4,"text":", ","element":"a"},{"href":"#id-24","referenceIndex":4,"text":"2019","element":"a"},{"text":"; ","element":"span"},{"href":"#id-25","referenceIndex":23,"text":"Mei et al.","element":"a"},{"href":"#id-25","referenceIndex":23,"text":", ","element":"a"},{"href":"#id-25","referenceIndex":23,"text":"2018","element":"a"},{"text":"; ","element":"span"},{"href":"#id-26","referenceIndex":28,"text":"Rotskoff & Vanden-Eijnden","element":"a"},{"href":"#id-26","referenceIndex":28,"text":", ","element":"a"},{"href":"#id-26","referenceIndex":28,"text":"2018","element":"a"},{"text":"; ","element":"span"},{"href":"#id-27","referenceIndex":29,"text":"Sirignano & Spiliopoulos","element":"a"},{"href":"#id-27","referenceIndex":29,"text":", ","element":"a"},{"href":"#id-27","referenceIndex":29,"text":"2018","element":"a"},{"text":"; ","element":"span"},{"href":"#id-28","referenceIndex":33,"text":"Woodworth et al.","element":"a"},{"href":"#id-28","referenceIndex":33,"text":", ","element":"a"},{"href":"#id-28","referenceIndex":33,"text":"2019","element":"a"},{"text":"; ","element":"span"},{"href":"#id-29","referenceIndex":24,"text":"Naveh et al.","element":"a"},{"text":"). Of particular relevance is the work by ","element":"span"},{"href":"#id-18","referenceIndex":12,"text":"Jacot et al. ","element":"a"},{"href":"#id-18","referenceIndex":12,"text":"(","element":"a"},{"href":"#id-18","referenceIndex":12,"text":"2018","element":"a"},{"text":") showing that gradient flow in the space of functions is governed by a dynamical quantity called the Neural Tangent Kernel (NTK) which is fixed at its initial value in this limit. ","element":"span"},{"href":"#id-19","referenceIndex":18,"text":"Lee et al. ","element":"a"},{"href":"#id-19","referenceIndex":18,"text":"(","element":"a"},{"href":"#id-19","referenceIndex":18,"text":"2019","element":"a"},{"text":") showed this result is equivalent to training the linearization of a model around its initialization in parameter space. Finally, moving away from the strict limit of infinite width by working perturbatively, ","element":"span"},{"href":"#id-30","referenceIndex":8,"text":"Dyer & Gur-Ari ","element":"a"},{"href":"#id-30","referenceIndex":8,"text":"(","element":"a"},{"href":"#id-30","referenceIndex":8,"text":"2020","element":"a"},{"text":"); ","element":"span"},{"href":"#id-31","referenceIndex":11,"text":"Huang ","element":"a"},{"href":"#id-31","referenceIndex":11,"text":"& Yau ","element":"a"},{"href":"#id-31","referenceIndex":11,"text":"(","element":"a"},{"href":"#id-31","referenceIndex":11,"text":"2019","element":"a"},{"text":") introduced an approach to computing the finite-width corrections to network evolution.","element":"span"}],[{"text":"Despite this progress, it seems these results are insufficient to capture the full dynamics of deep networks, as well as their superior performance, in regimes applicable to practice. Prior work has focused on comparisons between various infinite-width kernels associated with deep networks and their finite-width, SGD-trained counterparts (","element":"span"},{"href":"#id-32","referenceIndex":17,"text":"Lee et al.","element":"a"},{"href":"#id-32","referenceIndex":17,"text":", ","element":"a"},{"href":"#id-32","referenceIndex":17,"text":"2018","element":"a"},{"text":"; ","element":"span"},{"href":"#id-33","referenceIndex":25,"text":"Novak ","element":"a"},{"href":"#id-33","referenceIndex":25,"text":"et al.","element":"a"},{"href":"#id-33","referenceIndex":25,"text":", ","element":"a"},{"href":"#id-33","referenceIndex":25,"text":"2019","element":"a"},{"text":"; ","element":"span"},{"href":"#id-34","referenceIndex":2,"text":"Arora et al.","element":"a"},{"href":"#id-34","referenceIndex":2,"text":", ","element":"a"},{"href":"#id-34","referenceIndex":2,"text":"2019","element":"a"},{"text":"). Specific findings vary depending on precise choices for architecture and hyperparameters. However, dramatic performance gaps are consistently observed between non-linear CNNs and their limiting kernels, implying that the theory is not sufficient to explain the performance of deep networks in this realistic setup. Furthermore, some hyperparameter settings in finite-width models have no known analogue in the infinite width limit, and it is these settings that often lead to optimal performance.","element":"span"}],[{"text":"In particular, finite width networks are often trained with large learning rates that would cause divergence for infinite width linearized models. Further, these large learning rates cause finite width networks to converge to flat minima. For infinite width linearized models, trained with MSE loss, all minima have the same curvature, and the notion of flat minima does not apply. We argue that the reduction in curvature during optimization, and support for learning rates that are infeasible for infinite width linearized models, may thus partially explain performance gaps observed between linear and non-linear models.","element":"span"}],[{"style":{"fontWeight":"bold"},"text":"1.3. Our contribution: three learning rate regimes","element":"span"}],[{"text":"In this work, we identify a dynamical mechanism which enables finite-width networks to stably access large learning rates. We show that this mechanism causes training to converge to flatter minima and is associated with improved generalization. We further show that this same mechanism can describe the behavior of infinite width networks, if training time is increased","element":"span"}],[{"text":"along with network width.","element":"span"}],[{"text":"This new mechanism enables a characterization of gradient descent training in terms of three learning rate regimes, or phases: the ","element":"span"},{"style":{"fontStyle":"italic"},"text":"lazy phase","element":"span"},{"text":", the ","element":"span"},{"style":{"fontStyle":"italic"},"text":"catapult phase","element":"span"},{"text":", and the ","element":"span"},{"style":{"fontStyle":"italic"},"text":"divergent phase","element":"span"},{"text":". In Section ","element":"span"},{"text":"2 ","element":"span"},{"text":"we analytically derive the behavior in these three learning rate regimes for one hidden layer linear networks with large but finite width, trained with MSE loss. We confirm experimentally in Section ","element":"span"},{"text":"3 ","element":"span"},{"text":"that these phases also apply to deep nonlinear fully- connected, convolutional, and residual architectures. In Section ","element":"span"},{"text":"4 ","element":"span"},{"text":"we study additional predictions of the analytic solution.","element":"span"}],[{"text":"We now summarize all three phases, using ","element":"span"},{"style":{"height":10.4},"width":20,"height":26,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2003.02218/images/2-0.png","element":"img","alt":" η","inline":true,"padRight":true},{"text":"to indicate the learning rate, and ","element":"span"},{"style":{"height":13.19},"width":39.25,"height":32.98,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2003.02218/images/2-1.png","element":"img","alt":" λ0","inline":true,"padRight":true},{"text":"to indicate the initial curvature (defined precisely in Section ","element":"span"},{"href":"#id-35","text":"2.1","element":"a"},{"text":"). The phase is determined by the curvature at initialization and by the learning rate, despite the fact that the curvature may change significantly during training. Based on the experimental evidence we expect the behavior described below to apply in typical deep learning settings, when training sufficiently wide networks using SGD.","element":"span"}],[{"style":{"fontWeight":"bold"},"text":"Lazy phase: ","element":"span"},{"style":{"height":16},"width":175.25,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2003.02218/images/2-2.png","element":"img","alt":" η < 2/λ0 .","inline":true,"padRight":true},{"text":"For sufficiently small learning rate, the curvature ","element":"span"},{"style":{"height":13.19},"width":35.25,"height":32.98,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2003.02218/images/2-3.png","element":"img","alt":" λt","inline":true,"padRight":true},{"text":"at training step ","element":"span"},{"style":{"fontStyle":"italic"},"text":"t ","element":"span"},{"text":"remains constant during the initial part of training. The model behaves (loosely) as a model linearized about its initial parameters (","element":"span"},{"href":"#id-19","referenceIndex":18,"text":"Lee et al.","element":"a"},{"href":"#id-19","referenceIndex":18,"text":", ","element":"a"},{"href":"#id-19","referenceIndex":18,"text":"2019","element":"a"},{"text":"); this becomes exact in the infinite width limit, where these dynamics are sometimes called ","element":"span"},{"style":{"fontStyle":"italic"},"text":"lazy training ","element":"span"},{"text":"(","element":"span"},{"href":"#id-18","referenceIndex":12,"text":"Jacot et al.","element":"a"},{"href":"#id-18","referenceIndex":12,"text":", ","element":"a"},{"href":"#id-18","referenceIndex":12,"text":"2018","element":"a"},{"text":"; ","element":"span"},{"href":"#id-19","referenceIndex":18,"text":"Lee et al.","element":"a"},{"href":"#id-19","referenceIndex":18,"text":", ","element":"a"},{"href":"#id-19","referenceIndex":18,"text":"2019","element":"a"},{"text":"; ","element":"span"},{"href":"#id-20","referenceIndex":7,"text":"Du et al.","element":"a"},{"href":"#id-20","referenceIndex":7,"text":", ","element":"a"},{"href":"#id-20","referenceIndex":7,"text":"2019","element":"a"},{"text":"; ","element":"span"},{"href":"#id-23","referenceIndex":19,"text":"Li & Liang","element":"a"},{"href":"#id-23","referenceIndex":19,"text":", ","element":"a"},{"href":"#id-23","referenceIndex":19,"text":"2018","element":"a"},{"text":"; ","element":"span"},{"href":"#id-21","referenceIndex":37,"text":"Zou et al.","element":"a"},{"href":"#id-21","referenceIndex":37,"text":", ","element":"a"},{"href":"#id-21","referenceIndex":37,"text":"2018","element":"a"},{"text":"; ","element":"span"},{"href":"#id-22","referenceIndex":1,"text":"Allen-Zhu et al.","element":"a"},{"href":"#id-22","referenceIndex":1,"text":", ","element":"a"},{"href":"#id-22","referenceIndex":1,"text":"2019","element":"a"},{"text":"; ","element":"span"},{"href":"#id-24","referenceIndex":4,"text":"Chizat et al.","element":"a"},{"href":"#id-24","referenceIndex":4,"text":", ","element":"a"},{"href":"#id-24","referenceIndex":4,"text":"2019","element":"a"},{"text":"; ","element":"span"},{"href":"#id-30","referenceIndex":8,"text":"Dyer & ","element":"a"},{"href":"#id-30","referenceIndex":8,"text":"Gur-Ari","element":"a"},{"href":"#id-30","referenceIndex":8,"text":", ","element":"a"},{"href":"#id-30","referenceIndex":8,"text":"2020","element":"a"},{"text":"). For a discussion of trainability and the connection to the NTK in the lazy phase see ","element":"span"},{"href":"#id-36","referenceIndex":34,"text":"Xiao et al. ","element":"a"},{"href":"#id-36","referenceIndex":34,"text":"(","element":"a"},{"href":"#id-36","referenceIndex":34,"text":"2019","element":"a"},{"text":").","element":"span"}],[{"style":{"fontWeight":"bold"},"text":"Catapult phase: ","element":"span"},{"style":{"height":16},"width":308.98,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2003.02218/images/2-4.png","element":"img","alt":" 2/λ0 < η < ηmax .","inline":true,"padRight":true},{"text":"In this phase, the curvature at initialization is too high for training to converge to a nearby point, and the linear approximation quickly breaks down. Optimization begins with a period of exponential growth in the loss, coupled with a rapid decrease in curvature, until curvature stabilizes at a value ","element":"span"},{"style":{"height":16},"width":198.84,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2003.02218/images/2-5.png","element":"img","alt":" λfinal < 2/η","inline":true},{"text":". Once the curvature drops below ","element":"span"},{"style":{"height":16},"width":59.85,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2003.02218/images/2-6.png","element":"img","alt":" 2/η","inline":true},{"text":", training converges, ultimately reaching a minimum that is flatter than those found in the lazy phase. This initial period lasts for a number of training steps that is of order ","element":"span"},{"text":"log(","element":"span"},{"style":{"fontStyle":"italic"},"text":"n","element":"span"},{"text":")","element":"span"},{"text":", where ","element":"span"},{"style":{"fontStyle":"italic"},"text":"n ","element":"span"},{"text":"is the network width, and is therefore quite short for realistic networks (often lasting less than a single epoch). Optimal performance is often achieved when the initial learning rate is in this range. The gradient descent dynamics in this phase are visualized in SM Figure ","element":"span"},{"href":"#id-16","text":"S1 ","element":"a"},{"text":"and in Figure ","element":"span"},{"href":"#id-0","text":"1","element":"a"},{"text":".","element":"span"}],[{"text":"The maximum learning rate is approximately given by ","element":"span"},{"style":{"height":16},"width":461.89,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2003.02218/images/2-7.png","element":"img","alt":" ηmax = cact./λ0, where cact.","inline":true,"padRight":true},{"text":"is an architecture-dependent constant. Empirically, we find that this constant depends strongly on the non-linearity but only weakly on other aspects of the architecture. For networks with ReLU non-linearity we find empirically that ","element":"span"},{"style":{"height":13.19},"width":164.3,"height":32.98,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2003.02218/images/2-8.png","element":"img","alt":" cact. ≈ 12","inline":true},{"text":". For the theoretical model, we show that ","element":"span"},{"style":{"height":13.19},"width":154.26,"height":32.98,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2003.02218/images/2-9.png","element":"img","alt":" cact. = 4.","inline":true}],[{"style":{"fontWeight":"bold"},"text":"Divergent phase: ","element":"span"},{"style":{"height":12.4},"width":174.9,"height":31,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2003.02218/images/2-10.png","element":"img","alt":" η > ηmax .","inline":true,"padRight":true},{"text":"When the learning rate is above the maximum learning rate of the model, the loss diverges and the model does not train.","element":"span"}]]},{"heading":"2. Theoretical results","paragraphs":[[{"text":"We now present our main theoretical result, an analysis of gradient descent dynamics for a neural network with large but finite width.","element":"span"}],[{"text":"Given a network function ","element":"span"},{"style":{"height":16.58},"width":195.36,"height":41.46,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2003.02218/images/2-11.png","element":"img","alt":" f : Rd → R","inline":true,"padRight":true},{"text":"with model parameters ","element":"span"},{"style":{"height":11.6},"width":114.31,"height":29,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2003.02218/images/2-12.png","element":"img","alt":" θ ∈ Rp","inline":true},{"text":", and a training set ","element":"span"},{"style":{"height":16},"width":237.47,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2003.02218/images/2-13.png","element":"img","alt":" {(xα, yα)}mα=1","inline":true},{"text":", the MSE loss is","element":"span"}],[{"style":{"width":"61%"},"width":1204,"height":110,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2003.02218/images/2-14.png","element":"img"}],[{"text":"The NTK ","element":"span"},{"style":{"height":13.78},"width":298.62,"height":34.46,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2003.02218/images/2-15.png","element":"img","alt":" Θ : Rd × Rd → R","inline":true,"padRight":true},{"text":"is defined by","element":"span"}],[{"style":{"width":"64%"},"width":1253,"height":117,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2003.02218/images/2-16.png","element":"img"}],[{"text":"We denote by ","element":"span"},{"style":{"height":10.8},"width":23,"height":27,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2003.02218/images/2-17.png","element":"img","alt":" λ","inline":true,"padRight":true},{"text":"the maximum eigenvalue of the kernel. In large width models, ","element":"span"},{"style":{"height":10.8},"width":23,"height":27,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2003.02218/images/2-18.png","element":"img","alt":" λ","inline":true,"padRight":true},{"text":"provides a local measure of the loss landscape curvature that is similar to the top eigenvalue of the Hessian (","element":"span"},{"href":"#id-30","referenceIndex":8,"text":"Dyer & Gur-Ari","element":"a"},{"href":"#id-30","referenceIndex":8,"text":", ","element":"a"},{"href":"#id-30","referenceIndex":8,"text":"2020","element":"a"},{"text":").","element":"span"}],[{"text":"In this section, we will consider a network with one hidden layer and linear activations, where the network function ","element":"span"},{"style":{"fontStyle":"italic"},"text":"f ","element":"span"},{"text":"is given by","element":"span"}],[{"style":{"width":"58%"},"width":1138,"height":48,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2003.02218/images/2-19.png","element":"img"}],[{"text":"Here ","element":"span"},{"style":{"fontStyle":"italic"},"text":"n ","element":"span"},{"text":"is the width (number of neurons in the hidden layer), ","element":"span"},{"style":{"height":14.18},"width":359.07,"height":35.46,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2003.02218/images/3-0.png","element":"img","alt":" v ∈ Rn and u ∈ Rn×d ","inline":true,"padRight":true},{"text":"are the model parameters (collectively denoted ","element":"span"},{"style":{"height":16.59},"width":237.78,"height":41.47,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2003.02218/images/3-1.png","element":"img","alt":" θ), and x ∈ Rd ","inline":true,"padRight":true},{"text":"is the training input. At initialization, the weights are drawn from ","element":"span"},{"style":{"fontStyle":"italic"},"text":"N","element":"span"},{"text":"(0","element":"span"},{"style":{"fontStyle":"italic"},"text":", ","element":"span"},{"text":"1)","element":"span"},{"text":".","element":"span"}],[{"id":"id-35","style":{"fontWeight":"bold"},"text":"2.1. Warmup: a simplified model","element":"span"}],[{"text":"Before analyzing the dynamics of the model, we analyze a simpler setting which captures the most important aspects of the full solution. Consider a dataset with 1D inputs, and with a single training sample ","element":"span"},{"style":{"fontStyle":"italic"},"text":"x ","element":"span"},{"text":"= 1 ","element":"span"},{"text":"with label ","element":"span"},{"style":{"fontStyle":"italic"},"text":"y ","element":"span"},{"text":"= 0","element":"span"},{"text":". The network function evaluated on this input is then ","element":"span"},{"style":{"height":17.38},"width":246.68,"height":43.46,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2003.02218/images/3-2.png","element":"img","alt":" f = n−1/2vT u","inline":true},{"text":", with ","element":"span"},{"style":{"height":14},"width":160.74,"height":35,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2003.02218/images/3-3.png","element":"img","alt":" u, v ∈ Rn","inline":true},{"text":", and the loss is ","element":"span"},{"style":{"height":17.38},"width":163.84,"height":43.46,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2003.02218/images/3-4.png","element":"img","alt":" L = f 2/2","inline":true},{"text":". The gradient descent equations at training step ","element":"span"},{"style":{"fontStyle":"italic"},"text":"t ","element":"span"},{"text":"are","element":"span"}],[{"id":"id-37","style":{"width":"71%"},"width":1397,"height":48,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2003.02218/images/3-5.png","element":"img"}],[{"text":"Next, consider the update equations in function space. These can be written in terms of the Neural Tangent Kernel. For this model, the kernel evaluated on the training set is a scalar which is equal to ","element":"span"},{"style":{"height":10.8},"width":23,"height":27,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2003.02218/images/3-6.png","element":"img","alt":" λ","inline":true},{"text":", its top eigenvalue, and is given by","element":"span"}],[{"style":{"width":"64%"},"width":1264,"height":49,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2003.02218/images/3-7.png","element":"img"}],[{"text":"At initialization, both ","element":"span"},{"style":{"height":16.59},"width":39.8,"height":41.46,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2003.02218/images/3-8.png","element":"img","alt":" f 2","inline":true,"padRight":true},{"text":"and ","element":"span"},{"style":{"height":10.8},"width":23,"height":27,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2003.02218/images/3-9.png","element":"img","alt":" λ","inline":true,"padRight":true},{"text":"scale as ","element":"span"},{"style":{"height":13.39},"width":116.87,"height":33.46,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2003.02218/images/3-10.png","element":"img","alt":" n0 = 1","inline":true,"padRight":true},{"text":"with width. The following update equations for ","element":"span"},{"style":{"fontStyle":"italic"},"text":"f ","element":"span"},{"text":"and ","element":"span"},{"style":{"height":10.8},"width":23,"height":27,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2003.02218/images/3-11.png","element":"img","alt":" λ","inline":true,"padRight":true},{"text":"at step ","element":"span"},{"style":{"fontStyle":"italic"},"text":"t ","element":"span"},{"text":"can be derived from (","element":"span"},{"href":"#id-37","text":"4","element":"a"},{"text":").","element":"span"}],[{"id":"id-38","style":{"width":"62%"},"width":1225,"height":202,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2003.02218/images/3-12.png","element":"img"}],[{"id":"id-39","text":"It is important to note that these are the exact update equations for this model, and that no higher-order terms were neglected. ","element":"span"},{"text":"We now analyze these dynamical equations assuming the width ","element":"span"},{"style":{"fontStyle":"italic"},"text":"n ","element":"span"},{"text":"is large. Two learning rates that will be important in the analysis are ","element":"span"},{"style":{"height":16},"width":493.74,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2003.02218/images/3-13.png","element":"img","alt":" ηcrit = 2/λ0 and ηmax = 4/λ0","inline":true},{"text":". In terms of the notation introduced above, the architecture-dependent constant that determines that maximum learning rate in this model is ","element":"span"},{"style":{"height":13.19},"width":154.28,"height":32.98,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2003.02218/images/3-14.png","element":"img","alt":" cact. = 4.","inline":true}],[{"text":"2.1.1. L","element":"span"},{"text":"AZY PHASE","element":"span"}],[{"text":"Taking the strict infinite width limit, equations (","element":"span"},{"href":"#id-38","text":"6","element":"a"},{"text":") and (","element":"span"},{"href":"#id-39","text":"7","element":"a"},{"text":") become","element":"span"}],[{"id":"id-40","style":{"width":"64%"},"width":1256,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2003.02218/images/3-15.png","element":"img"}],[{"text":"When ","element":"span"},{"style":{"height":14.4},"width":197.47,"height":36,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2003.02218/images/3-16.png","element":"img","alt":" η < ηcrit, λ","inline":true,"padRight":true},{"text":"remains constant throughout training. This is a special case of NTK dynamics, where the kernel is constant and the network evolves as a linear model (","element":"span"},{"href":"#id-19","referenceIndex":18,"text":"Lee et al.","element":"a"},{"href":"#id-19","referenceIndex":18,"text":", ","element":"a"},{"href":"#id-19","referenceIndex":18,"text":"2019","element":"a"},{"text":"). The function and the loss both shrink to zero because the multiplicative factor obeys ","element":"span"},{"style":{"height":16},"width":222.4,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2003.02218/images/3-17.png","element":"img","alt":" |1 − ηλt| < 1","inline":true},{"text":". This convergence happens in ","element":"span"},{"style":{"height":17.38},"width":342.18,"height":43.46,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2003.02218/images/3-18.png","element":"img","alt":" O(n0) = O(1) steps.","inline":true}],[{"text":"2.1.2. C","element":"span"},{"text":"ATAPULT PHASE","element":"span"}],[{"text":"When ","element":"span"},{"style":{"height":12.4},"width":276.25,"height":31,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2003.02218/images/3-19.png","element":"img","alt":" ηcrit < η < ηmax","inline":true},{"text":", the loss diverges in the infinite width limit. Indeed, from (","element":"span"},{"href":"#id-40","text":"8","element":"a"},{"text":") we see that the kernel is constant in the limit, while ","element":"span"},{"style":{"fontStyle":"italic"},"text":"f ","element":"span"},{"text":"receives multiplicative updates where ","element":"span"},{"style":{"height":16},"width":219.94,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2003.02218/images/3-20.png","element":"img","alt":" |1 − ηλt| > 1","inline":true},{"text":". This is the well known instability of gradient descent dynamics for linear models with MSE loss. However, the underlying model is not linear in its parameters, and finite width contributions turn out to be important. We therefore relax the infinite width limit and analyze equations (","element":"span"},{"href":"#id-38","text":"6","element":"a"},{"text":",","element":"span"},{"href":"#id-39","text":"7","element":"a"},{"text":") for large but finite width, ","element":"span"},{"style":{"height":12},"width":115.84,"height":30,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2003.02218/images/3-21.png","element":"img","alt":" n ≫ 1.","inline":true}],[{"text":"First, note that ","element":"span"},{"style":{"height":14.4},"width":204.16,"height":36,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2003.02218/images/3-22.png","element":"img","alt":" ηλ0 − 4 < 0","inline":true,"padRight":true},{"text":"by assumption, and therefore the (additive) kernel updates are negative for all ","element":"span"},{"style":{"fontStyle":"italic"},"text":"t","element":"span"},{"text":". During early training, ","element":"span"},{"style":{"height":16},"width":55.61,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2003.02218/images/3-23.png","element":"img","alt":" |ft|","inline":true,"padRight":true},{"text":"grows (as in the infinite width limit) while ","element":"span"},{"style":{"height":13.19},"width":35.25,"height":32.98,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2003.02218/images/3-24.png","element":"img","alt":" λt","inline":true,"padRight":true},{"text":"remains constant up to small ","element":"span"},{"style":{"height":17.38},"width":131.03,"height":43.46,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2003.02218/images/3-25.png","element":"img","alt":" O(n−1)","inline":true,"padRight":true},{"text":"updates. After ","element":"span"},{"style":{"height":16},"width":174.42,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2003.02218/images/3-26.png","element":"img","alt":" t ∼ log(n)","inline":true,"padRight":true},{"text":"steps, ","element":"span"},{"style":{"height":16},"width":55.61,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2003.02218/images/3-27.png","element":"img","alt":" |ft|","inline":true,"padRight":true},{"text":"grows to order ","element":"span"},{"style":{"height":14.18},"width":72.13,"height":35.46,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2003.02218/images/3-28.png","element":"img","alt":" n1/2","inline":true},{"text":". At this point, the kernel updates are no longer negligible because ","element":"span"},{"style":{"height":17.38},"width":85.6,"height":43.46,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2003.02218/images/3-29.png","element":"img","alt":" f 2t /n","inline":true,"padRight":true},{"text":"is of order ","element":"span"},{"style":{"height":13.78},"width":126.49,"height":34.46,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2003.02218/images/3-30.png","element":"img","alt":" n0. The","inline":true,"padRight":true},{"text":"kernel ","element":"span"},{"style":{"height":13.19},"width":35.24,"height":32.98,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2003.02218/images/3-31.png","element":"img","alt":" λt","inline":true,"padRight":true},{"text":"receives negative, non-negligible updates while both ","element":"span"},{"style":{"height":14},"width":31.51,"height":35,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2003.02218/images/3-32.png","element":"img","alt":" ft","inline":true,"padRight":true},{"text":"and the loss continue to grow (for now, we ignore the term in (","element":"span"},{"href":"#id-38","text":"6","element":"a"},{"text":") with an explicit ","element":"span"},{"text":"1","element":"span"},{"style":{"fontStyle":"italic"},"text":"/n ","element":"span"},{"text":"dependence). This continues until the kernel is sufficiently small that the condition ","element":"span"},{"style":{"height":15.2},"width":131.63,"height":38,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2003.02218/images/3-33.png","element":"img","alt":" ηλt ≲ 2","inline":true,"padRight":true},{"text":"is met.","element":"span"},{"text":"1 ","element":"span"},{"text":"We call this curvature-reduction effect the ","element":"span"},{"style":{"fontStyle":"italic"},"text":"catapult effect","element":"span"},{"text":". Beyond this point, ","element":"span"},{"style":{"height":16},"width":388.55,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2003.02218/images/3-34.png","element":"img","alt":" |1 − ηλt| < 1 holds, |ft|","inline":true,"padRight":true},{"text":"shrinks, and the loss converges to a global minimum. The ","element":"span"},{"style":{"fontStyle":"italic"},"text":"n ","element":"span"},{"text":"dependence of the steps until optimization converges is ","element":"span"},{"text":"log (","element":"span"},{"style":{"fontStyle":"italic"},"text":"n","element":"span"},{"text":")","element":"span"},{"text":".","element":"span"}],[{"text":"It remains to show that the term in (","element":"span"},{"href":"#id-38","text":"6","element":"a"},{"text":") with an explicit ","element":"span"},{"style":{"height":13.38},"width":64.83,"height":33.46,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2003.02218/images/3-35.png","element":"img","alt":" n−1 ","inline":true,"padRight":true},{"text":"dependence does not affect these conclusions. Once ","element":"span"},{"style":{"height":16},"width":162.37,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2003.02218/images/3-36.png","element":"img","alt":" |ft| grows","inline":true,"padRight":true},{"text":"to order ","element":"span"},{"style":{"height":14.18},"width":72.13,"height":35.46,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2003.02218/images/3-37.png","element":"img","alt":" n1/2","inline":true},{"text":", this term is no longer negligible and can cause the multiplicative factor in front of ","element":"span"},{"style":{"height":14},"width":31.51,"height":35,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2003.02218/images/3-38.png","element":"img","alt":" ft","inline":true,"padRight":true},{"text":"to become smaller than 1 ","element":"span"},{"text":"in absolute value, causing ","element":"span"},{"style":{"height":16},"width":55.61,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2003.02218/images/4-0.png","element":"img","alt":" |ft|","inline":true,"padRight":true},{"text":"to start shrinking. However, once ","element":"span"},{"style":{"height":16},"width":55.61,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2003.02218/images/4-1.png","element":"img","alt":" |ft|","inline":true,"padRight":true},{"text":"shrinks sufficiently this term again becomes negligible. Therefore, the loss will not converge to zero unless the curvature eventually drops below ","element":"span"},{"style":{"height":16},"width":59.85,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2003.02218/images/4-2.png","element":"img","alt":" 2/η","inline":true},{"text":". Conversely, notice that this term cannot cause ","element":"span"},{"style":{"height":16},"width":55.61,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2003.02218/images/4-3.png","element":"img","alt":" |ft|","inline":true,"padRight":true},{"text":"to diverge for learning rates below ","element":"span"},{"style":{"height":10.4},"width":78.88,"height":26,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2003.02218/images/4-4.png","element":"img","alt":" ηmax","inline":true},{"text":". Indeed, if this were to happen then equation (","element":"span"},{"href":"#id-39","text":"7","element":"a"},{"text":") would drive ","element":"span"},{"style":{"height":13.19},"width":35.25,"height":32.98,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2003.02218/images/4-5.png","element":"img","alt":"λt","inline":true,"padRight":true},{"text":"to negative values, leading to a contradiction. This completes the analysis in this phase.","element":"span"}],[{"text":"Let us make a few comments about the catapult phase.","element":"span"}],[{"text":"It is important for the analysis that we take a modified large width limit, in which the number of training steps grows like ","element":"span"},{"text":"log(","element":"span"},{"style":{"fontStyle":"italic"},"text":"n","element":"span"},{"text":") ","element":"span"},{"text":"as ","element":"span"},{"style":{"fontStyle":"italic"},"text":"n ","element":"span"},{"text":"becomes large. This is different than the large width limit commonly studied in the literature, in which the number of steps is kept fixed as the width is taken large. When using this modified limit, the analysis above holds even in the limit. Note as well that the catapult effect takes place over ","element":"span"},{"text":"log(","element":"span"},{"style":{"fontStyle":"italic"},"text":"n","element":"span"},{"text":") ","element":"span"},{"text":"steps, and for practical networks will occur within the first 100 steps or so of training.","element":"span"}],[{"text":"In the catapult phase, the kernel at the end of training is smaller by an order ","element":"span"},{"style":{"height":13.39},"width":39.92,"height":33.46,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2003.02218/images/4-6.png","element":"img","alt":" n0","inline":true,"padRight":true},{"text":"amount compared with its value at initialization. The kernel provides a local measure of the loss curvature. Therefore, the minima that SGD finds in the catapult phase are flatter than those it finds in the lazy phase. Contrast this situation, in which the kernel receives non-negligible updates, with the conclusions of ","element":"span"},{"href":"#id-18","referenceIndex":12,"text":"Jacot et al. ","element":"a"},{"href":"#id-18","referenceIndex":12,"text":"(","element":"a"},{"href":"#id-18","referenceIndex":12,"text":"2018","element":"a"},{"text":") where the kernel is constant throughout training. The difference is due to the large learning rate, which leads to a breakdown of the linearized approximation even at large width.","element":"span"}],[{"text":"Figure ","element":"span"},{"href":"#id-15","text":"2 ","element":"a"},{"text":"illustrates the dynamics in the catapult phase. For learning rates ","element":"span"},{"style":{"height":12.4},"width":276.24,"height":31,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2003.02218/images/4-7.png","element":"img","alt":" ηcrit < η < ηmax","inline":true,"padRight":true},{"text":"we observe the catapult effect: the loss goes up before converging to zero. The curvature exhibits the expected sharp transitions as a function of the learning rate: it is constant in the lazy phase, decreases in the catapult phase, and diverges for ","element":"span"},{"style":{"height":12.4},"width":164.94,"height":31,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2003.02218/images/4-8.png","element":"img","alt":" η > ηmax.","inline":true}],[{"id":"id-15","style":{"width":"98%"},"width":1924,"height":515,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2003.02218/images/4-9.png","element":"img"}],[{"style":{"fontStyle":"italic"},"text":"Figure 2. ","element":"figcaption","subtype":"caption"},{"text":"Empirical results for the gradient descent dynamics of the warmup model with ","element":"figcaption","subtype":"caption"},{"style":{"height":12.49},"width":123.24,"height":31.24,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2003.02218/images/4-10.png","element":"img","alt":" n = 103","inline":true},{"text":", for which ","element":"figcaption","subtype":"caption"},{"style":{"height":13.2},"width":132.05,"height":33,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2003.02218/images/4-11.png","element":"img","alt":" ηcrit ≈ 1","inline":true},{"text":". (a) Training loss for different learning rates. (b) Maximum NTK eigenvalue as a function of time. For ","element":"figcaption","subtype":"caption"},{"style":{"height":13.2},"width":138.9,"height":33,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2003.02218/images/4-12.png","element":"img","alt":" η > 1, λt","inline":true,"padRight":true},{"text":"decreases rapidly to a fixed value. (c) Maximum NTK eigenvalue at ","element":"figcaption","subtype":"caption"},{"style":{"height":14.4},"width":136.75,"height":36,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2003.02218/images/4-13.png","element":"img","alt":" t = 25/η","inline":true},{"text":". The shaded area indicates learning rates for which training diverges empirically. The results are presented as a function of ","element":"figcaption","subtype":"caption"},{"style":{"height":12.4},"width":58.94,"height":31,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2003.02218/images/4-14.png","element":"img","alt":" t · η","inline":true,"padRight":true},{"text":"(rather than ","element":"figcaption","subtype":"caption"},{"style":{"fontStyle":"italic"},"text":"t","element":"figcaption","subtype":"caption"},{"text":") for convenience.","element":"figcaption","subtype":"caption"}],[{"text":"2.1.3. D","element":"span"},{"text":"IVERGENT PHASE","element":"span"}],[{"text":"Completing the analysis of this model, when ","element":"span"},{"style":{"height":12.4},"width":162.28,"height":31,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2003.02218/images/4-15.png","element":"img","alt":" η > ηmax","inline":true,"padRight":true},{"text":"the loss diverges because the kernel receives positive updates, accelerating the rate of growth of the function. Therefore, ","element":"span"},{"style":{"height":16},"width":212.82,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2003.02218/images/4-16.png","element":"img","alt":" ηmax = 4/λ0","inline":true,"padRight":true},{"text":"is the maximum learning rate of the model.","element":"span"}],[{"id":"id-64","style":{"fontWeight":"bold"},"text":"2.2. Full model","element":"span"}],[{"text":"We now turn to analyzing the model presented at the beginning of this section, with ","element":"span"},{"style":{"fontStyle":"italic"},"text":"d","element":"span"},{"text":"-dimensional inputs and ","element":"span"},{"style":{"fontStyle":"italic"},"text":"m ","element":"span"},{"text":"training samples with general labels. The full analysis is presented in SM Section ","element":"span"},{"href":"#id-41","text":"D.1","element":"a"},{"text":"; here we summarize the argument. The conclusions are essentially the same as those of the warmup model.","element":"span"}],[{"text":"We introduce the notation ","element":"span"},{"style":{"height":16},"width":207.27,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2003.02218/images/4-17.png","element":"img","alt":" fα := f(xα)","inline":true,"padRight":true},{"text":"for the function evaluated on a training sample, ","element":"span"},{"style":{"height":18.21},"width":237.98,"height":45.52,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2003.02218/images/4-18.png","element":"img","alt":"˜fα := fα − yα","inline":true,"padRight":true},{"text":"for the error, and ","element":"span"},{"style":{"height":16.79},"width":307.38,"height":41.98,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2003.02218/images/4-19.png","element":"img","alt":"Θαβ := Θ(xα, xβ)","inline":true,"padRight":true},{"text":"for the kernel elements. We will treat ","element":"span"},{"style":{"height":18.21},"width":67.87,"height":45.52,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2003.02218/images/4-20.png","element":"img","alt":" f, ˜f","inline":true,"padRight":true},{"text":"evaluated on the training set as vectors in ","element":"span"},{"style":{"height":10.8},"width":56.78,"height":27,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2003.02218/images/4-21.png","element":"img","alt":" Rm","inline":true},{"text":", whose elements are ","element":"span"},{"style":{"height":18.2},"width":100.44,"height":45.51,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2003.02218/images/4-22.png","element":"img","alt":" fα, ˜fα","inline":true},{"text":". Consider the following update equation for the error, which can be derived from the update equations for the","element":"span"}],[{"text":"parameters. Note that this is the exact update equation for this model; no higher-order terms were neglected.","element":"span"}],[{"style":{"width":"99%"},"width":1942,"height":231,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2003.02218/images/5-0.png","element":"img"}],[{"text":"We again take the modified large width limit ","element":"span"},{"style":{"height":8.8},"width":125.92,"height":22,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2003.02218/images/5-1.png","element":"img","alt":" n → ∞","inline":true},{"text":", allowing the number of steps to scale logarithmically in the width. At initialization, ","element":"span"},{"style":{"height":14},"width":40.51,"height":35,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2003.02218/images/5-2.png","element":"img","alt":" fα","inline":true},{"text":", ","element":"span"},{"style":{"height":18.21},"width":40.52,"height":45.52,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2003.02218/images/5-3.png","element":"img","alt":"˜fα","inline":true},{"text":", and ","element":"span"},{"style":{"height":15.59},"width":69.72,"height":38.98,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2003.02218/images/5-4.png","element":"img","alt":" Θαβ","inline":true,"padRight":true},{"text":"are all of order ","element":"span"},{"style":{"height":13.39},"width":39.92,"height":33.47,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2003.02218/images/5-5.png","element":"img","alt":" n0","inline":true},{"text":". We now analyze the gradient descent dynamics as a function of the learning rate.","element":"span"}],[{"text":"The maximum eigenvalue of the kernel at step ","element":"span"},{"style":{"height":19.01},"width":631.95,"height":47.52,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2003.02218/images/5-6.png","element":"img","alt":" t is λt. When η < ηcrit, the norm ∥ ˜f t∥2","inline":true,"padRight":true},{"text":"shrinks to zero in ","element":"span"},{"style":{"height":17.39},"width":106.13,"height":43.47,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2003.02218/images/5-7.png","element":"img","alt":" O(n0)","inline":true,"padRight":true},{"text":"time while the kernel receives ","element":"span"},{"style":{"height":17.38},"width":131.03,"height":43.46,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2003.02218/images/5-8.png","element":"img","alt":" O(n−1)","inline":true,"padRight":true},{"text":"corrections. Therefore, in the limit the kernel remains constant until convergence. This is a special case of the NTK result (","element":"span"},{"href":"#id-18","referenceIndex":12,"text":"Jacot et al.","element":"a"},{"href":"#id-18","referenceIndex":12,"text":", ","element":"a"},{"href":"#id-18","referenceIndex":12,"text":"2018","element":"a"},{"text":"), and the model evolves as a linear model.","element":"span"}],[{"text":"Next, suppose that ","element":"span"},{"style":{"height":12.4},"width":286.53,"height":31,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2003.02218/images/5-9.png","element":"img","alt":" ηcrit < η < ηmax","inline":true},{"text":". Early during training ","element":"span"},{"style":{"height":19.01},"width":79.65,"height":47.52,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2003.02218/images/5-10.png","element":"img","alt":" ∥ ˜f∥2","inline":true,"padRight":true},{"text":"grows, with the fastest growth taking place along the direction of the top kernel eigenvector, ","element":"span"},{"style":{"height":14.74},"width":184.84,"height":36.86,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2003.02218/images/5-11.png","element":"img","alt":" emaxt ∈ Rm","inline":true},{"text":". During this part of training the kernel receives ","element":"span"},{"style":{"height":17.38},"width":131.03,"height":43.46,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2003.02218/images/5-12.png","element":"img","alt":" O(n−1)","inline":true,"padRight":true},{"text":"updates, and so ","element":"span"},{"style":{"height":14.52},"width":77.65,"height":36.31,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2003.02218/images/5-13.png","element":"img","alt":"emaxt","inline":true,"padRight":true},{"text":"does not change much. As a result, ","element":"span"},{"style":{"height":18.21},"width":31.51,"height":45.52,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2003.02218/images/5-14.png","element":"img","alt":"˜ft","inline":true,"padRight":true},{"text":"becomes aligned with ","element":"span"},{"style":{"height":14.52},"width":77.65,"height":36.31,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2003.02218/images/5-15.png","element":"img","alt":" emaxt","inline":true,"padRight":true},{"text":". In addition, ","element":"span"},{"style":{"height":14},"width":31.51,"height":35,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2003.02218/images/5-16.png","element":"img","alt":" ft","inline":true,"padRight":true},{"text":"becomes close to ","element":"span"},{"style":{"height":18.61},"width":311.08,"height":46.52,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2003.02218/images/5-17.png","element":"img","alt":"˜ft because ft grows","inline":true,"padRight":true},{"text":"while the label is constant. We therefore consider the following approximate update equations for ","element":"span"},{"style":{"height":19.79},"width":408.68,"height":49.47,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2003.02218/images/5-18.png","element":"img","alt":"˜f max := �α ˜fαemaxα and","inline":true,"padRight":true},{"text":"for the maximum eigenvalue ","element":"span"},{"style":{"height":10.8},"width":23,"height":27,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2003.02218/images/5-19.png","element":"img","alt":" λ","inline":true},{"text":", which can be approximated by ","element":"span"},{"style":{"height":19.01},"width":215.15,"height":47.52,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2003.02218/images/5-20.png","element":"img","alt":"˜f T Θ ˜f/∥ ˜f∥22.","inline":true}],[{"id":"id-49","style":{"width":"64%"},"width":1261,"height":156,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2003.02218/images/5-21.png","element":"img"}],[{"id":"id-50","text":"We note in passing the similarity between these equations and ","element":"span"},{"text":"(","element":"span"},{"href":"#id-38","text":"6","element":"a"},{"text":"), (","element":"span"},{"href":"#id-39","text":"7","element":"a"},{"text":"). We see that once ","element":"span"},{"style":{"height":18.21},"width":180.81,"height":45.52,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2003.02218/images/5-22.png","element":"img","alt":"˜f max and ζ","inline":true,"padRight":true},{"text":"become of order ","element":"span"},{"style":{"height":16.58},"width":84,"height":41.46,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2003.02218/images/5-23.png","element":"img","alt":" n1/2,","inline":true},{"style":{"height":13.19},"width":35.25,"height":32.98,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2003.02218/images/5-24.png","element":"img","alt":"λt","inline":true,"padRight":true},{"text":"receives non-negligible negative corrections of order ","element":"span"},{"style":{"height":13.38},"width":39.92,"height":33.46,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2003.02218/images/5-25.png","element":"img","alt":" n0","inline":true},{"text":". This evolution continues until ","element":"span"},{"style":{"height":16},"width":150.26,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2003.02218/images/5-26.png","element":"img","alt":" λt ≲ 2/η","inline":true},{"text":", after which the error converges to zero. Finally, if ","element":"span"},{"style":{"height":12.4},"width":158.34,"height":31,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2003.02218/images/5-27.png","element":"img","alt":" η > ηmax","inline":true},{"text":", the error grows while ","element":"span"},{"style":{"height":13.19},"width":35.24,"height":32.98,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2003.02218/images/5-28.png","element":"img","alt":" λt","inline":true,"padRight":true},{"text":"receives positive updates, and the loss diverges. This concludes the discussion of the theoretical model; further details can be found in Section ","element":"span"},{"text":"4 ","element":"span"},{"text":"and in SM Section ","element":"span"},{"href":"#id-41","text":"D.1","element":"a"},{"text":".","element":"span"}]]},{"heading":"3. Experimental results","paragraphs":[[{"text":"In this section we test the extent to which the behavior of our theoretical model describes the dynamics of deep networks in practical settings. The theoretical results of Section ","element":"span"},{"text":"2","element":"span"},{"text":", describing distinct learning rate phases, are not guaranteed to hold beyond the model analyzed there. We treat these results as predictions to be tested empirically, including the values ","element":"span"},{"style":{"height":14.4},"width":136.35,"height":36,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2003.02218/images/5-29.png","element":"img","alt":" ηcrit and","inline":true},{"style":{"height":10.4},"width":78.88,"height":26,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2003.02218/images/5-30.png","element":"img","alt":"ηmax","inline":true,"padRight":true},{"text":"of the learning rates that separate the three phases.","element":"span"}],[{"text":"In a variety of deep learning settings, we find clear evidence of the different phases predicted by the model. The experiments all use MSE loss, sufficiently wide networks, and SGD","element":"span"},{"text":"2","element":"span"},{"text":". Parameters such as network architecture, choice of non-linearity, weight parameterization, and regularization, do not significantly affect this conclusion.","element":"span"}],[{"text":"In terms of the learning rates that determine the location of the transitions, the only modification needed to obtain good agreement with experiment is to replace the theoretical maximum learning rate, ","element":"span"},{"style":{"height":16},"width":79.09,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2003.02218/images/5-31.png","element":"img","alt":" 4/λ0","inline":true},{"text":", with a 1-parameter function ","element":"span"},{"style":{"height":16},"width":276.77,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2003.02218/images/5-32.png","element":"img","alt":"ηmax = cact./λ0","inline":true},{"text":", where ","element":"span"},{"style":{"height":9.19},"width":68.75,"height":22.98,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2003.02218/images/5-33.png","element":"img","alt":" cact.","inline":true,"padRight":true},{"text":"is an architecture-dependent constant. We find that ","element":"span"},{"style":{"height":13.19},"width":176.93,"height":32.98,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2003.02218/images/5-34.png","element":"img","alt":" cact. ≈ 12","inline":true,"padRight":true},{"text":"for all network that use ReLU non-linearity, and it seems this parameter depends only weakly on other details of the architecture. We find the level of agreement with the experiments surprising, given that our theoretical model involves a shallow network without non-linearities.","element":"span"}],[{"text":"Building on the observed correlation between lower curvature and generalization performance (","element":"span"},{"href":"#id-7","referenceIndex":15,"text":"Keskar et al.","element":"a"},{"href":"#id-7","referenceIndex":15,"text":", ","element":"a"},{"href":"#id-7","referenceIndex":15,"text":"2016","element":"a"},{"text":"; ","element":"span"},{"href":"#id-9","referenceIndex":14,"text":"Jiang ","element":"a"},{"href":"#id-9","referenceIndex":14,"text":"et al.","element":"a"},{"href":"#id-9","referenceIndex":14,"text":", ","element":"a"},{"href":"#id-9","referenceIndex":14,"text":"2020","element":"a"},{"text":"), we conjecture that optimal performance occurs in the large learning rate (catapult) phase, where the loss converges to a flatter minimum. For a fixed amount of computational budget, we find that this conjecture holds in all cases we tried. Even when comparing different learning rates trained for a fixed amount of ","element":"span"},{"style":{"fontStyle":"italic"},"text":"physical time ","element":"span"},{"style":{"height":14.79},"width":194.13,"height":36.97,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2003.02218/images/5-35.png","element":"img","alt":" tphys = t · η","inline":true},{"text":", we find that performance of models trained in the catapult phase either matches or exceeds that of models trained in the lazy phase.","element":"span"}],[{"style":{"fontWeight":"bold"},"text":"3.1. Early time curvature dynamics","element":"span"}],[{"text":"Our theoretical model makes detailed predictions for the gradient descent evolution of ","element":"span"},{"style":{"height":10.8},"width":23,"height":27,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2003.02218/images/6-0.png","element":"img","alt":" λ","inline":true},{"text":", the top eigenvalue of the NTK. Here we test these predictions against empirical results in a variety of deep learning models (see the Supplement for additional experimental results).","element":"span"}],[{"text":"Figure ","element":"span"},{"href":"#id-42","text":"3 ","element":"a"},{"text":"shows ","element":"span"},{"style":{"height":10.8},"width":23,"height":27,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2003.02218/images/6-1.png","element":"img","alt":" λ","inline":true,"padRight":true},{"text":"during the early part of training for two deep learning settings. The results are compared against the theoretical predictions of a phase transition at ","element":"span"},{"style":{"height":16},"width":209.39,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2003.02218/images/6-2.png","element":"img","alt":" ηcrit = 2/λ0","inline":true},{"text":", and a maximum learning rate of ","element":"span"},{"style":{"height":16},"width":79.1,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2003.02218/images/6-3.png","element":"img","alt":" 4/λ0","inline":true},{"text":". Here ","element":"span"},{"style":{"height":13.19},"width":39.25,"height":32.98,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2003.02218/images/6-4.png","element":"img","alt":" λ0","inline":true,"padRight":true},{"text":"is the top eigenvalue of the empirical NTK at initialization.","element":"span"}],[{"text":"For learning rates ","element":"span"},{"style":{"height":12.4},"width":141.88,"height":31,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2003.02218/images/6-5.png","element":"img","alt":" η < ηcrit","inline":true},{"text":", we find that ","element":"span"},{"style":{"height":10.8},"width":23,"height":27,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2003.02218/images/6-6.png","element":"img","alt":" λ","inline":true,"padRight":true},{"text":"is independent of the learning rate and constant throughout training, as expected in the lazy phase. For ","element":"span"},{"style":{"height":16},"width":287.12,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2003.02218/images/6-7.png","element":"img","alt":" ηcrit < η < 4/λ0","inline":true,"padRight":true},{"text":"we find that ","element":"span"},{"style":{"height":10.8},"width":23,"height":27,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2003.02218/images/6-8.png","element":"img","alt":" λ","inline":true,"padRight":true},{"text":"decreases during training to below ","element":"span"},{"style":{"height":16},"width":59.85,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2003.02218/images/6-9.png","element":"img","alt":" 2/η","inline":true},{"text":", matching the predicted behavior in the catapult phase (note that in the Wide ResNet example, ","element":"span"},{"style":{"height":10.8},"width":23,"height":27,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2003.02218/images/6-10.png","element":"img","alt":" λ","inline":true,"padRight":true},{"text":"initially increases before reaching its stable value).","element":"span"}],[{"text":"The large learning rate behavior predicted by the model appears to persist up to the maximum learning rate, which is larger in these experiments than in the theoretical model. In these and other experiments involving ReLU networks, we find that ","element":"span"},{"style":{"height":16},"width":241.22,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2003.02218/images/6-11.png","element":"img","alt":"ηmax ≈ 12/λ0","inline":true,"padRight":true},{"text":"is a good predictor of the maximum learning rate (in the SM ","element":"span"},{"href":"#id-43","text":"C.4 ","element":"a"},{"text":"we discuss other nonlinearities). We conjecture that this is the typical maximum learning rate of networks with ReLU non-linearities.","element":"span"}],[{"text":"Figure ","element":"span"},{"href":"#id-42","text":"3 ","element":"a"},{"text":"also shows the loss initially increasing before converging in the catapult phase, confirming another prediction of the model. This transient behavior is very short, taking less than 10 steps to complete.","element":"span"}],[{"id":"id-42","style":{"width":"85%"},"width":2091,"height":1146,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2003.02218/images/6-12.png","element":"img"}],[{"id":"id-54","style":{"fontStyle":"italic"},"text":"Figure 3. ","element":"figcaption","subtype":"caption"},{"text":"Early time dynamics. (a,b,c) A 3 hidden layer fully-connected network with ReLU non-linearity trained on MNIST (","element":"figcaption","subtype":"caption"},{"style":{"height":13.6},"width":200.28,"height":34,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2003.02218/images/6-13.png","element":"img","alt":"ηcrit = 6.25).","inline":true,"padRight":true},{"text":"(d,e,f) Wide ResNet 28-10 trained on CIFAR-10 (","element":"figcaption","subtype":"caption"},{"style":{"height":13.2},"width":182.28,"height":33,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2003.02218/images/6-14.png","element":"img","alt":"ηcrit = 0.18","inline":true},{"text":"). Both networks are trained with vanilla SGD; for more experimental details see SM Section ","element":"figcaption","subtype":"caption"},{"text":"A","element":"span","subtype":"caption"},{"text":". (a,d) Early time dynamics of the training loss for learning rates in the linear and catapult phases. (b,e) Early time dynamics of the curvature for learning rates in the linear and catapult phase. (c,f) ","element":"figcaption","subtype":"caption"},{"style":{"height":11.6},"width":32.51,"height":28.99,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2003.02218/images/6-15.png","element":"img","alt":" λt","inline":true,"padRight":true},{"text":"measured at ","element":"figcaption","subtype":"caption"},{"style":{"height":13.6},"width":163.25,"height":34,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2003.02218/images/6-16.png","element":"img","alt":" t · η = 250","inline":true,"padRight":true},{"text":"(for FC) and ","element":"figcaption","subtype":"caption"},{"style":{"height":12.8},"width":144.82,"height":32,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2003.02218/images/6-17.png","element":"img","alt":" t · η = 30","inline":true,"padRight":true},{"text":"(for WRN), as a function of learning rate, compared with theoretical predictions for the locations of phase transitions. Training diverges for learning rates in the shaded region.","element":"figcaption","subtype":"caption"}],[{"style":{"fontWeight":"bold"},"text":"3.2. Generalization performance","element":"span"}],[{"text":"We now consider the performance of trained models in the different phases discussed in this work. ","element":"span"},{"href":"#id-7","referenceIndex":15,"text":"Keskar et al. ","element":"a"},{"href":"#id-7","referenceIndex":15,"text":"(","element":"a"},{"href":"#id-7","referenceIndex":15,"text":"2016","element":"a"},{"text":") observed a correlation between the flatness of a minimum found by SGD and the generalization performance (see ","element":"span"},{"href":"#id-9","referenceIndex":14,"text":"Jiang et al. ","element":"a"},{"href":"#id-9","referenceIndex":14,"text":"(","element":"a"},{"href":"#id-9","referenceIndex":14,"text":"2020","element":"a"},{"text":") for additional empirical confirmation of this correlation). In this work, we showed that the minima SGD finds are flatter in the catapult phase, as measured by the top kernel eigenvalue. Our measure of flatness differs from that of ","element":"span"},{"href":"#id-7","referenceIndex":15,"text":"Keskar ","element":"a"},{"href":"#id-7","referenceIndex":15,"text":"et al. ","element":"a"},{"href":"#id-7","referenceIndex":15,"text":"(","element":"a"},{"href":"#id-7","referenceIndex":15,"text":"2016","element":"a"},{"text":"), but we expect that these measures are correlated.","element":"span"}],[{"text":"We therefore conjecture that optimal performance is often obtained for learning rates above ","element":"span"},{"style":{"height":10.4},"width":67.41,"height":26,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2003.02218/images/7-0.png","element":"img","alt":" ηcrit","inline":true,"padRight":true},{"text":"and below the maximum learning rate.","element":"span"}],[{"text":"In this section we test this conjecture empirically. We find that performance in the large learning rate range always matches or exceeds the performance when ","element":"span"},{"style":{"height":12.4},"width":141.76,"height":31,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2003.02218/images/7-1.png","element":"img","alt":" η < ηcrit","inline":true},{"text":". For a fixed compute budget, we find that the best performance is always found in the catapult phase.","element":"span"}],[{"text":"Figure ","element":"span"},{"href":"#id-44","text":"4 ","element":"a"},{"text":"shows the accuracy as a function of the learning rate for a fully-connected ReLU network trained on a subset of MNIST. We find that the optimal performance is achieved above ","element":"span"},{"style":{"height":10.4},"width":67.41,"height":26,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2003.02218/images/7-2.png","element":"img","alt":" ηcrit","inline":true,"padRight":true},{"text":"and close to ","element":"span"},{"style":{"height":16},"width":232.74,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2003.02218/images/7-3.png","element":"img","alt":" ηmax = 12/λ0","inline":true},{"text":", the expected maximum learning rate.","element":"span"}],[{"id":"id-44","style":{"width":"44%"},"width":870,"height":657,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2003.02218/images/7-4.png","element":"img"}],[{"style":{"fontStyle":"italic"},"text":"Figure 4. ","element":"figcaption","subtype":"caption"},{"text":"Final accuracy versus learning rate for a fully-connected 1 hidden layer ReLU network, trained on 512 samples of MNIST with full-batch gradient descent until training accuracy reaches 1 or 700k physical steps (see SM Section ","element":"figcaption","subtype":"caption"},{"text":"A ","element":"span","subtype":"caption"},{"text":"for details). We used a subset of samples to accentuate the performance difference between phases. The optimal performance is obtained when the learning rate is above ","element":"figcaption","subtype":"caption"},{"style":{"height":9.6},"width":63.41,"height":24,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2003.02218/images/7-5.png","element":"img","alt":"ηcrit","inline":true},{"text":", and close to ","element":"figcaption","subtype":"caption"},{"style":{"height":9.6},"width":83.31,"height":24,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2003.02218/images/7-6.png","element":"img","alt":" ηmax.","inline":true}],[{"text":"Next, Figure ","element":"span"},{"href":"#id-45","text":"5 ","element":"a"},{"text":"shows the performance of a convolutional network and a Wide ResNet (WRN) trained on CIFAR-10. The experimental setup, which we now describe, was chosen to ensure a fair comparison of the performance across different learning rates. The network is trained with different initial learning rates, followed by a decay at a fixed physical time ","element":"span"},{"style":{"height":13.6},"width":102.18,"height":34,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2003.02218/images/7-7.png","element":"img","alt":" t · η to","inline":true,"padRight":true},{"text":"the same final learning rate. This schedule is introduced in order to ensure that all experiments have the same level of SGD noise toward the end of training.","element":"span"}],[{"text":"We present results using two different stopping conditions. In Figure ","element":"span"},{"href":"#id-45","text":"5a","element":"a"},{"text":", ","element":"span"},{"href":"#id-46","text":"5c","element":"a"},{"text":", all models were trained for a fixed number of training steps. We find a significant performance gap between small and large learning rates, with the optimal learning rate above ","element":"span"},{"style":{"height":10.4},"width":67.41,"height":26,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2003.02218/images/7-8.png","element":"img","alt":" ηcrit","inline":true,"padRight":true},{"text":"and close to ","element":"span"},{"style":{"height":10.4},"width":78.88,"height":26,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2003.02218/images/7-9.png","element":"img","alt":" ηmax","inline":true},{"text":". Beyond this learning rate, performance drops sharply.","element":"span"}],[{"text":"The fixed compute stopping condition, while of practical interest, biases the results in favor of large learning rates. Indeed, in the limit of small learning rate, training for a fixed number of steps will keep the model close to initialization. To control for this, in Figure ","element":"span"},{"href":"#id-45","text":"5b","element":"a"},{"text":",","element":"span"},{"href":"#id-46","text":"5d ","element":"a"},{"text":"models were trained for the same amount of physical time ","element":"span"},{"style":{"height":13.6},"width":63,"height":34,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2003.02218/images/7-10.png","element":"img","alt":" t · η","inline":true},{"text":". For the CNN of figure ","element":"span"},{"href":"#id-45","text":"5b","element":"a"},{"text":", decaying the learning rate does not have a significant effect on performance and we observe that performance is flat up to ","element":"span"},{"style":{"height":14.8},"width":158.08,"height":37,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2003.02218/images/7-11.png","element":"img","alt":" ηmax, and","inline":true,"padRight":true},{"text":"there is no correlation between our measure of curvature and generalization performance. Figure ","element":"span"},{"href":"#id-46","text":"5d ","element":"a"},{"text":"shows the analogous experiment for WRN. When decaying the learning rate toward the end of training to control for SGD noise, we find that optimal performance is achieved above ","element":"span"},{"style":{"height":10.4},"width":67.41,"height":26,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2003.02218/images/7-12.png","element":"img","alt":" ηcrit","inline":true},{"text":". In all these cases, ","element":"span"},{"style":{"height":10.4},"width":78.88,"height":26,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2003.02218/images/7-13.png","element":"img","alt":" ηmax","inline":true,"padRight":true},{"text":"is a good predictor of the maximal learning rate, despite significant differences in the architectures. Notice that by tuning the learning rate to the catapult phase, we are able to achieve performance using MSE loss, and without momentum, that is competitive with the best reported results for this","element":"span"}],[{"id":"id-45","style":{"width":"93%"},"width":1826,"height":1438,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2003.02218/images/8-0.png","element":"img"}],[{"id":"id-46","style":{"fontStyle":"italic"},"text":"Figure 5. ","element":"figcaption","subtype":"caption"},{"text":"Test accuracy vs learning rate for (a,b) a CNN trained on CIFAR-10 using SGD with batch size 256 and ","element":"figcaption","subtype":"caption"},{"style":{"height":11.59},"width":40.08,"height":28.98,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2003.02218/images/8-1.png","element":"img","alt":" L2","inline":true,"padRight":true},{"text":"regularization (","element":"figcaption","subtype":"caption"},{"style":{"height":15.69},"width":188.93,"height":39.23,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2003.02218/images/8-2.png","element":"img","alt":"ηcrit ≈ 10−4","inline":true},{"text":") and (c,d) WRN28-10 trained on CIFAR-10 using SGD with batch size 1024, ","element":"figcaption","subtype":"caption"},{"style":{"height":11.59},"width":40.08,"height":28.98,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2003.02218/images/8-3.png","element":"img","alt":" L2","inline":true,"padRight":true},{"text":"regularization, and data augmentation (","element":"figcaption","subtype":"caption"},{"style":{"height":13.2},"width":352.75,"height":33,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2003.02218/images/8-4.png","element":"img","alt":"ηcrit ≈ 0.14); see SM A","inline":true,"padRight":true},{"text":"for details. (a,c) have a fixed compute budget: (a) 437k steps and (b) 12k steps. (b,d) have been evolved for a fixed amount of physical time: (b) was evolved for 475","element":"figcaption","subtype":"caption"},{"style":{"height":14.4},"width":37.44,"height":36,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2003.02218/images/8-5.png","element":"img","alt":"/η","inline":true,"padRight":true},{"text":"steps (purple) and evolved for 50k more steps at learning rate ","element":"figcaption","subtype":"caption"},{"style":{"height":15.29},"width":256.48,"height":38.23,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2003.02218/images/8-6.png","element":"img","alt":" 2 · 10−5 (red) and","inline":true,"padRight":true},{"text":"(d) was evolved for ","element":"figcaption","subtype":"caption"},{"style":{"height":14.4},"width":111.15,"height":36,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2003.02218/images/8-7.png","element":"img","alt":" 3360/η","inline":true,"padRight":true},{"text":"steps with learning rate ","element":"figcaption","subtype":"caption"},{"style":{"height":9.6},"width":19,"height":24,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2003.02218/images/8-8.png","element":"img","alt":" η","inline":true,"padRight":true},{"text":"(purple) and then evolved for 4800 more steps at learning rate ","element":"figcaption","subtype":"caption"},{"text":"0","element":"figcaption","subtype":"caption"},{"style":{"fontStyle":"italic"},"text":".","element":"figcaption","subtype":"caption"},{"text":"035 ","element":"figcaption","subtype":"caption"},{"text":"(red). In all cases, optimal performance is achieved above ","element":"figcaption","subtype":"caption"},{"style":{"height":9.6},"width":63.41,"height":24,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2003.02218/images/8-9.png","element":"img","alt":" ηcrit","inline":true,"padRight":true},{"text":"and close to the expected maximum learning rate, in agreement with our predictions.","element":"figcaption","subtype":"caption"}],[{"style":{"width":"33%"},"width":651,"height":37,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2003.02218/images/8-10.png","element":"img"}],[{"text":"In SM ","element":"span"},{"href":"#id-47","text":"B.1","element":"a"},{"text":", we present additional results for WRN on CIFAR-100, with similar conclusions as those for WRN on CIFAR-10.","element":"span"}]]},{"heading":"4. Additional properties of the model","paragraphs":[[{"text":"So far we have focused on the generalization performance and curvature of the large learning rate phase. Here we investigate additional predictions made by our model.","element":"span"}],[{"style":{"fontWeight":"bold"},"text":"4.1. Restoration of linear dynamics","element":"span"}],[{"text":"One striking prediction of the model is that after a period of excursion, the logit differences settle back to ","element":"span"},{"style":{"fontStyle":"italic"},"text":"O","element":"span"},{"text":"(1) ","element":"span"},{"text":"values, the NTK stops changing, and evolution is again well approximated by a linear model with constant kernel at large width.","element":"span"}],[{"text":"We speculate that the return to linearity and constancy of the kernel may hold asymptotically in width for more general models for a range of learning rates above ","element":"span"},{"style":{"height":10.4},"width":67.41,"height":26,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2003.02218/images/8-11.png","element":"img","alt":" ηcrit","inline":true},{"text":". We test this by evolving the model for order ","element":"span"},{"text":"log(","element":"span"},{"style":{"fontStyle":"italic"},"text":"n","element":"span"},{"text":") ","element":"span"},{"text":"steps until the catapult effect is over, linearizing the model, and comparing the evolution of the two models beyond this point. Figure ","element":"span"},{"href":"#id-48","text":"6 ","element":"a"},{"text":"shows an ","element":"span"},{"text":"example of this. At fixed width, the accuracy of the linear and non-linear networks match for a range of learning rates above the transition up to ","element":"span"},{"style":{"height":16},"width":79.1,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2003.02218/images/9-0.png","element":"img","alt":" 4/λ0","inline":true},{"text":". We present additional evidence for this asymptotic linearization behavior in the Supplement.","element":"span"}],[{"id":"id-48","style":{"width":"48%"},"width":940,"height":733,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2003.02218/images/9-1.png","element":"img"}],[{"style":{"fontStyle":"italic"},"text":"Figure 6. ","element":"figcaption","subtype":"caption"},{"text":"Evidence for linear dynamics after the catapult effect is over. Here we show the same model as in Figure ","element":"figcaption","subtype":"caption"},{"href":"#id-44","text":"4 ","element":"a","subtype":"caption"},{"text":"with the addition of models linearized at step ","element":"figcaption","subtype":"caption"},{"text":"0 ","element":"figcaption","subtype":"caption"},{"text":"and another linearized at step ","element":"figcaption","subtype":"caption"},{"text":"10","element":"figcaption","subtype":"caption"},{"text":". We observe that the model linearized after ","element":"figcaption","subtype":"caption"},{"text":"10 ","element":"figcaption","subtype":"caption"},{"text":"steps tracks the non-linear performance in the catapult phase up to ","element":"figcaption","subtype":"caption"},{"style":{"height":14.4},"width":152.81,"height":36,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2003.02218/images/9-2.png","element":"img","alt":" η ≈ 4/λ0.","inline":true}],[{"style":{"fontWeight":"bold"},"text":"4.2. Non-perturbative phase transition","element":"span"}],[{"text":"The large width analysis of the small learning rate phase has been the subject of much work. In this phase, at infinite width, the network map evolves as a linear random features model, ","element":"span"},{"style":{"height":22.53},"width":530.86,"height":56.32,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2003.02218/images/9-3.png","element":"img","alt":" f (0)t+1 = f (0)t − Θf (0)t , where f (0) ","inline":true,"padRight":true},{"text":"is the function of the linearized model. At large but finite width, corrections to this linear evolution can be systematically incorporated via a perturbative expansion (Taylor expansion) around infinite width (","element":"span"},{"href":"#id-30","referenceIndex":8,"text":"Dyer & Gur-Ari","element":"a"},{"href":"#id-30","referenceIndex":8,"text":", ","element":"a"},{"href":"#id-30","referenceIndex":8,"text":"2020","element":"a"},{"text":"; ","element":"span"},{"href":"#id-31","referenceIndex":11,"text":"Huang & Yau","element":"a"},{"href":"#id-31","referenceIndex":11,"text":", ","element":"a"},{"href":"#id-31","referenceIndex":11,"text":"2019","element":"a"},{"text":").","element":"span"}],[{"id":"id-51","style":{"width":"60%"},"width":1179,"height":83,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2003.02218/images/9-4.png","element":"img"}],[{"text":"The evolution equations (","element":"span"},{"href":"#id-49","text":"10","element":"a"},{"text":") and (","element":"span"},{"href":"#id-50","text":"11","element":"a"},{"text":") of the solvable model are an example of this. At large width and in the small learning rate phase, the ","element":"span"},{"style":{"height":17.38},"width":129.71,"height":43.46,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2003.02218/images/9-5.png","element":"img","alt":" O(n−1)","inline":true,"padRight":true},{"text":"terms are suppressed for all times. In contrast, the leading order dynamics of ","element":"span"},{"style":{"height":20.6},"width":64.14,"height":51.51,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2003.02218/images/9-6.png","element":"img","alt":" f (0)t","inline":true,"padRight":true},{"text":"diverge when ","element":"span"},{"style":{"height":12.4},"width":141.76,"height":31,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2003.02218/images/9-7.png","element":"img","alt":"η > ηcrit","inline":true},{"text":", and so the true evolution cannot be described by the linear model. Indeed, the logits grow to ","element":"span"},{"style":{"height":18.18},"width":138.34,"height":45.46,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2003.02218/images/9-8.png","element":"img","alt":" O(n1/2)","inline":true,"padRight":true},{"text":"and thus all terms in (","element":"span"},{"href":"#id-49","text":"10","element":"a"},{"text":") and (","element":"span"},{"href":"#id-50","text":"11","element":"a"},{"text":") are of the same order. Similarly, the growth observed empirically in the catapult phase for more general models cannot be described by truncating the series (","element":"span"},{"href":"#id-51","text":"12","element":"a"},{"text":") at any order, because the terms all become comparable.","element":"span"}]]},{"heading":"5. Discussion","paragraphs":[[{"text":"In this work we took a step toward understanding the role of large learning rates in deep learning. We presented a dynamical mechanism that allows deep networks to be trained at larger learning rates than those accessible to their linear counterparts. For MSE loss, linear model training diverges when the learning rate is above the critical value ","element":"span"},{"style":{"height":16},"width":463.72,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2003.02218/images/9-9.png","element":"img","alt":" ηcrit = 2/λ0, where λ0 is the","inline":true,"padRight":true},{"text":"curvature at initialization. We showed that deep networks can train for larger learning rates by navigating to an area of the landscape that has sufficiently low curvature. Perhaps counterintuitively, training in this regime involves an initial period during which the loss increases before converging to its final, small value. We call this the ","element":"span"},{"style":{"fontStyle":"italic"},"text":"catapult effect","element":"span"},{"text":".","element":"span"}],[{"style":{"fontWeight":"bold"},"text":"5.1. A tractable model illustrating catapult dynamics","element":"span"}],[{"text":"These observations are made concrete in our theoretical model, where we fully analyze the gradient descent dynamics as a function of the learning rate. The analysis involves a modified large width limit, in which both the width and training time are taken to be large. Sweeping the learning rate from small to large, and working in the limit, we find sharp transitions from a ","element":"span"},{"style":{"fontStyle":"italic"},"text":"lazy phase ","element":"span"},{"text":"where linearized model training is stable, to a ","element":"span"},{"style":{"fontStyle":"italic"},"text":"catapult phase ","element":"span"},{"text":"in which only the full model converges, and finally to a ","element":"span"},{"style":{"fontStyle":"italic"},"text":"divergent phase ","element":"span"},{"text":"in which training is unstable. These transitions have the hallmarks of phase transitions that commonly appear in physical systems such as ferromagnets or water, as one changes parameters such as temperature. In ","element":"span"},{"text":"particular, these transitions are non-perturbative: a Taylor series expansion of the linearized model that takes into account finite width corrections is not sufficient to describe the behavior beyond the critical learning rate.","element":"span"}],[{"text":"We derive the learning rates at which these transitions occur as a function of the curvature at initialization. We then treat these theoretical results as predictions, to be tested beyond the regime where they are guaranteed to hold, and find good quantitative agreement with empirical results across a variety of realistic deep learning settings.","element":"span"}],[{"text":"We find it striking that a relatively simple theoretical model can correctly predict the behavior of realistic deep learning models. In particular, we conjecture that the maximum learning rate is typically a simple function of the curvature at initialization, with a single parameter ","element":"span"},{"style":{"height":9.19},"width":68.75,"height":22.98,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2003.02218/images/10-0.png","element":"img","alt":" cact.","inline":true,"padRight":true},{"text":"that seems to depend only on the non-linearity. For ReLU networks, we conjecture that the maximum learning rate is approximately ","element":"span"},{"style":{"height":16},"width":99.02,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2003.02218/images/10-1.png","element":"img","alt":" 12/λ0","inline":true},{"text":", which we confirm in many cases.","element":"span"}],[{"style":{"fontWeight":"bold"},"text":"5.2. Reducing misalignment of activations and gradients","element":"span"}],[{"text":"The catapult dynamics for the simplified model in Section ","element":"span"},{"href":"#id-35","text":"2.1 ","element":"a"},{"text":"reduce curvature by shrinking the component of the first layer weights ","element":"span"},{"style":{"fontStyle":"italic"},"text":"u ","element":"span"},{"text":"which is orthogonal to the second layer weights ","element":"span"},{"style":{"fontStyle":"italic"},"text":"v","element":"span"},{"text":", and shrinking the component of the second layer weights ","element":"span"},{"style":{"fontStyle":"italic"},"text":"v ","element":"span"},{"text":"which is orthogonal to the first layer weights ","element":"span"},{"style":{"fontStyle":"italic"},"text":"u","element":"span"},{"text":". We can rewrite the simplified model in terms of a hidden layer ","element":"span"},{"style":{"fontStyle":"italic"},"text":"h ","element":"span"},{"text":"= ","element":"span"},{"style":{"fontStyle":"italic"},"text":"ux","element":"span"},{"text":", where ","element":"span"},{"style":{"height":18.18},"width":300.27,"height":45.46,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2003.02218/images/10-2.png","element":"img","alt":" f(x) = n−1/2v⊤h","inline":true},{"text":". The gradient with respect to this hidden layer is ","element":"span"},{"style":{"height":19.77},"width":294,"height":49.43,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2003.02218/images/10-3.png","element":"img","alt":"∂L∂h = n−1/2f(x)v","inline":true},{"text":". These hidden layer gradients","element":"span"}],[{"style":{"height":8},"width":37.72,"height":20,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2003.02218/images/10-4.png","element":"img","alt":"∂h","inline":true,"padRight":true},{"text":"thus point in the same direction as ","element":"span"},{"style":{"fontStyle":"italic"},"text":"v","element":"span"},{"text":", while the hidden activations ","element":"span"},{"style":{"fontStyle":"italic"},"text":"h ","element":"span"},{"text":"point in the same direction as ","element":"span"},{"style":{"fontStyle":"italic"},"text":"u","element":"span"},{"text":". An alternative ","element":"span"},{"text":"interpretation of the catapult dynamics is then that they reduce the components of ","element":"span"},{"style":{"fontStyle":"italic"},"text":"h ","element":"span"},{"text":"and ","element":"span"},{"style":{"height":19.77},"width":40.72,"height":49.43,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2003.02218/images/10-5.png","element":"img","alt":"∂L∂h","inline":true,"padRight":true},{"text":"which are orthogonal to ","element":"span"},{"text":"each other. The catapult dynamics thus serve, in this simplified model, to reduce the misalignment between feedforward activations ","element":"span"},{"style":{"fontStyle":"italic"},"text":"h","element":"span"},{"text":", and backpropagated gradients ","element":"span"},{"style":{"height":19.77},"width":40.72,"height":49.43,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2003.02218/images/10-6.png","element":"img","alt":"∂L∂h","inline":true,"padRight":true},{"text":". We hypothesize that this reduction of misalignment between activations ","element":"span"},{"text":"and gradients may be a feature of large learning rates and catapult dynamics in deep, as well as shallow, networks. We further hypothesize that it may play a directly beneficial role in generalization, for instance by making the model output less sensitive to orthogonal, out-of-distribution, perturbations of activations.","element":"span"}],[{"style":{"fontWeight":"bold"},"text":"5.3. Catapult dynamics often improve generalization","element":"span"}],[{"text":"Our results shed light on the regularizing effect of training at large learning rates. The effect presented here is independent of the regularizing effect of stochastic gradient noise, which has been studied extensively. Building on previous works, we noted the observed correlation between flatness and generalization performance. Based on these observations, we expect the optimal performance to often occur for learning rates larger than ","element":"span"},{"style":{"height":10.4},"width":67.41,"height":26,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2003.02218/images/10-7.png","element":"img","alt":" ηcrit","inline":true},{"text":", where the linearized model is unstable. Observing this effect required controlling for several confounding factors that affect the comparison of performance between different learning rates. Under a fair comparison, and also for a fixed compute budget, we find that this expectation holds in practice.","element":"span"}],[{"style":{"fontWeight":"bold"},"text":"5.4. Beyond infinite linear models","element":"span"}],[{"text":"One outcome of our work is to address the performance gap between ordinary neural networks, and linear models inspired by the theory of wide networks. Optimal performance is often obtained at large learning rates which are inaccessible to linearized models. In such cases, we expect the performance gap to persist even at arbitrarily large widths. We hope our work can further improve the understanding of deep learning methods.","element":"span"}],[{"style":{"fontWeight":"bold"},"text":"5.5. Other open questions","element":"span"}],[{"text":"There are several remaining open questions. While the model predicts a maximum learning rate of ","element":"span"},{"style":{"height":16},"width":79.1,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2003.02218/images/10-8.png","element":"img","alt":" 4/λ0","inline":true},{"text":", for models with ReLU activations we find that the maximum learning rate is consistently higher. This may be due to a separate dynamical curvature-reduction mechanism that relies on ReLU. In addition, we do not explore the degree to which our results extend to softmax classification. While we expect qualitatively similar behavior there, the non-constant Hessian of the softmax cross entropy makes controlled experiments more challenging. Similarly, behavior for other optimizers such as SGD with momentum may differ. For example, the maximum learning rate when training a linear model is larger for gradient descent with momentum than for vanilla gradient descent, and therefore the transition to the catapult phase (if it exists) will occur at a higher learning rate. We leave these questions to future work.","element":"span"}]]},{"heading":"Acknowledgements","paragraphs":[[{"text":"The authors would like to thank Kyle Aitken, Dar Gilboa, Justin Gilmer, Boris Hanin, Tengyu Ma, Andrea Montanari, and Behnam Neyshabur for useful discussions. We would also like to thank Jaehoon Lee for early discussions about empirical properties of the lazy phase.","element":"span"}]]},{"heading":"References","paragraphs":[[{"id":"id-22","text":"Allen-Zhu, Z., Li, Y., and Song, Z. A convergence theory for deep learning via over-parameterization. In Chaudhuri, K. ","element":"span"},{"text":"and Salakhutdinov, R. (eds.), ","element":"span"},{"style":{"fontStyle":"italic"},"text":"Proceedings of the 36th International Conference on Machine Learning","element":"span"},{"text":", volume 97 of ","element":"span"},{"style":{"fontStyle":"italic"},"text":"Proceedings of Machine Learning Research","element":"span"},{"text":", pp. 242–252, Long Beach, California, USA, 09–15 Jun 2019. PMLR.","element":"span"}],[{"id":"id-34","text":"Arora, S., Du, S. S., Hu, W., Li, Z., Salakhutdinov, R. R., and Wang, R. On exact computation with an infinitely wide neural ","element":"span"},{"text":"net. In ","element":"span"},{"style":{"fontStyle":"italic"},"text":"Advances in Neural Information Processing Systems","element":"span"},{"text":", pp. 8139–8148, 2019.","element":"span"}],[{"id":"id-52","text":"Bradbury, J., Frostig, R., Hawkins, P., Johnson, M. J., Leary, C., Maclaurin, D., and Wanderman-Milne, S. JAX: composable ","element":"span"},{"text":"transformations of Python+NumPy programs, 2018. URL ","element":"span"},{"href":"http://github.com/google/jax","text":"http://github.com/google/jax","element":"a"},{"text":".","element":"span"}],[{"id":"id-24","text":"Chizat, L., Oyallon, E., and Bach, F. On lazy training in differentiable programming. In Wallach, H., Larochelle, ","element":"span"},{"text":"H., Beygelzimer, A., d ","element":"span"},{"text":"´","element":"span"},{"text":"Alch´e-Buc, F., Fox, E., and Garnett, R. (eds.), ","element":"span"},{"style":{"fontStyle":"italic"},"text":"Advances in Neural Information Processing Systems 32","element":"span"},{"text":", pp. 2933–2943. Curran Associates, Inc., 2019. ","element":"span"},{"text":"URL ","element":"span"},{"href":"http://papers.nips.cc/paper/8559-on-lazy-training-in-\\ differentiable-programming.pdf","text":"http://papers.nips.cc/paper/ ","element":"a"},{"href":"http://papers.nips.cc/paper/8559-on-lazy-training-in-\\ differentiable-programming.pdf","text":"8559-on-lazy-training-in-\\differentiable-programming.pdf","element":"a"},{"text":".","element":"span"}],[{"id":"id-17","text":"Daniely, A. Sgd learns the conjugate kernel class of the network. In ","element":"span"},{"style":{"fontStyle":"italic"},"text":"Advances in Neural Information Processing Systems","element":"span"},{"text":", pp. 2422–2430, 2017.","element":"span"}],[{"id":"id-11","text":"Dinh, L., Pascanu, R., Bengio, S., and Bengio, Y. Sharp minima can generalize for deep nets. In ","element":"span"},{"style":{"fontStyle":"italic"},"text":"Proceedings of the 34th International Conference on Machine Learning-Volume 70","element":"span"},{"text":", pp. 1019–1028. JMLR. org, 2017.","element":"span"}],[{"id":"id-20","text":"Du, S. S., Lee, J. D., Li, H., Wang, L., and Zhai, X. Gradient descent finds global minima of deep neural networks. In ","element":"span"},{"style":{"fontStyle":"italic"},"text":"Proceedings of the 36th International Conference on Machine Learning, ICML 2019, 9-15 June 2019, Long Beach, California, USA","element":"span"},{"text":", pp. 1675–1685, 2019. URL ","element":"span"},{"href":"http://proceedings.mlr.press/v97/du19c.html","text":"http://proceedings.mlr.press/v97/du19c.html","element":"a"},{"text":".","element":"span"}],[{"id":"id-30","text":"Dyer, E. and Gur-Ari, G. Asymptotics of wide networks from feynman diagrams. In ","element":"span"},{"style":{"fontStyle":"italic"},"text":"International Conference on Learning Representations","element":"span"},{"text":", 2020. URL ","element":"span"},{"href":"https://openreview.net/forum?id=S1gFvANKDS","text":"https://openreview.net/forum?id=S1gFvANKDS","element":"a"},{"text":".","element":"span"}],[{"id":"id-4","text":"Frankle, J., Schwab, D. J., and Morcos, A. S. The early phase of neural network training. ","element":"span"},{"style":{"fontStyle":"italic"},"text":"arXiv preprint arXiv:2002.10365","element":"span"},{"text":", 2020.","element":"span"}],[{"id":"id-6","text":"Hochreiter, S. and Schmidhuber, J. Flat minima. ","element":"span"},{"style":{"fontStyle":"italic"},"text":"Neural Computation","element":"span"},{"text":", 9(1):1–42, 1997.","element":"span"}],[{"id":"id-31","text":"Huang, J. and Yau, H.-T. Dynamics of Deep Neural Networks and Neural Tangent Hierarchy. ","element":"span"},{"style":{"fontStyle":"italic"},"text":"arXiv e-prints","element":"span"},{"text":", art. arXiv:1909.08156, Sep 2019.","element":"span"}],[{"id":"id-18","text":"Jacot, A., Gabriel, F., and Hongler, C. Neural tangent kernel: Convergence and generalization in neural networks. In Bengio, ","element":"span"},{"text":"S., Wallach, H., Larochelle, H., Grauman, K., Cesa-Bianchi, N., and Garnett, R. (eds.), ","element":"span"},{"style":{"fontStyle":"italic"},"text":"Advances in Neural Information Processing Systems 31","element":"span"},{"text":", pp. 8571–8580. Curran Associates, Inc., 2018.","element":"span"}],[{"id":"id-5","text":"Jastrzebski, S., Szymczak, M., Fort, S., Arpit, D., Tabor, J., Cho, K., and Geras, K. The break-even point on optimization ","element":"span"},{"text":"trajectories of deep neural networks. ","element":"span"},{"style":{"fontStyle":"italic"},"text":"arXiv preprint arXiv:2002.09572","element":"span"},{"text":", 2020.","element":"span"}],[{"id":"id-9","text":"Jiang, Y., Neyshabur, B., Krishnan, D., Mobahi, H., and Bengio, S. Fantastic generalization measures and where to find ","element":"span"},{"text":"them. In ","element":"span"},{"style":{"fontStyle":"italic"},"text":"International Conference on Learning Representations","element":"span"},{"text":", 2020. URL ","element":"span"},{"href":"https://openreview.net/forum?id=SJgIPJBFvH","text":"https://openreview.net/forum? ","element":"a"},{"href":"https://openreview.net/forum?id=SJgIPJBFvH","text":"id=SJgIPJBFvH","element":"a"},{"text":".","element":"span"}],[{"id":"id-7","text":"Keskar, N. S., Mudigere, D., Nocedal, J., Smelyanskiy, M., and Tang, P. T. P. On large-batch training for deep learning: ","element":"span"},{"text":"Generalization gap and sharp minima. ","element":"span"},{"style":{"fontStyle":"italic"},"text":"CoRR","element":"span"},{"text":", abs/1609.04836, 2016. URL ","element":"span"},{"href":"http://arxiv.org/abs/1609.04836","text":"http://arxiv.org/abs/1609. ","element":"a"},{"href":"http://arxiv.org/abs/1609.04836","text":"04836","element":"a"},{"text":".","element":"span"}],[{"id":"id-2","text":"Leclerc, G. and Madry, A. The two regimes of deep network training, 2020.","element":"span"}],[{"id":"id-32","text":"Lee, J., Bahri, Y., Novak, R., Schoenholz, S., Pennington, J., and Sohl-dickstein, J. Deep neural networks as gaussian ","element":"span"},{"text":"processes. In ","element":"span"},{"style":{"fontStyle":"italic"},"text":"International Conference on Learning Representations","element":"span"},{"text":", 2018. URL ","element":"span"},{"href":"https://openreview.net/forum?id=B1EA-M-0Z","text":"https://openreview.net/ ","element":"a"},{"href":"https://openreview.net/forum?id=B1EA-M-0Z","text":"forum?id=B1EA-M-0Z","element":"a"},{"text":".","element":"span"}],[{"id":"id-19","text":"Lee, J., Xiao, L., Schoenholz, S., Bahri, Y., Novak, R., Sohl-Dickstein, J., and Pennington, J. Wide neural networks of any ","element":"span"},{"text":"depth evolve as linear models under gradient descent. In Wallach, H., Larochelle, H., Beygelzimer, A., d’ Alch´e-Buc, F., Fox, E., and Garnett, R. (eds.), ","element":"span"},{"style":{"fontStyle":"italic"},"text":"Advances in Neural Information Processing Systems 32","element":"span"},{"text":", pp. 8570–8581. Curran Associates, Inc., 2019. URL ","element":"span"},{"href":"http://papers.nips.cc/paper/9063-wide-neural-networks-of-\\any-depth-evolve-as-linear-models-\\under-gradient-descent.pdf","text":"http://papers.nips.cc/paper/9063-wide-neural-networks-of-\\ ","element":"a"},{"href":"http://papers.nips.cc/paper/9063-wide-neural-networks-of-\\any-depth-evolve-as-linear-models-\\under-gradient-descent.pdf","text":"any-depth-evolve-as-linear-models-\\under-gradient-descent.pdf","element":"a"},{"text":".","element":"span"}],[{"id":"id-23","text":"Li, Y. and Liang, Y. Learning overparameterized neural networks via stochastic gradient descent on structured data. In ","element":"span"},{"style":{"fontStyle":"italic"},"text":"Advances in Neural Information Processing Systems","element":"span"},{"text":", pp. 8157–8166, 2018.","element":"span"}],[{"id":"id-1","text":"Li, Y., Wei, C., and Ma, T. Towards explaining the regularization effect of initial large learning rate in training neural ","element":"span"},{"text":"networks. In Wallach, H., Larochelle, H., Beygelzimer, A., d’Alch´e Buc, F., Fox, E., and Garnett, R. (eds.), ","element":"span"},{"style":{"fontStyle":"italic"},"text":"Advances in Neural Information Processing Systems 32","element":"span"},{"text":", pp. 11669–11680. Curran Associates, Inc., 2019.","element":"span"}],[{"id":"id-12","text":"Mandt, S., Hoffman, M. D., and Blei, D. M. Stochastic gradient descent as approximate bayesian inference. ","element":"span"},{"style":{"fontStyle":"italic"},"text":"The Journal of Machine Learning Research","element":"span"},{"text":", 18(1):4873–4907, 2017.","element":"span"}],[{"id":"id-68","text":"May, R. M. Simple mathematical models with very complicated dynamics. ","element":"span"},{"style":{"fontStyle":"italic"},"text":"Nature","element":"span"},{"text":", 261(5560):459–467, 1976.","element":"span"}],[{"id":"id-25","text":"Mei, S., Montanari, A., and Nguyen, P.-M. A mean field view of the landscape of two-layer neural networks. 115(33): ","element":"span"},{"text":"E7665–E7671, 2018. doi: 10.1073/pnas.1806579115.","element":"span"}],[{"id":"id-29","text":"Naveh, Ben-David, Sompolinsky, and Ringel. to be published.","element":"span"}],[{"id":"id-33","text":"Novak, R., Xiao, L., Bahri, Y., Lee, J., Yang, G., Abolafia, D. A., Pennington, J., and Sohl-dickstein, J. Bayesian ","element":"span"},{"text":"deep convolutional networks with many channels are gaussian processes. In ","element":"span"},{"style":{"fontStyle":"italic"},"text":"International Conference on Learning Representations","element":"span"},{"text":", 2019. URL ","element":"span"},{"href":"https://openreview.net/forum?id=B1g30j0qF7","text":"https://openreview.net/forum?id=B1g30j0qF7","element":"a"},{"text":".","element":"span"}],[{"id":"id-53","text":"Novak, R., Xiao, L., Hron, J., Lee, J., Alemi, A. A., Sohl-Dickstein, J., and Schoenholz, S. S. Neural tangents: Fast ","element":"span"},{"text":"and easy infinite neural networks in python. In ","element":"span"},{"style":{"fontStyle":"italic"},"text":"International Conference on Learning Representations","element":"span"},{"text":", 2020. URL ","element":"span"},{"href":"https://github.com/google/neural-tangents","text":"https://github.com/google/neural-tangents","element":"a"},{"text":".","element":"span"}],[{"id":"id-10","text":"Park, D. S., Sohl-Dickstein, J., Le, Q. V., and Smith, S. L. The effect of network width on stochastic gradient descent and ","element":"span"},{"text":"generalization: an empirical study. ","element":"span"},{"style":{"fontStyle":"italic"},"text":"CoRR","element":"span"},{"text":", abs/1905.03776, 2019. URL ","element":"span"},{"href":"http://arxiv.org/abs/1905.03776","text":"http://arxiv.org/abs/1905.03776","element":"a"},{"text":".","element":"span"}],[{"id":"id-26","text":"Rotskoff, G. and Vanden-Eijnden, E. Parameters as interacting particles: long time convergence and asymptotic error scaling ","element":"span"},{"text":"of neural networks. In ","element":"span"},{"style":{"fontStyle":"italic"},"text":"Advances in neural information processing systems","element":"span"},{"text":", pp. 7146–7155, 2018.","element":"span"}],[{"id":"id-27","text":"Sirignano, J. and Spiliopoulos, K. Mean field analysis of neural networks. ","element":"span"},{"style":{"fontStyle":"italic"},"text":"arXiv preprint arXiv:1805.01053","element":"span"},{"text":", 2018.","element":"span"}],[{"id":"id-8","text":"Smith, S. L. and Le, Q. V. A bayesian perspective on generalization and stochastic gradient descent. In ","element":"span"},{"style":{"fontStyle":"italic"},"text":"International Conference on Learning Representations","element":"span"},{"text":", 2018. URL ","element":"span"},{"href":"https://openreview.net/forum?id=BJij4yg0Z","text":"https://openreview.net/forum?id=BJij4yg0Z","element":"a"},{"text":".","element":"span"}],[{"id":"id-13","text":"Smith, S. L., Kindermans, P.-J., Ying, C., and Le, Q. V. Don’t Decay the Learning Rate, Increase the Batch Size. ","element":"span"},{"style":{"fontStyle":"italic"},"text":"arXiv e-prints","element":"span"},{"text":", art. arXiv:1711.00489, Nov 2017.","element":"span"}],[{"id":"id-14","text":"Smith, S. L., Duckworth, D., Rezchikov, S., Le, Q. V., and Sohl-Dickstein, J. Stochastic natural gradient descent draws ","element":"span"},{"text":"posterior samples in function space. ","element":"span"},{"style":{"fontStyle":"italic"},"text":"arXiv preprint arXiv:1806.09597","element":"span"},{"text":", 2018.","element":"span"}],[{"id":"id-28","text":"Woodworth, B., Gunasekar, S., Lee, J., Soudry, D., and Srebro, N. Kernel and deep regimes in overparametrized models. ","element":"span"},{"style":{"fontStyle":"italic"},"text":"arXiv preprint arXiv:1906.05827","element":"span"},{"text":", 2019.","element":"span"}],[{"id":"id-36","text":"Xiao, L., Pennington, J., and Schoenholz, S. S. Disentangling trainability and generalization in deep learning, 2019.","element":"span"}],[{"id":"id-3","text":"Xie, Z., Sato, I., and Sugiyama, M. A diffusion theory for deep learning dynamics: Stochastic gradient descent escapes from ","element":"span"},{"text":"sharp minima exponentially fast. ","element":"span"},{"style":{"fontStyle":"italic"},"text":"arXiv preprint arXiv:2002.03495","element":"span"},{"text":", 2020.","element":"span"}],[{"id":"id-60","text":"Zagoruyko, S. and Komodakis, N. Wide residual networks. ","element":"span"},{"style":{"fontStyle":"italic"},"text":"CoRR","element":"span"},{"text":", abs/1605.07146, 2016. URL ","element":"span"},{"href":"http://arxiv.org/abs/1605.07146","text":"http://arxiv.org/ ","element":"a"},{"href":"http://arxiv.org/abs/1605.07146","text":"abs/1605.07146","element":"a"},{"text":".","element":"span"}],[{"id":"id-21","text":"Zou, D., Cao, Y., Zhou, D., and Gu, Q. Stochastic gradient descent optimizes over-parameterized deep relu networks. ","element":"span"},{"style":{"fontStyle":"italic"},"text":"arXiv preprint arXiv:1811.08888","element":"span"},{"text":", 2018.","element":"span"}]]},{"heading":"Supplementary materials A. Experimental details","paragraphs":[[{"text":"We are using JAX (","element":"span"},{"href":"#id-52","referenceIndex":3,"text":"Bradbury et al.","element":"a"},{"href":"#id-52","referenceIndex":3,"text":", ","element":"a"},{"href":"#id-52","referenceIndex":3,"text":"2018","element":"a"},{"text":") and the Neural Tangents Library for our experiments (","element":"span"},{"href":"#id-53","referenceIndex":26,"text":"Novak et al.","element":"a"},{"href":"#id-53","referenceIndex":26,"text":", ","element":"a"},{"href":"#id-53","referenceIndex":26,"text":"2020","element":"a"},{"text":").","element":"span"}],[{"text":"All the models have been trained with Mean Squared Error normalized as ","element":"span"},{"style":{"height":22.17},"width":736.49,"height":55.43,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2003.02218/images/14-0.png","element":"img","alt":" L({x, y}B) = 12k|B|�(x,y)∈B,i(f i(x) − yi)2","inline":true},{"text":", ","element":"span"},{"text":"where ","element":"span"},{"style":{"fontStyle":"italic"},"text":"k ","element":"span"},{"text":"is the number of classes and ","element":"span"},{"style":{"height":16.18},"width":31.97,"height":40.46,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2003.02218/images/14-1.png","element":"img","alt":" yi ","inline":true,"padRight":true},{"text":"are one-targets.","element":"span"}],[{"text":"In a similar way, we have normalized the NTK as ","element":"span"},{"style":{"height":22.17},"width":652.08,"height":55.43,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2003.02218/images/14-2.png","element":"img","alt":" Θij(x, x′) = 1k|B|�α ∂αf i(x)∂αf j(x′)","inline":true,"padRight":true},{"text":"so that the eigenvalues of the ","element":"span"},{"text":"NTK are the same as the non-zero eigenvalues of the Fisher information: ","element":"span"},{"style":{"height":22.17},"width":488.02,"height":55.43,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2003.02218/images/14-3.png","element":"img","alt":"1k|B|�x∈B,i ∂αf i(x)∂βf i(x).","inline":true}],[{"text":"In our experiments we measure the top eigenvalue of the NTK using Lanczos’ algorithm. We construct the NTK on a small batch of data, typically several hundred samples, compute the top eigenvalue, and then average over batches. In this work, we do not focus on precision aspects such as fluctuations in the top eigenvalue across batches.","element":"span"}],[{"text":"All experiments that compare different learning rates use the same seed for the weights at initialization and we consider only one such initialization (unless otherwise stated) although we have not seen much variance in the phenomena described. We let ","element":"span"},{"style":{"height":10},"width":103.05,"height":25,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2003.02218/images/14-4.png","element":"img","alt":" σw, σb","inline":true,"padRight":true},{"text":"denote the constant (width-independent) coefficient of the standard deviation of the weight and bias initializations, respectively.","element":"span"}],[{"text":"Here we describe experimental settings specific to a figure.","element":"span"}],[{"style":{"fontWeight":"bold"},"text":"Figure ","element":"span"},{"href":"#id-42","style":{"fontWeight":"bold"},"text":"3a","element":"a"},{"href":"#id-42","style":{"fontWeight":"bold"},"text":",","element":"a"},{"href":"#id-42","style":{"fontWeight":"bold"},"text":"3b","element":"a"},{"href":"#id-42","style":{"fontWeight":"bold"},"text":",","element":"a"},{"href":"#id-42","style":{"fontWeight":"bold"},"text":"3c","element":"a"},{"style":{"fontWeight":"bold"},"text":". ","element":"span"},{"text":"Fully connected, three hidden layers ","element":"span"},{"style":{"fontStyle":"italic"},"text":"w ","element":"span"},{"text":"= 2048","element":"span"},{"text":", ReLU non-linearity trained using SGD (no momentum) on MNIST. Batch size","element":"span"},{"text":"= 512","element":"span"},{"text":", using NTK normalization, ","element":"span"},{"style":{"height":17.19},"width":294.38,"height":42.96,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2003.02218/images/14-5.png","element":"img","alt":" σw =√2, σb = 0.","inline":true}],[{"style":{"fontWeight":"bold"},"text":"Figures ","element":"span"},{"href":"#id-54","style":{"fontWeight":"bold"},"text":"3d","element":"a"},{"href":"#id-54","style":{"fontWeight":"bold"},"text":",","element":"a"},{"href":"#id-54","style":{"fontWeight":"bold"},"text":"3e","element":"a"},{"href":"#id-54","style":{"fontWeight":"bold"},"text":",","element":"a"},{"href":"#id-54","style":{"fontWeight":"bold"},"text":"3f","element":"a"},{"style":{"fontWeight":"bold"},"text":". ","element":"span"},{"text":"Wide ResNet 28-18 trained on CIFAR10 with SGD (no momentum). Batch size of ","element":"span"},{"text":"128","element":"span"},{"text":", LeCun initialization with ","element":"span"},{"style":{"height":17.19},"width":432.37,"height":42.96,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2003.02218/images/14-6.png","element":"img","alt":" σw =√2, σb = 0, L2 = 0.","inline":true}],[{"style":{"fontWeight":"bold"},"text":"Figures ","element":"span"},{"href":"#id-44","style":{"fontWeight":"bold"},"text":"4","element":"a"},{"style":{"fontWeight":"bold"},"text":",","element":"span"},{"href":"#id-48","style":{"fontWeight":"bold"},"text":"6 ","element":"a"},{"text":"Fully connected network with one hidden layer and ReLU non-linearity trained on 512 samples of MNIST with SGD (no momentum). Batch size of ","element":"span"},{"text":"512","element":"span"},{"text":", NTK initialization with ","element":"span"},{"style":{"height":17.19},"width":294.38,"height":42.96,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2003.02218/images/14-7.png","element":"img","alt":" σw =√2, σb = 0.","inline":true}],[{"style":{"fontWeight":"bold"},"text":"Figures ","element":"span"},{"href":"#id-45","style":{"fontWeight":"bold"},"text":"5a","element":"a"},{"style":{"fontWeight":"bold"},"text":",","element":"span"},{"href":"#id-45","style":{"fontWeight":"bold"},"text":"5b","element":"a"},{"style":{"fontWeight":"bold"},"text":". ","element":"span"},{"text":"The convolutional network has the following architecture: Conv","element":"span"},{"style":{"height":16},"width":609.19,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2003.02218/images/14-8.png","element":"img","alt":"1(320) → ReLU → Conv2(320) →","inline":true,"padRight":true},{"text":"ReLU ","element":"span"},{"style":{"height":8.8},"width":40,"height":22,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2003.02218/images/14-9.png","element":"img","alt":" →","inline":true,"padRight":true},{"text":"MaxPool((2,2), ’VALID’) ","element":"span"},{"style":{"height":16},"width":804.78,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2003.02218/images/14-10.png","element":"img","alt":" → Conv1(320) → ReLU → Conv2(128) →","inline":true,"padRight":true},{"text":"MaxPool((2,2), ’VALID’) ","element":"span"},{"style":{"height":8.8},"width":40,"height":22,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2003.02218/images/14-11.png","element":"img","alt":" →","inline":true,"padRight":true},{"text":"Flatten() ","element":"span"},{"style":{"height":16},"width":673.82,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2003.02218/images/14-12.png","element":"img","alt":" → Dense(256) → ReLU → Dense(10)","inline":true},{"text":". Dense","element":"span"},{"text":"(","element":"span"},{"style":{"fontStyle":"italic"},"text":"n","element":"span"},{"text":") ","element":"span"},{"text":"denotes a fully-connected layer with output dimension ","element":"span"},{"style":{"fontStyle":"italic"},"text":"n","element":"span"},{"text":". Conv","element":"span"},{"style":{"height":16},"width":248.56,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2003.02218/images/14-13.png","element":"img","alt":"1(n), Conv2(n)","inline":true,"padRight":true},{"text":"denote convolutional layers with ’SAME’ or ’VALID’ padding and ","element":"span"},{"style":{"fontStyle":"italic"},"text":"n ","element":"span"},{"text":"filters, respectively; all convolutional layers use ","element":"span"},{"text":"(3","element":"span"},{"style":{"fontStyle":"italic"},"text":", ","element":"span"},{"text":"3) ","element":"span"},{"text":"filters. MaxPool((2,2), ’VALID’) performs max pooling with ’VALID’ padding and a (2,2) window size. LeCun initialization is used, with the standard deviation of the weights and biases drawn as ","element":"span"},{"style":{"height":16.38},"width":166.14,"height":40.94,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2003.02218/images/14-14.png","element":"img","alt":" σw =√2","inline":true},{"text":", ","element":"span"},{"style":{"height":14.8},"width":1699.27,"height":37,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2003.02218/images/14-15.png","element":"img","alt":"σb = 0.05, respectively. Trained on CIFAR-10 with SGD, batch size of 256 and L2 regularization = 0.001.","inline":true}],[{"style":{"fontWeight":"bold"},"text":"Figures ","element":"span"},{"href":"#id-0","style":{"fontWeight":"bold"},"text":"1","element":"a"},{"style":{"fontWeight":"bold"},"text":", ","element":"span"},{"href":"#id-46","style":{"fontWeight":"bold"},"text":"5c","element":"a"},{"href":"#id-46","style":{"fontWeight":"bold"},"text":",","element":"a"},{"href":"#id-46","style":{"fontWeight":"bold"},"text":"5d","element":"a"},{"style":{"fontWeight":"bold"},"text":". ","element":"span"},{"text":"Wide ResNet on CIFAR10 using SGD (no momentum). Training on v3-8 TPUs with a total batch size of ","element":"span"},{"text":"1024 ","element":"span"},{"text":"(and per device batch size of ","element":"span"},{"text":"128","element":"span"},{"text":"). They all use ","element":"span"},{"style":{"height":13.19},"width":43.12,"height":32.98,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2003.02218/images/14-16.png","element":"img","alt":" L2","inline":true,"padRight":true},{"text":"regularization","element":"span"},{"text":"= 0","element":"span"},{"style":{"fontStyle":"italic"},"text":".","element":"span"},{"text":"0005","element":"span"},{"text":", LeCun initialization with ","element":"span"},{"style":{"height":14},"width":261.17,"height":35,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2003.02218/images/14-17.png","element":"img","alt":" σw = 1, σb = 0.","inline":true,"padRight":true},{"text":"There is also data augmentation: we use flip, crop and mixup. With softmax classification, these models can get test accuracy of ","element":"span"},{"text":"0","element":"span"},{"style":{"fontStyle":"italic"},"text":".","element":"span"},{"text":"965 ","element":"span"},{"text":"if one uses cosine decay, so we don’t observe a big performance decay due to using MSE. Furthermore, we are using JAX’s implementation of Batch Norm which doesn’t keep track of training batch statistics for test mode evaluation. We have not hyperparameter tuned for learning rates nor ","element":"span"},{"style":{"height":13.19},"width":43.12,"height":32.98,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2003.02218/images/14-18.png","element":"img","alt":" L2","inline":true,"padRight":true},{"text":"regularization parameter.","element":"span"}],[{"style":{"fontWeight":"bold"},"text":"Figures ","element":"span"},{"href":"#id-55","style":{"fontWeight":"bold"},"text":"S2","element":"a"},{"style":{"fontWeight":"bold"},"text":",","element":"span"},{"href":"#id-56","style":{"fontWeight":"bold"},"text":"S3","element":"a"},{"style":{"fontWeight":"bold"},"text":". ","element":"span"},{"text":"Wide ResNet on CIFAR100 using SGD (no momentum). Same setting as figure ","element":"span"},{"href":"#id-46","text":"5c","element":"a"},{"href":"#id-46","text":", ","element":"a"},{"href":"#id-46","text":"5d ","element":"a"},{"text":"except for the different dataset, different L2 regularization ","element":"span"},{"text":"= 0","element":"span"},{"style":{"fontStyle":"italic"},"text":".","element":"span"},{"text":"000025 ","element":"span"},{"text":"and label smoothing (we have subtracted ","element":"span"},{"text":"0","element":"span"},{"style":{"fontStyle":"italic"},"text":".","element":"span"},{"text":"01 ","element":"span"},{"text":"from the target one-hot labels).","element":"span"}],[{"style":{"fontWeight":"bold"},"text":"Figure ","element":"span"},{"href":"#id-57","style":{"fontWeight":"bold"},"text":"S7","element":"a"},{"style":{"fontWeight":"bold"},"text":". ","element":"span"},{"text":"Two hidden layer, ReLU network for one data point ","element":"span"},{"style":{"fontStyle":"italic"},"text":"x ","element":"span"},{"text":"= 1","element":"span"},{"style":{"fontStyle":"italic"},"text":", y ","element":"span"},{"text":"= 1","element":"span"},{"text":".","element":"span"}],[{"style":{"fontWeight":"bold"},"text":"Figure ","element":"span"},{"href":"#id-58","style":{"fontWeight":"bold"},"text":"S10","element":"a"},{"style":{"fontWeight":"bold"},"text":". ","element":"span"},{"text":"Fully connected network with two hidden layers and tanh non-linearity trained on MNIST with SGD (no momentum). Batch size of ","element":"span"},{"text":"512","element":"span"},{"text":", LeCun initialization with ","element":"span"},{"style":{"height":14},"width":261.17,"height":35,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2003.02218/images/14-19.png","element":"img","alt":" σw = 1, σb = 0.","inline":true}],[{"style":{"fontWeight":"bold"},"text":"Figure ","element":"span"},{"href":"#id-59","style":{"fontWeight":"bold"},"text":"S8a","element":"a"},{"style":{"fontWeight":"bold"},"text":". ","element":"span"},{"text":"Two-hidden layer fully connected network trained on MNIST with batch size ","element":"span"},{"text":"512","element":"span"},{"text":", NTK normalization with ","element":"span"},{"style":{"height":17.19},"width":284.51,"height":42.97,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2003.02218/images/14-20.png","element":"img","alt":"σw =√2, σb = 0","inline":true},{"text":". Trained using both momenta ","element":"span"},{"style":{"height":14.4},"width":127,"height":36,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2003.02218/images/14-21.png","element":"img","alt":" γ = 0.9","inline":true,"padRight":true},{"text":"and vanilla SGD for three different non-linearities: tanh, ReLU and identity (no non-linearity). The learning rate for each non-linearity was chosen to correspond to ","element":"span"},{"style":{"height":20.97},"width":128.37,"height":52.42,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2003.02218/images/14-22.png","element":"img","alt":" η = 1λ0 .","inline":true}],[{"style":{"fontWeight":"bold"},"text":"Rest of SM figures. ","element":"span"},{"text":"Small modifications of experiments in previous figures, specified in captions.","element":"span"}],[{"id":"id-16","style":{"width":"74%"},"width":1458,"height":794,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2003.02218/images/15-0.png","element":"img"}],[{"style":{"fontStyle":"italic"},"text":"Figure S1. ","element":"figcaption","subtype":"caption"},{"text":"Visualization of training dynamics in all three phases. In the ","element":"figcaption","subtype":"caption"},{"style":{"fontWeight":"bold"},"text":"lazy phase","element":"figcaption","subtype":"caption"},{"text":", the network is approximately linear in its parameters, and converges exponentially to a global minimum. In the ","element":"figcaption","subtype":"caption"},{"style":{"fontWeight":"bold"},"text":"catapult phase","element":"figcaption","subtype":"caption"},{"text":", the loss initially grows, while the weight norm and curvature decrease. Once the curvature is low enough, optimization converges. In the ","element":"figcaption","subtype":"caption"},{"style":{"fontWeight":"bold"},"text":"divergent phase","element":"figcaption","subtype":"caption"},{"text":", both the loss and parameter magnitudes diverge. ","element":"figcaption","subtype":"caption"},{"style":{"fontStyle":"italic"},"text":"(a)-(d) ","element":"figcaption","subtype":"caption"},{"text":"Loss surface and training dynamics visualized in a 2d linear subspace. The network has a single hidden layer with width ","element":"figcaption","subtype":"caption"},{"style":{"fontStyle":"italic"},"text":"n ","element":"figcaption","subtype":"caption"},{"text":"= 500","element":"figcaption","subtype":"caption"},{"text":", linear activations, and is trained with MSE loss on a single 1D sample ","element":"figcaption","subtype":"caption"},{"style":{"fontStyle":"italic"},"text":"x ","element":"figcaption","subtype":"caption"},{"text":"= 1 ","element":"figcaption","subtype":"caption"},{"text":"with label ","element":"figcaption","subtype":"caption"},{"style":{"fontStyle":"italic"},"text":"y ","element":"figcaption","subtype":"caption"},{"text":"= 0","element":"figcaption","subtype":"caption"},{"text":". The parameter subspace is defined by ","element":"figcaption","subtype":"caption"},{"style":{"height":14.4},"width":955.02,"height":36,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2003.02218/images/15-1.png","element":"img","alt":" u = [dim1] r + [dim2] s, v = [dim1] r − [dim2] s, where r and s","inline":true,"padRight":true},{"text":"are orthonormal vectors, ","element":"figcaption","subtype":"caption"},{"style":{"height":13.29},"width":145.74,"height":33.23,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2003.02218/images/15-2.png","element":"img","alt":" u, v ∈ Rn ","inline":true,"padRight":true},{"text":"are the weight vectors, and ","element":"figcaption","subtype":"caption"},{"text":"[dim1]","element":"figcaption","subtype":"caption"},{"style":{"fontStyle":"italic"},"text":", ","element":"figcaption","subtype":"caption"},{"text":"[dim2] ","element":"figcaption","subtype":"caption"},{"text":"are the coordinates in the subspace. If initialized in this 2d subspace, ","element":"figcaption","subtype":"caption"},{"style":{"height":11.6},"width":134.08,"height":28.99,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2003.02218/images/15-3.png","element":"img","alt":" ut and vt","inline":true,"padRight":true},{"text":"remain in the subspace throughout training, and so training dynamics can be fully visualized with a two dimensional plot. ","element":"figcaption","subtype":"caption"},{"style":{"fontStyle":"italic"},"text":"(e) ","element":"figcaption","subtype":"caption"},{"text":"Visualization of the loss surface and training dynamics in terms of a nonlinear reparameterization, providing interpretable properties: ","element":"figcaption","subtype":"caption"},{"style":{"fontStyle":"italic"},"text":"x-axis ","element":"figcaption","subtype":"caption"},{"text":"correlation between weight vectors, ","element":"figcaption","subtype":"caption"},{"style":{"fontStyle":"italic"},"text":"y-axis ","element":"figcaption","subtype":"caption"},{"text":"curvature ","element":"figcaption","subtype":"caption"},{"style":{"height":10},"width":22,"height":25,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2003.02218/images/15-4.png","element":"img","alt":" λ","inline":true},{"text":". The trajectory shown is identical to that in (c), and in Figure ","element":"figcaption","subtype":"caption"},{"href":"#id-0","text":"1","element":"a","subtype":"caption"},{"text":".","element":"figcaption","subtype":"caption"}]]},{"heading":"B. Experimental results: Late time performance","paragraphs":[[{"id":"id-47","style":{"fontWeight":"bold"},"text":"B.1. CIFAR-100 performance","element":"span"}],[{"text":"We can also repeat the performance experiments for CIFAR-100 and the same Wide ResNet 28-10 setup. In this case, using MSE and SGD we require to evolve the system for longer times, which requires a smaller ","element":"span"},{"style":{"height":13.19},"width":43.12,"height":32.98,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2003.02218/images/15-5.png","element":"img","alt":" L2","inline":true,"padRight":true},{"text":"regularization. We didn’t tune for it, but found that ","element":"span"},{"style":{"height":13.79},"width":180.4,"height":34.46,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2003.02218/images/15-6.png","element":"img","alt":" 2.5 × 10−5 ","inline":true,"padRight":true},{"text":"works. With only one decay we can get within ","element":"span"},{"text":"3% ","element":"span"},{"text":"of the ","element":"span"},{"href":"#id-60","referenceIndex":36,"text":"Zagoruyko & Komodakis ","element":"a"},{"href":"#id-60","referenceIndex":36,"text":"(","element":"a"},{"href":"#id-60","referenceIndex":36,"text":"2016","element":"a"},{"text":") performance that used softmax classification and two learning rate decays. However, evolution for longer time is needed: we found that different learning rates converge at ","element":"span"},{"style":{"height":10.8},"width":121.84,"height":27,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2003.02218/images/15-7.png","element":"img","alt":" ≈ 2000","inline":true,"padRight":true},{"text":"physical epochs. Similar to the main text experiments, we observe that if we decay after evolving for the same amount of physical epochs, larger learning rates do better. See figure ","element":"span"},{"href":"#id-55","text":"S2","element":"a"},{"text":".","element":"span"}],[{"style":{"fontWeight":"bold"},"text":"B.2. Different learning rates converge at the same physical time","element":"span"}],[{"text":"We can also plot the test accuracy versus physical time for different learning rates to show that for vanilla SGD, the performance curves of different learning rates are basically on top of each other if we plot them in physical time, which is why we find that the fair comparison between learning rates should be at the same physical time.","element":"span"}],[{"text":"We have picked a subset of learning rates of the previous WRN28-18 CIFAR100 experiment of SM ","element":"span"},{"href":"#id-47","text":"B.1","element":"a"},{"text":". In figure ","element":"span"},{"href":"#id-56","text":"S3","element":"a"},{"text":", we see how even if the curves are slightly different they converge to roughly the same accuracy. The only curve which is slightly different is ","element":"span"},{"style":{"height":14.8},"width":125.35,"height":37,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2003.02218/images/15-8.png","element":"img","alt":" η = 2.5","inline":true,"padRight":true},{"text":"which is a rather high learning rate (close to ","element":"span"},{"style":{"height":20.97},"width":62.5,"height":52.42,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2003.02218/images/15-9.png","element":"img","alt":"12λ0 ).","inline":true}],[{"style":{"fontWeight":"bold"},"text":"B.3. Comparison of learning rates for different ","element":"span"},{"style":{"height":13.19},"width":43.12,"height":32.98,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2003.02218/images/15-10.png","element":"img","alt":" L2","inline":true,"padRight":true},{"style":{"fontWeight":"bold"},"text":"regularization for WRN28-10 on CIFAR10","element":"span"}],[{"text":"Even if in the main section we have considered a model with fixed ","element":"span"},{"style":{"height":13.19},"width":43.12,"height":32.98,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2003.02218/images/15-11.png","element":"img","alt":" L2","inline":true,"padRight":true},{"text":"regularization, we can study the effect without ","element":"span"},{"style":{"height":13.19},"width":88.71,"height":32.98,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2003.02218/images/15-12.png","element":"img","alt":" L2 or","inline":true,"padRight":true},{"text":"with a different value. In these two examples, we will be considering the same setup as figures ","element":"span"},{"href":"#id-46","text":"5c","element":"a"},{"href":"#id-46","text":",","element":"a"},{"href":"#id-46","text":"5d","element":"a"},{"text":".","element":"span"}],[{"text":"Without ","element":"span"},{"style":{"height":13.19},"width":43.12,"height":32.97,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2003.02218/images/15-13.png","element":"img","alt":" L2","inline":true,"padRight":true},{"text":"regularization, we see that the larger learning rate does better even in the absence of learning rate decay, although training takes a really long time. In our experience, comparing this setup with state of the art, ","element":"span"},{"style":{"height":13.19},"width":118.15,"height":32.98,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2003.02218/images/15-14.png","element":"img","alt":" L2 = 0","inline":true,"padRight":true},{"text":"regularization makes","element":"span"}],[{"id":"id-55","style":{"width":"90%"},"width":1761,"height":688,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2003.02218/images/16-0.png","element":"img"}],[{"id":"id-62","style":{"fontStyle":"italic"},"text":"Figure S2. ","element":"figcaption","subtype":"caption"},{"text":"Test accuracy vs learning rate for WRN28-10 and CIFAR100 with vanilla SGD, ","element":"figcaption","subtype":"caption"},{"style":{"height":11.59},"width":40.08,"height":28.98,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2003.02218/images/16-1.png","element":"img","alt":" L2","inline":true,"padRight":true},{"text":"regularization, data augmentation, label smoothing and batch size 1024. The critical learning rate is ","element":"figcaption","subtype":"caption"},{"style":{"height":12.8},"width":160.72,"height":32,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2003.02218/images/16-2.png","element":"img","alt":" ηcrit ≈ 0.4","inline":true},{"text":". (a) Evolved for 38400 steps. (b) Evolved for 96000","element":"figcaption","subtype":"caption"},{"style":{"height":14.4},"width":37.43,"height":36,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2003.02218/images/16-3.png","element":"img","alt":"/η","inline":true,"padRight":true},{"text":"steps with learning rate ","element":"figcaption","subtype":"caption"},{"style":{"height":9.6},"width":19,"height":24,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2003.02218/images/16-4.png","element":"img","alt":" η","inline":true,"padRight":true},{"text":"(blue) and then evolved for 7200 more steps at learning rate ","element":"figcaption","subtype":"caption"},{"text":"0","element":"figcaption","subtype":"caption"},{"style":{"fontStyle":"italic"},"text":".","element":"figcaption","subtype":"caption"},{"text":"01 ","element":"figcaption","subtype":"caption"},{"text":"(red).","element":"figcaption","subtype":"caption"}],[{"id":"id-56","style":{"width":"48%"},"width":936,"height":737,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2003.02218/images/16-5.png","element":"img"}],[{"style":{"fontStyle":"italic"},"text":"Figure S3. ","element":"figcaption","subtype":"caption"},{"text":"Test accuracy vs physical time for different learning rates in the WRN CIFAR100 experiment of the previous section ","element":"figcaption","subtype":"caption"},{"href":"#id-47","text":"B.1","element":"a","subtype":"caption"}],[{"text":"the experiment take longer before convergence but does not influence performance much.","element":"span"}],[{"text":"In the presence of ","element":"span"},{"style":{"height":13.19},"width":43.12,"height":32.98,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2003.02218/images/16-6.png","element":"img","alt":" L2","inline":true,"padRight":true},{"text":"regularization we picked the particular value ","element":"span"},{"style":{"height":13.59},"width":208.9,"height":33.98,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2003.02218/images/16-7.png","element":"img","alt":" L2 = 0.0005","inline":true,"padRight":true},{"text":"in order to make sure that our conclusion is not dependent on the choice of ","element":"span"},{"style":{"height":13.19},"width":43.12,"height":32.98,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2003.02218/images/16-8.png","element":"img","alt":" L2","inline":true},{"text":", the only hyperparameter (other than ","element":"span"},{"style":{"height":10.4},"width":20,"height":26,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2003.02218/images/16-9.png","element":"img","alt":" η","inline":true},{"text":"), we have considered a larger ","element":"span"},{"style":{"height":13.19},"width":321.19,"height":32.98,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2003.02218/images/16-10.png","element":"img","alt":" L2 = 0.001. We see","inline":true,"padRight":true},{"text":"that the optimal performance in physical time is also peaked in the catapult phase, although the difference here is smaller.","element":"span"}],[{"style":{"fontWeight":"bold"},"text":"B.4. Training accuracy plots","element":"span"}],[{"text":"The training accuracies of the previous experiments are shown in figure ","element":"span"},{"href":"#id-61","text":"S6","element":"a"},{"text":".","element":"span"}],[{"style":{"width":"45%"},"width":889,"height":672,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2003.02218/images/17-0.png","element":"img"}],[{"style":{"fontStyle":"italic"},"text":"Figure S4. ","element":"figcaption","subtype":"caption"},{"text":"WRN28-10 on CIFAR10 without ","element":"figcaption","subtype":"caption"},{"style":{"height":11.6},"width":40.08,"height":28.99,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2003.02218/images/17-1.png","element":"img","alt":" L2","inline":true},{"text":". Same setup as ","element":"figcaption","subtype":"caption"},{"href":"#id-46","text":"5d ","element":"a","subtype":"caption"},{"text":"but evolved for longer times.","element":"figcaption","subtype":"caption"}],[{"style":{"width":"97%"},"width":1892,"height":737,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2003.02218/images/17-2.png","element":"img"}],[{"style":{"fontStyle":"italic"},"text":"Figure S5. ","element":"figcaption","subtype":"caption"},{"text":"Test accuracies for a larger ","element":"figcaption","subtype":"caption"},{"style":{"height":11.59},"width":40.08,"height":28.98,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2003.02218/images/17-3.png","element":"img","alt":" L2","inline":true,"padRight":true},{"text":"CIFAR10 experiment like that of the main section. (a) WRN CIFAR-10 7200 steps as in figure ","element":"figcaption","subtype":"caption"},{"href":"#id-46","text":"5c","element":"a","subtype":"caption"},{"text":". (b) WRN CIFAR10 2400 physical steps and then 4800 more steps at learning rate ","element":"figcaption","subtype":"caption"},{"text":"0","element":"figcaption","subtype":"caption"},{"style":{"fontStyle":"italic"},"text":".","element":"figcaption","subtype":"caption"},{"text":"01 ","element":"figcaption","subtype":"caption"},{"text":"as in figure ","element":"figcaption","subtype":"caption"},{"href":"#id-46","text":"5d","element":"a","subtype":"caption"},{"text":".","element":"figcaption","subtype":"caption"}],[{"id":"id-61","style":{"width":"93%"},"width":1809,"height":1425,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2003.02218/images/18-0.png","element":"img"}],[{"style":{"fontStyle":"italic"},"text":"Figure S6. ","element":"figcaption","subtype":"caption"},{"text":"Training accuracies for the performance experiments. Smaller learning rates have higher training accuracy when compared in physical time. However, they still perform worse for a fixed number of steps. (a) WRN CIFAR-10 12000 steps as in figure ","element":"figcaption","subtype":"caption"},{"href":"#id-46","text":"5c","element":"a","subtype":"caption"},{"text":". (b) WRN CIFAR10 3360 physical steps as in figure ","element":"figcaption","subtype":"caption"},{"href":"#id-46","text":"5d","element":"a","subtype":"caption"},{"text":". (c) WRN CIFAR100 ","element":"figcaption","subtype":"caption"},{"text":"38400 ","element":"figcaption","subtype":"caption"},{"text":"steps as in figure ","element":"figcaption","subtype":"caption"},{"href":"#id-55","text":"S2a","element":"a","subtype":"caption"},{"text":".(d) WRN CIFAR100 ","element":"figcaption","subtype":"caption"},{"text":"96000 ","element":"figcaption","subtype":"caption"},{"text":"physical steps as in figure ","element":"figcaption","subtype":"caption"},{"href":"#id-62","text":"S2b","element":"a","subtype":"caption"},{"text":".","element":"figcaption","subtype":"caption"}]]},{"heading":"C. Experimental results: Early time dynamics","paragraphs":[[{"style":{"fontWeight":"bold"},"text":"C.1. ReLU activations for the simple model","element":"span"}],[{"text":"In the main text we have been using ReLU non-linearities. Compared with the simple model with no non-linearities, ReLU networks have a broader trainability regime after ","element":"span"},{"style":{"height":20.97},"width":112.04,"height":52.42,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2003.02218/images/19-0.png","element":"img","alt":" η = 4λ0","inline":true,"padRight":true},{"text":". It looks like these networks generically well train until ","element":"span"},{"style":{"height":20.97},"width":112.38,"height":52.42,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2003.02218/images/19-1.png","element":"img","alt":" η = 12λ0","inline":true,"padRight":true},{"text":". ","element":"span"},{"text":"This is a generic feature of deep ReLU networks and can be already observed for the model of section ","element":"span"},{"text":"2 ","element":"span"},{"text":"with a target ","element":"span"},{"style":{"fontStyle":"italic"},"text":"y ","element":"span"},{"text":"= 1","element":"span"},{"text":", two hidden layers and a ReLU non-linearity: ","element":"span"},{"style":{"fontStyle":"italic"},"text":"f ","element":"span"},{"text":"= ","element":"span"},{"style":{"fontStyle":"italic"},"text":"u.ReLU","element":"span"},{"text":"(","element":"span"},{"style":{"fontStyle":"italic"},"text":"w.ReLU","element":"span"},{"text":"(","element":"span"},{"style":{"fontStyle":"italic"},"text":"v","element":"span"},{"text":"))","element":"span"},{"text":", as shown in figure ","element":"span"},{"href":"#id-57","text":"S7","element":"a"},{"text":"). In this single sample context for ","element":"span"},{"style":{"height":19.37},"width":115.73,"height":48.43,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2003.02218/images/19-2.png","element":"img","alt":" η ≥ 12λ","inline":true,"padRight":true},{"text":", the loss doesn’t diverge but the neurons die and end up giving the trivial ","element":"span"},{"style":{"fontStyle":"italic"},"text":"f ","element":"span"},{"text":"= 0 ","element":"span"},{"text":"function. For deep ","element":"span"},{"text":"networks with more than one hidden layer and multiple samples, as discussed in the main text, we observe that the loss diverges after ","element":"span"},{"style":{"height":19.37},"width":93.4,"height":48.43,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2003.02218/images/19-3.png","element":"img","alt":" ∼ 12λ .","inline":true}],[{"id":"id-57","style":{"width":"97%"},"width":1888,"height":735,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2003.02218/images/19-4.png","element":"img"}],[{"style":{"fontStyle":"italic"},"text":"Figure S7. ","element":"figcaption","subtype":"caption"},{"text":"Simple model ReLU non-linearity (","element":"figcaption","subtype":"caption"},{"style":{"height":13.6},"width":179.15,"height":34,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2003.02218/images/19-5.png","element":"img","alt":"ηcrit = 2.54","inline":true},{"text":"). (b) is evaluated at physical time ","element":"figcaption","subtype":"caption"},{"text":"100","element":"figcaption","subtype":"caption"},{"text":".","element":"figcaption","subtype":"caption"}],[{"style":{"fontWeight":"bold"},"text":"C.2. Momenta","element":"span"}],[{"text":"The effect of the optimizer also affects these dynamics. If we consider a similar setup with momenta, first we expect that a linear model converges in a broader range ","element":"span"},{"style":{"height":20.97},"width":241.18,"height":52.42,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2003.02218/images/19-6.png","element":"img","alt":" η < 2λ0 (1 + γ)","inline":true},{"text":". For smooth non-linearities, we observe that for ","element":"span"},{"style":{"height":20.97},"width":268.96,"height":52.42,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2003.02218/images/19-7.png","element":"img","alt":" η < 2λ0 , the λt is","inline":true,"padRight":true},{"text":"constant. However this is not true for ReLU, see figure ","element":"span"},{"href":"#id-59","text":"S8a","element":"a"},{"text":". In fact, for ReLu networks, we observe that there is a small learning rate, roughly ","element":"span"},{"style":{"height":19.81},"width":234.65,"height":49.52,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2003.02218/images/19-8.png","element":"img","alt":" ηeff,crit = ηcrit1−γ","inline":true,"padRight":true},{"text":", below which the time dynamics of ","element":"span"},{"style":{"height":13.19},"width":35.24,"height":32.97,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2003.02218/images/19-9.png","element":"img","alt":" λt","inline":true,"padRight":true},{"text":"is similar (but non-constant). However, for ","element":"span"},{"style":{"height":13.59},"width":186.52,"height":33.98,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2003.02218/images/19-10.png","element":"img","alt":"η > ηeff,crit","inline":true},{"text":", there are strong time dynamics, we illustrate this in figure ","element":"span"},{"href":"#id-59","text":"S8b ","element":"a"},{"text":"with a 3 hidden layer ReLu network.","element":"span"}],[{"style":{"fontWeight":"bold"},"text":"C.3. Effect of ","element":"span"},{"style":{"height":13.19},"width":43.12,"height":32.98,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2003.02218/images/19-11.png","element":"img","alt":" L2","inline":true,"padRight":true},{"style":{"fontWeight":"bold"},"text":"regularization to early time dynamics","element":"span"}],[{"text":"We don’t expect ","element":"span"},{"style":{"height":13.19},"width":43.12,"height":32.98,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2003.02218/images/19-12.png","element":"img","alt":" L2","inline":true,"padRight":true},{"text":"regularization to affect the early time dynamics, but because of the strong rearrangement that goes on in the first steps, it could potentially have a non-trivial effect; among other things, the Hessian spectrum necessarily is decaying. We can see how the dynamics that drives the rearrangement is roughly the same, even in the maximum eigenvalue at early times is decreasing slowly.","element":"span"}],[{"id":"id-43","style":{"fontWeight":"bold"},"text":"C.4. Tanh activations","element":"span"}],[{"text":"We observe that for Tanh activation, ","element":"span"},{"style":{"height":10.4},"width":78.88,"height":26,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2003.02218/images/19-13.png","element":"img","alt":" ηmax","inline":true,"padRight":true},{"text":"is closer to the simple model expectation ","element":"span"},{"style":{"height":20.97},"width":32.9,"height":52.42,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2003.02218/images/19-14.png","element":"img","alt":"4λ0 ","inline":true,"padRight":true},{"text":", see figure ","element":"span"},{"href":"#id-58","text":"S10","element":"a"},{"text":".","element":"span"}],[{"style":{"fontWeight":"bold"},"text":"C.5. WRN NTK Normalization","element":"span"}],[{"text":"As illustrated in the text in figures ","element":"span"},{"href":"#id-42","text":"3","element":"a"},{"href":"#id-42","style":{"fontStyle":"italic"},"text":"b","element":"a"},{"href":"#id-42","style":{"fontStyle":"italic"},"text":", ","element":"a"},{"href":"#id-42","text":"3","element":"a"},{"style":{"fontStyle":"italic"},"text":"c ","element":"span"},{"text":"we also see this behaviour for NTK normalization. For completeness we include the WRN model with NTK normalization. From the linearized intuition, we expect the phases to also be determined by the quantity ","element":"span"},{"style":{"height":14.4},"width":56.46,"height":36,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2003.02218/images/19-15.png","element":"img","alt":" ηλt","inline":true},{"text":", independently of the normalization. Figure ","element":"span"},{"href":"#id-63","text":"S11 ","element":"a"},{"text":"has the same setup as in figure ","element":"span"},{"href":"#id-42","text":"3","element":"a"},{"text":".","element":"span"}],[{"id":"id-59","style":{"width":"91%"},"width":1771,"height":696,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2003.02218/images/20-0.png","element":"img"}],[{"style":{"fontStyle":"italic"},"text":"Figure S8. ","element":"figcaption","subtype":"caption"},{"text":"(a) Evolution of the normalized curvature ","element":"figcaption","subtype":"caption"},{"style":{"height":14.4},"width":396.91,"height":36,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2003.02218/images/20-1.png","element":"img","alt":" λt/λ0 for d = 2 w = 2048","inline":true,"padRight":true},{"text":"FC connected networks evolved with momenta (same networks with SGD with dashed line for reference) evolved for ","element":"figcaption","subtype":"caption"},{"style":{"height":18.99},"width":105.22,"height":47.48,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2003.02218/images/20-2.png","element":"img","alt":" η = 1λ0 ","inline":true,"padRight":true},{"text":". We observe that ReLU networks evolved with momenta doesn’t ","element":"figcaption","subtype":"caption"},{"text":"have a constant kernel in the naive ‘lazy’ phase. (b) ","element":"figcaption","subtype":"caption"},{"style":{"height":13.3},"width":416.79,"height":33.25,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2003.02218/images/20-3.png","element":"img","alt":" ηcrit = 6.96, ηcrit,eff = 0.69","inline":true,"padRight":true},{"text":"Same setup as the FC network of figure ","element":"figcaption","subtype":"caption"},{"href":"#id-42","text":"3 ","element":"a","subtype":"caption"},{"text":"with momenta ","element":"figcaption","subtype":"caption"},{"style":{"height":12.8},"width":116.91,"height":32,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2003.02218/images/20-4.png","element":"img","alt":"γ = 0.9","inline":true},{"text":": fully connected, three hidden layers ","element":"figcaption","subtype":"caption"},{"style":{"fontStyle":"italic"},"text":"w ","element":"figcaption","subtype":"caption"},{"text":"= 2048","element":"figcaption","subtype":"caption"},{"text":", ReLU non-linearity. ","element":"figcaption","subtype":"caption"},{"style":{"height":9.6},"width":63.4,"height":24,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2003.02218/images/20-5.png","element":"img","alt":" ηcrit","inline":true,"padRight":true},{"text":"is slightly different due to variations at initialization.","element":"figcaption","subtype":"caption"}],[{"style":{"width":"97%"},"width":1888,"height":734,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2003.02218/images/20-6.png","element":"img"}],[{"style":{"fontStyle":"italic"},"text":"Figure S9. ","element":"figcaption","subtype":"caption"},{"text":"Same WRN as figure ","element":"figcaption","subtype":"caption"},{"href":"#id-42","text":"3","element":"a","subtype":"caption"},{"text":"d,f with ","element":"figcaption","subtype":"caption"},{"style":{"height":11.6},"width":40.08,"height":28.99,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2003.02218/images/20-7.png","element":"img","alt":" L2","inline":true,"padRight":true},{"text":"regularization","element":"figcaption","subtype":"caption"},{"text":"= 0","element":"figcaption","subtype":"caption"},{"style":{"fontStyle":"italic"},"text":".","element":"figcaption","subtype":"caption"},{"text":"0005","element":"figcaption","subtype":"caption"},{"text":". Dynamics in physical steps of the ","element":"figcaption","subtype":"caption"},{"style":{"height":13.2},"width":444.69,"height":33,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2003.02218/images/20-8.png","element":"img","alt":" λt and λt vs η. ηcrit = 0.18 a)","inline":true},{"style":{"height":12.8},"width":124.18,"height":32,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2003.02218/images/20-9.png","element":"img","alt":"λt, b) λt","inline":true,"padRight":true},{"text":"at physical time ","element":"figcaption","subtype":"caption"},{"text":"25","element":"figcaption","subtype":"caption"}],[{"id":"id-58","style":{"width":"97%"},"width":1888,"height":735,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2003.02218/images/21-0.png","element":"img"}],[{"style":{"fontStyle":"italic"},"text":"Figure S10. ","element":"span"},{"text":"Maximum NTK eigenvalue ","element":"span"},{"style":{"height":10},"width":22,"height":25,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2003.02218/images/21-1.png","element":"img","alt":" λ","inline":true,"padRight":true},{"text":"at early times for a 2 hidden layer fully connected network with tanh non-linearity trained on MNIST, with ","element":"span"},{"style":{"height":12.8},"width":179.15,"height":32,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2003.02218/images/21-2.png","element":"img","alt":" ηcrit = 0.06","inline":true},{"text":". (a) Early time dynamics of the curvature for learning rates in the linear and catapult phase. (b) ","element":"span"},{"style":{"height":10},"width":22,"height":25,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2003.02218/images/21-3.png","element":"img","alt":" λ","inline":true,"padRight":true},{"text":"measured at ","element":"span"},{"style":{"height":12.8},"width":109.58,"height":32,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2003.02218/images/21-4.png","element":"img","alt":"ηt = 3.","inline":true}],[{"id":"id-63","style":{"width":"98%"},"width":1918,"height":817,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2003.02218/images/21-5.png","element":"img"}]]},{"heading":"D. Theoretical details","paragraphs":[[{"id":"id-41","style":{"fontWeight":"bold"},"text":"D.1. Full model analysis","element":"span"}],[{"text":"Here we provide additional details on the theoretical analysis of the full model in Section ","element":"span"},{"href":"#id-64","text":"2.2","element":"a"},{"text":". The gradient descent update equations are","element":"span"}],[{"style":{"width":"76%"},"width":1481,"height":83,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2003.02218/images/22-0.png","element":"img"}],[{"text":"and","element":"span"}],[{"style":{"width":"64%"},"width":1264,"height":68,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2003.02218/images/22-1.png","element":"img"}],[{"text":"The update equations for the error and kernel evaluated on training set inputs are","element":"span"}],[{"id":"id-65","style":{"width":"76%"},"width":1483,"height":303,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2003.02218/images/22-2.png","element":"img"}],[{"text":"Where ","element":"span"},{"style":{"height":19.79},"width":393.1,"height":49.47,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2003.02218/images/22-3.png","element":"img","alt":" ζ := �α ˜fαxα/m ∈ Rd","inline":true},{"text":". We now consider the dynamics of the kernel projected onto the ","element":"span"},{"style":{"height":18.2},"width":28.58,"height":45.51,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2003.02218/images/22-4.png","element":"img","alt":" ˜f","inline":true,"padRight":true},{"text":"direction, which is given ","element":"span"},{"text":"by","element":"span"}],[{"id":"id-66","style":{"width":"70%"},"width":1372,"height":75,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2003.02218/images/22-5.png","element":"img"}],[{"text":"Let us now analyze the phase structure of (","element":"span"},{"href":"#id-65","text":"S3","element":"a"},{"text":") and (","element":"span"},{"href":"#id-66","text":"S5","element":"a"},{"text":"). For now, we neglect the last term on the right-hand side of (","element":"span"},{"href":"#id-65","text":"S3","element":"a"},{"text":") (at initialization this term is of order ","element":"span"},{"style":{"height":13.38},"width":64.82,"height":33.46,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2003.02218/images/22-6.png","element":"img","alt":" n−1","inline":true,"padRight":true},{"text":"and is negligible at large width). Let ","element":"span"},{"style":{"height":13.19},"width":39.25,"height":32.98,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2003.02218/images/22-7.png","element":"img","alt":" λ0","inline":true,"padRight":true},{"text":"be the maximal eigenvalue of the kernel at initialization, and let ","element":"span"},{"style":{"height":11.6},"width":184.85,"height":29,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2003.02218/images/22-8.png","element":"img","alt":" emax ∈ Rm ","inline":true,"padRight":true},{"text":"be the corresponding eigenvector. Notice that ","element":"span"},{"style":{"height":18.21},"width":28.58,"height":45.52,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2003.02218/images/22-9.png","element":"img","alt":"˜f","inline":true,"padRight":true},{"text":"projected onto the top eigenvector evolves as","element":"span"}],[{"id":"id-67","style":{"width":"68%"},"width":1333,"height":48,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2003.02218/images/22-10.png","element":"img"}],[{"style":{"fontWeight":"bold"},"text":"Lazy phase. ","element":"span"},{"text":"When ","element":"span"},{"style":{"height":14.4},"width":135.47,"height":36,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2003.02218/images/22-11.png","element":"img","alt":" ηλ0 < 2","inline":true},{"text":", we see that ","element":"span"},{"style":{"height":19.01},"width":164.36,"height":47.52,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2003.02218/images/22-12.png","element":"img","alt":" |emaxT ˜f t|","inline":true,"padRight":true},{"text":"shrinks during training. The kernel updates are of order ","element":"span"},{"style":{"height":13.38},"width":64.83,"height":33.46,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2003.02218/images/22-13.png","element":"img","alt":" n−1","inline":true},{"text":", while convergence happens in order ","element":"span"},{"style":{"height":13.38},"width":39.92,"height":33.46,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2003.02218/images/22-14.png","element":"img","alt":" n0 ","inline":true,"padRight":true},{"text":"steps. Therefore the kernel does not change by much during training. This is a special case of the NTK result (","element":"span"},{"href":"#id-18","referenceIndex":12,"text":"Jacot et al.","element":"a"},{"href":"#id-18","referenceIndex":12,"text":", ","element":"a"},{"href":"#id-18","referenceIndex":12,"text":"2018","element":"a"},{"text":"). Effectively, the model evolves as a linear model in this phase.","element":"span"}],[{"style":{"fontWeight":"bold"},"text":"Catapult phase. ","element":"span"},{"text":"When ","element":"span"},{"style":{"height":19.01},"width":307.4,"height":47.52,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2003.02218/images/22-15.png","element":"img","alt":" 2 < ηλ0 < 4, ∥ ˜f∥2","inline":true,"padRight":true},{"text":"grows exponentially fast, and it grows fastest in the ","element":"span"},{"style":{"height":10.58},"width":77.65,"height":26.46,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2003.02218/images/22-16.png","element":"img","alt":" emax ","inline":true,"padRight":true},{"text":"direction. Therefore, the vector ","element":"span"},{"style":{"height":18.21},"width":28.58,"height":45.52,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2003.02218/images/22-17.png","element":"img","alt":"˜f","inline":true,"padRight":true},{"text":"becomes aligned with ","element":"span"},{"style":{"height":10.58},"width":77.65,"height":26.46,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2003.02218/images/22-18.png","element":"img","alt":" emax","inline":true,"padRight":true},{"text":"after a number of steps that is of order ","element":"span"},{"style":{"height":13.38},"width":39.92,"height":33.46,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2003.02218/images/22-19.png","element":"img","alt":" n0","inline":true},{"text":". Also, ","element":"span"},{"style":{"fontStyle":"italic"},"text":"f ","element":"span"},{"text":"itself grows quickly while the label is constant, and so we find that ","element":"span"},{"style":{"height":19},"width":390.77,"height":47.51,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2003.02218/images/22-20.png","element":"img","alt":" f ≈ ˜f ≈ (emaxT ˜f)emax ","inline":true,"padRight":true},{"text":"after a similar number of steps. When these approximations hold, notice that ","element":"span"},{"style":{"height":19.01},"width":286.18,"height":47.52,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2003.02218/images/22-21.png","element":"img","alt":"˜f T Θ ˜f ≈ λ · ∥ ˜f∥22","inline":true},{"text":". From equation (","element":"span"},{"href":"#id-66","text":"S5","element":"a"},{"text":") we can then derive an approximate equation for the evolution of the ","element":"span"},{"text":"top NTK eigenvalue.","element":"span"}],[{"style":{"width":"61%"},"width":1200,"height":73,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2003.02218/images/22-22.png","element":"img"}],[{"text":"While ","element":"span"},{"style":{"height":18.2},"width":28.58,"height":45.51,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2003.02218/images/22-23.png","element":"img","alt":"˜f","inline":true,"padRight":true},{"text":"grows exponentially fast, so will ","element":"span"},{"style":{"height":14},"width":19,"height":35,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2003.02218/images/22-24.png","element":"img","alt":" ζ","inline":true},{"text":". When ","element":"span"},{"style":{"height":14},"width":29.44,"height":35,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2003.02218/images/22-25.png","element":"img","alt":" ζt","inline":true,"padRight":true},{"text":"becomes of order ","element":"span"},{"style":{"height":14.18},"width":72.14,"height":35.46,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2003.02218/images/22-26.png","element":"img","alt":" n1/2","inline":true},{"text":", the updates to the top eigenvalue become of order ","element":"span"},{"style":{"height":13.38},"width":39.92,"height":33.46,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2003.02218/images/22-27.png","element":"img","alt":" n0","inline":true,"padRight":true},{"text":"(and negative), causing ","element":"span"},{"style":{"height":13.19},"width":35.25,"height":32.98,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2003.02218/images/22-28.png","element":"img","alt":" λt","inline":true,"padRight":true},{"text":"to decrease by a non-negligible amount. This will continue until ","element":"span"},{"style":{"height":16},"width":150.69,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2003.02218/images/22-29.png","element":"img","alt":" λt < 2/η","inline":true},{"text":", at which point ","element":"span"},{"style":{"height":18.2},"width":31.51,"height":45.51,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2003.02218/images/22-30.png","element":"img","alt":"˜ft","inline":true,"padRight":true},{"text":"will start converging to zero. Eventually, after a number of steps of order ","element":"span"},{"text":"log(","element":"span"},{"style":{"fontStyle":"italic"},"text":"n","element":"span"},{"text":")","element":"span"},{"text":", gradient descent will converge to a global minimum that has a lower curvature than the curvature at initialization.","element":"span"}],[{"text":"The justification for dropping the order ","element":"span"},{"style":{"height":13.38},"width":64.82,"height":33.46,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2003.02218/images/22-31.png","element":"img","alt":" n−1","inline":true,"padRight":true},{"text":"term in (","element":"span"},{"href":"#id-67","text":"S6","element":"a"},{"text":") was explained in the warmup model: While this term may affect the details of the dynamics, eventually the maximum kernel eigenvalue must drop below ","element":"span"},{"style":{"height":16},"width":59.85,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2003.02218/images/22-32.png","element":"img","alt":" 2/η","inline":true,"padRight":true},{"text":"for the component ","element":"span"},{"style":{"height":18.21},"width":173.95,"height":45.52,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2003.02218/images/22-33.png","element":"img","alt":" emaxT ˜f of","inline":true,"padRight":true},{"text":"the error (and therefore for the loss) to converge to zero.","element":"span"}],[{"style":{"fontWeight":"bold"},"text":"Divergent phase. ","element":"span"},{"text":"When ","element":"span"},{"style":{"height":14.4},"width":143.86,"height":36,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2003.02218/images/22-34.png","element":"img","alt":" ηλ0 > 4","inline":true},{"text":", both ","element":"span"},{"style":{"height":19.01},"width":79.65,"height":47.52,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2003.02218/images/22-35.png","element":"img","alt":" ∥ ˜f∥22","inline":true,"padRight":true},{"text":"and ","element":"span"},{"style":{"height":10.8},"width":23,"height":27,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2003.02218/images/22-36.png","element":"img","alt":" λ","inline":true,"padRight":true},{"text":"will grow, and optimization will diverge. Therefore, ","element":"span"},{"style":{"height":16},"width":79.1,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2003.02218/images/22-37.png","element":"img","alt":" 4/λ0","inline":true,"padRight":true},{"text":"is the ","element":"span"},{"text":"maximum learning rate for this model.","element":"span"}]]},{"heading":"E. Model dynamics close to the critical learning rate","paragraphs":[[{"text":"Here we consider the gradient descent dynamics of the model analyzed in Section ","element":"span"},{"text":"2","element":"span"},{"text":", for learning rates ","element":"span"},{"style":{"height":10.4},"width":20,"height":26,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2003.02218/images/23-0.png","element":"img","alt":" η","inline":true,"padRight":true},{"text":"that are close to the critical point ","element":"span"},{"style":{"height":16},"width":202.16,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2003.02218/images/23-1.png","element":"img","alt":" ηcrit = 2/λ0","inline":true},{"text":". The analysis reveals that the gradient descent dynamics of the model are qualitatively different above and below this point. For example, the loss decreases monotonically during training when ","element":"span"},{"style":{"height":12.4},"width":141.84,"height":31,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2003.02218/images/23-2.png","element":"img","alt":" η < ηcrit","inline":true},{"text":", but not when ","element":"span"},{"style":{"height":12.4},"width":147.38,"height":31,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2003.02218/images/23-3.png","element":"img","alt":"η > ηcrit","inline":true},{"text":". In this section we show that the transition from small to large learning rate becomes sharp once we take the modified large width limit, in the following sense: certain functions of the learning rate become non-analytic at ","element":"span"},{"style":{"height":14.8},"width":169.64,"height":37,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2003.02218/images/23-4.png","element":"img","alt":" ηcrit in the","inline":true,"padRight":true},{"text":"limit. This sharp transition bears close resemblance to phase transitions of the kind found in physical systems, such as the transition between the liquid and gaseous phases of water. In particular, our case involves a dynamical system, where the dynamics are governed by the gradient descent equations. These dynamics undergo a phase transition as a function of the learning rate — an external parameter. We point to the logistic map (","element":"span"},{"href":"#id-68","referenceIndex":22,"text":"May","element":"a"},{"href":"#id-68","referenceIndex":22,"text":", ","element":"a"},{"href":"#id-68","referenceIndex":22,"text":"1976","element":"a"},{"text":") as a well-known example of a dynamical system that undergoes phase transitions as a function of an external parameter.","element":"span"}],[{"style":{"fontWeight":"bold"},"text":"E.1. Non-perturbative dynamics","element":"span"}],[{"text":"A phase transition is a drastic change in a system’s behavior incurred under a small change in external parameters. Mathematically, it is a non-analyticity in some property of the system as a function of these parameters. For example, consider the property ","element":"span"},{"style":{"height":16},"width":94.28,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2003.02218/images/23-5.png","element":"img","alt":" λ∗(η)","inline":true},{"text":", the curvature of the model at the end of training as a function of the learning rate. In the modified large width limit, ","element":"span"},{"style":{"height":16},"width":94.28,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2003.02218/images/23-6.png","element":"img","alt":" λ∗(η)","inline":true,"padRight":true},{"text":"is constant for ","element":"span"},{"style":{"height":12.4},"width":141.84,"height":31,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2003.02218/images/23-7.png","element":"img","alt":" η < ηcrit","inline":true},{"text":", but not for ","element":"span"},{"style":{"height":12.4},"width":141.84,"height":31,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2003.02218/images/23-8.png","element":"img","alt":" η > ηcrit","inline":true},{"text":". Therefore, this function is not analytic at ","element":"span"},{"style":{"height":10.4},"width":67.41,"height":26,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2003.02218/images/23-9.png","element":"img","alt":"ηcrit","inline":true},{"text":". Notice that this statement is true in the limit but not necessarily at finite width, where the final curvature may be an analytic function of the learning rate even at ","element":"span"},{"style":{"height":10.4},"width":67.41,"height":26,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2003.02218/images/23-10.png","element":"img","alt":" ηcrit","inline":true},{"text":". It is well known in physics that phase transitions only occur in a limit where the number of dynamical variables (in this case the number of model parameters) is taken to infinity. One immediate consequence of the non-analyticity at ","element":"span"},{"style":{"height":10.4},"width":67.4,"height":26,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2003.02218/images/23-11.png","element":"img","alt":" ηcrit","inline":true,"padRight":true},{"text":"is that the large learning rate phase is inaccessible from the small learning rate phase via a perturbative expansion. In other words, we cannot describe all properties of the model for some ","element":"span"},{"style":{"height":12.4},"width":141.78,"height":31,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2003.02218/images/23-12.png","element":"img","alt":" η > ηcrit","inline":true,"padRight":true},{"text":"by doing a Taylor expansion around a point ","element":"span"},{"style":{"height":12.4},"width":158.2,"height":31,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2003.02218/images/23-13.png","element":"img","alt":" η0 < ηcrit","inline":true,"padRight":true},{"text":"and keeping a finite number of terms.","element":"span"}],[{"href":"#id-30","referenceIndex":8,"text":"Dyer & Gur-Ari ","element":"a"},{"href":"#id-30","referenceIndex":8,"text":"(","element":"a"},{"href":"#id-30","referenceIndex":8,"text":"2020","element":"a"},{"text":"); ","element":"span"},{"href":"#id-31","referenceIndex":11,"text":"Huang & Yau ","element":"a"},{"href":"#id-31","referenceIndex":11,"text":"(","element":"a"},{"href":"#id-31","referenceIndex":11,"text":"2019","element":"a"},{"text":") developed a formalism that allows one to compute finite-width corrections to various properties of deep networks, using a perturbative expansion around the infinite width limit. We have argued that the usual infinite width approximation to the training dynamics is not valid for learning rates above ","element":"span"},{"style":{"height":10.4},"width":67.41,"height":26,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2003.02218/images/23-14.png","element":"img","alt":" ηcrit","inline":true},{"text":", and that a full analysis must account for large finite-width effects. One may have hoped that including the perturbative finite-width corrections discussed in ","element":"span"},{"href":"#id-30","referenceIndex":8,"text":"Dyer & Gur-Ari ","element":"a"},{"href":"#id-30","referenceIndex":8,"text":"(","element":"a"},{"href":"#id-30","referenceIndex":8,"text":"2020","element":"a"},{"text":"); ","element":"span"},{"href":"#id-31","referenceIndex":11,"text":"Huang & Yau ","element":"a"},{"href":"#id-31","referenceIndex":11,"text":"(","element":"a"},{"href":"#id-31","referenceIndex":11,"text":"2019","element":"a"},{"text":") would allow us to regain analytic control over the dynamics. The results presented here suggest that this is not the case: For ","element":"span"},{"style":{"height":12.4},"width":144.98,"height":31,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2003.02218/images/23-15.png","element":"img","alt":" η > ηcrit","inline":true},{"text":", we expect that the perturbative expansion will not provide a good approximation to the gradient descent dynamics at any finite order in inverse width.","element":"span"}],[{"style":{"fontWeight":"bold"},"text":"E.2. Critical exponents","element":"span"}],[{"text":"When the external parameters are close to a phase transition, one often finds that the dynamical properties of the system obey power law behavior. The exponents of these power laws (called ","element":"span"},{"style":{"fontStyle":"italic"},"text":"critical exponents","element":"span"},{"text":") are of interest because they are often found to be universal, in the sense that the same set of exponents is often found to describe the phase transitions of completely different physical systems.","element":"span"}],[{"text":"Here we consider ","element":"span"},{"style":{"height":16},"width":85.43,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2003.02218/images/23-16.png","element":"img","alt":" t∗(η)","inline":true},{"text":", the number of steps until convergence, as a function of the learning rate. We will now show that ","element":"span"},{"style":{"height":12.39},"width":30.39,"height":30.98,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2003.02218/images/23-17.png","element":"img","alt":" t∗","inline":true,"padRight":true},{"text":"exhibits power-law behavior when ","element":"span"},{"style":{"height":10.4},"width":20,"height":26,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2003.02218/images/23-18.png","element":"img","alt":" η","inline":true,"padRight":true},{"text":"is close to ","element":"span"},{"style":{"height":10.4},"width":67.41,"height":26,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2003.02218/images/23-19.png","element":"img","alt":" ηcrit","inline":true},{"text":". For simplicity we consider the warmup model studied in Section ","element":"span"},{"text":"2","element":"span"},{"text":". First, suppose that we are below the transition, setting ","element":"span"},{"style":{"height":14.4},"width":224.47,"height":36,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2003.02218/images/23-20.png","element":"img","alt":" ηλ0 = 2 − ϵ","inline":true,"padRight":true},{"text":"for some small ","element":"span"},{"style":{"height":11.6},"width":103.86,"height":29,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2003.02218/images/23-21.png","element":"img","alt":" ϵ > 0","inline":true},{"text":". From the update equation, ","element":"span"},{"style":{"height":16},"width":555.19,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2003.02218/images/23-22.png","element":"img","alt":" ft+1 ≈ (1 − ηλt)ft ≈ −(1 − ϵ)ft","inline":true,"padRight":true},{"text":"we see that ","element":"span"},{"style":{"height":14},"width":31.51,"height":35,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2003.02218/images/23-23.png","element":"img","alt":" ft","inline":true,"padRight":true},{"text":"will converge to some fixed small value ","element":"span"},{"style":{"height":14},"width":35.51,"height":35,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2003.02218/images/23-24.png","element":"img","alt":" f∗","inline":true,"padRight":true},{"text":"after time ","element":"span"},{"style":{"height":17.38},"width":410.72,"height":43.46,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2003.02218/images/23-25.png","element":"img","alt":"t∗ ≈ ϵ−1 log(f −1∗ ) ∼ ϵ−1","inline":true},{"text":". Here we assumed that ","element":"span"},{"style":{"height":13.19},"width":35.25,"height":32.98,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2003.02218/images/23-26.png","element":"img","alt":" λt ","inline":true,"padRight":true},{"text":"is constant in ","element":"span"},{"style":{"fontStyle":"italic"},"text":"t","element":"span"},{"text":", which is true as long as ","element":"span"},{"style":{"height":12.39},"width":30.39,"height":30.98,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2003.02218/images/23-27.png","element":"img","alt":" t∗ ","inline":true,"padRight":true},{"text":"is independent of ","element":"span"},{"style":{"fontStyle":"italic"},"text":"n ","element":"span"},{"text":"(namely ","element":"span"},{"text":"we fix ","element":"span"},{"style":{"height":0},"width":20,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2003.02218/images/23-28.png","element":"img","alt":" ϵ","inline":true,"padRight":true},{"text":"and then take ","element":"span"},{"style":{"fontStyle":"italic"},"text":"n ","element":"span"},{"text":"large). Therefore, the convergence time below the transition scales as ","element":"span"},{"style":{"height":17.39},"width":297.62,"height":43.46,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2003.02218/images/23-29.png","element":"img","alt":" t∗ ∼ (ηcrit − η)−1","inline":true},{"text":", and the critical exponent is -1.","element":"span"}],[{"text":"Next, suppose that ","element":"span"},{"style":{"height":14.8},"width":379.92,"height":37,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2003.02218/images/23-30.png","element":"img","alt":" ηλ0 = 2 + ϵ with ϵ > 0","inline":true},{"text":". Now the update equation reads ","element":"span"},{"style":{"height":16},"width":305.2,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2003.02218/images/23-31.png","element":"img","alt":" ft+1 ≈ −(1 + ϵ)ft","inline":true},{"text":". This approximation holds early during training, when the curvature updates are small. Initially, ","element":"span"},{"style":{"height":16},"width":55.61,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2003.02218/images/23-32.png","element":"img","alt":" |ft|","inline":true,"padRight":true},{"text":"will grow until it is of order ","element":"span"},{"style":{"height":16},"width":57.21,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2003.02218/images/23-33.png","element":"img","alt":"√n","inline":true},{"text":", at which point the updates to ","element":"span"},{"style":{"height":13.19},"width":35.25,"height":32.98,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2003.02218/images/23-34.png","element":"img","alt":" λt","inline":true,"padRight":true},{"text":"become of order ","element":"span"},{"style":{"height":13.39},"width":39.92,"height":33.47,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2003.02218/images/23-35.png","element":"img","alt":" n0","inline":true},{"text":". This will happen in time ","element":"span"},{"style":{"height":17.46},"width":248.45,"height":43.65,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2003.02218/images/23-36.png","element":"img","alt":"ˆt ∼ ϵ−1 log √n","inline":true},{"text":". Following this, the optimizer will converge. At this point ","element":"span"},{"style":{"height":14.4},"width":56.46,"height":36,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2003.02218/images/23-37.png","element":"img","alt":" ηλt","inline":true,"padRight":true},{"text":"is no longer tuned to be close to the transition, and so the convergence time measured from this point on will not be sensitive to ","element":"span"},{"style":{"height":0},"width":20,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2003.02218/images/23-38.png","element":"img","alt":" ϵ","inline":true},{"text":". Therefore, for small ","element":"span"},{"style":{"height":0},"width":20,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2003.02218/images/23-39.png","element":"img","alt":" ϵ","inline":true,"padRight":true},{"text":"the convergence time will be dominated by the early part of training, namely ","element":"span"},{"style":{"height":16.13},"width":210.45,"height":40.33,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2003.02218/images/23-40.png","element":"img","alt":"t∗ ≈ ˆt ∼ ϵ−1","inline":true},{"text":". The critical exponent is again -1. Figure ","element":"span"},{"href":"#id-69","text":"S12 ","element":"a"},{"text":"show an empirical verification of this behavior.","element":"span"}],[{"id":"id-69","style":{"width":"57%"},"width":1119,"height":902,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2003.02218/images/24-0.png","element":"img"}],[{"style":{"fontStyle":"italic"},"text":"Figure S12. ","element":"span"},{"text":"The convergence time diverges when the learning rate is close to the critical value ","element":"span"},{"style":{"height":9.6},"width":63.41,"height":24,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2003.02218/images/24-1.png","element":"img","alt":" ηcrit","inline":true},{"text":", indicated by the solid green line. The measured exponents (shown in parentheses) are close to the predicted value of -1. Experiment involves the warmup model of Section ","element":"span"},{"text":"2 ","element":"span"},{"text":"with width ","element":"span"},{"text":"16","element":"span"},{"style":{"fontStyle":"italic"},"text":",","element":"span"},{"text":"000","element":"span"},{"text":".","element":"span"}]]},{"heading":"F. Additional evidence for linearization in the catapult phase.","paragraphs":[[{"text":"Here we present some more detailed evidence for the re-emergence of linear dynamics in the catapult phase. Figure ","element":"span"},{"href":"#id-70","text":"S13 ","element":"a"},{"text":"show results for models trained on subsets of MNIST with learning rates ","element":"span"},{"style":{"height":12.4},"width":141.81,"height":31,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2003.02218/images/24-2.png","element":"img","alt":" η > ηcrit","inline":true},{"text":". In figure Figure ","element":"span"},{"href":"#id-70","text":"S13a ","element":"a"},{"text":"we see that for a one-hidden-layer fully connected model trained on 512 MNIST images, the performance of the full non-linear model and model linearized after 10 steps track closely. Models evolve as linear models when the NTK is constant. In Figure ","element":"span"},{"href":"#id-70","text":"S13b ","element":"a"},{"text":"we give evidence that as networks become wider, the change in the kernel decreases.","element":"span"}],[{"id":"id-70","style":{"width":"96%"},"width":1886,"height":749,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2003.02218/images/24-3.png","element":"img"}],[{"style":{"fontStyle":"italic"},"text":"Figure S13. ","element":"span"},{"text":"Evidence for a return of linear dynamics after ","element":"span"},{"style":{"height":10.8},"width":38.61,"height":26.99,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2003.02218/images/24-4.png","element":"img","alt":" tlin","inline":true},{"text":". (a,b) Show the same model as in figure ","element":"span"},{"href":"#id-44","text":"4 ","element":"a"},{"text":"with the addition of linearized models at step ","element":"span"},{"text":"0 ","element":"span"},{"text":"and ","element":"span"},{"text":"10","element":"span"},{"text":". We observe that the linearized model after 10 steps tracks the non-linear performance in the ‘catapult’ phase up to ","element":"span"},{"style":{"height":18.99},"width":105.38,"height":47.47,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2003.02218/images/24-5.png","element":"img","alt":" η ∼ 4λ0 ","inline":true,"padRight":true},{"text":"(c) The change in the NTK between ","element":"span"},{"style":{"height":13.6},"width":415.64,"height":34,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2003.02218/images/24-6.png","element":"img","alt":" tlin = 50 steps and t = 1000","inline":true,"padRight":true},{"text":"steps decreases as the width increases. Here we consider ","element":"span"},{"text":"2-class MNIST with 100 samples per class.","element":"span"}]]}],"_version":"3.3.4"},"paperNode":"$28:props:children:props:children:0:props:product"}]]