36:[["$","audio",null,{"id":"tts"}],["$","$L3b",null,{"paperID":"1912.13053","publisher":"arxiv","paperJSON":{"title":"Disentangling trainability and generalization in deep learning","paperID":"1912.13053","avgLineHeight":11.91,"imgScale":4,"sections":[{"heading":"Abstract","paragraphs":[[{"text":"$3c","element":"span"},{"href":"https://colab.research.google.com/github/google/neural-tangents/blob/master/notebooks/Disentangling_Trainability_and_Generalization.ipynb","text":"colab","element":"a"},{"text":"1","element":"span"}],[{"text":"notebook that reproduces the essential results of the paper.","element":"span"}]]},{"heading":"1. Introduction","paragraphs":[[{"text":"Machine learning models based on deep neural networks have attained state-of-the-art performance across a dizzying array of tasks including vision (","element":"span"},{"href":"#id-0","referenceIndex":9,"text":"Cubuk et al.","element":"a"},{"href":"#id-0","referenceIndex":9,"text":", ","element":"a"},{"href":"#id-0","referenceIndex":9,"text":"2019","element":"a"},{"text":"), speech recognition (","element":"span"},{"href":"#id-1","referenceIndex":32,"text":"Park et al.","element":"a"},{"href":"#id-1","referenceIndex":32,"text":", ","element":"a"},{"href":"#id-1","referenceIndex":32,"text":"2019","element":"a"},{"text":"), machine translation (","element":"span"},{"href":"#id-2","referenceIndex":3,"text":"Bah- ","element":"a"},{"href":"#id-2","referenceIndex":3,"text":"danau et al.","element":"a"},{"href":"#id-2","referenceIndex":3,"text":", ","element":"a"},{"href":"#id-2","referenceIndex":3,"text":"2014","element":"a"},{"text":"), chemical property prediction (","element":"span"},{"href":"#id-3","referenceIndex":16,"text":"Gilmer ","element":"a"},{"href":"#id-3","referenceIndex":16,"text":"et al.","element":"a"},{"href":"#id-3","referenceIndex":16,"text":", ","element":"a"},{"href":"#id-3","referenceIndex":16,"text":"2017","element":"a"},{"text":"), diagnosing medical conditions (","element":"span"},{"href":"#id-4","referenceIndex":36,"text":"Raghu et al.","element":"a"},{"text":", ","element":"span"},{"href":"#id-4","referenceIndex":36,"text":"2019","element":"a"},{"text":"), and playing games (","element":"span"},{"href":"#id-5","referenceIndex":39,"text":"Silver et al.","element":"a"},{"href":"#id-5","referenceIndex":39,"text":", ","element":"a"},{"href":"#id-5","referenceIndex":39,"text":"2018","element":"a"},{"text":"). Historically, the rampant success of deep learning models has lacked a sturdy theoretical foundation: architectures, hyperparameters, and learning algorithms are often selected by brute force search (","element":"span"},{"href":"#id-6","referenceIndex":4,"text":"Bergstra & Bengio","element":"a"},{"href":"#id-6","referenceIndex":4,"text":", ","element":"a"},{"href":"#id-6","referenceIndex":4,"text":"2012","element":"a"},{"text":") and heuristics (","element":"span"},{"href":"#id-7","referenceIndex":17,"text":"Glorot & Bengio","element":"a"},{"href":"#id-7","referenceIndex":17,"text":", ","element":"a"},{"href":"#id-7","referenceIndex":17,"text":"2010","element":"a"},{"text":"). Recently, significant theoretical progress has been made on several fronts that have shown promise in making neural network design more systematic. In particular, in the infinite width (or channel) limit, the distribution of functions induced by neural networks with random weights and biases has been precisely characterized before, during, and after training.","element":"span"}],[{"text":"The study of infinite networks dates back to seminal work by ","element":"span"},{"href":"#id-8","referenceIndex":29,"text":"Neal ","element":"a"},{"href":"#id-8","referenceIndex":29,"text":"(","element":"a"},{"href":"#id-8","referenceIndex":29,"text":"1994","element":"a"},{"text":") who showed that the distribution of functions given by single hidden-layer networks with random weights and biases in the infinite-width limit are Gaussian Processes (GPs). Recently, there has been renewed interest in studying random, infinite, networks starting with concurrent work on “conjugate kernels” (","element":"span"},{"href":"#id-9","referenceIndex":11,"text":"Daniely et al.","element":"a"},{"href":"#id-9","referenceIndex":11,"text":", ","element":"a"},{"href":"#id-9","referenceIndex":11,"text":"2016","element":"a"},{"text":"; ","element":"span"},{"href":"#id-10","referenceIndex":10,"text":"Daniely","element":"a"},{"href":"#id-10","referenceIndex":10,"text":", ","element":"a"},{"href":"#id-10","referenceIndex":10,"text":"2017","element":"a"},{"text":") and “mean-field theory” (","element":"span"},{"href":"#id-11","referenceIndex":35,"text":"Poole et al.","element":"a"},{"href":"#id-11","referenceIndex":35,"text":", ","element":"a"},{"href":"#id-11","referenceIndex":35,"text":"2016","element":"a"},{"text":"; ","element":"span"},{"href":"#id-12","referenceIndex":38,"text":"Schoenholz et al.","element":"a"},{"href":"#id-12","referenceIndex":38,"text":", ","element":"a"},{"href":"#id-12","referenceIndex":38,"text":"2017","element":"a"},{"text":"). Among numerous contributions, the pair of papers by Daniely ","element":"span"},{"style":{"fontStyle":"italic"},"text":"et al. ","element":"span"},{"text":"argued that the empirical covariance matrix of pre-activations becomes deterministic in the infinite-width limit and called this the conjugate kernel of the network. Meanwhile, from a mean-field perspective, the latter two papers studied the properties of these limiting kernels. In particular, the spectrum of the conjugate kernel of wide, fully-connected, networks approaches a well-defined and data-independent limit when the depth exceeds a certain scale, ","element":"span"},{"style":{"height":14},"width":18,"height":35,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1912.13053/images/0-0.png","element":"img","alt":" ξ","inline":true},{"text":". Networks with ","element":"span"},{"text":"tanh","element":"span"},{"text":"-nonlinearities (among other bounded activations) exhibit a phase transition between two limiting spectral distributions of the conjugate kernel as a function of their hyperparameters with ","element":"span"},{"style":{"height":14},"width":18,"height":35,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1912.13053/images/0-1.png","element":"img","alt":" ξ","inline":true,"padRight":true},{"text":"diverging at the transition. It was additionally hypothesized that networks were un-trainable when the conjugate kernel was sufficiently close to its limit.","element":"span"}],[{"text":"Since then this analysis has been extended to include a wide range for architectures such as convolutions (","element":"span"},{"href":"#id-13","referenceIndex":41,"text":"Xiao et al.","element":"a"},{"href":"#id-13","referenceIndex":41,"text":", ","element":"a"},{"href":"#id-13","referenceIndex":41,"text":"2018","element":"a"},{"text":"), recurrent networks (","element":"span"},{"href":"#id-14","referenceIndex":6,"text":"Chen et al.","element":"a"},{"href":"#id-14","referenceIndex":6,"text":", ","element":"a"},{"href":"#id-14","referenceIndex":6,"text":"2018","element":"a"},{"text":"; ","element":"span"},{"href":"#id-15","referenceIndex":15,"text":"Gilboa et al.","element":"a"},{"href":"#id-15","referenceIndex":15,"text":", ","element":"a"},{"href":"#id-15","referenceIndex":15,"text":"2019","element":"a"},{"text":"), networks with residual connections (","element":"span"},{"href":"#id-16","referenceIndex":43,"text":"Yang & Schoen- ","element":"a"},{"href":"#id-16","referenceIndex":43,"text":"holz","element":"a"},{"href":"#id-16","referenceIndex":43,"text":", ","element":"a"},{"href":"#id-16","referenceIndex":43,"text":"2017","element":"a"},{"text":"), networks with quantized activations (","element":"span"},{"href":"#id-17","referenceIndex":5,"text":"Blumen- ","element":"a"},{"href":"#id-17","referenceIndex":5,"text":"feld et al.","element":"a"},{"href":"#id-17","referenceIndex":5,"text":", ","element":"a"},{"href":"#id-17","referenceIndex":5,"text":"2019","element":"a"},{"text":"), the spectrum of the fisher (","element":"span"},{"href":"#id-18","referenceIndex":24,"text":"Karakida et al.","element":"a"},{"href":"#id-18","referenceIndex":24,"text":", ","element":"a"},{"href":"#id-18","referenceIndex":24,"text":"2018","element":"a"},{"text":"), a range of activation functions (","element":"span"},{"href":"#id-19","referenceIndex":18,"text":"Hayou et al.","element":"a"},{"href":"#id-19","referenceIndex":18,"text":", ","element":"a"},{"href":"#id-19","referenceIndex":18,"text":"2018","element":"a"},{"text":"), and batch normalization (","element":"span"},{"href":"#id-20","referenceIndex":44,"text":"Yang et al.","element":"a"},{"href":"#id-20","referenceIndex":44,"text":", ","element":"a"},{"href":"#id-20","referenceIndex":44,"text":"2019","element":"a"},{"text":"). In each case, it was observed that the spectra of the kernels correlated strongly with whether or not the architectures were trainable. While these papers studied the properties of the conjugate kernels, especially the spectrum in the large-depth limit, a branch of concurrent work took a Bayesian perspective: that many networks converge to Gaussian Processes as their width becomes large (","element":"span"},{"href":"#id-21","referenceIndex":25,"text":"Lee et al.","element":"a"},{"href":"#id-21","referenceIndex":25,"text":", ","element":"a"},{"href":"#id-21","referenceIndex":25,"text":"2018","element":"a"},{"text":"; ","element":"span"},{"href":"#id-22","referenceIndex":27,"text":"Matthews et al.","element":"a"},{"href":"#id-22","referenceIndex":27,"text":", ","element":"a"},{"href":"#id-22","referenceIndex":27,"text":"2018","element":"a"},{"text":"; ","element":"span"},{"href":"#id-23","referenceIndex":31,"text":"Novak et al.","element":"a"},{"href":"#id-23","referenceIndex":31,"text":", ","element":"a"},{"href":"#id-23","referenceIndex":31,"text":"2019b","element":"a"},{"text":"; ","element":"span"},{"href":"#id-24","referenceIndex":14,"text":"Garriga-Alonso et al.","element":"a"},{"href":"#id-24","referenceIndex":14,"text":", ","element":"a"},{"href":"#id-24","referenceIndex":14,"text":"2018","element":"a"},{"text":"; ","element":"span"},{"href":"#id-25","referenceIndex":42,"text":"Yang","element":"a"},{"href":"#id-25","referenceIndex":42,"text":", ","element":"a"},{"href":"#id-25","referenceIndex":42,"text":"2019","element":"a"},{"text":"). In this case, the Conjugate Kernel was referred to as the Neural Network Gaussian Process (NNGP) kernel, which is used to train neural networks in a fully Bayesian fashion. As such, the NNGP kernel characterizes performance of the corresponding NNGP.","element":"span"}],[{"text":"Together this work offered a significant advance to our understanding of wide neural networks; however, this theoretical progress was limited to networks at initialization or after Bayesian posterior estimation and provided no link to gradient descent. Moreover, there was some preliminary evidence that suggested the situation might be more nuanced than the qualitative link between the NNGP spectrum and trainability might suggest. For example, ","element":"span"},{"href":"#id-26","referenceIndex":34,"text":"Philipp et al. ","element":"a"},{"href":"#id-26","referenceIndex":34,"text":"(","element":"a"},{"href":"#id-26","referenceIndex":34,"text":"2017","element":"a"},{"text":") showed that deep ","element":"span"},{"text":"tanh ","element":"span"},{"text":"FCNs could be trained after the kernel reached its large-depth, data-independent, limit but that these networks did not generalize to unseen data.","element":"span"}],[{"text":"Recently, significant theoretical clarity has been reached regarding the relationship between the GP prior and the distribution following gradient descent. In particular, ","element":"span"},{"href":"#id-27","referenceIndex":22,"text":"Jacot ","element":"a"},{"href":"#id-27","referenceIndex":22,"text":"et al. ","element":"a"},{"href":"#id-27","referenceIndex":22,"text":"(","element":"a"},{"href":"#id-27","referenceIndex":22,"text":"2018","element":"a"},{"text":") along with followup work (","element":"span"},{"href":"#id-28","referenceIndex":26,"text":"Lee et al.","element":"a"},{"href":"#id-28","referenceIndex":26,"text":", ","element":"a"},{"href":"#id-28","referenceIndex":26,"text":"2019","element":"a"},{"text":"; ","element":"span"},{"href":"#id-29","referenceIndex":8,"text":"Chizat et al.","element":"a"},{"href":"#id-29","referenceIndex":8,"text":", ","element":"a"},{"href":"#id-29","referenceIndex":8,"text":"2019","element":"a"},{"text":") showed that the distribution of functions induced by gradient descent for infinite-width networks is a Gaussian Process with a particular compositional kernel known as the Neural Tangent Kernel (NTK). In addition to characterizing the distribution over functions following gradient descent in the wide network limit, the learning dynamics can be solved analytically throughout optimization.","element":"span"}],[{"text":"In this paper, we leverage these developments and revisit the relationship between architecture, hyperparameters, trainability, and generalization in the large-depth limit for a variety of neural networks. In particular, we make the following contributions:","element":"span"}],[{"style":{"fontStyle":"italic"},"text":"• ","element":"span"},{"style":{"fontWeight":"bold"},"text":"Trainability. ","element":"span"},{"text":"We compute the large-depth asymptotics of several quantities related to trainability, including the largest/smallest eigenvalue of the NTK, ","element":"span"},{"style":{"height":16.48},"width":131.07,"height":41.19,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1912.13053/images/1-0.png","element":"img","alt":" λmax/min","inline":true},{"text":", and the condition number ","element":"span"},{"style":{"height":16},"width":236,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1912.13053/images/1-1.png","element":"img","alt":" κ = λmax/λmin","inline":true},{"text":"; see Table ","element":"span"},{"href":"#id-30","text":"1","element":"a"},{"text":".","element":"span"}],[{"style":{"fontStyle":"italic"},"text":"• ","element":"span"},{"style":{"fontWeight":"bold"},"text":"Generalization. ","element":"span"},{"text":"We characterize the ","element":"span"},{"style":{"fontStyle":"italic"},"text":"mean predictor ","element":"span"},{"style":{"height":16},"width":93.62,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1912.13053/images/1-2.png","element":"img","alt":"P(Θ)","inline":true},{"text":", which is intimately related to the prediction of wide neural networks on the test set following gradient descent training. As such, the mean predictor is intimately related to the model’s ability to generalize. In particular, we argue that networks fail to generalize if the mean predictor becomes data-independent.","element":"span"}],[{"id":"id-30","style":{"width":"99%"},"width":937,"height":409,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1912.13053/images/1-3.png","element":"img"}],[{"style":{"fontStyle":"italic"},"text":"Table 1. ","element":"figcaption","subtype":"caption"},{"style":{"fontWeight":"bold"},"text":"Evolution of the NTK spectra and ","element":"figcaption","subtype":"caption"},{"style":{"height":16.89},"width":120.28,"height":42.24,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1912.13053/images/1-4.png","element":"img","alt":" P(Θ(l))","inline":true,"padRight":true},{"style":{"fontWeight":"bold"},"text":"as a function of depth ","element":"figcaption","subtype":"caption"},{"style":{"fontStyle":"italic"},"text":"l","element":"figcaption","subtype":"caption"},{"style":{"fontWeight":"bold"},"text":". ","element":"figcaption","subtype":"caption"},{"text":"The NTKs of FCN and CNN without pooling (CNN-F) are essentially the same and the scaling of ","element":"figcaption","subtype":"caption"},{"style":{"height":18.65},"width":184.05,"height":46.64,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1912.13053/images/1-5.png","element":"img","alt":" λ(l)max, λ(l)bulk,","inline":true},{"style":{"height":15.7},"width":199.1,"height":39.24,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1912.13053/images/1-6.png","element":"img","alt":"κ(l), and ∆(l) ","inline":true,"padRight":true},{"text":"for these networks is written in black. Corrections to these quantities due to the addition of an average pooling layer (","element":"figcaption","subtype":"caption"},{"text":"CNN-P","element":"figcaption","subtype":"caption"},{"text":") with window size ","element":"figcaption","subtype":"caption"},{"style":{"fontStyle":"italic"},"text":"d ","element":"figcaption","subtype":"caption"},{"text":"is written in blue.","element":"figcaption","subtype":"caption"}],[{"style":{"fontStyle":"italic"},"text":"• ","element":"span"},{"text":"We show that the ","element":"span"},{"style":{"fontStyle":"italic"},"text":"ordered ","element":"span"},{"text":"and ","element":"span"},{"style":{"fontStyle":"italic"},"text":"chaotic ","element":"span"},{"text":"phases identi-fied in ","element":"span"},{"href":"#id-11","referenceIndex":35,"text":"Poole et al. ","element":"a"},{"href":"#id-11","referenceIndex":35,"text":"(","element":"a"},{"href":"#id-11","referenceIndex":35,"text":"2016","element":"a"},{"text":") lead to markedly different limiting spectra of the NTK. In the ordered phase the trainability of neural networks degrades at large depths, but their ability to generalize persists. By contrast, in the chaotic phase we show that trainability improves with depth, but generalization degrades and neural networks behave like hash functions.","element":"span"}],[{"text":"A corollary of these differences in the spectra is that, as a function of depth, the optimal learning rates ought to decay exponentially in the chaotic phase, linearly on the order-to-chase trainsition line, and remain roughly a constant in the ordered phase.","element":"span"}],[{"style":{"fontStyle":"italic"},"text":"• ","element":"span"},{"text":"We examine the differences in the above quantities for fully-connected networks (FCNs) and convolutional networks (CNNs) with and without pooling and precisely characterize the effect of pooling on the interplay between trainability, generalization, and depth.","element":"span"}],[{"text":"In each case, we provide empirical evidence to support our theoretical conclusions. Together these results provide a complete, analytically tractable, and dataset-independent theory for learning in very deep and wide networks. Philosophically, we find that trainability and generalization are distinct notions that are, at least in this case, at odds with one another. Indeed, good conditioning of the NTK (which is a necessary condition for training) seems necessarily to lead to poor generalization performance. It will be interesting to see whether these results carry over in shallower and narrower networks. The tractable nature of the wide and deep regime leads us to conclude that these models will be an interesting testbed to investigate various theories of generalization in deep learning.","element":"span"}]]},{"heading":"2. Related Work","paragraphs":[[{"text":"Recent work ","element":"span"},{"href":"#id-27","referenceIndex":22,"text":"Jacot et al. ","element":"a"},{"href":"#id-27","referenceIndex":22,"text":"(","element":"a"},{"href":"#id-27","referenceIndex":22,"text":"2018","element":"a"},{"text":"); ","element":"span"},{"href":"#id-31","referenceIndex":13,"text":"Du et al. ","element":"a"},{"href":"#id-31","referenceIndex":13,"text":"(","element":"a"},{"href":"#id-31","referenceIndex":13,"text":"2018b","element":"a"},{"text":"); ","element":"span"},{"href":"#id-32","referenceIndex":1,"text":"Allen- ","element":"a"},{"href":"#id-32","referenceIndex":1,"text":"Zhu et al. ","element":"a"},{"href":"#id-32","referenceIndex":1,"text":"(","element":"a"},{"href":"#id-32","referenceIndex":1,"text":"2018","element":"a"},{"text":"); ","element":"span"},{"href":"#id-33","referenceIndex":12,"text":"Du et al. ","element":"a"},{"href":"#id-33","referenceIndex":12,"text":"(","element":"a"},{"href":"#id-33","referenceIndex":12,"text":"2018a","element":"a"},{"text":"); ","element":"span"},{"href":"#id-34","referenceIndex":45,"text":"Zou et al. ","element":"a"},{"href":"#id-34","referenceIndex":45,"text":"(","element":"a"},{"href":"#id-34","referenceIndex":45,"text":"2018","element":"a"},{"text":") and many others proved global convergence of over-parameterized deep networks by showing that the NTK essentailly remains a constant over the course of training. However, in a different scaling limit the NTK changes over the course of training and global convergence is much more difficult to obtain and is known for neural networks with one hidden layer ","element":"span"},{"href":"#id-35","referenceIndex":28,"text":"Mei et al. ","element":"a"},{"href":"#id-35","referenceIndex":28,"text":"(","element":"a"},{"href":"#id-35","referenceIndex":28,"text":"2018","element":"a"},{"text":"); ","element":"span"},{"href":"#id-36","referenceIndex":7,"text":"Chizat & Bach ","element":"a"},{"href":"#id-36","referenceIndex":7,"text":"(","element":"a"},{"href":"#id-36","referenceIndex":7,"text":"2018","element":"a"},{"text":"); ","element":"span"},{"href":"#id-37","referenceIndex":40,"text":"Sirig- ","element":"a"},{"href":"#id-37","referenceIndex":40,"text":"nano & Spiliopoulos ","element":"a"},{"href":"#id-37","referenceIndex":40,"text":"(","element":"a"},{"href":"#id-37","referenceIndex":40,"text":"2018","element":"a"},{"text":"); ","element":"span"},{"href":"#id-38","referenceIndex":37,"text":"Rotskoff & Vanden-Eijnden ","element":"a"},{"href":"#id-38","referenceIndex":37,"text":"(","element":"a"},{"href":"#id-38","referenceIndex":37,"text":"2018","element":"a"},{"text":"). Therefore, understanding the training and generalization properties in this scaling limit remains a very challenging open question.","element":"span"}],[{"text":"Another two excellent recent works (","element":"span"},{"href":"#id-39","referenceIndex":19,"text":"Hayou et al.","element":"a"},{"href":"#id-39","referenceIndex":19,"text":", ","element":"a"},{"href":"#id-39","referenceIndex":19,"text":"2019","element":"a"},{"text":"; ","element":"span"},{"href":"#id-40","referenceIndex":23,"text":"Jacot et al.","element":"a"},{"href":"#id-40","referenceIndex":23,"text":", ","element":"a"},{"href":"#id-40","referenceIndex":23,"text":"2019","element":"a"},{"text":") also study the dynamics of ","element":"span"},{"style":{"height":18.18},"width":174.08,"height":45.46,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1912.13053/images/2-0.png","element":"img","alt":" Θ(l)(x, x′)","inline":true,"padRight":true},{"text":"for FCNs (and deconvolutions in (","element":"span"},{"href":"#id-40","referenceIndex":23,"text":"Jacot et al.","element":"a"},{"href":"#id-40","referenceIndex":23,"text":", ","element":"a"},{"href":"#id-40","referenceIndex":23,"text":"2019","element":"a"},{"text":")) as a function of depth and variances of the weights and biases. (","element":"span"},{"href":"#id-39","referenceIndex":19,"text":"Hayou et al.","element":"a"},{"href":"#id-39","referenceIndex":19,"text":", ","element":"a"},{"href":"#id-39","referenceIndex":19,"text":"2019","element":"a"},{"text":") investigates role of activation functions (smooth v.s. non-smooth) and skip-connection. (","element":"span"},{"href":"#id-40","referenceIndex":23,"text":"Jacot et al.","element":"a"},{"href":"#id-40","referenceIndex":23,"text":", ","element":"a"},{"href":"#id-40","referenceIndex":23,"text":"2019","element":"a"},{"text":") demonstrate that batch normalization helps remove the “ordered phase” (as in (","element":"span"},{"href":"#id-20","referenceIndex":44,"text":"Yang et al.","element":"a"},{"href":"#id-20","referenceIndex":44,"text":", ","element":"a"},{"href":"#id-20","referenceIndex":44,"text":"2019","element":"a"},{"text":")) and a layer-dependent learning rate allows every layer in a network to contribute to learning.","element":"span"}]]},{"heading":"3. Background","paragraphs":[[{"text":"We summarize recent developments in the study of wide random networks. We will keep our discussion relatively informal; see e.g. (","element":"span"},{"href":"#id-23","referenceIndex":31,"text":"Novak et al.","element":"a"},{"href":"#id-23","referenceIndex":31,"text":", ","element":"a"},{"href":"#id-23","referenceIndex":31,"text":"2019b","element":"a"},{"text":") for a more rigorous version of these arguments. To simplify this discussion and as a warm-up for the main text, we will consider the case of FCNs. Consider a fully-connected network of depth ","element":"span"},{"style":{"fontStyle":"italic"},"text":"L ","element":"span"},{"text":"where each layer has a width ","element":"span"},{"style":{"height":14.19},"width":71.06,"height":35.46,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1912.13053/images/2-1.png","element":"img","alt":" N (l) ","inline":true,"padRight":true},{"text":"and an activation function ","element":"span"},{"style":{"height":14},"width":195.18,"height":35,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1912.13053/images/2-2.png","element":"img","alt":"φ : R → R","inline":true},{"text":". In the main text we will restrict our discussion to ","element":"span"},{"style":{"height":14},"width":137.86,"height":35,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1912.13053/images/2-3.png","element":"img","alt":" φ = erf","inline":true,"padRight":true},{"text":"or ","element":"span"},{"text":"tanh ","element":"span"},{"text":"for clarity, however we include results for a range of architectures including ","element":"span"},{"style":{"height":14},"width":186.63,"height":35,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1912.13053/images/2-4.png","element":"img","alt":" φ = ReLU","inline":true,"padRight":true},{"text":"with and without skip connections and layer normalization in the supplementary material (see Sec. ","element":"span"},{"text":"C","element":"span"},{"text":"). We find that the high level picture described here applies to a wide range of architectural components, though important specifics - such as the phase diagram - can vary substantially. For simplicity, we will take the width of the hidden layers to infinity sequentially: ","element":"span"},{"style":{"height":17.39},"width":497.48,"height":43.46,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1912.13053/images/2-5.png","element":"img","alt":" N (1) → ∞, . . . , N (L−1) → ∞","inline":true},{"text":". The network is parameterized by weights and biases that we take to be randomly initialized with ","element":"span"},{"style":{"height":23.52},"width":333.2,"height":58.8,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1912.13053/images/2-6.png","element":"img","alt":" W (l)ij , b(l)i ∼ N(0, 1)","inline":true,"padRight":true},{"text":"along with hyperparameters, ","element":"span"},{"style":{"height":13.19},"width":160.29,"height":32.98,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1912.13053/images/2-7.png","element":"img","alt":" σw and σb","inline":true,"padRight":true},{"text":"that set the scale of the weights and biases respectively. Letting the ","element":"span"},{"style":{"height":13.78},"width":35.48,"height":34.46,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1912.13053/images/2-8.png","element":"img","alt":" ith ","inline":true,"padRight":true},{"text":"pre-activation in the ","element":"span"},{"style":{"height":13.79},"width":34.43,"height":34.47,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1912.13053/images/2-9.png","element":"img","alt":"lth","inline":true,"padRight":true},{"text":"layer due to an input ","element":"span"},{"style":{"fontStyle":"italic"},"text":"x ","element":"span"},{"text":"be given by ","element":"span"},{"style":{"height":21.12},"width":111.7,"height":52.8,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1912.13053/images/2-10.png","element":"img","alt":" z(l)i (x)","inline":true},{"text":", the network","element":"span"}],[{"text":"is then described by the recursion, for ","element":"span"},{"style":{"height":13.2},"width":244.62,"height":33,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1912.13053/images/2-11.png","element":"img","alt":" 0 ≤ l ≤ L − 1,","inline":true}],[{"id":"id-41","style":{"width":"97%"},"width":913,"height":131,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1912.13053/images/2-12.png","element":"img"}],[{"text":"Notice that as ","element":"span"},{"style":{"height":14.58},"width":184.98,"height":36.46,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1912.13053/images/2-13.png","element":"img","alt":" N (l) → ∞","inline":true},{"text":", the sum ends up being over a large number of random variables and we can invoke the central limit theorem to conclude that the ","element":"span"},{"style":{"height":23.16},"width":283.82,"height":57.9,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1912.13053/images/2-14.png","element":"img","alt":" {z(l+1)i }i∈[N (l+1)]","inline":true,"padRight":true},{"text":"are i.i.d. Gaussian with zero mean. Given a dataset of ","element":"span"},{"style":{"fontStyle":"italic"},"text":"m ","element":"span"},{"text":"points, the distribution over pre-activations can therefore be described completely by the covariance matrix, i.e. the NNGP kernel, between neurons in different inputs ","element":"span"},{"style":{"height":21.12},"width":519.93,"height":52.81,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1912.13053/images/2-15.png","element":"img","alt":"K(l)(x, x′) = E[z(l)i (x)z(l)i (x′)].","inline":true,"padRight":true},{"text":"Inspecting Equation ","element":"span"},{"href":"#id-41","text":"1","element":"a"},{"text":", we ","element":"span"},{"text":"see that ","element":"span"},{"style":{"height":14.18},"width":106,"height":35.46,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1912.13053/images/2-16.png","element":"img","alt":" K(l+1) ","inline":true,"padRight":true},{"text":"can be computed in terms of ","element":"span"},{"style":{"height":14.58},"width":111.74,"height":36.46,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1912.13053/images/2-17.png","element":"img","alt":" K(l) as","inline":true}],[{"id":"id-42","style":{"width":"84%"},"width":790,"height":118,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1912.13053/images/2-18.png","element":"img"}],[{"text":"Equation ","element":"span"},{"href":"#id-42","text":"2 ","element":"a"},{"text":"describes a dynamical system on positive semi-definite matrices ","element":"span"},{"style":{"fontStyle":"italic"},"text":"K","element":"span"},{"text":". It was shown in ","element":"span"},{"href":"#id-11","referenceIndex":35,"text":"Poole et al. ","element":"a"},{"href":"#id-11","referenceIndex":35,"text":"(","element":"a"},{"href":"#id-11","referenceIndex":35,"text":"2016","element":"a"},{"text":") that fixed points, ","element":"span"},{"style":{"height":16},"width":155.2,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1912.13053/images/2-19.png","element":"img","alt":" K∗(x, x′)","inline":true},{"text":", of these dynamics exist such that ","element":"span"},{"style":{"height":18.18},"width":544.29,"height":45.46,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1912.13053/images/2-20.png","element":"img","alt":" liml→∞ K(l)(x, x′) = K∗(x, x′)","inline":true,"padRight":true},{"text":"with ","element":"span"},{"style":{"height":16},"width":209.09,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1912.13053/images/2-21.png","element":"img","alt":" K∗(x, x′) =","inline":true},{"style":{"height":16.79},"width":401.45,"height":41.98,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1912.13053/images/2-22.png","element":"img","alt":"q∗[δx,x′ + c∗(1 − δx,x′)]","inline":true,"padRight":true},{"text":"independent of the inputs ","element":"span"},{"style":{"fontStyle":"italic"},"text":"x ","element":"span"},{"text":"and ","element":"span"},{"style":{"height":6.8},"width":36.78,"height":17,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1912.13053/images/2-23.png","element":"img","alt":"x′","inline":true},{"text":". The values of ","element":"span"},{"style":{"height":14.4},"width":148.43,"height":36,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1912.13053/images/2-24.png","element":"img","alt":" q∗ and c∗ ","inline":true,"padRight":true},{"text":"are determined by the hyperparameters, ","element":"span"},{"style":{"height":13.59},"width":162.48,"height":33.98,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1912.13053/images/2-25.png","element":"img","alt":" σw and σb","inline":true},{"text":". However Equation ","element":"span"},{"href":"#id-42","text":"2 ","element":"a"},{"text":"admits multiple fixed points (e.g. ","element":"span"},{"style":{"height":14.18},"width":146.34,"height":35.46,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1912.13053/images/2-26.png","element":"img","alt":" c∗ = 0, 1","inline":true},{"text":") and the stability of these fixed points plays a significant role in determining the properties of the network. Generically, there are large regions of the ","element":"span"},{"style":{"height":16},"width":136.55,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1912.13053/images/2-27.png","element":"img","alt":"(σw, σb)","inline":true,"padRight":true},{"text":"plane in which the fixed-point structure is constant punctuated by curves, called phase transitions, where the structure changes; see Fig ","element":"span"},{"href":"#id-43","text":"5 ","element":"a"},{"text":"for ","element":"span"},{"text":"tanh","element":"span"},{"text":"-networks.","element":"span"}],[{"text":"The rate at which ","element":"span"},{"style":{"height":16},"width":136.88,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1912.13053/images/2-28.png","element":"img","alt":" K(x, x′)","inline":true,"padRight":true},{"text":"approaches or departs ","element":"span"},{"style":{"height":16},"width":155.2,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1912.13053/images/2-29.png","element":"img","alt":" K∗(x, x′)","inline":true,"padRight":true},{"text":"can be determined by expanding Equation ","element":"span"},{"href":"#id-42","text":"2 ","element":"a"},{"text":"about its fixed point, ","element":"span"},{"style":{"height":16},"width":661.52,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1912.13053/images/2-30.png","element":"img","alt":" δK(x, x′) = K(x, x′) − K∗(x, x′) to find","inline":true}],[{"style":{"width":"89%"},"width":838,"height":49,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1912.13053/images/2-31.png","element":"img"}],[{"text":"with ","element":"span"},{"style":{"height":20.68},"width":627.06,"height":51.71,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1912.13053/images/2-32.png","element":"img","alt":"˙T (K) = E(z1,z2)∼N (0,K)[ ˙φ(z1) ˙φ(z2)]","inline":true,"padRight":true},{"text":"and ","element":"span"},{"style":{"height":18.21},"width":24,"height":45.52,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1912.13053/images/2-33.png","element":"img","alt":"˙φ","inline":true,"padRight":true},{"text":"is the derivative of ","element":"span"},{"style":{"height":14},"width":24,"height":35,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1912.13053/images/2-34.png","element":"img","alt":" φ","inline":true},{"text":". This expansion naturally exhibits exponential convergence to - or divergence from - the fixed-point as ","element":"span"},{"style":{"height":18.18},"width":432.18,"height":45.46,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1912.13053/images/2-35.png","element":"img","alt":" δK(l)(x, x′) ∼ χ(x, x′)l","inline":true,"padRight":true},{"text":"where ","element":"span"},{"style":{"height":16},"width":195.41,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1912.13053/images/2-36.png","element":"img","alt":" χ(x, x′) =","inline":true},{"style":{"height":18.83},"width":266.6,"height":47.07,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1912.13053/images/2-37.png","element":"img","alt":"σ2w ˙T (K∗(x, x′))","inline":true},{"text":". Since ","element":"span"},{"style":{"height":16},"width":155.2,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1912.13053/images/2-38.png","element":"img","alt":" K∗(x, x′)","inline":true,"padRight":true},{"text":"does not depend on ","element":"span"},{"style":{"fontStyle":"italic"},"text":"x ","element":"span"},{"text":"or ","element":"span"},{"style":{"height":6.8},"width":36.78,"height":17,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1912.13053/images/2-39.png","element":"img","alt":" x′ ","inline":true,"padRight":true},{"text":"it follows that ","element":"span"},{"style":{"height":16},"width":130.88,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1912.13053/images/2-40.png","element":"img","alt":" χ(x, x′)","inline":true,"padRight":true},{"text":"will take on a single value, ","element":"span"},{"style":{"height":10},"width":67.82,"height":25,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1912.13053/images/2-41.png","element":"img","alt":" χc∗,","inline":true,"padRight":true},{"text":"whenever ","element":"span"},{"style":{"height":15.2},"width":121.06,"height":38,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1912.13053/images/2-42.png","element":"img","alt":" x ̸= x′","inline":true},{"text":". If ","element":"span"},{"style":{"height":14},"width":139.34,"height":35,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1912.13053/images/2-43.png","element":"img","alt":" χc∗ < 1","inline":true,"padRight":true},{"text":"then this ","element":"span"},{"style":{"height":10.98},"width":46.94,"height":27.46,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1912.13053/images/2-44.png","element":"img","alt":" K∗","inline":true,"padRight":true},{"text":"fixed point is stable, but if ","element":"span"},{"style":{"height":14},"width":134.71,"height":35,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1912.13053/images/2-45.png","element":"img","alt":" χc∗ > 1","inline":true,"padRight":true},{"text":"then the fixed point is unstable and, as discussed above, the system will converge to a different fixed point. If ","element":"span"},{"style":{"height":14},"width":139.34,"height":35,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1912.13053/images/2-46.png","element":"img","alt":" χc∗ = 1","inline":true,"padRight":true},{"text":"then the hyperparameters lie at a phase transition and convergence is non-exponential. As was shown in ","element":"span"},{"href":"#id-11","referenceIndex":35,"text":"Poole et al. ","element":"a"},{"href":"#id-11","referenceIndex":35,"text":"(","element":"a"},{"href":"#id-11","referenceIndex":35,"text":"2016","element":"a"},{"text":"), there is always a fixed-point at ","element":"span"},{"style":{"height":10.98},"width":111.66,"height":27.46,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1912.13053/images/2-47.png","element":"img","alt":" c∗ = 1","inline":true,"padRight":true},{"text":"whose stability is determined by ","element":"span"},{"style":{"height":10},"width":40.93,"height":25,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1912.13053/images/2-48.png","element":"img","alt":" χ1","inline":true},{"text":". This is the so-called ordered phase since any pair of inputs will converge to identical outputs. The line defined by ","element":"span"},{"style":{"height":14},"width":118.67,"height":35,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1912.13053/images/2-49.png","element":"img","alt":" χ1 = 1","inline":true,"padRight":true},{"text":"defines the order-to-chaos transition separating the ordered ","element":"span"},{"text":"phase from the “chaotic” phase (where ","element":"span"},{"style":{"height":11.78},"width":108.75,"height":29.46,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1912.13053/images/3-0.png","element":"img","alt":" c∗ > 1","inline":true},{"text":"). Note, that ","element":"span"},{"style":{"height":10},"width":54.18,"height":25,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1912.13053/images/3-1.png","element":"img","alt":"χc∗","inline":true,"padRight":true},{"text":"can be used to define a depth-scale, ","element":"span"},{"style":{"height":16},"width":321.75,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1912.13053/images/3-2.png","element":"img","alt":" ξc∗ = −1/ log(χc∗)","inline":true,"padRight":true},{"text":"that describes the number of layers over which ","element":"span"},{"style":{"height":14.19},"width":65.63,"height":35.47,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1912.13053/images/3-3.png","element":"img","alt":" K(l)","inline":true,"padRight":true},{"text":"approaches ","element":"span"},{"style":{"height":10.99},"width":60.26,"height":27.46,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1912.13053/images/3-4.png","element":"img","alt":" K∗.","inline":true}],[{"text":"This provides a precise characterization of the NNGP kernel at large depths. As discussed above, recent work (","element":"span"},{"href":"#id-27","referenceIndex":22,"text":"Jacot et al.","element":"a"},{"href":"#id-27","referenceIndex":22,"text":", ","element":"a"},{"href":"#id-27","referenceIndex":22,"text":"2018","element":"a"},{"text":"; ","element":"span"},{"href":"#id-28","referenceIndex":26,"text":"Lee et al.","element":"a"},{"href":"#id-28","referenceIndex":26,"text":", ","element":"a"},{"href":"#id-28","referenceIndex":26,"text":"2019","element":"a"},{"text":"; ","element":"span"},{"href":"#id-29","referenceIndex":8,"text":"Chizat et al.","element":"a"},{"href":"#id-29","referenceIndex":8,"text":", ","element":"a"},{"href":"#id-29","referenceIndex":8,"text":"2019","element":"a"},{"text":") has connected the prior described by the NNGP with the result of gradient descent training using a quantity called the NTK. To construct the NTK, suppose we enumerate all the parameters in the fully-connected network described above by ","element":"span"},{"style":{"height":13.19},"width":39.71,"height":32.98,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1912.13053/images/3-5.png","element":"img","alt":" θα","inline":true},{"text":". The finite width NTK is defined by ","element":"span"},{"style":{"height":18.83},"width":394.43,"height":47.07,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1912.13053/images/3-6.png","element":"img","alt":"ˆΘ(x, x′) = J(x)J(x′)T","inline":true,"padRight":true},{"text":"where ","element":"span"},{"style":{"height":17.93},"width":322.19,"height":44.83,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1912.13053/images/3-7.png","element":"img","alt":"Jiα(x) = ∂θαzLi (x)","inline":true,"padRight":true},{"text":"is the Jacobian evaluated at a point ","element":"span"},{"style":{"fontStyle":"italic"},"text":"x","element":"span"},{"text":". ","element":"span"},{"text":"The main result in ","element":"span"},{"href":"#id-27","referenceIndex":22,"text":"Jacot et al. ","element":"a"},{"href":"#id-27","referenceIndex":22,"text":"(","element":"a"},{"href":"#id-27","referenceIndex":22,"text":"2018","element":"a"},{"text":") was to show that in the infinite-width limit, the NTK converges to a deterministic kernel ","element":"span"},{"style":{"height":10.8},"width":31,"height":27,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1912.13053/images/3-8.png","element":"img","alt":" Θ","inline":true,"padRight":true},{"text":"and remains constant over the course of training. As such, at a time ","element":"span"},{"style":{"fontStyle":"italic"},"text":"t ","element":"span"},{"text":"during gradient descent training with an MSE loss, the expected outputs of an infinitely wide network, ","element":"span"},{"style":{"height":17.93},"width":291.5,"height":44.83,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1912.13053/images/3-9.png","element":"img","alt":" µt(x) = E[zLi (x)]","inline":true},{"text":", evolve as","element":"span"}],[{"id":"id-44","style":{"width":"96%"},"width":902,"height":163,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1912.13053/images/3-10.png","element":"img"}],[{"id":"id-45","text":"for train and test points respectively; see Section 2 in ","element":"span"},{"href":"#id-28","referenceIndex":26,"text":"Lee ","element":"a"},{"href":"#id-28","referenceIndex":26,"text":"et al. ","element":"a"},{"href":"#id-28","referenceIndex":26,"text":"(","element":"a"},{"href":"#id-28","referenceIndex":26,"text":"2019","element":"a"},{"text":"). Here ","element":"span"},{"style":{"height":10.8},"width":31,"height":27,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1912.13053/images/3-11.png","element":"img","alt":" Θ","inline":true},{"text":"test, train ","element":"span"},{"text":"denotes the NTK between the test inputs ","element":"span"},{"style":{"height":13.19},"width":72,"height":32.98,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1912.13053/images/3-12.png","element":"img","alt":" Xtest","inline":true,"padRight":true},{"text":"and training inputs ","element":"span"},{"style":{"height":13.19},"width":84.2,"height":32.98,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1912.13053/images/3-13.png","element":"img","alt":" Xtrain","inline":true,"padRight":true},{"text":"and ","element":"span"},{"style":{"height":10.8},"width":31,"height":27,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1912.13053/images/3-14.png","element":"img","alt":" Θ","inline":true},{"text":"train, train ","element":"span"},{"text":"is defined similarly. Since ","element":"span"},{"style":{"height":14.83},"width":31,"height":37.08,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1912.13053/images/3-15.png","element":"img","alt":"ˆΘ","inline":true,"padRight":true},{"text":"converges to ","element":"span"},{"style":{"height":10.8},"width":31,"height":27,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1912.13053/images/3-16.png","element":"img","alt":" Θ","inline":true,"padRight":true},{"text":"as the network’s width approaches infinity, the gradient flow dynamics of real network also converge to the dynamics described by Equation ","element":"span"},{"href":"#id-44","text":"5 ","element":"a"},{"text":"and Equation ","element":"span"},{"href":"#id-45","text":"6 ","element":"a"},{"text":"(","element":"span"},{"href":"#id-27","referenceIndex":22,"text":"Jacot et al.","element":"a"},{"href":"#id-27","referenceIndex":22,"text":", ","element":"a"},{"href":"#id-27","referenceIndex":22,"text":"2018","element":"a"},{"text":"; ","element":"span"},{"href":"#id-28","referenceIndex":26,"text":"Lee et al.","element":"a"},{"href":"#id-28","referenceIndex":26,"text":", ","element":"a"},{"href":"#id-28","referenceIndex":26,"text":"2019","element":"a"},{"text":"; ","element":"span"},{"href":"#id-29","referenceIndex":8,"text":"Chizat et al.","element":"a"},{"href":"#id-29","referenceIndex":8,"text":", ","element":"a"},{"href":"#id-29","referenceIndex":8,"text":"2019","element":"a"},{"text":"; ","element":"span"},{"href":"#id-25","referenceIndex":42,"text":"Yang","element":"a"},{"href":"#id-25","referenceIndex":42,"text":", ","element":"a"},{"href":"#id-25","referenceIndex":42,"text":"2019","element":"a"},{"text":"; ","element":"span"},{"href":"#id-46","referenceIndex":2,"text":"Arora et al.","element":"a"},{"href":"#id-46","referenceIndex":2,"text":", ","element":"a"},{"href":"#id-46","referenceIndex":2,"text":"2019","element":"a"},{"text":"; ","element":"span"},{"href":"#id-47","referenceIndex":21,"text":"Huang & Yau","element":"a"},{"href":"#id-47","referenceIndex":21,"text":", ","element":"a"},{"href":"#id-47","referenceIndex":21,"text":"2019","element":"a"},{"text":"). As the training time, ","element":"span"},{"style":{"fontStyle":"italic"},"text":"t","element":"span"},{"text":", tends to infinity we note that these equations reduce to ","element":"span"},{"style":{"height":16},"width":336.75,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1912.13053/images/3-17.png","element":"img","alt":" µ(Xtrain) = Ytrain and","inline":true},{"style":{"height":16},"width":212.93,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1912.13053/images/3-18.png","element":"img","alt":"µ(Xtest) = Θ","inline":true},{"text":"test, train","element":"span"},{"style":{"height":14.42},"width":71.9,"height":36.05,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1912.13053/images/3-19.png","element":"img","alt":"Θ−1","inline":true},{"text":"train, train","element":"span"},{"style":{"height":13.19},"width":74.32,"height":32.97,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1912.13053/images/3-20.png","element":"img","alt":"Ytrain","inline":true},{"text":". Consequently we call","element":"span"}],[{"style":{"width":"72%"},"width":681,"height":52,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1912.13053/images/3-21.png","element":"img"}],[{"text":"the “mean predictor”. We can also compute the mean predictor of the NNGP kernel, ","element":"span"},{"style":{"fontStyle":"italic"},"text":"P","element":"span"},{"text":"(","element":"span"},{"style":{"fontStyle":"italic"},"text":"K","element":"span"},{"text":")","element":"span"},{"text":", which analogously can be used to find the mean of the posterior after Bayesian inference. We will discuss the connection between the mean predictor and generalization in the next section.","element":"span"}],[{"text":"In addition to showing that the NTK describes networks during gradient descent, ","element":"span"},{"href":"#id-27","referenceIndex":22,"text":"Jacot et al. ","element":"a"},{"href":"#id-27","referenceIndex":22,"text":"(","element":"a"},{"href":"#id-27","referenceIndex":22,"text":"2018","element":"a"},{"text":") showed that the NTK could be computed in closed form in terms of ","element":"span"},{"style":{"fontStyle":"italic"},"text":"T ","element":"span"},{"text":", ","element":"span"},{"style":{"height":16.03},"width":34,"height":40.08,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1912.13053/images/3-22.png","element":"img","alt":"˙T","inline":true,"padRight":true},{"text":", and the NNGP as,","element":"span"}],[{"id":"id-49","style":{"width":"97%"},"width":912,"height":77,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1912.13053/images/3-23.png","element":"img"}],[{"text":"where ","element":"span"},{"style":{"height":14.19},"width":65.69,"height":35.46,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1912.13053/images/3-24.png","element":"img","alt":" Θ(l) ","inline":true,"padRight":true},{"text":"is the NTK for the pre-activations at layer-","element":"span"},{"style":{"fontStyle":"italic"},"text":"l","element":"span"},{"text":".","element":"span"}]]},{"heading":"4. Metrics for Trainability and Generalization at Large Depth","paragraphs":[[{"text":"We begin by discussing the interplay between the conditioning of ","element":"span"},{"style":{"height":10.8},"width":31,"height":27,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1912.13053/images/3-25.png","element":"img","alt":" Θ","inline":true},{"text":"train, train ","element":"span"},{"text":"and the trainability of wide networks. We can write Equation ","element":"span"},{"href":"#id-44","text":"5 ","element":"a"},{"text":"in terms of the spectrum of ","element":"span"},{"style":{"height":10.8},"width":31,"height":27,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1912.13053/images/3-26.png","element":"img","alt":" Θ","inline":true},{"text":"train, train","element":"span"},{"text":". To do this we write the eigendecomposition of ","element":"span"},{"style":{"height":10.8},"width":31,"height":27,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1912.13053/images/3-27.png","element":"img","alt":" Θ","inline":true},{"text":"train, train ","element":"span"},{"text":"as ","element":"span"},{"style":{"height":10.8},"width":31,"height":27,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1912.13053/images/3-28.png","element":"img","alt":"Θ","inline":true},{"text":"train, train ","element":"span"},{"style":{"height":13.39},"width":163.83,"height":33.46,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1912.13053/images/3-29.png","element":"img","alt":" = U T DU","inline":true,"padRight":true},{"text":"with ","element":"span"},{"style":{"fontStyle":"italic"},"text":"D ","element":"span"},{"text":"a diagonal matrix of eigenvalues and ","element":"span"},{"style":{"fontStyle":"italic"},"text":"U ","element":"span"},{"text":"a unitary matrix. In this case Equation ","element":"span"},{"href":"#id-44","text":"5 ","element":"a"},{"text":"can be written as,","element":"span"}],[{"id":"id-48","style":{"width":"78%"},"width":734,"height":49,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1912.13053/images/3-30.png","element":"img"}],[{"text":"where ","element":"span"},{"style":{"height":13.19},"width":34.24,"height":32.98,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1912.13053/images/3-31.png","element":"img","alt":" λi","inline":true,"padRight":true},{"text":"are the eigenvalues of ","element":"span"},{"style":{"height":10.8},"width":31,"height":27,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1912.13053/images/3-32.png","element":"img","alt":" Θ","inline":true},{"text":"train, train ","element":"span"},{"text":"and ","element":"span"},{"style":{"height":16},"width":197.26,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1912.13053/images/3-33.png","element":"img","alt":" ˜µt(Xtrain) =","inline":true},{"style":{"height":16},"width":187.23,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1912.13053/images/3-34.png","element":"img","alt":"Uµt(Xtrain)","inline":true},{"text":", ","element":"span"},{"style":{"height":17.22},"width":246.9,"height":43.05,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1912.13053/images/3-35.png","element":"img","alt":"˜Ytrain = UYtrain","inline":true,"padRight":true},{"text":"are the mean prediction and the labels respectively written in the eigenbasis of ","element":"span"},{"style":{"height":10.8},"width":31,"height":27,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1912.13053/images/3-36.png","element":"img","alt":" Θ","inline":true},{"text":"train,train","element":"span"},{"text":". If we order the eigenvalues such that ","element":"span"},{"style":{"height":13.2},"width":247.16,"height":33,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1912.13053/images/3-37.png","element":"img","alt":" λ0 ≥ · · · ≥ λm","inline":true,"padRight":true},{"text":"then it has been hypothesized","element":"span"},{"text":"2 ","element":"span"},{"text":"in e.g. ","element":"span"},{"href":"#id-28","referenceIndex":26,"text":"Lee et al. ","element":"a"},{"href":"#id-28","referenceIndex":26,"text":"(","element":"a"},{"href":"#id-28","referenceIndex":26,"text":"2019","element":"a"},{"text":") that the maximum feasible learning rate scales as ","element":"span"},{"style":{"height":16},"width":174.47,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1912.13053/images/3-38.png","element":"img","alt":" η ∼ 2/λ0","inline":true,"padRight":true},{"text":"as we verify empirically in section 4. Plugging this scaling for ","element":"span"},{"style":{"height":10.4},"width":20,"height":26,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1912.13053/images/3-39.png","element":"img","alt":" η","inline":true,"padRight":true},{"text":"into Equation ","element":"span"},{"href":"#id-48","text":"9 ","element":"a"},{"text":"we see that the smallest eigenvalue will converge exponentially at a rate given by ","element":"span"},{"style":{"height":16},"width":62.85,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1912.13053/images/3-40.png","element":"img","alt":" 1/κ","inline":true},{"text":", where ","element":"span"},{"style":{"height":16},"width":207.08,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1912.13053/images/3-41.png","element":"img","alt":"κ = λ0/λm","inline":true,"padRight":true},{"text":"is the condition number. It follows that if the condition number of the NTK associated with a neural network diverges then it will become untrainable and so we use ","element":"span"},{"style":{"height":7.2},"width":23,"height":18,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1912.13053/images/3-42.png","element":"img","alt":" κ","inline":true,"padRight":true},{"text":"as a metric for trainability.","element":"span"}],[{"text":"We will see that at large depths, the spectrum of ","element":"span"},{"style":{"height":10.8},"width":31,"height":27,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1912.13053/images/3-43.png","element":"img","alt":" Θ","inline":true},{"text":"train, train ","element":"span"},{"text":"typically features a single large eigenvalue, ","element":"span"},{"style":{"height":13.19},"width":71.34,"height":32.98,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1912.13053/images/3-44.png","element":"img","alt":" λmax","inline":true},{"text":", and then a gap that is large compared with the rest of the spectrum. We therefore will often refer to a typical eigenvalue in the bulk as ","element":"span"},{"style":{"height":13.19},"width":84.46,"height":32.98,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1912.13053/images/3-45.png","element":"img","alt":" λbulk","inline":true,"padRight":true},{"text":"and approximate the condition number as ","element":"span"},{"style":{"height":16},"width":265.5,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1912.13053/images/3-46.png","element":"img","alt":"κ = λmax/λbulk.","inline":true}],[{"text":"We now turn our attention to generalization. At large depths, we will see that ","element":"span"},{"style":{"height":16.68},"width":65.68,"height":41.7,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1912.13053/images/3-47.png","element":"img","alt":" Θ(l)","inline":true},{"text":"test, train ","element":"span"},{"text":"and ","element":"span"},{"style":{"height":16.68},"width":65.69,"height":41.7,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1912.13053/images/3-48.png","element":"img","alt":" Θ(l)","inline":true},{"text":"train, train ","element":"span"},{"text":"converge their fixed ","element":"span"},{"text":"points independent of the data distribution. Consequently it is often the case that ","element":"span"},{"style":{"height":16},"width":111.93,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1912.13053/images/3-49.png","element":"img","alt":" P(Θ∗)","inline":true,"padRight":true},{"text":"will be data-independent and the network will fail to generalize. In this case, by symmetry, it is necessarily true that ","element":"span"},{"style":{"height":16},"width":111.93,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1912.13053/images/3-50.png","element":"img","alt":" P(Θ∗)","inline":true,"padRight":true},{"text":"will be a constant matrix. Contracting this matrix with a vector of labels ","element":"span"},{"style":{"height":13.19},"width":74.32,"height":32.98,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1912.13053/images/3-51.png","element":"img","alt":" Ytrain","inline":true,"padRight":true},{"text":"that have been standardized to have zero mean it will follow that ","element":"span"},{"style":{"height":16},"width":268.65,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1912.13053/images/3-52.png","element":"img","alt":" P(Θ∗)Ytrain = 0","inline":true,"padRight":true},{"text":"and the network will output zero in expectation on all test points. Clearly, in this setting the network will not be able to generalize. At large, but finite, depths the generalization performance of the network can be quantified by considering the rate at which ","element":"span"},{"style":{"height":18.19},"width":204.57,"height":45.46,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1912.13053/images/3-53.png","element":"img","alt":" P(Θ(l))Ytrain","inline":true,"padRight":true},{"text":"decays to zero. There are cases, however, where despite the data-independence of ","element":"span"},{"style":{"height":18.18},"width":419.16,"height":45.46,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1912.13053/images/3-54.png","element":"img","alt":" Θ∗, liml→∞ P(Θ(l))Ytrain","inline":true,"padRight":true},{"text":"remains nonzero and the network can continue to generalize even in the asymptotic limit. In either case, we will show that precisely characterizing ","element":"span"},{"style":{"height":18.19},"width":204.57,"height":45.46,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1912.13053/images/3-55.png","element":"img","alt":" P(Θ(l))Ytrain","inline":true,"padRight":true},{"text":"allows us to understand exactly where networks can, and cannot, generalize.","element":"span"}],[{"text":"Our goal is therefore to characterize the evolution of the two metrics ","element":"span"},{"style":{"height":14.19},"width":57.65,"height":35.47,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1912.13053/images/4-0.png","element":"img","alt":" κ(l)","inline":true,"padRight":true},{"text":"and ","element":"span"},{"style":{"height":18.19},"width":130.76,"height":45.47,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1912.13053/images/4-1.png","element":"img","alt":" P(Θ(l))","inline":true,"padRight":true},{"text":"in ","element":"span"},{"style":{"fontStyle":"italic"},"text":"l","element":"span"},{"text":". We follow the methodology outlined in ","element":"span"},{"href":"#id-12","referenceIndex":38,"text":"Schoenholz et al. ","element":"a"},{"href":"#id-12","referenceIndex":38,"text":"(","element":"a"},{"href":"#id-12","referenceIndex":38,"text":"2017","element":"a"},{"text":"); ","element":"span"},{"href":"#id-13","referenceIndex":41,"text":"Xiao et al. ","element":"a"},{"href":"#id-13","referenceIndex":41,"text":"(","element":"a"},{"href":"#id-13","referenceIndex":41,"text":"2018","element":"a"},{"text":") to explore the spectrum of the NTK as a function of depth. We will use this to make precise predictions relating trainability and generalization to the hyperparameters ","element":"span"},{"style":{"height":16},"width":166.94,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1912.13053/images/4-2.png","element":"img","alt":" (σw, σb, l)","inline":true},{"text":". Our main results are summarized in Table ","element":"span"},{"href":"#id-30","text":"1 ","element":"a"},{"text":"which describes the evolution of ","element":"span"},{"style":{"height":18.54},"width":82.34,"height":46.36,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1912.13053/images/4-3.png","element":"img","alt":" λ(l)max ","inline":true,"padRight":true},{"text":"(the largest eigenvalue of ","element":"span"},{"style":{"height":21.49},"width":253.93,"height":53.72,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1912.13053/images/4-4.png","element":"img","alt":" Θ(l)), λ(l)bulk (the","inline":true,"padRight":true},{"text":"remaining eigenvalues), ","element":"span"},{"style":{"height":14.19},"width":57.66,"height":35.46,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1912.13053/images/4-5.png","element":"img","alt":" κ(l)","inline":true},{"text":", and ","element":"span"},{"style":{"height":18.19},"width":130.75,"height":45.46,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1912.13053/images/4-6.png","element":"img","alt":" P(Θ(l))","inline":true,"padRight":true},{"text":"as a function of depth for three different network configurations (the ordered phase, the chaotic phase, and the phase transition). We study the dependence on: the size of the training set, ","element":"span"},{"style":{"fontStyle":"italic"},"text":"m","element":"span"},{"text":"; the choices of architecture including fully-connected networks (FCN), convolutional networks with flattening (CNN-F), and convolutions with pooling (CNN-P); and the size, ","element":"span"},{"style":{"fontStyle":"italic"},"text":"d","element":"span"},{"text":", of the window in the pooling layer (which we always take to be the penultimate layer).","element":"span"}],[{"text":"Before discussing the methodology it is useful to first give a qualitative overview of the phenomenology. We find identical phenomenology between FCNs and CNN-F architectures. In the ordered phase, ","element":"span"},{"style":{"height":17.39},"width":238.55,"height":43.46,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1912.13053/images/4-7.png","element":"img","alt":" Θ(l) → p∗11T","inline":true,"padRight":true},{"text":", ","element":"span"},{"style":{"height":19.88},"width":218.31,"height":49.7,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1912.13053/images/4-8.png","element":"img","alt":" λ(l)max → mp∗","inline":true,"padRight":true},{"text":"and ","element":"span"},{"style":{"height":21.49},"width":278.8,"height":53.72,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1912.13053/images/4-9.png","element":"img","alt":" λ(l)bulk = O(lχl1)","inline":true},{"text":". At large depths since ","element":"span"},{"style":{"height":14},"width":135.6,"height":35,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1912.13053/images/4-10.png","element":"img","alt":" χ1 < 1","inline":true,"padRight":true},{"text":"it ","element":"span"},{"text":"follows that ","element":"span"},{"style":{"height":18.18},"width":298.38,"height":45.46,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1912.13053/images/4-11.png","element":"img","alt":" κ(l) ≳ mp∗/(lχl1)","inline":true,"padRight":true},{"text":"and so the condition num- ","element":"span"},{"text":"ber diverges exponentially quickly. Thus, in the ordered phase we expect networks not to be trainable (or, specifi-cally, the time they take to learn will grow exponentially in their depth). Here ","element":"span"},{"style":{"height":18.18},"width":130.76,"height":45.46,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1912.13053/images/4-12.png","element":"img","alt":" P(Θ(l))","inline":true,"padRight":true},{"text":"converges to a data dependent constant independent of depth; thus, in the ordered phase networks fail to train but can generalize indefinitely.","element":"span"}],[{"text":"By contrast, in the chaotic phase we see that there is no gap between ","element":"span"},{"style":{"height":18.54},"width":82.34,"height":46.36,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1912.13053/images/4-13.png","element":"img","alt":" λ(l)max","inline":true,"padRight":true},{"text":"and ","element":"span"},{"style":{"height":21.49},"width":84.46,"height":53.72,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1912.13053/images/4-14.png","element":"img","alt":" λ(l)bulk","inline":true,"padRight":true},{"text":"and networks become perfectly ","element":"span"},{"text":"conditioned and are trainable everywhere. However, in this regime we see that the mean predictor scales as ","element":"span"},{"style":{"height":17.39},"width":186.47,"height":43.46,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1912.13053/images/4-15.png","element":"img","alt":" l(χc∗/χ1)l.","inline":true,"padRight":true},{"text":"Since in the chaotic phase ","element":"span"},{"style":{"height":14},"width":134.52,"height":35,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1912.13053/images/4-16.png","element":"img","alt":" χc∗ < 1","inline":true,"padRight":true},{"text":"and ","element":"span"},{"style":{"height":14},"width":119.51,"height":35,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1912.13053/images/4-17.png","element":"img","alt":" χ1 > 1","inline":true,"padRight":true},{"text":"it follows that ","element":"span"},{"style":{"height":18.18},"width":230.46,"height":45.46,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1912.13053/images/4-18.png","element":"img","alt":" P(Θ(l)) → 0","inline":true,"padRight":true},{"text":"over a depth ","element":"span"},{"style":{"height":16},"width":388.13,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1912.13053/images/4-19.png","element":"img","alt":" ξ∗ = −1/ log(χc∗/χ1)","inline":true},{"text":". Thus, in the chaotic phase, networks fail to generalize at a finite depth but remain trainable indefinitely. Finally, introducing pooling modestly augments the depth over which networks can generalize in the chaotic phase but reduces the depth in the ordered phase. We will explore all of these predictions in detail in section ","element":"span"},{"text":"7","element":"span"},{"text":".","element":"span"}]]},{"heading":"5. A Toy Example: RBF Kernel","paragraphs":[[{"text":"To provide more intuition about our analysis, we present a toy example using RBF kernels which already shares some core observations for deep neural networks. Consider a Gaussian process along with the RBF kernel given by,","element":"span"}],[{"style":{"width":"78%"},"width":736,"height":98,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1912.13053/images/4-20.png","element":"img"}],[{"text":"where ","element":"span"},{"style":{"height":14},"width":211.7,"height":35,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1912.13053/images/4-21.png","element":"img","alt":" x, x′ ∈ Xtrain","inline":true,"padRight":true},{"text":"along with a bandwidth ","element":"span"},{"style":{"fontStyle":"italic"},"text":"h > ","element":"span"},{"text":"0","element":"span"},{"text":". Note that ","element":"span"},{"style":{"height":16},"width":221.88,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1912.13053/images/4-22.png","element":"img","alt":" Kh(x, x) = 1","inline":true,"padRight":true},{"text":"for all ","element":"span"},{"style":{"fontStyle":"italic"},"text":"h ","element":"span"},{"text":"and ","element":"span"},{"style":{"fontStyle":"italic"},"text":"x","element":"span"},{"text":". Considering the follow-","element":"span"}],[{"text":"ing two cases.","element":"span"}],[{"text":"If the bandwidth is given by ","element":"span"},{"style":{"height":13.38},"width":123.78,"height":33.46,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1912.13053/images/4-23.png","element":"img","alt":" h = 2l","inline":true,"padRight":true},{"text":"and ","element":"span"},{"style":{"height":11.2},"width":132.44,"height":28,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1912.13053/images/4-24.png","element":"img","alt":" l → ∞","inline":true},{"text":", then ","element":"span"},{"style":{"height":17.38},"width":515.36,"height":43.46,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1912.13053/images/4-25.png","element":"img","alt":"Kh(x, x′) ≈ 1 − 2−l∥x − x′∥22","inline":true,"padRight":true},{"text":"which converges to ","element":"span"},{"text":"1 ","element":"span"},{"text":"ex- ","element":"span"},{"text":"ponentially fast. Thus, the largest eigenvalue of ","element":"span"},{"style":{"height":13.19},"width":52.84,"height":32.98,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1912.13053/images/4-26.png","element":"img","alt":" Kh","inline":true,"padRight":true},{"text":"is ","element":"span"},{"style":{"height":16},"width":238.17,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1912.13053/images/4-27.png","element":"img","alt":"λmax ≈ |Xtrain|","inline":true,"padRight":true},{"text":"and the bulk is of order ","element":"span"},{"style":{"height":15.78},"width":185.8,"height":39.44,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1912.13053/images/4-28.png","element":"img","alt":" λbulk ≈ 2−l","inline":true},{"text":". Thus the condition number ","element":"span"},{"style":{"height":16.99},"width":115.52,"height":42.46,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1912.13053/images/4-29.png","element":"img","alt":" κ ≳ 2l","inline":true,"padRight":true},{"text":"which diverges with ","element":"span"},{"style":{"fontStyle":"italic"},"text":"l","element":"span"},{"text":". We will see in the ","element":"span"},{"style":{"fontWeight":"bold"},"text":"Ordered Phase ","element":"span"},{"style":{"height":14.18},"width":65.69,"height":35.46,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1912.13053/images/4-30.png","element":"img","alt":" Θ(l)","inline":true,"padRight":true},{"text":"behaves qualitatively similar to this setting.","element":"span"}],[{"text":"On the other hand, if the bandwidth is given by ","element":"span"},{"style":{"fontStyle":"italic"},"text":"h ","element":"span"},{"text":"= 1","element":"span"},{"style":{"fontStyle":"italic"},"text":"/l ","element":"span"},{"text":"and ","element":"span"},{"style":{"height":11.2},"width":124,"height":28,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1912.13053/images/4-31.png","element":"img","alt":"l → ∞","inline":true,"padRight":true},{"text":"then the off-diagonals ","element":"span"},{"style":{"height":16},"width":426.86,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1912.13053/images/4-32.png","element":"img","alt":" Kh(x, x′) = exp(−l∥x −","inline":true},{"style":{"height":17.38},"width":181.9,"height":43.46,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1912.13053/images/4-33.png","element":"img","alt":"x′∥22) → 0","inline":true},{"text":". For large ","element":"span"},{"style":{"height":13.2},"width":89.92,"height":33,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1912.13053/images/4-34.png","element":"img","alt":" l, Kh","inline":true,"padRight":true},{"text":"is very close to the identity ","element":"span"},{"text":"matrix and the condition number of it is almost 1. In the ","element":"span"},{"style":{"fontWeight":"bold"},"text":"Chaotic Phase","element":"span"},{"text":", ","element":"span"},{"style":{"height":14.19},"width":65.69,"height":35.46,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1912.13053/images/4-35.png","element":"img","alt":" Θ(l) ","inline":true,"padRight":true},{"text":"is qualitatively similar to ","element":"span"},{"style":{"height":13.19},"width":64.5,"height":32.98,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1912.13053/images/4-36.png","element":"img","alt":" Kh.","inline":true}]]},{"heading":"6. Large-Depth Asymptotics of the NNGP and NTK","paragraphs":[[{"text":"We now give a brief derivation of the results in Table ","element":"span"},{"href":"#id-30","text":"1","element":"a"},{"text":". Details can be found in Sec.","element":"span"},{"text":"B","element":"span"},{"text":", ","element":"span"},{"text":"D ","element":"span"},{"text":"in the appendix. To simplify notation we will discuss fully-connected networks and then extend the results to CNNs with pooling (CNN-P) and without pooling (CNN-F).","element":"span"}],[{"text":"As in Sec. ","element":"span"},{"text":"3","element":"span"},{"text":", we will be concerned with the fixed points of ","element":"span"},{"style":{"height":10.8},"width":31,"height":27,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1912.13053/images/4-37.png","element":"img","alt":" Θ","inline":true,"padRight":true},{"text":"as well as the linearization of Equation ","element":"span"},{"href":"#id-49","text":"8 ","element":"a"},{"text":"about its fixed point. Recall that the fixed point structure is invariant within a phase so it suffices to consider the ordered phase, the chaotic phase, and the critical line separately. In cases where a stable fixed point exists, we will describe how ","element":"span"},{"style":{"height":10.8},"width":31,"height":27,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1912.13053/images/4-38.png","element":"img","alt":" Θ","inline":true,"padRight":true},{"text":"converges to the fixed point. We will see that in the chaotic phase and on the critical line, ","element":"span"},{"style":{"height":10.8},"width":31,"height":27,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1912.13053/images/4-39.png","element":"img","alt":" Θ","inline":true,"padRight":true},{"text":"has no stable fixed point and in that case we will describe its divergence. As above, in each case the fixed points of ","element":"span"},{"style":{"height":10.8},"width":31,"height":27,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1912.13053/images/4-40.png","element":"img","alt":" Θ","inline":true,"padRight":true},{"text":"have a simple structure with ","element":"span"},{"style":{"height":17.38},"width":509.86,"height":43.46,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1912.13053/images/4-41.png","element":"img","alt":" Θ∗ = p∗((1 − ˆc∗)Id + ˆc∗11T ).","inline":true}],[{"text":"To simplify the forthcoming analysis, without a loss of generality, we assume the inputs are normalized to have variance ","element":"span"},{"style":{"height":16.58},"width":61.49,"height":41.46,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1912.13053/images/4-42.png","element":"img","alt":" q∗ 3","inline":true},{"text":". As such, we can treat ","element":"span"},{"style":{"height":16.03},"width":143.16,"height":40.07,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1912.13053/images/4-43.png","element":"img","alt":" T and ˙T","inline":true,"padRight":true},{"text":", restricted on ","element":"span"},{"style":{"height":18.19},"width":117.93,"height":45.46,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1912.13053/images/4-44.png","element":"img","alt":"{K(l)}l","inline":true},{"text":", as a point-wise functions. To see this note that with this normalization ","element":"span"},{"style":{"height":18.19},"width":474.29,"height":45.47,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1912.13053/images/4-45.png","element":"img","alt":" K(l)(x, x) = q∗ for all l and x","inline":true},{"text":". It follows that both ","element":"span"},{"style":{"height":18.18},"width":277.2,"height":45.46,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1912.13053/images/4-46.png","element":"img","alt":" T (K(l+1))(x, x′)","inline":true,"padRight":true},{"text":"and ","element":"span"},{"style":{"height":18.83},"width":277.2,"height":47.07,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1912.13053/images/4-47.png","element":"img","alt":"˙T (K(l+1))(x, x′)","inline":true,"padRight":true},{"text":"depend only on ","element":"span"},{"style":{"height":18.18},"width":183.52,"height":45.46,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1912.13053/images/4-48.png","element":"img","alt":" K(l)(x, x′).","inline":true}],[{"text":"Since all of the off-diagonal elements approach the same fixed point at the same rate, we use ","element":"span"},{"style":{"height":21.49},"width":307.82,"height":53.72,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1912.13053/images/4-49.png","element":"img","alt":" q(l)ab ≡ K(l)(x, x′)","inline":true,"padRight":true},{"text":"and ","element":"span"},{"style":{"height":21.49},"width":304.15,"height":53.72,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1912.13053/images/4-50.png","element":"img","alt":" p(l)ab ≡ Θ(l)(x, x′)","inline":true,"padRight":true},{"text":"to denote any off diagonal entry ","element":"span"},{"text":"of ","element":"span"},{"style":{"height":14.59},"width":210.5,"height":36.46,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1912.13053/images/4-51.png","element":"img","alt":" K(l) and Θ(l) ","inline":true,"padRight":true},{"text":"respectively. We will similarly use ","element":"span"},{"style":{"height":15.71},"width":118.17,"height":39.28,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1912.13053/images/4-52.png","element":"img","alt":" q∗ab and","inline":true},{"style":{"height":15.5},"width":51.34,"height":38.75,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1912.13053/images/4-53.png","element":"img","alt":"p∗ab","inline":true,"padRight":true},{"text":"to denote the limits, ","element":"span"},{"style":{"height":21.49},"width":447.83,"height":53.72,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1912.13053/images/4-54.png","element":"img","alt":" liml→∞ q(l)ab = q∗ab = c∗q∗","inline":true,"padRight":true},{"text":"and ","element":"span"},{"style":{"height":21.49},"width":426.4,"height":53.72,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1912.13053/images/4-55.png","element":"img","alt":"liml→∞ p(l)ab = p∗ab = ˆc∗p∗","inline":true},{"text":". Finally, although the diagonal ","element":"span"},{"text":"entries of ","element":"span"},{"style":{"height":14.19},"width":65.63,"height":35.47,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1912.13053/images/4-56.png","element":"img","alt":" K(l)","inline":true,"padRight":true},{"text":"are all ","element":"span"},{"style":{"height":14.19},"width":35.22,"height":35.47,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1912.13053/images/4-57.png","element":"img","alt":" q∗","inline":true},{"text":", the diagonal entries of ","element":"span"},{"style":{"height":14.19},"width":65.69,"height":35.47,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1912.13053/images/4-58.png","element":"img","alt":" Θ(l)","inline":true,"padRight":true},{"text":"can","element":"span"}],[{"id":"id-51","style":{"width":"99%"},"width":1935,"height":1097,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1912.13053/images/5-0.png","element":"img"}],[{"style":{"fontStyle":"italic"},"text":"Figure 1. ","element":"figcaption","subtype":"caption"},{"style":{"fontWeight":"bold"},"text":"Condition number and mean predictor of NTKs and their rate of convergence for FCN, CNN-F and CNN-P. ","element":"figcaption","subtype":"caption"},{"text":"(a) In the chaotic phase, ","element":"figcaption","subtype":"caption"},{"style":{"height":13.29},"width":54.35,"height":33.23,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1912.13053/images/5-1.png","element":"img","alt":" κ(l) ","inline":true,"padRight":true},{"text":"converges to 1 for all architectures. (b) We plot ","element":"figcaption","subtype":"caption"},{"style":{"height":16.09},"width":93.94,"height":40.23,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1912.13053/images/5-2.png","element":"img","alt":" χl1κ(l)","inline":true},{"text":", confirming that ","element":"figcaption","subtype":"caption"},{"style":{"height":6.8},"width":21,"height":17,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1912.13053/images/5-3.png","element":"img","alt":" κ","inline":true,"padRight":true},{"text":"explodes with rate ","element":"figcaption","subtype":"caption"},{"style":{"height":16.49},"width":86.44,"height":41.23,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1912.13053/images/5-4.png","element":"img","alt":" 1/lχl1 ","inline":true,"padRight":true},{"text":"in the ordered ","element":"figcaption","subtype":"caption"},{"text":"phase. In (c) and (d), the solid lines are ","element":"figcaption","subtype":"caption"},{"style":{"height":13.29},"width":54.36,"height":33.24,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1912.13053/images/5-5.png","element":"img","alt":" κ(l) ","inline":true,"padRight":true},{"text":"and dashed lines are the ratio between first and second eigenvalues. We see that, on the order-to-chaos transition, these two numbers converge to ","element":"figcaption","subtype":"caption"},{"style":{"height":17.55},"width":225.04,"height":43.86,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1912.13053/images/5-6.png","element":"img","alt":"m+22 and dm+22","inline":true,"padRight":true},{"text":"(horizontal lines) for FC/CNN-F and CNN-P respectively, where ","element":"figcaption","subtype":"caption"},{"style":{"fontStyle":"italic"},"text":"m ","element":"figcaption","subtype":"caption"},{"text":"= 12 ","element":"figcaption","subtype":"caption"},{"text":"or ","element":"figcaption","subtype":"caption"},{"text":"20 ","element":"figcaption","subtype":"caption"},{"text":"is the batch size and ","element":"figcaption","subtype":"caption"},{"style":{"fontStyle":"italic"},"text":"d ","element":"figcaption","subtype":"caption"},{"text":"= 36 ","element":"figcaption","subtype":"caption"},{"text":"is the spatial dimension. (e) In the chaotic phase, the mean predictor decays to zero exponentially fast. (f) In the ordered phase the mean predictor converges to a data dependent value.","element":"figcaption","subtype":"caption"}],[{"text":"vary and we denote them ","element":"span"},{"style":{"height":17.38},"width":67.2,"height":43.46,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1912.13053/images/5-7.png","element":"img","alt":" p(l).","inline":true}],[{"text":"In what follows, we split the discussion into three sections according to the values of ","element":"span"},{"style":{"height":18.83},"width":242.99,"height":47.07,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1912.13053/images/5-8.png","element":"img","alt":" χ1 ≡ σ2ω ˙T (q∗)","inline":true,"padRight":true},{"text":"recalling that in ","element":"span"},{"href":"#id-11","referenceIndex":35,"text":"Poole et al. ","element":"a"},{"href":"#id-11","referenceIndex":35,"text":"(","element":"a"},{"href":"#id-11","referenceIndex":35,"text":"2016","element":"a"},{"text":"); ","element":"span"},{"href":"#id-12","referenceIndex":38,"text":"Schoenholz et al. ","element":"a"},{"href":"#id-12","referenceIndex":38,"text":"(","element":"a"},{"href":"#id-12","referenceIndex":38,"text":"2017","element":"a"},{"text":") it was shown that ","element":"span"},{"style":{"height":10},"width":40.93,"height":25,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1912.13053/images/5-9.png","element":"img","alt":" χ1","inline":true,"padRight":true},{"text":"controls the fixed point structure. In each section, we analyze the evolution of (1) the entries of ","element":"span"},{"style":{"height":17.39},"width":223.42,"height":43.47,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1912.13053/images/5-10.png","element":"img","alt":" Θ(l), i.e., p(l),","inline":true},{"style":{"height":21.49},"width":54.74,"height":53.72,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1912.13053/images/5-11.png","element":"img","alt":"p(l)ab","inline":true},{"text":", (2) the spectrum ","element":"span"},{"style":{"height":18.54},"width":82.34,"height":46.36,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1912.13053/images/5-12.png","element":"img","alt":" λ(l)max","inline":true,"padRight":true},{"text":"and ","element":"span"},{"style":{"height":21.49},"width":84.46,"height":53.72,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1912.13053/images/5-13.png","element":"img","alt":" λ(l)bulk","inline":true},{"text":", (3) the trainability ","element":"span"},{"text":"and generalization metrics ","element":"span"},{"style":{"height":18.18},"width":267.87,"height":45.46,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1912.13053/images/5-14.png","element":"img","alt":" κ(l) and P(Θ(l))","inline":true},{"text":", and finally (4) discuss the impact on finite width networks.","element":"span"}],[{"style":{"fontWeight":"bold"},"text":"6.1. The Chaotic Phase ","element":"span"},{"style":{"fontStyle":"italic"},"text":"χ","element":"span"},{"text":"1 ","element":"span"},{"style":{"fontStyle":"italic"},"text":"> ","element":"span"},{"text":"1","element":"span"},{"style":{"fontWeight":"bold"},"text":":","element":"span"}],[{"text":"The chaotic phase is so-named because it has a stable fixed-point ","element":"span"},{"style":{"height":11.79},"width":114.78,"height":29.46,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1912.13053/images/5-15.png","element":"img","alt":" c∗ < 1","inline":true},{"text":"; as such similar inputs become increasingly uncorrelated as they pass through the network. Our first result is to show that (see Sec. ","element":"span"},{"href":"#id-50","text":"B.1","element":"a"},{"text":"),","element":"span"}],[{"style":{"width":"93%"},"width":881,"height":112,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1912.13053/images/5-16.png","element":"img"}],[{"text":"where","element":"span"}],[{"style":{"width":"91%"},"width":856,"height":47,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1912.13053/images/5-17.png","element":"img"}],[{"text":"Note that ","element":"span"},{"style":{"height":10},"width":54.18,"height":25,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1912.13053/images/5-18.png","element":"img","alt":" χc∗","inline":true,"padRight":true},{"text":"controls the convergence of the ","element":"span"},{"style":{"height":21.49},"width":53.91,"height":53.72,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1912.13053/images/5-19.png","element":"img","alt":" q(l)ab","inline":true,"padRight":true},{"text":"and is ","element":"span"},{"text":"always less than 1 in the chaotic phase (","element":"span"},{"href":"#id-11","referenceIndex":35,"text":"Poole et al.","element":"a"},{"href":"#id-11","referenceIndex":35,"text":", ","element":"a"},{"href":"#id-11","referenceIndex":35,"text":"2016","element":"a"},{"text":"; ","element":"span"},{"href":"#id-12","referenceIndex":38,"text":"Schoenholz et al.","element":"a"},{"href":"#id-12","referenceIndex":38,"text":", ","element":"a"},{"href":"#id-12","referenceIndex":38,"text":"2017","element":"a"},{"text":"; ","element":"span"},{"href":"#id-13","referenceIndex":41,"text":"Xiao et al.","element":"a"},{"href":"#id-13","referenceIndex":41,"text":", ","element":"a"},{"href":"#id-13","referenceIndex":41,"text":"2018","element":"a"},{"text":"). Since ","element":"span"},{"style":{"height":14},"width":118.03,"height":35,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1912.13053/images/5-20.png","element":"img","alt":" χ1 > 1","inline":true},{"text":", ","element":"span"},{"style":{"height":17.38},"width":54.74,"height":43.46,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1912.13053/images/5-21.png","element":"img","alt":"p(l) ","inline":true,"padRight":true},{"text":"diverges with rate ","element":"span"},{"style":{"height":21.49},"width":199.61,"height":53.72,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1912.13053/images/5-22.png","element":"img","alt":" χl1 while p(l)ab ","inline":true,"padRight":true},{"text":"remains finite. It follows ","element":"span"},{"text":"that ","element":"span"},{"style":{"height":18.18},"width":310.93,"height":45.46,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1912.13053/images/5-23.png","element":"img","alt":" (p(l))−1Θ(l) → Id","inline":true,"padRight":true},{"text":"as ","element":"span"},{"style":{"height":11.2},"width":127.02,"height":28,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1912.13053/images/5-24.png","element":"img","alt":" l → ∞","inline":true},{"text":". Thus, in the chaotic phase, the spectrum of the NTK for very deep networks approaches the diverging constant multiplying the identity. This implies","element":"span"}],[{"style":{"width":"92%"},"width":863,"height":96,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1912.13053/images/5-25.png","element":"img"}],[{"text":"Figure ","element":"span"},{"href":"#id-51","text":"1a ","element":"a"},{"text":"plots the evolution of ","element":"span"},{"style":{"height":14.19},"width":57.65,"height":35.46,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1912.13053/images/5-26.png","element":"img","alt":" κ(l) ","inline":true,"padRight":true},{"text":"in this phase, confirming ","element":"span"},{"style":{"height":14.59},"width":142.09,"height":36.46,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1912.13053/images/5-27.png","element":"img","alt":"κ(l) → 1","inline":true,"padRight":true},{"text":"for all three different architectures (FCN, CNN-F and CNN-P).","element":"span"}],[{"text":"We now describe the asymptotic behavior of the mean predictor. Since ","element":"span"},{"style":{"height":13.38},"width":41,"height":33.46,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1912.13053/images/5-28.png","element":"img","alt":" Θl","inline":true},{"text":"test, train ","element":"span"},{"text":"has no diagonal elements, it follows ","element":"span"},{"text":"that it remains finite at large depths and so ","element":"span"},{"style":{"height":16},"width":270.75,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1912.13053/images/5-29.png","element":"img","alt":" P(Θ∗)Ytrain = 0.","inline":true,"padRight":true},{"text":"It follows that in the chaotic phase, the predictions of asymptotically deep neural networks on unseen test points will converge to zero exponentially quickly (see Sec. ","element":"span"},{"href":"#id-52","text":"D.1","element":"a"},{"text":"),","element":"span"}],[{"style":{"width":"81%"},"width":763,"height":48,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1912.13053/images/5-30.png","element":"img"}],[{"text":"Neglecting the relatively slowly varying polynomial term, this implies that we expect chaotic networks to fail to generalize when their depth is much larger than a scale set by ","element":"span"},{"style":{"height":16},"width":369.91,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1912.13053/images/6-0.png","element":"img","alt":"ξ∗ = −1/ log(χc∗/χ1)","inline":true},{"text":". We confirm this scaling in Fig ","element":"span"},{"href":"#id-51","text":"1e","element":"a"},{"text":".","element":"span"}],[{"text":"We confirm these predictions for finite-width neural network training using SGD as well as gradient-flow on infinite networks in the experimental results; see Fig ","element":"span"},{"href":"#id-53","text":"2","element":"a"},{"text":".","element":"span"}],[{"style":{"fontWeight":"bold"},"text":"6.2. The Ordered Phase ","element":"span"},{"style":{"fontStyle":"italic"},"text":"χ","element":"span"},{"text":"1 ","element":"span"},{"text":"= ","element":"span"},{"style":{"fontStyle":"italic"},"text":"σ","element":"span"},{"text":"2","element":"span"},{"style":{"fontStyle":"italic"},"text":"ω ","element":"span"},{"text":"˙","element":"span"},{"style":{"fontStyle":"italic"},"text":"T ","element":"span"},{"text":"(","element":"span"},{"style":{"fontStyle":"italic"},"text":"q","element":"span"},{"style":{"fontStyle":"italic"},"text":"∗","element":"span"},{"text":") ","element":"span"},{"style":{"fontStyle":"italic"},"text":"< ","element":"span"},{"text":"1","element":"span"},{"style":{"fontWeight":"bold"},"text":":","element":"span"}],[{"text":"The ordered phase is defined by the stability of the ","element":"span"},{"style":{"height":10.98},"width":109.11,"height":27.46,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1912.13053/images/6-1.png","element":"img","alt":" c∗ = 1","inline":true,"padRight":true},{"text":"fixed point. Here disparate inputs will end up converging to the same output at the end of the network. We show in Sec. ","element":"span"},{"href":"#id-54","text":"B.2 ","element":"a"},{"text":"that elements of the NNGP kernel and NTK have asymptotic dynamics given by,","element":"span"}],[{"id":"id-55","style":{"width":"94%"},"width":884,"height":121,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1912.13053/images/6-2.png","element":"img"}],[{"text":"where ","element":"span"},{"style":{"height":16},"width":307.18,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1912.13053/images/6-3.png","element":"img","alt":" p∗ = q∗/(1 − χ1)","inline":true},{"text":". Here all of the entries of ","element":"span"},{"style":{"height":14.18},"width":65.69,"height":35.46,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1912.13053/images/6-4.png","element":"img","alt":" Θ(l)","inline":true,"padRight":true},{"text":"converge to the same value, ","element":"span"},{"style":{"height":14.18},"width":36.06,"height":35.46,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1912.13053/images/6-5.png","element":"img","alt":" p∗","inline":true},{"text":", and the limiting kernel has the form ","element":"span"},{"style":{"height":17.33},"width":409.7,"height":43.32,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1912.13053/images/6-6.png","element":"img","alt":" Θ∗ = p∗1n1Tm where 1m ","inline":true,"padRight":true},{"text":"is the all-ones vector of ","element":"span"},{"text":"dimension ","element":"span"},{"style":{"fontStyle":"italic"},"text":"m ","element":"span"},{"text":"(typically ","element":"span"},{"style":{"fontStyle":"italic"},"text":"m ","element":"span"},{"text":"will correspond to the number of datapoints in the training set). The NNGP kernel has the same structure with ","element":"span"},{"style":{"height":14.18},"width":135.58,"height":35.46,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1912.13053/images/6-7.png","element":"img","alt":" p∗ ↔ q∗","inline":true},{"text":". Consequently both the NNGP kernel and the NTK are highly singular and feature a single non-zero eigenvalue, ","element":"span"},{"style":{"height":14.18},"width":197.44,"height":35.46,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1912.13053/images/6-8.png","element":"img","alt":" λmax = mp∗","inline":true},{"text":", with eigenvector ","element":"span"},{"style":{"height":12.79},"width":64.85,"height":31.98,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1912.13053/images/6-9.png","element":"img","alt":" 1m.","inline":true}],[{"text":"For large-but-finite depths, ","element":"span"},{"style":{"height":14.18},"width":65.69,"height":35.46,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1912.13053/images/6-10.png","element":"img","alt":" Θ(l)","inline":true,"padRight":true},{"text":"has (approximately) two eigenspaces: the first eigenspace corresponds to finite-depth corrections to ","element":"span"},{"style":{"height":13.2},"width":83.27,"height":33,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1912.13053/images/6-11.png","element":"img","alt":" λmax,","inline":true}],[{"style":{"width":"92%"},"width":868,"height":55,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1912.13053/images/6-12.png","element":"img"}],[{"text":"The second eigenspace comes from lifting the degenerate zero-modes has dimension ","element":"span"},{"style":{"height":16},"width":136.28,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1912.13053/images/6-13.png","element":"img","alt":" (m − 1)","inline":true,"padRight":true},{"text":"with eigenvalues that scale like ","element":"span"},{"style":{"height":21.49},"width":549.6,"height":53.72,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1912.13053/images/6-14.png","element":"img","alt":" λ(l)bulk = O(p(l) − p(l)ab) = O(lχl1).","inline":true,"padRight":true},{"text":"It follows that ","element":"span"},{"style":{"height":18.18},"width":246.59,"height":45.46,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1912.13053/images/6-15.png","element":"img","alt":"κ(l) ≳ (lχl1)−1","inline":true,"padRight":true},{"text":"and so the conditioning number explodes ","element":"span"},{"text":"exponentially quickly. We confirm the presence of the ","element":"span"},{"text":"1","element":"span"},{"style":{"fontStyle":"italic"},"text":"/l ","element":"span"},{"text":"correction term in ","element":"span"},{"style":{"height":14.19},"width":57.65,"height":35.46,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1912.13053/images/6-16.png","element":"img","alt":" κ(l)","inline":true,"padRight":true},{"text":"by plotting ","element":"span"},{"style":{"height":18.14},"width":100.46,"height":45.36,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1912.13053/images/6-17.png","element":"img","alt":" χl1κ(l)","inline":true,"padRight":true},{"text":"against ","element":"span"},{"style":{"fontStyle":"italic"},"text":"l ","element":"span"},{"text":"in Fig- ","element":"span"},{"text":"ure ","element":"span"},{"href":"#id-51","text":"1b","element":"a"},{"text":". Neglecting this correction, we expect networks in the ordered phase to become untrainable when their depth exceeds a scale given by ","element":"span"},{"style":{"height":16},"width":276.86,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1912.13053/images/6-18.png","element":"img","alt":" ξ1 = −1/ log χ1.","inline":true}],[{"text":"We now turn our discussion to the mean predictor. Equation ","element":"span"},{"href":"#id-55","text":"14 ","element":"a"},{"text":"shows that we can write the finite-depth corrections to the NTK as ","element":"span"},{"style":{"height":18.14},"width":408.42,"height":45.35,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1912.13053/images/6-19.png","element":"img","alt":" Θ(l) = p∗11T + A(l)lχl1","inline":true},{"text":". Here ","element":"span"},{"style":{"height":14.18},"width":69.34,"height":35.46,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1912.13053/images/6-20.png","element":"img","alt":" A(l)","inline":true,"padRight":true},{"text":"is the ","element":"span"},{"text":"data-dependent piece that lifts the zero eigenvalues. In the appendix, ","element":"span"},{"style":{"height":14.19},"width":69.34,"height":35.46,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1912.13053/images/6-21.png","element":"img","alt":" A(l) ","inline":true,"padRight":true},{"text":"converges to ","element":"span"},{"style":{"height":12},"width":201.76,"height":30,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1912.13053/images/6-22.png","element":"img","alt":" A as l → ∞","inline":true},{"text":"; see Lemma ","element":"span"},{"href":"#id-56","text":"2","element":"a"},{"text":". In Sec. ","element":"span"},{"href":"#id-57","text":"D.3 ","element":"a"},{"text":"we show that despite the singular nature of ","element":"span"},{"style":{"height":13.38},"width":116.15,"height":33.46,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1912.13053/images/6-23.png","element":"img","alt":" Θ∗, the","inline":true,"padRight":true},{"text":"mean has a well-defined limit as,","element":"span"}],[{"style":{"width":"96%"},"width":906,"height":66,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1912.13053/images/6-24.png","element":"img"}],[{"text":"where ","element":"span"},{"style":{"height":14.88},"width":35,"height":37.19,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1912.13053/images/6-25.png","element":"img","alt":"ˆA","inline":true,"padRight":true},{"text":"is some correction term. Thus, the mean predictor remains well-behaved and data dependent even in the ","element":"span"},{"text":"infinite-depth limit. Thus, we suspect that networks in the ordered phase should be able to generalize whenever they can be trained. We confirm the asymptotic data-dependence of the mean predictor in Fig ","element":"span"},{"href":"#id-51","text":"1f","element":"a"},{"text":".","element":"span"}],[{"style":{"fontWeight":"bold"},"text":"6.3. The Critical Line ","element":"span"},{"style":{"fontStyle":"italic"},"text":"χ","element":"span"},{"text":"1 ","element":"span"},{"text":"= ","element":"span"},{"style":{"fontStyle":"italic"},"text":"σ","element":"span"},{"text":"2","element":"span"},{"style":{"fontStyle":"italic"},"text":"ω ","element":"span"},{"text":"˙","element":"span"},{"style":{"fontStyle":"italic"},"text":"T ","element":"span"},{"text":"(","element":"span"},{"style":{"fontStyle":"italic"},"text":"q","element":"span"},{"style":{"fontStyle":"italic"},"text":"∗","element":"span"},{"text":") = 1","element":"span"}],[{"text":"On the critical line the ","element":"span"},{"style":{"height":10.98},"width":123.14,"height":27.46,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1912.13053/images/6-26.png","element":"img","alt":" c∗ = 1","inline":true,"padRight":true},{"text":"fixed point is marginally stable and dynamics become powerlaw. Here, both the diagonal and the off-diagonal elements of ","element":"span"},{"style":{"height":14.19},"width":65.69,"height":35.47,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1912.13053/images/6-27.png","element":"img","alt":" Θ(l)","inline":true,"padRight":true},{"text":"diverge linearly in the depth with ","element":"span"},{"style":{"height":21.42},"width":401.41,"height":53.54,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1912.13053/images/6-28.png","element":"img","alt":"1l Θ(l) → q∗3 (11T + 2Id)","inline":true},{"text":". The condition ","element":"span"},{"text":"number ","element":"span"},{"style":{"height":14.18},"width":57.65,"height":35.46,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1912.13053/images/6-29.png","element":"img","alt":" κ(l)","inline":true,"padRight":true},{"text":"converges to a finite value and the network is always trainable. However, the mean predictor decreases linearly with depth. In particular we show in Sec. ","element":"span"},{"href":"#id-58","text":"B.3","element":"a"},{"text":",","element":"span"}],[{"style":{"width":"93%"},"width":876,"height":121,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1912.13053/images/6-30.png","element":"img"}],[{"text":"For large ","element":"span"},{"style":{"fontStyle":"italic"},"text":"l ","element":"span"},{"text":"it follows that ","element":"span"},{"style":{"height":14.18},"width":65.69,"height":35.46,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1912.13053/images/6-31.png","element":"img","alt":" Θ(l)","inline":true,"padRight":true},{"text":"essentially has two eigenspaces: one has dimension one and the other has dimension ","element":"span"},{"style":{"height":16},"width":215.51,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1912.13053/images/6-32.png","element":"img","alt":" (m − 1) with","inline":true}],[{"style":{"width":"95%"},"width":893,"height":76,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1912.13053/images/6-33.png","element":"img"}],[{"text":"It follows that the condition number ","element":"span"},{"style":{"height":19.68},"width":281.71,"height":49.2,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1912.13053/images/6-34.png","element":"img","alt":" κ(l) = m+22 +","inline":true},{"style":{"height":19.37},"width":487.7,"height":48.44,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1912.13053/images/6-35.png","element":"img","alt":"mO(l−1) → m+22 as l → ∞","inline":true},{"text":". Unlike in the chaotic and ordered phases, here ","element":"span"},{"style":{"height":14.18},"width":57.65,"height":35.46,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1912.13053/images/6-36.png","element":"img","alt":" κ(l)","inline":true,"padRight":true},{"text":"converges with rate ","element":"span"},{"style":{"height":17.38},"width":119.79,"height":43.46,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1912.13053/images/6-37.png","element":"img","alt":" O(l−1)","inline":true},{"text":". Figure ","element":"span"},{"href":"#id-51","text":"1c ","element":"a"},{"text":"confirms the ","element":"span"},{"style":{"height":19.68},"width":195.61,"height":49.2,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1912.13053/images/6-38.png","element":"img","alt":" κ(l) → m+22","inline":true,"padRight":true},{"text":"for both FCN and CNN-F (the global average pooling in CNN introduces a correction term that we will discuss below). A similar calculation gives ","element":"span"},{"style":{"height":18.19},"width":303.17,"height":45.46,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1912.13053/images/6-39.png","element":"img","alt":"P(Θ(l)) = O(l−1)","inline":true,"padRight":true},{"text":"on the critical line.","element":"span"}],[{"text":"In summary, ","element":"span"},{"style":{"height":14.19},"width":57.65,"height":35.46,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1912.13053/images/6-40.png","element":"img","alt":" κ(l)","inline":true,"padRight":true},{"text":"converges to a finite number and the network ought to be trainable for arbitrary depth but the mean predictor ","element":"span"},{"style":{"height":18.19},"width":130.75,"height":45.46,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1912.13053/images/6-41.png","element":"img","alt":" P(Θ(l))","inline":true,"padRight":true},{"text":"decays as a powerlaw. Decay as ","element":"span"},{"style":{"height":13.39},"width":53.58,"height":33.46,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1912.13053/images/6-42.png","element":"img","alt":" l−1","inline":true,"padRight":true},{"text":"is much slower than exponential and is slow on the scale of neural networks. This explains why critically initialized networks with thousands of layers could still generalize (","element":"span"},{"href":"#id-13","referenceIndex":41,"text":"Xiao ","element":"a"},{"href":"#id-13","referenceIndex":41,"text":"et al.","element":"a"},{"href":"#id-13","referenceIndex":41,"text":", ","element":"a"},{"href":"#id-13","referenceIndex":41,"text":"2018","element":"a"},{"text":").","element":"span"}],[{"style":{"fontWeight":"bold"},"text":"6.4. The Effect of Convolutions","element":"span"}],[{"text":"The above theory can be extended to CNNs. We will provide an informal description here, with details in Sec. ","element":"span"},{"text":"F","element":"span"},{"text":". For an input-images of size ","element":"span"},{"text":"(","element":"span"},{"style":{"fontStyle":"italic"},"text":"m, k, k, ","element":"span"},{"text":"3) ","element":"span"},{"text":"the NTK and NNGP kernels will have shape ","element":"span"},{"text":"(","element":"span"},{"style":{"fontStyle":"italic"},"text":"m, k, k, m, k, k","element":"span"},{"text":") ","element":"span"},{"text":"and will contain information about the covariance between each pair of pixels in each image. For convenience we will let ","element":"span"},{"style":{"height":13.38},"width":111.97,"height":33.46,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1912.13053/images/6-43.png","element":"img","alt":" d = k2","inline":true},{"text":". In the large depth setting deviations of both kernels from their fixed point decomposes via Fourier transform in the spatial dimensions as,","element":"span"}],[{"style":{"width":"71%"},"width":673,"height":94,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1912.13053/images/6-44.png","element":"img"}],[{"text":"where ","element":"span"},{"style":{"fontStyle":"italic"},"text":"q ","element":"span"},{"text":"denotes the Fourier mode with ","element":"span"},{"style":{"fontStyle":"italic"},"text":"q ","element":"span"},{"text":"= 0 ","element":"span"},{"text":"being the zerofrequency (uniform) mode and ","element":"span"},{"style":{"height":11.59},"width":35.6,"height":28.98,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1912.13053/images/6-45.png","element":"img","alt":" ρq","inline":true,"padRight":true},{"text":"are eigenvalues of certain ","element":"span"},{"text":"convolution operator. Here ","element":"span"},{"style":{"height":18.18},"width":138.07,"height":45.46,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1912.13053/images/7-0.png","element":"img","alt":" δΘ(l)(q)","inline":true,"padRight":true},{"text":"are deviations from the fixed-point for the ","element":"span"},{"style":{"height":21.35},"width":583.7,"height":53.37,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1912.13053/images/7-1.png","element":"img","alt":" qth mode with δΘ(l)(q) ∝ δΘ(l)FCN the","inline":true,"padRight":true},{"text":"fully-connected deviation described above. We show that ","element":"span"},{"style":{"height":16.79},"width":400.71,"height":41.98,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1912.13053/images/7-2.png","element":"img","alt":"ρq=0 = 1 and |ρq̸=0| < 1","inline":true,"padRight":true},{"text":"which implies that asymptotically the nonuniform modes become subleading as ","element":"span"},{"style":{"height":19.72},"width":230.02,"height":49.31,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1912.13053/images/7-3.png","element":"img","alt":" ρlq → 0. Thus,","inline":true,"padRight":true},{"text":"at large depths different pixels evolve identically as FCNs.","element":"span"}],[{"text":"In Sec. ","element":"span"},{"href":"#id-59","text":"F.2 ","element":"a"},{"text":"we discuss the differences that arise when one combines a CNN with a flattening layer compared with an average pooling layer at the readout. In the case of flattening, the pixel-pixel correlations are discarded and ","element":"span"},{"style":{"height":21.36},"width":306.06,"height":53.41,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1912.13053/images/7-4.png","element":"img","alt":"Θ(l)CNN−F ≈ Θ(l)FCN","inline":true},{"text":". The plots in the first row of Figure ","element":"span"},{"href":"#id-51","text":"1 ","element":"a"},{"text":"confirm that the ","element":"span"},{"style":{"height":14.18},"width":57.66,"height":35.46,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1912.13053/images/7-5.png","element":"img","alt":" κ(l)","inline":true,"padRight":true},{"text":"of ","element":"span"},{"style":{"height":21.36},"width":146.7,"height":53.41,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1912.13053/images/7-6.png","element":"img","alt":" Θ(l)CNN−F","inline":true,"padRight":true},{"text":"and of ","element":"span"},{"style":{"height":21.36},"width":97.45,"height":53.41,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1912.13053/images/7-7.png","element":"img","alt":" Θ(l)FCN","inline":true,"padRight":true},{"text":"evolve al- ","element":"span"},{"text":"most identically in all phases. Note that this clarifies an empirical observation in ","element":"span"},{"href":"#id-13","referenceIndex":41,"text":"Xiao et al. ","element":"a"},{"href":"#id-13","referenceIndex":41,"text":"(","element":"a"},{"href":"#id-13","referenceIndex":41,"text":"2018","element":"a"},{"text":") (Figure 3 of ","element":"span"},{"href":"#id-13","referenceIndex":41,"text":"Xiao ","element":"a"},{"href":"#id-13","referenceIndex":41,"text":"et al. ","element":"a"},{"href":"#id-13","referenceIndex":41,"text":"(","element":"a"},{"href":"#id-13","referenceIndex":41,"text":"2018","element":"a"},{"text":")) that test performance of critically initialized CNNs degrades towards that of FCNs as depth increases. This is because (i) in the large width limit, the prediction of neural networks is characterized by the NTK and (ii) the NTKs of the two models are almost identical for large depth. However, when CNNs are combined with global average pooling a correction to the spectrum of the NTK (NNGP) emerges oweing to pixel-pixel correlations; this alters the dynamics of ","element":"span"},{"style":{"height":14.19},"width":57.65,"height":35.46,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1912.13053/images/7-8.png","element":"img","alt":" κ(l)","inline":true,"padRight":true},{"text":"and ","element":"span"},{"style":{"height":18.19},"width":130.75,"height":45.46,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1912.13053/images/7-9.png","element":"img","alt":" P(Θ(l))","inline":true},{"text":". In particular, we find that global average pooling increases ","element":"span"},{"style":{"height":14.18},"width":57.66,"height":35.46,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1912.13053/images/7-10.png","element":"img","alt":" κ(l)","inline":true,"padRight":true},{"text":"by a factor of ","element":"span"},{"style":{"fontStyle":"italic"},"text":"d ","element":"span"},{"text":"in the ordered phase and on the critical line; see Table ","element":"span"},{"href":"#id-30","text":"1 ","element":"a"},{"text":"for the exact correction as well as Figures ","element":"span"},{"href":"#id-51","text":"1d ","element":"a"},{"text":"for experimental evidence of this correction.","element":"span"}],[{"style":{"fontWeight":"bold"},"text":"6.5. Dropout, Relu and Skip-connection","element":"span"}],[{"text":"Adding a dropout to the penultimate layer has a similar effect to adding a diagonal regularization term to the NTK, which significantly improves the conditioning of the NTK in the ordered phase. In particular, adding a single dropout layer can cause ","element":"span"},{"style":{"height":14.19},"width":57.65,"height":35.46,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1912.13053/images/7-11.png","element":"img","alt":" κ(l)","inline":true,"padRight":true},{"text":"to converge to a finite ","element":"span"},{"style":{"height":10.99},"width":38.96,"height":27.46,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1912.13053/images/7-12.png","element":"img","alt":" κ∗","inline":true,"padRight":true},{"text":"rather than diverges exponentially; see Figure ","element":"span"},{"href":"#id-60","text":"4 ","element":"a"},{"text":"and Sec. ","element":"span"},{"text":"E","element":"span"},{"text":".","element":"span"}],[{"text":"For critically initialized Relu networks (aka, He’s initialization (","element":"span"},{"href":"#id-61","referenceIndex":20,"text":"He et al.","element":"a"},{"href":"#id-61","referenceIndex":20,"text":", ","element":"a"},{"href":"#id-61","referenceIndex":20,"text":"2015","element":"a"},{"text":")), the entries of the NTK also diverges linearly and ","element":"span"},{"style":{"height":19.68},"width":202.23,"height":49.2,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1912.13053/images/7-13.png","element":"img","alt":" κ(l) → m+33","inline":true,"padRight":true},{"text":"and ","element":"span"},{"style":{"height":18.18},"width":306.84,"height":45.46,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1912.13053/images/7-14.png","element":"img","alt":" P(Θ(l)) = O(1/l)","inline":true},{"text":"; see Table ","element":"span"},{"href":"#id-62","text":"2 ","element":"a"},{"text":"and Figure ","element":"span"},{"href":"#id-58","text":"3","element":"a"},{"text":". In addition, adding skip-connections makes all entries of the NTK to diverge exponentially, resulting exploding of gradients. However, we find that skip connections do not alter the dynamics of ","element":"span"},{"style":{"height":14.19},"width":57.66,"height":35.46,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1912.13053/images/7-15.png","element":"img","alt":" κ(l)","inline":true},{"text":". Finally, layer normalization could help address the issue of exploding of gradients; see Sec. ","element":"span"},{"text":"C","element":"span"},{"text":".","element":"span"}]]},{"heading":"7. Experiments","paragraphs":[[{"style":{"fontWeight":"bold"},"text":"Evolution of ","element":"span"},{"href":"#id-51","style":{"height":17.78},"width":249.22,"height":44.46,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1912.13053/images/7-16.png","element":"img","alt":" κ(l) (Figure 1).","inline":true,"padRight":true},{"text":"We randomly sample inputs with shape ","element":"span"},{"style":{"height":16},"width":760.28,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1912.13053/images/7-17.png","element":"img","alt":" (m, k, k, 3) where m ∈ {12, 20} and k = 6. We","inline":true,"padRight":true},{"text":"compute the exact NTK with activation function ","element":"span"},{"style":{"fontStyle":"italic"},"text":"Erf ","element":"span"},{"text":"using the ","element":"span"},{"style":{"fontStyle":"italic"},"text":"Neural Tangents ","element":"span"},{"text":"library (","element":"span"},{"href":"#id-63","referenceIndex":30,"text":"Novak et al.","element":"a"},{"href":"#id-63","referenceIndex":30,"text":", ","element":"a"},{"href":"#id-63","referenceIndex":30,"text":"2019a","element":"a"},{"text":"). We see excellent agreement between the theoretical calculation of ","element":"span"},{"style":{"height":14.18},"width":206.58,"height":35.46,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1912.13053/images/7-18.png","element":"img","alt":"κ(l) in Sec. 6","inline":true,"padRight":true},{"text":"(summarized in Table ","element":"span"},{"href":"#id-30","text":"1","element":"a"},{"text":") and the experimental results Figure ","element":"span"},{"href":"#id-51","text":"1","element":"a"},{"text":".","element":"span"}],[{"style":{"fontWeight":"bold"},"text":"Maximum Learning Rates (Figure ","element":"span"},{"href":"#id-53","style":{"fontWeight":"bold"},"text":"2 ","element":"a"},{"style":{"fontWeight":"bold"},"text":"(c)). ","element":"span"},{"text":"In practice, given a set of hyper-parameters of a network, knowing the range of feasible learning rates is extremely valuable. As discussed above, in the infinite width setting, Equation ","element":"span"},{"href":"#id-44","text":"5 ","element":"a"},{"text":"implies the maximal convergent learning rate is given by ","element":"span"},{"style":{"height":21.47},"width":286.84,"height":53.68,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1912.13053/images/7-19.png","element":"img","alt":"ηtheory ≡ 2/λ(l)max","inline":true},{"text":". From our theoretical results above, vary- ","element":"span"},{"text":"ing the hyperparameters of our network allows us to vary ","element":"span"},{"style":{"height":18.54},"width":82.34,"height":46.36,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1912.13053/images/7-20.png","element":"img","alt":"λ(l)max","inline":true,"padRight":true},{"text":"over a wide range and test this hypothesis. This is ","element":"span"},{"text":"shown for depth 10 networks varying ","element":"span"},{"style":{"height":18.18},"width":353.3,"height":45.44,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1912.13053/images/7-21.png","element":"img","alt":" σ2w with η = ρηtheory.","inline":true,"padRight":true},{"text":"We see that networks become untrainable when ","element":"span"},{"style":{"height":14},"width":183.7,"height":35,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1912.13053/images/7-22.png","element":"img","alt":" ρ exceeds 2","inline":true,"padRight":true},{"text":"as predicted.","element":"span"}],[{"style":{"fontWeight":"bold"},"text":"Trainability vs Generalization (Figure ","element":"span"},{"href":"#id-53","style":{"fontWeight":"bold"},"text":"2 ","element":"a"},{"style":{"fontWeight":"bold"},"text":"(a,b)). ","element":"span"},{"text":"We conduct an experiment training finite-width CNN-F networks with 1k training samples from CIFAR-10 with ","element":"span"},{"style":{"height":10.8},"width":136.82,"height":27,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1912.13053/images/7-23.png","element":"img","alt":" 20 × 20","inline":true,"padRight":true},{"text":"different ","element":"span"},{"style":{"height":17.38},"width":107.72,"height":43.46,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1912.13053/images/7-24.png","element":"img","alt":" (σ2ω, l)","inline":true,"padRight":true},{"text":"configurations. We train each network ","element":"span"},{"text":"using SGD with batch size ","element":"span"},{"style":{"fontStyle":"italic"},"text":"b ","element":"span"},{"text":"= 256 ","element":"span"},{"text":"and learning rate ","element":"span"},{"style":{"height":15.59},"width":248.69,"height":38.98,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1912.13053/images/7-25.png","element":"img","alt":"η = 0.1ηtheory","inline":true},{"text":". We see in Figure ","element":"span"},{"href":"#id-53","text":"2 ","element":"a"},{"text":"(a) that deep in the chaotic phase we see that all configurations reach perfect training accuracy, but the network completely fails to generalize in the sense test accuracy is around ","element":"span"},{"text":"10%","element":"span"},{"text":". As expected, in the ordered phase we see that although the training accuracy degrades generalization improves. As expected we see that the depth-scales ","element":"span"},{"style":{"height":14},"width":33.44,"height":35,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1912.13053/images/7-26.png","element":"img","alt":" ξ1","inline":true,"padRight":true},{"text":"and ","element":"span"},{"style":{"height":14},"width":33.44,"height":35,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1912.13053/images/7-27.png","element":"img","alt":" ξ∗","inline":true,"padRight":true},{"text":"control trainability in the ordered phase and generalization in the chaotic phase respectively. We also conduct extra experiments for FCN with more training points (16k); see Figure ","element":"span"},{"href":"#id-64","text":"6","element":"a"},{"text":".","element":"span"}],[{"style":{"fontWeight":"bold"},"text":"CNN-P v.s. CNN-F: spatial correction (Figure ","element":"span"},{"href":"#id-53","style":{"fontWeight":"bold"},"text":"2 ","element":"a"},{"style":{"fontWeight":"bold"},"text":"(d-f)). ","element":"span"},{"text":"We compute the test accuracy using the analytic equations for gradient flow, Equation ","element":"span"},{"href":"#id-45","text":"6","element":"a"},{"text":", which corresponds to the test accuracy of ensemble of gradient descent trained neural networks taking the width to infinity. As above, we use ","element":"span"},{"text":"1","element":"span"},{"style":{"fontStyle":"italic"},"text":"k ","element":"span"},{"text":"training points and consider a ","element":"span"},{"style":{"height":10.8},"width":121.73,"height":27,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1912.13053/images/7-28.png","element":"img","alt":" 20×20","inline":true,"padRight":true},{"text":"grid of configurations for ","element":"span"},{"style":{"height":17.38},"width":107.72,"height":43.46,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1912.13053/images/7-29.png","element":"img","alt":" (σ2ω, l)","inline":true},{"text":". We plot the test performance of CNN-P and ","element":"span"},{"text":"CNN-F and the performance difference in Fig ","element":"span"},{"href":"#id-53","text":"2 ","element":"a"},{"text":"(d-f). As expected, we see that the performance of both CNN-P and CNN-F are captured by ","element":"span"},{"style":{"height":16},"width":297.75,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1912.13053/images/7-30.png","element":"img","alt":" ξ1 = −1/ log(χ1)","inline":true,"padRight":true},{"text":"in the ordered phase and by ","element":"span"},{"style":{"height":16},"width":412.11,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1912.13053/images/7-31.png","element":"img","alt":" ξ∗ = −1/(log ξc−log ξ1)","inline":true,"padRight":true},{"text":"in the chaotic phase. We see that the test performance difference between CNN-P and CNN-F exhibits a region in the ordered phase (a blue strip) where CNN-F outperforms CNN-P by a large margin. This performance difference is due to the correction term ","element":"span"},{"style":{"fontStyle":"italic"},"text":"d ","element":"span"},{"text":"as predicted by the ","element":"span"},{"style":{"height":18.18},"width":130.75,"height":45.46,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1912.13053/images/7-32.png","element":"img","alt":" P(Θ(l))","inline":true},{"text":"-row of Table ","element":"span"},{"href":"#id-30","text":"1","element":"a"},{"text":". We also conduct extra experiments densely varying ","element":"span"},{"href":"#id-65","style":{"height":17.9},"width":265.66,"height":44.74,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1912.13053/images/7-33.png","element":"img","alt":" σ2b; see Sec. G.4.","inline":true,"padRight":true},{"text":"Together these results provide an extremely stringent test of our theory.","element":"span"}]]},{"heading":"8. Conclusion and Future Work","paragraphs":[[{"text":"In this work, we identify several quantities (","element":"span"},{"style":{"height":13.2},"width":192.43,"height":33,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1912.13053/images/7-34.png","element":"img","alt":"λmax, λbulk","inline":true},{"text":", ","element":"span"},{"style":{"height":7.2},"width":23,"height":18,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1912.13053/images/7-35.png","element":"img","alt":"κ","inline":true},{"text":", and ","element":"span"},{"style":{"height":18.19},"width":130.75,"height":45.47,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1912.13053/images/7-36.png","element":"img","alt":" P(Θ(l))","inline":true},{"text":") related to the spectrum of the NTK that","element":"span"}],[{"id":"id-53","style":{"width":"92%"},"width":1808,"height":1017,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1912.13053/images/8-0.png","element":"img"}],[{"style":{"fontStyle":"italic"},"text":"Figure 2. ","element":"figcaption","subtype":"caption"},{"style":{"fontWeight":"bold"},"text":"Trainability and generalization are captured by ","element":"figcaption","subtype":"caption"},{"style":{"height":16.89},"width":262.76,"height":42.24,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1912.13053/images/8-1.png","element":"img","alt":" κ(l) and P(Θ(l)).","inline":true,"padRight":true},{"text":"(a,b) The training and test accuracy of CNN-F trained with SGD. The network is untrainable above the green line because ","element":"figcaption","subtype":"caption"},{"style":{"height":13.29},"width":54.35,"height":33.24,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1912.13053/images/8-2.png","element":"img","alt":" κ(l) ","inline":true,"padRight":true},{"text":"is too large and is ungeneralizable above the orange line because ","element":"figcaption","subtype":"caption"},{"style":{"height":16.89},"width":120.28,"height":42.24,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1912.13053/images/8-3.png","element":"img","alt":"P(Θ(l))","inline":true,"padRight":true},{"text":"is too small. (c) The accuracy vs learning rate for FCNs trained with SGD sweeping over the weight variance. (d,e) The test accuracy of CNN-P and CNN-F using kernel regression. (f) The difference in accuracy between CNN-P and CNN-F networks.","element":"figcaption","subtype":"caption"}],[{"text":"control trainability and generalization of deep networks. We offer a precise characterization of these quantities and provide substantial experimental evidence supporting their role in predicting the training and generalization performance of deep neural networks. Future work might extend our framework to other architectures (for example, residual networks with batch-norm or attention architectures). Understanding the role of the nonuniform Fourier modes in the NTK in determining the test performance of CNNs is also an important research direction.","element":"span"}],[{"text":"In practice, the correspondence between the NTK and neural networks is often broken due to, e.g., insufficient width, using a large learning rate, or changing the parameterization. Our theory does not directly apply to this setting. As such, developing an understanding of training and generalization away from the NTK regime remains an important research direction.","element":"span"}]]},{"heading":"Acknowledgements","paragraphs":[[{"text":"We thank Jascha Sohl-dickstein, Greg Yang, Ben Adlam, Jaehoon Lee, Roman Novak and Yasaman Bahri for useful discussions and feedback. We also thank anonymous reviewers for feedback that helped improve the manuscript.","element":"span"}]]},{"heading":"References","paragraphs":[[{"id":"id-32","text":"Allen-Zhu, Z., Li, Y., and Song, Z. A convergence theory for ","element":"span"},{"text":"deep learning via over-parameterization. ","element":"span"},{"style":{"fontStyle":"italic"},"text":"arXiv preprint arXiv:1811.03962","element":"span"},{"text":", 2018.","element":"span"}],[{"id":"id-46","text":"Arora, S., Du, S. S., Hu, W., Li, Z., Salakhutdinov, R., and ","element":"span"},{"text":"Wang, R. On exact computation with an infinitely wide neural net. ","element":"span"},{"style":{"fontStyle":"italic"},"text":"arXiv preprint arXiv:1904.11955","element":"span"},{"text":", 2019.","element":"span"}],[{"id":"id-2","text":"Bahdanau, D., Cho, K., and Bengio, Y. Neural machine ","element":"span"},{"text":"translation by jointly learning to align and translate. ","element":"span"},{"style":{"fontStyle":"italic"},"text":"arXiv preprint arXiv:1409.0473","element":"span"},{"text":", 2014.","element":"span"}],[{"id":"id-6","text":"Bergstra, J. and Bengio, Y. ","element":"span"},{"text":"Random search for hyperparameter optimization. ","element":"span"},{"style":{"fontStyle":"italic"},"text":"Journal of Machine Learning Research","element":"span"},{"text":", 13(Feb):281–305, 2012.","element":"span"}],[{"id":"id-17","text":"Blumenfeld, Y., Gilboa, D., and Soudry, D. A mean field ","element":"span"},{"text":"theory of quantized deep networks: The quantizationdepth trade-off. ","element":"span"},{"style":{"fontStyle":"italic"},"text":"arXiv preprint arXiv:1906.00771","element":"span"},{"text":", 2019.","element":"span"}],[{"id":"id-14","text":"Chen, M., Pennington, J., and Schoenholz, S. Dynamical ","element":"span"},{"text":"isometry and a mean field theory of RNNs: Gating enables signal propagation in recurrent neural networks. In ","element":"span"},{"style":{"fontStyle":"italic"},"text":"International Conference on Machine Learning","element":"span"},{"text":", 2018.","element":"span"}],[{"id":"id-36","text":"Chizat, L. and Bach, F. On the global convergence of gradi- ","element":"span"},{"text":"ent descent for over-parameterized models using optimal","element":"span"}],[{"text":"transport. In ","element":"span"},{"style":{"fontStyle":"italic"},"text":"Advances in neural information processing systems","element":"span"},{"text":", pp. 3040–3050, 2018.","element":"span"}],[{"id":"id-29","text":"Chizat, L., Oyallon, E., and Bach, F. On lazy training in ","element":"span"},{"text":"differentiable programming. 2019.","element":"span"}],[{"id":"id-0","text":"Cubuk, E. D., Zoph, B., Mane, D., Vasudevan, V., and Le, ","element":"span"},{"text":"Q. V. Autoaugment: Learning augmentation strategies from data. In ","element":"span"},{"style":{"fontStyle":"italic"},"text":"The IEEE Conference on Computer Vision and Pattern Recognition (CVPR)","element":"span"},{"text":", June 2019.","element":"span"}],[{"id":"id-10","text":"Daniely, A. SGD learns the conjugate kernel class of the ","element":"span"},{"text":"network. In ","element":"span"},{"style":{"fontStyle":"italic"},"text":"Advances in Neural Information Processing Systems 30","element":"span"},{"text":". 2017.","element":"span"}],[{"id":"id-9","text":"Daniely, A., Frostig, R., and Singer, Y. Toward deeper under- ","element":"span"},{"text":"standing of neural networks: The power of initialization and a dual view on expressivity. In ","element":"span"},{"style":{"fontStyle":"italic"},"text":"Advances In Neural Information Processing Systems","element":"span"},{"text":", 2016.","element":"span"}],[{"id":"id-33","text":"Du, S. S., Lee, J. D., Li, H., Wang, L., and Zhai, X. Gradient ","element":"span"},{"text":"descent finds global minima of deep neural networks. ","element":"span"},{"style":{"fontStyle":"italic"},"text":"arXiv preprint arXiv:1811.03804","element":"span"},{"text":", 2018a.","element":"span"}],[{"id":"id-31","text":"Du, S. S., Zhai, X., Poczos, B., and Singh, A. Gradient ","element":"span"},{"text":"descent provably optimizes over-parameterized neural networks, 2018b.","element":"span"}],[{"id":"id-24","text":"Garriga-Alonso, A., Rasmussen, C. E., and Aitchison, L. ","element":"span"},{"text":"Deep convolutional networks as shallow gaussian processes, 2018.","element":"span"}],[{"id":"id-15","text":"Gilboa, D., Chang, B., Chen, M., Yang, G., Schoenholz, ","element":"span"},{"text":"S. S., Chi, E. H., and Pennington, J. Dynamical isometry and a mean field theory of lstms and grus. ","element":"span"},{"style":{"fontStyle":"italic"},"text":"CoRR","element":"span"},{"text":", abs/1901.08987, 2019. URL ","element":"span"},{"href":"http://arxiv.org/abs/1901.08987","text":"http://arxiv.org/ ","element":"a"},{"href":"http://arxiv.org/abs/1901.08987","text":"abs/1901.08987","element":"a"},{"text":".","element":"span"}],[{"id":"id-3","text":"Gilmer, J., Schoenholz, S. S., Riley, P. F., Vinyals, O., and ","element":"span"},{"text":"Dahl, G. E. Neural message passing for quantum chemistry. In ","element":"span"},{"style":{"fontStyle":"italic"},"text":"Proceedings of the 34th International Conference on Machine Learning - Volume 70","element":"span"},{"text":", ICML’17, pp. 1263– 1272. JMLR.org, 2017. URL ","element":"span"},{"href":"http://dl.acm.org/citation.cfm?id=3305381.3305512","text":"http://dl.acm.org/ ","element":"a"},{"href":"http://dl.acm.org/citation.cfm?id=3305381.3305512","text":"citation.cfm?id=3305381.3305512","element":"a"},{"text":".","element":"span"}],[{"id":"id-7","text":"Glorot, X. and Bengio, Y. Understanding the difficulty of ","element":"span"},{"text":"training deep feedforward neural networks. In ","element":"span"},{"style":{"fontStyle":"italic"},"text":"International Conference on Artificial Intelligence and Statistics","element":"span"},{"text":", pp. 249–256, 2010.","element":"span"}],[{"id":"id-19","text":"Hayou, S., Doucet, A., and Rousseau, J. On the selection ","element":"span"},{"text":"of initialization and activation function for deep neural networks. ","element":"span"},{"style":{"fontStyle":"italic"},"text":"arXiv preprint arXiv:1805.08266","element":"span"},{"text":", 2018.","element":"span"}],[{"id":"id-39","text":"Hayou, S., Doucet, A., and Rousseau, J. Mean-field be- ","element":"span"},{"text":"haviour of neural tangent kernel for deep neural networks, 2019.","element":"span"}],[{"id":"id-61","text":"He, K., Zhang, X., Ren, S., and Sun, J. Delving deep ","element":"span"},{"text":"into rectifiers: Surpassing human-level performance on imagenet classification. ","element":"span"},{"style":{"fontStyle":"italic"},"text":"CoRR","element":"span"},{"text":", abs/1502.01852, 2015. URL ","element":"span"},{"href":"http://arxiv.org/abs/1502.01852","text":"http://arxiv.org/abs/1502.01852","element":"a"},{"text":".","element":"span"}],[{"id":"id-47","text":"Huang, J. and Yau, H.-T. Dynamics of deep neural net- ","element":"span"},{"text":"works and neural tangent hierarchy. ","element":"span"},{"style":{"fontStyle":"italic"},"text":"arXiv preprint arXiv:1909.08156","element":"span"},{"text":", 2019.","element":"span"}],[{"id":"id-27","text":"Jacot, A., Gabriel, F., and Hongler, C. Neural tangent kernel: ","element":"span"},{"text":"Convergence and generalization in neural networks. In ","element":"span"},{"style":{"fontStyle":"italic"},"text":"Advances in Neural Information Processing Systems 31","element":"span"},{"text":". 2018.","element":"span"}],[{"id":"id-40","text":"Jacot, A., Gabriel, F., and Hongler, C. Freeze and chaos for ","element":"span"},{"text":"dnns: an ntk view of batch normalization, checkerboard and boundary effects, 2019.","element":"span"}],[{"id":"id-18","text":"Karakida, R., Akaho, S., and Amari, S.-i. Universal statistics ","element":"span"},{"text":"of fisher information in deep neural networks: mean field approach. ","element":"span"},{"style":{"fontStyle":"italic"},"text":"arXiv preprint arXiv:1806.01316","element":"span"},{"text":", 2018.","element":"span"}],[{"id":"id-21","text":"Lee, J., Bahri, Y., Novak, R., Schoenholz, S., Pennington, ","element":"span"},{"text":"J., and Sohl-dickstein, J. Deep neural networks as gaussian processes. In ","element":"span"},{"style":{"fontStyle":"italic"},"text":"International Conference on Learning Representations","element":"span"},{"text":", 2018.","element":"span"}],[{"id":"id-28","text":"Lee, J., Xiao, L., Schoenholz, S. S., Bahri, Y., Sohl- ","element":"span"},{"text":"Dickstein, J., and Pennington, J. Wide neural networks of any depth evolve as linear models under gradient descent. ","element":"span"},{"style":{"fontStyle":"italic"},"text":"arXiv preprint arXiv:1902.06720","element":"span"},{"text":", 2019.","element":"span"}],[{"id":"id-22","text":"Matthews, A., Hron, J., Rowland, M., Turner, R. E., and ","element":"span"},{"text":"Ghahramani, Z. Gaussian process behaviour in wide deep neural networks. In ","element":"span"},{"style":{"fontStyle":"italic"},"text":"International Conference on Learning Representations","element":"span"},{"text":", 4 2018. ","element":"span"},{"text":"URL ","element":"span"},{"href":"https://openreview.net/forum?id=H1-nGgWC-","text":"https:// ","element":"a"},{"href":"https://openreview.net/forum?id=H1-nGgWC-","text":"openreview.net/forum?id=H1-nGgWC-","element":"a"},{"text":".","element":"span"}],[{"id":"id-35","text":"Mei, S., Montanari, A., and Nguyen, P.-M. A mean field ","element":"span"},{"text":"view of the landscape of two-layer neural networks. ","element":"span"},{"style":{"fontStyle":"italic"},"text":"Proceedings of the National Academy of Sciences","element":"span"},{"text":", 115(33): E7665–E7671, 2018.","element":"span"}],[{"id":"id-8","text":"Neal, R. M. Priors for infinite networks (tech. rep. no. crg- ","element":"span"},{"text":"tr-94-1). ","element":"span"},{"style":{"fontStyle":"italic"},"text":"University of Toronto","element":"span"},{"text":", 1994.","element":"span"}],[{"id":"id-63","text":"Novak, R., Lee, L. X. J., Sohl-Dickstein, J., and Schoenholz, ","element":"span"},{"text":"S. S. Neural tangents: Fast and easy infinite neural networks in python, 2019a. URL ","element":"span"},{"href":"http://github.com/google/neural-tangents","text":"http://github.com/ ","element":"a"},{"href":"http://github.com/google/neural-tangents","text":"google/neural-tangents","element":"a"},{"text":".","element":"span"}],[{"id":"id-23","text":"Novak, R., Xiao, L., Lee, J., Bahri, Y., Yang, G., Hron, J., ","element":"span"},{"text":"Abolafia, D. A., Pennington, J., and Sohl-Dickstein, J. Bayesian deep convolutional networks with many channels are gaussian processes. In ","element":"span"},{"style":{"fontStyle":"italic"},"text":"International Conference on Learning Representations","element":"span"},{"text":", 2019b.","element":"span"}],[{"id":"id-1","text":"Park, D. S., Chan, W., Zhang, Y., Chiu, C.-C., Zoph, B., ","element":"span"},{"text":"Cubuk, E. D., and Le, Q. V. Specaugment: A simple data augmentation method for automatic speech recognition. ","element":"span"},{"style":{"fontStyle":"italic"},"text":"arXiv preprint arXiv:1904.08779","element":"span"},{"text":", 2019.","element":"span"}],[{"id":"id-80","text":"Pennington, J., Schoenholz, S. S., and Ganguli, S. The emer- ","element":"span"},{"text":"gence of spectral universality in deep networks. ","element":"span"},{"style":{"fontStyle":"italic"},"text":"arXiv preprint arXiv:1802.09979","element":"span"},{"text":", 2018.","element":"span"}],[{"id":"id-26","text":"Philipp, G., Song, D., and Carbonell, J. G. The explod- ","element":"span"},{"text":"ing gradient problem demystified-definition, prevalence, impact, origin, tradeoffs, and solutions. ","element":"span"},{"style":{"fontStyle":"italic"},"text":"arXiv preprint arXiv:1712.05577","element":"span"},{"text":", 2017.","element":"span"}],[{"id":"id-11","text":"Poole, B., Lahiri, S., Raghu, M., Sohl-Dickstein, J., and ","element":"span"},{"text":"Ganguli, S. Exponential expressivity in deep neural networks through transient chaos. In ","element":"span"},{"style":{"fontStyle":"italic"},"text":"Advances In Neural Information Processing Systems","element":"span"},{"text":", pp. 3360–3368, 2016.","element":"span"}],[{"id":"id-4","text":"Raghu, M., Zhang, C., Kleinberg, J., and Bengio, S. Trans- ","element":"span"},{"text":"fusion: Understanding transfer learning with applications to medical imaging. ","element":"span"},{"style":{"fontStyle":"italic"},"text":"arXiv preprint arXiv:1902.07208","element":"span"},{"text":", 2019.","element":"span"}],[{"id":"id-38","text":"Rotskoff, G. M. and Vanden-Eijnden, E. Neural networks as ","element":"span"},{"text":"interacting particle systems: Asymptotic convexity of the loss landscape and universal scaling of the approximation error. ","element":"span"},{"style":{"fontStyle":"italic"},"text":"arXiv preprint arXiv:1805.00915","element":"span"},{"text":", 2018.","element":"span"}],[{"id":"id-12","text":"Schoenholz, S. S., Gilmer, J., Ganguli, S., and Sohl- ","element":"span"},{"text":"Dickstein, J. Deep information propagation. ","element":"span"},{"style":{"fontStyle":"italic"},"text":"International Conference on Learning Representations","element":"span"},{"text":", 2017.","element":"span"}],[{"id":"id-5","text":"Silver, D., Hubert, T., Schrittwieser, J., Antonoglou, I., Lai, ","element":"span"},{"text":"M., Guez, A., Lanctot, M., Sifre, L., Kumaran, D., Graepel, T., Lillicrap, T., Simonyan, K., and Hassabis, D. A general reinforcement learning algorithm that masters chess, shogi, and go through self-play. ","element":"span"},{"style":{"fontStyle":"italic"},"text":"Science","element":"span"},{"text":", 362(6419):1140–1144, 2018. ","element":"span"},{"text":"ISSN 0036-8075. ","element":"span"},{"text":"doi: 10.1126/science.aar6404. URL ","element":"span"},{"href":"https://science.sciencemag.org/content/362/6419/1140","text":"https://science. ","element":"a"},{"href":"https://science.sciencemag.org/content/362/6419/1140","text":"sciencemag.org/content/362/6419/1140","element":"a"},{"text":".","element":"span"}],[{"id":"id-37","text":"Sirignano, J. and Spiliopoulos, K. Mean field analysis of ","element":"span"},{"text":"neural networks. ","element":"span"},{"style":{"fontStyle":"italic"},"text":"arXiv preprint arXiv:1805.01053","element":"span"},{"text":", 2018.","element":"span"}],[{"id":"id-13","text":"Xiao, L., Bahri, Y., Sohl-Dickstein, J., Schoenholz, S., and ","element":"span"},{"text":"Pennington, J. Dynamical isometry and a mean field theory of CNNs: How to train 10,000-layer vanilla convolutional neural networks. In ","element":"span"},{"style":{"fontStyle":"italic"},"text":"International Conference on Machine Learning","element":"span"},{"text":", 2018.","element":"span"}],[{"id":"id-25","text":"Yang, G. ","element":"span"},{"text":"Scaling limits of wide neural networks with weight sharing: Gaussian process behavior, gradient independence, and neural tangent kernel derivation. ","element":"span"},{"style":{"fontStyle":"italic"},"text":"arXiv preprint arXiv:1902.04760","element":"span"},{"text":", 2019.","element":"span"}],[{"id":"id-16","text":"Yang, G. and Schoenholz, S. Mean field residual networks: ","element":"span"},{"text":"On the edge of chaos. In ","element":"span"},{"style":{"fontStyle":"italic"},"text":"Advances in Neural Information Processing Systems","element":"span"},{"text":". 2017.","element":"span"}],[{"id":"id-20","text":"Yang, G., Pennington, J., Rao, V., Sohl-Dickstein, J., and ","element":"span"},{"text":"Schoenholz, S. S. A mean field theory of batch normalization. ","element":"span"},{"style":{"fontStyle":"italic"},"text":"arXiv preprint arXiv:1902.08129","element":"span"},{"text":", 2019.","element":"span"}],[{"id":"id-34","text":"Zou, D., Cao, Y., Zhou, D., and Gu, Q. Stochastic gradient ","element":"span"},{"text":"descent optimizes over-parameterized deep relu networks. ","element":"span"},{"style":{"fontStyle":"italic"},"text":"arXiv preprint arXiv:1811.08888","element":"span"},{"text":", 2018.","element":"span"}]]},{"heading":"A. Related Work","paragraphs":[[{"text":"Recent work ","element":"span"},{"href":"#id-27","referenceIndex":22,"text":"Jacot et al. ","element":"a"},{"href":"#id-27","referenceIndex":22,"text":"(","element":"a"},{"href":"#id-27","referenceIndex":22,"text":"2018","element":"a"},{"text":"); ","element":"span"},{"href":"#id-31","referenceIndex":13,"text":"Du et al. ","element":"a"},{"href":"#id-31","referenceIndex":13,"text":"(","element":"a"},{"href":"#id-31","referenceIndex":13,"text":"2018b","element":"a"},{"text":"); ","element":"span"},{"href":"#id-32","referenceIndex":1,"text":"Allen-Zhu et al. ","element":"a"},{"href":"#id-32","referenceIndex":1,"text":"(","element":"a"},{"href":"#id-32","referenceIndex":1,"text":"2018","element":"a"},{"text":"); ","element":"span"},{"href":"#id-33","referenceIndex":12,"text":"Du et al. ","element":"a"},{"href":"#id-33","referenceIndex":12,"text":"(","element":"a"},{"href":"#id-33","referenceIndex":12,"text":"2018a","element":"a"},{"text":"); ","element":"span"},{"href":"#id-34","referenceIndex":45,"text":"Zou et al. ","element":"a"},{"href":"#id-34","referenceIndex":45,"text":"(","element":"a"},{"href":"#id-34","referenceIndex":45,"text":"2018","element":"a"},{"text":") proved global convergence of over-parameterized deep networks by showing that the NTK essentailly remains a constant over the course of training. However, in a different scaling limit the NTK changes over the course of training and global convergence is much more difficult to obtain and is known for neural networks with one hidden layer ","element":"span"},{"href":"#id-35","referenceIndex":28,"text":"Mei et al. ","element":"a"},{"href":"#id-35","referenceIndex":28,"text":"(","element":"a"},{"href":"#id-35","referenceIndex":28,"text":"2018","element":"a"},{"text":"); ","element":"span"},{"href":"#id-36","referenceIndex":7,"text":"Chizat & Bach ","element":"a"},{"href":"#id-36","referenceIndex":7,"text":"(","element":"a"},{"href":"#id-36","referenceIndex":7,"text":"2018","element":"a"},{"text":"); ","element":"span"},{"href":"#id-37","referenceIndex":40,"text":"Sirignano & Spiliopoulos ","element":"a"},{"href":"#id-37","referenceIndex":40,"text":"(","element":"a"},{"href":"#id-37","referenceIndex":40,"text":"2018","element":"a"},{"text":"); ","element":"span"},{"href":"#id-38","referenceIndex":37,"text":"Rotskoff & Vanden-Eijnden ","element":"a"},{"href":"#id-38","referenceIndex":37,"text":"(","element":"a"},{"href":"#id-38","referenceIndex":37,"text":"2018","element":"a"},{"text":"). Therefore, understanding the training and generalization properties in this scaling limit remains a very challenging open question.","element":"span"}],[{"text":"Two excellent concurrent works (","element":"span"},{"href":"#id-39","referenceIndex":19,"text":"Hayou et al.","element":"a"},{"href":"#id-39","referenceIndex":19,"text":", ","element":"a"},{"href":"#id-39","referenceIndex":19,"text":"2019","element":"a"},{"text":"; ","element":"span"},{"href":"#id-40","referenceIndex":23,"text":"Jacot et al.","element":"a"},{"href":"#id-40","referenceIndex":23,"text":", ","element":"a"},{"href":"#id-40","referenceIndex":23,"text":"2019","element":"a"},{"text":") also study the dynamics of ","element":"span"},{"style":{"height":18.18},"width":174.08,"height":45.46,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1912.13053/images/11-0.png","element":"img","alt":" Θ(l)(x, x′)","inline":true,"padRight":true},{"text":"for FCNs (and deconvolutions in (","element":"span"},{"href":"#id-40","referenceIndex":23,"text":"Jacot et al.","element":"a"},{"href":"#id-40","referenceIndex":23,"text":", ","element":"a"},{"href":"#id-40","referenceIndex":23,"text":"2019","element":"a"},{"text":")) as a function of depth and variances of the weights and biases. (","element":"span"},{"href":"#id-39","referenceIndex":19,"text":"Hayou et al.","element":"a"},{"href":"#id-39","referenceIndex":19,"text":", ","element":"a"},{"href":"#id-39","referenceIndex":19,"text":"2019","element":"a"},{"text":") investigates role of activation functions (smooth v.s. non-smooth) and skip-connection. (","element":"span"},{"href":"#id-40","referenceIndex":23,"text":"Jacot et al.","element":"a"},{"href":"#id-40","referenceIndex":23,"text":", ","element":"a"},{"href":"#id-40","referenceIndex":23,"text":"2019","element":"a"},{"text":") demonstrate that batch normalization helps remove the “ordered phase” (as in (","element":"span"},{"href":"#id-20","referenceIndex":44,"text":"Yang et al.","element":"a"},{"href":"#id-20","referenceIndex":44,"text":", ","element":"a"},{"href":"#id-20","referenceIndex":44,"text":"2019","element":"a"},{"text":")) and a layer-dependent learning rate allows every layer in a network to contribute to learning. As opposed to these contributions, here we focus our effort on understanding trainability and generalization in this context. We also provide a theory for a wider range of architectures than these other efforts.","element":"span"}]]},{"heading":"B. Signal propagation of NNGP and NTK","paragraphs":[[{"text":"In this section, we assume that the activation function ","element":"span"},{"style":{"height":14},"width":24,"height":35,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1912.13053/images/11-1.png","element":"img","alt":" φ","inline":true,"padRight":true},{"text":"has a continuous third derivative. Recall that the recursive formulas for NNGP ","element":"span"},{"style":{"height":14.18},"width":65.63,"height":35.46,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1912.13053/images/11-2.png","element":"img","alt":" K(l) ","inline":true,"padRight":true},{"text":"and the NTK ","element":"span"},{"style":{"height":14.18},"width":65.69,"height":35.46,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1912.13053/images/11-3.png","element":"img","alt":" Θ(l) ","inline":true,"padRight":true},{"text":"are given by","element":"span"}],[{"id":"id-74","style":{"width":"75%"},"width":1467,"height":116,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1912.13053/images/11-4.png","element":"img"}],[{"text":"where","element":"span"}],[{"text":"Note that we have normalized each input to have variance ","element":"span"},{"style":{"height":14.19},"width":35.22,"height":35.46,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1912.13053/images/11-5.png","element":"img","alt":" q∗","inline":true,"padRight":true},{"text":"and the diagonals of ","element":"span"},{"style":{"height":14.19},"width":65.63,"height":35.46,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1912.13053/images/11-6.png","element":"img","alt":" K(l)","inline":true,"padRight":true},{"text":"are equal to ","element":"span"},{"style":{"height":14.19},"width":35.22,"height":35.46,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1912.13053/images/11-7.png","element":"img","alt":" q∗","inline":true,"padRight":true},{"text":"for all ","element":"span"},{"style":{"fontStyle":"italic"},"text":"l","element":"span"},{"text":". The off-diagonal terms of ","element":"span"},{"style":{"height":14.18},"width":65.63,"height":35.46,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1912.13053/images/11-8.png","element":"img","alt":" K(l)","inline":true,"padRight":true},{"text":"and ","element":"span"},{"style":{"height":14.18},"width":65.69,"height":35.46,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1912.13053/images/11-9.png","element":"img","alt":" Θ(l)","inline":true,"padRight":true},{"text":"are denoted by ","element":"span"},{"style":{"height":21.49},"width":53.91,"height":53.72,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1912.13053/images/11-10.png","element":"img","alt":" q(l)ab","inline":true,"padRight":true},{"text":"and ","element":"span"},{"style":{"height":21.49},"width":54.74,"height":53.72,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1912.13053/images/11-11.png","element":"img","alt":" p(l)ab","inline":true},{"text":", resp. and the diagonal terms are ","element":"span"},{"style":{"height":17.38},"width":53.91,"height":43.46,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1912.13053/images/11-12.png","element":"img","alt":" q(l)","inline":true,"padRight":true},{"text":"and ","element":"span"},{"style":{"height":17.38},"width":54.74,"height":43.46,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1912.13053/images/11-13.png","element":"img","alt":" p(l)","inline":true},{"text":", resp. The ","element":"span"},{"text":"above equations can be simplified to","element":"span"}],[{"id":"id-67","style":{"width":"83%"},"width":1621,"height":120,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1912.13053/images/11-14.png","element":"img"}],[{"id":"id-66","text":"In what follows, we compute the evolution of ","element":"span"},{"style":{"height":21.49},"width":208.16,"height":53.72,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1912.13053/images/11-15.png","element":"img","alt":" q(l)ab , p(l)ab, p(l)","inline":true,"padRight":true},{"text":"and the spectrum and condition numbers of ","element":"span"},{"style":{"height":14.58},"width":288.54,"height":36.46,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1912.13053/images/11-16.png","element":"img","alt":" K(l) and Θ(l). We","inline":true,"padRight":true},{"text":"will use ","element":"span"},{"style":{"height":18.18},"width":797.94,"height":45.46,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1912.13053/images/11-17.png","element":"img","alt":" λmax(Θ(l))/λmax(K(l)), λbulk(Θ(l))/λbulk(K(l))","inline":true,"padRight":true},{"text":"and ","element":"span"},{"style":{"height":18.18},"width":264.54,"height":45.46,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1912.13053/images/11-18.png","element":"img","alt":" κ(Θ(l))/κ(K(l))","inline":true,"padRight":true},{"text":"to denote the maximum eigenvalues, the bulk eigenvalues and the condition number of ","element":"span"},{"style":{"height":18.19},"width":252.45,"height":45.47,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1912.13053/images/11-19.png","element":"img","alt":" Θ(l)/K(l), resp.","inline":true}],[{"id":"id-50","style":{"fontWeight":"bold"},"text":"B.1. Chaotic Phase","element":"span"}],[{"text":"B.1.1. C","element":"span"},{"text":"ORRECTION OF THE OFF","element":"span"},{"text":"-","element":"span"},{"text":"DIAGONAL","element":"span"},{"text":"/","element":"span"},{"text":"DIAGONAL","element":"span"}],[{"text":"The diagonal terms are relatively simple to compute. Equation ","element":"span"},{"href":"#id-66","text":"24 ","element":"a"},{"text":"gives","element":"span"}],[{"style":{"width":"58%"},"width":1140,"height":47,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1912.13053/images/11-20.png","element":"img"}],[{"text":"i.e.","element":"span"}],[{"text":"In the chaotic phase, ","element":"span"},{"style":{"height":18.67},"width":416.84,"height":46.67,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1912.13053/images/11-21.png","element":"img","alt":" χ1 > 1 and p(l) ≈ χl−11 q∗","inline":true},{"text":", i.e. diverges exponentially quickly.","element":"span"}],[{"id":"id-62","style":{"fontWeight":"bold"},"text":"NTK ","element":"span"},{"style":{"height":16.59},"width":75.2,"height":41.48,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1912.13053/images/12-0.png","element":"img","alt":" Θ(l)","inline":true,"padRight":true},{"text":"of FC/CNN-F, ","element":"span"},{"text":"CNN-P","element":"span"}],[{"style":{"width":"89%"},"width":1751,"height":1164,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1912.13053/images/12-1.png","element":"img"}],[{"style":{"fontStyle":"italic"},"text":"Table 2. ","element":"figcaption","subtype":"caption"},{"style":{"fontWeight":"bold"},"text":"Evolution of the NTK/NNGP spectrum and ","element":"figcaption","subtype":"caption"},{"style":{"height":16.89},"width":404.88,"height":42.24,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1912.13053/images/12-2.png","element":"img","alt":" P(Θ(l))Ytrain/P(K(l))Ytrain","inline":true,"padRight":true},{"style":{"fontWeight":"bold"},"text":"as a function of depth ","element":"figcaption","subtype":"caption"},{"style":{"fontStyle":"italic"},"text":"l","element":"figcaption","subtype":"caption"},{"style":{"fontWeight":"bold"},"text":". ","element":"figcaption","subtype":"caption"},{"text":"The NTKs of FCN and CNN without pooling (CNN-F) are essentially the same and the scaling of ","element":"figcaption","subtype":"caption"},{"style":{"height":18.66},"width":392.6,"height":46.64,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1912.13053/images/12-3.png","element":"img","alt":" λ(l)max, λ(l)bulk, κ(l), and ∆(l)","inline":true,"padRight":true},{"text":"for these networks is written in ","element":"figcaption","subtype":"caption"},{"text":"black. Corrections to these quantities due to the addition of an average pooling layer (","element":"figcaption","subtype":"caption"},{"text":"CNN-P","element":"figcaption","subtype":"caption"},{"text":") with window size ","element":"figcaption","subtype":"caption"},{"style":{"fontStyle":"italic"},"text":"d ","element":"figcaption","subtype":"caption"},{"text":"is written in blue.","element":"figcaption","subtype":"caption"}],[{"text":"Now we compute the off-diagonal terms. Since ","element":"span"},{"style":{"height":19.34},"width":344.02,"height":48.36,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1912.13053/images/12-4.png","element":"img","alt":" χc∗ = σ2ω ˙T (q∗ab) < 1","inline":true,"padRight":true},{"text":"in the chaotic, ","element":"span"},{"style":{"height":15.5},"width":51.34,"height":38.75,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1912.13053/images/12-5.png","element":"img","alt":" p∗ab ","inline":true,"padRight":true},{"text":"exists and is finite. Indeed, letting ","element":"span"},{"style":{"height":11.2},"width":114.68,"height":28,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1912.13053/images/12-6.png","element":"img","alt":"l → ∞","inline":true,"padRight":true},{"text":"in equation ","element":"span"},{"href":"#id-67","text":"23","element":"a"},{"text":", we have","element":"span"}],[{"style":{"width":"60%"},"width":1186,"height":113,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1912.13053/images/12-7.png","element":"img"}],[{"text":"which gives","element":"span"}],[{"style":{"width":"9%"},"width":89,"height":41,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1912.13053/images/12-8.png","element":"img"}],[{"text":"To compute the finite depth correction, let","element":"span"}],[{"style":{"width":"10%"},"width":95,"height":127,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1912.13053/images/12-9.png","element":"img"}],[{"text":"Applying Taylor’s expansion to the first equation of ","element":"span"},{"href":"#id-67","text":"23 ","element":"a"},{"text":"gives","element":"span"}],[{"style":{"width":"48%"},"width":452,"height":196,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1912.13053/images/12-10.png","element":"img"}],[{"text":"That is","element":"span"}],[{"text":"Similarly, applying Taylor’s expansion to the second equation of ","element":"span"},{"href":"#id-67","text":"23 ","element":"a"},{"text":"gives","element":"span"}],[{"id":"id-68","style":{"width":"99%"},"width":1943,"height":507,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1912.13053/images/13-0.png","element":"img"}],[{"id":"id-69","style":{"fontWeight":"bold"},"text":"Lemma 1. ","element":"span"},{"style":{"fontStyle":"italic"},"text":"There exist a finite number ","element":"span"},{"style":{"height":14.4},"width":206.02,"height":36,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1912.13053/images/13-1.png","element":"img","alt":" ζab such that","inline":true}],[{"text":"We want to emphasize that the limits are data-dependent, which was verified in Fig. ","element":"span"},{"href":"#id-51","text":"1e ","element":"a"},{"href":"#id-51","text":"and ","element":"a"},{"href":"#id-51","text":"1f ","element":"a"},{"text":"empirically.","element":"span"}],[{"style":{"width":"99%"},"width":1943,"height":442,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1912.13053/images/13-2.png","element":"img"}],[{"text":"Equation ","element":"span"},{"href":"#id-68","text":"37 ","element":"a"},{"text":"gives","element":"span"}],[{"style":{"width":"99%"},"width":938,"height":300,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1912.13053/images/13-3.png","element":"img"}],[{"text":"Summing over all ","element":"span"},{"style":{"fontStyle":"italic"},"text":"l ","element":"span"},{"text":"implies","element":"span"}],[{"style":{"width":"27%"},"width":260,"height":55,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1912.13053/images/13-4.png","element":"img"}],[{"style":{"width":"88%"},"width":828,"height":28,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1912.13053/images/14-0.png","element":"img"}],[{"text":"We consider the spectrum of ","element":"span"},{"style":{"fontStyle":"italic"},"text":"K ","element":"span"},{"text":"and ","element":"span"},{"style":{"height":10.8},"width":31,"height":27,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1912.13053/images/14-1.png","element":"img","alt":" Θ","inline":true,"padRight":true},{"text":"in this phase. For ","element":"span"},{"style":{"height":14.19},"width":65.63,"height":35.47,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1912.13053/images/14-2.png","element":"img","alt":" K(l)","inline":true},{"text":", we have ","element":"span"},{"style":{"height":15.5},"width":193.05,"height":38.75,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1912.13053/images/14-3.png","element":"img","alt":" q∗ab = c∗q∗","inline":true,"padRight":true},{"text":"(with ","element":"span"},{"style":{"height":17.39},"width":329.2,"height":43.47,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1912.13053/images/14-4.png","element":"img","alt":" c∗ < 1), q(l) = q∗","inline":true,"padRight":true},{"text":"and ","element":"span"},{"style":{"height":21.49},"width":433.45,"height":53.72,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1912.13053/images/14-5.png","element":"img","alt":"q(l)ab = q∗ab + O(χlc∗). Thus","inline":true}],[{"style":{"width":"57%"},"width":1110,"height":45,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1912.13053/images/14-6.png","element":"img"}],[{"text":"where","element":"span"}],[{"text":"The NNGP ","element":"span"},{"style":{"height":10.98},"width":46.94,"height":27.46,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1912.13053/images/14-7.png","element":"img","alt":" K∗ ","inline":true,"padRight":true},{"text":"has two different eigenvalues: ","element":"span"},{"style":{"height":16},"width":307.75,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1912.13053/images/14-8.png","element":"img","alt":" q∗(1 + (m − 1)c∗)","inline":true,"padRight":true},{"text":"of order 1 and ","element":"span"},{"style":{"height":16},"width":654.36,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1912.13053/images/14-9.png","element":"img","alt":" q∗(1 − c∗) of order (m − 1), where m is","inline":true,"padRight":true},{"text":"the size of the dataset. For large ","element":"span"},{"style":{"fontStyle":"italic"},"text":"l","element":"span"},{"text":", since the spectral norm of ","element":"span"},{"style":{"height":17.38},"width":205.49,"height":43.46,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1912.13053/images/14-10.png","element":"img","alt":" El is O(χlc∗)","inline":true},{"text":", the spectrum and condition number of ","element":"span"},{"style":{"height":14.58},"width":127,"height":36.46,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1912.13053/images/14-11.png","element":"img","alt":" K(l) are","inline":true}],[{"style":{"width":"68%"},"width":1327,"height":221,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1912.13053/images/14-12.png","element":"img"}],[{"text":"For ","element":"span"},{"style":{"height":26.44},"width":1264.96,"height":66.09,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1912.13053/images/14-13.png","element":"img","alt":" Θ(l), we have p(l)ab = p∗ab + O(lχlc∗) → p∗ab < ∞ and p(l) = 1−χl11−χ1 q∗ → ∞, i.e.","inline":true}],[{"style":{"width":"63%"},"width":1234,"height":48,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1912.13053/images/14-14.png","element":"img"}],[{"text":"Thus ","element":"span"},{"style":{"height":14.19},"width":65.69,"height":35.46,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1912.13053/images/14-15.png","element":"img","alt":" Θ(l) ","inline":true,"padRight":true},{"text":"is essentially a diverging constant multiplying the identity and","element":"span"}],[{"style":{"width":"62%"},"width":1219,"height":183,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1912.13053/images/14-16.png","element":"img"}],[{"id":"id-54","style":{"fontWeight":"bold"},"text":"B.2. Ordered Phase","element":"span"}],[{"text":"B.2.1. T","element":"span"},{"text":"HE CORRECTION OF THE DIAGONAL","element":"span"},{"text":"/","element":"span"},{"text":"OFF","element":"span"},{"text":"-","element":"span"},{"text":"DIAGONAL","element":"span"}],[{"style":{"width":"100%"},"width":939,"height":605,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1912.13053/images/14-17.png","element":"img"}],[{"text":"Similar, in the ordered phase we have the following. ","element":"span"},{"id":"id-56","style":{"fontWeight":"bold"},"text":"Lemma 2. ","element":"span"},{"style":{"fontStyle":"italic"},"text":"There exists ","element":"span"},{"style":{"height":14.4},"width":206.03,"height":36,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1912.13053/images/14-18.png","element":"img","alt":" ζab such that","inline":true}],[{"style":{"width":"54%"},"width":509,"height":55,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1912.13053/images/14-19.png","element":"img"}],[{"style":{"fontStyle":"italic"},"text":"Therefore the following limits exist","element":"span"}],[{"text":"Since the proof is almost identical to Lemma ","element":"span"},{"href":"#id-69","text":"1","element":"a"},{"text":", we omit the details.","element":"span"}],[{"style":{"width":"88%"},"width":828,"height":28,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1912.13053/images/15-0.png","element":"img"}],[{"text":"For ","element":"span"},{"style":{"height":21.49},"width":1019.31,"height":53.72,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1912.13053/images/15-1.png","element":"img","alt":" K(l), we have q∗ab = q∗, q(l)ab = q∗ + O(χl1) and q(l) = q∗. Thus","inline":true}],[{"style":{"width":"59%"},"width":1162,"height":48,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1912.13053/images/15-2.png","element":"img"}],[{"text":"which implies","element":"span"}],[{"style":{"width":"99%"},"width":938,"height":423,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1912.13053/images/15-3.png","element":"img"}],[{"text":"which implies","element":"span"}],[{"style":{"width":"21%"},"width":204,"height":184,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1912.13053/images/15-4.png","element":"img"}],[{"id":"id-58","style":{"fontWeight":"bold"},"text":"B.3. The critical line.","element":"span"}],[{"style":{"fontStyle":"italic"},"text":"Figure 3. ","element":"figcaption","subtype":"caption"},{"style":{"fontWeight":"bold"},"text":"Condition numbers of NNGP and their rate of convergence. ","element":"figcaption","subtype":"caption"},{"text":"In the chaotic phase, ","element":"figcaption","subtype":"caption"},{"style":{"height":16.9},"width":112.64,"height":42.24,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1912.13053/images/15-5.png","element":"img","alt":" κ(K(l))","inline":true,"padRight":true},{"text":"converges to a constant (see Table ","element":"figcaption","subtype":"caption"},{"href":"#id-62","text":"2","element":"a","subtype":"caption"},{"text":") for FCN, CNN-F (a) and CNN-P (b). However, it diverges exponentially in the ordered phase (c) and linearly on the critical line (d). For critical RELU network, ","element":"figcaption","subtype":"caption"},{"style":{"height":16.9},"width":112.64,"height":42.24,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1912.13053/images/15-6.png","element":"img","alt":" κ(K(l))","inline":true,"padRight":true},{"text":"diverges quadratically (e) while ","element":"figcaption","subtype":"caption"},{"style":{"height":16.9},"width":112.86,"height":42.24,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1912.13053/images/15-7.png","element":"img","alt":" κ(Θ(l))","inline":true,"padRight":true},{"text":"converges to a fixed number with rate ","element":"figcaption","subtype":"caption"},{"style":{"height":16.1},"width":147.09,"height":40.24,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1912.13053/images/15-8.png","element":"img","alt":" (l−1) (see","inline":true,"padRight":true},{"text":"Equation ","element":"figcaption","subtype":"caption"},{"href":"#id-70","text":"92","element":"a","subtype":"caption"},{"text":") and we plot the value of ","element":"figcaption","subtype":"caption"},{"style":{"height":16.89},"width":282.44,"height":42.23,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1912.13053/images/15-9.png","element":"img","alt":" (κ(Θ(l)) − κ(Θ∗))","inline":true,"padRight":true},{"text":"of the NTK in (f).","element":"figcaption","subtype":"caption"}],[{"text":"B.3.1. C","element":"span"},{"text":"ORRECTION OF THE DIAGONALS","element":"span"},{"text":"/","element":"span"},{"text":"OFF","element":"span"},{"text":"-","element":"span"},{"text":"DIAGONALS","element":"span"},{"text":".","element":"span"}],[{"text":"We have ","element":"span"},{"style":{"height":14},"width":115.96,"height":35,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1912.13053/images/16-0.png","element":"img","alt":" χ1 = 1","inline":true,"padRight":true},{"text":"on the critical line. Equation ","element":"span"},{"href":"#id-66","text":"24 ","element":"a"},{"text":"implies ","element":"span"},{"style":{"height":17.39},"width":158.22,"height":43.47,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1912.13053/images/16-1.png","element":"img","alt":" p(l) = lq∗","inline":true},{"text":", i.e. the diagonal terms diverge linearly. To capture the linear divergence of ","element":"span"},{"style":{"height":21.49},"width":174.81,"height":53.72,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1912.13053/images/16-2.png","element":"img","alt":" p(l)ab, define","inline":true}],[{"style":{"width":"56%"},"width":1102,"height":127,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1912.13053/images/16-3.png","element":"img"}],[{"text":"We need to expand the first equation of ","element":"span"},{"href":"#id-67","text":"23 ","element":"a"},{"text":"to the second order","element":"span"}],[{"text":"Here we assume ","element":"span"},{"style":{"fontStyle":"italic"},"text":"T ","element":"span"},{"text":"has a continuous third derivative (which is sufficient to assume the activation ","element":"span"},{"style":{"height":14},"width":24,"height":35,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1912.13053/images/16-4.png","element":"img","alt":" φ","inline":true,"padRight":true},{"text":"to have a continuous third derivative.) The above equation implies","element":"span"}],[{"id":"id-71","style":{"width":"59%"},"width":1161,"height":93,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1912.13053/images/16-5.png","element":"img"}],[{"text":"Then","element":"span"}],[{"style":{"width":"52%"},"width":492,"height":198,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1912.13053/images/16-6.png","element":"img"}],[{"text":"Plugging Equation ","element":"span"},{"href":"#id-71","text":"73 ","element":"a"},{"text":"into the above equation gives","element":"span"}],[{"id":"id-73","style":{"width":"15%"},"width":150,"height":81,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1912.13053/images/16-7.png","element":"img"}],[{"text":"B.3.2. T","element":"span"},{"text":"HE ","element":"span"},{"text":"S","element":"span"},{"text":"PECTRUM OF ","element":"span"},{"text":"NNGP ","element":"span"},{"text":"AND ","element":"span"},{"text":"NTK","element":"span"}],[{"style":{"width":"99%"},"width":938,"height":663,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1912.13053/images/16-8.png","element":"img"}]]},{"heading":"C. NNGP and NTK of Relu networks.","paragraphs":[[{"style":{"fontWeight":"bold"},"text":"C.1. Critical Relu.","element":"span"}],[{"text":"We only consider the critical initialization (i.e. He’s initialization (","element":"span"},{"href":"#id-61","referenceIndex":20,"text":"He et al.","element":"a"},{"href":"#id-61","referenceIndex":20,"text":", ","element":"a"},{"href":"#id-61","referenceIndex":20,"text":"2015","element":"a"},{"text":")) ","element":"span"},{"style":{"height":17.9},"width":310.36,"height":44.75,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1912.13053/images/16-9.png","element":"img","alt":" σ2ω = 2 and σ2b = 0","inline":true},{"text":", which preserves the ","element":"span"},{"text":"norm of an input from layer to layer. We also normalize the inputs to have unit variance, i.e. ","element":"span"},{"style":{"height":17.38},"width":459.77,"height":43.46,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1912.13053/images/16-10.png","element":"img","alt":" q∗ = q(l) = q(0) = 1. Recall","inline":true,"padRight":true},{"text":"that","element":"span"}],[{"style":{"width":"64%"},"width":1263,"height":115,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1912.13053/images/16-11.png","element":"img"}],[{"text":"This implies","element":"span"}],[{"text":"which gives ","element":"span"},{"style":{"height":17.39},"width":122.32,"height":43.46,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1912.13053/images/17-0.png","element":"img","alt":" p(l) = l","inline":true},{"text":". Using the equations in Appendix C of (","element":"span"},{"href":"#id-28","referenceIndex":26,"text":"Lee et al.","element":"a"},{"href":"#id-28","referenceIndex":26,"text":", ","element":"a"},{"href":"#id-28","referenceIndex":26,"text":"2019","element":"a"},{"text":") gives","element":"span"}],[{"style":{"width":"67%"},"width":1309,"height":90,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1912.13053/images/17-1.png","element":"img"}],[{"text":"and taking the derivative w.r.t. ","element":"span"},{"style":{"height":0},"width":20,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1912.13053/images/17-2.png","element":"img","alt":" ϵ","inline":true}],[{"style":{"width":"41%"},"width":387,"height":90,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1912.13053/images/17-3.png","element":"img"}],[{"text":"Thus","element":"span"}],[{"style":{"width":"41%"},"width":388,"height":54,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1912.13053/images/17-4.png","element":"img"}],[{"text":"This is enough to conclude (similar to the above calculation)","element":"span"}],[{"style":{"width":"19%"},"width":179,"height":92,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1912.13053/images/17-5.png","element":"img"}],[{"text":"and","element":"span"}],[{"text":"Recall that the diagonals of ","element":"span"},{"style":{"height":17.38},"width":607.98,"height":43.46,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1912.13053/images/17-6.png","element":"img","alt":" K(l) and Θ(l) are q(l) = 1 and p(l) = l","inline":true},{"text":", resp. Therefore the spectrum and the condition numbers","element":"span"}],[{"id":"id-70","style":{"width":"99%"},"width":1941,"height":239,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1912.13053/images/17-7.png","element":"img"}],[{"style":{"fontWeight":"bold"},"text":"C.2. Residual Relu","element":"span"}],[{"text":"We consider the following “continuum” residual network","element":"span"}],[{"text":"where ","element":"span"},{"style":{"fontStyle":"italic"},"text":"t ","element":"span"},{"text":"denotes the ‘depth’ and ","element":"span"},{"style":{"fontStyle":"italic"},"text":"dt > ","element":"span"},{"text":"0 ","element":"span"},{"text":"is sufficiently small and ","element":"span"},{"style":{"fontStyle":"italic"},"text":"W ","element":"span"},{"text":"and ","element":"span"},{"style":{"fontStyle":"italic"},"text":"b ","element":"span"},{"text":"are the weights and biases. We also set ","element":"span"},{"style":{"height":17.33},"width":189.5,"height":43.32,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1912.13053/images/17-8.png","element":"img","alt":" σ2ω = 2 (i.e.","inline":true},{"style":{"height":17.9},"width":651.29,"height":44.75,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1912.13053/images/17-9.png","element":"img","alt":"E[WW T ] = 2Id) and σ2b = 0 (i.e. b = 0","inline":true},{"text":"). The NNGP and NTK have the following form","element":"span"}],[{"style":{"width":"71%"},"width":1388,"height":115,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1912.13053/images/17-10.png","element":"img"}],[{"text":"Taking the limit ","element":"span"},{"style":{"height":14.4},"width":210.04,"height":36,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1912.13053/images/17-11.png","element":"img","alt":" dt → 0 gives","inline":true}],[{"id":"id-72","text":"Using the fact that ","element":"span"},{"style":{"height":17.38},"width":135.17,"height":43.46,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1912.13053/images/17-12.png","element":"img","alt":" q(0) = 1","inline":true,"padRight":true},{"text":"(i.e. the inputs have unit variance), we can compute the diagonal terms ","element":"span"},{"style":{"height":17.38},"width":389.51,"height":43.46,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1912.13053/images/17-13.png","element":"img","alt":" q(t) = et and p(t) = tet.","inline":true,"padRight":true},{"text":"Letting ","element":"span"},{"style":{"height":21.49},"width":197.61,"height":53.72,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1912.13053/images/17-14.png","element":"img","alt":" q(t)ab = etc(t)ab ","inline":true,"padRight":true},{"text":"and applying the above fractional Taylor expansion to ","element":"span"},{"style":{"height":17.23},"width":291.64,"height":43.07,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1912.13053/images/17-15.png","element":"img","alt":" T and ˙T , we have","inline":true}],[{"style":{"width":"67%"},"width":1317,"height":90,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1912.13053/images/17-16.png","element":"img"}],[{"text":"Ignoring the higher order term and set ","element":"span"},{"style":{"height":21.49},"width":425.8,"height":53.72,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1912.13053/images/18-0.png","element":"img","alt":" y(t) = (1 − c(t)ab ), we have","inline":true}],[{"style":{"width":"99%"},"width":1941,"height":290,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1912.13053/images/18-1.png","element":"img"}],[{"text":"Applying this estimate to Equation ","element":"span"},{"href":"#id-72","text":"97 ","element":"a"},{"text":"gives","element":"span"}],[{"text":"Thus the limiting condition number of the NTK is ","element":"span"},{"style":{"fontStyle":"italic"},"text":"m/","element":"span"},{"text":"3 + 1","element":"span"},{"text":". This is the same as the above non-residual Relu case although the entries of ","element":"span"},{"style":{"height":14.58},"width":214.83,"height":36.46,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1912.13053/images/18-2.png","element":"img","alt":" K(t) and Θ(t) ","inline":true,"padRight":true},{"text":"blow up exponentially with ","element":"span"},{"style":{"fontStyle":"italic"},"text":"t","element":"span"},{"text":".","element":"span"}],[{"style":{"fontWeight":"bold"},"text":"C.3. Residual Relu + Layer Norm","element":"span"}],[{"text":"As we saw above, all the entries of ","element":"span"},{"style":{"height":14.18},"width":65.63,"height":35.46,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1912.13053/images/18-3.png","element":"img","alt":" K(l)","inline":true,"padRight":true},{"text":"and ","element":"span"},{"style":{"height":14.18},"width":65.69,"height":35.46,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1912.13053/images/18-4.png","element":"img","alt":" Θ(l)","inline":true,"padRight":true},{"text":"of a residual Relu network blow up exponentially, so do its gradients. In what follows, we show that normalization could help to avoid this issue. We consider the following “continuum” residual network with “layer norm”","element":"span"}],[{"style":{"width":"69%"},"width":1347,"height":92,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1912.13053/images/18-5.png","element":"img"}],[{"text":"We also set ","element":"span"},{"style":{"height":17.38},"width":473.98,"height":43.46,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1912.13053/images/18-6.png","element":"img","alt":" σ2ω = 2 (i.e. E[WW T ] = 2Id","inline":true},{"text":"). The normalization term ","element":"span"},{"style":{"height":22.73},"width":95.13,"height":56.83,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1912.13053/images/18-7.png","element":"img","alt":"1√1+dt","inline":true,"padRight":true},{"text":"makes sure ","element":"span"},{"style":{"height":14.18},"width":112.36,"height":35.46,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1912.13053/images/18-8.png","element":"img","alt":" x(t+dt) ","inline":true,"padRight":true},{"text":"has unit norm and removes the exponentially factor ","element":"span"},{"style":{"height":12.98},"width":30.56,"height":32.46,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1912.13053/images/18-9.png","element":"img","alt":" et ","inline":true,"padRight":true},{"text":"in both NNGP and NTK. To ses this, note that","element":"span"}],[{"style":{"width":"72%"},"width":1406,"height":185,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1912.13053/images/18-10.png","element":"img"}],[{"text":"Taking the limit ","element":"span"},{"style":{"height":14.4},"width":210.04,"height":36,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1912.13053/images/18-11.png","element":"img","alt":" dt → 0 gives","inline":true}],[{"text":"Using the fact that ","element":"span"},{"style":{"height":17.38},"width":135.16,"height":43.46,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1912.13053/images/18-12.png","element":"img","alt":" q(0) = 1","inline":true,"padRight":true},{"text":"(i.e. the inputs have unit variance) and the mapping ","element":"span"},{"text":"2","element":"span"},{"style":{"fontStyle":"italic"},"text":"T ","element":"span"},{"text":"is norm preserving, we see that ","element":"span"},{"style":{"height":17.38},"width":131.31,"height":43.46,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1912.13053/images/18-13.png","element":"img","alt":" q(t) = 1","inline":true,"padRight":true},{"text":"because","element":"span"}],[{"style":{"width":"61%"},"width":1206,"height":48,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1912.13053/images/18-14.png","element":"img"}],[{"text":"This implies ","element":"span"},{"style":{"height":17.39},"width":126.34,"height":43.46,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1912.13053/images/18-15.png","element":"img","alt":" p(t) = t","inline":true,"padRight":true},{"text":"(note that ","element":"span"},{"style":{"height":17.39},"width":243.88,"height":43.46,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1912.13053/images/18-16.png","element":"img","alt":" ˙p(t) = q(t) = 1","inline":true,"padRight":true},{"text":"and we assume the initial value ","element":"span"},{"style":{"height":17.39},"width":136.22,"height":43.46,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1912.13053/images/18-17.png","element":"img","alt":" p(0) = 0","inline":true},{"text":".) The off-diagonal terms can be computed similarly and","element":"span"}],[{"style":{"width":"61%"},"width":1205,"height":184,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1912.13053/images/18-18.png","element":"img"}],[{"text":"Thus the condition number of the NTK is ","element":"span"},{"style":{"fontStyle":"italic"},"text":"m/","element":"span"},{"text":"3 + 1","element":"span"},{"text":". This is the same as the non-residual Relu case discussed above.","element":"span"}]]},{"heading":"D. Asymptotic of P(Θ(l))","paragraphs":[[{"text":"To keep the notation simple, we denote ","element":"span"},{"style":{"height":13.2},"width":547.59,"height":33,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1912.13053/images/18-19.png","element":"img","alt":" Xd = Xtrain, Yd = Ytrain, Θtd = Θ","inline":true},{"text":"test, train","element":"span"},{"text":", ","element":"span"},{"style":{"height":13.19},"width":150.32,"height":32.98,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1912.13053/images/18-20.png","element":"img","alt":" Θdd = Θ","inline":true},{"text":"train, train","element":"span"},{"text":". Recall that","element":"span"}],[{"style":{"width":"64%"},"width":1252,"height":96,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1912.13053/images/18-21.png","element":"img"}],[{"text":"We split our calculation into three parts.","element":"span"}],[{"id":"id-52","style":{"fontWeight":"bold"},"text":"D.1. Chaotic phase","element":"span"}],[{"text":"In this case the diagonal ","element":"span"},{"style":{"height":17.39},"width":54.74,"height":43.47,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1912.13053/images/19-0.png","element":"img","alt":" p(l)","inline":true,"padRight":true},{"text":"diverges exponentially and the off-diagonals ","element":"span"},{"style":{"height":21.49},"width":54.74,"height":53.73,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1912.13053/images/19-1.png","element":"img","alt":" p(l)ab","inline":true,"padRight":true},{"text":"converges to a bounded constant ","element":"span"},{"style":{"height":15.5},"width":51.34,"height":38.75,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1912.13053/images/19-2.png","element":"img","alt":" p∗ab","inline":true},{"text":". We ","element":"span"},{"text":"further assume the input labels are centered in the sense ","element":"span"},{"style":{"height":13.19},"width":40.13,"height":32.98,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1912.13053/images/19-3.png","element":"img","alt":" Yd","inline":true,"padRight":true},{"text":"contains the same number of positive (+1) and negative (-1) labels","element":"span"},{"text":"4","element":"span"},{"text":". We expand ","element":"span"},{"style":{"height":14.19},"width":65.69,"height":35.46,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1912.13053/images/19-4.png","element":"img","alt":" Θ(l) ","inline":true,"padRight":true},{"text":"about its “fixed point”","element":"span"}],[{"style":{"width":"84%"},"width":1640,"height":487,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1912.13053/images/19-5.png","element":"img"}],[{"text":"In the last equation, we have used the fact ","element":"span"},{"style":{"height":17.9},"width":587.5,"height":44.75,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1912.13053/images/19-6.png","element":"img","alt":" 11T Yd = 0 and Θ∗tdYd = 0 since Yd","inline":true,"padRight":true},{"text":"is balanced. Therefore","element":"span"}],[{"style":{"width":"70%"},"width":1363,"height":55,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1912.13053/images/19-7.png","element":"img"}],[{"style":{"fontWeight":"bold"},"text":"Remark 1. ","element":"span"},{"style":{"fontStyle":"italic"},"text":"Without centering the labels ","element":"span"},{"style":{"height":13.19},"width":40.14,"height":32.98,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1912.13053/images/19-8.png","element":"img","alt":" Yd","inline":true,"padRight":true},{"style":{"fontStyle":"italic"},"text":"and normalizing each input in ","element":"span"},{"style":{"height":13.19},"width":50.02,"height":32.98,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1912.13053/images/19-9.png","element":"img","alt":" Xd","inline":true,"padRight":true},{"style":{"fontStyle":"italic"},"text":"to have the same variance, we will get a ","element":"span"},{"style":{"height":17.34},"width":40.94,"height":43.35,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1912.13053/images/19-10.png","element":"img","alt":" χl1","inline":true,"padRight":true},{"style":{"fontStyle":"italic"},"text":"decay for ","element":"span"},{"style":{"height":18.19},"width":170.38,"height":45.46,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1912.13053/images/19-11.png","element":"img","alt":" P(Θ(l))Yd","inline":true,"padRight":true},{"style":{"fontStyle":"italic"},"text":"instead of ","element":"span"},{"style":{"height":17.39},"width":186.46,"height":43.46,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1912.13053/images/19-12.png","element":"img","alt":" l(χc∗/χ1)l.","inline":true}],[{"style":{"fontWeight":"bold"},"text":"D.2. Critical line","element":"span"}],[{"text":"Note that in this phase, both the diagonals and the off-diagonals diverge linearly. In this case","element":"span"}],[{"style":{"width":"75%"},"width":1471,"height":90,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1912.13053/images/19-13.png","element":"img"}],[{"text":"Here we use ","element":"span"},{"style":{"height":12.79},"width":39.92,"height":31.98,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1912.13053/images/19-14.png","element":"img","alt":" 1d","inline":true,"padRight":true},{"text":"to denote the all ‘1’ (column) vector with length equal to the number of training points in ","element":"span"},{"style":{"height":13.19},"width":50.02,"height":32.98,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1912.13053/images/19-15.png","element":"img","alt":" Xd","inline":true,"padRight":true},{"text":"and ","element":"span"},{"style":{"height":12.79},"width":34.92,"height":31.98,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1912.13053/images/19-16.png","element":"img","alt":" 1t","inline":true,"padRight":true},{"text":"is defined similarly. Note that the constant matrix ","element":"span"},{"style":{"fontStyle":"italic"},"text":"B ","element":"span"},{"text":"is invertible. By Equation ","element":"span"},{"href":"#id-73","text":"77","element":"a"}],[{"style":{"width":"99%"},"width":1943,"height":594,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1912.13053/images/19-17.png","element":"img"}],[{"id":"id-57","style":{"fontWeight":"bold"},"text":"D.3. Ordered Phase","element":"span"}],[{"text":"In the ordered phase, we have that ","element":"span"},{"style":{"height":21.49},"width":600.85,"height":53.72,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1912.13053/images/19-18.png","element":"img","alt":" Θ(l)dd = p∗1d1Td +lχl1A(l)dd where A(l)dd","inline":true},{"text":", a symmetric matrix, represents the data-dependent ","element":"span"},{"text":"piece of ","element":"span"},{"href":"#id-56","style":{"height":21.49},"width":689.88,"height":53.72,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1912.13053/images/19-19.png","element":"img","alt":" Θ(l)dd. By Lemma 2, A(l)dd → Add as l → ∞","inline":true},{"text":". To simply the notation, in the calculation below we will replace ","element":"span"},{"style":{"height":21.49},"width":69.34,"height":53.72,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1912.13053/images/19-20.png","element":"img","alt":" A(l)dd","inline":true,"padRight":true},{"text":"by ","element":"span"},{"style":{"height":13.99},"width":68.24,"height":34.98,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1912.13053/images/19-21.png","element":"img","alt":" Add","inline":true},{"text":". We also assume ","element":"span"},{"style":{"height":13.99},"width":68.25,"height":34.98,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1912.13053/images/19-22.png","element":"img","alt":" Add","inline":true,"padRight":true},{"text":"is invertible. To compute the mean predictor, ","element":"span"},{"style":{"height":18.18},"width":130.75,"height":45.46,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1912.13053/images/19-23.png","element":"img","alt":" P(Θ(l))","inline":true},{"text":", asymptotically we begin by computing","element":"span"}],[{"style":{"width":"99%"},"width":1943,"height":421,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1912.13053/images/20-0.png","element":"img"}],[{"text":"where we have set","element":"span"}],[{"text":"Note that there is no divergence in ","element":"span"},{"style":{"height":18.18},"width":130.75,"height":45.46,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1912.13053/images/20-1.png","element":"img","alt":" P(Θ(l))","inline":true,"padRight":true},{"text":"as ","element":"span"},{"style":{"height":11.2},"width":114.68,"height":28,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1912.13053/images/20-2.png","element":"img","alt":" l → ∞","inline":true,"padRight":true},{"text":"and the limit is well-defined. The term ","element":"span"},{"style":{"height":16.58},"width":105.21,"height":41.46,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1912.13053/images/20-3.png","element":"img","alt":" ˆp1taT","inline":true,"padRight":true},{"text":"is independent from the input data.","element":"span"}],[{"style":{"width":"84%"},"width":1637,"height":66,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1912.13053/images/20-4.png","element":"img"}],[{"text":"We therefore see that even in the infinite-depth limit the mean predictor retains its data-dependence and we expect these networks to be able generalize indefinitely.","element":"span"}]]},{"heading":"E. Dropout","paragraphs":[[{"text":"In this section, we investigate the effect of adding a dropout layer to the penultimate layer. Let ","element":"span"},{"style":{"height":23.52},"width":465.45,"height":58.8,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1912.13053/images/20-5.png","element":"img","alt":" 0 < ρ ≤ 1 and γ(L)j (x) be iid","inline":true,"padRight":true},{"text":"random variables","element":"span"}],[{"style":{"width":"99%"},"width":1942,"height":294,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1912.13053/images/20-6.png","element":"img"}],[{"text":"and for the output layer,","element":"span"}],[{"text":"where ","element":"span"},{"style":{"height":23.52},"width":77.86,"height":58.81,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1912.13053/images/20-7.png","element":"img","alt":" W (l)ij","inline":true,"padRight":true},{"text":"and ","element":"span"},{"style":{"height":21.12},"width":51.8,"height":52.81,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1912.13053/images/20-8.png","element":"img","alt":" b(l)i","inline":true,"padRight":true},{"text":"are iid Gaussians ","element":"span"},{"style":{"fontStyle":"italic"},"text":"N","element":"span"},{"text":"(0","element":"span"},{"style":{"fontStyle":"italic"},"text":", ","element":"span"},{"text":"1)","element":"span"},{"text":". Since no dropout is applied in the first ","element":"span"},{"style":{"fontStyle":"italic"},"text":"L ","element":"span"},{"text":"layers, the NNGP kernel ","element":"span"},{"style":{"height":14.18},"width":65.63,"height":35.46,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1912.13053/images/20-9.png","element":"img","alt":" K(l)","inline":true,"padRight":true},{"text":"and ","element":"span"},{"style":{"height":14.18},"width":65.69,"height":35.46,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1912.13053/images/20-10.png","element":"img","alt":" Θ(l)","inline":true,"padRight":true},{"text":"can be computed using Equation ","element":"span"},{"href":"#id-74","text":"20 ","element":"a"},{"text":"and Equation ","element":"span"},{"href":"#id-49","text":"8","element":"a"},{"text":". Let ","element":"span"},{"style":{"height":20.94},"width":117.62,"height":52.36,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1912.13053/images/20-11.png","element":"img","alt":" K(L+1)ρ","inline":true,"padRight":true},{"text":"and ","element":"span"},{"style":{"height":20.94},"width":117.68,"height":52.36,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1912.13053/images/20-12.png","element":"img","alt":" Θ(L+1)ρ","inline":true,"padRight":true},{"text":"denote the NNGP and NTK of the ","element":"span"},{"text":"(","element":"span"},{"style":{"fontStyle":"italic"},"text":"L ","element":"span"},{"text":"+ 1)","element":"span"},{"text":"-th layer. Note that when ","element":"span"},{"style":{"height":20.93},"width":772.56,"height":52.32,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1912.13053/images/20-13.png","element":"img","alt":" ρ = 1, K(L+1)1 = K(L+1) and Θ(L+1)1 = Θ(L+1) ","inline":true,"padRight":true},{"text":". We will compute the correction induced by ","element":"span"},{"style":{"height":14.4},"width":248.17,"height":36,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1912.13053/images/20-14.png","element":"img","alt":" ρ < 1. The fact","inline":true}],[{"style":{"width":"71%"},"width":1389,"height":121,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1912.13053/images/20-15.png","element":"img"}],[{"text":"implies that the NNGP kernel ","element":"span"},{"style":{"height":20.94},"width":143.01,"height":52.36,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1912.13053/images/21-0.png","element":"img","alt":" K(L+1)ρ (","inline":true},{"href":"#id-12","referenceIndex":38,"text":"Schoenholz et al.","element":"a"},{"href":"#id-12","referenceIndex":38,"text":", ","element":"a"},{"href":"#id-12","referenceIndex":38,"text":"2017","element":"a"},{"text":") is","element":"span"}],[{"style":{"fontStyle":"italic"},"text":"σ","element":"span"},{"style":{"height":18.7},"width":598.99,"height":46.74,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1912.13053/images/21-1.png","element":"img","alt":"2wT (K(L)(x, x′)) + σ2b, if x ̸= x′","inline":true}],[{"style":{"width":"99%"},"width":929,"height":105,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1912.13053/images/21-2.png","element":"img"}],[{"text":"Now we compute the NTK ","element":"span"},{"style":{"height":20.94},"width":117.68,"height":52.36,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1912.13053/images/21-3.png","element":"img","alt":" Θ(L+1)ρ","inline":true,"padRight":true},{"text":", which is a sum of two terms","element":"span"}],[{"id":"id-75","style":{"width":"88%"},"width":1726,"height":145,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1912.13053/images/21-4.png","element":"img"}],[{"text":"Here ","element":"span"},{"style":{"height":14.18},"width":106.5,"height":35.46,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1912.13053/images/21-5.png","element":"img","alt":" θ(L+1) ","inline":true,"padRight":true},{"text":"denote the parameters in the ","element":"span"},{"text":"(","element":"span"},{"style":{"fontStyle":"italic"},"text":"L ","element":"span"},{"text":"+ 1) ","element":"span"},{"text":"layer, namely, ","element":"span"},{"style":{"height":23.52},"width":484.93,"height":58.81,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1912.13053/images/21-6.png","element":"img","alt":" W (L+1)ij and b(L+1)i and θ(≤L) ","inline":true,"padRight":true},{"text":"the remaining parameters.","element":"span"}],[{"text":"Note that the first term in Equation ","element":"span"},{"href":"#id-75","text":"138 ","element":"a"},{"text":"is equal to ","element":"span"},{"style":{"height":20.94},"width":225.99,"height":52.36,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1912.13053/images/21-7.png","element":"img","alt":" K(L+1)ρ (x, x′)","inline":true},{"text":". Using the chain rule, the second term is equal to","element":"span"}],[{"style":{"width":"88%"},"width":832,"height":656,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1912.13053/images/21-8.png","element":"img"}],[{"text":"In sum, we see that dropout only modifies the diagonal terms","element":"span"}],[{"style":{"width":"40%"},"width":379,"height":178,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1912.13053/images/21-9.png","element":"img"}],[{"text":"In the ordered phase, we see","element":"span"}],[{"style":{"width":"53%"},"width":502,"height":64,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1912.13053/images/21-10.png","element":"img"}],[{"id":"id-76","text":"and the condition number","element":"span"}],[{"text":"In Fig ","element":"span"},{"href":"#id-60","text":"4","element":"a"},{"text":", we plot the evolution of ","element":"span"},{"style":{"height":20.95},"width":537.35,"height":52.36,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1912.13053/images/21-11.png","element":"img","alt":" κ(L)ρ for ρ = 0.8, 0.95, 0.99 and 1","inline":true},{"text":", confirming Equation ","element":"span"},{"href":"#id-76","text":"145","element":"a"},{"text":".","element":"span"}]]},{"heading":"F. Convolutions","paragraphs":[[{"text":"In this section, we compute the evolution of ","element":"span"},{"style":{"height":14.58},"width":244.15,"height":36.46,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1912.13053/images/21-12.png","element":"img","alt":" Θ(l) for CNNs.","inline":true}],[{"style":{"fontWeight":"bold"},"text":"General setup. ","element":"span"},{"text":"For simplicity of presentation we consider 1D convolutional networks with circular padding as in ","element":"span"},{"href":"#id-13","referenceIndex":41,"text":"Xiao ","element":"a"},{"href":"#id-13","referenceIndex":41,"text":"et al. ","element":"a"},{"href":"#id-13","referenceIndex":41,"text":"(","element":"a"},{"href":"#id-13","referenceIndex":41,"text":"2018","element":"a"},{"text":"). We will see that this reduces to the fully-connected case introduced above if the image size is set to one and as such we will see that many of the same concepts and equations carry over schematically from the fully-connected case. The theory of two-or higher-dimensional convolutions proceeds identically but with more indices.","element":"span"}],[{"id":"id-60","style":{"width":"49%"},"width":972,"height":729,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1912.13053/images/22-0.png","element":"img"}],[{"style":{"fontStyle":"italic"},"text":"Figure 4. ","element":"figcaption","subtype":"caption"},{"style":{"fontWeight":"bold"},"text":"Dropout improves conditioning of the NTK. ","element":"figcaption","subtype":"caption"},{"text":"In the ordered phase, the condition number ","element":"figcaption","subtype":"caption"},{"style":{"height":13.29},"width":54.35,"height":33.24,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1912.13053/images/22-1.png","element":"img","alt":" κ(l) ","inline":true,"padRight":true},{"text":"explodes exponentially (yellow) as ","element":"figcaption","subtype":"caption"},{"style":{"height":10.4},"width":105.92,"height":26,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1912.13053/images/22-2.png","element":"img","alt":" l → ∞","inline":true},{"text":". However, a dropout layer could significantly improves the conditioning, making ","element":"figcaption","subtype":"caption"},{"style":{"height":13.29},"width":54.36,"height":33.24,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1912.13053/images/22-3.png","element":"img","alt":" κ(l) ","inline":true,"padRight":true},{"text":"converge to a finite constant (horizontal lines) Equation ","element":"figcaption","subtype":"caption"},{"href":"#id-76","text":"145","element":"a","subtype":"caption"},{"text":".","element":"figcaption","subtype":"caption"}],[{"style":{"fontWeight":"bold"},"text":"Random weights and biases. ","element":"span"},{"text":"The parameters of the network are the convolutional filters and biases, ","element":"span"},{"style":{"height":23.89},"width":350.04,"height":59.72,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1912.13053/images/22-4.png","element":"img","alt":" ω(l)ij,β and µ(l)i , respec-","inline":true,"padRight":true},{"text":"tively, with outgoing (incoming) channel index ","element":"span"},{"style":{"fontStyle":"italic"},"text":"i ","element":"span"},{"text":"(","element":"span"},{"style":{"fontStyle":"italic"},"text":"j","element":"span"},{"text":") and filter relative spatial location ","element":"span"},{"style":{"height":17.79},"width":595.86,"height":44.46,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1912.13053/images/22-5.png","element":"img","alt":" β ∈ [±k] ≡ {−k, . . . , 0, . . . , k}.5 As","inline":true,"padRight":true},{"text":"above, we will assume a Gaussian prior on both the filter weights and biases,","element":"span"}],[{"style":{"width":"88%"},"width":1727,"height":90,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1912.13053/images/22-6.png","element":"img"}],[{"text":"As above, ","element":"span"},{"style":{"height":17.9},"width":160.74,"height":44.75,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1912.13053/images/22-7.png","element":"img","alt":" σ2ω and σ2b ","inline":true,"padRight":true},{"text":"are hyperparameters that control the variance of the weights and biases respectively. ","element":"span"},{"style":{"height":14.19},"width":71.06,"height":35.46,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1912.13053/images/22-8.png","element":"img","alt":" N (l)","inline":true,"padRight":true},{"text":"is the number ","element":"span"},{"text":"of channels (filters) in layer ","element":"span"},{"style":{"fontStyle":"italic"},"text":"l","element":"span"},{"text":", ","element":"span"},{"text":"2","element":"span"},{"style":{"fontStyle":"italic"},"text":"k ","element":"span"},{"text":"+ 1 ","element":"span"},{"text":"is the filter size.","element":"span"}],[{"style":{"fontWeight":"bold"},"text":"Inputs, pre-activations, and activations. ","element":"span"},{"text":"Let ","element":"span"},{"style":{"fontStyle":"italic"},"text":"X ","element":"span"},{"text":"denote a set of input images. The network has activations ","element":"span"},{"style":{"height":18.19},"width":112.38,"height":45.46,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1912.13053/images/22-9.png","element":"img","alt":" y(l)(x)","inline":true,"padRight":true},{"text":"and pre-activations ","element":"span"},{"style":{"height":18.18},"width":111.7,"height":45.46,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1912.13053/images/22-10.png","element":"img","alt":" z(l)(x)","inline":true,"padRight":true},{"text":"for each input image ","element":"span"},{"style":{"height":19},"width":270.05,"height":47.49,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1912.13053/images/22-11.png","element":"img","alt":" x ∈ X ⊆ RN (0)d","inline":true},{"text":", with input channel count ","element":"span"},{"style":{"height":14.98},"width":156.86,"height":37.46,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1912.13053/images/22-12.png","element":"img","alt":" N (0) ∈ N","inline":true},{"text":", number of pixels ","element":"span"},{"style":{"height":13.2},"width":108.24,"height":33,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1912.13053/images/22-13.png","element":"img","alt":" d ∈ N,","inline":true,"padRight":true},{"text":"where","element":"span"}],[{"style":{"width":"89%"},"width":1747,"height":131,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1912.13053/images/22-14.png","element":"img"}],[{"style":{"height":14},"width":176.72,"height":35,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1912.13053/images/22-15.png","element":"img","alt":"φ : R → R","inline":true,"padRight":true},{"text":"is a point-wise activation function. Since we assume circular padding for all the convolutional layers, the spacial size ","element":"span"},{"style":{"fontStyle":"italic"},"text":"d ","element":"span"},{"text":"remains constant throughout the networks until the readout layer.","element":"span"}],[{"text":"For each ","element":"span"},{"style":{"height":18.19},"width":845.63,"height":45.46,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1912.13053/images/22-16.png","element":"img","alt":" l > 0, as min{N 1 . . . , N (l−1)} → ∞, for each i ∈ N","inline":true},{"text":", the pre-activation converges in distribution to ","element":"span"},{"style":{"fontStyle":"italic"},"text":"d","element":"span"},{"text":"-dimensional Gaussian with mean ","element":"span"},{"style":{"fontWeight":"bold"},"text":"0 ","element":"span"},{"text":"and covariance matrix ","element":"span"},{"style":{"height":14.19},"width":65.64,"height":35.47,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1912.13053/images/22-17.png","element":"img","alt":" K(l)","inline":true},{"text":", which can be computed recursively (","element":"span"},{"href":"#id-23","referenceIndex":31,"text":"Novak et al.","element":"a"},{"href":"#id-23","referenceIndex":31,"text":", ","element":"a"},{"href":"#id-23","referenceIndex":31,"text":"2019b","element":"a"},{"text":"; ","element":"span"},{"href":"#id-13","referenceIndex":41,"text":"Xiao et al.","element":"a"},{"href":"#id-13","referenceIndex":41,"text":", ","element":"a"},{"href":"#id-13","referenceIndex":41,"text":"2018","element":"a"},{"text":")","element":"span"}],[{"id":"id-79","style":{"width":"76%"},"width":1483,"height":58,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1912.13053/images/22-18.png","element":"img"}],[{"text":"Here ","element":"span"},{"style":{"height":23.76},"width":630.84,"height":59.39,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1912.13053/images/22-19.png","element":"img","alt":" K(l) ≡ [K(l)α,α′(x, x′)]α,α′∈[d],x,x′∈X , T","inline":true,"padRight":true},{"text":"is a non-linear transformation related to its fully-connected counterpart, and ","element":"span"},{"style":{"fontStyle":"italic"},"text":"A ","element":"span"},{"text":"a convolution acting on ","element":"span"},{"style":{"height":10.8},"width":158.97,"height":27,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1912.13053/images/22-20.png","element":"img","alt":" Xd × Xd","inline":true,"padRight":true},{"text":"PSD matrices","element":"span"}],[{"id":"id-77","style":{"width":"99%"},"width":1943,"height":182,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1912.13053/images/22-21.png","element":"img"}],[{"style":{"fontWeight":"bold"},"text":"F.1. The Neural Tangent Kernel","element":"span"}],[{"text":"To understand how the neural tangent kernel evolves with depth, we define the NTK of the ","element":"span"},{"style":{"fontStyle":"italic"},"text":"l","element":"span"},{"text":"-th hidden layer to be ","element":"span"},{"style":{"height":14.83},"width":65.69,"height":37.08,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1912.13053/images/23-0.png","element":"img","alt":"ˆΘ(l)","inline":true}],[{"style":{"width":"68%"},"width":1340,"height":63,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1912.13053/images/23-1.png","element":"img"}],[{"text":"where ","element":"span"},{"style":{"height":13.38},"width":54.72,"height":33.46,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1912.13053/images/23-2.png","element":"img","alt":" θ≤l ","inline":true,"padRight":true},{"text":"denotes all of the parameters in layers at-or-below the ","element":"span"},{"style":{"fontStyle":"italic"},"text":"l","element":"span"},{"text":"’th layer. It does not matter which channel index ","element":"span"},{"style":{"fontStyle":"italic"},"text":"i ","element":"span"},{"text":"is used because as the number of channels approach infinity, this kernel will also converge in distribution to a deterministic kernel ","element":"span"},{"href":"#id-25","referenceIndex":42,"style":{"height":17.79},"width":311.66,"height":44.46,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1912.13053/images/23-3.png","element":"img","alt":"Θ(l+1) (Yang, 2019","inline":true},{"text":"), which can also be computed recursively in a similar manner to the NTK for fully-connected networks as (","element":"span"},{"href":"#id-25","referenceIndex":42,"text":"Yang","element":"a"},{"href":"#id-25","referenceIndex":42,"text":", ","element":"a"},{"href":"#id-25","referenceIndex":42,"text":"2019","element":"a"},{"text":"; ","element":"span"},{"href":"#id-46","referenceIndex":2,"text":"Arora et al.","element":"a"},{"href":"#id-46","referenceIndex":2,"text":", ","element":"a"},{"href":"#id-46","referenceIndex":2,"text":"2019","element":"a"},{"text":"),","element":"span"}],[{"id":"id-78","style":{"width":"68%"},"width":1330,"height":48,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1912.13053/images/23-4.png","element":"img"}],[{"text":"where ","element":"span"},{"style":{"height":16.03},"width":34,"height":40.07,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1912.13053/images/23-5.png","element":"img","alt":"˙T","inline":true,"padRight":true},{"text":"is given by Equation ","element":"span"},{"href":"#id-77","text":"149 ","element":"a"},{"text":"with ","element":"span"},{"style":{"height":14},"width":24,"height":35,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1912.13053/images/23-6.png","element":"img","alt":" φ","inline":true,"padRight":true},{"text":"replaced by its derivative ","element":"span"},{"style":{"height":14},"width":37.74,"height":35,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1912.13053/images/23-7.png","element":"img","alt":" φ′","inline":true},{"text":". We will also normalize the variance of the inputs to ","element":"span"},{"style":{"height":14.18},"width":35.22,"height":35.46,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1912.13053/images/23-8.png","element":"img","alt":"q∗ ","inline":true,"padRight":true},{"text":"and hence treat ","element":"span"},{"style":{"height":16.03},"width":141.74,"height":40.07,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1912.13053/images/23-9.png","element":"img","alt":" T and ˙T","inline":true,"padRight":true},{"text":"as pointwise functions. We will only present the treatment in the chaotic phase to showcase how to deal with the operator ","element":"span"},{"style":{"fontStyle":"italic"},"text":"A","element":"span"},{"text":". The treatment of other phases are similar. Note that the diagonal entries of ","element":"span"},{"style":{"height":14.18},"width":65.63,"height":35.46,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1912.13053/images/23-10.png","element":"img","alt":" K(l)","inline":true,"padRight":true},{"text":"and ","element":"span"},{"style":{"height":14.18},"width":65.68,"height":35.46,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1912.13053/images/23-11.png","element":"img","alt":" Θ(l)","inline":true,"padRight":true},{"text":"are exactly the same as the fully-connected setting, which are ","element":"span"},{"style":{"height":14.18},"width":35.22,"height":35.46,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1912.13053/images/23-12.png","element":"img","alt":" q∗","inline":true,"padRight":true},{"text":"and ","element":"span"},{"style":{"height":17.38},"width":160.6,"height":43.46,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1912.13053/images/23-13.png","element":"img","alt":" p(l) = lq∗","inline":true},{"text":", respectively. We only need to consider the off-diagonal terms. Letting ","element":"span"},{"style":{"height":11.2},"width":114.68,"height":28,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1912.13053/images/23-14.png","element":"img","alt":" l → ∞","inline":true,"padRight":true},{"text":"in Equation ","element":"span"},{"href":"#id-78","text":"152 ","element":"a"},{"text":"we see that all the off-diagonal terms also converge ","element":"span"},{"style":{"height":15.5},"width":51.34,"height":38.75,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1912.13053/images/23-15.png","element":"img","alt":" p∗ab","inline":true},{"text":". Note that ","element":"span"},{"style":{"fontStyle":"italic"},"text":"A ","element":"span"},{"text":"does not mix terms from different diagonals and it suffices to handle each off-diagonal separately. Let ","element":"span"},{"style":{"height":21.49},"width":50.87,"height":53.72,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1912.13053/images/23-16.png","element":"img","alt":" ϵ(l)ab","inline":true,"padRight":true},{"text":"and ","element":"span"},{"style":{"height":21.49},"width":53.91,"height":53.72,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1912.13053/images/23-17.png","element":"img","alt":" δ(l)ab","inline":true,"padRight":true},{"text":"denote ","element":"span"},{"text":"the correction of the ","element":"span"},{"style":{"fontStyle":"italic"},"text":"j","element":"span"},{"text":"-th diagonal of ","element":"span"},{"style":{"height":14.59},"width":211.24,"height":36.46,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1912.13053/images/23-18.png","element":"img","alt":" K(l) and Θ(l) ","inline":true,"padRight":true},{"text":"to the fixed points. Linearizing Equation ","element":"span"},{"href":"#id-79","text":"148 ","element":"a"},{"text":"and Equation ","element":"span"},{"href":"#id-78","text":"152 ","element":"a"},{"text":"gives","element":"span"}],[{"style":{"width":"68%"},"width":1337,"height":152,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1912.13053/images/23-19.png","element":"img"}],[{"text":"Next let ","element":"span"},{"style":{"height":16},"width":104.16,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1912.13053/images/23-20.png","element":"img","alt":" {ρα}α","inline":true,"padRight":true},{"text":"be the eigenvalues of ","element":"span"},{"style":{"height":23.89},"width":343.86,"height":59.72,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1912.13053/images/23-21.png","element":"img","alt":" A and ϵ(l)ab,α and δ(l)ab,α ","inline":true,"padRight":true},{"text":"be the projection of ","element":"span"},{"style":{"height":21.49},"width":359.49,"height":53.72,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1912.13053/images/23-22.png","element":"img","alt":" ϵ(l)ab and δ(l)ab onto the α","inline":true},{"text":"-th eigenvector of ","element":"span"},{"style":{"fontStyle":"italic"},"text":"A","element":"span"},{"text":", ","element":"span"},{"text":"respectively. Then for each ","element":"span"},{"style":{"height":9.2},"width":35.64,"height":23,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1912.13053/images/23-23.png","element":"img","alt":" α,","inline":true}],[{"style":{"width":"69%"},"width":1354,"height":156,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1912.13053/images/23-24.png","element":"img"}],[{"text":"which gives","element":"span"}],[{"text":"Therefore, the correction ","element":"span"},{"style":{"height":14.18},"width":166.67,"height":35.46,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1912.13053/images/23-25.png","element":"img","alt":" Θ(l) − Θ∗","inline":true,"padRight":true},{"text":"propagates independently through different Fourier modes. In each mode, up to a scaling factor ","element":"span"},{"style":{"height":17.32},"width":41.6,"height":43.31,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1912.13053/images/23-26.png","element":"img","alt":" ρlα","inline":true},{"text":", the correction is the same as the correction of FCN. Since the subdominant modes (with ","element":"span"},{"style":{"height":16},"width":254.22,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1912.13053/images/23-27.png","element":"img","alt":" |ρα| < 1) decay","inline":true,"padRight":true},{"text":"exponentially faster than the dominant mode (with ","element":"span"},{"style":{"height":14},"width":116.46,"height":35,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1912.13053/images/23-28.png","element":"img","alt":" ρα = 1","inline":true},{"text":"), for large depth, the NTK of CNN is essentially the same as that of FCN.","element":"span"}],[{"id":"id-59","style":{"fontWeight":"bold"},"text":"F.2. The effect of pooling and flattening of CNNs","element":"span"}],[{"text":"With the bulk of the theory in hand, we now turn our attention to CNN-F and CNN-P. We have shown that the dominant mode in CNNs behaves exactly like the fully-connected case, however we will see that the readout can significantly affect the spectrum. The NNGP and NTK of the ","element":"span"},{"style":{"fontStyle":"italic"},"text":"l","element":"span"},{"text":"-th hidden layer CNN are 4D tensors ","element":"span"},{"style":{"height":23.76},"width":199.98,"height":59.39,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1912.13053/images/23-29.png","element":"img","alt":" K(l)α,α′(x, x′)","inline":true,"padRight":true},{"text":"and ","element":"span"},{"style":{"height":23.76},"width":200.61,"height":59.39,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1912.13053/images/23-30.png","element":"img","alt":" Θ(l)α,α′(x, x′)","inline":true},{"text":", ","element":"span"},{"text":"where ","element":"span"},{"style":{"height":16},"width":484,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1912.13053/images/23-31.png","element":"img","alt":" α, α′ ∈ [d] ≡ [0, 1, . . . , d − 1]","inline":true,"padRight":true},{"text":"denote the pixel locations. To perform tasks like image classification or regression, “flattening” and “pooling” (more precisely, global average pooling) are two popular readout strategies that transform the last convolution layer into the logits layer. The former strategy “flattens” an image of size ","element":"span"},{"text":"(","element":"span"},{"style":{"fontStyle":"italic"},"text":"d, N","element":"span"},{"text":") ","element":"span"},{"text":"into a vector in ","element":"span"},{"style":{"height":13.78},"width":143.12,"height":34.46,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1912.13053/images/23-32.png","element":"img","alt":" RdN and","inline":true,"padRight":true},{"text":"stacks a fully-connected layer on top. The latter projects the ","element":"span"},{"text":"(","element":"span"},{"style":{"fontStyle":"italic"},"text":"d, N","element":"span"},{"text":") ","element":"span"},{"text":"image into a vector of dimension ","element":"span"},{"style":{"fontStyle":"italic"},"text":"N ","element":"span"},{"text":"via averaging out the spatial dimension and then stacks a fully-connected layer on top. The actions of “flattening” and “pooling” on the image correspond to computing the mean of the trace and the mean of the pixel-to-pixel covariance on the NNGP/NTK,","element":"span"}],[{"text":"respectively, i.e.,","element":"span"}],[{"text":"where ","element":"span"},{"style":{"height":21.49},"width":122,"height":53.72,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1912.13053/images/24-0.png","element":"img","alt":" Θ(l)flatten","inline":true,"padRight":true},{"text":"(","element":"span"},{"style":{"height":23.89},"width":91.1,"height":59.72,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1912.13053/images/24-1.png","element":"img","alt":"Θ(l)pool","inline":true},{"text":") denotes the NTK right after flattening (pooling) the last convolution. We will also use ","element":"span"},{"style":{"height":21.49},"width":65.69,"height":53.72,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1912.13053/images/24-2.png","element":"img","alt":" Θ(l)fc","inline":true,"padRight":true},{"text":"to ","element":"span"},{"text":"denote the NTK of FC. ","element":"span"},{"style":{"height":23.89},"width":377.69,"height":59.72,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1912.13053/images/24-3.png","element":"img","alt":" K(l)flatten, K(l)pool and K(l)fc ","inline":true,"padRight":true},{"text":"are defined similarly. As discussed above, in the large depth setting, all the ","element":"span"},{"text":"diagonals ","element":"span"},{"style":{"height":20.94},"width":286.02,"height":52.36,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1912.13053/images/24-4.png","element":"img","alt":" Θ(l)α,α(x, x) = p(l)","inline":true,"padRight":true},{"text":"(since the inputs are normalized to have variance ","element":"span"},{"style":{"height":14.18},"width":35.22,"height":35.46,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1912.13053/images/24-5.png","element":"img","alt":" q∗","inline":true,"padRight":true},{"text":"for each pixel) and similar to ","element":"span"},{"style":{"height":21.49},"width":186.28,"height":53.72,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1912.13053/images/24-6.png","element":"img","alt":" Θ(l)fc , all the","inline":true,"padRight":true},{"text":"off-diagonals ","element":"span"},{"style":{"height":23.76},"width":200.61,"height":59.39,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1912.13053/images/24-7.png","element":"img","alt":" Θ(l)α′,α(x, x′)","inline":true,"padRight":true},{"text":"are almost equal (in the sense they have the same order of correction to ","element":"span"},{"style":{"height":15.5},"width":51.34,"height":38.75,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1912.13053/images/24-8.png","element":"img","alt":" p∗ab","inline":true,"padRight":true},{"text":"if exists.) Without ","element":"span"},{"text":"loss of generality, we assume all off-diagonals are the same and equal to ","element":"span"},{"style":{"height":21.49},"width":54.74,"height":53.72,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1912.13053/images/24-9.png","element":"img","alt":" p(l)ab","inline":true,"padRight":true},{"text":"(the leading correction of ","element":"span"},{"style":{"height":21.49},"width":53.91,"height":53.72,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1912.13053/images/24-10.png","element":"img","alt":" q(l)ab","inline":true,"padRight":true},{"text":"for CNN and ","element":"span"},{"text":"FCN are of the same order.) Applying flattening and pooling, the NTKs become","element":"span"}],[{"style":{"width":"77%"},"width":1505,"height":226,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1912.13053/images/24-11.png","element":"img"}],[{"text":"respectively. As we can see, ","element":"span"},{"style":{"height":21.49},"width":121.99,"height":53.72,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1912.13053/images/24-12.png","element":"img","alt":" Θ(l)flatten ","inline":true,"padRight":true},{"text":"is essentially the same as its FCN counterpart ","element":"span"},{"style":{"height":21.49},"width":65.69,"height":53.72,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1912.13053/images/24-13.png","element":"img","alt":" Θ(l)fc ","inline":true,"padRight":true},{"text":"up to sub-dominant Fourier modes ","element":"span"},{"text":"which decay exponentially faster than the dominant Fourier modes. Therefore the spectrum properties of ","element":"span"},{"style":{"height":21.49},"width":266.6,"height":53.72,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1912.13053/images/24-14.png","element":"img","alt":" Θ(l)flatten and Θ(l)fc","inline":true,"padRight":true},{"text":"are essentially the same for large ","element":"span"},{"style":{"fontStyle":"italic"},"text":"l","element":"span"},{"text":"; see Figure ","element":"span"},{"href":"#id-51","text":"1 ","element":"a"},{"text":"(a - c).","element":"span"}],[{"text":"However, pooling alters the NTK/NNGP spectrum in an interesting way. Noticeably, the contribution from ","element":"span"},{"style":{"height":17.38},"width":54.74,"height":43.46,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1912.13053/images/24-15.png","element":"img","alt":" p(l) ","inline":true,"padRight":true},{"text":"is discounted by a factor of ","element":"span"},{"style":{"fontStyle":"italic"},"text":"d","element":"span"},{"text":". On the critical line, asymptotically, the on- and off-diagonal terms are","element":"span"}],[{"style":{"width":"63%"},"width":1235,"height":180,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1912.13053/images/24-16.png","element":"img"}],[{"text":"This implies","element":"span"}],[{"text":"Here we use blue color to indicate the changes of such quantities against their ","element":"span"},{"style":{"height":21.49},"width":122,"height":53.72,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1912.13053/images/24-17.png","element":"img","alt":" Θ(l)flatten","inline":true,"padRight":true},{"text":"counterpart. Alternatively, one can ","element":"span"},{"text":"consider ","element":"span"},{"style":{"height":21.49},"width":121.99,"height":53.72,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1912.13053/images/24-18.png","element":"img","alt":" Θ(l)flatten","inline":true,"padRight":true},{"text":"as a special version (with ","element":"span"},{"style":{"fontStyle":"italic"},"text":"d ","element":"span"},{"text":"= 1","element":"span"},{"text":") of ","element":"span"},{"style":{"height":23.89},"width":91.1,"height":59.72,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1912.13053/images/24-19.png","element":"img","alt":" Θ(l)pool","inline":true},{"text":". Thus pooling decreases ","element":"span"},{"style":{"height":21.49},"width":84.46,"height":53.72,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1912.13053/images/24-20.png","element":"img","alt":" λ(l)bulk","inline":true,"padRight":true},{"text":"roughly by a factor of ","element":"span"},{"style":{"fontStyle":"italic"},"text":"d ","element":"span"},{"text":"and ","element":"span"},{"text":"increases the condition number by a factor of ","element":"span"},{"style":{"fontStyle":"italic"},"text":"d ","element":"span"},{"text":"comparing to flattening. In the chaotic phase, pooling does not change the off-diagonals ","element":"span"},{"style":{"height":21.49},"width":193.75,"height":53.72,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1912.13053/images/24-21.png","element":"img","alt":" q(l)ab = O(1)","inline":true,"padRight":true},{"text":"but does slow down the growth of the diagonals by a factor of ","element":"span"},{"style":{"fontStyle":"italic"},"text":"d","element":"span"},{"text":", i.e. ","element":"span"},{"style":{"height":18.18},"width":258.13,"height":45.46,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1912.13053/images/24-22.png","element":"img","alt":" p(l) = O(χl1/d)","inline":true},{"text":". This ","element":"span"},{"text":"improves ","element":"span"},{"style":{"height":18.18},"width":130.75,"height":45.46,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1912.13053/images/24-23.png","element":"img","alt":" P(Θ(l))","inline":true,"padRight":true},{"text":"by a factor of ","element":"span"},{"style":{"fontStyle":"italic"},"text":"d","element":"span"},{"text":". This suggests, in the chaotic phase, there exists a transient regime of depths, where CNN-F hardly perform while CNN-P performs well. In the ordered phase, the pooling does not affect ","element":"span"},{"style":{"height":18.54},"width":82.34,"height":46.36,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1912.13053/images/24-24.png","element":"img","alt":" λ(l)max","inline":true,"padRight":true},{"text":"much but ","element":"span"},{"text":"does decrease ","element":"span"},{"style":{"height":21.49},"width":84.46,"height":53.72,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1912.13053/images/24-25.png","element":"img","alt":" λ(l)bulk ","inline":true,"padRight":true},{"text":"by a factor of ","element":"span"},{"style":{"fontStyle":"italic"},"text":"d ","element":"span"},{"text":"and the condition number ","element":"span"},{"style":{"height":14.18},"width":57.65,"height":35.46,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1912.13053/images/24-26.png","element":"img","alt":" κ(l)","inline":true,"padRight":true},{"text":"grows approximately like ","element":"span"},{"style":{"height":18.67},"width":136.32,"height":46.68,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1912.13053/images/24-27.png","element":"img","alt":" dlχ−l1 , d","inline":true,"padRight":true},{"text":"times bigger than its ","element":"span"},{"text":"flattening and fully-connected network counterparts. This suggests the existence of a transient regime of depths, in which CNN-F outperforms CNN-P. This might be surprising because it is commonly believed CNN-P usually outperforms CNN-F. These statements are supported empirically in Figure ","element":"span"},{"href":"#id-53","text":"2","element":"a"},{"text":".","element":"span"}]]},{"heading":"G. Figure Zoo","paragraphs":[[{"style":{"fontWeight":"bold"},"text":"G.1. Phase Diagrams: Figure ","element":"span"},{"href":"#id-43","style":{"fontWeight":"bold"},"text":"5","element":"a"},{"style":{"fontWeight":"bold"},"text":".","element":"span"}],[{"text":"We plot the phase diagrams for the Erf function and the ","element":"span"},{"text":"tanh ","element":"span"},{"text":"function (adopted from (","element":"span"},{"href":"#id-80","referenceIndex":33,"text":"Pennington et al.","element":"a"},{"href":"#id-80","referenceIndex":33,"text":", ","element":"a"},{"href":"#id-80","referenceIndex":33,"text":"2018","element":"a"},{"text":")).","element":"span"}],[{"id":"id-43","style":{"width":"96%"},"width":1886,"height":925,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1912.13053/images/25-0.png","element":"img"}],[{"style":{"fontStyle":"italic"},"text":"Figure 5. ","element":"figcaption","subtype":"caption"},{"text":"Phase Diagram for ","element":"figcaption","subtype":"caption"},{"text":"tanh ","element":"figcaption","subtype":"caption"},{"text":"and Erf (right).","element":"figcaption","subtype":"caption"}],[{"style":{"fontWeight":"bold"},"text":"G.2. SGD on FCN on Larger Dataset: Figure ","element":"span"},{"href":"#id-64","style":{"fontWeight":"bold"},"text":"6","element":"a"},{"style":{"fontWeight":"bold"},"text":".","element":"span"}],[{"text":"We report the training and test accuracy of FCN trained on a subset (16k training points) of CIFAR-10 using SGD with 20 ","element":"span"},{"style":{"height":8},"width":31,"height":20,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1912.13053/images/25-1.png","element":"img","alt":" ×","inline":true,"padRight":true},{"text":"20 different ","element":"span"},{"style":{"height":17.38},"width":107.72,"height":43.46,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1912.13053/images/25-2.png","element":"img","alt":" (σ2ω, l)","inline":true,"padRight":true},{"text":"configurations.","element":"span"}],[{"style":{"fontWeight":"bold"},"text":"G.3. NNGP vs NTK prediction: Figure ","element":"span"},{"href":"#id-81","style":{"fontWeight":"bold"},"text":"7","element":"a"},{"style":{"fontWeight":"bold"},"text":".","element":"span"}],[{"text":"Here we compare the test performance of the NNGP and NTK with different ","element":"span"},{"style":{"height":17.39},"width":107.72,"height":43.46,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1912.13053/images/25-3.png","element":"img","alt":" (σ2ω, l)","inline":true,"padRight":true},{"text":"configurations. In the chaotic phase, ","element":"span"},{"text":"the generalizable depth-scale of the NNGP is captured by ","element":"span"},{"style":{"height":16},"width":323.84,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1912.13053/images/25-4.png","element":"img","alt":" ξc∗ = −1/ log(χc∗)","inline":true},{"text":". In contrast, the generalizble depth-scale of the NTK is captured by ","element":"span"},{"style":{"height":16},"width":507.23,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1912.13053/images/25-5.png","element":"img","alt":" ξ∗ = −1/(log(χc∗) − log(χ1))","inline":true},{"text":". Since ","element":"span"},{"style":{"height":14},"width":117.17,"height":35,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1912.13053/images/25-6.png","element":"img","alt":" χ1 > 1","inline":true,"padRight":true},{"text":"in the chaotic phase, ","element":"span"},{"style":{"height":14},"width":138.12,"height":35,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1912.13053/images/25-7.png","element":"img","alt":" ξc∗ > ξ∗","inline":true},{"text":". Thus for larger depth, the NNGP kernel performs better than the NTK. Corrections due to an additional average pooling layer is plotted in the third column of Figure .","element":"span"},{"href":"#id-81","text":"7","element":"a"}],[{"id":"id-64","style":{"width":"99%"},"width":1944,"height":648,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1912.13053/images/25-8.png","element":"img"}],[{"style":{"fontStyle":"italic"},"text":"Figure 6. ","element":"figcaption","subtype":"caption"},{"text":"Training and Test Accuracy for FCN for different ","element":"figcaption","subtype":"caption"},{"style":{"height":16.09},"width":98.89,"height":40.23,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1912.13053/images/25-9.png","element":"img","alt":" (σ2ω, l)","inline":true,"padRight":true},{"text":"configurations.","element":"figcaption","subtype":"caption"}],[{"id":"id-81","style":{"width":"99%"},"width":1946,"height":1041,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1912.13053/images/26-0.png","element":"img"}],[{"style":{"fontStyle":"italic"},"text":"Figure 7. ","element":"figcaption","subtype":"caption"},{"text":"Test Accuracy for NTK (top) and NNGP prediction for different ","element":"figcaption","subtype":"caption"},{"style":{"height":16.09},"width":98.89,"height":40.24,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1912.13053/images/26-1.png","element":"img","alt":" (σ2ω, l)","inline":true,"padRight":true},{"text":"configurations. First/second column: CNN with/without ","element":"figcaption","subtype":"caption"},{"text":"pooling. Last column: difference between the first and second columns.","element":"figcaption","subtype":"caption"}],[{"id":"id-65","style":{"fontWeight":"bold"},"text":"G.4. Densely Sweeping Over ","element":"span"},{"href":"#id-82","style":{"height":17.9},"width":209.84,"height":44.75,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1912.13053/images/26-2.png","element":"img","alt":" σ2b: Figure 8","inline":true}],[{"text":"We demonstrate that our prediction for the generalizable depth-scales for the NTK (","element":"span"},{"style":{"height":14},"width":33.43,"height":35,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1912.13053/images/26-3.png","element":"img","alt":"ξ∗","inline":true},{"text":") and NNGP (","element":"span"},{"style":{"height":14},"width":31.44,"height":35,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1912.13053/images/26-4.png","element":"img","alt":"ξc","inline":true},{"text":") are robust across a variety of hyperparameters. We densely sweep over 9 different values of ","element":"span"},{"style":{"height":17.9},"width":234.88,"height":44.75,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1912.13053/images/26-5.png","element":"img","alt":" σ2b ∈ [0.2, 1.8]","inline":true},{"text":". For each ","element":"span"},{"style":{"height":17.9},"width":40.2,"height":44.75,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1912.13053/images/26-6.png","element":"img","alt":" σ2b","inline":true,"padRight":true},{"text":"we compute the ","element":"span"},{"text":"NTK/NNGP test accuracy for 20 * 50 different configurations of (l, ","element":"span"},{"style":{"height":17.38},"width":674.05,"height":43.46,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1912.13053/images/26-7.png","element":"img","alt":" σ2ω) with l ∈ [1, 100] and σ2ω ∈ [0.12, 4.92]","inline":true},{"text":". The training ","element":"span"},{"text":"set is a ","element":"span"},{"text":"8","element":"span"},{"text":"k subset of CIFAR-10.","element":"span"}],[{"style":{"fontWeight":"bold"},"text":"G.5. Densely Sweeping Over the Regularization Strength ","element":"span"},{"href":"#id-83","style":{"height":14.8},"width":191.97,"height":37,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1912.13053/images/26-8.png","element":"img","alt":" σ: Figure 9","inline":true}],[{"text":"Similar to the above setup, we fixed ","element":"span"},{"style":{"height":17.9},"width":146.21,"height":44.75,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1912.13053/images/26-9.png","element":"img","alt":" σ2b = 1.6","inline":true,"padRight":true},{"text":"and densely vary ","element":"span"},{"style":{"height":17.38},"width":389.31,"height":43.46,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1912.13053/images/26-10.png","element":"img","alt":" σ ∈ {0, 10−6, . . . , 100}.","inline":true}],[{"id":"id-82","style":{"width":"99%"},"width":1932,"height":2053,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1912.13053/images/27-0.png","element":"img"}],[{"style":{"fontStyle":"italic"},"text":"Figure 8. ","element":"figcaption","subtype":"caption"},{"style":{"fontWeight":"bold"},"text":"Generalization metrics for NTK/NNGP vs Test Accuracy vs ","element":"figcaption","subtype":"caption"},{"style":{"height":15.46},"width":47.97,"height":38.66,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1912.13053/images/27-1.png","element":"img","alt":" σ2b.","inline":true}],[{"id":"id-83","style":{"width":"99%"},"width":1932,"height":1643,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1912.13053/images/28-0.png","element":"img"}],[{"style":{"fontStyle":"italic"},"text":"Figure 9. ","element":"figcaption","subtype":"caption"},{"style":{"fontWeight":"bold"},"text":"Generalization metrics for NTK/NNGP vs Test Accuracy vs ","element":"figcaption","subtype":"caption"},{"style":{"height":6.4},"width":31.36,"height":16,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1912.13053/images/28-1.png","element":"img","alt":" σ.","inline":true}]]}],"_version":"3.3.4"},"paperNode":"$28:props:children:props:children:0:props:product"}]]