b

DiscoverSearch
About
My stuff
Theoretical Understanding of Batch-normalization: A Markov Chain Perspective
2020·arXiv
Abstract
Abstract

Randomly initialized neural networks are known to become harder to train with increasing depth, unless architectural enhancements like residual connections and batch normalization are used. We here investigate this phenomenon by revisiting the connection between random initialization in deep networks and spectral instabilities in products of random matrices. Given the rich literature on random matrices, it is not surprising to find that the rank of the intermediate representations in unnormalized networks collapses quickly with depth. In this work we highlight the fact that batch normalization is an effective strategy to avoid rank collapse for both linear and ReLU networks. Leveraging tools from Markov chain theory, we derive a meaningful lower rank bound in deep linear networks. Empirically, we also demonstrate that this rank robustness generalizes to ReLU nets.Finally, we conduct an extensive set of experiments on real-world data sets, which confirm that rank stability is indeed a crucial condition for training modern-day deep neural architectures.

Depth is known to play an important role in the expressive power of neural networks [27]. Yet, increased depth typically leads to a drastic slow down of learning with gradient-based methods, which is commonly attributed to unstable gradient norms in deep networks [14]. One key aspect of the training process concerns the way the layer weights are initialized. When training contemporary neural networks, both practitioners and theoreticians advocate the use of randomly initialized layer weights with i.i.d. entries from a zero mean (Gaussian or uniform) distribution. This initialization strategy is commonly scaled such that the variance of the layer activation stays constant across layers [12, 13]. However, this approach can not avoid spectral instabilities as the depth of the network increases. For example,

[25] observes that for linear neural networks, such initialization lets all but one singular values of the last layers activation collapse towards zero as the depth increases.

Nevertheless, recent advances in neural architectures have allowed the training of very deep neural networks with standard i.i.d. initialization schemes despite the above mentioned shortcomings. Among these, both residual connections and normalization layers have proven particularly effective and are thus in widespread use (see [16, 23, 13] to name just a few). Our goal here is to bridge the explanatory gap between these two observations by studying the effect of architectural enhancements on the spectral properties of randomly initialized neural networks. We also provide evidence for a strong link of the latter with the performance of gradient-based optimization algorithms.

One particularly interesting architectural component of modern day neural networks is Batch Normalization (Bn) [16]. This simple heuristics that normalizes the pre-activation of hidden units across a mini-batch, has proven tremendously effective when training deep neural networks with gradient-based methods. Yet, despite of its ubiquitous use and strong empirical benefits, the research community has not yet reached a broad consensus, when it comes to a theoretical explanation for its practical success. Recently, several alternatives to the original “internal covariate shift” hypothesis [16] have appeared in the literature: decoupling optimization of direction and length of the parameters [19], auto-tuning of the learning rate for stochastic gradient descent [3], widening the learning rate range [6], alleviating sharpness of the Fisher information matrix [17], and smoothing the optimization landscape [24]. Yet, most of these candidate justifications are still actively debated within the community. For example, [24] first made a strong empirical case against the original internal covariate shift hypothesis. Secondly, they argued that batch normalization simplifies optimization by smoothing the loss landscape. However, their analysis is on a per-layer basis and treats only the largest eigenvalue. Furthermore, even more recent empirical studies again dispute these findings, by observing the exact opposite behaviour of Bn on a ResNet20 network [33].

1.1 On random initialization and gradient based training

In light of the above discussion, we take a step back – namely to the beginning of training – to find an interesting property that is provably present in batch normalized networks and can serve as a solid basis for a more complete theoretical understanding.

The difficulty of training randomly initialized, un-normalized deep networks with gradient methods is a long-known fact, that is commonly attributed to the so-called vanishing gradient effect, i.e., a decreasing gradient norm as the networks grow in depth (see, e.g., [26]). A more recent line of research tries to explain this effect by the condition number of the input-output Jacobian (see, e.g., [31, 32, 22, 6]). Here, we study the spectral properties of the above introduced initialization with a particular focus on the rank of the hidden layer activations over a batch of samples. The question at hand is whether or not the network preserves a diverse data representation which is necessary to disentangle the input in the final classification layer.

As a motivation, consider the results of Fig. 1, which plots accuracy and output rank when training batch-normalized and un-normalized neural networks of growing depth on the FashionMNIST dataset [30]. As can be seen, the rank in the last hidden layer of the vanilla networks collapses with depth and they are essentially unable to learn (in a limited number of epochs) as soon as the number of layers is above 10. The rank collapse indicates that the direction of the output vector has become independent of the actual input. In other words, the randomly initialized network no longer preserves information about the input. Batch-normalized networks, however, preserve a high rank across all network sizes and their training accuracy drops only very mildly as the networks reach depth 32.

The above example shows that both rank and optimization of even moderately-sized, unnormalized networks scale poorly with depth. Batch-normalization, however, stabilizes the rank in this setting and the obvious question is whether this effect is just a slow-down or even simply a numerical phenomenon, or whether it actually generalizes to networks of infinite depth.

In this work we make a strong case for the latter option by showing a remarkable stationarity aspect of Bn. Consider for example the case of passing  N samples xi ∈ Rd arranged column-wise in an input matrix  X ∈ Rd×N through a very deep network with fully-connected layers. Ideally, from an information propagation perspective, the network should be able to differentiate between

image

Figure 1: Effect of depth on rank and learning, on the Fashion-MNIST dataset with ReLU multilayer perceptrons (MLPs) of depth 1-32 and width 128 hidden units. Left: Rank2 after random initialization as in PyTorch [21, 12]. Right: Training accuracy after training 75 epochs with SGD, batch size 128 and grid-searched learning rate. Mean and 95% confidence interval of 5 independent runs.

individual samples, regardless of its depth [26]. However, as can be seen in Fig. 2, the hidden representation of X collapses to a rank one matrix in vanilla networks, thus mapping all  xito the same line in  Rd. Hence, the hidden layer activations and along with it the individual gradient directions become independent from the input  xias depth goes to infinity. We call this effect “directional” gradient vanishing (see Section 3 for a more thorough explanation).

Interestingly, this effect does not happen in batch-normalized networks, which yield – as we shall prove in Theorem 2 – a stable rank for any depth, thereby preserving a disentangled representation of the input and hence allowing the training of very deep networks. These results substantiate earlier empirical observations made by [6] for random Bn-nets, and also validates the claim that Bn helps with deep information propagation [26].

image

Figure 2: Rank comparison of last hidden activation: Log(rank) of the last hidden layer’s activation over total number of layers (blue for Bn- and orange for vanilla-networks) for Gaussian inputs. Networks are MLPs of width d = 32. (Left) Linear activations, (Right) ReLU activations. Mean and 95% confidence interval of 10 independent runs. While the rank quickly drops in depth for both networks, BN stabilizes the rank above√d.

1.2 Contributions

In summary, the work at hand makes the following two key contributions: (i) We theoretically prove that Bn indeed avoids rank collapse for deep linear neural nets under standard initialization and for any depth. In particular, we show that Bn can be seen as a

computationally cheap rank preservation operator, which may not yield hidden matrices with full rank but still preserves sufficient modes of variation in the data to achieve a scaling of the rank with  Ω(√d), where dis the width of the network. Subsequently, we leverage existing results from random matrix theory [8] to complete the picture with a simple proof of the above observed rank collapse for linear vanilla networks, which interestingly holds regardless of the presence of residual connections (Lemma 3). Finally, we connect the rank to difficulties in gradient based training of deep nets by showing that a rank collapse makes the directional component of the gradients independent of the input.

(ii) We empirically show that the rank is indeed a crucial quantity for gradient-based learning. In particular, we show that both the rank and the final training accuracy quickly diminish in depth unless Bn layers are incorporated in both simple feed-forward and convolutional neural nets. To take this reasoning beyond mere correlations, we actively intervene with the rank of networks before training and show that (a) one can break the training stability of Bn by initializing in a way that reduces its rank-preserving properties, and (b) a rank-increasing pre-training procedure for vanilla networks can recover their training ability even for large depth. Interestingly, our pre-training method allows vanilla SGD to outperform Bn on very deep MLPs. In all of our experiments, we find that SGD updates preserve the order of the initial rank throughout optimization, which underscores the importance of the rank at initialization for the entire convergence behavior.

Network description. We consider a given input  X ∈ Rd×Ncontaining N samples in  Rd. Let  1k ∈ Rk denote the k-dimensional all one vector and  H(γ)ℓdenote the hidden representation of  X in layer ℓ of a Bn-network with residual connections. The following recurrence summarizes the network mapping

image

where  Wℓ ∈ Rd×dand  γregulates the skip connection strength (in the limit,  γ = ∞recovers a network without skip connection)3. Throughout this work, we consider the network weights  Wℓto be initialized as follows.

Definition 1 (Standard weight initialization). The elements of weight matrices  Wℓare i.i.d. samples from a distribution P that has zero-mean, unit-variance, and its density is symmetric around zero4. We use the notation  µfor the probability distribution of the weight matrices.

We define the BN operator  Bnα,βas in the original paper [16], namely

image

where  ◦is a row-wise product. Both  α ∈ Rdand  β ∈ Rdare trainable parameters. Throughout this work we assume the initialization  α = 0 and β = 1d, and also omit corrections of the mean activity. As demonstrated empirically in Fig. 5, and theoretically in App. C this simplification does not change the performance of Bn in our settings.

Rank notions. To circumvent numerical issues involved in rank computations we introduce a soft notion of the rank denoted by rankτ(H)(soft rank). Specifically, let  σ1, . . . , σdbe the singular values of H. Then, given a  τ > 0, we define rankτ(H) as

image

Lemma 1. For an arbitrary matrix  H ∈ Rd×d, rank(H) ≥ r(H). For the sequence  {H(γ)ℓ }∞ℓ=1defined in Eq. (2), rankτ(H(γ)ℓ ) ≥ (1 − τ)2r(H(γ)ℓ ) holds for τ ∈ [0, 1].

Since our empirical observations hold equally for both non-linear and linear networks, we here focus on improving the theoretical understanding in the linear case, which constitutes a growing area of research [25, 18, 5, 2]. First, inspired by [9] and leveraging tools from Markov Chain theory, our main result proves that the rank of linear batch-normalized networks scales with their width as  Ω(√width). Secondly, we leverage results from random matrix theory [7] to contrast our main result to unnormalized linear networks which we show to provably collapse to rank one, even in the presence of residual connections.

3.1 Main result

In the following we state our main result which proves that batch normalization indeed prevents the rank of all hidden layer activations from collapsing to one. Please see Appendix E for the more formal version of this theorem statement.

Theorem 2. [Informal] Suppose that the rank(X) = d and that the weights  Wℓare initialized in a standard i.i.d. zero-mean fashion (see Def. 1). Then, the following limits exist such that

image

holds almost surely for a sufficiently small  γ(independent of  ℓ) and any  τ ∈ [0, 1), under some additional technical assumptions. Please see Theorem 14 in the Appendix for the formal statement.

Theorem 2 yields a non trivial width-dependency. Namely, by setting for example  τ := 1/2, the result states that the average number of singular values with absolute value greater than �N/2is at least  Ω(√d)on average. To put this into context: If one were to replace diag(M)−1/2by the  full inverse (M)−1/2 in Eq. (2), then Bnwould effectively constitute a classical whitening operation such that all  {H(γ)ℓ }Lℓ=1 would be full rank (equal to d). However, as noted in the original Bn paper [16], whitening is obviously expensive to compute and furthermore prohibitively costly to incorporate in back-propagation. As such, Bn can be seen as a computationally inexpensive approximation of whitening, which does not yield full rank hidden matrices but still preserves sufficient variation in the data to provide a rank scaling as  Ω(√d). Although the lower-bound in Thm. 2 is established on the average over infinite depth (i.e.,  L → ∞), Corollary 15 (in App. E) proves that the same bound holds for all rank(Hℓ) and rankτ(Hℓ).

Necessary assumptions. The above result relies on two key assumptions: (i) First, the input X needs to be full rank. (ii) Second, the weights have to be drawn according to the standard initialization scheme. We believe that both assumptions are indeed necessary for Bn to yield a robust rank.

Regarding (i), we consider a high input rank a natural condition since linear neural nets cannot possibly increase the rank when propagating information through their layers. Of course, full rank is easily achieved by an appropriate data pre-processing. Yet, even when the matrix is close to low rank we find that Bn is actually able to amplify small variations in the data (see Fig. 3.b).5Regarding (ii), we derive – based on our theoretical insights – an adversarial initialization strategy that corrupts both the rank robustness and optimization performance of batch-normalized networks, thus suggesting that the success of Bn indeed relies heavily on the standard i.i.d. zero-mean initialization.

Experimental validation. In order to underline the validity of Theorem 2 we run multiple simulations by feeding Gaussian data of dimensionality d = N into networks of growing size and with different residual strengths. For each network, we compute the mean and standard deviation of the soft rank rankτ with τ = 0.5. As depicted in Fig. 3, the curves clearly indicate a Ω(√d)dependency for  limL→∞�Lℓ=1 rankτ(Hℓ)/L, just as predicted in the Theorem. Although the established guarantee requires the weight on the parametric branch (i.e.,  γ) to be small, the results of Fig. 3 indicate that the established lower bound holds for a much wider range including the case where no residual connections are used at all (γ = ∞).

image

Figure 3: a) Result of Theorem 2 for different values of  γ, where γ = ∞depicts networks without skip connections. Each point is the average rank1/2over depth (L = 106) of nets of width  d ∈ {8, 16, .., 256}an on x-axis. b) Top 10 singular values of  H(γ)ℓfor increasing values of  ℓgiven nearly collinear inputs. As can be seen, Bn quickly amplifies smaller variations in the data while reducing the largest one.

3.2 Comparison with unnormalized networks

In order to stress the importance of the above result, we now compare the predicted rank of  Hℓwith the rank of unnormalized linear networks, which essentially constitute a linear mapping in the form of a product of random matrices. The spectral distribution of products of random matrices with i.i.d. standard Gaussian elements has been studied extensively [6, 11, 20]. Interestingly, one can show that the gap between the top and the second largest singular value increases with the number of products (i.e.,  ℓ) at an exponential rate6[11, 20]. Hence, the matrix converges to a rank one matrix after normalizing by the norm. In the following, we extend this result to products of random matrices with a residual branch that is obtained by adding the identity matrices. Particularly, we consider the hidden states �Hℓof the following linear residual network:

image

Since the norm of �Hℓis not necessarily bounded, we normalize as �Hℓ = BℓX/∥Bℓ∥. The nextlemma characterizes the limit behaviour of  { �Hℓ}.

Lemma 3. Suppose that  γ ∈ (0, 1)and assume the weights  Wℓto be initialized as in Def. 1 with element-wise distribution P. Then we have for linear networks, which follow recursion (6), that:

image

This results stands in striking contrast to the result of Theorem 2 established for batch-normalized networks.7 Interestingly, even residual skip connections cannot avoid rank collapse for very deep neural networks, unless one is willing to incorporate a depth dependent down-scaling of the parametric branch as for example done in [1], who set  γ = O( 1L). Remarkably, Theorem 2 shows that Bn layers provably avoid rank collapse without requiring the networks to become closer and closer to identity.

Implications of rank collapse on gradient based learning. In order to explain the severe consequence of rank collapse on optimization performance reported in Fig. 1, we study the effect of rank one hidden-layer representations on the gradient of the training loss for distinct input samples. Let  Lidenote the training loss for datapoint i on a vanilla network as in Eq. (6). Furthermore, let the final classification layer be parametrized by  WL+1 ∈ Rdout×d. Then, given that the hidden presentation at the last hidden layer L is rank one, the normalized gradients of the loss with respect to weights of individual neurons  k ∈ 1, ..., doutin the classification layer (denoted by  ∇WL+1,kLi, where  ∥∇WL+1,kLi∥ = 1) are collinear for any two datapoints i and j, i.e.  ∇WL+1,kLi = ∓∇WL+1,kLj. A formal statement is presented in Prop. 19 in the Appendix alongside empirical validations on a VGG19 network (Fig. 10). This result implies that the commonly accepted vanishing gradient norm hypothesis is not descriptive enough since SGD does not take small steps into the right direction, but into a random direction that is independent from the input. In other words, deep neural networks are prone to directional gradient vanishing after initialization, which is caused by the collapse of the last hidden layer activations to a very small subspace (one line in  Rd in the extreme case of rank one activations).

The preceding sections highlight that the rank of the hidden representations is a key difference between random vanilla and Bn networks. We now provide three experimental findings that substantiate the particular importance of the rank at the beginning of training: First, we find that an unsupervised, rank-increasing pre-training allows SGD on vanilla networks to outperform Bn networks. Second, we show that the performance of Bn-networks is closely tied to a high rank at initialization. Third, we report that SGD updates preserve the initial rank magnitude throughout the optimization process.

Outperforming BN using a pre-training step. As discussed above, batch normalization layers are very effective at avoiding rank collapse. Yet, this is of course not the only way to preserve rank. Based upon our theoretical insights, we leverage the lower bound established in Eq. (4) to design a pre-training step that not only avoids rank collapse but also accelerates the convergence of SGD. Our proposed procedure is both simple and computationally cheap. Specifically, we maximize the lower-bound  r(Hℓ)(in Eq. (4)) on the rank of the hidden presentation  Hℓin each layer  ℓ. Since this function is differentiable with respect to its input, it can be optimized sufficiently by just a few steps of (stochastic) gradient ascent (see Section G in the Appendix for more details).

image

Figure 4: Pre-training versus Bn: Loss over epochs on CIFAR-10 for MLPs of increasing depth with 128 hidden units and ReLU activation. Trained with SGD (batchsize 64) and grid-searched stepsize. See Fig. 11 for the corresponding test loss and accuracy as well as Fig. 12 for FashionMNIST results.

Fig. 4 compares the convergence rate of SGD on pre-trained vanilla networks and Bn-networks. As can be seen, the slow down in depth is much less severe for the pre-trained networks. This improvement is, also, reflected both in terms of training accuracy and test loss (see Fig. 11 in Appendix). Interestingly, the pre-training is not only faster than Bn on deep networks, but it is also straight-forward to use in settings where the application of Bn is rather cumbersome such as for very small batch sizes or on unseen data [15, 29].

Breaking batch normalization. Some scholars hypothesize that the effectiveness of Bn stems from a global landscape smoothing [24] or a certain learning rate tuning [3], that are thought to be induced by the normalization. Under these hypotheses, one would expect that SGD converges fast on Bn-nets regardless of the initialization. Yet, we here show that the way that networks are initialized does play a crucial role for the subsequent optimization performance of Bn-nets.

image

Figure 5: Breaking Batchnorm: CIFAR-10 on VGG19 with standard PyTorch initialization as well as a uniform initialization of same variance. (Left) training accuracy, (Right) Rank of last hidden layer computed using torch.matrix_rank(). Plot also shows results for standard initialization and Bn without mean deduction. Avg. and 95% CI of 5 independent runs. (See Fig. 13 in Appendix for similar results on ResNet-50).

Particularly, we train two MLPs with batchnorm, but change the initialization for the second net from the standard PyTorch way  Wl,i,j ∼ uniform�−1/√dl, 1/√dl�[21, 12] to Wl,i,j ∼uniform�0, +2/√dl�, where  dlis the layer size. As can be seen to the right, this small change reduces the rank preserving quality of BN significantly, which is reflected in much slower learning behaviour. Even sophisticated modern day architectures such as VGG and ResNet networks are unable to fit the CIFAR-10 dataset after changing the initialization in this way (see Fig. 5).

Rank through the optimization process. The theoretical result of Theorem 2 considers the rank at random initialization. To conclude, we perform two further experiments which confirm that the initial rank strongly influences the speed of SGD throughout the entire optimization process. In this regard, Fig. 6 reports that SGD preserves the initial magnitude of the rank to a large extent, regardless of the specific network type. This is particularly obvious when comparing the two Bn initializations. A further noteworthy aspect is the clear correlation between the level of pre-training and optimization performance on vanilla nets. Interestingly, this result does again not only hold on simple MLPs but also generalizes to modern day networks such as the VGG-19 (see Fig. 5) and ResNet50 architecture (see Appendix I).

image

Figure 6: Pretraining: Fashion-MNIST on MLPs of depth 32 and width 128. (Left) Training accuracy, (Right) Lower bound on rank. Blue line is a ReLU network with standard initialization. Other solid lines are pre-trained layer-wise with 25 (orange) and 75 (green) iterations to increase the rank. Dashed lines are batchnorm networks with standard and asymmetric initialization. Average and 95% confidence interval of 5 independent runs.

[1] Zeyuan Allen-Zhu, Yuanzhi Li, and Zhao Song. A convergence theory for deep learning via over-parameterization. arXiv preprint arXiv:1811.03962, 2018.

[2] Sanjeev Arora, Nadav Cohen, Noah Golowich, and Wei Hu. A convergence analysis of gradient descent for deep linear neural networks. arXiv preprint arXiv:1810.02281, 2018.

[3] Sanjeev Arora, Zhiyuan Li, and Kaifeng Lyu. Theoretical analysis of auto rate-tuning by batch normalization. arXiv preprint arXiv:1812.03981, 2018.

[4] Boaz Barak and David Steurer. Proofs, beliefs, and algorithms through the lens of sum-of-squares. Course notes: http://www. sumofsquares. org/public/index. html, 2016.

[5] Peter L. Bartlett, David P. Helmbold, and Philip M. Long. Gradient descent with identity initialization efficiently learns positive-definite linear transformations by deep residual networks. Neural computation, 31(3):477–502, 2019.

[6] Nils Bjorck, Carla P. Gomes, Bart Selman, and Kilian Q. Weinberger. Understanding batch normalization, 2018.

[7] Stéphane Boucheron, Gábor Lugosi, and Pascal Massart. Concentration inequalities: A nonasymptotic theory of independence. Oxford university press, 2013.

[8] Philippe Bougerol.  Products of Random Matrices with Applications to Schrödinger Operators,volume 8. Springer Science & Business Media, 2012.

[9] Aymeric Dieuleveut, Alain Durmus, and Francis Bach. Bridging the gap between constant step size stochastic gradient descent and Markov chains. arXiv preprint arXiv:1707.06386, 2017.

[10] Randal Douc, Eric Moulines, Pierre Priouret, and Philippe Soulier. Markov Chains. Springer, 2018.

[11] Peter J. Forrester. Lyapunov exponents for products of complex Gaussian random matrices. Journal of Statistical Physics, 151(5):796–808, 2013.

[12] Xavier Glorot and Yoshua Bengio. Understanding the difficulty of training deep feedforward neural networks. In Proceedings of the thirteenth international conference on artificial intelligence and statistics, pages 249–256, 2010.

[13] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Delving deep into rectifiers: Surpassing human-level performance on imagenet classification. In Proceedings of the IEEE international conference on computer vision, pages 1026–1034, 2015.

[14] Sepp Hochreiter. The vanishing gradient problem during learning recurrent neural nets and problem solutions. International Journal of Uncertainty, Fuzziness and Knowledge-Based Systems, 6(02):107–116, 1998.

[15] Sergey Ioffe. Batch renormalization: Towards reducing minibatch dependence in batch-normalized models. In Advances in neural information processing systems, pages 1945–1953, 2017.

[16] Sergey Ioffe and Christian Szegedy. Batch normalization: Accelerating deep network training by reducing internal covariate shift. arXiv preprint arXiv:1502.03167, 2015.

[17] Ryo Karakida, Shotaro Akaho, and Shun-ichi Amari. The normalization method for alleviating pathological sharpness in wide neural networks. In Advances in Neural Information Processing Systems, pages 6403–6413, 2019.

[18] Kenji Kawaguchi. Deep learning without poor local minima. In Advances in neural information processing systems, pages 586–594, 2016.

[19] Jonas Kohler, Hadi Daneshmand, Aurelien Lucchi, Ming Zhou, Klaus Neymeyr, and Thomas Hofmann. Exponential convergence rates for batch normalization: The power of lengthdirection decoupling in non-convex optimization. arXiv preprint arXiv:1805.10694, 2018.

[20] Dang-Zheng Liu, Dong Wang, and Lun Zhang. Bulk and soft-edge universality for singular values of products of ginibre random matrices. In  Annales de l’Institut Henri Poincaré,Probabilités et Statistiques, volume 52, pages 1734–1762. Institut Henri Poincaré, 2016.

[21] Adam Paszke, Sam Gross, Francisco Massa, Adam Lerer, James Bradbury, Gregory Chanan, Trevor Killeen, Zeming Lin, Natalia Gimelshein, Luca Antiga, et al. Pytorch: An imperative style, high-performance deep learning library. In Advances in Neural Information Processing Systems, pages 8024–8035, 2019.

[22] Jeffrey Pennington, Samuel S Schoenholz, and Surya Ganguli. The emergence of spectral universality in deep networks. arXiv preprint arXiv:1802.09979, 2018.

[23] Tim Salimans and Diederik P. Kingma. Weight normalization: A simple reparameterization to accelerate training of deep neural networks. In Advances in Neural Information Processing Systems, pages 901–909, 2016.

[24] Shibani Santurkar, Dimitris Tsipras, Andrew Ilyas, and Aleksander Madry. How does batch normalization help optimization?(no, it is not about internal covariate shift). arXiv preprint arXiv:1805.11604, 2018.

[25] Andrew M. Saxe, James L. McClelland, and Surya Ganguli. Exact solutions to the nonlinear dynamics of learning in deep linear neural networks. arXiv preprint arXiv:1312.6120, 2013.

[26] Samuel S. Schoenholz, Justin Gilmer, Surya Ganguli, and Jascha Sohl-Dickstein. Deep information propagation. arXiv preprint arXiv:1611.01232, 2016.

[27] Matus Telgarsky. Benefits of depth in neural networks. arXiv preprint arXiv:1602.04485, 2016.

[28] Joel A. Tropp. An introduction to matrix concentration inequalities. arXiv preprint arXiv:1501.01571, 2015.

[29] Yuxin Wu and Kaiming He. Group normalization. In Proceedings of the European Conference on Computer Vision (ECCV), pages 3–19, 2018.

[30] Han Xiao, Kashif Rasul, and Roland Vollgraf. Fashion-mnist: a novel image dataset for benchmarking machine learning algorithms. arXiv preprint arXiv:1708.07747, 2017.

[31] Ge Yang and Samuel Schoenholz. Mean field residual networks: On the edge of chaos. In Advances in neural information processing systems, pages 7103–7114, 2017.

[32] Greg Yang, Jeffrey Pennington, Vinay Rao, Jascha Sohl-Dickstein, and Samuel S. Schoenholz. A mean field theory of batch normalization. arXiv preprint arXiv:1902.08129, 2019.

[33] Zhewei Yao, Amir Gholami, Kurt Keutzer, and Michael Mahoney. PyHessian: Neural networks through the lens of the Hessian. arXiv preprint arXiv:1912.07145, 2019.

A Preliminaries

Recall that  H(γ)ℓdenotes the hidden representations in layer  ℓ. These matrices make a Markov chain that obeys the recurrence of Eq. (1), which we restate here

image

where we use the compact notation Bn for  Bn0,1d. Let  M(γ)ℓbe second moment matrix of the hidden representations  H(γ)ℓ, i.e.  M(γ)ℓ := H(γ)ℓ �H(γ)ℓ �⊤/N. Batch normalization ensures that the rows of  Hℓhave the same norm√N for ℓ > 0 –where Nis the size of mini-batch. Let H be space of  d × d-matrices that obey this propery. This property enforces two key characteristics on

image

Property (p.1) directly yields that the trace of  M(γ)ℓ(and hence the sum of its eigenvalues) is equal to d. We will repeatedly use these properties in our analysis.

Furthermore, the sequence  {H(γ)ℓ }∞ℓ=1constitute a Markov chain. Under mild assumptions, this chain admits an invariant distribution that is defined bellow[10].

Definition 2. Distribution  νis an invariant distribution of the hidden representations  {H(γ)ℓ }∞ℓ=1if it obeys

image

where  µdenotes the probability measure of random weights.

Later, we will see that the above invariance property allows us to determine the commutative behaviour of the sequence of hidden presentations.

B Lower bounds on the (soft) rank

Recall that we introduced the ratio  r(H) = Tr(M(H))2/∥M(H)∥2F in Eq. (4) as a lower bound on both the rank(H) as well as the soft rank rankτ(H)(stated in Lemma 1). This section establishes these lower bounds.

Proof of Lemma 1 (part 1). We first prove that rank(H) ≥ r(H). Let  M := M(H) = HH⊤/N. Since the eigenvalues of H are obtained by a constant scaling factor of the squared singular values of H, these two matrices have the same rank. We now establish a lower bound on rank(M). Let λ ∈ Rd contain the eigenvalues of matrix  M, hence ∥λ∥1 = Tr(M) and ∥λ∥22 = ∥M∥2F . Given λ,we define the vector  w ∈ Rd as

image

To proof the assertion, we make use if a straightforward application of Cauchy-Schwartz

image

Replacing  ∥λ∥2 = ∥M∥F and ∥λ∥1 = Tr(M)into the above equation concludes the result. Note that the above proof technique has been used in the planted sparse vector problem [4].

Proof of Lemma 1 (part 2). Now, we prove that rankτ(H(γ)ℓ ) ≥ (1 − τ)2r(H(γ)ℓ ). Let λ ∈ Rd+ bea vector containing the eigenvalues of the matrix  M(γ)ℓ = M(H(γ)ℓ ). Let  σ ∈ Rd+contain the singular values of H. Then, one can readily check that  σ2i /N = λi. Furthermore,  ∥λ∥1 = ddue to (p.1) in Eq. (8). Furthermore, we have by definition that

image

Let us now define a vector  w ∈ Rd with entries

image

Then, we use Cauchy-Schwartz to get

image

It is easy to check that  ∥w∥2 = hτ(λ)−1/2 holds. Furthermore,

image

where we used the fact that  ∥λ∥1 = din the last inequality. Replacing this result into the bound of Eq. (17) yields

image

which conludes the proof.

C Initialization consequences

The particular weight initialization scheme consider through out this work (recall Def. 1), imposes an interesting structure in the invariant distribution of the sequence of hidden presentations (defined in Def. 2).

Lemma 4. Suppose that the chain  {H(γ)ℓ }∞ℓ=1(defined in Eq. 7) admits a unique invariant distribution  νγand H is drawn from  νγ; then, the law of  Hi:equates the law of  −Hi:where  Hi:denotes the ith row of matrix H.

Proof. Let S be a sign filliping matrix: it is diagonal and its diagonal elements are in  {+1, −1}.Then  SW d= Wholds for a random matrix W whose elements are drawn i.i.d. from a symmetric distribution. Let H be drawn from the invariant distribution of the chain denoted by  νγ; Leveraging the invariance property, we get

image

By multiplying both sides with S, we get

image

Note that we use the fact that diagonal matrices commute in the above derivation. According to the definition,  S2 = Iholds. Considering this fact, we get

image

Replacing the above result into Eq. (29) yields

image

Hence the law of SH is invariant too. Since the invariant distribution is assumed to be unique, SH d= Hholds and thus  Hi: d= −Hi:.

image

Comment on Bn-centering Let  νγbe the unique invariant distribution associated with Markov chain  {H(γ)ℓ }. A straightforward implication of last Lemma is  E [Hi] = 0for  H ∼ νγ, hence the rows of  H(γ)ℓare mean-zero, hence their average is close to zero 8and the mean-zero operation in Bn is redundant. Although this theoretical argument is established for linear networks, we empirically observed that Bn without centering also works well on modern neural architectures. For example, Fig. 7 shows that the centering does not affect the performance of Bn on a VGG net when training CIFAR-10.

D Main Theorem: warm-up analysis

As a warm-up analysis, the next lemma proves that rank(H(γ)ℓ ) ≥ 2holds. Later, we will prove a stronger result. Yet, this initial results provides valuable insights into our proof technique. Furthermore, we will use the following result in the next steps.

image

Figure 7: Centering for Bn. The experiment is conducted on a VGG network. The blue line indicates the original Bn network and the orange line is Bn without mean adaption. The vertical axis in the left plot is training accuracy. In the right plot it is  r(HL), where  HLis the data representation in the last hidden layer L. The horizontal axis indicates the number of iterations.

Lemma 5. Suppose that each element of the weight matrices is independently drawn from distribution P that is zero-mean, unit-variance, and its support lies in interval  [−B, B]. If the Markov chain  {Hℓ}ℓ≥1admits a unique invariant distribution, then

image

holds almost surely for all integers  ℓ and γ ≤ 1/(8d).

Proof. Let the weights  {Wℓ}be drawn from the distribution  µ, defined in Def. 1. Such a distribution obeys an important property: element-wise symmetricity. That is,  [Wℓ]ijis distributed as  −[Wℓ]ij. Such an initialization enforces an interesting structural property for the invariant distribution  νγthat is stated in Lemma 4. It is easy to check that this implies

image

for any  i ̸= j. Recall,  M(H) = HH⊤/N. The above property enforces  [M(H)]2ijto be small and hence  ∥M(γ)ℓ ∥2F is small as well. Now, as rank(H(γ)ℓ )is proportional to  1/∥M(γ)ℓ ∥2F (compareEq. (4)), it must consequently stay large. The rest of the proof is based on this intuition. Given the uniqueness of the invariant distribution, we can invoke Birkhoff’s Ergodic Theorem for Markov Chains (Theorem 5.2.1 and 5.2.6 [10]) which yields

image

This allows us to conclude the proof by a simple contradiction. Assume that rank(H(γ)k ) is indeedone. Then, as established in the following Lemma, in the limit all entries of  M(H(γ)ℓ )are constant and either  −1 or 1.

Lemma 6. Suppose the assumptions of Lemma 5 hold. If rank(H(γ)k ) = 1for an integer k, then M(H(γ)ℓ ) = M(H(γ)k )holds for all  ℓ > k. Furthermore, all elements of all matrices  {M(H(γ)ℓ )}ℓ≥khave absolute value one, hence

image

holds.

As a result, leveraging the ergodicity established in (61), we get that then

image

must also hold. However, this contradicts the consequence of the symmetricity (Eq. (31)) which states that for any  j ̸= iwe have  EH∼νγ [[M(H)]ij] = −EH∼νγ [[M(H)]ij] = 0. Thus, the rank one assumption cannot hold, which proves the assertion.

To complete the proof of the last theorem, we prove Lemma 6.

Proof of Lemma 6. For the sake of simplicity, we omit all superscripts  (γ)throughout the proof. Suppose that rank(Hk) = 1, then rank(Hℓ) = 1for all  ℓ ≥ kas the sequence  {rank(Hℓ)}is non-increasing 9. Invoking the established rank bound from Lemma 1, we get

image

The above inequality together with properties (p.1) and (p.2) (presented in Eqs 8 and 9) yield Tr(Mℓ) = d. Replacing this into the above equation gives that  ∥Mℓ∥2F ≥ d2must hold for the rank of  Hℓto be one. Yet, recalling property (p.2), this can only be the case if  [Mℓ]ij ∈ {+1, −1}for all i, j. Replacing the definition  M(H) = HH⊤/Ninto updates of hidden presentation in Eq. 1 obtains

image

where

image

We now prove that the sign of  [Mℓ]ij and [Mℓ+1]ijare the same for  [Mℓ]ij ∈ {+1, −1}. The above update formula implies that the sign of  [Mℓ+1]ijequates that of  [Mℓ+1/2]ij. Furthermore, it is easy to check that  |[∆Mℓ]ij| ≤ 4γB. For  γ ≤ 1/(8Bd), this bound yields  |[∆Mℓ]ij| ≤ 12. Therefore, the sign of  [Mℓ+1/2]ijis equal to the one of  [Mℓ]ij. Since furthermore  [Mℓ+1]ij ∈ {1, −1}holds, we conclude that all elements of  Mℓremain constant for all  ℓ ≥ k, which yields the limit stated in Eq. 33 .

E Main theorem: Proof

In this section, we prove that  Bn yields an Ω(√d)-rank for hidden representation.

Proof sketch for Thm. 2. The proof is based on an application of ergodic theory (as detailed for example in Section 5 of [10]). In fact, the chain of hidden representations, denoted by  H(γ)ℓ(1), constitutes a Markov chain in a compact space. This chain admits at least one invariant distribution  νfor which the following holds

image

for every bounded Borel function  g : Rd×d → Rd. The above invariance property provides an interesting characterization of the invariant measure  ν. Particularly, we show in Lemma 13 that

image

holds, where r(H) is the established lower-bound on the rank (see Lemma 1). Under weak assumptions, the chain obey Birkhoff’s Ergodicity, which yields that the average behaviour of the hidden representations is determined by the invariant measure  ν:

image

Finally, the established lower bound in Lemma 1 allows us to directly extend this result to a lower bound on the soft rank itself.

Characterizing the change in Frobenius norm Recall the established lower bound on the

rank denoted by r(H), for which

image

holds for all  Hℓdefined in Eq. 1.10Therefore,  ∥Mℓ∥2Fdirectly influences rankτ(Hℓ)(and also rank(Hℓ)) according to Lemma 1. Here, we characterize the change in  ∥M(H)∥2F after applying one step of the recurrence in Eq. 7 to H, i.e. passing it trough one hidden layer. This yields

image

Let M = M(H) and  M+ = M(H+)for simplicity. The next lemma estimates the expectation (taken over the randomness of W) of the difference between the Frobenius norms of  M+ and M.

image

holds as long as the support of distribution P (in Def. 1) lies in a finite interval  [−B, B].

The proof of the above lemma is based on a Taylor expansion of the Bn non-linear operator. We postpone the detailed proof to the end of this section. While the above equation seems complicated at first glance, it provides some interesting insights.

Interlude: Intuition behind Lemma 7. In order to gain more understanding of the implications of the result derived in Lemma 7, we make the simplifying assumption that all the rows of matrix M have the same norm. We emphasize that this assumption is purely for intuition purposes and is not necessary for the proof of our main theorem. Under such an assumption, the next proposition shows that the change in the Frobenius norm directly relates to the spectral properties of matrix M.

Proposition 8. Suppose that all the rows of matrix M have the same norm. Let  λ ∈ Rd containthe eigenvalues of matrix M. Then,

image

holds and hence

image

We postpone the proof to the end of this section. This proposition re-expresses the polynomial of Lemma 7 in terms of the eigenspectrum of M.

Based on the above proposition, we can make sense of interesting empirical observation reported in Figure 3.b. This figure plots the evolution of the eigenvalues of  M(H(γ)ℓ )after starting from a matrix  M(H0)whose leading eigenvalue is large and all other eigenvalues are very small. We observe that a certain fraction of the small eigenvalues of  M(H(γ)ℓ )grow quickly with  ℓ, while the leading eigenvalue is decreases in magnitude. In the next example, we show that the result of the last proposition actually predicts this observation.

Example 9. Suppose that M is a matrix whose rows have the same norm. Let  λ1 ≥ λ2, . . . , λd bethe eigenvalues associated with the matrix  M such that λd = λd−1 = λ2 = γ2 and λ1 = d−γ2(d−1).In this setting, Prop. 8 implies that  EW ∥M+∥2F < ∥M∥2F − γ4d2for a sufficiently small  γ. Thischange has two consequences in expectation:(i.) the leading eigenvalue of  M+ is O(−γ4d) smallerthan the leading eigenvalue of M, and (ii.) some small eigenvalues of  M+are greater than those of M (see Fig. 3.b).

We provide a more detailed justification for the above statement at the end of this section. This example illustrates that the change in Frobenius norm (characterized in Lemma 7) can predict the change in the eigenvalues of  M(H(γ)ℓ )(singular values of  H(γ)ℓ) and hence the desired rank. Inspired by this, we base the proof of Theorem 2 on leveraging the invariance property of the unique invariant distribution with respect to Frobenius norm – i.e. setting  g(H) = ∥M(H)∥2Fin Def. 2. An observation: regularity of the invariant distribution We now return to the result de-

rived in Lemma 7 that characterizes the change in Frobenius norm of M(H) after the recurrence of Eq. (7). We show how such a result can be used to leverage the invariance property with respect to the Frobenius norm. First, we observe that the term Tr(M(H)3)in the expansion can be shown to dominate the term Tr(diag(M(H)2)2)in expectation. The next definition states this dominance formally.

Definition 3. (Regularity constant  α) Let νbe a distribution over  H ∈ H. Then the regularity constant associated with  νis defined as the following ratio:

image

The next lemma states that the regularity constant  αassociated with the invariant distribution νγis always less than one. Our analysis will in fact directly rely on  α < 1.

Lemma 10. Suppose that the chain  {H(γ)ℓ }admits the unique invariant distribution  νγ(in Def. 2). Then, the regularity constant of  νγ(in Def. 3) is less than one for a sufficiently small  γ.

Proof. We use a proof by contradiction where we suppose that the regularity constant of distribution  νγis greater than one. In this case, we prove that the distribution cannot be invariant with respect to the Frobenius norm.

If the regularity constant  αis greater than one, then

image

holds. According to Theorem 5, the rank of M(H) is at least 2. Since the sum of the eigenvalues is constant d, the leading eigenvalue is less than d. This leads to

image

Plugging the above inequality together with inequality 47 into the established bound in Lemma 7 yields

image

for a sufficiently small  γ. Therefore,  νγdoes not obey the invariance property for g(H) = ∥M(H)∥2F in Def. 2.

We can experimentally estimate the regularity constant  αusing the Ergodicity of the chain. Assuming that the chain is Ergodic11,

image

holds almost surely for every Borel bounded function  g : H → R. By setting  g1(H) = Tr(M(H)3)and  g2(H) = Tr(diag(M(H)2)2), we can estimate  EH∼νγ [gi(H)]for i = 1, and 2. Given these estimates,  αcan be estimated. Our experiments in Fig. 8 show that the regularity constant of the invariant distribution  νγis less than 0.9 for d > 10.

image

Figure 8: Regularity constant of the invariant distribution. The vertical axis is the estimated regularity constant  αand the horizontal axis is  d. We use L = 105 (in Eq. (49)).

Interlude: intuition behind the regularity We highlight the regularity constant does by itself not yield the desired rank property in Theorem 2. This is illustrated in the next example that shows how the regularity constant relates to the spectral properties of M(H).

Example 11. Suppose that the support of distribution  νcontains only matrices  H ∈ H for whichall rows of M(H) have the same norm. If the regularity constant of  νis greater than or equal to one, then all non-zero eigenvalues of matrix M(H) are equal.

A detailed justification of the above statement is presented at the end of this section. This example shows that the regularity constant does not necessarily relate to the rank of H, but instead it is determined by how much non-zero eigenvalues are close to each other. We believe that a sufficient variation in non-zero eigenvalues of M(H) imposes the regularity of the law of H with a constant less than one (i.e.  α < 1in Def. 3). The next example demonstrates this.

Example 12. Suppose the support of distribution  νcontains matrices  H ∈ Hfor which all rows of M(H) have the same norm. Let  λ ∈ Rdcontain sorted eigenvalues of M(H). If  λ1 = Θ(dβ)and  λi = o(dβ)for i > 1 and  β < 1,12then the regularity constant  αassociated with  νis less than 0.9 for sufficiently large d.

We later provide further details about this example. Invariance consequence The next lemma establishes a key result on the invariant distribution

νγ.

Lemma 13. Suppose that the chain  {H(γ)ℓ }(see Eq. 7) admits the invariant distribution  νγ (seeDef. 2). If the regularity constant associated with  νγ is α < 1(defined in Def. 3), then

image

holds for a sufficiently small  γ.

Proof. Leveraging invariance property in Def. 2,

image

holds where the expectation is taken with respect to the randomness of W and  νγ.13Invoking the result of Lemma 7, we get

image

Having a regularity constant less than one for  νγ implies

image

holds for sufficiently small  γ. Let  λ ∈ Rdbe a random vector containing the eigenvalues of the random matrix  M(H).14The eigenvalues of  M3are  λ3, hence the invariance result can be written alternatively as

image

The above equation leads to the following interesting spectral property:

image

A straightforward application of Cauchy-schwarz yields:

image

Given (i) the above bound, (ii) an application of Jensen’s inequality, (iii) and the result of Eq. (55), we conclude with the desired result:

image

Notably, the invariant distribution is observed to have a regularity constant less than 0.9 (in Fig. 8) for sufficiently large d. This implies that an upper-bound O�d3/2�is achievable on the Frobenius norm. Leveraging Ergodicity (with respect to Frobenius norm in Eq. (49)), we experimentally validate the result of the last lemma in Fig. 9.

Proof of the Main Theorem Here, we give a formal statement of the main Theorem that

contains all required additional details (which we omitted for simplicity in the original statement).

image

Figure 9: Dependency of  Eνγ∥M(H)∥2F on d. The horizontal axis is  log2(d)and the vertical axis shows  log2( 1L�Lℓ=1 ∥M(H(γ)ℓ )∥2F ) for L = 105. The green dashed-line plots  log2(d1.5).

Theorem 14 (Formal statement of Theorem 2). Suppose that rank(X) = d, γis sufficiently small, and all elements of the weight matrices  {Wℓ}are drawn i.i.d. from a zero-mean, unit variance distribution whose support lies in  [−B, B]and its law is symmertic around zero. Furthermore, assume that the Markov chain  {H(γ)ℓ }(defined in Eq. 1) admits a unique invariant distribution. Then, the regularity constant  α > 0associated with  νγ(see Def. 3) is less than one and the following limits exist such that

image

holds almost surely for all  τ ∈ [0, 1]. Assuming that the regularity constant  αdoes not increase with respect to d, the above lower-bound is proportional to  (1 − α)1/2√d = Ω(√d).

Remarkably, we experimentally observed (in Fig. 8) that the regularity constant  αis decreasing with respect to d. Examples 11 and 12 provide insights about the regularity constant. We believe that it is possible to prove that the constant  αis non-increasing with respect to d.

Proof of Theorem 2 . Lemma 10 proves that the regularity constant  αis less than one for the unique invariant distribution. Suppose that  H ∈ His a random matrix whose law is the one of the unique invariant distribution of the chain. For  H ∈ H, we get Tr(M(H)) = d. A straightforward application of Jensen’s inequality yields the following lower bound on the expectation of r(H) (i.e. the lower bound on the rank):

image

where the expectation is taken over the randomness of H (i.e. the invariant distribution). Invoking the result of Lemma 13, we get an upper-bound on the expectation of the Frobenius norm – in the right-side of the above equation. Therefore,

image

holds. The uniqueness of the invariant distribution allows us to invoke Birkhoff’s Ergodic Theorem for Markov Chains (Theorem 5.2.1 and 5.2.6 [10]) to get

image

The established lower bound on rankτ(H(γ)ℓ )–in terms of  r(H(γ)ℓ )– in Lemma 1 concludes

image

As shown in the following corollary, one can extend the result of Theorem 14 for any finite  ℓ.

Corollary 15. Under the setting of Thm. 14, rank(Hℓ) = Ω(√d)holds almost surely for all finite integer  ℓ. Assuming that  {rankτ(Hℓ)}is a monotonically no-increasing sequence, then rankτ(Hℓ) = Ω((1 − τ)2√d)holds almost surely for all finite  ℓ.

Proof. The proof is based on the no-increasing property of the rank15. Next lemma presents a straightforward implication of this property.

Lemma 16. Consider a sequence of non-increasing bounded finite integers  {yk}∞k=1. If limN→∞�Nk=1 yk/Nexists and is greater than  α, then yk ≥ αfor all finite k.

The proof of the last lemma is provided at the end of this section. Replacing the result of Thm. 14 into the above lemma concludes the proof of the corollary.

image

A remark on the number of hidden units. The focus of our analysis was networks with the same number of hidden units in each layer. Yet, this result extends to more general architectures. Most of modern neural architectures consists of blocks in which the number of hidden units are constant. For example, VGG19-Nets and ResNets are consist of blocks convolutional layers with 64, 128, 256, and 512 channels where the number channels are equal in each block. An analogy of such an architecture is an MLP with different blocks of hidden layers where the numbers of hidden units are the same in each block. According to Cor. 15, the rank preservation property holds in each block after applying BN. In this way, one can extend the established results of Thm. 14 and Cor. 15 to a general family of architectures with varying number of hidden units.

Postponed proofs.

Proof of Lemma 7. The proof is based on a Taylor expansion of the Bn non-linear recurrence function, which we restate here for simplicity:

image

Consider the covariance matrices  M = M(H) and M+ = M(H+)which obey

image

For the sake of simplicity, we use the compact notation  g := gij for i ̸= j. We further introduce the set of indices S = {ii, ij, jj}. A taylor expansion of g at M yields

image

Note that the choice of the element-wise uniform distribution over  [−√3,√3]allows us to deterministically bound the Taylor remainder term by O(γ3). Now, we compute the derivatives and expectations that appear in the above expansion individually. Let us start with the term  T1.The first-order partial derivative term in  T1is computed bellow.

image

The expectation term in  T1 is

image

Given the above formula, we reach the following compact expression for  T1:

image

The compute  T2we need to compute second-order partial derivatives of g and also estimate the following expectation:

image

We now compute  Kpq,kmin the above formula

image

The second-order partial derivatives of g reads as

image

Now, we replace the computed partial derivatives and the expectations into  T2:

image

Plugging terms  T1 and T2into the Taylor expansion yields

image

Summing over  i ̸= jconcludes the proof (note that the diagonal elements are one for the both of matrices  M and M+).

Proof of Proposition 8. Consider the spectral decomposition of matrix  M as M = Udiag(λ)U ⊤,then  Mk = Udiag(λk)U ⊤. Since Tr(Mk)is equal to the sum of the eigenvalues of  Mk, we get

image

for k = 2 and k = 3. The sum of the squared norm of the rows in M is equal to the Frobenius norm of M. Assuming that the rows have equal norm, we get

image

Therefore,

image

holds.

image

Details of Example 9. Under the assumptions stated in Example 9, we get

image

where the approximations are obtained by a first-order Taylor approximation of the norms at λ′ = (d, 0, . . . , 0), and all small terms  o(γ2)are omitted. Using the result of Proposition 8, we get

image

Let  λ+be the eigenvalues of the matrix  M+, then

image

Let  j = arg maxi E�[λ+]2i�. A straight-forward application of Jensen’s inequality yields

image

Hence the leading eigenvalue of  M+is smaller than the one of M. Since the sum of eigenvalues λ+and  λare equal, some of the eigenvalues  λ+are greater than those of  λ(in expectation) to compensate  E[λ+]j < λ1.

Details of Example 11. Invoking Prop. 8, we get

image

where  λ ∈ Rdcontains the eigenvalues of M(H). Since  H ∈ H, ∥λ∥1 = d. If the regularity constant is greater than or equal to one, then

image

A straightforward application of Cauchy-Schwartz yields:

image

The above result together with inequality 84 yields that

image

Finally, the above equality is met only when all non-zero eigenvalues are equal.

image

Details of Example 12. Since  λ1 = Θ(dβ) and λi>1 = o(dβ), we get

image

Thus, Prop. 8 yields

image

Therefore,

image

As a result,  αis less than 0.9 for sufficiently large d.

Proof of Lemma 16. The proof is based on a contradiction. Suppose that there exits a finite n such that  yn < α. Since the sequence is non-increasing,  ym < αfor holds for all m > n. This yields

image

where we used the fact that all  ykare bounded. The above result contradicts the fact that limn→∞�Nk=1 yk/N > α.

F Analysis for Vanilla Linear Networks.

In this section, we prove Lemma 3 that states the rank vanishing problem for vanilla linear networks. Since the proof relies on existing results on products of random matrices (PRM) [8], we first shortly review these results. Let T be the set of  d × dmatrices. Then, we review two notions for T: contractiveness and strong irreducibility.

Definition 4 (Contracting set [8]). T is contracting if there exists a sequence  {Mn ∈ T, n ≥ 0}such that  Mn/∥Mn∥converges to a rank one matrix.

Definition 5 (Invariant union of proper subspaces [8]). Consider a family of finite proper linear subspace  V1, . . . , Vk ⊂ Rd. The union of these subspaces is invariant with respect to T, if Mv ∈ V1 or V2 or . . . or Vk holds for ∀v ∈ V1 or V2 or . . . or Vk and ∀M ∈ T.

Example 17. Consider the following sets

image

then, union of  V1and  V2is invariant with respect to T because  αTv1 ∈ V2and  αTv2 ∈ V1hold for  α ̸= 0.

Definition 6 (Strongly irreducible set [8]). The set T is strongly irreducible if there does not exist a finite family of proper linear subspaces of  Rd such that their union is invariant with respect to T.

For example, the set T defined in Example 17 is not strongly irreducible.

Lemma 18 (Thm 3.1 of [8]). Let W1, W2, . . . be random d×dmatrices drawn independently from a distribution  µ. Let  Bn = �nk=1 Wk. If the support of  µis strongly irreducible and contracting, then any limit point of  {Bn/∥Bn∥}∞n=1 is a rank one matrix almost surely.

This result allows us to prove Lemma 3.

image

Proof of Lemma 3. Recall the structure of the random weight matrices as �Wk = I + γWk wherethe coordinates  Wkare i.i.d. from (a.) standard Gaussian, (b.) uniform[−√3,√3](i.e. with variance 1). One can readily check that for the Gaussian weights, the contracting and strong irreducibility hold and one can directly invoke the result of lemma 18 to get part (a.) of Lemma 3. Now, we prove part (b.). Let m be a random integer that obeys the law  p(m = k) = 2−k. Giventhe random variable m, we define the random matrix  Y = �mk=1 �Wkand use the notation  µ′ forits law. Let  {Yi = �mij=1 �Wk}ki=1 be drawn i.i.d. from  µ′. Then, Ck := Yk . . . Y2Y1is distributed as Bℓk := �Wℓk . . . �W2�W1 for ℓk = �ki=1 mi. We prove that every limit point of  {Ck/∥Ck∥} convergesto a rank one matrix, which equates the convergence of limit points of  {Bℓk/∥Bℓk∥}to a rank one matrix. To this end, we prove that the support of  µ′ denoted by Tµ′is contractive and strongly contractive. Then, Lemma 18 implies that the limit points of  {Ck/∥Ck∥}are rank one.

Contracting. Let  e1 ∈ Rdbe the first standard basis vector. Since  An := (I + γe1e⊤1 )n ∈ Tµ′and its limit point  {An/∥An∥}converges to a rank one matrix,  Tµ′is contractive.

Strong irreduciblity. Consider an arbitrary family of linear proper subspace of  Rdas {V1, . . . , Vq}. Let v be an arbitrary unit norm vector which belongs to one of the subspaces {Vi}qi=1. Given v, we define an indexed family of matrices  {Mα ∈ Tµ′|α ∈ Rd, |αi| ≤ 1} such that

image

where  eiis the i-th standard basis16. Then, we get

image

Therefore,  {Mαv||αi| ≤ 1}is not contained in any union of finite proper (m < k)-dimensional linear subspace of  Rd, hence Tµ′is strongly irreducible.

image

G Details: Pretraining algorithm

In Section 4, we introduced a pre-training method that effectively obtains a better optimization performance compared ot Bn. In this section, we provide more details about the pre-training step. Recall  X ∈ Rd×N is a minibatch of d-dimensional inputs of size  N. Let HL(X) ∈ Rd×N bethe hidden representation of input X in the last layer of a MLP. Using gradient descent method, we optimize  r(HL(X))–with respect to the parameters of networks– over different minibatches X. Algorithm 1 presents our pretraining method. As can be seen, the procedure is very simple.

image

H Details: Why the rank matters for gradient based learning.

We now provide an intuitive explanation of why rank one hidden representations prevent randomly initialized networks from learning. Particularly, we argue that these networks essentially map all inputs to a very small subspace17 such that the final classification layer can no longer disentangle the hidden representations. As a result, the gradients of that layer also align, yielding a learning signal that becomes independent of the input.

To make this claim more precise, consider training the linear network from Eq. (6) on a dataset  X ∈ Rd×N, where  xi ∈ Rdwith  douttargets  yi ∈ Rdout, i = 1, . . . , N. Each column �H(γ)L,iof the hidden representations in the last hidden layer  �H(γ)Lis the latent representation of datapoint i, which is fed into a final classification layer parametrized by  WL+1 ∈ Rdout×d. We optimize L(W), where W is a tensor containing all weights  W1, . . . , WL+1 and �H(γ)L,i is a function of  W1, . . . , WL(as detailed in Eq. (6):

image

and  ℓ : Rdout → R+is a differentiable loss function. Now, if the the hidden representations become rank one (as predicted by Lemma 3 and Fig. 2), one can readily check that the stochastic gradients of any neuron k in the last linear layer, i.e.,  ∇WL,[k,:]Li(W) = (∇ℓi)k �H(γ)L,i, align for both linear and ReLU networks.

Proposition 19. Consider a network with rank one hidden representations in the last layer �H(γ)L (W1, ..., WL), then for any neuron k and any two datapoints i, j with non-zero errors  Li and

image

∀i, j. That is, all stochastic gradients of neuron k in the final classification layer align along one single direction in  Rd.

Proof. The result follows directly from a simple application of the chain rule

image

To validate this claim, we again train CIFAR-10 on the VGG19 network from Figure 5 (top).

image

Figure 10: Directional gradient vanishing CIFAR-10 on a VGG19 network with BN, SGD, SGD with 100x learning rate and SGD on random data. Average and 95% confidence interval of 5 independent runs.

As expected, the network shows perfectly aligned gradients without Bn (right hand side of Fig. 10), which renders it un-trainable. In a next step, we replace the input by images generated randomly from a uniform distribution between 0 and 255 and find that SGD takes almost the exact same path on this data (compare log accuracy on the left hand side). Thus, our results suggest that the commonly accepted vanishing gradient norm hypothesis is not descriptive enough since SGD does not take small steps into the right direction- but into a random one after initialization in deep neural networks. As a result, even a 100x increase in the learning rate does not allow training. We consider our observation as a potential starting point for novel theoretical analysis focusing on understanding the propagation of information through neural networks, whose importance has also been highlighted by [6].

I Additional Experiments

image

Figure 11: CIFAR-10: Same setting as Fig.4 but now showing accuracy and test loss

Outperforming Bn The following Figure shows the result of the experiment of Fig. 4 that is repeated for FashionMNIST dataset.

image

Figure 12: Results of Fig.4 for FashionMNIST

Breaking Bn In the following result, we repeated the experiment of Fig. 5 for ResNets.

image

Figure 13: Breaking Batchnorm: CIFAR-10 on a ResNet-50 with standard PyTorch initialization as well as a uniform initialization of same variance in  R+. Average and 95% confidence interval of 5 independent runs. This plot also shows results for a Bn network without mean deduction/adaption, validating our claim from Section 2.


Designed for Accessibility and to further Open Science