Theoretical Understanding of Batch-normalization: A Markov Chain Perspective

2020·Arxiv

Abstract

Abstract

Randomly initialized neural networks are known to become harder to train with increasing depth, unless architectural enhancements like residual connections and batch normalization are used. We here investigate this phenomenon by revisiting the connection between random initialization in deep networks and spectral instabilities in products of random matrices. Given the rich literature on random matrices, it is not surprising to find that the rank of the intermediate representations in unnormalized networks collapses quickly with depth. In this work we highlight the fact that batch normalization is an effective strategy to avoid rank collapse for both linear and ReLU networks. Leveraging tools from Markov chain theory, we derive a meaningful lower rank bound in deep linear networks. Empirically, we also demonstrate that this rank robustness generalizes to ReLU nets.Finally, we conduct an extensive set of experiments on real-world data sets, which confirm that rank stability is indeed a crucial condition for training modern-day deep neural architectures.

1 Introduction and related work

Depth is known to play an important role in the expressive power of neural networks [27]. Yet, increased depth typically leads to a drastic slow down of learning with gradient-based methods, which is commonly attributed to unstable gradient norms in deep networks [14]. One key aspect of the training process concerns the way the layer weights are initialized. When training contemporary neural networks, both practitioners and theoreticians advocate the use of randomly initialized layer weights with i.i.d. entries from a zero mean (Gaussian or uniform) distribution. This initialization strategy is commonly scaled such that the variance of the layer activation stays constant across layers [12, 13]. However, this approach can not avoid spectral instabilities as the depth of the network increases. For example,

[25] observes that for linear neural networks, such initialization lets all but one singular values of the last layers activation collapse towards zero as the depth increases.

Nevertheless, recent advances in neural architectures have allowed the training of very deep neural networks with standard i.i.d. initialization schemes despite the above mentioned shortcomings. Among these, both residual connections and normalization layers have proven particularly effective and are thus in widespread use (see [16, 23, 13] to name just a few). Our goal here is to bridge the explanatory gap between these two observations by studying the effect of architectural enhancements on the spectral properties of randomly initialized neural networks. We also provide evidence for a strong link of the latter with the performance of gradient-based optimization algorithms.

One particularly interesting architectural component of modern day neural networks is Batch Normalization (Bn) [16]. This simple heuristics that normalizes the pre-activation of hidden units across a mini-batch, has proven tremendously effective when training deep neural networks with gradient-based methods. Yet, despite of its ubiquitous use and strong empirical benefits, the research community has not yet reached a broad consensus, when it comes to a theoretical explanation for its practical success. Recently, several alternatives to the original “internal covariate shift” hypothesis [16] have appeared in the literature: decoupling optimization of direction and length of the parameters [19], auto-tuning of the learning rate for stochastic gradient descent [3], widening the learning rate range [6], alleviating sharpness of the Fisher information matrix [17], and smoothing the optimization landscape [24]. Yet, most of these candidate justifications are still actively debated within the community. For example, [24] first made a strong empirical case against the original internal covariate shift hypothesis. Secondly, they argued that batch normalization simplifies optimization by smoothing the loss landscape. However, their analysis is on a per-layer basis and treats only the largest eigenvalue. Furthermore, even more recent empirical studies again dispute these findings, by observing the exact opposite behaviour of Bn on a ResNet20 network [33].

1.1 On random initialization and gradient based training

In light of the above discussion, we take a step back – namely to the beginning of training – to find an interesting property that is provably present in batch normalized networks and can serve as a solid basis for a more complete theoretical understanding.

The difficulty of training randomly initialized, un-normalized deep networks with gradient methods is a long-known fact, that is commonly attributed to the so-called vanishing gradient effect, i.e., a decreasing gradient norm as the networks grow in depth (see, e.g., [26]). A more recent line of research tries to explain this effect by the condition number of the input-output Jacobian (see, e.g., [31, 32, 22, 6]). Here, we study the spectral properties of the above introduced initialization with a particular focus on the rank of the hidden layer activations over a batch of samples. The question at hand is whether or not the network preserves a diverse data representation which is necessary to disentangle the input in the final classification layer.

As a motivation, consider the results of Fig. 1, which plots accuracy and output rank when training batch-normalized and un-normalized neural networks of growing depth on the FashionMNIST dataset [30]. As can be seen, the rank in the last hidden layer of the vanilla networks collapses with depth and they are essentially unable to learn (in a limited number of epochs) as soon as the number of layers is above 10. The rank collapse indicates that the direction of the output vector has become independent of the actual input. In other words, the randomly initialized network no longer preserves information about the input. Batch-normalized networks, however, preserve a high rank across all network sizes and their training accuracy drops only very mildly as the networks reach depth 32.

The above example shows that both rank and optimization of even moderately-sized, unnormalized networks scale poorly with depth. Batch-normalization, however, stabilizes the rank in this setting and the obvious question is whether this effect is just a slow-down or even simply a numerical phenomenon, or whether it actually generalizes to networks of infinite depth.

In this work we make a strong case for the latter option by showing a remarkable stationarity aspect of Bn. Consider for example the case of passing arranged column-wise in an input matrix through a very deep network with fully-connected layers. Ideally, from an information propagation perspective, the network should be able to differentiate between

Figure 1: Effect of depth on rank and learning, on the Fashion-MNIST dataset with ReLU multilayer perceptrons (MLPs) of depth 1-32 and width 128 hidden units. Left: Rankafter random initialization as in PyTorch [21, 12]. Right: Training accuracy after training 75 epochs with SGD, batch size 128 and grid-searched learning rate. Mean and 95% confidence interval of 5 independent runs.

individual samples, regardless of its depth [26]. However, as can be seen in Fig. 2, the hidden representation of X collapses to a rank one matrix in vanilla networks, thus mapping all to the same line in . Hence, the hidden layer activations and along with it the individual gradient directions become independent from the input as depth goes to infinity. We call this effect “directional” gradient vanishing (see Section 3 for a more thorough explanation).

Interestingly, this effect does not happen in batch-normalized networks, which yield – as we shall prove in Theorem 2 – a stable rank for any depth, thereby preserving a disentangled representation of the input and hence allowing the training of very deep networks. These results substantiate earlier empirical observations made by [6] for random Bn-nets, and also validates the claim that Bn helps with deep information propagation [26].

Figure 2: Rank comparison of last hidden activation: Log(rank) of the last hidden layer’s activation over total number of layers (blue for Bn- and orange for vanilla-networks) for Gaussian inputs. Networks are MLPs of width d = 32. (Left) Linear activations, (Right) ReLU activations. Mean and 95% confidence interval of 10 independent runs. While the rank quickly drops in depth for both networks, BN stabilizes the rank above

1.2 Contributions

In summary, the work at hand makes the following two key contributions: (i) We theoretically prove that Bn indeed avoids rank collapse for deep linear neural nets under standard initialization and for any depth. In particular, we show that Bn can be seen as a

computationally cheap rank preservation operator, which may not yield hidden matrices with full rank but still preserves sufficient modes of variation in the data to achieve a scaling of the rank with is the width of the network. Subsequently, we leverage existing results from random matrix theory [8] to complete the picture with a simple proof of the above observed rank collapse for linear vanilla networks, which interestingly holds regardless of the presence of residual connections (Lemma 3). Finally, we connect the rank to difficulties in gradient based training of deep nets by showing that a rank collapse makes the directional component of the gradients independent of the input.

(ii) We empirically show that the rank is indeed a crucial quantity for gradient-based learning. In particular, we show that both the rank and the final training accuracy quickly diminish in depth unless Bn layers are incorporated in both simple feed-forward and convolutional neural nets. To take this reasoning beyond mere correlations, we actively intervene with the rank of networks before training and show that (a) one can break the training stability of Bn by initializing in a way that reduces its rank-preserving properties, and (b) a rank-increasing pre-training procedure for vanilla networks can recover their training ability even for large depth. Interestingly, our pre-training method allows vanilla SGD to outperform Bn on very deep MLPs. In all of our experiments, we find that SGD updates preserve the order of the initial rank throughout optimization, which underscores the importance of the rank at initialization for the entire convergence behavior.

2 Background and Preliminaries

Network description. We consider a given input containing N samples in . Let denote the k-dimensional all one vector and denote the hidden representation of -network with residual connections. The following recurrence summarizes the network mapping

where and regulates the skip connection strength (in the limit, recovers a network without skip connection). Throughout this work, we consider the network weights to be initialized as follows.

Definition 1 (Standard weight initialization). The elements of weight matrices are i.i.d. samples from a distribution P that has zero-mean, unit-variance, and its density is symmetric around zero. We use the notation for the probability distribution of the weight matrices.

We define the BN operator as in the original paper [16], namely

where is a row-wise product. Both and are trainable parameters. Throughout this work we assume the initialization , and also omit corrections of the mean activity. As demonstrated empirically in Fig. 5, and theoretically in App. C this simplification does not change the performance of Bn in our settings.

Rank notions. To circumvent numerical issues involved in rank computations we introduce a soft notion of the rank denoted by rank(soft rank). Specifically, let be the singular values of H. Then, given a , we define rank

Lemma 1. For an arbitrary matrix , rank. For the sequence defined in Eq. (2), rank

3 Batch normalization provably prevents rank collapse

Since our empirical observations hold equally for both non-linear and linear networks, we here focus on improving the theoretical understanding in the linear case, which constitutes a growing area of research [25, 18, 5, 2]. First, inspired by [9] and leveraging tools from Markov Chain theory, our main result proves that the rank of linear batch-normalized networks scales with their width as . Secondly, we leverage results from random matrix theory [7] to contrast our main result to unnormalized linear networks which we show to provably collapse to rank one, even in the presence of residual connections.

3.1 Main result

In the following we state our main result which proves that batch normalization indeed prevents the rank of all hidden layer activations from collapsing to one. Please see Appendix E for the more formal version of this theorem statement.

Theorem 2. [Informal] Suppose that the rank(X) = d and that the weights are initialized in a standard i.i.d. zero-mean fashion (see Def. 1). Then, the following limits exist such that

holds almost surely for a sufficiently small (independent of ) and any , under some additional technical assumptions. Please see Theorem 14 in the Appendix for the formal statement.

Theorem 2 yields a non trivial width-dependency. Namely, by setting for example , the result states that the average number of singular values with absolute value greater than is at least on average. To put this into context: If one were to replace diagby the would effectively constitute a classical whitening operation such that all would be full rank (equal to d). However, as noted in the original Bn paper [16], whitening is obviously expensive to compute and furthermore prohibitively costly to incorporate in back-propagation. As such, Bn can be seen as a computationally inexpensive approximation of whitening, which does not yield full rank hidden matrices but still preserves sufficient variation in the data to provide a rank scaling as ). Although the lower-bound in Thm. 2 is established on the average over infinite depth (i.e., ), Corollary 15 (in App. E) proves that the same bound holds for all rank

Necessary assumptions. The above result relies on two key assumptions: (i) First, the input X needs to be full rank. (ii) Second, the weights have to be drawn according to the standard initialization scheme. We believe that both assumptions are indeed necessary for Bn to yield a robust rank.

Regarding (i), we consider a high input rank a natural condition since linear neural nets cannot possibly increase the rank when propagating information through their layers. Of course, full rank is easily achieved by an appropriate data pre-processing. Yet, even when the matrix is close to low rank we find that Bn is actually able to amplify small variations in the data (see Fig. 3.b).Regarding (ii), we derive – based on our theoretical insights – an adversarial initialization strategy that corrupts both the rank robustness and optimization performance of batch-normalized networks, thus suggesting that the success of Bn indeed relies heavily on the standard i.i.d. zero-mean initialization.

Experimental validation. In order to underline the validity of Theorem 2 we run multiple simulations by feeding Gaussian data of dimensionality d = N into networks of growing size and with different residual strengths. For each network, we compute the mean and standard deviation of the soft rank rank. As depicted in Fig. 3, the curves clearly indicate a dependency for , just as predicted in the Theorem. Although the established guarantee requires the weight on the parametric branch (i.e., ) to be small, the results of Fig. 3 indicate that the established lower bound holds for a much wider range including the case where no residual connections are used at all (

Figure 3: a) Result of Theorem 2 for different values of depicts networks without skip connections. Each point is the average rankover depth () of nets of width an on x-axis. b) Top 10 singular values of for increasing values of given nearly collinear inputs. As can be seen, Bn quickly amplifies smaller variations in the data while reducing the largest one.

3.2 Comparison with unnormalized networks

In order to stress the importance of the above result, we now compare the predicted rank of with the rank of unnormalized linear networks, which essentially constitute a linear mapping in the form of a product of random matrices. The spectral distribution of products of random matrices with i.i.d. standard Gaussian elements has been studied extensively [6, 11, 20]. Interestingly, one can show that the gap between the top and the second largest singular value increases with the number of products (i.e., ) at an exponential rate[11, 20]. Hence, the matrix converges to a rank one matrix after normalizing by the norm. In the following, we extend this result to products of random matrices with a residual branch that is obtained by adding the identity matrices. Particularly, we consider the hidden states of the following linear residual network:

Since the norm of is not necessarily bounded, we normalize as lemma characterizes the limit behaviour of

Lemma 3. Suppose that and assume the weights to be initialized as in Def. 1 with element-wise distribution P. Then we have for linear networks, which follow recursion (6), that:

This results stands in striking contrast to the result of Theorem 2 established for batch-normalized networks.Interestingly, even residual skip connections cannot avoid rank collapse for very deep neural networks, unless one is willing to incorporate a depth dependent down-scaling of the parametric branch as for example done in [1], who set . Remarkably, Theorem 2 shows that Bn layers provably avoid rank collapse without requiring the networks to become closer and closer to identity.

Implications of rank collapse on gradient based learning. In order to explain the severe consequence of rank collapse on optimization performance reported in Fig. 1, we study the effect of rank one hidden-layer representations on the gradient of the training loss for distinct input samples. Let denote the training loss for datapoint i on a vanilla network as in Eq. (6). Furthermore, let the final classification layer be parametrized by . Then, given that the hidden presentation at the last hidden layer L is rank one, the normalized gradients of the loss with respect to weights of individual neurons in the classification layer (denoted by , where ) are collinear for any two datapoints i and j, i.e. . A formal statement is presented in Prop. 19 in the Appendix alongside empirical validations on a VGG19 network (Fig. 10). This result implies that the commonly accepted vanishing gradient norm hypothesis is not descriptive enough since SGD does not take small steps into the right direction, but into a random direction that is independent from the input. In other words, deep neural networks are prone to directional gradient vanishing after initialization, which is caused by the collapse of the last hidden layer activations to a very small subspace (one line in in the extreme case of rank one activations).

4 The important role of the rank

The preceding sections highlight that the rank of the hidden representations is a key difference between random vanilla and Bn networks. We now provide three experimental findings that substantiate the particular importance of the rank at the beginning of training: First, we find that an unsupervised, rank-increasing pre-training allows SGD on vanilla networks to outperform Bn networks. Second, we show that the performance of Bn-networks is closely tied to a high rank at initialization. Third, we report that SGD updates preserve the initial rank magnitude throughout the optimization process.

Outperforming BN using a pre-training step. As discussed above, batch normalization layers are very effective at avoiding rank collapse. Yet, this is of course not the only way to preserve rank. Based upon our theoretical insights, we leverage the lower bound established in Eq. (4) to design a pre-training step that not only avoids rank collapse but also accelerates the convergence of SGD. Our proposed procedure is both simple and computationally cheap. Specifically, we maximize the lower-bound (in Eq. (4)) on the rank of the hidden presentation in each layer . Since this function is differentiable with respect to its input, it can be optimized sufficiently by just a few steps of (stochastic) gradient ascent (see Section G in the Appendix for more details).

Figure 4: Pre-training versus Bn: Loss over epochs on CIFAR-10 for MLPs of increasing depth with 128 hidden units and ReLU activation. Trained with SGD (batchsize 64) and grid-searched stepsize. See Fig. 11 for the corresponding test loss and accuracy as well as Fig. 12 for FashionMNIST results.

Fig. 4 compares the convergence rate of SGD on pre-trained vanilla networks and Bn-networks. As can be seen, the slow down in depth is much less severe for the pre-trained networks. This improvement is, also, reflected both in terms of training accuracy and test loss (see Fig. 11 in Appendix). Interestingly, the pre-training is not only faster than Bn on deep networks, but it is also straight-forward to use in settings where the application of Bn is rather cumbersome such as for very small batch sizes or on unseen data [15, 29].

Breaking batch normalization. Some scholars hypothesize that the effectiveness of Bn stems from a global landscape smoothing [24] or a certain learning rate tuning [3], that are thought to be induced by the normalization. Under these hypotheses, one would expect that SGD converges fast on Bn-nets regardless of the initialization. Yet, we here show that the way that networks are initialized does play a crucial role for the subsequent optimization performance of Bn-nets.

Figure 5: Breaking Batchnorm: CIFAR-10 on VGG19 with standard PyTorch initialization as well as a uniform initialization of same variance. (Left) training accuracy, (Right) Rank of last hidden layer computed using torch.matrix_rank(). Plot also shows results for standard initialization and Bn without mean deduction. Avg. and 95% CI of 5 independent runs. (See Fig. 13 in Appendix for similar results on ResNet-50).

Particularly, we train two MLPs with batchnorm, but change the initialization for the second net from the standard PyTorch way uniform, where is the layer size. As can be seen to the right, this small change reduces the rank preserving quality of BN significantly, which is reflected in much slower learning behaviour. Even sophisticated modern day architectures such as VGG and ResNet networks are unable to fit the CIFAR-10 dataset after changing the initialization in this way (see Fig. 5).

Rank through the optimization process. The theoretical result of Theorem 2 considers the rank at random initialization. To conclude, we perform two further experiments which confirm that the initial rank strongly influences the speed of SGD throughout the entire optimization process. In this regard, Fig. 6 reports that SGD preserves the initial magnitude of the rank to a large extent, regardless of the specific network type. This is particularly obvious when comparing the two Bn initializations. A further noteworthy aspect is the clear correlation between the level of pre-training and optimization performance on vanilla nets. Interestingly, this result does again not only hold on simple MLPs but also generalizes to modern day networks such as the VGG-19 (see Fig. 5) and ResNet50 architecture (see Appendix I).

Figure 6: Pretraining: Fashion-MNIST on MLPs of depth 32 and width 128. (Left) Training accuracy, (Right) Lower bound on rank. Blue line is a ReLU network with standard initialization. Other solid lines are pre-trained layer-wise with 25 (orange) and 75 (green) iterations to increase the rank. Dashed lines are batchnorm networks with standard and asymmetric initialization. Average and 95% confidence interval of 5 independent runs.

References

[1] Zeyuan Allen-Zhu, Yuanzhi Li, and Zhao Song. A convergence theory for deep learning via over-parameterization. arXiv preprint arXiv:1811.03962, 2018.

[2] Sanjeev Arora, Nadav Cohen, Noah Golowich, and Wei Hu. A convergence analysis of gradient descent for deep linear neural networks. arXiv preprint arXiv:1810.02281, 2018.

[3] Sanjeev Arora, Zhiyuan Li, and Kaifeng Lyu. Theoretical analysis of auto rate-tuning by batch normalization. arXiv preprint arXiv:1812.03981, 2018.

[4] Boaz Barak and David Steurer. Proofs, beliefs, and algorithms through the lens of sum-of-squares. Course notes: http://www. sumofsquares. org/public/index. html, 2016.

[5] Peter L. Bartlett, David P. Helmbold, and Philip M. Long. Gradient descent with identity initialization efficiently learns positive-definite linear transformations by deep residual networks. Neural computation, 31(3):477–502, 2019.

[6] Nils Bjorck, Carla P. Gomes, Bart Selman, and Kilian Q. Weinberger. Understanding batch normalization, 2018.

[7] Stéphane Boucheron, Gábor Lugosi, and Pascal Massart. Concentration inequalities: A nonasymptotic theory of independence. Oxford university press, 2013.

[8] Philippe Bougerol. volume 8. Springer Science & Business Media, 2012.

[9] Aymeric Dieuleveut, Alain Durmus, and Francis Bach. Bridging the gap between constant step size stochastic gradient descent and Markov chains. arXiv preprint arXiv:1707.06386, 2017.

[10] Randal Douc, Eric Moulines, Pierre Priouret, and Philippe Soulier. Markov Chains. Springer, 2018.

[11] Peter J. Forrester. Lyapunov exponents for products of complex Gaussian random matrices. Journal of Statistical Physics, 151(5):796–808, 2013.

[12] Xavier Glorot and Yoshua Bengio. Understanding the difficulty of training deep feedforward neural networks. In Proceedings of the thirteenth international conference on artificial intelligence and statistics, pages 249–256, 2010.

[13] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Delving deep into rectifiers: Surpassing human-level performance on imagenet classification. In Proceedings of the IEEE international conference on computer vision, pages 1026–1034, 2015.

[14] Sepp Hochreiter. The vanishing gradient problem during learning recurrent neural nets and problem solutions. International Journal of Uncertainty, Fuzziness and Knowledge-Based Systems, 6(02):107–116, 1998.

[15] Sergey Ioffe. Batch renormalization: Towards reducing minibatch dependence in batch-normalized models. In Advances in neural information processing systems, pages 1945–1953, 2017.

[16] Sergey Ioffe and Christian Szegedy. Batch normalization: Accelerating deep network training by reducing internal covariate shift. arXiv preprint arXiv:1502.03167, 2015.

[17] Ryo Karakida, Shotaro Akaho, and Shun-ichi Amari. The normalization method for alleviating pathological sharpness in wide neural networks. In Advances in Neural Information Processing Systems, pages 6403–6413, 2019.

[18] Kenji Kawaguchi. Deep learning without poor local minima. In Advances in neural information processing systems, pages 586–594, 2016.

[19] Jonas Kohler, Hadi Daneshmand, Aurelien Lucchi, Ming Zhou, Klaus Neymeyr, and Thomas Hofmann. Exponential convergence rates for batch normalization: The power of lengthdirection decoupling in non-convex optimization. arXiv preprint arXiv:1805.10694, 2018.

[20] Dang-Zheng Liu, Dong Wang, and Lun Zhang. Bulk and soft-edge universality for singular values of products of ginibre random matrices. In , volume 52, pages 1734–1762. Institut Henri Poincaré, 2016.

[21] Adam Paszke, Sam Gross, Francisco Massa, Adam Lerer, James Bradbury, Gregory Chanan, Trevor Killeen, Zeming Lin, Natalia Gimelshein, Luca Antiga, et al. Pytorch: An imperative style, high-performance deep learning library. In Advances in Neural Information Processing Systems, pages 8024–8035, 2019.

[22] Jeffrey Pennington, Samuel S Schoenholz, and Surya Ganguli. The emergence of spectral universality in deep networks. arXiv preprint arXiv:1802.09979, 2018.

[23] Tim Salimans and Diederik P. Kingma. Weight normalization: A simple reparameterization to accelerate training of deep neural networks. In Advances in Neural Information Processing Systems, pages 901–909, 2016.

[24] Shibani Santurkar, Dimitris Tsipras, Andrew Ilyas, and Aleksander Madry. How does batch normalization help optimization?(no, it is not about internal covariate shift). arXiv preprint arXiv:1805.11604, 2018.

[25] Andrew M. Saxe, James L. McClelland, and Surya Ganguli. Exact solutions to the nonlinear dynamics of learning in deep linear neural networks. arXiv preprint arXiv:1312.6120, 2013.

[26] Samuel S. Schoenholz, Justin Gilmer, Surya Ganguli, and Jascha Sohl-Dickstein. Deep information propagation. arXiv preprint arXiv:1611.01232, 2016.

[27] Matus Telgarsky. Benefits of depth in neural networks. arXiv preprint arXiv:1602.04485, 2016.

[28] Joel A. Tropp. An introduction to matrix concentration inequalities. arXiv preprint arXiv:1501.01571, 2015.

[29] Yuxin Wu and Kaiming He. Group normalization. In Proceedings of the European Conference on Computer Vision (ECCV), pages 3–19, 2018.

[30] Han Xiao, Kashif Rasul, and Roland Vollgraf. Fashion-mnist: a novel image dataset for benchmarking machine learning algorithms. arXiv preprint arXiv:1708.07747, 2017.

[31] Ge Yang and Samuel Schoenholz. Mean field residual networks: On the edge of chaos. In Advances in neural information processing systems, pages 7103–7114, 2017.

[32] Greg Yang, Jeffrey Pennington, Vinay Rao, Jascha Sohl-Dickstein, and Samuel S. Schoenholz. A mean field theory of batch normalization. arXiv preprint arXiv:1902.08129, 2019.

[33] Zhewei Yao, Amir Gholami, Kurt Keutzer, and Michael Mahoney. PyHessian: Neural networks through the lens of the Hessian. arXiv preprint arXiv:1912.07145, 2019.

Appendix

A Preliminaries

Recall that denotes the hidden representations in layer . These matrices make a Markov chain that obeys the recurrence of Eq. (1), which we restate here

where we use the compact notation Bn for . Let be second moment matrix of the hidden representations , i.e. . Batch normalization ensures that the rows of have the same normis the size of mini-batch. Let H be space of -matrices that obey this propery. This property enforces two key characteristics on

Property (p.1) directly yields that the trace of (and hence the sum of its eigenvalues) is equal to d. We will repeatedly use these properties in our analysis.

Furthermore, the sequence constitute a Markov chain. Under mild assumptions, this chain admits an invariant distribution that is defined bellow[10].

Definition 2. Distribution is an invariant distribution of the hidden representations if it obeys

where denotes the probability measure of random weights.

Later, we will see that the above invariance property allows us to determine the commutative behaviour of the sequence of hidden presentations.

B Lower bounds on the (soft) rank

Recall that we introduced the ratio in Eq. (4) as a lower bound on both the rank(H) as well as the soft rank rank(stated in Lemma 1). This section establishes these lower bounds.

Proof of Lemma 1 (part 1). We first prove that rank. Let . Since the eigenvalues of H are obtained by a constant scaling factor of the squared singular values of H, these two matrices have the same rank. We now establish a lower bound on rank(M). Let contain the eigenvalues of matrix we define the vector

To proof the assertion, we make use if a straightforward application of Cauchy-Schwartz

Replacing into the above equation concludes the result. Note that the above proof technique has been used in the planted sparse vector problem [4].

Proof of Lemma 1 (part 2). Now, we prove that ranka vector containing the eigenvalues of the matrix . Let contain the singular values of H. Then, one can readily check that . Furthermore, due to (p.1) in Eq. (8). Furthermore, we have by definition that

Let us now define a vector with entries

Then, we use Cauchy-Schwartz to get

It is easy to check that holds. Furthermore,

where we used the fact that in the last inequality. Replacing this result into the bound of Eq. (17) yields

which conludes the proof.

C Initialization consequences

The particular weight initialization scheme consider through out this work (recall Def. 1), imposes an interesting structure in the invariant distribution of the sequence of hidden presentations (defined in Def. 2).

Lemma 4. Suppose that the chain (defined in Eq. 7) admits a unique invariant distribution and H is drawn from ; then, the law of equates the law of where denotes the ith row of matrix H.

Proof. Let S be a sign filliping matrix: it is diagonal and its diagonal elements are in Then holds for a random matrix W whose elements are drawn i.i.d. from a symmetric distribution. Let H be drawn from the invariant distribution of the chain denoted by ; Leveraging the invariance property, we get

By multiplying both sides with S, we get

Note that we use the fact that diagonal matrices commute in the above derivation. According to the definition, holds. Considering this fact, we get

Replacing the above result into Eq. (29) yields

Hence the law of SH is invariant too. Since the invariant distribution is assumed to be unique, holds and thus

Comment on Bn-centering Let be the unique invariant distribution associated with Markov chain . A straightforward implication of last Lemma is for , hence the rows of are mean-zero, hence their average is close to zero and the mean-zero operation in Bn is redundant. Although this theoretical argument is established for linear networks, we empirically observed that Bn without centering also works well on modern neural architectures. For example, Fig. 7 shows that the centering does not affect the performance of Bn on a VGG net when training CIFAR-10.

D Main Theorem: warm-up analysis

As a warm-up analysis, the next lemma proves that rankholds. Later, we will prove a stronger result. Yet, this initial results provides valuable insights into our proof technique. Furthermore, we will use the following result in the next steps.

Figure 7: Centering for Bn. The experiment is conducted on a VGG network. The blue line indicates the original Bn network and the orange line is Bn without mean adaption. The vertical axis in the left plot is training accuracy. In the right plot it is , where is the data representation in the last hidden layer L. The horizontal axis indicates the number of iterations.

Lemma 5. Suppose that each element of the weight matrices is independently drawn from distribution P that is zero-mean, unit-variance, and its support lies in interval . If the Markov chain admits a unique invariant distribution, then

holds almost surely for all integers

Proof. Let the weights be drawn from the distribution , defined in Def. 1. Such a distribution obeys an important property: element-wise symmetricity. That is, is distributed as . Such an initialization enforces an interesting structural property for the invariant distribution that is stated in Lemma 4. It is easy to check that this implies

for any . Recall, . The above property enforces to be small and hence is small as well. Now, as rankis proportional to Eq. (4)), it must consequently stay large. The rest of the proof is based on this intuition. Given the uniqueness of the invariant distribution, we can invoke Birkhoff’s Ergodic Theorem for Markov Chains (Theorem 5.2.1 and 5.2.6 [10]) which yields

This allows us to conclude the proof by a simple contradiction. Assume that rankone. Then, as established in the following Lemma, in the limit all entries of are constant and either

Lemma 6. Suppose the assumptions of Lemma 5 hold. If rankfor an integer k, then holds for all . Furthermore, all elements of all matrices have absolute value one, hence

holds.

As a result, leveraging the ergodicity established in (61), we get that then

must also hold. However, this contradicts the consequence of the symmetricity (Eq. (31)) which states that for any we have . Thus, the rank one assumption cannot hold, which proves the assertion.

To complete the proof of the last theorem, we prove Lemma 6.

Proof of Lemma 6. For the sake of simplicity, we omit all superscripts throughout the proof. Suppose that rank, then rankfor all as the sequence is non-increasing . Invoking the established rank bound from Lemma 1, we get

The above inequality together with properties (p.1) and (p.2) (presented in Eqs 8 and 9) yield Tr. Replacing this into the above equation gives that must hold for the rank of to be one. Yet, recalling property (p.2), this can only be the case if for all i, j. Replacing the definition into updates of hidden presentation in Eq. 1 obtains

where

We now prove that the sign of are the same for . The above update formula implies that the sign of equates that of . Furthermore, it is easy to check that . For , this bound yields . Therefore, the sign of is equal to the one of . Since furthermore holds, we conclude that all elements of remain constant for all , which yields the limit stated in Eq. 33 .

E Main theorem: Proof

In this section, we prove that -rank for hidden representation.

Proof sketch for Thm. 2. The proof is based on an application of ergodic theory (as detailed for example in Section 5 of [10]). In fact, the chain of hidden representations, denoted by (1), constitutes a Markov chain in a compact space. This chain admits at least one invariant distribution for which the following holds

for every bounded Borel function . The above invariance property provides an interesting characterization of the invariant measure . Particularly, we show in Lemma 13 that

holds, where r(H) is the established lower-bound on the rank (see Lemma 1). Under weak assumptions, the chain obey Birkhoff’s Ergodicity, which yields that the average behaviour of the hidden representations is determined by the invariant measure

Finally, the established lower bound in Lemma 1 allows us to directly extend this result to a lower bound on the soft rank itself.

Characterizing the change in Frobenius norm Recall the established lower bound on the

rank denoted by r(H), for which

holds for all defined in Eq. 1.Therefore, directly influences rank(and also rank) according to Lemma 1. Here, we characterize the change in after applying one step of the recurrence in Eq. 7 to H, i.e. passing it trough one hidden layer. This yields

Let M = M(H) and for simplicity. The next lemma estimates the expectation (taken over the randomness of W) of the difference between the Frobenius norms of

holds as long as the support of distribution P (in Def. 1) lies in a finite interval

The proof of the above lemma is based on a Taylor expansion of the Bn non-linear operator. We postpone the detailed proof to the end of this section. While the above equation seems complicated at first glance, it provides some interesting insights.

Interlude: Intuition behind Lemma 7. In order to gain more understanding of the implications of the result derived in Lemma 7, we make the simplifying assumption that all the rows of matrix M have the same norm. We emphasize that this assumption is purely for intuition purposes and is not necessary for the proof of our main theorem. Under such an assumption, the next proposition shows that the change in the Frobenius norm directly relates to the spectral properties of matrix M.

Proposition 8. Suppose that all the rows of matrix M have the same norm. Let the eigenvalues of matrix M. Then,

holds and hence

We postpone the proof to the end of this section. This proposition re-expresses the polynomial of Lemma 7 in terms of the eigenspectrum of M.

Based on the above proposition, we can make sense of interesting empirical observation reported in Figure 3.b. This figure plots the evolution of the eigenvalues of after starting from a matrix whose leading eigenvalue is large and all other eigenvalues are very small. We observe that a certain fraction of the small eigenvalues of grow quickly with , while the leading eigenvalue is decreases in magnitude. In the next example, we show that the result of the last proposition actually predicts this observation.

Example 9. Suppose that M is a matrix whose rows have the same norm. Let the eigenvalues associated with the matrix In this setting, Prop. 8 implies that for a sufficiently small change has two consequences in expectation:(i.) the leading eigenvalue of than the leading eigenvalue of M, and (ii.) some small eigenvalues of are greater than those of M (see Fig. 3.b).

We provide a more detailed justification for the above statement at the end of this section. This example illustrates that the change in Frobenius norm (characterized in Lemma 7) can predict the change in the eigenvalues of (singular values of ) and hence the desired rank. Inspired by this, we base the proof of Theorem 2 on leveraging the invariance property of the unique invariant distribution with respect to Frobenius norm – i.e. setting in Def. 2. An observation: regularity of the invariant distribution We now return to the result de-

rived in Lemma 7 that characterizes the change in Frobenius norm of M(H) after the recurrence of Eq. (7). We show how such a result can be used to leverage the invariance property with respect to the Frobenius norm. First, we observe that the term Trin the expansion can be shown to dominate the term Trin expectation. The next definition states this dominance formally.

Definition 3. (Regularity constant be a distribution over . Then the regularity constant associated with is defined as the following ratio:

The next lemma states that the regularity constant associated with the invariant distribution is always less than one. Our analysis will in fact directly rely on

Lemma 10. Suppose that the chain admits the unique invariant distribution (in Def. 2). Then, the regularity constant of (in Def. 3) is less than one for a sufficiently small

Proof. We use a proof by contradiction where we suppose that the regularity constant of distribution is greater than one. In this case, we prove that the distribution cannot be invariant with respect to the Frobenius norm.

If the regularity constant is greater than one, then

holds. According to Theorem 5, the rank of M(H) is at least 2. Since the sum of the eigenvalues is constant d, the leading eigenvalue is less than d. This leads to

Plugging the above inequality together with inequality 47 into the established bound in Lemma 7 yields

for a sufficiently small . Therefore, does not obey the invariance property for g(H) =

We can experimentally estimate the regularity constant using the Ergodicity of the chain. Assuming that the chain is Ergodic

holds almost surely for every Borel bounded function . By setting and , we can estimate for i = 1, and 2. Given these estimates, can be estimated. Our experiments in Fig. 8 show that the regularity constant of the invariant distribution is less than 0.9 for d > 10.

Figure 8: Regularity constant of the invariant distribution. The vertical axis is the estimated regularity constant and the horizontal axis is (in Eq. (49)).

Interlude: intuition behind the regularity We highlight the regularity constant does by itself not yield the desired rank property in Theorem 2. This is illustrated in the next example that shows how the regularity constant relates to the spectral properties of M(H).

Example 11. Suppose that the support of distribution contains only matrices all rows of M(H) have the same norm. If the regularity constant of is greater than or equal to one, then all non-zero eigenvalues of matrix M(H) are equal.

A detailed justification of the above statement is presented at the end of this section. This example shows that the regularity constant does not necessarily relate to the rank of H, but instead it is determined by how much non-zero eigenvalues are close to each other. We believe that a sufficient variation in non-zero eigenvalues of M(H) imposes the regularity of the law of H with a constant less than one (i.e. in Def. 3). The next example demonstrates this.

Example 12. Suppose the support of distribution contains matrices for which all rows of M(H) have the same norm. Let contain sorted eigenvalues of M(H). If and for i > 1 and then the regularity constant associated with is less than 0.9 for sufficiently large d.

We later provide further details about this example. Invariance consequence The next lemma establishes a key result on the invariant distribution

ν

Lemma 13. Suppose that the chain (see Eq. 7) admits the invariant distribution Def. 2). If the regularity constant associated with (defined in Def. 3), then

holds for a sufficiently small

Proof. Leveraging invariance property in Def. 2,

holds where the expectation is taken with respect to the randomness of W and Invoking the result of Lemma 7, we get

Having a regularity constant less than one for

holds for sufficiently small . Let be a random vector containing the eigenvalues of the random matrix The eigenvalues of are , hence the invariance result can be written alternatively as

The above equation leads to the following interesting spectral property:

A straightforward application of Cauchy-schwarz yields:

Given (i) the above bound, (ii) an application of Jensen’s inequality, (iii) and the result of Eq. (55), we conclude with the desired result:

Notably, the invariant distribution is observed to have a regularity constant less than 0.9 (in Fig. 8) for sufficiently large d. This implies that an upper-bound Ois achievable on the Frobenius norm. Leveraging Ergodicity (with respect to Frobenius norm in Eq. (49)), we experimentally validate the result of the last lemma in Fig. 9.

Proof of the Main Theorem Here, we give a formal statement of the main Theorem that

contains all required additional details (which we omitted for simplicity in the original statement).

Figure 9: Dependency of . The horizontal axis is and the vertical axis shows . The green dashed-line plots

Theorem 14 (Formal statement of Theorem 2). Suppose that rankis sufficiently small, and all elements of the weight matrices are drawn i.i.d. from a zero-mean, unit variance distribution whose support lies in and its law is symmertic around zero. Furthermore, assume that the Markov chain (defined in Eq. 1) admits a unique invariant distribution. Then, the regularity constant associated with (see Def. 3) is less than one and the following limits exist such that

holds almost surely for all . Assuming that the regularity constant does not increase with respect to d, the above lower-bound is proportional to

Remarkably, we experimentally observed (in Fig. 8) that the regularity constant is decreasing with respect to d. Examples 11 and 12 provide insights about the regularity constant. We believe that it is possible to prove that the constant is non-increasing with respect to d.

Proof of Theorem 2 . Lemma 10 proves that the regularity constant is less than one for the unique invariant distribution. Suppose that is a random matrix whose law is the one of the unique invariant distribution of the chain. For . A straightforward application of Jensen’s inequality yields the following lower bound on the expectation of r(H) (i.e. the lower bound on the rank):

where the expectation is taken over the randomness of H (i.e. the invariant distribution). Invoking the result of Lemma 13, we get an upper-bound on the expectation of the Frobenius norm – in the right-side of the above equation. Therefore,

holds. The uniqueness of the invariant distribution allows us to invoke Birkhoff’s Ergodic Theorem for Markov Chains (Theorem 5.2.1 and 5.2.6 [10]) to get

The established lower bound on rank–in terms of – in Lemma 1 concludes

As shown in the following corollary, one can extend the result of Theorem 14 for any finite

Corollary 15. Under the setting of Thm. 14, rankholds almost surely for all finite integer . Assuming that is a monotonically no-increasing sequence, then rankholds almost surely for all finite

Proof. The proof is based on the no-increasing property of the rank. Next lemma presents a straightforward implication of this property.

Lemma 16. Consider a sequence of non-increasing bounded finite integers exists and is greater than for all finite k.

The proof of the last lemma is provided at the end of this section. Replacing the result of Thm. 14 into the above lemma concludes the proof of the corollary.

A remark on the number of hidden units. The focus of our analysis was networks with the same number of hidden units in each layer. Yet, this result extends to more general architectures. Most of modern neural architectures consists of blocks in which the number of hidden units are constant. For example, VGG19-Nets and ResNets are consist of blocks convolutional layers with 64, 128, 256, and 512 channels where the number channels are equal in each block. An analogy of such an architecture is an MLP with different blocks of hidden layers where the numbers of hidden units are the same in each block. According to Cor. 15, the rank preservation property holds in each block after applying BN. In this way, one can extend the established results of Thm. 14 and Cor. 15 to a general family of architectures with varying number of hidden units.

Postponed proofs.

Proof of Lemma 7. The proof is based on a Taylor expansion of the Bn non-linear recurrence function, which we restate here for simplicity:

Consider the covariance matrices which obey

For the sake of simplicity, we use the compact notation . We further introduce the set of indices S = {ii, ij, jj}. A taylor expansion of g at M yields

Note that the choice of the element-wise uniform distribution over allows us to deterministically bound the Taylor remainder term by O. Now, we compute the derivatives and expectations that appear in the above expansion individually. Let us start with the term The first-order partial derivative term in is computed bellow.

The expectation term in

Given the above formula, we reach the following compact expression for

The compute we need to compute second-order partial derivatives of g and also estimate the following expectation:

We now compute in the above formula

The second-order partial derivatives of g reads as

Now, we replace the computed partial derivatives and the expectations into

Plugging terms into the Taylor expansion yields

Summing over concludes the proof (note that the diagonal elements are one for the both of matrices

Proof of Proposition 8. Consider the spectral decomposition of matrix then is equal to the sum of the eigenvalues of

for k = 2 and k = 3. The sum of the squared norm of the rows in M is equal to the Frobenius norm of M. Assuming that the rows have equal norm, we get

Therefore,

holds.

Details of Example 9. Under the assumptions stated in Example 9, we get

where the approximations are obtained by a first-order Taylor approximation of the norms at , and all small terms are omitted. Using the result of Proposition 8, we get

Let be the eigenvalues of the matrix

Let . A straight-forward application of Jensen’s inequality yields

Hence the leading eigenvalue of is smaller than the one of M. Since the sum of eigenvalues and are equal, some of the eigenvalues are greater than those of (in expectation) to compensate

Details of Example 11. Invoking Prop. 8, we get

where contains the eigenvalues of M(H). Since . If the regularity constant is greater than or equal to one, then

A straightforward application of Cauchy-Schwartz yields:

The above result together with inequality 84 yields that

Finally, the above equality is met only when all non-zero eigenvalues are equal.

Details of Example 12. Since

Thus, Prop. 8 yields

Therefore,

As a result, is less than 0.9 for sufficiently large d.

Proof of Lemma 16. The proof is based on a contradiction. Suppose that there exits a finite n such that . Since the sequence is non-increasing, for holds for all m > n. This yields

where we used the fact that all are bounded. The above result contradicts the fact that

F Analysis for Vanilla Linear Networks.

In this section, we prove Lemma 3 that states the rank vanishing problem for vanilla linear networks. Since the proof relies on existing results on products of random matrices (PRM) [8], we first shortly review these results. Let T be the set of matrices. Then, we review two notions for T: contractiveness and strong irreducibility.

Definition 4 (Contracting set [8]). T is contracting if there exists a sequence such that converges to a rank one matrix.

Definition 5 (Invariant union of proper subspaces [8]). Consider a family of finite proper linear subspace . The union of these subspaces is invariant with respect to T, if

Example 17. Consider the following sets

then, union of and is invariant with respect to T because and hold for

Definition 6 (Strongly irreducible set [8]). The set T is strongly irreducible if there does not exist a finite family of proper linear subspaces of such that their union is invariant with respect to T.

For example, the set T defined in Example 17 is not strongly irreducible.

Lemma 18 (Thm 3.1 of [8])matrices drawn independently from a distribution . Let . If the support of is strongly irreducible and contracting, then any limit point of is a rank one matrix almost surely.

This result allows us to prove Lemma 3.

Proof of Lemma 3. Recall the structure of the random weight matrices as the coordinates are i.i.d. from (a.) standard Gaussian, (b.) uniform(i.e. with variance 1). One can readily check that for the Gaussian weights, the contracting and strong irreducibility hold and one can directly invoke the result of lemma 18 to get part (a.) of Lemma 3. Now, we prove part (b.). Let m be a random integer that obeys the law the random variable m, we define the random matrix and use the notation its law. Let be drawn i.i.d. from is distributed as . We prove that every limit point of to a rank one matrix, which equates the convergence of limit points of to a rank one matrix. To this end, we prove that the support of is contractive and strongly contractive. Then, Lemma 18 implies that the limit points of are rank one.

Contracting. Let be the first standard basis vector. Since and its limit point converges to a rank one matrix, is contractive.

Strong irreduciblity. Consider an arbitrary family of linear proper subspace of as . Let v be an arbitrary unit norm vector which belongs to one of the subspaces , we define an indexed family of matrices

where is the i-th standard basis. Then, we get

Therefore, is not contained in any union of finite proper (m < k)-dimensional linear subspace of is strongly irreducible.

G Details: Pretraining algorithm

In Section 4, we introduced a pre-training method that effectively obtains a better optimization performance compared ot Bn. In this section, we provide more details about the pre-training step. Recall is a minibatch of d-dimensional inputs of size the hidden representation of input X in the last layer of a MLP. Using gradient descent method, we optimize –with respect to the parameters of networks– over different minibatches X. Algorithm 1 presents our pretraining method. As can be seen, the procedure is very simple.

H Details: Why the rank matters for gradient based learning.

We now provide an intuitive explanation of why rank one hidden representations prevent randomly initialized networks from learning. Particularly, we argue that these networks essentially map all inputs to a very small subspacesuch that the final classification layer can no longer disentangle the hidden representations. As a result, the gradients of that layer also align, yielding a learning signal that becomes independent of the input.

To make this claim more precise, consider training the linear network from Eq. (6) on a dataset , where with targets . Each column of the hidden representations in the last hidden layer is the latent representation of datapoint i, which is fed into a final classification layer parametrized by . We optimize L(W), where W is a tensor containing all weights is a function of (as detailed in Eq. (6):

and is a differentiable loss function. Now, if the the hidden representations become rank one (as predicted by Lemma 3 and Fig. 2), one can readily check that the stochastic gradients of any neuron k in the last linear layer, i.e., , align for both linear and ReLU networks.

Proposition 19. Consider a network with rank one hidden representations in the last layer , then for any neuron k and any two datapoints i, j with non-zero errors

. That is, all stochastic gradients of neuron k in the final classification layer align along one single direction in

Proof. The result follows directly from a simple application of the chain rule

To validate this claim, we again train CIFAR-10 on the VGG19 network from Figure 5 (top).

Figure 10: Directional gradient vanishing CIFAR-10 on a VGG19 network with BN, SGD, SGD with 100x learning rate and SGD on random data. Average and 95% confidence interval of 5 independent runs.

As expected, the network shows perfectly aligned gradients without Bn (right hand side of Fig. 10), which renders it un-trainable. In a next step, we replace the input by images generated randomly from a uniform distribution between 0 and 255 and find that SGD takes almost the exact same path on this data (compare log accuracy on the left hand side). Thus, our results suggest that the commonly accepted vanishing gradient norm hypothesis is not descriptive enough since SGD does not take small steps into the right direction- but into a random one after initialization in deep neural networks. As a result, even a 100x increase in the learning rate does not allow training. We consider our observation as a potential starting point for novel theoretical analysis focusing on understanding the propagation of information through neural networks, whose importance has also been highlighted by [6].

I Additional Experiments

Figure 11: CIFAR-10: Same setting as Fig.4 but now showing accuracy and test loss

Outperforming Bn The following Figure shows the result of the experiment of Fig. 4 that is repeated for FashionMNIST dataset.

Figure 12: Results of Fig.4 for FashionMNIST

Breaking Bn In the following result, we repeated the experiment of Fig. 5 for ResNets.

Figure 13: Breaking Batchnorm: CIFAR-10 on a ResNet-50 with standard PyTorch initialization as well as a uniform initialization of same variance in . Average and 95% confidence interval of 5 independent runs. This plot also shows results for a Bn network without mean deduction/adaption, validating our claim from Section 2.