It has frequently been observed (Choromanska et al., 2015) that deep neural networks are relatively easy to optimize using straightforward SGD algorithms, despite having complex non-convex loss surfaces. A great body of work in recent years has focused on the properties of SGD and its ability to maneuver through saddle points and complex surface topologies, in order to explain the relative ease of the optimization process (Daneshmand et al., 2018). A complementary approach to analyzing the optimization process, is to consider the trajectories of the gradients during optimization, with the guiding intuition that while the loss surface defined by modern neural networks might be incredibly complex and hard to analyze, the trajectories of gradient descent on these loss surfaces might prove rather simple.
An additional direction of research focuses on the spectrum of the Hessian of the loss during training, as the knowledge of the second order information about the loss surface can tell us quite a bit about the behavior of the optimizer, and how we could modify our algorithms to make them converge faster and to better solutions (Gupta et al., 2018; Martens & Grosse, 2015). Recent works have empirically shown that the distribution of the eigenvalues of the Hessian is composed of a bulk, which is concentrated around zero, and the edges which are scattered away from zero (Sagun et al., 2016). Modern neural networks, however, contain millions of parameters, resulting in a hefty Hessian with millions of eigenvalues, the overwhelming majority of which are irrelevant.
In our work, we seek to study the curvature of the Hessian in directions corresponding to the trajectory of the gradients during training, by focusing on the Hessian gradient product. Specifically, we aim to empirically and theoretically analyze the path of gradient descent algorithms by analyzing the Hessian at initialization.
We identify and empirically demonstrate novel phenomena relating to the trajectory of gradient descent neural networks under certain conditions. We observe that the gradients follow a convex path during training when performing SGD, such that the curvature of the loss is seldom negative in the direction of the gradients. This effect is more prominent in wider networks, i.e., in CNNs with a larger number of kernels. Furthermore, we theoretically prove that when using common initialization methods, the gradients of random feedforward networks are indeed pointing in a direction of positive curvature when the network is wide enough, and the input is large enough.
1.1. Terminology
Given the output of some neural network for input
, with
denoting the model weights at iteration t, and a convex, twice differentiable loss function
, we denote the gradient on a batch of N samples by
, and the Hessian by
. A positive curvature during training at step t is characterized by the following condition:
Figure 1. The approximated curvature in the direction of the gradient, for networks with different width multiplier K = 1..5. (a,c) as a function of the epoch for multiple network widths, shown for CIFAR-100, and tiny imagenet, respectively. The different plots denote different layer width. (b,d) the test loss for the same networks trained on CIFAR-100 and tiny imagenet.
We approximate the Hessian gradient product ing the training of deep neural networks, using the following formula:
Which directly leads to:
where is the learning rate, and the loss L is computed on a minibatch
We track the behavior of across the epochs for networks of varying widths. In our experiments, we employ the ResNet-50 network (He et al., 2016), which has layers with 16, 32, or 64 kernels. We make these layers wider by multiplying the number of kernels by a factor K = 1, . . . , 5. We halve the learning rate at epochs 40, 80, and 120. A batch size of 100 is used with a vanilla SGD optimizer.
The results are shown in Fig. 1 for two popular datasets: CIFAR-100 (Krizhevsky, 2009) and tiny imagenet (Le & Yang, 2015). In both datasets, one can observe that (i) throughout the majority of the training run, positive; (ii) at the start of training, and when the network converges, it obtains larger values; and (iii) typically, the wider the network, the bigger this quantity is.
Fig. 1 also presents the test error for the same benchmarks
Figure 2. The curvature in the direction of the gradient, decomposed to its components (blue), as a function of epoch, for fully connected networks with layers’ width of (a) 50, (b) 200, and (c) 400, trained on random data and labels. In all cases,
is mostly dominated by
starting closer to zero for wider networks.
and the same networks. We observe that wider networks lead to smaller test errors, which means that their tendency toward a convex-mimicking behavior during training does not come at the expanse of the generalization ability.
We further conduct small scale experiments, in which one can efficiently decompose the empirical Hessian into its components H and G, and calculate the quantities and
during training. Specifically, we train a single output, 8 layer fully connected network with layer width of 50, 200 and 400 on a set of 1000 predetermined random inputs sampled from a Gaussian distribution with dimensionality matching the width of the network, which are then normalized to have unit norm. The network are trained to fit with a standard L2 loss random labels sampled from a binary distribution.
The results are shown in Fig. 2. As can be seen, in all cases, the network exhibits a convex-mimicking behaviour, similar to the results of our CNN experiments. Specifically, the curvature observed in seems to closely follow that of
. In addition, as predicted in our analysis, the curve of
starts closer to zero for wider networks.
An initial analysis is performed for linear networks (no non-linearities) at initialization, where the weights of the networks are randomly sampled. We make very few assumptions on the nature of the networks’ initialization, and the popular initialization techniques satisfy our assumptions.
3.1. Overview
In our theoretical analysis, we focus on the Hessian of a single output linear network with L layers, where the output of layer l = 1, . . . , L is given by where
, and the input, which is also denoted by
is given by a fixed dataset
. We denote by W the concatenation of all the parameters of the network in a column vector.
As mentioned, we consider a twice differentiable convex loss function L. We further assume that the loss has bounded first and second derivatives. The Hessian matrix is given by:
where is the gradient of the network output unit for input s with respect to the weights. The first term of the Hessian is comprised of a sum of positive semidefinite matrices, and is, therefore, positive semidefinite, while the second term is not, and is dependent on the architecture. In the following, we aim to show that under certain conditions pretaining to the width of the network and the dimensionality of the input, which are usually met in practice, the trajectory of GD is influenced primarily by G. That is, denoting by index t the evaluation of a tensor at iteration t, it holds that:
3.2. Main Results
In our analysis, we consider the weights sampled iid from a symmet- ric distribution with moments
, such that
, where
are the moments of the distribution. Notice that this condition applies to the common initialization scheme (He et al., 2015). We first consider the gradient and Hessian evaluated on a single sample with unit norm, i.e.,
, and then discuss multiple samples. We make the following notations: the derivative of the output unit with respect to a single weight ij in layer k is denoted by
we denote
where the triplets of indices ijk, uvl parameterize row and column indices in the matrix (note that the ranges of ij, uv depend on the layer indices k and l). The next theorem gives the expected mean and variance of
for wide networks at initialization:
Theorem 1. and for
,
it holds that
The Hessian gradient product is not necessarily indicative of the local curvature, since we do not take into account the norm of the gradient. As shown in (Arpit & Bengio, 2019), a constant width network ensures that the norm of the gradient approximately equals the input. More specifically, by having , we ensure that
omitted for brevity). The next theorem gives a bound on the probability of the local curvature to be above some the full Hessian.
Theorem 2. Assume that , and
Prob(
then it holds that:
The above analysis provides a rather surprising result, which is that the dimensionality of the input also plays a role in determining the local curvature at initialization, in addition to the width of the network.
We have discussed a single sample in our analysis so far. In the case of linear neural networks, the result of Thm. 1 also hold for the batch case since we did not assume anything on the input. Specifically, since and considering the gradient of sample
on sample v, then it holds that
The term can be significantly smaller in the batch case, even when the gradient is large. Indeed, under mild conditions, when gradients of individual samples are orthogonal to each other, this term can be made small while keeping the overall gradient large, insuring a faster convergence to the optimum. Precise statement and analysis are kept for future work.
The literature has already linked the width of a neural network to the property of norm-preserving. For example, (Arpit & Bengio, 2019) show that under different sets of assumptions, the wider the network, the more likely the norm of the activations will be preserved with depth. Our work is the first one, as far as we know, to show a link between a convex-mimicking behavior during optimization and the network’s width.
The literature also teaches us about the properties of the Hessian during training (Sagun et al., 2016). As far as we know, we are unique in that we look at the Hessian in the direction of the gradient, where it matters most to the optimization process.
The line of research can lead to a better understanding of the link between different architectures and their trainability. Moreover, the signal we track can be readily used in order to drive the optimization process. Specifically, it may be derivable to adjust the hyperparameters (learning rate, batch size, batch composition, etc.) such that the gradients are large while the product of the gradient with the Hessian implies close to zero curvature.
This project has received funding from the European Research Council (ERC) under the European Unions Horizon 2020 research and innovation programme (grant ERC CoG 725974).
Arpit, D. and Bengio, Y. The benefits of overparameterization at initialization in deep relu networks. arXiv preprint arXiv:1901.03611, 2019.
Choromanska, A., Henaff, M., Mathieu, M., Arous, G. B., and LeCun, Y. The Loss Surfaces of Multilayer Networks. In Lebanon, G. and Vishwanathan, S. V. N. (eds.), Proceedings of the Eighteenth International Conference on Artificial Intelligence and Statistics, volume 38 of Proceedings of Machine Learning Research, pp. 192–204, 2015.
Daneshmand, H., Kohler, J., Lucchi, A., and Hofmann, T. Escaping saddles with stochastic gradients. In Dy, J. and Krause, A. (eds.), Proceedings of the 35th International Conference on Machine Learning, volume 80 of Proceedings of Machine Learning Research, pp. 1155– 1164, Stockholmsmssan, Stockholm Sweden, 10–15 Jul 2018. PMLR. URL http://proceedings.mlr. press/v80/daneshmand18a.html.
Gupta, V., Koren, T., and Singer, Y. Shampoo: Precondi- tioned stochastic tensor optimization. In International Conference on Machine Learning, pp. 1837–1845, 2018.
He, K., Zhang, X., Ren, S., and Sun, J. Delving deep into rectifiers: Surpassing human-level performance on imagenet classification. In Proceedings of the 2015 IEEE
International Conference on Computer Vision (ICCV), ICCV ’15, pp. 1026–1034, Washington, DC, USA, 2015. IEEE Computer Society. ISBN 978-1-4673-8391-2. doi: 10.1109/ICCV.2015.123. URL http://dx.doi. org/10.1109/ICCV.2015.123.
He, K., Zhang, X., Ren, S., and Sun, J. Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 770–778, 2016.
Krizhevsky, A. Learning multiple layers of features from tiny images. Technical report, Citeseer, 2009.
Le, Y. and Yang, X. J. Tiny imagenet visual recognition challenge. Technical report, 2015.
Martens, J. and Grosse, R. B. Optimizing neural networks with kronecker-factored approximate curvature. In ICML, 2015.
Sagun, L., Bottou, L., and LeCun, Y. Singularity of the hes- sian in deep learning. arXiv preprint arXiv:1611.07476, 2016.
Proof. The derivative of unit j in layer k with respect to unit u in layer l is denoted by. Dropping the unit indexes j or u means deriving the entire layer, or with respect to the entire layer respectively. Dropping all superscripts means derivative of the output unit, so that
. It then holds that
. Finally, we denote by
the
column and row vectors of
respectively. We have:
where the triplets of indices ijk, uvl parameterize row and column indices in the matrix (note that the ranges of ij, uv depend on the layer indices k and l).
The Hessian gradient product is given by:
The claim is trivially true, since each term in the Hessian gradient products will contain odd moments of the weights, which zero out for symmetric distributions. For deep networks, we have that:
and so:
where we assumed in the last transition that . Applying expectation on the last weight layer
that:
Plugging each term in Eq. 12 yields the same asymptotic behaviour. For the sake of brevity, we will show this with the first term. Without loss of generality, we assume , we now recursively apply expectations over
, while noticing that the following hold for fixed vectors
and so:
Recursively going through
Summing over