Iterative Averaging in the Quest for Best Test Error

2020·Arxiv

Abstract

Abstract

We analyse and explain the increased generalisation performance of iterate averaging using a Gaussian process perturbation model between the true and batch risk surface on the high dimensional quadratic. We derive three phenomena from our theoretical results: (1) The importance of combining iterate averaging (IA) with large learning rates and regularisation for improved regularisation. (2) Justification for less frequent averaging. (3) That we expect adaptive gradient methods to work equally well, or better, with iterate averaging than their non-adaptive counterparts. Inspired by these results, together with empirical investigations of the importance of appropriate regularisation for the solution diversity of the iterates, we propose two adaptive algorithms with iterate averaging. These give significantly better results compared to stochastic gradient descent (SGD), require less tuning and do not require early stopping or validation set monitoring. We showcase the efficacy of our approach on the CIFAR-10/100, ImageNet and Penn Treebank datasets on a variety of modern and classical network architectures.

1. Introduction

Deep Neural Network (DNN) models achieve state of the art performance in a plethora of problems, such as speech recognition, visual object image recognition, object detection, drug discovery and genomics (LeCun et al., 2015). Of key interest is the ability of DNNs to “generalise” to unseen data, even when the parameter number greatly exceeds the dataset size (Zhang et al., 2016). DNNs are typically trained using stochastic gradient descent (SGD), in which model parameters at each optimisation step, , are updated using the gradient of the minibatch loss at the previous step,

where denotes the learning rate at iteration k. Whilst careful monitoring of the validation metrics, along with weight decay (Krogh and Hertz, 1992), layer-wise normalisation (Ioffe and Szegedy, 2015) and data-augmentation (Shorten and Khoshgoftaar, 2019; Zhang et al., 2017) help protect against over-fitting to the training data, the initial value of and its schedule throughout training has a large impact on generalisation (Jastrzebski et al., 2017; Li et al., 2019), making it a key hyperparameter to set correctly. Theoretical results for optimal asymptotic training set convergence prescribe a learning rate proportional to the inverse square root of the number of iterations (Nesterov, 2013) or a decay at this rate (Duchi, 2018). However, such schedules often result in poor test set performance for DNNs. Curiously, Merity et al. (2017) and Izmailov et al. (2018) demonstrate that combining tail iterate averaging (i.e. taking an average of the last iterates in training) with large learning rate SGD increases DNN generalisation at the expense of training accuracy. However, quite why and how this works is still something of a mystery, limiting its widespread adoption.

One proposal in the literature to limit sensitivity to the learning rate and its schedule has been the development of adaptive gradient optimisers, which invoke a per-parameter learning rate based on the history of gradients. Popular examples include Adam (Kingma and Ba, 2014), AdaDelta (Zeiler, 2012) and RMSprop (Tieleman and Hinton, 2012). Ignoring momentum and explicit regularisation, the iteration of a general adaptive optimiser is given by:

where the preconditioning matrix B typically approximates curvature information. Crucially, however, the generalisation of solutions found using adaptive methods, as measured in terms of test and validation error, significantly underperforms SGD (Wilson et al., 2017). Due to this, state-of-the-art convolutional neural networks, especially for image classification datasets such as CIFAR (Yun et al., 2019) and ImageNet (Xie et al., 2019; Cubuk et al., 2019) are still trained using SGD with momentum (Nesterov, 2013). Furthermore, despite Iterate Averaging (IA) being mentioned as a potential amendment in the original Adam paper (Kingma and Ba, 2014) and being required in the convergence proof (Reddi et al., 2019), it is not widely used for computer vision or other complex problems.

2. Contributions

The key contribution of this paper are: • We investigate the impact of IA on generalisation by considering high-dimensional SGD on the quadratic model of the true risk perturbed by i.i.d. Gaussian noise, showing that the iterate average

attains the global minimum, whereas the final point, despite multiple learning rate drops, or increases in batch size during training, does not.

• We extend the framework to a Gaussian process perturbation model between the true and batch gradients. We find that as long as certain technical conditions (well met in practice) are satisfied, the simplified result holds. Crucially, distance in weight space or relative weight space (depending on kernel choice) are pivotal to the effect, justifying in practice the need for large learning rates in conjunction with iterate averaging.

• We show that adaptive gradient methods have identical properties under the iterate average, but we expect them to converge faster than their non-adaptive counterparts.

• Motivated by these results, we consider why adaptive methods are not typically used in conjunction with iterate averaging? We find that ineffective regularisation, which limits the effective distance and prediction diversity between the iterates, is the main culprit and propose a simple, yet effective solution. We propose two Adam-based algorithms: Gadam and GadamX. Both outperform baselines tested for all networks and datasets we consider. GadamX achieves a Top-1 error of 22.69% on ImageNet using ResNet-50, outperforming a well-tuned SGD baseline of 23.85% (Chintala et al., 2017). To put this into perspective, the gain attributed to widely-adopted cosine schedules increases accuracy by 0.3% (Bello et al., 2021).

• Showing that adaptive methods can outperform SGD and SGD with IA provides a practical framework that can be used, even for large-scale problems.

Related Work & Motivation: To the best of our knowledge there has been no explicit theoretical work analysing the generalisation benefit of iterate averaging. Whilst Izmailov et al. (2018) propose that iterate averaging leads to “flatter minima which generalise better”, flatness metrics are known to have limitations as a proxy for generalisation (Dinh et al., 2017), in addition to which we show in the appendix Section E.1 that adaptive methods can find very sharp minima with good generalisation properties. Martens (2014) show that the IA convergence rate for both SGD and second-order methods are identical, but argue that second-order methods have an optimal pre-asymptotic convergence rate on a quadratic loss surface. Here, pre-asymptotic means before taking the number of iterations and quadratic means that the Hessian is constant at all points in weight-space. The analysis does not extend to generalisation and no connection is made to adaptive gradient methods, nor to the importance of the high parameter-space dimensionality of the problem, two major contributions of our work. Amendments to improve the generalisation of adaptive methods include switching between Adam and SGD (Keskar and Socher, 2017) and decoupled weight decay (Loshchilov and Hutter, 2019), limiting the extent of adaptivity (Chen and Gu, 2018; Zhuang et al., 2020). We incorporate these insights into our algorithms but significantly outperform them experimentally. The closest algorithmic contribution to our work is Lookahead (Zhang et al., 2019b), which combines adaptive methods with an exponentially moving average scheme. We analyse this algorithm both theoretically (see appendix Section B) and experimentally.

3. Iterate Averaging: A New Theory for Generalisation

The iterate average (Polyak and Juditsky, 1992) is the arithmetic mean of the model parameters over the optimisation trajectory . It is a classical variance reducing technique in optimisation and offers optimal asymptotic convergence rates and greater robustness to the choice of learning rate (Kushner and Yin, 2003). Indeed, popular regret bounds that form the basis of gradientbased convergence proofs (Duchi et al., 2011; Reddi et al., 2019) often consider convergence for the iterate average (Duchi, 2018). Further, theoretical extensions have shown that the rate of convergence can be improved by a factor of log T (where T is the iteration number) by suffix averaging (Rakhlin et al., 2011), which considers a fraction of the last iterates, polynomial decay averaging (Shamir and Zhang, 2013) which decays the influence of the previous iterates, or weighted averaging (Lacoste- Julien et al., 2012) which weights the iterate by its iteration number. That the final iterate of SGD is sub-optimal in terms of its convergence rate, by this logarithmic factor, has been proved by Harvey et al. (2019). However, under an alternative decay schedule it can be shown to be equal to that of averaged schemes (Jain et al., 2019).

For networks with batch normalisation (Ioffe and Szegedy, 2015), a naïve application of IA (in which we simply average the batch normalisation statistics) is known to lead to poor results (Defazio and Bottou, 2019). However, by computing the batch normalisation statistics for the iterate average using a forward pass of the data at the IA point, Izmailov et al. (2018) show that the performance of small-scale image experiments such as CIFAR-10/100 and pretrained ImageNet can be significantly improved. Even for small experiments this computation is expensive, so they further approximate IA by taking the average at the end of each epoch instead of each iteration, referred to as stochastic weight averaging (SWA). We show experimentally in Section 10 that the two approaches produce almost identical results, with SWA slightly outperforming IA. Since SWA can be seen as IA with a lower averaging frequency, we retain the terminology IA - however, in our theoretical analysis we also include analysis for reduced frequency iterate averaging.

Notation:

• With some variable , and scalar-valued functions f, g, f(n) = o(g(n)) is shorthand for . Similarly, f(n) = O(g(n)) is shorthand for for some constant c > 0. In particular O(1) can be read as shorthand for any fixed non-zero constant, and o(1) for any term which decays to 0. For example f(n) = 3n + 2 + 1/n can be abbreviated as f(n) = O(n), or f(n) = 3n + O(1), or f(n) = 3n + 2 + o(1), depending on the the level of precision required. We will also employ the asymptotic equivalence notation1

• For matrices denotes the Frobenius norm and denotes the operator norm.

• For random vectors

3.1 A High-Dimensional Geometry Perspective

We examine the variance reducing effect of IA in the context of a quadratic approximation to the true loss combined with additive perturbation models for the batch training loss.

The theory we present is high-dimensional (i.e. large number of parameters, P) and considers the small batch size (small B) regime, which we term the “deep learning limit”.

Intuitively, any given example from the training set , will contain general features, which hold over the data generating distribution and instance specific features (which are relevant only to the training sample in question). For example, for a training image of a dog, we may have that:

Under a quadratic approximation to the true loss2 , where is the Hessian of the true loss with respect to the weights and we sample a mini-batch gradient of size B at point . The observed gradient is perturbed by from the true loss gradient (due to instance specific features). Under this model the component of the ’th iterate along the j’th eigenvector of the true loss when running SGD with learning rate can be written:

in which are the eigenvalues of H. The simplest tractable model for the gradient noise assume samples from i.i.d. an isotropic, multivariate Normal. In particular, this assumption removes any dependence on and precludes the existence of any distinguished directions in the gradient noise. Using this assumption, we obtain Theorem 2 below, which relies on an intermediate result, found in Vershynin (2018).

Lemma 1 (Vershynin (2018) Theorem 6.3.2) Let matrix, and let be a random vector with independent mean-zero unit-variance sub-Gaussian coordinates. Then

where is a constant.

Theorem 2 Assume the aforementioned quadratic loss and i.i.d. Gaussian gradient noise model. Assume further that for all i and for all i. Then there exists a constant c > 0 such that for all

where is the batch size.

Proof Let be a random sub-Gaussian vector with independent components. Let

Lemma 1 then applies, to give

We have for some constant C > 0 (Vershynin (2018), exercise 2.5.8), and . Hence we obtain

for some new constant c > 0. The proof is then completed if we compute the means and variances of . To that end, with

Summation then gives

With all the being i.i.d. , we need simply to compute the sums

and similarly

Now assuming , and taking

and similarly

Thus it follows that

and

where in both cases we have used . The proof is now completed by applying (6) and noting that are finite so long as i (as we have already assumed).

The final iterate attains exponential convergence in the mean of , but does not control the variance term. Whereas for , although the convergence in the mean is worse (linear), the variance vanishes asymptotically – this motivates tail averaging, to get the best of both worlds. Another key implication of Theorem 2 lies in its dependence on P. With P being a rough gauge of the model complexity, the result implies that in more complex, over-parameterised models, we expect the benefit of IA over the final iterate to be larger due to the corresponding variance reduction. We show this explicitly in our experiments in Figure 4c. Note the limited extra improvement possible by simply increasing the batch size, compared to IA asymptotically.

3.2 A dependent model for the perturbation

We proceed now to propose a relaxation of the gradient perturbation independence assumption. (3) can be written equivalently as

where is a scalar field with . Note that we have neglected an irrelevant arbitrary constant in Equation (16) and also that we have rather than , but this amounts to scaling the per-sample noise variance by the inverse batch size . We model as a Gaussian process GP(m, k), where k is some kernel function is some mean function3 As an example, taking and restricting w to a hypersphere results in taking the exact form of a spherical p-spin glass, studied previously for DNNs (Choromanska et al., 2015; Gardner and Derrida, 1988; Mezard et al., 1987; Ros et al., 2019; Mannelli et al., 2019; Baskerville et al., 2021a,b). We are not proposing to model the loss surface (batch or true) as a spin glass (or more generally, a Gaussian process), rather we are modelling the perturbation between the loss surfaces in this way. We emphasise that this model is a strict generalisation of the i.i.d. assumption above, and presents a rich, but tractable, model of isotropic Gaussian gradient perturbations in which the noise for different iterates is neither independent nor identically distributed.

Following from our Gaussian process definition, the covariance of gradient perturbations can be computed using a well-known result (see Adler and Taylor (2009) equation 5.5.4):

Further assuming a stationary kernel

Thus we have a non-trivial covariance between gradient perturbation at different points in weight-space. This covariance structure can be used to prove the upcoming variance reduction result. Its proof relies on some technical Lemmas (proved in the appendix, Section A.1) which we now state.

Lemma 3 Let be a sequence of multivariate Gaussian random variables in

for all be any deterministic element of Define the events

Consider (note that need not diverge with P, but they can). Then

Lemma 4 Assume the covariance structure (18). Take any

where we define

Theorem 5 Let be defined as in Theorem 2 and let the gradient perturbation be given by the covariance structure in (17). Assume that the kernel function k is such that and decay as , and define . Assume further that . Let are multivariate Gaussian random variables and, with probability which approaches unity as the iterates are all mutually at least

Proof We will prove the result in the case for the sake of clarity. The same reasoning can be repeated in the more general case; where one gets below, one need only replace it with , exploiting linearity of the trace. We will also vacuously replace on notation. For weight iterates , we have the recurrence

which leads to

and then

Now define

Next we will apply Lemma 4 and utilise Lemma 3 to bound the variance of and . We first gather the following facts (which were also computed and used in the proof of Theorem 1:

The sum of squares for the is simple to obtain similarly

We now use the assumption that (required for the convergence of gradient descent) which gives, as

Summing explicitly is possible but unhelpfully complicated. Instead, some elementary bounds give

and

so in particular . Now define the events as in Lemma 3 using in place of . Further, choose large enough so that and are decreasing for . Define . Lemma 4 gives

where we note that we have only upper-bounded the second term in (30), so using (27) and (28) and taking large enough we obtain

Turning now to we similarly obtain

and, as before, taking large enough we can obtain

Finally recalling (22) and (23) and writing for large n, we obtain the result.

Note that Theorem 5 is a generalisation of Theorem 2 to the context of our dependent perturbation model. Let us make some clarifying remarks about the theorem and its proof:

1. The bound (21) in the statement of the theorem relies on all iterates being separated by a distance at least . Moreover, the bound is only useful if is large enough to ensure the terms are small.

2. Just as in the independent case of Theorem 2, the first term in the bound in (21) decays only in the case that the number of iterates

3. The remaining conditions on are required for the high-dimensional probability argument which we use to ensure that all iterates are separated by at least

4. is a perfectly reasonable condition in the context of deep learning. E.g. for a ResNet-50 with , violation of this condition would require . A typical ResNet schedule on ImageNet has total steps.

Consequently, our result points to the importance of good separation between weight iterates in IA to retain the independence benefit and variance reduction in a non-independent noise setting, hence one would expect large learning rates to play a crucial role in successful IA. At the same time, our result is particularly adapted to the deep learning limit of very many model parameters (), since this is the only regime in which we can argue probabilistically for good separation of weight iterates (otherwise one may simply have to assume such separation). Furthermore, the importance of indicates that perhaps averaging less frequently than every iteration could be beneficial to generalisation. The following corollary makes this intuition precise.

Corollary 6 Let now be a strided iterate average with stride

Then, under the same conditions as Theorem 5

where the constant O(1) coefficient of the second term in (36) is independent of

Proof Very similar to that of Theorem 5. See appendix Section A.2.

Intuitively, the first term in the covariance in (21) is an “independence term”, i.e. it is common between Theorems 2 and 5 and represents the simple variance reducing effect of averaging. The second variance term in (21) comes from dependence between the iterate gradient perturbations. We see from the corollary that an independent model for gradient perturbation would predict an unambiguous inflationary effect of strided IA on variance (the first term in (36)). However introducing dependence in the manner that we have predicts a more nuanced picture, where increased distance between weight iterates can counteract the simple “independent term” inflationary effect of striding, leaving open the possibility for striding to improve on standard IA for the purposes of generalisation. We investigate and experimentally confirm this hypothesis in Section 9.

3.3 Validation of Theory:

To better understand the effect of the large learning learning rate on generalisation, we train a VGG-16 network (Simonyan and Zisserman, 2014) with no data augmentation/batch normalisation (to isolate the overfitting effect from reducing the learning rate) with a learning rate of . Replacing the learning rate drop (performed at epoch 60 by a factor of 10 with weight decay ) with IA at the same point, we find that the test error is reduced by a greater margin (), shown in Figure 1a. We note that IA improves over the SGD learning rate equivalent for all values of weight decay, with results for 10 seeds shown in Fig 1b and hence this argument is independent of

Figure 1: (a) STEP (learning rate decay) and IA Train/Val error. Both approaches reduce Val error, but IA by a greater margin. (b) Effect of weight decay on held out test error for IA/sharp learning rate decay solutions. Greater weight decay increases the margin of IA improvement. The lower subplot shows the average symmetric KL-divergence between IA solutions.

Figure 2: Test error improvement with differing degrees of regularisation for (a) network ensembling and (b) IA.

explicit regularisation as indicated by Theorem 5. For our Deep Neural Network experiments, we find that the best IA optimiser improvement over its base optimiser is proportional to the number of parameters P as shown in Fig 4c and predicted by Theorem 2. This theorem and experimental validation thereof translates into the following advice for practitioners: In the deep learning limit (large P and small relative B) one should keep the learning rate high and use iterate averaging instead of sharply dropping the learning rate!

3.4 A Closer Look at IA and the Importance of Regularisation

Izmailov et al. (2018) argue, under a linearisation assumption, that IA can be seen as approximate model ensembling. Since averaging only improves test performance for sufficiently uncorrelated models (through a reduction in variance of the ensemble), we must ensure sufficiently diverse models at each epoch through our training procedure. We note from Fig 2b, that unlike model ensembling (shown in Fig 2a), the IA improvement is strongly dependent on the use of weight decay. We show the difference between IA and sharply decaying learning rate schedules (which mirror conventional setups) over 10 seeds as a function of weight decay coefficient in Fig 1b. The margin of improvement from IA, over sharply decaying schedules, steadily increases with regularisation - with (where denotes the amount of weight decay or the coefficient of regularisation in the loss) delivering a greater final validation accuracy at the final IA point, despite starting from a lower accuracy compared to . To explicitly show that such weight decay regularisation encourages greater diversity in the iterates we calculate the symmetrised KL-divergence, , over the entire test set between the softmax outputs of the IA iterates. We take an average for each weight decay value (normalising the 0 weight decay value to 1), as shown in the lower subfigure of Fig 1b. As expected, greater weight decay gives greater solution diversity.

Distance in weight space or relative distance? Theorem 5 relies on sufficiently large distances in weight-space between iterates to achieve variance reduction with IA. It is therefore natural to ask if weight decay encourages greater separation between iterates and if this separation in turn explains the efficacy of weight decay. For learning rate and weight decay , we move a distance in weight space. Intuitively, since random vectors in high dimensions are nearly orthogonal with high probability (Vershynin, 2018), we expect the distance in weight space to move a distance , which is larger for . Conversely, we expect to be smaller for smaller and the gradients also to be smaller (Granziol, 2020). For the experiment (with equal learning rates) shown in Fig 1b, the average distances between the IA epochs for weight decay values , respectively. We note, however, that the relative distance, when normalised by the average weight norm, increases successively as 0.11, 0.13, 0.22. This begs the question whether Theorem 5 can be extended to include a notion of relative distance.

Theorem 7 Let be defined as in Theorem 5 and let the gradient noise be given by the basic covariance structure in (17). Let the kernel function be of the form

and assume that the kernel function k is such that and decay as , and define . Assume further that . Then the result of Theorem 5 holds.

Proof A minor modification of the proof of Theorem 5 and so is relegated to the appendix Section A.2.

In Theorem 5 it is the absolute distance between iterates that determines the strength of dependence. In Theorem 7, the gradient noise covariance is sensitive instead to the the relative distance. This notion can be easily extended to products of weights in different layers, which is known to be invariant with the use of batch normalisation (Ioffe and Szegedy, 2015), used in many modern networks.

4. Adaptive Gradient Methods with Iterate Averaging

Figure 3: (a) Validation error for the PreResNet-110 on CIFAR-100 for various optimisers using IA and (b) the solution diversity given as the symmetrised KL D or total variation distance calculated on the test set and the change in validation error for the final IA point.

Naïvely combining IA with Adam is not effective, as shown in Figure 3a. Despite the same regularisation (0.0001), the error drop is significantly less than for SGD-IA. Following our intuition from Sec 3.4, we consider whether the problem could be that overly correlated solutions form the IA due to ineffective regularisation. As shown in Tab. (b) of Figure 3, both the symmetrised KL divergence D(p||q) and total variation distance (calculated between all epochs using IA at the end of training and then averaged) are lower for Adam-IA than for SGD-IA. For adaptive optimisers, regularisation is not equivalent to weight decay (Zhang et al., 2018; Loshchilov and Hutter, 2019), with weight decay generalising better - known as AdamW. For AdamW with a decoupled weight decay of 0.25, the solution diversity increases beyond that of SGD-IA. This is accompanied by a greater drop in validation error, even outperforming SGD-IA. We term this combination of AdamW + IA Gadam to denote a variant of Adam that generalises. Previous work has shown that limiting the belief in the Adam curvature matrix improves generalisation (Zhuang et al., 2020; Chen and Gu, 2018), hence we also incorporate such a partially adaptive Adam into our framework and term the resulting Algorithm GadamX.

For convolutional neural networks using batch normalisation, the effective learning rate is proportional to (Hoffer et al., 2018). With batch normalisation, the output is invariant to the channel weight norm, hence weight changes are only with respect to the direction of the vector. Since the effective weight decay depends on the (effective) learning rate, we expect this to lead to more regularised solutions and better validation error. To test this hypothesis, we train a VGG-16 network on CIFAR-100 with and without BN (see Figs 4a,4b): for an identical setup, the margin of improvement of Gadam over AdamW is much larger with BN. We note that while Gadam keeps the effective learning rate high, in scheduled AdamW it quickly vanishes once we start learning rate decay. We were unable to compensate with learning rate scheduling, underscoring the importance of appropriate weight decay.

Here we present the full Gadam/GadamX algorithm. Note that for simplicity, in Algorithm 1, we present a Polyak-style averaging of every iteration. In practice we find both practical and theoretical results suggesting that averaging less frequently is almost equally good, if not better. We discuss this in Corollary 6 and conduct experiments on the averaging frequency in Section 9.

Figure 4: (a) Val. error and (b) effective learning rate of VGG-16 on CIFAR-100 with and without BN. (c) Improvement in using IA over the base optimiser for a variety of networks, closely following the linear trend.

5. Extension of theoretical framework to weight decay and adaptive methods

To make a closer connection with the new optimisation algorithms proposed in this work we consider decoupled weight decay (strength ) and gradient preconditioning:

where is some approximation to the true loss Hessian used at iteration t. In the presence of weight decay, we move the true loss minimum away from the origin for the analysis, i.e.

. The update rule is then

We take to be diagonal in the eigenbasis of H, with eigenvalues is the standard tolerance parameter (Kingma and Ba, 2014). One could try to construct the from the Gaussian process loss model, so making them stochastic and covarying with the gradient noise, however we do not believe this is tractable. Instead, let us heuristically assume that, with high probability, is close to , say within a distance , for large enough t and all i. If we take a large enough this is true even for SGD and we expect Adam to better approximate the local curvature matrix than SGD (Granziol et al., 2020a). This results in the following theorem.

Theorem 8 Fix some and assume that for all , for some fixed , with high probability. Use the update rule (38). Assume that the are bounded away from zero and . Further assume , where c is a constant independent of and is defined in the proof. Let everything else be as in Theorem 5. Then there exist constants such that, with high probability,

Proof We begin with the equivalent of (22) for update rule (38):

To make progress, we need the following bounds valid for all

and

where the final inequality in each case can be derived from Taylor’s theorem with Lagrange’s form of the remainder (Shirali and Vasudeva, 2014). Since the are bounded away from zero, we have

established

where the constant , say. From this bound we can in turn obtain

where the second line exploits the assumption and our choice c > 1. Thus

where the second inequality follows, for large n, by summing the geometric series and again using Lagrange’s form of the remainder in Taylor’s theorem. is some constant, derived from c that we need not determine explicitly. A complementary lower bound is obtained similarly (for large n). We have thus shown that

Reusing the bound (44) then yields (39). The remaining results, (40)-(42) follow similarly using the same bounds and ideas as above, but applied to the corresponding steps from the proof of Theorem 2.

Theorem 8 demonstrates the same IA variance reduction as seen previously, but in the more general context of weight decay and adaptive optimisation. As expected, improved estimation of the true Hessian eigenvalues (i.e. smaller ) reduces the error in recovery of . Moreover, increasing the weight decay strength decreases the leading order error bounds in (39) and (41), but only up to a point, as the other error terms are valid and small only if is not too large.

6. Image Classiﬁcation on CIFAR and Down-sampled 32x32 ImageNet Datasets

Here we consider VGG-16, Preactivated ResNet (PRN) and ResNeXt (Simonyan and Zisserman, 2014; He et al., 2016b; Xie et al., 2017) on CIFAR datasets (Krizhevsky et al., 2009). We also considered the down-sampled ImageNet dataset (Russakovsky et al., 2015) on Wide Residual Networks.

Learning Rate Schedule For all experiments without IA, we use the following learning rate schedule for the learning rate at the t-th epoch, similar to Izmailov et al. (2018), which we find to perform better than the conventionally employed step scheduling (refer to the experimental details in appendix Section D.4):

where is the initial learning rate. In the motivating logistic regression experiments on MNIST, we used T = 50. T = 300 is the total number of epochs budgeted for all CIFAR experiments, whereas we used T = 200 and 50 respectively for PRN-110 and WideResNet (WRN) in ImageNet. We set r = 0.01 for all experiments. For experiments with iterate averaging, we use the following learning rate schedule instead:

where refers to the (constant) learning rate after iterate averaging activation, and in this paper we set is the epoch after which iterate averaging is activated, and the methods to determine was described in the main text. This schedule allows us to adjust learning rate smoothly in the epochs leading up to iterate averaging activation through a similar linear decay mechanism in the experiments without iterate averaging, as described above.

The only exception is the WRN experiments on ImageNet 3232, where we only run 50 epochs of training and start averaging from 30th epoch. We found that when using the schedule described above for the IA schedules (SWA/Gadam/GadamX), we start decay the learning rate too early and the final result is not satisfactory. Therefore, for this particular set of experiments, we use the same learning rate schedule for both averaged and normal optimisers. The only difference is that for IA experiments, we decay the learning rate until the 30th epoch and keep it fixed for the rest of the training.

Hyperparameter Tuning In CIFAR experiments, we tune the base optimisers (i.e. SGD, Adam(W), Padam(W)) only, and assuming that the ideal hyperparameters in base optimisers apply to IA, and apply the same hyperparameter setting for the corresponding IA optimisers (i.e. SWA, Gadam, GadamX). For SGD, we use a base learning rate of 0.1 and use a grid searched initial learning rates in the range of {0.001, 0.01, 0.1} and use the same learning rate for Padam, similar to the procedures suggested in Chen and Gu (2018). For Adam(W), we simply use the default initial learning rate of 0.001 except in VGG-16, where we use initial learning rate of 0.0005. After the best learning rate has been identified, we conduct a further search on the weight decay, which we find often leads to a trade-off between the convergence speed and final performance; again we search on the base optimisers only and use the same value for the IA optimisers. For CIFAR experiments, we search in the range of , from the suggestions of Loshchilov and Hutter (2019). For decoupled weight decay, we search the same range for the weight decay scaled by initial learning rate.

On ImageNet (Russakovsky et al., 2015) experiments, we conduct the following process. On WRN we use the settings recommended by Chrabaszcz et al. (2017), who conducted a thorough hyperparameter search: we set the learning rate at 0.03 and weight decay at 0.0001 for SGD/SWA and Padam, based on their searched optimal values. for AdamW/Gadam, we set decoupled weight decay at 0.01 and initial learning rate to be 0.001 (default Adam learning rate). For GadamX, we again use the same learning rate of 0.03, but since the weight decay in GadamX is partially decoupled, we set the decoupled weight decay to 0.0003. On PRN-110, we follow the recommendations of the authors of He et al. (2016b) to set the initial learning rate for SGD, Padam and GadamX to be 0.1. For AdamW and Gadam, we again use the default learning rate of 0.001. Following the observation by Loshchilov and Hutter (2019) that smaller weight decay should be used for longer training (in PRN-110 we train for 200 epochs), we set weight decay at and decoupled weight decay at 0.0003 (GadamX)/0.001 (others) respectively, where applicable.

Overall, we do not tune adaptive methods (Adam and Gadam) as much (most noticeably, we usually fix their learning rate to 0.001), and therefore in particular the AdamW results we obtain may or may not be at their optimal performance. Nonetheless, the rationale is that by design, one of the key advantage claimed is that adaptive optimisers should be less sensitive to hyperparameter choice, and in this paper, the key message is that Gadam performs well, even though its base optimiser’s parameters (AdamW) are rather crudely tuned.

In all experiments, the momentum parameter () for SGD and , for Adam and its variants, are left at their respective default values. For all experiments, unless otherwise stated, we average once per epoch. We also apply standard data augmentation (e.g. flip, random crops) and use a batch size of 128 for all experiments conducted.

6.1 Results

We show the results for the ResNext on CIFAR-100 in Table 1, ImageNet-32 in Table 2 and further CIFAR-100/10 results in Table 4. We also show the training curves for CIFAR-100 and ImageNet-32 in Figure 5. As AdamW always outperforms Adam in our experiments, the curves for the latter are omitted in the main text; we detail these results in the supplementary. The results show that optimisers with IA (SWA, Gadam and GadamX) invariably improve over their counterparts without, and GadamX always delivers the strongest performance. Without compromising convergence speed, Gadam outperforms tuned SGD and Padam - suggesting that solutions found by adaptive optimisers do not necessarily generalise more poorly, as suggested in the literature (Wilson et al., 2017). Indeed, any generalisation gap seems to be closed by the using IA and an appropriately implemented weight decay. We emphasise that results here are achieved without tuning the point at which we start averaging ; if we allow crude tuning of , on CIFAR-100 GadamX achieves 77.22% (VGG-16) and 79.41% 4 (PRN-110) test accuracy respectively, which to our knowledge are the best reported performances on these architectures. We show results on ImageNet 3232 (Chrabaszcz et al., 2017) in Figure 5c. While Gadam does not outperform our strong SGD baseline, it nevertheless improves upon AdamW greatly and posts a performance stronger than the SGD baseline in literature with identical (Chrabaszcz et al., 2017) and improved (McDonnell, 2018) setups. Finally, GadamX performs strongly, outperforming more than 3% compared to the baseline (Chrabaszcz et al., 2017) in Top-5 accuracy. We run each experiment three times with mean and standard deviation reported. In this section, all non-IA baselines are tuned rigorously with proper schedules for fair comparisons5, and we also include the results reported in the previous works in Table 8 of appendix Section C.

Table 1: ResNeXt on CIFAR-100.

Table 2: Test Accuracy on ImageNet 32

Figure 5: (a-b) Top 1 Test error on CIFAR-100, (c) Top-5 Test Error on ImageNet-32 and (d) IA test improvement over its base optimiser against number of parameters.

7. ImageNet Experiments

We compare against step learning rate decay (factor of 10 every 30 epochs) and linear schedule for SGD and AdamW for 90 epochs (He et al., 2016a), with respective initial learning rates and weight decays on ImageNet (Russakovsky et al., 2015). We combine LookAhead (Zhang et al., 2019b) with gradient centralisation (Yong et al., 2020) as a high performance adaptive baseline Ranger (Wright, 2019), also using step decay. We search for the best performing initial learning rates for SGD, AdamW, SWA, GadamX and Ranger by factors of 3 i.e

Table 3: ImageNet Results

Table 4: CIFAR-10/100 Results

0.001, 0.003 in either direction (increase/decrease) until we find a local maximum in performance, otherwise leaving settings as in Section 6. We show the results in Table 3.

Experimenting with Partial Adaptivity for the Best Computer Vision Results: Following Granziol et al. (2020a); Choi et al. (2019), we experiment with setting the numerical stability coefficient to for GadamX and attempt an SGD like procedure for Gadam where we train with . Note that such a large numerical stability coefficient has a similar effect to reducing the effect of the preconditioning matrix as GadamX hence also allowing for a larger global learning rate. We find that whilst the generalisation benefit of using Gadam alone is significant (without decreasing partial adaptivity) it is not competitive with SGD on this dataset, wheras leaving the numerical stability coefficient unchanged for GadamX only results in a very minor decrease in performance. We detail both of these effects in experimental finding 3.

Due to poor “out of the box” performance of SWA, we repeat the logarithmic grid search procedure on the IA learning rate for SWA. We report results in Table 3, where we see Gadam(X) strongly out-performing all baselines. We do not include the ResNet-18 as AdamW outperforms SGD with 69.92% top-1 accuracy over 69.72%, hence not a useful test-case for analysing the adaptive generalisation gap, prevalent in deeper models. Gadam nonetheless improves on this attaining 70.11%. Whilst we find that step/linear scheduling is less effective for AdamW/SGD, attaining 73.68/75.52% respectively on the ResNet-50. Since these are small difference we don’t consider scheduling to be a major factor in our outstanding results. We detail our major experimental findings from these experiments which could be of use to the community.

1. Adaptive IA makes use of huge initial learning rates: unlike SGD and SWA, which have a strong performance degradation when large initial learning rates are used, shown in Fig 7a, we find that large initial learning rates improve the generalisation performance of Gadam/GadamX, with the largest initial learning rates giving the best results.

2. Convergence speed comes at a cost: Combining a large numerical stability coefficient and large learning rates allows Gadam to give significantly superior performance to SGD. However, the price paid is in convergence speed, shown in Fig 7b. For these settings the convergence speed is often

Figure 6: Final ImageNet epochs, showing the improvement of both SGD with Iterate Averaging (SGD IA) and our proposed GadamX optimiser over the SGD step-schedule in Top-1 validation error.

Figure 7: (a) Unlike IA adaptive methods, SGD does not benefit from larger initial learning rates. (b) To attain the greatest generalisation with adaptive methods, the fast convergence is sacrificed. Gadam , has a correspondingly large learning rate 0.5

as slow or slower than GadamX. Using the same settings as in the small scale experiments (shown as Gadam in the graph) we achieve a top-1 accuracy of 75.52 for ResNet-50. Whilst this significantly improves upon the base optimiser AdamW, these results are not as strong as those of SGD. Whilst increasing the base learning rate to 0.003 increases the ResNet-50 Gadam generalisation performance to 76.53, much of the convergence speed is already lost. We note that the effective weight decay is given by so we expect higher regularisation from higher learning rates. We do not find that increasing the weight decay whilst keeping the same base learning rates produces as strong results in our experiments and hence this learning rate and weight decay interplay could form the basis for interesting future work.

3. Partially adaptive optimisation generalises best: We find that for all experiments GadamX delivers the strongest performance. We do not find a strong dependence on the choice of the IA starting point (we try epoch 61, 71, 81). We find that altering the numerical stability constant gives a small boost in Top-1 error, from 77.19 to 77.31 for the ResNet-50, but that results remain strong for the traditional setting.

Comparison to previous results: We specifically report the final (as opposed to best) validation error for all our runs. We find the best SGD ResNet-50/101 results to be 75.75/77.62%, which are slightly worse/better than the official repository results. All of these results are still significantly lower than results achieved by Gadam/GadamX. We note that iterate averaged methods seem to continually decrease error in the final epochs of training, unlike SGD, which can sometimes overfit slightly in the final epochs of training.

8. Beyond Computer Vision: PTB LSTM

We run word-level language modelling using a 3-layer Long-short Term Memory (LSTM) model (Gers et al., 1999) on PTB dataset (Marcus et al., 1993) and the results are shown in Table 5 and Figure 8. Remarkably, Gadam achieves a test perplexity of 58.77 (58.61 if we tune . See Table 6 in Section 10), better than the baseline NT-ASGD in Merity et al. (2017) that runs an additional 300 epochs on an identical network. Note that since, by default, the ASGD uses a constant learning rate, we do not schedule the learning rate except Padam which requires scheduling to converge. Also, for consistency, we use a manual trigger to start averaging at the 100th epoch for ASGD (which actually outperforms the NT-ASGD variant). We additionally conduct experiments with scheduling and NT-ASGD (appendix Section D) and Gadam still outperforms. It is worth mentioning that for state of the art results in language modelling Melis et al. (2017); Brown et al. (2020); Shoeybi et al. (2019), Adam is the typical optimiser of choice. Hence these results are both encouraging and significant for wider use in the community.

Figure 8: Validation perplexity of 3-layer LSTM on PTB word-level modelling

Table 5: LSTM Penn Treebank Experimental results.

9. Effect of Frequency of Averaging

While we derive the theoretical bounds for both Polyak-style averaging on every iteration and strided averaging, in practice we use strided averaging to save on computation. We either average once per epoch similar to Izmailov et al. (2018), or select a rather arbitrary value such as averaging once per 100 iterations. The reason is both practical and theoretical: averaging much less leads to significant computational savings, and at the same time as we argued more independent iterates the benefit from averaging is better. In this case, averaging less causes the iterates to be further apart and more independent, and thus fewer number of iterates is required to achieve the similar level of performance if less independent iterates are used. We verify this both on the language and the vision experiments using the identical setup as the main text. With reference to Figure 9(a), not only is the final perplexity very insensitive to averaging frequency (note that the y-axis scale is very small), it is also interesting that averaging less actually leads to a slightly better validation perplexity compared to schemes that, say, average every iteration. We see a similar picture emerges in Figure 9(b), where the despite of following very close trajectories, averaging every iteration gives a slightly worse testing performance compared to once an epoch and is also significantly more expensive (with a NVIDIA GeForce RTX 2080 Ti GPU, each epoch of training takes around 10s if we average once per epoch but averaging every iteration takes around 20s).

Figure 9: Effect of different averaging frequencies on validation perplexity of Gadam on representative (a) Language and (b) Image classification tasks. Freq=n suggests averaging once per n iterations. freq=350 in (b) is equivalently averaging once per epoch.

10. Effect of Average Starting Point and GadamAuto

In Gadam(X), we need to determine when to start averaging (in Algorithm 1), and here we investigate the sensitivity of Gadam(X) to this hyperparameter. We use a range of for a number of different tasks and architectures (Figure 10 and Table 6), including extreme choices such as (start averaging at the beginning). We observe that for any reasonable , Gadam(X) always outperform their base optimisers with standard learning rate decay, and tuning yields even more improvements over the heuristics employed in the main text, even if selecting any sensible already can lead to a promising performance over standard learning rate decay.

Figure 10: Effect of different on the performance of various tasks and architectures.

Here we also conduct preliminary experiments on GadamAuto, a variant of Gadam that uses a constant learning rate schedule and automatically determines the starting point of averaging and training termination - this is possible given the insensitivity of the end-results towards above, and is desirable as the optimiser both has fewer hyperparameters to tune and trains faster. We use VGG-16 network on CIFAR-100. For all experiments, we simply use a flat learning rate schedule. The results are shown in Table 7. We use a patience of 10 for both the determination of the averaging activation and early termination. We also include SWA experiments with SGD iterates.

Table 6: Best results obtained from tuning

Table 7: GadamAuto Test Performance at Termination.

It can be seen that, while automatic determination for averaging trigger and early termination work well for Gadam (GadamAuto posts a performance only marginally worse than the manually tuned Gadam), they lead to a rather significant deterioration in test in SWA (SWA-Auto performs worse than tuned SWA, and even worse than tuned SGD). This highlights the benefit of using adaptive optimiser as the base optimiser in IA, as the poor performance in SWA-Auto is likely attributed to the fact that SGD is much more hyperparameter-sensitive (to initial learning rate and learning rate schedule, for example. SWA-Auto uses a constant schedule, which is sub-optimal for SGD), and that validation performance often fluctuates more during training for SGD: SWA-Auto determines averaging point based on the number of epochs of validation accuracy stagnation. For a noisy training curve, averaging might be triggered too early; while this can be ameliorated by setting a higher patience, doing so will eventually defeat the purpose of using an automatic trigger. Both issues highlighted here are less serious in adaptive optimisation, which likely leads to the better performance of GadamAuto.

Nonetheless, the fact that scheduled Gadam still outperforms GadamAuto suggests that there is still ample room of improvement to develop a truly automatic optimiser that performs as strong as or even stronger than tuned ones. One desirable alternative we propose for the future work is the integration of Rectified Adam Liu et al. (2019), which is shown to be much more insensitive to choice of hyperparameter even compared to Adam.

11. Conclusion

We propose a Gaussian Process perturbation between the batch and true risk surfaces and derive the phenomenon of improved generalisation for large learning rates and larger weight decay when combined with iterate averaging observed in practice. We extend this formalism to include adaptive methods and show that we expect further improvement when using adaptive algorithms. Based on this theory we develop two adaptive algorithms, Gadam and GadamX, variants of Adam with iterate averaging. We extensively validate Gadam and GadamX on computer vision tasks and a natural language experiment, showing strong performance against baseline and state of the art. Another interesting consequence of our work is that in all our experiments the last iterate is the best. Unlike SGD, where the epoch of best test/validation error is typically not the last and techniques such as early stopping are often employed, we find consistent near-monotonic improvements in test/validation error using our algorithms. We also find from preliminary analysis that our algorithms require less hyper-parameter tuning than SGD and variants thereof. This may be of interest for practitioners that want to get good results fast, as opposed to state of the art slowly.

References

Milton Abramowitz, Irene A Stegun, and Robert H Romer. Handbook of mathematical functions with formulas, graphs, and mathematical tables, 1988.

Robert J Adler and Jonathan E Taylor. Random fields and geometry. Springer Science & Business Media, 2009.

George E. Andrews, Richard Askey, and Ranjan Roy. Special Functions. Encyclopedia of Mathematics and its Applications. Cambridge University Press, 1999. doi: 10.1017/CBO9781107325937.

Nitin Bansal, Xiaohan Chen, and Zhangyang Wang. Can we gain more from orthogonality regular- izations in training deep networks? In Advances in Neural Information Processing Systems, pages 4261–4271, 2018.

Nicholas P Baskerville, Jonathan P Keating, Francesco Mezzadri, and Joseph Najnudel. The loss surfaces of neural networks with general activation functions. Journal of Statistical Mechanics: Theory and Experiment, 2021(6):064001, 2021a.

Nicholas P Baskerville, Jonathan P Keating, Francesco Mezzadri, and Joseph Najnudel. A spin-glass model for the loss surfaces of generative adversarial networks. arXiv preprint arXiv:2101.02524, 2021b.

Irwan Bello, William Fedus, Xianzhi Du, Ekin D Cubuk, Aravind Srinivas, Tsung-Yi Lin, Jonathon Shlens, and Barret Zoph. Revisiting resnets: Improved training and scaling strategies. arXiv preprint arXiv:2103.07579, 2021.

Tom B Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared Kaplan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, et al. Language models are few-shot learners. arXiv preprint arXiv:2005.14165, 2020.

Jinghui Chen and Quanquan Gu. Closing the generalization gap of adaptive gradient methods in training deep neural networks. arXiv preprint arXiv:1806.06763, 2018.

Soumith Chintala et al. Pytorch imagenet baseline, 2017. URL https://github.com/ pytorch/examples/blob/master/imagenet/main.py. "2016 (accessed September, 2020)".

Dami Choi, Christopher J Shallue, Zachary Nado, Jaehoon Lee, Chris J Maddison, and George E Dahl. On empirical comparisons of optimizers for deep learning. arXiv preprint arXiv:1910.05446, 2019.

Anna Choromanska, Mikael Henaff, Michael Mathieu, Gérard Ben Arous, and Yann LeCun. The loss surfaces of multilayer networks. In Artificial Intelligence and Statistics, pages 192–204, 2015.

Patryk Chrabaszcz, Ilya Loshchilov, and Frank Hutter. A downsampled variant of ImageNet as an alternative to the CIFAR datasets. arXiv preprint arXiv:1707.08819, 2017.

Ekin D Cubuk, Barret Zoph, Jonathon Shlens, and Quoc V Le. Randaugment: Practical data augmentation with no separate search. arXiv preprint arXiv:1909.13719, 2019.

Aaron Defazio and Léon Bottou. On the ineffectiveness of variance reduced optimization for deep learning. In Advances in Neural Information Processing Systems, pages 1753–1763, 2019.

Laurent Dinh, Razvan Pascanu, Samy Bengio, and Yoshua Bengio. Sharp minima can generalize for deep nets. In Proceedings of the 34th International Conference on Machine Learning-Volume 70, pages "1019–1028". "JMLR. org", 2017.

John Duchi, Elad Hazan, and Yoram Singer. Adaptive subgradient methods for online learning and stochastic optimization. Journal of machine learning research, 12(Jul):2121–2159, 2011.

John C Duchi. Introductory lectures on stochastic optimization. The Mathematics of Data, 25:99, 2018.

Elizabeth Gardner and Bernard Derrida. Optimal storage properties of neural network models. Journal of Physics A: Mathematical and general, 21(1):271, 1988.

Timur Garipov, Pavel Izmailov, Dmitrii Podoprikhin, Dmitry P Vetrov, and Andrew G Wilson. Loss surfaces, mode connectivity, and fast ensembling of dnns. In Advances in Neural Information Processing Systems, pages 8789–8798, 2018.

Felix A Gers, Jürgen Schmidhuber, and Fred Cummins. Learning to forget: Continual prediction with LSTM. 1999.

Diego Granziol. Flatness is a false friend. arXiv preprint arXiv:2006.09091, 2020.

Diego Granziol, Xingchen Wan, and Timur Garipov. MLRG deep curvature. arXiv preprint arXiv:1912.09656, 2019.

Diego Granziol, Samuel Albanie, Xingen Wan, and Stephen Roberts. Explaining the adaptive generalisation gap. arXiv preprint arXiv, 2020a.

Diego Granziol, Timur Garipov, Dmitry Vetrov, Stefan Zohren, Stephen Roberts, and Andrew Gordon Wilson. Towards understanding the true loss surface of deep neural networks using random matrix theory and iterative spectral methods, 2020b. URL https://openreview.net/forum? id=H1gza2NtwH.

Nicholas JA Harvey, Christopher Liaw, Yaniv Plan, and Sikander Randhawa. Tight analyses for non-smooth stochastic gradient descent. In Conference on Learning Theory, pages 1579–1613. PMLR, 2019.

Haowei He, Gao Huang, and Yang Yuan. Asymmetric valleys: Beyond sharp and flat local minima. arXiv preprint arXiv:1902.00744, 2019.

Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 770–778, 2016a.

Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Identity mappings in deep residual networks. In European conference on computer vision, pages 630–645. Springer, 2016b.

Sepp Hochreiter and Jürgen Schmidhuber. Flat minima. Neural Computation, 9(1):1–42, 1997.

Elad Hoffer, Ron Banner, Itay Golan, and Daniel Soudry. Norm matters: efficient and accurate normalization schemes in deep networks. In Advances in Neural Information Processing Systems, pages 2160–2170, 2018.

Zehao Huang and Naiyan Wang. Data-driven sparse structure selection for deep neural networks. In Proceedings of the European Conference on Computer Vision (ECCV), pages 304–320, 2018.

Sergey Ioffe and Christian Szegedy. Batch normalization: Accelerating deep network training by reducing internal covariate shift. arXiv preprint arXiv:1502.03167, 2015.

Pavel Izmailov, Dmitrii Podoprikhin, Timur Garipov, Dmitry Vetrov, and Andrew Gordon Wilson. Av- eraging weights leads to wider optima and better generalization. arXiv preprint arXiv:1803.05407, 2018.

Prateek Jain, Dheeraj Nagaraj, and Praneeth Netrapalli. Making the last iterate of sgd information theoretically optimal. In Conference on Learning Theory, pages 1752–1755. PMLR, 2019.

Stanislaw Jastrzebski, Zachary Kenton, Devansh Arpit, Nicolas Ballas, Asja Fischer, Yoshua Bengio, and Amos Storkey. Three factors influencing minima in sgd. arXiv preprint arXiv:1711.04623, 2017.

Stanislaw Jastrzebski, Maciej Szymczak, Stanislav Fort, Devansh Arpit, Jacek Tabor, Kyunghyun Cho, and Krzysztof Geras. The break-even point on the optimization trajectories of deep neural networks. In International Conference on Learning Representations, 2020. URL https:// openreview.net/forum?id=r1g87C4KwB.

Nitish Shirish Keskar and Richard Socher. Improving generalization performance by switching from Adam to SGD. arXiv preprint arXiv:1712.07628, 2017.

Diederik P Kingma and Jimmy Ba. Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980, 2014.

Alex Krizhevsky, Geoffrey Hinton, et al. Learning multiple layers of features from tiny images. 2009.

Anders Krogh and John A Hertz. A simple weight decay can improve generalization. In Advances in neural information processing systems, pages 950–957, 1992.

Harold Kushner and G George Yin. Stochastic approximation and recursive algorithms and applications, volume 35. Springer Science & Business Media, 2003.

Simon Lacoste-Julien, Mark Schmidt, and Francis Bach. A simpler approach to obtaining an o (1/t) convergence rate for the projected stochastic subgradient method. arXiv preprint arXiv:1212.2002, 2012.

Yann LeCun, Yoshua Bengio, and Geoffrey Hinton. Deep learning. nature, 521(7553):436–444, 2015.

Hao Li, Zheng Xu, Gavin Taylor, Christoph Studer, and Tom Goldstein. Visualizing the loss landscape of neural nets. In Advances in Neural Information Processing Systems, pages 6389–6399, 2018.

Yuanzhi Li, Colin Wei, and Tengyu Ma. Towards explaining the regularization effect of initial large learning rate in training neural networks. In Advances in Neural Information Processing Systems, pages 11674–11685, 2019.

Liyuan Liu, Haoming Jiang, Pengcheng He, Weizhu Chen, Xiaodong Liu, Jianfeng Gao, and Jiawei Han. On the variance of the adaptive learning rate and beyond. arXiv preprint arXiv:1908.03265, 2019.

Ilya Loshchilov and Frank Hutter. Decoupled weight decay regularization. 2019.

Wesley J Maddox, Pavel Izmailov, Timur Garipov, Dmitry P Vetrov, and Andrew Gordon Wilson. A simple baseline for Bayesian uncertainty in deep learning. In Advances in Neural Information Processing Systems, pages 13132–13143, 2019.

Stefano Sarao Mannelli, Florent Krzakala, Pierfrancesco Urbani, and Lenka Zdeborova. Passed & spurious: Descent algorithms and local minima in spiked matrix-tensor models. arXiv preprint arXiv:1902.00139, 2019.

Mitchell Marcus, Beatrice Santorini, and Mary Ann Marcinkiewicz. Building a large annotated corpus of English: The penn treebank. 1993.

James Martens. New insights and perspectives on the natural gradient method. arXiv preprint arXiv:1412.1193, 2014.

Mark D McDonnell. Training wide residual networks for deployment using a single bit for each weight. arXiv preprint arXiv:1802.08530, 2018.

Gábor Melis, Chris Dyer, and Phil Blunsom. On the state of the art of evaluation in neural language models. arXiv preprint arXiv:1707.05589, 2017.

Stephen Merity, Nitish Shirish Keskar, and Richard Socher. Regularizing and optimizing LSTM language models. arXiv preprint arXiv:1708.02182, 2017.

Marc Mezard, Giorgio Parisi, and Miguel Virasoro. Spin glass theory and beyond: An Introduction to the Replica Method and Its Applications, volume 9. World Scientific Publishing Company, 1987.

Yurii Nesterov. Introductory lectures on convex optimization: A basic course, volume 87. Springer Science & Business Media, 2013.

Boris T Polyak and Anatoli B Juditsky. Acceleration of stochastic approximation by averaging. SIAM journal on control and optimization, 30(4):838–855, 1992.

Alexander Rakhlin, Ohad Shamir, and Karthik Sridharan. Making gradient descent optimal for strongly convex stochastic optimization. arXiv preprint arXiv:1109.5647, 2011.

Sashank J Reddi, Satyen Kale, and Sanjiv Kumar. On the convergence of adam and beyond. arXiv preprint arXiv:1904.09237, 2019.

Valentina Ros, Gerard Ben Arous, Giulio Biroli, and Chiara Cammarota. Complex energy landscapes in spiked-tensor and simple glassy models: Ruggedness, arrangements of local minima, and phase transitions. Physical Review X, 9(1):011003, 2019.

Olga Russakovsky, Jia Deng, Hao Su, Jonathan Krause, Sanjeev Satheesh, Sean Ma, Zhiheng Huang, Andrej Karpathy, Aditya Khosla, Michael Bernstein, et al. Imagenet large scale visual recognition challenge. International journal of computer vision, 115(3):211–252, 2015.

Ohad Shamir and Tong Zhang. Stochastic gradient descent for non-smooth optimization: Conver- gence results and optimal averaging schemes. In International conference on machine learning, pages 71–79. PMLR, 2013.

Satish Shirali and Harkrishan L Vasudeva. An Introduction to Mathematical Analysis. Alpha Science International, Limited, 2014.

Mohammad Shoeybi, Mostofa Patwary, Raul Puri, Patrick LeGresley, Jared Casper, and Bryan Catan- zaro. Megatron-lm: Training multi-billion parameter language models using model parallelism. arXiv preprint arXiv:1909.08053, 2019.

Connor Shorten and Taghi M Khoshgoftaar. A survey on image data augmentation for deep learning. Journal of Big Data, 6(1):60, 2019.

Karen Simonyan and Andrew Zisserman. Very deep convolutional networks for large-scale image recognition. arXiv preprint arXiv:1409.1556, 2014.

Tijmen Tieleman and Geoffrey Hinton. Lecture 6.5-RMSProp: Divide the gradient by a running average of its recent magnitude. COURSERA: Neural networks for machine learning, 4(2):26–31, 2012.

Phuong Thi Tran et al. On the convergence proof of AMSGrad and a new version. IEEE Access, 7: 61706–61716, 2019.

Roman Vershynin. High-dimensional probability: An introduction with applications in data science, volume 47. Cambridge university press, 2018.

Ashia C Wilson, Rebecca Roelofs, Mitchell Stern, Nati Srebro, and Benjamin Recht. The marginal value of adaptive gradient methods in machine learning. In Advances in Neural Information Processing Systems, pages 4148–4158, 2017.

Less Wright. Ranger - a synergistic optimizer. https://github.com/lessw2020/ Ranger-Deep-Learning-Optimizer, 2019.

Lei Wu, Chao Ma, and E Weinan. How sgd selects the global minima in over-parameterized learning: A dynamical stability perspective. In Advances in Neural Information Processing Systems, pages 8279–8288, 2018.

Qizhe Xie, Eduard Hovy, Minh-Thang Luong, and Quoc V. Le. Self-training with noisy student improves ImageNet classification, 2019.

Saining Xie, Ross Girshick, Piotr Dollár, Zhuowen Tu, and Kaiming He. Aggregated residual transformations for deep neural networks. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 1492–1500, 2017.

Hongwei Yong, Jianqiang Huang, Xiansheng Hua, and Lei Zhang. Gradient centralization: A new optimization technique for deep neural networks. arXiv preprint arXiv:2004.01461, 2020.

Sangdoo Yun, Dongyoon Han, Seong Joon Oh, Sanghyuk Chun, Junsuk Choe, and Youngjoon Yoo. Cutmix: Regularization strategy to train strong classifiers with localizable features. arXiv preprint arXiv:1905.04899, 2019.

Matthew D Zeiler. Adadelta: an adaptive learning rate method. arXiv preprint arXiv:1212.5701, 2012.

Chiyuan Zhang, Samy Bengio, Moritz Hardt, Benjamin Recht, and Oriol Vinyals. Understanding deep learning requires rethinking generalization. arXiv preprint arXiv:1611.03530, 2016.

Guodong Zhang, Chaoqi Wang, Bowen Xu, and Roger Grosse. Three mechanisms of weight decay regularization. arXiv preprint arXiv:1810.12281, 2018.

Guodong Zhang, Lala Li, Zachary Nado, James Martens, Sushant Sachdeva, George Dahl, Chris Shallue, and Roger B Grosse. Which algorithmic choices matter at which batch sizes? insights from a noisy quadratic model. In Advances in Neural Information Processing Systems, pages 8194–8205, 2019a.

Hongyi Zhang, Moustapha Cisse, Yann N Dauphin, and David Lopez-Paz. mixup: Beyond empirical risk minimization. arXiv preprint arXiv:1710.09412, 2017.

Michael Zhang, James Lucas, Jimmy Ba, and Geoffrey E Hinton. Lookahead optimizer: k steps forward, 1 step back. In Advances in Neural Information Processing Systems, pages 9593–9604, 2019b.

Juntang Zhuang, Tommy Tang, Sekhar Tatikonda, Nicha Dvornek, Yifan Ding, Xenophon Pa- pademetris, and James S Duncan. Adabelief optimizer: Adapting stepsizes by the belief in observed gradients. arXiv preprint arXiv:2010.07468, 2020.

Appendix A. Proofs<