Deep Neural Network (DNN) models achieve state of the art performance in a plethora of problems, such as speech recognition, visual object image recognition, object detection, drug discovery and genomics (LeCun et al., 2015). Of key interest is the ability of DNNs to “generalise” to unseen data, even when the parameter number greatly exceeds the dataset size (Zhang et al., 2016). DNNs are typically trained using stochastic gradient descent (SGD), in which model parameters at each optimisation step, , are updated using the gradient of the minibatch loss at the previous step,
where denotes the learning rate at iteration k. Whilst careful monitoring of the validation metrics, along with weight decay (Krogh and Hertz, 1992), layer-wise normalisation (Ioffe and Szegedy, 2015) and data-augmentation (Shorten and Khoshgoftaar, 2019; Zhang et al., 2017) help protect against over-fitting to the training data, the initial value of
and its schedule throughout training has a large impact on generalisation (Jastrzebski et al., 2017; Li et al., 2019), making it a key hyperparameter to set correctly. Theoretical results for optimal asymptotic training set convergence prescribe a learning rate proportional to the inverse square root of the number of iterations (Nesterov, 2013) or a decay at this rate (Duchi, 2018). However, such schedules often result in poor test set performance for DNNs. Curiously, Merity et al. (2017) and Izmailov et al. (2018) demonstrate that combining tail iterate averaging (i.e. taking an average of the last iterates in training) with large learning rate SGD increases DNN generalisation at the expense of training accuracy. However, quite why and how this works is still something of a mystery, limiting its widespread adoption.
One proposal in the literature to limit sensitivity to the learning rate and its schedule has been the development of adaptive gradient optimisers, which invoke a per-parameter learning rate based on the history of gradients. Popular examples include Adam (Kingma and Ba, 2014), AdaDelta (Zeiler, 2012) and RMSprop (Tieleman and Hinton, 2012). Ignoring momentum and explicit regularisation, the iteration of a general adaptive optimiser is given by:
where the preconditioning matrix B typically approximates curvature information. Crucially, however, the generalisation of solutions found using adaptive methods, as measured in terms of test and validation error, significantly underperforms SGD (Wilson et al., 2017). Due to this, state-of-the-art convolutional neural networks, especially for image classification datasets such as CIFAR (Yun et al., 2019) and ImageNet (Xie et al., 2019; Cubuk et al., 2019) are still trained using SGD with momentum (Nesterov, 2013). Furthermore, despite Iterate Averaging (IA) being mentioned as a potential amendment in the original Adam paper (Kingma and Ba, 2014) and being required in the convergence proof (Reddi et al., 2019), it is not widely used for computer vision or other complex problems.
The key contribution of this paper are: • We investigate the impact of IA on generalisation by considering high-dimensional SGD on the quadratic model of the true risk perturbed by i.i.d. Gaussian noise, showing that the iterate average
attains the global minimum, whereas the final point, despite multiple learning rate drops, or increases in batch size during training, does not.
• We extend the framework to a Gaussian process perturbation model between the true and batch gradients. We find that as long as certain technical conditions (well met in practice) are satisfied, the simplified result holds. Crucially, distance in weight space or relative weight space (depending on kernel choice) are pivotal to the effect, justifying in practice the need for large learning rates in conjunction with iterate averaging.
• We show that adaptive gradient methods have identical properties under the iterate average, but we expect them to converge faster than their non-adaptive counterparts.
• Motivated by these results, we consider why adaptive methods are not typically used in conjunction with iterate averaging? We find that ineffective regularisation, which limits the effective distance and prediction diversity between the iterates, is the main culprit and propose a simple, yet effective solution. We propose two Adam-based algorithms: Gadam and GadamX. Both outperform baselines tested for all networks and datasets we consider. GadamX achieves a Top-1 error of 22.69% on ImageNet using ResNet-50, outperforming a well-tuned SGD baseline of 23.85% (Chintala et al., 2017). To put this into perspective, the gain attributed to widely-adopted cosine schedules increases accuracy by 0.3% (Bello et al., 2021).
• Showing that adaptive methods can outperform SGD and SGD with IA provides a practical framework that can be used, even for large-scale problems.
Related Work & Motivation: To the best of our knowledge there has been no explicit theoretical work analysing the generalisation benefit of iterate averaging. Whilst Izmailov et al. (2018) propose that iterate averaging leads to “flatter minima which generalise better”, flatness metrics are known to have limitations as a proxy for generalisation (Dinh et al., 2017), in addition to which we show in the appendix Section E.1 that adaptive methods can find very sharp minima with good generalisation properties. Martens (2014) show that the IA convergence rate for both SGD and second-order methods are identical, but argue that second-order methods have an optimal pre-asymptotic convergence rate on a quadratic loss surface. Here, pre-asymptotic means before taking the number of iterations and quadratic means that the Hessian is constant at all points in weight-space. The analysis does not extend to generalisation and no connection is made to adaptive gradient methods, nor to the importance of the high parameter-space dimensionality of the problem, two major contributions of our work. Amendments to improve the generalisation of adaptive methods include switching between Adam and SGD (Keskar and Socher, 2017) and decoupled weight decay (Loshchilov and Hutter, 2019), limiting the extent of adaptivity (Chen and Gu, 2018; Zhuang et al., 2020). We incorporate these insights into our algorithms but significantly outperform them experimentally. The closest algorithmic contribution to our work is Lookahead (Zhang et al., 2019b), which combines adaptive methods with an exponentially moving average scheme. We analyse this algorithm both theoretically (see appendix Section B) and experimentally.
The iterate average (Polyak and Juditsky, 1992) is the arithmetic mean of the model parameters over the optimisation trajectory . It is a classical variance reducing technique in optimisation and offers optimal asymptotic convergence rates and greater robustness to the choice of learning rate (Kushner and Yin, 2003). Indeed, popular regret bounds that form the basis of gradientbased convergence proofs (Duchi et al., 2011; Reddi et al., 2019) often consider convergence for the iterate average (Duchi, 2018). Further, theoretical extensions have shown that the rate of convergence can be improved by a factor of log T (where T is the iteration number) by suffix averaging (Rakhlin et al., 2011), which considers a fraction of the last iterates, polynomial decay averaging (Shamir and Zhang, 2013) which decays the influence of the previous iterates, or weighted averaging (Lacoste- Julien et al., 2012) which weights the iterate by its iteration number. That the final iterate of SGD is sub-optimal in terms of its convergence rate, by this logarithmic factor, has been proved by Harvey et al. (2019). However, under an alternative decay schedule it can be shown to be equal to that of averaged schemes (Jain et al., 2019).
For networks with batch normalisation (Ioffe and Szegedy, 2015), a naïve application of IA (in which we simply average the batch normalisation statistics) is known to lead to poor results (Defazio and Bottou, 2019). However, by computing the batch normalisation statistics for the iterate average using a forward pass of the data at the IA point, Izmailov et al. (2018) show that the performance of small-scale image experiments such as CIFAR-10/100 and pretrained ImageNet can be significantly improved. Even for small experiments this computation is expensive, so they further approximate IA by taking the average at the end of each epoch instead of each iteration, referred to as stochastic weight averaging (SWA). We show experimentally in Section 10 that the two approaches produce almost identical results, with SWA slightly outperforming IA. Since SWA can be seen as IA with a lower averaging frequency, we retain the terminology IA - however, in our theoretical analysis we also include analysis for reduced frequency iterate averaging.
Notation:
• With some variable , and scalar-valued functions f, g, f(n) = o(g(n)) is shorthand for
. Similarly, f(n) = O(g(n)) is shorthand for
for some constant c > 0. In particular O(1) can be read as shorthand for any fixed non-zero constant, and o(1) for any term which decays to 0. For example f(n) = 3n + 2 + 1/n can be abbreviated as f(n) = O(n), or f(n) = 3n + O(1), or f(n) = 3n + 2 + o(1), depending on the the level of precision required. We will also employ the asymptotic equivalence notation1
• For matrices denotes the Frobenius norm and
denotes the operator norm.
• For random vectors
3.1 A High-Dimensional Geometry Perspective
We examine the variance reducing effect of IA in the context of a quadratic approximation to the true loss combined with additive perturbation models for the batch training loss.
The theory we present is high-dimensional (i.e. large number of parameters, P) and considers the small batch size (small B) regime, which we term the “deep learning limit”.
Intuitively, any given example from the training set , will contain general features, which hold over the data generating distribution and instance specific features (which are relevant only to the training sample in question). For example, for a training image of a dog, we may have that:
Under a quadratic approximation to the true loss2 , where
is the Hessian of the true loss with respect to the weights and we sample a mini-batch gradient of size B at point
. The observed gradient is perturbed by
from the true loss gradient (due to instance specific features). Under this model the component of the
’th iterate along the j’th eigenvector
of the true loss when running SGD with learning rate
can be written:
in which are the eigenvalues of H. The simplest tractable model for the gradient noise
assume samples from i.i.d. an isotropic, multivariate Normal. In particular, this assumption removes any dependence on
and precludes the existence of any distinguished directions in the gradient noise. Using this assumption, we obtain Theorem 2 below, which relies on an intermediate result, found in Vershynin (2018).
Lemma 1 (Vershynin (2018) Theorem 6.3.2) Let matrix, and let
be a random vector with independent mean-zero unit-variance sub-Gaussian coordinates. Then
where is a constant.
Theorem 2 Assume the aforementioned quadratic loss and i.i.d. Gaussian gradient noise model. Assume further that for all i and
for all i. Then there exists a constant c > 0 such that for all
where is the batch size.
Proof Let be a random sub-Gaussian vector with independent components. Let
Lemma 1 then applies, to give
We have for some constant C > 0 (Vershynin (2018), exercise 2.5.8), and
. Hence we obtain
for some new constant c > 0. The proof is then completed if we compute the means and variances of . To that end, with
Summation then gives
With all the being i.i.d.
, we need simply to compute the sums
and similarly
Now assuming , and taking
and similarly
Thus it follows that
and
where in both cases we have used . The proof is now completed by applying (6) and noting that
are finite so long as
i (as we have already assumed).
The final iterate attains exponential convergence in the mean of , but does not control the variance term. Whereas for
, although the convergence in the mean is worse (linear), the variance vanishes asymptotically – this motivates tail averaging, to get the best of both worlds. Another key implication of Theorem 2 lies in its dependence on P. With P being a rough gauge of the model complexity, the result implies that in more complex, over-parameterised models, we expect the benefit of IA over the final iterate to be larger due to the corresponding variance reduction. We show this explicitly in our experiments in Figure 4c. Note the limited extra improvement possible by simply increasing the batch size, compared to IA asymptotically.
3.2 A dependent model for the perturbation
We proceed now to propose a relaxation of the gradient perturbation independence assumption. (3) can be written equivalently as
where is a scalar field with
. Note that we have neglected an irrelevant arbitrary constant in Equation (16) and also that we have
rather than
, but this amounts to scaling the per-sample noise variance
by the inverse batch size
. We model
as a Gaussian process GP(m, k), where k is some kernel function
is some mean function3
As an example, taking
and restricting w to a hypersphere results in
taking the exact form of a spherical p-spin glass, studied previously for DNNs (Choromanska et al., 2015; Gardner and Derrida, 1988; Mezard et al., 1987; Ros et al., 2019; Mannelli et al., 2019; Baskerville et al., 2021a,b). We are not proposing to model the loss surface (batch or true) as a spin glass (or more generally, a Gaussian process), rather we are modelling the perturbation between the loss surfaces in this way. We emphasise that this model is a strict generalisation of the i.i.d. assumption above, and presents a rich, but tractable, model of isotropic Gaussian gradient perturbations in which the noise for different iterates is neither independent nor identically distributed.
Following from our Gaussian process definition, the covariance of gradient perturbations can be computed using a well-known result (see Adler and Taylor (2009) equation 5.5.4):
Further assuming a stationary kernel
Thus we have a non-trivial covariance between gradient perturbation at different points in weight-space. This covariance structure can be used to prove the upcoming variance reduction result. Its proof relies on some technical Lemmas (proved in the appendix, Section A.1) which we now state.
Lemma 3 Let be a sequence of multivariate Gaussian random variables in
for all be any deterministic element of
Define the events
Consider (note that
need not diverge with P, but they can). Then
Lemma 4 Assume the covariance structure (18). Take any
where we define
Theorem 5 Let be defined as in Theorem 2 and let the gradient perturbation be given by the covariance structure in (17). Assume that the kernel function k is such that
and
decay as
, and define
. Assume further that
. Let
are multivariate Gaussian random variables and, with probability which approaches unity as
the iterates
are all mutually at least
Proof We will prove the result in the case for the sake of clarity. The same reasoning can be repeated in the more general case; where one gets
below, one need only replace it with
, exploiting linearity of the trace. We will also vacuously replace
on notation. For weight iterates
, we have the recurrence
which leads to
and then
Now define
Next we will apply Lemma 4 and utilise Lemma 3 to bound the variance of and
. We first gather the following facts (which were also computed and used in the proof of Theorem 1:
The sum of squares for the is simple to obtain similarly
We now use the assumption that (required for the convergence of gradient descent) which gives, as
Summing explicitly is possible but unhelpfully complicated. Instead, some elementary bounds give
and
so in particular . Now define the events
as in Lemma 3 using
in place of
. Further, choose
large enough so that
and
are decreasing for
. Define
. Lemma 4 gives
where we note that we have only upper-bounded the second term in (30), so using (27) and (28) and taking large enough we obtain
Turning now to we similarly obtain
and, as before, taking large enough we can obtain
Finally recalling (22) and (23) and writing for large n, we obtain the result.
Note that Theorem 5 is a generalisation of Theorem 2 to the context of our dependent perturbation model. Let us make some clarifying remarks about the theorem and its proof:
1. The bound (21) in the statement of the theorem relies on all iterates being separated by a distance at least . Moreover, the bound is only useful if
is large enough to ensure the
terms are small.
2. Just as in the independent case of Theorem 2, the first term in the bound in (21) decays only in the case that the number of iterates
3. The remaining conditions on are required for the high-dimensional probability argument which we use to ensure that all iterates are separated by at least
4. is a perfectly reasonable condition in the context of deep learning. E.g. for a ResNet-50 with
, violation of this condition would require
. A typical ResNet schedule on ImageNet has
total steps.
Consequently, our result points to the importance of good separation between weight iterates in IA to retain the independence benefit and variance reduction in a non-independent noise setting, hence one would expect large learning rates to play a crucial role in successful IA. At the same time, our result is particularly adapted to the deep learning limit of very many model parameters (), since this is the only regime in which we can argue probabilistically for good separation of weight iterates (otherwise one may simply have to assume such separation). Furthermore, the importance of
indicates that perhaps averaging less frequently than every iteration could be beneficial to generalisation. The following corollary makes this intuition precise.
Corollary 6 Let now be a strided iterate average with stride
Then, under the same conditions as Theorem 5
where the constant O(1) coefficient of the second term in (36) is independent of
Proof Very similar to that of Theorem 5. See appendix Section A.2.
Intuitively, the first term in the covariance in (21) is an “independence term”, i.e. it is common between Theorems 2 and 5 and represents the simple variance reducing effect of averaging. The second variance term in (21) comes from dependence between the iterate gradient perturbations. We see from the corollary that an independent model for gradient perturbation would predict an unambiguous inflationary effect of strided IA on variance (the first term in (36)). However introducing dependence in the manner that we have predicts a more nuanced picture, where increased distance between weight iterates can counteract the simple “independent term” inflationary effect of striding, leaving open the possibility for striding to improve on standard IA for the purposes of generalisation. We investigate and experimentally confirm this hypothesis in Section 9.
3.3 Validation of Theory:
To better understand the effect of the large learning learning rate on generalisation, we train a VGG-16 network (Simonyan and Zisserman, 2014) with no data augmentation/batch normalisation (to isolate the overfitting effect from reducing the learning rate) with a learning rate of . Replacing the learning rate drop (performed at epoch 60 by a factor of 10 with weight decay
) with IA at the same point, we find that the test error is reduced by a greater margin (
), shown in Figure 1a. We note that IA improves over the SGD learning rate equivalent for all values of weight decay, with results for 10 seeds shown in Fig 1b and hence this argument is independent of
Figure 1: (a) STEP (learning rate decay) and IA Train/Val error. Both approaches reduce Val error, but IA by a greater margin. (b) Effect of weight decay on held out test error for IA/sharp learning rate decay solutions. Greater weight decay increases the margin of IA improvement. The lower subplot shows the average symmetric KL-divergence between IA solutions.
Figure 2: Test error improvement with differing degrees of regularisation for (a) network ensembling and (b) IA.
explicit regularisation as indicated by Theorem 5. For our Deep Neural Network experiments, we find that the best IA optimiser improvement over its base optimiser is proportional to the number of parameters P as shown in Fig 4c and predicted by Theorem 2. This theorem and experimental validation thereof translates into the following advice for practitioners: In the deep learning limit (large P and small relative B) one should keep the learning rate high and use iterate averaging instead of sharply dropping the learning rate!
3.4 A Closer Look at IA and the Importance of Regularisation
Izmailov et al. (2018) argue, under a linearisation assumption, that IA can be seen as approximate model ensembling. Since averaging only improves test performance for sufficiently uncorrelated models (through a reduction in variance of the ensemble), we must ensure sufficiently diverse models at each epoch through our training procedure. We note from Fig 2b, that unlike model ensembling (shown in Fig 2a), the IA improvement is strongly dependent on the use of weight decay. We show the difference between IA and sharply decaying learning rate schedules (which mirror conventional setups) over 10 seeds as a function of weight decay coefficient in Fig 1b. The margin of improvement from IA, over sharply decaying schedules, steadily increases with regularisation - with (where
denotes the amount of weight decay or the coefficient of
regularisation in the loss) delivering a greater final validation accuracy at the final IA point, despite starting from a lower accuracy compared to
. To explicitly show that such weight decay regularisation encourages greater diversity in the iterates we calculate the symmetrised KL-divergence,
, over the entire test set between the softmax outputs of the IA iterates. We take an average for each weight decay value (normalising the 0 weight decay value to 1), as shown in the lower subfigure of Fig 1b. As expected, greater weight decay gives greater solution diversity.
Distance in weight space or relative distance? Theorem 5 relies on sufficiently large distances in weight-space between iterates to achieve variance reduction with IA. It is therefore natural to ask if weight decay encourages greater separation between iterates and if this separation in turn explains the efficacy of weight decay. For learning rate and weight decay
, we move a distance
in weight space. Intuitively, since random vectors in high dimensions are nearly orthogonal with high probability (Vershynin, 2018), we expect the distance in weight space to move a distance
, which is larger for
. Conversely, we expect
to be smaller for smaller
and the gradients also to be smaller (Granziol, 2020). For the experiment (with equal learning rates) shown in Fig 1b, the average distances between the IA epochs for weight decay values
, respectively. We note, however, that the relative distance, when normalised by the average weight norm, increases successively as 0.11, 0.13, 0.22. This begs the question whether Theorem 5 can be extended to include a notion of relative distance.
Theorem 7 Let be defined as in Theorem 5 and let the gradient noise be given by the basic covariance structure in (17). Let the kernel function be of the form
and assume that the kernel function k is such that and
decay as
, and define
. Assume further that
. Then the result of Theorem 5 holds.
Proof A minor modification of the proof of Theorem 5 and so is relegated to the appendix Section A.2.
In Theorem 5 it is the absolute distance between iterates that determines the strength of dependence. In Theorem 7, the gradient noise covariance is sensitive instead to the the relative distance. This notion can be easily extended to products of weights in different layers, which is known to be invariant with the use of batch normalisation (Ioffe and Szegedy, 2015), used in many modern networks.
Figure 3: (a) Validation error for the PreResNet-110 on CIFAR-100 for various optimisers using IA and (b) the solution diversity given as the symmetrised KL D or total variation distance calculated on the test set and the change in validation error for the final IA point.
Naïvely combining IA with Adam is not effective, as shown in Figure 3a. Despite the same regularisation (0.0001), the error drop is significantly less than for SGD-IA. Following our intuition from Sec 3.4, we consider whether the problem could be that overly correlated solutions form the IA due to ineffective regularisation. As shown in Tab. (b) of Figure 3, both the symmetrised KL divergence D(p||q) and total variation distance
(calculated between all epochs using IA at the end of training and then averaged) are lower for Adam-IA than for SGD-IA. For adaptive optimisers,
regularisation is not equivalent to weight decay (Zhang et al., 2018; Loshchilov and Hutter, 2019), with weight decay generalising better - known as AdamW. For AdamW with a decoupled weight decay of 0.25, the solution diversity increases beyond that of SGD-IA. This is accompanied by a greater drop in validation error, even outperforming SGD-IA. We term this combination of AdamW + IA Gadam to denote a variant of Adam that generalises. Previous work has shown that limiting the belief in the Adam curvature matrix improves generalisation (Zhuang et al., 2020; Chen and Gu, 2018), hence we also incorporate such a partially adaptive Adam into our framework and term the resulting Algorithm GadamX.
For convolutional neural networks using batch normalisation, the effective learning rate is proportional to (Hoffer et al., 2018). With batch normalisation, the output is invariant to the channel weight norm, hence weight changes are only with respect to the direction of the vector. Since the effective weight decay depends on the (effective) learning rate, we expect this to lead to more regularised solutions and better validation error. To test this hypothesis, we train a VGG-16 network on CIFAR-100 with and without BN (see Figs 4a,4b): for an identical setup, the margin of improvement of Gadam over AdamW is much larger with BN. We note that while Gadam keeps the effective learning rate
high, in scheduled AdamW it quickly vanishes once we start learning rate decay. We were unable to compensate with learning rate scheduling, underscoring the importance of appropriate weight decay.
Here we present the full Gadam/GadamX algorithm. Note that for simplicity, in Algorithm 1, we present a Polyak-style averaging of every iteration. In practice we find both practical and theoretical results suggesting that averaging less frequently is almost equally good, if not better. We discuss this in Corollary 6 and conduct experiments on the averaging frequency in Section 9.
Figure 4: (a) Val. error and (b) effective learning rate of VGG-16 on CIFAR-100 with and without BN. (c) Improvement in using IA over the base optimiser for a variety of networks, closely following the linear trend.
To make a closer connection with the new optimisation algorithms proposed in this work we consider decoupled weight decay (strength ) and gradient preconditioning:
where is some approximation to the true loss Hessian used at iteration t. In the presence of weight decay, we move the true loss minimum away from the origin for the analysis, i.e.
. The update rule is then
We take to be diagonal in the eigenbasis of H, with eigenvalues
is the standard tolerance parameter (Kingma and Ba, 2014). One could try to construct the
from the Gaussian process loss model, so making them stochastic and covarying with the gradient noise, however we do not believe this is tractable. Instead, let us heuristically assume that, with high probability,
is close to
, say within a distance
, for large enough t and all i. If we take a large enough
this is true even for SGD and we expect Adam to better approximate the local curvature matrix than SGD (Granziol et al., 2020a). This results in the following theorem.
Theorem 8 Fix some and assume that
for all
, for some fixed
, with high probability. Use the update rule (38). Assume that the
are bounded away from zero and
. Further assume
, where c is a constant independent of
and is defined in the proof. Let everything else be as in Theorem 5. Then there exist constants
such that, with high probability,
Proof We begin with the equivalent of (22) for update rule (38):
To make progress, we need the following bounds valid for all
and
where the final inequality in each case can be derived from Taylor’s theorem with Lagrange’s form of the remainder (Shirali and Vasudeva, 2014). Since the are bounded away from zero, we have
established
where the constant , say. From this bound we can in turn obtain
where the second line exploits the assumption and our choice c > 1. Thus
where the second inequality follows, for large n, by summing the geometric series and again using Lagrange’s form of the remainder in Taylor’s theorem. is some constant, derived from c that we need not determine explicitly. A complementary lower bound is obtained similarly (for large n). We have thus shown that
Reusing the bound (44) then yields (39). The remaining results, (40)-(42) follow similarly using the same bounds and ideas as above, but applied to the corresponding steps from the proof of Theorem 2.
Theorem 8 demonstrates the same IA variance reduction as seen previously, but in the more general context of weight decay and adaptive optimisation. As expected, improved estimation of the true Hessian eigenvalues (i.e. smaller ) reduces the error in recovery of
. Moreover, increasing the weight decay strength
decreases the leading order error bounds in (39) and (41), but only up to a point, as the other error terms are valid and small only if
is not too large.
Here we consider VGG-16, Preactivated ResNet (PRN) and ResNeXt (Simonyan and Zisserman, 2014; He et al., 2016b; Xie et al., 2017) on CIFAR datasets (Krizhevsky et al., 2009). We also considered the down-sampled ImageNet dataset (Russakovsky et al., 2015) on Wide Residual Networks.
Learning Rate Schedule For all experiments without IA, we use the following learning rate schedule for the learning rate at the t-th epoch, similar to Izmailov et al. (2018), which we find to perform better than the conventionally employed step scheduling (refer to the experimental details in appendix Section D.4):
where is the initial learning rate. In the motivating logistic regression experiments on MNIST, we used T = 50. T = 300 is the total number of epochs budgeted for all CIFAR experiments, whereas we used T = 200 and 50 respectively for PRN-110 and WideResNet (WRN)
in ImageNet. We set r = 0.01 for all experiments. For experiments with iterate averaging, we use the following learning rate schedule instead:
where refers to the (constant) learning rate after iterate averaging activation, and in this paper we set
is the epoch after which iterate averaging is activated, and the methods to determine
was described in the main text. This schedule allows us to adjust learning rate smoothly in the epochs leading up to iterate averaging activation through a similar linear decay mechanism in the experiments without iterate averaging, as described above.
The only exception is the WRN experiments on ImageNet 3232, where we only run 50 epochs of training and start averaging from 30th epoch. We found that when using the schedule described above for the IA schedules (SWA/Gadam/GadamX), we start decay the learning rate too early and the final result is not satisfactory. Therefore, for this particular set of experiments, we use the same learning rate schedule for both averaged and normal optimisers. The only difference is that for IA experiments, we decay the learning rate until the 30th epoch and keep it fixed for the rest of the training.
Hyperparameter Tuning In CIFAR experiments, we tune the base optimisers (i.e. SGD, Adam(W), Padam(W)) only, and assuming that the ideal hyperparameters in base optimisers apply to IA, and apply the same hyperparameter setting for the corresponding IA optimisers (i.e. SWA, Gadam, GadamX). For SGD, we use a base learning rate of 0.1 and use a grid searched initial learning rates in the range of {0.001, 0.01, 0.1} and use the same learning rate for Padam, similar to the procedures suggested in Chen and Gu (2018). For Adam(W), we simply use the default initial learning rate of 0.001 except in VGG-16, where we use initial learning rate of 0.0005. After the best learning rate has been identified, we conduct a further search on the weight decay, which we find often leads to a trade-off between the convergence speed and final performance; again we search on the base optimisers only and use the same value for the IA optimisers. For CIFAR experiments, we search in the range of , from the suggestions of Loshchilov and Hutter (2019). For decoupled weight decay, we search the same range for the weight decay scaled by initial learning rate.
On ImageNet (Russakovsky et al., 2015) experiments, we conduct the following process. On WRN we use the settings recommended by Chrabaszcz et al. (2017), who conducted a thorough hyperparameter search: we set the learning rate at 0.03 and weight decay at 0.0001 for SGD/SWA and Padam, based on their searched optimal values. for AdamW/Gadam, we set decoupled weight decay at 0.01 and initial learning rate to be 0.001 (default Adam learning rate). For GadamX, we again use the same learning rate of 0.03, but since the weight decay in GadamX is partially decoupled, we set the decoupled weight decay to 0.0003. On PRN-110, we follow the recommendations of the authors of He et al. (2016b) to set the initial learning rate for SGD, Padam and GadamX to be 0.1. For AdamW and Gadam, we again use the default learning rate of 0.001. Following the observation by Loshchilov and Hutter (2019) that smaller weight decay should be used for longer training (in PRN-110 we train for 200 epochs), we set weight decay at and decoupled weight decay at 0.0003 (GadamX)/0.001 (others) respectively, where applicable.
Overall, we do not tune adaptive methods (Adam and Gadam) as much (most noticeably, we usually fix their learning rate to 0.001), and therefore in particular the AdamW results we obtain may or may not be at their optimal performance. Nonetheless, the rationale is that by design, one of the key advantage claimed is that adaptive optimisers should be less sensitive to hyperparameter choice, and in this paper, the key message is that Gadam performs well, even though its base optimiser’s parameters (AdamW) are rather crudely tuned.
In all experiments, the momentum parameter () for SGD and
,
for Adam and its variants, are left at their respective default values. For all experiments, unless otherwise stated, we average once per epoch. We also apply standard data augmentation (e.g. flip, random crops) and use a batch size of 128 for all experiments conducted.
6.1 Results
We show the results for the ResNext on CIFAR-100 in Table 1, ImageNet-32 in Table 2 and further CIFAR-100/10 results in Table 4. We also show the training curves for CIFAR-100 and ImageNet-32 in Figure 5. As AdamW always outperforms Adam in our experiments, the curves for the latter are omitted in the main text; we detail these results in the supplementary. The results show that optimisers with IA (SWA, Gadam and GadamX) invariably improve over their counterparts without, and GadamX always delivers the strongest performance. Without compromising convergence speed, Gadam outperforms tuned SGD and Padam - suggesting that solutions found by adaptive optimisers do not necessarily generalise more poorly, as suggested in the literature (Wilson et al., 2017). Indeed, any generalisation gap seems to be closed by the using IA and an appropriately implemented weight decay. We emphasise that results here are achieved without tuning the point at which we start averaging ; if we allow crude tuning of
, on CIFAR-100 GadamX achieves 77.22% (VGG-16) and 79.41% 4 (PRN-110) test accuracy respectively, which to our knowledge are the best reported performances on these architectures. We show results on ImageNet 32
32 (Chrabaszcz et al., 2017) in Figure 5c. While Gadam does not outperform our strong SGD baseline, it nevertheless improves upon AdamW greatly and posts a performance stronger than the SGD baseline in literature with identical (Chrabaszcz et al., 2017) and improved (McDonnell, 2018) setups. Finally, GadamX performs strongly, outperforming more than 3% compared to the baseline (Chrabaszcz et al., 2017) in Top-5 accuracy. We run each experiment three times with mean and standard deviation reported. In this section, all non-IA baselines are tuned rigorously with proper schedules for fair comparisons5, and we also include the results reported in the previous works in Table 8 of appendix Section C.
Table 1: ResNeXt on CIFAR-100.
Table 2: Test Accuracy on ImageNet 32
Figure 5: (a-b) Top 1 Test error on CIFAR-100, (c) Top-5 Test Error on ImageNet-32 and (d) IA test improvement over its base optimiser against number of parameters.
We compare against step learning rate decay (factor of 10 every 30 epochs) and linear schedule for SGD and AdamW for 90 epochs (He et al., 2016a), with respective initial learning rates and weight decays
on ImageNet (Russakovsky et al., 2015). We combine LookAhead (Zhang et al., 2019b) with gradient centralisation (Yong et al., 2020) as a high performance adaptive baseline Ranger (Wright, 2019), also using step decay. We search for the best performing initial learning rates for SGD, AdamW, SWA, GadamX and Ranger by factors of 3 i.e
Table 3: ImageNet Results
Table 4: CIFAR-10/100 Results
0.001, 0.003 in either direction (increase/decrease) until we find a local maximum in performance, otherwise leaving settings as in Section 6. We show the results in Table 3.
Experimenting with Partial Adaptivity for the Best Computer Vision Results: Following Granziol et al. (2020a); Choi et al. (2019), we experiment with setting the numerical stability coefficient to for GadamX and attempt an SGD like procedure for Gadam where we train with
. Note that such a large numerical stability coefficient has a similar effect to reducing the effect of the preconditioning matrix as GadamX hence also allowing for a larger global learning rate. We find that whilst the generalisation benefit of using Gadam alone is significant (without decreasing partial adaptivity) it is not competitive with SGD on this dataset, wheras leaving the numerical stability coefficient unchanged for GadamX only results in a very minor decrease in performance. We detail both of these effects in experimental finding 3.
Due to poor “out of the box” performance of SWA, we repeat the logarithmic grid search procedure on the IA learning rate for SWA. We report results in Table 3, where we see Gadam(X) strongly out-performing all baselines. We do not include the ResNet-18 as AdamW outperforms SGD with 69.92% top-1 accuracy over 69.72%, hence not a useful test-case for analysing the adaptive generalisation gap, prevalent in deeper models. Gadam nonetheless improves on this attaining 70.11%. Whilst we find that step/linear scheduling is less effective for AdamW/SGD, attaining 73.68/75.52% respectively on the ResNet-50. Since these are small difference we don’t consider scheduling to be a major factor in our outstanding results. We detail our major experimental findings from these experiments which could be of use to the community.
1. Adaptive IA makes use of huge initial learning rates: unlike SGD and SWA, which have a strong performance degradation when large initial learning rates are used, shown in Fig 7a, we find that large initial learning rates improve the generalisation performance of Gadam/GadamX, with the largest initial learning rates giving the best results.
2. Convergence speed comes at a cost: Combining a large numerical stability coefficient and large learning rates allows Gadam to give significantly superior performance to SGD. However, the price paid is in convergence speed, shown in Fig 7b. For these settings the convergence speed is often
Figure 6: Final ImageNet epochs, showing the improvement of both SGD with Iterate Averaging (SGD IA) and our proposed GadamX optimiser over the SGD step-schedule in Top-1 validation error.
Figure 7: (a) Unlike IA adaptive methods, SGD does not benefit from larger initial learning rates. (b) To attain the greatest generalisation with adaptive methods, the fast convergence is sacrificed. Gadam , has a correspondingly large learning rate 0.5
as slow or slower than GadamX. Using the same settings as in the small scale experiments (shown as Gadam in the graph) we achieve a top-1 accuracy of 75.52 for ResNet-50. Whilst this significantly improves upon the base optimiser AdamW, these results are not as strong as those of SGD. Whilst increasing the base learning rate to 0.003 increases the ResNet-50 Gadam generalisation performance to 76.53, much of the convergence speed is already lost. We note that the effective weight decay is given by so we expect higher regularisation from higher learning rates. We do not find that increasing the weight decay whilst keeping the same base learning rates produces as strong results in our experiments and hence this learning rate and weight decay interplay could form the basis for interesting future work.
3. Partially adaptive optimisation generalises best: We find that for all experiments GadamX delivers the strongest performance. We do not find a strong dependence on the choice of the IA starting point (we try epoch 61, 71, 81). We find that altering the numerical stability constant gives a small boost in Top-1 error, from 77.19 to 77.31 for the ResNet-50, but that results remain strong for the traditional setting.
Comparison to previous results: We specifically report the final (as opposed to best) validation error for all our runs. We find the best SGD ResNet-50/101 results to be 75.75/77.62%, which are slightly worse/better than the official repository results. All of these results are still significantly lower than results achieved by Gadam/GadamX. We note that iterate averaged methods seem to continually decrease error in the final epochs of training, unlike SGD, which can sometimes overfit slightly in the final epochs of training.
We run word-level language modelling using a 3-layer Long-short Term Memory (LSTM) model (Gers et al., 1999) on PTB dataset (Marcus et al., 1993) and the results are shown in Table 5 and Figure 8. Remarkably, Gadam achieves a test perplexity of 58.77 (58.61 if we tune . See Table 6 in Section 10), better than the baseline NT-ASGD in Merity et al. (2017) that runs an additional 300 epochs on an identical network. Note that since, by default, the ASGD uses a constant learning rate, we do not schedule the learning rate except Padam which requires scheduling to converge. Also, for consistency, we use a manual trigger to start averaging at the 100th epoch for ASGD (which actually outperforms the NT-ASGD variant). We additionally conduct experiments with scheduling and NT-ASGD (appendix Section D) and Gadam still outperforms. It is worth mentioning that for state of the art results in language modelling Melis et al. (2017); Brown et al. (2020); Shoeybi et al. (2019), Adam is the typical optimiser of choice. Hence these results are both encouraging and significant for wider use in the community.
Figure 8: Validation perplexity of 3-layer LSTM on PTB word-level modelling
Table 5: LSTM Penn Treebank Experimental results.
While we derive the theoretical bounds for both Polyak-style averaging on every iteration and strided averaging, in practice we use strided averaging to save on computation. We either average once per epoch similar to Izmailov et al. (2018), or select a rather arbitrary value such as averaging once per 100 iterations. The reason is both practical and theoretical: averaging much less leads to significant computational savings, and at the same time as we argued more independent iterates the benefit from averaging is better. In this case, averaging less causes the iterates to be further apart and more independent, and thus fewer number of iterates is required to achieve the similar level of performance if less independent iterates are used. We verify this both on the language and the vision experiments using the identical setup as the main text. With reference to Figure 9(a), not only is the final perplexity very insensitive to averaging frequency (note that the y-axis scale is very small), it is also interesting that averaging less actually leads to a slightly better validation perplexity compared to schemes that, say, average every iteration. We see a similar picture emerges in Figure 9(b), where the despite of following very close trajectories, averaging every iteration gives a slightly worse testing performance compared to once an epoch and is also significantly more expensive (with a NVIDIA GeForce RTX 2080 Ti GPU, each epoch of training takes around 10s if we average once per epoch but averaging every iteration takes around 20s).
Figure 9: Effect of different averaging frequencies on validation perplexity of Gadam on representative (a) Language and (b) Image classification tasks. Freq=n suggests averaging once per n iterations. freq=350 in (b) is equivalently averaging once per epoch.
In Gadam(X), we need to determine when to start averaging (in Algorithm 1), and here we investigate the sensitivity of Gadam(X) to this hyperparameter. We use a range of
for a number of different tasks and architectures (Figure 10 and Table 6), including extreme choices such as
(start averaging at the beginning). We observe that for any reasonable
, Gadam(X) always outperform their base optimisers with standard learning rate decay, and tuning
yields even more improvements over the heuristics employed in the main text, even if selecting any sensible
already can lead to a promising performance over standard learning rate decay.
Figure 10: Effect of different on the performance of various tasks and architectures.
Here we also conduct preliminary experiments on GadamAuto, a variant of Gadam that uses a constant learning rate schedule and automatically determines the starting point of averaging and training termination - this is possible given the insensitivity of the end-results towards above, and is desirable as the optimiser both has fewer hyperparameters to tune and trains faster. We use VGG-16 network on CIFAR-100. For all experiments, we simply use a flat learning rate schedule. The results are shown in Table 7. We use a patience of 10 for both the determination of the averaging activation and early termination. We also include SWA experiments with SGD iterates.
Table 6: Best results obtained from tuning
Table 7: GadamAuto Test Performance at Termination.
It can be seen that, while automatic determination for averaging trigger and early termination work well for Gadam (GadamAuto posts a performance only marginally worse than the manually tuned Gadam), they lead to a rather significant deterioration in test in SWA (SWA-Auto performs worse than tuned SWA, and even worse than tuned SGD). This highlights the benefit of using adaptive optimiser as the base optimiser in IA, as the poor performance in SWA-Auto is likely attributed to the fact that SGD is much more hyperparameter-sensitive (to initial learning rate and learning rate schedule, for example. SWA-Auto uses a constant schedule, which is sub-optimal for SGD), and that validation performance often fluctuates more during training for SGD: SWA-Auto determines averaging point based on the number of epochs of validation accuracy stagnation. For a noisy training curve, averaging might be triggered too early; while this can be ameliorated by setting a higher patience, doing so will eventually defeat the purpose of using an automatic trigger. Both issues highlighted here are less serious in adaptive optimisation, which likely leads to the better performance of GadamAuto.
Nonetheless, the fact that scheduled Gadam still outperforms GadamAuto suggests that there is still ample room of improvement to develop a truly automatic optimiser that performs as strong as or even stronger than tuned ones. One desirable alternative we propose for the future work is the integration of Rectified Adam Liu et al. (2019), which is shown to be much more insensitive to choice of hyperparameter even compared to Adam.
We propose a Gaussian Process perturbation between the batch and true risk surfaces and derive the phenomenon of improved generalisation for large learning rates and larger weight decay when combined with iterate averaging observed in practice. We extend this formalism to include adaptive methods and show that we expect further improvement when using adaptive algorithms. Based on this theory we develop two adaptive algorithms, Gadam and GadamX, variants of Adam with iterate averaging. We extensively validate Gadam and GadamX on computer vision tasks and a natural language experiment, showing strong performance against baseline and state of the art. Another interesting consequence of our work is that in all our experiments the last iterate is the best. Unlike SGD, where the epoch of best test/validation error is typically not the last and techniques such as early stopping are often employed, we find consistent near-monotonic improvements in test/validation error using our algorithms. We also find from preliminary analysis that our algorithms require less hyper-parameter tuning than SGD and variants thereof. This may be of interest for practitioners that want to get good results fast, as opposed to state of the art slowly.
Milton Abramowitz, Irene A Stegun, and Robert H Romer. Handbook of mathematical functions with formulas, graphs, and mathematical tables, 1988.
Robert J Adler and Jonathan E Taylor. Random fields and geometry. Springer Science & Business Media, 2009.
George E. Andrews, Richard Askey, and Ranjan Roy. Special Functions. Encyclopedia of Mathematics and its Applications. Cambridge University Press, 1999. doi: 10.1017/CBO9781107325937.
Nitin Bansal, Xiaohan Chen, and Zhangyang Wang. Can we gain more from orthogonality regular- izations in training deep networks? In Advances in Neural Information Processing Systems, pages 4261–4271, 2018.
Nicholas P Baskerville, Jonathan P Keating, Francesco Mezzadri, and Joseph Najnudel. The loss surfaces of neural networks with general activation functions. Journal of Statistical Mechanics: Theory and Experiment, 2021(6):064001, 2021a.
Nicholas P Baskerville, Jonathan P Keating, Francesco Mezzadri, and Joseph Najnudel. A spin-glass model for the loss surfaces of generative adversarial networks. arXiv preprint arXiv:2101.02524, 2021b.
Irwan Bello, William Fedus, Xianzhi Du, Ekin D Cubuk, Aravind Srinivas, Tsung-Yi Lin, Jonathon Shlens, and Barret Zoph. Revisiting resnets: Improved training and scaling strategies. arXiv preprint arXiv:2103.07579, 2021.
Tom B Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared Kaplan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, et al. Language models are few-shot learners. arXiv preprint arXiv:2005.14165, 2020.
Jinghui Chen and Quanquan Gu. Closing the generalization gap of adaptive gradient methods in training deep neural networks. arXiv preprint arXiv:1806.06763, 2018.
Soumith Chintala et al. Pytorch imagenet baseline, 2017. URL https://github.com/ pytorch/examples/blob/master/imagenet/main.py. "2016 (accessed September, 2020)".
Dami Choi, Christopher J Shallue, Zachary Nado, Jaehoon Lee, Chris J Maddison, and George E Dahl. On empirical comparisons of optimizers for deep learning. arXiv preprint arXiv:1910.05446, 2019.
Anna Choromanska, Mikael Henaff, Michael Mathieu, Gérard Ben Arous, and Yann LeCun. The loss surfaces of multilayer networks. In Artificial Intelligence and Statistics, pages 192–204, 2015.
Patryk Chrabaszcz, Ilya Loshchilov, and Frank Hutter. A downsampled variant of ImageNet as an alternative to the CIFAR datasets. arXiv preprint arXiv:1707.08819, 2017.
Ekin D Cubuk, Barret Zoph, Jonathon Shlens, and Quoc V Le. Randaugment: Practical data augmentation with no separate search. arXiv preprint arXiv:1909.13719, 2019.
Aaron Defazio and Léon Bottou. On the ineffectiveness of variance reduced optimization for deep learning. In Advances in Neural Information Processing Systems, pages 1753–1763, 2019.
Laurent Dinh, Razvan Pascanu, Samy Bengio, and Yoshua Bengio. Sharp minima can generalize for deep nets. In Proceedings of the 34th International Conference on Machine Learning-Volume 70, pages "1019–1028". "JMLR. org", 2017.
John Duchi, Elad Hazan, and Yoram Singer. Adaptive subgradient methods for online learning and stochastic optimization. Journal of machine learning research, 12(Jul):2121–2159, 2011.
John C Duchi. Introductory lectures on stochastic optimization. The Mathematics of Data, 25:99, 2018.
Elizabeth Gardner and Bernard Derrida. Optimal storage properties of neural network models. Journal of Physics A: Mathematical and general, 21(1):271, 1988.
Timur Garipov, Pavel Izmailov, Dmitrii Podoprikhin, Dmitry P Vetrov, and Andrew G Wilson. Loss surfaces, mode connectivity, and fast ensembling of dnns. In Advances in Neural Information Processing Systems, pages 8789–8798, 2018.
Felix A Gers, Jürgen Schmidhuber, and Fred Cummins. Learning to forget: Continual prediction with LSTM. 1999.
Diego Granziol. Flatness is a false friend. arXiv preprint arXiv:2006.09091, 2020.
Diego Granziol, Xingchen Wan, and Timur Garipov. MLRG deep curvature. arXiv preprint arXiv:1912.09656, 2019.
Diego Granziol, Samuel Albanie, Xingen Wan, and Stephen Roberts. Explaining the adaptive generalisation gap. arXiv preprint arXiv, 2020a.
Diego Granziol, Timur Garipov, Dmitry Vetrov, Stefan Zohren, Stephen Roberts, and Andrew Gordon Wilson. Towards understanding the true loss surface of deep neural networks using random matrix theory and iterative spectral methods, 2020b. URL https://openreview.net/forum? id=H1gza2NtwH.
Nicholas JA Harvey, Christopher Liaw, Yaniv Plan, and Sikander Randhawa. Tight analyses for non-smooth stochastic gradient descent. In Conference on Learning Theory, pages 1579–1613. PMLR, 2019.
Haowei He, Gao Huang, and Yang Yuan. Asymmetric valleys: Beyond sharp and flat local minima. arXiv preprint arXiv:1902.00744, 2019.
Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 770–778, 2016a.
Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Identity mappings in deep residual networks. In European conference on computer vision, pages 630–645. Springer, 2016b.
Sepp Hochreiter and Jürgen Schmidhuber. Flat minima. Neural Computation, 9(1):1–42, 1997.
Elad Hoffer, Ron Banner, Itay Golan, and Daniel Soudry. Norm matters: efficient and accurate normalization schemes in deep networks. In Advances in Neural Information Processing Systems, pages 2160–2170, 2018.
Zehao Huang and Naiyan Wang. Data-driven sparse structure selection for deep neural networks. In Proceedings of the European Conference on Computer Vision (ECCV), pages 304–320, 2018.
Sergey Ioffe and Christian Szegedy. Batch normalization: Accelerating deep network training by reducing internal covariate shift. arXiv preprint arXiv:1502.03167, 2015.
Pavel Izmailov, Dmitrii Podoprikhin, Timur Garipov, Dmitry Vetrov, and Andrew Gordon Wilson. Av- eraging weights leads to wider optima and better generalization. arXiv preprint arXiv:1803.05407, 2018.
Prateek Jain, Dheeraj Nagaraj, and Praneeth Netrapalli. Making the last iterate of sgd information theoretically optimal. In Conference on Learning Theory, pages 1752–1755. PMLR, 2019.
Stanislaw Jastrzebski, Zachary Kenton, Devansh Arpit, Nicolas Ballas, Asja Fischer, Yoshua Bengio, and Amos Storkey. Three factors influencing minima in sgd. arXiv preprint arXiv:1711.04623, 2017.
Stanislaw Jastrzebski, Maciej Szymczak, Stanislav Fort, Devansh Arpit, Jacek Tabor, Kyunghyun Cho, and Krzysztof Geras. The break-even point on the optimization trajectories of deep neural networks. In International Conference on Learning Representations, 2020. URL https:// openreview.net/forum?id=r1g87C4KwB.
Nitish Shirish Keskar and Richard Socher. Improving generalization performance by switching from Adam to SGD. arXiv preprint arXiv:1712.07628, 2017.
Diederik P Kingma and Jimmy Ba. Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980, 2014.
Alex Krizhevsky, Geoffrey Hinton, et al. Learning multiple layers of features from tiny images. 2009.
Anders Krogh and John A Hertz. A simple weight decay can improve generalization. In Advances in neural information processing systems, pages 950–957, 1992.
Harold Kushner and G George Yin. Stochastic approximation and recursive algorithms and applications, volume 35. Springer Science & Business Media, 2003.
Simon Lacoste-Julien, Mark Schmidt, and Francis Bach. A simpler approach to obtaining an o (1/t) convergence rate for the projected stochastic subgradient method. arXiv preprint arXiv:1212.2002, 2012.
Yann LeCun, Yoshua Bengio, and Geoffrey Hinton. Deep learning. nature, 521(7553):436–444, 2015.
Hao Li, Zheng Xu, Gavin Taylor, Christoph Studer, and Tom Goldstein. Visualizing the loss landscape of neural nets. In Advances in Neural Information Processing Systems, pages 6389–6399, 2018.
Yuanzhi Li, Colin Wei, and Tengyu Ma. Towards explaining the regularization effect of initial large learning rate in training neural networks. In Advances in Neural Information Processing Systems, pages 11674–11685, 2019.
Liyuan Liu, Haoming Jiang, Pengcheng He, Weizhu Chen, Xiaodong Liu, Jianfeng Gao, and Jiawei Han. On the variance of the adaptive learning rate and beyond. arXiv preprint arXiv:1908.03265, 2019.
Ilya Loshchilov and Frank Hutter. Decoupled weight decay regularization. 2019.
Wesley J Maddox, Pavel Izmailov, Timur Garipov, Dmitry P Vetrov, and Andrew Gordon Wilson. A simple baseline for Bayesian uncertainty in deep learning. In Advances in Neural Information Processing Systems, pages 13132–13143, 2019.
Stefano Sarao Mannelli, Florent Krzakala, Pierfrancesco Urbani, and Lenka Zdeborova. Passed & spurious: Descent algorithms and local minima in spiked matrix-tensor models. arXiv preprint arXiv:1902.00139, 2019.
Mitchell Marcus, Beatrice Santorini, and Mary Ann Marcinkiewicz. Building a large annotated corpus of English: The penn treebank. 1993.
James Martens. New insights and perspectives on the natural gradient method. arXiv preprint arXiv:1412.1193, 2014.
Mark D McDonnell. Training wide residual networks for deployment using a single bit for each weight. arXiv preprint arXiv:1802.08530, 2018.
Gábor Melis, Chris Dyer, and Phil Blunsom. On the state of the art of evaluation in neural language models. arXiv preprint arXiv:1707.05589, 2017.
Stephen Merity, Nitish Shirish Keskar, and Richard Socher. Regularizing and optimizing LSTM language models. arXiv preprint arXiv:1708.02182, 2017.
Marc Mezard, Giorgio Parisi, and Miguel Virasoro. Spin glass theory and beyond: An Introduction to the Replica Method and Its Applications, volume 9. World Scientific Publishing Company, 1987.
Yurii Nesterov. Introductory lectures on convex optimization: A basic course, volume 87. Springer Science & Business Media, 2013.
Boris T Polyak and Anatoli B Juditsky. Acceleration of stochastic approximation by averaging. SIAM journal on control and optimization, 30(4):838–855, 1992.
Alexander Rakhlin, Ohad Shamir, and Karthik Sridharan. Making gradient descent optimal for strongly convex stochastic optimization. arXiv preprint arXiv:1109.5647, 2011.
Sashank J Reddi, Satyen Kale, and Sanjiv Kumar. On the convergence of adam and beyond. arXiv preprint arXiv:1904.09237, 2019.
Valentina Ros, Gerard Ben Arous, Giulio Biroli, and Chiara Cammarota. Complex energy landscapes in spiked-tensor and simple glassy models: Ruggedness, arrangements of local minima, and phase transitions. Physical Review X, 9(1):011003, 2019.
Olga Russakovsky, Jia Deng, Hao Su, Jonathan Krause, Sanjeev Satheesh, Sean Ma, Zhiheng Huang, Andrej Karpathy, Aditya Khosla, Michael Bernstein, et al. Imagenet large scale visual recognition challenge. International journal of computer vision, 115(3):211–252, 2015.
Ohad Shamir and Tong Zhang. Stochastic gradient descent for non-smooth optimization: Conver- gence results and optimal averaging schemes. In International conference on machine learning, pages 71–79. PMLR, 2013.
Satish Shirali and Harkrishan L Vasudeva. An Introduction to Mathematical Analysis. Alpha Science International, Limited, 2014.
Mohammad Shoeybi, Mostofa Patwary, Raul Puri, Patrick LeGresley, Jared Casper, and Bryan Catan- zaro. Megatron-lm: Training multi-billion parameter language models using model parallelism. arXiv preprint arXiv:1909.08053, 2019.
Connor Shorten and Taghi M Khoshgoftaar. A survey on image data augmentation for deep learning. Journal of Big Data, 6(1):60, 2019.
Karen Simonyan and Andrew Zisserman. Very deep convolutional networks for large-scale image recognition. arXiv preprint arXiv:1409.1556, 2014.
Tijmen Tieleman and Geoffrey Hinton. Lecture 6.5-RMSProp: Divide the gradient by a running average of its recent magnitude. COURSERA: Neural networks for machine learning, 4(2):26–31, 2012.
Phuong Thi Tran et al. On the convergence proof of AMSGrad and a new version. IEEE Access, 7: 61706–61716, 2019.
Roman Vershynin. High-dimensional probability: An introduction with applications in data science, volume 47. Cambridge university press, 2018.
Ashia C Wilson, Rebecca Roelofs, Mitchell Stern, Nati Srebro, and Benjamin Recht. The marginal value of adaptive gradient methods in machine learning. In Advances in Neural Information Processing Systems, pages 4148–4158, 2017.
Less Wright. Ranger - a synergistic optimizer. https://github.com/lessw2020/ Ranger-Deep-Learning-Optimizer, 2019.
Lei Wu, Chao Ma, and E Weinan. How sgd selects the global minima in over-parameterized learning: A dynamical stability perspective. In Advances in Neural Information Processing Systems, pages 8279–8288, 2018.
Qizhe Xie, Eduard Hovy, Minh-Thang Luong, and Quoc V. Le. Self-training with noisy student improves ImageNet classification, 2019.
Saining Xie, Ross Girshick, Piotr Dollár, Zhuowen Tu, and Kaiming He. Aggregated residual transformations for deep neural networks. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 1492–1500, 2017.
Hongwei Yong, Jianqiang Huang, Xiansheng Hua, and Lei Zhang. Gradient centralization: A new optimization technique for deep neural networks. arXiv preprint arXiv:2004.01461, 2020.
Sangdoo Yun, Dongyoon Han, Seong Joon Oh, Sanghyuk Chun, Junsuk Choe, and Youngjoon Yoo. Cutmix: Regularization strategy to train strong classifiers with localizable features. arXiv preprint arXiv:1905.04899, 2019.
Matthew D Zeiler. Adadelta: an adaptive learning rate method. arXiv preprint arXiv:1212.5701, 2012.
Chiyuan Zhang, Samy Bengio, Moritz Hardt, Benjamin Recht, and Oriol Vinyals. Understanding deep learning requires rethinking generalization. arXiv preprint arXiv:1611.03530, 2016.
Guodong Zhang, Chaoqi Wang, Bowen Xu, and Roger Grosse. Three mechanisms of weight decay regularization. arXiv preprint arXiv:1810.12281, 2018.
Guodong Zhang, Lala Li, Zachary Nado, James Martens, Sushant Sachdeva, George Dahl, Chris Shallue, and Roger B Grosse. Which algorithmic choices matter at which batch sizes? insights from a noisy quadratic model. In Advances in Neural Information Processing Systems, pages 8194–8205, 2019a.
Hongyi Zhang, Moustapha Cisse, Yann N Dauphin, and David Lopez-Paz. mixup: Beyond empirical risk minimization. arXiv preprint arXiv:1710.09412, 2017.
Michael Zhang, James Lucas, Jimmy Ba, and Geoffrey E Hinton. Lookahead optimizer: k steps forward, 1 step back. In Advances in Neural Information Processing Systems, pages 9593–9604, 2019b.
Juntang Zhuang, Tommy Tang, Sekhar Tatikonda, Nicha Dvornek, Yifan Ding, Xenophon Pa- pademetris, and James S Duncan. Adabelief optimizer: Adapting stepsizes by the belief in observed gradients. arXiv preprint arXiv:2010.07468, 2020.
In this section we give any proofs that were omitted from the main text.
A.1 Proof of Theorem 5
The proof of Theorem 5 was given in the main text but depends on several intermediate results which we now state and prove.
Lemma 9 Take any let
. Consider
with
and let
(note that
need not diverge with P, but they can). Define
then as
and moreover as
Proof With the Euclidean volume measure, we have
where is the volume of the unit sphere in P dimensions. Therefore
where is the lower incomplete gamma function. Since
, it follows that
and so Lemma 10 can be applied to yield the result, recalling that
Lemma 10 Define the function
where is the lower incomplete gamma function. Assume that
, where x may or may not diverge with
, and more precisely
Proof We have , where
is the confluent hypergeometric function of the first kind (Andrews et al., 1999). Then
where we have used a result of Abramowitz et al. (1988). The integral in (55) can be evaluated asymptotically in the limit with
. Writing the integrand as
it is plainly seen to have no saddle points in [0, 1] given the condition
. The leading order term therefore originates at the right edge t = 1. A simple application of Laplace’s method leads to
where the penultimate line makes uses of Stirling’s approximation (Andrews et al., 1999). Since
which completes the proof.
The following two lemmas were stated in the main text. Their proofs, which we now give, depend on the preceding lemmas.
Proof of Lemma 3 Let us use the definitions of from Lemma 9 (with
the obvious way). Since
, the chain rule of probability gives
but
and so (51) gives the result.
Proof of Lemma 4 Each of the is Gaussian distributed with covariance matrix
Equation 6 and the covariance between different gradients
is similarly given by Equation 6. By standard multivariate Gaussian properties
then taking the trace
Using the covariance structure from (18) in the main text gives
from which the result follows.
A.2 Proofs of Theorem 7 and Corollary 6
Both of the proofs given here are quite similar to the proof of Theorem 5 in the main text and so we here present only the differences.
Proof of Theorem 7 The proof is much the same as that of Theorem 5, except a slightly modified version of Lemma 4 is required. The expression analogous to the result in that lemma is different in the following two ways:
1. All terms are divided by positive powers of
2. There are extra terms arising from derivatives applied to , which give rise to factors of the form
Extending the proof of Theorem 5 to this case requires the following two observations. Firstly, terms of the form can be easily bounded as
. Secondly the terms of the form
will cause no problems so long as they can be uniformly bounded away from 0 as
. This can be established with high probability as a trivial extension of Lemma 3 by introducing an extra point
, say, at the origin.
Proof of Corollary 6 The proof is just as in Theorem 2 (or Theorems 3 or 4), differing only in the values of the . Indeed, a little thought reveals that the generalisation of
to the case
Note that is just the (negative) remainder after division of
. Then for large n
and similarly
As discussed, most related works improve generalisation of adaptive methods by combining them with SGD in some form. As an example representing the recent works claiming promising performances, Chen and Gu (2018) introduce an additional hyperparameter p, to control the extent of adaptivity: for , we have fully adaptive Adam(W) or pure first-order SGD respectively and usually a p falling between the extremes is taken. In addition to empirical comparisons, since our approach is orthogonal to these approaches, as an singular example, we propose GadamX that combines Gadam with Padam, where for simplicity we follow Chen and Gu (2018) to fix
for the current work. We note that p < 1 is regularly considered a heuristic to be used for an inaccurate curvature matrix Martens (2014), although the specific choice of p = 1/2 has a principled derivation in terms of a regret bound Duchi et al. (2011). Previous works also use EMA in weight space to achieve optimisation and/or generalisation improvements: Izmailov et al. (2018) entertain EMA in SWA, although they conclude simple averaging is more competitive. Recently, Zhang et al. (2019b) proposes Lookahead (LH), a plug-in optimiser that uses EMA on the slow weights to improve convergence and generalisation. Nonetheless, having argued the dominance of noise in the high-dimensional deep learning regime, we argue that simple averaging is more theoretically desirable for generalisation. Following the identical analysis to the noisy quadratic with i.i.d noise, we consider the 1D case without loss of generality and denote
as the coefficient of decay, asymptotically the EMA point
is governed by:
Where ). An alternative analysis of EMA arriving at similar result was done in Zhang et al. (2019a), but their emphasis of comparison is between the EMA and iterates instead of EMA and the IA point in our case. From (65), while the convergence in mean is less strongly affected, the noise is reduced by a factor of
. So whilst we reduce the noise possibly by a very large factor, it does not vanish asymptotically. Hence viewing EMA or IA as noise reduction schemes, we consider IA to be far more aggressive. Secondly, EMA implicitly assumes that more recent iterates are better, or otherwise more important, than the previous iterates. While justified initially (partially explaining LH’s efficacy in accelerating optimisation), it is less so in the late stage of training. We nonetheless believe LH could be of great value. Like our proposed methods, LH features weight-space average to achieve optimisation and generalisation benefits, however LH maintains different update rules for the fast and slow weights, and uses exponentially moving average to update the parameters. In this section, we both comment on the key theoretical differences between Gadam and Lookahead and make some preliminary practical comparisons. We also offer an attempt to bring together the optimisation benefit of Lookahead and the generalisation benefit of Gadam, with promising preliminary results.
B.1 Major Differences between Gadam and Lookahead
Averaging Method Lookahead opts for a more complicated averaging scheme: they determine the ‘fast’- and ‘slow’- varying weights during optimisation, and maintains an EMA to average the weight. On the other hand, Gadam uses a more straightforward simple average. As we discussed in the main text, EMA is more theoretically justified during the initial rather than later stage of training. This can also be argued from a Bayesian viewpoint following Maddox et al. (2019), who argued that iterates are simply the draws from the posterior predictive distribution of the neural network, where as averaging leads to a rough estimation of its posterior mean. It is apparent that if the draws from this distribution are equally good (which is likely to be the case if we start averaging only if validation metrics stop improving), assigning the iterates with an exponential weight just based on when they are drawn constitutes a rather arbitrary prior in Bayesian sense.
Averaging Frequency Lookahead averages every iteration whereas in Gadam, while possible to do so as well, by default averages much less frequently. We detail our rationale for this in Section 9.
Starting Point of Averaging While Lookahead starts averaging at the beginning of the training, Gadam starts averaging either from a pre-set starting point or an automatic trigger (for GadamAuto). While authors of Lookahead Zhang et al. (2019b) argue that starting averaging eliminates the hyperparameter on when to start averaging, it is worth noting that Lookahead also introduces two additional hyperparameters , which are non-trivially determined from grid search (although the authors argue that the final result is not very sensitive to them).
We believe the difference here is caused by the different design philosophies of Gadam and Lookahead: by using EMA and starting averaging from the beginning, Lookahead benefits from faster convergence and some generalisation improvement whereas in Gadam, since the averages of iterates are not used during training to promote independece between iterates, Gadam does not additionally accelerate optimisation but, by our theory, should generalise better. As we will see in the next section, this theoretical insight is validated by the experiments and leads to combinable benefits.
Empirical Comparison
We make some empirical evaluations on CIFAR-100 data-set with different network architectures, and we use different base optimiser for Lookahead. For all experiments, we use the author-recommended default values of k = 5 (number of lookahead steps) and . We focus on the combination of Lookahead and adaptive optimisers, as this is the key focus of this paper, although we do include results with Lookahead with SGD as the base optimiser.
We first test AdamW and SGD with and without Lookahead and the results are in Figure 11. Whilst SGD + LH outperforms SGD in final test accuracy by a rather significant margin in both architectures, Lookahead does not always lead to better final test accuracy in AdamW (although it does improve the convergence speed and reduce fluctuations in test error during training, which is unsurprising as EMA shares similar characteristics with IA in reducing sensitivity to gradient noise). On the other hand, it is clear that Gadam delivers both more significant and more consistent improvements over AdamW, both here and in the rest of the paper.
Nonetheless, we believe that Lookahead, being an easy-to-use plug-in optimiser that clearly improves convergence speed, offers significant combinable potential with Gadam, which focuses on generalisation. Indeed, by using Lookahead before the 161st epoch where we start IA, and switching to IA after the starting point, we successfully combine Gadam and LH into a new optimiser which we term Gadam + LH. With reference to Figure 11, in VGG-16, Gadam + LH both converges at the fastest speed in all the optimisers tested and achieves a final test accuracy only marginally worse than Gadam (but still stronger than all others). On the other hand, in PRN-110, perhaps due to the specific architecture choice, the initial difference in convergence speed of all optimisers is minimal, but Gadam + LH clearly performs very promisingly in the end: it is not only stronger than our result without Lookahead in Figure 11(b), but also, by visual inspection, significantly stronger than the SGD + LH results on the same data-set and using the same architecture reported in the original Lookahead paper Zhang et al. (2019b).
Due to the fact that Lookahead is a very recent creation and our constraint on computational resources, we have not been able to fully test Gadam + LH on a wider range of problems. Nonetheless, we believe that the results obtained here are encouraging, and should merit more in-depth investigations in the future works.
Figure 11: Test accuracy of Lookahead in CIFAR-100 against number of epochs.
Unless otherwise stated, all experiments are run with PyTorch 1.1 on Python 3.7 Anaconda environment with GPU acceleration. We use one of the three possible GPUs for our experiment: NVIDIA GeForce GTX 1080 Ti, GeForce RTX 2080 Ti or Tesla V100. We always use a single GPU for any single run of experiment.
C.1 Validating Experiments
VGG-16 on CIFAR-100 In this expository experiment, we use the original VGG-16 without batch normalisation (batch normalisation has non-trivial impact on conventional measures of sharpness and flatness. See Li et al. (2018)). We conduct all experiments with initial learning rate 0.05. For fair comparison to previous literature, we use the linear decay schedules advocated in Izmailov et al. (2018), for both SGD and IA. For IA we run the set of terminal learning rates during averaging {0.03, 0.01, 0.003}, whereas for SGD we decay it linearly to 0.0005
C.2 Language Modelling Experiments
In language modelling experiments, we use the codebase provided by https://github.com/ salesforce/awd-lstm-lm. For ASGD, we use the hyperparameters recommended by Merity et al. (2017) and set the initial learning rate to be 30. Note that in language experiments, consistent with other findings decoupled weight decay seems to be not as effective , possibly due to LSTM could be more well-regularised already, and that batch normalisation, which we argue to be central to the efficacy of decoupled weight decay, is not used in LSTM. Thus, for this set of experiments we simply use Adam and Padam as the iterates for Gadam and GadamX. For Adam/Gadam, we tune the learning rate by searching initial learning rate in the range of {0.0003, 0.001, 0.003, 0.01} and for Padam and GadamX, we set the initial learning rate to be 1 and partially adaptive parameter
Table 8: Baseline Results from Previous Works
p = 0.2, as recommended by the authors Chen and Gu (2018). We further set the weight decay to be their recommended value of . For the learning rate schedule, we again follow Merity et al. (2017) for a piece-wise constant schedule, and decay the learning rate by a factor of 10 at the {100, 150}-th epochs for all experiments without using iterate averaging. For experiments with iterate averaging, instead of decaying the learning rate by half before averaging starts, we keep the learning rate constant throughout to make our experiment comparable with the ASGD schedule. We run all experiments for 200 (instead of 500 in Merity et al. (2017)) epochs.
Learning Rate Schedule As discussed in the main text, the experiments shown in Table 10 and Figure 8 are run with constant schedules (except for Padam). Padam runs with a step decay of factor of 10 at {100, 150}-th epochs. However, often even the adaptive methods such as Adam are scheduled with learning rate decay for enhanced performance. Therefore, we also conduct additional scheduled experiments with Adam, where we follow the same schedule of Padam. The results are shown in appendix Section D.2.
C.3 Experiment Baselines
To validate the results we obtain and to make sure that any baseline algorithms we use are properly and fairly tuned, we also survey the previous literature for baseline results where the authors use same (or similar) network architectures on the same image classification/language tasks, and the comparison of our results against theirs is presented in Table 8. It is clear that for most of the settings, our baseline results achieve similar or better performance compared to the previous work for comparable methods; this validates the rigour of our tuning process.
D.1 Testing Performance of CIFAR-10
We report the testing performance of VGG-16 and PRN-110 on CIFAR-10 in Figure 12 and Table 9. Perhaps due to the fact that CIFAR-10 poses a simpler problem compared to CIFAR-100 and ImageNet in the main text, the convergence speeds of the optimisers differ rather minimally. Nonetheless, we find that GadamX still outperforms all other optimisers by a non-trivial margin in terms of final test accuracy.
Figure 12: Test Error on CIFAR-10
Table 9: Top-1 Test Accuracy on CIFAR-10 Data-set
Table 10: Validation and Test Perplexity on Word-level Language Modelling.
D.2 Word Level Language Modelling with Learning Rate Schedules and Non-monotonic Trigger
Word-level Language Modelling on PTB Here we include additional results on word-level language modelling using scheduled Adam and NT-ASGD, where the point to start averaging is learned non-monotonically and automatically. Where scheduling further improves the Adam performance marginally, the automatically triggered ASGD actually does not perform as well as the manually triggered ASGD that starts averaging from 100th epoch onwards, as we discussed in the main text -this could be because that ASGD converges rather slowly, the 200-epoch budget is not sufficient, or the patience (we use patience = 10) requires further tuning. Otherwise, our proposed Gadam and GadamX without IA schedules still outperform the variants tested here without careful learning rate scheduling. The results are summarised in Figure 13 and Table 11.
Figure 13: Validation Perplexity of NT-ASGD and Scheduled Adam on 3-layer LSTM PTB Wordlevel Modelling.
Table 11: Validation and Test Perplexity on Word-level Language Modelling. The Gadam(X) results are lifted from Table 10.
D.3 Relation between Improvement from Averaging and Number of Parameters in Previous Work
In this section we demonstrate that our claim that there should be a dependence on number of parameters P on the margin of improvement from averaging is also present in previous works that use IA or a related ensemble method. Here we use the results from Table 1 of Izmailov et al. (2018). Since the different network architectures are trained with different budget of epochs which make the direct comparison of SWA results difficult, we instead consider their FGE (Garipov et al., 2018) results which the author argue to have the similar properties to and that is actually approximated by SWA. We plot their result along with us in Figure 14. While we do not obtain a perfect linear relationship possibly due to a wide range of possible interfering factors such as difference in architecture, use of batch normalisation, choice of optimiser and hyperparameter tuning, again we nevertheless observe that there exists a roughly positive relationship between P and the margin of test improvement.
D.4 Linear vs Step Scheduling
In this work, for the baseline methods in image classification tasks we use linear instead of the more conventionally employed step scheduling because we find linear scheduling to generally perform better in the experiments we conduct. In this section, we detail the results of these experiments, and in this section, ‘linear’ refers to the schedule introduced in appendix Section ?? and ‘step’ refers to the schedule that reduces the learning rate by a factor of 10 in {150, 250} epochs for 300-epoch experiments (CIFAR datasets), or in {25, 40} epochs for 50-epoch experiments (ImageNet dataset). The results are shown in Table 12.
Figure 14: Number of parameters P against improvement margin for both results obtained by us and in Izmailov et al. (2018) (annotated with asterisks) on CIFAR-100
Table 12: Testing performance of linear and step learning rate schedules on baseline methods.
We argue that despite of the universal practical use of the final iterate of optimisation, it is heuristically motivated and in most proofs of convergence, some form of iterative averaging is required and used implicitly to derive the theoretical bounds. For -Lipschitz, convex empirical risks, denoted the (overall) loss L. The difference between the t + 1’th iterate and the optimal solution
can be bounded. The sum of differences along the trajectory (known as the regret) telescopes, hence resulting in a convergence rate for the average regret which is an upper bound for the loss of the
average point Nesterov (2013); Duchi (2018):
where is the noisy gradient at
and
is its variance: Var
. Noting that
Using Jensen’s inequality, we have:
where , and R is the regret. Setting
in equation 67 gives us the optimal convergence rate. Similar convergence results can be given for a decreasing step size
. For adaptive optimisers, the noisy gradient is preconditioned by some non-identity matrix
Methods of proof Reddi et al. (2019); Tran et al. (2019) rely on bounding the regret and showing that the average regret
and Equation 67 explicitly demonstrates that the average regret is an upper bound on the expected loss for the average point in the trajectory. Hence existing convergence results in the literature prove convergence for the iterate average, but not the final iterate.
Setting
results can be given for a decreasing step size Duchi (2018) when the number of
iterations T is not known in advance. Given the use of both iterate averaging and learning rate schedule in the proofs, it is difficult to understand the relative importance of the two and how this compares with the typical heuristic of using the final point.
E.1 Relevance of Local Geometry Arguments
One argument as to why IA improves generalisation Izmailov et al. (2018) is about the local geometry of the solution found: Izmailov et al. (2018) discuss the better generalisation of SWA to the “flatter” minimum it finds. The same argument is used to explain the apparent worse generalisation of adaptive method: Wu et al. (2018) showed empirically that adaptive methods are not drawn to flat minima unlike SGD. From both Bayesian and minimum description length arguments Hochreiter and Schmidhuber (1997), flatter minima generalise better, as they capture more probability mass. He et al. (2019) formalise the intuition under the assumption of a shift between the training and testing loss surface and investigate the presence of “flat valleys” in loss landscape. They argue that averaging leads to a biased solution to the “flatter” valley, which has worse training but better generalisation performance due to the shift. This suggests IA has an inherent regularising effect, which contrasts with our previous claim that IA should improve both.
However, one issue in the aforementioned analysis, is that they train their SGD baseline and averaged schemes on different learning rate schedules. While this is practically justified, and even desirable, exactly because IA performs better with high learning rate as argued, for theoretical analysis on the relevance of the landscape geometry to solution quality, it introduces interfering factors. It is known that the learning rate schedule can have a significant impact on both performance and curvature Jastrzebski et al. (2020). We address this by considering IA and the iterates, for the same learning rate to specifically alleviate this issue. We use the VGG-16 without BN6 using both AdamW/Gadam and SGD/SWA. In addition to the test and training statistics, we also examine the spectral norm, Frobenius norm and trace which serve as different measures on the “sharpness” of the solutions using the spectral tool by Granziol et al. (2019); we show the results in Table 13. We
Table 13: Performance and Hessian-based sharpness metrics on CIFAR-100 using VGG-16. The numerical results for iterates are in brackets.
find a rather mixed result with respect to the local geometry argument. While averaging indeed leads to solutions with lower curvature, we find no clear correlation between flatness and generalisation. One example is that compared to SGD, the best performing Gadam run has larger spectral norm,
larger Frobenius norm and
larger Hessian trace, yet the test accuracy is only 0.09% worse. Either our metrics do not sufficiently represent sharpness, which is unlikely since we included multiple metrics commonly used, or that it is not the most relevant explanation for the generalisation gain. We hypothesise the reason here is that the critical assumption, upon which the geometry argument builds, that there exist only shifts between test and train surfaces is unsound despite a sound analysis given that. For example, recent work has shown under certain assumptions that the true risk surface is everywhere flatter than the empirical counterpart Granziol et al. (2020b). Furthermore, for any arbitrary learning rate, as predicted IA helps both optimisation and generalisation compared to iterates of the same learning rate; any trade-offs between optimisation and generalisation seem to stem from the choice of learning rates only.