b

DiscoverModelsSearch
About
Nesterov acceleration despite very noisy gradients
2023
·
arXiv
Abstract

We present a generalization of Nesterov’s accelerated gradient descent algorithm. Our algorithm (AGNES) provably achieves acceleration for smooth convex and strongly convex minimization tasks with noisy gradient estimates if the noise intensity is proportional to the magnitude of the gradient at every point. Nesterov’s method converges at an accelerated rate if the constant of proportionality is below 1, while AGNES accommodates any signal-to-noise ratio. The noise model is motivated by applications in overparametrized machine learning. AGNES requires only two parameters in convex and three in strongly convex minimization tasks, improving on existing methods. We further provide clear geometric interpretations and heuristics for the choice of parameters.

The recent success of deep learning [LeCun et al., 2015] is built on stochastic first order optimization methods such as stochastic gradient descent [LeCun et al., 1998] and ADAM [Kingma and Ba, 2014], which have enabled the large-scale training of neural networks. While such tasks are generally non-convex, accelerated first order methods for convex optimization have proved practically useful. Specifically, Nesterov [1983]’s accelerated gradient descent has become a standard training method [Sutskever et al., 2013].

Modern neural networks tend to operate in the overparametrized regime, i.e. the number of model parameters exceeds the number of data points to be fit [Belkin, 2021]. In this setting, minibatch gradient estimates are exact (namely, exactly 0) on the set of global minimizers since data can be interpolated exactly. Motivated by such applications, Vaswani et al. [2019] proved that Nesterov [2012]’s accelerated coordinate descent method (ACDM) achieves acceleration in (strongly) convex optimization with multiplicative noise, i.e. when assuming stochastic gradient estimates for which the noise intensity scales linearly with the magnitude of the gradient. Conversely, Liu and Belkin [2018] show that the original version of Nesterov [1983]’s method generally does not achieve acceleration in this setting.

Another algorithm with a similar goal is the continuized Nesterov method (CNM), which has been studied by Even et al. [2021], Berthier et al. [2021] in convex optimization (deterministic or with additive noise) and with multiplicative noise for overparametrized linear least squares regression. For a more extensive discussion of the context of our work in the literature, please see Section 2.

Vaswani et al. [2019]’s algorithm is a four parameter scheme in the strongly convex case, which reduces to a three parameter scheme in the convex case. Liu and Belkin [2018] introduce a simpler three parameter scheme, but only prove that it achieves acceleration for overparametrized linear problems. In this work, we demonstrate that it is possible to achieve the same theoretical guarantees as Vaswani et al. [2019] with a simpler scheme, which can be considered as a reparametrized version of Liu and Belkin [2018]’s Momentum-Added Stochastic Solver (MaSS) method. More precisely, we prove the following:

1. We show that Nesterov’s accelerated gradient descent achieves an accelerated convergence rate, but only with noise which is strictly smaller than the gradient in the  L2-sense. We alsoshow numerically that when the noise is larger than the gradient, the algorithm diverges for a choice of step size for which gradient descent remains convergent.

2. Motivated by this, we introduce a generalization of Nesterov’s method, which we call Accelerated Gradient descent with Noisy EStimators (AGNES), which provably achieves acceleration no matter how large the noise is relative to the gradient, both in the convex and strongly convex cases.

3. When moving from NAG to AGNES, the learning rate ‘bifurcates’ to two parameters in order to accommodate stochastic gradient estimates. The extension requires three hyperparameters in the strongly convex case and two in the convex case.

4. We provide a transparent geometric interpretation of the AGNES parameters in terms of their scaling with problem parameters (Appendix F.3) and the continuum limit models for various scaling regimes (Appendix C).

5. We build strong intuition for the choice of hyperparameters for machine learning applications and empirically demonstrate that AGNES improves the training of CNNs relative to SGD with momentum and Nesterov’s accelerated gradient descent.

Accelerated first order methods. Accelerated first order methods have been extensively studied in convex optimization. Beginning with the conjugate gradient (CG) algorithm introduced by Hestenes and Stiefel [1952], the Heavy ball method of Polyak [1964], and Nesterov [1983]’s seminal work on accelerated gradient descent, many authors have developed and analyzed accelerated first order methods for convex problems, including Beck and Teboulle [2009], Nesterov [2012, 2013], Chambolle and Dossal [2015], Kim and Fessler [2018] to name just a few.

An important line of research is to gain an understanding of how accelerated methods work. After Polyak [1964] derived the original Heavy ball method as a discretization of an ordinary differential equation, Alvarez et al. [2002], Su et al. [2014], Wibisono et al. [2016], Zhang et al. [2018], Siegel [2019], Shi et al. [2019], Muehlebach and Jordan [2019], Wilson et al. [2021], Shi et al. [2021], Suh et al. [2022], Attouch et al. [2022], Aujol et al. [2022b,a], Dambrine et al. [2022] studied accelerated first order methods from the point of view of ODEs. This perspective has facilitated the use of Lyapunov functional analysis to quantify the convergence properties. We remark that in addition to the intuition provided by differential equations, Joulani et al. [2020] and Gasnikov and Nesterov [2018] have also proposed interesting ideas for explaining and deriving accelerated first-order methods. In addition, there has been a large interest in deriving adaptive accelerated first order methods, see for instance Levy et al. [2018], Cutkosky [2019], Kavis et al. [2019].

Stochastic optimization. Robbins and Monro [1951] first introduced optimization algorithms where gradients are only estimated by a stochastic oracle. For convex optimization, Nemirovski et al. [2009], Ghadimi and Lan [2012] obtained minimax-optimal convergence rates with additive stochastic noise.

In deep learning, stochastic algorithms are ubiquitous in the training of deep neural networks, see [LeCun et al., 1998, 2015, Goodfellow et al., 2016, Bottou et al., 2018]. Here, the additive noise assumption not usually appropriate. As Wojtowytsch [2023], Wu et al. [2022a] show, the noise is of low rank and degenerates on the set of global minimizers. Stich [2019], Stich and Karimireddy [2022], Bassily et al. [2018], Gower et al. [2019], Damian et al. [2021], Wojtowytsch [2023], Zhou et al. [2020] consider various non-standard noise models and [Wojtowytsch, 2021, Zhou et al., 2020, Li et al., 2022] study the continuous time limit of stochastic gradient descent. These include noise assumptions for degenerate noise due to Bassily et al. [2018], Damian et al. [2021], Wojtowytsch [2023, 2021], low rank noise studied by Damian et al. [2021], Li et al. [2022] and noise with heavy tails explored by Zhou et al. [2020].

Acceleration with stochastic gradients. Kidambi et al. [2018] prove that there are situations in which it is impossible for any first order oracle method to improve upon SGD due to information- theoretic lower bounds. More generally, lower bounds in the stochastic first order oracle (SFO) model were presented by Nemirovski et al. [2009] (see also [Ghadimi and Lan, 2012]).

A partial improvement on the state of the art is given by Jain et al. [2018], who present an accelerated stochastic gradient method motivated by a particular low-dimensional and strongly convex problem. Laborde and Oberman [2020] obtain faster convergence of an accelerated method under an additive noise assumption by a Lyapunov function analysis. Bollapragada et al. [2022] study an accelerated gradient method for the optimization of a strongly convex quadratic objective function with minibatch noise.

Closest to our work are Liu and Belkin [2018], Vaswani et al. [2019], Even et al. [2021], Berthier et al. [2021] who study generalizations of Nesterov’s method in stochastic optimization. Liu and Belkin [2018], Even et al. [2021] obtain guarantees with noise of approximately multiplicative noise in overparametrized linear least squares problems and for general convex objective functions with additive noise and in deterministic optimization. Vaswani et al. [2019] obtain comparable guarantees for the more complicated method of Nesterov [2013].

3.1 Assumptions

In the remainder of this article, we consider the task of minimizing an objective function  f : Rm → Rusing stochastic gradient estimates g. We assume that f, g and the initial condition  x0 satisfy:

1. The initial condition  x0is a (potentially random) point such that  E[f(x0) + ∥x0∥2] < ∞.2.  f is L−smooth, i.e.  ∇f is L−Lipschitz continuous with respect to the Euclidean norm.

3. There exists a probability space  (Ω, A, P)and a gradient estimator, i.e. a measurable function g : Rm × Ω → Rm such that for all  x ∈ Rm the properties

image

A justification of the multiplicative noise scaling is given in Section 4. In the setting of machine learning, the space  Ωis given by the random subsampling of the dataset. A rigorous discussion of the probabilistic foundations is given in Appendix D.

3.2 Nesterov’s Method with Multiplicative Noise

First we analyze Nesterov [1983]’s accelerated gradient descent algorithm (NAG) in the setting of multiplicative noise. NAG is given by the initialization  x0 = x′0 and the two-step iteration

image

where  g′n = g(x′n, ωn)and the variables  ωnare iid samples from the probability space  Ω, i.e.  g′nis an unbiased estimate of  ∇f(x′n). We write ρinstead of  ρn in cases where a dependence on n is not required. We show that this scheme achieves an  O(1/n2)convergence rate for convex functions but only in the case that  σ < 1. To the best of our knowledge, this analysis is optimal.

Theorem 1 (NAG, convex case). Suppose that  xn and x′n are generated by the time-stepping scheme (1), f and g satisfy the conditions laid out in Section 3.1, f is convex, and  x∗is a point such that f(x∗) = infx∈Rm f(x). If σ < 1and the parameters are chosen such that

image

The expectation on the right hand side is over the random initialization  x0.

The proof of Theorem 1 is given in Appendix E. Note that the constant  1/ηblows up as  σ ↗ 1 andthe analysis yields no guarantees for  σ > 1. This mirrors numerical experiments in Section 5.

Theorem 2 (NAG, strongly convex case). In addition to the assumptions in Theorem 1, suppose that f is µ-strongly convex and the parameters are chosen such that

image

Just like in the convex case, the step size  ηdecreases to zero as  σ ↗ 1, and we fail to obtain convergence guarantees for  σ ≥ 1. We argue in the proof of Theorem 2, given in Appendix F, that it is not possible to modify the Lyapunov sequence analysis to obtain a better rate of convergence. This motivates our introduction of the more general AGNES method below.

Notably, there cannot be a diverging lower bound for NAG since gradient descent arises in the special case  ρ = 0, and gradient descent converges for small stepsize with multiplicative noise [Wojtowytsch, 2023]. On the other hand, Liu and Belkin [2018] show that NAG does not achieve accelerated convergence with multiplicative type noise even for quadratic strongly convex functions.

3.3 AGNES Descent algorithm

The proofs of Theorems 1 and 2 suggest that the momentum step in (1) is quite sensitive to the step size used for the gradient step, which severely restricts the step size  η. We propose the Accelerated Gradient descent with Noisy EStimators (AGNES) scheme, which addresses this problem by introducing an additional parameter  αin the momentum step:

image

where  g′n = g(x′n, ωn)as before. Equivalently, AGNES can be formulated as a three-step scheme with an auxiliary velocity variable  vn, initialized as  v0 = 0:

image

We show that the two formulations of AGNES are equivalent in Appendix B.1. However, we find (3) more intuitive (see Appendix C for a continuous time interpretation) and easier to implement as an algorithm without storing past values of  xn. The pseudocode and a set of suggested default parameters are given in Algorithm 1.

image

From (1) and (2), we note that NAG is AGNES with the special choice  α = η. Allowing  αand ηto be different helps AGNES achieve an accelerated rate of convergence for both convex and strongly convex functions, no matter how large  σis. While for gradient descent, only the product L(1 + σ2)has to be considered, this is not the case for momentum-based schemes. We consider first the convergence rate in the convex case.

Theorem 3 (AGNES, convex case). Suppose that  xnand  x′nare generated by the time-stepping scheme (3),  f and g′n = g(x′n, ωn)satisfy the conditions laid out in Section 3.1, f is convex, and  x∗is a point such that  f(x∗) = infx∈Rm f(x). If the parameters are chosen such that

image

In particular, if  η ≤ 1L(1+2σ2), we may make the universal choice  a0 = 4, i.e.  ρn = nn+5. Only the parameters  η, αdepend on the specific problem. The proof of Theorem 3 is given in Appendix E.

There, we also present an alternative version of Theorem 3 for a different choice of parameters

image

The benefit of the accelerated scheme is an improvement from a decay rate of O(1/n) to the rate O(1/n2), which is optimal under the given assumptions even in the deterministic case. While the noise can be orders of magnitude larger than the quantity we want to estimate, it only affects the constants in the convergence, not the rate. We get an analogous result for strongly convex functions.

Theorem 4 (AGNES, strongly convex case). In addition to the assumptions in Theorem 3, suppose that  f is µ-strongly convex and the parameters are chosen such that

image

Choosing  ηtoo small can be interpreted as overestimating  L or σ. Choosing αtoo small (with respect to  η) can be interpreted as overestimating  σ. Since every L-Lipschitz function is  L′-Lipschitz for L′ > L, and since the multiplicative noise bound with constant  σimplies the same bound with σ′ > σ, exponential convergence still holds at a generally slower rate.

We note that since  |∇f(x)|2 ≤ 2L(f(x)−inf f) (Lemma 12in Appendix D), Theorems 3 and 4 lead to analogous convergence results for  E[∇f(xn)]as well. Due to the summability of the sequences n−2 and rn for r < 1, we get not only convergence in expectation but also almost sure convergence. The proof is given in Appendix E.

Corollary 5. In the setting of Theorems 3 and 4,  f(xn) → inf fwith probability 1.

In the deterministic case  σ = 0, we have  α = ηin both Theorems 3 and 4. In Theorem 4, the parameters coincide with the usual choice for NAG, while we opted for a simple statement in Theorem 3 which does not exactly recover the standard choice  η = 1/Land  ρn = n/(n + 3). The proofs below easily cover these special cases as well. If  0 < σ < 1, both AGNES and NAG converge with the same rate  n−2 in the convex case, but the constant of NAG is always larger. In the strongly convex case, even the decay rate of NAG is slower than AGNES for  σ ∈ (0, 1) since 1 − σ2 < (1 + σ2)−1.We see the real power of AGNES in the stochastic setting where it converges for very high values of σwhen Nesterov’s method may diverge. For the optimal choice of parameters, we summarize the results in terms of the time-complexity of SGD and AGNES in Figure 1. For the related guarantee for SGD, see Theorems 17 and 22 in Appendices E and F respectively.

Remark 6 (Batching). Let us compare AGNES with two families of gradient estimators:

1.  g′n = g(x′n, ωn)as studied in Theorems 3 and 4.

2. A gradient estimator  g′n := 1nb�nbj=1 g(x′n, ωn,j)which averages multiple independent estimates to reduce the variance.

The second gradient estimator falls into the same framework with ˜Ω = Ωnband  ˜σ2 = σ2/nb. Assuming vector additions cost negligible time, optimizer steps are only as expensive as gradient evaluations. In this setting – which is often realistic in deep learning – it is appropriate to compare E[f(xnbn)] (nb · niterations using  g′n) and E[f(Xn)] (niterations with  g′n). For the strongly convex case, we note that�1 −� µL 11+σ2�nb ≤ 1 −� µL 11+σ2/nb if and only if

image

image

Figure 1: The minimal n for AGNES and SGD such that  E[f(xn) − inf f] < εwhen minimizing an L-smooth function with multiplicative noise intensity  σin the gradient estimates and under a convexity assumption. The SGD rate of the  µ-strongly convex case is achieved more generally under a PL condition with PL-constant  µ. While SGD requires the optimal choice of one variable to achieve the optimal rate, AGNES requires three (two in the determinstic case).

The approximation is well-justified in the important case that  µ ≪ L. In particular, the upper bound for non-batching AGNES is always favorable compared to the batching version as  nb ∈ N≥1, and thetwo only match for the optimal batch size  nb = 1. The optimal batch size for minimizing f is the largest one that can be processed in parallel without increasing the computing time for a single step. A similar argument holds for the convex case.

With a slight modification, the proof of Theorem 3 extends to the situation of convex objective functions which do not have minimizers. Such objectives arise for example in linear classification with the popular cross-entropy loss function and linearly separable data.

Theorem 7 (Convexity without minimizers). Let f be a convex objective function satisfying the assumptions in Section 3.1 and  xnbe generated by the time-stepping scheme (3). Assume that  η, αand  ρnare as in Theorem 3. Then  lim infn→∞ E[f(xn)] = infx∈Rm f(x).

The proof and more details are given in Appendix E. For completeness, we consider the case of non-convex optimization in Appendix G. As a limitation, we note that multiplicative noise is well-motivated in machine learning for global minimizers, but not at generic critical points.

3.4 Geometric Interpretation

Let us briefly discuss the parameter choices in Theorem 4. As we consider larger  σ for fixed µ and L,the decay factor  ρmoves closer to 1. This slows the ‘forgetting’ of past gradients in  vn, allowing us to better average out stochastic noise. The price we pay is computing with more outdated gradients, slowing convergence. Our choice balances these effects.

In AGNES,  ρinadvertently also governs magnitude of the momentum variable  vn, which scales as (1 − ρ)−1for objective functions with constant gradient and  n ≫ 1. To compensate, we choose αsmaller compared to  ηwhen  σ(and thus  (1 − ρ)−1) is large. Nevertheless, the effect of the momentum step does not decrease. For further details, see Appendix F.3.

For further interpretability, we obtain several ODE and SDE continuous time descriptions of AGNES in Appendix C.

In supervised learning applications, the learning task often corresponds to minimizing a risk or loss function  R(w) = 1N�Ni=1 ℓ�h(w, xi), yi�=: 1N�Ni=1 ℓi(w), where  h : Rm × Rd → Rk, (w, x) �→ h(w, x)and  ℓ : Rk × Rk → [0, ∞)are a parametrized function of weights w and data x and a loss function measuring compliance between  h(w, xi) and yirespectively.1 Safran and Shamir [2018], Chizat and Bach [2018], Du et al. [2018] show that working in the overparametrized regime m ≫ Nsimplifies the optimization process and Belkin et al. [2019, 2020] illustrate that it facilitates generalization to previously unseen data. Cooper [2019] shows that fitting N constraints with m parameters typically leads to an  m − N-dimensional submanifold M of the parameter space  Rm

image

Figure 2: To be able to quantify the gradient noise exactly, we choose relatively small models and data sets. Left: A ReLU network with four hidden layers of width 250 is trained by SGD to fit random labels  yi(drawn from a 2-dimensional standard Gaussian) at 1, 000 random data points  xi(drawn from a 500-dimensional standard Gaussian). The variance  σ2of the gradient estimators is ∼ 105times larger than the loss function and  ∼ 106times larger than the parameter gradient. This relationship is stable over approximately ten orders of magnitude. Right: A ReLU network with two hidden layers of width 50 is trained by SGD to fit the Runge function  1/(1 + x2)on equispaced data samples in the interval  [−8, 8]. Also here, the variance in the gradient estimates is proportional to both the loss function and the magnitude of the gradient.

such that all given labels  yiare fit exactly by  h(w, ·)at the data points  xi for w ∈ M, i.e. R ≡ 0 onthe smooth set of minimizers  M = R−1({0}).

If N is large, it is computationally expensive to evaluate the gradient  ∇R(w) = 1N�Ni=1 ∇ℓi of therisk function R exactly and we commonly resort to stochastic estimates

image

where  Ib ⊆ {1, . . . , N}is a subsampled collection of  nbdata points (a batch or minibatch). Minibatch gradient estimates are very different from the stochasticity we encounter e.g. in statistical mechanics:

1. The covariance matrix  Σ = 1N�Ni=1�∇ℓi −∇R)⊗(∇ℓi −∇R)of the gradient estimators ∇ℓihas low rank  N ≪ m.

2. Assume specifically that  ℓis a loss function which satisfies  ℓ(y, y) = 0 for all y ∈ Rk, suchas the popular  ℓ2-loss function  ℓ(h, y) = ∥h−y∥2. Then ∇ℓi(w) = 0 for all i ∈ {1, . . . , N}and all  w ∈ M = R−1(0). In particular, minibatch gradient estimates are exact on M.

image

Lemma 8 is proved in Appendix H. It is a modification of [Wojtowytsch, 2023, Lemma 2.14] for function models which are locally, but not globally Lipschitz-continuous in the weights w, such as deep neural networks with smooth activation function. The exponent p may scale with network depth.

Lemma 8 describes the variance of a gradient estimator which uses a random index  i ∈ {1, . . . , N}and the associated gradient  ∇ℓiis used to approximate  ∇R. If a batch  Ibof  nbindices is selected randomly with replacement, then the variance of the estimates scales in the usual way:

image

image

Figure 3: We plot  E�fd(xn)�on a loglog scale for SGD (blue), AGNES (red), NAG (green), ACDM (orange) and CNM (maroon) with d = 4 (left) and d = 16 (right) for noise levels  σ = 0(solid line),  σ = 10(dashed) and  σ = 50(dotted). The initial condition is  x0 = 1in all simulations. Means are computed over 200 runs. After an initial plateau, AGNES, CNM and ACDM significantly outperform SGD in all settings, while NAG (green) diverges if  σis large. The length of the initial plateau increases with  σ.

As noted by Wu et al. [2019, 2022b],  R and ∥∇R∥2 often behave similarly in overparametrized deep learning. We illustrate this in Figure 2 together with Lemma 8. Heuristically, we therefore replaced (4) by a more manageable assumption akin to  E[ 1N�Ni=1 ∥∇ℓi − ∇R∥2] ≤ σ2∥∇R∥2in Section 3.1. The setting where the signal-to-noise ratio (the quotient of estimate variance and true magnitude) is  Ω(1)is often referred to as ‘multiplicative noise’, as it resembles the noise generated by estimates of the form  g = (1 + σZ)∇R, where  Z ∼ N(0, 1). When the objective function is L-smooth and satisfies a PL condition (see e.g. [Karimi et al., 2016]), both scaling assumptions are equivalent.

5.1 Convex optimization

We compare the optimization algorithms for the family of objective functions

image

for  d ≥ 2with gradient estimators  g = (1 + σN)f ′(x), where N is a unit normal random variable. The functions are convex and their derivatives are Lipschitz-continuous with  L = d(d − 1). Varioustrajectories are compared for different values of d and  σin Figure 3. We run AGNES with the parameters  α = η1+σ2 , η = 1L(1+2σ2), ρn = nn+5 derived above and SGD with the optimal step size η = 1L(1+σ2) (see Lemmas 16 and 17). For NAG, we select  η = 1L(1+σ2) and ρn = nn+3. We present a similar experiment in the strongly convex case in Appendix A.

We additionally compare to two other methods of accelerated gradient descent which were recently proposed for multiplicative noise models: The ACDM method of Nesterov [2012], Vaswani et al. [2019], and the continuized Nesterov method (CNM) of Even et al. [2021], Berthier et al. [2021] with the proposed parameters. In this simple setting where all constants are known, AGNES, ACDM and CNM perform comparably in the long run and on average.

5.2 Neural network regression

We generated n = 100, 000 12-dimensional random vectors. Using a fixed, randomly initialized neural network  f ∗ (with 10 hidden layers, each with width 10, and output dimension 1), we produced labels  yi = f ∗(xi). The resulting dataset was split into 90% training and 10% testing data. We then trained identically initialized copies of a larger neural network (15 hidden layers, each with width 15) using Adam, NAG, SGD with momentum, and AGNES to minimize the mean-squared error (MSE) loss.

image

Figure 4: We report the training loss as a running average with decay rate 0.99 (top row) and test loss (bottom row) for batch sizes 100 (left column), 50 (middle column), and 10 (right column) in the setting of Section 5.2. The horizontal axis represents the number of optimizer steps. The performance gap between AGNES and other algorithms widens for smaller batch sizes, where the gradient estimates are more stochastic and the two different parameters  α, ηadd the most benefit.

We selected the learning rate  10−3for Adam as it performed poorly at higher or lower rates  10−2and  10−4. For AGNES, NAG, and SGD, based on initial exploratory experiments, we used a learning rate of  10−4, a momentum value of 0.99, and for AGNES, a correction step size  η = 10−3. The experiment was repeated 10 times each for batch sizes 100, 50, and 10, and run for 45,000 optimizer steps each time. The average loss and standard deviation for each algorithm are reported in Figure 4. The results show that AGNES performs better than SGD and NAG for all batch sizes. With large batch size, Adam performs well with default hyperparameters. The performance of AGNES relative to other algorithms especially improves as the batch size decreases.

5.3 Image classification

We trained ResNet-34 [He et al., 2016] with batch sizes 50 and 10, and ResNet-50 with batch size 50 on the CIFAR-10 image dataset [Krizhevsky et al., 2009] with standard data augmentation (normalization, random crop, and random flip) using Adam, SGD with momentum, NAG, and AGNES. The model implementations were based on [Liu, 2017]. Each algorithm was provided an identically initialized model and the experiment was repeated 5 times for 50 epochs each. The averages and standard deviations of training loss and test accuracy are reported in Figure 5. We used the same initial learning rate  10−3 for all the algorithms, which was dropped to  10−4 after 25 epochs. A momentum value of 0.99 was used for SGD, NAG, and AGNES and a constant correction step size η = 10−2 was used for AGNES.

AGNES reliably outperforms SGD and NAG both in terms of training loss and test accuracy. The gap in performance appears to increase as model size increases or batch size decreases, suggesting that AGNES primarily excels in situations where gradients are harder to estimate accurately. For the sake of completeness, we include Adam with default hyperparameters as a comparison.

In congruence with convergence guarantees from convex optimization, grid search suggests that  α isthe primary learning rate and  ηshould be chosen larger than  α. We tried NAG and Adam with higher learning rates  10−2and  10−1as well to ensure a fair comparison with AGNES, but found that they become unstable or perform worse for larger learning rates in our experiments. The AGNES default parameters  α = 10−3, η = 10−2, ρ = 0.99in Algorithm 1 give consistently strong performance on different models but can be further tuned to improve performance. While the numerical experiments we performed support our theoretical predictions, we acknowledge that our focus lies on theoretical guarantees and we did not test these predictions over a broad set of benchmark problems.

image

Figure 5: We report the training loss as a running average with decay rate 0.99 (top row) and test accuracy (bottom row) for ResNet-34 trained on CIFAR-10 with batch sizes 50 (left column) and 10 (middle column), and ResNet-50 trained with batch size 50 (right column). The performance of AGNES with the proposed hyperparameters is stable over the changes in model and batch size.

image

Figure 6: We report the average training loss after each epoch for six epochs for training LeNet-5 on MNIST with AGNES for various combinations of the hyperparameters  αand  ηto illustrate that αis the algorithm’s primary learning rate. Left: For a given  α(color coded), the difference in the trajectory for the three values of  η(line style) is marginal. On the other hand, choosing  αwell significantly affects performance. Middle: For any given  α, the largest value of  ηperforms much better than the other three values which have near-identical performance. Nevertheless, the worst performing value of  ηwith well chosen  α = 5 · 10−3 performs better than the best performing value of  η with α = 5 · 10−4. Right: When αis too large, the loss increases irrespective of the value of  η.

We present a more thorough comparison of NAG and AGNES with various parameter selections in Figure 8 in Appendix A. With default parameters or minimal parameter tuning, AGNES reliably achieves superior performance compared to NAG (training loss) and smoother curves, suggesting more stable behavior (test accuracy).

5.4 Hyperparameter comparison

We tried various combinations of AGNES hyperparameters  α and ηto train LeNet-5 on the MNIST dataset to determine which hyperparameter has a greater impact on training. With a fixed batch size of 60 and a momentum value  ρ = 0.99, we trained independent copies of the model for 6 epochs for each combination of the hyperparameters. The average training loss over the epoch was recorded after each epoch. The results are reported in Figure 6. We see that  αhas the largest impact on the rate of decay of the loss, which establishes it as the ‘primary learning rage’. If  αis too small, the algorithm converges slowly and if  αis too large, it diverges. If  αis chosen correctly, a good choice of the correction step size  η(which can be orders of magnitude larger than  α) further accelerates convergence, but  ηcannot compensate for a poor choice of  α.

Portions of this research were conducted with the advanced computing resources provided by Texas A&M High Performance Research Computing. This research was also supported in part by the University of Pittsburgh Center for Research Computing, RRID:SCR_022735, through the resources provided. Specifically, this work used the H2P cluster, which is supported by NSF award number OAC-2117681.

Felipe Alvarez, Hedy Attouch, Jérôme Bolte, and Patrick Redont. A second-order gradient-like dissipative dynamical system with Hessian-driven damping: Application to optimization and mechanics. Journal de mathématiques pures et appliquées, 81(8):747–779, 2002.

Hedy Attouch, Zaki Chbani, Jalal Fadili, and Hassan Riahi. First-order optimization algorithms via inertial systems with hessian driven damping. Mathematical Programming, pages 1–43, 2022.

J. F. Aujol, Ch. Dossal, and A. Rondepierre. Convergence rates of the heavy-ball method under the łojasiewicz property. Mathematical Programming, 2022a. doi: 10.1007/s10107-022-01770-2. URL https://doi.org/10.1007/s10107-022-01770-2.

J.-F. Aujol, Ch. Dossal, and A. Rondepierre. Convergence rates of the heavy ball method for quasi- strongly convex optimization. SIAM Journal on Optimization, 32(3):1817–1842, 2022b. doi: 10.1137/21M1403990. URL https://doi.org/10.1137/21M1403990.

Raef Bassily, Mikhail Belkin, and Siyuan Ma. On exponential convergence of SGD in non-convex over-parametrized learning. CoRR, abs/1811.02564, 2018. URL http://arxiv.org/abs/1811. 02564.

Amir Beck and Marc Teboulle. A fast iterative shrinkage-thresholding algorithm for linear inverse problems. SIAM journal on imaging sciences, 2(1):183–202, 2009.

Mikhail Belkin. Fit without fear: remarkable mathematical phenomena of deep learning through the prism of interpolation. Acta Numerica, 30:203–248, 2021.

Mikhail Belkin, Daniel Hsu, Siyuan Ma, and Soumik Mandal. Reconciling modern machine-learning practice and the classical bias–variance trade-off. Proceedings of the National Academy of Sciences, 116(32):15849–15854, 2019.

Mikhail Belkin, Daniel Hsu, and Ji Xu. Two models of double descent for weak features. SIAM Journal on Mathematics of Data Science, 2(4):1167–1180, 2020.

Raphaël Berthier, Francis Bach, Nicolas Flammarion, Pierre Gaillard, and Adrien Taylor. A con- tinuized view on nesterov acceleration. arXiv preprint arXiv:2102.06035, 2021.

Raghu Bollapragada, Tyler Chen, and Rachel Ward. On the fast convergence of minibatch heavy ball momentum. arXiv preprint arXiv:2206.07553, 2022.

Léon Bottou, Frank E Curtis, and Jorge Nocedal. Optimization methods for large-scale machine learning. Siam Review, 60(2):223–311, 2018.

Antonin Chambolle and Charles H Dossal. On the convergence of the iterates of FISTA. Journal of Optimization Theory and Applications, 166(3):25, 2015.

Lénaïc Chizat and Francis Bach. On the global convergence of gradient descent for over-parameterized models using optimal transport. In Advances in Neural Information Processing Systems, volume 31. Curran Associates, Inc., 2018. URL https://proceedings.neurips.cc/paper/2018/file/ a1afc58c6ca9540d057299ec3016d726-Paper.pdf.

Y. Cooper. The loss landscape of overparameterized neural networks, 2019. URL https:// openreview.net/forum?id=SyezvsC5tX.

Ashok Cutkosky. Anytime online-to-batch, optimism and acceleration. In International conference on machine learning, pages 1446–1454. PMLR, 2019.

Marc Dambrine, Ch Dossal, Bénédicte Puig, and Aude Rondepierre. Stochastic differential equations for modeling first order optimization methods. HAL preprint hal-03630785, 2022.

Alex Damian, Tengyu Ma, and Jason D. Lee. Label noise SGD provably prefers flat global minimizers. In Advances in Neural Information Processing Systems, 2021. URL https://openreview.net/ forum?id=x2TMPhseWAW.

Jelena Diakonikolas and Michael I Jordan. Generalized momentum-based methods: A hamiltonian perspective. SIAM Journal on Optimization, 31(1):915–944, 2021.

Simon S Du, Xiyu Zhai, Barnabas Poczos, and Aarti Singh. Gradient descent provably optimizes over-parameterized neural networks. arXiv:1810.02054 [cs.LG], 2018.

Mathieu Even, Raphaël Berthier, Francis Bach, Nicolas Flammarion, Pierre Gaillard, Hadrien Hendrikx, Laurent Massoulié, and Adrien Taylor. A continuized view on Nesterov acceleration for stochastic gradient descent and randomized gossip. arXiv preprint arXiv:2106.07644, 2021.

Alexander Vladimirovich Gasnikov and Yu E Nesterov. Universal method for stochastic composite optimization problems. Computational Mathematics and Mathematical Physics, 58:48–64, 2018.

Saeed Ghadimi and Guanghui Lan. Optimal stochastic approximation algorithms for strongly convex stochastic composite optimization i: A generic algorithmic framework. SIAM Journal on Optimization, 22(4):1469–1492, 2012.

Ian Goodfellow, Yoshua Bengio, and Aaron Courville. Deep learning. MIT press, 2016.

Robert Mansel Gower, Nicolas Loizou, Xun Qian, Alibek Sailanbayev, Egor Shulgin, and Peter Richtárik. Sgd: General analysis and improved rates. In International conference on machine learning, pages 5200–5209. PMLR, 2019.

Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 770–778, 2016.

Magnus R Hestenes and Eduard Stiefel. Methods of conjugate gradients for solving linear systems. Journal of research of the National Bureau of Standards, 49(6):409–436, 1952.

Prateek Jain, Sham M Kakade, Rahul Kidambi, Praneeth Netrapalli, and Aaron Sidford. Accelerating stochastic gradient descent for least squares regression. In Conference On Learning Theory, pages 545–604. PMLR, 2018.

Richard Jordan, David Kinderlehrer, and Felix Otto. The variational formulation of the fokker– planck equation. SIAM Journal on Mathematical Analysis, 29(1):1–17, 1998. doi: 10.1137/ S0036141096303359. URL https://doi.org/10.1137/S0036141096303359.

Pooria Joulani, Anant Raj, Andras Gyorgy, and Csaba Szepesvári. A simpler approach to accelerated optimization: iterative averaging meets optimism. In International conference on machine learning, pages 4984–4993. PMLR, 2020.

Hamed Karimi, Julie Nutini, and Mark Schmidt. Linear convergence of gradient and proximal- gradient methods under the Polyak-Lojasiewicz condition. In Joint European Conference on Machine Learning and Knowledge Discovery in Databases, pages 795–811. Springer, 2016.

Ali Kavis, Kfir Y Levy, Francis Bach, and Volkan Cevher. Unixgrad: A universal, adaptive algorithm with optimal guarantees for constrained optimization. Advances in neural information processing systems, 32, 2019.

Rahul Kidambi, Praneeth Netrapalli, Prateek Jain, and Sham Kakade. On the insufficiency of existing momentum schemes for stochastic optimization. In 2018 Information Theory and Applications Workshop (ITA), pages 1–9. IEEE, 2018.

Donghwan Kim and Jeffrey A Fessler. Another look at the fast iterative shrinkage/thresholding algorithm (fista). SIAM Journal on Optimization, 28(1):223–250, 2018.

Diederik P Kingma and Jimmy Ba. Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980, 2014.

A. Klenke. Probability Theory: A Comprehensive Course. Universitext. Springer London, 2013. ISBN 978-1-4471-5361-0.

Alex Krizhevsky, Geoffrey Hinton, et al. Learning multiple layers of features from tiny images. 2009.

Maxime Laborde and Adam Oberman. A lyapunov analysis for accelerated gradient methods: from deterministic to stochastic case. In International Conference on Artificial Intelligence and Statistics, pages 602–612. PMLR, 2020.

Y. LeCun, L. Bottou, Y. Bengio, and P. Haffner. Gradient-based learning applied to document recognition. Proceedings of the IEEE, 86(11):2278–2324, 1998. doi: 10.1109/5.726791.

Yann LeCun, Yoshua Bengio, and Geoffrey Hinton. Deep learning. Nature, 521(7553):436–444, 2015.

Kfir Y Levy, Alp Yurtsever, and Volkan Cevher. Online adaptive methods, universality and accelera- tion. Advances in neural information processing systems, 31, 2018.

Zhiyuan Li, Tianhao Wang, and Sanjeev Arora. What happens after SGD reaches zero loss? –a mathematical framework. In International Conference on Learning Representations, 2022. URL https://openreview.net/forum?id=siCt4xZn5Ve.

Chaoyue Liu and Mikhail Belkin. Accelerating sgd with momentum for over-parameterized learning. arXiv preprint arXiv:1810.13395, 2018.

Kuang Liu. Train cifar10 with pytorch. https://github.com/kuangliu/pytorch-cifar, 2017. [Online; accessed 16-May-2024].

Ilya Loshchilov and Frank Hutter. Fixing weight decay regularization in adam, 2018. URL https: //openreview.net/forum?id=rk6qdGgCZ.

Ilya Loshchilov and Frank Hutter. Decoupled weight decay regularization. In International Conference on Learning Representations, 2019.

Michael Muehlebach and Michael Jordan. A dynamical systems perspective on nesterov acceleration. In International Conference on Machine Learning, pages 4656–4662. PMLR, 2019.

Arkadi Nemirovski, Anatoli Juditsky, Guanghui Lan, and Alexander Shapiro. Robust stochastic approximation approach to stochastic programming. SIAM Journal on optimization, 19(4):1574– 1609, 2009.

Yurii Nesterov. A method for solving the convex programming problem with convergence rate o(1/k2).Dokl. Akad. Nauk SSSR, 269:543–547, 1983.

Yurii Nesterov. Efficiency of coordinate descent methods on huge-scale optimization problems. SIAM Journal on Optimization, 22(2):341–362, 2012.

Yurii Nesterov. Gradient methods for minimizing composite functions. Mathematical programming, 140(1):125–161, 2013.

Mark A Peletier. Variational modelling: Energies, gradient flows, and large deviations. arXiv preprint arXiv:1402.1990, 2014.

Boris T Polyak. Some methods of speeding up the convergence of iteration methods. Ussr computational mathematics and mathematical physics, 4(5):1–17, 1964.

Herbert Robbins and Sutton Monro. A stochastic approximation method. The annals of mathematical statistics, pages 400–407, 1951.

Itay Safran and Ohad Shamir. Spurious local minima are common in two-layer relu neural net- works. In Proceedings of the 35th International Conference on Machine Learning, ICML 2018, Stockholmsmässan, Stockholm, Sweden, July 10-15, 2018, volume 80 of Proceedings of Machine Learning Research, pages 4430–4438. PMLR, 2018. URL http://proceedings.mlr.press/ v80/safran18a.html.

Bin Shi, Simon S Du, Weijie Su, and Michael I Jordan. Acceleration via symplectic discretization of high-resolution differential equations. Advances in Neural Information Processing Systems, 32, 2019.

Bin Shi, Simon S Du, Michael I Jordan, and Weijie J Su. Understanding the acceleration phenomenon via high-resolution differential equations. Mathematical Programming, pages 1–70, 2021.

Jonathan W Siegel. Accelerated first-order methods: Differential equations and Lyapunov functions. arXiv preprint arXiv:1903.05671, 2019.

Jonathan W Siegel and Stephan Wojtowytsch. A qualitative difference between gradient flows of con- vex functions in finite-and infinite-dimensional Hilbert spaces. arXiv preprint arXiv:2310.17610, 2023.

Sebastian U. Stich. Unified Optimal Analysis of the (Stochastic) Gradient Method. arXiv e-prints, art. arXiv:1907.04232, July 2019. doi: 10.48550/arXiv.1907.04232.

Sebastian U. Stich and Sai Praneeth Karimireddy. The error-feedback framework: Better rates for sgd with delayed gradients and compressed updates. J. Mach. Learn. Res., 21(1), jun 2022. ISSN 1532-4435.

Weijie Su, Stephen Boyd, and Emmanuel Candes. A differential equation for modeling nesterov’s accelerated gradient method: theory and insights. Advances in neural information processing systems, 27, 2014.

Jaewook J Suh, Gyumin Roh, and Ernest K Ryu. Continuous-time analysis of accelerated gradient methods via conservation laws in dilated coordinate systems. In International Conference on Machine Learning, pages 20640–20667. PMLR, 2022.

Ilya Sutskever, James Martens, George Dahl, and Geoffrey Hinton. On the importance of initialization and momentum in deep learning. In International conference on machine learning, pages 1139– 1147. PMLR, 2013.

Sharan Vaswani, Francis Bach, and Mark Schmidt. Fast and faster convergence of sgd for over- parameterized models and an accelerated perceptron. In The 22nd international conference on artificial intelligence and statistics, pages 1195–1204. PMLR, 2019.

Andre Wibisono, Ashia C Wilson, and Michael I Jordan. A variational perspective on accelerated methods in optimization. proceedings of the National Academy of Sciences, 113(47):E7351–E7358, 2016.

Ashia C Wilson, Ben Recht, and Michael I Jordan. A lyapunov analysis of accelerated methods in optimization. The Journal of Machine Learning Research, 22(1):5040–5073, 2021.

Stephan Wojtowytsch. Stochastic gradient descent with noise of machine learning type. Part II: Continuous time analysis. arXiv:2106.02588 [cs.LG], 2021.

Stephan Wojtowytsch. Stochastic gradient descent with noise of machine learning type. Part I: Discrete time analysis. Journal of Nonlinear Science, 33, 2023.

Lei Wu, Mingze Wang, and Weijie J Su. The alignment property of sgd noise and how it helps select flat minima: A stability analysis. In Advances in Neural Information Processing Systems, 2022a.

Xiaoxia Wu, Simon S Du, and Rachel Ward. Global convergence of adaptive gradient methods for an over-parameterized neural network. arXiv preprint arXiv:1902.07111, 2019.

Xiaoxia Wu, Yuege Xie, Simon Shaolei Du, and Rachel Ward. Adaloss: A computationally-efficient and provably convergent adaptive gradient method. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 36, pages 8691–8699, 2022b.

Jingzhao Zhang, Aryan Mokhtari, Suvrit Sra, and Ali Jadbabaie. Direct runge-kutta discretization achieves acceleration. Advances in neural information processing systems, 31, 2018.

Pan Zhou, Jiashi Feng, Chao Ma, Caiming Xiong, Steven Hoi, and E. Weinan. Towards theoretically understanding why sgd generalizes better than adam in deep learning. In Proceedings of the 34th International Conference on Neural Information Processing Systems, NIPS’20, Red Hook, NY, USA, 2020. Curran Associates Inc. ISBN 9781713829546.

We compare SGD and AGNES for the family of objective functions

image

We considered several stochastic estimators with multiplicative gradient scaling such as

• collinear noise  g = (1 + σN)∇f(x), where Nis one-dimensional standard normal.

• isotropic noise  g = ∇f(x) + σ ∥∇f(x)∥√d N, where N is a d-dimensional unit Gaussian.

• Gaussian noise with standard variation  σ∥∇f(x)∥only in the direction orthogonal to  ∇f(x).

• Gaussian noise with standard variation  σ∥∇f(x)∥only in the direction of the fixed vector v = (1, 1)/√2.

• Noise of the form  ∇f(x) +�1 + σ2 ∥∇f(x)∥2 N vwhere  v = (1, 1)/√2and a variable

N which takes values 1 or  −1with probability 12 σ2 ∥∇f(x)∥21+σ2 ∥∇f(x)∥2each; N = 0 otherwise. In this setting, the noise remains macroscopically large at the global minimum, but the probability of encountering noise becomes small.

Numerical results were found to be comparable for all settings on a long time-scale, but the geometry of trajectories may change in the early stages of optimization depending on the noise structure.

For collinear and isotropic noise, the results obtained for  fµ,Lon  R2were furthermore found comparable (albeit not identical) to simulations with a quadratic form on  Rd with d = 10 and

 (d − 1)eigenvalues  = µand one eigenvalue = L

 (d − 1)eigenvalues = L and one eigenvalue  = µ

• eigenvalues equi-spaced between  µ and L.

The evolution of  E[f(xn)]for different values of  σand  L ≥ µ ≡ 1is considered for both SGD and AGNES in Figure 7.

The objective functions are  µ = 1-convex and L-smooth. We use the optimal parameters  α, η, ρderived above for AGNES and the optimal step size  η = 1L(1+σ2) for SGD. The mean of the objective value at each iteration is computed over 1,000 samples for each optimizer.

We ran additional experiments testing a much wider range of hyperparameters for NAG for the task of classifying images from CIFAR-10. The results, presented in Figure 8 indicate that AGNES outperforms NAG with little fine-tuning of the hyperparameters.

We trained ResNet-34 using batch size of 50 for 40 epochs using NAG with learning rate in {8 · 10−5, 10−4, 2 · 10−4, 5 · 10−4, 8 · 10−4, 10−3, 2 · 10−3, 5 · 10−3, 8 · 10−3, 10−2, 2 · 10−2, 5 ·10−2, 8 · 10−2, 10−1, 2 · 10−1, 5 · 10−1} and momentum value in {0.2, 0.5, 0.8, 0.9, 0.99}. These 80 combinations of hyperparameters for NAG were compared against AGNES with the default hyperparameters suggested  α = 10−3(learning rate),  η = 10−2(correction step), and  ρ = 0.99(momentum) as well as AGNES with a slightly smaller learning rate  5 · 10−4(with the other two hyperparameters being the same).

AGNES consistently achieved a lower training loss as well as a better test accuracy faster than any combination of NAG hyperparameters tested. The same random seed was used each time to ensure a fair comparison between the optimizers. Overall, AGNES remained more stable and while other versions of NAG occasionally achieved a higher classification accuracy in certain epochs.

image

Figure 7: We compare AGNES (red) and SGD (blue) for the optimization of  fµ,Lwith  µ = 1and L = 500 (left) /  L = 104(right) for different noise levels  σ = 0(solid line),  σ = 10(dashed) and σ = 50(dotted). In all cases, AGNES improves significantly upon SGD. The noise model used is isotropic Gaussian, but comparable results are obtained for different versions of multiplicatively scaling noise.

B Multiple versions of the AGNES scheme

Lemma 9. The two formulations of the AGNES time-stepping scheme (2) and (3) produce the same sequence of points.

Proof. We consider the three-step formulation (3),

image

and use it to derive (2) by eliminating the velocity variable  vn. If  v0 = 0, then  x′0 = x0. From the definition  x′n, we get αvn = x′n − xn. Substituting this into the definition of  vn+1,

image

Then using this expression for  vn+1to compute  x′n+1,

image

Together with the definition of  xn+1and the initialization  x′0 = x0, this is exactly the two-step formulation (2) of AGNES.

After the completion of this work, we learned of Liu and Belkin [2018]’s Momentum-Added Stochastic Solver (MaSS) method, which generates sequences according to the iteration

image

where  g′nis an estimate for  ∇f(x′n). This is a version of AGNES with the choice  η = η1and the momentum step

image

i.e. MaSS coincides with AGNES for  γ = ρ and η2 = (η − α)ρ.

image

Figure 8: We trained ResNet34 on CIFAR-10 with batch size 50 for 40 epochs using NAG. Training losses are reported as a running average with decay rate 0.999 in the left column and test accuracy after every epoch is reported in the right column. Each row represents a specific value of momentum used for NAG (from top to bottom: 0.99, 0.9, 0.8, 0.5, and 0.2) with learning rates ranging from 8 · 10−5to 0.5. These hyperparameter choices for NAG were compared against AGNES with the default hyperparameters suggested  α = 10−3(learning rate),  η = 10−2(correction step), and ρ = 0.99(momentum) as well as AGNES with a slightly smaller learning rate  5·10−4 (with ρ = 0.99,η = 10−2as well). The same two training trajectories with AGNES are shown in all the plots in shades of blue. The horizontal axes represent the number of optimizer steps.

C Continuous time interpretation of AGNES

For better interpretability, we consider the continuous time limit of the AGNES algorithm. Similar ODE analyses of accelerated first order methods have been considered by many authors, including Su et al. [2014], Siegel [2019], Wilson et al. [2021], Attouch et al. [2022], Aujol et al. [2022a], Zhang et al. [2018], Dambrine et al. [2022].

Consider the time-stepping scheme

image

which reduces to AGNES as in (3) with the choice of parameters  γ1 = α, γ2 = 1. For the derivation of continuous time dynamics, we show that the same scheme arises with the choice  γ1 = γ2 = √α.Lemma 10. Let  ρ ∈ (0, 1)and  η > 0parameters. Assume that  γ1, γ2and  ˜γ1, ˜γ2are parameters such that  ˜γ1˜γ2 = γ1γ2. Consider the sequences  (˜xn, ˜x′n, ˜vn) and (xn, x′n, vn)generated by the time stepping scheme (5) with parameters  (ρ, η, ˜γ1, ˜γ2)and  (ρ, η, γ1, γ2)respectively. If  x0 = ˜x0and γ1v0 = ˜γ1 ˜v0, then xn = ˜xn, x′n = ˜x′n and ˜γ1 ˜vn = γ1 vn for all n ∈ N.

Proof. We proceed by mathematical induction on n. For n = 0, the claim holds by the hypotheses of the lemma. For the inductive hypothesis, we suppose that  xn = ˜xnand  γ1vn = ˜γ1˜vnand prove the claim for n + 1. Note that since  x′n = xn + γ1vn, it automatically follows that  x′n = ˜x′n. This implies that

image

Thus  xn+1 = ˜xn+1 and γ1vn+1 = ˜γ1˜vn+1. The induction can therefore be continued.

Consider the choice of parameters in Theorem 4 by

image

if  µ ≪ L. We denote  h := 1√L(1+σ2) and note that

image

and

image

Depending on which interpretation of  ηwe select, we obtain a different continuous time limit. First, consider the deterministic case  σ = 0. Then

image

Differentiating the first equation and using the system ODEs to subsequently eliminate v from the expression, we observe that

image

i.e. we recover the heavy ball ODE. The alternative interpretation  η = h/√Lcan be analized equivalently and leads to a system

image

which corresponds to a second order ODE

image

This continuum limit is not a simple heavy-ball ODE, but rather a system with adaptive friction modelled by Hessian damping. A related Newton/heavy ball hybrid dynamical system was studied in greater detail by Alvarez et al. [2002]. For L-smooth functions, the  ℓ2-operator norm of  D2f(x)satisfies  ∥D2f(x)∥ ≤ L, i.e. the additional friction term can be as large as√Lin directions corresponding to high eigenvalues of the Hessian. This provides significant regularization in directions which would otherwise be notably underdamped.

Following Appendix F.3, we maintain that the scaling

image

is ‘natural’ as we vary  η, α, ρ. The same fixed ratio is maintained for the scaling choice  η = h/√L as

image

Indeed, numerical experiments in Section A suggest that such regularization may be observed in practice as high eigenvalues in the quadratic map do not exhibit ‘underdamped’ behavior. We therefore believe that Hessian dampening is the potentially more instructive continuum description of AGNES. A similar analysis can be conducted in the stochastic case with the scaling

image

for large  σ. We incorporate noise as

image

under moment bounds on the noise  Nn. The precise noise type depends on the assumptions on the covariance structure of  Nn– noise can point only in gradient direction or be isotropic on the entire space. For small h, the dynamics become deterministic. Again, an alternative continuous time limit is

image

if  ηis scaled towards zero as  h/√L. The first limiting structure is recoverd in the limit  L → ∞. Notably, the noise in the first equation is expected to be non-negligible if  σ ≫√L. A similar analysis can be conducted in the convex case, noting that

image

where  (n + n0 + 3)hroughly corresponds to the time t in the continuous time setting.

D Background material and auxiliary results

In this appendix, we gather a few auxiliary results that will be used in the proofs below. We believe that these will be familiar to the experts and can be skipped by experienced readers.

Recall that if a function f is L-smooth, then

image

For convex functions, this is in fact equivalent to  ∇f being L-Lipschitz.

Lemma 11. If f is convex and differentiable and satisfies (6), then  ∥∇f(x) − ∇f(y)∥ ≤ L∥x − y∥for all x and y.

Proof. Setting  y = x − 1L∇f(x)in (6) implies that  f(x) − infz f(z) ≥ 12L∥∇f(x)∥2. Applying this to the modified function  fy(x) = f(x) − ∇f(y) · (x − y), which is still convex and satifies (6), we get

image

Note that here we have used the convexity to conclude that  infz fy(z) = fy(y), i.e. that  fyis minimized at y, since by construction  ∇fy(y) = 0(this is the only place where we use convexity!). Swapping the role of x and y, adding these inequalities, and applying Cauchy-Schwartz we get

image

which implies the result.

From the first order strong convexity condition,

image

we deduce the more useful formulation  ∇f(x) · (x − y) ≥ f(x) − f(y) + µ2 ∥x − y∥2. The convex case arises as the special case  µ = 0. We note a special case of these conditions when one of the points is a minimizer of f.

Lemma 12. If f is an L-smooth function and  x∗ is a point such that  f(x∗) = infx∈Rm f(x) then for

image

Proof. This follows from the two first order conditions stated above by noting that  ∇f(x∗) = 0 if x∗is a minimizer of f.

Additionally, L-smooth functions which are bounded from below satisfy the inequality

image

Intuitively, if the gradient is large at a point, then we reduce f quickly by walking in the gradient direction. The L-smoothness condition prevents the gradient from decreasing quickly along our path. Thus if the gradient is larger than a threshold at a point where f is close to inf f, then the inequality f ≥ inf fwould be violated.

Let us record a modified gradient descent estimate, which is used only in the non-convex case. The difference to the usual estimate is that the gradient is evaluated at the terminal point of the interval rather than the initial point.

Lemma 13. For any  x, v and α: If f is L-smooth, then

image

Note that if f is convex, this follows immediately from (6) and the convexity condition  (∇f(y) −∇f(x)) · (y − x) ≥ 0.

Proof. The proof is essentially identical to the standard decay estimate. We compute

image

Now, we turn towards a very brief review of the stochastic process theory used in the analysis of gradient descent type algorithms. Recall that  (Ω, A, P)is a probablity space from which we draw elements  ωnfor gradient estimates  g(x′n, ωn)(AGNES) or  g(xn, ωn)(SGD). We consider  x0as a random variable on  Rm with law Q. Let us introduce the probability space  (�Ω, �A, �P) where

1. �Ω = Rd × �n∈N Ω,

2. �Ais the cylindrical/product  σ-algebra on �Ω, and3. �P = Q × � P.

The product  σ-algebra and product measure are objects suited to events which are defined using only finitely many variables in the product space. A more detailed introduction can be found in [Klenke, 2013, Example 1.63]. We furthermore define the filtration  {Fn}n∈Nwhere  Fnis the  σ-algebra generated by sets of the form

image

In particular, �n∈N Fn ⊆ σ��n∈N Fn� = �Aand, examining the time-stepping scheme, it is immediately apparent that  xn, x′n, vn are Fn-measurable random variables on  �Ω. In particular, they are A-measurable. Alternatively, we can consider  Fnas the  σ-algebra generated by the random variables  x1, x′1, . . . , xn, x′n, i.e. all the information that is known after intialization and taking n gradient steps. All probabilities in the main article are with respect to �P.

Recall that conditional expectations are a technical tool to capture the stochasticity in a random variable X which can be predicted from another random quantity Y . This allows us to quantify the randomness in the gradient estimators  g′nwhich comes from the fact that  xnis a random variable (not known ahead of time) and which randomness comes from the fact that on top of the inherent randomness due to e.g. initialization, we do not compute exact gradients. In particular, even at run time when  xnis known, there is additional noise in the estimators  g′nin our setting due to the selection of  ωn.

In the next Lemma, we recall two important properties of conditional expectations.

Lemma 14. [Klenke, 2013, Theorem 8.14] Let g and h be A-measurable random variables on a probability space  (Ω, A, P) and F ⊆ A be a σ−algebra. Then the conditional expectations E[g | F] and E[h | F] satisfy the following properties:

1. (linearity)  E[αg + βh | F] = α E[g] + β E[h | F] for all α, β ∈ R

2. (tower identity) E[E[g | F]] = E[g]

3. If  g is F−measurable then E[gh | F] = g E[h | F]. In particular, E[g | F] = g

For a more thorough introduction to filtrations and conditional expectations, see e.g. [Klenke, 2013, Chapter 8].  E[g′n|Fn]is the mean of  g′n if all previous steps are already known.

Lemma 15. Suppose  g′n, xn, and x′n satisfy the assumptions laid out in Section 3.1, then the following statements hold

1.  E�g′n|Fn�= ∇f(x′n)

2.  E�∥g′n − ∇f(x′n)∥2�≤ σ2 E�∥∇f(x′n)∥2�.

3.  E[∥g′n∥2] = (1 + σ2)E[∥∇f(x′n)∥2]

4.  E[∇f(x′n) · g′n] = E[∥∇f(x′n)∥2]

Proof. First and second claim. This follows from Fubini’s theorem.

Third claim. The third result then follows by an application of the tower identity with  Fn, expanding the square of the norm as a dot product, and then using the linearity of conditional expectation:

image

Fourth claim. For the fourth result, we observe that since f is a deterministic function and  x′nis Fn-measurable,  ∇f(x′n)is also measurable with respect to the  σ-algebra. Then using the tower identity followed by the third property in Lemma 14,

image

As a consequence, we note the following decrease estimate. Lemma 16. Suppose that  f, x′n, and  g′n = g(x′n, ωn)satisfy the conditions laid out in Section 3.1, then

image

Proof. Using L-smoothness of f,

image

Then taking the expectation and using the results of the previous lemma,

image

In particular, if  η ≤ 1L(1+σ2), then

image

E Convergence proofs: convex case

We first present a convergence result for stochastic gradient descent for convex functions with multiplicative noise scaling. To the best of our knowledge, convergence proofs for this type of noise which degenerates at the global minimum have been given by Bassily et al. [2018], Wojtowytsch [2023] under a Polyak-Lojasiewicz (or PL) condition (which holds automatically in the strongly convex case), but not for functions which are merely convex. We note that, much like AGNES, SGD achieves the same rate of convergence in stochastic convex optimization with multiplicative noise as in the deterministic case (albeit with a generally much larger constant). In particular, SGD with multiplicative noise is more similar to deterministic gradient descent than to SGD with additive noise in this way.

Analyses of SGD with non-standard noise under various conditions are given by Stich and Karim- ireddy [2022], Stich [2019].

Theorem 17 (GD, convex case). Assume that f is a convex function and that the assumptions laid out in Section 3.1 are satisfied. If the sequence  xnis generated by the gradient descent scheme

image

then for any  x∗ ∈ Rm and any n0 ≥ 1 + σ2,

image

In particular, if  η = 1L(1+σ2), n0 = 1 + σ2, and x∗is a point such that  f(x∗) = infx∈Rm f(x), then

image

Proof. Let  n0 ≥ 0and consider the Lyapunov sequence

image

We find that

image

by the convexity of f. The result therefore holds if  n0is chosen large since

image

If  x∗is a minimizer of f then the last claim in the theorem follows by using the upper bound f(x0) − f(x∗) ≤ L2 ∥x0 − x∗∥2from Lemma 12 and substituting  η = 1L(1+σ)2 , n0 = 1 + σ2.

The proofs of Theorems 1 and 3 in this section are constructed in analogy to the simplest setting of deterministic continuous-time optimization. As noted by Su et al. [2014], Nesterov’s time-stepping

scheme can be seen as a non-standard time discretization of the heavy ball ODE 

image

with a decaying friction coefficient. The same is true for AGNES, which reduces to Nesterov’s method in the determinstic case. Taking the derivative and exploiting the first-order convexity condition, we see that the Lyapunov function

image

is decreasing in time along the heavy ball ODE, see e.g. [Su et al., 2014, Theorem 3]. Here  x∗is a minimizer of the convex function f. In particular

image

To prove Theorems 1 and 3, we construct an analogue to L in (7). Note that  αvn = x′n − xnis a discrete analogue of the velocity  ˙xin the continuous setting. Both the proofs follow the same outline. Since Nesterov’s algorithm is a special case of AGNES, we first prove Theorem 3. We present the Lyapunov sequence in a fairly general form, which allows us to reuse calculations for both proofs and suggests the optimality of our approach for Nesterov’s original algorithm.

For details on the probalistic set-up and useful properties of gradient estimators, see Appendix D.2. Let us recall the two-step formulation of AGNES, which we use for the proof,

image

We first prove the alternative version mentioned after Theorem 3 in the main text. Both proofs proceed initially identically and only diverge in Step 3. The reader interested mainly in Theorem 3 is invited to read the first two steps of the proof of Theorem 18 and then skip ahead to the proof of Theorem 3 below.

Theorem 18 (AGNES, convex case,  n0version). Suppose that  xnand  x′nare generated by the time-stepping scheme (3), f and  g′n = g(x′n, ωn)satisfy the conditions laid out in Section 3.1, f is convex, and  x∗ is a point such that  f(x∗) = infx∈Rm f(x). If the parameters are chosen such that

image

then

image

In particular, if  α ≤ η1+2σ2then it suffices to choose  n0 ≥ 2η/α ≥ 2(1 + 2σ2).

Proof. Set-up. Mimicking the continuous time model in (7), we consider the Lyapunov sequence given by

image

where P(n) some function of  n, a(n) = a0 + a1n, and  b(n) = b0 + b1nfor some coefficients a0, a1, b0, b1. Our goal is to choose these in such a way that  Lnis a decreasing sequence.

Step 1. If we denote the first half of the Lyapunov sequence as  L 1n = P(n)E [f(xn) − f(x∗)], then

image

where k is a positive constant that can be chosen later to balance out other terms. Using Lemma 15,

image

where  cη,σ,L = η�1 − L(1+σ2)η2 �. Using convexity,

image

such that the expression becomes

image

We want the terms in this expression to balance the terms in  L 1n+1 − L 1n, so we choose  a1 = 0, i.e.a(n) = a0is a constant. This implies,

image

Step 3. Combining the estimates (8) and (9) from the last two steps,

image

Since  ∥∇f(x′n)∥2 ≥ 0and  ∇f(x′n) · (x′n − x∗) ≥ f(x′n) − f(x∗) ≥ 0, we require the coefficients of these two terms to be non-positive and the coefficient of  ∇f(x′n) · (x′n − xn)to be zero. That gives us the following system of inequalities,

image

Step 4. Now we can choose values that will satisfy the above system of inequalities. We substitute

image

Then (11) holds because

image

We now choose  ηto satisfy  η ≤ 1L(1+σ2), which ensures that  η2 ≤ η�1 − L(1+σ2)η2 �. Consequently, for (12), it suffices to ensure that

image

which is equivalent to showing that the polynomial,

image

is non-negative for all  z ≥ n0. q(z)simplifies to

image

To guarantee that q is non-negative for  z = n + n0 ≥ n0, we require that

image

image

Since q(0) < 0 and q is quadratic, this suffices to guarantee that q is increasing on  [n0, ∞). The first condition reduces to the fact that

image

We can find the minimal admissible value of  n0by the quadratic formula. We first consider the term outside the square root:

image

and thus

n0 ≥ − ηα

image

In particular, in the deterministic case  σ = 0, the choice  n0 = 0is admissible. Notably, we require n0 ≥ 2σ2 ηη = 2σ2. Furthermore, if  α ≤ η1+2σ2 , then

image

so it suffices to choose  n0 ≥ 2η/αin this case.

Step 5. We have shown that the Lyapunov sequence,

image

is monotone decreasing. It follows that

image

If  2η ≤ αn0, we get

image

Finally, if  η = 1L(1+σ2), α = 1L(1+σ2)(1+2σ2),and  n0 = 2(1 + 2σ2), then using Lemma 12, the expression above simplifies to

image

Remark 19. Note that SGD arises as a special case of this analysis if we consider  α = 0, n0 ≥ 2σ2since P(n) is a linear polynomial in this case.

Remark 20. Note that the proof of Theorem 18 implies more generally that

image

even if  n0is not chosen such that  q(n+n0) ≥ 0 for all n. However, q(n+n0) ≥ 0 for allsufficiently large  n ∈ N, i.e. Lndecreases eventually (assuming that  Ln < ∞for all finite n) . More precisely, for given  η, α if

image

Thus a poor choice of  n0will not prevent convergence, but it may delay it.

We now prove the version of this result stated in the main text, for which  ρn = nn+a0+1 with a0 > 2,i.e. with slightly more friction.

Theorem 3 (AGNES, convex case). Suppose that  xnand  x′nare generated by the time-stepping scheme (3),  f and g′n = g(x′n, ωn)satisfy the conditions laid out in Section 3.1, f is convex, and  x∗is a point such that  f(x∗) = infx∈Rm f(x). If the parameters are chosen such that

image

Proof. The proof for this version of Theorem 3 is identical to the proof of Theorem 18 until Step 3, after which we take an alternate approach. Let us recall the expression we got in the beginning of Step 3.

Step 3. We want to show that the bound on the right hand side of the inequality

image

is non-positive. Using convexity and L-smoothness in the form of [Wojtowytsch, 2023, Lemma B.1], we get the inequality

image

which allows us to combine the second and third line in (13), assuming that the coefficient in the second line is non-positive. If this is the case, then the entire right hand side of (13) is bounded from above by

image

As  E [∇f(x′n) · (x′n − xn)]does not have a sign, we choose to set its coefficient to zero, and we require both the coefficient in the second line of (13) and the coefficient of  E[∥∇f(x′n)∥2]in the combined version to be non-positive. Noting that  cη,σ,L ≥ η/2 if η ≤ 1/L(1 + σ2), this leads to the system of inequalities

image

In principle, this approach is more general than that of Theorem 3 as we do not require two terms to be individually non-positive, but only one of them and their weighted sum. In the proof of Theorem 3, a similar role was played by the parameter k, which allowed to shift a small positive term between expressions.

Step 4. Now we can choose the parameters and variables so as to satisfy the inequalities above. We begin by setting  b1 = 1, b0 = 0, k = 0, and choosing  α, η, a0as in the theorem statement. Using (14) as the definition of P(n), we note that

image

where the last equality holds since  α = η/(1 + σ2). Thus for (16) to hold, it suffices that

2αn + α + ηa0 − a0(αn + ηa0) ≤ L��η(2α + ηa0) − 2ηa0α(1 + σ2)�n − η2a20(1 + σ2) + η(α + ηa0)�,

which is equivalent to

image

A linear polynomial is non-negative for all  n ≥ 0if and only if both of its coefficients are. The leading order coefficient in (17) is

image

which is non-positive if and only if  a0 ≥ 2(1−Lη)1−Lη(1+σ2). We remark that it is this part of the computation that forces us to choose  ηstrictly smaller than 1L(1+σ2). In the deterministic case  σ = 0, we would encounter no such limitation as the term would be automatically zero for  η = 1/L. Finally, we consider the constant term in (17) and use the fact that  1 < a0 and α ≤ η

α + ηa0 − a20η + Lη2a20(1 + σ2) − Lη(α + ηa0) = (α + ηa0)(1 − Lη) + a20η(Lη(1 + σ2) − 1)

image

Step 5. The conclusion again follows as in the proof of Theorem 18.

In addition to convergence in expectation, we get almost sure convergence as well. Corollary 5. In the setting of Theorems 3 and 4,  f(xn) → inf fwith probability 1.

The same is of course true for Theorem 18.

Proof. The conclusion follows by standard arguments from the fact that the sequence of expectations E[f(xn) − inf f]is summable: By the previous argument, the estimate

image

holds for some C > 0. Since

image

it suffices to show that  P (lim supn→∞ |f(xn) − inf f| > ε) = 0 for any ε > 0. We further note that for any  N ∈ N we have

image

by Markov’s inequality. As the series over  n−2converges, the expression on the right can be made arbitrarily small by choosing N sufficiently large. Thus the quantity on the left must be zero, which concludes the proof. In the strongly convex case, the series �∞n=1�1 −� µL 11+σ2�nconverges and thus the same argument applies there as well.

Next we turn to NAG. Let us recall the statement of Theorem 1.

Theorem 1 (NAG, convex case). Suppose that  xn and x′n are generated by the time-stepping scheme (1), f and g satisfy the conditions laid out in Section 3.1, f is convex, and  x∗is a point such that f(x∗) = infx∈Rm f(x). If σ < 1and the parameters are chosen such that

image

The expectation on the right hand side is over the random initialization  x0.

Proof. We consider a Lyapunov sequence of the same form as before,

image

where P(n) is some function of  n, a(n) = a0 + a1n, and b(n) = b0 + b1n.

Since Nesterov’s algorithm is a special case of AGNES, after substituting  α = η, the analysis in steps 1, 2, and 3 of the proof of Theorem 3 remains valid. With that substitution, we get the following system of inequalities corresponding to step 3,

image

Using the definition of P(n) from (18), (20) is equivalent to

image

which should still hold in limit as  n → ∞,

image

This implies

image

We can choose  a0 = 2, b(n) = n, and k = η. Then (18) implies that  P(n) = ηn(n + 2). (19) holdsbecause

image

and (20) holds because

image

We have shown that the Lyapunov sequence

image

We emphasize again that this analysis works only if  σ < 1. The condition that  η ≤ 1−σ2L(1+σ2)is imposed by (20) and does not depend on any specific choice of  a0, b0, or b1. On the other hand, (18) forces the rate of convergence to be inversely proportional to  η. This means that as  σapproaches 1, the step size  ηdecreases to zero, and the rate of convergence blows up to infinity. On the other hand, as the proof of Theorem 3 shows, AGNES does not suffer from this problem. Having an additional parameter enables AGNES to converge even if the noise  σis arbitrarily large.

Let us point out how the same techniques used in Theorem 3 can be adapted to prove convergence f(xn) → inf f, even if a global minimizer does not exist. We recall the main statement.

Theorem 7 (Convexity without minimizers). Let f be a convex objective function satisfying the assumptions in Section 3.1 and  xnbe generated by the time-stepping scheme (3). Assume that  η, αand  ρnare as in Theorem 3. Then  lim infn→∞ E[f(xn)] = infx∈Rm f(x).

Proof. The first step follows along the same lines as the proof of Theorem 18 with minor modifications. Note that we did not use the minimizing property of  x∗except for Step 5.2. Assume for the moment that  inf f > −∞.

Assume first that  ε := lim infn→∞ E[f(xn)] − inf f > 0. Select  x∗such that  f(x∗) < inf f + ε/4and define the Lyapunov sequence  Lnjust as in the proof of Theorem 3 with the selected point  x∗.

We distinguish between two situations. First, assume that  n satisfies E�f(x′n)�≥ f(x∗). In this case we find that also  E�f(xn+1) ≤ E�f(x′n)�≤ f(x∗).

On the other hand, assume that  E�f(x′n)�≥ f(x∗) for n = 0, . . . , N. In that case, the proof of Theorem 3 still applies, meaning that  E[f(xN)]cannot remain larger than  f(x∗)+ε/2indefinitely. In either case, we find that there exists  N ∈ N such that E�f(xN)�≤ f(x∗) + ε/2 < lim infn→∞ E[f(xn)].

Note that the proof of Theorem 3 applies with  n′ ≥ n0as a starting point and a non-zero initial velocity  vn. The argument therefore shows that, for every  n′ ∈ Nthere exists  N ∈ Nsuch that E[f(xN)] ≤ lim infn→∞ E[f(xn)]. Inserting the definition of the lower limit, we have reached a contradiction.

We conjecture that the statement holds with the limit in place of the lower limit, but that it is impossible to guarantee a rate of convergence  O(n−β) for any β > 0in this setting. When following this strategy, the key question is how far away the point  x∗must be chosen. For very flat functions such as

image

x∗ may be very far away from the initial point  x0, and the rate of decay can be excrutiatingly slow if minimizers do not exist. For an easy example, we turn to the continuous time model. The solution to the heavy ball ODE

image

is given by

image

for  β = 22+α�4 (3+α)α(2+α)2� 22+α > 0. Ignoring the complicated constant factor, we see that

image

the decay rate can be as close to zero as desired for  αclose to zero, and indeed Siegel and Wojtowytsch [2023] show that no rate of decay can be guaranteed even beyond the situation of algebraic rates. For comparison, the solution of the gradient flow equation

image

Thus, while both the heavy ball ODE and the gradient flow can be made arbitrarily slow in this setting, the heavy ball remains much faster in comparison.

F Convergence proofs: strongly convex case

Bassily et al. [2018], Wojtowytsch [2023] analyze stochastic gradient descent under the PL condition

image

and the noise scaling assumption

image

motivated by Lemma 8. The assumption is equivalent to multiplicative noise scaling within a constant since every L-smooth function which satisfies a PL condition satisfies

image

For completeness, we provide a statement and proof directly in the multiplicative noise scaling regime which attains the optimal constant.

Additionally, we note that strong convexity implies the PL condition. The PL condition holds in many cases where convexity is false, e.g.

image

The set of minimizers {(x, y) : y = sin x} is non-convex, so f cannot be convex. While this result is well-known to the experts, we have been unable to locate a reference and hence provide a proof.

Lemma 21. Assume that  f : Rm → Ris  µ-strongly convex and  C2-smooth. Then f satisfies the PL-condition with constant  µ > 0.

Proof. Let  x, y ∈ Rd. Strong convexity combined with the Cauchy-Schwartz inequality means that

image

Since this is true for  y = x∗, the result follows.

Several results in this vein are also collected in [Karimi et al., 2016, Theorem 2] together with additional generalizations of convexity, but with a suboptimal implication (µ-strongly convex & L-smooth) ⇒ µ/L-PL. The additional implication (convexity & PL)  ⇒strong convexity can also be found there. Theorem 22 (GD, PL condition). Assume that f satisfies the PL-condition (21) and that the assumptions laid out in Section 3.1 are satisfied. Let  xnbe the sequence generated by the gradient descent scheme

image

where  ω1, ω2, . . .are elements of  Ωwhich are drawn independently of each other and the initial condition  x0. Then the estimate

image

holds for any  n ∈ N. Additionally, the sequence  xnconverges to a limiting random variable  x∞almost surely and in  L2 such that f(x∞) ≡ inf falmost surely.

image

and compute by Lemma 16 that

image

The proof of almost sure convergence is identical to the corresponding argument in [Wojtowytsch, 2023, Theorem 2.2] and similar in spirit to that of Corollary 5.

As usual, the optimal step-size is  η = 1L(1+σ2) as used in Figure 1.

Just like the convex case, we first prove Theorem 4 and set up the Lyapunov sequence with variable coefficients that can be chosen as per the time-stepping scheme. The continuous time analogue in this case is the heavy-ball ODE

image

For  µ-strongly convex f, a simple calculation shows that the Lyapunov function

image

satisfies  L ′(t) ≤ −√µ L (t) and thus

image

See for instance [Siegel, 2019, Theorem 1] for details.

Here, we state and prove a slightly generalized version of Theorem 4 in the main text. While we assumed an optimal choice of parameters in the main text, we allow for a suboptimal selection here. Theorem 4 (AGNES, strongly convex case – general version). In addition to the assumptions in Theorem 3, suppose that  f is µ-strongly convex and that

image

Note that  E�f(x0) − f(x∗) + µ2 ∥x0 − x∗∥2�≤ 2E [f(x0) − f(x∗)]due to Lemma 12. A discussion about the set of admissible parameters is provided after the proof. We note several special cases here.

µη1+σ2, strongly resembling Theorem 3.

2. If additionally  η = 1/(L(1 + σ2))is chosen optimally, then we recover the decay rate 1 −�µ/L /(1 + σ2)claimed in the main text.

3. We recover the gradient descent algorithm with the choice  α = 0which is achieved for ψ = η√µ. This selection is admissible in our analysis since

image

As expected, the constant of decay is  1 − √µ ψ = 1 − µη, as achieved in Theorem 22. In this sense, our analysis of AGNES interpolates fully between the optimal AGNES scheme (a NAG-type scheme in the deterministic case) and (stochastic) gradient descent. However, this proof only applies in the strongly convex setting, but not under a mere PL assumption.

4. If  µ < L– i.e. if  f(x) ̸≡ A + µ∥x − x∗∥2for some  A ∈ Rand  x0 ∈ Rm– then we canchoose  0 < ψ < √µ η, corresponding to  α < 0. In this case, the gradient step is sufficiently strong to compensate for momentum taking us in the wrong direction. Needless to say, this is a terrible idea and the rate of convergence is worse than that of gradient descent.

Proof. Set-up. Consider the Lyapunov sequence

image

for constants b, a to be chosen later. We want to show that there exists some decay factor  0 < δ < 1such that  Ln+1 ≤ δLn.

Step 1. Let us consider the first term. Note that

image

Step 2. We now turn to the second term and use the definition of  x′n+1 from (2),

image

To simplify notation, we introduce two new dependent variables:

image

With these variables, we have

image

Taking expectation of the square, we find that

image

Step 3. We now use strong convexity to deduce that

image

= (c2 − cψµ) E�∥x′n − xn∥2�+ 2ac E�(x′n − xn) · (x′n − x∗)�+�a2 − aψµ�E�∥x′n − x∗∥2�− 2cψE [f(x′n) − f(xn)] − 2aψ E [f(x′n) − f(x∗)] + ψ2(1 + σ2) E�∥∇f(x′n)∥2�.

Step 4. We now add the estimates of Steps 1 and 3:

image

The smallest decay factor we can get at this point is the coefficient of  E[f(xn) − f(x∗)]. So we hope to show that  Ln+1 ≤ cψLn, which leads to the following system of inequalities on comparing it

with the coefficients in the upper bound that we obtained in the previous step,

image

Step 5. Now we try to choose constants such that the system of inequalities holds. We assume that η ≤ 1L(1+σ2). Then since  η2 ≤ η�1 − L(1+σ2)η2 �, for (29) it suffices that  (1 + σ2)ψ2 ≤ η, i.e.

image

Note that (27) implies  ψ = 1/band substituting that into (25), we get  c = b − a. Using this, (28) is equivalent to

image

if  µ, ψ > 0. Finally (23) implies

image

and (24) implies

image

where we have used Lemma 12 for strong convexity in the last step. When the parameters are chosen optimally, i.e.  η = 1L(1+σ2)and  ψ =�

η1+σ2 = 1√L(1+σ2), we get  ρ, αand the convergence rate as stated in the theorem.

We focus on the meaningful case in which √µψ > 0. As discussed in Section 3, for given f, g we can replace  L, σby larger values  L′, σ′ and µby a smaller value  µ′. Let us briefly explore the effect of these substitutions. The parameter range described in this version of Theorem 4 can be understood as a three parameter family of AGNES parameters  η, α, ρparametrized by  η, ψ, µ′and constraints given by  L, µ, σ as

image

The parameter map is given by

image

We can converesely obtain √µ′ ψ from ρsince the function  z �→ (1 − z)/(1 + z)is its own inverse and thus √µ′ ψ = 1−ρ1+ρ. In particular, in terms of the algorithms parameters, the decay rate is

image

Furthermore, we see that

image

since  ψ > 0. Thus, at the cost of a more complicated representation, we could work directly in the parameter variables rather than using the auxiliary quantities  ψ, µ′. In particular, both the parameter map and its inverse are continuous on D and its image respectively. Hence,despite the rigid appearance of the parameter selection in Theorem 4, there exists an open set of admissible parameters  η, α, ρfor which we obtain exponentially fast convergence.

We provide a more general version of Theorem 2 as well. Just as in the convex case, as  σ ↗ 1, the step size  ηdecreases to zero and the theorem fails to guarantee convergence for  σ > 1.

Theorem 2 (NAG, strongly convex case). In addition to the assumptions in Theorem 1, suppose that f is µ-strongly convex and the parameters are chosen such that

image

Proof. Consider the Lyapunov sequence

image

where a and b are to be determined later. Since NAG is a special case of AGNES with  α = η, the first four steps are identical to the proof of Theorem 4. We get the following system of inequalities,

image

Substituting (31) into (32), we get  (a + c)2 = 1η and ψ = η/√η = √η. Thus (35) simplifies to

image

which is equivalent to

image

From (35),  b = 1/ψ = 1/√η. The rest of the inequalities can be verified to work with  a = √µ, c =b − a, ρ = b−ab+a. This shows that  Ln+1 ≤ cψLn = (1 − √µη)Ln. Finally, we get

image

Two different AGNES parameters are associated with momentum:  αand  ρ. In this section, we disentangle their respective contributions to keeping AGNES stable for highly stochastic noise.

For simplicity, first consider the case  f : R → R, f(x) = x and g(x) = (1 + σN) f ′(x) where N isa standard normal random variable. Then

image

since  v0 = 0. In particular, we note that

image

and

image

due to the independence of different gradient estimators between time steps. In particular, we see that

1. as  ρbecomes closer to 1, the eventual magnitude of the velocity variable increases as limn→∞ E∥vn∥ = ρ1−ρ.

2. as  ρbecomes closer to 1, the eventual variance of the velocity variable increases as limn→∞ E�∥vn − E[vn]∥2�= ρ21−ρ2 .

3. the noise in the normalized velocity estimate asymptotically satisfies

image

Thus, if  ρis closer to 1, both the magnitude and the variance of the velocity variable increase, but the the relative importance of noise approaches zero as  ρ → 1. This is not surprising – if  ρis close to 1, the sequence  ρn decays much slower than if  ρis small. Gradient estimates from different times enter at a similar scale and cancellations can occur easily. As the influence of past gradients remains large, we say that the momentum variable has a ‘long memory’.

Of course, when minimizing a non-linear function f, the gradient is not constant, and we face a trade-off:

1. A long memory allows us to cancel random oscillations in the gradient estimates more easily.

2. A long memory also means we compute with more out-of-date gradient estimates from points much further in the past along the trajectory.

Naturally, the relative importance of the first point increases with the stochasticity  σof the gradient estimates. Even if the gradient evaluations are deterministic, we benefit from integrating historic information gained throughout the optimization process, but the rate at which we ‘forget’ outdated information is much higher.

Thus the parameter  ρcorresponds to the rate at which we forget old information. It also impacts the magnitude of the velocity variable. The parameter  αcompensates for the scaling of  vnwith 1/(1 − ρ). We can think of  ρas governing the rate at which we forget past gradients, and  αas a measure of the confidence with which we integrate past gradient information into time-steps for x. Let us explore this relationship in strongly convex optimization. In Theorem 4, the optimal choice of hyper-parameters is given by  η = 1L(1+σ2) and

image

Let us consider the simplified regime  µ ≪ L in which

image

In particular, we note: The larger  σ, the closer  ρis to 1, i.e. the longer the memory we keep. The relative importance of the momentum step compared to the gradient step, on the other hand, remains constant, depending only on the ‘condition number’  L/µ.

We note that also in the convex case, high stochasticity forces  n0to be large, meaning that  ρnis always close to 1. Notably for generic non-convex objective functions, it is unclear that past gradients along the trajectory would carry useful information, as there is no discernible geometric relationship between gradients at different points. This mirrors an observation of Appendix G, just after Theorem 23.

G AGNES in non-convex optimization

We consider the case of non-convex optimization. In the deterministic setting, momentum methods for non-convex optimization have recently been studied by Diakonikolas and Jordan [2021]. We note that the algorithm may perform worse than stochastic gradient descent, but that for suitable parameters, the performance is comparable to that of SGD within a constant factor.

Theorem 23 (Non-convex case). Assume that f satisfies the assumptions laid out in Section 3.1. Let η, α, ρbe such that

image

Then

image

If  v0 = 0, the bound is minimal for gradient descent (i.e.  α = 0) since the decay factor  ε =η − α(1 + σ2)is maximal.

Proof. Consider

image

for a parameter  λ > 0to be fixed later. We have

image

by Lemmas 13 and 16. We deduce that

image

image

The first condition implies that  λ = (αρ2)−1, so the second one reduces to

image

Finally, we consider the last equation. If

image

then we find that

image

and hence

image

H Proof of Lemma 8: Scaling intensity of minibatch noise

In this appendix, we provide theoretical justification for the multiplicative noise scaling regime considered in this article. Recall our main statement:

Lemma 8 (Noise intensity). Assume that  ℓ(h, y) = ∥h − y∥2and  h : Rm × Rd → Rksatisfies ∥∇wh(w, xi)∥2 ≤ C�1 + ∥w∥�pfor some C, p > 0 and all  w ∈ Rmand i = 1, . . . , N. Then for all  w ∈ Rm:

image

as the average of a quantity is the unique value which minimizes the mean square discrepancy: EX = argmina∈R E�|X − a|2�. We further find by Hölder’s inequality that

image

I Implementation aspects

We discuss some implementation in this section. All the code used for the experiments in the paper has been provided in the supplementary materials. The experiments in section Section 5 and Appendix A were run on Google Colab for compute time less than an hour. The experiments in Section 5.2 were run on a laptop CPU with compute time less than an hour. The experiments in Sections 5.3 and 5.4 were run on a single current generation GPU in a local cluster for up to 50 hours. An additional compute of no more than 200 hours on a single GPU was used for experiments which were ultimately not used in the submitted version.

All neural-network based experiments were performed using the PyTorch library. Gradient-based optimizers in PyTorch and TensorFlow are implemented in such a way that gradients are computed outside of the optimizer and the point returned by an optimizer step is the point for the next gradient evaluation. This strategy facilitates the manual manipulation of gradients by scaling, clipping or masking to train only a subset of the network parameters.

The approach is theoretically justified for SGD. Guarantees for NAG and AGNES, on the other hand, are given for  f(xn)rather than  f(x′n), i.e. not at the point where the gradient is evaluated. A discrepancy arises between theory and practice.3 In Algorithm 1, this discrepancy is resolved by taking a final gradient descent step in the last time step and returning the sequence  x′n at intermediate steps. In our numerical experiments, we did not include the final gradient descent step. Skipping the gradient step in particular allows for an easier continuation of simulations beyond the initially specified stopping time, if so desired. We do not anticipate major differences under realistic circumstances. This can be justified analytically in convex and strongly convex optimization, at least for a low learning rate.

image

Proof. By essentially the same proof as Lemma 16, we have

image

since the correction term to linear approximation is bounded by the L-Lipschitz continuity of  ∇fboth from above and below. Recall furthermore that

image

for all L-smooth functions. Thus

image

In particular, if  1 − 3Lη > 0, then

image

The condition  η < 1/(3L)is guaranteed if the stochastic noise scaling satisfies  σ >√2since then 1 − 3Lη ≥ 1 − 31+σ2 . For η = 1/((1 + σ2)L, we than find that

image

Weight decay is a machine learning tool which controls the magnitude of the coefficients of a neural network. In the simplest SGD setting, weight decay takes the form of a modified update step

image

for  λ > 0. A gradient flow is governed by (1) an energy to be minimized and (2) an energy dissipation mechanism [Peletier, 2014]. It is known that different energy/dissipation pairings may induce the same dynamics – for instance, Jordan et al. [1998] show that the heat equation is both the  L2-gradientflow of the Dirichlet energy and the Wasserstein gradient flow of the entropy function.

In this language, weight decay can be interpreted in two different ways:

1. We minimize a modified objective function  x �→ f(x) + λ2 ∥x∥2which includes a Tikhonovregularizer. The gradient estimates are stochastic for f and deterministic for the regularizer. This perspective corresponds to including weight decay as part of the energy.

2. We dynamically include a confinement into the optimizer which pushes back against large values of  xn. This perspective corresponds to including weight decay as part of the dissipation.

In GD, both perspectives lead to the same optimization algorithm. In advanced minimizers, the two perspectives no longer coincide. For Adam, Loshchilov and Hutter [2018, 2019] initiated a debate on the superior strategy of including weight decay. We note that the two strategies do not coincide for AGNES, but do not comment on the superiority of one over the other:

image

2. Treating weight decay as a component of the objective function to be minimized leads to the update rule

image

In our numerical experiments, we choose the second approach, viewing weight decay as a property of the objective function rather than the dissipation. This coincides with the approach taken by the SGD (and SGD with momentum) optimizer as well as Adam (but not AdamW).

NeurIPS Paper Checklist

Question: Do the main claims made in the abstract and introduction accurately reflect the paper’s contributions and scope? Answer: [Yes] Justification: We wrote the abstract and introduction with the goal to summarize our main contributions accurately and precisely. Guidelines:

• The answer NA means that the abstract and introduction do not include the claims made in the paper.

• The abstract and/or introduction should clearly state the claims made, including the contributions made in the paper and important assumptions and limitations. A No or NA answer to this question will not be perceived well by the reviewers.

• The claims made should match theoretical and experimental results, and reflect how much the results can be expected to generalize to other settings.

• It is fine to include aspirational goals as motivation as long as it is clear that these goals are not attained by the paper.

Question: Does the paper discuss the limitations of the work performed by the authors? Answer: [Yes] Justification: We compare the algorithm proposed to commonly used methods both in convex optimization and deep learning. We dedicate Section 3 to the derivation of the noise modelling assumption and illustrate the heuristics which are being made. Guidelines:

• The answer NA means that the paper has no limitation while the answer No means that the paper has limitations, but those are not discussed in the paper.

• The authors are encouraged to create a separate "Limitations" section in their paper.

• The paper should point out any strong assumptions and how robust the results are to violations of these assumptions (e.g., independence assumptions, noiseless settings, model well-specification, asymptotic approximations only holding locally). The authors should reflect on how these assumptions might be violated in practice and what the implications would be.

• The authors should reflect on the scope of the claims made, e.g., if the approach was only tested on a few datasets or with a few runs. In general, empirical results often depend on implicit assumptions, which should be articulated.

• The authors should reflect on the factors that influence the performance of the approach. For example, a facial recognition algorithm may perform poorly when image resolution is low or images are taken in low lighting. Or a speech-to-text system might not be used reliably to provide closed captions for online lectures because it fails to handle technical jargon.

• The authors should discuss the computational efficiency of the proposed algorithms and how they scale with dataset size.

• If applicable, the authors should discuss possible limitations of their approach to address problems of privacy and fairness.

• While the authors might fear that complete honesty about limitations might be used by reviewers as grounds for rejection, a worse outcome might be that reviewers discover limitations that aren’t acknowledged in the paper. The authors should use their best judgment and recognize that individual actions in favor of transparency play an important role in developing norms that preserve the integrity of the community. Reviewers will be specifically instructed to not penalize honesty concerning limitations.

Question: For each theoretical result, does the paper provide the full set of assumptions and a complete (and correct) proof?

Answer: [Yes] Justification: All assumptions are summarized in Section 3.1. Wherever additional assumptions are made, they are stated clearly in the proof. Complete and correct proofs for all the lemmas and theorems are provided in the appendices. Guidelines:

• The answer NA means that the paper does not include theoretical results. • All the theorems, formulas, and proofs in the paper should be numbered and crossreferenced.

• All assumptions should be clearly stated or referenced in the statement of any theorems.

• The proofs can either appear in the main paper or the supplemental material, but if they appear in the supplemental material, the authors are encouraged to provide a short proof sketch to provide intuition.

• Inversely, any informal proof provided in the core of the paper should be complemented by formal proofs provided in appendix or supplemental material.

• Theorems and Lemmas that the proof relies upon should be properly referenced.

Question: Does the paper fully disclose all the information needed to reproduce the main experimental results of the paper to the extent that it affects the main claims and/or conclusions of the paper (regardless of whether the code and data are provided or not)? Answer: [Yes] Justification: All code from experiments is provided in the supplementary materials. Guidelines:

image

• If the paper includes experiments, a No answer to this question will not be perceived well by the reviewers: Making the paper reproducible is important, regardless of whether the code and data are provided or not.

• If the contribution is a dataset and/or model, the authors should describe the steps taken to make their results reproducible or verifiable.

• Depending on the contribution, reproducibility can be accomplished in various ways. For example, if the contribution is a novel architecture, describing the architecture fully might suffice, or if the contribution is a specific model and empirical evaluation, it may be necessary to either make it possible for others to replicate the model with the same dataset, or provide access to the model. In general. releasing code and data is often one good way to accomplish this, but reproducibility can also be provided via detailed instructions for how to replicate the results, access to a hosted model (e.g., in the case of a large language model), releasing of a model checkpoint, or other means that are appropriate to the research performed.

• While NeurIPS does not require releasing code, the conference does require all submissions to provide some reasonable avenue for reproducibility, which may depend on the nature of the contribution. For example (a) If the contribution is primarily a new algorithm, the paper should make it clear how to reproduce that algorithm.

(b) If the contribution is primarily a new model architecture, the paper should describe the architecture clearly and fully.

(c) If the contribution is a new model (e.g., a large language model), then there should either be a way to access this model for reproducing the results or a way to reproduce the model (e.g., with an open-source dataset or instructions for how to construct the dataset).

(d) We recognize that reproducibility may be tricky in some cases, in which case authors are welcome to describe the particular way they provide for reproducibility. In the case of closed-source models, it may be that access to the model is limited in some way (e.g., to registered users), but it should be possible for other researchers to have some path to reproducing or verifying the results.

Question: Does the paper provide open access to the data and code, with sufficient instructions to faithfully reproduce the main experimental results, as described in supplemental material?

image

Justification: All code as well as the synthetically generated data used for the regression experiments are provided in the supplementary materials.

image

• The answer NA means that paper does not include experiments requiring code. • Please see the NeurIPS code and data submission guidelines (https://nips.cc/ public/guides/CodeSubmissionPolicy) for more details.

• While we encourage the release of code and data, we understand that this might not be possible, so “No” is an acceptable answer. Papers cannot be rejected simply for not including code, unless this is central to the contribution (e.g., for a new open-source benchmark).

• The instructions should contain the exact command and environment needed to run to reproduce the results. See the NeurIPS code and data submission guidelines (https: //nips.cc/public/guides/CodeSubmissionPolicy) for more details.

• The authors should provide instructions on data access and preparation, including how to access the raw data, preprocessed data, intermediate data, and generated data, etc.

• The authors should provide scripts to reproduce all experimental results for the new proposed method and baselines. If only a subset of experiments are reproducible, they should state which ones are omitted from the script and why.

• At submission time, to preserve anonymity, the authors should release anonymized versions (if applicable).

• Providing as much information as possible in supplemental material (appended to the paper) is recommended, but including URLs to data and code is permitted.

Question: Does the paper specify all the training and test details (e.g., data splits, hyper-parameters, how they were chosen, type of optimizer, etc.) necessary to understand the results?

image

Justification: All experimental settings are described in the article and its supplementary materials. They can also be inferred in the code provided.

image

• The answer NA means that the paper does not include experiments. • The experimental setting should be presented in the core of the paper to a level of detail that is necessary to appreciate the results and make sense of them.

• The full details can be provided either with the code, in appendix, or as supplemental material.

Question: Does the paper report error bars suitably and correctly defined or other appropriate information about the statistical significance of the experiments?

image

Justification: All experiments were repeated multiple times. We provide means and standard deviations over all runs.

image

• The answer NA means that the paper does not include experiments.

• The authors should answer "Yes" if the results are accompanied by error bars, confidence intervals, or statistical significance tests, at least for the experiments that support the main claims of the paper.

• The factors of variability that the error bars are capturing should be clearly stated (for example, train/test split, initialization, random drawing of some parameter, or overall run with given experimental conditions).

• The method for calculating the error bars should be explained (closed form formula, call to a library function, bootstrap, etc.)

• The assumptions made should be given (e.g., Normally distributed errors). • It should be clear whether the error bar is the standard deviation or the standard error of the mean.

• It is OK to report 1-sigma error bars, but one should state it. The authors should preferably report a 2-sigma error bar than state that they have a 96% CI, if the hypothesis of Normality of errors is not verified.

• For asymmetric distributions, the authors should be careful not to show in tables or figures symmetric error bars that would yield results that are out of range (e.g. negative error rates).

• If error bars are reported in tables or plots, The authors should explain in the text how they were calculated and reference the corresponding figures or tables in the text.

Question: For each experiment, does the paper provide sufficient information on the computer resources (type of compute workers, memory, time of execution) needed to reproduce the experiments?

image

• The answer NA means that the paper does not include experiments. • The paper should indicate the type of compute workers CPU or GPU, internal cluster, or cloud provider, including relevant memory and storage.

• The paper should provide the amount of compute required for each of the individual experimental runs as well as estimate the total compute.

• The paper should disclose whether the full research project required more compute than the experiments reported in the paper (e.g., preliminary or failed experiments that didn’t make it into the paper).

Question: Does the research conducted in the paper conform, in every respect, with the NeurIPS Code of Ethics https://neurips.cc/public/EthicsGuidelines?

image

Justification: The work presented here is primarily theoretical. No human subjects were involved. The datasets used are standard benchmark datasets (MNIST, CIFAR-10) or purely synthetic. No direct social consequences are anticipated.

image

• The answer NA means that the authors have not reviewed the NeurIPS Code of Ethics. • If the authors answer No, they should explain the special circumstances that require a deviation from the Code of Ethics.

• The authors should make sure to preserve anonymity (e.g., if there is a special consideration due to laws or regulations in their jurisdiction).

Question: Does the paper discuss both potential positive societal impacts and negative societal impacts of the work performed?

image

Justification: The main contribution of the work is an algorithm for smooth convex optimization. Foundational as the topic at large may be in various fields, it is impossible to link directly to societal impact.

image

• The answer NA means that there is no societal impact of the work performed. • If the authors answer NA or No, they should explain why their work has no societal impact or why the paper does not address societal impact.

• Examples of negative societal impacts include potential malicious or unintended uses (e.g., disinformation, generating fake profiles, surveillance), fairness considerations (e.g., deployment of technologies that could make decisions that unfairly impact specific groups), privacy considerations, and security considerations.

• The conference expects that many papers will be foundational research and not tied to particular applications, let alone deployments. However, if there is a direct path to any negative applications, the authors should point it out. For example, it is legitimate to point out that an improvement in the quality of generative models could be used to generate deepfakes for disinformation. On the other hand, it is not needed to point out that a generic algorithm for optimizing neural networks could enable people to train models that generate Deepfakes faster.

• The authors should consider possible harms that could arise when the technology is being used as intended and functioning correctly, harms that could arise when the technology is being used as intended but gives incorrect results, and harms following from (intentional or unintentional) misuse of the technology.

• If there are negative societal impacts, the authors could also discuss possible mitigation strategies (e.g., gated release of models, providing defenses in addition to attacks, mechanisms for monitoring misuse, mechanisms to monitor how a system learns from feedback over time, improving the efficiency and accessibility of ML).

Question: Does the paper describe safeguards that have been put in place for responsible release of data or models that have a high risk for misuse (e.g., pretrained language models, image generators, or scraped datasets)?

image

Justification: The main contribution of the work is theoretical and no data or models with a high risk for misuse are produced.

image

• The answer NA means that the paper poses no such risks.

• Released models that have a high risk for misuse or dual-use should be released with necessary safeguards to allow for controlled use of the model, for example by requiring that users adhere to usage guidelines or restrictions to access the model or implementing safety filters.

• Datasets that have been scraped from the Internet could pose safety risks. The authors should describe how they avoided releasing unsafe images.

• We recognize that providing effective safeguards is challenging, and many papers do not require this, but we encourage authors to take this into account and make a best faith effort.

Question: Are the creators or original owners of assets (e.g., code, data, models), used in the paper, properly credited and are the license and terms of use explicitly mentioned and properly respected?

image

Justification: We use the MNIST and CIFAR-10 datasets, which are cited accurately. We also use an implementation of ResNets, for which we cite the GitHub repository and reproduce the license terms in the code provided.

image

• The answer NA means that the paper does not use existing assets. • The authors should cite the original paper that produced the code package or dataset.

• The authors should state which version of the asset is used and, if possible, include a URL.

• The name of the license (e.g., CC-BY 4.0) should be included for each asset. • For scraped data from a particular source (e.g., website), the copyright and terms of service of that source should be provided.

• If assets are released, the license, copyright information, and terms of use in the package should be provided. For popular datasets, paperswithcode.com/datasets has curated licenses for some datasets. Their licensing guide can help determine the license of a dataset.

• For existing datasets that are re-packaged, both the original license and the license of the derived asset (if it has changed) should be provided.

• If this information is not available online, the authors are encouraged to reach out to the asset’s creators.

Question: Are new assets introduced in the paper well documented and is the documentation provided alongside the assets?

image

• The answer NA means that the paper does not release new assets.

• Researchers should communicate the details of the dataset/code/model as part of their submissions via structured templates. This includes details about training, license, limitations, etc.

• The paper should discuss whether and how consent was obtained from people whose asset is used.

• At submission time, remember to anonymize your assets (if applicable). You can either create an anonymized URL or include an anonymized zip file.

Question: For crowdsourcing experiments and research with human subjects, does the paper include the full text of instructions given to participants and screenshots, if applicable, as well as details about compensation (if any)?

image

• The answer NA means that the paper does not involve crowdsourcing nor research with human subjects.

• Including this information in the supplemental material is fine, but if the main contribution of the paper involves human subjects, then as much detail as possible should be included in the main paper.

• According to the NeurIPS Code of Ethics, workers involved in data collection, curation, or other labor should be paid at least the minimum wage in the country of the data collector.

Question: Does the paper describe potential risks incurred by study participants, whether such risks were disclosed to the subjects, and whether Institutional Review Board (IRB) approvals (or an equivalent approval/review based on the requirements of your country or institution) were obtained?

image

• The answer NA means that the paper does not involve crowdsourcing nor research with human subjects.

• Depending on the country in which research is conducted, IRB approval (or equivalent) may be required for any human subjects research. If you obtained IRB approval, you should clearly state this in the paper.

• We recognize that the procedures for this may vary significantly between institutions and locations, and we expect authors to adhere to the NeurIPS Code of Ethics and the guidelines for their institution.

• For initial submissions, do not include any information that would break anonymity (if applicable), such as the institution conducting the review.

Designed for Accessibility and to further Open Science

Thank you Gupta Kanan, Siegel Jonathan W., Wojtowytsch Stephan, who authored Nesterov acceleration despite very noisy gradients 🙏 This page is the html of their arXiv pdf, with no changes made other than format. Please cite their work