Deep network architectures are becoming increasingly complex, often containing parameters that can be grouped according to multiple functionalities, such as gating, attention, convolution, and generation. Such parameter groups should arguably be treated differently during training, as their gradient statistics might be highly distinct. Adaptive gradient methods designate parameter-wise learning rates based on gradient histories, treating such parameters groups differently and, in principle, promise to be better suited for training complex neural network architectures.
Nonetheless, advances in neural architectures have not been matched by progress in adaptive gradient descent algorithms. SGD is still prevalent, in spite of the development of seemingly more sophisticated adaptive alternatives, such as RMSProp (Dauphin et al., 2015) and Adam (Kingma & Ba, 2015). Such adaptive methods have been observed to yield poor generalization compared to SGD in classification tasks (Wilson et al., 2017), and hence have been mostly adopted for training complex models (Vaswani et al., 2017; Arjovsky et al., 2017). For relatively simple architectures, such as ResNets (He et al., 2016a) and DenseNets (Huang et al., 2017), SGD is still the dominant choice.
At a theoretical level, concerns have also emerged about the current crop of adaptive methods. Recently, Reddi et al. (2018) has identified cases, even in the stochastic convex setting, where Adam (Kingma & Ba, 2015) fails to converge. Modifications to Adam that provide convergence guarantees have been formulated, but have shortcomings. AMSGrad (Reddi et al., 2018) requires non-increasing learning rates, while AdamNC (Reddi et al., 2018) and AdaBound (Luo et al., 2019) require that adaptivity be gradually eliminated during training. Moreover, while most of the recently proposed variants do not provide formal guarantees for non-convex problems, the few current convergence rate analyses in the literature (Zaheer et al., 2018; Chen et al., 2019) do not match SGD’s. Section 3 fully details the convergence rates of the most popular Adam variants, along with their shortcomings.
Our contribution is marked improvements to adaptive optimizers, from both theoretical and practical perspectives. At the theoretical level, we focus on convergence guarantees, deriving new algorithms:
• Delayed Adam. Inspired by Zaheer et al. (2018)’s analysis of Adam, Section 4 proposes a simple modification for adaptive gradient methods which yields a provable convergence rate of in the stochastic non-convex setting – the same as SGD. Our modification can be implemented by swapping two lines of code and preserves adaptivity without incurring extra memory costs. To illustrate these results, we present a non-convex problem where Adam fails to converge to a stationary point, while Delayed Adam – Adam with our proposed modifi-cation – provably converges with a rate of
• AvaGrad. Inspecting the convergence rate of Delayed Adam, we show that it would improve with an adaptive global learning rate, which self-regulates based on global statistics of the gradient second moments. Following this insight, Section 5 proposes a new adaptive method, AvaGrad, whose hyperparameters decouple learning rate and adaptability.
Through extensive experiments, Section 6 demonstrates that AvaGrad is not merely a theoretical exercise. AvaGrad performs as well as both SGD and Adam in their respectively favored usage scenarios. Along this experimental journey, we happen to disprove some conventional wisdom, finding adaptive optimizers, including Adam, to be superior to SGD for training CNNs. The caveat is that, excepting AvaGrad, these methods need extensive grid search to outperform SGD, often requiring unconventional hyperparameter values to even yield competitive performance.
AvaGrad is a uniquely attractive adaptive optimizer, as it decouples the learning rate and its adaptability parameter, making hyperparameter search significantly faster. In particular, given a computational budget similar to SGD’s, AvaGrad yields near best results over a wide range of tasks.
2.1. Notation
For vectors , we use the following notation:
for element-wise divi- sion (
for element-wise square root (
for element-wise addition (
for element-wise multiplication (
). Moreover,
is used to denote the
-norm: other norms will be specified whenever used (e.g.,
For subscripts and vector indexing, we adopt the following convention: the subscript t is used to denote an object related to the t-th iteration of an algorithm (e.g., denotes the iterate at time step t); the subscript i is used for indexing:
denotes the i-th coordinate of
. When used together,
denotes the i-th coordinate of
2.2. Stochastic Non-Convex Optimization
In the stochastic non-convex setting, we are concerned with the optimization problem:
where D is a probability distribution over a set S of “data points”. We also assume that f is M-smooth in w, as is typically done in non-convex optimization:
Methods for stochastic non-convex optimization are evaluated in terms of number of iterations or gradient evaluations required to achieve small loss gradients. This differs from the stochastic convex setting where convergence is measured w.r.t. suboptimality . We assume that the algorithm takes a sequence of data points
from which it deterministically computes a sequence of parameter settings
together with a distribution P over {1, . . . , T}. We say an algorithm has a
O(g(T)) where, as defined above,
We also assume that the functions have bounded gradients: there exists some
such that
for all
and
. Throughout the paper, we also let
denote an upper bound on
Here we present a brief overview of optimization methods commonly used for training neural networks, along with their convergence rate guarantees for stochastic smooth non-convex problems. We consider methods which, at each iteration t, receive or compute a gradient estimate:
and perform an update of the form:
where is the global learning rate,
are the parameter-wise learning rates, and
is the update direction, typically defined as:
Non-momentum methods such as SGD, AdaGrad, and RMSProp (Dauphin et al., 2015; Duchi et al., 2011) have ), while momentum SGD and Adam (Kingma & Ba, 2015) have
. Note that while
can always be absorbed into
, representing the update in this form will be convenient throughout the paper.
SGD uses the same learning rate for all parameters, i.e., . Although SGD is simple and offers no adaptation, it has a convergence rate of
with either constant, increasing, or decreasing learning rates (Ghadimi & Lan, 2013), and is widely used when training deep networks, especially CNNs (He et al., 2016a; Huang et al., 2017). At the heart of its convergence proof is the fact that
Popular adaptive methods such as RMSProp (Dauphin et al., 2015), AdaGrad (Duchi et al., 2011), and Adam (Kingma & Ba, 2015) have is given by:
As is an estimate of the second moments of the gradients, the optimizer designates smaller learning rates for parameters with larger uncertainty in their stochastic gradients. However, in this setting
are no longer independent, hence
. This “bias” can cause RMSProp and Adam to present convergence issues, even in the stochastic convex setting (Reddi et al., 2018).
Recently, Zaheer et al. (2018) showed that, with a constant learning rate, RMSProp and Adam have a convergence rate of , where
, hence their result does not generally guarantee convergence.
Chen et al. (2019) showed that AdaGrad and AMSGrad enjoy a convergence rate of when a decaying learning rate is used. Note that both methods constrain
in some form, the former with
(adaptability diminishes with t), and the latter explicitly enforces
for all
is point-wise non-increasing). In both cases, the method is less adaptive than Adam, and the rates above are worse than SGD’s
We first take a step back to note the following: to show that Adam might not converge in the stochastic convex setting, Reddi et al. (2018) provide a stochastic linear problem where Adam fails to converge w.r.t. suboptimality. Since non-convex optimization is evaluated w.r.t. norm of the gradients, a different instance is required to characterize Adam’s behavior in this setting.
The following result shows that even for a quadratic problem, Adam indeed does not converge to a stationary point:
Theorem 1. For any and constant
there is a stochastic convex optimization problem for which Adam does not converge to a stationary point.
Proof. The full proof is given in Appendix A 1. The argument follows closely from Reddi et al. (2018), where we explicitly present a stochastic optimization problem:
We show that, for large enough C (as a function of Adam will move towards
the constraint
does not make w = 1 a stationary point.
This result, like the one in Reddi et al. (2018), relies on the fact that and
are correlated: upon a draw of the rare sample
, the learning rate
decreases significantly and Adam takes a small step in the correct direction. On the other hand, a sequence of common samples increases
and Adam moves faster towards w = 1.
Instead of enforcing to be point-wise non-increasing in t (Reddi et al., 2018), which forces the optimizer to take small steps even for a long sequence of common samples, we propose to simply have
be independent of
. As an extra motivation for this approach, note that successful proof strategies (Zaheer et al., 2018) to analyzing adaptive methods include the following step:
where bounding , seen as a form of bias, is a key part of recent convergence analyses. Replacing
by
in the update equation of Adam removes this bias and can be implemented by simply swapping lines of code (updating
after w), yielding a simple convergence analysis without hindering the adaptability of the method in any way.
Algorithm 1 provides pseudo-code when applying this mod-ification to Adam, yielding Delayed Adam. The following Theorem shows that this modification is enough to guarantee a SGD-like convergence rate of in the stochastic non-convex setting for general adaptive gradient methods.
Theorem 2. Consider any optimization method which updates parameters as follows:
where , and
are independent of
. Assume that
is M-smooth, and
for all
. Moreover, let
For α
then:
where P assigns probabilities
Proof. The full proof is given in Appendix B, along with analysis for the case with momentum in Appendix B.1, and in particular
, which yields a similar rate.
The convergence rate depends on are random variables for Adam-like algorithms. However, if there are constants H and L such that
for all i and t, then a rate of is guaranteed.
This is the case for Delayed Adam, where for all t and i. Theorem 2 also requires that
and
are independent of
, which can be assured to hold by applying a “delay” to their respective computations, if necessary (i.e., replacing
, as in Delayed Adam).
Additionally, the assumption that that a single sample should not affect the distribution of
, is required since P is conditioned on the samples S (unlike in standard analysis, where Z =
and
is deterministic), and is expected to hold as
Practitioners typically use the last iterate or perform early-stopping: in this case, whether the assumption holds or not does not affect the behavior of the algorithm. Nonetheless, we also show in Appendix B.2 a similar rate that does not require this assumption to hold, which also yields a
convergence rate taken that the parameter-wise learning rates are bounded from above and below.
Now, we consider the implications of Theorem 2 for De-
layed Adam, where , and hence
For a fixed , chosen a-priori (that is, without knowledge of the realization of
), we can optimize
to minimize the worst-case rate using
and
. This yields
, and a convergence rate linear in
, suggesting that, at least in the worst case,
should be chosen to be as large as possible, and the learning rate
should scale linearly with
What if we allow to vary in each time step? For example, choosing
yields a convergence rate of
Using we see that in the worst-case this is also linear in
. However, this dependence differs from the one with fixed
in a few aspects. Most
notably, if we consider different scalings of (e.g., small
and scaling
), the convergence rate with fixed
can get arbitrarily worse: in (10), we get that the numerator grows quadratically with the scaling, while the denominator only grows linearly.
On the other hand, for the convergence rate remains unchanged since in (12), both
and
grow linearly with the scaling and hence its effect is cancelled out.
For the particular case yielding the exact same convergence rate as SGD including constant factors. However, a constant
results in a dependence on the scale of
, which again can be large if the second moment estimate is either large or small. Lastly, normalizing the learning rate by
removes its dependence on
in the worst-case setting, making the two hyperparameters more separable.
Motivated by the above observations, the choice of , where
is added for normalization effects, yields a method which we name AvaGrad – Adaptive VAriance Gradients, presented as pseudo-code in Algorithm 2. We call it an “adaptive variance” method since, if we scale up or down the variance of the gradients, the convergence guarantee in Theorem 2 does not change, while for global learning rates that are independent of
(as in Adam and other adaptive methods), it can be arbitrarily bad.
Figure 1: Plots of Adam, AMSGrad, and Delayed Adam trained on the synthetic example in Equation 13, with a stationary point at The expected iterate sampled uniformly from
, for each iteration t. As predicted by our theoretical results, Adam moves towards
, while Delayed Adam converges to
expected norm squared of the gradient, for w randomly sampled from
. Delayed Adam converges significantly faster than AMSGrad, while Adam fails to converge.
6.1. Synthetic Data
To illustrate empirically the implications of Theorem 1 and Theorem 2, we set up a synthetic stochastic optimization problem with the same form as the one used in the proof of Theorem 1:
This function has a stationary point , and it satisfies Theorem 1 for
We proceed to perform stochastic optimization with Adam, AMSGrad, and Delayed Adam, with constant learning rate
. For simplicity, we let P be uniform over
is constant.
for each iteration t: as expected,Adam fails to converge to the stationary point
, while both AMSGrad and Delayed Adam converge. Note that Delayed Adam converges significantly faster, likely because it has no constraint on the learning rates.
6.2. Image Classification on CIFAR
Our theory suggests that, in the worst case, should be chosen as large as possible, at which point the learning rate
should scale linearly with it. As a first experiment to assess this hypothesis, we analyze the interaction between
and
when training a Wide ResNet 28-4 (Zagoruyko & Komodakis, 2016) on the CIFAR dataset (Krizhevsky, 2009). The CIFAR-10 and CIFAR-100 datasets consist of 60,000 RGB images with
pixels and comes with a standard train/test split of 50,000 and 10,000 images, respectively.
Following Zagoruyko & Komodakis (2016), we pre-process the dataset by performing channel-wise normalization using statistics computed from the training set. We also flip each image horizontally with 50% probability and perform random cropping by first padding 4 black pixels to each image and then extracting a random
We use a validation set of 5,000 images to evaluate the performance of SGD and different adaptive gradient methods: Adam, AMSGrad, AdaBound (Luo et al., 2019; Savarese, 2019), AdaShift (Zhou et al., 2019), and our proposed algorithm, AvaGrad. We also assess whether performing weight decay as proposed in Loshchilov & Hutter (2019) instead of standard regularization positively impacts the performance of adaptive methods: we do this by evaluating AdamW and AvaGradW.
The learning rate is decayed by a factor of 5 at epochs 60, 120 and 160, and the model is trained for a total of 200 epochs with a weight decay of 0.0005. We use a mini-batch size of 128, and each model is trained on a single GPU. For SGD, we use a momentum of 0.9, while for adaptive methods we use the default and
. For AdaBound, we use the default final learning rate
and
for the bound functions. Finally, for AdaShift, we use the default stack size n = 10.
We run each adaptive method with different powers of 10 for , multiplied by
, a value large enough such that adaptability can be effectively ignored. We
Figure 2: Validation error of a Wide ResNet 28-4 trained on the CIFAR-10 dataset with Adam (left) and AvaGrad (right), for different values of the learning rate and parameter
, where larger
yields less adaptability. Best performance is achieved with small adaptability (
also vary the learning rate of each method with different powers of 10, multiplied by 1 and 5, from
up to 5000. In total, we evaluate 441 different hyperparameter settings for each adaptive method.
Figure 2 shows the results for Adam and AvaGrad. Our main findings are twofold:
• The optimal for every adaptive method is considerably larger than the values typically used in practice, ranging from 0.1 (Adam, AMSGrad, AvaGradW) to 10.0 (AvaGrad, AdamW). For Adam and AMSGrad, the optimal learning rate is
, a value 100 times larger than the default.
• All adaptive methods, except for AdaBound, outperform SGD in terms of validation performance. Note that for SGD the optimal learning rate is , matching the value used in work such as He et al. (2016a); Zagoruyko & Komodakis (2016); Xie et al. (2017), which presented state-of-the-art results at time of publication.
However, the fact that adaptive methods outperform SGD in this setting is not conclusive, since they are executed with more hyperparameter settings (varying as well as
Moreover, the main motivation for adaptive methods is to be less sensitive to hyperparameter values; performing an extensive grid search defeats their purpose.
Aiming for a fair comparison between SGD and adaptive methods, we also train a Wide ResNet 28-10 on both CIFAR-10 and CIFAR-100, evaluating the test performance of each adaptive method with its optimal values for and
found in the previous experiment. For SGD, we confirmed that the learning rate
still yielded the best validation performance with the new architecture, hence the fact that we transfer hyperparameters from the Wide ResNet 28-4 runs does not unfairly advantage adaptive methods in the comparison with SGD. With a larger network and a different task (CIFAR-100), this experiment should also capture how hyperparameters of adaptive methods transfer between tasks and models.
On CIFAR-10, SGD achieves 3.86% test error (reported as 4% in Zagoruyko & Komodakis (2016)) and is outperformed by both Adam (3.64%) and AvaGrad (3.80%). On CIFAR-100, SGD (19.05%) is outperformed by Adam (18.96%), AMSGrad (18.97%), AdaShift (18.88%), AvaGrad (18.76%), and AvaGradW (19.04%). We believe these results are surprising, as they show that adaptive methods can yield state-of-the-art performance when training CNNs as long as their adaptability is correctly controlled with
6.3. Image Classification on ImageNet
As a final evaluation of the role of adaptability when training convolutional networks, we repeat the previous experiment on the ImageNet dataset (Russakovsky et al., 2015), a challenging benchmark composed of 1.2M training and 50,000 validation RGB images sampled from a total of 1,000 classes. We follow Gross & Wilber (2016) for data augmentation (scale and color transformations) and use 224x224 single-crops to compute the top-1 accuracy on the validation set.
We train a ResNet-50 (He et al., 2016b) with SGD and different adaptive methods, transferring the hyperparameters from our original CIFAR-10 results. The network is trained
Figure 3: Validation bits-per-character (lower is better) of a 3-layer LSTM with 300 hidden units, trained on the Penn Treebank dataset with Adam (left) and AvaGrad (right), for different values of the learning rate and parameter
larger
yields less adaptability. Best performance is achieved with high adaptability (
Table 1: Test performance of SGD and popular adaptive methods in benchmark tasks. Red indicates results with the recommended optimizer, following the paper that proposed each model, and any improved performance is given in blue. The best result for each task is in bold, and numbers in parentheses present standard deviations of 3 runs for CIFAR.
for 100 epochs with a batch size of 256 on 4 GPUs (batch size of 64 per GPU), the learning rate is decayed by a factor of 10 at epochs 30, 60 and 90, and a weight decay of 0.0001 is applied.
SGD yields 24.01% top-1 validation error, underperforming Adam (23.45%), AMSGrad (23.46%), AvaGrad (23.58%) and AvaGradW (23.49%) – a total of 4 out of the 6 adaptive methods evaluated on the dataset. Table 1 summarizes the results.
In contrast to numerous papers that surpassed the state-of-the-art on ImageNet by training networks with SGD (Si- monyan & Zisserman, 2015; Szegedy et al., 2015; He et al., 2016a;b; Zagoruyko & Komodakis, 2016; Xie et al., 2017), our results show that adaptive methods can yield superior results in terms of generalization performance.
Most strikingly, we observed that Adam outperformed more sophisticated methods such as AMSGrad, AdaBound, and AdamW. Note that the hyperparameter values we used for SGD match the ones in He et al. (2016a), He et al. (2016b) and Gross & Wilber (2016): an initial learning rate of 0.1 with a momentum of 0.9.
6.4. Language Modelling with RNNs
It is perhaps not very surprising that to perform optimally in the image classification tasks studied previously, adaptive gradient methods required large values of , and hence were barely adaptive.
Here, we consider a task where state-of-the-art results are not typically achieved by SGD, but by adaptive methods with low values for : language modelling with recurrent networks. In particular, we perform character-level language modelling on the Penn Treebank dataset (Mar- cus et al., 1994; Mikolov et al., 2010), which consists of 5.01M/393k/442k training/validation/test tokens, respectively, and a vocabulary size of 10,000.
Following Merity et al. (2018), we train 3-layer LSTMs (Hochreiter & Schmidhuber, 1997) with a character embedding size of 200 and varying size for the LSTM cells. The model is trained for a total of 500 epochs, and the learning rate is decayed by 10 at epochs 300 and 400. We use a batch size of 128, a BPTT length of 150, and weight decay of
As in Merity et al. (2018), we apply weight dropout with p = 0.5 to the LSTM’s hidden-to-hidden matrix, variational dropout (Gal & Ghahramani, 2016) with p = 0.1 for the input/output layers, p = 0.25 for the LSTM layers, and p = 0.1 to the columns of the embedding matrix (embedding dropout).
We first evaluate the validation performance of SGD, Adam, AMSGrad, AdaShift, AdaBound, AdamW, AvaGrad and AvaGradW with varying learning rate and adaptability parameter
, when training a 3-layer LSTM with 300 hidden units in each layer. We vary the learning rate
in powers of 10 multiplied by 2: from 0.0002 up to 20; for the adaptability parameter
, we vary the values in powers of 10, multiplied by
Figure 3 shows that, in this task, smaller values for are indeed optimal: Adam, AMSGrad and AvaGrad performed best with
. The optimal learning rates for both Adam and AMSGrad,
, agree with the value used in Merity et al. (2018). Both AvaGrad and AvaGradW performed best with
: the former with
, the latter with
Next, we train a larger model: a 3-layer LSTM with 1000 hidden units per layer (the same model used in Merity et al. (2018), where it was trained with Adam), choosing values for which yielded the best validation performance in the previous experiment. For SGD, we again confirmed that a learning rate of 20 performed best on the validation set.
Table 1 (right column) reports all results. In this setting, AvaGrad and AvaGradW outperform Adam, achieving bit-per-characters of 1.179 and 1.175 compared to 1.182. The poor performance of AdaBound could be caused by convergence issues or due to the default values for its hyperparameters: Savarese (2019) showed that that the bound functions strongly affect the optimizer’s behavior and might require careful tuning.
When combined with the previous results, we see that adaptive methods actually dominate SGD across tasks of different domains. In particular, both Adam and AvaGrad outperformed SGD in all 4 considered tasks.
6.5. Hyperparameter Separability and Domain-Independence
We observed that, given enough budget for hyperparameter tuning, Adam can actually outperform SGD in tasks such as image classification with CNNs, where adaptive methods have traditionally found little success. But can we decrease the cost of hyperparameter search?
One of the main motivations behind AvaGrad is that it removes the dependence between the learning rate adaptability parameter
, at least in the worst-case rate of Theorem 2. Observing the heatmaps in Figure 2 and 3, we can see that indeed AvaGrad offers great separability between
, unlike Adam.
In particular, for values larger than has little to no interaction with the learning rate
, as opposed to Adam where the optimal
increases linearly with
. For language modelling on Penn Treebank, the optimal learning rate for AvaGrad was
for every choice of
, while for image classification on CIFAR-10, we had
for all except two values of
. This shows that AvaGrad enables a grid search over
(naively, with quadratic complexity) to be broken into two line searches over
and
separately (linear complexity). In the context of Section 6.2, this leads to a decrease from
to
total trials for hyperparameter search, only twice as many as SGD’s budget. The cost of optimizing both SGD and Adam with a fixed
is the same as fully optimizing AvaGrad. Therefore, unless
is chosen optimally a-priori, AvaGrad dominates both SGD and Adam given the same budget and coarseness for hyperparameter tuning.
As neural architectures become more complex, with parameters having highly heterogeneous roles, parameter-wise learning rates are often necessary for training. However, adaptive methods have both theoretical and empirical gaps, with SGD outperforming them in some tasks and having stronger theoretical convergence guarantees. In this paper, we close this gap, by first providing a convergence rate guarantee that matches SGD’s, and by showing that, with proper hyperparameter tuning, adaptive methods can dominate in both computer vision and natural language processing tasks. Key to our finding is AvaGrad, our proposed optimizer whose adaptability is decoupled from its learning rate.
Our experimental results show that proper tuning of the learning rate together with the adaptability of the method is necessary to achieve optimal results in different domains, where distinct neural network architectures are used across tasks. By enabling this tuning to be performed in linear time, AvaGrad takes a leap towards efficient domain-agnostic training of general neural architectures.
Arjovsky, M., Chintala, S., and Bottou, L. Wasserstein GAN. arXiv:1701.07875, 2017.
Chen, X., Liu, S., Sun, R., and Hong, M. On the conver- gence of a class of adam-type algorithms for non-convex optimization. ICLR, 2019.
Dauphin, Y., de Vries, H., Chung, J., and Bengio, Y. Rm- sprop and equilibrated adaptive learning rates for non-convex optimization. corrL, 2015.
Duchi, J., Hazan, E., , and Singer, Y. Adaptive subgradient methods for online learning and stochastic optimization. ICML, 2011.
Gal, Y. and Ghahramani, Z. A theoretically grounded ap- plication of dropout in recurrent neural networks. NIPS, 2016.
Ghadimi, S. and Lan, G. Stochastic First- and Zeroth-order Methods for Nonconvex Stochastic Programming. SIAM, 2013.
Gross, S. and Wilber, M. Training and investigating resid- ual nets. https://github.com/facebook/fb. resnet.torch, 2016.
He, K., Zhang, X., Ren, S., and Sun, J. Deep residual learning for image recognition. CVPR, 2016a.
He, K., Zhang, X., Ren, S., and Sun, J. Identity mappings in deep residual networks. ECCV, 2016b.
Hochreiter, S. and Schmidhuber, J. Long short-term memory. Neural computation, 1997.
Huang, G., Liu, Z., van der Maaten, L., and Weinberger, K. Q. Densely connected convolutional networks. CVPR, 2017.
Kingma, D. P. and Ba, J. Adam: A method for stochastic optimization. ICLR, 2015.
Krizhevsky, A. Learning multiple layers of features from tiny images. Technical report, 2009.
Loshchilov, I. and Hutter, F. Decoupled weight decay regu- larization. ICLR, 2019.
Luo, L., Xiong, Yuanhao, Liu, Yan, and Sun, X. Adaptive gradient methods with dynamic bound of learning rate. ICLR, 2019.
Marcus, M., Kim, G., Marcinkiewicz, M. A., MacIntyre, R., Bies, A., Ferguson, M., Katz, K., and Schasberger, B. The penn treebank: Annotating predicate argument structure. In Proceedings of the Workshop on Human Language Technology, 1994.
Merity, S., Shirish Keskar, N., and Socher, R. An Anal- ysis of Neural Language Modeling at Multiple Scales. arXiv:1803.08240, 2018.
Mikolov, T., Karafiát, M., Burget, L., Cernocký, J. H., and Khudanpur, S. Recurrent neural network based language model. INTERSPEECH, 2010.
Reddi, S. J., Kale, S., and Kumar, S. On the convergence of adam and beyond. ICLR, 2018.
Russakovsky, O., Deng, J., Su, H., Krause, J., Satheesh, S., Ma, S., Huang, Z., Karpathy, A., Khosla, A., Bernstein, M. S., Berg, A. C., and Li, F. Imagenet large scale visual recognition challenge. IJCV, 115(3), 2015.
Savarese, P. On the Convergence of AdaBound and its Connection to SGD. arXiv:1908.04457, 2019.
Simonyan, K. and Zisserman, A. Very deep convolutional networks for large-scale image recognition. ICLR, 2015.
Szegedy, C., Liu, W., Jia, Y., Sermanet, P., Reed, S. E., Anguelov, D., Erhan, D., Vanhoucke, V., and Rabinovich, A. Going deeper with convolutions. CVPR, 2015.
Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A. N., Kaiser, L., and Polosukhin, I. Attention Is All You Need. NIPS, 2017.
Wilson, A. C., Roelofs, R., Stern, M., Srebro, N., and Recht, B. The Marginal Value of Adaptive Gradient Methods in Machine Learning. NIPS, 2017.
Xie, S., Girshick, R. B., Dollár, P., Tu, Z., and He, K. Aggre- gated residual transformations for deep neural networks. CVPR, 2017.
Zagoruyko, S. and Komodakis, N. Wide residual networks. BMVC, 2016.
Zaheer, M., Reddi, S., Sachan, D., Kale, S., and Kumar, S. Adaptive methods for nonconvex optimization. NIPS, 2018.
Zhou, Z., Zhang, Q., Lu, G., Wang, H., Zhang, W., and Yu, Y. Adashift: Decorrelation and convergence of adaptive learning rate methods. ICLR, 2019.
A. Proof of Theorem 1
Proof. Consider the following stochastic optimization problem:
The proof follows closely from Reddi et al. (2018)’s linear example for convergence in suboptimality. We assume w.l.o.g. that . Consider:
where the expectation is over all the randomness in the algorithm up to time t, as all expectations to follow in the proof. Note that as follows:
As for , we have, from Jensen’s inequality:
Now, remember that
and thus:
Plugging in the bounds for in Equation 16:
Hence, for large enough while the above quantity becomes non-negative, and hence
In other words, Adam will, in expectation, drift away from the stationary point, towards w = 1, at which point
For example,
implies that
. To see that w = 1 is not a stationary point due to
the feasibility constraints, check that : that is, the negative gradient points towards the feasible region.
B. Proof of Theorem 2
Proof. Throughout the proof we use the following notation for clarity:
We start from the fact that f is M-smooth:
and use the update
where in the first step we used the fact that , in the second we used
, in the third we used Cauchy-Schwarz, and in the fourth we used
, along with
Now, taking the expectation over , and using the fact that
are both independent of
where in the second step we used and
Re-arranging, we get:
Now, taking the conditional expectation over all samples S given Z:
where in the second step we used which follows from the assumption that
p(Z), and the third step follows from a telescoping sum, along with the fact that
. Now, using
Then, taking the expectation over Z:
Now, let
where we used the fact that
Now, recall that
Setting and checking that
Recalling that proves the claim.
B.1. The case with first-order momentum
For the case , assume that
in Equation 32:
where in the last step we used
Similarly to the guarantee in Equation 33, we can show a convergence rate if we further assume that there exist constants H and L such that
for all i and t (i.e., the parameter-wise learning rates are bounded away from zero and also from above), and that
is bounded similarly. For example, having
B.2. The case with unconditional distribution over iterates
To show a similar bound without the assumption that , we can alternatively bound
and
using the worst case over possible samples S. From (26) we have, with
Now, define with
instead. As long as
does not depend on S, Z is no longer a random variable. Following the same steps as above leads to the following:
In particular, if there are constants , can bound
and
, yielding:
Hence a follows as long as
can be upper and lower bounded accordingly by constants.