Stagewise Enlargement of Batch Size for SGD-based Learning

2020·Arxiv

Abstract

Abstract

Existing research shows that the batch size can seriously affect the performance of stochastic gradient descent (SGD) based learning, including training speed and generalization ability. A larger batch size typically results in less parameter updates. In distributed training, a larger batch size also results in less frequent communication. However, a larger batch size can make a generalization gap more easily. Hence, how to set a proper batch size for SGD has recently attracted much attention. Although some methods about setting batch size have been proposed, the batch size problem has still not been well solved. In this paper, we first provide theory to show that a proper batch size is related to the gap between initialization and optimum of the model parameter. Then based on this theory, we propose a novel method, called stagewise enlargement of batch size (SEBS), to set proper batch size for SGD. More specifically, SEBS adopts a multi-stage scheme, and enlarges the batch size geometrically by stage. We theoretically prove that, compared to classical stagewise SGD which decreases learning rate by stage, SEBS can reduce the number of parameter updates without increasing generalization error. SEBS is suitable for SGD, momentum SGD and AdaGrad. Empirical results on real data successfully verify the theories of SEBS. Furthermore, empirical results also show that SEBS can outperform other baselines.

1. Introduction

Many machine learning models can be formulated as the following empirical risk minimization (ERM) problem:

where w denotes the model parameter, denotes the set of training instances sampled from distribution ) denotes the loss on the i-th training instance.

With the rapid growth of data, stochastic gradient descent (SGD) and mini-batch SGD (Robbins and Monro, 1951; Bottou, 1998) have become the most popular methods for solving the ERM problem in (1), and many variants of SGD have been proposed. Among these algorithms, the classical and most widely used one is the stagewise SGD which has

been adopted in (Krizhevsky et al., 2012; He et al., 2016). Stagewise SGD is based on a multi-stage learning scheme. At the s-th stage, it runs the following iterations:

where is the initialization, is a mini-batch of instances randomly sampled from is the learning rate which is a constant at each stage and decreases geometrically by stage. After the s-th stage is completed, the algorithm randomly picks a parameter from or the last one the initialization of the next stage. For stagewise SGD with S stages, the computation complexity (total number of gradient computation) is ity (total number of parameter updates) is Recently, some work (Yuan et al., 2019) theoretically proves that the stagewise SGD is better than the original SGD which adopts the polynomially decreased learning rate under the weakly quasi-convex and PolyakLojasiewicz (PL) condition. Classical stagewise SGD methods (Krizhevsky et al., 2012; He et al., 2016) mainly focus on how to set the learning rate for a given constant batch size which is typically not too large.

From (2), we can find that given a fixed computation complexity, a larger batch size will result in less parameter updates. In distributed training, each parameter update typically needs one time of communication, and hence a larger batch size will result in less frequent communication. Furthermore, a larger batch size can typically better utilize the computing power of current multi-core systems like GPU to reduce computation time, as long as the mini-batch does not exceed the memory or computing limit of the system. Figure 1 gives an example to show that enlarging batch size can reduce computation time. Hence, we need to choose a larger batch size for SGD to reduce computation time if we do not take generalization error into consideration. However, a larger batch size can make a generalization gap more easily (??). Some work (?) points out that we need to train longer (with higher computation complexity) for larger batch training to achieve a similar generalization error as that of smaller batch training. This is contrary to the original intention of large batch training. Hence, how to set a proper batch size for SGD has become an interesting but challenging topic.

There have appeared some works proposing heuristic methods for large batch training (Goyal et al., 2017; You et al., 2017; McCandlish et al., 2018). Compared to classical stagewise SGD methods with a small constant batch size and stagewisely decreased learning rate, these large batch methods need more tricks, which should be carefully tuned on different models and data sets. Furthermore, theoretical guarantee about the iteration complexity and generalization error of these methods is missing. In addition, in our experiments we find that these methods might increase generalization error if a large batch size is adopted from the initialization.

There have also appeared some other methods proposing to dynamically set the batch size. (Friedlander and Schmidt, 2012; Byrd et al., 2012; De et al., 2017; Yin et al., 2018) relate the batch size with the noise of stochastic gradients. These methods need to determine the batch size in each iteration, which will bring much extra cost. (Smith et al., 2018) increases the batch size by relating SGD with a stochastic differential equation. However, the theoretical guarantee about the iteration complexity and generalization error is missing.

Figure 1: Using an NVIDIA V100 GPU to train models on CIFAR10. The y-axis denotes computation time per epoch. In ResNet20, volatile gpu-util achieves 100% when b = 4096. In ResNet56, volatile gpu-util achieves 100% when b = 512.

Furthermore, some work (Yu and Jin, 2019) uses the stagewise training strategy. At each stage, the batch size starts from a small constant and is geometrically increased by iteration. However, the scaling ratio for the batch size cannot be large for convergence guarantee. Furthermore, in our experiments we also find that it might increase generalization error.

In this paper, we propose a novel method, called stagewise enlargement of batch size (SEBS), to set proper batch size for SGD. The main contributions of this paper are outlined as follows:

• We first provide theoryto show that a proper batch size is related to the gap between initialization and optimum of the model parameter. Then based on this theory, we propose SEBS which adopts a multi-stage scheme and enlarges the batch size geometrically by stage.

• We theoretically prove that decreasing learning rate and enlarging batch size have the same effect on the performance of stagewise SGD.

• We theoretically prove that, compared to classical stagewise SGD which decreases learning rate by stage, SEBS can reduce the number of parameter updates (iteration complexity) without increasing generalization error when the total number of gradient computation (computation complexity) is fixed.

• Besides SGD, SEBS is also suitable for momentum SGD and adaptive gradient descent (AdaGrad) (Duchi et al., 2010). We also provide theoretical results about the number of parameter updates (iteration complexity) for momentum SGD and AdaGrad. To the best of our knowledge, this is the first work that analyzes the effect of batch size on the convergence of AdaGrad

• Empirical results on real data successfully verify the theories of SEBS. Furthermore, empirical results also show that SEBS can outperform other baselines.

2. Preliminaries

First, we give the following notations. denotes the denotes the norm. denotes the optimal solution (optimum) of (1). denotes the stochastic gradient of the mini-batch to denote the j-th element of a.

We also make the following assumptions.

Assumption 1 The variance of stochastic gradient is bounded:

Assumption 2

-weakly quasi-convex (

-Polyak Lojasiewicz () condition:

Recently, both weak quasi-convexity and PL condition have been observed for many machine learning models, including deep neural networks (Charles and Papailiopoulos, 2018; Yuan et al., 2019). The -PL condition also implies a quadratic growth (Karimi et al., 2016), i.e., 2. Another inequality (Nesterov, 2004) used in this paper is ). Please note that these two inequalities do not need the convex assumption. We call the conditional number of F(w) under PL condition.

3. SEBS

In this section, we present the details of SEBS for SGD, including the theory about the relationship between batch size and model initialization, SEBS algorithm, theoretical analysis about the training error and generalization error.

3.1 Relationship between Batch Size and Model Initialization

We start from the vanilla SGD with a constant batch size and learning rate, which can be written as follows:

where . The computation complexity (total number of gradient computation) is denote a value randomly sampled from

We aim to find how large the batch size can be without loss of performance. First, we can obtain the following property about (3):

Lemma 1 By setting

Remark 2 Another common upper bound for

which uses the bounded gradient assumption . Comparing to Assumption 1, we can see that the bounded gradient assumption in (?) omits the effect of batch size.

Based on (4), we can get a learning rate ) which minimizes the right term of (4). In fact, (4) also implies a proper batch size. Using C = Mb, we rewrite the right term of (4) as follows:

Then, we have:

To make ) get the minimum, the corresponding batch size and learning rate should satisfy:

where ) is from Lemma 1.

From (5), we can find that given a fixed computation complexity C, a proper batch size is related to the gap between the initialization and optimum of the model parameter. More specifically, the smaller the gap between the initialization and optimum of the model parameter is, the larger the batch size can be.

The theory of this subsection provides theoretical foundation for designing the SEBS algorithm in the following subsection.

3.2 SEBS Algorithm

In classical stagewise SGD (Krizhevsky et al., 2012; He et al., 2016), we can see that at each stage it actually runs the vanilla SGD with a constant batch size and learning rate. After each stage, it decreases the learning rate geometrically. In (Yuan et al., 2019), both theoretical and empirical results show that after each stage there is a geometric decrease in the training loss. This means that the gap between the current model parameter and the optimal solution (optimum) is smaller than that of previous stages. Based on the theory about the relationship between the batch size and model initialization from Section 3.1, we

can actually enlarge the batch size in the next stage. Inspired by this, we propose our algorithm called stagewise enlargement of batch size (SEBS) for SGD-based learning.

SEBS adopts a multi-stage scheme, and enlarges the batch size geometrically by stage. The detail of SEBS is presented in Algorithm 1. We can find that SEBS divides the whole learning procedure into S stages. At the s-th stage, SEBS runs the penalty SGD in Algorithm 2, denoted as denotes the loss function in (1), I denotes the training set, is the coefficient of a quadratic penalty, ˜initialization of the model parameter at the is the batch size at the s-th stage, is a constant learning rate, and is the computation complexity at the s-th stage. The output of pSGD, denoted as ˜, will be used as the model parameter initialization for the next stage.

The penalty SGD is a variant of vanilla SGD. Compared to the vanilla SGD, there is an additional quadratic penalty in penalty SGD. If SGD degenerates to the vanilla SGD. The quadratic penalty has been widely used in many recent variants of SGD (Allen-Zhu, 2018; Yu and Jin, 2019; Chen et al., 2019b,a; Yuan et al., 2019). Although it may slow down the convergence rate, it can improve the generalization ability.

3.3 Theoretical Analysis about Training Error

First, we have the following one-stage training error for SEBS:

we have:

where is the output of pSGD and M = C/b.

We can find that the one-stage training error for SEBS is similar to that in (4). Hence, we can set the batch size of each stage in SEBS according to the gap . Particularly, we can get the following convergence result:

be the sequence produced by

where

In SEBS, if we set ) which is a constant, and set the batch size as

which means , according to Theorem 4, we can obtain the computation complexity of SEBS:

This result is consistent with that in (Yuan et al., 2019) which sets Hence, by setting the batch size according to (8), SEBS achieves the same performance as classical stagewise SGD on computation complexity. Please note that when the loss function F(w) is strongly convex, which means 1, the proved computation complexity above is optimal (?).

The iteration complexity of SEBS is as follows:

Then we can get the following conclusions:

• Compared to classical stagewise SGD which decreases learning rate by stage and adopts a constant batch size, SEBS reduces the iteration complexity from is the upper bound for

• We can also observe that the iteration complexity of SEBS is independent of the variance , and hence is independent of the dimension d;

• According to (7) in Theorem 4, in order to get the convergence result, we need to keep the relation between the batch size and learning rate in each stage as follows:

This relation implies that the following two strategies for adjusting batch size and learning rate:

are equivalent in terms of training error. Both of them will not affect the computation complexity. Please note that strategy (a) has been widely adopted in classical stagewise training methods, and strategy (b) is proposed in SEBS.

3.4 Theoretical Analysis about Generalization Error

In this section, we will analyze the generalization error of SEBS. The main tool we used for the generalization error is the uniform stability (Hardt et al., 2016), which is defined as follows:

Definition 5 A randomized algorithm -uniformly stable if for all data sets such that differ in at most one instance, we have

where ˜is the output of A on data set

It has been proved (Hardt et al., 2016) that if -uniformly stable, then

where ˜w is the output of A on data set I. Hence, in the following content, we consider the two data sets differing in only a single instance which is indexed by be the output of SEBS on data set be the sequences produced by SEBS at the last stage, be the corresponding randomly selected mini-batch of instances, i = 1, 2. We omit the subscript s and use to denote the learning rate, batch size and computation complexity in the last stage. We also define Following (Hardt et al., 2016), we assume that . Then we have the following property about

Lemma 6 For one specific , then we get

If , then we get

Using the recursive relation of , we get the following uniform stability of SEBS:

defined in Theorem 4, we obtain

where

According to Theorem 7, we obtain the following two conclusions:

• This uniform stability is consistent with (Yuan et al., 2019). The stability error only depends on the computation complexity and has nothing to do with the batch size of each stage.

• Compared to the classical non-penalty SGD in (2) which actually corresponds to the penalty SGD with 1, penalty SGD with a finite can improve the stability, and hence improve the generalization error.

Since , by setting ), we obtain a generalization error ) for SEBS.

4. SEBS for Momentum SGD and AdaGrad

Momentum SGD (mSGD) (Polyak, 1964; Tseng, 1998; Ghadimi and Lan, 2013, 2016) and adaptive gradient descent (AdaGrad) (McMahan and Streeter, 2010; Duchi et al., 2010) have been two of the most important and popular variants of SGD. In the following content, we will show that SEBS is also suitable for momentum SGD and AdaGrad. To the best of our knowledge, existing research on AdaGrad only analyzes the convergence property with b = 1. This is the first work that analyzes the effect of batch size on the convergence of AdaGrad.

4.1 SEBS for Momentum SGD

Here, we propose to adapt SEBS for momentum SGD. The resulting algorithm is called mSEBS, which is presented in Algorithm 3. mSEBS divides the whole learning procedure into S stages. At each stage, mSEBS runs the Polyak’s momentum SGD (Polyak, 1964) which is presented in Algorithm 4. Please note that mSEBS will reset the momentum to zero after each stage for the convenience of convergence proof. This is different from some

mSGD implementations like that on PyTorch which does not reset the momentum to zero. In our experiments, we find that this difference does not have significant influence.

Similar to SEBS for SGD, mSEBS can also achieve ) computation complexity and )) iteration complexity which is independent of . Due to space limitation, we move the related theorems to the supplementary material.

4.2 SEBS for AdaGrad

Here, we propose to adapt SEBS for AdaGrad. The resulting algorithm is called AdaSEBS, which is presented in Algorithm 5. AdaSEBS also divides the whole learning procedure into S stages. At the s-th stage, AdaSEBS runs ) which is presented in Algorithm 6. In particular, AdaGrad runs the following iterations:

where is a diagonal matrix, in which the diagonal ele- ment is defined as existing research, is typically set to 0.5 for convex loss functions (McMahan and Streeter, 2010; Duchi et al., 2010) and is typically set to 1 for strongly convex loss functions (Duchi et al., 2010; ?).

Let be the sequence produced by AdaGrad in Algorithm 6. According to (Duchi et al., 2010), we obtain

then

e.g., , the square root operation will make the upper bound of bad. Hence in this work, although ) is not necessarily strongly convex, we still set = 1 and get the following one-stage training error for AdaSEBS:

Lemma 8 (One-stage training error for AdaSEBS) Let be the sequence produced by AdaGrad(. Then we have

where is the output of AdaGrad, . Furthermore, if we choose , then we have

According to Lemma 8, we actually prove a ) error while (Duchi et al., 2010) proves a ) error, where C = Mb. Furthermore, our one-stage training error is independent of the input . Since the exact upper bound of is usually unknown, we can set a large without loss of training error in our presented AdaGrad. Hence, the error bound in (10) is better than that in (Duchi et al., 2010) in which a large to a large error.

Then we have the following convergence result for AdaSEBS:

be the sequence produced by Alorithm 5, setting

we obtain

Similar to SEBS for vanilla SGD, we also obtain ) computation complexity and )) iteration complexity.

Recently, there is another work about stagewise AdaGrad, called SADAGrad (Chen et al., 2018), which mainly focuses on the case that is sparse. SADAGrad adopts a constant batch size and decreases the learning rate geometrically by stage. Under convex and quadratic growth condition, SADAGrad achieves an iteration complexity of which is dependent on dimension d. Hence, AdaSEBS is better than SADAGrad.

5. Experiments

First, we verify the theory about the relationship between batch size and model initialization. We consider a synthetic problem:

where , each data is sampled from the gaussian distribution N(0, I), and D is a diagonal matrix with D(i, i) = i. The corresponding 100, respectively, and the optimal solution of (11) is . We run vanilla SGD in (3) with a fixed computation complexity C = n to solve (11). We set the model parameter initialization 100]. For each x, we aim to find the optimal batch size that can achieve the smallest value of is the output of vanilla SGD algorithm in (3). We repeat 50 times and the average result about the optimal batch size is presented in Figure 2. We can find that the optimal batch size is almost proportional to 1, and a larger learning rate implies a larger optimal batch size. These phenomenons are consistent with our theory in (5) where for a fixed computation complexity C.

Figure 2: Relationship between the optimal batch size and model initialization in vanilla SGD for solving problem (11).

Next, we consider a real problem which trains ResNet20 with 0.0001 weight decay on CIFAR10. The experiments are conducted on the PyTorch platform with an NVIDIA V100 GPU (32G GPU memory). For classical stagewise methods, we follow (He et al., 2016) which divides the learning rate by 10 at the 80, 120 epochs. According to our theory that ), in SEBS, mSEBS and AdaSEBS, the learning rate is constant and the batch size is scaled by at the 80, 120 epochs. We set 12 for illustration. In the experiments about vanilla SGD, we also compare SEBS with DB-SGD (Yu and Jin, 2019) in which the scaling ratio for batch size is 1.02. The initial batch size of these methods is 128. In the experiments about momentum SGD, we also compare mSEBS with the large batch training method LARS (You et al., 2017). The poly power and warm-up of LARS are the same as that in (You et al., 2017). We set the batch size, based learning rate, scaling factor of LARS as 4096, 3.2, 0.01. The results are presented in Figure 3. We can find that SEBS, mSEBS and AdaSEBS can achieve similar performance, measured based on computation complexity (epochs), as classical stagewise counterparts respectively, especially when is large. When measured based on iteration complexity which is directly related to computation time or wall-clock time, SEBS, mSEBS and AdaSEBS are more efficient than their classical stagewise counterparts respectively. In particular, classical stagewise counterparts expend 62.5k parameter updates, while SEBS with = 12 only expends 32.6k parameter updates. Since DB-SGD increases the batch size in every epoch, it falls into a local minimum and the accuracy is worse than SEBS. We also try some other scaling ratios for DB-SGD and DB-SGD still cannot achieve performance as good as classical stagewise SGD and SEBS on either training loss or test accuracy. Different from DB-SGD, SEBS increases the batch size after a stage which contains several epochs, and hence it achieves better performance than DB-SGD. Although LARS expends fewer parameter updates than mSEBS, it only achieves test accuracy of 90.97%, while mSGD and mSEBS with

Figure 3: Learning curves of training ResNet20 on CIFAR10 with different methods. The learning rates for SGD, mSGD and AdaGrad experiments are 0.5, 0.1, 1, respectively. The in SGD experiment is 10in AdaGrad experiments is 1.

Table 1: Empirical results on ImageNet. mSGD* is the momentum SGD implemented on PyTorch, in which the momentum is not reset to zero at the 30, 60 epochs.

achieve test accuracy of 91.74%. We also try to set the scaling factor of LARS as that in (You et al., 2017), but the test accuracy further drops 5%.

We also compare mSEBS with momentum SGD (mSGD) by training ResNet18 and ResNet50 with 0.0001 weight decay on ImageNet. Data augmentation and initialization of w (including the parameters of batch normalization layers) follow the code of PyTorch mSGD and mSEBS, the initial batch size is 256 and the learning rate is 0.1. Following (He et al., 2016), we divide the learning rates of mSGD by 10 and scale the batch size of mSEBS by 12, at the 30, 60 epochs. The results are presented in Table 1. We can see that mSEBS achieves the same performance as momentum SGD on test accuracy. mSEBS scales the batch size to 36k after 60 epochs and saves about 64% parameter updates in total. We also run the large batch training method in (Smith et al., 2018) to train ResNet50: the initial batch size and learning rate are 8192 and 3.2 respectively, the batch size is scaled by 10 at the 30 epoch, the learning rate is divided by 10 at the 60, 80 epochs. Although the method in (Smith et al., 2018) expends fewer parameter updates, its accuracy drops 2.4%. Hence, SEBS is better than classical stagewise methods and more universal than large batch training methods.

6. Conclusion

In this paper, we propose a novel method called SEBS to set proper batch size for SGDbased machine learning. Both theoretical and empirical results show that SEBS can reduce the number of parameter updates without loss of training error and test accuracy, compared to classical stagewise SGD methods.

References

Zeyuan Allen-Zhu. How to make the gradients small stochastically: Even faster convex and nonconvex SGD. In Advances in Neural Information Processing Systems, 2018.

L´eon Bottou. Online learning in neural networks. chapter Online Learning and Stochastic Approximations, pages 9–42. 1998.

Richard H. Byrd, Gillian M. Chin, Jorge Nocedal, and Yuchen Wu. Sample size selection in optimization methods for machine learning. Mathematical Programming, 134(1):127–155, 2012.

Zachary B. Charles and Dimitris S. Papailiopoulos. Stability and generalization of learning algorithms that converge to global optima. In Proceedings of the International Conference on Machine Learning, 2018.

Zaiyi Chen, Yi Xu, Enhong Chen, and Tianbao Yang. SADAGRAD: strongly adaptive stochastic gradient methods. In Proceedings of the International Conference on Machine Learning, pages 912–920, 2018.

Zaiyi Chen, Yi Xu, Haoyuan Hu, and Tianbao Yang. Katalyst: Boosting convex katayusha for non-convex problems with a large condition number. In Proceedings of the International Conference on Machine Learning, 2019a.

Zaiyi Chen, Zhuoning Yuan, Jinfeng Yi, Bowen Zhou, Enhong Chen, and Tianbao Yang. Universal stagewise learning for non-convex problems with convergence on averaged solutions. In Proceedings of the International Conference on Learning Representations, 2019b.

Soham De, Abhay Kumar Yadav, David W. Jacobs, and Tom Goldstein. Automated infer- ence with adaptive batches. In Proceedings of the International Conference on Artificial Intelligence and Statistics, 2017.

John C. Duchi, Elad Hazan, and Yoram Singer. Adaptive subgradient methods for on- line learning and stochastic optimization. In Proceedings of the Conference on Learning Theory, pages 257–269, 2010.

Michael P. Friedlander and Mark W. Schmidt. Hybrid deterministic-stochastic methods for data fitting. SIAM Journal on Scientific Computing, 34(3), 2012.

Saeed Ghadimi and Guanghui Lan. Optimal stochastic approximation algorithms for strongly convex stochastic composite optimization, II: shrinking procedures and optimal algorithms. SIAM Journal on Optimization, 23(4):2061–2089, 2013.

Saeed Ghadimi and Guanghui Lan. Accelerated gradient methods for nonconvex nonlinear and stochastic programming. Mathematical Programming, 156(1-2):59–99, 2016.

Priya Goyal, Piotr Doll´ar, Ross B. Girshick, Pieter Noordhuis, Lukasz Wesolowski, Aapo Kyrola, Andrew Tulloch, Yangqing Jia, and Kaiming He. Accurate, large minibatch SGD: training imagenet in 1 hour. CoRR, abs/1706.02677, 2017.

Moritz Hardt, Ben Recht, and Yoram Singer. Train faster, generalize better: Stability of stochastic gradient descent. In Proceedings of the 33nd International Conference on Machine Learning, 2016.

Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recognition. In Proceedings of Conference on Computer Vision and Pattern Recognition, 2016.

Hamed Karimi, Julie Nutini, and Mark W. Schmidt. Linear convergence of gradient and proximal-gradient methods under the polyak-lojasiewicz condition. In Proceedings of Machine Learning and Knowledge Discovery in Databases, 2016.

Alex Krizhevsky, Ilya Sutskever, and Geoffrey E. Hinton. Imagenet classification with deep convolutional neural networks. In Advances in Neural Information Processing Systems, 2012.

Sam McCandlish, Jared Kaplan, Dario Amodei, and OpenAI Dota Team. An empirical model of large-batch training. CoRR, abs/1812.06162, 2018.

H. Brendan McMahan and Matthew J. Streeter. Adaptive bound optimization for online convex optimization. In Proceedings of the Conference on Learning Theory, pages 244– 256, 2010.

Yurii E. Nesterov. Introductory Lectures on Convex Optimization - A Basic Course, volume 87 of Applied Optimization. Springer, 2004.

Boris Polyak. Some methods of speeding up the convergence of iteration methods. Ussr Computational Mathematics and Mathematical Physics, 4:1–17, 12 1964.

Herbert Robbins and Sutton Monro. A stochastic approximation method. The Annals of Mathematical Statistics, 22(3):400–407, 1951.

Samuel L. Smith, Pieter-Jan Kindermans, Chris Ying, and Quoc V. Le. Don’t decay the learning rate, increase the batch size. In Proceedings of the International Conference on Learning Representations, 2018.

Paul Tseng. An incremental gradient(-projection) method with momentum term and adap- tive stepsize rule. SIAM Journal on Optimization, 8(2):506–531, 1998.

Yan Yan, Tianbao Yang, Zhe Li, Qihang Lin, and Yi Yang. A unified analysis of stochastic momentum methods for deep learning. In Proceedings of the International Joint Conference on Artificial Intelligence. International Joint Conferences on Artificial Intelligence Organization, 2018.

Dong Yin, Ashwin Pananjady, Maximilian Lam, Dimitris S. Papailiopoulos, Kannan Ram- chandran, and Peter L. Bartlett. Gradient diversity: a key ingredient for scalable distributed learning. In Proceedings of the International Conference on Artificial Intelligence and Statistics, 2018.

Yang You, Igor Gitman, and Boris Ginsburg. Scaling SGD batch size to 32k for imagenet training. CoRR, abs/1708.03888, 2017.

Hao Yu and Rong Jin. On the computation and communication complexity of parallel SGD with dynamic batch sizes for stochastic non-convex optimization. In Proceedings of the International Conference on Machine Learning, 2019.

Zhuoning Yuan, Yan Yan, Rong Jin, and Tianbao Yang. Stagewise training accelerates convergence of testing error over sgd. In Advances in Neural Information Processing Systems, 2019.

Appendix A. SEBS

A.1 Proof of Lemma 3

According to the updates ), we get that

Using the fact that ) is convex, we obtain

Taking expectation on both sides, we obtain

Summing up from m = 1 to M, we obtain

which implies

Since , we obtain

A.2 Proof of Theorem 4

Since , we use the induction to prove the result. Assuming

, and using the PL condition, we obtain

Since

By setting , we obtain

Finally, we obtain that when

A.3 Proof of Lemma 6

If , then we have

If , then we have

where the last inequality uses the fact differ in at most one instance.

A.4 Proof of Theorem 7

Let E be the event that SEBS picking for the first time happens at the -iteration of

the last stage, then we have

Let j be the total iterations that mb-SGD using for the first time. Then we obtain

in which we use the inequality ). Using Lemma 6, we obtain

which implies that

Then we obtain

By setting

Appendix B. Momentum SEBS

Similar to the convergence analysis for SEBS, we first establish the one-stage training error

for for mSEBS:

Lemma 10 (One-stage training error for for mSEBS) Let be the sequence produced

by mSGD(. Then we have

where is the output of MSGD and M = C/b.

Then we have the following convergence result:

be the sequence produced by

where

To prove the above lemma and theorem, we give the following lemma:

be the sequence produced by

Then we have

Specifically, if Ω = (Yan et al., 2018), then we have

2, we obtain

B.1 Proof of Lemma 10

Let . Then we have

Using the Young’s inequality that (and the fact

, we obtain

We denote ) for short. Then we have

m=1

and hence

i.e.,

B.2 Proof of Theorem 3

Since ), using PL condition, we obtain

Since , we use the induction to prove the result. Assuming

, then we have

Appendix C. AdaSEBS

First, we have the following inequality:

C.1 Proof of Lemma 8

Let . First, we have

α

where ∆

Since 0, we obtain

Combining the above inequalities, we obtain

Since

we define

and

where )]. On the other hand, we have

Using the inequality , we obtain

Since ), we obtain

By choosing , we obtain

The conditions for the batch size and learning rate are

i.e.,

C.2 Proof of Theorem 9

Since

Then we can use Lemma 8 and PL condition to obtain

Since , we use the induction to proof the result. Assuming

, then we have

Since

Since , we obtain

we have