With the expansion of data and model scale, the training of machine learning models, especially deep learning models has become increasingly time-consuming. To accelerate the training process, distributed parallel optimization has attracted widespread interests recently, which encourages multiple workers to cooperatively optimize the model.
For large-scale machine learning problems, stochastic gradient descent (SGD) is a fundamental tool. It can be easily parallelized by collecting stochastic gradient from different workers and hence it is widely adopted. Previous studies [Dekel et al., 2012, Ghadimi and Lan, 2013] justify that synchronous stochastic gradient descent (S-SGD) has a linear iteration speedup for both general convex and non-convex objectives, which means that the total number of iterations is reduced by N times with N workers. However, S-SGD suffers from a major drawback: the communication cost among workers is expensive when the number of workers is large, which prevents S-SGD from achieving a linear time speedup. Therefore, it is crucial to overcome the communication bottleneck.
To reduce communication cost, several studies [Wang and Joshi, 2018, Zhou and Cong, 2018, Stich, 2019, Yu et al., 2019b, Shen et al., 2019] have managed to lower the communication frequency. Among them, Local SGD [Stich, 2019] is a representative distributed algorithm, where workers can conduct SGD locally and average model with each other every k iterations. Compared with S-SGD, the algorithms based on Local SGD reduce the communication rounds from O(T) to O(T/k). To deal with the gradient variance among workers, previous studies require at least one of the following extra assumptions: (1) the bounded gradient variance among workers; (2) an upper bound for gradients; (3) identical data on all workers. When the data distribution on workers is identical, which is the so-called identical case, the algorithms based on Local SGD can exhibit superior performance. Nevertheless, the identical data assumption is not always valid in real cases. When the data distribution on workers is non-identical, which is the so-called non-identical case, these algorithms would encounter a significant degradation in the convergence rate due to the gradient variance among workers. We seek to eliminate the gradient variance among workers, which may make the algorithm converge much faster than the vanilla Local SGD.
In this paper, we propose Variance Reduced Local SGD (VRL-SGD), a novel distributed optimization algorithm to further reduce the communication complexity. Benefiting from an additional variance reduction component, VRL-SGD eliminates the extra assumption about bounded gradient variance among workers in previous studies based on Local SGD [Yu et al., 2019a,b, Shen et al., 2019]. Thus the communication complexity can be reduced from to in VRL-SGD for the non-identical case, which is crucial for achieving a better time speedup. Therefore, VRL-SGD is more suitable than Local SGD for real cases, such as federated learning[Koneˇcn`y et al., 2016, Li et al., 2019, Kairouz et al., 2019], where the non-identical data has become a fundamentally challenging problem.
Contributions of this paper are summarized as follows:
• We propose VRL-SGD, a novel distributed optimization algorithm with a better communication complexity. Specifically, the communication complexity is reduced from to for the non-identical case. To the best of our knowledge, this is the first time that an algorithm based on Local SGD possesses such a communication complexity for non-convex objective in the non-identical case. Meanwhile, VRL-SGD also achieves the optimal communication complexity in the identical case.
• We provide a theoretical analysis and a more intuitive explanation for improving the convergence rate of existing algorithms. Besides, we prove that VRL-SGD has a linear iteration speedup with respect to the number of workers. Our method does not require the extra assumptions, e.g. the gradient variance across workers is bounded.
• We validate the effectiveness of VRL-SGD on three standard machine learning tasks. And experimental results show that the proposed algorithm performs significantly better than Local SGD if data distribution in workers is different, while maintains the same convergence rate as Local SGD if all workers access identical datasets.
Synchronous stochastic gradient descent (S-SGD) is a parallelized version of mini-batch SGD and is theoretically proved to achieve a linear iteration speedup with respect to the number of workers [Dekel et al., 2012, Ghadimi and Lan, 2013]. Nevertheless, due to the communication bottleneck, it is difficult to obtain the property of linear time speedup. To eliminate communication bottlenecks, many distributed SGD-based methods are proposed, such as lossy compression methods [Alistarh et al., 2017, Aji and Heafield, 2017, Bernstein et al., 2019, Lin et al., 2018b, Karimireddy et al., 2019, Tang et al., 2019], which use inexact approximations or partial data to represent the gradients, and methods [Stich, 2019, Yu et al., 2019b] based on the lower communication frequency.
Among them, Local SGD [Stich, 2019], a representative method to lower the communication frequency, has been widely used in the training of large-scale machine learning models, and its superior performance is verified in several tasks [Povey et al., 2014, Su and Chen, 2015, Lin et al., 2018a]. In Local SGD, each worker conducts SGD updates locally and averages its model with others periodically. Previous studies have proven that Local SGD can attain a linear iteration speedup for both strongly convex [Stich, 2019] and non-convex [Yu et al., 2019b] problems. To fully utilize hardware resources, a variant of Local SGD, called CoCoD-SGD [Shen et al., 2019], is proposed with the decoupling of computation and communication. Furthermore, Yu et al. [2019a] provide a clear linear speedup analysis for Local SGD with momentum. However, most of the above algorithms assume that the gradient variance among workers is bounded, and some of them even depend on a stronger assumption, e.g., the data distribution on workers is identical. Dependence on these assumptions may lead to a slow convergence rate for the non-identical case, which limits the further reduction of communication frequency and avoids a better time speedup. Haddadpour et al. [2019] verify that the use of redundant data can lead to lower communication complexity and hence faster convergence. The redundant data can help reduce the gradient variance among workers, thus it avoids the slow convergence rate. Nevertheless, this method may be constrained in some cases. For instance, it could not be widely applied in federated learning [Koneˇcn`y et al., 2016] as data cannot be exchanged between workers for privacy-preserving.
Although there are many studies proposed to reduce the variance in SGD, e.g., SVRG [Johnson and Zhang, 2013], SAGA [Defazio et al., 2014], and SARAH [Nguyen et al., 2017], they could not directly deal with the gradient variance among workers in distributed optimization. In recent years, several studies [Shi et al., 2015, Mokhtari and Ribeiro, 2016,
Table 1: Comparisons of the communication complexity for different algorithms. The second column and the third column show communication complexity for identical and non-identical datasets respectively. Here, we regard the following assumptions as extra assumptions: (1) an upper bound for gradients; (2) the bounded gradient variance among workers.
Tang et al., 2018] have proposed to eliminate the gradient variance among workers in the decentralized setting. Among them, Shi et al. [2015] propose a novel decentralized algorithm, EXTRA, which provides an ergodic convergence rate for convex problems and a linear convergence rate for strongly convex problems benefiting from eliminating the variance among workers. The [Tang et al., 2018] algorithm further applies the variance reduction on non-convex stochastic decentralized optimization problems and removes the impact of the gradient variance among workers on the convergence rate. To eliminate the gradient variance among workers and accelerate the training, we incorporate the variance reduction technique into Local SGD, and hence reduce the extra assumptions in the theoretical analysis. For a better comparison with related algorithms in terms of communication complexity and assumptions, we summarize the results in Table 1. It presents that our algorithm achieves better communication complexity compared with the previous algorithms for the non-identical case and does not need extra assumptions.
3.1 Problem definition
We focus on data-parallel distributed training, where N workers collaboratively train a machine learning model, and each worker may have its data with different distributions, which is the non-identical case. We use to denote the local data distribution in the i-th worker. Specifically, we consider the following finite-sum optimization:
where is the local loss function of the i-th worker.
3.2 Notations
First of all, we summarize the key notations of this paper as follows.
denotes the norm of a vector. • is the optimal value of equation (58). • E denotes that the expectation is taken with respect to all random indexes sampled to calculate stochastic
• denotes the average of local models over all N workers, and that is
is a stochastic gradient of the i-th worker at the t-th iteration. • represents the iteration of the last communication, and that is • represents the iteration of the penultimate communication, and that is
3.3 Assumptions
Throughout this paper, we make the following assumptions, which are commonly used in the theoretical analysis of distributed algorithms [Stich, 2019, Yu et al., 2019a, Shen et al., 2019].
Assumption 1
(3) Dependence of random variables: ’s are independent random variables, where
Previous studies based on Local SGD assume that the gradient variance among workers is bounded, or even depend on a stronger assumption, e.g., an upper bound for gradients or identical data distribution on workers, while we do not require these assumptions.
In this section, we first introduce the proposed algorithm and then give an intuitive explanation.
4.1 Variance Reduced Local SGD
We propose VRL-SGD, a variant of Local SGD. VRL-SGD allows locally updating in each worker to reduce the communication cost. But there are a few more steps in VRL-SGD to eliminate the gradient variance among workers. And in VRL-SGD, a worker:
1. Communicates with other workers to get the average of all local models
2. Calculates , which denotes the average deviation of gradient between the local gradients and the global gradients in the previous period. And it is defined as
The complete procedure of VRL-SGD is summarized in Algorithm 1. VRL-SGD allows each worker to maintain its local model and gets the average of all local models every k steps. Note that VRL-SGD with k = 1 is equivalent to S-SGD. While VRL-SGD with k > 1 reduces the number of communication rounds by k times compared with S-SGD. And VRL-SGD is equivalent to Local SGD if we set be 0 in line 5 of Algorithm 1 all the time.
To achieve a linear iteration speedup, Local SGD requires that T is more than . In other words, the communication period k in Local SGD is bounded by , which reduces the communication complexity to . Notice that a better communication period bound can be attained in the identical case in the previous studies [Shen et al., 2019, Yu et al., 2019a]. Nevertheless, the proposed algorithm can attain the communication period bound in both the identical case and the non-identical case.
One might wonder why VRL-SGD can improve the convergence rate of Local SGD. VRL-SGD uses an inexact variance reduction technique to reduce the variance among workers. To better understand the intuition of VRL-SGD, let us see the update of in equation (4). By summing up all and using the fact that
By summing up the above equality over , we obtain
It can be noticed that the update of in equation (8) is in the form of the generalized stochastic gradient descent. In addition, we can obtain a new representation of
Substituting equation (9) into equation (6), we have
The representation of in equation (10) can be regarded as the form of the generalized variance reduction, which is similar to SVRG [Johnson and Zhang, 2013] and SAGA [Defazio et al., 2014]. To observe that the variance among workers is reduced, we assume that the gradient variance within each worker is zero, which means that we calculate in line 8 of Algorithm 1. When all local model and the average model converge to the local minimum
, it holds that
Therefore, can converge to zero when the variance within each worker is zero, which helps VRL-SGD converge faster. On the other hand, the gradient in Local SGD cannot converge to zero, which prevents the local model from converging to the local minimum , so it is hard to converge for Local SGD. In summary, that is why VRL-SGD performs better than Local SGD for the non-identical case, where the gradient variance among workers is not zero.
In this section, we provide a theoretical analysis of VRL-SGD. We bound the expected squared gradient norm of the average model, which is the commonly used metric to prove the convergence rate for non-convex problems [Ghadimi and Lan, 2013, Tang et al., 2018, Yu et al., 2019a].
Theorem 5.1 Under Assumption 1, if the learning rate satisfies and , we have the following convergence result for VRL-SGD in Algorithm 1:
where C is defined as
The proof of Theorem 5.1 is given in Appendix C. Note that C will be 0 if k = 1 according to equation (12). It is consistent with the fact that when k = 1 VRL-SGD is equivalent to S-SGD, where the convergence of S-SGD is not related to the variance among workers.
By setting a suitable learning rate , we have the following corollary.
Corollary 5.2 Under Assumption 1, when the learning rate is set as
and the total number of iterations satisfies , we have the following convergence result for Algorithm 1:
where C is defined in Theorem 5.1.
The detailed proof of Corollary 5.2 is given in Appendix D.
Remark 5.3 Warm-up. We can set the first communication period k to 1 in VRL-SGD, which is VRL-SGD with a warm-up (VRL-SGD-W), then the variable C in Theorem 5.1 and Corollary 5.2 will be 0. Essentially, this is equivalent to conduct one S-SGD update and initialize . Therefore, the convergence result is not related to the extent of non-iid. We conduct additional experiments to verify this conclusion in Appendix E.
Remark 5.4 Consistent with . In [Tang et al., 2018], the convergence rate is , where represents the extent of non-iid in the first iteration. While the convergence rate in VRL-SGD is , where C is similar to . However, we can reduce the dependence on C by a warm-up, which leads to a tighter convergence rate.
Remark 5.5 Linear Speedup. For non-convex optimization, if there are N workers training a model collaboratively, according to Corollary 5.2, VRL-SGD converges at the rate , which is consistent with S-SGD and Local SGD. To achieve -optimal solutuioin, iterations are needed. Thus, VRL-SGD has a linear iteration speedup with respect to the number of workers.
Remark 5.6 Communication Complexity. By Corollary 5.2, to achieve the convergence rate , the number of iterations T needs to satisfy , which requires the communication period . Consequently, by setting , VRL-SGD can reduce communication complexity by a factor k. However, for the non-identical case, previous algorithms based on Local SGD can only reduce communication complexity by a factor
Remark 5.7 Mini-batch VRL-SGD. Although we consider only a single stochastic gradient in each worker so far, VRL-SGD can calculate mini-batch gradients with size b in line 8 of Algorithm 1. It reduces the variance within each worker by a factor b, thus VRL-SGD can converge at the rate by setting the learning rate
6.1 Experimental Settings
Experimental Environment We implement algorithms with Pytorch 1.1 [Paszke et al., 2017]. And we use a machine with 8 Nvidia Geforce GTX 1080Ti GPUs, 2 Xeon(R) E5-2620 cores and 256 GB RAM Memory. Each GPU is regarded as one worker in experiments.
Baselines We compare the proposed algorithm VRL-SGD1 with Local SGD [Stich, 2019], EASGD [Zhang et al., 2015] and S-SGD [Ghadimi and Lan, 2013].
Data Partitioning To validate the effectiveness of VRL-SGD in various scenarios, we consider two cases: the non-identical case and the identical case. In the non-identical case, each worker can only access a subset of data. For example, when 5 workers are used to train a model on 10 classes of data, each worker can only access to two classes of data. In the identical case, we allow each worker to access all data.
Datasets and Models We consider three typical tasks: (1) LeNet [El-Sawy et al., 2016] on MNIST [LeCun, 1998]; (2) TextCNN [Kim, 2014] on DBPedia [Lehmann et al., 2015]; (3) transfer learning on tiny ImageNet 2, which is a subset of the ImageNet dataset [Deng et al., 2009]. When training TextCNN on DBPedia, we retain the first 50 words and use a GloVe [Pennington et al., 2014] pre-trained model to extract 50 features for word representation. In transfer learning, we use an Inception V3 [Szegedy et al., 2016] pre-trained model as the feature extractor to extract 2,048 features for each image. Then we train a multilayer perceptron with one fully-connected hidden layer of 1,024 nodes, 200 output nodes, and relu activation. All datasets are summarized in Table 2. A lot of deep learning models use batch normalization [Ioffe and Szegedy, 2015], which assumes that the mini-batches are sampled from the same distribution. Applying batch normalization directly to the non-identical case may lead to some other issues, which is beyond the scope of this paper.
Table 2: Parameters used in experiments and a summary of datasets. N denotes the number of workers, b denotes batch size on each worker, is the learning rate, k is the communication period, n represents the number of data samples and m represents the number of data categories.
Hyper-parameters For the above three different tasks, we set the weight decay to be . And we initialize model weights by performing 2 epoch SGD iterations in all experiments. Other detailed hyper-parameters can be found in Table 2.
Metrics In this paper, we mainly focus on the convergence rate of different algorithms. Local SGD has a more superior training speed performance than S-SGD, which has been empirically observed in various machine learning tasks [Povey et al., 2014, Su and Chen, 2015]. Besides, VRL-SGD has only a minor change over Local SGD. So VRL-SGD and Local SGD have the same training time in one epoch and both of them have a faster training speed compared with S-SGD. VRL-SGD and EASGD would have the same communication complexity under the same period k. Therefore, we compare only the convergence rate (the training loss with regard to epochs) of different algorithms.
Figure 1: Epoch loss for the non-identical case. VRL-SGD converges as fast as S-SGD, and Local SGD, EASGD converge slowly or even cannot converge.
Figure 2: Epoch loss for the identical case. All of the algorithms have a similar convergence rate.
6.2 Non-identical case
This paper seeks to address the problem of poor convergence for Local SGD when the variance among workers is high. Therefore, we focus on comparing the convergence rate of all algorithms in the non-identical case, where the data variance among workers is maximized.
We choose three classical tasks: image classification, text classification, and transfer learning. Figure 1 shows the training loss with regard to epochs on the three tasks. The results are indicative of the strength of VRL-SGD in the non-identical case. Local SGD converges slowly compared with S-SGD when the communication period k is relatively large, while VRL-SGD enjoys the same convergence rate as that of S-SGD. This is consistent with theoretical analysis that VRL-SGD has a better communication period bound compared to Local SGD. When the variance among workers is not zero, Local SGD requires that T is greater than to achieve a linear iteration speedup. Thus Local SGD losses this property if k is larger than . However, benefiting from eliminating the dependency on the gradient variance among workers, VRL-SGD can attain a better communication period bound than Local SGD as shown in Corollary 5.2. Therefore, under the same communication period, VRL-SGD can achieve a linear iteration speedup and converges much faster than Local SGD. To maintain the same convergence rate, Local SGD needs to set a smaller communication period, which will result in higher communication cost. EASGD converges the worst under the same communication period in the non-identical case.
There are more experimental results to analyze the influence of parameter k in Appendix F.
6.3 Identical case
In addition to the above extreme case, we also validate the effectiveness of VRL-SGD in the identical case. As shown in Figure 2, all algorithms have a similar convergence rate. VRL-SGD, EASGD and Local SGD converge as fast as S-SGD when workers can observe unbiased stochastic gradients.
In this paper, we propose a novel distributed algorithm VRL-SGD for accelerating the training of machine learning models. VRL-SGD incorporates the variance reduction technique into Local SGD to further reduce the communication complexity. We theoretically prove that VRL-SGD can achieve a linear iteration speedup for nonconvex functions with the optimal communication complexity whether each worker accesses identical data or not. Experimental results verify the effectiveness of VRL-SGD, where VRL-SGD is significantly better than traditional Local SGD for the non-identical case and enjoys the same convergence rate as that of Local SGD.
In the future, we will consider the deep learning models with batch normalization layers, which may lead to an unstable convergence in the non-identical case.
A. F. Aji and K. Heafield. Sparse communication for distributed gradient descent. In Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing, pages 440–445, 2017.
D. Alistarh, D. Grubic, J. Li, R. Tomioka, and M. Vojnovic. Qsgd: Communication-efficient sgd via gradient quantization and encoding. In Advances in Neural Information Processing Systems, pages 1709–1720, 2017.
J. Bernstein, J. Zhao, K. Azizzadenesheli, and A. Anandkumar. signSGD with majority vote is communication efficient and fault tolerant. In International Conference on Learning Representations, 2019. URL https://openreview. net/forum?id=BJxhijAcY7.
A. Defazio, F. Bach, and S. Lacoste-Julien. Saga: A fast incremental gradient method with support for non-strongly convex composite objectives. In Advances in neural information processing systems, pages 1646–1654, 2014.
O. Dekel, R. Gilad-Bachrach, O. Shamir, and L. Xiao. Optimal distributed online prediction using mini-batches. Journal of Machine Learning Research, 13(Jan):165–202, 2012.
J. Deng, W. Dong, R. Socher, L.-J. Li, K. Li, and L. Fei-Fei. ImageNet: A Large-Scale Hierarchical Image Database. In CVPR09, 2009.
A. El-Sawy, E.-B. Hazem, and M. Loey. Cnn for handwritten arabic digits recognition based on lenet-5. In International Conference on Advanced Intelligent Systems and Informatics, pages 566–575. Springer, 2016.
S. Ghadimi and G. Lan. Stochastic first-and zeroth-order methods for nonconvex stochastic programming. SIAM Journal on Optimization, 23(4):2341–2368, 2013.
F. Haddadpour, M. M. Kamani, M. Mahdavi, and V. Cadambe. Trading redundancy for communication: Speeding up distributed sgd for non-convex optimization. In ICML, pages 2545–2554, 2019.
S. Ioffe and C. Szegedy. Batch normalization: Accelerating deep network training by reducing internal covariate shift. In International Conference on Machine Learning, pages 448–456, 2015.
R. Johnson and T. Zhang. Accelerating stochastic gradient descent using predictive variance reduction. In Advances in neural information processing systems, pages 315–323, 2013.
P. Kairouz, H. B. McMahan, B. Avent, A. Bellet, M. Bennis, A. N. Bhagoji, K. Bonawitz, Z. Charles, G. Cormode, R. Cummings, et al. Advances and open problems in federated learning. arXiv preprint arXiv:1912.04977, 2019.
S. P. Karimireddy, Q. Rebjock, S. Stich, and M. Jaggi. Error feedback fixes signsgd and other gradient compression schemes. In ICML, pages 3252–3261, 2019.
Y. Kim. Convolutional neural networks for sentence classification. In Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP), pages 1746–1751, 2014.
J. Koneˇcn`y, H. B. McMahan, F. X. Yu, P. Richtárik, A. T. Suresh, and D. Bacon. Federated learning: Strategies for improving communication efficiency. arXiv preprint arXiv:1610.05492, 2016.
Y. LeCun. The mnist database of handwritten digits. http://yann. lecun. com/exdb/mnist/, 1998.
J. Lehmann, R. Isele, M. Jakob, A. Jentzsch, D. Kontokostas, P. N. Mendes, S. Hellmann, M. Morsey, P. Van Kleef, S. Auer, et al. Dbpedia–a large-scale, multilingual knowledge base extracted from wikipedia. Semantic Web, 6(2): 167–195, 2015.
T. Li, A. K. Sahu, A. Talwalkar, and V. Smith. Federated learning: Challenges, methods, and future directions. arXiv preprint arXiv:1908.07873, 2019.
T. Lin, S. U. Stich, K. K. Patel, and M. Jaggi. Don’t use large mini-batches, use local sgd. arXiv preprint arXiv:1808.07217, 2018a.
Y. Lin, S. Han, H. Mao, Y. Wang, and B. Dally. Deep gradient compression: Reducing the communication bandwidth for distributed training. In International Conference on Learning Representations, 2018b. URL https://openreview. net/forum?id=SkhQHMW0W.
A. Mokhtari and A. Ribeiro. Dsa: Decentralized double stochastic averaging gradient algorithm. The Journal of Machine Learning Research, 17(1):2165–2199, 2016.
L. M. Nguyen, J. Liu, K. Scheinberg, and M. Takáˇc. Sarah: A novel method for machine learning problems using stochastic recursive gradient. In ICML, pages 2613–2621, 2017.
A. Paszke, S. Gross, S. Chintala, G. Chanan, E. Yang, Z. DeVito, Z. Lin, A. Desmaison, L. Antiga, and A. Lerer. Automatic differentiation in pytorch. 2017.
J. Pennington, R. Socher, and C. D. Manning. Glove: Global vectors for word representation. In Empirical Methods in Natural Language Processing (EMNLP), pages 1532–1543, 2014. URL http://www.aclweb.org/anthology/ D14-1162.
D. Povey, X. Zhang, and S. Khudanpur. Parallel training of dnns with natural gradient and parameter averaging. arXiv preprint arXiv:1410.7455, 2014.
S. Shen, L. Xu, J. Liu, X. Liang, and Y. Cheng. Faster distributed deep net training: Computation and communication decoupled stochastic gradient descent. In IJCAI, 2019.
W. Shi, Q. Ling, G. Wu, and W. Yin. Extra: An exact first-order algorithm for decentralized consensus optimization. SIAM Journal on Optimization, 25(2):944–966, 2015.
S. U. Stich. Local sgd converges fast and communicates little. In ICLR 2019 ICLR 2019 International Conference on Learning Representations, number CONF, 2019.
H. Su and H. Chen. Experiments on parallel training of deep neural network using model averaging. arXiv preprint arXiv:1507.01239, 2015.
C. Szegedy, V. Vanhoucke, S. Ioffe, J. Shlens, and Z. Wojna. Rethinking the inception architecture for computer vision. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 2818–2826, 2016.
H. Tang, X. Lian, M. Yan, C. Zhang, and J. Liu. D2: Decentralized training over decentralized data. In ICML, pages 4855–4863, 2018.
H. Tang, C. Yu, X. Lian, T. Zhang, and J. Liu. Doublesqueeze: Parallel stochastic gradient descent with double-pass error-compensated compression. In ICML, pages 6155–6165, 2019.
J. Wang and G. Joshi. Cooperative sgd: A unified framework for the design and analysis of communication-efficient sgd algorithms. arXiv preprint arXiv:1808.07576, 2018.
H. Yu, R. Jin, and S. Yang. On the linear speedup analysis of communication efficient momentum sgd for distributed non-convex optimization. In ICML, pages 7184–7193, 2019a.
H. Yu, S. Yang, and S. Zhu. Parallel restarted sgd with faster convergence and less communication: Demystifying why model averaging works for deep learning. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 33, pages 5693–5700, 2019b.
S. Zhang, A. E. Choromanska, and Y. LeCun. Deep learning with elastic averaging sgd. In Advances in Neural Information Processing Systems, pages 685–693, 2015.
F. Zhou and G. Cong. On the convergence properties of a k-step averaging stochastic gradient descent algorithm for nonconvex optimization. In Proceedings of the 27th International Joint Conference on Artificial Intelligence, pages 3219–3227. AAAI Press, 2018.
In this section, we present Lemma 1 and Lemma 2 to bound the partially accumulated local gradients, which are defined as
Lemma 1 Under Assumption 1, we have the following inequality for
Proof. By the definition of in (13), we have
where the inequality follows from Cauchy’s inequality. We next bound
Because ’s are independent at different time and workers, and the variance of stochastic gradient in each worker is bounded by , we can bound
Substituting (17), (18) and (19) into (16), we have
We next bound
where the first three inequalities follow from Cauchy’s inequality, and the fourth inequality follows from the Lipschitz gradient assumption. According to (21), we have
Substituting (20 ), (22) into (15), we obtain Lemma 1.
Lemma 2 Under Assumption 1, we have the following inequality for t < k,
Proof. By the definition of in (13), we have
where the second inequalities can be obtained by using (17) again. Rerrangeing the inequality, we obtain Lemma 2.
In this section, we introduce Lemma 3, which bounds the difference between the local model and the average model
Lemma 3 Under Lemma 1 and Lemma 2 , when the learning rate and the communication period k satisfy that , we have the following inequality
Proof. According to the updating scheme in Algorithms 1, can be represented as
On the other hand, by the definition of , we can represent it as
Substituting (27) and (28) into the left hand side of (25) , we have
According to the result in Lemma 1 and Lemma 2, for
and for t < k, we have
Summing up (30) and (31) from , we obtain
where the second and the third inequalities can be obtained by using a simple counting argument. Denote C =
Dividing on both sides completes the proof.
In this section, we give the proof of Theorem 5.1.
Theorem 5.1 Under Assumption 1, if the learning rate satisfies and , we have the following convergence result for Algorithm 1:
Proof. Since -smooth, it is easy to verify that -smooth. We have
By applying expectation with respect to all the random variables at step t and conditional on the past (denote by we have
Note that
where the last equality holds because
where the second equality holds because the random variables on different workers are independent. Substituting (37) into (36) and applying expectation with respect to all the random variables, we obtain
We then bound the difference of
where the two inequalities follow from Cauchy’s inequality and Lipschitz gradient assumption, respectively. Substituting (40) into (39) yields
Rearranging the inequality and summing up both sides from
Substituting Lemma 3 into (42) and combing , we obtain
Next, we bound
Substituting (45) into (44), we have
where the inequality holds since . Substituting (46) into (43), we obtain
Rearranging this inequality and dividing both sides by
Then we prove . If the learnign rate , then we have
Since , then we have , and thus . Rearranging (48) and dividing both sides by
where the inequalities hold because
In this section, we evaluate the effectiveness of our algorithm on different variance among workers. Specifically, we consider the following finite-sum optimization
where respectively denote the local loss function of the first and the second worker.
We can set a large variance among workers by adjusting b. Therefore, we can compare the convergence rate of algorithms in different variance, where the variance among workers is large with a large b. VRL-SGD-W denotes VRL-SGD with a warm-up, where the first communication period is set to 1. Figure 3 shows the gap with regard to iteration on different k and b. We can see that Local SGD converges slowly compared with VRL-SGD-W and VRL-SGD when the communication period k is relatively large. And VRL-SGD without warm-up is related to b while VRL-SGD-W is not sensitive to b. Figure 4 shows that the variance of in VRL-SGD and VRL-SGD-W converges to 0, while the variance of in Local SGD is a constant related to b. The experimental results verify our conclusion that VRL-SGD has a better convergence rate compared with Local SGD in the non-identical case, and VRL-SGD with a warm-up is more robustness to the variance among workers.
p
Figure 3: Logarithm of distance to the global minimum for different b and communication period k.
Figure 4: Logarithm of variance among workers for different b and communication period k.
In this section, we evaluate all algorithms with different communication period k.
As shown in Figure 5, VRL-SGD converges as fast as S-SGD, while Local SGD, EASGD converge slowly even if we set the period k to half of it in Figure 1. The results show that k in Local SGD should be smaller, such as k = 2 or k = 5
in transfer learning, which is in line with
VRL-SGD. Figure 6 compares the convergence of different algorithms with a larger k. We observe that the convergence of VRL-SGD will be affected with much large k, but VRL-SGD is still faster than Local SGD and EASGD, which is consistent with our theoretical analysis.
Figure 5: Epoch loss for the non-identical case. We set k = 10 for LeNet, k = 25 for TextCNN and k = 10 for Transfer Learning.
Figure 6: Epoch loss for the non-identical case. We set k = 40 for LeNet, k = 100 for TextCNN and k = 40 for Transfer Learning.