Faster On-Device Training Using New Federated Momentum Algorithm

2020·Arxiv

ABSTRACT

ABSTRACT

Mobile crowdsensing has gained significant attention in recent years and has become a critical paradigm for emerging Internet of Things applications. The sensing devices continuously generate a significant quantity of data, which provide tremendous opportunities to develop innovative intelligent applications. To utilize these data to train machine learning models while not compromising user privacy, federated learning has become a promising solution. However, there is little understanding of whether federated learning algorithms are guaranteed to converge. We reconsider model averaging in federated learning and formulate it as a gradient-based method with biased gradients. This novel perspective assists analysis of its convergence rate and provides a new direction for more acceleration. We prove for the first time that the federated averaging algorithm is guaranteed to converge for non-convex problems, without imposing additional assumptions. We further propose a novel accelerated federated learning algorithm and provide a convergence guarantee. Simulated federated learning experiments are conducted to train deep neural networks on benchmark datasets, and experimental results show that our proposed method converges faster than previous approaches.

ACM Reference Format:

Zhouyuan Huo1, Qian Yang2, Bin Gu1, Lawrence Carin2, Heng Huang1. 2020. Faster On-Device Training Using New Federated Momentum Algorithm. In ACM, New York, NY, USA, 11 pages. https://doi.org/10.1145/nnnnnnn.nnnnnnn

1 INTRODUCTION

Mobile crowdsensing is a new paradigm of sensing by taking advantage of the power of various mobile devices, which are penetrating most aspects of modern life and also continuously generate a large amount of data. As new high-speed 5G networks arrive to handle their traffic, the number of connected smart devices is expected to grow further over the next five years [42]. It is desirable to utilize these data to improve model performance and maximize the user experience. Traditional distributed optimization methods are able to train models when all datasets are stored in the cluster [8]. However, the increasing awareness of user privacy and data security issues prevents storing and training a model on a centralized server [35]. It therefore becomes a major challenge to train a model with massive distributed and heterogeneous datasets without compromising user privacy.

Federated learning [12, 24, 35] is a promising solution for machine learning model training using crowdsensing data without

Figure 1: Federated learning procedure: (1) Server selects a set of Active clients and broadcasts model . (2) After receiving model from Server, Active clients conduct the update locally and send the updated model (the figure) back to Server. 3) Server updates the model Steps 1-3 are repeated until convergence.

compromising user privacy. It has been widely applied for mobile keyboard prediction [9], private language models [25], and financial-client classification [7]. As shown in Figure 1, the data from each client are never uploaded to the server during the optimization period. Instead, each client conducts updates locally for several iterations and sends the updated model back to the server. At each iteration, the server only has access to a small fraction of clients, and updates the server model after receiving from these active clients. The Google Gboard team trains a recurrent neural network (RNN) with 1.4 million parameters for the next-word prediction task [9], where a round of training takes 2 to 3 minutes. It requires roughly 3000 rounds to converge, with a total running time of over 5 days – taking much more wall-clock time than centralized training on the same task [8]. It is nontrivial to accelerate the training of federated learning. Furthermore, [2] reports a large difference in the number of participating devices over a 24-hour period for a US-centric client population, which consequently has an impact on the round-completion rate. Because the communication between server and clients is unstable and expensive, [4, 11, 15] proposed new techniques to reduce communication and accelerate federated-learning training. Nonetheless, few studies provide a convergence guarantee of federated learning algorithms, especially for non-convex problems. There are two difficulties that make this a challenging problem: the data are non-independent identical distribution (non-IID) and there is limited communication between server and clients.

Synchronous and asynchronous gradient-based methods have already been proven to converge to critical points for non-convex problems [18, 20, 29]. To reduce communication overhead, local stochastic gradient descent (local SGD) is studied in the field of distributed optimization for clusters [21, 33, 38]. In [21], the authors showed that local SGD converges faster than mini-batch gradient descent with fewer communications. [33, 38, 41] investigated the convergence of local SGD and proved that it is guaranteed to converge for strongly convex or non-convex problems. Another line of communication-efficient distributed methods [17, 23, 40] performed model averaging after solving local subproblems and are guaranteed to converge. However, the above distributed methods either assume that the dataset is distributed IID or requires the server to collect updates from all workers at each iteration. None of these approaches is applicable to federated learning.

We investigate the model averaging in federated learning from a new point of view and formulate it as a gradient-based method with biased gradients. This novel perspective helps the analysis of its convergence rate and motivates a new accelerated method. Our main contributions in this paper are summarized as follows:

• We investigate the model averaging step of the FedAvg algorithm [24] and derive the first convergence proof for non-convex problems;

• A novel algorithm is proposed to accelerate federated optimization, and it also provides a convergence analysis for non-convex problems;

• We perform simulated federated learning experiments with training deep neural networks, and the empirical results show that the proposed method converges faster than previous approaches.

2 RELATED WORKS

Distributed Learning. When a dataset P of n samples is centrally stored and partitioned across K machines or clients P = , we use distributed optimization methods to train machine learning models. represents the set of data indices on worker k, and we let and . The target of distributed learning is to minimize a weighted finite-sum loss of all clients as follows:

where f (w) and are non-convex problems. denotes a subset of loss on client k and , where is a sampled data from . If the dataset is randomly partitioned and can also be represented as 1. Distributed mini-batch gradient methods have been used widely to train deep neural networks [6, 8, 37]. At each iteration, gradients are computed on clients and aggregated on the server. However, this method suffers severely from network delays and bandwidth limits. To overcome the communication bottleneck, [21, 39] allowed workers to perform local updates for a while and average local models on the server periodically. In [33], the authors analyzed the convergence of local SGD for convex problems. Recently, [34, 38, 41] applied local SGD to non-convex problems and proved that it guarantees convergence as well. However, local SGD requires

Table 1: Comparisons of settings between federated learning and normal distributed learning. IID denotes independent and identically distributed.

averaging local models from all clients, which is not realistic for federated learning.

Federated Learning. Instead of training machine learning models with centrally stored data, as with distributed learning, federated learning seeks to perform large-scale learning on a massive number of distributed mobile devices with heterogeneous data [24, 32]. We summarize the differences between distributed and federated learning in Table 1. There are two challenges in federated learning: (i) the datasets are massive, distributed, and heterogeneous, so that K is extremely large and is highly unbalanced in problem (1); (ii) communications between server and clients are highly unstable, so that only a small fraction of clients are active within one iteration. To improve federated learning empirically, [15] investigated reducing the uplink (clients server) communication cost through message compression. Later, [4] proposed to reduce the downlink (server clients) communication cost by training smaller sub-models. However, few studies have analyzed the convergence of federated learning methods. [19] analyzed the convergence of federated averaging algorithm for convex problems. Recently, [31] proved that federated learning is guaranteed to converge for non-convex problems. However, the authors considered a different federated learning algorithm, imposing a new regularization and considered unrealistic assumptions, such as strongly convex subproblems and bounded dissimilarity of local data.

3 FEDERATED AVERAGING ALGORITHM

We first briefly introduce the Federated Averaging algorithm (FedAvg) proposed in [24]. We then reformulate the model averaging in FedAvg as a gradient-based method and prove that it is guaranteed to converge to critical solutions for non-convex problems.

3.1 Algorithm Description

According to the setting of federated learning, we assume there is one server and K clients, where K is a large value. Algorithm 1 summarizes the procedures of FedAvg on the server. At iteration t, the server is connected with a set of active clients with the size M, where . After broadcasting model to active clients , the server waits until receiving updated model through Algorithm 2 from . Finally, model is updated on the server via model averaging:

Note that is averaging local updated models from all clients in (2) by setting, which is consistent with the

FedAvg algorithm in the original paper [24]. The right term in (2) enforces that stay close to the current server model implicitly if ], the authors updated the server model using local models from only active clients through . Because of this step, they have to impose new regularization on local subproblems and bring in an additional assumption about data distributions for convergence analysis.

Algorithm 2 describes the local solver on clients. After receiving model from the server, client k applies SGD locally and updates iteratively for iterations. Finally, it sends the updated local model back to the server. The local solver can also be any gradient-based method, such as Momentum method [28, 36], RMSProp [10], Adam [14] or AdamW [22]. We only consider SGD in this paper, for simplicity.

3.2 Model Averaging Is a Gradient-Based Method with Biased Gradients

Clients in federated learning are mobile devices, and the server is a cluster and able to do more computations not limited to model averaging. In this paper, we reconsider the model averaging at Line 8 in Algorithm 1, and formulate it as an update of gradient-based methods as follows:

w

denotes a biased gradient on the server at iteration t, because . Equations (2) and (3) are equivalent. Given that , we can also rewritein (3) using gradients computed on clients in Algorithm 2. To generalize (3), we set a constant learning rate and obtain

the following update function in the server:

If 1, it is equivalent to (3). By rewriting model averaging as a gradient-based method, we can easily utilize existing theoretical analysis and many improved algorithms of gradient-based methods. In the following context, we provide the convergence analysis of FedAvg for non-convex problems, and propose a novel accelerated federated learning algorithm.

Difficulty of the Convergence Analysis: It is difficult to analyze the convergence of FedAvg, due to its biased gradient. Conventional analysis for gradient-based methods requires However, this is not satisfied in FedAvg. Apart from the unbiased gradients, limited communication in federated learning also increases the difficulty of analysis.

3.3 Convergence Analysis

To prove the convergence rate of FedAvg, we assume that two widely used assumptions [3], Bounded Variance and Lipschitz Continuous Gradient, are satisfied throughout this paper.

Assumption 1 (Bounded Variance). We assume that the variance of stochastic gradient on local clients is upper bounded, so that for any , it is satisfied that:

Assumption 2 (Lipschitz Continuous Gradient)). The gradients of f and are Lipschitz continuous with a constant L > 0, so that for any , it is satisfied that:

Under Assumptions 1 and 2, we can prove the upper bound of FedAvg at each iteration.

Lemma 3.1. Under Assumptions 1 and 2, the update of on the server at each iteration is upper bounded as follows:

Proof. According to Assumption 2, it holds that:

Taking an expectation over samples on the inequality above, we have:

where the inequality follows from . Because all the clients are selected randomly with a uniform distribution, we take expectation over clients and have:

equality follows from Jensen’s inequality, and the last equality follows from . We prove the

upper bound of as follows:

where the first inequality follows from Assumption 2 andthe second inequality follows from

for any random variable with mean 0. The last inequality follows from Assumption 1. Summing the inequality above from 1, we know that:

To get the upper bound of

where the second equality follows from and the inequality is from Assumption 1. Inputting inequality (6), we complete the proof.

From Lemma 3.1, we can ensure the convergence of f at each iteration as long asis properly selected and 10. According to Lemma 3.1, we can prove the convergence of FedAvg for non-convex problems as follows.

Theorem 3.2. Assume that Assumptions 1 and 2 hold. We let local iteration , and stepsize sequence satisfies

assume loss f has a lower bound and let . Then, the output of Algorithm 1 satisfies that:

Proof. From Lemma 3.1, if we let have:

Taking expectation of inequality (4), it holds that:

Supposing , rearranging the inequality above, and summing it up from 1, we have:

We complete the proof.

Corollary 3.3. Following Theorem 3.2, we can prove that Algorithm 1 is guaranteed to converge to critical points for the non-convex problem lim

satisfies:

The results above follow from the seminal work in [30] and the two requirements in (8) can be easily satisfied if we let . We can also obtain the convergence rate of FedAvg if we let constant.

for all

it is guaranteed that Algorithm 1 converges as follows:

We have proven that FedAvg is guaranteed to converge to critical points for non-convex problems at O

This is the first work which confirms the convergence of FedAvg for non-convex problems. It is worthwhile to highlight the generalities of our analysis compared to [31] as follows: i) no data distribution assumption, so that it is satisfied for clients with any data distributions; ii) no constraints on local subproblem, so subproblems can also be non-convex.

4 NEW FEDERATED MOMENTUM ALGORITHM

By understanding model averaging as a gradient-based method with biased gradients, we propose a novel accelerated federated momentum algorithm on the server end. We also prove that the proposed method is guaranteed to converge to critical points for non-convex problems.

4.1 Algorithm Description

The origin of momentum methods dates back to the 1960’s [27]. Since then, it has achieved the optimal convergence rate for strongly convex smooth optimization [1, 26]. Although admitting a similar convergence rate as SGD for non-convex problems, momentum methods exhibit impressive performance in training deep neural networks [36]. In Section 3, we reformulate model averaging in Algorithm 1 as an update of gradient-based methods. To accelerate the training of federated learning, we propose Federated Momentum algorithm (FedMom) by using Nesterov’s accelerated gradient on the server.

We describe the procedures of FedMom in Algorithm 3. FedMom is similar to FedAvg in selecting clients and receiving updated models at each iteration. However, instead of computing the average of collected models, the server stores a momentum variable and updates the server model following steps 8-9 in Algorithm 3:

In other words, FedMom performs a simple step like SGD from to at first. After that, the model moves a little bit further in the direction of the previous point . Parameter is selected from [0, 1). In the experiment, we set 9 all the time. In the following context, we provide convergence guarantees of FedMom for non-convex problems.

4.2 Convergence Analysis

Under Assumptions 1 and 2, we show that FedMom is guaranteed to converge to critical points for non-convex problems.

Theorem 4.1. Assume that Assumptions 1 and 2 hold, for all . In addition, we assume that loss f has a lower bound

Algorithm 3 satisfies that:

Proof. The procedures of FedMom is as follows:

according to (9), we have:

We also define as follows:

where 0. It also holds that:

Following [36], we can prove that:

Let , according to Assumption 2 and taking expectation over , we know that:

where the inequalities follow from Cauchy’s inequality and techniques in the proof of inequalities (5) and (6). We readily get the upper bound of from inequality (7):

The upper bound of is as follows:

where the first inequality follows from Assumption 2 and the equality follows from . Because of (10) and 0, it is satisfied that:

where the inequality is from Jensen’s inequality. We can also obtain the upper bound of as follows:

where the first inequality follows from Jensen’s inequality, the sec-

Because

from 1, we have:

To obtain the upper bound of

Q4

where the first inequality follows from Assumption 2, the second inequality follows from Assumption 1 and for any random variable , ... , with mean 0. Summing up 1, it holds that:

Inputting into inequality (12) and summing it up from 1, we have:

As long as the following inequalities are satisfied:

Figure 2: Visualization of non-IID and unbalanced data on all clients. For FEMNIST dataset, the number of samples on clients is from 0 to 480; for Shakespeare dataset, the number of samples on clients is from 0 to 18000.

Table 2: Statistics of FEMNIST and Shakespeare datasets used in our experiment.

we have:

Mηγ2K

Thus, it follows that:

Rearranging inequality (14) and letting

After inputting the upper bound of in the above inequality, we complete the proof.

Above we also prove that FedMom is guaranteed to converge to critical points for non-convex problems at O

FedMom shares the similar convergence rate to FedAvg, we will show in the following context that FedMom works better than FedAvg empirically.

5 EXPERIMENTS

We validate our analysis with simulated federated learning experiments on training deep neural networks. There are two targets: (i) we verify that the stochastic gradient in (3) is a right direction towards target solution although it is biased; (ii) we demonstrate that our proposed method converges faster. All experiments are performed on a machine with Intel(R) Xeon(R) CPU E5-2650 v4 @ 2.20GHz and 4 TITAN Xp GPUs.

5.1 Implementation Details

Our implementations are based on the LEAF project [5], which is a benchmark for federated learning algorithms. As in Table 2, there are two tasks in the experiment: the digit recognition task on FEMNIST dataset [5] and the character prediction task on Shakespeare dataset [24]. For the digit recognition task, we use LeNet in the experiment [16]; for the task of character prediction, we train a character-level LSTM language model, which is 1-layer LSTM with 128 nodes [13]. To simulate the setting of federated learning, we set M = 2 in all experiments, such that only two clients communicate with the server at each iteration. We let for two datasets; |B| represents the number of mini-batches in each epoch with batch size B = 10. For FedMom algorithm, we let 9 in all experiments.

5.2 Direction of Biased Gradient

We train neural networks using the FedAvg algorithm and visualize in Figure 3 the variations of during the course of optimization. Positive values denote that is heading towards the

Figure 3: Variation of the expectation of Inner product in the course of optimization. We set the model after 2000 communication rounds as

Figure 4: Verifying why FedAvg converges faster than FedSGD. The shaded region denotes the gap of performance between two methods. “Inner Product” represents

target solution. We approximate the expectation of by taking the average of every 100 communication rounds. , which is the model after 2000 communication rounds. Taking the left figure on FEMNIST as an example, we have two observations. First, the values are large at the beginning of optimization, which means the model is far from the target point at first and it moves towards the target point at a fast speed. After a number of rounds, the model is close to the target point and the value of becomes small. Secondly, it is clear that the values of are larger than 0 most of the time. We can also draw similar conclusions according to the result on Shakespeare dataset. Therefore, in FedAvg algorithm is an appropriate direction towards the target point, although it is biased.

5.3 Investigating FedAvg and FedSGD

In this section, we investigate why FedAvg converges faster than FedSGD empirically. We compare these two methods by training digit recognition task on FEMNIST dataset. In Figure 4, we visualize the difference of

during the course of optimization. In the leftmost figure, we can observe that the “inner product” of FedAvg is larger than FedSGD all the time. At the same time, FedAvg converges faster than FedSGD regarding the training loss and testing accuracy. Experimental results indicate that FedAvg is moving towards a better direction to the target point than FedSGD.

5.4 Convergence Comparison

We compare the convergence of FedSGD, FedAvg and FedMom, with results visualized in Figure 5. There are two observations: (i) we know that FedAvg always converges faster than FedSGD by a large margin; (ii) FedMom converges faster than FedAvg given similar step size in all experiments.

In Figure 6, we evaluate the proposed method by varying the step size and local iterations H on FEMNIST dataset. In the left figure, FedMom is always works better than FedAvg when we select a similar step size . Besides, it is clear that FedMom is more robust to the selection of step size . However, the performance of FedAvg with smaller drops severely. When varying H, we observe similar results that FedMom performs more robust than FedAvg. Thus,

Figure 5: Performance of compared methods on FEMNIST and Shakespeare dataset. 10th percentile denotes that there are 10% of the data values below it.

Figure 6: Training loss of FedMom and FedAvg methods when we vary the value of step sizeand local iterationsH on FEMNIST dataset.

FedMom is a more practical method because it is easier to tune step size and iterations H than the compared methods.

6 CONCLUSIONS

We have investigated model averaging in the federated averaging algorithm, and have reformulated it as a gradient-based method with biased gradients. As a result, we derived the first convergence proof of the federated averaging algorithm for nonconvex problems. Based on our new perspective, we propose a novel federated momentum algorithm (FedMom) and prove that it is guaranteed to converge to critical solutions for non-convex problems. In the experiments, we compare FedMom with FedAvg and FedSGD by conducting simulated federated learning experiments on the digit recognition task and the character prediction task. Experimental results demonstrate that the proposed FedMom converges faster than the compared methods on both tasks and is easier to tune parameters as well. More important, our research results open up new research directions for federated learning.

REFERENCES

[1] Z. Allen-Zhu and L. Orecchia. Linear coupling: An ultimate unification of gradient and mirror descent. arXiv preprint arXiv:1407.1537, 2014.

[2] K. Bonawitz, H. Eichner, W. Grieskamp, D. Huba, A. Ingerman, V. Ivanov, C. Kiddon, J. Konecny, S. Mazzocchi, H. B. McMahan, et al. Towards federated learning at scale: System design. arXiv preprint arXiv:1902.01046, 2019.

[3] L. Bottou, F. E. Curtis, and J. Nocedal. Optimization methods for large-scale machine learning. arXiv preprint arXiv:1606.04838, 2016.

[4] S. Caldas, J. Konečny, H. B. McMahan, and A. Talwalkar. Expanding the reach of federated learning by reducing client resource requirements. arXiv preprint arXiv:1812.07210, 2018.

[5] S. Caldas, P. Wu, T. Li, J. Konečn`y, H. B. McMahan, V. Smith, and A. Talwalkar. Leaf: A benchmark for federated settings. arXiv preprint arXiv:1812.01097, 2018.

[6] J. Chen, X. Pan, R. Monga, S. Bengio, and R. Jozefowicz. Revisiting distributed synchronous sgd. arXiv preprint arXiv:1604.00981, 2016.

[7] K. Cheng, T. Fan, Y. Jin, Y. Liu, T. Chen, and Q. Yang. Secureboost: A lossless federated learning framework. arXiv preprint arXiv:1901.08755, 2019.

[8] J. Dean, G. Corrado, R. Monga, K. Chen, M. Devin, M. Mao, A. Senior, P. Tucker, K. Yang, Q. V. Le, et al. Large scale distributed deep networks. In Advances in neural information processing systems, pages 1223–1231, 2012.

[9] A. Hard, K. Rao, R. Mathews, F. Beaufays, S. Augenstein, H. Eichner, C. Kiddon, and D. Ramage. Federated learning for mobile keyboard prediction. arXiv preprint arXiv:1811.03604, 2018.

[10] G. Hinton, N. Srivastava, and K. Swersky. Neural networks for machine learning lecture 6a overview of mini-batch gradient descent. Cited on, page 14, 2012.

[11] E. Jeong, S. Oh, H. Kim, J. Park, M. Bennis, and S.-L. Kim. Communicationefficient on-device machine learning: Federated distillation and augmentation under non-iid private data. arXiv preprint arXiv:1811.11479, 2018.

[12] P. Kairouz, H. B. McMahan, B. Avent, A. Bellet, M. Bennis, A. N. Bhagoji, K. Bonawitz, Z. Charles, G. Cormode, R. Cummings, et al. Advances and open problems in federated learning. arXiv preprint arXiv:1912.04977, 2019.

[13] Y. Kim, Y. Jernite, D. Sontag, and A. M. Rush. Character-aware neural language models. In AAAI, pages 2741–2749, 2016.

[14] D. Kingma and J. Ba. Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980, 2014.

[15] J. Konečn`y, H. B. McMahan, F. X. Yu, P. Richtárik, A. T. Suresh, and D. Bacon. Federated learning: Strategies for improving communication efficiency. arXiv preprint arXiv:1610.05492, 2016.

[16] Y. LeCun, L. Bottou, Y. Bengio, and P. Haffner. Gradient-based learning applied to document recognition. Proceedings of the IEEE, 86(11):2278–2324, 1998.

[17] J. D. Lee, Q. Lin, T. Ma, and T. Yang. Distributed stochastic variance reduced gradient methods and a lower bound for communication complexity. arXiv preprint arXiv:1507.07595, 2015.

[18] M. Li, T. Zhang, Y. Chen, and A. J. Smola. Efficient mini-batch training for stochastic optimization. In Proceedings of the 20th ACM SIGKDD international conference on Knowledge discovery and data mining, pages 661–670. ACM, 2014.

[19] X. Li, K. Huang, W. Yang, S. Wang, and Z. Zhang. On the convergence of fedavg on non-iid data. arXiv preprint arXiv:1907.02189, 2019.

[20] X. Lian, Y. Huang, Y. Li, and J. Liu. Asynchronous parallel stochastic gradient for nonconvex optimization. In Advances in Neural Information Processing Systems, pages 2737–2745, 2015.

[21] T. Lin, S. U. Stich, and M. Jaggi. Don’t use large mini-batches, use local sgd. arXiv preprint arXiv:1808.07217, 2018.

[22] I. Loshchilov and F. Hutter. Fixing weight decay regularization in adam. arXiv preprint arXiv:1711.05101, 2017.

[23] C. Ma, V. Smith, M. Jaggi, M. I. Jordan, P. Richtárik, and M. Takáč. Adding vs. averaging in distributed primal-dual optimization. arXiv preprint arXiv:1502.03508, 2015.

[24] H. B. McMahan, E. Moore, D. Ramage, S. Hampson, et al. Communicationefficient learning of deep networks from decentralized data. arXiv preprint arXiv:1602.05629, 2016.

[25] H. B. McMahan, D. Ramage, K. Talwar, and L. Zhang. Learning differentially private recurrent language models. arXiv preprint arXiv:1710.06963, 2017.

[26] Y. Nesterov. A method for unconstrained convex minimization problem with the rate of convergence o (1/kˆ 2). In Doklady AN USSR, volume 269, pages 543–547, 1983.

[27] B. T. Polyak. Some methods of speeding up the convergence of iteration methods. USSR Computational Mathematics and Mathematical Physics, 4(5):1–17, 1964.

[28] N. Qian. On the momentum term in gradient descent learning algorithms. Neural networks, 12(1):145–151, 1999.

[29] S. J. Reddi, A. Hefny, S. Sra, B. Poczos, and A. J. Smola. On variance reduction in stochastic gradient descent and its asynchronous variants. In Advances in Neural Information Processing Systems, pages 2647–2655, 2015.

[30] H. Robbins and S. Monro. A stochastic approximation method. The annals of mathematical statistics, pages 400–407, 1951.

[31] A. K. Sahu, T. Li, M. Sanjabi, M. Zaheer, A. Talwalkar, and V. Smith. On the convergence of federated optimization in heterogeneous networks. arXiv preprint

arXiv:1812.06127, 2018.

[32] V. Smith, C.-K. Chiang, M. Sanjabi, and A. S. Talwalkar. Federated multi-task learning. In Advances in Neural Information Processing Systems, pages 4424–4434, 2017.

[33] S. U. Stich. Local sgd converges fast and communicates little. arXiv preprint arXiv:1805.09767, 2018.

[34] J. Wang and G. Joshi. Cooperative sgd: A unified framework for the design and analysis of communication-efficient sgd algorithms. arXiv preprint arXiv:1808.07576, 2018.

[35] Q. Yang, Y. Liu, T. Chen, and Y. Tong. Federated machine learning: Concept and applications. ACM Transactions on Intelligent Systems and Technology (TIST), 10(2):12, 2019.

[36] T. Yang, Q. Lin, and Z. Li. Unified convergence analysis of stochastic momentum methods for convex and non-convex optimization. arXiv preprint arXiv:1604.03257, 2016.

[37] Y. You, I. Gitman, and B. Ginsburg. Large batch training of convolutional networks. arXiv preprint arXiv:1708.03888, 2017.

[38] H. Yu, S. Yang, and S. Zhu. Parallel restarted sgd for non-convex optimization with faster convergence and less communication. arXiv preprint arXiv:1807.06629, 2018.

[39] S. Zhang, A. E. Choromanska, and Y. LeCun. Deep learning with elastic averaging sgd. In Advances in Neural Information Processing Systems, pages 685–693, 2015.

[40] Y. Zhang and X. Lin. Disco: Distributed optimization for self-concordant empirical loss. In International conference on machine learning, pages 362–370, 2015.

[41] F. Zhou and G. Cong. On the convergence properties of a k-step averaging stochastic gradient descent algorithm for nonconvex optimization. arXiv preprint arXiv:1708.01012, 2017.

[42] M. Zhou. 5g will mean big boom for smart devices, ericsson says. 2018.