We consider a cloud server and a set of M mobile devices (workers) collected in M := {1, . . . , M}. Each device m has its local dataset , which defines the loss function of device m as
where is the sought vector (e.g., parameters of a prediction model) and
is a data sample. For example, in linear regression,
) is the square loss; and, in deep learning,
) is the loss function of a neural network, and
concatenates the weights. The goal is to solve
Problem (2) also arises in a number of areas, such as multi-agent optimization (Nedic and Ozdaglar, 2009), distributed signal processing (Msechu and Giannakis, 2011), and distributed machine learning (Dean et al., 2012). While our algorithms can be applied to other settings, we focus on the federated learning setting. In this case, for bandwidth and privacy concerns, local data at each worker m are not uploaded to the server, and collaboration is needed through communication between the server and workers.
To solve (2), we can in principle apply the distributed version of stochastic gradient descent (SGD) method. In this case, at iteration k, the server broadcasts the current model to all the workers; each worker m computes
) using a randomly selected sample or a minibatch of samples
, and then uploads it to the server; and once receiving stochastic gradients from
Figure 1: Generic LASG implementation.
all workers, the server updates the model parameters via
SGD
where 0 is the (possibly) time-varying stepsize used at iteration k. When
) is an unbiased gradient estimator of
), the convergence of SGD update (3) is guaranteed (Bottou et al., 2016). To implement (3), however, the server has to communicate with all workers to obtain fresh
. This prevents the efficient implementation of SGD in scenarios where communication between the server and the workers is costly (McMahan et al., 2017), since latency will degrade the overall performance. Therefore, our goal is to find the parameter
that minimizes (2) with minimal communication overhead.
1.1 Our approach
This paper puts forward a class of new stochastic optimization methods that can considerably reduce the redundant or less informative communication of SGD. The key motivation is that during the distributed learning process, not all communication rounds between the server and the worker are equally important. So a natural solution is to use a condition that decides whether to communicate or not. In this case, if some workers are not communicating, the server uses their stale gradients so that stale gradients replace skipped fresh gradients.
Analogous to the distributed implementation of SGD (3), our new approaches also aggregate the stochastic gradients from all workers, but in a fairly lazy manner. Hence, we term our algorithms as Lazily Aggregated Stochastic Gradient (LASG). LASG has the following generic update
or equivalently (see also Figure 1)
where :=
) is the stochastic gradient innovation,
0 is the staleness of the gradient from worker m used by k-iteration, and
is the subset of workers uploading
at iteration k. The stalenesses
are controlled by the selection of subset
: at iteration k, if worker
, the server increases the staleness by
+ 1; otherwise, worker m uploads the stochastic gradient, and the server resets
Clearly, selection of subset is critical in LASG. However, the challenges are 1) the importance of each communication round is dynamic, thus a fixed or nonadaptive condition is ineffective; and 2) the condition needs to be checked either at server or at worker locally and efficiently.
To overcome these challenges, we develop two types of adaptive conditions based on message innovations. They can be chosen under different communication, computation and memory requirements. The first type is adopted by each worker (WK), and the second one by the server (PS). LASG-WK: At iteration k, the server broadcasts to all workers; every worker computes
and checks if it belongs to
; only the workers in
; the server updates via (5). LASG-PS: At iteration k, the server decides
to workers in
; each worker
computes
) and uploads
; the rest of workers do nothing; the server updates via (5);
With detailed description of LASG rules deferred to Section 3, the contributions of this paper are listed as follows.
1) We introduce a class of novel (quantized) stochastic optimization approaches that reuses stale stochastic gradients to reduce redundant communication.
2) We establish convergence of our proposed algorithms in strongly convex and nonconvex settings even when the datasets are non-i.i.d. across workers. The convergence rates match those of SGD in the respective settings.
3) We confirm performance gains of our novel distributed algorithms over some alternatives using extensive numerical tests on logistic regression and neural network training.
1.2 Related work
Communication-efficient distributed learning methods have gained popularity recently (Nedić et al., 2018; Jordan et al., 2018). Most popular methods belong to two categories: c1) reduce the number of bits per communication round; and, c2) save the number of communication rounds. For c1), methods are centered around the ideas of quantization and sparsification. Quantization has been successfully applied to several engineering tasks employing wireless sensor networks (Msechu and Giannakis, 2011). In the context of distributed machine learning, a 1-bit and multi-bits quantization methods have been developed in (Seide et al., 2014; Bernstein et al., 2018; Alistarh et al., 2017; Magnússon et al., 2019). Other variants of quantized gradient schemes include error compensation (Wu et al., 2018), variance-reduced quantization (Zhang et al., 2017), and quantization to a ternary vector (Wen et al., 2017). Sparsification amounts to transmitting only gradient coordinates with large enough magnitudes exceeding a certain threshold (Strom, 2015; Aji and Heafield, 2017). To avoid losing information of skipping communication, small gradient components will be accumulated and then transmitted when they are large enough (Lin et al., 2018; Stich et al., 2018; Alistarh et al., 2018). Recently, randomized sparsification approaches have also been developed in (Wangni et al., 2018; Wang et al., 2018).
However, both quantization and sparsification aim to resolve c1). For exchanging messages, e.g., the p-dimensional or its gradient, other latencies (initiating communication links, queueing, and propagating the message) are at least comparable to the message size-dependent transmission latency (Peterson and Davie, 2007). This motivates c2) reducing the number of communication rounds. Periodic communication. In contrast to the gradient compression schemes, schemes that reduce the number of communication rounds have also been developed, including the periodic averaging techniques, e.g., local stochastic gradient descent (a.k.a. local SGD) (Lin et al., 2019; Stich, 2019; Wang and Joshi, 2018; Yu and Jin, 2019; Yu et al., 2019). In local SGD, workers are allowed to perform local model updates independently and the resultant models are averaged periodically. In this way communication frequency is reduced. The caveat is that most local SGD methods have performance guarantee in the homogeneous settings, where the data are independent and identically distributed over all workers. However, this assumption rarely holds in federated learning (McMahan et al., 2017). Intermittent communication. Different from periodic communication used in local SGD, adaptive uploading techniques have been studied in e.g., lazily aggregated gradient (LAG) approaches (Chen et al., 2018; Sun et al., 2019). LAG is tailored for the heterogeneous learning settings, and has provable performance gain when the data distributions vary across workers. Models in LAG are updated at the server, and workers only adaptively upload information that is determined to be informative enough.
Figure 2: Number of uploads in per epoch (10 iterations) under stochastic LAG-WK and LASG-WK2.
Unfortunately, while the original LAG has good performance in the deterministic settings (e.g., with full gradient), its performance is significantly degraded in the stochastic settings. Recent efforts have been made towards adaptive uploading in stochastic settings (Li et al., 2019), but the proposed scheme therein requires an exponentially increasing batch size, which is not favorable in practice. In contrast, the LASG approaches can be viewed as the stochastic counterparts of the LAG, and our adaptive communication rules are new and tailored for SGD, which do not require the increasing batch size.
Our LASG approaches are closely related to the recently developed LAG method (Chen et al., 2018). In this section, we revisit LAG and provide insights why it does not work well in stochastic settings.
Because not every communication round is equally important during the learning process, LAG only admits useful communication, and otherwise, reuses stale information. Instead of communicating with all M workers as SGD in (3), the direct (or “naive") stochastic version of LAG (specifically LAG-WK) selects the subset of workers to obtain their fresh stochastic gradients
,
. The direct stochastic LAG also follows the generic update (4), but it selects
as: if worker m finds the innovation of the fresh stochastic gradient
) is small such that it satisfies (with pre-defined
then we reuse the old gradient, , and increase the staleness by
+ 1; otherwise, worker m uploads the stochastic gradient, and resets
In the deterministic setting, LAG condition (6) is motivated by the elegant “larger descent per upload" rationale, and has proved to be effective (Chen et al., 2018). Nevertheless, the observation here is that the two stochastic gradients (6) are evaluated on two different iterates (
and
)
and two different samples () thus two different loss functions. This is in contrast to the original LAG condition in (Chen et al., 2018) where the gradient innovation is evaluated on the same function. This subtle difference leads to significant degradation in performance.
Figure 2 compares the stochastic LAG and one of our new algorithms LASG-WK2 (introduced later), and demonstrates that the stochastic LAG is not effective in saving communication — when is set to be small (e.g., 0.4), (6) almost never satisfies; and when
is set to be large (e.g., 4), (6) satisfies initially, but stops satisfying later. This can be explained by expanding the left-hand-side
Table 1: A comparison of communication, computation and memory requirements. PS denotes the parameter server, WK denotes the worker, PSis the download from the server to the worker
is the upload from the worker m to the server.
(LHS) of (6) by (see the supplemental material for the deduction)
Even if the iterate converges, e.g.,
, and thus the right-hand-side (RHS) of (6)
0, the LHS of (6) does not, because the gradient variance appearing in (7a) and (7b) does not vanish yet the gradient difference at the same function (7c) diminishes.
Therefore, the key insight here is that the non-diminishing variance of stochastic gradients makes the direct implementation of the LAG rule (6) ineffective eventually.
In this section, we formally develop our LASG method, and present the intuition behind its design.
While the updates of stochastic LAG and LASG (4) look identical, the choice of in them is very different. To overcome the limitations of LAG in stochastic settings, the key of the LASG design is to reduce the variance of the innovation measure appeared in the adaptive condition.
Towards this goal, we develop two types of LASG rules to select . The first type is adopted by each worker that uses the gradient difference as the innovation measure but a variance-reduced gradient difference; and the second one by the parameter server that uses the model difference as the innovation measure, but uses a sequence of diminishing stepsizes to control variance.
3.1 Worker LASG: save communication uploads
We first introduce two LASG variants that use variance-reduced rules to check gradient innovation at the worker side. The first one that we term LASG-WK1 will reuse the old gradient of worker m at iteration k if it satisfies
where ˜:=
(˜
) is the stochastic gradient difference at a common sample
, ˜
(˜
) is the stochastic gradient difference at a common sample
, and ˜
is a snapshot of the previous iterate that will be updated every D iterations. If (8) is
Table 2: A comparison of LASG-WK1 and LASG-WK2.
satisfied, the staleness increases by + 1; otherwise, worker m uploads the fresh stochastic gradient, and resets staleness as
The rationale of (8) follows next. In contrast to the non-vanishing variance in the LAG-WK rule (see (7)), the LASG-WK1 rule (8) reduces its inherent variance. To see this, we can decompose the LHS of (8) as the difference of two variance reduced stochastic gradients at iteration k and . Using the stochastic gradient in SVRG as an example (Johnson and Zhang, 2013), the innovation can be written as
Define the minimizer of (2) as and assume that
) is ¯L-Lipschitz continuous for any
. The expectation of the LHS of (8) can be upper-bounded by
If the iterate converges, e.g.,
, the RHS of (9) diminishes, and thus the LHS of (8) diminishes. This is in contrast to the stochastic LAG-WK rule in (7) that is lower-bounded by a non-diminishing value.
if it satisfies
If (10) is satisfied, the server will use the stale stochastic gradient ) for worker m, and the staleness increases by
+ 1; otherwise, worker m uploads the fresh stochastic gradient, and resets the staleness as
= 1. Notice that different from the naive LAG-WK (6), the LASG condition (10) is evaluated at two different iterates but on the same sample
Similar to LASG-WK1, the LASG-WK2 rule (10) also reduces its inherent variance, since the LHS of (10) can be written as the difference between a variance reduced stochastic gradient and a deterministic gradient, that is
With derivations deferred to the supplementary document, similar to (9) we can also conclude that 0 as the iterate
3.2 Server LASG: save up/downloads and calculations
Besides the worker-side conditions, we next introduce two LASG variants that use variance-reduced rules to check model innovation at the server side, both of which do not even need to broadcast the current models. The rationale is that if the model difference is small, the gradient difference used in Section 3.1 is likely to be small.
The first one that we term LASG-PS will reuse the old gradient of worker m, given that the old parameter that worker m used for computing the last stochastic gradient satisfies
where is the smoothness constant of
). Condition (12) can be checked at the server side without computing new gradients if the server stores
that all workers used for computing the most recent stochastic gradients.
The LHS of (12) can be upper-bounded in expectation by
If the iterate does not diverge so that
is bounded, then the diminishing stepsizes
ensure that the second and third terms in the RHS of (13) vanish. Using mathematical induction, the LHS of (12) also diminishes. Therefore, similar to the variance-reduced gradient difference used in LASG-WK, the diminishing stepsizes can also make the LASG-PS condition effective asymptotically.
In many problems, however, is not always available, or, hard to compute. To resolve this issue, we develop LASG-PSE, a variation of LASG-PS that estimates
“on-the-fly.” With ˆ
denoting the estimate of
will reuse the old gradient of worker m if it satisfies
where the estimated constant ˆis updated iteratively via
We summarize LASG-PS and LASG-PSE in Algorithms 3 and 4, and compare our four LASG variants in Table 1.
Table 3: A comparison of LASG-PS and LASG-PSE.
Comparison of all LASG variants. Comparing WK conditions with PS conditions in Table 1, LASG-PS and LASG-PSE need extra memory at the server side but save both local computation and download communication, while LASG-WK1 and LASG-WK2 save only upload communication. Between the two WK conditions, LASG-WK1 is more conservative as LASG-WK1 measures the change of gradients at two model states for both new and old data samples but LASG-WK2 measures only the change of gradient at the new sample. Between the two PS conditions, LASG-PSE is more flexible since it does not require the knowledge of local smoothness . Given specific communication, computation, and memory requirements, we can flexibly choose different LASG.
3.3 Quantized LASG: Further save communication bits
The four LASG variants save the number of communication rounds. To further reduce communication bits per round, we combine LASG with various quantization mechanisms. With the stochastic gradient ), we define the gradient under a quantization operator Q as
We adopt the stochastic quantization scheme in (Alistarh et al., 2017) and develop the quantized LASG as
where is determined by one out of four rules (8)-(14). We term the quantized LASG as LAQSG.
In this section we present the convergence results of LASG-WK1, LASG-WK2 and LASG-PS in both the nonconvex and strongly convex cases, and the convergence results of LAQSG in the nonconvex case. Due to the technical reasons, we leave the analysis of LASG-PSE for future work, but it empirically has very impressive performance.
First, we make some basic assumptions, which are standard in analyzing SGD and its variants (Ghadimi and Lan, 2013; Johnson and Zhang, 2013; Alistarh et al., 2017).
Assumption 1 The loss function L is smooth with L > 0.
Assumption 2 The samples are independent, and the stochastic gradient
For LASG-PS, we require an extra smoothness assumption.
Assumption 3 The local gradient -Lipschitz continuous, i.e. for any
With these assumptions, LASG will yield descent of
Lemma 1 Under Assumptions 1, 2 and 3, generated by Algorithms 1, 2 and 3 satisfy
Note that all the terms on the right hand side of the inequality (20) show up in SGD analysis except , which exists due to stale information. To deal with this term, we introduce the following Lyapunov function:
where are constants to be determined later. The following lemma is a direct application of Lemma 1.
Lemma 2 Under Assumptions 1 and 2, there exist constants
The constants and
depend on stepsize
and
. Their expressions are specified in the supplementary materials. By choosing proper
and
, we are able to ensure the convergence of LASG.
We first present the convergence in nonconvex case.
Theorem 3 (nonconvex) Under Assumptions 1, 2 (for Algorithm 3 also Assumption 3), if =
generated by Algorithms 1-3 satisfy
Next we present the convergence results under the following strong convexity assumption on
Assumption 4 The overall loss -strongly convex.
Figure 3: Logistic regression on covtype dataset in the heterogeneous setting
Figure 4: Training Neural network on mnist dataset in the heterogeneous setting.
Parallel to the sublinear convergence of SGD in the strongly convex case, e.g., (Rakhlin et al., 2011), LASG algorithms achieve the O(1/K) order of convergence.
Theorem 4 (strongly convex) Under Assumption 1,2,4 (for Algorithm 3 only, also Assumption 3), if for a given constant
generated by Algorithms 1, 2 and 3 satisfies
For the convergence of LAQSG algorithms, we make the following additional assumption that guarantees the bounded variance of the quantized stochastic gradient.
Assumption 5 The gradient is bounded as
Based on this assumption, we have the following result.
Theorem 5 (LAQSG) Under Assumptions 1, 2, 5 (also Assumption 3 for Algorithm 3), if =
),
where
0 is a constant, then
generated by quantized Algorithms 1 - 3 satisfy
Numerical tests have been conducted on both logistic regression and neural network models. We benchmark LA(Q)SG with SGD, LAG-WK, local SGD and QSGD. For local SGD (Lin et al., 2019), workers perform SGD independently to update local then are averaged
Figure 5: Simulations on mnist dataset averaged over 30 trials.
Figure 6: Simulations on tiny imagenet dataset.
over all workers every H iterations. In simulations, we did a grid search for SGD learning rates. We consider the heterogeneous setting where data with same labels are unevenly assigned to M workers. Logistic regression on ijcnn1, MNIST and covtype. The data are distributed across M = 10 workers for ijcnn1, MNIST (with digits 3, 5) and M = 20 for Covtype. For each worker, the batch size is selected to be 0.01 of the local data size for ijcnn1, MNIST and 0.001 for Covtype. The -regularization parameter is set to be 10
. We choose stepsize
= 0.1. For all LASG algorithms, D = 100 and
= 0
for d = 1, 2, . . . 10 and
= 0 for d = 11, . . . 100. For local-SGD, the communication period is H = 50, 10, 20 iterations for ijcnn1, MNIST, Covtype respectively. This is optimized to save communication as much as possible without largely affecting the convergence speed. For quantization methods, we perform 4-bit stochastic quantization (Alistarh et al., 2017). Numerical results are reported in Figure 3 and in Tables 4, 5. Performance averaged over multiple trails has also been reported in Figures 5a and 5b. Supplementary materials have additional tests in Figures 8-10 and Tables 6 and 7.
Neural network. We train a convolutional neural network with two convolution-ELU-maxpooling layers (ELU is a smoothed ReLU) followed by two fully-connected layers for 10 classes classification
Table 4: Loss after 10communication rounds for logistic regression (LR) and neural network (NN) in heterogeneous setting.
Table 5: Loss after 10, 10
, 10
bits of uploads for LR on ijcnn1, mnist, covtype, and 10
for NN in heterogeneous setting.
on MNIST. The data are distributed on M = 10 workers. We choose stepsize = 0.05. Since the objective function is nonsmooth (
is not available), LASG-PS is not considered in this test. For all LASG algorithms, we set D = 50,
= 0
for d = 1, 2, . . . 10 and
= 0 for d = 11, . . . 50. For local-SGD, we set the communication period to be 4. For all quantization methods, we perform 8-bit stochastic quantization. Numerical results are reported in Figure 4 and listed in Tables 4 and 5. Additional results can be found in Figures 11-15 and Tables 6, 7 in the supplementary materials.
All algorithms have been tested on the popular tiny imagenet dataset using the Resnet18 model initialized by weights pretrained on ImageNet1000; see the accuracy versus total time (communication and computation) in Figures 6a and 6b. For training loss, LASG-WK1 and -WK2 require much less total time than SGD and local SGD with H = 2, but slightly more than local SGD with H = 4 and 6. However, as shown in Figure 6b, local SGD with larger communication period sacrifices the testing accuracy by 3-4%.
Figure 7: Logistic regression on covtype dataset in the heterogeneous setting.
In our numerical tests, all LASG algorithms achieve the same iteration complexity as SGD and outperform local-SGD in most cases. Compared with SGD, LASG-WK2 and LASG-PSE reduce the number of communication rounds by around one order of magnitude for neural network training and even more for logistic regression. LASG-WK1 also reduce the communication by more than one order of magnitude for logistic regression. Based on the results of LAG-WK and QSGD, it is evident that the selection rules (8), (10) and (14) of LASG-WK1, LASG-WK2 and LASG-PSE achieve more significant improvement in terms of saving communication and bits than the selection rule (6) of LAG-WK and stochastic quantization strategy of QSGD.
Although the performance of LASG-PS is not as impressive as other LASG algorithms in saving communication, it considerably saves local computation compared with other algorithms except LASGPSE as shown in Figure 7. Moreover, LASG-PSE has performance gains in both saving communication and local computation. The performance of LAQSG validates that the LASG algorithms can be easily equipped with stochastic quantization with additional benefits from quantization.
In this paper, we developed a class of LASG methods as communication-efficient variants of SGD. LASG methods leverage a set of adaptive communication rules to detect and then skip less informative or redundant communication rounds between the server and workers during distributed learning. To further reduce communication bandwidth, the quantized version of LASG is also presented. Both LASG and their quantized version are simple to implement, and have convergence rate comparable to the original SGD. Extensions to nonsmooth and decentralized settings are also in our research agenda.
Alham Fikri Aji and Kenneth Heafield. Sparse communication for distributed gradient descent. In Proc. Conf. Empirical Methods Natural Language Process., pages 440–445, Copenhagen, Denmark, Sep 2017.
Dan Alistarh, Demjan Grubic, Jerry Li, Ryota Tomioka, and Milan Vojnovic. QSGD: Communication- efficient SGD via gradient quantization and encoding. In Proc. Advances in Neural Info. Process. Syst., pages 1709–1720, Long Beach, CA, Dec 2017.
Dan Alistarh, Torsten Hoefler, Mikael Johansson, Nikola Konstantinov, Sarit Khirirat, and Cédric Renggli. The convergence of sparsified gradient methods. In Proc. Advances in Neural Info. Process. Syst., pages 5973–5983, Montreal, Canada, Dec 2018.
Jeremy Bernstein, Yu-Xiang Wang, Kamyar Azizzadenesheli, and Animashree Anandkumar. SignSGD: Compressed optimisation for non-convex problems. In Proc. Intl. Conf. Machine Learn., pages 559–568, Stockholm, Sweden, Jul 2018.
Léon Bottou, Frank E Curtis, and Jorge Nocedal. Optimization methods for large-scale machine learning. arXiv preprint:1606.04838, June 2016.
Tianyi Chen, Georgios Giannakis, Tao Sun, and Wotao Yin. LAG: Lazily aggregated gradient for communication-efficient distributed learning. In Proc. Advances in Neural Info. Process. Syst., pages 5050–5060, Montreal, Canada, Dec 2018.
Jeffrey Dean, Greg Corrado, Rajat Monga, Kai Chen, Matthieu Devin, Mark Mao, Andrew Senior, Paul Tucker, Ke Yang, Quoc V Le, et al. Large scale distributed deep networks. In Proc. Advances in Neural Info. Process. Syst., pages 1223–1231, Lake Tahoe, NV, 2012.
Saeed Ghadimi and Guanghui Lan. Stochastic first-and zeroth-order methods for nonconvex stochastic programming. SIAM Journal on Optimization, 23(4):2341–2368, 2013.
Rie Johnson and Tong Zhang. Accelerating stochastic gradient descent using predictive variance reduction. In Advances in neural information processing systems, pages 315–323, 2013.
Michael I Jordan, Jason D Lee, and Yun Yang. Communication-efficient distributed statistical inference. J. American Statistical Association, to appear, 2018.
Weiyu Li, Tianyi Chen, Liping Li, and Qing Ling. Communication-censored distributed stochastic gradient descent. arXiv preprint:1909.03631, September 2019.
Tao Lin, Sebastian U Stich, Kumar Kshitij Patel, and Martin Jaggi. Don’t use large mini-batches, use local sgd. arXiv preprint:1808.07217v5, Jun 2019.
Yujun Lin, Song Han, Huizi Mao, Yu Wang, and William J Dally. Deep gradient compression: Reducing the communication bandwidth for distributed training. arXiv preprint:1712.01887v2, Feb 2018.
Sindri Magnússon, Hossein Shokri-Ghadikolaei, and Na Li. On maintaining linear convergence of distributed learning and optimization under limited communication. arXiv preprint arXiv:1902.11163, 2019.
Brendan McMahan, Eider Moore, Daniel Ramage, Seth Hampson, and Blaise Aguera y Arcas. Communication-efficient learning of deep networks from decentralized data. In Proc. Intl. Conf. Artificial Intell. and Stat., pages 1273–1282, Fort Lauderdale, FL, April 2017.
Eric J Msechu and Georgios B Giannakis. Sensor-centric data reduction for estimation with WSNs via censoring and quantization. IEEE Trans. Sig. Proc., 60(1):400–414, Jan 2011.
Angelia Nedic and Asuman Ozdaglar. Distributed subgradient methods for multi-agent optimization. IEEE Trans. Automat. Control, 54(1):48–61, January 2009.
Angelia Nedić, Alex Olshevsky, and Michael Rabbat. Network topology and communicationcomputation tradeoffs in decentralized optimization. Proceedings of the IEEE, 106(5):953–976, May 2018.
Larry L Peterson and Bruce S Davie. Computer Networks: A Systems Approach. Morgan Kaufman, Burlington, MA, 2007.
Alexander Rakhlin, Ohad Shamir, and Karthik Sridharan. Making gradient descent optimal for strongly convex stochastic optimization. arXiv preprint arXiv:1109.5647, 2011.
Frank Seide, Hao Fu, Jasha Droppo, Gang Li, and Dong Yu. 1-bit stochastic gradient descent and its application to data-parallel distributed training of speech dnns. In Proc. Conf. Intl. Speech Comm. Assoc., Singapore, Sept 2014.
Sebastian U. Stich, Jean-Baptiste Cordonnier, and Martin Jaggi. Sparsified SGD with memory. In Proc. Advances in Neural Info. Process. Syst., pages 4447–4458, Montreal, Canada, Dec 2018.
Sebastian Urban Stich. Local sgd converges fast and communicates little. In ICLR 2019 International Conference on Learning Representations, number CONF, New Orleans, May 2019.
Nikko Strom. Scalable distributed DNN training using commodity gpu cloud computing. In Proc. Conf. Intl. Speech Comm. Assoc., Dresden, Germany, Sept 2015.
Jun Sun, Tianyi Chen, Georgios Giannakis, and Zaiyue Yang. Communication-efficient distributed learning via lazily aggregated quantized gradients. In Proc. Advances in Neural Info. Process. Syst., page to appear, Vancouver, Canada, Dec 2019.
Hongyi Wang, Scott Sievert, Shengchao Liu, Zachary Charles, Dimitris Papailiopoulos, and Stephen Wright. Atomo: Communication-efficient learning via atomic sparsification. In Proc. Advances in Neural Info. Process. Syst., pages 9850–9861, Montreal, Canada, Dec 2018.
Jianyu Wang and Gauri Joshi. Cooperative SGD: A unified framework for the design and analysis of communication-efficient SGD algorithms. arXiv preprint:1808.07576, August 2018.
Jianqiao Wangni, Jialei Wang, Ji Liu, and Tong Zhang. Gradient sparsification for communication- efficient distributed optimization. In Proc. Advances in Neural Info. Process. Syst., pages 1299–1309, Montreal, Canada, Dec 2018.
Wei Wen, Cong Xu, Feng Yan, Chunpeng Wu, Yandan Wang, Yiran Chen, and Hai Li. Terngrad: Ternary gradients to reduce communication in distributed deep learning. In Proc. Advances in Neural Info. Process. Syst., pages 1509–1519, Long Beach, CA, Dec 2017.
Jiaxiang Wu, Weidong Huang, Junzhou Huang, and Tong Zhang. Error compensated quantized SGD and its applications to large-scale distributed optimization. arXiv preprint arXiv:1806.08054, 2018.
Hao Yu and Rong Jin. On the computation and communication complexity of parallel SGD with dynamic batch sizes for stochastic non-convex optimization. In Proc. Intl. Conf. Machine Learn., Long Beach, CA, June 2019.
Hao Yu, Sen Yang, and Shenghuo Zhu. Parallel restarted sgd with faster convergence and less communication: Demystifying why model averaging works for deep learning. In Proc. AAAI Conf. Artificial Intell., volume 33, pages 5693–5700, 2019.
Hantian Zhang, Jerry Li, Kaan Kara, Dan Alistarh, Ji Liu, and Ce Zhang. Zipml: Training linear models with end-to-end low precision, and a little bit of deep learning. In Proc. Intl. Conf. Machine Learn., pages 4035–4043, Sydney, Australia, Aug 2017.
In this supplementary document, we first present some basic inequalities that will be used frequently in this document, and then present the missing derivations of some claims, as well as the proofs of all the lemmas and theorems in the paper, which is followed by details on our experiments. The content of this supplementary document is summarized as follows.
Before starting the proof, we introduce our notation. First let ˆNote that ˆ
, LASG’s update can be simplified to
Define the -algebra Θ
=
:
. We also let
=
, which allows us to express our algorithms conveniently. Some basic facts used in the proof are reviewed as follows.
Fact 1. Assume that are independent random variables, and
=
Fact 2. (Young’s inequality) For any
As a consequence, we have
Fact 3. (Cauchy-Schwarz inequality) For any
Fact 4. For
where (1a) holds due to
(1b) is a direct application of the Young’s inequality (28), and (1c) is a result of applying the Cauchy-Schwarz inequality (30) to , (27), and Assumption 2 to
Similar to (31), it can be verified that
The analysis in this part is analogous to that in (Ghadimi and Lan, 2013). Before the proof, we define an auxiliary function,
where is a global minimizer of L. And we assume that
-Lipschitz continuous for any
Take expectation with respect to and we can obtain
Note that -Lipschitz continuous and thus
By (30), we can derive that
As a consequence,
≥12
=12
where we used the fact that = 0 to obtain (7) since
Recall that
And by (30),
=4¯
By nonnegativity of
Similarly, we can prove 4¯
)) + 4¯L(EL(˜
)). Therefore, it follows that
Similar to the proof of (9), we can obtain
Combined with
we have
From the LASG update, we have
Then the LASG-PS condition (12) implies that
Due to the smoothness of L in Assumption 1, we have
We decompose as follows,
By taking expectation and substituting in (33), we obtain
We analyze separately for different rules. First for LASG-WK1’s rule (8),
where (2a) is due to the definition of and (2b) is obtained by (8), (28) with
, and (31) with
= ˜
. Note that the definition of ˜
in Algorithm 1 implies
. Similarly, for
LASG-WK2’s rule (10), we apply (28) with and (31) with
For LASG-PS’s rule (12), apply
Now we deal with . For LASG-WK1,
For LASG-WK2,
For LASG-PS,
Plug back into (34) and get
Solve the linear equations above and get
Select
Specially, if we choose constant stepsize
where 0 is a constant, then
By strong convexity of
Then (36) can be rewritten as
We will choose
Solve the linear equations above and get
and
In this section we prove the convergence in Theorem 5. Let to denote the expectation with respect to the stochastic quantization Q. As a results of [Lemma 3.1, Alistarh et al. (2017)] and Assumption 5,
b-bit quantized gradients (1 bit for sign) have the following properties ,
Following the proof of Lemma 1, we can get
where
where (3a) is obtained by an approach similar to (31), and
Therefore,
Similar to Lemma 2,
Let ¯and choose
Solve the linear equations above and get
If we choose constant stepsize
then
The additional numerical results in this section include both homogeneous and heterogeneous setting. Homogeneous: Data samples are shuffled and uniformly partitioned to M workers. Heterogeneous: Data samples with same labels are unevenly partitioned and assigned to M workers.
Table 6: Objective value after 1000 and 10000 communication rounds for logistic regression (LR) and neural network (NN) respectively in the homogeneous setting.
Table 7: Objective value after 1e5, 1e6, 1e6 bits of uploads for logistic regression on ijcnn1, mnist, covtype, and 1e8 for neural network in the homogeneous setting.
Figure 8: Logistic regression on ijcnn1 in the homogeneous setting.
Figure 9: Logistic regression on mnist digits 3 and 5 in the homogeneous setting
Figure 10: Logistic regression on dataset covtype in the homogeneous setting
Figure 11: Neural network on dataset mnist in the homogeneous setting
Figure 12: Neural network on dataset mnist in the homogeneous setting
Figure 13: Logistic regression on ijcnn1 in the heterogeneous setting
Figure 14: Logistic regression on mnist digits 3 and 5 in the heterogeneous setting
Figure 15: Neural network on dataset mnist in the heterogeneous setting