Acceleration for Compressed Gradient Descent in Distributed and Federated Optimization

2020·Arxiv

Abstract

Abstract

Due to the high communication cost in distributed and federated learning problems, methods relying on compression of communicated messages are becoming increasingly popular. While in other contexts the best performing gradient-type methods invariably rely on some form of acceleration/momentum to reduce the number of iterations, there are no methods which combine the benefits of both gradient compression and acceleration. In this paper, we remedy this situation and propose the first accelerated compressed gradient descent (ACGD) methods. In the single machine regime, we prove that ACGD enjoys the rate for -strongly con-

vex problems and for convex problems, respectively, where is the compression parameter. Our results improve upon the existing non-accelerated rates

and , respectively, and recover the optimal rates of accelerated gradient descent as a special case when no compression () is applied. We further propose a distributed variant of ACGD (called ADIANA) and prove the convergence rate

where n is the number of devices/workers and hides the logarithmic factor . This improves upon the previous best result

achieved by the DIANA method of Mishchenko et al. (2019). Finally, we conduct several experiments on real-world datasets which corroborate our theoretical results and confirm the practical superiority of our accelerated methods.

1. Introduction

With the proliferation of edge devices such as mobile phones, wearables and smart home devices comes an increase in the amount of data rich in potential information which can be mined for the benefit of the users. One of the approaches of turning the raw data into information is via federated learning (Koneˇcn´y et al., 2016; McMahan et al., 2017), where typically a single global supervised model is trained in a massively distributed manner over a network of heterogeneous devices.

Training supervised federated learning models is typically performed by solving an optimization problem of the form

where is smooth loss associated with data stored on device is a relatively simple but possibly nonsmooth regularizer.

In distributed learning in general, and federated learning in particular, communication of messages across a network forms the bottleneck of the training system. It is thus very important to devise novel strategies for reducing the number of communication rounds. Two of the most common strategies are i) local computations (Ma et al., 2017; Stich, 2019; Khaled et al., 2020) and ii) communication compression (Seide et al., 2014; Alistarh et al., 2017; Wangni et al., 2018; Horv´ath et al., 2019a). The former is used to perform more local computations on each device before communication and subsequent model averaging, hoping that this will reduce the total number of communications. The latter is used to reduce the size of communicated messages, saving precious time spent in each communication round, and hoping that this will not increase the total number of communications.

1.1. Theoretical inefficiency of local methods

Despite their practical success, local methods are poorly understood and there is much to be discovered. For instance, there exist no theoretical results which would suggest that any local method (e.g., local gradient descent (GD) or local SGD) can achieve better communication complexity than its standard non-local variant (e.g., GD, SGD). In fact, until recently, no complexity results existed for local SGD in environments with heterogeneous data (Khaled et al., 2019; 2020), a key regime in federated learning settings (Li et al., 2019). In the important regime when all participating devices compute full gradients based on their local data, the recently proposed stochastic controlled averaging (SCAFFOLD) method (Karimireddy et al., 2019) offers no improvement on the number of communication as the number of local steps grows despite the fact that this is a rather elaborate method combining local stochastic gradient descent with control variates for reducing the model drift among clients.

1.2. Methods with compressed communication

However, the situation is much brighter with methods employing communication compression. Indeed, several recent theoretical results suggest that by combining an appropriate (typically randomized) compression operator with a suitably designed gradient-type method, one can obtain improvement in the the total communication complexity over comparable baselines not performing any compression. For instance, this is the case for distributed compressed gradient descent (CGD) (Alistarh et al., 2017; Khirirat et al., 2018; Horv´ath et al., 2019a; Li & Richt´arik, 2020) and distributed CGD methods which employ variance reduction to tame the variance introduced by compression (Hanzely et al., 2018; Mishchenko et al., 2019; Horv´ath et al., 2019b; Hanzely & Richt´arik, 2019b; Li & Richt´arik, 2020).

While in the case of CGD compression leads to a decrease in the size of communicated messages per communication round, it leads to an increase in the number of communications. Yet, certain compression operators, such as natural dithering (Horv´ath et al., 2019a), were shown to be better than no compression in terms of the overall communication complexity.

The variance-reduced CGD method DIANA (Mishchenko et al., 2019; Horv´ath et al., 2019b) enjoys even better behavior: the number of communication rounds for this method is unaffected up to a certain level of compression when the variance induced by compression is smaller than a certain threshold. This threshold can be very large in practice, which means that massive reduction is often possible in the number of communicated bits without this having any adverse effect on the number of communication rounds.

Recall that a function is L-smooth or has L-Lipschitz continuous gradient (for L > 0) if

and -strongly convex (for

for all . The case corresponds to the standard convexity.

In particular, for L-smooth and -strongly convex f with n machines, DIANA enjoys the iteration bound is the condition num- ber and is the compression parameter (see Definition 1). If , which corresponds to no compression, DIANA recovers the rate of gradient descent. On the

other hand, as long as , the rate is still

, which shows that DIANA is able to retain the same number of communication rounds as gradient descent and yet save on bit transmission in each round. The higher is allowed to be, the more compression can be applied.

2. Contributions

Discouraged by the lack of theoretical results suggesting that local methods indeed help to reduce the number of communications, and encouraged by the theoretical success of CGD methods, in this paper we seek to enhance CGD methods with a mechanism which, unlike local updating, can provably lead to a decrease of the number of communication rounds.

What mechanism could achieve further improvements?

In the world of deterministic gradient methods, one technique for such a reduction is well known: Nesterov acceleration / momentum (Nesterov, 1983; 2004). In case of stochastic gradient methods, the accelerated method Katyusha (Allen-Zhu, 2017) achieves the optimal rate for strongly convex problems, and the unified accelerated method Varag (Lan et al., 2019) achieves the optimal rates for convex problems regardless of the strong convexity. See also (Kovalev et al., 2020; Qian et al., 2019) for some enhancements. Essentially all state-of-the-art methods for training deep learning models, including Adam (Kingma & Ba, 2014), rely on the use of momentum/acceleration in one form or another, albeit lacking in theoretical support.

However, the successful combination of gradient compression and acceleration/momentum has so far remained elusive, and to the best of our knowledge, no algorithms supported with theoretical results exist in this space. Given the omnipresence of momentum in modern machine learning, this is surprising.

We now summarize our key contributions:

2.1. First combination of gradient compression and acceleration

We develop the first gradient-type optimization methods provably combining the benefits of gradient compression

Table 1. Convergence results for the special case with n = 1 device (i.e., problem (4))

Table 2. Convergence results for the general case with n devices (i.e., problem (1)). Our results are always better than previous results.

and acceleration: i) ACGD (Algorithm 1) in the single device case, and ii) ADIANA (Algorithm 2) in the distributed case.

2.2. Single device setting

We first study the single-device setting, and design an accelerated CGD method (ACGD - Algorithm 1) for solving the unconstrained smooth minimization problem

in the regimes when f is L-smooth and i) -strongly convex, and ii) convex. Our theoretical results are summarized in Table 1. In the strongly convex case, we improve the complexity of CGD (Khirirat et al., 2018) from

ment is from where is the compression parameter (see Definition 1).

2.3. Distributed setting

We further study the distributed setting with n devices/nodes and focus on problem (1) in its full generality, i.e.,

The presence of multiple nodes (n > 1) and of the regularizer poses additional challenges. In order to address them, we need to not only combine acceleration and compression, but also introduce a DIANA-like variance reduction mechanism to remove the variance introduced by the compression operators.

In particular, we have developed an accelerated variant of the DIANA method for solving the general problem (1), which we call ADIANA (Algorithm 2). The comparison of complexity results between ADIANA and DIANA is summarized in Table 2.

Note that our results always improve upon the non-accelerated DIANA method. Indeed, in the regime when the compression parameter is larger than the number of nodes n, we improve the DIANA rate

On the other hand, in the regime when , we improve the DIANA rate to

since

Note that if , which is more often true in federated learning as the number of devices in federated learning is typically very large, our ADIANA result reduces to In particular, if

O

; the same as that of non-compressed accelerated gradient descent (AGD) (Nesterov, 2004). It means that ADIANA benefits from cheaper communication due to compression for free without hurting the convergence rate (i.e., the communication rounds are the same), and is therefore better suited for federated optimization.

3. Randomized Compression Operators

We now introduce the notion of a randomized compression operator which is used to compress the gradients.

Definition 1 (Compression operator) A randomized map -compression operator if

In particular, no compression (

Note that the conditions (5) require the compression operator to be unbiased and its variance uniformly bounded by a relative magnitude of the vector which we are compressing.

3.1. Examples

We now give a few examples of randomized compression operators without attempting to be exhaustive.

where denotes the Hadamard (element-wise) product and is a uniformly random binary vector with k nonzero entries (). This random-k sparsifica-tion operator C satisfies (5) with . Indeed, no compression

Example 2 (Quantization): Given , the (p, s)-quantization operator is defined by

where is a random vector with i-th element

where the level l satisfies . The probability

is chosen so that )-quantization operator C satisfies (5) with . In particular, QSGD (Alistarh et al., 2017) used p = 2 (i.e., (2, s)-quantization) and proved that the expected sparsity of C(x) is

4. Accelerated CGD: Single Machine

In this section, we study the special case of problem (1) with a single machine (n = 1) and no regularizer (),

4.1. The CGD algorithm

First, we recall the update step in compressed gradient descent (CGD) method, i.e.,

where C is a kind of -compression operator defined in Definition 1.

As mentioned earlier, convergence results of CGD are for strongly convex problems and

convergence proof for strongly convex problems, i.e., , can be found in (Khirirat et al., 2018). For completeness, we now establish a convergence result for convex problems, i.e., since we did not find it in the literature.

is at most

4.2. The ACGD algorithm

Note that in the non-compressed case (i.e., CGD is reduced to standard GD), there exists methods for obtaining accelerated convergence rates of and

for strongly convex and convex problems, respectively. However, no accelerated convergence results exist for CGD methods. Inspired by Nesterov’s accelerated gradient descent (AGD) method (Nesterov, 2004) and FISTA (Beck & Teboulle, 2009), we propose the first accelerated compressed gradient descent (ACGD) method, described in Algorithm 1.

4.3. Convergence theory

Our accelerated convergence results for ACGD (Algorithm 1) are stated in Theorems 2 and 3, formulated next.

Theorem 2 (ACGD: convex case) Let f be convex with L-Lipschitz continuous gradient and let the compression operator C satisfy (5). Choose the parameters in ACGD (Algorithm 1) as follows:

Then the number of iterations performed by ACGD to find an -solution such that

is at most

Theorem 3 (ACGD: strongly convex case) Let f be -strongly convex with L-Lipschitz continuous gradient and let the compression operator C satisfy (5).Choose the parameters in ACGD (Algorithm 1) as follows:

Then the number of iterations performed by ACGD to find an -solution such that

(or ) is at most

In the non-compressed case (i.e., ), our results recover the standard optimal rates of accelerated gradient descent. Further, if we consider the random-k spar-sification compression operator, ACGD can be seen as a variant of accelerated randomized coordinate descent (Nes- terov, 2012). Our results recover the optimal results of accelerated randomized coordinate descent method (Allen- Zhu et al., 2016; Hanzely & Richt´arik, 2019a) under the same standard smoothness assumptions.

4.4. Proof sketch

The following lemma which demonstrates improvement in one iteration plays a key role in our analysis.

Lemma 1 If parameters and the compression operator , then we have for any iteration k of ACGD, and for all

where the expectation is with respect to the randomness of compression operator sampled at iteration k.

The proof of Theorems 2 and 3 can be derived (i.e., plug into the specified parameters (and p) and collect all iterations) from Lemma 1. The detailed proofs can be found in the appendix.

5. Accelerated CGD: Distributed Setting

We now turn our attention to the general distributed case, i.e., problem (1):

The presence of multiple nodes (n > 1) and of the regularizer poses additional challenges.

5.1. The ADIANA algorithm

We now propose an accelerated algorithm for solving problem (1). Our method combines both acceleration and variance reduction, and hence can be seen as an accelerated version of DIANA (Mishchenko et al., 2019; Horv´ath et al., 2019b). Therefore, we call our method ADIANA (Algorithm 2). In this case, each machine/agent computes its local gradient (e.g., ) and a shifted version thereof is compressed and sent to the server. The server subsequently aggregates all received messages, to form a stochastic gradient estimator of , and then performs a proximal step. The shift terms are adaptively chang- ing throughout the iterative process, and have the role of reducing the variance introduced by compression. If no compression is used, we may simply set the shift terms to be

Our method was inspired by Mishchenko et al. (2019), who first studied variance reduction for CGD methods for a specific ternary compression operator, and Horv´ath et al. (2019b) who studied the general class of -compression operators we also study here. However, we had to make certain modifications to make variance-reduced compression work in the accelerated case since both of them were studied in the non-accelerated case. Besides, our method adopts a randomized update rule for the auxiliary vectors which simplifies the algorithm and analysis, resembling the workings of the loopless SVRG method proposed by Kovalev et al. (2020).

5.2. Convergence theory

Our main convergence result for ADIANA (Algorithm 2) is formulated in Theorem 4. We focus on the strongly convex setting.

Theorem 4 Suppose f is -strongly convex and that the functions -Lipschitz continuous gradient for all i. Further, let the compression operator C satisfy (5). Choose the ADIANA (Algorithm 2) parameters as follows:

Then the number of iterations performed by ADIANA to find an -solution such that

is at most

As we have explained in the introduction, the above rate is vastly superior to that of non-accelerated distributed CGD methods, including that of DIANA.

5.3. Proof sketch

In the proof, we use the following notation:

We first present a key technical lemma which plays a similar role to that of Lemma 1.

Lemma 2 If the parameters satisfy

and , then we have for any iteration k,

Theorem 4 can be proved by combing the above lemma with three additional Lemmas: Lemma 3, 4 and 5, which we present next. In view of the presence of following result is useful as it allows us to add into the Lyapunov function.

Lemma 3 According to Line 11 of Algorithm 2 and Defini-tion (7)–(8), we have

To cancel the term in (10), we use the defining property of compression operator (i.e., (5)) :

Lemma 4 If the compression operator C satisfies (5), we have

Note that the bound on variance obtained above introduces an additional term ). We will therefore add the terms into the Lyapunov function as well.

Lemma 5 If

Note that the terms and in Lemma 5 and (13) can be cancelled by (11) and (12) by choosing the parameters appropriately.

Finally, it is not hard to obtain the following key inequality for the Lyapunov function by plugging Lemmas 3-5 into our key Lemma 2:

Above, the constants are related to the algorithm parameters and p. Finally, the proof of Theorem 4 can be derived (i.e., plug into the specified parameters) from inequality (14). The detailed proof can be found in the appendix.

6. Experiments

In this section, we demonstrate the performance of our accelerated method ADIANA (Algorithm 2) and previous methods with different compression operators on the regularized logistic regression problem,

where are data samples.

Data sets. In our experiments we use four standard datasets, namely, a5a, mushrooms, a9a and w6a from the LIBSVM library. Some of the experiments are provided in the appendix.

Compression operators. We use three different compression operators: random sparsification (see e.g. (Stich et al., 2018)), random dithering (see e.g. (Alistarh et al., 2017)), and natural compression (see e.g. (Horv´ath et al., 2019a)). For random-r sparsification, the number of communicated bits per iteration is 32r, and we choose r = d/4. For random dithering, we choose , which means the number of communicated bits per iteration is 2.8d + 32 (Alistarh et al., 2017). For natural compression, the number of communicated bits per iteration is 2019a).

Parameter setting. In our experiments, we use the theoretical stepsize and parameters for all the three algorithms: vanilla distributed compressed gradient descent (DCGD), DIANA (Mishchenko et al., 2019), and our ADIANA (Algorithm 2). The default number of nodes/machines is 20 and the regularization parameter . The numerical results for different number of nodes can be found in the appendix. For the figures, we plot the relation of the optimality loss gap and the number of accumulated transmitted bits. The optimal value for each case is obtained by getting the minimum of three uncompressed versions of ADIANA, DIANA, and DCGD for 100000 iterations.

6.1. Comparison with DIANA and DCGD

In this subsection, we compare our ADIANA with DIANA and DCGD with three compression operators: random spar-sification, random dithering, and natural compression in Figures 1 and 2.

The experimental results indeed show that our ADIANA converges fastest for all three compressors, and natural compression uses the fewest communication bits than random dithering and random sparsification. Moreover, because the compression error of vanilla DCGD is nonzero in general, DCGD can only converge to the neighborhood of the optimal solution while DIANA and ADIANA can converge to the optimal solution.

6.2. Communication efficiency

Now, we compare our ADIANA and DIANA, with and without compression to show the communication efficiency of our accelerated method ADIANA in Figures 3 and 4.

According to the left top and left bottom of Figure 4, DIANA is better than its uncompressed version if the compression operator is random sparsification. However, ADIANA behaves worse than its uncompressed version. For random dithering (middle figures) and natural compression (right fig-ures), ADIANA is about twice faster than its uncompressed version, and is much faster than DIANA with/without compression. These numerical results indicate that ADIANA (which enjoys both acceleration and compression) could be a more practical communication efficiency method, i.e., acceleration (better than non-accelerated DIANA) and compression (better than the uncompressed version), especially for random dithering and natural compression.

Figure 1. The communication complexity of different methods for three different compressors (random sparsification, random dithering and natural compression) on the a5a dataset.

Figure 2. The communication complexity of different methods for three different compressors (random sparsification, random dithering and natural compression) on the mushrooms dataset.

Figure 3. The communication complexity of DIANA and ADIANA with and without compression on the a5a dataset.

Figure 4. The communication complexity of DIANA and ADIANA with and without compression on the mushrooms dataset.

Acknowledgements

The authors would like to acknowledge support from the KAUST Baseline Research Fund and the KAUST Visual Computing Center. Zhize Li and Xun Qian thank for support from the KAUST Extreme Computing Research Center.

References

Alistarh, D., Grubic, D., Li, J., Tomioka, R., and Vojnovic, M. QSGD: Communication-efficient SGD via gradient quantization and encoding. In Advances in Neural Information Processing Systems, pp. 1709–1720, 2017.

Allen-Zhu, Z. Katyusha: The first direct acceleration of stochastic gradient methods. In Proceedings of the 49th Annual ACM SIGACT Symposium on Theory of Computing, pp. 1200–1205. ACM, 2017.

Allen-Zhu, Z., Qu, Z., Richt´arik, P., and Yuan, Y. Even faster accelerated coordinate descent using non-uniform sampling. In The 33rd International Conference on Machine Learning, pp. 1110–1119, 2016.

Beck, A. and Teboulle, M. A fast iterative shrinkagethresholding algorithm for linear inverse problems. SIAM Journal on Imaging Sciences, 2(1):183–202, 2009.

Hanzely, F. and Richt´arik, P. Accelerated coordinate descent with arbitrary sampling and best rates for minibatches. In The 22nd International Conference on Artificial Intelligence and Statistics, 2019a.

Hanzely, F. and Richt´arik, P. One method to rule them all: variance reduction for data, parameters and many new methods. arXiv preprint arXiv:1905.11266, 2019b.

Hanzely, F., Mishchenko, K., and Richt´arik, P. SEGA: variance reduction via gradient sketching. In Advances in Neural Information Processing Systems 31, pp. 2082– 2093, 2018.

Horv´ath, S., Ho, C.-Y., ˇLudov´ıt Horv´ath, Sahu, A. N., Canini, M., and Richt´arik, P. Natural compression for distributed deep learning. arXiv preprint arXiv:1905.10988, 2019a.

Horv´ath, S., Kovalev, D., Mishchenko, K., Stich, S., and Richt´arik, P. Stochastic distributed learning with gradient quantization and variance reduction. arXiv preprint arXiv:1904.05115, 2019b.

Karimireddy, S., Kale, S., Mohri, M., Reddi, S., Stich, S., and Suresh, A. SCAFFOLD: Stochastic controlled averaging for on-device federated learning. arXiv preprint arXiv:1910.06378, 2019.

Khaled, A., Mishchenko, K., and Richt´arik, P. First analysis of local GD on heterogeneous data. In NeurIPS Workshop on Federated Learning for Data Privacy and Confidentiality, pp. 1–11, 2019.

Khaled, A., Mishchenko, K., and Richt´arik, P. Tighter theory for local SGD on identical and heterogeneous data. In The 23rd International Conference on Artificial Intelligence and Statistics (AISTATS 2020), 2020.

Khirirat, S., Feyzmahdavian, H. R., and Johansson, M. Distributed learning with compressed gradients. arXiv preprint arXiv:1806.06573, 2018.

Kingma, D. P. and Ba, J. Adam: a method for stochastic optimization. In The 3rd International Conference on Learning Representations, 2014.

Koneˇcn´y, J., McMahan, H. B., Yu, F., Richt´arik, P., Suresh, A. T., and Bacon, D. Federated learning: strategies for improving communication efficiency. In NIPS Private Multi-Party Machine Learning Workshop, 2016.

Kovalev, D., Horv´ath, S., and Richt´arik, P. Dont jump through hoops and remove those loops: SVRG and Katyusha are better without the outer loop. In Proceedings of the 31st International Conference on Algorithmic Learning Theory, 2020.

Lan, G., Li, Z., and Zhou, Y. A unified variance-reduced accelerated gradient method for convex optimization. In Advances in Neural Information Processing Systems, pp. 10462–10472, 2019.

Li, T., Sahu, A. K., Talwalkar, A., and Smith, V. Feder- ated learning: challenges, methods, and future directions. arXiv preprint arXiv:1908.07873, 2019.

Li, Z. and Richt´arik, P. A unified analysis of stochastic gradient methods for nonconvex federated optimization. arXiv preprint arXiv:2006.07013, 2020.

Ma, C., Koneˇcn´y, J., Jaggi, M., Smith, V., Jordan, M. I., Richt´arik, P., and Tak´aˇc, M. Distributed optimization with arbitrary local solvers. Optimization Methods and Software, 32(4):813–848, 2017.

McMahan, H. B., Moore, E., Ramage, D., Hampson, S., and Ag¨uera y Arcas, B. Communication-efficient learning of deep networks from decentralized data. In Proceedings of the 20th International Conference on Artificial Intelligence and Statistics (AISTATS), 2017.

Mishchenko, K., Gorbunov, E., Tak´aˇc, M., and Richt´arik, P. Distributed learning with compressed gradient differences. arXiv preprint arXiv:1901.09269, 2019.

Nesterov, Y. A method for unconstrained convex minimiza- tion problem with the rate of convergence . In Doklady AN USSR, volume 269, pp. 543–547, 1983.

Nesterov, Y. Introductory lectures on convex optimization: a basic course. Kluwer Academic Publishers, 2004.

Nesterov, Y. Efficiency of coordinate descent methods on huge-scale optimization problems. SIAM Journal on Optimization, 22(2):341–362, 2012.

Qian, X., Qu, Z., and Richt´arik, P. L-SVRG and LKatyusha with arbitrary sampling. arXiv preprint arXiv:1906.01481, 2019.

Seide, F., Fu, H., Droppo, J., Li, G., and Yu, D. 1-bit stochas- tic gradient descent and its application to data-parallel distributed training of speech DNNs. In Fifteenth Annual Conference of the International Speech Communication Association, 2014.

Stich, S. U. Local SGD converges fast and communicates little. In International Conference on Learning Representations, 2019.

Stich, S. U., Cordonnier, J.-B., and Jaggi, M. Sparsified SGD with memory. In Advances in Neural Information Processing Systems, pp. 4447–4458, 2018.

Wangni, J., Wang, J., Liu, J., and Zhang, T. Gradient spar- sification for communication-efficient distributed optimization. In Advances in Neural Information Processing Systems, pp. 1306–1316, 2018.

Appendix

A. Missing Proofs

In this appendix, we provide the detailed proofs for our Theorems 1–4.

A.1. Proof of Theorem 1

We first restate our Theorem 1 here and then provide the detailed proof.

Theorem 1 Suppose f(x) is convex with L-Lipschitz continuous gradient and the compression operator Let step size , then the number of iterations performed by CGD to find an -solution such that

is at most

Proof: According to CGD update

where the last inequality holds due to convexity of f. Besides, according to L-smoothness of f (see (2)), we have

Adding (15) and ) to cancel the term

Summing up the above inequality from iteration 0 to k, we have

where the last inequality uses the L-smoothness of f. Finally, noting that i = 0, . . . k according to (16), then (18) turns to be

where the last equality uses the choice of step size . Now the proof of Theorem 1 is finished by letting the number of iteration , i.e., obtain the iterations.

A.2. Proof of Theorems 2 and 3

First, we restate our key Lemma 1 here. Then we use it to prove Theorems 2 and 3. Finally, we provide the proof for this key Lemma 1.

Lemma 1 If parameters and p satisfy and the compression operator , then we have for any iteration

where the expectation is with respect to the randomness of compression at iteration k.

Now, we recall our Theorem 2 here and are ready to prove it using Lemma 1.

Theorem 2 Suppose f(x) is convex with L-Lipschitz continuous gradient and the compression operator Let parameters and , then the number of iterations performed by ACGD (Algorithm 1) to find an -solution such that is at most

Proof of Theorem 2. First, we know in this general convex case. By choosing step size turns to be

Summing up the above inequality from iteration and letting

where the second inequality uses . Now the proof of Theorem 2 is finished, i.e., we obtain the

Similarly, we recall our Theorem 3 here and also prove it using Lemma 1.

Theorem 3 Suppose f(x) is -strongly convex with L-Lipschitz continuous gradient and the compression operator

satisfies (5). Let parameters

performed by ACGD (Algorithm 1) to find an -solution such that ) is at most

Proof of Theorem 3. By choosing step size and , Lemma 1

turns to be

Telescoping the above inequality from iteration 0 to k and letting

where the first inequality uses -strongly convex of f (see (3)) and . Now the proof of Theorem 3 is finished, i.e., we obtain the -solution (or ) such that (or ) within

Key Lemma: Now, the only remaining thing is to prove the key Lemma 1.

Proof of Lemma 1. First, we get the following equality (its proof is deferred to the end):

Then, to cancel the last inner product in (23), we use the property (smoothness and/or strong convexity) of f:

where (24) and (25) uses L-smoothness (see (2)) and -strong convexity (see (3)), respectively. By adding

Now, this key lemma (i.e., (20)) is proved as follows by adding

condition . Now, the only remaining thing is to prove (23). For any

where (31) and (32) use equalities (36) and (40), respectively. Further,

where (33) and (35) hold due to the condition holds due to the relation (Line 3 in Algorithm 1). Now, we finish the proof for the inner product term.

where (37) and (39) hold due to the condition holds due to the relation (Line 3 in Algorithm 1).

A.3. Proof of Theorem 4

In this section, we provide the detailed proof for accelerated result in the distributed case. Similar to previous Section A.2, we first restate our Lemmas 2–5 here. Then we use them to prove Theorem 4. Finally, we provide the proof for these Lemmas. Before restating lemmas, we recall the following notation:

Lemma 2 If parameters satisfy , then we have for any iteration k,

Lemma 3 According to Line 11 of Algorithm 2 and Definition (42)–(43), we have

Lemma 4 If the compression operator

Lemma 5 If

Now, we recall our Theorem 4 here and are ready to prove it using Lemmas 2–5.

Theorem 4 Suppose f(x) is -strongly convex and all have L-Lipschitz continuous gradients, and the compression operator satisfies (5). Let parameters

, and , then the number of iterations performed by ADIANA (Algorithm 2) to find an -solution such that is at most

Proof of Theorem 4. We define the following Lyapunov function and induce it as follows:

where (49) uses Lemma 2, (50) uses Lemma 3, (51) uses uses Lemma 4, (53) uses Lemma 5, (54) uses

Telescoping the above inequality (56) from iteration . To obtain an

, the number of iterations is at most

By letting , it is not hard to verify that the number of iterations performed by ADIANA (Algorithm 2) to find an -solution such that is at most

where n is the number of parallel machines.

Key Lemmas: Now, the remaining thing is to prove Lemmas 2–5. We first prove the relatively simple Lemmas 3–5 and then prove the key Lemma 2.

Proof of Lemma 3. According to Line 11 of Algorithm 2, i.e.,

with probability p with probability , and definitions , this lemma is directly obtained, i.e.,

Proof of Lemma 4. This lemma is proved as follows:

where first inequality uses the property of -compression operator (i.e., (5)) and the last inequality uses Cauchy-Schwarz inequality and definition

Proof of Lemma 5. This lemma is proved as follows:

where (59) uses the definition of (see Line 11 of Algorithm 2), (60) uses Cauchy-Schwarz inequality, (61) uses , the property of -compression operator (i.e., (5)) and , and (63) uses Cauchy-Schwarz inequality.

Proof of Lemma 2. Similar to the proof of key Lemma 1 in the last section, we first get the following equality:

where (64) uses in Algorithm 2). To cancel the inner products in (64), we first use the property of f:

where the inequalities hold uses the L-smoothness and -strong convexity of

Then, according the definition of of Algorithm 2), we have

where . Besides, according to convexity of

where (68) uses (66), (69) uses Young’s inequality, and (70) uses the condition

Now, we are ready to prove this lemma by canceling the inner products in (64) using (70). By plugging

B. Extra Experiments

In this section, we conduct more experiments for these methods with different compression operators for the same regularized logistic regression problem on other two datasets a9a and w6a. Similar to Section 6, for all methods, we use the theoretical stepsize and parameters.

B.1. Comparison with DIANA and DCGD

Similar to Section 6.1, we compare our ADIANA with DIANA and DCGD with three compression operators: random sparsification, random dithering, and natural compression on different datasets in Figures 5 and 6. Similar to Figures 1 and 2, here Figures 5 and 6 also show that our ADIANA converges fastest for all three compressors, and natural compression uses the fewest communication bits than random dithering and random sparsification. Furthermore, because the compression error of vanilla DCGD is nonzero in general, DCGD can only converge to the neighborhood of the optimal solution while DIANA and ADIANA can converge to the optimal solution.

Figure 5. The communication complexity of different methods for three different compressors (random sparsification, random dithering and natural compression) on the a9a dataset.

Figure 6. The communication complexity of different methods for three different compressors (random sparsification, random dithering and natural compression) on the w6a dataset.

B.2. Communication efficiency

Similar to Section 6.2, we compare our ADIANA and DIANA, with and without compression on different datasets to show the communication efficiency of our accelerated method ADIANA in Figures 7 and 8. Similarly in Figures 5 and 6, DIANA is better than its uncompressed version if the compression operator is random sparsification. However, ADIANA behaves worse than its uncompressed version. For random dithering (middle figures) and natural compression (right figures), ADIANA is about twice faster than its uncompressed version, and is much faster than DIANA with/without compression. These numerical results indicate that ADIANA (which enjoys both acceleration and compression) could be a more practical communication efficiency method, i.e., acceleration (better than non-accelerated DIANA) and compression (better than the uncompressed version), especially for random dithering and natural compression.

Figure 7. The communication complexity of DIANA and ADIANA with and without compression on the a9a dataset.

Figure 8. The communication complexity of DIANA and ADIANA with and without compression on the w6a dataset.

B.3. Different number of nodes

In this subsection, we demonstrate the performance of our ADIANA for different number of nodes/machines with random dithering and natural compression on a9a and w6a datasets in Figure 9. In the previous numerical results, the number of communication bits is not multiplied by the number of nodes, since the number of nodes is the same for all methods. However, here the number of communication bits is multiplied by the number of nodes since they are different now.

Figure 9. The communication complexity of ADIANA with different number of nodes on the a9a (left) and w6a (right) datasets.