b

DiscoverSearch
About
My stuff
Acceleration for Compressed Gradient Descent in Distributed and Federated Optimization
2020·arXiv
Abstract
Abstract

Due to the high communication cost in distributed and federated learning problems, methods relying on compression of communicated messages are becoming increasingly popular. While in other contexts the best performing gradient-type methods invariably rely on some form of acceleration/momentum to reduce the number of iterations, there are no methods which combine the benefits of both gradient compression and acceleration. In this paper, we remedy this situation and propose the first accelerated compressed gradient descent (ACGD) methods. In the single machine regime, we prove that ACGD enjoys the rate  O�(1 + ω)�Lµ log 1ϵ�for  µ-strongly con-

vex problems and  O�(1 + ω)�Lϵ�for convex problems, respectively, where  ωis the compression parameter. Our results improve upon the existing non-accelerated rates  O�(1 + ω) Lµ log 1ϵ�

and  O�(1 + ω) Lϵ�, respectively, and recover the optimal rates of accelerated gradient descent as a special case when no compression (ω = 0) is applied. We further propose a distributed variant of ACGD (called ADIANA) and prove the convergence rate �O�ω +�

where n is the number of devices/workers and �Ohides the logarithmic factor  log 1ϵ . This improves upon the previous best result �O�ω + Lµ + ωLnµ�

achieved by the DIANA method of Mishchenko et al. (2019). Finally, we conduct several experiments on real-world datasets which corroborate our theoretical results and confirm the practical superiority of our accelerated methods.

With the proliferation of edge devices such as mobile phones, wearables and smart home devices comes an increase in the amount of data rich in potential information which can be mined for the benefit of the users. One of the approaches of turning the raw data into information is via federated learning (Koneˇcn´y et al., 2016; McMahan et al., 2017), where typically a single global supervised model is trained in a massively distributed manner over a network of heterogeneous devices.

Training supervised federated learning models is typically performed by solving an optimization problem of the form

image

where  fi : Rd → Ris smooth loss associated with data stored on device  i and ψ : Rd → R ∪ {+∞}is a relatively simple but possibly nonsmooth regularizer.

In distributed learning in general, and federated learning in particular, communication of messages across a network forms the bottleneck of the training system. It is thus very important to devise novel strategies for reducing the number of communication rounds. Two of the most common strategies are i) local computations (Ma et al., 2017; Stich, 2019; Khaled et al., 2020) and ii) communication compression (Seide et al., 2014; Alistarh et al., 2017; Wangni et al., 2018; Horv´ath et al., 2019a). The former is used to perform more local computations on each device before communication and subsequent model averaging, hoping that this will reduce the total number of communications. The latter is used to reduce the size of communicated messages, saving precious time spent in each communication round, and hoping that this will not increase the total number of communications.

1.1. Theoretical inefficiency of local methods

Despite their practical success, local methods are poorly understood and there is much to be discovered. For instance, there exist no theoretical results which would suggest that any local method (e.g., local gradient descent (GD) or local SGD) can achieve better communication complexity than its standard non-local variant (e.g., GD, SGD). In fact, until recently, no complexity results existed for local SGD in environments with heterogeneous data (Khaled et al., 2019; 2020), a key regime in federated learning settings (Li et al., 2019). In the important regime when all participating devices compute full gradients based on their local data, the recently proposed stochastic controlled averaging (SCAFFOLD) method (Karimireddy et al., 2019) offers no improvement on the number of communication as the number of local steps grows despite the fact that this is a rather elaborate method combining local stochastic gradient descent with control variates for reducing the model drift among clients.

1.2. Methods with compressed communication

However, the situation is much brighter with methods employing communication compression. Indeed, several recent theoretical results suggest that by combining an appropriate (typically randomized) compression operator with a suitably designed gradient-type method, one can obtain improvement in the the total communication complexity over comparable baselines not performing any compression. For instance, this is the case for distributed compressed gradient descent (CGD) (Alistarh et al., 2017; Khirirat et al., 2018; Horv´ath et al., 2019a; Li & Richt´arik, 2020) and distributed CGD methods which employ variance reduction to tame the variance introduced by compression (Hanzely et al., 2018; Mishchenko et al., 2019; Horv´ath et al., 2019b; Hanzely & Richt´arik, 2019b; Li & Richt´arik, 2020).

While in the case of CGD compression leads to a decrease in the size of communicated messages per communication round, it leads to an increase in the number of communications. Yet, certain compression operators, such as natural dithering (Horv´ath et al., 2019a), were shown to be better than no compression in terms of the overall communication complexity.

The variance-reduced CGD method DIANA (Mishchenko et al., 2019; Horv´ath et al., 2019b) enjoys even better behavior: the number of communication rounds for this method is unaffected up to a certain level of compression when the variance induced by compression is smaller than a certain threshold. This threshold can be very large in practice, which means that massive reduction is often possible in the number of communicated bits without this having any adverse effect on the number of communication rounds.

Recall that a function  f : Rd → Ris L-smooth or has L-Lipschitz continuous gradient (for L > 0) if

image

and  µ-strongly convex (for  µ ≥ 0) if

image

for all  x, y ∈ Rd. The  µ = 0case corresponds to the standard convexity.

In particular, for L-smooth and  µ-strongly convex f with n machines, DIANA enjoys the iteration bound O��ω + Lµ + ωnLµ�log 1ϵ�, where Lµ is the condition num- ber and  ωis the compression parameter (see Definition 1). If  ω = 0, which corresponds to no compression, DIANA recovers the  O�Lµ log 1ϵ�rate of gradient descent. On the

other hand, as long as  ω = O�min� Lµ , n��, the rate is still

�Lµ log 1ϵ�, which shows that DIANA is able to retain the same number of communication rounds as gradient descent and yet save on bit transmission in each round. The higher ωis allowed to be, the more compression can be applied.

Discouraged by the lack of theoretical results suggesting that local methods indeed help to reduce the number of communications, and encouraged by the theoretical success of CGD methods, in this paper we seek to enhance CGD methods with a mechanism which, unlike local updating, can provably lead to a decrease of the number of communication rounds.

What mechanism could achieve further improvements?

In the world of deterministic gradient methods, one technique for such a reduction is well known: Nesterov acceleration / momentum (Nesterov, 1983; 2004). In case of stochastic gradient methods, the accelerated method Katyusha (Allen-Zhu, 2017) achieves the optimal rate for strongly convex problems, and the unified accelerated method Varag (Lan et al., 2019) achieves the optimal rates for convex problems regardless of the strong convexity. See also (Kovalev et al., 2020; Qian et al., 2019) for some enhancements. Essentially all state-of-the-art methods for training deep learning models, including Adam (Kingma & Ba, 2014), rely on the use of momentum/acceleration in one form or another, albeit lacking in theoretical support.

However, the successful combination of gradient compression and acceleration/momentum has so far remained elusive, and to the best of our knowledge, no algorithms supported with theoretical results exist in this space. Given the omnipresence of momentum in modern machine learning, this is surprising.

We now summarize our key contributions:

2.1. First combination of gradient compression and acceleration

We develop the first gradient-type optimization methods provably combining the benefits of gradient compression

Table 1. Convergence results for the special case with n = 1 device (i.e., problem (4))

image

Table 2. Convergence results for the general case with n devices (i.e., problem (1)). Our results are always better than previous results.

image

and acceleration: i) ACGD (Algorithm 1) in the single device case, and ii) ADIANA (Algorithm 2) in the distributed case.

2.2. Single device setting

We first study the single-device setting, and design an accelerated CGD method (ACGD - Algorithm 1) for solving the unconstrained smooth minimization problem

image

in the regimes when f is L-smooth and i)  µ-strongly convex, and ii) convex. Our theoretical results are summarized in Table 1. In the strongly convex case, we improve the complexity of CGD (Khirirat et al., 2018) from  O�(1 + ω) Lµ log 1ϵ�

image

ment is from  O�(1 + ω) Lϵ�to O�(1 + ω)�Lϵ�,where ω ≥ 0is the compression parameter (see Definition 1).

2.3. Distributed setting

We further study the distributed setting with n devices/nodes and focus on problem (1) in its full generality, i.e.,

image

The presence of multiple nodes (n > 1) and of the regularizer  ψposes additional challenges. In order to address them, we need to not only combine acceleration and compression, but also introduce a DIANA-like variance reduction mechanism to remove the variance introduced by the compression operators.

In particular, we have developed an accelerated variant of the DIANA method for solving the general problem (1), which we call ADIANA (Algorithm 2). The comparison of complexity results between ADIANA and DIANA is summarized in Table 2.

Note that our results always improve upon the non-accelerated DIANA method. Indeed, in the regime when the compression parameter  ωis larger than the number of nodes n, we improve the DIANA rate  O�ω�1 + Lnµ�log 1ϵ�to O�ω�1 +�

On the other hand, in the regime when  ω < n, we improve the DIANA rate  O��ω + Lµ�log 1ϵ�to

image

since  ω + Lµ ≥ 2�

Note that if  ω ≤ n1/3, which is more often true in federated learning as the number of devices in federated learning is typically very large, our ADIANA result reduces to  O��ω +�Lµ�log 1ϵ�.In particular, if  ω =

O�min�n1/3,�

µ log 1ϵ�; the same as that of non-compressed accelerated gradient descent (AGD) (Nesterov, 2004). It means that ADIANA benefits from cheaper communication due to compression for free without hurting the convergence rate (i.e., the communication rounds are the same), and is therefore better suited for federated optimization.

We now introduce the notion of a randomized compression operator which is used to compress the gradients.

Definition 1 (Compression operator) A randomized map C : Rd �→ Rd is an ω-compression operator if

image

In particular, no compression (C(x) ≡ x) implies ω = 0.

Note that the conditions (5) require the compression operator to be unbiased and its variance uniformly bounded by a relative magnitude of the vector which we are compressing.

3.1. Examples

We now give a few examples of randomized compression operators without attempting to be exhaustive.

image

where  ⊙denotes the Hadamard (element-wise) product and ξk ∈ {0, 1}dis a uniformly random binary vector with k nonzero entries (∥ξk∥0 = k). This random-k sparsifica-tion operator C satisfies (5) with  ω = dk − 1. Indeed, no compression  k = d implies ω = 0.

Example 2 (Quantization): Given  x ∈ Rd, the (p, s)-quantization operator is defined by

image

where  ξs ∈ Rd is a random vector with i-th element

image

where the level l satisfies |xi|∥x∥p ∈ [ ls, l+1s ]. The probability

is chosen so that  E[ξs(i)] = |xi|∥x∥p s. This (p, s)-quantization operator C satisfies (5) with  ω = 2 + d1/p+d1/2s. In particular, QSGD (Alistarh et al., 2017) used p = 2 (i.e., (2, s)-quantization) and proved that the expected sparsity of C(x) is  E[∥C(x)∥0] = O�s(s +√d)�.

In this section, we study the special case of problem (1) with a single machine (n = 1) and no regularizer (ψ(x) ≡ 0),

image

4.1. The CGD algorithm

First, we recall the update step in compressed gradient descent (CGD) method, i.e.,

image

where C is a kind of  ω-compression operator defined in Definition 1.

As mentioned earlier, convergence results of CGD are O�(1 + ω) L ϵ�for strongly convex problems and

image

convergence proof for strongly convex problems, i.e., O�(1 + ω) L ϵ�, can be found in (Khirirat et al., 2018). For completeness, we now establish a convergence result for convex problems, i.e.,  O�(1 + ω) Lϵ�since we did not find it in the literature.

image

is at most

image

4.2. The ACGD algorithm

Note that in the non-compressed case  ω = 0(i.e., CGD is reduced to standard GD), there exists methods for obtaining accelerated convergence rates of  O��Lµ log 1ϵ�and

O��Lϵ�for strongly convex and convex problems, respectively. However, no accelerated convergence results exist for CGD methods. Inspired by Nesterov’s accelerated gradient descent (AGD) method (Nesterov, 2004) and FISTA (Beck & Teboulle, 2009), we propose the first accelerated compressed gradient descent (ACGD) method, described in Algorithm 1.

image

4.3. Convergence theory

Our accelerated convergence results for ACGD (Algorithm 1) are stated in Theorems 2 and 3, formulated next.

Theorem 2 (ACGD: convex case) Let f be convex with L-Lipschitz continuous gradient and let the compression operator C satisfy (5). Choose the parameters in ACGD (Algorithm 1) as follows:

image

Then the number of iterations performed by ACGD to find an  ϵ-solution such that

image

is at most

image

Theorem 3 (ACGD: strongly convex case) Let f be  µ-strongly convex with L-Lipschitz continuous gradient and let the compression operator C satisfy (5).Choose the parameters in ACGD (Algorithm 1) as follows:

image

Then the number of iterations performed by ACGD to find an  ϵ-solution such that

image

(or  E[∥xk − x∗∥2] ≤ ϵ) is at most

image

In the non-compressed case  ω = 0(i.e.,  C(x) ≡ x), our results recover the standard optimal rates of accelerated gradient descent. Further, if we consider the random-k spar-sification compression operator, ACGD can be seen as a variant of accelerated randomized coordinate descent (Nes- terov, 2012). Our results recover the optimal results of accelerated randomized coordinate descent method (Allen- Zhu et al., 2016; Hanzely & Richt´arik, 2019a) under the same standard smoothness assumptions.

4.4. Proof sketch

The following lemma which demonstrates improvement in one iteration plays a key role in our analysis.

Lemma 1 If parameters  {ηk}, {θk}, {βk}, {γk} and p sat-and the compression operator  Ck satisfies (5), then we have for any iteration k of ACGD, and for all  x ∈ Rd,

image

where the expectation is with respect to the randomness of compression operator sampled at iteration k.

The proof of Theorems 2 and 3 can be derived (i.e., plug into the specified parameters ({ηk}, {θk}, {βk}, {γk}and p) and collect all iterations) from Lemma 1. The detailed proofs can be found in the appendix.

We now turn our attention to the general distributed case, i.e., problem (1):

image

The presence of multiple nodes (n > 1) and of the regularizer  ψposes additional challenges.

5.1. The ADIANA algorithm

We now propose an accelerated algorithm for solving problem (1). Our method combines both acceleration and variance reduction, and hence can be seen as an accelerated version of DIANA (Mishchenko et al., 2019; Horv´ath et al., 2019b). Therefore, we call our method ADIANA (Algorithm 2). In this case, each machine/agent computes its local gradient (e.g.,  ∇fi(xk)) and a shifted version thereof is compressed and sent to the server. The server subsequently aggregates all received messages, to form a stochastic gradient estimator  gkof 1n�i ∇fi(xk), and then performs a proximal step. The shift terms  hkiare adaptively chang- ing throughout the iterative process, and have the role of reducing the variance introduced by compression. If no compression is used, we may simply set the shift terms to be  hki = 0 for all i, k.

Our method was inspired by Mishchenko et al. (2019), who first studied variance reduction for CGD methods for a specific ternary compression operator, and Horv´ath et al. (2019b) who studied the general class of  ω-compression operators we also study here. However, we had to make certain modifications to make variance-reduced compression work in the accelerated case since both of them were studied in the non-accelerated case. Besides, our method adopts a randomized update rule for the auxiliary vectors wk which simplifies the algorithm and analysis, resembling the workings of the loopless SVRG method proposed by Kovalev et al. (2020).

image

5.2. Convergence theory

Our main convergence result for ADIANA (Algorithm 2) is formulated in Theorem 4. We focus on the strongly convex setting.

Theorem 4 Suppose f is  µ-strongly convex and that the functions  fi have L-Lipschitz continuous gradient for all i. Further, let the compression operator C satisfy (5). Choose the ADIANA (Algorithm 2) parameters as follows:

image

Then the number of iterations performed by ADIANA to find an  ϵ-solution such that

image

is at most

image

As we have explained in the introduction, the above rate is vastly superior to that of non-accelerated distributed CGD methods, including that of DIANA.

5.3. Proof sketch

In the proof, we use the following notation:

image

We first present a key technical lemma which plays a similar role to that of Lemma 1.

Lemma 2 If the parameters satisfy  η ≤ 12L, θ1 ≤ 14, θ2 =

2, γ = η2(θ1+ηµ)and  β = 1 − γµ, then we have for any iteration k,

image

Theorem 4 can be proved by combing the above lemma with three additional Lemmas: Lemma 3, 4 and 5, which we present next. In view of the presence of  Wk in (10), thefollowing result is useful as it allows us to add  Wk+1into the Lyapunov function.

Lemma 3 According to Line 11 of Algorithm 2 and Defini-tion (7)(8), we have

image

To cancel the term  E���gk − ∇f(xk)��2�in (10), we use the defining property of compression operator (i.e., (5)) :

Lemma 4 If the compression operator C satisfies (5), we have

image

Note that the bound on variance obtained above introduces an additional term  Hk (see (13)). We will therefore add the terms  Hk+1 into the Lyapunov function as well.

Lemma 5 If  α ≤ 1ω+1, we have

E�Hk+1�≤�1 − α2

image

Note that the terms �ni=1��∇fi(wk) − ∇fi(xk)��2and �ni=1��∇fi(yk) − ∇fi(xk)��2in Lemma 5 and (13) can be cancelled by (11) and (12) by choosing the parameters appropriately.

Finally, it is not hard to obtain the following key inequality for the Lyapunov function by plugging Lemmas 3-5 into our key Lemma 2:

image

Above, the constants  c1, . . . , c5are related to the algorithm parameters  η, θ1, θ2, α, β, γand p. Finally, the proof of Theorem 4 can be derived (i.e., plug into the specified parameters) from inequality (14). The detailed proof can be found in the appendix.

In this section, we demonstrate the performance of our accelerated method ADIANA (Algorithm 2) and previous methods with different compression operators on the regularized logistic regression problem,

image

where  {ai, bi}i∈[n]are data samples.

Data sets. In our experiments we use four standard datasets, namely, a5a, mushrooms, a9a and w6a from the LIBSVM library. Some of the experiments are provided in the appendix.

Compression operators. We use three different compression operators: random sparsification (see e.g. (Stich et al., 2018)), random dithering (see e.g. (Alistarh et al., 2017)), and natural compression (see e.g. (Horv´ath et al., 2019a)). For random-r sparsification, the number of communicated bits per iteration is 32r, and we choose r = d/4. For random dithering, we choose  s =√d, which means the number of communicated bits per iteration is 2.8d + 32 (Alistarh et al., 2017). For natural compression, the number of communicated bits per iteration is  9d bits (Horv´ath et al.,2019a).

Parameter setting. In our experiments, we use the theoretical stepsize and parameters for all the three algorithms: vanilla distributed compressed gradient descent (DCGD), DIANA (Mishchenko et al., 2019), and our ADIANA (Algorithm 2). The default number of nodes/machines is 20 and the regularization parameter  λ = 10−3. The numerical results for different number of nodes can be found in the appendix. For the figures, we plot the relation of the optimality loss gap  f(xk) − f(x∗)and the number of accumulated transmitted bits. The optimal value  f(x∗)for each case is obtained by getting the minimum of three uncompressed versions of ADIANA, DIANA, and DCGD for 100000 iterations.

6.1. Comparison with DIANA and DCGD

In this subsection, we compare our ADIANA with DIANA and DCGD with three compression operators: random spar-sification, random dithering, and natural compression in Figures 1 and 2.

The experimental results indeed show that our ADIANA converges fastest for all three compressors, and natural compression uses the fewest communication bits than random dithering and random sparsification. Moreover, because the compression error of vanilla DCGD is nonzero in general, DCGD can only converge to the neighborhood of the optimal solution while DIANA and ADIANA can converge to the optimal solution.

6.2. Communication efficiency

Now, we compare our ADIANA and DIANA, with and without compression to show the communication efficiency of our accelerated method ADIANA in Figures 3 and 4.

According to the left top and left bottom of Figure 4, DIANA is better than its uncompressed version if the compression operator is random sparsification. However, ADIANA behaves worse than its uncompressed version. For random dithering (middle figures) and natural compression (right fig-ures), ADIANA is about twice faster than its uncompressed version, and is much faster than DIANA with/without compression. These numerical results indicate that ADIANA (which enjoys both acceleration and compression) could be a more practical communication efficiency method, i.e., acceleration (better than non-accelerated DIANA) and compression (better than the uncompressed version), especially for random dithering and natural compression.

image

Figure 1. The communication complexity of different methods for three different compressors (random sparsification, random dithering and natural compression) on the a5a dataset.

image

Figure 2. The communication complexity of different methods for three different compressors (random sparsification, random dithering and natural compression) on the mushrooms dataset.

image

Figure 3. The communication complexity of DIANA and ADIANA with and without compression on the a5a dataset.

image

Figure 4. The communication complexity of DIANA and ADIANA with and without compression on the mushrooms dataset.

The authors would like to acknowledge support from the KAUST Baseline Research Fund and the KAUST Visual Computing Center. Zhize Li and Xun Qian thank for support from the KAUST Extreme Computing Research Center.

Alistarh, D., Grubic, D., Li, J., Tomioka, R., and Vojnovic, M. QSGD: Communication-efficient SGD via gradient quantization and encoding. In Advances in Neural Information Processing Systems, pp. 1709–1720, 2017.

Allen-Zhu, Z. Katyusha: The first direct acceleration of stochastic gradient methods. In Proceedings of the 49th Annual ACM SIGACT Symposium on Theory of Computing, pp. 1200–1205. ACM, 2017.

Allen-Zhu, Z., Qu, Z., Richt´arik, P., and Yuan, Y. Even faster accelerated coordinate descent using non-uniform sampling. In The 33rd International Conference on Machine Learning, pp. 1110–1119, 2016.

Beck, A. and Teboulle, M. A fast iterative shrinkagethresholding algorithm for linear inverse problems. SIAM Journal on Imaging Sciences, 2(1):183–202, 2009.

Hanzely, F. and Richt´arik, P. Accelerated coordinate descent with arbitrary sampling and best rates for minibatches. In The 22nd International Conference on Artificial Intelligence and Statistics, 2019a.

Hanzely, F. and Richt´arik, P. One method to rule them all: variance reduction for data, parameters and many new methods. arXiv preprint arXiv:1905.11266, 2019b.

Hanzely, F., Mishchenko, K., and Richt´arik, P. SEGA: variance reduction via gradient sketching. In Advances in Neural Information Processing Systems 31, pp. 2082– 2093, 2018.

Horv´ath, S., Ho, C.-Y., ˇLudov´ıt Horv´ath, Sahu, A. N., Canini, M., and Richt´arik, P. Natural compression for distributed deep learning. arXiv preprint arXiv:1905.10988, 2019a.

Horv´ath, S., Kovalev, D., Mishchenko, K., Stich, S., and Richt´arik, P. Stochastic distributed learning with gradient quantization and variance reduction. arXiv preprint arXiv:1904.05115, 2019b.

Karimireddy, S., Kale, S., Mohri, M., Reddi, S., Stich, S., and Suresh, A. SCAFFOLD: Stochastic controlled averaging for on-device federated learning. arXiv preprint arXiv:1910.06378, 2019.

Khaled, A., Mishchenko, K., and Richt´arik, P. First analysis of local GD on heterogeneous data. In NeurIPS Workshop on Federated Learning for Data Privacy and Confidentiality, pp. 1–11, 2019.

Khaled, A., Mishchenko, K., and Richt´arik, P. Tighter theory for local SGD on identical and heterogeneous data. In The 23rd International Conference on Artificial Intelligence and Statistics (AISTATS 2020), 2020.

Khirirat, S., Feyzmahdavian, H. R., and Johansson, M. Distributed learning with compressed gradients. arXiv preprint arXiv:1806.06573, 2018.

Kingma, D. P. and Ba, J. Adam: a method for stochastic optimization. In The 3rd International Conference on Learning Representations, 2014.

Koneˇcn´y, J., McMahan, H. B., Yu, F., Richt´arik, P., Suresh, A. T., and Bacon, D. Federated learning: strategies for improving communication efficiency. In NIPS Private Multi-Party Machine Learning Workshop, 2016.

Kovalev, D., Horv´ath, S., and Richt´arik, P. Dont jump through hoops and remove those loops: SVRG and Katyusha are better without the outer loop. In Proceedings of the 31st International Conference on Algorithmic Learning Theory, 2020.

Lan, G., Li, Z., and Zhou, Y. A unified variance-reduced accelerated gradient method for convex optimization. In Advances in Neural Information Processing Systems, pp. 10462–10472, 2019.

Li, T., Sahu, A. K., Talwalkar, A., and Smith, V. Feder- ated learning: challenges, methods, and future directions. arXiv preprint arXiv:1908.07873, 2019.

Li, Z. and Richt´arik, P. A unified analysis of stochastic gradient methods for nonconvex federated optimization. arXiv preprint arXiv:2006.07013, 2020.

Ma, C., Koneˇcn´y, J., Jaggi, M., Smith, V., Jordan, M. I., Richt´arik, P., and Tak´aˇc, M. Distributed optimization with arbitrary local solvers. Optimization Methods and Software, 32(4):813–848, 2017.

McMahan, H. B., Moore, E., Ramage, D., Hampson, S., and Ag¨uera y Arcas, B. Communication-efficient learning of deep networks from decentralized data. In Proceedings of the 20th International Conference on Artificial Intelligence and Statistics (AISTATS), 2017.

Mishchenko, K., Gorbunov, E., Tak´aˇc, M., and Richt´arik, P. Distributed learning with compressed gradient differences. arXiv preprint arXiv:1901.09269, 2019.

Nesterov, Y. A method for unconstrained convex minimiza- tion problem with the rate of convergence  o(1/k2). In Doklady AN USSR, volume 269, pp. 543–547, 1983.

Nesterov, Y. Introductory lectures on convex optimization: a basic course. Kluwer Academic Publishers, 2004.

Nesterov, Y. Efficiency of coordinate descent methods on huge-scale optimization problems. SIAM Journal on Optimization, 22(2):341–362, 2012.

Qian, X., Qu, Z., and Richt´arik, P. L-SVRG and LKatyusha with arbitrary sampling. arXiv preprint arXiv:1906.01481, 2019.

Seide, F., Fu, H., Droppo, J., Li, G., and Yu, D. 1-bit stochas- tic gradient descent and its application to data-parallel distributed training of speech DNNs. In Fifteenth Annual Conference of the International Speech Communication Association, 2014.

Stich, S. U. Local SGD converges fast and communicates little. In International Conference on Learning Representations, 2019.

Stich, S. U., Cordonnier, J.-B., and Jaggi, M. Sparsified SGD with memory. In Advances in Neural Information Processing Systems, pp. 4447–4458, 2018.

Wangni, J., Wang, J., Liu, J., and Zhang, T. Gradient spar- sification for communication-efficient distributed optimization. In Advances in Neural Information Processing Systems, pp. 1306–1316, 2018.

A. Missing Proofs

In this appendix, we provide the detailed proofs for our Theorems 14.

A.1. Proof of Theorem 1

We first restate our Theorem 1 here and then provide the detailed proof.

Theorem 1 Suppose f(x) is convex with L-Lipschitz continuous gradient and the compression operator  C(·) satisfies (5).Let step size  η = 1(1+ω)L, then the number of iterations performed by CGD to find an  ϵ-solution such that  E[f(xk)−f(x∗)] ≤

ϵis at most  k = O( (1+ω)Lϵ ).

Proof: According to CGD update  xk+1 = xk − ηgk, we have

where the last inequality holds due to convexity of f. Besides, according to L-smoothness of f (see (2)), we have

image

Adding (15) and η(1+ω)1− Lη(1+ω)2 times (16) to cancel the term  ∥∇f(xk)∥2, we have

image

Summing up the above inequality from iteration 0 to k, we have

image

where the last inequality uses the L-smoothness of f. Finally, noting that  E[f(xi+1) − f(x∗)] ≤ E[f(xi) − f(x∗)] for alli = 0, . . . k according to (16), then (18) turns to be

image

where the last equality uses the choice of step size  η = 1(1+ω)L. Now the proof of Theorem 1 is finished by letting the number of iteration  k = (1+ω)L∥x0−x∗∥2ϵ, i.e., obtain the  ϵ-solution xk such that E[f(xk) − f(x∗)] ≤ ϵ within O( (1+ω)Lϵ )iterations. □

A.2. Proof of Theorems 2 and 3

First, we restate our key Lemma 1 here. Then we use it to prove Theorems 2 and 3. Finally, we provide the proof for this key Lemma 1.

Lemma 1 If parameters  {ηk}, {θk}, {βk}, {γk}and p satisfy  θk = 1−γk/p1−βkγk/p βk ≤ min{ µηkγkp , 1}, p ≥ (1+Lηk)(1+ω)2and the compression operator  C(·) satisfies (5), then we have for any iteration  k, ∀x ∈ Rd,

image

where the expectation is with respect to the randomness of compression  C(·)at iteration k.

Now, we recall our Theorem 2 here and are ready to prove it using Lemma 1.

Theorem 2 Suppose f(x) is convex with L-Lipschitz continuous gradient and the compression operator  C(·) satisfies (5).Let parameters  ηk ≡ 1L, θk = 1 − 2k+2, βk ≡ 0, γk = 2pk+2and  p = 1 + ω, then the number of iterations performed by ACGD (Algorithm 1) to find an  ϵ-solution such that  E[f(xk) − f(x∗)] ≤ ϵis at most  k = O�(1 + ω)�

Proof of Theorem 2. First, we know  µ = 0in this general convex case. By choosing step size  ηk ≡ 1L, p = 1 + ω, βk ≡ 0,γk = 2pk+2 and θk = 1 − 2k+2, Lemma 1turns to be

image

Summing up the above inequality from iteration  0 to k − 1and letting  x = x∗, we have

image

where the second inequality uses  f(yi) − f(x∗) ≥ 0 and y0 = x0. Now the proof of Theorem 2 is finished, i.e., we obtain the  ϵ-solution yk such that E[f(yk) − f(x∗)] ≤ ϵ within k = O�(1 + ω)�

Similarly, we recall our Theorem 3 here and also prove it using Lemma 1.

Theorem 3 Suppose f(x) is  µ-strongly convex with L-Lipschitz continuous gradient and the compression operator  C(·)

satisfies (5). Let parameters  ηk ≡ 1L, θk ≡ pp+√µ/L, βk ≡

performed by ACGD (Algorithm 1) to find an  ϵ-solution such that  E[f(xk) − f(x∗)] ≤ ϵ (or E[∥xk − x∗∥2] ≤ ϵ) is at most k = O�(1 + ω)�

image

Proof of Theorem 3. By choosing step size  ηk ≡ 1L, p = 1 + ω, γk ≡� µL, βk ≡√µ/Lpand  θk ≡ 11+√µ/p2L, Lemma 1

turns to be

Telescoping the above inequality from iteration 0 to k and letting  x = x∗, we have

image

where the first inequality uses  µ-strongly convex of f (see (3)) and  y0 = z0 = x0. Now the proof of Theorem 3 is finished, i.e., we obtain the  ϵ-solution  yk(or  zk) such that  E[f(yk) − f(x∗)] ≤ ϵ(or  E[∥zk − x∗∥2] ≤ ϵ) within k = O�(1 + ω)�

image

Key Lemma: Now, the only remaining thing is to prove the key Lemma 1.

Proof of Lemma 1. First, we get the following equality (its proof is deferred to the end):

image

Then, to cancel the last inner product in (23), we use the property (smoothness and/or strong convexity) of f:

image

where (24) and (25) uses L-smoothness (see (2)) and  µ-strong convexity (see (3)), respectively. By adding�1 − γkp�times

image

Now, this key lemma (i.e., (20)) is proved as follows by adding 2ηkγ2k times (26) and (23):

image

condition  p ≥ (1+Lηk)(1+ω)2. Now, the only remaining thing is to prove (23). For any  x ∈ Rd, we have

image

where (31) and (32) use equalities (36) and (40), respectively. Further,

image

where (33) and (35) hold due to the condition  θk = 1−γk/p1−βkγk/p, and (34)holds due to the relation  xk = θkyk + (1 − θk)zk(Line 3 in Algorithm 1). Now, we finish the proof for the inner product term.

image

where (37) and (39) hold due to the condition  θk = 1−γk/p1−βkγk/p, and (38)holds due to the relation  xk = θkyk + (1 − θk)zk(Line 3 in Algorithm 1). □

A.3. Proof of Theorem 4

In this section, we provide the detailed proof for accelerated result in the distributed case. Similar to previous Section A.2, we first restate our Lemmas 25 here. Then we use them to prove Theorem 4. Finally, we provide the proof for these Lemmas. Before restating lemmas, we recall the following notation:

image

Lemma 2 If parameters satisfy  η ≤ 12L, θ1 ≤ 14, θ2 = 12, γ = η2(θ1+ηµ) and β = 1 − γµ, then we have for any iteration k,

image

Lemma 3 According to Line 11 of Algorithm 2 and Definition (42)(43), we have

image

Lemma 4 If the compression operator  C(·) satisfies (5), we have

image

Lemma 5 If  α ≤ 1/(1 + ω), we have

Now, we recall our Theorem 4 here and are ready to prove it using Lemmas 25.

Theorem 4 Suppose f(x) is  µ-strongly convex and all  fishave L-Lipschitz continuous gradients, and the compression operator  C(·)satisfies (5). Let parameters  α = 1ω+1, η = min{ 12L, n64ω(2p(ω+1)+1)2L}, θ1 = min�14,�

γ = η2(θ1+ηµ), β = 1 − γµ, and  p = min�1,max{1,√n/32ω−1}2(1+ω) �, then the number of iterations performed by ADIANA (Algorithm 2) to find an  ϵ-solution such that  E[∥zk − x∗∥2] ≤ ϵis at most

image

Proof of Theorem 4. We define the following Lyapunov function  Ψand induce it as follows:

image

image

where (49) uses Lemma 2, (50) uses Lemma 3, (51) uses  θ1 ≤ 1/4 and θ2 = 1/2, (52)uses Lemma 4, (53) uses Lemma 5, (54) uses  η = min{ 12L, n64ω(2p(ω+1)+1)2L}, (55) uses γ = η2(θ1+ηµ), β = 1 − γµ ≤ 1 − ηµ4θ1 due to ηµ ≤ θ1, and (56) uses

image

Telescoping the above inequality (56) from iteration  0 to k, we have E[Ψk] ≤�1 − min� α4 , p8,√ηµp4 ��kΨ0. To obtain an

ϵ-solution zk such that E[∥zk − x∗∥2] ≤ ϵ, the number of iterations is at most

image

By letting  p = min�1,max{1,√n/32ω−1}2(1+ω) �, it is not hard to verify that the number of iterations performed by ADIANA (Algorithm 2) to find an  ϵ-solution such that  E[∥zk − x∗∥2] ≤ ϵis at most

image

where n is the number of parallel machines. □

Key Lemmas: Now, the remaining thing is to prove Lemmas 25. We first prove the relatively simple Lemmas 35 and then prove the key Lemma 2.

image

Proof of Lemma 3. According to Line 11 of Algorithm 2, i.e.,  wk+1 =

�yk,with probability p wk,with probability  1 − p, and definitions Yk := P(yk) − P(x∗) and Wk := P(wk) − P(x∗), this lemma is directly obtained, i.e.,

image

Proof of Lemma 4. This lemma is proved as follows:

where first inequality uses the property of  ω-compression operator (i.e., (5)) and the last inequality uses Cauchy-Schwarz inequality and definition  Hk := 1n�ni=1��hki − ∇fi(wk)��2. □

Proof of Lemma 5. This lemma is proved as follows:

where (59) uses the definition of  wk+1(see Line 11 of Algorithm 2), (60) uses Cauchy-Schwarz inequality, (61) uses hk+1i = hki + αC(∇fi(wk) − hki ), the property of  ω-compression operator (i.e., (5)) and  α ≤ 1/(1 + ω), and (63) uses Cauchy-Schwarz inequality. □

Proof of Lemma 2. Similar to the proof of key Lemma 1 in the last section, we first get the following equality:

image

where (64) uses  xk = θ1zk + θ2wk + (1 − θ1 − θ2)yk (see Line 3in Algorithm 2). To cancel the inner products in (64), we first use the property of f:

image

where the inequalities hold uses the L-smoothness and  µ-strong convexity of  f and f(x) := 1n�ni=1 fi(x).

Then, according the definition of  yk+1 (see Line 9of Algorithm 2), we have

image

where  ∆ ∈ ∂ψ(yk+1). Besides, according to convexity of  ψ, we have

image

Adding (65) and (67), we have

where (68) uses (66), (69) uses Young’s inequality, and (70) uses the condition  η ≤ 1/2L.

Now, we are ready to prove this lemma by canceling the inner products in (64) using (70). By plugging  2γ times (70) (whereu = x∗), 2γβθ2θ1 times (70) (where u = wk), and 2γβ(1−θ1−θ2)θ1 times (70) (where u = yk) into (64), we have

��∇f(xk) − gk��2 − 14η

image

image

B. Extra Experiments

In this section, we conduct more experiments for these methods with different compression operators for the same regularized logistic regression problem on other two datasets a9a and w6a. Similar to Section 6, for all methods, we use the theoretical stepsize and parameters.

B.1. Comparison with DIANA and DCGD

Similar to Section 6.1, we compare our ADIANA with DIANA and DCGD with three compression operators: random sparsification, random dithering, and natural compression on different datasets in Figures 5 and 6. Similar to Figures 1 and 2, here Figures 5 and 6 also show that our ADIANA converges fastest for all three compressors, and natural compression uses the fewest communication bits than random dithering and random sparsification. Furthermore, because the compression error of vanilla DCGD is nonzero in general, DCGD can only converge to the neighborhood of the optimal solution while DIANA and ADIANA can converge to the optimal solution.

image

Figure 5. The communication complexity of different methods for three different compressors (random sparsification, random dithering and natural compression) on the a9a dataset.

image

Figure 6. The communication complexity of different methods for three different compressors (random sparsification, random dithering and natural compression) on the w6a dataset.

B.2. Communication efficiency

Similar to Section 6.2, we compare our ADIANA and DIANA, with and without compression on different datasets to show the communication efficiency of our accelerated method ADIANA in Figures 7 and 8. Similarly in Figures 5 and 6, DIANA is better than its uncompressed version if the compression operator is random sparsification. However, ADIANA behaves worse than its uncompressed version. For random dithering (middle figures) and natural compression (right figures), ADIANA is about twice faster than its uncompressed version, and is much faster than DIANA with/without compression. These numerical results indicate that ADIANA (which enjoys both acceleration and compression) could be a more practical communication efficiency method, i.e., acceleration (better than non-accelerated DIANA) and compression (better than the uncompressed version), especially for random dithering and natural compression.

image

Figure 7. The communication complexity of DIANA and ADIANA with and without compression on the a9a dataset.

image

Figure 8. The communication complexity of DIANA and ADIANA with and without compression on the w6a dataset.

B.3. Different number of nodes

In this subsection, we demonstrate the performance of our ADIANA for different number of nodes/machines with random dithering and natural compression on a9a and w6a datasets in Figure 9. In the previous numerical results, the number of communication bits is not multiplied by the number of nodes, since the number of nodes is the same for all methods. However, here the number of communication bits is multiplied by the number of nodes since they are different now.

image

Figure 9. The communication complexity of ADIANA with different number of nodes on the a9a (left) and w6a (right) datasets.


Designed for Accessibility and to further Open Science