b

DiscoverSearch
About
My stuff
Decentralized SGD with Over-the-Air Computation
2020·arXiv
Abstract
Abstract

We study the performance of decentralized stochastic gradient descent (DSGD) in a wireless network, where the nodes collaboratively optimize an objective function using their local datasets. Unlike the conventional setting, where the nodes communicate over error-free orthogonal communication links, we assume that transmissions are prone to additive noise and interference. We first consider a point-to-point (P2P) transmission strategy, termed the OAC-P2P scheme, in which the node pairs are scheduled in an orthogonal fashion to minimize interference. Since in the DSGD framework, each node requires a linear combination of the neighboring models at the consensus step, we then propose the OAC-MAC scheme, which utilizes the signal superposition property of the wireless medium to achieve over-the-air computation (OAC). For both schemes, we cast the scheduling problem as a graph coloring problem. We numerically evaluate the performance of these two schemes for the MNIST image classification task under various network conditions. We show that the OAC-MAC scheme attains better convergence performance with a fewer communication rounds.

Over the last decade, modern machine learning (ML) techniques, particularly deep learning (DL) framework, have demonstrated tremendous advances in many complex problems such as machine vision, speech recognition, and natural language processing. From the computational aspect, the main task in DL framework is computing the gradients through the layers in the back-propagation step, which is later used to update the network parameters. Since it is practically not possible to train the network over a huge data set, stochastic gradient descent (SGD) is used instead, in which a random subset of data, called a mini-batch, is used to compute an estimate of the gradient at each step. To further reduce the training time, stochastic gradient computations can be distributed to multiple workers, each of which executes the back-propagation step over a different subset of data, in parallel, under orchestration of the parameter server that is responsible for the model update [1]. This approach is often referred to as parallel SGD (PSGD) in the literature. PSGD is also employed when data is already distributed at multiple devices, which would like to train a model collaboratively without offloading their data to a central server due to privacy concerns. This latter framework is known as federated learning [2], [3].

Although the PSGD framework can successfully reduce the computation time and address the privacy concerns, it requires large amount of data transfer between the parameter server and the workers at each iteration, thus the communication latency

This work has been funded by the European Research Council (ERC) through project BEACON (No. 725731) and by by H2020-MSCA-ITN-2015 project SCAVENGE under grant number 677854.

becomes the new bottleneck. Hence, the recent studies have focused on the communication latency of PSGD, and introduced different strategies to mitigate it, such as sparsification [4], where gradients are transformed into sparse vectors by keeping only the significant values, and quantization [5], where each gradient is represented by only a limited number of bits.

While federated learning has been extremely popular over the last few years, the conventional parameter server architecture, in which a central server orchestrates the training across all the participating nodes has its limitations. First of all, such a parameter server may not always be available. Moreover, the centralized structure creates congestion at the parameter server, and leads to a single-point-of-failure, limiting the robustness of the system, and increasing the security risks.

An alternative framework is decentralized SGD (DSGD), where each worker communicates with only a certain number of other workers without assistance of a parameter server [6][9]. Such a decentralized architecture can address the congestion at the parameter server, and reduce the communication latency by relaxing the requirement to communicate with a single node from every other collaborating node in the network.

Although there is a broadening literature on communication efficient DSGD framework, in none of these papers wireless nature of the communication medium is taken into account. However, when learning takes place at the wireless network edge [10], it is indispensable to take into account the wireless nature of the communication medium, which creates additional challenges due to the inherent noise and interference among nodes. In this paper, we study DSGD over wireless networks taking into account the physical channel characteristics into account, and propose communication efficient transmission and scheduling schemes to reduce the communication delay of the learning algorithm.

Contributions: In this paper, we first highlight how the communication strategy between the nodes affects the convergence performance of the DSGD scheme when implemented in a wireless network. We first introduce a point-to-point(P2P) strategy, in which pairs of nodes are scheduled with the goal of avoiding or minimizing interference. We then propose an alternative multiple access (MAC) strategy, where multiple nodes transmit simultaneously to a common receiver, such that their transmissions are aligned for over-the-air computation (OAC). This MAC strategy harnesses interference from other nodes, rather than trying to mitigate it. It has previously been used in the federated learning setting in [10][12]. Note that the MAC strategy improves the efficiency of resource usage by scheduling all the transmitters at a single time slot, and provides better noise mitigation. Such simultaneous transmission does not lead to any additional interference since the receiver is not interested in the individual estimates of the other nodes, but only in their weighted average. We will achieve different weights as required by the consensus algorithm by allowing scaling at the transmitters. Note that, while multiple receivers can be scheduled simultaneously in the P2P case, this is more challenging in the MAC case.

We evaluate the performance of these two strategies for the MNIST digit classification task under various network conditions. We observe that the MAC approach with OAC provides significant gains in terms of the wall-clock convergence speed. It also results in higher accuracy (lower test error) due to its increased robustness against channel noise.

A. System Model

We consider the following distributed stochastic optimization problem

image

where  fi(θ)is the loss function at node  i ∈ [n] ≜ {1, . . . , n}based on the local sample distribution  Di, and  F(θ, ζ)is parameterized loss function imposed by the learning task and ζis variable referring the data sample. When the samples are from a finite data set, as in many DL problem,  fi(θ) =

image

smooth,  ∀i ∈ [n]. We also assume that the solution of (1) is finite, and denoted by  θ∗. The goal of each node is to obtain an estimate of  θ∗. In particular, we wish to minimize the maximum disagreement defined as

image

where �θi(t) ≜ 1/t �l∈[t] θi(l), and  θi(t)is the estimate of θ∗at node i at time t.

We implement DSGD in three steps to solve (1) in a distributed manner. First, at the beginning of each iteration, node i computes the local stochastic gradient, i.e.,

image

where  ζi,τis randomly sampled data at iteration t by the ith worker. And it is often assumed that stochastic gradient in (3) have a bounded variance. Then, in the intermediate step, each node exchanges its local model with the other nodes to seek a consensus, i.e.,

image

and, finally updates the local model, i.e.,  θi(t + 1) = θi(t +1/2) + gi(t). We note that the coefficients  mi,jdepend on the connectivity pattern between the nodes; specifically, if the network is defined by a connected graph G = (V, E) with V = {V1, . . . , Vn}and  E ⊆ V ×V, the coefficients are chosen according to the adjacency matrix A of G, so that they form a doubly stochastic matrix, which is the necessary condition for the convergence of DSGD.

B. DSGD over a Wireless Network

Here, we will briefly explain how the wireless communication model and DSGD framework are linked to each other. Without any topology constraint on the network, the nodes are allowed to communicate with each other over a fading additive white Gaussian noise (AWGN) channel. The communication across nodes is described as follows: at time  t ∈ N, node  Vi, i ∈ [n], transmits an input signal  xi(t) ∈ Rd, and receives the output signal  yi(t) ∈ Rdobtained as

image

where  hij = hjiis the channel gain between nodes i and j, and  nj(t) ∼ N(0d, Ids)represents the additive noise vector. We assume that the channel gains  hijare independent and identically distributed (i.i.d.) according to distribution  Ph, and remain fixed until the convergence is attained. We further assume that each node knows the coefficients of its incoming/ outgoing channels. We consider full-duplex nodes; that is, they can simultaneously transmit and receive signals.

The channel input at each node i is subject to the per-symbol power constraint

image

where  ∥·∥is the 2-norm. Note that the expected value in (6) is with respect to the additive noise and the stochastic gradient.

Assume that the communication phase for the consensus step given in (4) lasts T time instants per computation round, referred to as a communication block. T denotes the resources required for the communication phase, and has a direct impact on the wall-clock convergence speed. At time t = (m + 1)T, node i possesses the m channel outputs obtained in the previous communication block  Yi(mT) =[yi(mT + 1), . . . , yi((m + 1)T)],where

image

where  θj(m)is the local estimate of  θ∗at  Vjat communication round m, and  wjkis the coefficient with which user j scales its estimate for the kth symbol of the  mth

communication round. Using  Yi(mT), node  Viupdates its estimate as

image

where  P ∈ Rd×mare the combining weights, while  αi(m)is the learning rate in block m, and  gi(t)is the stochastic gradient in (3). The update in (8) can be rewritten as

image

where

image

If the matrix �P ∈ Rd×dwith entries  �pikis doubly stochastic, then convergence for DSGD can be shown [9]. We note that, in (9),  �ni(m) = �k∈[T ] piknik(mT +K). In the communication phase, by choosing the elements of matrix  W ∈ Rn×T, denoted by  wjk, j ∈ [n], and the elements of P, denoted by  pik, k ∈ [T], a different weight matrix �Pand a different noise matrix �N ∈ Rn×din (9) are obtained. Convergence of DSGD in the wireless setting can be analyzed using the existing techniques for the convergence of DSGD framework [9], which we omit here due to space limitation. Nevertheless, regarding the communication phase of the proposed approach, the fundamental design question is how to choose W, P, and T to achieve the fastest convergence possible, while guaranteeing that �Pis a doubly stochastic matrix.

C. Problem Definition

In the scope of this paper, we assume that, based on the channel realization H, certain links between the nodes will not be used for communications, particularly those having channel gains below a predetermined threshold  hth. These links will only act as interference, and their impact will be included in the noise term.

Under this assumption, network topology can be considered as a random graph G with adjacency matrix A, and we can decouple the problem into two sub-problems. Accordingly, we first construct a symmetric doubly-stochastic �Pfrom the matrix Laplacian as

image

where D is a diagonal matrix whose entry in position  Diiis the degree of  Vi, and  dmaxis the maximum node degree.

Once �Pis fixed, we can search for an efficient communication strategy, in terms of T and �N, satisfying predetermined weight matrix �P. We particularly focus on drawing a connection between the duration of the communication block T and the graph coloring problem. From a high-level perspective, communication strategies answering this question can be classified as follows: P2P scheme: Only pair-wise transmissions are allowed. At each instant, two pairs of users transmit simultaneously

image

Fig. 1: The example of network topology considered in Sec. III.

only if their transmitters do not interfere. Here we consider as interfering only those links with gains above  hth. MAC-OAC scheme: Multiple nodes concurrently transmit their estimates in an uncoded ‘analog’ manner, and the receiver receives a linear combination of the transmitted model (together with some noise) utilizing the superposition property of the wireless channel. Two nodes can receive simultaneously only if none of their transmitters interfere. Broadcast (BC)-OAC scheme: A transmitter sends its signal to multiple receivers. Two nodes can broadcast simultaneously if they do not interfere at any of their receivers. Interference Channel (IFC)-OAC scheme: Two nodes transmit or receive simultaneously even if they interfere.

Due to space limitations, we will consider only P2P and MAC communication schemes in this paper.

Instead, we focus on drawing a connection between graph coloring and the duration of the communication block T.

A. P2P-SGD Algorithm

For the P2P scheme, transmissions are constrained as follows: in each communication slot (i) the neighborhood of an active receiver node (a.k.a. RX node) contains only one active transmitter node (a.k.a. TX node), and (ii) the neighborhood of an active TX node contains only one RX node. Here the neighborhood is defined on graph G. We can formulate the scheduling problem as a graph coloring problem as follows: we construct an augmented graph �G ≜ G(�V , �E)with �V = E, where the vertex (i, j) in �Grepresents node i transmitting to node j on the initial graph G. Two vertices  (i, j), (l, m) ∈ �Gare connected in �Gif i = l or j = m. It can be shown that a solution of the P2P scheduling problem is then obtained by the vertex coloring of �Gby letting T be the number of colors and by having nodes colored with the same color correspond to TX/RX pairs communicating over the same slot.

B. MAC-OAC-SGD Algorithm

In MAC-OAC-SGD, for each transmission slot two conditions must hold: (i) all neighbors of a RX node must be TX nodes, and (ii) if two nodes are RX nodes, then they cannot have a common neighbor. To cast these constraints as a graph theoretical problem, we define the graph �G(V, �E)as

image

Fig. 2: An illustration of the schedules for the P2P-SGD and MAC-OAC-SGD schemes for the example in Sec. III-C.

the augmented version of the graph G(V, E), which have the same vertex set with G while �Eis obtained as

image

Again the edge coloring of this graph yields a feasible schedule for the MAC-SDG scheme; this time the nodes colored with the same color correspond to RX nodes active in the same time slot.

C. An Illustrative Example

To clarify the scheduling strategies for the P2P and MAC schemes, we illustrate them on a simple five-node network in Fig. 1, which is obtained by removing the links whose quality is below the threshold  hth. For this network, the graph �Gin Sec. III-A is illustrated in Fig. 2(a). In Fig. 2(b) we plot the scheduling corresponding to the coloring in Fig. 2(a). Note that the TX/RX pairs (1, 3), (3, 1) and (5, 4) have the same color (green) which implies that transmission from nodes 1, 3 and 5 can be take place is the same time slot. This is represented in Fig. 2(b) by the green arrows leaving this set of nodes. These transmissions are received at all neighbors, but utilized only at nodes 3, 1 and 4, respectively. The same representation is used in Fig. 2(c) and Fig 2(d) for the MAC-OAC-SGD algorthim.

In our numerical analysis, we consider the image classifi-cation problem over the MNIST dataset [13] with 10 different image classes, and train a neural network architecture with two convolutional layers followed by two fully connected layers according to the decentralized SGD framework. We consider a topology with n = 20 nodes where the training dataset is divided among them in an i.i.d. manner. For the training, we use the learning rate  α = 0.1and measure the test accuracy over 250 iterations.

The network among the nodes is generated as follows: Matrix M of  n × ni.i.d. Rayleigh random variables with scaling parameter  σ ∈ {2, 5}is first generated. From this, matrix A is obtained by setting to zero all the entries of M below threshold  τ ∈ {0.8, 1.2, 1.6}σ. This truncation induces the probability of an edge as  p ∈ {0.7, 0.5, 0.25}. The matrix A is the matrix of the elements of M above the given threshold. Among the matrix so generated, only the connected matrices A are considered. The logic behind this choice of models is as follows: we assume that the p is inversely proportional to E[hij]as the threshold  τindicates the quality of the wireless links used in transmission.

For each sub-scenario, the training/test procedure is repeated 50 times by generating a new channel realization at the beginning of each time training. The performance of the algorithm is measured in terms of the test accuracy and the average value of T, i.e. T. The results of the simulations are presented in Table I.

Form Table I we observe that the MAC-OAC-SGD scheme consistently requires smaller values of T compared to P2P-

image

TABLE I: Computation and Communication performance of MAC and P2P schemes over 50 trials

image

Fig. 3: Test error for P2P and MAC schemes for  σ = 5and the random topology based on  τ = 0.8σ.

image

Fig. 4: Test error for P2P and MAC schemes for  σ = 5and the random topology based on  τ = 1.2σ.

SGD. This is expected as in each time slot multiple nodes can communicate. For example, when  τ = 0.8σ, P2P scheme requires almost twice the communication resources as the MAC scheme. On the other, as the connection probability decreases, the advantages of the MAC strategy also diminish. We also observe that, in all the scenarios, the MAC-OAC-SGD scheme achieves a higher average test accuracy compared to the P2P scheme since it is more robust to additive noise. This is the case since the P2P accumulates the additive noise over multiple transmission instances, each time the model is exchanged, while the MAC is more robust to additive noise.

From the simulation results given Table I, we can conclude that advantages of the MAC scheme over the P2P scheme is two-folded; first due to over-the-air aggregation of the signals MAC scheme better suppress the noise, hence can achieve lower test error. Second, compared to P2P scheme, on the average, it requires less number time slots for the communication. The joint impact of aforementioned advantages of MAC scheme are illustrated in Fig. 3 and Fig. 4.

In this work, we studied decentralized SGD with over-the-air aggregation where the nodes transmit their local model in a synchronized fashion that all the transmitted models naturally add up on-the-air at the receiver side. We show that in the scope of decentralized SGD framework implemented in wireless network topology, MAC scheme outperforms the conventional P2P scheme by reducing the per iteration communication time as well as suppressing the noise term by aggregating the signals on-the-air.

[1] O. Dekel, R. Gilad-Bachrach, O. Shamir, and L. Xiao, “Optimal dis- tributed online prediction using mini-batches,” J. Mach. Learn. Res., vol. 13, no. 1, pp. 165–202, Jan. 2012.

[2] J. Konecny, B. McMahan, and D. Ramage, “Federated optimization: Dis- tributed optimization beyond the datacenter,” arXiv:1511.03575 [cs.LG], Nov. 2015.

[3] J. Konecny, H. B. McMahan, F. X. Yu, P. Richtarik, A. T. Suresh, and D. Bacon, “Federated learning: Strategies for improving communication efficiency,” arXiv:1610.05492v2 [cs.LG], Oct. 2017.

[4] Y. Lin, S. Han, H. Mao, Y. Wang, and B. Dally, “Deep gradient compression: Reducing the communication bandwidth for distributed training,” in International Conference on Learning Representations, 2018.

[5] D. Alistarh, D. Grubic, J. Li, R. Tomioka, and M. Vojnovic, “Qsgd: Communication-efficient sgd via gradient quantization and encoding,” in Advances in Neural Information Processing Systems 30, 2017, pp. 1709–1720.

[6] X. Lian, C. Zhang, H. Zhang, C.-J. Hsieh, W. Zhang, and J. Liu, “Can decentralized algorithms outperform centralized algorithms? a case study for decentralized parallel stochastic gradient descent,” in Advances in Neural Information Processing Systems 30, 2017, pp. 5330–5340.

[7] K. Yuan, Q. Ling, and W. Yin, “On the convergence of decentralized gradient descent,” SIAM Journal on Optimization, vol. 26, no. 3, pp. 1835–1854, 2016.

[8] J. Wang, A. K. Sahu, Z. Yang, G. Joshi, and S. Kar, “MATCHA: speeding up decentralized SGD via matching decomposition sampling,” CoRR, vol. abs/1905.09435, 2019.

[9] A. Reisizadeh, H. Taheri, A. Mokhtari, H. Hassani, and R. Pedarsani, “Robust and communication-efficient collaborative learning,” in Advances in Neural Information Processing Systems 32, 2019, pp. 8388– 8399.

[10] D. G¨und¨uz, P. de Kerret, N. D. Sidiropoulos, D. Gesbert, C. R. Murthy, and M. van der Schaar, “Machine learning in the air,” IEEE J. Sel. Areas Commun., vol. 37, no. 10, pp. 2184–2199, Oct 2019.

[11] M. M. Amiri and D. G¨und¨uz, “Machine learning at the wireless edge: Distributed stochastic gradient descent over-the-air,” IEEE Trans. Signal Process., to appear, Feb. 2020.

[12] G. Zhu, Y. Wang, and K. Huang, “Broadband analog aggregation for low-latency federated edge learning,” IEEE Trans. Wireless Commun., 2019.

[13] L. Deng, “The MNIST database of handwritten digit images for machine learning research [best of the web],” IEEE Signal Processing Magazine, vol. 29, no. 6, pp. 141–142, Nov 2012.


Designed for Accessibility and to further Open Science