Fast and Robust Distributed Learning in High Dimension

2019·Arxiv

Abstract

Abstract

Modern machine learning is distributed and the work of several machines is typically aggregated by averaging which is the optimal rule in terms of speed, offering a speedup of n (with respect to using a single machine) when n processes are learning together.

Distributing data and models poses however fundamental vulnerabilities, be they to software bugs, asynchrony, or worse, to malicious attackers controlling some machines or injecting misleading data in the network. Such behavior is best modeled as Byzantine failures, and averaging does not tolerate a single one from a worker.

Krum, the first provably Byzantine resilient aggregation rule for SGD only uses one worker per step, which hampers its speed of convergence, especially in best case conditions when none of the workers is actually Byzantine. An idea, coined multi-Krum, of using m different workers per step was mentioned, without however any proof neither on its Byzantine resilience nor on its slowdown. More recently, it was shown that in high dimensional machine learning, guaranteeing convergence is not a sufficient condition for strong Byzantine resilience. A improvement on Krum, coined Bulyan, was proposed and proved to guarantee stronger resilience. However, Bulyan suffers from the same weakness of Krum: using only one worker per step. This adds up to the aforementioned open problem and leaves the crucial need for both fast and strong Byzantine resilience unfulfilled.

The present paper tackles both open problems and proposes using Bulyan over Multi-Krum (we call it MULTI-BULYAN), a combination for which we provide proofs of strong Byzantine resilience, as well as an slowdown, compared to averaging, the fastest (but non Byzantine resilient) rule for distributed machine learning.

Finally, modern machine learning involves data of unprecedentedly high dimension: some models are nowadays vectors of dimension . In order to deliver results within a reasonable time, learning algorithms should be at most linear in d and avoid using classic security mechanisms, most of which are at least quadratic in d and hence impractical. A strength of MULTI-BULYAN is that it inherits the O(d) merits of both multi-Krum and Bulyan.

1 Introduction

The ongoing data deluge has been both a blessing and burden for machine learning system designers. A blessing since machine learning provably performs better with more training datan [10], and a burden since the numbers are beyond previous orders of magnitude. For instance, machine learning set of parameters are now in the gigabyte [5], training data is several orders of magnitude beyond that [5]. For the latter reason, distributed machine learning is not an option, it the only way to deliver results in a reasonable time for the user. For instance, Stochastic Gradient Descent (SGD), an algorithm which is the workhorse of today’s machine learning.

With the amounts of workload involved in today’s machine learning, distributed systems are the only option to deliver results in a reasonable time.

This constraint is even more crucial since ML, given its workload, relies on large scale distributed systems for which communications costs are additional constraints to local computation costs.

We prove the similar slowdown of MULTI-BULYAN and its (strong) Byzantine resilience. We deduce that MULTI-BULYAN ensures strong Byzantine resilience and the very fact that it is times as fast as the optimal algorithm (averaging) in the absence of Byzantine workers.

MULTI-BULYAN can be viewed as generalization (also using m different workers per step to leverage the fact that f, possibly less than a minority can be faulty) of Bulyan, the defense mechanism we present in [6]. Before presenting in Section 3, our proofs of convergence and slow down of MULTI-KRUM and in Section 4 our proofs of convergence and slow down of BULYAN and hence MULTI-BULYAN, we introduce in Section 2 a toolbox of formal definitions: weak, strong, and –Byzantine resilience. We also present a necessary context on non-convex optimization, as well as its interplay with the high dimensionality of machine learning together with theleeway it provides to strong attackers

2 Model

2.1 Stochastic Gradient Descent

The learning task consists in making accurate predictions for the labels of each data instance using a high dimentional model (for example, a neural network); we denote the d parameters of that model by the vector x. Each data instance has a set of features (image pixels), and a set of labels (e.g., {cat, person}). The CNN is trained with the popular backpropagation algorithm based on SGD. Specifically, SGD addresses the following optimization problem.

where is a random variable representing a total of B data instances and is the loss function. The function Q(x) is smooth but not convex.

SGD computes the gradient () and then updates the model parameters (x) in a direction opposite to that of the gradient (descent). The vanilla SGD update rule given a sequence of learning rates at any given step1 is the following:

The popularity of SGD stems from its ability to employ noisy approximations of the actual gradient. In a distributed setup, SGD employs a mini-batch of b < B training instances for the gradient computation:

Figure 1: Correct workers (black dashed arrows) estimating the real gradient (blue full arrow) while a Byzantine worker (red dotted)

The size of the mini-batch (b) induces a trade-off between the robustness of a given update (noise in the gradient approximation) and the time required to compute this update. The mini-batch also affects the amount of parallelism (Equation 3) that modern computing clusters (multi-GPU etc.) largely benefit from. Scaling the mini-batch size to exploit additional parallelism requires however a non-trivial selection of the sequence of learning rates [7]. A very important assumption for the convergence properties of SGD is that each gradient is an unbiased estimation of the actual gradient, which is typically ensured through uniform random sampling, i.e., gradients that are on expectation equal to the actual gradient (Figure 1).

2.2 Algorithms

MULTI-BULYAN relies on two algorithmic components: MULTI-KRUM [1] and BULYAN [6]. The former rule requires that and the second requires that

Intuitively, the goal of MULTI-KRUM is to select the gradients that deviate less from the “majority” based on their relative distances. Given gradients proposed by workers 1 to n respectively, MULTI-KRUM selects the m gradients with the smallest sum of scores (i.e., L2 norm from the other gradients) as follows:

where given a function X(i), (m) arg min(X(i)) denotes the indexes i with the m smallest X(i) values, and means that is among the closest gradients to in turn takes the aforementioned m vectors, computes their coordinate-wise median and produces a gradient which coordinates are the average of the closest values to the median.

2.3 Byzantine Resilience

Intuitively, weak Byzantine resilience requires a GAR to guarantee convergence despite the presence of f Byzantine workers. It can be formally stated as follows.

Definition 1 (Weak Byzantine resilience). We say that a GAR ensures weak f-Byzantine resilience if the sequence (Equation 2 in the main paper) converges almost surely to some despite the presence of f Byzantine workers.

On the other hand, strong Byzantine resilience requires that this convergence does not lead to ”bad” optimums, and is related to more intricate problem of non-convex optimization, which, in the presence of Byzantine workers, is highly aggravated by the dimension of the problem as explained in what follows.

Figure 2: In a non-convex situation, two correct vectors (black arrows) are pointing towards the deep optimum located in area B, both vectors belong to the plane formed by lines L1 and L2. A Byzantine worker (magenta) is taking benefit from the third dimension, and the non-convex landscape, to place a vector that is heading towards one of the bad local optimums of area A. This Byzantine vector is located in the plane (L1,L3). Due to the variance of the correct workers on the plane (L1,L2), the Byzantine one has a budget of abouttimes the disagreement of the correct workers, to put as a deviation towards A, on the line (L3), while still being selected by a weak Byzantine resilient GAR, since its projection on the plane (L1,L2) lies exactly on the line (L1), unlike that of the correct workers. In very high dimensions, the situation is amplified by

Specificity of non-convex optimization. Non-convex optimization is one of the earliest established NPhard problems [8]. In fact, many interesting but hard questions in machine learning boil down to one answer: ”because the cost function is not necessarily convex”.

In distributed machine learning, the non-convexity of the cost function creates two non-intuitive behaviours that are important to highlight.

(1) A ”mild” Byzantine worker can make the system converge faster. For instance, it has been reported several times in the literature that noise accelerates learning [2, 8]. This can be understood from the ”S” (stochasticity) of SGD: as (correct) workers cannot have a full picture of the surrounding landscape of the loss, they can only draw a sample at random and estimate the best direction based on that sample, which can be, and is probably biased compared to the true gradient. Moreover, due to non-convexity, even the true gradient might be leading to the local minima where the parameter server is. By providing a wrong direction (i.e. not the true gradient, or a correct stochastic estimation), a Byzantine worker whose resources cannot face the high-dimensional landscape of the loss, might end up providing a direction to get out of that local minima.

(2) Combined with high dimensional issues, non-convexity explains the need for strong Byzantine resilience. Unlike the ”mild” Byzantine worker, a strong adversary with more resources than the workers and the server, can see a larger picture and provide an attack that requires a stronger requirement. Namely, a requirement that would cut theleeway offered to an attacker in each dimension. Figure 2 provides an illustration.

This motivates the following formalization of strong Byzantine resilience.

Definition 2 (Strong Byzantine resilience). We say that a GAR ensures strong f-Byzantine resilient if for every , there exists a correct gradient G (i.e., computed by a non-Byzantine worker) s.t. . The the expectation is taken over the random samples (in Equation 1)and denotes the

coordinate of a vector v.

Weak vs. strong Byzantine resilience. To attack non-Byzantine resilient GARs such as averaging, it only takes the computation of an estimate of the gradient, which can be done in O(n.d) operations per round by a Byzantine worker. This attack is reasonably cheap: within the usual cost of the workload of other workers, O(d), and the server, O(n.d).

To attack weakly Byzantine-resilient GARs however, one needs to find the ’most legitimate but harmful vector possible’, i.e one that will (1) be selected by a weakly Byzantine-resilient GAR, and (2) be misleading convergence (red arrow in Figure 1). To find this vector, an attacker has to first collect every correct worker’s vector (before they reach the server), and solve an optimization problem (by linear regression) to approximate this harmful but legitimate vector [6]. If the desired quality of the approximation is Byzantine worker would need at least operation to reach it with regression. This is a tight lower bound for a regression problem in d dimensions with n vectors [8]. In practice, if the required precision is of order workers and a neural network model of dimension , the cost of the attack becomes quickly prohibitive (operations to be done in each step by the attacker).

To summarize, weak Byzantine resilience can be enough as a practical solution against attackers whose resources are comparable to the server’s. However, strong Byzantine resilience remains the only provable solution against attackers with significant resources.

For the sake of our theoretical analysis, we also recall the definition of –Byzantine resilience [1] (Definition 3). This definition is a sufficient condition (as proved in [1] based on [2]) for weak Byzantine resilience.Even-though the property of –Byzantine resilience is a sufficient, but not a necessary condition for (weak) Byzantine resilience, it has been so far used as the defacto standard [1,4,11] to guarantee (weak) Byzantine resilience for SGD. We will therefore follow this standard and require resilience from any GAR that is plugged into MULTI-BULYAN, in particular, we will require it from MULTI-KRUM. The theoretical analysis done in [6] guarantees that BULYAN inherits it.

Intuitively, Definition 3 states that the gradient aggregation rule GAR produces an output vector that lives, on average (over random samples used by SGD), in the cone of angle around the true gradient. We simply call this the ”correct cone”.

Definition 3 (–Byzantine resilience (as in [1])). Let be any angular value, and any integer be any independent identically distributed random vectors in be any random vectors in , possibly dependent on the aggregation rule GAR is said to be -Byzantine resilient if, for any

satisfies (i) is bounded above by a linear combination of terms

Choice of f. The properties of the existing Byzantine-resilient SGD algorithms all depend on one important parameter, i.e., the number of potentially Byzantine nodes f. It is important to notice that f denotes a contract between the designer of the fault-tolerant solution and the user of the solution (who implements a service on top of the solution and deploys it in a specific setting). As long as the number of Byzantine workers is less than f, the solution is safe. Fixing an optimal value for f is an orthogonal problem. For example, if daily failures in a data center are about 1%, f = 0.01.n would be a suggested choice to tune the algorithm, and suffer from only a 99% slowdown.

The performance (convergence time) of certain existing Byzantine-resilient SGD algorithms in a nonByzantine environment is independent of the choice of f. These algorithms do not exploit the full potential of the choice of f. Modern large-scale systems are versatile and often undergo important structural changes while providing online services (e.g., addition or maintenance of certain worker nodes). Intuitively, there should be a fine granularity between the level of pessimism (i.e., value of f) and the performance of the SGD algorithm in the setting with no Byzantine failures.

3 MULTI-KRUM: Weak Byzantine Resilience and Slowdown

Let n be any integer greater than 2, f any integer s.t an integer s.t

We first prove the –Byzantine resilience of MULTI-KRUM (Lemma 1), then prove its almost sure convergence (Lemma 2) based on that, which proves the weak Byzantine resilience of MULTI-KRUM (Theorem 1).

In all what follows, expectations are taken over random samples used by correct workers to estimate the gradient, i.e the ”S” (stochasticity) that is inherent to SGD. It is worth noting that this analysis in expectation is not an average case analysis from the point of view of Byzantine fault tolerance. For instance, the Byzantine worker is always assumed to follow arbitrarily bad policies and the analysis is a worst-case one.

The Byzantine resilience proof (Lemma 1) relies on the following observation: given and in particular -Krum averages m gradients that are all in the ”correct cone”, and a cone is a convex set, thus stable by averaging. The resulting vectors therefore also live in that cone. The angle of the cone will depend on a variable as in [1], the value of itself depends on m. This is what enables us to use multi-Krum as the basis of our MULTI-KRUM, unlike [1] where a restriction is made on m = 1.

The proof of Lemma 2 is the same as the one in [1] which itself draws on the rather classic analysis of SGD made by L.Bottou [2]. The key concepts are (1) a global confinement of the sequence of parameter vectors and (2) a bound on the statistical moments of the random sequence of estimators built by the GAR of MULTI-KRUM. As in [1,2], reasonable assumptions are made on the cost function Q, those assumption are not restrictive and are common in practical machine learning.

Theorem 1 (Byzantine resilience and slowdown of MULTI-KRUM). Let m be any integer s.t. (i) MULTI-KRUM has weak Byzantine resilience against f failures. (ii) In the absence of Byzantine workers, MULTI-KRUM has a slowdown (expressed in ratio with averaging) of

Proof. Proof of (i). To prove (i), we will require Lemma 1 and Lemma 2, then conclude by construction of MULTI-KRUM as a multi-Krum algorithm with

Lemma 1. Let be any independent and identically distributed random d-dimensional vectors s.t random vectors, possibly dependent on the

then the GAR function of MULTI-KRUM is -Byzantine resilient where is defined by

Proof. Without loss of generality, we assume that the Byzantine vectors occupy the last f positions in the list of arguments of MULTI-KRUM, i.e., MULTI-KRUM An index is correct if it refers to a vector among . An index is Byzantine if it refers to a vector among . For each index (correct or Byzantine) i, we denote by ) the number of correct (resp. Byzantine) indices (the notation we introduced in Section 3 when defining MULTI-KRUM), i.e the number of workers, among the m neighbors of i that are correct (resp. Byzantine). We have

We focus first on the condition (i) of -Byzantine resilience. We determine an upper bound on the squared distance . Note that, for any correct . We denote by of the worst scoring among the m vectors chosen by the MULTI-KRUM function, i.e one that ranks with the smallest score in Equation 5 of the main paper (Section 3).

where I denotes the indicator function4. We examine the case for some correct index i.

We now examine the case for some Byzantine index k. The fact that k minimizes the score implies that for all correct indices i

We focus on the term . Each correct process i has m neighbors, and f + 1 non-neighbors. Thus there exists a correct worker which is farther from i than any of the neighbors of i. In particular, for each Byzantine index

Putting everything back together, we obtain

By assumption, belongs to a ball centered at g with radius . This implies

To sum up, condition (i) of the -Byzantine resilience property holds. We now focus on condition (ii).

The second inequality comes from the equivalence of norms in finite dimension. Now

Since the ’s are independent, we finally obtain that is bounded above by a linear combination of terms of the form . This completes the proof of condition (ii).

Lemma 2. Assume that (i) the cost function Q is three times differentiable with continuous derivatives, and is non-negative, the learning rates satisfy the gradient estimator satisfies constants there exists a constant such that for all x

(v) finally, beyond a certain horizon, , there exist

Then the sequence of gradients converges almost surely to zero.

Proof. For the sake of simplicity, we write MULTI-KRUM. Before proving the main claim of the proposition, we first show that the sequence is almost surely globally confined within the region

(Global confinement). Let

This becomes an equality when . Applying this inequality to

Let denote the -algebra encoding all the information up to round t. Taking the conditional expectation with respect to

Thanks to condition (ii) of -Byzantine resilience, and the assumption on the first four moments of G, there exist positive constants

Thus, there exist positive constant A, B such that

When , the first term of the right hand side is null because this first term is negative because (see Figure 3)

We define two auxiliary sequences

Note that the sequence converges because

Consider the indicator of the positive variations of the left-hand side

The right-hand side of the previous inequality is the summand of a convergent series. By the quasi-martingale convergence theorem [9], this shows that the sequence converges almost surely, which in turn shows that the sequence converges almost surely, Let us assume that is large enough, this implies that are greater than D. Inequality 5 becomes an equality, which implies that the following infinite sum converges almost surely

Note that the sequence converges to a positive value. In the region

This contradicts the fact that . Therefore, the sequence converges to zero. This con- vergence implies that the sequence is bounded, i.e., the vector is confined in a bounded region containing the origin. As a consequence, any continuous function of is also bounded, such as, e.g., and all the derivatives of the cost function . In the sequel, positive constants etc. . . are introduced whenever such a bound is used.

(Convergence). We proceed to show that the gradient converges almost surely to zero. We define

Using a first-order Taylor expansion and bounding the second derivative with , we obtain

Therefore

By the properties of -Byzantine resiliency, this implies

which in turn implies that the positive variations of are also bounded

The right-hand side is the summand of a convergent infinite sum. By the quasi-martingale convergence theorem, the sequence converges almost surely,

that

Figure 3: Condition on the angles between and the the , in the region

Using a Taylor expansion, as demonstrated for the variations of , we obtain

Taking the conditional expectation, and bounding the second derivatives by

The positive expected variations of are bounded

The two terms on the right-hand side are the summands of convergent infinite series. By the quasi-martingale convergence theorem, this shows that converges almost surely. We have

This implies that the following infinite series converge almost surely

Since converges almost surely, and the series diverges, we conclude that the sequence converges almost surely to zero.

We conclude the proof of (i) by recalling the definition of MULTI-KRUM, as the instance of with

Proof of (ii). (ii) is a consequence of the fact that m-Krum is the average of m estimators of the gradient. In the absence of Byzantine workers, all those estimators will not only be from the ”correct cone”, but from correct workers (Byzantine workers can also be in the correct cone, but in this case there are none). As SGD converges in is the number of used estimators of the gradient, the slowdown result follows.

4 MULTI-BULYAN: Strong Byzantine Resilience and Slowdown

Let n be any integer greater than 2, f any integer s.t an integer s.t

Theorem 2 (Byzantine resilience and slowdown of MULTI-BULYAN). (i) MULTI-BULYAN provides strong Byzantine resilience against f failures. (ii) In the absence of Byzantine workers, MULTI-BULYAN has a slowdown (expressed in ratio with averaging) of

Proof. If the number of iterations over MULTI-KRUM is , then the leeway, defined by the coordinate-wise distance between the output of BULYAN and a correct gradient is upper bounded by . This is due to the fact that BULYAN relies on a component-wise median, that, as proven in [6] guarantees this bound. The proof is then a direct consequence of Theorem 1 and the properties of Bulyan [6].

References

[1] BLANCHARD, P., EL MHAMDI, E. M., GUERRAOUI, R., AND STAINER, J. Machine learning with adversaries: Byzantine tolerant gradient descent. In NIPS (2017), pp. 118–128.

[2] BOTTOU, L. Online learning and stochastic approximations. Online learning in neural networks 17, 9 (1998), 142.

[3] DAMASKINOS, G., EL-MHAMDI, E.-M., GUERRAOUI, R., GUIRGUIS, A., AND ROUAULT, S. Aggregathor: : Byzantine machine learning via robust gradient aggregation. In Proceedings of the SysML Conference (2019).

[4] DAMASKINOS, G., EL MHAMDI, E. M., GUERRAOUI, R., PATRA, R., TAZIKI, M., ET AL. Asynchronous Byzantine machine learning (the case of sgd). In Proceedings of the 35th International Conference on Machine Learning (Stockholmsm¨assan, Stockholm Sweden, 10–15 Jul 2018), vol. 80 of Proceedings of Machine Learning Research, PMLR, pp. 1153–1162.

[5] DEAN, J., CORRADO, G., MONGA, R., CHEN, K., DEVIN, M., MAO, M., SENIOR, A., TUCKER, P., YANG, K., LE, Q. V., ET AL. Large scale distributed deep networks. In NIPS (2012), pp. 1223– 1231.

[6] EL MHAMDI, E. M., GUERRAOUI, R., AND ROUAULT, S. The hidden vulnerability of distributed learning in Byzantium. In Proceedings of the 35th International Conference on Machine Learning (Stockholmsm¨assan, Stockholm Sweden, 10–15 Jul 2018), vol. 80 of Proceedings of Machine Learning Research, PMLR, pp. 3521–3530.

[7] GOYAL, P., DOLL ´AR, P., GIRSHICK, R., NOORDHUIS, P., WESOLOWSKI, L., KYROLA, A., TUL- LOCH, A., JIA, Y., AND HE, K. Accurate, large minibatch sgd: training imagenet in 1 hour. arXiv preprint arXiv:1706.02677 (2017).

[8] HAYKIN, S. S. Neural networks and learning machines, vol. 3. Pearson Upper Saddle River, NJ, USA:, 2009.

[9] M´ETIVIER, M. Semi-Martingales. Walter de Gruyter, 1983.

[10] SHALEV-SHWARTZ, S., AND BEN-DAVID, S. Understanding machine learning: From theory to algorithms. Cambridge university press, 2014.

[11] XIE, C., KOYEJO, O., AND GUPTA, I. Generalized Byzantine-tolerant sgd. arXiv preprint arXiv:1802.10116 (2018).

designed for accessibility and to further open science