b

DiscoverSearch
About
My stuff
Fast and Robust Distributed Learning in High Dimension
2019·arXiv
Abstract
Abstract

Modern machine learning is distributed and the work of several machines is typically aggregated by averaging which is the optimal rule in terms of speed, offering a speedup of n (with respect to using a single machine) when n processes are learning together.

Distributing data and models poses however fundamental vulnerabilities, be they to software bugs, asynchrony, or worse, to malicious attackers controlling some machines or injecting misleading data in the network. Such behavior is best modeled as Byzantine failures, and averaging does not tolerate a single one from a worker.

Krum, the first provably Byzantine resilient aggregation rule for SGD only uses one worker per step, which hampers its speed of convergence, especially in best case conditions when none of the workers is actually Byzantine. An idea, coined multi-Krum, of using m different workers per step was mentioned, without however any proof neither on its Byzantine resilience nor on its slowdown. More recently, it was shown that in high dimensional machine learning, guaranteeing convergence is not a sufficient condition for strong Byzantine resilience. A improvement on Krum, coined Bulyan, was proposed and proved to guarantee stronger resilience. However, Bulyan suffers from the same weakness of Krum: using only one worker per step. This adds up to the aforementioned open problem and leaves the crucial need for both fast and strong Byzantine resilience unfulfilled.

The present paper tackles both open problems and proposes using Bulyan over Multi-Krum (we call it MULTI-BULYAN), a combination for which we provide proofs of strong Byzantine resilience, as well as an mn slowdown, compared to averaging, the fastest (but non Byzantine resilient) rule for distributed machine learning.

Finally, modern machine learning involves data of unprecedentedly high dimension: some models are nowadays vectors of dimension  d = 109. In order to deliver results within a reasonable time, learning algorithms should be at most linear in d and avoid using classic security mechanisms, most of which are at least quadratic in d and hence impractical. A strength of MULTI-BULYAN is that it inherits the O(d) merits of both multi-Krum and Bulyan.

The ongoing data deluge has been both a blessing and burden for machine learning system designers. A blessing since machine learning provably performs better with more training datan [10], and a burden since the numbers are beyond previous orders of magnitude. For instance, machine learning set of parameters are now in the gigabyte [5], training data is several orders of magnitude beyond that [5]. For the latter reason, distributed machine learning is not an option, it the only way to deliver results in a reasonable time for the user. For instance, Stochastic Gradient Descent (SGD), an algorithm which is the workhorse of today’s machine learning.

With the amounts of workload involved in today’s machine learning, distributed systems are the only option to deliver results in a reasonable time.

This constraint is even more crucial since ML, given its workload, relies on large scale distributed systems for which communications costs are additional constraints to local computation costs.

We prove the similar mn slowdown of MULTI-BULYAN and its (strong) Byzantine resilience. We deduce that MULTI-BULYAN ensures strong Byzantine resilience and the very fact that it is mn times as fast as the optimal algorithm (averaging) in the absence of Byzantine workers.

MULTI-BULYAN can be viewed as generalization (also using m different workers per step to leverage the fact that f, possibly less than a minority can be faulty) of Bulyan, the defense mechanism we present in [6]. Before presenting in Section 3, our proofs of convergence and slow down of MULTI-KRUM and in Section 4 our proofs of convergence and slow down of BULYAN and hence MULTI-BULYAN, we introduce in Section 2 a toolbox of formal definitions: weak, strong, and  (α, f)–Byzantine resilience. We also present a necessary context on non-convex optimization, as well as its interplay with the high dimensionality of machine learning together with the√dleeway it provides to strong attackers

2.1 Stochastic Gradient Descent

The learning task consists in making accurate predictions for the labels of each data instance  ξiusing a high dimentional model (for example, a neural network); we denote the d parameters of that model by the vector x. Each data instance has a set of features (image pixels), and a set of labels (e.g., {cat, person}). The CNN is trained with the popular backpropagation algorithm based on SGD. Specifically, SGD addresses the following optimization problem.

image

where  ξis a random variable representing a total of B data instances and  F(x; ξ)is the loss function. The function Q(x) is smooth but not convex.

SGD computes the gradient (G(x, ξ) ≜ ∇F(x; ξ)) and then updates the model parameters (x) in a direction opposite to that of the gradient (descent). The vanilla SGD update rule given a sequence of learning rates  {γk}at any given step1 is the following:

image

The popularity of SGD stems from its ability to employ noisy approximations of the actual gradient. In a distributed setup, SGD employs a mini-batch of b < B training instances for the gradient computation:

image

image

Figure 1: Correct workers (black dashed arrows) estimating the real gradient (blue full arrow) while a Byzantine worker (red dotted)

The size of the mini-batch (b) induces a trade-off between the robustness of a given update (noise in the gradient approximation) and the time required to compute this update. The mini-batch also affects the amount of parallelism (Equation 3) that modern computing clusters (multi-GPU etc.) largely benefit from. Scaling the mini-batch size to exploit additional parallelism requires however a non-trivial selection of the sequence of learning rates [7]. A very important assumption for the convergence properties of SGD is that each gradient is an unbiased estimation of the actual gradient, which is typically ensured through uniform random sampling, i.e., gradients that are on expectation equal to the actual gradient (Figure 1).

2.2 Algorithms

MULTI-BULYAN relies on two algorithmic components: MULTI-KRUM [1] and BULYAN [6]. The former rule requires that  n ≥ 2f + 3and the second requires that  n ≥ 4f + 3.

Intuitively, the goal of MULTI-KRUM is to select the gradients that deviate less from the “majority” based on their relative distances. Given gradients  G1 . . . Gnproposed by workers 1 to n respectively, MULTI-KRUM selects the m gradients with the smallest sum of scores (i.e., L2 norm from the other gradients) as follows:

image

where given a function X(i), (m) arg min(X(i)) denotes the indexes i with the m smallest X(i) values, and  i → jmeans that  Gjis among the  n − f − 2closest gradients to  Gi. BULYANin turn takes the aforementioned m vectors, computes their coordinate-wise median and produces a gradient which coordinates are the average of the  m − 2fclosest values to the median.

2.3 Byzantine Resilience

Intuitively, weak Byzantine resilience requires a GAR to guarantee convergence despite the presence of f Byzantine workers. It can be formally stated as follows.

Definition 1 (Weak Byzantine resilience). We say that a GAR ensures weak f-Byzantine resilience if the sequence  x(k) (Equation 2 in the main paper) converges almost surely to some  x∗ where ∇Q(x∗) = 0,despite the presence of f Byzantine workers.

On the other hand, strong Byzantine resilience requires that this convergence does not lead to ”bad” optimums, and is related to more intricate problem of non-convex optimization, which, in the presence of Byzantine workers, is highly aggravated by the dimension of the problem as explained in what follows.

image

Figure 2: In a non-convex situation, two correct vectors (black arrows) are pointing towards the deep optimum located in area B, both vectors belong to the plane formed by lines L1 and L2. A Byzantine worker (magenta) is taking benefit from the third dimension, and the non-convex landscape, to place a vector that is heading towards one of the bad local optimums of area A. This Byzantine vector is located in the plane (L1,L3). Due to the variance of the correct workers on the plane (L1,L2), the Byzantine one has a budget of about√3times the disagreement of the correct workers, to put as a deviation towards A, on the line (L3), while still being selected by a weak Byzantine resilient GAR, since its projection on the plane (L1,L2) lies exactly on the line (L1), unlike that of the correct workers. In very high dimensions, the situation is amplified by√d.

Specificity of non-convex optimization. Non-convex optimization is one of the earliest established NPhard problems [8]. In fact, many interesting but hard questions in machine learning boil down to one answer: ”because the cost function is not necessarily convex”.

In distributed machine learning, the non-convexity of the cost function creates two non-intuitive behaviours that are important to highlight.

(1) A ”mild” Byzantine worker can make the system converge faster. For instance, it has been reported several times in the literature that noise accelerates learning [2, 8]. This can be understood from the ”S” (stochasticity) of SGD: as (correct) workers cannot have a full picture of the surrounding landscape of the loss, they can only draw a sample at random and estimate the best direction based on that sample, which can be, and is probably biased compared to the true gradient. Moreover, due to non-convexity, even the true gradient might be leading to the local minima where the parameter server is. By providing a wrong direction (i.e. not the true gradient, or a correct stochastic estimation), a Byzantine worker whose resources cannot face the high-dimensional landscape of the loss, might end up providing a direction to get out of that local minima.

(2) Combined with high dimensional issues, non-convexity explains the need for strong Byzantine resilience. Unlike the ”mild” Byzantine worker, a strong adversary with more resources than the workers and the server, can see a larger picture and provide an attack that requires a stronger requirement. Namely, a requirement that would cut the√dleeway offered to an attacker in each dimension. Figure 2 provides an illustration.

This motivates the following formalization of strong Byzantine resilience.

Definition 2 (Strong Byzantine resilience). We say that a GAR ensures strong f-Byzantine resilient if for every  i ∈ [1, d], there exists a correct gradient G (i.e., computed by a non-Byzantine worker) s.t.  E|GARi−Gi| = O( 1√d). The the expectation is taken over the random samples (ξin Equation 1)and  videnotes the

ith coordinate of a vector v.

Weak vs. strong Byzantine resilience. To attack non-Byzantine resilient GARs such as averaging, it only takes the computation of an estimate of the gradient, which can be done in O(n.d) operations per round by a Byzantine worker. This attack is reasonably cheap: within the usual cost of the workload of other workers, O(d), and the server, O(n.d).

To attack weakly Byzantine-resilient GARs however, one needs to find the ’most legitimate but harmful vector possible’, i.e one that will (1) be selected by a weakly Byzantine-resilient GAR, and (2) be misleading convergence (red arrow in Figure 1). To find this vector, an attacker has to first collect every correct worker’s vector (before they reach the server), and solve an optimization problem (by linear regression) to approximate this harmful but legitimate vector [6]. If the desired quality of the approximation is  ϵ, theByzantine worker would need at least  Ω( n.dϵ )operation to reach it with regression. This is a tight lower bound for a regression problem in d dimensions with n vectors [8]. In practice, if the required precision is of order  10−9, with 100workers and a neural network model of dimension  109, the cost of the attack becomes quickly prohibitive (≈ 1020 operations to be done in each step by the attacker).

To summarize, weak Byzantine resilience can be enough as a practical solution against attackers whose resources are comparable to the server’s. However, strong Byzantine resilience remains the only provable solution against attackers with significant resources.

For the sake of our theoretical analysis, we also recall the definition of  (α, f)–Byzantine resilience [1] (Definition 3). This definition is a sufficient condition (as proved in [1] based on [2]) for weak Byzantine resilience.Even-though the property of  (α, f)–Byzantine resilience is a sufficient, but not a necessary condition for (weak) Byzantine resilience, it has been so far used as the defacto standard [1,4,11] to guarantee (weak) Byzantine resilience for SGD. We will therefore follow this standard and require  (α, f)–Byzantineresilience from any GAR that is plugged into MULTI-BULYAN, in particular, we will require it from MULTI-KRUM. The theoretical analysis done in [6] guarantees that BULYAN inherits it.

Intuitively, Definition 3 states that the gradient aggregation rule GAR produces an output vector that lives, on average (over random samples used by SGD), in the cone of angle  αaround the true gradient. We simply call this the ”correct cone”.

Definition 3 ((α, f)–Byzantine resilience (as in [1])). Let  0 ≤ α < π/2be any angular value, and any integer  0 ≤ f ≤ n. Let V1, . . . , Vnbe any independent identically distributed random vectors in  Rd,Vi ∼ G, with EG = g. Let B1, . . . , Bfbe any random vectors in  Rd, possibly dependent on the  Vi’s. Anaggregation rule GAR is said to be  (α, f)-Byzantine resilient if, for any  1 ≤ j1 < · · · < jf ≤ n, vector

image

satisfies (i)  ⟨EGAR, g⟩ ≥ (1 − sin α) · ∥g∥2 > 0 2 and (ii) for r = 2, 3, 4, E ∥GAR∥r is bounded above by a linear combination of terms  E ∥G∥r1 . . . E ∥G∥rn−1 with r1 + · · · + rn−1 = r.

Choice of f. The properties of the existing Byzantine-resilient SGD algorithms all depend on one important parameter, i.e., the number of potentially Byzantine nodes f. It is important to notice that f denotes a contract between the designer of the fault-tolerant solution and the user of the solution (who implements a service on top of the solution and deploys it in a specific setting). As long as the number of Byzantine workers is less than f, the solution is safe. Fixing an optimal value for f is an orthogonal problem. For example, if daily failures in a data center are about 1%, f = 0.01.n would be a suggested choice to tune the algorithm, and suffer from only a 99% slowdown.

The performance (convergence time) of certain existing Byzantine-resilient SGD algorithms in a nonByzantine environment is independent of the choice of f. These algorithms do not exploit the full potential of the choice of f. Modern large-scale systems are versatile and often undergo important structural changes while providing online services (e.g., addition or maintenance of certain worker nodes). Intuitively, there should be a fine granularity between the level of pessimism (i.e., value of f) and the performance of the SGD algorithm in the setting with no Byzantine failures.

Let n be any integer greater than 2, f any integer s.t  f ≤ n−22 and man integer s.t  m ≤ n − f − 2. Let˜m = n − f − 2.

We first prove the  (α, f)–Byzantine resilience of MULTI-KRUM (Lemma 1), then prove its almost sure convergence (Lemma 2) based on that, which proves the weak Byzantine resilience of MULTI-KRUM (Theorem 1).

In all what follows, expectations are taken over random samples used by correct workers to estimate the gradient, i.e the ”S” (stochasticity) that is inherent to SGD. It is worth noting that this analysis in expectation is not an average case analysis from the point of view of Byzantine fault tolerance. For instance, the Byzantine worker is always assumed to follow arbitrarily bad policies and the analysis is a worst-case one.

The Byzantine resilience proof (Lemma 1) relies on the following observation: given  m ≤ n − f − 2,and in particular  m = n − f − 2 3, m-Krum averages m gradients that are all in the ”correct cone”, and a cone is a convex set, thus stable by averaging. The resulting vectors therefore also live in that cone. The angle of the cone will depend on a variable  η(n.f)as in [1], the value of  η(n.f)itself depends on m. This is what enables us to use multi-Krum as the basis of our MULTI-KRUM, unlike [1] where a restriction is made on m = 1.

The proof of Lemma 2 is the same as the one in [1] which itself draws on the rather classic analysis of SGD made by L.Bottou [2]. The key concepts are (1) a global confinement of the sequence of parameter vectors and (2) a bound on the statistical moments of the random sequence of estimators built by the GAR of MULTI-KRUM. As in [1,2], reasonable assumptions are made on the cost function Q, those assumption are not restrictive and are common in practical machine learning.

Theorem 1 (Byzantine resilience and slowdown of MULTI-KRUM). Let m be any integer s.t.  m ≤ n−f −2.(i) MULTI-KRUM has weak Byzantine resilience against f failures. (ii) In the absence of Byzantine workers, MULTI-KRUM has a slowdown (expressed in ratio with averaging) of  Ω( ˜mn ).

Proof. Proof of (i). To prove (i), we will require Lemma 1 and Lemma 2, then conclude by construction of MULTI-KRUM as a multi-Krum algorithm with  m = n − f − 2.

Lemma 1. Let  V1, . . . , Vnbe any independent and identically distributed random d-dimensional vectors s.t  Vi ∼ G, with EG = g and E ∥G − g∥2 = dσ2. Let B1, . . . , Bf be any frandom vectors, possibly dependent on the  Vi’s. If 2f + 2 < n and η(n, f)√d · σ < ∥g∥, where

image

then the GAR function of MULTI-KRUM is  (α, f)-Byzantine resilient where  0 ≤ α < π/2is defined by

image

Proof. Without loss of generality, we assume that the Byzantine vectors  B1, . . . , Bfoccupy the last f positions in the list of arguments of MULTI-KRUM, i.e., MULTI-KRUM  = MULTI-KRUM(V1, . . . , Vn−f, B1, . . . , Bf).An index is correct if it refers to a vector among  V1, . . . , Vn−f. An index is Byzantine if it refers to a vector among  B1, . . . , Bf. For each index (correct or Byzantine) i, we denote by  δc(i) (resp. δb(i)) the number of correct (resp. Byzantine) indices  j such that i → j(the notation we introduced in Section 3 when defining MULTI-KRUM), i.e the number of workers, among the m neighbors of i that are correct (resp. Byzantine). We have

image

We focus first on the condition (i) of  (α, f)-Byzantine resilience. We determine an upper bound on the squared distance  ∥EMULTI-KRUM − g∥2. Note that, for any correct  j, EVj = g. We denote by  i∗ the indexof the worst scoring among the m vectors chosen by the MULTI-KRUM function, i.e one that ranks with the mth smallest score in Equation 5 of the main paper (Section 3).

image

where I denotes the indicator function4. We examine the case  i∗ = ifor some correct index i. ������Vi − 1δc(i)

image

image

We now examine the case  i∗ = kfor some Byzantine index k. The fact that k minimizes the score implies that for all correct indices i

image

We focus on the term  D2(i). Each correct process i has m neighbors, and f + 1 non-neighbors. Thus there exists a correct worker  ζ(i)which is farther from i than any of the neighbors of i. In particular, for each Byzantine index  l such that i → l, ∥Vi − Bl∥2 ≤��Vi − Vζ(i)��2. Whence

image

Putting everything back together, we obtain

image

By assumption,  η(n, f)√dσ < ∥g∥, i.e., EMULTI-KRUMbelongs to a ball centered at g with radius  η(n, f)·√d · σ. This implies

image

To sum up, condition (i) of the  (α, f)-Byzantine resilience property holds. We now focus on condition (ii).

image

image

The second inequality comes from the equivalence of norms in finite dimension. Now

image

Since the  Vi’s are independent, we finally obtain that  E ∥MULTI-KRUM∥r is bounded above by a linear combination of terms of the form  E ∥V1∥r1 · · · E ∥Vn−f∥rn−f = E ∥G∥r1 · · · E ∥G∥rn−f with r1 + · · · +rn−f = r. This completes the proof of condition (ii).

Lemma 2. Assume that (i) the cost function Q is three times differentiable with continuous derivatives, and is non-negative,  Q(x) ≥ 0; (ii)the learning rates satisfy �t γt = ∞ and �t γ2t < ∞; (iii)the gradient estimator satisfies  EG(x, ξ) = ∇Q(x) and ∀r ∈ {2, . . . , 4}, E∥G(x, ξ)∥r ≤ Ar + Br∥x∥r for someconstants  Ar, Br; (iv)there exists a constant  0 ≤ α < π/2such that for all x

image

(v) finally, beyond a certain horizon,  ∥x∥2 ≥ D, there exist  ϵ > 0 and 0 ≤ β < π/2 − α such that

image

Then the sequence of gradients  ∇Q(xt)converges almost surely to zero.

Proof. For the sake of simplicity, we write MULTI-KRUMt = MULTI-KRUM(V t1 , . . . , V tn). Before proving the main claim of the proposition, we first show that the sequence  xtis almost surely globally confined within the region  ∥x∥2 ≤ D.

(Global confinement). Let  ut = φ(∥xt∥2) where

image

This becomes an equality when  a, b ≥ D. Applying this inequality to  ut+1 − ut yields

image

Let  Ptdenote the  σ-algebra encoding all the information up to round t. Taking the conditional expectation with respect to  Pt yields

image

Thanks to condition (ii) of  (α, f)-Byzantine resilience, and the assumption on the first four moments of G, there exist positive constants  A0, B0 such that

image

Thus, there exist positive constant A, B such that

image

When  ∥xt∥2 < D, the first term of the right hand side is null because  φ′(∥xt∥2) = 0. When ∥xt∥2 ≥ D,this first term is negative because (see Figure 3)

image

We define two auxiliary sequences

image

Note that the sequence  µtconverges because �t γ2t < ∞. Then

image

Consider the indicator of the positive variations of the left-hand side

image

image

The right-hand side of the previous inequality is the summand of a convergent series. By the quasi-martingale convergence theorem [9], this shows that the sequence  u′t converges almost surely, which in turn shows that the sequence  utconverges almost surely,  ut → u∞ ≥ 0.Let us assume that  u∞ > 0. When tis large enough, this implies that  ∥xt∥2 and ∥xt+1∥2 are greater than D. Inequality 5 becomes an equality, which implies that the following infinite sum converges almost surely

image

Note that the sequence  φ′(∥xt∥2)converges to a positive value. In the region  ∥xt∥2 > D, we have

image

This contradicts the fact that �∞t=1 γt = ∞. Therefore, the sequence  utconverges to zero. This con- vergence implies that the sequence  ∥xt∥2 is bounded, i.e., the vector  xtis confined in a bounded region containing the origin. As a consequence, any continuous function of  xtis also bounded, such as, e.g.,  ∥xt∥2,E ∥G(xt, ξ)∥2 and all the derivatives of the cost function  Q(xt). In the sequel, positive constants  K1, K2,etc. . . are introduced whenever such a bound is used.

(Convergence). We proceed to show that the gradient  ∇Q(xt)converges almost surely to zero. We define

image

Using a first-order Taylor expansion and bounding the second derivative with  K1, we obtain

image

Therefore

image

By the properties of  (α, f)-Byzantine resiliency, this implies

image

which in turn implies that the positive variations of  htare also bounded

image

The right-hand side is the summand of a convergent infinite sum. By the quasi-martingale convergence theorem, the sequence  htconverges almost surely,  Q(xt) → Q∞.

image

that

image

image

Figure 3: Condition on the angles between  xt, ∇Q(xt)and the the  GAR of MULTI-KRUM vectorEMULTI-KRUMt, in the region  ∥xt∥2 > D.

image

Using a Taylor expansion, as demonstrated for the variations of  ht, we obtain

image

Taking the conditional expectation, and bounding the second derivatives by  K4,

image

The positive expected variations of  ρtare bounded

image

The two terms on the right-hand side are the summands of convergent infinite series. By the quasi-martingale convergence theorem, this shows that  ρtconverges almost surely. We have

image

This implies that the following infinite series converge almost surely

image

Since  ρtconverges almost surely, and the series �∞t=1 γt = ∞diverges, we conclude that the sequence ∥∇Q(xt)∥converges almost surely to zero.

We conclude the proof of (i) by recalling the definition of MULTI-KRUM, as the instance of  m − Krumwith  m = n − f − 2.

Proof of (ii). (ii) is a consequence of the fact that m-Krum is the average of m estimators of the gradient. In the absence of Byzantine workers, all those estimators will not only be from the ”correct cone”, but from correct workers (Byzantine workers can also be in the correct cone, but in this case there are none). As SGD converges in  O( 1m), where mis the number of used estimators of the gradient, the slowdown result follows.

Let n be any integer greater than 2, f any integer s.t  f ≤ n−34 and man integer s.t  m ≤ n − 2f − 2. Let˜m = n − 2f − 2.

Theorem 2 (Byzantine resilience and slowdown of MULTI-BULYAN). (i) MULTI-BULYAN provides strong Byzantine resilience against f failures. (ii) In the absence of Byzantine workers, MULTI-BULYAN has a slowdown (expressed in ratio with averaging) of  Ω( ˜mn ).

Proof. If the number of iterations over MULTI-KRUM is  n − 2f, then the leeway, defined by the coordinate-wise distance between the output of BULYAN and a correct gradient is upper bounded by  O( 1√d). This is due to the fact that BULYAN relies on a component-wise median, that, as proven in [6] guarantees this bound. The proof is then a direct consequence of Theorem 1 and the properties of Bulyan [6].

[1] BLANCHARD, P., EL MHAMDI, E. M., GUERRAOUI, R., AND STAINER, J. Machine learning with adversaries: Byzantine tolerant gradient descent. In NIPS (2017), pp. 118–128.

[2] BOTTOU, L. Online learning and stochastic approximations. Online learning in neural networks 17, 9 (1998), 142.

[3] DAMASKINOS, G., EL-MHAMDI, E.-M., GUERRAOUI, R., GUIRGUIS, A., AND ROUAULT, S. Aggregathor: : Byzantine machine learning via robust gradient aggregation. In Proceedings of the  1stSysML Conference (2019).

[4] DAMASKINOS, G., EL MHAMDI, E. M., GUERRAOUI, R., PATRA, R., TAZIKI, M., ET AL. Asynchronous Byzantine machine learning (the case of sgd). In Proceedings of the 35th International Conference on Machine Learning (Stockholmsm¨assan, Stockholm Sweden, 10–15 Jul 2018), vol. 80 of Proceedings of Machine Learning Research, PMLR, pp. 1153–1162.

[5] DEAN, J., CORRADO, G., MONGA, R., CHEN, K., DEVIN, M., MAO, M., SENIOR, A., TUCKER, P., YANG, K., LE, Q. V., ET AL. Large scale distributed deep networks. In NIPS (2012), pp. 1223– 1231.

[6] EL MHAMDI, E. M., GUERRAOUI, R., AND ROUAULT, S. The hidden vulnerability of distributed learning in Byzantium. In Proceedings of the 35th International Conference on Machine Learning (Stockholmsm¨assan, Stockholm Sweden, 10–15 Jul 2018), vol. 80 of Proceedings of Machine Learning Research, PMLR, pp. 3521–3530.

[7] GOYAL, P., DOLL ´AR, P., GIRSHICK, R., NOORDHUIS, P., WESOLOWSKI, L., KYROLA, A., TUL- LOCH, A., JIA, Y., AND HE, K. Accurate, large minibatch sgd: training imagenet in 1 hour. arXiv preprint arXiv:1706.02677 (2017).

[8] HAYKIN, S. S. Neural networks and learning machines, vol. 3. Pearson Upper Saddle River, NJ, USA:, 2009.

[9] M´ETIVIER, M. Semi-Martingales. Walter de Gruyter, 1983.

[10] SHALEV-SHWARTZ, S., AND BEN-DAVID, S. Understanding machine learning: From theory to algorithms. Cambridge university press, 2014.

[11] XIE, C., KOYEJO, O., AND GUPTA, I. Generalized Byzantine-tolerant sgd. arXiv preprint arXiv:1802.10116 (2018).


Designed for Accessibility and to further Open Science