b

DiscoverSearch
About
My stuff
An Optimal Multistage Stochastic Gradient Method for Minimax Problems
2020·arXiv
Abstract
Abstract

In this paper, we study the minimax optimization problem in the smooth and strongly convex-strongly concave setting when we have access to noisy estimates of gradients. In particular, we first analyze the stochastic Gradient Descent Ascent (GDA) method with constant stepsize, and show that it converges to a neighborhood of the solution of the minimax problem. We further provide tight bounds on the convergence rate and the size of this neighborhood. Next, we propose a multistage variant of stochastic GDA (M-GDA) that runs in multiple stages with a particular learning rate decay schedule and converges to the exact solution of the minimax problem. We show M-GDA achieves the lower bounds in terms of noise dependence without any assumptions on the knowledge of noise characteristics. We also show that M-GDA obtains a linear decay rate with respect to the error’s dependence on the initial error, although the dependence on condition number is suboptimal. In order to improve this dependence, we apply the multistage machinery to the stochastic Optimistic Gradient Descent Ascent (OGDA) algorithm and propose the M-OGDA algorithm which also achieves the optimal linear decay rate with respect to the initial error. To the best of our knowledge, this method is the first to simultaneously achieve the best dependence on noise characteristic as well as the initial error and condition number.

The minimax optimization problem has recently gained tremendous attention as the canonical problem formulation for robust training of machine learning models and Generative Adversarial Networks (GANs) (see Madry et al. (2018); Goodfellow et al. (2014); Arjovsky et al. (2017)). While many papers have studied the convergence of a broad range of algorithms in the deterministic setting, i.e., when the gradient information is exact, many aspects of this problem in the stochastic setting are yet to be explored. This is the main goal of our manuscript as we provide a framework for analyzing minimax optimization algorithms which can be used for both the deterministic and stochastic settings.

We consider the minimax problem

image

where  f : Rm × Rn → Ris  L−smooth and  µ-strongly convex-strongly concave (See Section 2 for the precise statement of our assumptions). The condition number of the problem is defined as κ := L/µ. Due to the strong convexity-strong concavity of the function f, this problem has a unique saddle point which we denote by  (x∗, y∗), i.e.,

image

In this paper, our main focus is on the case when the exact gradient information is not available and we only have access to an unbiased estimate through a stochastic oracle. More formally, we assume1that at a point  (xk, yk), we have access to noisy estimates of the gradients ˜∇xf(xk, yk)and ˜∇yf(xk, yk), which are unbiased estimates of  ∇xf(xk, yk)and  ∇yf(xk, yk), respectively, and their variances are bounded by  σ2.

This setting arises in many applications, including the problem of training GANs where the generator and the discriminator approximate the gradient by taking a batch of data D and computing 1|D|�θ∈D ˜∇if(x, y, ; θ)where  i ∈ {x, y}and  ˜∇if(x, y, ; θ)is the gradient computed over a single data point  θ. It is worth noting that the inexact gradient issue also appears in other scenarios such as privacy-related applications where the noise is added intentionally to prevent the model from remembering possibly sensitive data and preserve privacy Xie et al. (2018) or when the presence of noise is inevitable due to imperfections in communication and sensing.

In solving the minimization problem in the stochastic setting, it is well-known that, for many algorithms, the squared distance of the iterates to the solution of the minimization problem can be bounded by the sum of two terms: bias and variance Bach and Moulines (2013); Ghadimi and Lan (2012); Aybat et al. (2019). The bias term captures the effect of the initialization expressed in terms of the distance of the initial point to the solution, and is independent of the noise parameters. The variance term depends on noise characteristics (σ2in our case) and is independent of the initialization error. For the minimization problem with strongly convex objective function, and in the noiseless case (with only the bias term), Nemirovsky and Yudin (1983) have shown the lower bound of  Θ (exp(−Θ(1)n/√κ))for the distance of the n-th iterate to the optimal solution. With noise Raginsky and Rakhlin (2011) have shown the lower bound increases to  Θ(σ2/n). Several papers have highlighted the trade-off between bias and variance which arises in design of optimization algorithms Aybat et al. (2018) and tried to achieve both lower bounds simultaneously Ghadimi and Lan (2013); Aybat et al. (2019).

In this paper, we highlight this bias-variance decomposition in evaluating the performance of algorithms that solve the minimax problem. For the bias term, i.e., the deterministic case, Ibrahim et al. (2019) have recently shown the lower bound  O(1) exp(−Θ(1)n/κ)highlighting that the dependence on condition number increases from √κin minimization problems to  κfor minimax problems. For the variance term, since the minimax problem is a special case of the minimization problem, the lower bound  O(σ2/n)of the minimization problem is also valid for the minimax problem. While this lower bound for variance term has been obtained Hsieh et al. (2019); Rosasco et al. (2014) at the cost of making the bias term sublinear, the question of whether a linear rate  O(1) exp(−Θ(1)n/κ)in bias and  O(σ2/n)in variance could be achieved simultaneously has not been addressed prior to this work.

In what follows, we first provide a summary of related works and then discuss the main contributions of our paper.

1.1 Related Work

1.1.1 Deterministic Case

Many papers have studied the minimax problem when the exact gradient information is available. In the case of Gradient Descent Ascent (GDA) method, Du and Hu (2019) analyzes its performance for the special case of bilinear coupling, i.e., when  f(x, y) = g(x) + y⊤Ax − h(y)where g is smooth and convex, h is smooth and strongly convex, and the matrix A has full column rank. They show that running the GDA algorithm for n steps on this problem reaches a point which is O(1)�1/n2�close to the saddle point. In addition, when the function  g(·)is assumed to be strongly convex, GDA reaches a point which is O(1)exp�−Θ(1)n/κ2�close to the saddle point after n steps. Liang and Stokes (2019) extend this result to a general function f(x, y) which is strongly convex in x and strongly concave in y (achieving the same rate of convergence as Du and Hu (2019)). Several other gradient based algorithms like the Optimistic Gradient Descent Ascent (OGDA) method (see Daskalakis et al. (2018)) and the Extragradient method Korpelevich (1976) have been analyzed in recent papers including Mokhtari et al. (2019b); Liang and Stokes (2019); Gidel et al. (2019); Mokhtari et al. (2019a); Hsieh et al. (2019). These papers analyze these algorithms in several settings including bilinear, strongly convex-strongly concave and convex-concave. More specifically, Gidel et al. (2019); Mokhtari et al. (2019b) show that when the objective function is strongly convex-strongly concave, running the OGDA and Extragradient algorithms for n steps reaches a point which is  O(1) exp(−Θ(1)n/κ)close to the saddle point.

1.1.2 Stochastic Case

The papers which are closest to our results are Rosasco et al. (2014) and Hsieh et al. (2019). Rosasco et al. (2014) propose a forward-backward splitting algorithm to solve the stochastic mini-

Table 1: Summary of results (up to O(1) constants)

image

max problem (they solve the more general problem of monotone inclusions). When the function is strongly convex-strongly concave, they show convergence at a rate of  O(∥z0 − z∗∥2/np + σ2cp/n)to the saddle point, where p is any constant greater than 0 and c is a constant larger than 1. Hsieh et al. (2019) show that the stochastic version of OGDA converges to the saddle point at a rate of  O( 1n)for both bias and variance when the objective function is strongly convex-strongly concave.

There are several papers which analyze the stochastic minimax problem when the objective function is convex-concave. Juditsky et al. (2011) propose the stochastic mirror-prox algorithm (a special case of which is the stochastic extragradient method) to solve the convex-concave saddle point problem with noisy gradients. They assume the constraint set is compact and show a convergence rate of  O(1/√n)(the result in this paper improves on the robust stochastic approximation algorithm proposed in Nemirovski et al. (2009)). Chen et al. (2014) proposes an accelerated pri- mal dual algorithm which achieves a convergence rate of  O(1/√n). Recently, Mertikopoulos et al.

(2018) analyzed the stochastic extragradient algorithm for coherent minimax problems (a condition slightly weaker than convex-concave assumption) and they show asymptotic convergence to a saddle point. Gidel et al. (2019) analyzed a single call version of extragradient (which corresponds to OGDA) when the function is convex-concave and they showed that in the stochastic setting, this algorithm converges to the saddle point at a rate of  O(1/√n).

Another line of work is the case where the objective function has a finite sum structure and the gradient of the entire function cannot be computed at each step. Several papers including Bot et al. (2019); Palaniappan and Bach (2016); Chavdarova et al. (2019); Iusem et al. (2017) analyze this setting and apply variance reduction techniques (like SVRG and SAGA) to improve convergence rates to the saddle point.

1.2 Our Contribution

We first analyze GDA with constant stepsize (learning rate) where we build our analysis by casting it as a dynamical system, an approach that has gained attention in the optimization and machine learning literature recently Lessard et al. (2016); Hu and Lessard (2017); Aybat et al. (2018, 2019). In particular, we show that GDA with any stepsize  α ≤ µ/(4L2)converges to an  O(α)neighborhood of the optimal solution at a linear rate  exp(−αµk). Next, we propose a novel MultistageStochastic Gradient Descent Ascent scheme (inspired from Aybat et al. (2019)) which achieves a rate of  O(σ2/n)for the variance term (which is optimal in terms of n dependence) and a rate of O(1) exp(−Θ(1)n/κ2)for the bias term, and we show that the n and  κdependence of the latter cannot be improved for GDA dynamics.

Next, we focus on the OGDA method which has gained widespread attention for solving minimax problems. We first highlight that OGDA also converges to an  O(α)neighborhood of the optimal solution with linear rate  exp(−αµk), but allows for a broader range of  α ≤ 1/(8L)for the stepsize. Then, we introduce the Multistage version of Stochastic Optimistic Gradient Descent Ascent (M-OGDA) which achieves the rate of  O(σ2/n)for the variance term and a rate of O(1) exp(−Θ(1)n/κ)for the bias term which improves on the  O(1) exp(−Θ(1)n/κ2)decay of the bias term of GDA and matches the lower bound shown in Ibrahim et al. (2019).

1.3 Notation

We denote  d × didentity and zero matrices by  Idand  0d, respectively. Throughout this paper, all vectors are represented as column vectors. The superscript ⊤represents the transpose of a vector or a matrix. For two matrices  A ∈ Rm×nand  B ∈ Rp×q, their Kronecker product is represented by  A ⊗ B. Also,  A ⪰ Bimplies that  A − Bis a symmetric and positive semidefinite matrix.

We first state formally the strong convexity(concavity) and smoothness properties of a function.

Definition 2.1. A convex function  φ : Rn → Ris L-smooth and  µ-strongly convex if it satisfies the following two conditions for all  x, ˆx ∈ Rn:

image

For an  L−smooth convex function  φ(·), we have the following characterization (see Theorem 2.1.5 in Nesterov (2004)):

image

Throughout the paper, we assume the following:

Assumption 2.2. We assume at iterate (x, y), we have access to ˜∇fx(x, y, ζ), ˜∇fy(x, y, ξ)which are unbiased estimates of  ∇fx(x, y)and  ∇fy(x, y), respectively, i.e.,

image

In addition, we assume  ζand  ξare independent from each other and previous iterates. Moreover, we assume

image

Assumption 2.3. The function f(x, y) is continuously differentiable in x and y. For any  ˆy ∈ Rn, f(x, ˆy)is  Lx-smooth and  µx-strongly convex as a function of x. Similarly, for any  ˆx ∈ Rm, f(ˆx, y)is  Ly-smooth and  µy-strongly concave as a function of y.

image

Note that this assumption leads to the saddle point  (x∗, y∗)being unique and, in addition, we have  ∇xf(x∗, y∗) = 0and  ∇yf(x∗, y∗) = 0.

Remark 2.4. Under Assumptions 2.3, we call the function  f(·, ·)as L-smooth and  µ-strongly convex- strongly concave where  µ = min{µx, µy}and  L = max{Lx, Ly, Lxy, Lyx}. We define the condition number of the problem as  κ ≜ L/µ.

We next present some key properties of smooth strongly convex-strongly concave functions that will be used in our analysis. Define:

image

where  z = (x⊤, y⊤)⊤ ∈ Rm+n. For  z = (x⊤, y⊤)⊤, ˆz = (ˆx⊤, ˆy⊤)⊤, we define

image

Also, we define  z∗ ≜ (x∗⊤, y∗⊤)⊤as the unique saddle point. The following lemma follows from the strong convexity and smoothness properties of f.

Lemma 2.5. Let  z, ˆz ∈ Rm+n. Recall the definition of  Φfrom (7) and  µand L from Assumption 2.3 and Remark 2.4. Then,

image

Proof. Check Appendix A.

Using Lemma 2.5, we can prove the following result

Lemma 2.6. Let  z, ˆz ∈ Rm+n. Recall the definition of  Φfrom (7) and  µand L from Assumption 2.3 and Remark 2.4. Then,

image

Proof. Check Appendix B.

Using Lemmas 2.5 and 2.6, we immediately obtain the following result which we state in the form of a matrix inequality since this form is more convenient for subsequent analysis.

Corollary 2.7. Let  z ∈ Rm+n. Recall the definition of  Φfrom (7) and  µand L from Assumption 2.3 and Remark 2.4. Then,

image

In this section, we study the Stochastic Gradient Descent Ascent (GDA) algorithm, which is given by:

image

This can be succinctly written as:

image

where  zk = (x⊤k , y⊤k )⊤and

image

Using this notation, we can represent GDA as a dynamical system as follows:

image

where  A = Im+n, B = −αIm+n. We study the convergence properties of the sequence  {zk}kthrough the evolution of the Lyapunov function  Vp(z) = (z − z∗)⊤P(z − z∗)where  P = p ⊗ Im+nwith  p ≥ 0an arbitrary constant. In particular, in the following lemma, we first bound the difference of  E[Vp(zk+1)] − ρ2E[Vp(zk)]for any  ρ ≥ 0and  k ≥ 0. We skip the proof as it is very similar to the proof of Lemma B.1 in Aybat et al. (2019).

Lemma 3.1. Let  P = p ⊗ Im+nwith  p ≥ 0and consider the function  Vp(z) = (z − z∗)⊤P(z − z∗). Then we have

image

Next, using this lemma, we characterize the convergence of GDA.

Theorem 3.2. Suppose that the conditions in Assumptions 2.2 and 2.3 are satisfied. Let  {zk}be the iterates generated by GDA (12) with  0 < α ≤ µ4L2. Then, for any  k ≥ 1, we have

image

Proof. See Appendix C.

Remark 3.3. It is worth noting that the range for the stepsize  αin Theorem 3.2 is upper bounded by  µ/4L2(as opposed to just a function of the Lipschitz parameter L, as is the case for Gradient Descent in minimization problems). This is consistent with the fact that GDA may diverge when the strong convexity parameter  µis 0, i.e., the function is convex-concave (see the Bilinear example in Daskalakis et al. (2018)).

3.1 Tightness of the Results

In this subsection, we give an example of a function where after running GDA for n iterations reaches a point which is O(1)exp�−Θ(1)n/κ2�close to the saddle point. Consider the function

image

where  x, y ∈ Rand  0 < µ ≤ L. The condition number of this function is  κ = Lµand the saddle point of this function is (x, y) = (0, 0).

Example 3.4. Let  {xk, yk}be the iterates generated by GDA for the objective function given in Equation (18). Then,

(i) if the gradient at each step is exactly available (i.e. the updates reduce to the deterministic GDA updates), we have:

image

(ii) if at each step the gradients are corrupted by additive i.i.d. noise with a distribution  N(0, σ2), we have

image

Proof. See Appendix D.

Example 3.4(i) shows that in order to find the saddle point of the function f defined in equation (18), we need to run at least  O(κ2 log(1/ǫ))steps of GDA (i.e. the deterministic case) to reach a point which is  ǫ-close to the solution, showing that this dependence on  κcannot be improved. Example 3.4(ii) shows that when the gradients are corrupted by noise with variance  σ2, GDA reaches an  O(α)neighborhood of the saddle point and this dependence on  αcannot be improved.

image

image

Our result in Theorem 3.2 shows that for GDA with constant stepsize  α, the iterates converge to an  O(α)neighborhood of the saddle point. In this section, we introduce a new method which is a variant of GDA with progressively decreasing stepsize that converges to the exact unique saddle point of problem 1. Our proposed algorithm, Multistage Stochastic Gradient Descent Ascent (MGDA), which is presented in Algorithm 1, runs in several stages where each stage is the GDA method with constant stepsize. In what follows, we show our multistage method with a carefully chosen learning rate and step length evolution achieves linear decay in the bias term as well as optimal variance dependence without any knowledge of the noise properties.

Theorem 4.1. Suppose that the conditions in Assumptions 2.2 and 2.3 are satisfied. Let {(xkm, ykm)nkm=0}Kk=1be the iterates generated by M-GDA (Algorithm 1) with the following parameters

image

Proof. We prove the result by induction on k. To simplify the notation, we define errk :=E�∥zknk − z∗∥2�.First, and for k = 1, note that, by using Theorem 3.2 along with the fact that  1 − αµ ≤exp(−αµ), we have:

image

where we plugged in  α1 = µ4L2 , n1 ≥ 1to obtain the last equality. Hence, the result holds for k = 1. Now, assume the result holds for k, and we show it for k + 1. Note that, Theorem 3.2 for stage k + 1 yields:

image

where we used  αk+1 = µ/(L22k+3)and  nk+1 ≥ p2k+3κ2 log(2)to derive the last inequality. Now, note that, by induction hypothesis, we have

image

Substituting this bound in (23), we obtain

image

where the last bound follows from  p ≥ 2. This completes the proof.

The above theorem provides an upper bound on the distance of the last iterate of each stage to the saddle point of problem (1). Using this result, and in the following corollary, we provide an upper bound on the distance of any iterate from the saddle point. Before stating this corollary, let {zn}nbe the sequence which is obtained by concatenating the  {(xkm, ykm)nkm=0}Kk=1sequences, i.e.,

image

Corollary 4.2. Suppose that the conditions in Assumptions 2.2 and 2.3 are satisfied. Consider running M-GDA (Algorithm 1) with the parameters given in Theorem 4.1. Also, recall the definition of the concatenated sequence  {zn}from (28). Then, for any  n > n1, we have

image

Proof. Check Appendix E

We interpret this result in two different regimes. First, we consider the case where we are given a fixed budget of n iterations. In this case, the following corollary shows how we can tune the parameters to obtain linear decay in the bias term as well as O(1/n) reduction in the variance term. We omit the proof as it is an immediate application of Corollary 4.2.

Corollary 4.3. Suppose that the conditions in Assumptions 2.2 and 2.3 are satisfied. Consider running M-GDA (Algorithm 1) with the parameters given in Theorem 4.1 and  p = 2, n1 = nCwith C ≥ 2. Also, recall the definition of the concatenated sequence  {zn}from (28). Then, for any n ≥ 2κ2, we have

image

Finally, in the following corollary, we illustrate how our results can be applied to the case where we do not know the number of iterations in advance.

Corollary 4.4. Suppose that the conditions in Assumptions 2.2 and 2.3 are satisfied. Consider running M-GDA (Algorithm 1) with the parameters given in Theorem 4.1 and  n1 = ⌈4pκ2 log(pκ2)⌉for an arbitrary  p ≥ 2. Also, recall the definition of the concatenated sequence  {zn}from (28). Then, for any  n ≥ 2n1, we have

image

It is worth noting that the results in Corollaries 4.3 and 4.4 are presented in terms of the  nthiterate  znwhich is obtained by concatenating the iterates of all stages, including inner iterations (as given in (28)). In fact, while it is true that the number of inner stage iterations increases, the bounds in Table 1 and Corollaries 4.3 and 4.4, are all based on the total number of iterations, and therefore, they take into account the inner stage iterations.

image

As we showed in previous Section, M-GDA achieves the optimal variance rate  O(σ2/n)as well as linear decay  O(1) exp(−Θ(1)n/κ2)in the bias term. However, the dependence of the latter to condition number  κis suboptimal compared to the lower bound presented in Ibrahim et al. (2019). Therefore, a natural question is whether we can design an algorithm which matches the lower bound for the bias term while simultaneously enjoying the optimal variance decay. In this section, we show that this is possible, and we do so by applying the multistage machinery to the stochastic Optimistic Gradient Descent Ascent (OGDA) algorithm. In this section, we first revisit the existing results on convergence of stochastic OGDA method, and next, show how its multistage version (M-OGDA) can matches both lower bounds simultaneously.

The stochastic OGDA method is given by:

image

which can also be written as:

image

where recall that  zk = (x⊤k , y⊤k )⊤, αis the stepsize, and  ˜Φ(·)are the stochastic gradients (unbiased estimates of the true gradients) (13). The OGDA updates have been observed to permorm well empirically for training GANs (see Daskalakis et al. (2018)) and have been proved to converge for convex-concave problems (see Mokhtari et al. (2019b); Hsieh et al. (2019)) which is not true for GDA. As shown in Gidel et al. (2019); Hsieh et al. (2019), the OGDA updates can also be thought of as a single call version of the Extragradient method (since it reuses a gradient from the past) and using this interpretation, the OGDA algorithm can also be written as follows2:

image

Note that the difference from Extragradient (EG) is that in EG, the update for  zk+1involves the gradient at  wkwhereas here we use the gradient at  zkinstead. We will use this form of the Stochastic OGDA updates for our analysis. From the analysis 3of Theorem 5 in Hsieh et al. (2019), we have the following result for Stochastic OGDA:

Theorem 5.1. (Hsieh et al. (2019)) Suppose that the conditions in Assumptions 2.2 and 2.3 are satisfied. Let  {zk, wk}kbe the iterates generated by Stochastic OGDA with  0 < α ≤ 18L. Then, for any  k ≥ 1, we have

image

This is similar to Theorem 3.2 for GDA. However, we can see that for OGDA, the range of permissible stepsizes goes all the way up to O(1/L) whereas for GDA, the stepsizes are upper bounded by  O(µ/L2).

The result in Theorem 5.1 shows that for OGDA with constant stepsize  α, the iterates converge to an  O(α)neighborhood of the saddle point. Next, we analyze a multistage version of OGDA (M-OGDA) similar to the analysis of M-GDA in Section 4. We show that the iterates of M-OGDA converge to the unique saddle point at a rate where the variance decays as  O(σ2/n), which is optimal (and also achieved by M-GDA), but the bias term decays as  O(1) exp(−n/Θ(κ)), which "accelerates" GDA in terms of its dependence on  κ. More formally, we state the following theorem which is analogous to Theorem 4.1 for M-GDA and present the convergence rate of M-OGDA (we omit the proof as it is very similar to that of Theorem 4.1):

Theorem 5.2. Suppose that the conditions in Assumptions 2.2 and 2.3 are satisfied. Let {(wkm, zkm)nkm=0}Kk=1be the iterates generated by M-OGDA (Algorithm 2) with the following pa- rameters

image

image

Similar to the discussion in Section 4, we next state how our result leads to bounds on distance of each iterate to the saddle point of Problem 1. In addition, we propose proper choice of parameters in general as well as in the case that the iteration budget is known in advance, the results corresponding to Corollaries 4.2 and 4.4 for M-GDA. Before stating this corollary, we define  {wn}nto be the sequence which is obtained by concatenating the  {(wkm)nkm=0}Kk=1sequences, i.e.,

image

Corollary 5.3. Suppose that the conditions in Assumptions 2.2 and 2.3 are satisfied. Let {(wkm, zkm)nkm=0}Kk=1be the iterates generated by M-OGDA (Algorithm 2) with the parameters given in Theorem 4.1. Also, recall the definition of the concatenated sequence  {wn}from (35). Then, for any  n > n1, we have

image

Also, when the number of iterations n is known in advance, choosing  p = 2, n1 = nCwith  C ≥ 2, implies

image

Once again, we would like to highlight that the results in Corollary 5.3 are presented in terms of the  nthiterate  wnwhich is obtained by concatenating the iterates of all stages, including inner iterations (as given in (35)). As a result, in comparing our results to other methods in Table 1, we take into account the inner stage iterations.

In this paper, we propose multistage versions of Gradient Descent Ascent (GDA) and Optimistic Gradient Descent Ascent (OGDA) algorithms to solve the stochastic minimax problems. In particular, these algorithms are the first to achieve linear rate in bias and optimal  O(σ2/n)rate in variance, simultaneously. We also show that Multistage OGDA improves the bias rate of Multistage GDA from  O(exp(−Θ(1)n/κ2))to  O(exp(−Θ(1)n/κ))which is the best known rate in deterministic minimax optimization.

Arjovsky, M., Chintala, S., and Bottou, L. (2017). Wasserstein generative adversarial networks. In Proceedings of the 34th International Conference on Machine Learning, pages 214–223.

Aybat, N. S., Fallah, A., Gurbuzbalaban, M., and Ozdaglar, A. (2019). A universally optimal multistage accelerated stochastic gradient method. arXiv preprint arXiv:1901.08022.

Aybat, S., Fallah, A., Gurbuzbalaban, M., and Ozdaglar, A. (2018). Robust accelerated gradient methods for smooth strongly convex functions. arXiv preprint arXiv:1805.10579.

Bach, F. and Moulines, E. (2013). Non-strongly-convex smooth stochastic approximation with convergence rate o(1/n). In Burges, C. J. C., Bottou, L., Welling, M., Ghahramani, Z., and Weinberger, K. Q., editors, Advances in Neural Information Processing Systems 26, pages 773– 781. Curran Associates, Inc.

Bot, R. I., Mertikopoulos, P., Staudigl, M., and Vuong, P. T. (2019). Forward-backward-forward methods with variance reduction for stochastic variational inequalities. arXiv preprint arXiv:1902.03355.

Chavdarova, T., Gidel, G., Fleuret, F., and Lacoste-Julien, S. (2019). Reducing noise in gan training with variance reduced extragradient. arXiv preprint arXiv:1904.08598.

Chen, Y., Lan, G., and Ouyang, Y. (2014). Optimal primal-dual methods for a class of saddle point problems. SIAM Journal on Optimization, 24(4):1779–1814.

Daskalakis, C., Ilyas, A., Syrgkanis, V., and Zeng, H. (2018). Training gans with optimism. In 6th International Conference on Learning Representations, ICLR 2018, Vancouver, BC, Canada, April 30 - May 3, 2018, Conference Track Proceedings.

Du, S. S. and Hu, W. (2019). Linear convergence of the primal-dual gradient method for convex- concave saddle point problems without strong convexity. In The 22nd International Conference on Artificial Intelligence and Statistics, AISTATS 2019, 16-18 April 2019, Naha, Okinawa, Japan, pages 196–205.

Ghadimi, S. and Lan, G. (2012). Optimal stochastic approximation algorithms for strongly con- vex stochastic composite optimization i: A generic algorithmic framework. SIAM Journal on Optimization, 22(4):1469–1492.

Ghadimi, S. and Lan, G. (2013). Optimal stochastic approximation algorithms for strongly convex stochastic composite optimization, ii: shrinking procedures and optimal algorithms. SIAM Journal on Optimization, 23(4):2061–2089.

Gidel, G., Berard, H., Vignoud, G., Vincent, P., and Lacoste-Julien, S. (2019). A variational inequality perspective on generative adversarial networks. In 7th International Conference on Learning Representations, ICLR 2019, New Orleans, LA, USA, May 6-9, 2019.

Goodfellow, I., Pouget-Abadie, J., Mirza, M., Xu, B., Warde-Farley, D., Ozair, S., Courville, A., and Bengio, Y. (2014). Generative adversarial nets. In Advances in neural information processing systems, pages 2672–2680.

Hsieh, Y.-G., Iutzeler, F., Malick, J., and Mertikopoulos, P. (2019). On the convergence of single- call stochastic extra-gradient methods. arXiv preprint arXiv:1908.08465.

Hu, B. and Lessard, L. (2017). Dissipativity theory for Nesterov’s accelerated method. In Proceedings of the 34th International Conference on Machine Learning, volume 70 of Proceedings of Machine Learning Research, pages 1549–1557, International Convention Centre, Sydney, Australia. PMLR.

Ibrahim, A., Azizian, W., Gidel, G., and Mitliagkas, I. (2019). Lower bounds and conditioning of differentiable games. arXiv preprint arXiv:1906.07300.

Iusem, A. N., Jofré, A., Oliveira, R. I., and Thompson, P. (2017). Extragradient method with vari- ance reduction for stochastic variational inequalities. SIAM Journal on Optimization, 27(2):686– 724.

Juditsky, A., Nemirovski, A., and Tauvel, C. (2011). Solving variational inequalities with stochastic mirror-prox algorithm. Stochastic Systems, 1(1):17–58.

Korpelevich, G. (1976). The extragradient method for finding saddle points and other problems. Matecon, 12:747–756.

Lessard, L., Recht, B., and Packard, A. (2016). Analysis and design of optimization algorithms via integral quadratic constraints. SIAM Journal on Optimization, 26(1):57–95.

Liang, T. and Stokes, J. (2019). Interaction matters: A note on non-asymptotic local convergence of generative adversarial networks. In The 22nd International Conference on Artificial Intelligence and Statistics, AISTATS 2019, 16-18 April 2019, Naha, Okinawa, Japan, pages 907–915.

Madry, A., Makelov, A., Schmidt, L., Tsipras, D., and Vladu, A. (2018). Towards deep learning models resistant to adversarial attacks. In 6th International Conference on Learning Representations, ICLR 2018, Vancouver, BC, Canada, April 30 - May 3, 2018, Conference Track Proceedings.

Mertikopoulos, P., Zenati, H., Lecouat, B., Foo, C.-S., Chandrasekhar, V., and Piliouras, G. (2018). Mirror descent in saddle-point problems: Going the extra (gradient) mile. arXiv preprint arXiv:1807.02629.

Mokhtari, A., Ozdaglar, A., and Pattathil, S. (2019a). Proximal point approximations achieving a convergence rate of O(1/k) for smooth convex-concave saddle point problems: Optimistic gradient and extra-gradient methods. arXiv preprint arXiv:1906.01115.

Mokhtari, A., Ozdaglar, A., and Pattathil, S. (2019b). A unified analysis of extra-gradient and optimistic gradient methods for saddle point problems: Proximal point approach. arXiv preprint arXiv:1901.08511.

Nemirovski, A., Juditsky, A., Lan, G., and Shapiro, A. (2009). Robust stochastic approximation approach to stochastic programming. SIAM Journal on optimization, 19(4):1574–1609.

Nemirovsky, A. S. and Yudin, D. B. (1983). Problem complexity and method efficiency in optimization. Wiley.

Nesterov, Y. (2004). Introductory Lectures on Convex Optimization: A Basic Course, volume 87. Springer.

Palaniappan, B. and Bach, F. R. (2016). Stochastic variance reduction methods for saddle-point problems. In Advances in Neural Information Processing Systems 29: Annual Conference on Neural Information Processing Systems 2016, December 5-10, 2016, Barcelona, Spain, pages 1408–1416.

Raginsky, M. and Rakhlin, A. (2011). Information-based complexity, feedback and dynamics in convex programming. IEEE Transactions on Information Theory, 57(10):7036–7056.

Rosasco, L., Villa, S., and V˜u, B. C. (2014). A stochastic forward-backward splitting method for solving monotone inclusions in hilbert spaces. arXiv preprint arXiv:1403.7999.

Xie, L., Lin, K., Wang, S., Wang, F., and Zhou, J. (2018). Differentially private generative adversarial network. arXiv preprint arXiv:1802.06739.

We have:

image

where (39) follows from (3). Similarly

image

Adding equations (40) and (41), and noting that

image

we obtain the right hand side of (9). Similarly, we have:

image

where (43) follows from (4). Similarly

image

Adding equations (44) and (45) we obtain the left hand side of (9).

From Lemma 2.5, we have:

image

and from Assumption 2.1, we have:

image

Combining Equation (46) and (46), we have:

image

First, note that for ρ2 = 1 −αµ and P  = p ⊗ Im+nwith p = 1, we have

image

where the last inequality follows from the fact that  α ≤ µ/(4L2). This result implies

image

Substituting left and right hand side by using Corollary 2.7 and Lemma 3.1, respectively, yields

image

Finally, dividing both sides by p completes the proof (16). To show the second result, note that by using (16) sequentially, we have

image

which completes the proof.

We have the function:

image

The gradient at step k is corrupted by noise  ξxkand  ξykwhich we assume to be iid  ∼ N(0, σ). The GDA method when applied to this problems leads to:

image

This gives:

image

which proves the first part of the lemma. Note that when the gradients are not corrupted by noise (i.e. when  σ = 0, we have)

image

The coefficient on the right side is minimized for  α = µµ2+L2. Substituting this in equation (52), we get:

image

Now, making the substitution  κ = Lµ, we have:

image

Therefore, for any other stepsize  α > 0, we have:

image

image

As a consequence, and using �ki=2 2i+2 = 16(2k−1 − 1), we have

image

and as a result, we have

image

Also, by Theorem 3.2 for stage k + 1, we have

image

where the last inequality follows from (61). Now, plugging in (60) in this bound completes the proof.


Designed for Accessibility and to further Open Science