An Optimal Multistage Stochastic Gradient Method for Minimax Problems

2020·Arxiv

Abstract

Abstract

In this paper, we study the minimax optimization problem in the smooth and strongly convex-strongly concave setting when we have access to noisy estimates of gradients. In particular, we first analyze the stochastic Gradient Descent Ascent (GDA) method with constant stepsize, and show that it converges to a neighborhood of the solution of the minimax problem. We further provide tight bounds on the convergence rate and the size of this neighborhood. Next, we propose a multistage variant of stochastic GDA (M-GDA) that runs in multiple stages with a particular learning rate decay schedule and converges to the exact solution of the minimax problem. We show M-GDA achieves the lower bounds in terms of noise dependence without any assumptions on the knowledge of noise characteristics. We also show that M-GDA obtains a linear decay rate with respect to the error’s dependence on the initial error, although the dependence on condition number is suboptimal. In order to improve this dependence, we apply the multistage machinery to the stochastic Optimistic Gradient Descent Ascent (OGDA) algorithm and propose the M-OGDA algorithm which also achieves the optimal linear decay rate with respect to the initial error. To the best of our knowledge, this method is the first to simultaneously achieve the best dependence on noise characteristic as well as the initial error and condition number.

1 Introduction

The minimax optimization problem has recently gained tremendous attention as the canonical problem formulation for robust training of machine learning models and Generative Adversarial Networks (GANs) (see Madry et al. (2018); Goodfellow et al. (2014); Arjovsky et al. (2017)). While many papers have studied the convergence of a broad range of algorithms in the deterministic setting, i.e., when the gradient information is exact, many aspects of this problem in the stochastic setting are yet to be explored. This is the main goal of our manuscript as we provide a framework for analyzing minimax optimization algorithms which can be used for both the deterministic and stochastic settings.

We consider the minimax problem

where is smooth and -strongly convex-strongly concave (See Section 2 for the precise statement of our assumptions). The condition number of the problem is defined as . Due to the strong convexity-strong concavity of the function f, this problem has a unique saddle point which we denote by , i.e.,

In this paper, our main focus is on the case when the exact gradient information is not available and we only have access to an unbiased estimate through a stochastic oracle. More formally, we assumethat at a point , we have access to noisy estimates of the gradients and , which are unbiased estimates of and , respectively, and their variances are bounded by .

This setting arises in many applications, including the problem of training GANs where the generator and the discriminator approximate the gradient by taking a batch of data D and computing where and is the gradient computed over a single data point . It is worth noting that the inexact gradient issue also appears in other scenarios such as privacy-related applications where the noise is added intentionally to prevent the model from remembering possibly sensitive data and preserve privacy Xie et al. (2018) or when the presence of noise is inevitable due to imperfections in communication and sensing.

In solving the minimization problem in the stochastic setting, it is well-known that, for many algorithms, the squared distance of the iterates to the solution of the minimization problem can be bounded by the sum of two terms: bias and variance Bach and Moulines (2013); Ghadimi and Lan (2012); Aybat et al. (2019). The bias term captures the effect of the initialization expressed in terms of the distance of the initial point to the solution, and is independent of the noise parameters. The variance term depends on noise characteristics (in our case) and is independent of the initialization error. For the minimization problem with strongly convex objective function, and in the noiseless case (with only the bias term), Nemirovsky and Yudin (1983) have shown the lower bound of for the distance of the n-th iterate to the optimal solution. With noise Raginsky and Rakhlin (2011) have shown the lower bound increases to . Several papers have highlighted the trade-off between bias and variance which arises in design of optimization algorithms Aybat et al. (2018) and tried to achieve both lower bounds simultaneously Ghadimi and Lan (2013); Aybat et al. (2019).

In this paper, we highlight this bias-variance decomposition in evaluating the performance of algorithms that solve the minimax problem. For the bias term, i.e., the deterministic case, Ibrahim et al. (2019) have recently shown the lower bound highlighting that the dependence on condition number increases from in minimization problems to for minimax problems. For the variance term, since the minimax problem is a special case of the minimization problem, the lower bound of the minimization problem is also valid for the minimax problem. While this lower bound for variance term has been obtained Hsieh et al. (2019); Rosasco et al. (2014) at the cost of making the bias term sublinear, the question of whether a linear rate in bias and in variance could be achieved simultaneously has not been addressed prior to this work.

In what follows, we first provide a summary of related works and then discuss the main contributions of our paper.

1.1 Related Work

1.1.1 Deterministic Case

Many papers have studied the minimax problem when the exact gradient information is available. In the case of Gradient Descent Ascent (GDA) method, Du and Hu (2019) analyzes its performance for the special case of bilinear coupling, i.e., when where g is smooth and convex, h is smooth and strongly convex, and the matrix A has full column rank. They show that running the GDA algorithm for n steps on this problem reaches a point which is close to the saddle point. In addition, when the function is assumed to be strongly convex, GDA reaches a point which is O(1)expclose to the saddle point after n steps. Liang and Stokes (2019) extend this result to a general function f(x, y) which is strongly convex in x and strongly concave in y (achieving the same rate of convergence as Du and Hu (2019)). Several other gradient based algorithms like the Optimistic Gradient Descent Ascent (OGDA) method (see Daskalakis et al. (2018)) and the Extragradient method Korpelevich (1976) have been analyzed in recent papers including Mokhtari et al. (2019b); Liang and Stokes (2019); Gidel et al. (2019); Mokhtari et al. (2019a); Hsieh et al. (2019). These papers analyze these algorithms in several settings including bilinear, strongly convex-strongly concave and convex-concave. More specifically, Gidel et al. (2019); Mokhtari et al. (2019b) show that when the objective function is strongly convex-strongly concave, running the OGDA and Extragradient algorithms for n steps reaches a point which is close to the saddle point.

1.1.2 Stochastic Case

The papers which are closest to our results are Rosasco et al. (2014) and Hsieh et al. (2019). Rosasco et al. (2014) propose a forward-backward splitting algorithm to solve the stochastic mini-

Table 1: Summary of results (up to O(1) constants)

max problem (they solve the more general problem of monotone inclusions). When the function is strongly convex-strongly concave, they show convergence at a rate of to the saddle point, where p is any constant greater than 0 and c is a constant larger than 1. Hsieh et al. (2019) show that the stochastic version of OGDA converges to the saddle point at a rate of for both bias and variance when the objective function is strongly convex-strongly concave.

There are several papers which analyze the stochastic minimax problem when the objective function is convex-concave. Juditsky et al. (2011) propose the stochastic mirror-prox algorithm (a special case of which is the stochastic extragradient method) to solve the convex-concave saddle point problem with noisy gradients. They assume the constraint set is compact and show a convergence rate of (the result in this paper improves on the robust stochastic approximation algorithm proposed in Nemirovski et al. (2009)). Chen et al. (2014) proposes an accelerated pri- mal dual algorithm which achieves a convergence rate of . Recently, Mertikopoulos et al.

(2018) analyzed the stochastic extragradient algorithm for coherent minimax problems (a condition slightly weaker than convex-concave assumption) and they show asymptotic convergence to a saddle point. Gidel et al. (2019) analyzed a single call version of extragradient (which corresponds to OGDA) when the function is convex-concave and they showed that in the stochastic setting, this algorithm converges to the saddle point at a rate of .

Another line of work is the case where the objective function has a finite sum structure and the gradient of the entire function cannot be computed at each step. Several papers including Bot et al. (2019); Palaniappan and Bach (2016); Chavdarova et al. (2019); Iusem et al. (2017) analyze this setting and apply variance reduction techniques (like SVRG and SAGA) to improve convergence rates to the saddle point.

1.2 Our Contribution

We first analyze GDA with constant stepsize (learning rate) where we build our analysis by casting it as a dynamical system, an approach that has gained attention in the optimization and machine learning literature recently Lessard et al. (2016); Hu and Lessard (2017); Aybat et al. (2018, 2019). In particular, we show that GDA with any stepsize converges to an neighborhood of the optimal solution at a linear rate . Next, we propose a novel MultistageStochastic Gradient Descent Ascent scheme (inspired from Aybat et al. (2019)) which achieves a rate of for the variance term (which is optimal in terms of n dependence) and a rate of for the bias term, and we show that the n and dependence of the latter cannot be improved for GDA dynamics.

Next, we focus on the OGDA method which has gained widespread attention for solving minimax problems. We first highlight that OGDA also converges to an neighborhood of the optimal solution with linear rate , but allows for a broader range of for the stepsize. Then, we introduce the Multistage version of Stochastic Optimistic Gradient Descent Ascent (M-OGDA) which achieves the rate of for the variance term and a rate of for the bias term which improves on the decay of the bias term of GDA and matches the lower bound shown in Ibrahim et al. (2019).

1.3 Notation

We denote identity and zero matrices by and , respectively. Throughout this paper, all vectors are represented as column vectors. The superscript represents the transpose of a vector or a matrix. For two matrices and , their Kronecker product is represented by . Also, implies that is a symmetric and positive semidefinite matrix.

2 Preliminaries

We first state formally the strong convexity(concavity) and smoothness properties of a function.

Definition 2.1. A convex function is L-smooth and -strongly convex if it satisfies the following two conditions for all :

For an smooth convex function , we have the following characterization (see Theorem 2.1.5 in Nesterov (2004)):

Throughout the paper, we assume the following:

Assumption 2.2. We assume at iterate (x, y), we have access to which are unbiased estimates of and , respectively, i.e.,

In addition, we assume and are independent from each other and previous iterates. Moreover, we assume

Assumption 2.3. The function f(x, y) is continuously differentiable in x and y. For any , is -smooth and -strongly convex as a function of x. Similarly, for any is -smooth and -strongly concave as a function of y.

Note that this assumption leads to the saddle point being unique and, in addition, we have and .

Remark 2.4. Under Assumptions 2.3, we call the function as L-smooth and -strongly convex- strongly concave where and . We define the condition number of the problem as .

We next present some key properties of smooth strongly convex-strongly concave functions that will be used in our analysis. Define:

where . For , we define

Also, we define as the unique saddle point. The following lemma follows from the strong convexity and smoothness properties of f.

Lemma 2.5. Let . Recall the definition of from (7) and and L from Assumption 2.3 and Remark 2.4. Then,

Proof. Check Appendix A.

Using Lemma 2.5, we can prove the following result

Lemma 2.6. Let . Recall the definition of from (7) and and L from Assumption 2.3 and Remark 2.4. Then,

Proof. Check Appendix B.

Using Lemmas 2.5 and 2.6, we immediately obtain the following result which we state in the form of a matrix inequality since this form is more convenient for subsequent analysis.

Corollary 2.7. Let . Recall the definition of from (7) and and L from Assumption 2.3 and Remark 2.4. Then,

3 Analysis of Stochastic Gradient Descent Ascent Method

In this section, we study the Stochastic Gradient Descent Ascent (GDA) algorithm, which is given by:

This can be succinctly written as:

where and

Using this notation, we can represent GDA as a dynamical system as follows:

where . We study the convergence properties of the sequence through the evolution of the Lyapunov function where with an arbitrary constant. In particular, in the following lemma, we first bound the difference of for any and . We skip the proof as it is very similar to the proof of Lemma B.1 in Aybat et al. (2019).

Lemma 3.1. Let with and consider the function . Then we have

Next, using this lemma, we characterize the convergence of GDA.

Theorem 3.2. Suppose that the conditions in Assumptions 2.2 and 2.3 are satisfied. Let be the iterates generated by GDA (12) with . Then, for any , we have

Proof. See Appendix C.

Remark 3.3. It is worth noting that the range for the stepsize in Theorem 3.2 is upper bounded by (as opposed to just a function of the Lipschitz parameter L, as is the case for Gradient Descent in minimization problems). This is consistent with the fact that GDA may diverge when the strong convexity parameter is 0, i.e., the function is convex-concave (see the Bilinear example in Daskalakis et al. (2018)).

3.1 Tightness of the Results

In this subsection, we give an example of a function where after running GDA for n iterations reaches a point which is O(1)expclose to the saddle point. Consider the function

where and . The condition number of this function is and the saddle point of this function is (x, y) = (0, 0).

Example 3.4. Let be the iterates generated by GDA for the objective function given in Equation (18). Then,

(i) if the gradient at each step is exactly available (i.e. the updates reduce to the deterministic GDA updates), we have:

(ii) if at each step the gradients are corrupted by additive i.i.d. noise with a distribution , we have

Proof. See Appendix D.

Example 3.4(i) shows that in order to find the saddle point of the function f defined in equation (18), we need to run at least steps of GDA (i.e. the deterministic case) to reach a point which is -close to the solution, showing that this dependence on cannot be improved. Example 3.4(ii) shows that when the gradients are corrupted by noise with variance , GDA reaches an neighborhood of the saddle point and this dependence on cannot be improved.

4 A Multistage Stochastic Gradient Descent Ascent Method

Our result in Theorem 3.2 shows that for GDA with constant stepsize , the iterates converge to an neighborhood of the saddle point. In this section, we introduce a new method which is a variant of GDA with progressively decreasing stepsize that converges to the exact unique saddle point of problem 1. Our proposed algorithm, Multistage Stochastic Gradient Descent Ascent (MGDA), which is presented in Algorithm 1, runs in several stages where each stage is the GDA method with constant stepsize. In what follows, we show our multistage method with a carefully chosen learning rate and step length evolution achieves linear decay in the bias term as well as optimal variance dependence without any knowledge of the noise properties.

Theorem 4.1. Suppose that the conditions in Assumptions 2.2 and 2.3 are satisfied. Let be the iterates generated by M-GDA (Algorithm 1) with the following parameters

Proof. We prove the result by induction on k. To simplify the notation, we define err.First, and for k = 1, note that, by using Theorem 3.2 along with the fact that , we have:

where we plugged in to obtain the last equality. Hence, the result holds for k = 1. Now, assume the result holds for k, and we show it for k + 1. Note that, Theorem 3.2 for stage k + 1 yields:

where we used and to derive the last inequality. Now, note that, by induction hypothesis, we have

Substituting this bound in (23), we obtain

where the last bound follows from . This completes the proof.

The above theorem provides an upper bound on the distance of the last iterate of each stage to the saddle point of problem (1). Using this result, and in the following corollary, we provide an upper bound on the distance of any iterate from the saddle point. Before stating this corollary, let be the sequence which is obtained by concatenating the sequences, i.e.,

Corollary 4.2. Suppose that the conditions in Assumptions 2.2 and 2.3 are satisfied. Consider running M-GDA (Algorithm 1) with the parameters given in Theorem 4.1. Also, recall the definition of the concatenated sequence from (28). Then, for any , we have

Proof. Check Appendix E

We interpret this result in two different regimes. First, we consider the case where we are given a fixed budget of n iterations. In this case, the following corollary shows how we can tune the parameters to obtain linear decay in the bias term as well as O(1/n) reduction in the variance term. We omit the proof as it is an immediate application of Corollary 4.2.

Corollary 4.3. Suppose that the conditions in Assumptions 2.2 and 2.3 are satisfied. Consider running M-GDA (Algorithm 1) with the parameters given in Theorem 4.1 and with . Also, recall the definition of the concatenated sequence from (28). Then, for any , we have

Finally, in the following corollary, we illustrate how our results can be applied to the case where we do not know the number of iterations in advance.

Corollary 4.4. Suppose that the conditions in Assumptions 2.2 and 2.3 are satisfied. Consider running M-GDA (Algorithm 1) with the parameters given in Theorem 4.1 and for an arbitrary . Also, recall the definition of the concatenated sequence from (28). Then, for any , we have

It is worth noting that the results in Corollaries 4.3 and 4.4 are presented in terms of the iterate which is obtained by concatenating the iterates of all stages, including inner iterations (as given in (28)). In fact, while it is true that the number of inner stage iterations increases, the bounds in Table 1 and Corollaries 4.3 and 4.4, are all based on the total number of iterations, and therefore, they take into account the inner stage iterations.

5 A Multistage Stochastic Optimistic Gradient Descent As-

As we showed in previous Section, M-GDA achieves the optimal variance rate as well as linear decay in the bias term. However, the dependence of the latter to condition number is suboptimal compared to the lower bound presented in Ibrahim et al. (2019). Therefore, a natural question is whether we can design an algorithm which matches the lower bound for the bias term while simultaneously enjoying the optimal variance decay. In this section, we show that this is possible, and we do so by applying the multistage machinery to the stochastic Optimistic Gradient Descent Ascent (OGDA) algorithm. In this section, we first revisit the existing results on convergence of stochastic OGDA method, and next, show how its multistage version (M-OGDA) can matches both lower bounds simultaneously.

The stochastic OGDA method is given by:

which can also be written as:

where recall that is the stepsize, and are the stochastic gradients (unbiased estimates of the true gradients) (13). The OGDA updates have been observed to permorm well empirically for training GANs (see Daskalakis et al. (2018)) and have been proved to converge for convex-concave problems (see Mokhtari et al. (2019b); Hsieh et al. (2019)) which is not true for GDA. As shown in Gidel et al. (2019); Hsieh et al. (2019), the OGDA updates can also be thought of as a single call version of the Extragradient method (since it reuses a gradient from the past) and using this interpretation, the OGDA algorithm can also be written as follows:

Note that the difference from Extragradient (EG) is that in EG, the update for involves the gradient at whereas here we use the gradient at instead. We will use this form of the Stochastic OGDA updates for our analysis. From the analysis of Theorem 5 in Hsieh et al. (2019), we have the following result for Stochastic OGDA:

Theorem 5.1. (Hsieh et al. (2019)) Suppose that the conditions in Assumptions 2.2 and 2.3 are satisfied. Let be the iterates generated by Stochastic OGDA with . Then, for any , we have

This is similar to Theorem 3.2 for GDA. However, we can see that for OGDA, the range of permissible stepsizes goes all the way up to O(1/L) whereas for GDA, the stepsizes are upper bounded by .

The result in Theorem 5.1 shows that for OGDA with constant stepsize , the iterates converge to an neighborhood of the saddle point. Next, we analyze a multistage version of OGDA (M-OGDA) similar to the analysis of M-GDA in Section 4. We show that the iterates of M-OGDA converge to the unique saddle point at a rate where the variance decays as , which is optimal (and also achieved by M-GDA), but the bias term decays as , which "accelerates" GDA in terms of its dependence on . More formally, we state the following theorem which is analogous to Theorem 4.1 for M-GDA and present the convergence rate of M-OGDA (we omit the proof as it is very similar to that of Theorem 4.1):

Theorem 5.2. Suppose that the conditions in Assumptions 2.2 and 2.3 are satisfied. Let be the iterates generated by M-OGDA (Algorithm 2) with the following pa- rameters

Similar to the discussion in Section 4, we next state how our result leads to bounds on distance of each iterate to the saddle point of Problem 1. In addition, we propose proper choice of parameters in general as well as in the case that the iteration budget is known in advance, the results corresponding to Corollaries 4.2 and 4.4 for M-GDA. Before stating this corollary, we define to be the sequence which is obtained by concatenating the sequences, i.e.,

Corollary 5.3. Suppose that the conditions in Assumptions 2.2 and 2.3 are satisfied. Let be the iterates generated by M-OGDA (Algorithm 2) with the parameters given in Theorem 4.1. Also, recall the definition of the concatenated sequence from (35). Then, for any , we have

Also, when the number of iterations n is known in advance, choosing with , implies

Once again, we would like to highlight that the results in Corollary 5.3 are presented in terms of the iterate which is obtained by concatenating the iterates of all stages, including inner iterations (as given in (35)). As a result, in comparing our results to other methods in Table 1, we take into account the inner stage iterations.

6 Conclusion

In this paper, we propose multistage versions of Gradient Descent Ascent (GDA) and Optimistic Gradient Descent Ascent (OGDA) algorithms to solve the stochastic minimax problems. In particular, these algorithms are the first to achieve linear rate in bias and optimal rate in variance, simultaneously. We also show that Multistage OGDA improves the bias rate of Multistage GDA from to which is the best known rate in deterministic minimax optimization.

References

Arjovsky, M., Chintala, S., and Bottou, L. (2017). Wasserstein generative adversarial networks. In Proceedings of the 34th International Conference on Machine Learning, pages 214–223.

Aybat, N. S., Fallah, A., Gurbuzbalaban, M., and Ozdaglar, A. (2019). A universally optimal multistage accelerated stochastic gradient method. arXiv preprint arXiv:1901.08022.

Aybat, S., Fallah, A., Gurbuzbalaban, M., and Ozdaglar, A. (2018). Robust accelerated gradient methods for smooth strongly convex functions. arXiv preprint arXiv:1805.10579.

Bach, F. and Moulines, E. (2013). Non-strongly-convex smooth stochastic approximation with convergence rate o(1/n). In Burges, C. J. C., Bottou, L., Welling, M., Ghahramani, Z., and Weinberger, K. Q., editors, Advances in Neural Information Processing Systems 26, pages 773– 781. Curran Associates, Inc.

Bot, R. I., Mertikopoulos, P., Staudigl, M., and Vuong, P. T. (2019). Forward-backward-forward methods with variance reduction for stochastic variational inequalities. arXiv preprint arXiv:1902.03355.

Chavdarova, T., Gidel, G., Fleuret, F., and Lacoste-Julien, S. (2019). Reducing noise in gan training with variance reduced extragradient. arXiv preprint arXiv:1904.08598.

Chen, Y., Lan, G., and Ouyang, Y. (2014). Optimal primal-dual methods for a class of saddle point problems. SIAM Journal on Optimization, 24(4):1779–1814.

Daskalakis, C., Ilyas, A., Syrgkanis, V., and Zeng, H. (2018). Training gans with optimism. In 6th International Conference on Learning Representations, ICLR 2018, Vancouver, BC, Canada, April 30 - May 3, 2018, Conference Track Proceedings.

Du, S. S. and Hu, W. (2019). Linear convergence of the primal-dual gradient method for convex- concave saddle point problems without strong convexity. In The 22nd International Conference on Artificial Intelligence and Statistics, AISTATS 2019, 16-18 April 2019, Naha, Okinawa, Japan, pages 196–205.

Ghadimi, S. and Lan, G. (2012). Optimal stochastic approximation algorithms for strongly con- vex stochastic composite optimization i: A generic algorithmic framework. SIAM Journal on Optimization, 22(4):1469–1492.

Ghadimi, S. and Lan, G. (2013). Optimal stochastic approximation algorithms for strongly convex stochastic composite optimization, ii: shrinking procedures and optimal algorithms. SIAM Journal on Optimization, 23(4):2061–2089.

Gidel, G., Berard, H., Vignoud, G., Vincent, P., and Lacoste-Julien, S. (2019). A variational inequality perspective on generative adversarial networks. In 7th International Conference on Learning Representations, ICLR 2019, New Orleans, LA, USA, May 6-9, 2019.

Goodfellow, I., Pouget-Abadie, J., Mirza, M., Xu, B., Warde-Farley, D., Ozair, S., Courville, A., and Bengio, Y. (2014). Generative adversarial nets. In Advances in neural information processing systems, pages 2672–2680.

Hsieh, Y.-G., Iutzeler, F., Malick, J., and Mertikopoulos, P. (2019). On the convergence of single- call stochastic extra-gradient methods. arXiv preprint arXiv:1908.08465.

Hu, B. and Lessard, L. (2017). Dissipativity theory for Nesterov’s accelerated method. In Proceedings of the 34th International Conference on Machine Learning, volume 70 of Proceedings of Machine Learning Research, pages 1549–1557, International Convention Centre, Sydney, Australia. PMLR.

Ibrahim, A., Azizian, W., Gidel, G., and Mitliagkas, I. (2019). Lower bounds and conditioning of differentiable games. arXiv preprint arXiv:1906.07300.

Iusem, A. N., Jofré, A., Oliveira, R. I., and Thompson, P. (2017). Extragradient method with vari- ance reduction for stochastic variational inequalities. SIAM Journal on Optimization, 27(2):686– 724.

Juditsky, A., Nemirovski, A., and Tauvel, C. (2011). Solving variational inequalities with stochastic mirror-prox algorithm. Stochastic Systems, 1(1):17–58.

Korpelevich, G. (1976). The extragradient method for finding saddle points and other problems. Matecon, 12:747–756.

Lessard, L., Recht, B., and Packard, A. (2016). Analysis and design of optimization algorithms via integral quadratic constraints. SIAM Journal on Optimization, 26(1):57–95.

Liang, T. and Stokes, J. (2019). Interaction matters: A note on non-asymptotic local convergence of generative adversarial networks. In The 22nd International Conference on Artificial Intelligence and Statistics, AISTATS 2019, 16-18 April 2019, Naha, Okinawa, Japan, pages 907–915.

Madry, A., Makelov, A., Schmidt, L., Tsipras, D., and Vladu, A. (2018). Towards deep learning models resistant to adversarial attacks. In 6th International Conference on Learning Representations, ICLR 2018, Vancouver, BC, Canada, April 30 - May 3, 2018, Conference Track Proceedings.

Mertikopoulos, P., Zenati, H., Lecouat, B., Foo, C.-S., Chandrasekhar, V., and Piliouras, G. (2018). Mirror descent in saddle-point problems: Going the extra (gradient) mile. arXiv preprint arXiv:1807.02629.

Mokhtari, A., Ozdaglar, A., and Pattathil, S. (2019a). Proximal point approximations achieving a convergence rate of O(1/k) for smooth convex-concave saddle point problems: Optimistic gradient and extra-gradient methods. arXiv preprint arXiv:1906.01115.

Mokhtari, A., Ozdaglar, A., and Pattathil, S. (2019b). A unified analysis of extra-gradient and optimistic gradient methods for saddle point problems: Proximal point approach. arXiv preprint arXiv:1901.08511.

Nemirovski, A., Juditsky, A., Lan, G., and Shapiro, A. (2009). Robust stochastic approximation approach to stochastic programming. SIAM Journal on optimization, 19(4):1574–1609.

Nemirovsky, A. S. and Yudin, D. B. (1983). Problem complexity and method efficiency in optimization. Wiley.

Nesterov, Y. (2004). Introductory Lectures on Convex Optimization: A Basic Course, volume 87. Springer.

Palaniappan, B. and Bach, F. R. (2016). Stochastic variance reduction methods for saddle-point problems. In Advances in Neural Information Processing Systems 29: Annual Conference on Neural Information Processing Systems 2016, December 5-10, 2016, Barcelona, Spain, pages 1408–1416.

Raginsky, M. and Rakhlin, A. (2011). Information-based complexity, feedback and dynamics in convex programming. IEEE Transactions on Information Theory, 57(10):7036–7056.

Rosasco, L., Villa, S., and V˜u, B. C. (2014). A stochastic forward-backward splitting method for solving monotone inclusions in hilbert spaces. arXiv preprint arXiv:1403.7999.

Xie, L., Lin, K., Wang, S., Wang, F., and Zhou, J. (2018). Differentially private generative adversarial network. arXiv preprint arXiv:1802.06739.

A Proof of Lemma 2.5

B Proof of Lemma 2.6

C Proof of Theorem 3.2

First, note that for ραµ and P with p = 1/α, we have

where the last inequality follows from the fact that . This result implies

Substituting left and right hand side by using Corollary 2.7 and Lemma 3.1, respectively, yields

Finally, dividing both sides by p completes the proof (16). To show the second result, note that by using (16) sequentially, we have

which completes the proof.

D Proof of Example 3.4

We have the function:

The gradient at step k is corrupted by noise and which we assume to be iid . The GDA method when applied to this problems leads to:

This gives:

which proves the first part of the lemma. Note that when the gradients are not corrupted by noise (i.e. when , we have)

The coefficient on the right side is minimized for . Substituting this in equation (52), we get:

Now, making the substitution , we have:

Therefore, for any other stepsize , we have:

E Proof of Corollary 4.2

designed for accessibility and to further open science