In this paper, we consider stochastic algorithms for solving the following min-max saddle-point problem with a general objective function f without smoothness or any other special structure:
where are closed convex sets and
is continuous. It is of great interest to find a saddle-point solution to the above problem, which is defined as
a number of applications in machine learning, including distributionally robust optimization (DRO) Namkoong and Duchi (2017, 2016), learning with non-decomposable loss functions Liu et al. (2018); Fan et al. (2017); Ying et al. (2016); Liu et al. (2019), and generative adversarial networks Goodfellow et al. (2014); Arjovsky et al. (2017).
In this work, we focus on two classes of the min-max problems: (i) strongly-convex strongly-concave (SCSC) problem where f is strongly convex in terms of and is strongly concave in terms of
; (ii) weakly-convex strongly-concave (WCSC) problem, where there exists
is strongly convex in terms of
and is strongly concave in terms of
. Both classes have applications in machine learning Yan et al. (2019); Rafique et al. (2018).
Although stochastic algorithms for convex-concave min-max problems have been studied extensively in the literature, their research is still far behind its counterpart for stochastic convex minimization problems. Below, we highlight some of these gaps to motivate the present work. For the sake of presentation, we first introduce some terminologies. The duality gap at (x, y) is defined as Gapand
. If we denote by
is the primal objective gap, where
When f is convex in x and concave in y, many studies have designed and analyzed stochastic primal-dual algorithms for solving the min-max problems under different conditions of the problem (see references in next section). A standard result is provided by Nemirovski et al. (2009), which proves that primal-dual SGD suffers from a convergence rate of for the duality gap without imposing any additional assumptions about the objective function. This is analogous to that for stochastic convex minimization Nemirovski et al. (2009). However, the research of stochastic algorithms for SCSC problems lacks behind that for strongly convex minimization problems. A well-known result for stochastic strongly convex minimization is given by Hazan and Kale (2011), which presents the first fast convergence rate O(1/T) for stochastic strongly convex minimization by the Epoch-GD algorithm, which runs standard SGD in an epoch-wise manner by decreasing the step size geometrically. However, a fast rate of O(1/T) for the duality gap of a stochastic algorithm is still unknown for general SCSC problems. We notice that there are extensive studies about stochastic algorithms with faster convergence rates than
for solving convex-concave min-max problems Zhang and Xiao (2017); Tan et al. (2018); Shalev-Shwartz and Zhang (2013); Du and Hu (2018); Dang and Lan (2014); Chen et al. (2014); Palaniappan and Bach (2016); Hsieh et al. (2019); Yan et al. (2019); Hien et al. (2017); Zhao (2019). However, these works usually require additional assumptions about the objective function (e.g., smoothness, bilinear structure) or only prove the convergence in weaker measures (e.g., the primal objective gap, the distance of a solution to the saddle point).
We aim to bridge this gap by presenting the first optimal rate O(1/T) of the duality gap for solving general SCSC problems. In particular, we propose an epoch-wise stochastic gradient descent ascent (Epoch-GDA) algorithm - a primal-dual variant of Epoch-GD that runs stochastic gradient descent update for the primal variable and stochastic gradient ascent update for the dual variable for solving (1). Although the algorithmic generalization is straightforward, the proof of convergence in terms of the duality gap for Epoch-GDA is not straightforward at all. We note that the key difference in the analysis of Epoch-GDA is that to upper bound the duality gap of a solution we need to deal with the distance of an initial solution
to the reference solutions
and
, respectively. In contrast, in the analysis of the objective gap for Epoch-GD, one only needs to deal with the distance from an initial solution
to the optimal solution
, which by strong convexity can easily connects to the objective gap
, leading to the telescoping sum on the objective gap. Towards addressing the challenge caused by dealing with the duality gap, we present a key lemma that connects the distance measure
to the duality gap of
. In addition, since we use the same technique as Epoch-GD for handling the variance of stochastic gradient by projecting onto a bounded ball with shrinking radius, we have to carefully prove that such restriction does not affect the duality gap for the original problem, which also needs to deal with bounding
Moreover, we notice that the aforementioned key lemma and the telescoping technique based on the duality gap can also be used for proving the convergence of Epoch-GDA for finding an approximate stationary solution of general WCSC problems. The algorithmic framework is similar to that proposed by Rafique et al. (2018), i.e., by solving SCSC problems successively, but with a subtle difference in handling the dual variable. In particular, we do not need additional condition on the structure of the objective function and extra care for dealing with the dual variable for restart as done in Rafique et al. (2018). This key difference is caused by our sharper analysis, i.e., we use the telescoping sum based on the duality gap instead of the primal objective gap as in Rafique et al. (2018). As a result, our algorithm and analysis lead to a nearly optimal complexity for solving WCSC problems without the smoothness assumption on the objective Arjevani et al. (2019) . Finally, we summarize our results and the comparison with existing results in Table 1.
Below, we provide an overview of related results in this area and the review is not necessarily exhaustive. In addition, we focus on the stochastic algorithms, and leave deterministic algorithms Chambolle and Pock (2011); Nesterov (2005); Yang et al. (2015); Gidel et al. (2016); Nouiehed et al. (2019); Hong (2016); Hajinezhad and Hong (2019); Hong et al. (2018); Lu et al. (2019); Hamedani and Aybat (2018) out of our discussion.
Nemirovski et al. (2009) is one of the early works that studies stochastic primal-dual gradient methods for convex-concave min-max problems, which establishes a convergence rate of for the duality gap of general convex-concave problems. Following this work, many studies have tried to improve the algorithm and the analysis for a certain class of problems by exploring the smoothness condition of some component functions Juditsky et al. (2011); Zhao (2019); Hsieh et al. (2019) or bilinear structure of the objective function Chen et al. (2014); Dang and Lan (2014). For example, Zhao (2019) considers a family of min-max problems whose objective is
, where the smoothness condition
Table 1: Summary of complexity results of this work and previous works for finding an -duality-gap solution for SCSC or an
-stationary solution for WCSC min-max problems. We focus on comparison of existing results without assuming smoothness of the objective function. Restriction means whether an additional condition about the objective function’s structure is imposed.
is imposed on and strong convexity is imposed on f if necessary, and establishes optimal or nearly optimal complexity of a stochastic primal-dual hybrid algorithm. Although the dependence on each problem parameter of interest is made (nearly) optimal, the worst case complexity is still
Hsieh et al. (2019) considers single-call stochastic extra-gradient and establishes O(1/T) rate for smooth and strongly monotone variational inequalities in terms of the square distance from the returned solution to the saddle point. The present work is complementary to these developments by making no assumption on smoothness or the structure of the objective but considers strong (weak) convexity and strong concavity of the objective function. It has applications in robust learning with non-smooth loss functions Yan et al. (2019); Rafique et al. (2018).
In the machine learning community, many works have considered stochastic primal-dual algorithms for solving regularized loss minimization problems, whose min-max formulation usually exhibits bi-linear structure Zhang and Xiao (2017); Shalev-Shwartz and Zhang (2013); Wang and Xiao (2017); Du and Hu (2018); Palaniappan and Bach (2016). For example, Zhang and Xiao (2017) designs a stochastic primal-dual coordinate (SPDC) method for SCSC problems with bilinear structure, which enjoys a linear convergence for the duality gap. Similarly, in Yu et al. (2015); Tan et al. (2018), different variants of SPDC are proposed and analyzed for problems with the bilinear structure. Palaniappan and Bach (2016) proposes stochastic variance reduction methods for a family of saddle-point problems with special structure that yields a linear convergence rate. An exception that makes no smoothness assumption and imposes no bilinear structure is a recent work Yan et al. (2019). It considers a family of functions and proposes a stochastic primal-dual algorithm similar to Epoch-GDA. The key difference is that Yan et al. (2019) designs a particular scheme that computes a restarting dual solution based on
where
is a restarting primal solution in order to derive a fast rate of O(1/T) under strong convexity and strong concavity. Additionally, their fast rate O(1/T) is in terms of the primal objective gap, which is weaker than our convergence result in terms of the duality gap.
There is also increasing interest in stochastic primal-dual algorithms for solving WCSC min-max problems. To the best of our knowledge, Rafique et al. (2018) is probably the first work that comprehensively studies stochastic algorithms for solving WCSC min-max problems. To find a nearly -stationary point, their algorithms suffer from an
iteration complexity without strong concavity and an
complexity with strong concavity and a special structure of the objective function that is similar to that imposed in Yan et al. (2019). Some recent works are trying to improve the complexity for solving WCSC min-max problems by exploring other conditions (e.g., smoothness) Lin et al. (2019); Luo Luo (2020). For example, Lin et al. (2019) establishes an
complexity for a single-loop stochastic gradient descent ascent method. However, their analysis requires the smoothness condition and their algorithm needs to use a large mini-batch size in the order
. In contrast, we impose neither assumption about smoothness nor special structure of the objective function. The complexity of our algorithm is
for finding a nearly
-stationary point, which is the state of the art result for the considered non-smooth WCSC problem.
This section provides some notations and assumptions used in the paper. We let denote the Euclidean norm of a vector. Given a function
, we denote the Fréchet subgradients and limiting Fréchet gradients by
, respectively, i.e., at x,
represents
function
-strongly convex on X if for any
A function
-weakly convex on
denote a stochastic subgradient of f at
is used to denote the random variable. Similarly, let
denote a stochastic sugradient of
denote the projection onto the set
denote an Euclidean ball centered at x with a radius R. Denote by dist(x, X) the distance between x and the set
hide some logarithmic factors.
For a WCSC min-max problem, it is generally a hard problem to find a saddle point. Hence, we use as the measure of convergence for solving WCSC problems Rafique et al. (2018), which is defined as follows.
Definition 1 A solution x is a nearly -stationary point of
if there exist z and a constant
For a -weakly convex function
is a reference point. Due to the strong convexity of the above problem, z is unique and
, which results in
We can find a nearly
-stationary point
as long as
Before ending this section, we present some assumptions that will be imposed in our analysis.
Assumption 1 X and Y are closed convex sets. There exist initial solutions and
such that Gap
-strongly convex in
-strongly concave in
. (2) There exist
-weakly convex in
-strongly concave in
Remark: When f(x, y) is smooth in x and y, the second condition in the above assumption can be replaced by the bounded variance condition.
4.1. Strongly-Convex Strongly-Concave Min-Max Problems
In this subsection, we present the main result for solving SCSC problems. The proposed Epoch-GDA algorithm for SCSC min-max problems is shown in Algorithm 1. As illustrated, our algorithm consists of a series of epochs. In each epoch (Line 3 to 7), standard primal-dual updates are performed. After an epoch ends, in Line 8, the solutions over the epoch are returned as the initialization for the next epoch. In Line 9, step sizes
, the radius
and the number of iterations
are also adjusted for the next epoch. The ball constraints
at each iteration are used for the convergence analysis in high probability as in Hazan and Kale (2011, 2014). It is clear that Epoch-GDA can be considered as a primal-dual variant of Epoch-GD Hazan and Kale (2011, 2014).
The following theorem shows that the iteration complexity of Algorithm 1 to achieve an -duality gap for a general SCSC problem (1) is
Theorem 1 Suppose Assumption 1 and Assumption 2 hold and let be a failing probability and
be the target accuracy level for the duality gap. Let
Then the total number of iterations of Algorithm 1 to achieve an -duality gap, i.e., Gap
, with probability
Remark 1 To the best of our knowledge, this is the first study that achieves a fast rate of O(1/T) for the duality gap of a general SCSC min-max problem without any special structure assumption or smoothness of the objective function and an additional computational cost. In contrast, even if the algorithm in Yan et al. (2019) attains the O(1/T) rate of convergence, it i) only guarantees the convergence of the primal objective gap, rather than the duality gap, ii) additionally requires a special structure of the objective function, and iii) needs an extra O(n) computational cost of the deterministic update at each outer loop to handle the maximization over y. In contrast, Algorithm 1 has stronger theoretical results with less restrictions of the problem structures and computational cost.
Remark 2 A lower bound of O(1/T) for stochastic strongly convex minimization problems has been proven in Agarwal et al. (2009); Hazan and Kale (2014). Due to Gap, bounding the duality gap is more difficult than bounding the primal gap. This means that our convergence rate matches the lower bound and is therefore the best possible convergence rate without adding more assumptions.
4.2. Weakly-Convex Strongly-Concave Problems
In this subsection, we present the convergence results for solving WCSC problems, where the objective function -weakly convex in
-strongly concave in y. The proposed Epoch-GDA algorithm for WCSC min-max problems is summarized in Algorithm 2. As our Algorithm 1, Algorithm 2 consists of a number of epochs. As shown in Line 4 to Line 8, each epoch performs primal-dual updates on x and y. When updating x at the
k-th stage, an additional regularizer is added, where the value
added term is used to handle the weak convexity condition. After an epoch ends, average solutions of both x and y are restarted as the initial ones for the next epoch. The step sizes for updating x and y are set to
-th epoch, respectively. If we define
, we can see that
-strongly convex in x and
-strongly concave in
-weakly convex and
. Indeed, for each inner loop of Algorithm 2, we actually work on the SCSC problem
It is worth mentioning the key difference between our algorithm and the recently proposed stochastic algorithm PG-SMD Rafique et al. (2018) for WCSC problems with a special structural objective function. PG-SMD also consists of two loops. For each inner loop, it runs the same updates with the added regularizer on x as Algorithm 2. It restarts x by averaging the solutions over the inner loop, like our
, but restarts y by taking the deterministic maximization of (1) over
, leading to an additional O(n) computational complexity per epoch. In addition, PG-SMD sets
. Although Algorithm 2 shares similar updates to PG-SMD, our analysis yields stronger results under weaker assumptions — the same iteration complexity
without deterministic updates for y and special structure in the objective function. This is due to our sharper analysis that makes use of the telescoping sum based on the duality gap of
instead of the primal objective gap. The convergence result of Algorithm 2 that achieves a nearly
-stationary point with
iteration complexity is summarized below.
where is randomly sampled from
is the saddle-point of
total number of iteration is
Remark 3 Theorem 2 shows that the iteration complexity of Algorithm 2 to attain an nearly stationary point is
. It improves the result of Rafique et al. (2018) for WCSC problems in terms of two aspects. First, Rafique et al. (2018) requires a stronger condition on the structure of the objective function, while our analysis simply assumes a general objective function f(x, y). Second, Rafique et al. (2018) requires to solve the maximization over y at each epoch, which may introduce an O(n) computational complexity for
contrast, our algorithm restarts both the primal variable x and dual variable y at each epoch, which does not need an additional cost.
Finally, we note that when f(x, y) is smooth in x and y, we can use stochastic Mirror Prox algorithm Juditsky et al. (2011) to replace the stochastic gradient descent ascent updates (Step 6 and Step 7) such that we can use a bounded variance assumption of the stochastic gradients instead of bounded second-order moments. It is a simple exercise to finish the proof by following our analysis of Theorem 2.
In this section, we present the proof of Theorem 1 and a proof sketch of Theorem 2. As we mentioned at the introduction, the key challenge in the analysis of Epoch-GDA lies in handling the variable distance measure and its connection to the duality gaps, where
and
. Hence, we first introduce the following key lemma that is useful in the analysis of Epoch-GDA for both SCSC and WCSC problems. It connects the variable distance measure
to the duality gaps at
and
Lemma 1 Consider the following -strongly convex in
-strongly concave problem
denote the saddle point solution to this problem. Suppose we have two solutions
. Then the following
5.1. Proof of Theorem 1 for the SCSC setting
The key idea is to first show the convergence of the duality gap with respect to the ball constraints in an epoch (Lemma 2). Then we investigate the condition to make
given the average solution
, which allows us to derive the duality gap Gap
for the original problem. Finally, under such conditions, we show how the duality gap between two consecutive outer loops can be halved (Theorem 3), which implies the total iteration complexity (Theorem 1). Below, we omit superscript k when it applies to all epochs.
Lemma 2 Suppose Assumption 2 holds. Let Line 3 to 7 of Algorithm 1 run for T iterations (omitting the k-index) by fixed step sizes . Then with the probability at least
where
Remark 4 Lemma 2 is a standard analysis for an epoch of Algorithm 1. The difficulty arises when attempting to plug x and y into (3). In order to derive the duality gap on the LHS of (3), we have to plug in . Nevertheless, it is unclear whether
, which is the requirement for x and y to be plugged into (3). In the following lemma, we investigate the condition to make
based on Lemma 1.
Remark 5 Lemma 3 shows that if we properly set the values of and
are the interior points of
with high probability. Therefore, we conclude that
with probability
under the conditions of Lemma 3, which allows us to derive the duality gap in LHS of (3) of Lemma 2.
The following theorem gives the relation of duality gaps between two consecutive epochs of Algorithm 1 by using Lemma 2 and the conditions proven by Lemma 3.
Theorem 3 Consider the k-th epoch of Algorithm 1 with an initial solution the ending averaged solution
. Suppose Assumption 2 holds and Gap
Let
Remark 6 Theorem 3 shows that after running iterations at the k-th stage, the upper bound of the duality gap would be halved with high probability, i.e., from
. Then, in order to make the duality gap of each outer loop of Algorithm 1 halved from the last epoch, we can simply set
, and accordingly,
and
Proof (of Theorem 3) For any and
, we have with probability
where inequality . Inequality (b) is due to the values of
. Recall the definitions
probability
. Then from (4) we have
Given the condition Gapwe then conclude that running
iterations in an epoch of Algorithm 1 would halve the duality gap with high probability. As indicated in Theorem 3, the duality gap Gap
can be halved as long as the condition of Theorem 3 holds. Then Theorem 1 is implied (the detailed proof is in Supplementary Materials).
5.2. Proof Sketch of Theorem 2 for the WCSC setting
Due to limit of space, we only present a sketch here and present the full proof in the Supplement. Recall Let us denote its duality gap by
Gap, where we define
given
. Its saddle point solution is denoted by
key idea of our analysis is to connect the duality gap
then by making
, we can show that
is a nearly
-stationary point. To this end we first establish a bound of the duality gap for the regularized problem
for the k-th epoch (Lemma 4). Then we connect it to
). Finally, we bound
by a telescoping sum of
In this paper, we filled the gaps between stochastic min-max and minimization optimization problems. We proposed Epoch-GDA algorithms for general SCSC and general WCSC problems, which do not impose any additional assumptions on the smoothness or the structure of the objective function. Our key lemma provides sharp analysis of Epoch-GDA for both problems. For SCSC min-max problems, to the best of our knowledge, our result is the first one to show that Epoch-GDA achieves the optimal rate of O(1/T) for the duality gap of general SCSC min-max problems. For WCSC min-max problems, our analysis allows us to derive the best complexity of Epoch-GDA to reach a nearly
-stationary point, which does not require smoothness, large mini-batch sizes or other structural conditions.
Alekh Agarwal, Martin J Wainwright, Peter L Bartlett, and Pradeep K Ravikumar. Information-theoretic lower bounds on the oracle complexity of convex optimization. In Advances in Neural Information Processing Systems, pages 1–9, 2009.
Yossi Arjevani, Yair Carmon, John C Duchi, Dylan J Foster, Nathan Srebro, and Blake Woodworth. Lower bounds for non-convex stochastic optimization. arXiv preprint arXiv:1912.02365, 2019.
Martin Arjovsky, Soumith Chintala, and Léon Bottou. Wasserstein generative adversarial networks. In Proceedings of the 34th International Conference on Machine Learning, volume 70, pages 214–223, International Convention Centre, Sydney, Australia, 2017.
Antonin Chambolle and Thomas Pock. A first-order primal-dual algorithm for convex problems with applications to imaging. J. Math. Imaging Vis., 40(1): 120–145, May 2011. ISSN 0924-9907. doi: 10.1007/s10851-010-0251-1. URL http://dx.doi.org/10.1007/s10851-010-0251-1.
Yunmei Chen, Guanghui Lan, and Yuyuan Ouyang. Optimal primal-dual methods for a class of saddle point problems. SIAM Journal on Optimization, 24(4):1779–1814, 2014.
Cong Dang and Guanghui Lan. Randomized first-order methods for saddle point optimiza- tion. arXiv preprint arXiv:1409.8625, 2014.
Simon S Du and Wei Hu. Linear convergence of the primal-dual gradient method for convex- concave saddle point problems without strong convexity. arXiv preprint arXiv:1802.01504, 2018.
Yanbo Fan, Siwei Lyu, Yiming Ying, and Baogang Hu. Learning with average top-k loss. In Advances in Neural Information Processing Systems 30, pages 497–505. 2017.
Gauthier Gidel, Tony Jebara, and Simon Lacoste-Julien. Frank-wolfe algorithms for saddle point problems. arXiv preprint arXiv:1610.07797, 2016.
Ian J. Goodfellow, Jean Pouget-Abadie, Mehdi Mirza, Bing Xu, David Warde-Farley, Sher- jil Ozair, Aaron Courville, and Yoshua Bengio. Generative adversarial nets. In Proceedings of the 27th International Conference on Neural Information Processing Systems - Volume 2, NIPS’14, pages 2672–2680, Cambridge, MA, USA, 2014. MIT Press. URL http://dl.acm.org/citation.cfm?id=2969033.2969125.
Davood Hajinezhad and Mingyi Hong. Perturbed proximal primal–dual algorithm for non- convex nonsmooth optimization. Mathematical Programming, 176(1-2):207–245, 2019.
Erfan Yazdandoost Hamedani and Necdet Serhat Aybat. A primal-dual algorithm for general convex-concave saddle point problems. arXiv preprint arXiv:1803.01401, 2018.
Elad Hazan and Satyen Kale. Beyond the regret minimization barrier: an optimal algorithm for stochastic strongly-convex optimization. In Proceedings of the 24th Annual Conference on Learning Theory (COLT), pages 421–436, 2011.
Elad Hazan and Satyen Kale. Beyond the regret minimization barrier: optimal algorithms for stochastic strongly-convex optimization. The Journal of Machine Learning Research, 15(1):2489–2512, 2014.
Le Thi Khanh Hien, Renbo Zhao, and William B Haskell. An inexact primal-dual smoothing framework for large-scale non-bilinear saddle point problems. arXiv preprint arXiv:1711.03669, 2017.
Mingyi Hong. Decomposing linearly constrained nonconvex problems by a proximal primal dual approach: Algorithms, convergence, and applications. arXiv preprint arXiv:1604.00543, 2016.
Mingyi Hong, Jason D Lee, and Meisam Razaviyayn. Gradient primal-dual algorithm con- verges to second-order stationary solutions for nonconvex distributed optimization. arXiv preprint arXiv:1802.08941, 2018.
Yu-Guan Hsieh, Franck Iutzeler, Jérôme Malick, and Panayotis Mertikopoulos. On the con- vergence of single-call stochastic extra-gradient methods. arXiv preprint arXiv:1908.08465, 2019.
Anatoli Juditsky, Arkadi Nemirovski, and Claire Tauvel. Solving variational inequalities with stochastic mirror-prox algorithm. Stochastic Systems, 1(1):17–58, 2011.
Guanghui Lan, Arkadi Nemirovski, and Alexander Shapiro. Validation analysis of mirror descent stochastic approximation method. Mathematical programming, 134(2):425–458, 2012.
Tianyi Lin, Chi Jin, and Michael I. Jordan. On gradient descent ascent for nonconvexconcave minimax problems. CoRR, abs/1906.00331, 2019.
Mingrui Liu, Xiaoxuan Zhang, Zaiyi Chen, Xiaoyu Wang, and Tianbao Yang. Fast stochastic auc maximization with o(1/n)-convergence rate. In Proceedings of the 35th International Conference on Machine Learning (ICML), 2018.
Mingrui Liu, Zhuoning Yuan, Yiming Ying, and Tianbao Yang. Stochastic auc maximization with deep neural networks. arXiv preprint arXiv:1908.10831, 2019.
Songtao Lu, Ioannis Tsaknakis, Mingyi Hong, and Yongxin Chen. Hybrid block successive approximation for one-sided non-convex min-max problems: algorithms and applications. arXiv preprint arXiv:1902.08294, 2019.
Tong Zhang Luo Luo, Haishan Ye. Stochastic recursive gradient descent ascent for stochastic nonconvex-strongly-concave minimax problems. CoRR, abs/2001.03724, 2020.
Hongseok Namkoong and John C. Duchi. Stochastic gradient methods for distributionally robust optimization with f-divergences. In Advances in Neural Information Processing Systems (NIPS), pages 2208–2216, 2016.
Hongseok Namkoong and John C. Duchi. Variance-based regularization with convex objec- tives. In Advances in Neural Information Processing Systems (NIPS), pages 2975–2984, 2017.
Arkadi Nemirovski, Anatoli Juditsky, Guanghui Lan, and Alexander Shapiro. Robust stochastic approximation approach to stochastic programming. SIAM Journal on Optimization, 19:1574–1609, 2009. URL http://dx.doi.org/10.1137/070704277.
Yu Nesterov. Excessive gap technique in nonsmooth convex minimization. SIAM Journal on Optimization, 16:235–249, 01 2005. doi: 10.1137/S1052623403422285.
Maher Nouiehed, Maziar Sanjabi, Tianjian Huang, Jason D Lee, and Meisam Razaviyayn. Solving a class of non-convex min-max games using iterative first order methods. In Advances in Neural Information Processing Systems, pages 14905–14916, 2019.
Balamurugan Palaniappan and Francis R. Bach. Stochastic variance reduction methods for saddle-point problems. In Advances in Neural Information Processing Systems (NIPS), pages 1408–1416, 2016.
Hassan Rafique, Mingrui Liu, Qihang Lin, and Tianbao Yang. Non-convex min-max optimization: Provable algorithms and applications in machine learning. CoRR, abs/1810.02060, 2018.
Shai Shalev-Shwartz and Tong Zhang. Stochastic Dual Coordinate Ascent Methods for Regularized Loss Minimization. Journal of Machine Learning Research (JMLR), 2013.
Conghui Tan, Tong Zhang, Shiqian Ma, and Ji Liu. Stochastic primal-dual method for empirical risk minimization with o (1) per-iteration complexity. In Advances in Neural Information Processing Systems, pages 8366–8375, 2018.
Jialei Wang and Lin Xiao. Exploiting strong convexity from data with primal-dual first-order algorithms. In Proceedings of the 34th International Conference on Machine LearningVolume 70, pages 3694–3702. JMLR. org, 2017.
Yan Yan, Yi Xu, Qihang Lin, Lijun Zhang, and Tianbao Yang. Stochastic primal-dual algorithms with faster convergence than for problems without bilinear structure. arXiv preprint arXiv:1904.10112, 2019.
Tianbao Yang, Mehrdad Mahdavi, Rong Jin, and Shenghuo Zhu. An efficient primal dual prox method for non-smooth optimization. Machine Learning, 98(3):369–406, 2015.
Yiming Ying, Longyin Wen, and Siwei Lyu. Stochastic online auc maximization. In Advances in neural information processing systems, pages 451–459, 2016.
Adams Wei Yu, Qihang Lin, and Tianbao Yang. Doubly stochastic primal-dual coordinate method for regularized empirical risk minimization with factorized data. CoRR, abs/1508.03390, 2015. URL http://arxiv.org/abs/1508.03390.
Yuchen Zhang and Lin Xiao. Stochastic primal-dual coordinate method for regularized empirical risk minimization. The Journal of Machine Learning Research, 18(1):2939–2980, 2017.
Renbo Zhao. Optimal algorithms for stochastic three-composite convex-concave saddle point problems. arXiv preprint arXiv:1903.01687, 2019.
Recall Let us denote its duality gap by
, where we define
. Its saddle point solution is denoted by
i.e.,
. The key idea of our analysis is to connect the duality gap
, and then by making
, we can show that
is a nearly
-stationary point. To this end we first establish a bound of the duality gap for the regularized problem
(Lemma 4). Then we connect it to
). Finally, we bound
by a telescoping sum of
Lemma 4 Suppose Assumption 3 holds and of Algorithm 2 guarantee
the RHS of (5), we could have the following upper bound by the key lemma (Lemma 1) 3
On the other hand, the following lemma lower bounds LHS of (5) to construct telescoping sums.
Lemma 5 lower bounds in LHS of (5) by three parts. The first part constructs telescoping sum of
together with (6). The second part itself is an element of telescoping sums over the primal gap. The third part
can be used as the measure of nearly
-stationary point, which is further explored in Theorem 2.
Proof (of Theorem 2) Consider the k-th stage. Let us start from (5) in Lemma 4 as follows
where (a) is due to settings . Re-organizing the above inequality, we have
Summing from k = 1 to k = K, we have
where the last inequality is due to the upper bounds the three terms A, B and C as follows. For the term A, we have
where the second equality is due to the setting of
inequality is due to For the term B, we have
For the term C, we have
where the first inequality is due to . By plugging the above upper bounds of the three terms A, B and C into (11), we have (12). Then by randomly sampling
which leads to . To compute the total number of iterations, we have
where inequality -strong convexity of
(with optimality at
-strong convexity of
(with optimality at
equality
. Inequality
where inequality -strong concavity of
(optimality at
(optimality at
). Inequality
. Inequality
Proof Before the proof, we first present the following two lemmas as follows.
be independent random variables and
for any
. Then we have with probability at least
Lemma 7 (Lemma 2 of Lan et al. (2012)) Let be a martingale difference sequence, i.e.,
. Suppose that for some values
have
. Then with probability at least
For simplicity of presentation, we use the notations . To prove Lemma 2, we would leverage the following
two update approaches:
where . The first two updates are identical to Line 4 and Line 5 in Algorithm 1. This can be verified easily. Take the first one as example:
Let , which includes the four update approaches in (15) as special cases. By using the strong convexity of
and the first order optimality condition (
which implies
Then
Applying the above result to the updates in (15), we have for any
Adding the above four inequalities together, we have
where inequality (a) above is due to the convexity of and concavity of
in y. Inequality
Then we combine the LHS and RHS by summing up
In the following, we show how to bound the above A to E terms. To bound the above term A in (19), we apply Lemma 6 as follows, which holds with probability
Similarly, term C in (19) can be bounded with probability as follows
To bound term B of (19), which contains only the full subgradients
where the first inequality is due to Jensen’s inequality and the second inequality is due to
Therefore, we have
Similarly, for term D in (19), we have
To bound term for
, which are Martingale difference sequences. We thus would like to use Lemma 7 to handle these terms. To this end, we can first upper bound
follows
Then the above two inequalities implies that
where inequality , inequality (b) is due to the concavity of
and Jensen’s inequality. Inequality (c) is due to the assumption. In a similar way, we have
Next, applying Lemma 7 with (24) and (25), we have with probability at least
For LHS of (19), by Jensen’s inequality, we have
where
Suppose . By plugging (27), (20), (21), (22), (23) and (26) back into (19), with probability at least
Proof First, we start from
where the first inequality is due to the assumption. Markov inequality indicates that for a random variable X, which, by additionally introducing
Therefore, with probability at least
where inequality -weakly convexity of f in x. Inequality (b) is due to Young’s inequality, i.e.,
Similarly, due to the -strong convexity of
and the optimality condition of
we have
where inequality (a) is due to concavity of f in y. Inequality (b) is due to Young’s inequality,
Now we do not take expectation, since we aim to eliminate the randomness of x and y in , respectively. To achieve this, we use the following updates
Summing the above two inequalities, we have
Summing the above inequality from and using Jensen’s inequality, we have
Taking expectation over both sides and recalling that
Proof Before we prove the three results, we first state two results of Young’s inequality as follows
where inequality (a) is due to (39) (). Inequality
-strong convexity of
and optimality at
. Inequality
Re-organizing the above inequality, we have
where inequality (a) is due to (41) and . Inequality
-storng convexity of
. Inequality