b

DiscoverSearch
About
My stuff
Sharp Analysis of Epoch Stochastic Gradient Descent Ascent Methods for Min-Max Optimization
2020·arXiv
Abstract
Abstract

Epoch gradient descent method (a.k.a. Epoch-GD) proposed by Hazan and Kale (2011) was deemed a breakthrough for stochastic strongly convex minimization, which achieves the optimal convergence rate of O(1/T ) with T iterative updates for the objective gap. However, its extension to solving stochastic min-max problems with strong convexity and strong concavity still remains open, and it is still unclear whether a fast rate of O(1/T ) for the duality gap is achievable for stochastic min-max optimization under strong convexity and strong concavity. Although some recent studies have proposed stochastic algorithms with fast convergence rates for min-max problems, they require additional assumptions about the problem, e.g., smoothness, bi-linear structure, etc. In this paper, we bridge this gap by providing a sharp analysis of epoch-wise stochastic gradient descent ascent method (referred to as Epoch-GDA) for solving strongly convex strongly concave (SCSC) min-max problems, without imposing any additional assumption about smoothness or the function’s structure. To the best of our knowledge, our result is the first one that shows EpochGDA can achieve the optimal rate of O(1/T ) for the duality gap of general SCSC min-max problems. We emphasize that such generalization of Epoch-GD for strongly convex minimization problems to Epoch-GDA for SCSC min-max problems is non-trivial and requires novel technical analysis. Moreover, we notice that the key lemma can also be used for proving the convergence of Epoch-GDA for weakly-convex strongly-concave min-max problems, leading to a nearly optimal complexity without resorting to smoothness or other structural conditions.

In this paper, we consider stochastic algorithms for solving the following min-max saddle-point problem with a general objective function f without smoothness or any other special structure:

image

where  X ⊆ Rd and Y ⊆ Rn are closed convex sets and  f : X × Y → Ris continuous. It is of great interest to find a saddle-point solution to the above problem, which is defined as  (x∗, y∗) such thatf(x∗, y) ≤ f(x∗, y∗) ≤ f(x, y∗), ∀x ∈ X, y ∈ Y. Problem (1) coversa number of applications in machine learning, including distributionally robust optimization (DRO) Namkoong and Duchi (2017, 2016), learning with non-decomposable loss functions Liu et al. (2018); Fan et al. (2017); Ying et al. (2016); Liu et al. (2019), and generative adversarial networks Goodfellow et al. (2014); Arjovsky et al. (2017).

In this work, we focus on two classes of the min-max problems: (i) strongly-convex strongly-concave (SCSC) problem where f is strongly convex in terms of  x for any y ∈ Yand is strongly concave in terms of  y for any x ∈ X; (ii) weakly-convex strongly-concave (WCSC) problem, where there exists  ρ > 0 such that f(x, y) + ρ2∥x∥2is strongly convex in terms of  x for any y ∈ Yand is strongly concave in terms of  y for any x ∈ X. Both classes have applications in machine learning Yan et al. (2019); Rafique et al. (2018).

Although stochastic algorithms for convex-concave min-max problems have been studied extensively in the literature, their research is still far behind its counterpart for stochastic convex minimization problems. Below, we highlight some of these gaps to motivate the present work. For the sake of presentation, we first introduce some terminologies. The duality gap at (x, y) is defined as Gap(x, y) := f(x, ˆy(x))−f(ˆx(y), y), where ˆx(y) := arg minx′∈X f(x′, y)and  ˆy(x) := arg maxy′∈Y f(x, y′). If we denote by  P(x) := maxy′∈Y f(x, y′), then P(x) −P(x∗)is the primal objective gap, where  x∗ = arg minx∈X P(x).

When f is convex in x and concave in y, many studies have designed and analyzed stochastic primal-dual algorithms for solving the min-max problems under different conditions of the problem (see references in next section). A standard result is provided by Nemirovski et al. (2009), which proves that primal-dual SGD suffers from a convergence rate of  O(1/√T)for the duality gap without imposing any additional assumptions about the objective function. This is analogous to that for stochastic convex minimization Nemirovski et al. (2009). However, the research of stochastic algorithms for SCSC problems lacks behind that for strongly convex minimization problems. A well-known result for stochastic strongly convex minimization is given by Hazan and Kale (2011), which presents the first fast convergence rate O(1/T) for stochastic strongly convex minimization by the Epoch-GD algorithm, which runs standard SGD in an epoch-wise manner by decreasing the step size geometrically. However, a fast rate of O(1/T) for the duality gap of a stochastic algorithm is still unknown for general SCSC problems. We notice that there are extensive studies about stochastic algorithms with faster convergence rates than  O(1/√T)for solving convex-concave min-max problems Zhang and Xiao (2017); Tan et al. (2018); Shalev-Shwartz and Zhang (2013); Du and Hu (2018); Dang and Lan (2014); Chen et al. (2014); Palaniappan and Bach (2016); Hsieh et al. (2019); Yan et al. (2019); Hien et al. (2017); Zhao (2019). However, these works usually require additional assumptions about the objective function (e.g., smoothness, bilinear structure) or only prove the convergence in weaker measures (e.g., the primal objective gap, the distance of a solution to the saddle point).

We aim to bridge this gap by presenting the first optimal rate O(1/T) of the duality gap for solving general SCSC problems. In particular, we propose an epoch-wise stochastic gradient descent ascent (Epoch-GDA) algorithm - a primal-dual variant of Epoch-GD that runs stochastic gradient descent update for the primal variable and stochastic gradient ascent update for the dual variable for solving (1). Although the algorithmic generalization is straightforward, the proof of convergence in terms of the duality gap for Epoch-GDA is not straightforward at all. We note that the key difference in the analysis of Epoch-GDA is that to upper bound the duality gap of a solution  (¯x, ¯y)we need to deal with the distance of an initial solution  (x0, y0)to the reference solutions  (ˆx(¯y), ˆy(¯x)), where ˆx(¯y) = arg minx′∈X f(x′, ¯y)and  ˆy(¯x) = arg maxy′∈Y f(¯x, y′) depend on ¯y and ¯x, respectively. In contrast, in the analysis of the objective gap for Epoch-GD, one only needs to deal with the distance from an initial solution  x0to the optimal solution  x∗, i.e., ∥x0 − x∗∥22, which by strong convexity can easily connects to the objective gap  P(x0)−P(x∗), leading to the telescoping sum on the objective gap. Towards addressing the challenge caused by dealing with the duality gap, we present a key lemma that connects the distance measure  ∥x0 − ˆx(¯y)∥22 + ∥y0 − ˆy(¯x)∥22 to the duality gap of  (x0, y0) and (¯x, ¯y). In addition, since we use the same technique as Epoch-GD for handling the variance of stochastic gradient by projecting onto a bounded ball with shrinking radius, we have to carefully prove that such restriction does not affect the duality gap for the original problem, which also needs to deal with bounding  ∥x0 − ˆx(¯y)∥22 and ∥y0 − ˆy(¯x)∥22.

Moreover, we notice that the aforementioned key lemma and the telescoping technique based on the duality gap can also be used for proving the convergence of Epoch-GDA for finding an approximate stationary solution of general WCSC problems. The algorithmic framework is similar to that proposed by Rafique et al. (2018), i.e., by solving SCSC problems successively, but with a subtle difference in handling the dual variable. In particular, we do not need additional condition on the structure of the objective function and extra care for dealing with the dual variable for restart as done in Rafique et al. (2018). This key difference is caused by our sharper analysis, i.e., we use the telescoping sum based on the duality gap instead of the primal objective gap as in Rafique et al. (2018). As a result, our algorithm and analysis lead to a nearly optimal complexity for solving WCSC problems without the smoothness assumption on the objective Arjevani et al. (2019) 1. Finally, we summarize our results and the comparison with existing results in Table 1.

Below, we provide an overview of related results in this area and the review is not necessarily exhaustive. In addition, we focus on the stochastic algorithms, and leave deterministic algorithms Chambolle and Pock (2011); Nesterov (2005); Yang et al. (2015); Gidel et al. (2016); Nouiehed et al. (2019); Hong (2016); Hajinezhad and Hong (2019); Hong et al. (2018); Lu et al. (2019); Hamedani and Aybat (2018) out of our discussion.

Nemirovski et al. (2009) is one of the early works that studies stochastic primal-dual gradient methods for convex-concave min-max problems, which establishes a convergence rate of O(1/√T)for the duality gap of general convex-concave problems. Following this work, many studies have tried to improve the algorithm and the analysis for a certain class of problems by exploring the smoothness condition of some component functions Juditsky et al. (2011); Zhao (2019); Hsieh et al. (2019) or bilinear structure of the objective function Chen et al. (2014); Dang and Lan (2014). For example, Zhao (2019) considers a family of min-max problems whose objective is  f(x) + g(x) + φ(x, y) − J(y), where the smoothness condition

Table 1: Summary of complexity results of this work and previous works for finding an ǫ-duality-gap solution for SCSC or an  ǫ-stationary solution for WCSC min-max problems. We focus on comparison of existing results without assuming smoothness of the objective function. Restriction means whether an additional condition about the objective function’s structure is imposed.

image

is imposed on  f and φand strong convexity is imposed on f if necessary, and establishes optimal or nearly optimal complexity of a stochastic primal-dual hybrid algorithm. Although the dependence on each problem parameter of interest is made (nearly) optimal, the worst case complexity is still  O(1/√T).Hsieh et al. (2019) considers single-call stochastic extra-gradient and establishes O(1/T) rate for smooth and strongly monotone variational inequalities in terms of the square distance from the returned solution to the saddle point. The present work is complementary to these developments by making no assumption on smoothness or the structure of the objective but considers strong (weak) convexity and strong concavity of the objective function. It has applications in robust learning with non-smooth loss functions Yan et al. (2019); Rafique et al. (2018).

In the machine learning community, many works have considered stochastic primal-dual algorithms for solving regularized loss minimization problems, whose min-max formulation usually exhibits bi-linear structure Zhang and Xiao (2017); Shalev-Shwartz and Zhang (2013); Wang and Xiao (2017); Du and Hu (2018); Palaniappan and Bach (2016). For example, Zhang and Xiao (2017) designs a stochastic primal-dual coordinate (SPDC) method for SCSC problems with bilinear structure, which enjoys a linear convergence for the duality gap. Similarly, in Yu et al. (2015); Tan et al. (2018), different variants of SPDC are proposed and analyzed for problems with the bilinear structure. Palaniappan and Bach (2016) proposes stochastic variance reduction methods for a family of saddle-point problems with special structure that yields a linear convergence rate. An exception that makes no smoothness assumption and imposes no bilinear structure is a recent work Yan et al. (2019). It considers a family of functions  f(x, y) = y⊤ℓ(x) − φ∗(y) + g(x)and proposes a stochastic primal-dual algorithm similar to Epoch-GDA. The key difference is that Yan et al. (2019) designs a particular scheme that computes a restarting dual solution based on  ∇φ(ℓ(¯x)),where  ¯xis a restarting primal solution in order to derive a fast rate of O(1/T) under strong convexity and strong concavity. Additionally, their fast rate O(1/T) is in terms of the primal objective gap, which is weaker than our convergence result in terms of the duality gap.

There is also increasing interest in stochastic primal-dual algorithms for solving WCSC min-max problems. To the best of our knowledge, Rafique et al. (2018) is probably the first work that comprehensively studies stochastic algorithms for solving WCSC min-max problems. To find a nearly  ǫ-stationary point, their algorithms suffer from an  O(1/ǫ6)iteration complexity without strong concavity and an  O(1/ǫ4 +n/ǫ2)complexity with strong concavity and a special structure of the objective function that is similar to that imposed in Yan et al. (2019). Some recent works are trying to improve the complexity for solving WCSC min-max problems by exploring other conditions (e.g., smoothness) Lin et al. (2019); Luo Luo (2020). For example, Lin et al. (2019) establishes an  O(1/ǫ4)complexity for a single-loop stochastic gradient descent ascent method. However, their analysis requires the smoothness condition and their algorithm needs to use a large mini-batch size in the order O(1/ǫ2). In contrast, we impose neither assumption about smoothness nor special structure of the objective function. The complexity of our algorithm is �O(1/ǫ4)for finding a nearly ǫ-stationary point, which is the state of the art result for the considered non-smooth WCSC problem.

This section provides some notations and assumptions used in the paper. We let  ∥ · ∥denote the Euclidean norm of a vector. Given a function  f : Rd → R, we denote the Fréchet subgradients and limiting Fréchet gradients by ˆ∂f and ∂f, respectively, i.e., at x, ˆ∂f(x) = {v ∈ Rd : limx→x′ inf f(x)−f(x′)−v⊤(x−x′)∥x−x′∥ ≥ 0}, and ∂f(x) = {vk ∈ Rd : ∃xk f→x, vk ∈ ˆ∂f(xk), vk → v, v ∈ ˆ∂f(x)}. Here xk f→ xrepresents  xk → x with f(xk) → f(x). Afunction  f(x) is µ-strongly convex on X if for any  x, x′ ∈ X, ∂f(x′)⊤(x−x′)+ µ2∥x−x′∥2 ≤f(x) − f(x′).A function  f(x) is ρ-weakly convex on  X for any x, x′ ∈ X ∂f(x′)⊤(x − x′) −

2∥x − x′∥2 ≤ f(x) − f(x′). Let Gx = ∂xf(x, y; ξ)denote a stochastic subgradient of f at x given y, where ξis used to denote the random variable. Similarly, let  Gy = ∂yf(x, y; ξ)denote a stochastic sugradient of  f at y given x. Let ΠΩ[·]denote the projection onto the set  Ω, and let B(x, R)denote an Euclidean ball centered at x with a radius R. Denote by dist(x, X) the distance between x and the set  X, i.e., dist(x, X) = minv∈X ∥x−v∥. Let ˜O(·)hide some logarithmic factors.

For a WCSC min-max problem, it is generally a hard problem to find a saddle point. Hence, we use  nearly ǫ-stationarityas the measure of convergence for solving WCSC problems Rafique et al. (2018), which is defined as follows.

Definition 1 A solution x is a nearly  ǫ-stationary point of  minx ψ(x)if there exist z and a constant  c > 0 such that ∥z − x∥ ≤ cǫ and dist(0, ∂ψ(z)) ≤ ǫ.

For a  ρ-weakly convex function  ψ(x), let z = arg minx∈Rd ψ(x)+ γ2∥x− ˜x∥2 where γ > ρ and˜x ∈ Rd is a reference point. Due to the strong convexity of the above problem, z is unique and  0 ∈ ∂ψ(z)+γ(z−˜x), which results in  γ(˜x−z) ∈ ∂ψ(z), so that dist(0, ∂ψ(z)) ≤ γ∥˜x−z∥.We can find a nearly  ǫ-stationary point  ˜xas long as  γ∥˜x − z∥ ≤ ǫ.

Before ending this section, we present some assumptions that will be imposed in our analysis.

Assumption 1 X and Y are closed convex sets. There exist initial solutions  x0 ∈ X, y0 ∈ Yand  ǫ0 > 0such that Gap(x0, y0) ≤ ǫ0.

image

Assumption 2 (1) f(x, y) is µ-strongly convex in  x for any y ∈ Y and λ-strongly concave in  y for any x ∈ X. (2) There exist  B1, B2 > 0 such that E[exp(∥Gx∥2B21 )] ≤ exp(1) and

image

Assumption 3 (1) f(x, y) is ρ-weakly convex in  x for any y ∈ Y and is λ-strongly concave in  y for any x ∈ X. (2) E[∥Gx∥2] ≤ M21 and E[∥Gy∥2] ≤ M22 .

Remark: When f(x, y) is smooth in x and y, the second condition in the above assumption can be replaced by the bounded variance condition.

4.1. Strongly-Convex Strongly-Concave Min-Max Problems

In this subsection, we present the main result for solving SCSC problems. The proposed Epoch-GDA algorithm for SCSC min-max problems is shown in Algorithm 1. As illustrated, our algorithm consists of a series of epochs. In each epoch (Line 3 to 7), standard primal-dual updates are performed. After an epoch ends, in Line 8, the solutions  ¯xk and ¯yk averagedover the epoch are returned as the initialization for the next epoch. In Line 9, step sizes ηx,k+1 and ηy,k+1, the radius  Rk+1and the number of iterations  Tk+1are also adjusted for the next epoch. The ball constraints  B(xk0, Rk) and B(yk0, Rk)at each iteration are used for the convergence analysis in high probability as in Hazan and Kale (2011, 2014). It is clear that Epoch-GDA can be considered as a primal-dual variant of Epoch-GD Hazan and Kale (2011, 2014).

The following theorem shows that the iteration complexity of Algorithm 1 to achieve an ǫ-duality gap for a general SCSC problem (1) is  O(1/ǫ).

Theorem 1 Suppose Assumption 1 and Assumption 2 hold and let  δ ∈ (0, 1)be a failing probability and  ǫ ∈ (0, 1)be the target accuracy level for the duality gap. Let  K = ⌈log(ǫ0ǫ )⌉

image

Then the total number of iterations of Algorithm 1 to achieve an  ǫ-duality gap, i.e., Gap(¯xK, ¯yK) ≤ǫ, with probability  1 − δ is

image

Remark 1 To the best of our knowledge, this is the first study that achieves a fast rate of O(1/T) for the duality gap of a general SCSC min-max problem without any special structure assumption or smoothness of the objective function and an additional computational cost. In contrast, even if the algorithm in Yan et al. (2019) attains the O(1/T) rate of convergence, it i) only guarantees the convergence of the primal objective gap, rather than the duality gap, ii) additionally requires a special structure of the objective function, and iii) needs an extra O(n) computational cost of the deterministic update at each outer loop to handle the maximization over y. In contrast, Algorithm 1 has stronger theoretical results with less restrictions of the problem structures and computational cost.

Remark 2 A lower bound of O(1/T) for stochastic strongly convex minimization problems has been proven in Agarwal et al. (2009); Hazan and Kale (2014). Due to Gap(x, y) ≥P(x) − P(x∗), bounding the duality gap is more difficult than bounding the primal gap. This means that our convergence rate matches the lower bound and is therefore the best possible convergence rate without adding more assumptions.

4.2. Weakly-Convex Strongly-Concave Problems

In this subsection, we present the convergence results for solving WCSC problems, where the objective function  f(x, y) in (1) is ρ-weakly convex in  x and λ-strongly concave in y. The proposed Epoch-GDA algorithm for WCSC min-max problems is summarized in Algorithm 2. As our Algorithm 1, Algorithm 2 consists of a number of epochs. As shown in Line 4 to Line 8, each epoch performs primal-dual updates on x and y. When updating x at the

k-th stage, an additional regularizer γ2∥x − xk0∥2is added, where the value  γ = 2ρ. Theadded term is used to handle the weak convexity condition. After an epoch ends, average solutions of both x and y are restarted as the initial ones for the next epoch. The step sizes for updating x and y are set to  O(1/(ρk)) and O(1/(λk)) at the k-th epoch, respectively. If we define ˆfk(x, y) = f(x, y) + γ2 ∥x − xk0∥2, we can see that  ˆfk(x, y) is ρ-strongly convex in x and  λ-strongly concave in  y, since f(x, y) is ρ-weakly convex and  γ = 2ρ. Indeed, for each inner loop of Algorithm 2, we actually work on the SCSC problem  minx∈X maxy∈Y ˆfk(x, y).It is worth mentioning the key difference between our algorithm and the recently proposed stochastic algorithm PG-SMD Rafique et al. (2018) for WCSC problems with a special structural objective function. PG-SMD also consists of two loops. For each inner loop, it runs the same updates with the added regularizer on x as Algorithm 2. It restarts x by averaging the solutions over the inner loop, like our  ¯xk, but restarts y by taking the deterministic maximization of (1) over  y given ¯xk, leading to an additional O(n) computational complexity per epoch. In addition, PG-SMD sets  ηky = O(1/(γλ2k)). Although Algorithm 2 shares similar updates to PG-SMD, our analysis yields stronger results under weaker assumptions — the same iteration complexity ˜O(1/ǫ4)without deterministic updates for y and special structure in the objective function. This is due to our sharper analysis that makes use of the telescoping sum based on the duality gap of ˆfkinstead of the primal objective gap. The convergence result of Algorithm 2 that achieves a nearly  ǫ-stationary point with ˜O(1/ǫ4)iteration complexity is summarized below.

image

where  τis randomly sampled from  {1, ..., K} and (ˆx∗k, ˆy∗k)is the saddle-point of  fk(x, y). Thetotal number of iteration is �Kk=1 Tk = ˜O( 1ǫ4 ).

Remark 3 Theorem 2 shows that the iteration complexity of Algorithm 2 to attain an  ǫ-nearly stationary point is ˜O(1/ǫ4). It improves the result of Rafique et al. (2018) for WCSC problems in terms of two aspects. First, Rafique et al. (2018) requires a stronger condition on the structure of the objective function, while our analysis simply assumes a general objective function f(x, y). Second, Rafique et al. (2018) requires to solve the maximization over y at each epoch, which may introduce an O(n) computational complexity for  y ∈ Rn 2. Incontrast, our algorithm restarts both the primal variable x and dual variable y at each epoch, which does not need an additional cost.

Finally, we note that when f(x, y) is smooth in x and y, we can use stochastic Mirror Prox algorithm Juditsky et al. (2011) to replace the stochastic gradient descent ascent updates (Step 6 and Step 7) such that we can use a bounded variance assumption of the stochastic gradients instead of bounded second-order moments. It is a simple exercise to finish the proof by following our analysis of Theorem 2.

In this section, we present the proof of Theorem 1 and a proof sketch of Theorem 2. As we mentioned at the introduction, the key challenge in the analysis of Epoch-GDA lies in handling the variable distance measure  ∥ˆx(y1)−x0∥2+∥ˆy(x1)−y0∥2 for any (x0, y0) ∈ X×Y and(x1, y1) ∈ X × Yand its connection to the duality gaps, where  ˆx(y1) = arg minx′∈X f(x′, y1)and  ˆy(x1) = arg maxy′∈Y f(x1, y′). Hence, we first introduce the following key lemma that is useful in the analysis of Epoch-GDA for both SCSC and WCSC problems. It connects the variable distance measure  ∥ˆx(y1) − x0∥2 + ∥ˆy(x1) − y0∥2 to the duality gaps at  (x0, y0)and  (x1, y1).

Lemma 1 Consider the following  µ-strongly convex in  x and λ-strongly concave problem minx∈Ω1 maxy∈Ω2 f(x, y). Let (x∗, y∗)denote the saddle point solution to this problem. Suppose we have two solutions  (x0, y0) ∈ Ω1 × Ω2 and (x1, y1) ∈ Ω1 × Ω2. Then the following

image

5.1. Proof of Theorem 1 for the SCSC setting

The key idea is to first show the convergence of the duality gap with respect to the ball constraints  B(xk0, Rk) and B(yk0, Rk)in an epoch (Lemma 2). Then we investigate the condition to make  ˆx(¯yk) ∈ B(xk0, Rk) and ˆy(¯xk) ∈ B(yk0, Rk)given the average solution (¯xk, ¯yk), which allows us to derive the duality gap Gap(¯xk, ¯yk)for the original problem. Finally, under such conditions, we show how the duality gap between two consecutive outer loops can be halved (Theorem 3), which implies the total iteration complexity (Theorem 1). Below, we omit superscript k when it applies to all epochs.

Lemma 2 Suppose Assumption 2 holds. Let Line 3 to 7 of Algorithm 1 run for T iterations (omitting the k-index) by fixed step sizes  ηx and ηy. Then with the probability at least  1 − ˜δwhere  0 < ˜δ < 1, for any x ∈ X ∩ B(x0, R) and y ∈ Y ∩ B(y0, R), ¯x = �T−1t=0 xt/T,

image

Remark 4 Lemma 2 is a standard analysis for an epoch of Algorithm 1. The difficulty arises when attempting to plug x and y into (3). In order to derive the duality gap on the LHS of (3), we have to plug in  x ← ˆx(¯y) and y ← ˆy(¯x). Nevertheless, it is unclear whether ˆx(¯y) ∈ B(x0, R) and ˆy(¯x) ∈ B(y0, R), which is the requirement for x and y to be plugged into (3). In the following lemma, we investigate the condition to make  ˆx(¯y) ∈ B(x0, R) andˆy(¯x) ∈ B(y0, R)based on Lemma 1.

image

Remark 5 Lemma 3 shows that if we properly set the values of  R, ηx, ηy and T, then ˆxR(¯y)and  ˆyR(¯x)are the interior points of  B(x0, R) and B(y0, R)with high probability. Therefore, we conclude that  ˆx(¯y) = ˆxR(¯y) and ˆy(¯x) = ˆyR(¯x)with probability  1 − ˜δunder the conditions of Lemma 3, which allows us to derive the duality gap in LHS of (3) of Lemma 2.

The following theorem gives the relation of duality gaps between two consecutive epochs of Algorithm 1 by using Lemma 2 and the conditions proven by Lemma 3.

Theorem 3 Consider the k-th epoch of Algorithm 1 with an initial solution  (xk0, yk0) andthe ending averaged solution  (¯xk, ¯yk). Suppose Assumption 2 holds and Gap(xk0, yk0) ≤ ǫk−1.Let  Rk ≥ 2�

image

Remark 6 Theorem 3 shows that after running  Tkiterations at the k-th stage, the upper bound of the duality gap would be halved with high probability, i.e., from min{µ,λ}R2k8 to

16. Then, in order to make the duality gap of each outer loop of Algorithm 1 halved from the last epoch, we can simply set  R2k+1 = R2k2 , and accordingly,  ηx,k+1 = ηkx2 , ηy,k+1 =ηky2and  Tk+1 = 2Tk.

Proof (of Theorem 3) For any  x ∈ B(xk0, Rk) and y ∈ B(yk0, Rk), we have ∥x − xk0∥ ≤ Rand  ∥y − yk0∥ ≤ R, so by (3) of Lemma 2, we have with probability  1 − ˜δ

image

where inequality  (a) is due to x ∈ B(xk0, Rk) and y ∈ B(yk0, Rk). Inequality (b) is due to the values of  ηkx, ηky and Tk. Recall the definitions  ˆx(¯yk) = arg minx∈X f(x, ¯yk) and ˆy(¯xk) =

arg maxy∈Y f(¯xk, y). By Lemma 3, we have ˆx(¯yk) ∈ B(xk0, Rk) and ˆy(¯xk) ∈ B(yk0, Rk) withprobability  1 − ˜δ. Then from (4) we have

image

Given the condition Gap(xk0, yk0) ≤ ǫk−1 ≤ min{µ,λ}R2k8 ,we then conclude that running  Tkiterations in an epoch of Algorithm 1 would halve the duality gap with high probability. As indicated in Theorem 3, the duality gap Gap(¯xk, ¯yk)can be halved as long as the condition of Theorem 3 holds. Then Theorem 1 is implied (the detailed proof is in Supplementary Materials).

5.2. Proof Sketch of Theorem 2 for the WCSC setting

Due to limit of space, we only present a sketch here and present the full proof in the Supplement. Recall ˆfk(x, y) = f(x, y) + γ2∥x − xk0∥2.Let us denote its duality gap by

Gapk(x, y) = ˆfk(x, ˆyk(x)) − ˆfk(ˆxk(y), y), where we define  ˆyk(x) := arg maxy′∈Y ˆfk(x, y′)given  x ∈ X and ˆxk(y) := arg minx′∈X ˆfk(x′, y) given y ∈ Y. Its saddle point solution is denoted by  (ˆx∗k, ˆy∗k), i.e., ˆfk(ˆx∗k, y) ≤ ˆfk(ˆx∗k, ˆy∗k) ≤ ˆfk(x, ˆy∗k) for any x ∈ X and y ∈ Y . Thekey idea of our analysis is to connect the duality gap �Gapk(xk0, yk0) to γ2∥ˆx∗k − xk0∥2, andthen by making  γ2∥ˆx∗k − xk0∥2 ≤ ǫ2, we can show that  xk0 is a nearly  ǫ-stationary point. To this end we first establish a bound of the duality gap for the regularized problem ˆfk(x, y)for the k-th epoch (Lemma 4). Then we connect it to  γ∥ˆx∗k − xk0∥2 (Lemma 5). Finally, we bound  γ∥ˆx∗k − xk0∥2by a telescoping sum of  E[ �Gapk(xk0, yk0)] − E[ �Gapk+1(xk+10 , yk+10 )] andE[P(xk0) − P(xk+10 )].

In this paper, we filled the gaps between stochastic min-max and minimization optimization problems. We proposed Epoch-GDA algorithms for general SCSC and general WCSC problems, which do not impose any additional assumptions on the smoothness or the structure of the objective function. Our key lemma provides sharp analysis of Epoch-GDA for both problems. For SCSC min-max problems, to the best of our knowledge, our result is the first one to show that Epoch-GDA achieves the optimal rate of O(1/T) for the duality gap of general SCSC min-max problems. For WCSC min-max problems, our analysis allows us to derive the best complexity ˜O(1/ǫ4)of Epoch-GDA to reach a nearly  ǫ-stationary point, which does not require smoothness, large mini-batch sizes or other structural conditions.

Alekh Agarwal, Martin J Wainwright, Peter L Bartlett, and Pradeep K Ravikumar. Information-theoretic lower bounds on the oracle complexity of convex optimization. In Advances in Neural Information Processing Systems, pages 1–9, 2009.

Yossi Arjevani, Yair Carmon, John C Duchi, Dylan J Foster, Nathan Srebro, and Blake Woodworth. Lower bounds for non-convex stochastic optimization. arXiv preprint arXiv:1912.02365, 2019.

Martin Arjovsky, Soumith Chintala, and Léon Bottou. Wasserstein generative adversarial networks. In Proceedings of the 34th International Conference on Machine Learning, volume 70, pages 214–223, International Convention Centre, Sydney, Australia, 2017.

Antonin Chambolle and Thomas Pock. A first-order primal-dual algorithm for convex problems with applications to imaging. J. Math. Imaging Vis., 40(1): 120–145, May 2011. ISSN 0924-9907. doi: 10.1007/s10851-010-0251-1. URL http://dx.doi.org/10.1007/s10851-010-0251-1.

Yunmei Chen, Guanghui Lan, and Yuyuan Ouyang. Optimal primal-dual methods for a class of saddle point problems. SIAM Journal on Optimization, 24(4):1779–1814, 2014.

Cong Dang and Guanghui Lan. Randomized first-order methods for saddle point optimiza- tion. arXiv preprint arXiv:1409.8625, 2014.

Simon S Du and Wei Hu. Linear convergence of the primal-dual gradient method for convex- concave saddle point problems without strong convexity. arXiv preprint arXiv:1802.01504, 2018.

Yanbo Fan, Siwei Lyu, Yiming Ying, and Baogang Hu. Learning with average top-k loss. In Advances in Neural Information Processing Systems 30, pages 497–505. 2017.

Gauthier Gidel, Tony Jebara, and Simon Lacoste-Julien. Frank-wolfe algorithms for saddle point problems. arXiv preprint arXiv:1610.07797, 2016.

Ian J. Goodfellow, Jean Pouget-Abadie, Mehdi Mirza, Bing Xu, David Warde-Farley, Sher- jil Ozair, Aaron Courville, and Yoshua Bengio. Generative adversarial nets. In Proceedings of the 27th International Conference on Neural Information Processing Systems - Volume 2, NIPS’14, pages 2672–2680, Cambridge, MA, USA, 2014. MIT Press. URL http://dl.acm.org/citation.cfm?id=2969033.2969125.

Davood Hajinezhad and Mingyi Hong. Perturbed proximal primal–dual algorithm for non- convex nonsmooth optimization. Mathematical Programming, 176(1-2):207–245, 2019.

Erfan Yazdandoost Hamedani and Necdet Serhat Aybat. A primal-dual algorithm for general convex-concave saddle point problems. arXiv preprint arXiv:1803.01401, 2018.

Elad Hazan and Satyen Kale. Beyond the regret minimization barrier: an optimal algorithm for stochastic strongly-convex optimization. In Proceedings of the 24th Annual Conference on Learning Theory (COLT), pages 421–436, 2011.

Elad Hazan and Satyen Kale. Beyond the regret minimization barrier: optimal algorithms for stochastic strongly-convex optimization. The Journal of Machine Learning Research, 15(1):2489–2512, 2014.

Le Thi Khanh Hien, Renbo Zhao, and William B Haskell. An inexact primal-dual smoothing framework for large-scale non-bilinear saddle point problems. arXiv preprint arXiv:1711.03669, 2017.

Mingyi Hong. Decomposing linearly constrained nonconvex problems by a proximal primal dual approach: Algorithms, convergence, and applications. arXiv preprint arXiv:1604.00543, 2016.

Mingyi Hong, Jason D Lee, and Meisam Razaviyayn. Gradient primal-dual algorithm con- verges to second-order stationary solutions for nonconvex distributed optimization. arXiv preprint arXiv:1802.08941, 2018.

Yu-Guan Hsieh, Franck Iutzeler, Jérôme Malick, and Panayotis Mertikopoulos. On the con- vergence of single-call stochastic extra-gradient methods. arXiv preprint arXiv:1908.08465, 2019.

Anatoli Juditsky, Arkadi Nemirovski, and Claire Tauvel. Solving variational inequalities with stochastic mirror-prox algorithm. Stochastic Systems, 1(1):17–58, 2011.

Guanghui Lan, Arkadi Nemirovski, and Alexander Shapiro. Validation analysis of mirror descent stochastic approximation method. Mathematical programming, 134(2):425–458, 2012.

Tianyi Lin, Chi Jin, and Michael I. Jordan. On gradient descent ascent for nonconvexconcave minimax problems. CoRR, abs/1906.00331, 2019.

Mingrui Liu, Xiaoxuan Zhang, Zaiyi Chen, Xiaoyu Wang, and Tianbao Yang. Fast stochastic auc maximization with o(1/n)-convergence rate. In Proceedings of the 35th International Conference on Machine Learning (ICML), 2018.

Mingrui Liu, Zhuoning Yuan, Yiming Ying, and Tianbao Yang. Stochastic auc maximization with deep neural networks. arXiv preprint arXiv:1908.10831, 2019.

Songtao Lu, Ioannis Tsaknakis, Mingyi Hong, and Yongxin Chen. Hybrid block successive approximation for one-sided non-convex min-max problems: algorithms and applications. arXiv preprint arXiv:1902.08294, 2019.

Tong Zhang Luo Luo, Haishan Ye. Stochastic recursive gradient descent ascent for stochastic nonconvex-strongly-concave minimax problems. CoRR, abs/2001.03724, 2020.

Hongseok Namkoong and John C. Duchi. Stochastic gradient methods for distributionally robust optimization with f-divergences. In Advances in Neural Information Processing Systems (NIPS), pages 2208–2216, 2016.

Hongseok Namkoong and John C. Duchi. Variance-based regularization with convex objec- tives. In Advances in Neural Information Processing Systems (NIPS), pages 2975–2984, 2017.

Arkadi Nemirovski, Anatoli Juditsky, Guanghui Lan, and Alexander Shapiro. Robust stochastic approximation approach to stochastic programming. SIAM Journal on Optimization, 19:1574–1609, 2009. URL http://dx.doi.org/10.1137/070704277.

Yu Nesterov. Excessive gap technique in nonsmooth convex minimization. SIAM Journal on Optimization, 16:235–249, 01 2005. doi: 10.1137/S1052623403422285.

Maher Nouiehed, Maziar Sanjabi, Tianjian Huang, Jason D Lee, and Meisam Razaviyayn. Solving a class of non-convex min-max games using iterative first order methods. In Advances in Neural Information Processing Systems, pages 14905–14916, 2019.

Balamurugan Palaniappan and Francis R. Bach. Stochastic variance reduction methods for saddle-point problems. In Advances in Neural Information Processing Systems (NIPS), pages 1408–1416, 2016.

Hassan Rafique, Mingrui Liu, Qihang Lin, and Tianbao Yang. Non-convex min-max optimization: Provable algorithms and applications in machine learning. CoRR, abs/1810.02060, 2018.

Shai Shalev-Shwartz and Tong Zhang. Stochastic Dual Coordinate Ascent Methods for Regularized Loss Minimization. Journal of Machine Learning Research (JMLR), 2013.

Conghui Tan, Tong Zhang, Shiqian Ma, and Ji Liu. Stochastic primal-dual method for empirical risk minimization with o (1) per-iteration complexity. In Advances in Neural Information Processing Systems, pages 8366–8375, 2018.

Jialei Wang and Lin Xiao. Exploiting strong convexity from data with primal-dual first-order algorithms. In Proceedings of the 34th International Conference on Machine LearningVolume 70, pages 3694–3702. JMLR. org, 2017.

Yan Yan, Yi Xu, Qihang Lin, Lijun Zhang, and Tianbao Yang. Stochastic primal-dual algorithms with faster convergence than  o(1/√T)for problems without bilinear structure. arXiv preprint arXiv:1904.10112, 2019.

Tianbao Yang, Mehrdad Mahdavi, Rong Jin, and Shenghuo Zhu. An efficient primal dual prox method for non-smooth optimization. Machine Learning, 98(3):369–406, 2015.

Yiming Ying, Longyin Wen, and Siwei Lyu. Stochastic online auc maximization. In Advances in neural information processing systems, pages 451–459, 2016.

Adams Wei Yu, Qihang Lin, and Tianbao Yang. Doubly stochastic primal-dual coordinate method for regularized empirical risk minimization with factorized data. CoRR, abs/1508.03390, 2015. URL http://arxiv.org/abs/1508.03390.

Yuchen Zhang and Lin Xiao. Stochastic primal-dual coordinate method for regularized empirical risk minimization. The Journal of Machine Learning Research, 18(1):2939–2980, 2017.

Renbo Zhao. Optimal algorithms for stochastic three-composite convex-concave saddle point problems. arXiv preprint arXiv:1903.01687, 2019.

Recall ˆfk(x, y) = f(x, y) + γ2∥x − xk0∥2.Let us denote its duality gap by �Gapk(x, y) =ˆfk(x, ˆyk(x)) − ˆfk(ˆxk(y), y), where we define  ˆyk(x) := arg maxy′∈Y ˆfk(x, y′) given x ∈ X andˆxk(y) := arg minx′∈X ˆfk(x′, y) given y ∈ Y. Its saddle point solution is denoted by  (ˆx∗k, ˆy∗k),i.e., ˆfk(ˆx∗k, y) ≤ ˆfk(ˆx∗k, ˆy∗k) ≤ ˆfk(x, ˆy∗k) for any x ∈ X and y ∈ Y. The key idea of our analysis is to connect the duality gap �Gapk(xk0, yk0) to γ2∥ˆx∗k − xk0∥2, and then by making γ2∥ˆx∗k − xk0∥2 ≤ ǫ2, we can show that  xk0 is a nearly  ǫ-stationary point. To this end we first establish a bound of the duality gap for the regularized problem ˆfk(x, y) for the k-th epoch(Lemma 4). Then we connect it to  γ∥ˆx∗k − xk0∥2 (Lemma 5). Finally, we bound  γ∥ˆx∗k − xk0∥2by a telescoping sum of  E[ �Gapk(xk0, yk0)] − E[ �Gapk+1(xk+10 , yk+10 )] and E[P(xk0) − P(xk+10 )].

Lemma 4 Suppose Assumption 3 holds and  γ = 2ρ. For k ≥ 1, Lines 4 to 8of Algorithm 2 guarantee

image

the RHS of (5), we could have the following upper bound by the key lemma (Lemma 1) 3

image

On the other hand, the following lemma lower bounds LHS of (5) to construct telescoping sums.

image

Lemma 5 lower bounds �Gapk(¯xk, ¯yk)in LHS of (5) by three parts. The first part constructs telescoping sum of �Gapk+1(xk+10 , yk+10 )− �Gapk(xk0, yk0)together with (6). The second part itself is an element of telescoping sums over the primal gap. The third part  ∥xk0 − ˆx∗k∥2can be used as the measure of nearly  ǫ-stationary point, which is further explored in Theorem 2.

Proof (of Theorem 2) Consider the k-th stage. Let us start from (5) in Lemma 4 as follows

image

where (a) is due to settings  Tk = 106(k+1)3 , ηkx = 4ρ(k+1), and ηky = 2λ(k+1). Re-organizing the above inequality, we have

image

Summing from k = 1 to k = K, we have

image

where the last inequality is due to the upper bounds the three terms A, B and C as follows. For the term A, we have

image

where the second equality is due to the setting of  ηkx = 4ρ(k+1) and ηky = 2λ(k+1). The last

inequality is due to �K+1k=1 1k ≤ ln(K + 1) + 1.For the term B, we have

image

For the term C, we have

image

where the first inequality is due to �GapK+1(xK+10 , yK+10 ) ≥ 0. By plugging the above upper bounds of the three terms A, B and C into (11), we have (12). Then by randomly sampling  τ from {1, ..., K}, we have

image

which leads to  γ2E[∥x∗τ − xτ0∥2] ≤ ǫ2. Recall Tk = 106(k+1)3. To compute the total number of iterations, we have

image

image

where inequality  (a) is due to µ-strong convexity of  f(x, y1) in x with fixed y1(with optimality at  ˆx(y1)) and µ-strong convexity of  f(x, y∗) in x with fixed y∗ (with optimality at  x∗). In-equality  (b) is due to f(x∗, y1) ≤ f(x∗, y∗). Inequality  (c) is due to f(x0, y∗) ≤ f(x0, ˆy(x0)).

image

where inequality  (a) is due to λ-strong concavity of  f(x1, y) in y with fixed x1(optimality at ˆy(x1)) and f(x∗, y) in y with fixed x∗ (optimality at  ˆy∗). Inequality  (b) is due to f(x1, y∗) ≥f(x∗, y∗). Inequality  (c) is due to f(x∗, y0) ≥ f(ˆx(y0), y0).

image

Proof Before the proof, we first present the following two lemmas as follows.

Lemma 6 Let X1, X2, ..., XTbe independent random variables and  Et[exp(X2tB2 )] ≤ exp(1)for any  t ∈ {1, ..., T}. Then we have with probability at least  1 − ˜δ

image

Lemma 7 (Lemma 2 of Lan et al. (2012)) Let  X1, ..., XTbe a martingale difference sequence, i.e.,  Et[Xt] = 0 for all t. Suppose that for some values  σt, for t = 1, 2, ..., T, wehave  Et[exp(X2tσ2t )] ≤ exp(1). Then with probability at least  1 − δ, we have

image

For simplicity of presentation, we use the notations  ∆tx = ∂xf(xt, yt; ξt), ∆ty = ∂yf(xt, yt, ; ξt),∂tx = ∂xf(xt, yt) and ∂ty = ∂yf(xt, yt). To prove Lemma 2, we would leverage the following

two update approaches:

image

where  x0 = ˜x0 and y0 = ˜y0. The first two updates are identical to Line 4 and Line 5 in Algorithm 1. This can be verified easily. Take the first one as example:

image

Let  ψ(x) = x⊤u + 12γ ||x − v||2 with x′ = arg minx∈X′ ψ(x), which includes the four update approaches in (15) as special cases. By using the strong convexity of  ψ(x)and the first order optimality condition (∂ψ(x′)⊤(x − x′) ≥ 0), for any x ∈ X′, we have

image

which implies

image

Then

image

Applying the above result to the updates in (15), we have for any  x ∈ X ∩ B(x0, R) andy ∈ Y ∩ B(y0, R),

image

Adding the above four inequalities together, we have

image

where inequality (a) above is due to the convexity of  f(x, yt) in xand concavity of  f(xt, y)in y. Inequality  (b) is due to (a + b)2 ≤ 2a2 + 2b2.

Then we combine the LHS and RHS by summing up  t = 0, ..., T − 1:

image

In the following, we show how to bound the above A to E terms. To bound the above term A in (19), we apply Lemma 6 as follows, which holds with probability  1 − ˜δ,

image

Similarly, term C in (19) can be bounded with probability  1 − ˜δas follows

image

To bound term B of (19), which contains only the full subgradients  ∂tx, we have∥∂tx∥2 = ∥E[∆tx]∥2 ≤ E[∥∆tx∥2] ≤ B21,

where the first inequality is due to Jensen’s inequality and the second inequality is due to

image

Therefore, we have

image

Similarly, for term D in (19), we have

image

To bound term  E of (19), let Ut = (xt − ˜xt)⊤(∂tx − ∆tx) and Vt = (yt − ˜yt)⊤(∂ty − ∆ty)for  t ∈ {0, ..., T − 1}, which are Martingale difference sequences. We thus would like to use Lemma 7 to handle these terms. To this end, we can first upper bound  |Ut| and |Vt| asfollows

image

Then the above two inequalities implies that

image

where inequality  (a) is due to (a + b)2 ≤ 2a2 + 2b2, inequality (b) is due to the concavity of √·and Jensen’s inequality. Inequality (c) is due to the assumption. In a similar way, we have

image

Next, applying Lemma 7 with (24) and (25), we have with probability at least  1 − ˜δ

image

For LHS of (19), by Jensen’s inequality, we have

image

where  ¯x = 1T�T−1t=0 xt and ¯y = 1T�T−1t=0 .

Suppose  T ≥ 1. By plugging (27), (20), (21), (22), (23) and (26) back into (19), with probability at least  1 − ˜δ, we have

image

Proof First, we start from

image

where the first inequality is due to the assumption. Markov inequality indicates that  P(X ≥ a) ≤ E[X]afor a random variable X, which, by additionally introducing ˜δ, leads to

image

Therefore, with probability at least  1 − ˜δ, we have

image

image

We can bound the above term A as follows

image

To Bound term B in (29), we apply Lemma 2 as follows

image

where the last inequality holds with probability at least  1 − ˜δwith the setting of  ηx, ηy andT as follows

image

which shows  ˆxR(¯y) and ˆyR(¯y)are interior points of  B(x0, R) and B(y0, R), respectively, so that  ˆxR(¯y) = ˆx(¯y) and ˆyR(¯x) = ˆy(¯x).

image

image

Given  Tk+1 = 2Tkin Algorithm 1 and  K = ⌈log(ǫ0ǫ )⌉, the total number of iterations can be computed by

image

image

where inequality  (a) is due to ρ-weakly convexity of f in x. Inequality (b) is due to Young’s inequality, i.e.,  (xt − xt+1)⊤∆tx − 12ηx ∥xt+1 − xt∥2 ≤ ηx2 ∥∆tx∥2.

Similarly, due to the 1ηy -strong convexity of  ψty in yand the optimality condition of  yt+1,we have

image

where inequality (a) is due to concavity of f in y. Inequality (b) is due to Young’s inequality,

image

Now we do not take expectation, since we aim to eliminate the randomness of x and y in  (x − xt) and (yt − y), respectively. To achieve this, we use the following updates

image

image

Summing the above two inequalities, we have

image

Summing the above inequality from  t = 0 to T − 1and using Jensen’s inequality, we have

image

Taking expectation over both sides and recalling that  E[∥∂xf(x, y; ξ)∥2] ≤ M21 and

image

Before proving Lemma 5, we first state the following lemma, whose proof is in the next section.

image

2) �Gapk(¯xk, ¯yk) ≥ P(xk+10 ) − P(xk0) + γ2∥¯xk − xk0∥2, where P(x) = maxy∈Y f(x, y),

3) �Gapk(¯xk, ¯yk) ≥ ρ(1 − β)2( 1β − 1)∥xk0 − ˆx∗k∥2 − ρ2( 1β − 1)∥¯xk − xk0∥2, (37)

image

where inequality (a) is due to Lemma 8, inequality (b) is due to the setting of  γ = 2ρ, α = 56,β = 12. Inequality  (c) is due to ∥¯xk − xk0∥2 ≥ 0.

Proof Before we prove the three results, we first state two results of Young’s inequality as follows

image

where inequality (a) is due to (39) (0 < α ≤ 1). Inequality  (b) is due to ρ-strong convexity of ˆfk+1(x, yk+10 ) in xand optimality at  ˆxk+1(yk+10 ). Inequality  (c) is due to ˆfk(ˆxk+1(¯yk), ¯yk) ≥ˆfk(ˆxk(¯yk), ¯yk) and ˆfk+1(xk+10 , yk+10 ) ≤ ˆfk+1(xk+10 , ˆyk+1(xk+10 )).

Re-organizing the above inequality, we have

image

where inequality (a) is due to (41) and  0 < β ≤ 1. Inequality  (b) is due to ρ-storng convexity of ˆfk in x. Inequality  (c) is due to 0 < β ≤ 1 and

image


Designed for Accessibility and to further Open Science