Last iterate convergence in no-regret learning: constrained min-max optimization for convex-concave landscapes

2020·Arxiv

Abstract

Abstract

In a recent series of papers it has been established that variants of Gradient Descent/Ascent and Mirror Descent exhibit last iterate convergence in convex-concave zero-sum games. Specifi-cally, [6, 10] show last iterate convergence of the so called “Optimistic Gradient Descent/Ascent” for the case of unconstrained min-max optimization. Moreover, in [11] the authors show that Mirror Descent with an extra gradient step displays last iterate convergence for convex-concave problems (both constrained and unconstrained), though their algorithm does not follow the online learning framework; it uses extra information rather than only the history to compute the next iteration. In this work, we show that ”Optimistic Multiplicative-Weights Update (OMWU)” which follows the no-regret online learning framework, exhibits last iterate convergence locally for convex-concave games, generalizing the results of [8] where last iterate convergence of OMWU was shown only for the bilinear case. We complement our results with experiments that indicate fast convergence of the method.

1 Introduction

In classic (normal form) zero-sum games, one has to compute two probability vectors ∆that consist an equilibrium of the following problem

where A is real matrix (called payoff matrix). Here represents the payment of the x player to the y player under choices of strategies by the two players and is a bilinear function.

Arguably, one of the most celebrated theorems and a founding stone in Game Theory, is the minimax theorem by Von Neumann [16]. It states that

where f : ∆is convex in x, concave in y. The aforementioned result holds for any convex compact sets . The min-max theorem reassures us that an equilibrium always exists in the bilinear game (1) or its convex-concave analogue (again f(x, y) is interpreted as the payment of the x player to the y player). An equilibrium is a pair of randomized strategies () such that neither player can improve their payoff by unilaterally changing their distribution.

Soon after the appearance of the minimax theorem, research was focused on dynamics for solving min-max optimization problems by having the min and max players of (1) run a simple online learning procedure. In the online learning framework, at time t, each player chooses a probability distribution (respectively) simultaneously depending only on the past choices of both players (i.e., ) and experiences payoff that depends on choices

An early method, proposed by Brown [4] and analyzed by Robinson [14], was fictitious play. Later on, researchers discover several learning robust algorithms converging to minimax equilibrium at faster rates, see [5]. This class of learning algorithms, are the so-called “no-regret” and include Multiplicative Weights Update method [2] and Follow the regularized leader.

1.1 Average Iterate Convergence

Despite the rich literature on no-regret learning, most of the known results have the feature that min-max equilibrium is shown to be attained only by the time average. This means that the trajectory of a no-regret learning method () has the property that to the equilibrium of (1), as . Unfortunately that does not mean that the last iterate (converges to an equilibrium, it commonly diverges or cycles. One such example is the well-known Multiplicative Weights Update Algorithm, the time average of which is known to converge to an equilibrium, but the actual trajectory cycles towards the boundary of the simplex ([3]). This is even true for the vanilla Gradient Descent/Ascent, where one can show for even bilinear landscapes (unconstrained case) last iterate fails to converge [6].

Motivated by the training of Generative Adversarial Networks (GANs), the last couple of years researchers have focused on designing and analyzing procedures that exhibit last iterate convergence (or pointwise convergence) for zero-sum games. This is crucial for training GANs, the landscapes of which are typically non-convex non-concave and averaging now as before does not give much guarantees (e.g., note that Jensen’s inequality is not applicable anymore). In [6, 10] the authors show that a variant of Gradient Descent/Ascent, called Optimistic Gradient Descent/Ascent has last iterate convergence for the case of bilinear functions where and (this is called the unconstrained case, since there are no restrictions on the vectors). Later on, [8] generalized the above result with simplex constraints, where the online method that the authors analyzed was Optimistic Multiplicative Weights Update. In [11], it is shown that Mirror Descent with extra gradient computation converges pointwise for a class of zero-sum games that includes the convex-concave setting (with arbitrary constraints), though their algorithm does not fit in the online no-regret framework since it uses information twice about the payoffs before it iterates. Last but not least there have appeared other works that show pointwise convergence for other settings (see [13, 7] and [1] and references therein) to stationary points (but not local equilibrium solutions).

1.2 Main Results

In this paper, we focus on the min-max optimization problem

where f is a convex-concave function (convex in x, concave in y). We analyze the no-regret online algorithm Optimistic Multiplicative Weights Update (OMWU). OMWU is an instantiation of the Optimistic Follow the Regularized Leader (OFTRL) method with entropy as a regularizer (for both players, see Preliminaries section for the definition of OMWU).

We prove that OMWU exhibits local last iterate convergence, generalizing the result of [8] and proving an open question of [15] (for convex-concave games). Formally, our main theorem is stated below:

Theorem 1.1 (Last iterate convergence of OMWU)be a twice differentiable function f(x, y) that is convex in x and concave in y. Assume that there exists an equilibrium (that satisfies the KKT conditions with strict inequalities (see (4)). It holds that for sufficiently small stepsize, there exists a neighborhood such that for all for all initial conditions (, OMWU exhibits last iterate (pointwise) convergence, i.e.,

where (denotes the t-th iterate of OMWU.

Moreover, we complement our theoretical findings with experimental analysis of the procedure. The experiments on KL-divergence indicate that the results should hold globally.

1.3 Structure and Technical Overview

We present the structure of the paper and a brief technical overview.

Section 2 provides necessary definitions, the explicit form of OMWU derived from OFTRL with entropy regularizer, and some existing results on dynamical systems.

Section 3 is the main technical part, i.e, the computation and spectral analysis of the Jacobian matrix of OMWU dynamics. The stability analysis, the understanding of the local behavior and the local convergence guarantees of OMWU rely on the spectral analysis of the computed Jacobian matrix. The techniques for bilinear games (as in [8]) are no longer valid in convex-concave games. Allow us to explain the differences from [8]. In general, one cannot expect a trivial generalization from linear to non-linear scenarios. The properties of bilinear games are fundamentally different from that of convex-concave games, and this makes the analysis much more challenging in the latter. The key result of spectral analysis in [8] is in a lemma (Lemma B.6) which states that a skew symmetrichas imaginary eigenvalues. Skew symmetric matrices appear since in bilinear cases there are terms that are linear in x and linear in y but no higher order terms in x or y. However, the skew symmetry has no place in the case of convex-concave landscapes and the Jacobian matrix of OMWU is far more complicated. One key technique to overcome the lack of skew symmetry is the use of Ky Fan inequality [12] which states that the sequence of the eigenvalues of majorizes the real part of the sequence of the eigenvalues of W for any square matrix W (see Lemma 3.1).

Section 4 focuses on numerical experiments to understand how the problem size and the choice of learning rate affect the performance of our algorithm. We observe that our algorithm is able to achieve global convergence invariant to the choice of learning rate, random initialization or problem size. As comparison, the latest popularized (projected) optimistic gradient descent ascent is much more sensitivity to the choice of hyperparameter. Due to space constraint, the detailed calculation of the Jacobian matrix (general form and at fixed point) of OMWU are left in Appendix.

Notation The boldface x and y denote the vectors in ∆and ∆denotes the t-th iterate of the dynamical system. The letter J denote the Jacobian matrix. I, 0 and 1 are preserved for the identity, zero matrix and the vector with all the entries equal to 1. The support of x is the set of indices of such that = 0, denoted by Supp(x). () denotes the optimal solution for minimax problem. [n] denote the set of integers {1, ..., n}.

2 Preliminaries

In this section, we present some background that will be used later.

2.1 Equilibria for Constrained Minimax

From Von Neumann’s minimax theorem, one can conclude that the problem minhas always an equilibrium () with ) be unique. Moreover from KKT conditions (as long as f is twice differentiable), such an equilibrium must satisfy the following (is a local minimum for fixed is a local maximum for fixed

Definition 2.1 (KKT conditions). Formally, it holds

Remark 2.2 (No degeneracies). For the rest of the paper we assume no degeneracies, i.e., the last inequalities hold strictly (in the case a strategy is played with zero probability for each player). Moreover, it is easy to see that since f is convex concave and twice differentiable, then of the Hessian that involves x variables) is positive semi-definite and (part of the Hessian that involves y variables) is negative semi-definite.

2.2 Optimistic Multiplicative Weights Update

The equations of Optimistic Follow-the-Regularized-Leader (OFTRL) applied to a problem min) with regularizers (strongly convex functions) ) (for player x, y respectively) and is given below (see [6]):

is called the stepsize of the online algorithm. OFTRL is uniquely defined if f is convex-concave and domains X and Y are convex. For simplex constraints and entropy regularizers, i.e., ) = , we can solve for the explicit form of OFTRL using KKT conditions, the update rule is the Optimistic Multiplicative Weights Update (OMWU) and is described as

follows:

2.3 Fundamentals of Dynamical Systems

We conclude Preliminaries section with some basic facts from dynamical systems.

Definition 2.3. A recurrence relation of the form ) is a discrete time dynamical system, with update rule is a subset of for some positive integer k. The point is called a fixed point if w(z) = z.

Remark 2.4. Using KKT conditions (4), it is not hard to observe that an equilibrium point (must be a fixed point of the OMWU algorithm, i.e., if () = () = () then (

Proposition 2.5 ([9]). Assume that w is a differentiable function and the Jacobian of the update rule w at a fixed point has spectral radius less than one. It holds that there exists a neighborhood U around such that for all , the dynamics converges to is called a contraction mapping in U.

Note that we will make use of Proposition 2.5 to prove our Theorem 1.1 (by proving that the Jacobian of the update rule of OMWU has spectral radius less than one).

3 Last iterate convergence of OMWU

In this section, we prove that OMWU converges pointwise (exhibits last iterate convergence) if the initializations () belong in a neighborhood U of the equilibrium (

3.1 Dynamical System of OMWU

We first express OMWU algorithm as a dynamical system so that we can use Proposition 2.5. The idea (similar to [8]) is to lift the space to consist of four components (x, y, z, w, in such a way we can include the history (current and previous step, see Section 2.2 for the equations). First, we provide the update rule g : ∆of the lifted dynamical system and is given by

where [4] are defined as follows:

Then the dynamical system of OMWU can be written in compact form as

In what follows, we will perform spectral analysis on the Jacobian of the function g, computed at the fixed point (). Since g has been lifted, the fixed point we analyze is () (see Remark 2.4). By showing that the spectral radius is less than one, our Theorem 1.1 follows by Proposition 2.5. The computations of the Jacobian of g are deferred to the supplementary material.

3.2 Spectral Analysis

Let () be the equilibrium of min-max problem (2). Assume ), i.e., = 0 then (see equations at the supplementary material, section A)

Jacobian computed at (). This is true because the row of the Jacobian that corresponds to has zeros everywhere but the diagonal entry. Moreover because of the degeneracy assumption of KKT conditions (see Remark 2.2), it holds that

Similarly, it holds for

(again by Remark 2.2) and all other partial derivatives of are zero, therefore

is an eigenvalue of the Jacobian computed at (We focus on the submatrix of the Jacobian of g computed at () that corresponds to the non-zero probabilities of and . We denote to be the diagonal matrix of size

that has on the diagonal the nonzero entries of and similarly we define of size . For convenience, let us denote := and . The Jacobian submatrix is the following

where

We note that I, 0 capture the identity matrix and the all zeros matrix respectively (the appropriate size is indicated as a subscript). The vectors () and () are left eigenvectors with eigenvalue zero for the above matrix. Hence, any right eigenvector (should satisfy the conditions = 0. Thus, every non-zero eigenvalue of the above matrix is also a non-zero eigenvalue of the matrix below:

The characteristic polynomial of is obtained by finding det(). One can perform row/column operations on to calculate this determinant, which gives us the following relation:

where ) is the characteristic polynomial of the following matrix

and are the aforementioned sub-matrices. Notice that can be written as

where,

Notice here that H is the Hessian matrix evaluated at the fixed point (), and is the appropriate sub-matrix restricted to the support of . Although, the Hessian matrix is symmetric, we would like to work with the following representation of

where,

Let us denote any non-zero eigenvalue of which may be a complex number. Thus where ) vanishes and hence the eigenvalue of must satisfy the relation

We are to now show that the magnitude of any eigenvalue of is strictly less than 1, i.e, 1. Trivially, satisfies the above condition. Thus we need to show that the magnitude of where ) vanishes is strictly less than 1. The remainder of the proof proceeds by showing the following two lemmas:

Lemma 3.1 (Real part non-positive). Let be an eigenvalue of matrix . It holds that Re(

Figure 1: Convergence of OMWU vs different sizes of the problem. For Figure (a), x-axis is n and y-axis is the number of iterations to reach convergence for Eqn. (14). In Figure (b) we choose four cases of n to illustrate how error of the problem decreases with the number of iterations.

Proof. Assume that = 0. All the non-zero eigenvalues of matrix coincide with the eigenvalues of the matrix

is positive semi-definite. Moreover, we use KyFan inequalities which state that the sequence (in decreasing order) of the eigenvalues of ) majorizes the real part of the sequence of the eigenvalues of W for any square matrix W (see [12], page 4). We conclude that for any eigenvalue of R, it holds that Re() is at most the maximum eigenvalue of ). Observe now that

Since

by the convex-concave assumption on f it follows that the matrix above is negative semi-definite (see Remark 2.2) and so is . We conclude that the maximum eigenvalue of is non-positive. Therefore any eigenvalue of R has real part non-positive and the same is true for

Lemma 3.2. If is a non-zero eigenvalue of then, Re(0 and 0 as the stepsize

We first can see that which is the learning rate multiplies any eigenvalue and we may assume that whilst is positive, it may be chosen to be sufficiently small and hence the magnitude of any eigenvalue

Remark 3.3. The equation = determines two complex roots for each fixed , say and . The relation between is illustrated in Figure 2, where the x-axis is taken to be ). Specifically we choose that satisfies (The x-axis of Figure 2 takes x from 3 to 103).

Figure 2: less than 1 as

. The relation gives two equations based on the equality of real and imaginary parts as follows,

Notice that the above equations can be transformed to the following forms:

For each , there exist two pairs of points () that are the intersections of the above two hyperbola, illustrated in Figure 4. Recall the condition that hyperbola can be obtained from the translation by () of the hyperbola

where the translated symmetric center is close to ( 0) since (a, b) is close to (0, 0). So the two intersections of the above hyperbola, (), satisfy the property that and since the two intersections are on two sides of the axis , as showed in Figure 3. On the other hand, we have

Figure 3: The intersections of the four branches of hyperbola are the two solutions of the equations (10) or (12). The intersections are on two sides of the line defined by and a < 0. This occurs in the case either ab > 0 or ab < 0.

and then the condition a < 0 gives the inequality

that is equivalent to

where only the case is considered since if the intersection whose x-component satisfying has the property that + is small and then less than 1, Figure 4. Thus to prove that 1, it suffices to assume . It is obvious that = (0 implies that 1. The proof completes.

Figure 4:

4 Experiments

In this section, we conduct empirical studies to verify the theoretical results of our paper. We primarily target to understand two factors that influence the convergence speed of OMWU: the problem size and the learning rate. We also compare our algorithm with Optimistic Gradient Descent Ascent (OGDA) with projection, and demonstrate our superiority against it.

We start with a simple bilinear min-max game:

We first vary the value of n to study how the learning speed scales with the size of the problem. The learning rate is fixed at 1.0, and we run OMWU with 50, 75250} and matrix is generated with i.i.d random Gaussian entries. We output the number of iterations for OMWU to reach convergence, i.e., with error to the optimal solution to be less or equal to 10. The results are averaged from 10 runs with different randomly initializations. As reported in Figure 1, generally a larger problem size requires more iterations to reach convergence. We also provide four specific cases of n to show the convergence in distance in Figure 1(b). The shaded area demonstrates the standard deviation from the 50 runs.

To understand how learning rate affects the speed of convergence, we conduct similar experiments on Eqn. (14) and plot the error with different step sizes in Figure 5(a)-(c). For this experiment the matrix size is fixed as n = 100. We also include a comparison with the Optimistic Gradient Descent Ascent[7]. Notice the original proposal was for unconstrained problems and we use projection in each step in order to constrain the iterates to stay inside the simplex. For the setting we considered, we observe a larger learning rate effectively speeds up our learning process, and our algorithm is relatively more stable to the choice of step-size. In comparison, OGDA is quite sensitive to the choice of step-size. As shown in Figure 5(b), a larger step-size makes the algorithm diverge, while a smaller step-size will make very little progress. Furthermore, we also choose to perform our algorithm over a convex-concave but not bilinear function are the first coefficients of x and y. With this low dimensional function, we could visually show

Figure 5: Time comparisons of OMWU and projected OGDA vs different choices of learning rate. For Figure (a)(b)(c), x-axis is iterations and y-axis is the error to the stationary point for Eqn. (14) with n = 100. We observe that OMWU (as in (a)) always converges while projected OGDA (as in (b)) will diverge for large learning rate. In figure (c) we remove the divergent case and compare the efficiency of the two algorithm measured in CPU time. In Figure (d) we visually present the trajectories for the min-max game of minwith learning rate 0.1, 1.0 and 10. Here x-axis is the value of and y-axis is the value of respectively. The equilibrium point the algorithm converges to is x = [0, 1], y = [0, 1].

the convergence procedure as in Figure 5(b), where each arrow indicates an OMWU step. This figure demonstrates that at least in this case, a larger step size usually makes sure a bigger progress towards the optimal solution.

Finally we show how the KL divergence )) decreases under different circumstances. Figure 6 again considers the bilinear problem (Eqn.(14)) with multiple dimensions n and a simple convex-concave function f(x, y) = + 2with different learning rate. We note that in all circumstances we consider, we observe that OMWU is very stable, and achieves global convergence invariant to the problem size, random initialization, and learning rate.

5 Conclusion

In this paper we analyze the last iterate behavior of a no-regret learning algorithm called Optimistic Multiplicative Weights Update for convex-concave landscapes. We prove that OMWU exhibits last iterate convergence in a neighborhood of the fixed point of OMWU algorithm, generalizing previous

Figure 6: KL divergence decreases with #iterations under different settings. For both images, x-axis is the number of iterations, and y-axis is KL divergence. Figure (a) is OMWU on bilinear function Eqn.(14) with n = {25, 100, 175, 250}. Figure (b) is OMWU on the quadratic function with different learning rate . Shaded area indicates standard deviation from 10 runs with random initializations. OMWU with smaller learning rate tends to have higher variance.

results that showed last iterate convergence for bilinear functions. The experiments explores how the problem size and the choice of learning rate affect the performance of our algorithm. We find that OMWU achieves global convergence and less sensitive to the choice of hyperparameter, compared to projected optimistic gradient descent ascent.

References

[1] Jacob D. Abernethy, Kevin A. Lai, and Andre Wibisono. Last-iterate convergence rates for min-max optimization. CoRR, abs/1906.02027, 2019.

[2] Sanjeev Arora, Elad Hazan, and Satyen Kale. The multiplicative weights update method: a meta-algorithm and applications. Theory of Computing, 8(1):121–164, 2012.

[3] James P. Bailey and Georgios Piliouras. Multiplicative weights update in zero-sum games. In Proceedings of the 2018 ACM Conference on Economics and Computation, Ithaca, NY, USA, June 18-22, 2018, pages 321–338, 2018.

[4] G.W Brown. Iterative solutions of games by fictitious play. In Activity Analysis of Production and Allocation, 1951.

[5] Nikolo Cesa-Bianchi and Gabor Lugosi. Prediction, Learning, and Games. Cambridge University Press, 2006.

[6] Constantinos Daskalakis, Andrew Ilyas, Vasilis Syrgkanis, and Haoyang Zeng. Training GANs with Optimism. In Proceedings of ICLR, 2018.

[7] Constantinos Daskalakis and Ioannis Panageas. The limit points of (optimistic) gradient descent in min-max optimization. In Advances in Neural Information Processing Systems 31: Annual Conference on Neural Information Processing Systems 2018, NeurIPS 2018, 3-8 December 2018, Montr´eal, Canada., pages 9256–9266, 2018.

[8] Constantinos Daskalakis and Ioannis Panageas. Last-iterate convergence: Zero-sum games and constrained min-max optimization. In 10th Innovations in Theoretical Computer Science Conference, ITCS 2019, January 10-12, 2019, San Diego, California, USA, pages 27:1–27:18, 2019.

[9] Oded Galor. Discrete Dynamical Systems. Springer, 2007.

[10] Tengyuan Liang and James Stokes. Interaction matters: A note on non-asymptotic local convergence of generative adversarial networks. arXiv preprint:1802.06132, 2018.

[11] Panayotis Mertikopoulos, Houssam Zenati, Bruno Lecouat, Chuan-Sheng Foo, Vijay Chandrasekhar, and Georgios Piliouras. Mirror descent in saddle-point problems: Going the extra (gradient) mile. CoRR, abs/1807.02629, 2018.

[12] Mohammad Sal Moslehian. Ky fan inequalities. CoRR, abs/1108.1467, 2011.

[13] Gerasimos Palaiopanos, Ioannis Panageas, and Georgios Piliouras. Multiplicative weights update with constant step-size in congestion games: Convergence, limit cycles and chaos. In Advances in Neural Information Processing Systems 30: Annual Conference on Neural Information Processing Systems 2017, 4-9 December 2017, Long Beach, CA, USA, pages 5874–5884, 2017.

[14] J. Robinson. An iterative method of solving a game. In Annals of Mathematics, pages 296–301, 1951.

[15] Vasilis Syrgkanis, Alekh Agarwal, Haipeng Luo, and Robert E. Schapire. Fast convergence of regularized learning in games. In Annual Conference on Neural Information Processing Systems 2015, pages 2989–2997, 2015.

[16] J Von Neumann. Zur theorie der gesellschaftsspiele. In Math. Ann., pages 295–320, 1928.

A Equations of the Jacobian of OMWU

B Equations of the Jacobian of OMWU at the ﬁxed point (x∗, y∗, z∗, w∗)

In this section, we compute the equations of the Jacobian at the fixed point (). The fact that () takes the position of (x, y) in computing partial derivatives gives the following equations.

C Jacobian matrix at (x∗, y∗, z∗, w∗)

This section serves for the ”Spectral Analysis” of Section 3. The Jacobian matrix of g at the fixed point is obtained based on the calculations above. We refer the main article for the subscript indicating the size of each block matrix.

By acting on the tangent space of each simplex, we observe that = 0 for = 0,

so each eigenvalue of matrix J is an eigenvalue of the following matrix

The characteristic polynomial of is det() that can be computed as the determinant of the following matrix: