b

DiscoverSearch
About
My stuff
Learning in Non-convex Games with an Optimization Oracle
2018·arXiv
Abstract
Abstract

We consider online learning in an adversarial, non-convex setting under the assumption that the learner has an access to an offline optimization oracle. In the general setting of prediction with expert advice, [11] established that in the optimization-oracle model, online learning requires exponentially more computation than statistical learning. In this paper we show that by slightly strengthening the oracle model, the online and the statistical learning models become computationally equivalent. Our result holds for any Lipschitz and bounded (but not necessarily convex) function. As an application we demonstrate how the offline oracle enables efficient computation of an equilibrium in non-convex games, that include GAN (generative adversarial networks) as a special case.

The setting of online learning in games is a fundamental paradigm which allows formulation of tasks such as spam detection, online routing, online recommendation systems, and more [3, 9, 16]. A key feature of this model is the ability of the environments to evolve over time, possibly in an adversarial manner. Consequently, this framework can be used to produce more robust learners compared to the classic stationary and statistical learning framework. A fundamental question investigated in recent literature is whether this robustness comes with a computational price. While it is well-known that any efficient online learner can be transformed into an efficient statistical (or batch) learner [2], it is important to understand to what extent is the online model harder.

To enable a systematic comparison between the two models we must allow a reduction in the opposite direction. To this end we adopt the offline optimization oracle model suggested in [11], where the online learner submits a sequence of loss functions and the oracle returns any minimizer of the cumulative loss. For the well-established setting of learning with expert advice, [11] demonstrated an exponential gap between the oracle complexity in the online and the statistical settings.

In this paper we study the same question in the more general non-convex setting.1Deviating from [11], we allow the learner to linearly perturb the objective submitted to the oracle. Arguably, adding a linear term to a non-convex function should not increase the overall complexity of the oracle. Perhaps surprisingly, we show that this moderate modification renders the online adversarial setting computationally equivalent to the statistical setting. We show this by extending the powerful Follow-the-Perturbed-Leader (FTPL) meta-algorithm to the non-convex setting and derive a polynomial bound on its oracle complexity.

1.1 Setting and Main Result

1.1.1 Basic definitions and assumptions

Let  W Ď Rdbe the decision set (a.k.a. hypothesis space in the statistical setting) with  ℓ8-diameter at most D, and let  L Ď RWbe the set of all G-Lipschitz functions w.r.t. the  ℓ1-norm. We assume that both G and

D are polynomial in the ambient dimension  d.2

Consider the setting of online learning, where an online algorithm predicts a point  wt P Win iterative fashion and receives a feedback according to an adversarially chosen loss function  ℓt P L. The goal of the learner is to minimize the average regret, which is defined as the difference between the average loss of the learner and that of the best fixed point  w˚ P Win hindsight. We define the sample complexity as the number of rounds required for attaining expected average regret at most  ϵ.

The statistical setting differs from the online setting in two important aspects. a) We assume that the loss functions are drawn according to some unknown fixed distribution. b) The learner receives a sample of loss functions drawn according to the same distribution. Then it has to output a single predictor ˆw. The goal of the learner is minimize the expected excess risk, which is defined as  Eℓrℓp ˆwqs ´infwPW Eℓrℓpwqs. The sample complexity in this model is the size of a sample (of loss functions) that is required for attaining expected excess risk at most  ϵ.

1.1.2 The offline oracle model

In order to compare between the online and the statistical models, we assume an access to two types of oracles:

1. Value oracle whose input is a pair  pw, ℓq P W ˆ Land its output is  ℓpwq.

2. Offline optimization oracle whose input consists of a sequence of loss functions  pℓ1, . . . , ℓkq P Lkand a d-dimensional vector  σ, and its output is output has the form

image

We define the oracle complexity as

image

1.1.3 Main result

Our online Algorithm 1 applies the offline oracle with a random linear perturbation  σwhose coordinates are i.i.d. exponential random variables with parameter  η. Our main result can be stated as follows.

Theorem 1. The oracle complexity of Algorithm 1 is polypd, 1{ϵq.

Notably, both the loss functions and the domain W are not assumed to be convex. The oracle complexity in the statistical setting (under the same assumptions) is also polypd, 1{ϵq.3We thus conclude that both statistical and the online oracle complexities for non-convex learning setting are polynomially equivalent. We deduce the following game theoretic result:

Corollary 1. (informal) Convergence to equilibrium in two player zero-sum non-convex games is as hard as the corresponding offline best-response optimization problem.

We elaborate on this implication and specify it to GANs in Section 4.

1.2 Related Work

Follow-the-perturbed-leader. The ubiquitous Follow-the-Perturbed-Leader (FTPL) algorithm [8, 13] is the canonical example of using an optimization oracle: the algorithm returns the result of a single optimization oracle call per iteration. Since its introduction, an extensive study of FTPL has yielded new insights and efficient variants in various different settings (e.g. [10, 5, 17, 4]).

Online Convex Optimization. If the problem admits a convex structure, then the oracle complexity is polynomial in the dimension via bandit convex optimization [3, 9, 1]. If one considers the number of oracle calls to the optimization oracle only, and does not have access to a value oracle, then it is still possible to obtain a polynomial bound on the oracle complexity. This is due to the fact that online convex optimization reduces to online linear optimization [18], and this enables extension of FTPL to the convex case. However, this extension requires access to the gradient, which does not fall into our oracle model. We are not aware of any analysis of direct application of FTPL to a convex loss (i.e., without access to the gradients). In a sense, our treatment of the non-convex case gives the first direct analysis for FTPL to the convex case.

The experts setting: Overcoming the lower bound It is instructive to revisit the experts setting and understand why our result does not contradict the exponential lower bound of [11]. After all, one can easily embed the general experts problem in the d-dimensional hypercube for  d “ rlog Ns using the following standard technique:

1. Associate each vertex  z P t0, 1udwith some expert ipzq.

2. Associate each  x P r0, 1sdwith a random expert according to  ppzq “ śdi“1pzixi ` p1 ´ ziqp1 ´ xiqq. 3. Perform optimization over  r0, 1sd, where the loss of each  x P r0, 1sdis řzPt0,1ud ppzqℓpipzqq.

It can be verified that the parameters G and D are polynomial in d, as required. Consequently, our main result applies to this setting as well.

Crucially, unlike our oracle model, [11] does not allow a linear perturbation of the cumulative loss in this low-dimensional presentation. As it seems, this arguably moderate modification of the model rendered the offline-to-online reduction tractable.

Experts with low-dimensional structure. In the context of contextual bandits, [6] formulate abstract conditions under which the randomness can be shared between the experts, and allow efficient regret minimization in the oracle complexity model.

[7] study stability in non-convex settings, and bound the stability rate of ERM for strict saddle problems. In this paper we derive stable algorithms under much more moderate assumptions.

Generative adversarial networks. Several works have studied GANs in the regret minimization framework (e.g. [15, 14, 12]). We provide the first evidence that achieving equilibrium in GANs can be reduced to the offline problems associated with the players.

1.3 Overview and Techniques

1.3.1 Why standard approaches do not work?

A common approach which works well in the convex setting is to apply the Follow-the-Regularized-Leader (FTRL) with  ℓ2-regularization:

image

In the convex case  ℓ2-regularization stabilizes the solution by pushing it towards zero. However, we argue that in the non-convex setting, this approach does not help. To demonstrate this claim, consider a 1-dimensional setting, where the loss functions have the form  w ÞÑ pσpwxq ´ yq2, where  σpxq “maxtx, 0u is the ReLU function and  x P r´1, 1s, y P r0, 1s. Due to the ReLU term, the magnitude of the loss incurred by classifying x negatively is not important (i.e., there is no difference between  wx “ ´10´6and  wx “ ´1). Informally, if all x’s are bounded away from zero, we mostly care for the ratio between positive and negative examples. Therefore, adding  ℓ2-regularization does not make solutions near zero more appealing. It is not hard to formalize this argument and show that FTRL with  ℓ2(or  ℓ1) regularization can not yield sublinear regret.

1.3.2 Extending FTPL to the non-convex case

Our result is proved by extending the Follow-The-Perturbed-Leader algorithm to the non-convex setting. As we detail in the preliminaries section, online learnability requires algorithmic stability between consecutive rounds. For linear loss functions, [13] proved that linear perturbation of the loss stabilized the loss function itself, and consequently the minimizer is stable as well. The proof relies heavily on the fact that the perturbation and the loss function are of the same type.

In the non-convex case, we can not hope to stabilize the loss itself using a linear perturbation. Nevertheless, our main contribution is to establish that the randomness injected by FTPL does stabilize the predictions of the learner. We prove this result by investigating how the outputs of FTPL change as we vary the the noise vector  σ P Rdě0. In the 1-dimensional case, this investigation yields a useful monotonicity property which helps us bounding the expected distance between consecutive minimizers. While the general d-dimensional introduces some challenges, we are able to effectively reduce the analysis to the 1-dimensional setting by varying each coordinate of the noise separately.

2.1 Online to batch conversion

The following well-known result due to [2] tells us that the online sample complexity dominates the batch sample complexity. The intuition that online learning is at least as hard as batch learning is formalized by the following online-to-batch theorem.

Theorem 2. [2] Suppose that A is an online learner with ErRegretT sT ď ϵpTqfor any T. Consider the following algorithm for the batch setting: given a sample  pℓ1, . . . , ℓnq „ Pn, the algorithm applies A to the sample in an online manner. Thereafter, it draws a random round j P rns uniformly at random and returns ˆw “ wj. Then the expected excess risk of the algorithm,  ErLp ˆwqs ´ Lpw‹q, is at most  ϵpTq.

2.2 Online learning via stability

The main challenge in online learning stems from the fact that the learner has to make a decision before observing the adversarial action. Intuitively, we expect that the performance after shifting the actions of the learner by one step (i.e. considering the loss  ℓtpwt`1qrather than  ℓtpwtq) to be optimal. This view suggests that online learning is all about balancing between optimal performance w.r.t. previous rounds and ensuring stability between consecutive rounds. Similarly to the statistical setting, the most common algorithmic tool for achieving stability is regularization. In particular, the well-established Follow-the-Regularized-Leader is a meta-algorithm whose instances are determined by choosing a concrete regularization function. Precisely, given a regularizer  R : Rd Ñ R, the t-iterate of the algorithm is

image

The next well-known lemma provides a systematic approach for analyzing Follow-the-Regularized-Leader-type algorithms.

Lemma 1. (FTL-BTL [13]) The regret of Follow-the-Regularized-Leader is at most

image

2.3 The exponential distribution

We use the following properties of the exponential distribution.

Lemma 2. Let X be an exponential random variable with parameter  η.4The following properties hold: a) for any  s P R, PpX ě sq “expp´ηsq. b) Memorylessness: for any  s, q P R, PpX ě q ` s|X ě qq “ PpX ě sq. c) if  X1, . . . , Xdare i.i.d. with  Xi „ Exppηq, then  Er}pX1, . . . , Xdq}8s ď η´1plogpdq ` 1q.

In this section we present and analyze the non-convex FTPL method presented in Algorithm 1. Our analysis completes the proof of our main theorem (Theorem 1). Along the proof we distinguish between the one-dimensional and the general d-dimensional case. For the former case we obtain better regret bound in terms of the dependence on the horizon parameter T. Omitted proofs are provided in the Appendix.

image

3.1 Reduction to oblivious setting

To simplify the presentation we make the following standard modification:

1. The adversary is oblivious in the sense that the sequence  pℓtqTt“1is chosen in advance.

2. This allows us to analyze a slightly different algorithm which draws only a single noise vector  σ „Exppηqdrather than drawing a fresh noise vector on every round.

It follows from [3][Lemma 4.1] that proving regret bounds for this variant translates into asymptotically equivalent (expected) regret bounds for non-oblivious adversaries using Algorithm 1.

3.2 Main Lemma

Throughout this section we use the notation  wtpσqto emphasize that  wtas defined in (1), is determined by the noise vector  σ. Following Lemma 1 we would like to to establish a bound on the expected instability at time t, i.e.  Erℓtpwtpσqq ´ ℓtpwt`1pσqqs. This is bounded above by  G ¨ E}wtpσq ´ wt`1pσq}1. Note that the distance between  wtand  wt`1is ill-defined since both  wtand  wt`1are not unique. However, as we show below, we will be able to derive a uniform bound on the distance between any consecutive minimizers for every choice of minimizers. Note that this not really needed. As we are primarily interested in stability with respect to the function value, we can make any assumptions on the tie-breaking mechanism. However, we found it both interesting and surprisingly easier to prove the stronger result.

Lemma 3. Fix an iteration t and let  δ ą0 be a margin parameter. There is a tie-breaking rule for choosing minimizers such that  Er}wtpσq ´ wt`1pσq}1s “ O´polypdqηδ ` dδ¯. In the one-dimensional case we obtain the improved bound  Er|wt ´ wt`1|s “ Opηq.

Proof. (of Theorem 1) We start with the multidimensional case. Applying the FTL-BTL lemma (Lemma 1) with the regularizer  Rpwq “ ´σJwand using H¨older inequality, we obtain

image

For the multidimensional case, we use that  Er}σ}8s ď η´1plog d ` 1q, D, G P polypdq, and apply Lemma 3

to obtain

image

By setting  η “ T ´2{3and  δ “ T ´1{3, we obtain the regret bound ErRegretT s ď OpT 2{3polypdqq. Online-to-

batch conversion yields a sample complexity bound of  O´polypdqϵ3 ¯.

In the 1-dimensional case we simply set  η “ T ´1{2to obtain ErRegretT s “ OpT 1{2q. This translates into a sample complexity bound of  O´polypdqϵ2 ¯.

3.3 Proof of Lemma 3

We begin with the following lemma which provides a bound on the gap between minimizers with respect to the change in noise parameter  σ.

Lemma 4. For any two functions  f1, f2 : W Ñ Rand vectors  σ1, σ2 P Rd, let

image

Letting  f “ f1 ´ f2and  σ “ σ1 ´ σ2, we have that

image

Proof. Using optimality conditions for  wipσiq, we have that

image

Adding the above two inequalities and rearranging finishes the proof.

We now provide the proof of Lemma 3 in the two considered cases.

Proof. (of Lemma 3: one-dimensional case) We wish to use Lemma 4. For any time t, consider substituting  f1pwq “ řiăt lipwq, f2pwq “ řiăt`1 lipwq, σ2 “ σand  σ1 “ σ1 fi σ ` 2G. We immediately get that

image

Using the fact that  ltis G-lipschitz, we get that

image

which immediately implies that  wtpσ1q ě wt`1pσq. Similar calculations show that  wt`1pσ1q ě wtpσqand wtpσ1q ě wtpσq.

For the rest of the proof we will omit the dependence on t as it will be clear from context. We denote by wminpσq “mintwtpσq, wt`1pσuu, wmaxpσq “maxtwtpσq, wt`1pσqu. First we observe that

image

image

Figure 1: Illustration of the monotonicity property used in the Proof of Lemma 3: The unperturbed minimizer of  Lt(solid line), denoted  wt`1p0q, can be significantly smaller than the unperturbed minimizer of  Lt´1(dashed line),  wtp0q. This can be balanced by increasing the noise parameter corresponding to  wt`1.

Secondly the computation above implies that

image

This powerful monotonicity property (see Figure 1) is now used to lower bound  Erwminpσqsin terms Erwmaxpσqs. Letting  σ1 “ σ ` 2G, we have

image

where the second inequality uses Equation 3 and the last inequality uses the inequality exppxq ě 1 ` x.

Proof. (of Lemma 3:multiple dimensions) Once again we wish to use Lemma 4. For any time t and any coordinate k, consider substituting  f1pwq “ řiăt lipwq, f2pwq “ řiăt`1 lipwq, σ2 “ σand  σ1 “ σ1 fiσ ` 3Bδ´1 ¨ ek(where  ekis the  kthvector in the canonical basis). We immediately get that

image

where  wt,kis the k-th coordinate of  wt. Using the fact that the range of  ltis  r´B, Bs, we get that

image

which immediately implies that  wt,kpσ1q ě wt`1,kpσq´δ. A similar calculation also derives that  wt`1,kpσ1q ěwt,kpσq ´ δ.

Now for any k P rds, let  wk,minpσq “mintwt,kpσq, wt`1,kpσqu, wmaxpσq “maxtwt,kpσq, wt`1,kpσqu. First we observe that

image

Secondly the calculation above implies that for all k

image

Now fix a coordinate k P rds along with all noise coordinates  σjfor  j ‰ k. Denote by  E´kthe corresponding conditional expectation. Up to the additional margin term  δ, lower bounding  E´krwk,minsin terms of  E´krwk,maxsreduces to the one-dimensional case; letting  q “ 3Bδ´1and  µpxq “ ηexpp´ηxq, we have

image

The second inequality uses Equation 4 and the last inequality follows by substituting  q “ 3Bδ´1and using the inequality exppxq ě 1 ` x. Since the above holds for any fixed  σ´k “ pσjqj‰k, the unconditioned expectations also satisfy

image

Summing over all coordinates we conclude the bound.

Consider the following formulation of a non-convex zero-sum game. Let  F : X ˆ Y Ñ R, where  X, Y Ď Rdare compact with diameter at most D. The x-th player wishes to minimize F and whereas the y-th player wishes to maximize F. We assume that for all x P X and y P Y, both  Fp¨, yqand  ´Fpx, ¨qare G-Lipschitz and B-bounded. A known approach for achieving equilibrium is to apply (for each of the players) an online method with vanishing average regret. Precisely, on each round t both players choose a pair  pxt, ytqwhich induces the losses  Fpxt, ytqand  ´Fpxt, ytq, respectively. Finally, we draw a random index rjs P rTs and output the pair  pˆx, ˆyq fi pxj, yjq. By endowing the players with access to an offline oracle and playing according to non-convex FTPL we can reach approximate equilibrium.

Theorem 3. Suppose that both the x-player and the y-player have an access to an offline oracle and play according to non-convex FTPL (Algorithm 1). Given  ϵ ą0, let T P polypdq{ϵ3such that the expected average regret of non-convex FTPL is at most  ϵ. Then,  pˆx, ˆyqforms an  ϵ-approximated equilibrium, i.e., for any x P X and y P Y,

image

Note that the players can use their offline oracle to amplify their confidence and achieve an equilibrium with high probability. The proof is provided in the appendix.

4.1 Implication to GANs

In particular, we consider the case where the x-th player is a generator, who produces synthetic samples (e.g. images), whereas the y-th player acts as a discriminator by assigning scores to samples reflecting the probability of being generated from the true distribution. Formally, by choosing a parameter x P X and drawing a random noise z, the x-th player produces a sample denote  Gxpzq. Conversely, the y-th player chooses a parameter y P Y and assign the score  DypGxpzqq P r0, 1sto the sample  Gxpzq. The function F usually corresponds to the log-likelihood of mistakenly assigning an high score to a synthetic example and vice versa. It is reasonable to assume that F is Lipschitz and bounded w.r.t. the network parameters. As a result, efficient convergence to GANs is established by assuming an access to an offline oracle.

Our work establishes a computational equivalence between online and statistical learning in the non-convex setting. We shed light on the hardness result of [11] by demonstrating that online learning is significantly more difficult than statistical learning only when no structure is assumed.

One interesting direction for further investigation is to refine the comparison model and study the polynomial dependencies more carefully. One obvious question is to understand the gap in terms of the horizon parameter T between the regret bounds for the one-dimensional and the multidimensional settings.

We thank Karan Singh for recognizing a bug in our original proof and several discussions. We also thank Alon Cohen and Roi Livni for fruitful discussions. Elad Hazan acknowledges funding from NSF award Number 1704860.

[1] S´ebastien Bubeck, Nicolo Cesa-Bianchi, et al. Regret analysis of stochastic and nonstochastic multi- armed bandit problems.  Foundations and Trends R⃝ in Machine Learning, 5(1):1–122, 2012.

[2] Nicol`o Cesa-Bianchi, Alex Conconi, and Claudio Gentile. On the Generalization Ability of Online Learning Algorithms for Pairwise Loss Functions. IEEE Transactions on Information Theory, 50:2050— -2057, 2004.

[3] Nicolo Cesa-Bianchi and Gabor Lugosi. Prediction, Learning, and Games. Cambridge university press, 2006.

[4] Alon Cohen and Tamir Hazan. Following the perturbed leader for online structured learning. In International Conference on Machine Learning, pages 1034–1042, 2015.

[5] Luc Devroye, G´abor Lugosi, and Gergely Neu. Prediction by random-walk perturbation. In Conference on Learning Theory, pages 460–473, 2013.

[6] Miroslav Dudik, Nika Haghtalab, Haipeng Luo, Robert E. Schapire, Vasilis Syrgkanis, and Jen- nifer Wortman Vaughan. Oracle-efficient online learning and auction design. In Annual Symposium on Foundations of Computer Science - Proceedings, 2017.

[7] Alon Gonen and Shai Shalev-Shwartz. Fast Rates for Empirical Risk Minimization of Strict Saddle Problems. In Ohad Shamir and Satyen Kale, editors, Proceedings of the 2017 Conference on Learning Theory, pages 1043—-1063. PMLR, 2017.

[8] James Hannan. Approximation to bayes risk in repeated play. Contributions to the Theory of Games, 3:97–139, 1957.

[9] Elad Hazan. Introduction to Online Convex Optimization.  Foundations and Trends R⃝ in Optimization, 2(3-4):157–325, 2016.

[10] Elad Hazan and Satyen Kale. Online submodular minimization. Journal of Machine Learning Research, 13(Oct):2903–2922, 2012.

[11] Elad Hazan and Tomer Koren. The Computational Power of Optimization in Online Learning. In Proceedings of the forty-eighth annual ACM symposium on Theory of Computing, pages 128–141. ACM, 2016.

[12] Elad Hazan, Karan Singh, and Cyril Zhang. Efficient regret minimization in non-convex games. In International Conference on Machine Learning, pages 1433–1441, 2017.

[13] Adam Kalai and Santosh Vempala. Efficient Algorithms for Online Decision Problems. Journal of Computer and System Sciences, 2004.

[14] Naveen Kodali, Jacob Abernethy, James Hays, and Zsolt Kira. On Convergence and Stability of GANs. arXiv preprint arXiv:1705.07215, 2017.

[15] Dale Schuurmans and Martin A Zinkevich. Deep learning games. In Advances in Neural Information Processing Systems, pages 1678–1686, 2016.

[16] Shai Shalev-Shwartz. Online Learning and Online Convex Optimization.  Foundations and Trends R⃝ inMachine Learning, 4(2):107–194, 2011.

[17] Tim Van Erven, Wojciech Kot�lowski, and Manfred K Warmuth. Follow the leader with dropout per- turbations. In Conference on Learning Theory, pages 949–974, 2014.

[18] Martin Zinkevich. Online convex programming and generalized infinitesimal gradient ascent. In Proceedings of the 20th International Conference on Machine Learning (ICML-03), pages 928–936, 2003.

Proof. (of Theorem 3) For all y P Y,

image

Similarly, for all x P X,

image


Designed for Accessibility and to further Open Science