The setting of online learning in games is a fundamental paradigm which allows formulation of tasks such as spam detection, online routing, online recommendation systems, and more [3, 9, 16]. A key feature of this model is the ability of the environments to evolve over time, possibly in an adversarial manner. Consequently, this framework can be used to produce more robust learners compared to the classic stationary and statistical learning framework. A fundamental question investigated in recent literature is whether this robustness comes with a computational price. While it is well-known that any efficient online learner can be transformed into an efficient statistical (or batch) learner [2], it is important to understand to what extent is the online model harder.
To enable a systematic comparison between the two models we must allow a reduction in the opposite direction. To this end we adopt the offline optimization oracle model suggested in [11], where the online learner submits a sequence of loss functions and the oracle returns any minimizer of the cumulative loss. For the well-established setting of learning with expert advice, [11] demonstrated an exponential gap between the oracle complexity in the online and the statistical settings.
In this paper we study the same question in the more general non-convex setting.Deviating from [11], we allow the learner to linearly perturb the objective submitted to the oracle. Arguably, adding a linear term to a non-convex function should not increase the overall complexity of the oracle. Perhaps surprisingly, we show that this moderate modification renders the online adversarial setting computationally equivalent to the statistical setting. We show this by extending the powerful Follow-the-Perturbed-Leader (FTPL) meta-algorithm to the non-convex setting and derive a polynomial bound on its oracle complexity.
1.1 Setting and Main Result
1.1.1 Basic definitions and assumptions
Let be the decision set (a.k.a. hypothesis space in the statistical setting) with
-diameter at most D, and let
be the set of all G-Lipschitz functions w.r.t. the
-norm. We assume that both G and
D are polynomial in the ambient dimension
Consider the setting of online learning, where an online algorithm predicts a point in iterative fashion and receives a feedback according to an adversarially chosen loss function
. The goal of the learner is to minimize the average regret, which is defined as the difference between the average loss of the learner and that of the best fixed point
in hindsight. We define the sample complexity as the number of rounds required for attaining expected average regret at most
.
The statistical setting differs from the online setting in two important aspects. a) We assume that the loss functions are drawn according to some unknown fixed distribution. b) The learner receives a sample of loss functions drawn according to the same distribution. Then it has to output a single predictor ˆw. The goal of the learner is minimize the expected excess risk, which is defined as inf
. The sample complexity in this model is the size of a sample (of loss functions) that is required for attaining expected excess risk at most
.
1.1.2 The offline oracle model
In order to compare between the online and the statistical models, we assume an access to two types of oracles:
1. Value oracle whose input is a pair and its output is
.
2. Offline optimization oracle whose input consists of a sequence of loss functions and a d-dimensional vector
, and its output is output has the form
We define the oracle complexity as
1.1.3 Main result
Our online Algorithm 1 applies the offline oracle with a random linear perturbation whose coordinates are i.i.d. exponential random variables with parameter
. Our main result can be stated as follows.
Theorem 1. The oracle complexity of Algorithm 1 is poly.
Notably, both the loss functions and the domain W are not assumed to be convex. The oracle complexity in the statistical setting (under the same assumptions) is also polyWe thus conclude that both statistical and the online oracle complexities for non-convex learning setting are polynomially equivalent. We deduce the following game theoretic result:
Corollary 1. (informal) Convergence to equilibrium in two player zero-sum non-convex games is as hard as the corresponding offline best-response optimization problem.
We elaborate on this implication and specify it to GANs in Section 4.
1.2 Related Work
Follow-the-perturbed-leader. The ubiquitous Follow-the-Perturbed-Leader (FTPL) algorithm [8, 13] is the canonical example of using an optimization oracle: the algorithm returns the result of a single optimization oracle call per iteration. Since its introduction, an extensive study of FTPL has yielded new insights and efficient variants in various different settings (e.g. [10, 5, 17, 4]).
Online Convex Optimization. If the problem admits a convex structure, then the oracle complexity is polynomial in the dimension via bandit convex optimization [3, 9, 1]. If one considers the number of oracle calls to the optimization oracle only, and does not have access to a value oracle, then it is still possible to obtain a polynomial bound on the oracle complexity. This is due to the fact that online convex optimization reduces to online linear optimization [18], and this enables extension of FTPL to the convex case. However, this extension requires access to the gradient, which does not fall into our oracle model. We are not aware of any analysis of direct application of FTPL to a convex loss (i.e., without access to the gradients). In a sense, our treatment of the non-convex case gives the first direct analysis for FTPL to the convex case.
The experts setting: Overcoming the lower bound It is instructive to revisit the experts setting and understand why our result does not contradict the exponential lower bound of [11]. After all, one can easily embed the general experts problem in the d-dimensional hypercube for log Ns using the following standard technique:
1. Associate each vertex with some expert ipzq.
2. Associate each with a random expert according to
. 3. Perform optimization over
, where the loss of each
is
.
It can be verified that the parameters G and D are polynomial in d, as required. Consequently, our main result applies to this setting as well.
Crucially, unlike our oracle model, [11] does not allow a linear perturbation of the cumulative loss in this low-dimensional presentation. As it seems, this arguably moderate modification of the model rendered the offline-to-online reduction tractable.
Experts with low-dimensional structure. In the context of contextual bandits, [6] formulate abstract conditions under which the randomness can be shared between the experts, and allow efficient regret minimization in the oracle complexity model.
[7] study stability in non-convex settings, and bound the stability rate of ERM for strict saddle problems. In this paper we derive stable algorithms under much more moderate assumptions.
Generative adversarial networks. Several works have studied GANs in the regret minimization framework (e.g. [15, 14, 12]). We provide the first evidence that achieving equilibrium in GANs can be reduced to the offline problems associated with the players.
1.3 Overview and Techniques
1.3.1 Why standard approaches do not work?
A common approach which works well in the convex setting is to apply the Follow-the-Regularized-Leader (FTRL) with -regularization:
In the convex case -regularization stabilizes the solution by pushing it towards zero. However, we argue that in the non-convex setting, this approach does not help. To demonstrate this claim, consider a 1-dimensional setting, where the loss functions have the form
, where
maxtx, 0u is the ReLU function and
. Due to the ReLU term, the magnitude of the loss incurred by classifying x negatively is not important (i.e., there is no difference between
and
1). Informally, if all x’s are bounded away from zero, we mostly care for the ratio between positive and negative examples. Therefore, adding
-regularization does not make solutions near zero more appealing. It is not hard to formalize this argument and show that FTRL with
(or
) regularization can not yield sublinear regret.
1.3.2 Extending FTPL to the non-convex case
Our result is proved by extending the Follow-The-Perturbed-Leader algorithm to the non-convex setting. As we detail in the preliminaries section, online learnability requires algorithmic stability between consecutive rounds. For linear loss functions, [13] proved that linear perturbation of the loss stabilized the loss function itself, and consequently the minimizer is stable as well. The proof relies heavily on the fact that the perturbation and the loss function are of the same type.
In the non-convex case, we can not hope to stabilize the loss itself using a linear perturbation. Nevertheless, our main contribution is to establish that the randomness injected by FTPL does stabilize the predictions of the learner. We prove this result by investigating how the outputs of FTPL change as we vary the the noise vector . In the 1-dimensional case, this investigation yields a useful monotonicity property which helps us bounding the expected distance between consecutive minimizers. While the general d-dimensional introduces some challenges, we are able to effectively reduce the analysis to the 1-dimensional setting by varying each coordinate of the noise separately.
2.1 Online to batch conversion
The following well-known result due to [2] tells us that the online sample complexity dominates the batch sample complexity. The intuition that online learning is at least as hard as batch learning is formalized by the following online-to-batch theorem.
Theorem 2. [2] Suppose that A is an online learner with for any T. Consider the following algorithm for the batch setting: given a sample
, the algorithm applies A to the sample in an online manner. Thereafter, it draws a random round j P rns uniformly at random and returns ˆ
. Then the expected excess risk of the algorithm,
, is at most
.
2.2 Online learning via stability
The main challenge in online learning stems from the fact that the learner has to make a decision before observing the adversarial action. Intuitively, we expect that the performance after shifting the actions of the learner by one step (i.e. considering the loss rather than
) to be optimal. This view suggests that online learning is all about balancing between optimal performance w.r.t. previous rounds and ensuring stability between consecutive rounds. Similarly to the statistical setting, the most common algorithmic tool for achieving stability is regularization. In particular, the well-established Follow-the-Regularized-Leader is a meta-algorithm whose instances are determined by choosing a concrete regularization function. Precisely, given a regularizer
, the t-iterate of the algorithm is
The next well-known lemma provides a systematic approach for analyzing Follow-the-Regularized-Leader-type algorithms.
Lemma 1. (FTL-BTL [13]) The regret of Follow-the-Regularized-Leader is at most
2.3 The exponential distribution
We use the following properties of the exponential distribution.
Lemma 2. Let X be an exponential random variable with parameter The following properties hold: a) for any
exp
. b) Memorylessness: for any
. c) if
are i.i.d. with
, then
logpdq ` 1q.
In this section we present and analyze the non-convex FTPL method presented in Algorithm 1. Our analysis completes the proof of our main theorem (Theorem 1). Along the proof we distinguish between the one-dimensional and the general d-dimensional case. For the former case we obtain better regret bound in terms of the dependence on the horizon parameter T. Omitted proofs are provided in the Appendix.
3.1 Reduction to oblivious setting
To simplify the presentation we make the following standard modification:
1. The adversary is oblivious in the sense that the sequence is chosen in advance.
2. This allows us to analyze a slightly different algorithm which draws only a single noise vector rather than drawing a fresh noise vector on every round.
It follows from [3][Lemma 4.1] that proving regret bounds for this variant translates into asymptotically equivalent (expected) regret bounds for non-oblivious adversaries using Algorithm 1.
3.2 Main Lemma
Throughout this section we use the notation to emphasize that
as defined in (1), is determined by the noise vector
. Following Lemma 1 we would like to to establish a bound on the expected instability at time t, i.e.
. This is bounded above by
. Note that the distance between
and
is ill-defined since both
and
are not unique. However, as we show below, we will be able to derive a uniform bound on the distance between any consecutive minimizers for every choice of minimizers. Note that this not really needed. As we are primarily interested in stability with respect to the function value, we can make any assumptions on the tie-breaking mechanism. However, we found it both interesting and surprisingly easier to prove the stronger result.
Lemma 3. Fix an iteration t and let 0 be a margin parameter. There is a tie-breaking rule for choosing minimizers such that
. In the one-dimensional case we obtain the improved bound
.
Proof. (of Theorem 1) We start with the multidimensional case. Applying the FTL-BTL lemma (Lemma 1) with the regularizer and using H¨older inequality, we obtain
For the multidimensional case, we use that log d ` 1q, D, G P polypdq, and apply Lemma 3
to obtain
By setting and
, we obtain the regret bound ErRegret
polypdqq. Online-to-
batch conversion yields a sample complexity bound of .
In the 1-dimensional case we simply set to obtain ErRegret
. This translates into a sample complexity bound of
.
3.3 Proof of Lemma 3
We begin with the following lemma which provides a bound on the gap between minimizers with respect to the change in noise parameter .
Lemma 4. For any two functions and vectors
, let
Letting and
, we have that
Proof. Using optimality conditions for , we have that
Adding the above two inequalities and rearranging finishes the proof.
We now provide the proof of Lemma 3 in the two considered cases.
Proof. (of Lemma 3: one-dimensional case) We wish to use Lemma 4. For any time t, consider substituting and
. We immediately get that
Using the fact that is G-lipschitz, we get that
which immediately implies that . Similar calculations show that
and
.
For the rest of the proof we will omit the dependence on t as it will be clear from context. We denote by min
max
. First we observe that
Figure 1: Illustration of the monotonicity property used in the Proof of Lemma 3: The unperturbed minimizer of (solid line), denoted
, can be significantly smaller than the unperturbed minimizer of
(dashed line),
. This can be balanced by increasing the noise parameter corresponding to
.
Secondly the computation above implies that
This powerful monotonicity property (see Figure 1) is now used to lower bound in terms
. Letting
, we have
where the second inequality uses Equation 3 and the last inequality uses the inequality exp.
Proof. (of Lemma 3:multiple dimensions) Once again we wish to use Lemma 4. For any time t and any coordinate k, consider substituting and
(where
is the
vector in the canonical basis). We immediately get that
where is the k-th coordinate of
. Using the fact that the range of
is
, we get that
which immediately implies that . A similar calculation also derives that
.
Now for any k P rds, let min
max
. First we observe that
Secondly the calculation above implies that for all k
Now fix a coordinate k P rds along with all noise coordinates for
. Denote by
the corresponding conditional expectation. Up to the additional margin term
, lower bounding
in terms of
reduces to the one-dimensional case; letting
and
exp
, we have
The second inequality uses Equation 4 and the last inequality follows by substituting and using the inequality exp
. Since the above holds for any fixed
, the unconditioned expectations also satisfy
Summing over all coordinates we conclude the bound.
Consider the following formulation of a non-convex zero-sum game. Let , where
are compact with diameter at most D. The x-th player wishes to minimize F and whereas the y-th player wishes to maximize F. We assume that for all x P X and y P Y, both
and
are G-Lipschitz and B-bounded. A known approach for achieving equilibrium is to apply (for each of the players) an online method with vanishing average regret. Precisely, on each round t both players choose a pair
which induces the losses
and
, respectively. Finally, we draw a random index rjs P rTs and output the pair
. By endowing the players with access to an offline oracle and playing according to non-convex FTPL we can reach approximate equilibrium.
Theorem 3. Suppose that both the x-player and the y-player have an access to an offline oracle and play according to non-convex FTPL (Algorithm 1). Given 0, let T P poly
such that the expected average regret of non-convex FTPL is at most
. Then,
forms an
-approximated equilibrium, i.e., for any x P X and y P Y,
Note that the players can use their offline oracle to amplify their confidence and achieve an equilibrium with high probability. The proof is provided in the appendix.
4.1 Implication to GANs
In particular, we consider the case where the x-th player is a generator, who produces synthetic samples (e.g. images), whereas the y-th player acts as a discriminator by assigning scores to samples reflecting the probability of being generated from the true distribution. Formally, by choosing a parameter x P X and drawing a random noise z, the x-th player produces a sample denote . Conversely, the y-th player chooses a parameter y P Y and assign the score
to the sample
. The function F usually corresponds to the log-likelihood of mistakenly assigning an high score to a synthetic example and vice versa. It is reasonable to assume that F is Lipschitz and bounded w.r.t. the network parameters. As a result, efficient convergence to GANs is established by assuming an access to an offline oracle.
Our work establishes a computational equivalence between online and statistical learning in the non-convex setting. We shed light on the hardness result of [11] by demonstrating that online learning is significantly more difficult than statistical learning only when no structure is assumed.
One interesting direction for further investigation is to refine the comparison model and study the polynomial dependencies more carefully. One obvious question is to understand the gap in terms of the horizon parameter T between the regret bounds for the one-dimensional and the multidimensional settings.
We thank Karan Singh for recognizing a bug in our original proof and several discussions. We also thank Alon Cohen and Roi Livni for fruitful discussions. Elad Hazan acknowledges funding from NSF award Number 1704860.
[1] S´ebastien Bubeck, Nicolo Cesa-Bianchi, et al. Regret analysis of stochastic and nonstochastic multi- armed bandit problems. , 5(1):1–122, 2012.
[2] Nicol`o Cesa-Bianchi, Alex Conconi, and Claudio Gentile. On the Generalization Ability of Online Learning Algorithms for Pairwise Loss Functions. IEEE Transactions on Information Theory, 50:2050— -2057, 2004.
[3] Nicolo Cesa-Bianchi and Gabor Lugosi. Prediction, Learning, and Games. Cambridge university press, 2006.
[4] Alon Cohen and Tamir Hazan. Following the perturbed leader for online structured learning. In International Conference on Machine Learning, pages 1034–1042, 2015.
[5] Luc Devroye, G´abor Lugosi, and Gergely Neu. Prediction by random-walk perturbation. In Conference on Learning Theory, pages 460–473, 2013.
[6] Miroslav Dudik, Nika Haghtalab, Haipeng Luo, Robert E. Schapire, Vasilis Syrgkanis, and Jen- nifer Wortman Vaughan. Oracle-efficient online learning and auction design. In Annual Symposium on Foundations of Computer Science - Proceedings, 2017.
[7] Alon Gonen and Shai Shalev-Shwartz. Fast Rates for Empirical Risk Minimization of Strict Saddle Problems. In Ohad Shamir and Satyen Kale, editors, Proceedings of the 2017 Conference on Learning Theory, pages 1043—-1063. PMLR, 2017.
[8] James Hannan. Approximation to bayes risk in repeated play. Contributions to the Theory of Games, 3:97–139, 1957.
[9] Elad Hazan. Introduction to Online Convex Optimization. , 2(3-4):157–325, 2016.
[10] Elad Hazan and Satyen Kale. Online submodular minimization. Journal of Machine Learning Research, 13(Oct):2903–2922, 2012.
[11] Elad Hazan and Tomer Koren. The Computational Power of Optimization in Online Learning. In Proceedings of the forty-eighth annual ACM symposium on Theory of Computing, pages 128–141. ACM, 2016.
[12] Elad Hazan, Karan Singh, and Cyril Zhang. Efficient regret minimization in non-convex games. In International Conference on Machine Learning, pages 1433–1441, 2017.
[13] Adam Kalai and Santosh Vempala. Efficient Algorithms for Online Decision Problems. Journal of Computer and System Sciences, 2004.
[14] Naveen Kodali, Jacob Abernethy, James Hays, and Zsolt Kira. On Convergence and Stability of GANs. arXiv preprint arXiv:1705.07215, 2017.
[15] Dale Schuurmans and Martin A Zinkevich. Deep learning games. In Advances in Neural Information Processing Systems, pages 1678–1686, 2016.
[16] Shai Shalev-Shwartz. Online Learning and Online Convex Optimization. Machine Learning, 4(2):107–194, 2011.
[17] Tim Van Erven, Wojciech Kot�lowski, and Manfred K Warmuth. Follow the leader with dropout per- turbations. In Conference on Learning Theory, pages 949–974, 2014.
[18] Martin Zinkevich. Online convex programming and generalized infinitesimal gradient ascent. In Proceedings of the 20th International Conference on Machine Learning (ICML-03), pages 928–936, 2003.