Lipschitz and Comparator-Norm Adaptivity in Online Learning

2020·Arxiv

Abstract

Abstract

We study Online Convex Optimization in the unbounded setting where neither predictions nor gradient are constrained. The goal is to simultaneously adapt to both the sequence of gradients and the comparator. We first develop parameter-free and scale-free algorithms for a simplified setting with hints. We present two versions: the first adapts to the squared norms of both comparator and gradients separately using Opdq time per round, the second adapts to their squared inner products (which measure variance only in the comparator direction) in time per round. We then generalize two prior reductions to the unbounded setting; one to not need hints, and a second to deal with the range ratio problem (which already arises in prior work). We discuss their optimality in light of prior and new lower bounds. We apply our methods to obtain sharper regret bounds for scale-invariant online prediction with linear models.

Keywords: Online Convex Optimization, Parameter-Free Online Learning, Scale-Invariant Online Algorithms

1. Introduction

We consider the setting of online convex optimization where the goal is to make sequential predictions to minimize a certain notion of regret. Specifically, at the beginning of each round , a learner predicts in some convex set in dimension d P N. The environment then reveals a convex loss function , and the learner suffers loss . The goal of the learner is to minimize the regret after rounds against any “comparator” prediction w P W. Typically, an online learning algorithm outputs a vector , based only on a sequence of observed sub-gradients , where . In this paper, we are interested in online algorithms which can guarantee a good regret bound (by a measure which we will make precise below) against any comparator vector w P W, even when W is unbounded, and without prior knowledge of the maximum norm of the observed sub-gradients. In what follows, we refer to L as the Lipschitz constant.

By assuming an upper-bound on the norm of the desired comparator vector w in hindsight, there exist Lipschitz-adaptive algorithms that can achieve a sub-linear regret of order LD?T, without knowing L in advance. A Lipschitz-adaptive algorithm is also called scale-free (or scale-invariant) if its predictions do not change when the loss functions are multiplied by a factor ; in this case, its regret bound is expected to scale by the same factor c. When L is known in advance and , there exists another type of algorithms, so-called parameter-free, which can achieve an rOp}w}L?Tq regret bound, where w is the desired comparator vector in hindsight (the notation rO hides log-factors). Up to an additive lower-order term, this type of regret bound is also achievable for bounded W via the unconstrained-to-constrained reduction (Cutkosky, 2019).

The question of whether an algorithm can simultaneously be scale-free and parameter-free was posed as an open problem by Orabona and P´al (2016b). It was latter answered in the negative by Cutkosky and Boahen (2017). Nevertheless, Cutkosky (2019) recently presented algorithms which achieve an regret bound, without knowing either L or . This does not violate the earlier lower bound of Cutkosky and Boahen (2017), which insists on norm dependence

Though Cutkosky (2019) designs algorithms that can to some extent adapt to both L and , their algorithms are still not scale-free. Multiplying , and as a result , by a positive factor changes the outputs of their algorithms, and their regret bounds scale by a factor necessarily equal to c. Their algorithms depend on a parameter which has to be specified in advance. This parameter appears in their regret bounds as an additive term and also in a logarithmic term of the form . As a result of this type of dependence on and the fact that , there is no prior choice of which can make their regret bounds scale-invariant. What is more, without knowing L, there is also no “safe” choice of which can prevent the term from becoming arbitrarily large relative to L (it suffices for to be small enough relative to the “unknown” L).

Contributions. Our main contribution is a new scale-free, parameter-free learning algorithm for OCO with regret at most , for any comparator w P W in a bounded set W, where . When the set W is unbounded, the algorithm achieves the same guarantee up to an additive for all t P rTs. In the latter case, we also show a matching lower bound; when W is unbounded and without knowing L, any online learning algorithm which insists on an rOp?Tq bound, has regret at least . We also provide a second scale-invariant algorithm which replaces the leading term in the regret bound of our first algorithm by . Our starting point for designing our algorithms is a known potential function which we show to be controlled for a unique choice of output sequence

As our main application, we show how our algorithms can be applied to learn linear models. The result is an online algorithm for learning linear models whose label predictions are invariant to coordinate-wise scaling of the input feature vectors. The regret bound of the algorithm is naturally also scale-invariant and improves on the bounds of existing state-of-the-art algorithms in this setting (Kotłowski, 2017; Kempka et al., 2019).

Related Work For an overview of Online Convex Optimization in the bounded setting, we refer to the textbook (Hazan, 2016). The unconstrained case was first studied by McMahan and Streeter (2010). A powerful methodology for the unbounded case is Coin Betting by Orabona and P´al (2016a). Even though not always visible, our potential functions are inspired by this style of thinking. We build our unbounded OCO learner by targeting a specific other constrained problem. We further employ several general reductions from the literature, including gradient clipping Cutkosky (2019), the constrained-to-unconstrained reduction Cutkosky and Orabona (2018), and the restart wrapper to pacify the final-vs-initial scale ratio appearing inside logarithms by Mhammedi et al. (2019). Our analysis is, at its core, proving a certain minimax result about sufficient-statistic-based potentials reminiscent of the Burkholder approach pioneered by Foster et al. (2017, 2018). Applications for scale-invariant learning in linear models were studied by Kempka et al. (2019). For our multidimensional learner we took inspiration from the Gaussian Exp-concavity step in the analysis of the MetaGrad algorithm by Van Erven and Koolen (2016).

Outline In Section 2, we present the setting and notation, and formulate our goal. In Section 3, we present our main algorithms. In Section 4, we present new lower-bounds for algorithms which adapt to both the Lipschitz constant and the norm of the comparator. In Section 5, we apply our algorithms to online prediction with linear models.

2. Preliminaries

Our goal is to design scale-free algorithms that adapt to the Lipschitz constant L and comparator norm . We will first introduce the setting, then discuss existing reductions, and finally state what needs to be done to achieve our goal.

2.1. Setting and Notation

Let be a convex set, and assume without loss of generality that 0 P W. We allow the set W to be unbounded, and we define its (possibly infinite) diameter r0, `8s. We consider the setting of Online Convex Optimization (OCO) where at the beginning of each round , the learner outputs a prediction , before observing a convex loss function , or an element of its sub-gradient at . The goal of the learner is to minimize the regret after rounds, which is given by

for any comparator vector w P W. In this paper, we do not assume that T is known to the learner, and so we are after algorithms with so called any-time guarantees. By convexity, we have

and thus for the purpose of minimizing the regret, typical OCO algorithms minimize the RHS of (1), which is known as the linearized regret, by generating outputs based on the sequence of observed sub-gradients . Likewise, we focus our attention exclusively on linear optimization.

Given a sequence of sub-gradients , it will be useful to define the running maximum gradient norm and the clipped sub-gradients

for , with the convention that . We also drop the subscript t from when , i.e. we write

We denote by Athe output in round of an algorithm A, which uses the observed sub-gradients so far and a hint on the upcoming sub-gradient . As per Section 1, we say that an algorithm is scale-free (or scale-invariant) if its predictions are invariant to any common positive scaling of the loss functions and, if applicable, the hints.

Additional Notation. Given a closed convex set , we denote by the Euclidean projection of a point

2.2. Helpful Reductions

The difficulty behind designing scale-free algorithms lies partially in the fact that is unknown at the start of round t; before outputting . The following result due to Cutkosky (2019) quantifies the additional cost of proceeding with the plug-in estimate

Lemma 1 Let A be an online algorithm which at the start of each round , has access to a hint , and outputs A, before observing . Suppose that A guarantees an upper-bound on its linearized regret for the sequence and for all algorithm B which at the start of each round , guarantees

First, we note that Lemma 1 is only really useful when W is bounded; otherwise, depending on algorithm A, the term on the RHS of (2) could in principle be arbitrarily large even for fixed w, L, and T. The moral of Lemma 1 is that as long as the set W is bounded, one does not really need to know before outputting to guarantee a “good” regret bound against any w P W. For example, suppose that W has a bounded diameter D and algorithm A in Lemma 1 is such that , for all w P W. Then, from (2) and the fact that (since

), it is clear that algorithm B in Lemma 1 also guarantees the same regret bound to an additive 2DL, despite not having had the hints

It is possible to extend the result of Lemma 1 so that the regret bound of algorithm B remains useful even in the case where W is unbounded. An approach suggested by Cutkosky (2019) is to restrict the outputs of algorithm B to be in a non-decreasing sequence subsets of W. In this case, the diameters of need to be carefully chosen to achieve a desired regret bound. This approach, which essentially combines the idea of Lemma 1 and the unconstrained-to-constrained reduction due to Cutkosky and Orabona (2018), is formalized in the next lemma (essentially due to Cutkosky (2019)):

Lemma 2 Let algorithm A be as in Lemma 1, and let be a sequence of non-decreasing closed convex subsets of W with diameters . Then, algorithm B which at the start of round

We see that compared to Lemma 1, the additional penalty that algorithm B incurs for restricting its predictions to the sets is the sum . The challenge is now in choosing the diameters to control the trade-off between this sum and the term the RHS of (3). If T is known in advance, one could set , in which case the RHS of (3) is at most

We now instantiate the bound of Lemma 2 for another choice of is unknown:

Corollary 3 In the setting of Lemma 2, let be the ball of diameter , where . Then the RHS of (3) is bounded from above by

We see that by the more careful choice of in Corollary 3, one can replace the L?T term in (4) by the smaller quantity ; whether this can be improved further to , where , was raised as an open question by Cutkosky (2019). We will answer this in the negative in Theorem 14. We will also show in Theorem 15 below that, if one insists on a regret of order rOp?Tq, it is essentially not possible to improve on the penalty

2.3. Outlook

The conclusion that should be drawn from Lemmas 1 and 2 is the following; if one seeks an algorithm B with a regret bound of the form rOp}w}L?Tq up to some lower-order terms in T, without knowledge of L and regardless of whether W is bounded or not, it suffices to find an algorithm A which guarantees the sought type of regret whenever it has access to a sequence of hints satisfying (as in Lemmas 1 and 2), . Thus, our first goal in the next section is to design a scale-free algorithm A which accesses such a sequence of hints and ensures that its linearized regret is bounded from above by:

for all , and . We show an analogous “full-matrix” upgrade of order . We note that if Algorithm A in Lemmas 1 and 2 is scale-free, then so is the corresponding Algorithm B.

If the desired set W has bounded diameter , then using the unconstrained-to-constrained reduction due to Cutkosky and Orabona (2018), it is straightforward to design a new algorithm based on A with regret also bounded by (6) up to an additive LD, for w P W (this is useful for Lemma 1).

Finally, we also note that algorithms which can access hints are of independent interest; in fact, it is the same algorithm A that we will use in Section 5 as a scale-invariant algorithm for learning linear models.

3. Scale-Free, Parameter-Free Algorithms for OCO

In light of the conclusions of Section 2, we will design new unconstrained scale-free algorithms which can access a sequence of hints (as in Lemma 1) and guarantee a regret bound of the form given in (6). In this section, we will make the following assumption on the hints

Assumption 1 We assume that (i) is a non-decreasing sequence; (ii) , for all ; and (iii) if the sub-gradients are multiplied by a factor , then the hints are multiplied by the same factor c.

The third item of the assumption ensures that our algorithms are scale-free. We note that Assumption 1 is satisfied by the sequence of hints that Algorithm B constructs when invoking Algorithm A in Lemmas 1 and 2. To avoid an uninteresting case distinction, we will also make the following assumption, which is without loss of generality, since the regret is zero while

Assumption 2 We assume that

3.1. FREEGRAD: An Adaptive Scale-Free Algorithm

In this subsection, we design a new algorithm based on a time-varying potential function, where the outputs of the algorithm are uniquely determined by the gradients of the potential function at its iterates—an approach used in the design of many existing algorithms (Cesa-Bianchi et al., 1997).

Let be a sequence of sub-gradients satisfying Assumption 2, and let be a sequence of hints satisfying Assumption 1. Consider the following potential function:

This potential function has appeared as a by-product in the analyses of previous algorithms such as the ones in (Cutkosky and Orabona, 2018; Cutkosky, 2019). The expression of is interesting to us since it can be shown via the regret-reward duality (McMahan and Orabona, 2014) (as we do in the proof of Theorem 6 below) that any algorithm which outputs vectors such that is non-increasing for any sequence of sub-gradients , also guarantees a regret bound of the form (6). We will now design such an algorithm.

Definition 4 (FREEGRAD) In round t, based on the sequence of past sub-gradients available hint algorithm outputs the unconstrained iterate

where

The prediction (9) is forced by our design goal of decreasing potential . To see why, observe that at zero we have . The weights cancel the derivative at , ensuring there is no direction of linear increase (which would present a violation for tiny

Our main technical contribution in this subsection is to show that, in fact, with the choice of , the potential functions are non-increasing for any sequence of sub-gradients

Theorem 5 For , under Assumptions 1 and 2, we have:

The proof of the theorem is postponed to Appendix A. Theorem 5 and the regret-reward duality (McMahan and Orabona, 2014) yield a regret bound for FREEGRAD. In fact, if , then by the definition of

Now by Fenchel’s inequality, we have , for all , where , is the Fenchel dual of (Hiriart-Urruty and Lemar´echal, 2004). Combining this with (10), we obtain:

Rearranging (11) for a given leads to a regret bound of . Further bounding this quantity using existing results due to Cutkosky and Orabona (2018); Cutkosky (2019); McMahan and Orabona (2014), leads to the following regret bound (the proof is in Appendix B.1):

Theorem 6 Under Assumptions 1 and 2, for , we have, with

for all

Range-Ratio Problem. While the outputs in (9) of FREEGRAD are scale-free for the sequence of hints satisfying Assumption 1, there remains one serious issue; the fractions and inside the log-terms in the regret bound of Theorem 6 could in principle be arbitrarily large if is small enough relative to . Such a problematic ratio has appeared in the regret bounds of many previous algorithms which attempt to adapt to the Lipschitz constant L (Ross et al., 2013; Wintenberger, 2017; Kotłowski, 2017; Mhammedi et al., 2019; Kempka et al., 2019).

When the output set W is bounded with diameter , this ratio can be dispensed of using a recently proposed restart trick due to Mhammedi et al. (2019), which restarts the algorithm whenever . The price to pay for this is merely an additive OpLDq in the regret bound. However, this trick does not directly apply to our setting since in our case W may be unbounded. Fortunately, we are able to extend the analysis of the restart trick to the unbounded setting where a sequence of hints satisfying Assumption 1 is available; the cost we incur in the regret bound is an additive lower-order rOp}w}Lq term. Algorithm 1 displays our restart “wrapper”, FREERANGE, which uses the outputs of FREEGRAD to guarantee the following regret bound (the proof is in Appendix B):

Theorem 7 Let be the outputs of FREERANGE (Algorithm 1). Then,

3.2. MATRIX-FREEGRAD: Adapting to Directional Variance

Reflecting on the previous subsection, we see that the potential function that we ideally would like to use is . However, as we saw, this is a little too greedy even in one dimension, and we need to introduce some slack to make the potential controllable. In the previous subsection we did this by increasing the scalar denominator from , which acts as a barrier function restricting the norm of . In this section, we will instead employ a hard norm constraint. We will further need to include a fudge factor multiplying V to turn the above shape into a bona fide potential. To describe its effect, we define

The increasing function

The potential function of this section is parameterized by a prod factor (which we will set to some universal constant). We define

where . We introduce the following algorithm to control

Definition 8 (MATRIX-FREEGRAD) In round t, given past sub-gradients and a hint prediction is obtained from the gradient of in the first argument by

where

We can compute in time per round by first computing an eigendecomposition of , followed by a one-dimensional binary search for the which achieves the inf in (13) with . Then the output is given by

Our heavy-lifting step in the analysis is the following, which we prove in Appendix C:

Lemma 9 For any vector satisfying , the vector

From here, we obtain our main result using telescoping and regret-reward duality:

Theorem 10 Let

Note in particular that the result is scale-free. Expanding the main case of the theorem (modest we find regret bounded by

This bound looks almost like an ideal upgrade of that in Theorem 6, though technically, the bounds are not really comparable since the ln det Q can be as large as d ln T, potentially canceling the advantage of having instead of inside the square-root. The matrix Q and hence any directional variance is scale-invariant. The only fudge factor in the answer is the . We currently cannot tolerate so the lower-order term would explode. We note that a bound of the form given in the previous display, with the ln det Q replaced by the larger term d ln tr Q, was achieved by a previous (not scale-free) algorithm due to Cutkosky and Orabona (2018).

Remark 11 As Theorem 7 did in the previous subsection, our restarts method allows us to get rid of problematic scale ratios in the regret bound of Theorem 10; this can be achieved using FREERANGE with set to be as in (14) instead of (9). The key idea behind the proof of Theorem 7 is to show that the regrets from all but the last two epochs add up to a lower-order term in the final regret bound. This still holds when are the outputs of MATRIX-FREEGRAD instead of FREEGRAD, since by Theorem 10, the regret bound of MATRIX-FREEGRAD is of order at most d times the regret of FREEGRAD within any given epoch.

As a final note about the algorithm, we may also develop a “one-dimensional” variant by replacing matrix inverse and determinant by their scalar analogues applied to . One effect of this is that the minimization in can be computed in closed form. The resulting potential and corresponding algorithm and regret bound are very close to those of Section 3.1.

Conclusion The algorithms designed in this section can now be used in the role of algorithm A in the reductions presented in Section 2.2. This will yield algorithms which achieve our goal; they adapt to the norm of the comparator and the Lipschitz constant and are completely scale-free, for both bounded and unbounded sets, without requiring hints. We now show that the penalties incurred by these reductions are not improvable.

4. Lower Bounds

As we saw in Corollary 3, given a base algorithm A, which takes a sequence of hints for all , and which suffers regret against comparator w P W, there exists an algorithm B for the setting without hints which suffers the same regret against w up to an additive penalty . In this section, we show that the penalty is not improvable if one insists on a regret bound of order rOp?Tq. We also show that it is not possible to replace the penalty by the typically smaller quantity where . Our starting point is the following lemma:

Lemma 12 For all , past sub-gradients and past and current outputs

Proof We want to find such that , where . By restricting to be aligned with , the problem reduces to finding

The LHS of (15) is a piece-wise linear function in x which goes to infinity as . Therefore, there exists a large enough which satisfies (15).

Observe that if , for , then by Lemma 12, there exists a sub-gradient which makes the regret against at round t at least . This essentially means that if the sub-gradients are unbounded, then the outputs must be in a bounded set whose diameter will depend on the desired regret bound; if one insists on a regret of order rOp?Tq, then the outputs

must be in a ball of radius at most rOp?Tq.

Cutkosky (2019) posed the question of whether there exists an algorithm which can guarantee a regret bound of order , with , while adapting to both L and }w} (which essentially means replacing in Corollary 3 by ). Here, we ask the question whether is possible for any . If such an algorithm exists, then by Lemma 12, there exists a constant its outputs . The next lemma, when instantiated with , gives us a regret lower-bound on such algorithms (the proof is in Appendix D):

Lemma 13 For all , and , there exists , and , such that for any sequence satisfying , for all t P rTs, where

By combining the results of Lemma 12 and 13, we have the following regret lower bound for algorithms with can adapt to both L and }w}:

Theorem 14 For any and , there exists no algorithm that guarantees, up to multiplicative log-factors in }w} and T, a regret bound of the form

Proof By Lemma 12, the only candidate algorithms are those whose outputs satisfy , for all , for some constant . By Lemma 13, no such algorithms can achieve the desired regret bound.

The regret lower bound in Theorem 14 does not apply to the case where . In fact, thanks to Corollary 3 and our main algorithm in Section 3 (which can play the role of Algorithm A in Corollary 3), we know that there exists an algorithm B which guarantees a regret bound of order . Next we show that if one insists on a regret bound of order , or even?T (up to log-factors), the exponent in is unimprovable (the proof of Theorem 15 is in Appendix D.2).

Theorem 15 For any and , there exists no algorithm that guarantees, up to multiplicative log-factors in }w} and T, a regret bound of the form

5. Application to Learning Linear Models with Online Algorithms

In this section, we consider the setting of online learning of linear models which is a special case of OCO. At the start of each round , a learner receives a feature vector , then issues a prediction in the form of an inner product between and a vector The environment then reveals a label and the learner suffers loss is a fixed loss function which is convex and 1-Lipschitz in its second argument; this covers popular losses such as the logistic, hinge, absolute and Huberized squared loss. (Technically, the machinery developed so far and the reductions in Section 2.2 allow us to handle the non-Lipschitz case).

In the current setting, the regret is measured against the best fixed “linear model”

where the last inequality holds for any sub-gradients , due to the convexity of in its second argument, which in turn makes the function convex for all . Here, denotes the sub-differential of with respect to its second argument. Thus, minimizing the regret in (16) fits into the OCO framework described in Section 2. In fact, we will show how our algorithms from Section 3 can be applied in this setting to yield scale-free, and even rotation-free, (all with respect to the feature vectors ) algorithms for learning linear models. These algorithms can, without any prior knowledge on , achieve regret bounds against any matching (up to log-factors) that of OGD with optimally tuned learning rate.

As in Section 3, we focus on algorithms which make predictions based on observed sub-gradients (); in this case, , where . Since the loss is 1-Lipschitz, we have , for all and , and so is revealed at the beginning of round

is available ahead of outputting , and so our algorithms from Section 3 are well suited for this setting.

Improvement over Current Algorithms. We improve on current state-of-the-art algorithms in two ways; First, we provide a (coordinate-wise) scale-invariant algorithm which guarantees regret bound, against any

where , which improves the regret bound of the current state-of-the-art scale-invariant algorithm SCLNOLKempka et al., 2019) by aalnp}w}Tq factor. Second, we provide an algorithm that is both scale and rotation invariant with respect to the input feature vectors with a state-of-the-art regret bound; by scale and rotation invariance we mean that, if the sequence of feature vectors is multiplied by cO, where and O is any special orthogonal matrix in , the outputs () of the algorithm remain unchanged. Arguably the closest algorithm to ours in the latter case is that of Kotłowski (2017) whose regret bound is essentially of order for any comparator , where . However, in our case, instead of the matrix , we have , where , which can yield a much smaller bound for small (this typically happens when the algorithm starts to “converge”).

A Scale-Invariant Algorithm. To design our first scale-invariant algorithm, we will use the outputs ( ) of FREEGRAD in (9) with as in (17), and a slight modification of FREERANGE (see Algorithm 2). This modification consists of first scaling the outputs initial hint of the current epoch to make the predictions scale-invariant. By Theorem 20 below, the regret bound corresponding to such scaled outputs will have a lower-order term which, unlike in the regret bound of Theorem 6, does not depend on the initial hint. This breaks our current analysis of FREERANGE in the proof of Theorem 7 which we used to overcome the range-ratio problem. To solve this issue, we further scale the output at round by the sum , where denotes the first index of the current epoch (see Algorithm 2). Due to this change, the proof of the next theorem differs slightly from that of Theorem 7.

First, we study the regret bound of Algorithm 2 in the case where

Theorem 16 Let and be as in (17). If are the outputs of Algorithm 2, then for all

The proof of Theorem 16 is in Appendix E. If are the outputs of Algorithm 2 in the one-dimensional case, then by Theorem 16 and (16), the algorithm which, at each round

has regret bounded from above by the RHS of (19). Note also that the outputs are scale-invariant.

Now consider an algorithm A which at round rds, are the outputs of Algorithm 2 when applied to coordinate i; in this case, we will have a sequence of hints for each coordinate i satisfying , for all . Algorithm A is coordinate-wise scale-invariant, and due to (16) and Theorem 16, it guarantees a regret bound of the form (18). We note, however, that a factor d will appear multiplying the lower-order term in (19) (since the regret bounds for the different coordinates are added together). To avoid this, at the cost of a factor d appearing inside the logarithms in (18), it suffices to divide the outputs of algorithm A by d. To see why this works, see Theorem 20 in the appendix.

A Rotation-Invariant Algorithm. To obtain a rotation and scale-invariant online algorithm for learning linear models we will make use of the outputs of MATRIX-FREEGRAD instead of FREEGRAD. Let be the sequence of predictions defined by

with as in (17) and where are the predictions of a variant of MATRIX-FREEGRAD, where the leading in the potential (13) is replaced by 1 (we analyze this variant in Appendix C.1).

Theorem 17 Let

and

Proof It suffices to use (16) and instantiate the regret bound in Theorem 21 with

The range-ratio problem manifests itself again in Theorem 17 through the term can be solved using the outputs of Algorithm 2, where in Line 7, is taken to be as in (20) (see Remark 11).

Acknowledgments

This work was supported by the Australian Research Council and Data61.

References

Nicol`o Cesa-Bianchi, Yoav Freund, David Haussler, David P. Helmbold, Robert E. Schapire, and Manfred K. Warmuth. How to use expert advice. J. ACM, 44(3):427–485, 1997.

Ashok Cutkosky. Artificial constraints and hints for unbounded online learning. In Conference on Learning Theory, pages 874–894, 2019.

Ashok Cutkosky and Kwabena Boahen. Online learning without prior information. In Conference on Learning Theory, pages 643–677, 2017.

Ashok Cutkosky and Francesco Orabona. Black-box reductions for parameter-free online learning in Banach spaces. In Conference On Learning Theory, COLT 2018, Stockholm, Sweden, 6-9 July 2018., pages 1493–1529, 2018.

Tim van Erven and Wouter M. Koolen. Metagrad: Multiple learning rates in online learning. In Advances in Neural Information Processing Systems 29 (NIPS), pages 3666–3674, 2016.

Dylan J. Foster, Alexander Rakhlin, and Karthik Sridharan. Zigzag: A new approach to adaptive online learning. In Proc. of the 2017 Annual Conference on Learning Theory (COLT), pages 876–924, 2017.

Dylan J. Foster, Alexander Rakhlin, and Karthik Sridharan. Online learning: Sufficient statistics and the Burkholder method. In Conference On Learning Theory, COLT 2018, Stockholm, Sweden, 6-9 July 2018., pages 3028–3064, 2018.

Elad Hazan. Introduction to online convex optimization. Foundations and Trends in Optimization, 2 (3–4):157–325, 2016.

Jean-Baptiste Hiriart-Urruty and Claude Lemar´echal. Fundamentals of Convex Analysis. Springer Science & Business Media, 2004.

Michal Kempka, Wojciech Kotłowski, and Manfred K. Warmuth. Adaptive scale-invariant online algorithms for learning linear models. In Proceedings of the 36th International Conference on Machine Learning, ICML 2019, 9-15 June 2019, Long Beach, California, USA, pages 3321–3330, 2019.

Wojciech Kotłowski. Scale-invariant unconstrained online learning. In International Conference on Algorithmic Learning Theory, ALT 2017, 15-17 October 2017, Kyoto University, Kyoto, Japan, pages 412–433, 2017.

H. Brendan McMahan and Francesco Orabona. Unconstrained online linear learning in Hilbert spaces: Minimax algorithms and normal approximations. In Conference on Learning Theory, pages 1020–1039, 2014.

H. Brendan McMahan and Matthew J. Streeter. Adaptive bound optimization for online convex optimization. In Proceedings of the 23rd Annual Conference on Learning Theory (COLT), pages 244–256, 2010.

Zakaria Mhammedi, Wouter M. Koolen, and Tim van Erven. Lipschitz adaptivity with multiple learning rates in online learning. In Conference on Learning Theory, COLT 2019, 25-28 June 2019, Phoenix, AZ, USA, pages 2490–2511, 2019.

Francesco Orabona and D´avid P´al. Coin betting and parameter-free online learning. In Advances in Neural Information Processing Systems, pages 577–585, 2016a.

Francesco Orabona and D´avid P´al. Open problem: Parameter-free and scale-free online algorithms. In Conference on Learning Theory, pages 1659–1664, 2016b.

Stephane Ross, Paul Mineiro, and John Langford. Normalized online learning. In Proceedings of the Twenty-Ninth Conference on Uncertainty in Artificial Intelligence, pages 537–545, 2013.

Olivier Wintenberger. Optimal learning with Bernstein online aggregation. Machine Learning, 106 (1):119–141, 2017.

Appendix A. Proof of Theorem 5

The proof of Theorem 5 relies on the following key lemma:

Lemma 18 For

It holds that

Proof For notational simplicity we assume . Let us look at

Since ln is increasing, we have that , if and only if, , and so we want to show for all . Our approach will be to show that is increasing in V . The result then follows from . It remains to study the derivative

Factoring this as a ratio of polynomials, we obtain:

where , are polynomials in g and G whose explicit (yet gruesome) expressions are:

α

α

Under our assumptions and , the denominator of above is positive. Furthermore, its numerator, regarded as a polynomial in V , has exclusively positive

We want to show that the gap is ≤ 0. Our approach will be to show that gap is increasing, so that we can then bound it by the limit

We will show that the numerator is positive by showing that it is a polynomial with only positive coefficients. Here are the coefficients on the monomials Vi for i=0,1,...

Figure 1: Mathematica notebook in support of Theorem 5.

coefficients , as can be verified using computer algebra software (we used Mathematica’s FullSimplify—see Figure 1). This implies that , for all , and

We are ourselves a bit disgruntled about the opacity of the above proof. On the one hand, it is just a tedious verification of an analytic statement about a function of three scalar variables, and one might expect that tighter statements require more sophisticated techniques (c.f. Kotłowski, 2017, Appendix F). It is quite plausible that positivity may be established in a somewhat more streamlined fashion using Sum-of-Squares techniques. Yet on the other hand, we were hoping to gain, from the proof, a deeper insight into the design of potential functions. Unfortunately this did not materialize. In particular, we still do not know how to address the multi-dimensional case of our Section 3.2 with

a similar potential. Controlling the intuitive upgrade of (7) where the exponential is replaced by

is impossible, as witnessed by numerical counterexamples returned by random search, already in dimension 2. We need one more result before we prove Theorem 5:

Lemma 19 Let . Then, the function

is non-increasing on

Proof It suffices to show that the function

is non-increasing on . Evaluating the derivative of , we find that

where . The derivative in (21) is non-positive for all that

Proof of Theorem 5. We will proceed by induction. By the fact that and the definition of the potential in (7), we have . Now let , and . We will show that . First, note that for any

Thus, for any

Let be the vector which maximizes the RHS of (22), where is the ball in Suppose that . Note that within the hyperplane H, only the first term on the RHS of (22) varies. Since is the maximizer of the RHS of (22) within instantiating Lemma 19 with and , implies that . Adding this to the fact that H is a hyperplane orthogonal to G implies that must be aligned, i.e. there exists a . Therefore, we have

Further, note that . Thus, the RHS of (22) is bounded from above by

Note that ) can be written in terms of the function

Since , Lemma 18 implies that , and so due to (23), we also have . Since is an upper-bound on the RHS of (22), it follows that as desired.

Appendix B. Proofs of Section 3.1

B.1. Proof of Theorem 6

The proof of Theorem 6 follows from the next theorem by setting . Theorem 20 essentially gives the regret bound of FREEGRAD if its outputs are scaled by a constant . This will be useful to us later.

Theorem 20 Let . Then, under Assumptions 1 and 2:

for all

Proof Since the assumptions of Theorem 5 are satisfied, we have

Dividing both sides of (24) by and rearranging yields

By duality, we further have that

Since , for all , where is the function defined in (10), we have by the properties of the Fenchel dual (Hiriart-Urruty and Lemar´echal, 2004, Prop. 1.3.1) that

Thus, according to (McMahan and Orabona, 2014, Lemma 3) and the properties of duality (Hiriart- Urruty and Lemar´echal, 2004, Prop. E.1.3.1), we have

On the other hand, (Cutkosky and Orabona, 2018, Lemma 18, 19) and (Orabona and P´al, 2016a, Lemma 18) provides the following upper-bound on , using the Lambert function W (where Wpxq is defined as the principal solution to

and . Using the fact that the Lambert function satisfies (see Lemma 22), together with (28) and (27) implies that

for all . Combining this with (25) and (26) leads to the desired regret bound.

Proof of Theorem 6. Invoke Theorem 20 with

B.2. Proof of Theorem 7

Proof Fix be the total number of epochs. We denote by the start index of epoch i P rks. Further, for , we define and (note how the upper index is exclusive). Recall that at epoch i P rks, the restart condition in Algorithm 1 is triggered at

where the last inequality follows by Assumption 1. We note that (29) also implies that

On the other hand, within epoch

Summing this inequality over

Now using (29) at , we have for all i P rks,

where the inequality follows by Assumption 1. Now by (30), we also have

Thus, substituting (35) and (34) into (33), and using the fact that

where in the last inequality, we used the fact that. Now, summing (32) over the last two epochs, yields

Adding (36) and (37) together leads to

This concludes the proof.

Appendix C. Proofs for Section 3.2

In this section we work on a version of the potential function that does not have the tuning for Section 3.2 substituted in yet, so that we can prove the result necessary for Section 5 in one go. The potential is parameterized by a prior variance initial wealth and, as before prod factor . It is defined by

C.1. Proof of Lemma 9

We prove the claim in Lemma 9 for the more general potential (38). Let be the minimizer in the problem . With that notation, we see that . To prove the lemma, it suffices to prove the stronger statement obtained by picking the sub-optimal choice for the problem , and dividing by

Let us abbreviate . The matrix determinant lemma and monotonicity of matrix inverse give

Then Sherman-Morrison gives

and splitting off the last round

All in all, it suffices to show

Being a square, is positive. In addition, optimality of ensures that ; this follows from the fact that . In combination with . The above requirement may hence be further reorganized to

The convex right hand side is minimized subject to

so it remains to show

The function ) is designed to satisfy the hardest case, where , with equality.

C.2. Proof of Theorem 10

We restate the claim for the potential (38) before tuning:

Theorem 21 (Theorem 10 rephrased) Let

Using that is decreasing in h, we can telescope to obtain

Using the definition reveals

To transform this into a regret bound, it remains to compute the convex conjugate of the RHS of (39) in . To this end, let

The Fenchel dual of this function is

where the model complexity is measured for through the function

(where Wpxq is defined as the principal solution to ). We will further use that

is increasing, and that it satisfies (see Lemma 22). Zero derivative of the above

objective for occurs at the pleasantly explicit

and hence the optimum for is either at that point or at zero, whichever is higher, with the crossover point at . Plugging that in, we find that

where . Using that is increasing, we may drop the exponential in its argument in the first case, and obtain

Note that this is a curious maximum between (the larger for modest w), and the average between that very same term and another quantity that grows super-linearly with (so this is the winner for extreme w).

Okay, now let’s collect everything for the