We consider the setting of online convex optimization where the goal is to make sequential predictions to minimize a certain notion of regret. Specifically, at the beginning of each round , a learner predicts
in some convex set
in dimension d P N. The environment then reveals a convex loss function
, and the learner suffers loss
. The goal of the learner is to minimize the regret
after
rounds against any “comparator” prediction w P W. Typically, an online learning algorithm outputs a vector
, based only on a sequence of observed sub-gradients
, where
. In this paper, we are interested in online algorithms which can guarantee a good regret bound (by a measure which we will make precise below) against any comparator vector w P W, even when W is unbounded, and without prior knowledge of the maximum norm
of the observed sub-gradients. In what follows, we refer to L as the Lipschitz constant.
By assuming an upper-bound on the norm of the desired comparator vector w in hindsight, there exist Lipschitz-adaptive algorithms that can achieve a sub-linear regret of order LD?T, without knowing L in advance. A Lipschitz-adaptive algorithm is also called scale-free (or scale-invariant) if its predictions do not change when the loss functions
are multiplied by a factor
; in this case, its regret bound is expected to scale by the same factor c. When L is known in advance and
, there exists another type of algorithms, so-called parameter-free, which can achieve an rOp}w}L?Tq regret bound, where w is the desired comparator vector in hindsight (the notation rO hides log-factors). Up to an additive lower-order term, this type of regret bound is also achievable for bounded W via the unconstrained-to-constrained reduction (Cutkosky, 2019).
The question of whether an algorithm can simultaneously be scale-free and parameter-free was posed as an open problem by Orabona and P´al (2016b). It was latter answered in the negative by Cutkosky and Boahen (2017). Nevertheless, Cutkosky (2019) recently presented algorithms which achieve an regret bound, without knowing either L or
. This does not violate the earlier lower bound of Cutkosky and Boahen (2017), which insists on norm dependence
Though Cutkosky (2019) designs algorithms that can to some extent adapt to both L and , their algorithms are still not scale-free. Multiplying
, and as a result
, by a positive factor
changes the outputs
of their algorithms, and their regret bounds scale by a factor
necessarily equal to c. Their algorithms depend on a parameter
which has to be specified in advance. This parameter appears in their regret bounds as an additive term and also in a logarithmic term of the form
. As a result of this type of dependence on
and the fact that
, there is no prior choice of
which can make their regret bounds scale-invariant. What is more, without knowing L, there is also no “safe” choice of
which can prevent the
term from becoming arbitrarily large relative to L (it suffices for
to be small enough relative to the “unknown” L).
Contributions. Our main contribution is a new scale-free, parameter-free learning algorithm for OCO with regret at most , for any comparator w P W in a bounded set W, where
. When the set W is unbounded, the algorithm achieves the same guarantee up to an additive
for all t P rTs. In the latter case, we also show a matching lower bound; when W is unbounded and without knowing L, any online learning algorithm which insists on an rOp?Tq bound, has regret at least
. We also provide a second scale-invariant algorithm which replaces the leading
term in the regret bound of our first algorithm by
. Our starting point for designing our algorithms is a known potential function which we show to be controlled for a unique choice of output sequence
As our main application, we show how our algorithms can be applied to learn linear models. The result is an online algorithm for learning linear models whose label predictions are invariant to coordinate-wise scaling of the input feature vectors. The regret bound of the algorithm is naturally also scale-invariant and improves on the bounds of existing state-of-the-art algorithms in this setting (Kotłowski, 2017; Kempka et al., 2019).
Related Work For an overview of Online Convex Optimization in the bounded setting, we refer to the textbook (Hazan, 2016). The unconstrained case was first studied by McMahan and Streeter (2010). A powerful methodology for the unbounded case is Coin Betting by Orabona and P´al (2016a). Even though not always visible, our potential functions are inspired by this style of thinking. We build our unbounded OCO learner by targeting a specific other constrained problem. We further employ several general reductions from the literature, including gradient clipping Cutkosky (2019), the constrained-to-unconstrained reduction Cutkosky and Orabona (2018), and the restart wrapper to pacify the final-vs-initial scale ratio appearing inside logarithms by Mhammedi et al. (2019). Our analysis is, at its core, proving a certain minimax result about sufficient-statistic-based potentials reminiscent of the Burkholder approach pioneered by Foster et al. (2017, 2018). Applications for scale-invariant learning in linear models were studied by Kempka et al. (2019). For our multidimensional learner we took inspiration from the Gaussian Exp-concavity step in the analysis of the MetaGrad algorithm by Van Erven and Koolen (2016).
Outline In Section 2, we present the setting and notation, and formulate our goal. In Section 3, we present our main algorithms. In Section 4, we present new lower-bounds for algorithms which adapt to both the Lipschitz constant and the norm of the comparator. In Section 5, we apply our algorithms to online prediction with linear models.
Our goal is to design scale-free algorithms that adapt to the Lipschitz constant L and comparator norm . We will first introduce the setting, then discuss existing reductions, and finally state what needs to be done to achieve our goal.
2.1. Setting and Notation
Let be a convex set, and assume without loss of generality that 0 P W. We allow the set W to be unbounded, and we define its (possibly infinite) diameter
r0, `8s. We consider the setting of Online Convex Optimization (OCO) where at the beginning of each round
, the learner outputs a prediction
, before observing a convex loss function
, or an element of its sub-gradient
at
. The goal of the learner is to minimize the regret after
rounds, which is given by
for any comparator vector w P W. In this paper, we do not assume that T is known to the learner, and so we are after algorithms with so called any-time guarantees. By convexity, we have
and thus for the purpose of minimizing the regret, typical OCO algorithms minimize the RHS of (1), which is known as the linearized regret, by generating outputs based on the sequence of observed sub-gradients
. Likewise, we focus our attention exclusively on linear optimization.
Given a sequence of sub-gradients , it will be useful to define the running maximum gradient norm and the clipped sub-gradients
for , with the convention that
. We also drop the subscript t from
when
, i.e. we write
We denote by Athe output in round
of an algorithm A, which uses the observed sub-gradients so far and a hint
on the upcoming sub-gradient
. As per Section 1, we say that an algorithm is scale-free (or scale-invariant) if its predictions are invariant to any common positive scaling of the loss functions
and, if applicable, the hints.
Additional Notation. Given a closed convex set , we denote by
the Euclidean projection of a point
2.2. Helpful Reductions
The difficulty behind designing scale-free algorithms lies partially in the fact that is unknown at the start of round t; before outputting
. The following result due to Cutkosky (2019) quantifies the additional cost of proceeding with the plug-in estimate
Lemma 1 Let A be an online algorithm which at the start of each round , has access to a hint
, and outputs A
, before observing
. Suppose that A guarantees an upper-bound
on its linearized regret for the sequence
and for all
algorithm B which at the start of each round
, guarantees
First, we note that Lemma 1 is only really useful when W is bounded; otherwise, depending on algorithm A, the term on the RHS of (2) could in principle be arbitrarily large even for fixed w, L, and T. The moral of Lemma 1 is that as long as the set W is bounded, one does not really need to know
before outputting
to guarantee a “good” regret bound against any w P W. For example, suppose that W has a bounded diameter D and algorithm A in Lemma 1 is such that
, for all w P W. Then, from (2) and the fact that
(since
), it is clear that algorithm B in Lemma 1 also guarantees the same regret bound
to an additive 2DL, despite not having had the hints
It is possible to extend the result of Lemma 1 so that the regret bound of algorithm B remains useful even in the case where W is unbounded. An approach suggested by Cutkosky (2019) is to restrict the outputs of algorithm B to be in a non-decreasing sequence
subsets of W. In this case, the diameters
of
need to be carefully chosen to achieve a desired regret bound. This approach, which essentially combines the idea of Lemma 1 and the unconstrained-to-constrained reduction due to Cutkosky and Orabona (2018), is formalized in the next lemma (essentially due to Cutkosky (2019)):
Lemma 2 Let algorithm A be as in Lemma 1, and let be a sequence of non-decreasing closed convex subsets of W with diameters
. Then, algorithm B which at the start of round
We see that compared to Lemma 1, the additional penalty that algorithm B incurs for restricting its predictions to the sets is the sum
. The challenge is now in choosing the diameters
to control the trade-off between this sum and the term
the RHS of (3). If T is known in advance, one could set
, in which case the RHS of (3) is at most
We now instantiate the bound of Lemma 2 for another choice of is unknown:
Corollary 3 In the setting of Lemma 2, let be the ball of diameter
, where
. Then the RHS of (3) is bounded from above by
We see that by the more careful choice of in Corollary 3, one can replace the L?T term in (4) by the smaller quantity
; whether this can be improved further to
, where
, was raised as an open question by Cutkosky (2019). We will answer this in the negative in Theorem 14. We will also show in Theorem 15 below that, if one insists on a regret of order rOp?Tq, it is essentially not possible to improve on the penalty
2.3. Outlook
The conclusion that should be drawn from Lemmas 1 and 2 is the following; if one seeks an algorithm B with a regret bound of the form rOp}w}L?Tq up to some lower-order terms in T, without knowledge of L and regardless of whether W is bounded or not, it suffices to find an algorithm A which guarantees the sought type of regret whenever it has access to a sequence of hints satisfying (as in Lemmas 1 and 2),
. Thus, our first goal in the next section is to design a scale-free algorithm A which accesses such a sequence of hints and ensures that its linearized regret is bounded from above by:
for all , and
. We show an analogous “full-matrix” upgrade of order
. We note that if Algorithm A in Lemmas 1 and 2 is scale-free, then so is the corresponding Algorithm B.
If the desired set W has bounded diameter , then using the unconstrained-to-constrained reduction due to Cutkosky and Orabona (2018), it is straightforward to design a new algorithm based on A with regret also bounded by (6) up to an additive LD, for w P W (this is useful for Lemma 1).
Finally, we also note that algorithms which can access hints are of independent interest; in fact, it is the same algorithm A that we will use in Section 5 as a scale-invariant algorithm for learning linear models.
In light of the conclusions of Section 2, we will design new unconstrained scale-free algorithms which can access a sequence of hints (as in Lemma 1) and guarantee a regret bound of the form given in (6). In this section, we will make the following assumption on the hints
Assumption 1 We assume that (i) is a non-decreasing sequence; (ii)
, for all
; and (iii) if the sub-gradients
are multiplied by a factor
, then the hints
are multiplied by the same factor c.
The third item of the assumption ensures that our algorithms are scale-free. We note that Assumption 1 is satisfied by the sequence of hints that Algorithm B constructs when invoking Algorithm A in Lemmas 1 and 2. To avoid an uninteresting case distinction, we will also make the following assumption, which is without loss of generality, since the regret is zero while
Assumption 2 We assume that
3.1. FREEGRAD: An Adaptive Scale-Free Algorithm
In this subsection, we design a new algorithm based on a time-varying potential function, where the outputs of the algorithm are uniquely determined by the gradients of the potential function at its iterates—an approach used in the design of many existing algorithms (Cesa-Bianchi et al., 1997).
Let be a sequence of sub-gradients satisfying Assumption 2, and let
be a sequence of hints satisfying Assumption 1. Consider the following potential function:
This potential function has appeared as a by-product in the analyses of previous algorithms such as the ones in (Cutkosky and Orabona, 2018; Cutkosky, 2019). The expression of is interesting to us since it can be shown via the regret-reward duality (McMahan and Orabona, 2014) (as we do in the proof of Theorem 6 below) that any algorithm which outputs vectors
such that
is non-increasing for any sequence of sub-gradients
, also guarantees a regret bound of the form (6). We will now design such an algorithm.
Definition 4 (FREEGRAD) In round t, based on the sequence of past sub-gradients available hint
algorithm outputs the unconstrained iterate
where
The prediction (9) is forced by our design goal of decreasing potential . To see why, observe that at zero
we have
. The weights
cancel the derivative
at
, ensuring there is no direction of linear increase (which would present a violation for tiny
Our main technical contribution in this subsection is to show that, in fact, with the choice of , the potential functions
are non-increasing for any sequence of sub-gradients
Theorem 5 For , under Assumptions 1 and 2, we have:
The proof of the theorem is postponed to Appendix A. Theorem 5 and the regret-reward duality (McMahan and Orabona, 2014) yield a regret bound for FREEGRAD. In fact, if , then by the definition of
Now by Fenchel’s inequality, we have , for all
, where
, is the Fenchel dual of
(Hiriart-Urruty and Lemar´echal, 2004). Combining this with (10), we obtain:
Rearranging (11) for a given leads to a regret bound of
. Further bounding this quantity using existing results due to Cutkosky and Orabona (2018); Cutkosky (2019); McMahan and Orabona (2014), leads to the following regret bound (the proof is in Appendix B.1):
Theorem 6 Under Assumptions 1 and 2, for , we have, with
for all
Range-Ratio Problem. While the outputs in (9) of FREEGRAD are scale-free for the sequence of hints
satisfying Assumption 1, there remains one serious issue; the fractions
and
inside the log-terms in the regret bound of Theorem 6 could in principle be arbitrarily large if
is small enough relative to
. Such a problematic ratio has appeared in the regret bounds of many previous algorithms which attempt to adapt to the Lipschitz constant L (Ross et al., 2013; Wintenberger, 2017; Kotłowski, 2017; Mhammedi et al., 2019; Kempka et al., 2019).
When the output set W is bounded with diameter , this ratio can be dispensed of using a recently proposed restart trick due to Mhammedi et al. (2019), which restarts the algorithm whenever
. The price to pay for this is merely an additive OpLDq in the regret bound. However, this trick does not directly apply to our setting since in our case W may be unbounded. Fortunately, we are able to extend the analysis of the restart trick to the unbounded setting where a sequence of hints
satisfying Assumption 1 is available; the cost we incur in the regret bound is an additive lower-order rOp}w}Lq term. Algorithm 1 displays our restart “wrapper”, FREERANGE, which uses the outputs of FREEGRAD to guarantee the following regret bound (the proof is in Appendix B):
Theorem 7 Let be the outputs of FREERANGE (Algorithm 1). Then,
3.2. MATRIX-FREEGRAD: Adapting to Directional Variance
Reflecting on the previous subsection, we see that the potential function that we ideally would like to use is . However, as we saw, this is a little too greedy even in one dimension, and we need to introduce some slack to make the potential controllable. In the previous subsection we did this by increasing the scalar denominator from
, which acts as a barrier function restricting the norm of
. In this section, we will instead employ a hard norm constraint. We will further need to include a fudge factor
multiplying V to turn the above shape into a bona fide potential. To describe its effect, we define
The increasing function
The potential function of this section is parameterized by a prod factor (which we will set to some universal constant). We define
where . We introduce the following algorithm to control
Definition 8 (MATRIX-FREEGRAD) In round t, given past sub-gradients and a hint
prediction is obtained from the gradient of
in the first argument by
where
We can compute in
time per round by first computing an eigendecomposition of
, followed by a one-dimensional binary search for the
which achieves the inf in (13) with
. Then the output is given by
Our heavy-lifting step in the analysis is the following, which we prove in Appendix C:
Lemma 9 For any vector satisfying
, the vector
From here, we obtain our main result using telescoping and regret-reward duality:
Theorem 10 Let
Note in particular that the result is scale-free. Expanding the main case of the theorem (modest we find regret bounded by
This bound looks almost like an ideal upgrade of that in Theorem 6, though technically, the bounds are not really comparable since the ln det Q can be as large as d ln T, potentially canceling the advantage of having instead of
inside the square-root. The matrix Q and hence any directional variance
is scale-invariant. The only fudge factor in the answer is the
. We currently cannot tolerate
so the lower-order term would explode. We note that a bound of the form given in the previous display, with the ln det Q replaced by the larger term d ln tr Q, was achieved by a previous (not scale-free) algorithm due to Cutkosky and Orabona (2018).
Remark 11 As Theorem 7 did in the previous subsection, our restarts method allows us to get rid of problematic scale ratios in the regret bound of Theorem 10; this can be achieved using FREERANGE with set to be as in (14) instead of (9). The key idea behind the proof of Theorem 7 is to show that the regrets from all but the last two epochs add up to a lower-order term in the final regret bound. This still holds when
are the outputs of MATRIX-FREEGRAD instead of FREEGRAD, since by Theorem 10, the regret bound of MATRIX-FREEGRAD is of order at most d times the regret of FREEGRAD within any given epoch.
As a final note about the algorithm, we may also develop a “one-dimensional” variant by replacing matrix inverse and determinant by their scalar analogues applied to . One effect of this is that the minimization in
can be computed in closed form. The resulting potential and corresponding algorithm and regret bound are very close to those of Section 3.1.
Conclusion The algorithms designed in this section can now be used in the role of algorithm A in the reductions presented in Section 2.2. This will yield algorithms which achieve our goal; they adapt to the norm of the comparator and the Lipschitz constant and are completely scale-free, for both bounded and unbounded sets, without requiring hints. We now show that the penalties incurred by these reductions are not improvable.
As we saw in Corollary 3, given a base algorithm A, which takes a sequence of hints for all
, and which suffers regret
against comparator w P W, there exists an algorithm B for the setting without hints which suffers the same regret against w up to an additive penalty
. In this section, we show that the penalty
is not improvable if one insists on a regret bound of order rOp?Tq. We also show that it is not possible to replace the penalty
by the typically smaller quantity
where
. Our starting point is the following lemma:
Lemma 12 For all , past sub-gradients
and past and current outputs
Proof We want to find such that
, where
. By restricting
to be aligned with
, the problem reduces to finding
The LHS of (15) is a piece-wise linear function in x which goes to infinity as . Therefore, there exists a large enough
which satisfies (15).
Observe that if , for
, then by Lemma 12, there exists a sub-gradient
which makes the regret against
at round t at least
. This essentially means that if the sub-gradients
are unbounded, then the outputs
must be in a bounded set whose diameter will depend on the desired regret bound; if one insists on a regret of order rOp?Tq, then the outputs
must be in a ball of radius at most rOp?Tq.
Cutkosky (2019) posed the question of whether there exists an algorithm which can guarantee a regret bound of order , with
, while adapting to both L and }w} (which essentially means replacing
in Corollary 3 by
). Here, we ask the question whether
is possible for any
. If such an algorithm exists, then by Lemma 12, there exists a constant
its outputs
. The next lemma, when instantiated with
, gives us a regret lower-bound on such algorithms (the proof is in Appendix D):
Lemma 13 For all , and
, there exists
, and
, such that for any sequence
satisfying
, for all t P rTs, where
By combining the results of Lemma 12 and 13, we have the following regret lower bound for algorithms with can adapt to both L and }w}:
Theorem 14 For any and
, there exists no algorithm that guarantees, up to multiplicative log-factors in }w} and T, a regret bound of the form
Proof By Lemma 12, the only candidate algorithms are those whose outputs satisfy
, for all
, for some constant
. By Lemma 13, no such algorithms can achieve the desired regret bound.
The regret lower bound in Theorem 14 does not apply to the case where . In fact, thanks to Corollary 3 and our main algorithm in Section 3 (which can play the role of Algorithm A in Corollary 3), we know that there exists an algorithm B which guarantees a regret bound of order
. Next we show that if one insists on a regret bound of order
, or even?T (up to log-factors), the exponent in
is unimprovable (the proof of Theorem 15 is in Appendix D.2).
Theorem 15 For any and
, there exists no algorithm that guarantees, up to multiplicative log-factors in }w} and T, a regret bound of the form
In this section, we consider the setting of online learning of linear models which is a special case of OCO. At the start of each round , a learner receives a feature vector
, then issues a prediction
in the form of an inner product between
and a vector
The environment then reveals a label
and the learner suffers loss
is a fixed loss function which is convex and 1-Lipschitz in its second argument; this covers popular losses such as the logistic, hinge, absolute and Huberized squared loss. (Technically, the machinery developed so far and the reductions in Section 2.2 allow us to handle the non-Lipschitz case).
In the current setting, the regret is measured against the best fixed “linear model”
where the last inequality holds for any sub-gradients , due to the convexity of
in its second argument, which in turn makes the function
convex for all
. Here,
denotes the sub-differential of
with respect to its second argument. Thus, minimizing the regret in (16) fits into the OCO framework described in Section 2. In fact, we will show how our algorithms from Section 3 can be applied in this setting to yield scale-free, and even rotation-free, (all with respect to the feature vectors
) algorithms for learning linear models. These algorithms can, without any prior knowledge on
, achieve regret bounds against any
matching (up to log-factors) that of OGD with optimally tuned learning rate.
As in Section 3, we focus on algorithms which make predictions based on observed sub-gradients (); in this case,
, where
. Since the loss
is 1-Lipschitz, we have
, for all
and
, and so
is revealed at the beginning of round
is available ahead of outputting , and so our algorithms from Section 3 are well suited for this setting.
Improvement over Current Algorithms. We improve on current state-of-the-art algorithms in two ways; First, we provide a (coordinate-wise) scale-invariant algorithm which guarantees regret bound, against any
where , which improves the regret bound of the current state-of-the-art scale-invariant algorithm SCLNOL
Kempka et al., 2019) by aalnp}w}Tq factor. Second, we provide an algorithm that is both scale and rotation invariant with respect to the input feature vectors
with a state-of-the-art regret bound; by scale and rotation invariance we mean that, if the sequence of feature vectors
is multiplied by cO, where
and O is any special orthogonal matrix in
, the outputs (
) of the algorithm remain unchanged. Arguably the closest algorithm to ours in the latter case is that of Kotłowski (2017) whose regret bound is essentially of order
for any comparator
, where
. However, in our case, instead of the matrix
, we have
, where
, which can yield a much smaller bound for small
(this typically happens when the algorithm starts to “converge”).
A Scale-Invariant Algorithm. To design our first scale-invariant algorithm, we will use the outputs ( ) of FREEGRAD in (9) with
as in (17), and a slight modification of FREERANGE (see Algorithm 2). This modification consists of first scaling the outputs
initial hint of the current epoch to make the predictions
scale-invariant. By Theorem 20 below, the regret bound corresponding to such scaled outputs will have a lower-order term which, unlike in the regret bound of Theorem 6, does not depend on the initial hint. This breaks our current analysis of FREERANGE in the proof of Theorem 7 which we used to overcome the range-ratio problem. To solve this issue, we further scale the output
at round
by the sum
, where
denotes the first index of the current epoch (see Algorithm 2). Due to this change, the proof of the next theorem differs slightly from that of Theorem 7.
First, we study the regret bound of Algorithm 2 in the case where
Theorem 16 Let and
be as in (17). If
are the outputs of Algorithm 2, then for all
The proof of Theorem 16 is in Appendix E. If are the outputs of Algorithm 2 in the one-dimensional case, then by Theorem 16 and (16), the algorithm which, at each round
has regret bounded from above by the RHS of (19). Note also that the outputs
are scale-invariant.
Now consider an algorithm A which at round rds, are the outputs of Algorithm 2 when applied to coordinate i; in this case, we will have a sequence of hints
for each coordinate i satisfying
, for all
. Algorithm A is coordinate-wise scale-invariant, and due to (16) and Theorem 16, it guarantees a regret bound of the form (18). We note, however, that a factor d will appear multiplying the lower-order term
in (19) (since the regret bounds for the different coordinates are added together). To avoid this, at the cost of a factor d appearing inside the logarithms in (18), it suffices to divide the outputs of algorithm A by d. To see why this works, see Theorem 20 in the appendix.
A Rotation-Invariant Algorithm. To obtain a rotation and scale-invariant online algorithm for learning linear models we will make use of the outputs of MATRIX-FREEGRAD instead of FREEGRAD. Let be the sequence of predictions defined by
with as in (17) and where
are the predictions of a variant of MATRIX-FREEGRAD, where the leading
in the potential (13) is replaced by 1 (we analyze this variant in Appendix C.1).
Theorem 17 Let
and
Proof It suffices to use (16) and instantiate the regret bound in Theorem 21 with
The range-ratio problem manifests itself again in Theorem 17 through the term can be solved using the outputs of Algorithm 2, where in Line 7,
is taken to be as in (20) (see Remark 11).
This work was supported by the Australian Research Council and Data61.
Nicol`o Cesa-Bianchi, Yoav Freund, David Haussler, David P. Helmbold, Robert E. Schapire, and Manfred K. Warmuth. How to use expert advice. J. ACM, 44(3):427–485, 1997.
Ashok Cutkosky. Artificial constraints and hints for unbounded online learning. In Conference on Learning Theory, pages 874–894, 2019.
Ashok Cutkosky and Kwabena Boahen. Online learning without prior information. In Conference on Learning Theory, pages 643–677, 2017.
Ashok Cutkosky and Francesco Orabona. Black-box reductions for parameter-free online learning in Banach spaces. In Conference On Learning Theory, COLT 2018, Stockholm, Sweden, 6-9 July 2018., pages 1493–1529, 2018.
Tim van Erven and Wouter M. Koolen. Metagrad: Multiple learning rates in online learning. In Advances in Neural Information Processing Systems 29 (NIPS), pages 3666–3674, 2016.
Dylan J. Foster, Alexander Rakhlin, and Karthik Sridharan. Zigzag: A new approach to adaptive online learning. In Proc. of the 2017 Annual Conference on Learning Theory (COLT), pages 876–924, 2017.
Dylan J. Foster, Alexander Rakhlin, and Karthik Sridharan. Online learning: Sufficient statistics and the Burkholder method. In Conference On Learning Theory, COLT 2018, Stockholm, Sweden, 6-9 July 2018., pages 3028–3064, 2018.
Elad Hazan. Introduction to online convex optimization. Foundations and Trends in Optimization, 2 (3–4):157–325, 2016.
Jean-Baptiste Hiriart-Urruty and Claude Lemar´echal. Fundamentals of Convex Analysis. Springer Science & Business Media, 2004.
Michal Kempka, Wojciech Kotłowski, and Manfred K. Warmuth. Adaptive scale-invariant online algorithms for learning linear models. In Proceedings of the 36th International Conference on Machine Learning, ICML 2019, 9-15 June 2019, Long Beach, California, USA, pages 3321–3330, 2019.
Wojciech Kotłowski. Scale-invariant unconstrained online learning. In International Conference on Algorithmic Learning Theory, ALT 2017, 15-17 October 2017, Kyoto University, Kyoto, Japan, pages 412–433, 2017.
H. Brendan McMahan and Francesco Orabona. Unconstrained online linear learning in Hilbert spaces: Minimax algorithms and normal approximations. In Conference on Learning Theory, pages 1020–1039, 2014.
H. Brendan McMahan and Matthew J. Streeter. Adaptive bound optimization for online convex optimization. In Proceedings of the 23rd Annual Conference on Learning Theory (COLT), pages 244–256, 2010.
Zakaria Mhammedi, Wouter M. Koolen, and Tim van Erven. Lipschitz adaptivity with multiple learning rates in online learning. In Conference on Learning Theory, COLT 2019, 25-28 June 2019, Phoenix, AZ, USA, pages 2490–2511, 2019.
Francesco Orabona and D´avid P´al. Coin betting and parameter-free online learning. In Advances in Neural Information Processing Systems, pages 577–585, 2016a.
Francesco Orabona and D´avid P´al. Open problem: Parameter-free and scale-free online algorithms. In Conference on Learning Theory, pages 1659–1664, 2016b.
Stephane Ross, Paul Mineiro, and John Langford. Normalized online learning. In Proceedings of the Twenty-Ninth Conference on Uncertainty in Artificial Intelligence, pages 537–545, 2013.
Olivier Wintenberger. Optimal learning with Bernstein online aggregation. Machine Learning, 106 (1):119–141, 2017.
The proof of Theorem 5 relies on the following key lemma:
Lemma 18 For
It holds that
Proof For notational simplicity we assume . Let us look at
Since ln is increasing, we have that , if and only if,
, and so we want to show
for all
. Our approach will be to show that
is increasing in V . The result then follows from
. It remains to study the derivative
Factoring this as a ratio of polynomials, we obtain:
where , are polynomials in g and G whose explicit (yet gruesome) expressions are:
α
α
Under our assumptions and
, the denominator of
above is positive. Furthermore, its numerator, regarded as a polynomial in V , has exclusively positive
We want to show that the gap is ≤ 0. Our approach will be to show that gap is increasing, so that we can then bound it by the limit
We will show that the numerator is positive by showing that it is a polynomial with only positive coefficients. Here are the coefficients on the monomials Vi for i=0,1,...
Figure 1: Mathematica notebook in support of Theorem 5.
coefficients , as can be verified using computer algebra software (we used Mathematica’s FullSimplify—see Figure 1). This implies that
, for all
, and
We are ourselves a bit disgruntled about the opacity of the above proof. On the one hand, it is just a tedious verification of an analytic statement about a function of three scalar variables, and one might expect that tighter statements require more sophisticated techniques (c.f. Kotłowski, 2017, Appendix F). It is quite plausible that positivity may be established in a somewhat more streamlined fashion using Sum-of-Squares techniques. Yet on the other hand, we were hoping to gain, from the proof, a deeper insight into the design of potential functions. Unfortunately this did not materialize. In particular, we still do not know how to address the multi-dimensional case of our Section 3.2 with
a similar potential. Controlling the intuitive upgrade of (7) where the exponential is replaced by
is impossible, as witnessed by numerical counterexamples returned by random search, already in dimension 2. We need one more result before we prove Theorem 5:
Lemma 19 Let . Then, the function
is non-increasing on
Proof It suffices to show that the function
is non-increasing on . Evaluating the derivative of
, we find that
where . The derivative in (21) is non-positive for all
that
Proof of Theorem 5. We will proceed by induction. By the fact that and the definition of the potential in (7), we have
. Now let
, and
. We will show that
. First, note that for any
Thus, for any
Let be the vector
which maximizes the RHS of (22), where
is the ball in
Suppose that
. Note that within the hyperplane H, only the first term on the RHS of (22) varies. Since
is the maximizer of the RHS of (22) within
instantiating Lemma 19 with
and
, implies that
. Adding this to the fact that H is a hyperplane orthogonal to G implies that
must be aligned, i.e. there exists a
. Therefore, we have
Further, note that . Thus, the RHS of (22) is bounded from above by
Note that ) can be written in terms of the function
Since , Lemma 18 implies that
, and so due to (23), we also have
. Since
is an upper-bound on the RHS of (22), it follows that
as desired.
B.1. Proof of Theorem 6
The proof of Theorem 6 follows from the next theorem by setting . Theorem 20 essentially gives the regret bound of FREEGRAD if its outputs
are scaled by a constant
. This will be useful to us later.
Theorem 20 Let . Then, under Assumptions 1 and 2:
for all
Proof Since the assumptions of Theorem 5 are satisfied, we have
Dividing both sides of (24) by and rearranging yields
By duality, we further have that
Since , for all
, where
is the function defined in (10), we have by the properties of the Fenchel dual (Hiriart-Urruty and Lemar´echal, 2004, Prop. 1.3.1) that
Thus, according to (McMahan and Orabona, 2014, Lemma 3) and the properties of duality (Hiriart- Urruty and Lemar´echal, 2004, Prop. E.1.3.1), we have
On the other hand, (Cutkosky and Orabona, 2018, Lemma 18, 19) and (Orabona and P´al, 2016a, Lemma 18) provides the following upper-bound on , using the Lambert function W (where Wpxq is defined as the principal solution to
and . Using the fact that the Lambert function satisfies
(see Lemma 22), together with (28) and (27) implies that
for all . Combining this with (25) and (26) leads to the desired regret bound.
Proof of Theorem 6. Invoke Theorem 20 with
B.2. Proof of Theorem 7
Proof Fix be the total number of epochs. We denote by
the start index of epoch i P rks. Further, for
, we define
and
(note how the upper index is exclusive). Recall that at epoch i P rks, the restart condition in Algorithm 1 is triggered at
where the last inequality follows by Assumption 1. We note that (29) also implies that
On the other hand, within epoch
Summing this inequality over
Now using (29) at , we have for all i P rks,
where the inequality follows by Assumption 1. Now by (30), we also have
Thus, substituting (35) and (34) into (33), and using the fact that
where in the last inequality, we used the fact that. Now, summing (32) over the last two epochs, yields
Adding (36) and (37) together leads to
This concludes the proof.
In this section we work on a version of the potential function that does not have the tuning for Section 3.2 substituted in yet, so that we can prove the result necessary for Section 5 in one go. The potential is parameterized by a prior variance initial wealth
and, as before prod factor
. It is defined by
C.1. Proof of Lemma 9
We prove the claim in Lemma 9 for the more general potential (38). Let be the minimizer in the problem
. With that notation, we see that
. To prove the lemma, it suffices to prove the stronger statement obtained by picking the sub-optimal choice
for the problem
, and dividing by
Let us abbreviate . The matrix determinant lemma and monotonicity of matrix inverse give
Then Sherman-Morrison gives
and splitting off the last round
All in all, it suffices to show
Being a square, is positive. In addition, optimality of
ensures that
; this follows from the fact that
. In combination with
. The above requirement may hence be further reorganized to
The convex right hand side is minimized subject to
so it remains to show
The function ) is designed to satisfy the hardest case, where
, with equality.
C.2. Proof of Theorem 10
We restate the claim for the potential (38) before tuning:
Theorem 21 (Theorem 10 rephrased) Let
Using that is decreasing in h, we can telescope to obtain
Using the definition reveals
To transform this into a regret bound, it remains to compute the convex conjugate of the RHS of (39) in . To this end, let
The Fenchel dual of this function is
where the model complexity is measured for through the function
(where Wpxq is defined as the principal solution to ). We will further use that
is increasing, and that it satisfies (see Lemma 22). Zero derivative of the above
objective for occurs at the pleasantly explicit
and hence the optimum for is either at that point or at zero, whichever is higher, with the crossover point at
. Plugging that in, we find that
where . Using that
is increasing, we may drop the exponential in its argument in the first case, and obtain
Note that this is a curious maximum between (the larger for modest w), and the average between that very same term and another quantity that grows super-linearly with
(so this is the winner for extreme w).
Okay, now let’s collect everything for the final result and undo the abbreviations. We have
where
To complete the proof of Theorem 21, it remains to prove the following result.
Lemma 22 For , define
Proof The fact that follows from (Orabona and P´al, 2016a, Lemma 18). Recall that
Hence
where we plugged in the sub-optimal choices (this requires
stick in
Note that . This is less than 2, the value of
becomes positive.
D.1. Proof of Lemma 13
Let , and
. We consider the 1-dimensional case (i.e.
) and set
, for all
. Since
, we have
, for all
, and so the sequence
is increasing. Further, there exists
such that,
Thus, given any sequence satisfying
we have, for
Now by the choice of , and so by (41),
Using again the fact that and (40), we have
, and so due to (41), we have
Since , the exists
and so for such a choice of T, (42) and (43) imply the desired result.
D.2. Proof of Theorem 15
We need the following lemma in the proof of Theorem 15:
Lemma 23 For all , there exists
, such that for any sequence
satisfying
Proof Let . We consider the 1-dimensional case (i.e.
set
, for all
. In this case, we have
, for all
. Given any sequence
satisfying
we have, for
Now since , we have, by (45),
Using again the fact that ?T ln T, and so due to (45),
Since , the exists
and so for such a choice of T, (46) and (47) imply the desired result.
Proof of Theorem 15. By Lemma 12, the only candidate algorithms are the ones whose outputs satisfy
, for all
, for some constant
. By Lemma 23, no such algorithms can achieve the desired regret bound.
Proof of Theorem 16. The proof is similar to that of Theorem 7 expect for some changes to account for the fact that the modified FREERANGE wrapper scales the outputs of FREEGRAD.
First, let us review some notation. Let be the total number of epochs and denote by
the start index of epoch i P rks. Further, for
, we define
. In what follows, let w P R be fixed.
Let be the outputs of algorithm 2 and i P rks. In this case, we have
, for all
, and by Theorem 20, with
Recall that at epoch i P rks, the restart condition in Algorithm 2 is triggered at
where the equality follows by the fact that when is satisfied for the first time, it must hold that
(recall that the hints
)); in fact, we have,
From (49), we get that
Summing this inequality over
Using (49) again, we get that
where the inequality follows by the fact that
). We also have
Thus, substituting (56), (55), and (54) into (53), and using the fact that
where in the last inequality, we used the fact that. Now, summing (52) over the last two epochs, yields