b

DiscoverSearch
About
My stuff
Lipschitz and Comparator-Norm Adaptivity in Online Learning
2020·arXiv
Abstract
Abstract

We study Online Convex Optimization in the unbounded setting where neither predictions nor gradient are constrained. The goal is to simultaneously adapt to both the sequence of gradients and the comparator. We first develop parameter-free and scale-free algorithms for a simplified setting with hints. We present two versions: the first adapts to the squared norms of both comparator and gradients separately using Opdq time per round, the second adapts to their squared inner products (which measure variance only in the comparator direction) in time  Opd3qper round. We then generalize two prior reductions to the unbounded setting; one to not need hints, and a second to deal with the range ratio problem (which already arises in prior work). We discuss their optimality in light of prior and new lower bounds. We apply our methods to obtain sharper regret bounds for scale-invariant online prediction with linear models.

Keywords: Online Convex Optimization, Parameter-Free Online Learning, Scale-Invariant Online Algorithms

We consider the setting of online convex optimization where the goal is to make sequential predictions to minimize a certain notion of regret. Specifically, at the beginning of each round  t ě 1, a learner predicts  pwtin some convex set  W Ď Rdin dimension d P N. The environment then reveals a convex loss function  ft : W Ñ R, and the learner suffers loss  ftp pwtq. The goal of the learner is to minimize the regret řTt“1 ftp pwtq ´ řTt“1 ftpwqafter  T ě 1rounds against any “comparator” prediction w P W. Typically, an online learning algorithm outputs a vector  pwt, t ě 1, based only on a sequence of observed sub-gradients  pgsqsăt, where  gs P Bfsp pwsq, s ă t. In this paper, we are interested in online algorithms which can guarantee a good regret bound (by a measure which we will make precise below) against any comparator vector w P W, even when W is unbounded, and without prior knowledge of the maximum norm  L :“ maxtďT }gt}of the observed sub-gradients. In what follows, we refer to L as the Lipschitz constant.

By assuming an upper-bound  D ą 0on the norm of the desired comparator vector w in hindsight, there exist Lipschitz-adaptive algorithms that can achieve a sub-linear regret of order LD?T, without knowing L in advance. A Lipschitz-adaptive algorithm is also called scale-free (or scale-invariant) if its predictions do not change when the loss functions  pftqare multiplied by a factor  c ą 0; in this case, its regret bound is expected to scale by the same factor c. When L is known in advance and W “ Rd, there exists another type of algorithms, so-called parameter-free, which can achieve an rOp}w}L?Tq regret bound, where w is the desired comparator vector in hindsight (the notation rO hides log-factors). Up to an additive lower-order term, this type of regret bound is also achievable for bounded W via the unconstrained-to-constrained reduction (Cutkosky, 2019).

The question of whether an algorithm can simultaneously be scale-free and parameter-free was posed as an open problem by Orabona and P´al (2016b). It was latter answered in the negative by Cutkosky and Boahen (2017). Nevertheless, Cutkosky (2019) recently presented algorithms which achieve an rOp}w}L?T ` L}w}3qregret bound, without knowing either L or  ∥w∥. This does not violate the earlier lower bound of Cutkosky and Boahen (2017), which insists on norm dependence rOp∥w∥q.

Though Cutkosky (2019) designs algorithms that can to some extent adapt to both L and  ∥w∥, their algorithms are still not scale-free. Multiplying  pftq, and as a result  pgtq, by a positive factor c ą 0changes the outputs  p pwtqof their algorithms, and their regret bounds scale by a factor  c1, notnecessarily equal to c. Their algorithms depend on a parameter  ϵ ą 0which has to be specified in advance. This parameter appears in their regret bounds as an additive term and also in a logarithmic term of the form  logpLα{ϵq, for some α ą 1. As a result of this type of dependence on  ϵand the fact that  α ą 1, there is no prior choice of  ϵwhich can make their regret bounds scale-invariant. What is more, without knowing L, there is also no “safe” choice of  ϵwhich can prevent the  logpLα{ϵqterm from becoming arbitrarily large relative to L (it suffices for  ϵto be small enough relative to the “unknown” L).

Contributions. Our main contribution is a new scale-free, parameter-free learning algorithm for OCO with regret at most  Op}w}aVT logp}w}Tqq, for any comparator w P W in a bounded set W, where  VT – řTt“1 }gt}2. When the set W is unbounded, the algorithm achieves the same guarantee up to an additive  OpLamaxtďT Bt ` L}w}3q, where Bt :“ řts“1 }gs}{Lt and Lt :“ maxsďt∥gs∥,for all t P rTs. In the latter case, we also show a matching lower bound; when W is unbounded and without knowing L, any online learning algorithm which insists on an rOp?Tq bound, has regret at least  ΩpL?BT ` L}w}3q. We also provide a second scale-invariant algorithm which replaces the leading  }w}?VTterm in the regret bound of our first algorithm by ?w⊺VT w ln det VT , whereVT :“ řTt“1 gtg⊺t. Our starting point for designing our algorithms is a known potential function which we show to be controlled for a unique choice of output sequence  p pwtq.

As our main application, we show how our algorithms can be applied to learn linear models. The result is an online algorithm for learning linear models whose label predictions are invariant to coordinate-wise scaling of the input feature vectors. The regret bound of the algorithm is naturally also scale-invariant and improves on the bounds of existing state-of-the-art algorithms in this setting (Kotłowski, 2017; Kempka et al., 2019).

Related Work For an overview of Online Convex Optimization in the bounded setting, we refer to the textbook (Hazan, 2016). The unconstrained case was first studied by McMahan and Streeter (2010). A powerful methodology for the unbounded case is Coin Betting by Orabona and P´al (2016a). Even though not always visible, our potential functions are inspired by this style of thinking. We build our unbounded OCO learner by targeting a specific other constrained problem. We further employ several general reductions from the literature, including gradient clipping Cutkosky (2019), the constrained-to-unconstrained reduction Cutkosky and Orabona (2018), and the restart wrapper to pacify the final-vs-initial scale ratio appearing inside logarithms by Mhammedi et al. (2019). Our analysis is, at its core, proving a certain minimax result about sufficient-statistic-based potentials reminiscent of the Burkholder approach pioneered by Foster et al. (2017, 2018). Applications for scale-invariant learning in linear models were studied by Kempka et al. (2019). For our multidimensional learner we took inspiration from the Gaussian Exp-concavity step in the analysis of the MetaGrad algorithm by Van Erven and Koolen (2016).

Outline In Section 2, we present the setting and notation, and formulate our goal. In Section 3, we present our main algorithms. In Section 4, we present new lower-bounds for algorithms which adapt to both the Lipschitz constant and the norm of the comparator. In Section 5, we apply our algorithms to online prediction with linear models.

Our goal is to design scale-free algorithms that adapt to the Lipschitz constant L and comparator norm  ∥w∥. We will first introduce the setting, then discuss existing reductions, and finally state what needs to be done to achieve our goal.

2.1. Setting and Notation

Let  W Ď Rd, d P N,be a convex set, and assume without loss of generality that 0 P W. We allow the set W to be unbounded, and we define its (possibly infinite) diameter  D :“ supw,w1PW }w ´ w1} Pr0, `8s. We consider the setting of Online Convex Optimization (OCO) where at the beginning of each round  t ě 1, the learner outputs a prediction  pwt P W, before observing a convex loss function ft : W Ñ R, or an element of its sub-gradient  gt P Bftp pwtqat  pwt. The goal of the learner is to minimize the regret after  T ě 0rounds, which is given by

image

for any comparator vector w P W. In this paper, we do not assume that T is known to the learner, and so we are after algorithms with so called any-time guarantees. By convexity, we have

image

and thus for the purpose of minimizing the regret, typical OCO algorithms minimize the RHS of (1), which is known as the linearized regret, by generating outputs  p pwtqbased on the sequence of observed sub-gradients  pgtq. Likewise, we focus our attention exclusively on linear optimization.

Given a sequence of sub-gradients  pgtq, it will be useful to define the running maximum gradient norm and the clipped sub-gradients

image

for  t ě 1, with the convention that  L0 “ 0. We also drop the subscript t from  Ltwhen  t “ T, i.e. we write  L for LT .

We denote by Apg1, . . . , gt´1; htqthe output in round  t ě 1of an algorithm A, which uses the observed sub-gradients so far and a hint  ht ě Lton the upcoming sub-gradient  gt. As per Section 1, we say that an algorithm is scale-free (or scale-invariant) if its predictions are invariant to any common positive scaling of the loss functions  pftqand, if applicable, the hints.

Additional Notation. Given a closed convex set  X Ď Rd, we denote by  ΠX pxqthe Euclidean projection of a point  x P Rd on the set X; that is, ΠX pxq P argmin˜xPX }x ´ ˜x}.

2.2. Helpful Reductions

The difficulty behind designing scale-free algorithms lies partially in the fact that  Ltis unknown at the start of round t; before outputting  pwt. The following result due to Cutkosky (2019) quantifies the additional cost of proceeding with the plug-in estimate  Lt´1 for Lt:

Lemma 1 Let A be an online algorithm which at the start of each round  t ě 1, has access to a hint ht ě Lt, and outputs Apg1, . . . , gt´1; htq P W, before observing  gt. Suppose that A guarantees an upper-bound  RAT pwqon its linearized regret for the sequence  pgtqand for all  w P W, T ě 1. Then,algorithm B which at the start of each round  t ě 1 outputs pwt “ Ap¯g1, . . . , ¯gt´1; Lt´1q, guarantees

image

First, we note that Lemma 1 is only really useful when W is bounded; otherwise, depending on algorithm A, the term  maxtPrTs Lt} pwt}on the RHS of (2) could in principle be arbitrarily large even for fixed w, L, and T. The moral of Lemma 1 is that as long as the set W is bounded, one does not really need to know  Ltbefore outputting  pwtto guarantee a “good” regret bound against any w P W. For example, suppose that W has a bounded diameter D and algorithm A in Lemma 1 is such that RAT pwq “ rOp}w}L?T ` DLq, for all w P W. Then, from (2) and the fact that  } pwt} ď D(since

wt P W), it is clear that algorithm B in Lemma 1 also guarantees the same regret bound  RAT pwq upto an additive 2DL, despite not having had the hints  phtq.

It is possible to extend the result of Lemma 1 so that the regret bound of algorithm B remains useful even in the case where W is unbounded. An approach suggested by Cutkosky (2019) is to restrict the outputs  p pwtqof algorithm B to be in a non-decreasing sequence  pWtq of bounded convexsubsets of W. In this case, the diameters  pDtq Ă Rof  pWtqneed to be carefully chosen to achieve a desired regret bound. This approach, which essentially combines the idea of Lemma 1 and the unconstrained-to-constrained reduction due to Cutkosky and Orabona (2018), is formalized in the next lemma (essentially due to Cutkosky (2019)):

Lemma 2 Let algorithm A be as in Lemma 1, and let  pWtqbe a sequence of non-decreasing closed convex subsets of W with diameters  pDtq Ă Rą0. Then, algorithm B which at the start of round

image

We see that compared to Lemma 1, the additional penalty that algorithm B incurs for restricting its predictions to the sets  W1, . . . , WT Ď Wis the sum řTt“1 }gt} ¨ }w ´ ΠWtpwq}. The challenge is now in choosing the diameters  pDtqto control the trade-off between this sum and the term  LDT onthe RHS of (3). If T is known in advance, one could set  D1 “ ¨ ¨ ¨ “ DT “?T, in which case the RHS of (3) is at most

image

We now instantiate the bound of Lemma 2 for another choice of  pDtq when Tis unknown:

Corollary 3 In the setting of Lemma 2, let  Wtbe the ball of diameter  Dt :“ ?maxsďt Bs, t ě 1, where  Bt :“ řts“1 }gs}{Lt, and let W “ Rd. Then the RHS of (3) is bounded from above by

image

We see that by the more careful choice of  pDtqin Corollary 3, one can replace the L?T term in (4) by the smaller quantity  LamaxtPrTs Bt; whether this can be improved further to ?VT, where VT “ řTt“1 }gt}2, was raised as an open question by Cutkosky (2019). We will answer this in the negative in Theorem 14. We will also show in Theorem 15 below that, if one insists on a regret of order rOp?Tq, it is essentially not possible to improve on the penalty  L}w}3 in (5).

2.3. Outlook

The conclusion that should be drawn from Lemmas 1 and 2 is the following; if one seeks an algorithm B with a regret bound of the form rOp}w}L?Tq up to some lower-order terms in T, without knowledge of L and regardless of whether W is bounded or not, it suffices to find an algorithm A which guarantees the sought type of regret whenever it has access to a sequence of hints phtqsatisfying (as in Lemmas 1 and 2),  ht ě Lt, for all t ě 1. Thus, our first goal in the next section is to design a scale-free algorithm A which accesses such a sequence of hints and ensures that its linearized regret is bounded from above by:

image

for all  w P Rd, T ě 0, and  pgtq Ă Rd. We show an analogous “full-matrix” upgrade of order aw⊺V w ln pw⊺V w det V q, with V “ řTt“1 gtg⊺t . We note that if Algorithm A in Lemmas 1 and 2 is scale-free, then so is the corresponding Algorithm B.

If the desired set W has bounded diameter  D ą 0, then using the unconstrained-to-constrained reduction due to Cutkosky and Orabona (2018), it is straightforward to design a new algorithm based on A with regret also bounded by (6) up to an additive LD, for w P W (this is useful for Lemma 1).

Finally, we also note that algorithms which can access hints  phtq such that ht ě Lt, for all t ě 1,are of independent interest; in fact, it is the same algorithm A that we will use in Section 5 as a scale-invariant algorithm for learning linear models.

In light of the conclusions of Section 2, we will design new unconstrained scale-free algorithms which can access a sequence of hints  phtq(as in Lemma 1) and guarantee a regret bound of the form given in (6). In this section, we will make the following assumption on the hints  phtq:

Assumption 1 We assume that (i)  phtqis a non-decreasing sequence; (ii)  ht ě Lt, for all  t ě 1; and (iii) if the sub-gradients  pgsqare multiplied by a factor  c ą 0, then the hints  phtqare multiplied by the same factor c.

The third item of the assumption ensures that our algorithms are scale-free. We note that Assumption 1 is satisfied by the sequence of hints that Algorithm B constructs when invoking Algorithm A in Lemmas 1 and 2. To avoid an uninteresting case distinction, we will also make the following assumption, which is without loss of generality, since the regret is zero while  gt “ 0.

Assumption 2 We assume that  L1 “ ∥g1∥ ą 0.

3.1. FREEGRAD: An Adaptive Scale-Free Algorithm

In this subsection, we design a new algorithm based on a time-varying potential function, where the outputs of the algorithm are uniquely determined by the gradients of the potential function at its iterates—an approach used in the design of many existing algorithms (Cesa-Bianchi et al., 1997).

Let  t ě 1, let pgsqsďt Ă Rd be a sequence of sub-gradients satisfying Assumption 2, and let  phtqbe a sequence of hints satisfying Assumption 1. Consider the following potential function:

image

This potential function has appeared as a by-product in the analyses of previous algorithms such as the ones in (Cutkosky and Orabona, 2018; Cutkosky, 2019). The expression of  Φt in (7)is interesting to us since it can be shown via the regret-reward duality (McMahan and Orabona, 2014) (as we do in the proof of Theorem 6 below) that any algorithm which outputs vectors  p pwtqsuch that  pΦtqis non-increasing for any sequence of sub-gradients  pgtq, also guarantees a regret bound of the form (6). We will now design such an algorithm.

Definition 4 (FREEGRAD) In round t, based on the sequence of past sub-gradients  pgsqsăt and theavailable hint  ht ě Lt, the FREEGRADalgorithm outputs the unconstrained iterate

image

where  pGtq and pVtq are as in (8).

The prediction (9) is forced by our design goal of decreasing potential  Φt ď Φt´1. To see why, observe that at zero  gt “ 0we have  Φt “ Φt´1. The weights  pwtcancel the derivative  ∇gtΦtat gt “ 0, ensuring there is no direction of linear increase (which would present a violation for tiny  gt).

Our main technical contribution in this subsection is to show that, in fact, with the choice of p pwtqtě1 as in (9), the potential functions  pΦtqare non-increasing for any sequence of sub-gradients

image

Theorem 5 For  p pwtq, and pΦtq as in (9), and (7), under Assumptions 1 and 2, we have:

image

The proof of the theorem is postponed to Appendix A. Theorem 5 and the regret-reward duality (McMahan and Orabona, 2014) yield a regret bound for FREEGRAD. In fact, if  ΦT ď Φ0, then by the definition of  ΦT in (7), we have

image

Now by Fenchel’s inequality, we have  ´ΨT pGT q ď xw, GT y ` Ψ‹T p´wq, for all  w P Rd, where Ψ‹T pwq :“ supzPRdtxw, zy ´ ΨT pzqu, w P Rd, is the Fenchel dual of  ΨT(Hiriart-Urruty and Lemar´echal, 2004). Combining this with (10), we obtain:

image

Rearranging (11) for a given  w P Rdleads to a regret bound of  Ψ‹T p´wq ` Φ0. Further bounding this quantity using existing results due to Cutkosky and Orabona (2018); Cutkosky (2019); McMahan and Orabona (2014), leads to the following regret bound (the proof is in Appendix B.1):

Theorem 6 Under Assumptions 1 and 2, for  p pwtq as in (9), we have, with  ln`p¨q :“ 0 _ lnp¨q,

image

for all  w P W “ Rd, T ě 1.

Range-Ratio Problem. While the outputs  p pwtqin (9) of FREEGRAD are scale-free for the sequence of hints  phtqsatisfying Assumption 1, there remains one serious issue; the fractions  VT {h21and  hT {h1inside the log-terms in the regret bound of Theorem 6 could in principle be arbitrarily large if  h1is small enough relative to  hT. Such a problematic ratio has appeared in the regret bounds of many previous algorithms which attempt to adapt to the Lipschitz constant L (Ross et al., 2013; Wintenberger, 2017; Kotłowski, 2017; Mhammedi et al., 2019; Kempka et al., 2019).

When the output set W is bounded with diameter  D ą 0, this ratio can be dispensed of using a recently proposed restart trick due to Mhammedi et al. (2019), which restarts the algorithm whenever Lt{L1 ą řts“1 }gs}{Ls. The price to pay for this is merely an additive OpLDq in the regret bound. However, this trick does not directly apply to our setting since in our case W may be unbounded. Fortunately, we are able to extend the analysis of the restart trick to the unbounded setting where a sequence of hints  phtqsatisfying Assumption 1 is available; the cost we incur in the regret bound is an additive lower-order rOp}w}Lq term. Algorithm 1 displays our restart “wrapper”, FREERANGE, which uses the outputs of FREEGRAD to guarantee the following regret bound (the proof is in Appendix B):

Theorem 7 Let  p pwtqbe the outputs of FREERANGE (Algorithm 1). Then,

image

image

3.2. MATRIX-FREEGRAD: Adapting to Directional Variance

Reflecting on the previous subsection, we see that the potential function that we ideally would like to use is  St ` h1 exp` 12G⊺t V ´1t Gt ´ 12 ln det Vt˘, t ě 1, where Vt “ řts“1 gsg⊺s. However, as we saw, this is a little too greedy even in one dimension, and we need to introduce some slack to make the potential controllable. In the previous subsection we did this by increasing the scalar denominator from  V to V ` ∥G∥, which acts as a barrier function restricting the norm of  pwt. In this section, we will instead employ a hard norm constraint. We will further need to include a fudge factor  γ ą 1multiplying V to turn the above shape into a bona fide potential. To describe its effect, we define

image

The increasing function  ρ satisfies limγÑ1 ρpγq “ 0, limγÑ8 ρpγq “ 1, and ρp2q “ 0.358649.

The potential function of this section is parameterized by a prod factor  γ ą 1(which we will set to some universal constant). We define

image

where  G P Rd, V P Rdˆd, and h ą 0. We introduce the following algorithm to control  Ψ.

Definition 8 (MATRIX-FREEGRAD) In round t, given past sub-gradients  pgsqsătand a hint  ht ěLT , the MATRIX-FREEGRADprediction is obtained from the gradient of  Ψin the first argument by

image

where  Gt´1 “ řt´1s“1 gs and Vt´1 :“ řt´1s“1 gsg⊺s.

We can compute  pwtin  Opd3qtime per round by first computing an eigendecomposition of  Vt´1, followed by a one-dimensional binary search for the  λ‹which achieves the inf in (13) with pG, V , hq “ pGt´1, Vt´1, htq. Then the output is given by

image

Our heavy-lifting step in the analysis is the following, which we prove in Appendix C:

Lemma 9 For any vector  gt P Rd and ht ą 0satisfying  ∥gt∥ ď ht, the vector  pwt in (14) ensures

image

From here, we obtain our main result using telescoping and regret-reward duality:

Theorem 10 Let  Σ´1T :“ γh21I ` γVT . For p pwtq as in (14), we have

image

Note in particular that the result is scale-free. Expanding the main case of the theorem (modest  ∥w∥),we find regret bounded by

image

This bound looks almost like an ideal upgrade of that in Theorem 6, though technically, the bounds are not really comparable since the ln det Q can be as large as d ln T, potentially canceling the advantage of having  w⊺Qwinstead of  }w}2 řTt“1 }gt}2inside the square-root. The matrix Q and hence any directional variance  w⊺Qwis scale-invariant. The only fudge factor in the answer is the γ ą 1. We currently cannot tolerate  γ “ 1, for then ρpγq “ 0so the lower-order term would explode. We note that a bound of the form given in the previous display, with the ln det Q replaced by the larger term d ln tr Q, was achieved by a previous (not scale-free) algorithm due to Cutkosky and Orabona (2018).

Remark 11 As Theorem 7 did in the previous subsection, our restarts method allows us to get rid of problematic scale ratios in the regret bound of Theorem 10; this can be achieved using FREERANGE with  p pwtqset to be as in (14) instead of (9). The key idea behind the proof of Theorem 7 is to show that the regrets from all but the last two epochs add up to a lower-order term in the final regret bound. This still holds when  p pwtqare the outputs of MATRIX-FREEGRAD instead of FREEGRAD, since by Theorem 10, the regret bound of MATRIX-FREEGRAD is of order at most d times the regret of FREEGRAD within any given epoch.

As a final note about the algorithm, we may also develop a “one-dimensional” variant by replacing matrix inverse and determinant by their scalar analogues applied to  VT “ řTt“1∥gt∥2. One effect of this is that the minimization in  λcan be computed in closed form. The resulting potential and corresponding algorithm and regret bound are very close to those of Section 3.1.

Conclusion The algorithms designed in this section can now be used in the role of algorithm A in the reductions presented in Section 2.2. This will yield algorithms which achieve our goal; they adapt to the norm of the comparator and the Lipschitz constant and are completely scale-free, for both bounded and unbounded sets, without requiring hints. We now show that the penalties incurred by these reductions are not improvable.

As we saw in Corollary 3, given a base algorithm A, which takes a sequence of hints  phtq such thatht ě Ltfor all  t ě 1, and which suffers regret  RAT pwqagainst comparator w P W, there exists an algorithm B for the setting without hints which suffers the same regret against w up to an additive penalty  LT }w}3 ` LTamaxtPrTs Bt, where Bt “ řts“1 }gs}{Lt. In this section, we show that the penalty  LT }w}3is not improvable if one insists on a regret bound of order rOp?Tq. We also show that it is not possible to replace the penalty  LTamaxtPrTs Btby the typically smaller quantity  ?VT ,where  VT “ řTt“1 }gt}2. Our starting point is the following lemma:

Lemma 12 For all  t ě 1, past sub-gradients  pgsqsătand past and current outputs  p pwsqsďt P Rd,

image

Proof We want to find  gtsuch that  xgt, pwty ě }wt}Lt{2 ´ St´1, where  St´1 :“ řt´1s“1xgs, pwsy. By restricting  gtto be aligned with  pwt, the problem reduces to finding  x “ }gt} such that

image

The LHS of (15) is a piece-wise linear function in x which goes to infinity as  x Ñ 8. Therefore, there exists a large enough  x ě 0which satisfies (15).

Observe that if  } pwt} ě Dt ą 0, for  t ě 1, then by Lemma 12, there exists a sub-gradient  gtwhich makes the regret against  w “ 0at round t at least  DtL{2. This essentially means that if the sub-gradients  pgtqare unbounded, then the outputs  p pwtqmust be in a bounded set whose diameter will depend on the desired regret bound; if one insists on a regret of order rOp?Tq, then the outputs

wt, t ě 1,must be in a ball of radius at most rOp?Tq.

Cutkosky (2019) posed the question of whether there exists an algorithm which can guarantee a regret bound of order  }w}aVT lnp}w}Tq ` ?VT ln T ` L}w}3, with  VT “ řTt“1 }gt}2, while adapting to both L and }w} (which essentially means replacing  LamaxtPrTs Btin Corollary 3 by ?VT ln T). Here, we ask the question whether  }w}aVT lnp}w}Tq`?VT ln T `L}w}ν is possible for any  ν ě 1. If such an algorithm exists, then by Lemma 12, there exists a constant  b ą 0 such thatits outputs  p pwtq satisfy } pwt} ď b?Vt ln t{Lt, for all t ě 1. The next lemma, when instantiated with α “ 2, gives us a regret lower-bound on such algorithms (the proof is in Appendix D):

Lemma 13 For all  b, c, β ě 0, ν ě 1, and  α Ps1, 2s, there exists  pgtq P Rd, T ě 1, and  w P Rd, such that for any sequence  p pwtqsatisfying  } pwt} ď b ¨aVα,t lnptq{Lαt, for all t P rTs, where

image

By combining the results of Lemma 12 and 13, we have the following regret lower bound for algorithms with can adapt to both L and }w}:

Theorem 14 For any  α Ps1, 2s, c ą 0and  ν ě 1, there exists no algorithm that guarantees, up to multiplicative log-factors in }w} and T, a regret bound of the form  c ¨ pLT }w}ν ` L1´α{2T p}w} `1qaVα,T ln Tq, for all T ě 1, w P Rd, and pgtq Ă Rd, where Vα,T :“ řTt“1 }gt}α.

Proof By Lemma 12, the only candidate algorithms are those whose outputs  p pwtqsatisfy  } pwt} ďbaVα,t lnptq{Lαt, for all  t ě 1, for some constant  b ą 0. By Lemma 13, no such algorithms can achieve the desired regret bound.

The regret lower bound in Theorem 14 does not apply to the case where  α “ 1. In fact, thanks to Corollary 3 and our main algorithm in Section 3 (which can play the role of Algorithm A in Corollary 3), we know that there exists an algorithm B which guarantees a regret bound of order rOpLT }w}3 `}w}aVT lnp}w}Tq`LTamaxtPrTs Btq, where Bt “ řts“1 }gs}{Lt. Next we show that if one insists on a regret bound of order ?BT, or even?T (up to log-factors), the exponent in }w}3 is unimprovable (the proof of Theorem 15 is in Appendix D.2).

Theorem 15 For any  ν P r1, 3rand  c ą 0, there exists no algorithm that guarantees, up to multiplicative log-factors in }w} and T, a regret bound of the form  c ¨ pLT }w}ν ` LT p}w} `1q?T ln Tq, for all T ě 1, w P Rd, and pgtq Ă Rd.

In this section, we consider the setting of online learning of linear models which is a special case of OCO. At the start of each round  t ě 1, a learner receives a feature vector  xt P W “ Rd, then issues a prediction  pyt P Rin the form of an inner product between  xtand a vector  put P Rd, i.e. pyt “ pu⊺t xt.The environment then reveals a label  yt P Rand the learner suffers loss  ℓpyt, pytq, where ℓ: R2 Ñ Ris a fixed loss function which is convex and 1-Lipschitz in its second argument; this covers popular losses such as the logistic, hinge, absolute and Huberized squared loss. (Technically, the machinery developed so far and the reductions in Section 2.2 allow us to handle the non-Lipschitz case).

In the current setting, the regret is measured against the best fixed “linear model”  w P Rd as

image

where the last inequality holds for any sub-gradients  δt P Bp0,1qℓpyt, pytq, t ě 1, due to the convexity of  ℓin its second argument, which in turn makes the function  ftpwq :“ ℓpyt, w⊺xtqconvex for all w P W “ Rd. Here,  Bp0,1qℓdenotes the sub-differential of  ℓwith respect to its second argument. Thus, minimizing the regret in (16) fits into the OCO framework described in Section 2. In fact, we will show how our algorithms from Section 3 can be applied in this setting to yield scale-free, and even rotation-free, (all with respect to the feature vectors  pxtq) algorithms for learning linear models. These algorithms can, without any prior knowledge on  w or pw⊺xtq, achieve regret bounds against any  w P Rd matching (up to log-factors) that of OGD with optimally tuned learning rate.

As in Section 3, we focus on algorithms which make predictions based on observed sub-gradients (gt); in this case,  gt “ xtδt P xt ¨ Bp0,1qℓpyt, pytq “ Bftpputq, t ě 1, where  ftpwq “ ℓpyt, w⊺xtq. Since the loss  ℓis 1-Lipschitz, we have  |δt| ď 1, for all  δt P Bp0,1qℓpyt, pytqand  t ě 1, and so }gt} ď }xt}. Since xtis revealed at the beginning of round  t ě 1, the hint

image

is available ahead of outputting  put, and so our algorithms from Section 3 are well suited for this setting.

Improvement over Current Algorithms. We improve on current state-of-the-art algorithms in two ways; First, we provide a (coordinate-wise) scale-invariant algorithm which guarantees regret bound, against any  w P Rd, of order

image

where  VT,i :“ |x1,i|2 ` řTt“1 δ2t |xt,i|2, i P rds, which improves the regret bound of the current state-of-the-art scale-invariant algorithm SCLNOL1 (Kempka et al., 2019) by aalnp}w}Tq factor. Second, we provide an algorithm that is both scale and rotation invariant with respect to the input feature vectors  pxtqwith a state-of-the-art regret bound; by scale and rotation invariance we mean that, if the sequence of feature vectors  pxtqis multiplied by cO, where  c ą 0and O is any special orthogonal matrix in  Rdˆd, the outputs (pyt) of the algorithm remain unchanged. Arguably the closest algorithm to ours in the latter case is that of Kotłowski (2017) whose regret bound is essentially of order rOp?w⊺ST wqfor any comparator  w P Rd, where  ST “ řTt“1 xtx⊺t. However, in our case, instead of the matrix  ST, we have  VT :“ }x1}2I ` řTt“1 xtx⊺t δ2t, where  δt P Bp0,1qℓpyt, pytq, t ě 1, which can yield a much smaller bound for small  pδtq(this typically happens when the algorithm starts to “converge”).

A Scale-Invariant Algorithm. To design our first scale-invariant algorithm, we will use the outputs (  pwt) of FREEGRAD in (9) with  phtqas in (17), and a slight modification of FREERANGE (see Algorithm 2). This modification consists of first scaling the outputs  p pwtq of FREEGRAD by theinitial hint of the current epoch to make the predictions  ppytqscale-invariant. By Theorem 20 below, the regret bound corresponding to such scaled outputs will have a lower-order term which, unlike in the regret bound of Theorem 6, does not depend on the initial hint. This breaks our current analysis of FREERANGE in the proof of Theorem 7 which we used to overcome the range-ratio problem. To solve this issue, we further scale the output  pwtat round  t ě 1by the sum řτs“1 }xs}{hs, where  τdenotes the first index of the current epoch (see Algorithm 2). Due to this change, the proof of the next theorem differs slightly from that of Theorem 7.

First, we study the regret bound of Algorithm 2 in the case where  W “ R.

Theorem 16 Let  d “ 1and  phtqbe as in (17). If  pputqare the outputs of Algorithm 2, then for all

image

The proof of Theorem 16 is in Appendix E. If  pputqare the outputs of Algorithm 2 in the one-dimensional case, then by Theorem 16 and (16), the algorithm which, at each round  t ě 1, predicts

image

pyt “ xtputhas regret bounded from above by the RHS of (19). Note also that the outputs  ppytqare scale-invariant.

Now consider an algorithm A which at round  t ě 1 predicts pyt “ řdi“1 xt,iput,i, where pput,iq, i Prds, are the outputs of Algorithm 2 when applied to coordinate i; in this case, we will have a sequence of hints  pht,iqfor each coordinate i satisfying  ht,i “ maxsďt |xt,i|, for all  t ě 1. Algorithm A is coordinate-wise scale-invariant, and due to (16) and Theorem 16, it guarantees a regret bound of the form (18). We note, however, that a factor d will appear multiplying the lower-order term  p2`ln BT qin (19) (since the regret bounds for the different coordinates are added together). To avoid this, at the cost of a factor d appearing inside the logarithms in (18), it suffices to divide the outputs of algorithm A by d. To see why this works, see Theorem 20 in the appendix.

A Rotation-Invariant Algorithm. To obtain a rotation and scale-invariant online algorithm for learning linear models we will make use of the outputs of MATRIX-FREEGRAD instead of FREEGRAD. Let  ppytqbe the sequence of predictions defined by

image

with  phtqas in (17) and where  pwtare the predictions of a variant of MATRIX-FREEGRAD, where the leading  h1in the potential (13) is replaced by 1 (we analyze this variant in Appendix C.1).

Theorem 17 Let  γ ą 0 and phtq be as in (17). If ppytq are as in (20), then

image

and  Σ´1T :“ γh21I ` γ řTt“1 gtg⊺t .

Proof It suffices to use (16) and instantiate the regret bound in Theorem 21 with  pϵ, σ´2q “ p1, γh21q.

image

The range-ratio problem manifests itself again in Theorem 17 through the term  detpγh21ΣT q´1. Thiscan be solved using the outputs of Algorithm 2, where in Line 7,  pwtis taken to be as in (20) (see Remark 11).

This work was supported by the Australian Research Council and Data61.

Nicol`o Cesa-Bianchi, Yoav Freund, David Haussler, David P. Helmbold, Robert E. Schapire, and Manfred K. Warmuth. How to use expert advice. J. ACM, 44(3):427–485, 1997.

Ashok Cutkosky. Artificial constraints and hints for unbounded online learning. In Conference on Learning Theory, pages 874–894, 2019.

Ashok Cutkosky and Kwabena Boahen. Online learning without prior information. In Conference on Learning Theory, pages 643–677, 2017.

Ashok Cutkosky and Francesco Orabona. Black-box reductions for parameter-free online learning in Banach spaces. In Conference On Learning Theory, COLT 2018, Stockholm, Sweden, 6-9 July 2018., pages 1493–1529, 2018.

Tim van Erven and Wouter M. Koolen. Metagrad: Multiple learning rates in online learning. In Advances in Neural Information Processing Systems 29 (NIPS), pages 3666–3674, 2016.

Dylan J. Foster, Alexander Rakhlin, and Karthik Sridharan. Zigzag: A new approach to adaptive online learning. In Proc. of the 2017 Annual Conference on Learning Theory (COLT), pages 876–924, 2017.

Dylan J. Foster, Alexander Rakhlin, and Karthik Sridharan. Online learning: Sufficient statistics and the Burkholder method. In Conference On Learning Theory, COLT 2018, Stockholm, Sweden, 6-9 July 2018., pages 3028–3064, 2018.

Elad Hazan. Introduction to online convex optimization. Foundations and Trends in Optimization, 2 (3–4):157–325, 2016.

Jean-Baptiste Hiriart-Urruty and Claude Lemar´echal. Fundamentals of Convex Analysis. Springer Science & Business Media, 2004.

Michal Kempka, Wojciech Kotłowski, and Manfred K. Warmuth. Adaptive scale-invariant online algorithms for learning linear models. In Proceedings of the 36th International Conference on Machine Learning, ICML 2019, 9-15 June 2019, Long Beach, California, USA, pages 3321–3330, 2019.

Wojciech Kotłowski. Scale-invariant unconstrained online learning. In International Conference on Algorithmic Learning Theory, ALT 2017, 15-17 October 2017, Kyoto University, Kyoto, Japan, pages 412–433, 2017.

H. Brendan McMahan and Francesco Orabona. Unconstrained online linear learning in Hilbert spaces: Minimax algorithms and normal approximations. In Conference on Learning Theory, pages 1020–1039, 2014.

H. Brendan McMahan and Matthew J. Streeter. Adaptive bound optimization for online convex optimization. In Proceedings of the 23rd Annual Conference on Learning Theory (COLT), pages 244–256, 2010.

Zakaria Mhammedi, Wouter M. Koolen, and Tim van Erven. Lipschitz adaptivity with multiple learning rates in online learning. In Conference on Learning Theory, COLT 2019, 25-28 June 2019, Phoenix, AZ, USA, pages 2490–2511, 2019.

Francesco Orabona and D´avid P´al. Coin betting and parameter-free online learning. In Advances in Neural Information Processing Systems, pages 577–585, 2016a.

Francesco Orabona and D´avid P´al. Open problem: Parameter-free and scale-free online algorithms. In Conference on Learning Theory, pages 1659–1664, 2016b.

Stephane Ross, Paul Mineiro, and John Langford. Normalized online learning. In Proceedings of the Twenty-Ninth Conference on Uncertainty in Artificial Intelligence, pages 537–545, 2013.

Olivier Wintenberger. Optimal learning with Bernstein online aggregation. Machine Learning, 106 (1):119–141, 2017.

The proof of Theorem 5 relies on the following key lemma:

Lemma 18 For  G, g P R and, V ą 0, define

image

It holds that  ΘpG, V, gq ď 0, for all G P R, V ą 0, and g P r´1, 1s.

Proof For notational simplicity we assume  G ě 0. Let us look at

image

Since ln is increasing, we have that  Θ ď 0, if and only if,  Γ ď 0, and so we want to show  Γ ď 0for all  V ą 0, G ě 0, and g P r´1, 1s. Our approach will be to show that  Γis increasing in V . The result then follows from  limV Ñ8 Γ “ 0. It remains to study the derivative

image

Factoring this as a ratio of polynomials, we obtain:

image

where  αi, i P r5s, are polynomials in g and G whose explicit (yet gruesome) expressions are:

image

α2 “ g2G2

image

α3 “ G

image

Under our assumptions  V ą 0, G ą 0and  g P r´1, 1s, the denominator of  BΓ{BVabove is positive. Furthermore, its numerator, regarded as a polynomial in V , has exclusively positive

image

We want to show that the gap is ≤ 0. Our approach will be to show that gap is increasing, so that we can then bound it by the limit

image

We will show that the numerator is positive by showing that it is a polynomial with only positive coefficients. Here are the coefficients on the monomials Vi for i=0,1,...

image

Figure 1: Mathematica notebook in support of Theorem 5.

coefficients  αi ě 0, as can be verified using computer algebra software (we used Mathematica’s FullSimplify—see Figure 1). This implies that  BΓ{BV ě 0, for all  G P R, V ą 0, and g P r´1, 1s, and so Γ ď limV Ñ8 Γ “ 0.

We are ourselves a bit disgruntled about the opacity of the above proof. On the one hand, it is just a tedious verification of an analytic statement about a function of three scalar variables, and one might expect that tighter statements require more sophisticated techniques (c.f. Kotłowski, 2017, Appendix F). It is quite plausible that positivity may be established in a somewhat more streamlined fashion using Sum-of-Squares techniques. Yet on the other hand, we were hoping to gain, from the proof, a deeper insight into the design of potential functions. Unfortunately this did not materialize. In particular, we still do not know how to address the multi-dimensional case of our Section 3.2 with

a similar potential. Controlling the intuitive upgrade of (7) where the exponential is replaced by

image

is impossible, as witnessed by numerical counterexamples returned by random search, already in dimension 2. We need one more result before we prove Theorem 5:

Lemma 19 Let  G, s P R and V, h ě 0. Then, the function

image

is non-increasing on  tg ě 0 | g2 ` 2s ` G2 ě 0u.

Proof It suffices to show that the function

image

is non-increasing on  tg ě 0 | g2 ` 2s ` G2 ě 0u. Evaluating the derivative of  Ξ, we find that

image

where  N :“ag2 ` G2 ` 2s. The derivative in (21) is non-positive for all  V, h ě 0 and g ě 0 suchthat  g2 ` 2s ` G2 ě 0.

Proof of Theorem 5. We will proceed by induction. By the fact that  }G0} “ 0and the definition of the potential in (7), we have  Φ0 “ h1. Now let  t ě 0, h1 ą 0, and  pSt, Vt, ht, Gtq P R ˆ Rě0 ˆRą0 ˆ Rd. We will show that  Φt`1 ´ Φt ď 0. First, note that for any  ht`1 ě ht, we have

image

Thus, for any  gt`1 P Rd such that }gt`1} ď ht`1, and pS, V, h, G, gq :“ pSt, Vt, ht`1, Gt, gt`1q,

image

Let  g‹be the vector  g P Bhwhich maximizes the RHS of (22), where  Bhis the ball in  Rd of radius h.Suppose that  G ‰ 0, and let H :“ tg P Rd | xg, Gy “ xg‹, Gyu. Note that within the hyperplane H, only the first term on the RHS of (22) varies. Since  g‹is the maximizer of the RHS of (22) within  Bh,instantiating Lemma 19 with  s :“ xg‹, Gyand  G :“ }G}, implies that  g‹ P argmint}g} | g P Hu. Adding this to the fact that H is a hyperplane orthogonal to G implies that  g‹ and Gmust be aligned, i.e. there exists a  c‹ P R such that g‹ “ c‹G{}G}. Therefore, we have  }g‹ ` G} “ |g‹ ` G|, where

image

Further, note that  |g‹| ď h. Thus, the RHS of (22) is bounded from above by

image

Note that  ∆ in (23) can be written in terms of the function  Θ in Lemma 18 as:

image

Since  pG{h, V {h2, g‹{hq P R ˆ Rą0 ˆ r´1, 1s, Lemma 18 implies that  ΘpG{h, V {h2, g‹{hq ď 0, and so due to (23), we also have  ∆ ď 0. Since  ∆is an upper-bound on the RHS of (22), it follows that  Φt`1 ´ Φt ď 0as desired.

B.1. Proof of Theorem 6

The proof of Theorem 6 follows from the next theorem by setting  ϵ “ 1. Theorem 20 essentially gives the regret bound of FREEGRAD if its outputs  p pwtqare scaled by a constant  ϵ ą 0. This will be useful to us later.

Theorem 20 Let  ϵ ą 0, and put :“ pwt{ϵ, for p pwtq as in (9). Then, under Assumptions 1 and 2:

image

for all  w P W “ Rd, T ě 1.

Proof Since the assumptions of Theorem 5 are satisfied, we have

image

Dividing both sides of (24) by  ϵ ą 0and rearranging yields

image

By duality, we further have that

image

Since  ΘT pGq “ ΨT pGq{ϵ, for all  G P Rd, where  ΨTis the function defined in (10), we have by the properties of the Fenchel dual (Hiriart-Urruty and Lemar´echal, 2004, Prop. 1.3.1) that

image

Thus, according to (McMahan and Orabona, 2014, Lemma 3) and the properties of duality (Hiriart- Urruty and Lemar´echal, 2004, Prop. E.1.3.1), we have

image

On the other hand, (Cutkosky and Orabona, 2018, Lemma 18, 19) and (Orabona and P´al, 2016a, Lemma 18) provides the following upper-bound on  ψ‹T puq, u P R, using the Lambert function W (where Wpxq is defined as the principal solution to  WpxqeWpxq “ x):

image

and  cT :“?2VT {phT h21q. Using the fact that the Lambert function satisfies  pWpxqq1{2´pWpxqq´1{2 ďaln` x, for all x ě 0(see Lemma 22), together with (28) and (27) implies that

image

for all  w P Rd. Combining this with (25) and (26) leads to the desired regret bound.

Proof of Theorem 6. Invoke Theorem 20 with  ϵ “ 1.

B.2. Proof of Theorem 7

Proof Fix  w P Rd and let k ě 1be the total number of epochs. We denote by  τi ě 1the start index of epoch i P rks. Further, for  τ, τ 1 P N, we define  ˜τ :“ τ ´ 1and  Vτ:τ 1 :“ h2τ ` ř˜τ 1s“τ }gs}2(note how the upper index is exclusive). Recall that at epoch i P rks, the restart condition in Algorithm 1 is triggered at  t “ τi`1 ą τi only if

image

where the last inequality follows by Assumption 1. We note that (29) also implies that

image

On the other hand, within epoch  i P rks, hthτi ď řt´1s“1}gs}hs ` 2, for all τi ď t ď ˜τi`1, and thus

image

Summing this inequality over  i “ 1, . . . , k ´ 2, we get:

image

Now using (29) at  t “ τi`2, we have for all i P rks,

image

where the inequality  p˚qfollows by Assumption 1. Now by (30), we also have

image

Thus, substituting (35) and (34) into (33), and using the fact that  hτk ď hT , we get:

image

where in the last inequality, we used the fact that?2x ď x ` 1{2, for all x ě 0. Now, summing (32) over the last two epochs, yields

image

Adding (36) and (37) together leads to

image

This concludes the proof.

In this section we work on a version of the potential function that does not have the tuning for Section 3.2 substituted in yet, so that we can prove the result necessary for Section 5 in one go. The potential is parameterized by a prior variance  σ2 ą 0,initial wealth  ϵ ą 0and, as before prod factor γ ą 1. It is defined by

image

C.1. Proof of Lemma 9

We prove the claim in Lemma 9 for the more general potential (38). Let  λ‹ ě 0be the minimizer in the problem  ΨpGt´1, Vt´1, htq. With that notation, we see that  pwt “ ´ΨpGt´1, Vt´1, htq ¨`σ´2I ` γV ` λ‹I˘´1 Gt´1. To prove the lemma, it suffices to prove the stronger statement obtained by picking the sub-optimal choice  λ “ λ‹for the problem  ΨpGt, Vt, htq, and dividing by ΨpGt´1, Vt´1, htq ą 0, i.e.

image

Let us abbreviate  Σ´1 “ σ´2I ` γVt´1 ` λ‹I. The matrix determinant lemma and monotonicity of matrix inverse give

image

Then Sherman-Morrison gives

image

and splitting off the last round  Gt “ Gt´1 ` gt gives

image

All in all, it suffices to show

image

Being a square,  z ě 0is positive. In addition, optimality of  λ‹ensures that  ∥ΣGt´1∥ “ ρpγqht; this follows from the fact that ddλ G⊺t´1pσ´2I ` γV ` λIq´1Gt´1ˇˇλ“λ‹ “ }ΣGt´1}2. In combination with  ∥gt∥ ď ht, we find |r| ď ρpγq ď 1. The above requirement may hence be further reorganized to

image

The convex right hand side is minimized subject to  z ě 0 at

image

so it remains to show

image

The function  ρ in (12) is designed to satisfy the hardest case, where  r “ ´ρpγq, with equality.

C.2. Proof of Theorem 10

We restate the claim for the potential (38) before tuning:

Theorem 21 (Theorem 10 rephrased) Let  Σ´1T :“ σ´2I ` γVT . For p pwtq as in (14), we have

image

Using that  ΨpG, V , hqis decreasing in h, we can telescope to obtain

image

Using the definition reveals  Ψp0, 0, h1q “ ϵ, yielding

image

To transform this into a regret bound, it remains to compute the convex conjugate of the RHS of (39) in  GT. To this end, let

image

The Fenchel dual of this function is

image

where the model complexity is measured for  θ ě 0through the function  Xpθq :“ supα α ´

image

(where Wpxq is defined as the principal solution to  WpxqeWpxq “ x). We will further use that  Xpθq

is increasing, and that it satisfies  Xpθq ďaln` θ(see Lemma 22). Zero derivative of the above

objective for  λoccurs at the pleasantly explicit

image

and hence the optimum for  λis either at that point or at zero, whichever is higher, with the crossover point at ∥w∥2Z ln ∥w∥2Z “ w⊺Qw. Plugging that in, we find that

image

where  C :“ ∥w∥2Z ln ∥w∥2Z. Using that  Xpθqis increasing, we may drop the exponential in its argument in the first case, and obtain

image

Note that this is a curious maximum between  w⊺Qw(the larger for modest w), and the average between that very same term and another quantity that grows super-linearly with  ∥w∥2 (so this is the winner for extreme w).

Okay, now let’s collect everything for the final result and undo the abbreviations. We have

image

where

image

To complete the proof of Theorem 21, it remains to prove the following result.

image

Lemma 22 For  θ ě 0, define  Xpθq :“ supα α ´ e

image

Proof The fact that  Xpθq “ pWpθqq1{2 ´ pWpθqq´1{2follows from (Orabona and P´al, 2016a, Lemma 18). Recall that

image

Hence

image

where we plugged in the sub-optimal choices  η “ 1?ln θ (this requires  θ ě 1) and η “ 1?θ. When we

stick in  η “ 1?lnpee´2`θq we find

image

Note that  ee´2 “ 1.14492. This is less than 2, the value of  θ where?θ{2 ´ 1{?θbecomes positive.

image

D.1. Proof of Lemma 13

Let  c, b, β ě 0, ν ě 1, α Ps1, 2s, and  γ Ps ´ 1, ´α´1r. We consider the 1-dimensional case (i.e. d “ 1) and set  gt “ tγ, for all  t ě 1. Since  ´1 ă γ ă ´1{α, we have  Lt “ L1 “ 1, for all  t ě 1, and so the sequence  paVα,t lnptq{Lαt qis increasing. Further, there exists  p, q ą 0such that,

image

Thus, given any sequence  p pwtq P Rsatisfying

image

we have, for  T ě 1 and w “ ´2baVα,T lnpTq{LαT ,

image

Now by the choice of  w and (40), we have |w| ď 2bq?ln T, and so by (41),

image

Using again the fact that  |w| ď 2bq?ln Tand (40), we have  L1´α{2T p|w| ` 1qaVα,T ln T ď2bq2 ln T ` q?ln T, and so due to (41), we have

image

Since  γ ą ´1, the exists  T ě 1 such that

image

and so for such a choice of T, (42) and (43) imply the desired result.

D.2. Proof of Theorem 15

We need the following lemma in the proof of Theorem 15:

Lemma 23 For all  b, c, β ě 0 and ν P r1, 3r, there exists  pgtq P Rd, T ě 1, and w P Rd, such that for any sequence  p pwtqsatisfying  } pwt} ď b ¨?t ln t, for all t ě 1, we have

image

Proof Let  c, b, β, ě 0, ν P r1, 3r, and α Ps1, 2s. We consider the 1-dimensional case (i.e.  d “ 1) andset  gt “ 1, for all  t ě 1. In this case, we have  Lt “ 1, for all  t ě 1. Given any sequence  p pwtq P Rsatisfying

image

we have, for  T ě 1 and w “ ´2b?T ln T,

image

Now since  |w| “ 2b?T ln T, we have, by (45),

image

Using again the fact that  |w| “ 2b?T ln T and LT “ 1, we have LT p|w|`1q?T ln T “ 2bT ln T `?T ln T, and so due to (45),

image

Since  ν P r1, 3r, the exists  T ě 1 such that

image

and so for such a choice of T, (46) and (47) imply the desired result.

Proof of Theorem 15. By Lemma 12, the only candidate algorithms are the ones whose outputs p pwtqsatisfy  } pwt} ď b?t ln t, for all  t ě 1, for some constant  b ą 0. By Lemma 23, no such algorithms can achieve the desired regret bound.

Proof of Theorem 16. The proof is similar to that of Theorem 7 expect for some changes to account for the fact that the modified FREERANGE wrapper scales the outputs of FREEGRAD.

First, let us review some notation. Let  k ě 1be the total number of epochs and denote by τi ě 1the start index of epoch i P rks. Further, for  τ, τ 1 P N, we define  ˜τ :“ τ ´ 1, Vτ:τ 1 :“|xτ|2 ` ř˜τ 1s“τ |gs|2, and Bτ :“ řτs“1 |xs|{hs. In what follows, let w P R be fixed.

Let  pputqbe the outputs of algorithm 2 and i P rks. In this case, we have  put “ pwt{phτiBτiq, for all  t P tτi, . . . , ˜τi`1u, and by Theorem 20, with  d “ 1, ϵ “ hτiBτi, and gt P xt ¨ Bp0,1qℓpyt, xtputq:

image

Recall that at epoch i P rks, the restart condition in Algorithm 2 is triggered at  t “ τi`1 ě τi only if

image

where the equality follows by the fact that when  p˚qis satisfied for the first time, it must hold that |xt| “ ht(recall that the hints  phtq satisfy (17)); in fact, we have,

image

From (49), we get that

image

Summing this inequality over  i “ 1, . . . , k ´ 2, we get:

image

Using (49) again, we get that

image

where the inequality  p˚qfollows by the fact that řτks“1 |xs|{hs ě k due to (50). We also have

image

Thus, substituting (56), (55), and (54) into (53), and using the fact that  hτk ď hT , we get:

image

where in the last inequality, we used the fact that?2x ď x ` 1{2, for all x ě 0. Now, summing (52) over the last two epochs, yields

image

Adding (57) and (58) together implies the desired result.


Designed for Accessibility and to further Open Science