Fast Rates for Online Prediction with Abstention

2020·Arxiv

Abstract

Abstract

In the setting of sequential prediction of individual {0, 1}-sequences with expert advice, we show that by allowing the learner to abstain from the prediction by paying a cost marginally smaller than ), it is possible to achieve expected regret bounds that are independent of the time horizon T . We exactly characterize the dependence on the abstention cost c and the number of experts N by providing matching upper and lower bounds of order , which is to be contrasted with the best possible rate of that is available without the option to abstain. We also discuss various extensions of our model, including a setting where the sequence of abstention costs can change arbitrarily over time, where we show regret bounds interpolating between the slow and the fast rates mentioned above, under some natural assumptions on the sequence of abstention costs.

1. Introduction

Consider the problem of online prediction of individual sequences which is one of the first and well-studied online learning models (Cesa-Bianchi and Lugosi, 2006). In this setup, a learner and an environment interact in a sequence of rounds t = 1, . . . , T as follows. In round t, the learner observes the {0, 1}-valued predictions of N experts, denoted as . Based on these observations and possibly some use of randomness, the learner predicts and then the environment reveals . In particular, the environment can be aware of the strategy of the learner but not the random bits used for producing . Having made its prediction, the learner suffers the real-valued loss , which in this paper is chosen as the binary loss The aim of the learner is to minimize its regret which is the difference between the total loss of the prediction strategy and the total loss of the best fixed expert chosen in full knowledge of the sequence of outcomes.

Since the naive upper bound O(T) always holds for the regret, we are aiming for better dependencies. It is well known that it is not possible to guarantee non-trivial bounds on the regret when the learner’s predictions are deterministic. Therefore, we allow the learner to randomize its decisions and measure its performance in terms of the expected regret defined formally as

where the expectation is taken with respect to the randomness injected by the learner and the loss suffered by expert i in round t. As shown in the seminal work of Littlestone and Warmuth (1994), it is possible to construct an algorithm that achieves regret of order setting, and this is essentially the best one we hope for as long as the binary loss together with

randomization is considered—see also the classic works of Hannan (1957), Cover (1965) and Vovk (1990) and the excellent monograph by Cesa-Bianchi and Lugosi (2006). Notably, the result holds without any assumptions on the behaviour of the environment which can be completely adversarial.

One natural direction of research is to understand when it is possible to improve on the worst-case regret guarantees mentioned above. Of particular interest are bounds that are independent on the time horizon T. Bounds of this type are sometimes refereed to as the fast rates in online learning (to be contrasted with the worst-case slow rate mentioned above). There are several conditions on the loss function or the sequence of outcomes that are known to imply fast rates. A classic result due to Haussler, Kivinen, and Warmuth (1998) and Vovk (1990) claims that if the learner is allowed to output and the loss function is convex with sufficient curvature, then there exists a deterministic prediction strategy guaranteeing . A second type of assumptions that can lead to fast rates even in the case of the binary loss is that the outcomes are generated in an i.i.d. manner, rather than being chosen by an adversarial environment. In this favourable setup, assuming that the set of experts and the distribution of satisfy the so-called Bernstein assumption (introduced originally in the statistical learning literature by Bartlett and Mendelson, 2006), Koolen et al. (2016) show that it is possible to obtain intermediate rates between O(log N) and depending on the parameters in the Bernstein assumption. Similar results are well known in the statistical learning setup (see, e.g., Tsybakov, 2004). For an extensive survey on fast rates in online learning, we refer the interested reader to van Erven et al. (2015).

In this paper we show that a simple variation on the basic protocol of online binary prediction also permits regret of O(log N): instead of forcing the learner to output a prediction in {0, 1}, we also allow the learner to abstain from prediction. This model is sometimes referred to as Chow’s reject option model, introduced in the seminal work of Chow (1970). This setup was intensively studied in statistical learning where the learner can output one of three values and the value corresponds to an abstention (see, e.g., the works of Herbei and Wegkamp, 2006; Bartlett and Wegkamp, 2008; Bousquet and Zhivotovskiy, 2019). In this model the loss is assumed to be binary on {0, 1} outputs but the price for abstention is assumed to be equal to means that the learner may profit from using the option to abstain whenever they believe that there is no way to predict better than by flipping an unbiased coin. We only assume that c is marginally smaller than (say, 0.49), so that there is no reason for the learner to abstain too often if they want to have a small regret.

We stress that even when deciding to abstain, the learner still gets to observe the true outcome. The first thing to notice regarding the hardness of our setting is that the learner still needs to randomize its decisions in order to achieve sublinear regret. Specifically, one can easily adapt the classic counterexample of Cover (1965) to show that any deterministic strategy will result in a regret of at least for the learner—we provide the details in Appendix A.

We are not the first to explore the effects of allowing abstentions in the context of online learning. The most notable contribution is the work of Cortes, DeSalvo, Gentile, Mohri, and Yang (2018), who extended the model of Herbei and Wegkamp (2006); Bartlett and Wegkamp (2008) from the setting of statistical learning to that of online learning and provided a range of results for both stochastic and adversarial environments. A crucial feature of their setup is that abstention is an action available to the experts: each expert produces a prediction in , which is then used by the learner for producing its their own randomized action. Furthermore, when deciding to abstain, the learner does not get to observe the outcome produced by the environment. In contrast, our setting treats abstention as an action only available to the learner, and the true outcome is always revealed to the learner at the end of the round independently of the learner’s decision. Thus, the setting of Cortes et al. (2018) is in many senses more complicated than ours, and as a result their regret guarantees are weaker: they only match the worst-case rate of . It is unclear if it is possible to refine their techniques to obtain fast rates comparable to the ones we prove. Another work related to ours is that of Zhang and Chaudhuri (2016), who considered an online prediction problem with an abstention option under a realizability assumption and noted the possibility of adapting Chow’s reject option model to the sequential prediction setting as an interesting direction of future work—which is precisely what we address in this paper.

Our pursuit of fast rates is motivated by the recent results of Bousquet and Zhivotovskiy (2019) who have shown for the first time that allowing the learner to abstain from prediction enables fast rates in statistical learning. Notably, their results were obtained without any further assumptions on the data distribution (i.e., no Bernstein condition) or on the curvature of the loss function (i.e., binary loss). Their approach is based on an application of the empirical risk minimization principle on an extended set of classifiers, resulting in a deterministic algorithm guaranteeing an excess risk of at most when the set of classifiers is finite. While our results are of similar flavor, the approach that we put forward in this paper is different in almost every respect from theirs.

Our contributions. Our main contribution the following: For a fixed abstention cost show in Section 2 that there is an algorithm with the regret bound

We also prove that our bound is tight up to multiplicative constant factors if c is bounded away from zero (Section 4). Assuming that the abstention cost changes in time, we prove intermediate rates between . Finally, we provide two natural extensions of our main result in Section 5.

Notation. We define . We also use the standard notation. The indicator of the event A is denoted by . Given integer K we define [K] = {1, . . . , K}.

2. Algorithm and main result

Our algorithm is based on the classic exponentially weighted forecaster of Littlestone and Warmuth (1994) (see also Vovk, 1990; Freund and Schapire, 1997), and is described as follows. In each round t, the algorithm computes the set of weights

for some learning rate , the corresponding probability distribution over all experts

The crucial step in the algorithm design is generating a randomized -valued prediction based on . The key intuition is that the learner should abstain whenever and output a deterministic prediction whenever . We choose to linearly interpolate between these extreme cases by producing a randomized output supported on otherwise. Precisely, our algorithm will base its decision on the probability of the most likely label : it will abstain with probability , and output the most likely label with probability . The procedure is compactly presented as Algorithm 1.

Note that the abstention probability due to the fact that . To gain some intuition about this rule, observe that when assigning the numerical value to abstention, the resulting prediction has expectation . Our main result regarding the performance of our algorithm is the following:

satisfies

Notably, this theorem shows that whenever the abstention cost c is bounded away from 1/2, we can set and our algorithm achieves a regret bound that is independent of the time horizon T. However, when c is very close to 1/2, one may favor to fall back to the standard worst-case regret guarantee of order . This is, however, easily achieved by choosing a conservative value of for this unfavorable case. The following simple corollary of Theorem 1 summarizes the rates achieved by our algorithm in all regimes.

Corollary 2 Setting

We provide the simple proofs of both Theorem 1 and Corollary 2 below. A central object in our analysis is the quantity

often called the mix loss in the literature (Vovk, 1990, 1998; De Rooij et al., 2014). The following classic result highlights the key role of the mix loss in the analysis of the exponentially weighted forecaster:

Lemma 3 For any the cumulative mix loss of the exponentially weighted forecaster satisfies

We include the proof in Appendix B for the sake of completeness. The heart of our analysis is the following lemma that establishes a connection between the mix loss and the loss suffered by our algorithm:

Lemma 4 Suppose that

Proof Let us define the misclassification probability of the plain exponentially weighted forecaster as . With this notation, the mix loss can be written as

On the other hand, the expected loss of our algorithm is

To finish the proof, we define the functions

and note that . We will show that will imply the statement of the theorem. To this end, note that both functions are convex and equal to each other at r = 0 and r = 1, so the desired inequality will hold if the respective derivatives at these two points satisfy . To verify this, first observe that

On the other hand, we have

Thus, the condition on the derivatives we seek is satisfied when

Also observing that both hold for , we can verify that the conditions above are satisfied whenever , as required in the statement of the theorem. This concludes the proof.

Putting Lemmas 3 and 4 together proves Theorem 1. Corollary 2 is proved by first observing that

holds as long as , and then using Hoeffding’s lemma (Lemma 2.2 in Cesa-Bianchi and Lugosi, 2006) that guarantees

Putting this inequality together with Lemma 3, we retrieve the standard bound of the exponentially weighted forecaster: . Combining this bound with the one of Theorem 1 and tuning proves the corollary.

3. Changing abstention costs

Let us now consider a slight variation on our problem where the abstention cost c may depend on time: in round t, the cost of abstention is . The main question we address in this section is whether it is possible to attain fast rates even in the case where can get arbitrarily close to small number of rounds. It is easy to see that our algorithm given in Section 2 satisfies the following regret bound with its proof presented in Appendix B.

Notably, this result asserts that it is not necessary to assume that is strictly bounded away from in order to obtain an improvement over the worst-case regret bound. In fact, we show that a clear improvement is possible when making the following quantitative assumption about the behavior of the abstention costs around

Definition 6 (Tsybakov’s condition for abstention costs) Assume that there are constants such that for any x > 0,

As the name suggests, this condition is directly inspired by the renowned Tsybakov’s margin assumption introduced first by Mammen and Tsybakov (1999) in the context of binary classification and which was further analyzed by Tsybakov (2004). In the context of classification where one has to predict a random label based on a random instance X, Tsybakov’s assumption is used to characterize the distribution of the regression function , which is related to the hardness of the classification problem at hand. Indeed, instances such that the regression function is further away from are “easier” to classify than ones closer to this value, so restricting the den- sity of E [Y | X] around this problematic region can make the overall learning problem easier—for more details, and relations of Tsybakov’s assumption with the so-called Bernstein assumption we refer the interested reader to van Erven et al. (2015). Similarly, our own online prediction problem becomes “harder” when the abstention costs are densely distributed around . By analogy with the case of binary classification, we choose to quantify the density of the quantity of interest around

through Tsybakov’s assumption: larger values of correspond to less frequent values around the margin, whereas smaller values correspond to more problematic points. Similarly, the situation where the abstention cost is strictly bounded away from 1/2 can be seen as an analog of Massart’s margin assumption in the statistical learning setup (Massart and N´ed´elec, 2006).

Assuming that the costs satisfy the inequality (3), our algorithm can be easily seen to achieve the following result:

Corollary 7 Suppose that the sequence of abstention costs satisfies Tsybakov’s margin condition.

Then, setting , the regret of our algorithm satisfies

The proof is immediate given Proposition 5 and the definition of Tsybakov’s condition. We remark that the bound of the very same form as (4) has been recently shown in Koolen et al. (2016) under the assumption that the losses are i.i.d. and that the Bernstein assumption is satisfied. However, our assumption is different and does not imply the Bernstein assumption, even if it leads to similar regret bounds.

One downside of Corollary 7 is that it relies on a choice of the learning rate that requires prior knowledge of the sequence of abstention costs. More generally, the learning rate that minimizes the bound of Proposition 5 is clearly a function of the abstention costs. One may wonder if it is possible to attain a regret guarantee comparable to

without having prior knowledge of the sequence of abstention costs.

We answer this question in the positive by considering a simple variation of our algorithm based on the exponentially weighted forecaster with adaptive learning rates. Precisely, in each round t = 1, 2, . . . , T, we will choose a positive learning-rate parameter , compute the weights , and then use these weights in the same way as our basic algorithm did. The intuition driving our algorithm design is to keep the learning rate as large as possible, and only decrease it when observing high abstention costs. Specifically, we let the number of times that the learning rate has exceeded before round t, and define our learning rate as

The following theorem (proved in Appendix B) establishes a regret bound for this algorithm.

Theorem 8 The regret of Algorithm 2 satisfies

4. Lower bound

In this section we prove that the regret is optimal in the fixed abstention cost setup. Precisely, the main result we present here is the following non-asymptotic lower bound which shows that the bound of Theorem 1 is sharp with respect to N, c and T provided that c is separated away from zero:

Theorem 9 Fix the abstention price . There is a set of experts of size N such that for any randomized online prediction strategy there is a strategy of the environment such that

The proof of this result is based on observing that any randomized online algorithm can be converted to a randomized batch algorithm in a way that the excess risk in the latter setting is closely related to the regret in the former setting. Thus, any lower bound in the batch setting will imply a lower bound on the best achievable regret. An additional crucially important ingredient is relating the cost of abstention to a suitably chosen strongly convex loss function.

First, we recall the standard statistical learning setup. In this setting we consider the instance space X and the label space is a constant. Let be an i.i.d. sequence of points sampled according to some unknown distribution will refer to this sequence as the learning sample and to this setup as the batch setting. Given the loss function of a hypothesis , where the expectation is taken with respect to P. We denote We need the following standard result which can be found in, e.g., Audibert (2009).

Lemma 10 (Online to batch conversion) Consider any randomized online learning algorithm A that produces the possibly randomized 1 hypothesis when the environment uses an i.i.d. sequence . We define the randomized batch algorithm L that, given the data and any new observation X, predicts its label according to the hypothesis chosen uniformly at random among . We denote this randomized hypothesis by the risk of L satisfies

where the expectation on the right-hand side integrates over the random hypothesis chosen by L, the instances , and the randomization of the online algorithm.

Our main technical tool is the following lower bound in the batch setting.

if X contains at least points, there exists a set G of N hypotheses satisfying for any (possibly randomized 2) estimator there exists a probability distribution P on Z such that

where the expectation in is taken with respect to the learning sample, (X, Y ) and the (possible) randomness of the algorithm.

With these results in mind we are ready to provide some intuition on the proof of Theorem 9. For it appears that the construction in Lemma 11 uses only the distributions P such that and the set G of hypotheses taking their values in . Using Lemma 10 based on our online learning algorithm we may construct an estimator taking its values in value 0 corresponds to an abstention. Indeed, it is clear that as long as we consider only the loss is equivalent to binary. However, if the loss will be equal to c for any Y as above. The full proof of Theorem 9 is presented in Appendix B.

5. Extensions

In this section we provide two natural extensions of our results to the setting of multiclass classifi-cation, and the case of binary classification with an infinite set of experts.

5.1. Multiclass classification

We now turn our attention to a generalization of our online prediction setup that allows multiple classes. Specifically, we will now consider the case where the outcomes can take K different values, and the learner’s prediction is evaluated through the binary loss. Formally, our setup can be described as follows:

Our algorithm described in the previous sections can be extended to this setup in a straightforward way. Specifically, let us define for each expert i in the same way as before, and define the aggregated probability assigned to class . Given the above notation, Algorithm 1 can be adapted to the multiclass case by making the following adjustments:

1. Redefine , that is, the label attaining the highest probability, and

2. let and predict with probability with probability

Note that whenever , the algorithm abstains with probability 1. While this may appear surprising at first sight, it is easily justified by observing that the probability of predicting incorrectly in such cases results in a loss greater than , thus one can only win by abstaining by our assumption that . We also note that the lower bound of Theorem 9 continues to hold in the multiclass setting since one can clearly embed a binary classification problem into this more general problem setting. Thus, we cannot expect to have rates faster than in the multiclass case either. The following theorem (proved in Appendix B) establishes an upper bound of this exact order, thus matching the lower bound.

Theorem 12 Suppose that

algorithm for multiclass classification satisfies

5.2. Infinite sets of experts

Let us return to the case of binary classification with the twist that we now allow the set of hypotheses to be infinite. As shown by Ben-David, P´al, and Shalev-Shwartz (2009), the complexity of online prediction in this case is characterized by the so-called Littlestone dimension of the class H of hypothesis—for the precise definition, we refer to Ben-David et al. (2009). Assuming that the Littlestone dimension L of the class is finite, Ben-David et al. show that a clever extension of the exponentially weighted forecaster satisfies the regret bound

In our setup we provide the following regret bound, and defer the proof to Appendix B.

Corollary 13 Assume that H has finite Littlestone dimension , and that the price of abstention is . Then, there is a randomized prediction strategy satisfying

In particular, whenever c is separated from the regret scales as O(log T) which is much better than the standard dependence. We also remark that this result can be seen as an online analog of Theorem 2.2 in Bousquet and Zhivotovskiy (2019), with the Littlestone dimension replacing the VC dimension therein. However, as we mentioned, their algorithm and analysis are based on completely different techniques that take their roots in empirical process theory.

6. Concluding remarks and directions of future research

For the first time in the context of online learning, we have shown that endowing the learner with the option of abstaining from prediction allows a significant improvement in the best achievable rates, and that a very natural algorithm can attain the optimal rates. Notably, these results are proved without making any assumption about the sequence of outcomes that the learner has to predict or about the curvature of the loss functions. In this view, our results represent a new flavor of fast rates that, to the best of our knowledge, haven’t been explored before in the online learning literature. Arguably, one key factor that makes these fast rates possible is that the learner has access to an action that the comparator strategy cannot use. However, this feature alone cannot explain our results: since we only require the extra action to be marginally better than a uniform random prediction, the learner cannot abuse it to easily beat the best expert. How to characterize properties of such powerful additional actions for the learner in more general convex optimization problems remains to be seen.

Our key technical contribution is observing that the expected loss of our algorithm is upper bounded by the mix loss. This technique played a crucial role in proving a number of classic results in the literature, and was traditionally enabled by directly making an assumption about the curvature of the loss function evaluating the learner’s predictions. To our knowledge, our work is the first to make use of this technique without making direct assumptions about the loss. Rather, our approach is based on transforming the output of the exponential weights algorithm into a prediction in a way that effectively replaces the linear loss by a mixable one from the perspective of the algorithm producing . Since this transformation can be applied to the output of any algorithm, we find it plausible that our rates could be improved by a more refined prediction algorithm such as the Aggregating Forecaster of Vovk (1998, 2001). Another possible improvement to pursue is proving high-probability versions of our upper bounds.

Another important technique we employed for proving our bounds in Section 3 was an adaptive stepsize schedule that decreases the learning rate in each round when it proves too large for the loss to be mixable. This technique bears a vague resemblance to the AdaHedge algorithm of Van Erven et al. (2011); De Rooij et al. (2014) that uses a learning rate proportional to the cumulative gap between the expected loss and the mix loss. However, their analysis is restricted to the case where this gap is nonnegative, whereas our bounds crucially use that they can be often negative. As a result, our technique can be directly adapted to account for more general mixable losses with time-dependent mixability parameters. Similarly, our variant of Tsybakov’s condition on the abstention costs can be adapted to mixability parameters to characterize the hardness of learning with a non-stationary sequence of loss functions. For instance, one can easily prove the following result by adjusting our proof techniques in a straightforward way:

Proposition 14 Consider the problem of predicting a sequence of outcomes in [0, 1] under a sequence of loss functions that satisfy the following Tsybakov’s type margin assumption for holds for some Then, there is a deterministic prediction strategy satisfying

This result uses a variant of Tsybakov’s condition that characterizes the behaviour of the second derivative of the loss around zero. The statement is proven by observing that the expected loss is upper-bounded by the mix loss whenever , and then closely following the proof steps in Section 3—we leave this as an easy exercise for the reader. The result can be seen to address an open question of Haussler et al. (1998) about whether intermediate rates between O(log N) and are attainable in online prediction under suitable curvature assumptions on the loss functions. It also bears resemblance to the results of Bartlett, Hazan, and Rakhlin (2008) who provide algorithms that adapt to the curvature of the losses in more general online convex optimization problems, the key difference being that the curvature of our losses can be quantified in terms of exp-concavity rather than strong convexity as done by Bartlett et al..

Finally, we note that while our techniques are different from those of Bousquet and Zhivotovskiy (2019) who prove analogous results for the setting of statistical learning, the two analyses actually related in a very subtle way: both of them are based on techniques that are traditionally used for proving fast rates for losses with curvature, in their respective settings. While our approach is based on bounding the mixability gap as discussed above, the one of Bousquet and Zhivotovskiy is based on an extension of the recently proposed technique of Mendelson (2019) for statistical learning with the square loss. Our lower bound also makes use of this connection. Deeper investigation of this connection between curvature and the abstention may lead to further interesting results.

Acknowledgments

We thank the three anonymous reviewers for their valuable feedback that helped us improve the paper, particularly for catching a small bug in the original proof of Theorem 12. G. Neu was supported by “la Caixa” Banking Foundation through the Junior Leader Postdoctoral Fellowship Programme, a Google Faculty Research Award, and a Bosch AI Young Researcher Award.

References

Jean-Yves Audibert. Fast learning rates in statistical inference through aggregation. The Annals of Statistics, 37(4):1591–1646, 2009.

Peter L. Bartlett and Shahar Mendelson. Empirical minimization. Probability theory and related fields, 135(3):311–334, 2006.

Peter L. Bartlett and Marten H. Wegkamp. Classification with a reject option using a hinge loss. Journal of Machine Learning Research, 9(Aug):1823–1840, 2008.

Peter L. Bartlett, Elad Hazan, and Alexander Rakhlin. Adaptive online gradient descent. In Advances in Neural Information Processing Systems, pages 65–72, 2008.

Shai Ben-David, D´avid P´al, and Shai Shalev-Shwartz. Agnostic online learning. In COLT, volume 3, page 1, 2009.

Olivier Bousquet and Nikita Zhivotovskiy. Fast classification rates without standard margin as- sumptions. arXiv preprint arXiv:1910.12756, 2019.

Nicol`o Cesa-Bianchi and G´abor Lugosi. Prediction, learning, and games. Cambridge university press, 2006.

C. Chow. On optimum recognition error and reject tradeoff. IEEE Transactions on information theory, 16(1):41–46, 1970.

Corinna Cortes, Giulia DeSalvo, Claudio Gentile, Mehryar Mohri, and Scott Yang. Online learning with abstention. In Proceedings of the 35th International Conference on Machine Learning, pages 1059–1067, 2018.

Thomas M. Cover. Behavior of sequential predictors of binary sequences. Technical report, Trans. 4th Prague Conf. Information Theory Statistical Decision Functions, Random Processes, 1965.

Steven De Rooij, Tim Van Erven, Peter D. Gr¨unwald, and Wouter M. Koolen. Follow the leader if you can, hedge if you must. The Journal of Machine Learning Research, 15(1):1281–1316, 2014.

Y. Freund and R. E. Schapire. A decision-theoretic generalization of on-line learning and an appli- cation to boosting. Journal of Computer and System Sciences, 55:119–139, 1997.

J. Hannan. Approximation to Bayes risk in repeated play. Contributions to the theory of games, 3: 97–139, 1957.

David Haussler, Jyrki Kivinen, and Manfred K. Warmuth. Sequential prediction of individual se- quences under general loss functions. IEEE Transactions on Information Theory, 44(5):1906– 1925, 1998.

Radu Herbei and Marten H. Wegkamp. Classification with reject option. Canadian Journal of Statistics, 34(4):709–721, 2006.

Wouter M Koolen, Peter Gr¨unwald, and Tim van Erven. Combining adversarial guarantees and stochastic fast rates in online learning. In Advances in Neural Information Processing Systems, pages 4457–4465, 2016.

Nick Littlestone and Manfred K. Warmuth. The weighted majority algorithm. Information and computation, 108(2):212–261, 1994.

Enno Mammen and Alexandre B. Tsybakov. Smooth discrimination analysis. The Annals of Statistics, 27(6):1808–1829, 1999.

Pascal Massart and ´Elodie N´ed´elec. Risk bounds for statistical learning. The Annals of Statistics, 34(5):2326–2366, 2006.

Shahar Mendelson. An unrestricted learning procedure. J. ACM, 66(6), 2019.

Ohad Shamir. The sample complexity of learning linear predictors with the squared loss. The Journal of Machine Learning Research, 16(1):3475–3486, 2015.

Alexander B. Tsybakov. Optimal aggregation of classifiers in statistical learning. The Annals of Statistics, 32(1):135–166, 2004.

Tim Van Erven, Wouter M. Koolen, Steven D Rooij, and Peter Gr¨unwald. Adaptive hedge. In Advances in Neural Information Processing Systems, pages 1656–1664, 2011.

Tim van Erven, Peter D. Gr¨unwald, Nishant A. Mehta, Mark D. Reid, and Robert C. Williamson. Fast rates in statistical and online learning. Journal of Machine Learning Research, 16(54):1793– 1861, 2015.

Vladimir Vapnik and Alexey Chervonenkis. On the uniform convergence of relative frequencies of events to their probabilities. Proc. USSR Acad. Sci., 1968.

Vladimir Vovk. Aggregating strategies. Proc. of Computational Learning Theory, 1990.

Vladimir Vovk. A game of prediction with expert advice. Journal of Computer and System Sciences, 56(2):153–173, 1998.

Vladimir Vovk. Competitive on-line statistics. International Statistical Review, 69(2):213–248, 2001.

Chicheng Zhang and Kamalika Chaudhuri. The extended Littlestone’s dimension for learning with mistakes and abstentions. In Conference on Learning Theory, pages 1584–1616, 2016.

Appendix A. Why do we need to randomize?

Example 1 Consider the case of two experts such that one of them always predicts zero and the other always predicts one. Assume that the environment waits for the learners prediction and then provides the opposite label as the true label if the learner predicts zero or one. If the learner decides to abstain the environment provides the label according to the prediction of the best expert on T rounds. In what follows, we show that the regret of any deterministic strategy with abstention will be at least

Assume that out of T rounds the learner used the abstention k times and on the remaining rounds they predicted 0 or 1. The total loss suffered by the learner will be equal to Now we need to upper bound the loss of the best expert. The simple upper bound is . Now, since the environment is aware of the deterministic strategy of the learner we can always guarantee that it reveals the labels such that the loss of the best expert on these k rounds is equal to zero. This implies the number of mistakes made by the best expert is at most . Considering the case we have that the regret is at least we have that the regret is at least

Appendix B. Omitted proofs

Proof of Lemma 3

The proof is based on studying the evolution of the cumulative weights

Reordering gives the result.

Proof of Proposition 5

The proof is based on considering two types of rounds depending on the value of . In the case where is such that is satisfied, we have . Otherwise, we use the general bound as in the proof of Corollary 2. The proof is concluded by appealing to Lemma 3.

Proof of Theorem 8

We begin by studying some properties of the learning rate optimizing the function

with at most T discontinuities, and thus it takes its minimum either at one of its discontinuities or when its derivative is zero. In either case,

with equality when the minimum is achieved where the function is continuous. Note that this implies that is bounded as . Furthermore, the bound also entails Regarding the regret of our algorithm, standard arguments can be used to prove the bound

which actually holds for general nonincreasing learning-rate sequences (see, e.g., Lemma 2 in De Rooij et al., 2014). By the same argument as used to prove Corollary 2, each term in the sum on the right-hand side of the above bound can be bounded as

so that the sum itself can be bounded as

where we used that increases precisely whenever . Now notice that by monotonicity of , which implies that there exists a holds for all , so that we have

After round , the remaining number of times that our algorithm updates the learning rate is , which can be bounded as

where the first inequality uses the monotonicity of each of the summands in and the second step adds a number of non-negative terms to the sum. The final step is the definition of . Altogether, this implies the bound , so the cumulative mixability gap is bounded as

On the other hand, we have

Thus, the total regret of our algorithm is bounded as

Comparing this bound with the lower bound on concludes the proof.

Proof of Theorem 9

Our trick will be to construct a randomized prediction algorithm in the batch setup using an algorithm A for our main setting through the online-to-batch conversion described in Lemma 10. First, we fix and observe that it follows from the proof of Lemma 11 that the class of func- tions G used in the lower bound is a cube of -valued functions consisting of functions and that we are only interested in distributions with . For the exact details implying these two facts we refer to the definition of the hypercube of probability distributions (Definition 8.1 in Audibert, 2009) and the first lines of the proof of Theorem 8.3 and 8.4 where Audibert sets . Finally, his proof uses and this implies that the

for each distribution in the hypercube of distributions.

With these important observations in mind we will construct the following game of prediction with expert advice. For each round t = 1, 2, . . . , T, the environment draws , and each expert (we can naturally identify with experts) predicts

This setup allows us to use our online learning algorithm A to produce a randomized hypothesis and taking its values in and matching the bound of Lemma 10. For each t = 1, 2, . . . , T matching the conditions of Lemma 10 we run any produces its prediction by feeding the expert predictions , obtaining a random prediction , and finally mapping to an element of Y according to . Observe that since the following holds if we fix

At the same time, for any , it holds that . To conclude, we have in this setup that the loss of is binary whenever the output is or equal to c whenever the output is equal to zero regardless of the value of . Using Lemma 10 we have

which implies

This corresponds to the regime where the abstentions can give some gain compared to the slow rate bound . Therefore, provided that since the lower bound of Lemma 11 holds for any -valued estimator (and therefore, for ) we have for some C > 0,

where we used (5). We remark that the bound in Lemma 11 has the factor with the first term being dominant if . However, it can be easily seen that the values of q > 1.031 can be controlled by the choice of the absolute constant C above. Finally, by Jensen’s inequality

This implies that is greater than the expected regret of the best expert. There-

fore, we show that for any randomized algorithm with abstentions there is a random environment

such that the expected regret is provided that

This, of course, implies the existence of the deterministic strategy for the environment. To finish the proof we need to consider the regime

and show that it gives the lower bound of order which does not depend of c anymore. Since the total loss of any algorithm grows monotonically as c approaches , we can prove the lower

automatically for smaller values of q (this corresponds to larger values of c). Fortunately, that value of q is covered by the first part of the proof. It is left to observe that in this case we have that is controlled by up to a multiplicative constant factor. The claim follows.

Proof of Theorem 12

We begin by defining the misclassification probability

and noticing that the mix loss can be again written as

On the other hand, the expected loss of our algorithm is

Notice that this equals . In this case, the error probability satisfies so that the mix loss is can be lower bounded as

where in the first line we used the inequalities hold for , and our assumption that in the last line. Thus, the expected loss lower bounds the mix loss in this case.

Considering the case where otherwise (since in this case we have ). In the first case, we have

while in the second one we have

Summarizing both cases, we have

The proof is concluded by noticing that this function is upper-bounded by the mix loss by the same argument as used in the proof of Lemma 4, and using Lemma 3 to upper-bound the cumulative mix loss.

Proof of Corollary 13

Our key technical tool is the following lemma which can be seen as an online learning analog of the renowned lemma by Vapnik and Chervonenkis (1968).

Lemma 15 (Ben-David et al. (2009)) Fix T and assume that H has the Littlestone dimension L. There is a set of at most

predicting as h on any sequence of length at most T.

Now the proof follows immediately from Corollary 2, Lemma 15 and

. We remark that the results of Section 3 can also be straightforwardly adapted for the sets of experts having finite Littlestone dimension.