b

DiscoverSearch
About
My stuff
Model Selection in Contextual Stochastic Bandit Problems
2020·arXiv
Abstract
Abstract

We study bandit model selection in stochastic environments. Our approach relies on a meta-algorithm that selects between candidate base algorithms. We develop a meta-algorithm-base algorithm abstraction that can work with general classes of base algorithms and different type of adversarial meta-algorithms. Our methods rely on a novel and generic smoothing transformation for bandit algorithms that permits us to obtain optimal  O(√T) model selection guarantees for stochastic contextual bandit problems as long as the optimal base algorithm satisfies a high probability regret guarantee. We show through a lower bound that even when one of the base algorithms has O(log T) regret, in general it is impossible to get better than Ω(√T) regret in model selection, even asymptotically. Using our techniques, we address model selection in a variety of problems such as misspecified linear contextual bandits [LSW20], linear bandit with unknown dimension [FKL19] and reinforcement learning with unknown feature maps. Our algorithm requires the knowledge of the optimal base regret to adjust the meta-algorithm learning rate. We show that without such prior knowledge any meta-algorithm can suffer a regret larger than the optimal base regret.

Bandit algorithms have been applied in a variety of decision making and personalization problems in industry. There are many specialized algorithms each designed to perform well in specific environments. For example, algorithms are designed to exploit low variance [AMS09], extra context information, linear reward structure [DHK08; Li+10; AYPS11], sparsity [AYPS12; CM12], etc. The exact properties of the current environment however might not be known in advance, and we might not know which algorithm is going to perform best.

Model selection in contextual bandits aims to solve this problem. More formally, the learner is tasked to solve a bandit problem for which the appropriate bandit algorithm to use is not known in advance. Despite this limitation, the learner does have access to M different algorithms  {Bi}Mi=1, one of which  Bi⋆ is promised to be adequate for the problem the learner wishes to solve. We use regret to measure the learner’s performance1. The problem’s objective is to design algorithms to minimize regret.

The algorithms we develop in this work follow the template of [Aga+17] where a ‘meta-algorithm’ algorithm is placed on top of a couple of ‘base’ algorithms (in this case  {Bi}Mi=1).At the beginning of each round the meta-algorithm selects which base algorithm to ‘listen to’ during that time-step effectively treating the base algorithms as arms to be pulled by the meta-algorithm. The difficulty in using existing algorithms such as UCB or EXP3 [BC12] as a meta-algorithm lies in the non-stationary nature of the rewards collected by a learning base algorithm. The meta-algorithm needs to be sufficiently smart to recognize when a base algorithm is simply performing poorly because it still in the early stages of learning from the case where poor performance is the result of model misspecification.

Adapted and misspecified algorithms We say that an algorithm is adapted to the environment at hand if it satisfies a valid regret guarantee. Let’s illustrate this with an example in the setting of linear bandits with finitely many arms. In this problem the learner has access to K arms. Each arm  i ∈ [K] is associated with a feature vector  zi ∈ Rd,and the reward of arm  i ∈ [K] follows a linear model of the form  ri = ⟨zi, θ⋆⟩ + ξiwhere ξiis conditionally zero mean and  θ⋆is an unknown parameter. An algorithm such as LinUCB [Chu+11] achieves a regret guarantee of order �O(

logarithmic factors in T. In contrast, the UCB algorithm [ACBF02] yields a regret guarantee of order �O(√KT). In this case, both algorithms are well adapted to the problem of linear bandits with finitely many actions, but LinUCB’s regret guarantee may be substantially smaller than UCB’s regret upper bound if d is much smaller than K. If an algorithm is not well adapted, we say it is misspecified. For the sake of exposition let’s assume we are in a similar setting as above, where the learner has access to K arms each of which is associated with a feature vector  zi ∈ Rd. Instead of assuming a linear model as before, let’s instead assume that  ri = (⟨zi, θ⋆⟩)2+  ξiis quadratic. In this case, there is no reason to believe LinUCB can yield a valid regret guarantee since the underlying linearity assumption of LinUCB is violated. We say that in this case LinUCB is misspecified. Consider an instance of LinUCB that instead uses matrix features of the form  ziz⊤i. In this case the quadratic reward is again a linear function of the feature vectors since (⟨zi, θ⋆⟩)2 = ⟨ziz⊤i , θ⋆θ⊤⋆ ⟩. Thusthis version of LinUCB with quadratic features is adapted.

We will assume that all algorithms  Bi for i ∈ [M] are associated with a putative regret guarantee  Ui(t, δ) that is known by the learner and holding with probability 1  − δfor all t ∈ [N] if algorithm i is adaptedto the environment at hand. If the learner knew the identity of the best adapted algorithm  i⋆, it would be able to incur regret of order  Ui⋆(T, δ) by playing  Bi⋆for T time-steps. The learner’s objective in the model selection problem is to design a procedure that would allow a learner to incur in regret that is competitive with the regret upper bound  Ui⋆(t, δ) of the best adapted algorithm among those in  {Bi}Mi=1, so thatthe regret incurred by the learner up to time T scales as a function of T, the parameters defining  Bi⋆(and therefore  Ui⋆(t, δ)) and possibly M. From now on we will refer to each of the M algorithms in  Bi⋆as a base algorithm. We will alert the reader if we have a specific set of M algorithms in mind. In any other case, when we talk about the set of base algorithms we simply mean a set of M algorithms the learner is model selecting from.

The authors of [OM11] were perhaps the first to address the bandit model-selection problem, with a variant of an EXP4 meta-algorithm that works with UCB or EXP3 base algorithms. These results are improved by [Aga+17] via the CORRAL algorithm. The CORRAL algorithm follows the meta-algorithm-base template that we discussed at the beginning of this section. It makes use of a CORRAL meta-algorithm based on a LogBarrier Online Mirror Descent algorithm controlling which of the base algorithms to play at any given round. Let  pt be the M−dimensional probability distribution over the M base algorithms given by the CORRAL meta-algorithm. The learner will sample an algorithm index  jt ∈ [M] with  jt ∼ pt. and play the action prescribed by  Bjtto collect a reward rt. All algorithms  {Bi}Mi=1are then updated using an importance weighted version of  rtregardless of whether they were selected by the meta-algorithm or not.

Unfortunately, this means that in order to use a base algorithm in CORRAL, it needs to be compatible with this importance weighting modification of the rewards. For example, to use UCB as a base, we would need to manually re-derive UCB’s confidence intervals and modify its regret analysis to be compatible with importance weighted feedback. The authors show that a base algorithm can be safely combined with the CORRAL meta-algorithm to yield model selection guarantees provided it satisfies a stability condition (see Definition 3 in [Aga+17]). Verifying that an algorithm satisfies such stability condition is a cumbersome process that requires a detailed analysis of the algorithm’s internal workings. In this work we instead focus on devising a black-box procedure that can solve the model selection problem for a general class of stochastic contextual bandit algorithms. This work introduced the first black-box method for model selection in stochastic contextual bandits, and it has been followed by many others that have expanded and refined these results; most

image

Contributions. We focus on the problem of bandit model-selection in stochastic environments. Our contributions are as follows:

A new model selection algorithm. We introduce Stochastic CORRAL, a two step per round algorithm and an accompanying base “smoothing” wrapper that can be shown to satisfy model selection guarantees when combined with any set of M stochastic contextual bandit algorithms such that at least one of them is adapted and satisfies a high probability regret guarantee. We also show model selection regret guarantees for Stochastic CORRAL with two distinct adversarial meta-algorithms, CORRAL [Aga+17] and EXP3.P [BC12]. Our approach is more general than that of the original CORRAL algorithm [Aga+17] because instead of requiring each base algorithm to be individually modified to satisfy a certain stability condition, our version of the CORRAL algorithm provides the algorithm designer with a generic black-box wrapper that allows to do model selection over any set of M base algorithm with high probability regret guarantees. Stochastic CORRAL has another important difference with respect to the original CORRAL algorithm: instead of importance weighted feedback, the unadulterated reward  rtis sent to algorithm  Bjt, and only this algorithm is allowed to update its internal state at round t. The main consequence of these properties of Stochastic CORRAL is that our model selection strategy can be used with almost any base algorithm developed for stochastic environments. When the learner has knowledge of the function  Ui⋆(t, δ) but not of  i⋆(for example when all the putative upper bounds  Ui(t, δ) are the same), using the CORRAL meta-algorithm achieves optimal regret guarantees. When the optimal target regret is unknown sometimes using a an EXP3.P meta-algorithm can achieve better performance.

A versatile algorithm. We demonstrate the generality and effectiveness of our method by showing how it seamlessly improves existing results or addresses open questions in a variety of problems. We show applications in adapting to the misspecification level in contextual linear bandits [LSW20], adapting to the unknown dimension in nested linear bandit classes [FKL19], tuning the data-dependent exploration rate of bandit algorithms, and choosing feature maps in reinforcement learning. Moreover, our meta-algorithm can simultaneously perform different types of model selection. For example, we show how to choose both the unknown dimension and the unknown mis-specification error at the same time. This is in contrast to algorithms that specialize in a specific type of model selection such as detecting the unknown dimension [FKL19].

Lower bounds. In the stochastic domain, an important question is whether a model selection procedure can inherit the O(log T) regret of a fast stochastic base algorithm. We show a lower bound for the model selection problem that scales as Ω(√T), which implies that our result is minimax optimal. Recall the CORRAL meta-algorithm achieves an oracle optimal model selection regret guarantee provided the algorithm has access to

Ui⋆(t, δ). This begs the question of whether this is an unavoidable statistical limitation of model selection or just a property of the CORRAL meta-algorithm. We show this condition is unavoidable in general: there are problems where the regret of the best base algorithm scales as  O(T x) for an unknown x, and the regret of any meta-algorithm that is unaware of the value of x scales as Ω(T y) for y > x.

We use the notation  δato write the delta distribution at a. For an integer n, we use [n] to denote the set {1, 2, . . . , n}. We consider the following formulation of contextual stochastic bandits. At the beginning of each time-step t, the learner observes a context  Atthat corresponds to a subset of an ‘action-set’ A. After this the learner will select an action at ∈ Atand then collect a reward  rt = f(At, at) +  ξt, a noisy quantity that will depend on the context  At, and the learner’s action  at, a reward function f and a 1−subGaussian conditionally zero mean random noise random variable  ξt. In this work we will restrict ourselves to the case where contexts sets  Atare all subsets of a context generating set A. This is in fact a very general scenario that captures all types of contextual bandit problems ranging from the case of changing linear contexts with linear rewards, to more general contexts and reward sets studied in works such as [FR20]. For simplicity we will assume the contexts  Atare made of the subset of available actions to the learner at time t. Our formulation allows for the action set to vary in size from round to round and even to be infinite. For example, the finite linear bandit setting (where  At = A ⊂ Rdfor all t) fits in this setting. Similarly it is easy to see the linear contextual bandit problem with i.i.d. contexts and K actions can also be written as an instance of our formulation. In the linear contextual bandit problem with  Ktactions (where  Ktmay be infinite) the learner is presented at time t with  Ktaction-vectors  A = {ai}Kti=1with  ai ∈ Rdand the (random) return  raof any action  a ∈ A satisfies ra = ⟨a, θ⋆⟩+ξ. The K−action setting where contexts xt ∈ X(an abstract context set) and the learner has access to K actions per round can also be formulated in this way by defining  A = X × [K] and defining  DSto be a distribution over subsets of the form  {(x, i)}Ki=1 with x ∈ X.

In this work we focus on the setting of stochastic i.i.d. contexts. Let S be the set of all subsets of A and let  DSbe a distribution over S. We assume all contexts  Ati.i.d.∼ DSand that  f : S × A → R. For an arbitrary subset  A ⊂ Awe denote the space of distributions over A as ∆A. A policy  πis a mapping such that for any subset  A ⊂ Ain the support of DSoutputs an element of ∆A. Let’s denote by Π as the space of all policies with domain in Support(DS). We abuse notation and denote  f(A, π) = Ea∼π(X) [f(A, a)]. Notice that in this case  f(A, a) = f(A, δa) for all a ∈ A. We will generally omit the  A ∼ DSdependence from our expectation notation in the future.

In a contextual bandit problem the learner chooses policy  πtat time t, which takes context set  At ∈ Sas an input and outputs a distribution over  At. The learner then selects

an action  at ∼ πt(At) and receives a reward  rt such that rt = f(At, δat) + ξt.We are interested in designing an algorithm with small regret, defined as

image

If for example  Ui(T, δ) = cdi�T log(1/δ) for all i ∈ [M] we would like our algorithm to satisfy a regret guarantee of the form  R(T) ≤ O(Mαdβi⋆�T log(1/δ)) for some α ≥ 0, β ≥ 1and where  i⋆is the index of the best performing adapted base algorithm  Bi⋆. Crucially, we want to avoid this guarantee to depend on other  di > di⋆(if any). From now on we will refer to the policy maximizing the right hand side of the equation above as  π∗. For simplicity we will also make the following assumption regarding the range of f,

Assumption 2.1 (Bounded Expected Rewards). The absolute value of f is bounded by 1,

image

Throughout this work we assume the base algorithms we want to model select from satisfy a high probability regret bound whenever they are well adapted to their environment. We make this more precise in definition 2.1,

Definition 2.1 ((U, δ, T)−Boundedness). Let  U : R ×[0, 1]  → R+. We say an adapted algorithm  B is (U, δ, T)−bounded if with probability at least 1−δand for all rounds  t ∈ [1, T],its cumulative pseudo-regret is bounded above by  U(t, δ): �tl=1 f(Al, π∗)−f(Al, πl) ≤ U(t, δ).

We assume that for all  i ∈ [M] the base algorithm  Bi is (Ui, δ, T)-bounded for a function Uiknown to the learner2. For example in the Multi Armed Bandit Problem with K arms the UCB algorithm is (c�KT log(T/δ), δ, T)−bounded for some universal constant c > 0.

Original CORRAL

We start by reproducing the pesudo-code of CORRAL [Aga+17] (see Algorithm 1) as it will prove helpful in our discussion of our main algorithm: Stochastic CORRAL. As we have explained in the previous section we assume there are M candidate base algorithms and a meta-algorithm which we denote as M. At time-step t the CORRAL meta-algorithm M selects one of the base algorithms in  {Bi}Mi=1according to a distribution  pt ∈ ∆Mby sampling an index  jt ∼ pt. The learner plays action  at ∼ πt,jt(At) and receives reward rt = f(At, δat) + ξt. An importance weighted version of  rtis sent out to all base algorithms, after which all of them update their internal state.

image

In order to better describe the feedback structure of Stochastic CORRAL we abstract the meta-algorithm-base interaction template discussed in Section 1 into Algorithms 4 and 5. As we have mentioned before, one crucial difference between Stochastic CORRAL and CORRAL is that in Stochastic CORRAL only the state of the base algorithm whose action was selected is modified. In contrast in the CORRAL algorithm all the base algorithms’ states are updated during every step.

To make this description more precise we introduce some notation. Each base algorithm Bjmaintains a counter  st,jthat keeps track of the number of times it has been updated up to time t. We will say algorithm  Bjis in ‘state’  st,jat time t. For any base algorithm Bj, πs,jis the policy  Bjuses at state  s. If t1 < t2are two consecutive times when base j is chosen by the meta-algorithm, then base j proposed policy  πst1,j,jat time  t1and policy πst2,j,jduring all times  t1 + 1, . . . , t2 where st2,j = st1,j + 1.

image

Regret Decomposition. Let’s introduce the regret decomposition we will make use of to prove our regret guarantees. This is a similar decomposition as the one appearing in the proofs of Theorem 4,5 and 7 of [Aga+17]. We split the regret R(T) of Equation 1 into two

terms (I and II) by adding and subtracting terms  {f(At, πst,i⋆,i⋆)}Tt=1 :

image

Term I is the regret of the meta-algorithm with respect to the optimal base  Bi⋆, and termII is the regret of the optimal base with respect to the optimal policy  π∗. Analysis of term I is largely based on the adversarial regret guarantees of the Log-Barrier-OMD in CORRAL and of the EXP3.P algorithm.

In order to bound term II we will have to modify the feedback structure of Algorithms 4

and 5. In Algorithm 8 from Section 4 we introduce a smoothing procedure that allows any (U, δ, T)-bounded algorithm to be transformed into a ‘smoothed’ version of itself such that its conditional expected instantaneous regret is bounded with high probability during every even step. We name this procedure ‘smoothing’ because it is based on playing uniformly from the set of previously played policies during the smoothed algorithm’s odd steps. We provide more details in Section 4. For now, the main property we are to use from this discussion is that by smoothing a (U, δ, T)-bounded algorithm it is possible to ensure the conditional expected instantaneous regret of the smoothed algorithm is bounded above by

image

function of  ℓ) when U(ℓ, δ) is concave in  ℓ. In Stochastic CORRAL the smoothing of base algorithms takes the place of the stability condition required by the CORRAL algorithm in [Aga+17].

Let’s sketch some intuition behind why this decreasing instantaneous regret condition can help us bound term II. For all  i ∈ [M] let  {pi1, . . . , piT }be the (random) probabilities used by the Stochastic CORRAL meta-algorithm M (an adversarial meta-algorithm) to chose base i during round  t and let pi = mint pit. Let’s focus on the optimal algorithm  i⋆ andassume  U⋆(t, δ) is convex in  t. Since U⋆(t,δ)tis decreasing, term II is the largest when base  i⋆is selected the least often. For the sake of the argument let’s assume that  pi⋆t = pi⋆ ∀t. In this case base  i⋆will be played roughly  Tpi⋆times, and will repeat its decisions in intervals of length 1pi⋆ , resulting in the following bound:

Lemma 3.1 (informal). If the regret of the optimal base is (U∗, T, δ)-bounded, then we have that

image

We demonstrate the effectiveness of our smoothing transformation by deriving regret bounds with two meta-algorithms: the Log-Barrier-OMD algorithm in CORRAL (introduced by [Aga+17]) which we will henceforth refer to as the CORRAL and EXP3.P (Theorem 3.3 in [BC12]) with forced exploration. The later is a simple algorithm that ensures each base is picked with at least a (horizon dependent) constant probability p. We now state an informal version of our main result, Theorem 4.11.

Theorem 3.2 (informal version of Theorem 4.11). If  U∗(T, δ) = O(c(δ) T α) for some function  c : R → Rand constant  α ∈[1/2, 1) and  B∗is (U⋆, T, δ)-bounded, the regrets of Stochastic CORRAL with an EXP3.P and CORRAL meta-algorithms are:

image

The CORRAL meta-algorithm achieves optimal regret when  αand  c(δ) are known.

When  c(δ) is unknown and  c(δ) > T

an EXP3.P meta-algorithm achieves better regret because �O�T 12−α c(δ)�< �O�T αc(δ)1α�. We complement this result with a couple of lower bounds.

Lower bounds. Theorem 5.3 in Section 5 shows that if the regret of the best base is O(T x), in the worst case a meta-algorithm that does not know x can have regret Ω(T y) with y > x. Theorem 5.2 shows that in general it is impossible for any meta-algorithm to achieve a regret better than Ω(√T) even when the best base has regret O(log(T)). When the regret of the best base is  O(√T), CORRAL with our smoothing achieves the optimal O(√T) regret.

The detailed description of the aforementioned smoothing procedure, its properties and the regret analysis are postponed to Section 4. We also show some applications of our model selection results in Section 6.

Meta-Algorithms

We review the adversarial bandit algorithms used as a Meta-Algorithm in Algorithm 4.

CORRAL Meta-Algorithm

We reproduce the CORRAL meta-algorithm below.

image

EXP3.P Meta-Algorithm

We reproduce the EXP3.P algorithm (Figure 3.1 in [BS12]) below. In this formulation we use  η = 1, γ = 2βk and p = γk.

image

Non-increasing Instantaneous Regret

We introduce a ”smoothing” procedure (Algorithm 8) which, given a (U, δ, T)−bounded algorithm B constructs a smoothed algorithm �Bwith the property that for some time-steps its conditional expected instantaneous regret has a decreasing upper bound. For ease of presentation and instead of making use of odd and even time-steps in the definition of �Bwe assume each round t is split in two types of steps (Step 1 and Step 2). We will denote objects pertaining to the  t−th round step i using a subscript t and a superscript (i). The construction of �Bis simple. The smoothed algorithm maintains an internal copy of the

image

original algorithm B. During step 1 of round t, �Bwill play the action suggested by its internal copy of B. During step 2 of round  t, �Bwill instead sample uniformly from the set of policies previously played by the copy of B maintained by �Bduring steps of type 1 from round 1 to round t.

Let’s define step 2 more formally. If algorithm B is at state s during round t, at step 2 of the corresponding time-step the smoothed strategy will pick an index q in [1, 2, .., s] uniformly at random, and will then re-play the policy B used during step 1 of round q. Since B is (U, δ, T)-bounded we will show in Lemma 4.2 that the expected instantaneous regret of step 2 at round s is at most  U(s, δ)/swith high probability.

image

It is easy to see that if algorithm �Bhas been played  ℓtimes (including step 1 and 2 plays), the internal counter of B equals  ℓ/2. We will make use of this internal counter when we connect a smoothed algorithm with the Stochastic CORRAL meta-algorithm. We now introduce the definition of (U, δ, T (2))−Smoothness which in short corresponds to algorithms that satisfy a high probability conditional expected regret upper bound during steps of type 2.

Definition 4.1 ((U, δ, T (2))−Smoothness). Let  U : R ×[0, 1]  → R+. We say a smoothed algorithm �Bis (U, δ, T (2))−smooth if with probability 1  − δand for all rounds  t ∈ [T], the conditional expected instantaneous regret of Step 2 is bounded above by  U(t, δ)/t:

image

Where �Ft−1 = σ�{A(i)ℓ , �π(i)ℓ , r(i)ℓ , a(i)ℓ }ℓ∈[t−1],i∈{1,2}, ∪{A(1)ℓ , �π(1)ℓ , r(i)ℓ , a(1)ℓ }�is the sigma algebra generated by all contexts, rewards, policies and actions up to time t step 1.

image

During all steps of type 2 algorithm �Breplays the policies it has used when confronted with contexts  A(1)1 , ..., A(1)s . In Lemma 4.2we will use the fact that all contexts are assumed to be generated as i.i.d. samples from the same context generating distribution  DS to showthat �B is (U, δ, T (2))−smooth.

With this objective in mind let’s analyze a slightly more general setting. Let B be a (U, δ, T)−bounded algorithm playing in an environment where the high probability regret upper bound U holds. Let’s assume that B has been played for t time-steps during which it has encountered i.i.d. generated contexts  A1, · · · , Atand has played actions sampled from policies  π1, · · · , πt. Similar to the definition of �Ft−1in Definition 4.1, let’s define Ft−1 = σ�{Aℓ, �πℓ, rℓ, aℓ}ℓ∈[t−1]�, the sigma algebra generated by all contexts, rewards, policies and actions up to time  t −1. We define the “expected replay regret”  Replay(t|Ft−1)as:

image

Where  A′1, · · · , A′tare i.i.d. contexts from  DSall of them conditionally independent from Ft. It is easy to see that the conditional instantaneous regret of a smoothed algorithm �Bduring round t step 2 equals the expected replay regret  Replay(t| �Ft−1) of the Bcopy inside �B.

As a first step in proving that �B is (U, δ, T (2))−smooth in Lemma 4.2 we show the replay regret of a (U, δ, T)-bounded algorithm satisfies a high probability upper bound.

image

Lemma 4.2. If B is (U, δ, T)−bounded with  U(t, δ) > 8

Assumption 2.1, then with probability at least 1  − δ for all t ∈ [T]the expected replay regret of B satisfies:

image

Furthermore, if  δ ≤ 1√T then Replay(t|Ft−1) ≤ 5U(t, δ).

Proof. Let’s condition on the event  E1that B’s plays satisfy the high probability regret guarantee given by U:

image

For all  t ∈ [T] and where  A1, · · · , Atare the contexts algorithm B encountered up to time t. Since B is (U, δ, T)−bounded it must be the case that  P (E1) ≥ 1 − δ.

Let  A′1, · · · , A′t be a collection of t fresh i.i.d. contexts from  DS independent from  Ft. Wenow use martingale concentration arguments to show that �tl=1 f(Al, π∗) ≈ �tl=1 f(A′l, π∗)

image

Since by assumption maxA′,π |f(A′, π)| ≤1 each term in  {M1l } and {M2l }is bounded and satisfies max�|M1l |, |M2l |�≤ 2 for all t. A simple use of Azuma-Hoeffding yields:

image

Summing over all  t, and all i ∈ {1, 2}, using the fact that �Tt=1 1t2 <2 and the union bound implies that for all t, with probability 1  − δ,

image

Denote this event as  E2. We shall proceed to upper bound the replay regret. Let’s condition on  E1 ∩ E2. The following sequence of inequalities holds,

image

For all  t ∈ [T]. Inequality (i) follows by the triangle inequality while (ii) is a consequence of conditioning on  E1 ∩ E2and invoking inequalities 5, 6 and 6. We conclude that with probability at least 1  − 2δand for all  t ∈ [T],

image

image

It is easy to see that in case  δ ≤ 1√T then Replay(t|Ft−1) ≤ 5U(t, δ).

image

In Propositon 4.3 we show that if B is bounded, then �Bis both bounded and smooth. We will then show that several algorithms such as UCB, LinUCB,  ϵ-greedy and EXP3 are (U, δ, T)-bounded for appropriate functions U. By Proposition 4.3 we will then conclude the smoothed versions of these algorithms are smooth.

image

Proposition 4.3. If  U(t, δ) >8

and B is (U, δ, T)−bounded, then �Bis (5U, δ, T (2))−smooth and with probability at least

image

for all  t ∈ [T].

Proof. Let  E1denote the event that �B’s plays during steps of type 1 satisfy the high probability regret guarantee given by U:

image

for all  t ∈ [T]. Since the conditional instantaneous regret of Step 2 of round t equals the average replay regret of the type 1 steps up to t, Lemma 4.2 implies that whenever  E2 holds(see definition for  E2in the proof of Lemma 4.2) which occurs with probability at least 1− δ,the conditional expected instantaneous regret satisfies:  E[f(A′, π∗) − f(A′, π(2)t)| �Ft−1] ≤

image

It is easy to see that if we condition on  E1 ∩ E2the conditional expected instantaneous regret of steps of type 2 satisfy,

image

For all  t ∈ [T]. We now show the regret incurred by �Bsatisfies a high probability upper bound. To bound the regret accrued during time-steps of type 2, consider the following

Martingale difference sequences,

image

As a result of Assumption 2.1,  |Mil | ≤2 for all  i ∈ {1, 2}and therefore a simple use of Azuma-Hoeffding’s inequality,

image

Summing over all t, applying the union bound, using the fact that �Tt=1 1t2 <2 implies that for all  t ∈ [T], with probability 1  − δ,

image

Let’s denote as  E3the event where Equation 11 holds. If  E2 ∩ E3occur, then combining the upper bounds in 10 and 11 we conclude that,

image

combining this last observation with Equation 9, we conclude that for all t with probability at least 1  − 3δ,

image

For all  t ∈ [T]. The result follows.

It remains to show how to adapt the feedback structure of the Stochastic CORRAL meta-algorithm to deal with the two step nature of smoothed algorithms. We reproduce the full pseudo-code of the Stochastic CORRAL meta-algorithm adapted to smoothed

algorithms below,

image

For reasons that have to do with the analysis, Algorithm 9 has a few extra features not present in the meta-algorithm-base template of Algorithm 4. First, whenever the smooth stochastic corral meta-algorithm selects an algorithm  jtit plays it for two steps, thus coinciding with �Bjt’s two time step structure. Second, it updates its distribution  ptusing the feedback 2r(2)t − bjt(st,jt) instead of using the sum  r(1)t + r(2)t . Most notably, the update makes use of a bias adjustment to the reward signal that is not present in the original. The reason behind this modification will become clearer in the regret analysis.

Applications of Proposition 4.3

We now show the smoothed versions of several algorithms satisfy Definition 4.1 by showing they are (U, δ, T)−bounded for an appropriate upper bound function U. We focus on algorithms for the  k−armed bandit setting and the contextual linear bandit setting.

Lemma 4.4 (Theorem 3 in [AYPS11]). In the case of changing and potentially infinite contexts of dimension d, LinUCB is (U, δ, T)-bounded with  U(t, δ) = O(d√t log(1/δ)).

Lemma 4.5 (Theorem 1 in [Chu+11]). In the case of finite linear contexts of size k and dimension d, LinUCB is (U, δ, T)-bounded with  U(t, δ) = O(√dt log3(kT log(T)/δ)).

Lemma 4.6 (Theorem 1 in [Sel+13]). In the k−armed adversarial bandit setting Exp3 is (U, δ, T)−bounded where  U(t, δ) = O(√tk log tkδ ).

Lemma 4.7. In the stochastic  k−armed bandit problem, if we assume the noise  ξtis conditionally 1-sub-Gaussian, UCB is (U, δ, T)-bounded with  U(t, δ) = O(√tk log tkδ ).

image

For the remainder of this section we focus on showing that in the stochastic  k−armed bandit problem, the  ϵ-greedy algorithm (Algorithm 1.2 of [Sli19]) is (U, T, δ)−bounded. At time t the  ϵ-greedy algorithm selects with probability  ϵt = min(c/t,1) an arm uniformly at random, and with probability 1  − ϵtit selects the arm whose empirical estimate of the mean is largest so far. Let’s introduce some notation: we will denote by  µ1, · · · , µkthe unknown means of the K arms use the name  �µ(t)jto denote the empirical estimate of the mean of arm j after using t samples.

Without loss of generality let  µ1be the optimal arm. We denote the sub-optimality gaps as ∆j = µ1 − µjfor all  j ∈ [k]. Let ∆∗be the smallest nonzero gap. We follow the discussion in [ACBF02] and start by showing that under the right assumptions, and for a horizon of size T, the algorithm satisfies a high probability regret bound for all  t ≤ T. Theobjective of this section is to prove the following Lemma:

Lemma 4.8.  If c = 10K log(T 3/γ)∆2∗

with  ϵt = ct is (U, δ, T)−bounded for  δ ≤ ∆2∗T 3 where

image

Proof. Let E(t) = 12k�tl=1 ϵland denote by  Tj(t) the random variable denoting the number of times arm j was selected up to time t. We start by analyzing the probability that a suboptimal arm j > 1 is selected at time t:

image

Let’s bound the second term.

image

The analysis of these two terms is the same. Denote by  T Rj (t) the number of times arm

j was played as a result of a random epsilon greedy move. We have:

image

Inequality a is a consequence of the Azuma-Hoeffding inequality bound. Inequality b follows because �∞l=E+1 exp(−αl) ≤ 1a exp(−αE). Term (1) corresponds to the probability that within the interval [1, · · · , t], the number of greedy pulls to arm j is at most half its expectation. Term (2) is already ”small”. Lets proceed to bound (1). Let  ϵt = min(c/t, 1).with  c = 10k log(T 3/γ)∆2∗for some  γ ∈(0, 1) satisfying  γ ≤ ∆2j. We’ll show that under these assumptions we can lower bound  E(t). If t ≥ 10k log(T 3/γ)∆2∗ :

image

By Bernstein’s inequality (see derivation of equation (13) in [ACBF02]) we can upper bound T Rj (t):

image

Hence for  t ≥ 10k log(T 3/γ)∆2∗ :

image

image

Now we proceed with term (2):

image

The first inequality (a) follows because  E(t) ≥ 5 log(T 3/γ)∆2∗. Inequality (b) follows because by the assumption  γ ≤

By applying the union bound over all arms  j ̸= 1 and time-steps t ≥ 10k log(T 3/γ)∆2∗, we conclude that the probability of choosing a sub-optimal arm  j ≥2 at any time time t for t ≥ 10k log(T 3/γ)∆2∗as a greedy choice is upper bounded by kγT 2 ≤ kγT. In other words after

t ≥ 10k log(T 3/γ)∆2∗rounds, with probability 1  − kγTsub-optimal arms are only chosen as a result of random epsilon greedy move (occurring with probability  ϵt).

A similar argument as the one that gave us Equation 13 can be used to upper bound the probability that  T Rj (t) be much larger than its mean:

image

Using this and the union bound we see that with probability more than 1  − kγTand for all  t ∈ [T] and arms  j ∈ [k], T Rj(t) ≤ 3E(t). Combining this with the observation that after  t ≥ 10k log(T 3/γ)∆2∗and with probability 1  − kγTover all t simultaneously regret is only incurred by random exploration pulls (and not greedy actions), we can conclude that with probability at least 1  − 2kγTsimultaneously for all  t ≥ 10k log(T 3/γ)∆2∗the regret incurred is

image

Term (i) is a crude upper bound on the regret incurred in the first 10k log(T 3/γ)∆2∗ rounds and(ii) is an upper bound for the regret incurred in the subsequent rounds.

Since  E(t) ≤ 20k log(T 3/γ)∆2∗ log(t) we conclude that with probability 1  − 2kγT for all t ≤ Tthe cumulative regret of epsilon greedy is upper bounded by

image

the result follows by identifying  δ = γ/T 3.

image

Lemma 4.8 gives us an instance dependent upper bound for the  ϵ−greedy algorithm. We now show the instance-independent high probability regret bound for  ϵ-greedy:

Lemma 4.9. If  c =10k log( 1δ )∆2∗, then  ϵ−greedy with  ϵt = ctis (δ, U, T)−bounded for  δ ≤ ∆2∗T 3and:

1. U(t, δ) = 16�log( 1δ)t when k = 2.

2. U(t, δ) = 20�k log( 1δ)��Kj=2 ∆j��1/3t2/3 when k > 2.

Proof. Let ∆ be some arbitrary gap value. Let R(t) denote the expected regret up to round t. We recycle the notation from the proof of Lemma 4.8, recall  δ = γ/T 3.

image

When  k = 2, ∆2 = ∆∗and therefore (assuming ∆  < ∆2):

image

second inequality B is satisfied for T large enough. We choose this expression for simplicity of exposition.

When k > 2 notice that we can arrive to a bound similar to 14:

image

The inequality  ξis true for T large enough. We choose this expression for simplicity of exposition.

image

Regret Analysis

In this section we go back to sketch the proof of Theorem 3.2 by explaining how to bound terms I and II in the regret decomposition of Equation 2.

Bounding Term I. Recall that Algorithm 9 only sends the smoothed reward of Step 2 to the meta-algorithm while the base plays and incurs regrets from both Step 1 and Step 2. We show in Section A that this does not affect the regret of the meta-algorithm significantly.

image

with exploration rate  p, E [I] < O(√MT + 1p + MTp).

Bounding Term II. This quantity is the regret of all the policies proposed by the optimal base  i⋆, even during steps when it was not selected by the meta-algorithm. Recall the internal state of any algorithm (including  i⋆) is only updated when selected and played by M. We assume the smoothed base algorithm �Bi⋆satisfies the smoothness and boundedness properties of Definitions 2.1 and 4.1. For the purpose of the analysis we declare that when a smoothed base repeats its policy while not played, it repeats its subsequent Step 2 policy (Algorithm 8). This will become clearer in Section A. Since we select �Bi⋆with probability at least  pi⋆it will be updated at most every 1/pitime-steps and the regret upper bound will be roughly 1pi⋆ Ui⋆(Tpi⋆, δ).

Theorem 4.10. We have that  E [II] ≤ O�E�1pi Ui(Tpi, δ) log T�+ δT(log T + 1)�. Here, the expectation is over the random variable  pi. If  U(t, δ) = tαc(δ) for some  α ∈[1/2, 1) then,  E [II] ≤ �O

Total Regret. Adding Term I and Term II gives us the following worst-case model selection regret bound for the CORRAL meta-algorithm (maximized over  pi⋆and with a chosen  η) and the EXP3.P meta-algorithm (with a chosen p):

Theorem 4.11. If a base algorithm is (U, δ, T)-bounded for  U(T, δ) = T αc(δ) and some α ∈[1/2, 1) and the choice of  δ = 1/T, the regret of the Smooth Stochastic CORRAL (Algorithm 9) where  bj(s) = Uj(s,δ)sis upper bounded by :

In stochastic environments, algorithms such as UCB can achieve logarithmic regret bounds. Our model selection procedure however has a  O(√T) overall regret. In this section, we show that in general it is impossible to obtain a regret better than Ω(√T) even when the optimal base algorithm has 0 regret. In order to formalize this statement, let’s define a model selection problem formally.

image

Table 1: Comparison of model selection guarantees for Stochastic CORRAL between the EXP3.P and CORRAL meta-algorithm. The top row shows the general regret guarantees. The middle row shows the regret guarantees when  α and c(δ) are known. The bottom row shows the regret guarantees when  αis known and  c(δ) is unknown.

Definition 5.1 (Model Selection Problem). We call a tuple ({Bi}Mi=1, Env)a model selection problem where  {Bi}Mi=1 is a set of M base algorithms and Env is a bandit environment4.

Theorem 5.2. Let T ∈ N. For any model selection algorithm there exists a corresponding model selection problem ({B1, B2}, Env)such the regret of this model selection algorithm is lower bounded by  R(T) = Ω� √Tlog(T)�.

Proof. Consider a stochastic 2-arm bandit problem where the best arm has expected reward 1/2 and the second best arm has expected reward 1/4. We construct base algorithms  B1, B2as follows.  B1always chooses the optimal arm and its expected instantaneous reward is 1/2. B2chooses the second best arm at time step t with probability 4c√t+2 log(t+2) (c will bespecified later), and chooses the best arm otherwise. The expected reward at time step t of B2 is 12 − c√t+2 log(t+2).

Let  A∗ be uniformly sampled from {1, 2}. Consider two environments  ν1 and ν2 for themeta-algorithm, each made up of two base algorithms �B1, �B2. Under ν1, �B1 and �B2 are bothinstantiations of  B1. Under  ν2, �BA∗, where  A∗is a uniformly sampled index in {1, 2}, is a copy of  B1 and �B3−A∗is a copy of  B2.

Let  P1, P2denote the probability measures induced by interaction of the meta-algorithm with  ν1 and ν2respectively. Let �BAtdenote the base algorithm chosen by the meta-algorithm at time  t. We have P1(At ̸= A∗) = 12 for all t, since the learner has no information available to identify which algorithm is considered optimal. By Pinskers’ inequality we have

image

By the divergence decomposition [see LS20, proof of Lemma 15.1 for the decomposition

technique] and using that for ∆  < 14 : KL( 12, 12 − ∆) ≤ 3∆2 (Lemma B.3), we have

image

Picking  c =�

bounded by

image

CORRAL needs knowledge of the best base’s regret to achieve the same regret. The following lower bound shows that this requirement is unavoidable:

Theorem 5.3. Let Alg be a model selection algorithm. There exists a model selection problem with two base algorithms where the best base has regret �O(T x) for some 0 < x < 1such that if Alg has no knowledge of x nor of the reward of the best arm, then there exists a potentially different model selection problem where the best base also has regret �O(T x) butthe model selection regret guarantee of Alg is lower bounded by Ω(T y) with y > x.

Proof. Let the set of arms be  {a1, a2, a3}. Let x and y be such that 0  < x < y ≤1. Let ∆ = T x−1+(y−x)/2. Define two environment  E1and  E2with reward vectors {1, 1, 0} and {1 + ∆, 1, 0} for {a1, a2, a3}, respectively. Let  B1 and B2be two base algorithms defined by the following fixed policies when running alone in  E1 or E2:

image

We also construct base  B′2 defined as follows. Let  c2 > 0 and ϵ2 = (y−x)/4 be two constants. Base  B′2mimics base  B2when  t ≤ c2T x−y+1+ϵ2, and picks arm  a1when  t > c2T x−y+1+ϵ2. The instantaneous rewards of  B1and  B2when running alone are  r1t = 1 − T x−1and r2t = 1−T y−1 for all 1 ≤ t ≤ T. Next, consider model selection with base algorithms  B1 andB2 in E1. Let T1 and T2be the number of rounds that  B1 and B2are chosen, respectively.

First, assume case (1): There exist constants c > 0,  ϵ >0,  p ∈(0, 1), and  T0 >0 such that with probability at least  p, T2 ≥ cT x−y+1+ϵ for all T > T0.

The regret of base  B1when running alone for  T rounds is T · T x−1 = T x. The regret of the model selection method is at least

image

Given that the inequality holds for any  T > T0, it proves the statement of the lemma in case (1).

Next, we assume the complement of case (1): For all constants c > 0,  ϵ >0,  p ∈(0, 1), and  T0 >0, with probability at least  p, T2 < cT x−y+1+ϵ for some T > T0.

Let T be any such time horizon. Consider model selection with base algorithms  B1 andB′2 in environment  E2 for Trounds. Let  T ′1 and T ′2 be the number of rounds that  B1 and B′2are chosen. Note that  B2 and B′2 behave the same for  c2T x−y+1+ϵtime steps, and that  B1and  B2never choose action  a1. Therefore for the first  c2T x−y+1+ϵ2time steps, the model selection strategy that selects between  B1and  B′2in  E2behaves the same as when it runs B1and  B2in  E1. Therefore with probability  p > 1/2, T ′2 < c2T x−y+1+ϵ2, which implies T ′1 > T/2.

image

Given that with probability  p > 1/2, T ′1 > T/2, the regret of the learner is lower bounded as,

image

which is larger than the regret of  B′2 running alone because  3x+y4 < x+y2 . The statement of the lemma follows given that for any  T0there exists  T > T0so that the model selection fails.

Misspecified Contextual Linear Bandit

We consider model selection in the misspecified linear bandit problem. The learner selects an action  at ∈ Atand receives a reward  rt such that |E[rt] − a⊤t θ| ≤ ϵ∗ where θ ∈ Rd is anunknown parameter vector and  ϵ∗is the misspecification error. For this problem, [Zan+20] and [LSW20] present variants of LinUCB that achieve a high probability �O(d√T + ϵ∗√dT)regret bound. Both algorithms require knowledge of  ϵ∗, but [LSW20] show a regret bound of the same order without the knowledge of  ϵ∗for the version of the problem with a fixed action set  At = A. Their method relies on G-optimal design, which does not work for contextual settings. It is an open question whether it is possible to achieve the above regret without knowing  ϵ∗for problems with changing action sets.

In this section, we show a �O(d√T + ϵ∗√dT) regret bound for linear bandit problems with changing action sets without knowing  ϵ∗. For problems with fixed action sets, we show an improved regret that matches the lower bound of [LS20].

Given a constant E so that  |ϵ∗| ≤ E, we divide the interval [1, E] into an exponential grid  G = [1, 2, 22, ..., 2log(E)]. We use log(E) modified LinUCB bases, from either [Zan+20] or [LSW20], with each base algorithm instantiated with a value of  ϵin the grid.

Theorem 6.1. For the misspecified linear bandit problem described above, the regret of Stochastic CORRAL with a CORRAL meta-algorithm using learning rate  η = 1√Tdand LinUCB base algorithms with target misspecification level  ϵ ∈ G, is upper bounded by

O(d√T + ϵ∗√dT). In the case of a fixed action linear bandit problem with k arms and √k > d, the regret of Stochastic CORRAL with a CORRAL meta-algorithm using learning rate  η = 1√Tdapplied to a set of base algorithms consisting of one UCB base and one G-optimal base algorithm [LSW20] is upper bounded by �O�min�kd√T, d√T + ϵ∗√dT��.

Proof. From Lemma 4.7, for UCB,  U(T, δ) = O(√Tk log Tkδ ). Therefore from Theorem 4.11, running CORRAL with smooth UCB results in the following regret bound:

image

Maximizing over  ρresults in a regret guarantee of the form �O�√T + 1η + Td2η + ϵ√dT�.For the misspecified linear bandit problem we use M = O(log(T)) LinUCB bases with  ϵdefined in the grid, and choose  η = 1√Td. The resulting regret for Stochastic CORRAL is of

the form �O�√Td + ϵ√dT�.

When the action sets are fixed, by the choice of  η = 1√Td, the regret of Stochastic CORRAL with a CORRAL meta-algorithm over one UCB and one G-optimal base equals:

image

If√k > d, the above expression becomes �O�min�√T kd,√Td + ϵ√dT��

Observe that in the case of a fixed action linear bandit problem, the regret upper bound we achieve for Stochastic CORRAL with a CORRAL meta-algorithm and a learning rate of η = 1√Tdis of the form  �O�min�kd√T, d√T + ϵ∗√dT��. The product of the terms inside

the minimum is of order �O(kT). This result matches the following lower bound that shows that it is impossible to achieve �O(min(√kT, d√T + ϵ∗√dT)) regret:

Lemma 6.2 (Implied by Theorem 24.4 in [LS20]). Let  Rν(T) denote the cumulative regret at time T on environment  ν. For any algorithm, there exists a 1-dimensional linear bandit environment  ν1and a k-armed bandit environment  ν2such that  Rν1(T) · Rν2(T) ≥T(k − 1)e−2.

Experiment (Figure 1). Let d = 2. Consider a contextual bandit problem with k = 50 arms, where each arm j has an associated vector  aj ∈ Rdsampled uniformly at random from [0, 1]d. We consider two cases: (1) For a  θ ∈ Rdsampled uniformly at random from [0, 1]d, reward of arm j at time t is  a⊤j θ + ηt, where  ηt ∼ N(0,1), and (2) There are k parameters  µj for j ∈ [k] all sampled uniformly at random from [0, 10], so that the reward of arm j at time t is sampled from  N(µj,1). We use CORRAL with learning rate  η = 2√Tdand UCB and LinUCB as base algorithm. In case (1) LinUCB performs better while in case (2) UCB performs better. Each experiment is repeated 500 times.

image

Figure 1: CORRAL with UCB and LinUCB bases. Shaded regions denote the standard deviations.

Contextual Bandits with Unknown Dimension

We consider model selection in the nested contextual linear bandit problem studied by [FKL19]. In this problem the context space  A ⊂ RD. Each action is a  D−dimensional vector and each context  Atis a subset of  RD. The unknown parameter vector  θ∗ ∈ RD butonly its first  d∗coordinates are nonzero. Here,  d∗is unknown and possibly much smaller than D. We assume access to a family of LinUCB algorithms  {Bi}Mi=1with increasing dimensionality  di. Algorithm i is designed to ’believe’ the unknown parameter vector  θ∗has only nonzero entries in the first  dientries. In [FKL19] the authors consider the special case when  |At| = k < ∞ for all t. In order to obtain their model selection guarantees they require a lower bound on the average eigenvalues of the covariance matrices of all actions. In contrast, we do not require any such structural assumptions on the context. We provide the first sublinear regret for this problem when the action set is infinite. Further, we have no eigenvalue assumptions and our regret does not scale with the number of actions k.

We use LinUCB with each value of  d ∈[1, 2, 22, ..., 2log(D)] as a base algorithm for CORRAL and EXP3.P. We also consider the case when both the optimal dimension  d∗and the misspecification  ϵ∗are unknown: we use  M = log(E) · log(D) modified LinUCB bases (see the discussion on Misspecified Contextual Linear Bandits above) for each value of (ϵ∗, d∗) in the grid [1, 2, 22, ..., 2log(E)] × [1, 2, 22, ..., 2log(D)].

From Lemma 4.4 and Lemma 4.5, for linear contextual bandit, LinUCB is (U, δ, T)-bounded with  U(t, δ) = O(d√t log(1/δ)) for infinite action sets  U−bounded with  U(t, δ) =O(√dt log3(kT log(T)/δ)) for finite action sets. Choose  δ = 1/Tand ignore the log factor, U(t, δ) = �O(d√t) for infinite action sets and  U(t, δ) = �O(√dt) for finite action sets. Then U(t) = c(δ)tαwith  α = 1/2 and  c(δ) = �O(d) for infinite action sets, and  c(δ) = �O(√d) for finite action sets.

Now consider the misspecified linear contextual bandit problem with unknown  d∗ and ϵ∗.We use the smoothed LinUCB bases [LSW20; Zan+20]. Using the calculation in the proof of Theorem 6.1 in Section 6, using CORRAL with a smooth LinUCB base with parameters (d, ϵ) in the grids results in �O�1η + Td2η + ϵ√dT�regret. Since d is unknown, choosing

η = 1/√Tyields the regret �O�√Td2∗ + ϵ√dT�. Using EXP3.P with a smooth LinUCB base with parameters (d, ϵ) in the grids results in:

image

Since  d∗is unknown, choosing  p = T −1/3yields a �O(T23 d∗ + ϵ∗√dT) regret bound. We

summarize our results in the following table:

image

We study model selection in the setting of non-parametric contextual bandits.[GJ18] consider non-parametric stochastic contextual bandits. At time t and given a context  xt ∈ RD, thelearner selects arm  at ∈ [k] and observes the reward  f(at, xt) +  ξt, where  ξtis a 1-sub-Gaussian random variable and for all  a ∈ [k], the reward function  f(a, ·) is  L−lipschitz in the context  x ∈ RD. It is assumed that the contexts arrive in an IID fashion. [GJ18] obtain a �O�T1+d2+d�regret for this problem. Similar to [FKL19], we assume that only the first  d∗context features are relevant for an unknown  d∗ < D. It is important to find  d∗because  T1+d∗2+d∗ ≪ T1+D2+D. Stochastic CORRAL can successfully adapt to this unknown quantity: we can initialize a smoothed copy of Algorithm 2 of [GJ18] for each value of d in the grid [b0, b1, b2, ..., blogb(D)] for some b > 1 and perform model selection with CORRAL and EXP3.P with these base algorithms.

image

Tuning the Exploration Rate of  ϵ-greedy

We study the problem of selecting for the optimal scaling for the exploration probability in the  ϵ-greedy algorithm. Recall that for a given positive constant  c, the ϵ-greedy algorithm pulls the arm with the largest empirical average reward with probability 1  − c/t, and otherwise pulls an arm uniformly at random. Let  ϵt = c/t. It can be shown that the optimal value for  ϵt is min{1, 5k∆2∗t} where ∆∗is the smallest gap between the optimal arm and the sub-optimal arms [LS20]. With this exploration rate, the regret scales as �O(√T) for k = 2.We would like to find the optimal value of c without the knowledge of ∆∗. In this discussion we show it is possible to obtain such result by applying CORRAL to a set of  ϵ-greedy base algorithms each instantiated with a  c in [1, 2, 22, ..., 2log(kT)].

Theorem 6.3. The regret of CORRAL using smoothed  ϵ-greedy base algorithms defined on the grid is bounded by �O(T 1/2) when k = 2.

image

Figure 2: CORRAL with  ϵ-Greedy bases with different exploration rates. 5

Proof. From Lemma 4.9, we lower bound the smallest gap by 1/T (because the gaps smaller than 1/T will cause constant regret in T time steps) and choose  δ = 1/T 5. From Theorem 4.11, the regret is �O(T 2/3) when k > 2 and �O(T 1/2) when k = 2 with the base running alone.

Next we show that the best value of c in the exponential grid gives a regret that is within a constant factor of the regret above where we known the smallest non-zero gap ∆∗. An exploration rates can be at most  kT. Since 5K∆2∗ >1, we need to search only in the interval [1, KT]. Let  c1be the element in the exponential grid such that  c1 ≤ c∗ ≤ 2c1. Then 2c1 = γc∗ where γ <2 is a constant, and therefore using 2c1 = γc∗ will give a regret up to a constant factor of the optimal regret.

Experiment (Figure 2). Let there be two Bernoulli arms with means 0.5 and 0.45. We use 18  ϵ-greedy base algorithms differing in their choice of c in the exploration rate ϵt = c/t. We take T = 50, 000, η = 20/√T and ϵ’s to lie on a geometric grid in [1, 2T]. Each experiments is repeated 50 times.

Reinforcement Learning

We can instantiate Stochastic CORRAL model selection regret guarantees to the episodic linear MDP setting of [Jin+20], again with nested feature classes of doubling dimension just as in the case of the Contextual Bandits with Unknown Dimension. Let’s formally define a Linear MDP,

image

Figure 3:  ϵ-Greedy vs UCRL2 vs PSRL in the River Swim environment [SL08].

Definition 6.4 (Linear MDP ( Assumption A in [Jin+20])). An episodic MDP (Denoted by the tuple (S, A, H, P, r)) is a linear MDP with a feature map Φ :  S × A → Rd, if for any  h ∈ [H] there exist d unknown (signed) measures  µh = (µ(1)h , · · · , µ(d)h) over S and an unknown vector  θh ∈ Rd, such that for any (s, a) ∈ S × A, we have,

image

The value function for a linear MDP also satisfies a linear parametrization,

Proposition 6.5 (Proposition 2.3 from [Jin+20]). For a linear MDP, and for any policy πthere exist  d−dimensional weights  {wπh}h∈[H]such that for any (s, a, h) ∈ S × A × [H] we have that the value function of policy  π satisfies Qπh(s, a) = ⟨Φ(s, a), wπh⟩.

For the purpose of studying model selection in the setting of linear MDPs we assume access to  D−dimensional feature maps Φ :  S × A → RD. For all policies  πthe unknown parameters  {wπh}h∈[H]are all assumed to have unknown coordinates only in their first d∗dimensions. We assume access to a family of LSVI-UCB (Algorithm 1 of [Jin+20]) algorithms  {Bi}Mi=1with increasing dimensionality  di. Algorithm i is designed to ‘believe’ the unknown parameter vectors  {wπh}h∈[H]has only nonzero entries in the first  dientries for all policies  π.

Theorem 6.6. Let M = (S, A, H, P, r) be a linear MDP parametrized by a feature map {Φ : S × A → RD}. Let {Φi(s, a)}Mi=1 be the family of nested feature maps such that Φi(s, a)corresponds to the top  dientries of Φ(s, a). Assume that for all policies  πthe unknown parameters  {wπh}h∈[H]have nonzero coordinates only in their first  d∗dimensions and that there exists an index  i∗such that  d∗ ≤ di ≤ 2d∗. Selecting among different smoothed LSVI-UCB base algorithms corresponding to the feature maps  {Φi}Mi=1using Stochastic CORRAL with a CORRAL meta-algorithm and  η = M1/2T 1/2d3/2H3/2satisfies a regret guarantee:

R(T)  ≤ �O�√Md3H3T�.

Proof of Theorem 6.6. When well specified the LSVI-UCB algorithm [Jin+20] satisfies the high probability bound �O(√d3H3T) where H is the length of each episode. The result then follows from Theorem 4.11 by setting the CORRAL meta-algorithm learning rate as η = M1/2T 1/2d3/2H3/2 .

We also observe that in practice, smoothing RL algorithms such as UCRL and PSRL and using a CORRAL meta-algorithm on top of them can lead to improved performance. In Figure 3, we present results for the model selection problem among distinct RL algorithms in the River Swim environment [SL08]. We use three different bases,  ϵ−greedy Q−learning with  ϵ = .1, Posterior Sampling Reinforcement Learning (PSRL), as described in [OVR17] and UCRL2 as described in [JOA10]. The implementation of these algorithms and the environment is taken from TabulaRL (https://github.com/iosband/TabulaRL), a popular benchmark suite for tabular reinforcement learning problems. Smooth CORRAL uses a CORRAL meta-algorithm with a learning rate  η = 15√T, all base algorithms are smoothed using Algorithm 8. The curves for UCRL2, PSRL and  ϵ−greedy are all of their un-smoothed versions. Each experiment was repeated 10 times and we have reported the mean cumulative regret and shaded a region around them corresponding to  ±.3 the standard deviation across these 10 runs.

Generalized Linear Bandits with Unknown Link Function

[LLZ17] study the generalized linear bandit model for the stochastic k-armed contextual bandit problem. In round t and given context  xt ∈ Rd×k, the learner chooses arm  itand observes reward  rt = µ(x⊤t,itθ∗) +  ξtwhere  θ∗ ∈ Rdis an unknown parameter vector,  ξtis a conditionally zero-mean random variable and  µ : R → Ris called the link function. [LLZ17] obtain the high probability regret bound �O(√dT) where the link function is known. Suppose we have a set of link functions L that contains the true link function  µ. Since the target regret �O(√dT) is known, we can run CORRAL with the algorithm in [LLZ17] with each link function in the set as a base algorithm. From Theorem 4.11, CORRAL will achieve regret �O(�|L|dT).

Bandits with Heavy Tail

[Sha+18] study the linear stochastic bandit problem with heavy tail. If the reward distribution has finite moment of order 1 +  ϵ∗, [Sha+18] obtain the high probability regret bound �O�T 11+ϵ∗�. We consider the problem when  ϵ∗ ∈(0, 1] is unknown with a known lower bound L where L is a conservative estimate and  ϵ∗could be much larger than L. To the best of our knowledge, we provide the first result when  ϵ∗is unknown. We use the algorithms in [Sha+18] with value of  ϵ∗in the grid [blogb(L), ..., b1, b0] for some 0 < b < 1 as base algorithms with  η = T −1/2 for CORRAL. A direct application of Theorem 4.11 yields regret �O�T 1−0.5bϵ∗�. When  ϵ∗ = 1 (as in the case of finite variance), �O�T 1−0.5bϵ∗�is close to �O�T 0.5�when bis close to 1.

In this work we introduced the Stochastic CORRAL algorithm that successfully combines an EXP3 or CORRAL adversarial meta-algorithm with a wide variety of stochastic base algorithms for contextual bandits and reinforcement learning. We improve the results of the original CORRAL approach [Aga+17] that requires the base algorithms to satisfy a stability condition not often fulfilled by even the simplest stochastic bandit algorithms such as UCB and OFUL. Our approach can make use of the input base algorithms in a fully blackbox fashion without the need of reproving regret bounds for the component base algorithms. This versatility has allowed us to crack several open problems ranging from algorithms that adapt to the misspecification level in linear contextual bandits to the effective dimension in non-parametric problems.

[AYPP20] Yasin Abbasi-Yadkori, Aldo Pacchiano, and My Phan. “Regret Balancing for Bandit and RL Model Selection”. In: arXiv preprint arXiv:2006.05491 (2020) (Cited on page 4).

[AYPS11] Yasin Abbasi-Yadkori, D´avid P´al, and Csaba Szepesv´ari. “Improved Algorithms for Linear Stochastic Bandits”. In: Advances in Neural Information Processing Systems. 2011, pp. 2312–2320 (Cited on pages 2, 17).

[AYPS12] Yasin Abbasi-Yadkori, David P´al, and Csaba Szepesv´ari. “Online-to-Confidence-Set Conversions and Application to Sparse Stochastic Bandits”. In: International Conference on Artificial Intelligence and Statistics. PMLR. 2012 (Cited on page 2).

[Aga+17] Alekh Agarwal, Haipeng Luo, Behnam Neyshabur, and Robert E Schapire. “Corralling a Band of Bandit Algorithms”. In: Conference on Learning Theory. PMLR. 2017, pp. 12–38 (Cited on pages 24, 6, 810, 34, 41, 49).

[AMM21] Raman Arora, Teodor Vanislavov Marinov, and Mehryar Mohri. “Corralling Stochastic Bandit Algorithms”. In: International Conference on Artificial Intelligence and Statistics. PMLR. 2021, pp. 2116–2124 (Cited on page 4).

[AMS09] Jean-Yves Audibert, R´emi Munos, and Csaba Szepesv´ari. “Exploration-Exploitation Tradeoff Using Variance Estimates in Multi-Armed Bandits”. In: Theoretical Computer Science 410.19 (2009), pp. 1876–1902 (Cited on page 2).

[ACBF02] Peter Auer, Nicolo Cesa-Bianchi, and Paul Fischer. “Finite-Time Analysis of the Multiarmed Bandit Problem”. In: Machine learning 47.2 (2002), pp. 235– 256 (Cited on pages 2, 18, 19).

[BC12] S´ebastien Bubeck and Nicol`o Cesa-Bianchi. “Regret Analysis of Stochastic and Nonstochastic Multi-Armed Bandit Problems”. In: CoRR abs/1204.5721 (2012). arXiv: 1204.5721 (Cited on pages 2, 4, 10).

[BS12] S´ebastien Bubeck and Aleksandrs Slivkins. “The Best of Both Worlds: Stochastic and Adversarial Bandits”. In: Conference on Learning Theory. PMLR. 2012, pp. 42–1 (Cited on pages 11, 42).

[CM12] Alexandra Carpentier and R´emi Munos. “Bandit Theory Meets Compressed Sensing for High Dimensional Stochastic Linear Bandit”. In: International Conference on Artificial Intelligence and Statistics. PMLR. 2012, pp. 190–198 (Cited on page 2).

[Chu+11] Wei Chu, Lihong Li, Lev Reyzin, and Robert Schapire. “Contextual Bandits with Linear Payoff Functions”. In: International Conference on Artificial Intelligence and Statistics. PMLR, 2011, pp. 208–214 (Cited on pages 2, 17).

[Cut+21] Ashok Cutkosky, Christoph Dann, Abhimanyu Das, Claudio Gentile, Aldo Pacchiano, and Manish Purohit. “Dynamic Balancing for Model Selection in Bandits and RL”. In: International Conference on Machine Learning. PMLR. 2021, pp. 2276–2285 (Cited on page 4).

[DHK08] Varsha Dani, Thomas P. Hayes, and Sham M. Kakade. “Stochastic Linear Optimization under Bandit Feedback”. In: Conference on Learning Theory. PMLR. 2008 (Cited on page 2).

[FR20] Dylan Foster and Alexander Rakhlin. “Beyond UCB: Optimal and Efficient Contextual Bandits with Regression Oracles”. In: International Conference on Machine Learning. PMLR. 2020, pp. 3199–3210 (Cited on page 5).

[FKL19] Dylan J Foster, Akshay Krishnamurthy, and Haipeng Luo. “Model Selection for Contextual Bandits”. In: Advances in Neural Information Processing Systems. 2019, pp. 14741–14752 (Cited on pages 1, 4, 29, 30).

[GJ18] Melody Guan and Heinrich Jiang. “Nonparametric Stochastic Contextual Bandits”. In: AAAI Conference on Artificial Intelligence. Vol. 32. 1. 2018 (Cited on page 30).

[JOA10] Thomas Jaksch, Ronald Ortner, and Peter Auer. “Near-Optimal Regret Bounds for Reinforcement Learning”. In: Journal of Machine Learning Research 11.Apr (2010), pp. 1563–1600 (Cited on page 33).

[Jin+20] Chi Jin, Zhuoran Yang, Zhaoran Wang, and Michael I Jordan. “Provably Efficient Reinforcement Learning with Linear Function Approximation”. In: Conference on Learning Theory. PMLR. 2020, pp. 2137–2143 (Cited on pages 3133).

[LS20] Tor Lattimore and Csaba Szepesv´ari. Bandit Algorithms. Cambridge University Press, 2020 (Cited on pages 24, 27, 28, 30).

[LSW20] Tor Lattimore, Csaba Szepesvari, and Gellert Weisz. “Learning with Good Feature Representations in Bandits and in RL with a Generative Model”. In: International Conference on Machine Learning. PMLR. 2020, pp. 5662–5670 (Cited on pages 1, 4, 26, 27, 29).

[Lee+21] Jonathan Lee, Aldo Pacchiano, Vidya Muthukumar, Weihao Kong, and Emma Brunskill. “Online Model Selection for Reinforcement Learning with Function Approximation”. In: International Conference on Artificial Intelligence and Statistics. PMLR. 2021, pp. 3340–3348 (Cited on page 4).

[Li+10] Lihong Li, Wei Chu, John Langford, and Robert E Schapire. “A Contextual Bandit Approach to Personalized News Article Recommendation”. In: International conference on World wide web. 2010, pp. 661–670 (Cited on page 2).

[LLZ17] Lihong Li, Yu Lu, and Dengyong Zhou. “Provably Optimal Algorithms for Generalized Linear Contextual Bandits”. In: International Conference on Machine Learning. PMLR. 2017, pp. 2071–2080 (Cited on page 33).

[OM11] Maillard Odalric and R´emi Munos. “Adaptive Bandits: Towards the Best History-Dependent Strategy”. In: International Conference on Artificial Intelligence and Statistics. PMLR. 2011, pp. 570–578 (Cited on page 3).

[OVR17] Ian Osband and Benjamin Van Roy. “Why is Posterior Sampling Better than Optimism for Reinforcement Learning?” In: International Conference on Machine Learning. PMLR. 2017, pp. 2701–2710 (Cited on page 33).

[PDG22] Aldo Pacchiano, Christoph Dann, and Claudio Gentile. “Best of Both Worlds Model Selection”. In: Advances in Neural Information Processing Systems (2022) (Cited on page 4).

[Pac+20] Aldo Pacchiano, Christoph Dann, Claudio Gentile, and Peter Bartlett. “Regret Bound Balancing and Elimination for Model Selection in Bandits and RL”. In: arXiv preprint arXiv:2012.13045 (2020) (Cited on page 4).

[Sel+13] Yevgeny Seldin, Csaba Szepesvari, Peter Auer, and Yasin Abbasi-Yadkori. “Evaluation and Analysis of the Performance of the EXP3 Algorithm in Stochastic Environments”. In: European Workshop on Reinforcement Learning. 2013 (Cited on page 17).

[Sha+18] Han Shao, Xiaotian Yu, Irwin King, and Michael R Lyu. “Almost Optimal Algorithms for Linear Stochastic Bandits with Heavy-Tailed Payoffs”. In: Advances in Neural Information Processing Systems. Ed. by S. Bengio, H. Wallach, H. Larochelle, K. Grauman, N. Cesa-Bianchi, and R. Garnett. 2018, pp. 8420–8429 (Cited on page 33).

[Sli19] Aleksandrs Slivkins. “Introduction to Multi-Armed Bandits”. In: arXiv preprint arXiv:1904.07272 (2019) (Cited on page 17).

[SL08] Alexander L Strehl and Michael L Littman. “An Analysis of Model-Based Interval Estimation for Markov Decision Processes”. In: Journal of Computer and System Sciences 74.8 (2008), pp. 1309–1331 (Cited on pages 32, 33).

[Zan+20] Andrea Zanette, Alessandro Lazaric, Mykel Kochenderfer, and Emma Brunskill. “Learning Near Optimal Policies with Low Inherent Bellman Error”. In: International Conference on Machine Learning. PMLR. 2020, pp. 10978–10989 (Cited on pages 26, 27, 29).

Bounding term I

When the base algorithms are not chosen, they repeat their step 2’s policy to ensure that the conditional instantaneous regret is decreasing. To ensure the decreasing conditional instantaneous regret serves its purpose, when the base algorithms are chosen by the meta-algorithm, we only send step 2’s rewards to the meta-algorithm as feedback signals. This is to ensure that the sequence of rewards the meta-algorithm is competing against satisfies the decreasing instantaneous regret condition. However, since the bases play and incur regrets from both step 1 and step 2 when they are chosen, we must account for the difference between the reward of step 1 and step 2 (that the bases incur when they play the arms), and 2 times the reward of step 2 (what the bases send to the meta-algorithm as feedback signals).

Since we assume all base algorithms to be smoothed and satisfy a two step feedback structure, we also denote by  π(j)tas the policy used by the meta-algorithm during round t, step j. Term I, the regret of the meta-algorithm with respect to base  i⋆can be written as:

image

The reader should keep in mind the meta-algorithm is updated only using the reward of Step 2 of base algorithms even though the bases play both step 1 and 2. Let  Tibe the random subset of rounds when M choose base �Bi, (it = i) for all  i ∈ [M]. Adding and subtracting terms  {f(A(1)t , π(2)t )}Tt=1 we see that:

image

Equality (i) holds because term I0equals zero (recall for all  t ∈ Ti⋆ algorithm i⋆ is chosenby the meta-algorithm) and therefore I0 = I′0and in all steps  t ∈ Tci⋆, base  i⋆repeated a policy of Step 2 so that I1 = I′1. Equality (ii) follows by adding and subtracting term IB. Term  E [IA]is the regret of the meta-algorithm with respect to base  i⋆. Term  E [IB]accounts for the difference between the rewards of Step 1 and Step 2 (that the bases incur) and 2 times the rewards of Step 2 (that the bases send to the meta-algorithm). We now focus on bounding  E [IA] and E [IB].

Biased step 2’s rewards.We set the bias functions to  bj(s) = U(s,δ)sin Algorithm 9. This will become useful to control  E [IB]. Instead of sending the meta-algorithm the unadulterated 2r(2)t,jfeedback, at all time step t, all bases will send the following modified feedback:

image

This reward satisfies:

image

Define the modified rewards �f(A, π(2)t,j ) = f(A, π(2)t,j ) − bj(st,j) for all j ∈ [M] and context A. Let’s write IA + IBin terms of these �f.

image

IA + IB =

image

Where inequality (i) holds because �Tt=12b(st,jt) − �t∈Tci⋆ b(st,jt) ≤0. In the coming

discussion we’ll show that this modification allows us to control term �IB. In the following two sections we will control  E��IA�and  E��IB�. We will control  E��IA�by using standard arguments from the adversarial bandits literature. We will also show that with high probability  E��IB�≤ 8�MT log( 4TMδ ). The use of the biased rewards  �r(2)t,jallows us to ensure the collected reward during steps of type 1 plus the bias terms vs the collected reward of steps of type 2 is close to zero. Without these bias terms, bounding term IB mayprove problematic since the rewards of steps of type 1 may be smaller than the collected rewards of steps of type 2. In this case  E[IB] may give rise to a regret term dependent on the putative regret upper bounds of all algorithms  j ∈ [M] and not only on  U⋆(T, δ).

Bounding term  E��IA�

Let’s start by noting that after taking expectations,

image

The modification of the bases’ rewards in Equation 16 modifies both the bases rewards as well as the comparator. Since both meta-algorithms CORRAL and EXP3.P are k-armed bandit adversarial algorithms, their worst-case performance guarantees hold for this biased pseudo-reward sequence.

CORRAL Meta-Algorithm

image

We can bound Equation 18 using Lemma 13 from [Aga+17]. Indeed, in term �IA, the policy choice for all base algorithms  { �Bm}Mm=1during any round t is chosen before the value of itis revealed. This ensures the estimates 2r(2)tpittand 0 for all  i ̸= itare indeed unbiased estimators of the base algorithm’s rewards. We conclude:

image

EXP3.P Meta-Algorithm

Since  E [IA]is the regret of base i with respect to the meta-algorithm, it can be upper bounded by the k-armed bandit regret of the meta-algorithm with M arms. Choose η = 1, γ = 2kβin Theorem 3.3 in [BS12], we have that if  p ≤ 12k, the regret of EXP3.P:

image

Bounding  E��IB�

Notice that:

image

Substituting the definition of �f(A(2)t , π(2)t) and  bj(st,j) back into the expectation for E��IB�becomes:

image

Equality (1) follows by noting  Tci⋆ = ∪j̸=i⋆Tj. Inequality (2) follows because by Lemma B.1, we have  Uj(sT,j, δ) ≤ �sT,js=1U(s,δ)sfor all  j ∈ [M]. If the  j−th algorithm was adapted to the environment, then with high probability satisfies the following bound:

image

Inequality (A) follows because by definition  f(A(2)t , π∗) ≥ f(A(2)t , π(2)t ) and (B) because if Bjis adapted to the environment it satisfies a high probability regret bound. Let Adapt = {j ∈ [M] s.t. j is adapted }. Let’s rewrite the upper bound for  E��IB�from Equation 19

as a sum of terms corresponding to base algorithms  j ∈ Adapt and j ∈ [M]\Adapt.

image

Equation 20 implies that with probability at least 1  − |Adapt| δ,

image

We are left with controlling the component of the upper bound of  E��IB�that runs over misspecified algorithms. When  Bjis not adapted, Equation 20 may or may not hold. In order to ensure we are able to control  E��IB�we will make sure that algorithms that violate Equation 20 by a large margin are dropped by the meta-algorithm. Since it is impossible to compute the terms  f(A(2)t , π(2)t ) − f(A(2)t , π∗) and f(A(1)t , π∗) − f(A(1)t , π(1)t) directly, we instead rely on the following test:

Base Test. Let  Tj(l) be the set of time indices in [l] when the meta-algorithm chose to play base j. We drop base �Bjif at any point during the history of the algorithm,

image

Let’s start by showing that with high probability �t∈Tj(l) r(2)t,j − r(1)t,j is a good estimator of �t∈Tj(l) f(A(2)t , π(2)t,j ) − f(A(2)t , π∗) + f(A(1)t , π∗) − f(A(1)t , π(1)t,j ) for all j ∈ [M].

As a simple consequence of the Azuma-Hoeffding martingale bound and Assumption 2.1, with probability at least 1  − δ/Mand for all  ℓ ∈ [T] and for any  j ∈ [M]:

image

Combining Equation 22 and Equation 23 we get, with probability at least 1  − δMfor all l ∈ [T] and for any  j ∈ [M]:

image

Equation 20 holds for all  j ∈ Adaptwith probability at least 1  − |Adapt| δ. Combining this result with Equation 24 we conclude that with probability at least 1  − |Adapt|(1 + 1M )δfor all  j ∈Adapt and all  ℓ ∈ [T],

image

for all  ℓ ∈ [T]. Thus with high probability no well adapted algorithm will be eliminated.

Let’s now show that for all  j ∈ [M]\Adaptthe contribution of �Bjto  E��IB�while the test of Equation 21 has not been triggered is small. If Equation 21 holds for algorithm j ∈ [M] (even if �Bjis not adapted), then Equation 24 implies that with probability at least 1  − δM :

image

The last inequality holds because �j̸=i⋆�|Tj| ≤√TM. And therefore,

image

Bounding term II

Recall term II equals:

image

We use  nit to denote the number of rounds base i is chosen up to time  t for all i ∈ [M].Let  tl,ibe the round index of the  l−th time the meta-algorithm chooses algorithm  Bi and letbl,i = tl,i − tl−1,i with t0,i = 0 and tniT +1,i = T + 1. Let Ti ⊂ [T] be the set of rounds where base i is chosen and  Tci = [T]\Ti. For  S ⊂ [T] and  j ∈ {1, 2}, we define the regret of the i−th base algorithm during Step  j of rounds S as R(j)i (S) = �t∈S f(A(j)t , π∗)−f(A(j)t , π(j)t,i ).

The following decomposition of E [II] holds:

image

R(1)i⋆(Ti⋆) consists of the regret when base  i⋆was updated in step 1 while the remaining 3 terms consists of the regret when the policies are reused by step 2.

Biased step 2’s rewards

Note that we modified the rewards of step 2 as defined in Equation 16, both when the base is chosen and not chosen. We now analyze the effect of this modification:

image

We provided a bound for term I-modified at the beginning of Section A. In this section we concern ourselves with II−modified. Notice its expectation can be written as:

image

Now the second part of this sum is easy to deal with as it can be incorporated into the bound of E [II] by slightly modifying the bound given by Equation 28 below and changing 2bl − 1 to 2bl+ 1. The rest of the argument remains the same.

image

From this section onward we drop the subscript  i⋆whenever clear to simplify the notations. In this section we show an upper bound for Term II when there is a value  pi⋆ ∈ (0, 1) thatlower bounds  pi1, · · · , pi⋆T with probability 1. We then use the restarting trick to extend the proof to the case when  piis random in Theorem 4.10

Lemma A.1 (Fixed  pi⋆).Let  pi⋆ ∈(0, 1) be such that 1ρi⋆ = pi⋆ ≤ pi⋆1 , · · · , pi⋆Twith probability one, then,  E [II] ≤ 4ρi⋆ Ui(T/ρi⋆, δ) log T + δT.

Proof of Lemma A.1. Since E [II] ≤ E [1{E}II]+δT, we focus on bounding E [1{E}II]. since base  i is (U, T, δ)−bounded, E�R(1)i⋆ (Ti)1(E)�≤ E�Ui⋆(δ, ni⋆T )1(E)�. We proceed to bound the regret corresponding to the remaining terms in II0:

image

The multiplier 2bl −1 arises because the policies proposed by the base algorithm during the rounds it is not selected by M satisfy  π(1)t,i⋆ = π(2)t,i⋆ = π(2)tl,ifor all  l ≤ nTi⋆+ 1 and t = tl−1+ 1, · · · , tl −1. The factorization is a result of conditional independence between E�r(2)tl,i⋆|Ftl−1�and  E�bl|Ftl−1�where  Ftl−1already includes algorithm �Bi⋆update right after round  tl−1. The inequality holds because �Bi⋆ is (Ui⋆, δ2M , T (2))−smooth and therefore satisfies Equation 3 on event E. Recall that as a consequence of Equation 27 we have

image

The first term is bounded by  E�Ui⋆(ni⋆T , δ)1(E)�while the second term satisfies the bound

in (28). Let  ul = Ui⋆(l,δ/2M)l . By Lemma B.1, �tl=1 ul ≥ Ui⋆(t, δ/M) for all t, and so,

image

By (28) and (29),

image

Let  al = E [bl]for all l. Consider a meta-algorithm that uses  pi⋆instead of  pi⋆t. In this new process let  t′lbe the corresponding rounds when the base is selected, ¯ni⋆Tbe the total number of rounds the base is selected, and  cl = E�t′l − t′l−1�. Since  pi⋆ ≤ pi⋆tfor all t it holds that �jl=1 al ≤ �jl=1 clfor all j. If we use the same coin flips used to generate  tlto generate  t′l, we observe that  t′l ⊂ tland ¯ni⋆T ≤ ni⋆T. Let  f : R →[0, 1] be a decreasing function such that for integer  i⋆, f(i⋆) = ui⋆. Then �ni⋆T +1l=1 aluland �¯ni⋆T +1l=1 clulare two estimates of integral� T0 f(x)dx. Given that  t′l ⊂ tl and ulis a decreasing sequence in l,

image

and thus

image

We proceed to upper bound the right hand side of this inequality:

image

The first inequality holds because  E�t′l − t′l−1�≤ 1pi⋆and the second inequality follows by concavity of  Ui⋆(t, δ) as a function of t. The proof follows.

Proof of Theorem 4.10

We use the restarting trick to extend Lemma A.1 to the case when the lower bound  pi⋆is random (more specifically the algorithm (CORRAL) will maintain a lower bound that in the end will satisfy  pi⋆ ≈ mint pi⋆t) in Theorem 4.10. We restate the theorem statement here for convenience.

Theorem A.2 (Theorem 4.10 ).

image

Here, the expectation is over the random variable  ρi⋆ = maxt 1pi⋆t. If  U(t, δ) = tαc(δ) for some  α ∈ [1/2, 1) then, E [II] ≤ 4 21−α21−α−1T αc(δ)E�ρ1−αi �+ δT(log T + 1).

Restarting trick: Initialize  pi⋆ = 12M . If pi⋆t < pi⋆, set pi⋆ = pi⋆t2 and restart the base.

Proof of Theorem 4.10. The proof follows that of Theorem 15 in [Aga+17]. Let  ℓ1, · · · , ℓdi <T be the rounds where Line 10 of the CORRAL is executed. Let  ℓ0 = 0 and ℓdi⋆+1 = T fornotational convenience. Let  el = [ℓl−1+ 1, · · · , ℓl]. Denote by  pi⋆,ℓlthe probability lower bound maintained by CORRAL during time-steps  t ∈ [ℓl−1, · · · , ℓl] and ρi⋆,ℓl = 1/pi⋆,ℓl. Inthe proof of Lemma 13 in [Aga+17], the authors prove  di⋆ ≤ log(T) with probability one. Therefore,

image

The inequality is a consequence of Lemma A.1 applied to the restarted segment [ℓl−1, · · · , ℓl].This step is valid because by assumption 1ρi⋆,ℓl ≤ mint∈[ℓl−1,··· ,ℓl] pt.If Ui⋆(t, δ) = tαc(δ) for some function  c : R → R+, then  ρi⋆U(T/ρi⋆, δ) = ρ1−αi⋆ T αc(δ). And therefore:

image

Where ¯α = 1 − α. The last inequality follows from the same argument as in Theorem 15 in [Aga+17].

Proof of Theorem 4.11

Proof. For the CORRAL meta-algorithm,

image

Using Theorem 4.10 to control term II, the total regret of CORRAL is:

image

Using Lemma A.1 to control term II, we have the total regret of EXP3.P when  δ = 1/T:

image

When both  α and c(δ) are known, choose  p = T − 1−α2−α M− 12−α c(δ) 12−α. When only  α is known,choose  p = T − 1−α2−α M− 12−α. We then have the following regret:

Lemma B.1. If U(t, δ) = tβc(δ), for 0 ≤ β ≤ 1 then:

image

image

Table 2: The top row shows the general regret guarantees. The middle row shows the regret guarantees when  α and c(δ) are known. The bottom row shows the regret guarantees when αis known and  c(δ) is unknown.

Proof. The LHS follows immediately from observing U(t,δ)tis decreasing as a function of t and therefore �lt=1U(t,δ)t ≥ l U(l,δ)l = U(l, δ). The RHS is a consequence of bounding the sum by the integral� l0U(t,δ)t dt, substituting the definition  U(t, δ) = tβc(δ) and solving it.

Lemma B.2. If f(x) is a concave and doubly differentiable function on  x > 0 and f(0) ≥ 0then f(x)/x is decreasing on x > 0

Proof. In order to show that f(x)/x is decreasing when x > 0, we want to show that

image

g(x) = xf′(x) − f(x) is a non-increasing function on x > 0. We have  g′(x) = xf′′(x) ≤0 when  x ≥0 because f(x) is concave. Therefore  xf′(x) − f(x) ≤ 0f′(0)  − f(0)  ≤0 for all x ≥0, which completes the proof.

image

Proof. By definition  kl(p, q) = p log(p/q) + (1 − p) log( 1−p1−q), so

image


Designed for Accessibility and to further Open Science