Bandit algorithms have been applied in a variety of decision making and personalization problems in industry. There are many specialized algorithms each designed to perform well in specific environments. For example, algorithms are designed to exploit low variance [AMS09], extra context information, linear reward structure [DHK08; Li+10; AYPS11], sparsity [AYPS12; CM12], etc. The exact properties of the current environment however might not be known in advance, and we might not know which algorithm is going to perform best.
Model selection in contextual bandits aims to solve this problem. More formally, the learner is tasked to solve a bandit problem for which the appropriate bandit algorithm to use is not known in advance. Despite this limitation, the learner does have access to M different algorithms , one of which
is promised to be adequate for the problem the learner wishes to solve. We use regret to measure the learner’s performance
. The problem’s objective is to design algorithms to minimize regret.
The algorithms we develop in this work follow the template of [Aga+17] where a ‘meta-algorithm’ algorithm is placed on top of a couple of ‘base’ algorithms (in this case At the beginning of each round the meta-algorithm selects which base algorithm to ‘listen to’ during that time-step effectively treating the base algorithms as arms to be pulled by the meta-algorithm. The difficulty in using existing algorithms such as UCB or EXP3 [BC12] as a meta-algorithm lies in the non-stationary nature of the rewards collected by a learning base algorithm. The meta-algorithm needs to be sufficiently smart to recognize when a base algorithm is simply performing poorly because it still in the early stages of learning from the case where poor performance is the result of model misspecification.
Adapted and misspecified algorithms We say that an algorithm is adapted to the environment at hand if it satisfies a valid regret guarantee. Let’s illustrate this with an example in the setting of linear bandits with finitely many arms. In this problem the learner has access to K arms. Each arm ] is associated with a feature vector
and the reward of arm
] follows a linear model of the form
where
is conditionally zero mean and
is an unknown parameter. An algorithm such as LinUCB [Chu+11] achieves a regret guarantee of order
(
logarithmic factors in T. In contrast, the UCB algorithm [ACBF02] yields a regret guarantee of order ). In this case, both algorithms are well adapted to the problem of linear bandits with finitely many actions, but LinUCB’s regret guarantee may be substantially smaller than UCB’s regret upper bound if d is much smaller than K. If an algorithm is not well adapted, we say it is misspecified. For the sake of exposition let’s assume we are in a similar setting as above, where the learner has access to K arms each of which is associated with a feature vector
. Instead of assuming a linear model as before, let’s instead assume that
+
is quadratic. In this case, there is no reason to believe LinUCB can yield a valid regret guarantee since the underlying linearity assumption of LinUCB is violated. We say that in this case LinUCB is misspecified. Consider an instance of LinUCB that instead uses matrix features of the form
. In this case the quadratic reward is again a linear function of the feature vectors since (
this version of LinUCB with quadratic features is adapted.
We will assume that all algorithms ] are associated with a putative regret guarantee
) that is known by the learner and holding with probability 1
for all
to the environment at hand. If the learner knew the identity of the best adapted algorithm
, it would be able to incur regret of order
) by playing
for T time-steps. The learner’s objective in the model selection problem is to design a procedure that would allow a learner to incur in regret that is competitive with the regret upper bound
) of the best adapted algorithm among those in
the regret incurred by the learner up to time T scales as a function of T, the parameters defining
(and therefore
)) and possibly M. From now on we will refer to each of the M algorithms in
as a base algorithm. We will alert the reader if we have a specific set of M algorithms in mind. In any other case, when we talk about the set of base algorithms we simply mean a set of M algorithms the learner is model selecting from.
The authors of [OM11] were perhaps the first to address the bandit model-selection problem, with a variant of an EXP4 meta-algorithm that works with UCB or EXP3 base algorithms. These results are improved by [Aga+17] via the CORRAL algorithm. The CORRAL algorithm follows the meta-algorithm-base template that we discussed at the beginning of this section. It makes use of a CORRAL meta-algorithm based on a LogBarrier Online Mirror Descent algorithm controlling which of the base algorithms to play at any given round. Let dimensional probability distribution over the M base algorithms given by the CORRAL meta-algorithm. The learner will sample an algorithm index
] with
. and play the action prescribed by
to collect a reward
. All algorithms
are then updated using an importance weighted version of
regardless of whether they were selected by the meta-algorithm or not.
Unfortunately, this means that in order to use a base algorithm in CORRAL, it needs to be compatible with this importance weighting modification of the rewards. For example, to use UCB as a base, we would need to manually re-derive UCB’s confidence intervals and modify its regret analysis to be compatible with importance weighted feedback. The authors show that a base algorithm can be safely combined with the CORRAL meta-algorithm to yield model selection guarantees provided it satisfies a stability condition (see Definition 3 in [Aga+17]). Verifying that an algorithm satisfies such stability condition is a cumbersome process that requires a detailed analysis of the algorithm’s internal workings. In this work we instead focus on devising a black-box procedure that can solve the model selection problem for a general class of stochastic contextual bandit algorithms. This work introduced the first black-box method for model selection in stochastic contextual bandits, and it has been followed by many others that have expanded and refined these results; most
Contributions. We focus on the problem of bandit model-selection in stochastic environments. Our contributions are as follows:
• A new model selection algorithm. We introduce Stochastic CORRAL, a two step per round algorithm and an accompanying base “smoothing” wrapper that can be shown to satisfy model selection guarantees when combined with any set of M stochastic contextual bandit algorithms such that at least one of them is adapted and satisfies a high probability regret guarantee. We also show model selection regret guarantees for Stochastic CORRAL with two distinct adversarial meta-algorithms, CORRAL [Aga+17] and EXP3.P [BC12]. Our approach is more general than that of the original CORRAL algorithm [Aga+17] because instead of requiring each base algorithm to be individually modified to satisfy a certain stability condition, our version of the CORRAL algorithm provides the algorithm designer with a generic black-box wrapper that allows to do model selection over any set of M base algorithm with high probability regret guarantees. Stochastic CORRAL has another important difference with respect to the original CORRAL algorithm: instead of importance weighted feedback, the unadulterated reward is sent to algorithm
, and only this algorithm is allowed to update its internal state at round t. The main consequence of these properties of Stochastic CORRAL is that our model selection strategy can be used with almost any base algorithm developed for stochastic environments. When the learner has knowledge of the function
) but not of
(for example when all the putative upper bounds
) are the same), using the CORRAL meta-algorithm achieves optimal regret guarantees. When the optimal target regret is unknown sometimes using a an EXP3.P meta-algorithm can achieve better performance.
• A versatile algorithm. We demonstrate the generality and effectiveness of our method by showing how it seamlessly improves existing results or addresses open questions in a variety of problems. We show applications in adapting to the misspecification level in contextual linear bandits [LSW20], adapting to the unknown dimension in nested linear bandit classes [FKL19], tuning the data-dependent exploration rate of bandit algorithms, and choosing feature maps in reinforcement learning. Moreover, our meta-algorithm can simultaneously perform different types of model selection. For example, we show how to choose both the unknown dimension and the unknown mis-specification error at the same time. This is in contrast to algorithms that specialize in a specific type of model selection such as detecting the unknown dimension [FKL19].
• Lower bounds. In the stochastic domain, an important question is whether a model selection procedure can inherit the O(log T) regret of a fast stochastic base algorithm. We show a lower bound for the model selection problem that scales as Ω(), which implies that our result is minimax optimal. Recall the CORRAL meta-algorithm achieves an oracle optimal model selection regret guarantee provided the algorithm has access to
). This begs the question of whether this is an unavoidable statistical limitation of model selection or just a property of the CORRAL meta-algorithm. We show this condition is unavoidable in general: there are problems where the regret of the best base algorithm scales as
) for an unknown x, and the regret of any meta-algorithm that is unaware of the value of x scales as Ω(
We use the notation to write the delta distribution at a. For an integer n, we use [n] to denote the set {1, 2, . . . , n}. We consider the following formulation of contextual stochastic bandits. At the beginning of each time-step t, the learner observes a context
that corresponds to a subset of an ‘action-set’ A. After this the learner will select an action
and then collect a reward
) +
, a noisy quantity that will depend on the context
, and the learner’s action
, a reward function f and a 1
subGaussian conditionally zero mean random noise random variable
. In this work we will restrict ourselves to the case where contexts sets
are all subsets of a context generating set A. This is in fact a very general scenario that captures all types of contextual bandit problems ranging from the case of changing linear contexts with linear rewards, to more general contexts and reward sets studied in works such as [FR20]. For simplicity we will assume the contexts
are made of the subset of available actions to the learner at time t. Our formulation allows for the action set to vary in size from round to round and even to be infinite. For example, the finite linear bandit setting (where
for all t) fits in this setting. Similarly it is easy to see the linear contextual bandit problem with i.i.d. contexts and K actions can also be written as an instance of our formulation. In the linear contextual bandit problem with
actions (where
may be infinite) the learner is presented at time t with
action-vectors
with
and the (random) return
of any action
action setting where contexts
(an abstract context set) and the learner has access to K actions per round can also be formulated in this way by defining
] and defining
to be a distribution over subsets of the form
In this work we focus on the setting of stochastic i.i.d. contexts. Let S be the set of all subsets of A and let be a distribution over S. We assume all contexts
and that
. For an arbitrary subset
we denote the space of distributions over A as ∆
. A policy
is a mapping such that for any subset
in the support of
outputs an element of ∆
. Let’s denote by Π as the space of all policies with domain in Support(
). We abuse notation and denote
. Notice that in this case
. We will generally omit the
dependence from our expectation notation in the future.
In a contextual bandit problem the learner chooses policy at time t, which takes context set
as an input and outputs a distribution over
. The learner then selects
an action ) and receives a reward
We are interested in designing an algorithm with small regret, defined as
If for example ] we would like our algorithm to satisfy a regret guarantee of the form
and where
is the index of the best performing adapted base algorithm
. Crucially, we want to avoid this guarantee to depend on other
(if any). From now on we will refer to the policy maximizing the right hand side of the equation above as
. For simplicity we will also make the following assumption regarding the range of f,
Assumption 2.1 (Bounded Expected Rewards). The absolute value of f is bounded by 1,
Throughout this work we assume the base algorithms we want to model select from satisfy a high probability regret bound whenever they are well adapted to their environment. We make this more precise in definition 2.1,
Definition 2.1 ((Boundedness). Let
[0, 1]
. We say an adapted algorithm
bounded if with probability at least 1
and for all rounds
its cumulative pseudo-regret is bounded above by
We assume that for all ] the base algorithm
)-bounded for a function
known to the learner
. For example in the Multi Armed Bandit Problem with K arms the UCB algorithm is (
bounded for some universal constant c > 0.
Original CORRAL
We start by reproducing the pesudo-code of CORRAL [Aga+17] (see Algorithm 1) as it will prove helpful in our discussion of our main algorithm: Stochastic CORRAL. As we have explained in the previous section we assume there are M candidate base algorithms and a meta-algorithm which we denote as M. At time-step t the CORRAL meta-algorithm M selects one of the base algorithms in according to a distribution
by sampling an index
. The learner plays action
) and receives reward
. An importance weighted version of
is sent out to all base algorithms, after which all of them update their internal state.
In order to better describe the feedback structure of Stochastic CORRAL we abstract the meta-algorithm-base interaction template discussed in Section 1 into Algorithms 4 and 5. As we have mentioned before, one crucial difference between Stochastic CORRAL and CORRAL is that in Stochastic CORRAL only the state of the base algorithm whose action was selected is modified. In contrast in the CORRAL algorithm all the base algorithms’ states are updated during every step.
To make this description more precise we introduce some notation. Each base algorithm maintains a counter
that keeps track of the number of times it has been updated up to time t. We will say algorithm
is in ‘state’
at time t. For any base algorithm
is the policy
uses at state
are two consecutive times when base j is chosen by the meta-algorithm, then base j proposed policy
at time
and policy
during all times
Regret Decomposition. Let’s introduce the regret decomposition we will make use of to prove our regret guarantees. This is a similar decomposition as the one appearing in the proofs of Theorem 4,5 and 7 of [Aga+17]. We split the regret R(T) of Equation 1 into two
terms (I and II) by adding and subtracting terms
Term I is the regret of the meta-algorithm with respect to the optimal base II is the regret of the optimal base with respect to the optimal policy
. Analysis of term I is largely based on the adversarial regret guarantees of the Log-Barrier-OMD in CORRAL and of the EXP3.P algorithm.
In order to bound term II we will have to modify the feedback structure of Algorithms 4
and 5. In Algorithm 8 from Section 4 we introduce a smoothing procedure that allows any ()-bounded algorithm to be transformed into a ‘smoothed’ version of itself such that its conditional expected instantaneous regret is bounded with high probability during every even step. We name this procedure ‘smoothing’ because it is based on playing uniformly from the set of previously played policies during the smoothed algorithm’s odd steps. We provide more details in Section 4. For now, the main property we are to use from this discussion is that by smoothing a (
)-bounded algorithm it is possible to ensure the conditional expected instantaneous regret of the smoothed algorithm is bounded above by
function of ) is concave in
. In Stochastic CORRAL the smoothing of base algorithms takes the place of the stability condition required by the CORRAL algorithm in [Aga+17].
Let’s sketch some intuition behind why this decreasing instantaneous regret condition can help us bound term II. For all ] let
be the (random) probabilities used by the Stochastic CORRAL meta-algorithm M (an adversarial meta-algorithm) to chose base i during round
. Let’s focus on the optimal algorithm
assume
) is convex in
is decreasing, term II is the largest when base
is selected the least often. For the sake of the argument let’s assume that
. In this case base
will be played roughly
times, and will repeat its decisions in intervals of length
, resulting in the following bound:
Lemma 3.1 (informal). If the regret of the optimal base is (-bounded, then we have that
We demonstrate the effectiveness of our smoothing transformation by deriving regret bounds with two meta-algorithms: the Log-Barrier-OMD algorithm in CORRAL (introduced by [Aga+17]) which we will henceforth refer to as the CORRAL and EXP3.P (Theorem 3.3 in [BC12]) with forced exploration. The later is a simple algorithm that ensures each base is picked with at least a (horizon dependent) constant probability p. We now state an informal version of our main result, Theorem 4.11.
Theorem 3.2 (informal version of Theorem 4.11). If ) for some function
and constant
[1/2, 1) and
is (
)-bounded, the regrets of Stochastic CORRAL with an EXP3.P and CORRAL meta-algorithms are:
The CORRAL meta-algorithm achieves optimal regret when and
) are known.
When ) is unknown and
an EXP3.P meta-algorithm achieves better regret because . We complement this result with a couple of lower bounds.
Lower bounds. Theorem 5.3 in Section 5 shows that if the regret of the best base is ), in the worst case a meta-algorithm that does not know x can have regret Ω(
) with y > x. Theorem 5.2 shows that in general it is impossible for any meta-algorithm to achieve a regret better than Ω(
) even when the best base has regret O(log(T)). When the regret of the best base is
), CORRAL with our smoothing achieves the optimal
The detailed description of the aforementioned smoothing procedure, its properties and the regret analysis are postponed to Section 4. We also show some applications of our model selection results in Section 6.
Meta-Algorithms
We review the adversarial bandit algorithms used as a Meta-Algorithm in Algorithm 4.
CORRAL Meta-Algorithm
We reproduce the CORRAL meta-algorithm below.
EXP3.P Meta-Algorithm
We reproduce the EXP3.P algorithm (Figure 3.1 in [BS12]) below. In this formulation we use
Non-increasing Instantaneous Regret
We introduce a ”smoothing” procedure (Algorithm 8) which, given a (bounded algorithm B constructs a smoothed algorithm
with the property that for some time-steps its conditional expected instantaneous regret has a decreasing upper bound. For ease of presentation and instead of making use of odd and even time-steps in the definition of
we assume each round t is split in two types of steps (Step 1 and Step 2). We will denote objects pertaining to the
th round step i using a subscript t and a superscript (i). The construction of
is simple. The smoothed algorithm maintains an internal copy of the
original algorithm B. During step 1 of round t, will play the action suggested by its internal copy of B. During step 2 of round
will instead sample uniformly from the set of policies previously played by the copy of B maintained by
during steps of type 1 from round 1 to round t.
Let’s define step 2 more formally. If algorithm B is at state s during round t, at step 2 of the corresponding time-step the smoothed strategy will pick an index q in [1, 2, .., s] uniformly at random, and will then re-play the policy B used during step 1 of round q. Since B is ()-bounded we will show in Lemma 4.2 that the expected instantaneous regret of step 2 at round s is at most
with high probability.
It is easy to see that if algorithm has been played
times (including step 1 and 2 plays), the internal counter of B equals
2. We will make use of this internal counter when we connect a smoothed algorithm with the Stochastic CORRAL meta-algorithm. We now introduce the definition of (
Smoothness which in short corresponds to algorithms that satisfy a high probability conditional expected regret upper bound during steps of type 2.
Definition 4.1 ((Smoothness). Let
[0, 1]
. We say a smoothed algorithm
is (
smooth if with probability 1
and for all rounds
], the conditional expected instantaneous regret of Step 2 is bounded above by
Where is the sigma algebra generated by all contexts, rewards, policies and actions up to time t step 1.
During all steps of type 2 algorithm replays the policies it has used when confronted with contexts
we will use the fact that all contexts are assumed to be generated as i.i.d. samples from the same context generating distribution
that
With this objective in mind let’s analyze a slightly more general setting. Let B be a (bounded algorithm playing in an environment where the high probability regret upper bound U holds. Let’s assume that B has been played for t time-steps during which it has encountered i.i.d. generated contexts
and has played actions sampled from policies
. Similar to the definition of
in Definition 4.1, let’s define
, the sigma algebra generated by all contexts, rewards, policies and actions up to time
1. We define the “expected replay regret”
as:
Where are i.i.d. contexts from
all of them conditionally independent from
. It is easy to see that the conditional instantaneous regret of a smoothed algorithm
during round t step 2 equals the expected replay regret
copy inside
As a first step in proving that smooth in Lemma 4.2 we show the replay regret of a (
)-bounded algorithm satisfies a high probability upper bound.
bounded with
Assumption 2.1, then with probability at least 1 the expected replay regret of B satisfies:
Furthermore, if
Proof. Let’s condition on the event that B’s plays satisfy the high probability regret guarantee given by U:
For all ] and where
are the contexts algorithm B encountered up to time
bounded it must be the case that
Let be a collection of t fresh i.i.d. contexts from
independent from
now use martingale concentration arguments to show that
Since by assumption max1 each term in
is bounded and satisfies max
. A simple use of Azuma-Hoeffding yields:
Summing over all , using the fact that
2 and the union bound implies that for all t, with probability 1
Denote this event as . We shall proceed to upper bound the replay regret. Let’s condition on
. The following sequence of inequalities holds,
For all ]. Inequality (i) follows by the triangle inequality while (ii) is a consequence of conditioning on
and invoking inequalities 5, 6 and 6. We conclude that with probability at least 1
and for all
It is easy to see that in case
In Propositon 4.3 we show that if B is bounded, then is both bounded and smooth. We will then show that several algorithms such as UCB, LinUCB,
-greedy and EXP3 are (
)-bounded for appropriate functions U. By Proposition 4.3 we will then conclude the smoothed versions of these algorithms are smooth.
Proposition 4.3. If 8
and B is (bounded, then
is (5
smooth and with probability at least
for all
Proof. Let denote the event that
’s plays during steps of type 1 satisfy the high probability regret guarantee given by U:
for all ]. Since the conditional instantaneous regret of Step 2 of round t equals the average replay regret of the type 1 steps up to t, Lemma 4.2 implies that whenever
(see definition for
in the proof of Lemma 4.2) which occurs with probability at least 1
the conditional expected instantaneous regret satisfies:
)
It is easy to see that if we condition on the conditional expected instantaneous regret of steps of type 2 satisfy,
For all ]. We now show the regret incurred by
satisfies a high probability upper bound. To bound the regret accrued during time-steps of type 2, consider the following
Martingale difference sequences,
As a result of Assumption 2.1, 2 for all
and therefore a simple use of Azuma-Hoeffding’s inequality,
Summing over all t, applying the union bound, using the fact that 2 implies that for all
], with probability 1
Let’s denote as the event where Equation 11 holds. If
occur, then combining the upper bounds in 10 and 11 we conclude that,
combining this last observation with Equation 9, we conclude that for all t with probability at least 1
For all ]. The result follows.
It remains to show how to adapt the feedback structure of the Stochastic CORRAL meta-algorithm to deal with the two step nature of smoothed algorithms. We reproduce the full pseudo-code of the Stochastic CORRAL meta-algorithm adapted to smoothed
algorithms below,
For reasons that have to do with the analysis, Algorithm 9 has a few extra features not present in the meta-algorithm-base template of Algorithm 4. First, whenever the smooth stochastic corral meta-algorithm selects an algorithm it plays it for two steps, thus coinciding with
’s two time step structure. Second, it updates its distribution
using the feedback 2
) instead of using the sum
. Most notably, the update makes use of a bias adjustment to the reward signal that is not present in the original. The reason behind this modification will become clearer in the regret analysis.
Applications of Proposition 4.3
We now show the smoothed versions of several algorithms satisfy Definition 4.1 by showing they are (bounded for an appropriate upper bound function U. We focus on algorithms for the
armed bandit setting and the contextual linear bandit setting.
Lemma 4.4 (Theorem 3 in [AYPS11]). In the case of changing and potentially infinite contexts of dimension d, LinUCB is (-bounded with
Lemma 4.5 (Theorem 1 in [Chu+11]). In the case of finite linear contexts of size k and dimension d, LinUCB is (-bounded with
Lemma 4.6 (Theorem 1 in [Sel+13])armed adversarial bandit setting Exp3 is (
bounded where
Lemma 4.7. In the stochastic armed bandit problem, if we assume the noise
is conditionally 1-sub-Gaussian, UCB is (
-bounded with
For the remainder of this section we focus on showing that in the stochastic armed bandit problem, the
-greedy algorithm (Algorithm 1.2 of [Sli19]) is (
bounded. At time t the
-greedy algorithm selects with probability
1) an arm uniformly at random, and with probability 1
it selects the arm whose empirical estimate of the mean is largest so far. Let’s introduce some notation: we will denote by
the unknown means of the K arms use the name
to denote the empirical estimate of the mean of arm j after using t samples.
Without loss of generality let be the optimal arm. We denote the sub-optimality gaps as ∆
for all
]. Let ∆
be the smallest nonzero gap. We follow the discussion in [ACBF02] and start by showing that under the right assumptions, and for a horizon of size T, the algorithm satisfies a high probability regret bound for all
objective of this section is to prove the following Lemma:
Lemma 4.8.
with bounded for
and denote by
) the random variable denoting the number of times arm j was selected up to time t. We start by analyzing the probability that a suboptimal arm j > 1 is selected at time t:
Let’s bound the second term.
The analysis of these two terms is the same. Denote by ) the number of times arm
j was played as a result of a random epsilon greedy move. We have:
Inequality a is a consequence of the Azuma-Hoeffding inequality bound. Inequality b follows because ). Term (1) corresponds to the probability that within the interval [1
], the number of greedy pulls to arm j is at most half its expectation. Term (2) is already ”small”. Lets proceed to bound (1). Let
with
for some
(0, 1) satisfying
. We’ll show that under these assumptions we can lower bound
By Bernstein’s inequality (see derivation of equation (13) in [ACBF02]) we can upper bound
Hence for
Now we proceed with term (2):
The first inequality (a) follows because . Inequality (b) follows because by the assumption
By applying the union bound over all arms , we conclude that the probability of choosing a sub-optimal arm
2 at any time time t for
as a greedy choice is upper bounded by
. In other words after
rounds, with probability 1
sub-optimal arms are only chosen as a result of random epsilon greedy move (occurring with probability
A similar argument as the one that gave us Equation 13 can be used to upper bound the probability that ) be much larger than its mean:
Using this and the union bound we see that with probability more than 1 and for all
] and arms
(
). Combining this with the observation that after
and with probability 1
over all t simultaneously regret is only incurred by random exploration pulls (and not greedy actions), we can conclude that with probability at least 1
simultaneously for all
the regret incurred is
Term (i) is a crude upper bound on the regret incurred in the first (ii) is an upper bound for the regret incurred in the subsequent rounds.
Since ) we conclude that with probability 1
the cumulative regret of epsilon greedy is upper bounded by
the result follows by identifying
Lemma 4.8 gives us an instance dependent upper bound for the greedy algorithm. We now show the instance-independent high probability regret bound for
Lemma 4.9. If , then
greedy with
is (
bounded for
and:
Proof. Let ∆ be some arbitrary gap value. Let R(t) denote the expected regret up to round t. We recycle the notation from the proof of Lemma 4.8, recall
When and therefore (assuming ∆
second inequality B is satisfied for T large enough. We choose this expression for simplicity of exposition.
When k > 2 notice that we can arrive to a bound similar to 14:
The inequality is true for T large enough. We choose this expression for simplicity of exposition.
Regret Analysis
In this section we go back to sketch the proof of Theorem 3.2 by explaining how to bound terms I and II in the regret decomposition of Equation 2.
Bounding Term I. Recall that Algorithm 9 only sends the smoothed reward of Step 2 to the meta-algorithm while the base plays and incurs regrets from both Step 1 and Step 2. We show in Section A that this does not affect the regret of the meta-algorithm significantly.
with exploration rate
Bounding Term II. This quantity is the regret of all the policies proposed by the optimal base , even during steps when it was not selected by the meta-algorithm. Recall the internal state of any algorithm (including
) is only updated when selected and played by M. We assume the smoothed base algorithm
satisfies the smoothness and boundedness properties of Definitions 2.1 and 4.1. For the purpose of the analysis we declare that when a smoothed base repeats its policy while not played, it repeats its subsequent Step 2 policy (Algorithm 8). This will become clearer in Section A. Since we select
with probability at least
it will be updated at most every 1
time-steps and the regret upper bound will be roughly
Theorem 4.10. We have that . Here, the expectation is over the random variable
. If
) for some
[1/2, 1) then,
Total Regret. Adding Term I and Term II gives us the following worst-case model selection regret bound for the CORRAL meta-algorithm (maximized over and with a chosen
) and the EXP3.P meta-algorithm (with a chosen p):
Theorem 4.11. If a base algorithm is ()-bounded for
) and some
[1/2, 1) and the choice of
, the regret of the Smooth Stochastic CORRAL (Algorithm 9) where
is upper bounded by :
In stochastic environments, algorithms such as UCB can achieve logarithmic regret bounds. Our model selection procedure however has a ) overall regret. In this section, we show that in general it is impossible to obtain a regret better than Ω(
) even when the optimal base algorithm has 0 regret. In order to formalize this statement, let’s define a model selection problem formally.
Table 1: Comparison of model selection guarantees for Stochastic CORRAL between the EXP3.P and CORRAL meta-algorithm. The top row shows the general regret guarantees. The middle row shows the regret guarantees when ) are known. The bottom row shows the regret guarantees when
is known and
) is unknown.
Definition 5.1 (Model Selection Problem). We call a tuple (a model selection problem where
is a set of M base algorithms and Env is a bandit environment
. For any model selection algorithm there exists a corresponding model selection problem (
such the regret of this model selection algorithm is lower bounded by
Proof. Consider a stochastic 2-arm bandit problem where the best arm has expected reward 1/2 and the second best arm has expected reward 1/4. We construct base algorithms as follows.
always chooses the optimal arm and its expected instantaneous reward is 1
chooses the second best arm at time step t with probability
specified later), and chooses the best arm otherwise. The expected reward at time step t of
Let be uniformly sampled from {1, 2}. Consider two environments
meta-algorithm, each made up of two base algorithms
instantiations of
. Under
,
, where
is a uniformly sampled index in {1, 2}, is a copy of
is a copy of
Let denote the probability measures induced by interaction of the meta-algorithm with
respectively. Let
denote the base algorithm chosen by the meta-algorithm at time
, since the learner has no information available to identify which algorithm is considered optimal. By Pinskers’ inequality we have
By the divergence decomposition [see LS20, proof of Lemma 15.1 for the decomposition
technique] and using that for ∆
Picking
bounded by
CORRAL needs knowledge of the best base’s regret to achieve the same regret. The following lower bound shows that this requirement is unavoidable:
Theorem 5.3. Let Alg be a model selection algorithm. There exists a model selection problem with two base algorithms where the best base has regret such that if Alg has no knowledge of x nor of the reward of the best arm, then there exists a potentially different model selection problem where the best base also has regret
the model selection regret guarantee of Alg is lower bounded by Ω(
Proof. Let the set of arms be . Let x and y be such that 0
1. Let
. Define two environment
and
with reward vectors {1, 1, 0} and
, respectively. Let
be two base algorithms defined by the following fixed policies when running alone in
We also construct base defined as follows. Let
4 be two constants. Base
mimics base
when
, and picks arm
when
. The instantaneous rewards of
and
when running alone are
and
. Next, consider model selection with base algorithms
be the number of rounds that
are chosen, respectively.
First, assume case (1): There exist constants c > 0, 0,
(0, 1), and
0 such that with probability at least
The regret of base when running alone for
. The regret of the model selection method is at least
Given that the inequality holds for any , it proves the statement of the lemma in case (1).
Next, we assume the complement of case (1): For all constants c > 0, 0,
(0, 1), and
0, with probability at least
Let T be any such time horizon. Consider model selection with base algorithms in environment
rounds. Let
be the number of rounds that
are chosen. Note that
behave the same for
time steps, and that
and
never choose action
. Therefore for the first
time steps, the model selection strategy that selects between
and
in
behaves the same as when it runs
and
in
. Therefore with probability
, which implies
Given that with probability 2, the regret of the learner is lower bounded as,
which is larger than the regret of running alone because
. The statement of the lemma follows given that for any
there exists
so that the model selection fails.
Misspecified Contextual Linear Bandit
We consider model selection in the misspecified linear bandit problem. The learner selects an action and receives a reward
unknown parameter vector and
is the misspecification error. For this problem, [Zan+20] and [LSW20] present variants of LinUCB that achieve a high probability
regret bound. Both algorithms require knowledge of
] show a regret bound of the same order without the knowledge of
for the version of the problem with a fixed action set
. Their method relies on G-optimal design, which does not work for contextual settings. It is an open question whether it is possible to achieve the above regret without knowing
for problems with changing action sets.
In this section, we show a ) regret bound for linear bandit problems with changing action sets without knowing
. For problems with fixed action sets, we show an improved regret that matches the lower bound of [LS20].
Given a constant E so that , we divide the interval [1, E] into an exponential grid
) modified LinUCB bases, from either [Zan+20] or [LSW20], with each base algorithm instantiated with a value of
in the grid.
Theorem 6.1. For the misspecified linear bandit problem described above, the regret of Stochastic CORRAL with a CORRAL meta-algorithm using learning rate and LinUCB base algorithms with target misspecification level
, is upper bounded by
). In the case of a fixed action linear bandit problem with k arms and
, the regret of Stochastic CORRAL with a CORRAL meta-algorithm using learning rate
applied to a set of base algorithms consisting of one UCB base and one G-optimal base algorithm [LSW20] is upper bounded by
Proof. From Lemma 4.7, for UCB, ). Therefore from Theorem 4.11, running CORRAL with smooth UCB results in the following regret bound:
Maximizing over results in a regret guarantee of the form
For the misspecified linear bandit problem we use M = O(log(T)) LinUCB bases with
defined in the grid, and choose
. The resulting regret for Stochastic CORRAL is of
the form
When the action sets are fixed, by the choice of , the regret of Stochastic CORRAL with a CORRAL meta-algorithm over one UCB and one G-optimal base equals:
If, the above expression becomes
Observe that in the case of a fixed action linear bandit problem, the regret upper bound we achieve for Stochastic CORRAL with a CORRAL meta-algorithm and a learning rate of is of the form
. The product of the terms inside
the minimum is of order ). This result matches the following lower bound that shows that it is impossible to achieve
Lemma 6.2 (Implied by Theorem 24.4 in [LS20]). Let ) denote the cumulative regret at time T on environment
. For any algorithm, there exists a 1-dimensional linear bandit environment
and a k-armed bandit environment
such that
Experiment (Figure 1). Let d = 2. Consider a contextual bandit problem with k = 50 arms, where each arm j has an associated vector sampled uniformly at random from [0, 1]
. We consider two cases: (1) For a
sampled uniformly at random from [0, 1]
, reward of arm j at time t is
, where
1), and (2) There are k parameters
] all sampled uniformly at random from [0, 10], so that the reward of arm j at time t is sampled from
1). We use CORRAL with learning rate
and UCB and LinUCB as base algorithm. In case (1) LinUCB performs better while in case (2) UCB performs better. Each experiment is repeated 500 times.
Figure 1: CORRAL with UCB and LinUCB bases. Shaded regions denote the standard deviations.
Contextual Bandits with Unknown Dimension
We consider model selection in the nested contextual linear bandit problem studied by [FKL19]. In this problem the context space . Each action is a
dimensional vector and each context
is a subset of
. The unknown parameter vector
only its first
coordinates are nonzero. Here,
is unknown and possibly much smaller than D. We assume access to a family of LinUCB algorithms
with increasing dimensionality
. Algorithm i is designed to ’believe’ the unknown parameter vector
has only nonzero entries in the first
entries. In [FKL19] the authors consider the special case when
. In order to obtain their model selection guarantees they require a lower bound on the average eigenvalues of the covariance matrices of all actions. In contrast, we do not require any such structural assumptions on the context. We provide the first sublinear regret for this problem when the action set is infinite. Further, we have no eigenvalue assumptions and our regret does not scale with the number of actions k.
We use LinUCB with each value of [1
] as a base algorithm for CORRAL and EXP3.P. We also consider the case when both the optimal dimension
and the misspecification
are unknown: we use
) modified LinUCB bases (see the discussion on Misspecified Contextual Linear Bandits above) for each value of (
) in the grid [1
From Lemma 4.4 and Lemma 4.5, for linear contextual bandit, LinUCB is ()-bounded with
)) for infinite action sets
bounded with
)) for finite action sets. Choose
and ignore the log factor,
) for infinite action sets and
) for finite action sets. Then
with
2 and
) for infinite action sets, and
) for finite action sets.
Now consider the misspecified linear contextual bandit problem with unknown We use the smoothed LinUCB bases [LSW20; Zan+20]. Using the calculation in the proof of Theorem 6.1 in Section 6, using CORRAL with a smooth LinUCB base with parameters (
) in the grids results in
regret. Since d is unknown, choosing
yields the regret
. Using EXP3.P with a smooth LinUCB base with parameters (
) in the grids results in:
Since is unknown, choosing
yields a
) regret bound. We
summarize our results in the following table:
We study model selection in the setting of non-parametric contextual bandits.[GJ18] consider non-parametric stochastic contextual bandits. At time t and given a context learner selects arm
] and observes the reward
) +
, where
is a 1-sub-Gaussian random variable and for all
], the reward function
) is
lipschitz in the context
. It is assumed that the contexts arrive in an IID fashion. [GJ18] obtain a
regret for this problem. Similar to [FKL19], we assume that only the first
context features are relevant for an unknown
. It is important to find
because
. Stochastic CORRAL can successfully adapt to this unknown quantity: we can initialize a smoothed copy of Algorithm 2 of [GJ18] for each value of d in the grid [
] for some b > 1 and perform model selection with CORRAL and EXP3.P with these base algorithms.
Tuning the Exploration Rate of -greedy
We study the problem of selecting for the optimal scaling for the exploration probability in the -greedy algorithm. Recall that for a given positive constant
-greedy algorithm pulls the arm with the largest empirical average reward with probability 1
, and otherwise pulls an arm uniformly at random. Let
. It can be shown that the optimal value for
is the smallest gap between the optimal arm and the sub-optimal arms [LS20]. With this exploration rate, the regret scales as
We would like to find the optimal value of c without the knowledge of ∆
. In this discussion we show it is possible to obtain such result by applying CORRAL to a set of
-greedy base algorithms each instantiated with a
Theorem 6.3. The regret of CORRAL using smoothed -greedy base algorithms defined on the grid is bounded by
Figure 2: CORRAL with -Greedy bases with different exploration rates.
Proof. From Lemma 4.9, we lower bound the smallest gap by 1/T (because the gaps smaller than 1/T will cause constant regret in T time steps) and choose . From Theorem 4.11, the regret is
) when k > 2 and
) when k = 2 with the base running alone.
Next we show that the best value of c in the exponential grid gives a regret that is within a constant factor of the regret above where we known the smallest non-zero gap ∆. An exploration rates can be at most
1, we need to search only in the interval [1, KT]. Let
be the element in the exponential grid such that
. Then 2
2 is a constant, and therefore using 2
will give a regret up to a constant factor of the optimal regret.
Experiment (Figure 2). Let there be two Bernoulli arms with means 0.5 and 0.45. We use 18 -greedy base algorithms differing in their choice of c in the exploration rate
’s to lie on a geometric grid in [1, 2T]. Each experiments is repeated 50 times.
Reinforcement Learning
We can instantiate Stochastic CORRAL model selection regret guarantees to the episodic linear MDP setting of [Jin+20], again with nested feature classes of doubling dimension just as in the case of the Contextual Bandits with Unknown Dimension. Let’s formally define a Linear MDP,
Figure 3: -Greedy vs UCRL2 vs PSRL in the River Swim environment [SL08].
Definition 6.4 (Linear MDP ( Assumption A in [Jin+20])). An episodic MDP (Denoted by the tuple (S, A, H, P, r)) is a linear MDP with a feature map Φ : , if for any
] there exist d unknown (signed) measures
) over S and an unknown vector
, such that for any (
The value function for a linear MDP also satisfies a linear parametrization,
Proposition 6.5 (Proposition 2.3 from [Jin+20]). For a linear MDP, and for any policy there exist
dimensional weights
such that for any (
] we have that the value function of policy
For the purpose of studying model selection in the setting of linear MDPs we assume access to dimensional feature maps Φ :
. For all policies
the unknown parameters
are all assumed to have unknown coordinates only in their first
dimensions. We assume access to a family of LSVI-UCB (Algorithm 1 of [Jin+20]) algorithms
with increasing dimensionality
. Algorithm i is designed to ‘believe’ the unknown parameter vectors
has only nonzero entries in the first
entries for all policies
Theorem 6.6. Let M = (S, A, H, P, r) be a linear MDP parametrized by a feature map be the family of nested feature maps such that Φ
corresponds to the top
entries of Φ(s, a). Assume that for all policies
the unknown parameters
have nonzero coordinates only in their first
dimensions and that there exists an index
such that
. Selecting among different smoothed LSVI-UCB base algorithms corresponding to the feature maps
using Stochastic CORRAL with a CORRAL meta-algorithm and
satisfies a regret guarantee:
R(T)
Proof of Theorem 6.6. When well specified the LSVI-UCB algorithm [Jin+20] satisfies the high probability bound ) where H is the length of each episode. The result then follows from Theorem 4.11 by setting the CORRAL meta-algorithm learning rate as
We also observe that in practice, smoothing RL algorithms such as UCRL and PSRL and using a CORRAL meta-algorithm on top of them can lead to improved performance. In Figure 3, we present results for the model selection problem among distinct RL algorithms in the River Swim environment [SL08]. We use three different bases, greedy
learning with
1, Posterior Sampling Reinforcement Learning (PSRL), as described in [OVR17] and UCRL2 as described in [JOA10]. The implementation of these algorithms and the environment is taken from TabulaRL (https://github.com/iosband/TabulaRL), a popular benchmark suite for tabular reinforcement learning problems. Smooth CORRAL uses a CORRAL meta-algorithm with a learning rate
, all base algorithms are smoothed using Algorithm 8. The curves for UCRL2, PSRL and
greedy are all of their un-smoothed versions. Each experiment was repeated 10 times and we have reported the mean cumulative regret and shaded a region around them corresponding to
3 the standard deviation across these 10 runs.
Generalized Linear Bandits with Unknown Link Function
[LLZ17] study the generalized linear bandit model for the stochastic k-armed contextual bandit problem. In round t and given context , the learner chooses arm
and observes reward
) +
where
is an unknown parameter vector,
is a conditionally zero-mean random variable and
is called the link function. [LLZ17] obtain the high probability regret bound
) where the link function is known. Suppose we have a set of link functions L that contains the true link function
. Since the target regret
) is known, we can run CORRAL with the algorithm in [LLZ17] with each link function in the set as a base algorithm. From Theorem 4.11, CORRAL will achieve regret
Bandits with Heavy Tail
[Sha+18] study the linear stochastic bandit problem with heavy tail. If the reward distribution has finite moment of order 1 + , [Sha+18] obtain the high probability regret bound
. We consider the problem when
(0, 1] is unknown with a known lower bound L where L is a conservative estimate and
could be much larger than L. To the best of our knowledge, we provide the first result when
is unknown. We use the algorithms in [Sha+18] with value of
in the grid [
] for some 0 < b < 1 as base algorithms with
for CORRAL. A direct application of Theorem 4.11 yields regret
. When
is close to
is close to 1.
In this work we introduced the Stochastic CORRAL algorithm that successfully combines an EXP3 or CORRAL adversarial meta-algorithm with a wide variety of stochastic base algorithms for contextual bandits and reinforcement learning. We improve the results of the original CORRAL approach [Aga+17] that requires the base algorithms to satisfy a stability condition not often fulfilled by even the simplest stochastic bandit algorithms such as UCB and OFUL. Our approach can make use of the input base algorithms in a fully blackbox fashion without the need of reproving regret bounds for the component base algorithms. This versatility has allowed us to crack several open problems ranging from algorithms that adapt to the misspecification level in linear contextual bandits to the effective dimension in non-parametric problems.
[AYPP20] Yasin Abbasi-Yadkori, Aldo Pacchiano, and My Phan. “Regret Balancing for Bandit and RL Model Selection”. In: arXiv preprint arXiv:2006.05491 (2020) (Cited on page 4).
[AYPS11] Yasin Abbasi-Yadkori, D´avid P´al, and Csaba Szepesv´ari. “Improved Algorithms for Linear Stochastic Bandits”. In: Advances in Neural Information Processing Systems. 2011, pp. 2312–2320 (Cited on pages 2, 17).
[AYPS12] Yasin Abbasi-Yadkori, David P´al, and Csaba Szepesv´ari. “Online-to-Confidence-Set Conversions and Application to Sparse Stochastic Bandits”. In: International Conference on Artificial Intelligence and Statistics. PMLR. 2012 (Cited on page 2).
[Aga+17] Alekh Agarwal, Haipeng Luo, Behnam Neyshabur, and Robert E Schapire. “Corralling a Band of Bandit Algorithms”. In: Conference on Learning Theory. PMLR. 2017, pp. 12–38 (Cited on pages 2–4, 6, 8–10, 34, 41, 49).
[AMM21] Raman Arora, Teodor Vanislavov Marinov, and Mehryar Mohri. “Corralling Stochastic Bandit Algorithms”. In: International Conference on Artificial Intelligence and Statistics. PMLR. 2021, pp. 2116–2124 (Cited on page 4).
[AMS09] Jean-Yves Audibert, R´emi Munos, and Csaba Szepesv´ari. “Exploration-Exploitation Tradeoff Using Variance Estimates in Multi-Armed Bandits”. In: Theoretical Computer Science 410.19 (2009), pp. 1876–1902 (Cited on page 2).
[ACBF02] Peter Auer, Nicolo Cesa-Bianchi, and Paul Fischer. “Finite-Time Analysis of the Multiarmed Bandit Problem”. In: Machine learning 47.2 (2002), pp. 235– 256 (Cited on pages 2, 18, 19).
[BC12] S´ebastien Bubeck and Nicol`o Cesa-Bianchi. “Regret Analysis of Stochastic and Nonstochastic Multi-Armed Bandit Problems”. In: CoRR abs/1204.5721 (2012). arXiv: 1204.5721 (Cited on pages 2, 4, 10).
[BS12] S´ebastien Bubeck and Aleksandrs Slivkins. “The Best of Both Worlds: Stochastic and Adversarial Bandits”. In: Conference on Learning Theory. PMLR. 2012, pp. 42–1 (Cited on pages 11, 42).
[CM12] Alexandra Carpentier and R´emi Munos. “Bandit Theory Meets Compressed Sensing for High Dimensional Stochastic Linear Bandit”. In: International Conference on Artificial Intelligence and Statistics. PMLR. 2012, pp. 190–198 (Cited on page 2).
[Chu+11] Wei Chu, Lihong Li, Lev Reyzin, and Robert Schapire. “Contextual Bandits with Linear Payoff Functions”. In: International Conference on Artificial Intelligence and Statistics. PMLR, 2011, pp. 208–214 (Cited on pages 2, 17).
[Cut+21] Ashok Cutkosky, Christoph Dann, Abhimanyu Das, Claudio Gentile, Aldo Pacchiano, and Manish Purohit. “Dynamic Balancing for Model Selection in Bandits and RL”. In: International Conference on Machine Learning. PMLR. 2021, pp. 2276–2285 (Cited on page 4).
[DHK08] Varsha Dani, Thomas P. Hayes, and Sham M. Kakade. “Stochastic Linear Optimization under Bandit Feedback”. In: Conference on Learning Theory. PMLR. 2008 (Cited on page 2).
[FR20] Dylan Foster and Alexander Rakhlin. “Beyond UCB: Optimal and Efficient Contextual Bandits with Regression Oracles”. In: International Conference on Machine Learning. PMLR. 2020, pp. 3199–3210 (Cited on page 5).
[FKL19] Dylan J Foster, Akshay Krishnamurthy, and Haipeng Luo. “Model Selection for Contextual Bandits”. In: Advances in Neural Information Processing Systems. 2019, pp. 14741–14752 (Cited on pages 1, 4, 29, 30).
[GJ18] Melody Guan and Heinrich Jiang. “Nonparametric Stochastic Contextual Bandits”. In: AAAI Conference on Artificial Intelligence. Vol. 32. 1. 2018 (Cited on page 30).
[JOA10] Thomas Jaksch, Ronald Ortner, and Peter Auer. “Near-Optimal Regret Bounds for Reinforcement Learning”. In: Journal of Machine Learning Research 11.Apr (2010), pp. 1563–1600 (Cited on page 33).
[Jin+20] Chi Jin, Zhuoran Yang, Zhaoran Wang, and Michael I Jordan. “Provably Efficient Reinforcement Learning with Linear Function Approximation”. In: Conference on Learning Theory. PMLR. 2020, pp. 2137–2143 (Cited on pages 31– 33).
[LS20] Tor Lattimore and Csaba Szepesv´ari. Bandit Algorithms. Cambridge University Press, 2020 (Cited on pages 24, 27, 28, 30).
[LSW20] Tor Lattimore, Csaba Szepesvari, and Gellert Weisz. “Learning with Good Feature Representations in Bandits and in RL with a Generative Model”. In: International Conference on Machine Learning. PMLR. 2020, pp. 5662–5670 (Cited on pages 1, 4, 26, 27, 29).
[Lee+21] Jonathan Lee, Aldo Pacchiano, Vidya Muthukumar, Weihao Kong, and Emma Brunskill. “Online Model Selection for Reinforcement Learning with Function Approximation”. In: International Conference on Artificial Intelligence and Statistics. PMLR. 2021, pp. 3340–3348 (Cited on page 4).
[Li+10] Lihong Li, Wei Chu, John Langford, and Robert E Schapire. “A Contextual Bandit Approach to Personalized News Article Recommendation”. In: International conference on World wide web. 2010, pp. 661–670 (Cited on page 2).
[LLZ17] Lihong Li, Yu Lu, and Dengyong Zhou. “Provably Optimal Algorithms for Generalized Linear Contextual Bandits”. In: International Conference on Machine Learning. PMLR. 2017, pp. 2071–2080 (Cited on page 33).
[OM11] Maillard Odalric and R´emi Munos. “Adaptive Bandits: Towards the Best History-Dependent Strategy”. In: International Conference on Artificial Intelligence and Statistics. PMLR. 2011, pp. 570–578 (Cited on page 3).
[OVR17] Ian Osband and Benjamin Van Roy. “Why is Posterior Sampling Better than Optimism for Reinforcement Learning?” In: International Conference on Machine Learning. PMLR. 2017, pp. 2701–2710 (Cited on page 33).
[PDG22] Aldo Pacchiano, Christoph Dann, and Claudio Gentile. “Best of Both Worlds Model Selection”. In: Advances in Neural Information Processing Systems (2022) (Cited on page 4).
[Pac+20] Aldo Pacchiano, Christoph Dann, Claudio Gentile, and Peter Bartlett. “Regret Bound Balancing and Elimination for Model Selection in Bandits and RL”. In: arXiv preprint arXiv:2012.13045 (2020) (Cited on page 4).
[Sel+13] Yevgeny Seldin, Csaba Szepesvari, Peter Auer, and Yasin Abbasi-Yadkori. “Evaluation and Analysis of the Performance of the EXP3 Algorithm in Stochastic Environments”. In: European Workshop on Reinforcement Learning. 2013 (Cited on page 17).
[Sha+18] Han Shao, Xiaotian Yu, Irwin King, and Michael R Lyu. “Almost Optimal Algorithms for Linear Stochastic Bandits with Heavy-Tailed Payoffs”. In: Advances in Neural Information Processing Systems. Ed. by S. Bengio, H. Wallach, H. Larochelle, K. Grauman, N. Cesa-Bianchi, and R. Garnett. 2018, pp. 8420–8429 (Cited on page 33).
[Sli19] Aleksandrs Slivkins. “Introduction to Multi-Armed Bandits”. In: arXiv preprint arXiv:1904.07272 (2019) (Cited on page 17).
[SL08] Alexander L Strehl and Michael L Littman. “An Analysis of Model-Based Interval Estimation for Markov Decision Processes”. In: Journal of Computer and System Sciences 74.8 (2008), pp. 1309–1331 (Cited on pages 32, 33).
[Zan+20] Andrea Zanette, Alessandro Lazaric, Mykel Kochenderfer, and Emma Brunskill. “Learning Near Optimal Policies with Low Inherent Bellman Error”. In: International Conference on Machine Learning. PMLR. 2020, pp. 10978–10989 (Cited on pages 26, 27, 29).
Bounding term I
When the base algorithms are not chosen, they repeat their step 2’s policy to ensure that the conditional instantaneous regret is decreasing. To ensure the decreasing conditional instantaneous regret serves its purpose, when the base algorithms are chosen by the meta-algorithm, we only send step 2’s rewards to the meta-algorithm as feedback signals. This is to ensure that the sequence of rewards the meta-algorithm is competing against satisfies the decreasing instantaneous regret condition. However, since the bases play and incur regrets from both step 1 and step 2 when they are chosen, we must account for the difference between the reward of step 1 and step 2 (that the bases incur when they play the arms), and 2 times the reward of step 2 (what the bases send to the meta-algorithm as feedback signals).
Since we assume all base algorithms to be smoothed and satisfy a two step feedback structure, we also denote by as the policy used by the meta-algorithm during round t, step j. Term I, the regret of the meta-algorithm with respect to base
can be written as:
The reader should keep in mind the meta-algorithm is updated only using the reward of Step 2 of base algorithms even though the bases play both step 1 and 2. Let be the random subset of rounds when M choose base
, (
) for all
]. Adding and subtracting terms
we see that:
Equality (i) holds because term Iequals zero (recall for all
by the meta-algorithm) and therefore I
and in all steps
, base
repeated a policy of Step 2 so that I
. Equality (ii) follows by adding and subtracting term I
. Term
is the regret of the meta-algorithm with respect to base
. Term
accounts for the difference between the rewards of Step 1 and Step 2 (that the bases incur) and 2 times the rewards of Step 2 (that the bases send to the meta-algorithm). We now focus on bounding
We set the bias functions to
in Algorithm 9. This will become useful to control
. Instead of sending the meta-algorithm the unadulterated 2
feedback, at all time step t, all bases will send the following modified feedback:
This reward satisfies:
Define the modified rewards ] and context A. Let’s write I
in terms of these
I
Where inequality (i) holds because 2
0. In the coming
discussion we’ll show that this modification allows us to control term . In the following two sections we will control
and
. We will control
by using standard arguments from the adversarial bandits literature. We will also show that with high probability
. The use of the biased rewards
allows us to ensure the collected reward during steps of type 1 plus the bias terms vs the collected reward of steps of type 2 is close to zero. Without these bias terms, bounding term I
prove problematic since the rewards of steps of type 1 may be smaller than the collected rewards of steps of type 2. In this case
] may give rise to a regret term dependent on the putative regret upper bounds of all algorithms
] and not only on
Bounding term
Let’s start by noting that after taking expectations,
The modification of the bases’ rewards in Equation 16 modifies both the bases rewards as well as the comparator. Since both meta-algorithms CORRAL and EXP3.P are k-armed bandit adversarial algorithms, their worst-case performance guarantees hold for this biased pseudo-reward sequence.
CORRAL Meta-Algorithm
We can bound Equation 18 using Lemma 13 from [Aga+17]. Indeed, in term , the policy choice for all base algorithms
during any round t is chosen before the value of
is revealed. This ensures the estimates
and 0 for all
are indeed unbiased estimators of the base algorithm’s rewards. We conclude:
EXP3.P Meta-Algorithm
Since is the regret of base i with respect to the meta-algorithm, it can be upper bounded by the k-armed bandit regret of the meta-algorithm with M arms. Choose
in Theorem 3.3 in [BS12], we have that if
, the regret of EXP3.P:
Bounding
Notice that:
Substituting the definition of ) and
) back into the expectation for
Equality (1) follows by noting . Inequality (2) follows because by Lemma B.1, we have
for all
]. If the
th algorithm was adapted to the environment, then with high probability satisfies the following bound:
Inequality (A) follows because by definition ) because if
is adapted to the environment it satisfies a high probability regret bound. Let Adapt =
] s.t. j is adapted }. Let’s rewrite the upper bound for
from Equation 19
as a sum of terms corresponding to base algorithms
Equation 20 implies that with probability at least 1
We are left with controlling the component of the upper bound of that runs over misspecified algorithms. When
is not adapted, Equation 20 may or may not hold. In order to ensure we are able to control
we will make sure that algorithms that violate Equation 20 by a large margin are dropped by the meta-algorithm. Since it is impossible to compute the terms
) directly, we instead rely on the following test:
Base Test. Let ) be the set of time indices in [l] when the meta-algorithm chose to play base j. We drop base
if at any point during the history of the algorithm,
Let’s start by showing that with high probability is a good estimator of
As a simple consequence of the Azuma-Hoeffding martingale bound and Assumption 2.1, with probability at least 1 and for all
] and for any
Combining Equation 22 and Equation 23 we get, with probability at least 1 for all
] and for any
Equation 20 holds for all with probability at least 1
. Combining this result with Equation 24 we conclude that with probability at least 1
for all
Adapt and all
for all ]. Thus with high probability no well adapted algorithm will be eliminated.
Let’s now show that for all the contribution of
to
while the test of Equation 21 has not been triggered is small. If Equation 21 holds for algorithm
is not adapted), then Equation 24 implies that with probability at least 1
The last inequality holds because . And therefore,
Bounding term II
Recall term II equals:
We use to denote the number of rounds base i is chosen up to time
Let
be the round index of the
th time the meta-algorithm chooses algorithm
] be the set of rounds where base i is chosen and
. For
] and
, we define the regret of the
th base algorithm during Step
The following decomposition of E [II] holds:
(
) consists of the regret when base
was updated in step 1 while the remaining 3 terms consists of the regret when the policies are reused by step 2.
Biased step 2’s rewards
Note that we modified the rewards of step 2 as defined in Equation 16, both when the base is chosen and not chosen. We now analyze the effect of this modification:
We provided a bound for term I-modified at the beginning of Section A. In this section we concern ourselves with IImodified. Notice its expectation can be written as:
Now the second part of this sum is easy to deal with as it can be incorporated into the bound of E [II] by slightly modifying the bound given by Equation 28 below and changing 2+ 1. The rest of the argument remains the same.
From this section onward we drop the subscript whenever clear to simplify the notations. In this section we show an upper bound for Term II when there is a value
lower bounds
with probability 1. We then use the restarting trick to extend the proof to the case when
is random in Theorem 4.10
Lemma A.1 (Fixed Let
(0, 1) be such that
with probability one, then,
, we focus on bounding E [1{E}II]. since base
. We proceed to bound the regret corresponding to the remaining terms in II
The multiplier 21 arises because the policies proposed by the base algorithm during the rounds it is not selected by M satisfy
for all
+ 1 and
+ 1
1. The factorization is a result of conditional independence between
and
where
already includes algorithm
update right after round
. The inequality holds because
smooth and therefore satisfies Equation 3 on event E. Recall that as a consequence of Equation 27 we have
The first term is bounded by while the second term satisfies the bound
in (28). Let
Let for all l. Consider a meta-algorithm that uses
instead of
. In this new process let
be the corresponding rounds when the base is selected, ¯
be the total number of rounds the base is selected, and
. Since
for all t it holds that
for all j. If we use the same coin flips used to generate
to generate
, we observe that
and ¯
. Let
[0, 1] be a decreasing function such that for integer
. Then
and
are two estimates of integral
. Given that
is a decreasing sequence in l,
and thus
We proceed to upper bound the right hand side of this inequality:
The first inequality holds because and the second inequality follows by concavity of
) as a function of t. The proof follows.
Proof of Theorem 4.10
We use the restarting trick to extend Lemma A.1 to the case when the lower bound is random (more specifically the algorithm (CORRAL) will maintain a lower bound that in the end will satisfy
) in Theorem 4.10. We restate the theorem statement here for convenience.
Theorem A.2 (Theorem 4.10 ).
Here, the expectation is over the random variable . If
) for some
Restarting trick: Initialize and restart the base.
Proof of Theorem 4.10. The proof follows that of Theorem 15 in [Aga+17]. Let T be the rounds where Line 10 of the CORRAL is executed. Let
notational convenience. Let
+ 1
]. Denote by
the probability lower bound maintained by CORRAL during time-steps
the proof of Lemma 13 in [Aga+17], the authors prove
) with probability one. Therefore,
The inequality is a consequence of Lemma A.1 applied to the restarted segment [This step is valid because by assumption
) for some function
, then
). And therefore:
Where ¯. The last inequality follows from the same argument as in Theorem 15 in [Aga+17].
Proof of Theorem 4.11
Proof. For the CORRAL meta-algorithm,
Using Theorem 4.10 to control term II, the total regret of CORRAL is:
Using Lemma A.1 to control term II, we have the total regret of EXP3.P when
When both ) are known, choose
. When only
choose
. We then have the following regret:
Table 2: The top row shows the general regret guarantees. The middle row shows the regret guarantees when ) are known. The bottom row shows the regret guarantees when
is known and
) is unknown.
Proof. The LHS follows immediately from observing is decreasing as a function of t and therefore
). The RHS is a consequence of bounding the sum by the integral
, substituting the definition
) and solving it.
Lemma B.2. If f(x) is a concave and doubly differentiable function on then f(x)/x is decreasing on x > 0
Proof. In order to show that f(x)/x is decreasing when x > 0, we want to show that
) is a non-increasing function on x > 0. We have
0 when
0 because f(x) is concave. Therefore
(0)
(0)
0 for all
0, which completes the proof.
Proof. By definition