For any positive integer n, we denote by [n]. The transpose of a vector A is denoted by
. For positive semidefinite matrix
and for any vector
to
. We also use notation
to denote the d-by-d identity matrix.
Let A be the set of all possible actions and let (be a sequence of T random subsets of
will be referred to as the time horizon. A policy
sequentially interacts with this environment in T rounds. At time
], the action set
is revealed to the policy and it chooses an action
and receives a stochastic reward R(
). We also assume that
and is
bounded; that is, there exists a positive constant . Moreover, we assume there exists a random vector Θ
for all where
is the standard dotproduct on
. We also assume that there exists a
positive parameter such that distribution of Θ
For example, if Θis bounded then Eq. (2.2) easily holds. Another important setting where Eq. (2.2) holds is when Θ
has a normal distribution.
Next, we introduce the notation
and assume that there exists (random) optimal action such that the following holds almost
surely for all
Now, consider the sequence of that encode history of observations up
to time t and are defined by
In this model, a is formally defined as a deterministic function that maps
to an element of
Specifically,
The performance measure for evaluating the policies is the standard cumulative Bayesian regret
defined as
The expectation is taken with respect to the entire randomness in our model, including the prior distribution of Θ. Although we assumed that Θ
is random, and we described a Bayesian formulation of regret, our model and results include the deterministic Θ
setting as well. This can be achieved by considering the prior distribution to be the distribution with a point mass at Θ
Action sets. Action sets and their structure play key roles in this paper which require introducing a number of important notions associated with them. We start by defining the extremal points of an action set.
Definition 2.1 (Extremal points) For an action set A, define its extremal points E(A) to be all that are not a convex combination of other actions in A, i.e., actions in A for which one
cannot find actions and coefficients
satisfying
The importance of this definition is that all the algorithms studied in this paper choose only extremal points in action sets, because of the linearity assumption on the mean reward as stated in Eq. (2.1). This observation implies that the rewards attained by any of these algorithms, when
provided with the action set A, belong to the reward profile of A defined by
Recall from Eq. (2.3) that ) is the maximum attainable reward of an action set A. Building
on this, we define gap of an action set A as
Moreover, for any 0, we define
In the sequel, we may simplify the above notation and use subscript t instead of . For instance,
. We now define a gapped problem as follows:
Definition 2.2 (Gapped problem) We call a linear bandit problem gapped if for some positive numbers , the following inequality holds:
where the probability is calculated with respect to the randomness of the action sets. Moreover, for a fixed gap level to be the indicator of the event
Remark 2.1 The above notion of gap is more general than the well-known notion of gap in the literature, as in (Abbasi-Yadkori et al. 2011), which is a deterministic concept. Specifically, we do not assume that the probability is equal to 0, for a fixed
Remark 2.2 All problems are gapped for all 0 and
= 1 since Eq. (2.4) will be trivially satisfied. This observation will help us obtain gap-independent bounds.
In this section we introduce the notion of uncertainty structure, which will be a key parameter in obtaining regret bounds in subsequent sections. We also calculate this parameter in three examples to help build intuition.
By uncertainty structure, we simply refer to a sequence of functions : (
, where
. By a slight abuse of notation, for any policy
Finally, for a set of policies P, the uncertainty complexity is defined as
Note that uncertainty complexity is not a unique quantity for a given problem as the choice of functions can vary. We will see in the following sections that any uncertainty structure, together with an associated gain rate that is defined in Section 4, can be used to provide an upper bound for the regret of any policy. However, the quality of the regret bound does depend on the choice of uncertainty structure.
In order to get a better of sense of uncertainty complexity, in the remainder of this section we provide upper bounds for the uncertainty complexity of several well-known problems. We then use these bounds in Section 6 to derive rate-optimal regret bounds for OFUL and variants of TS. Overall, the optimal selection of an uncertainty structure is an interesting and challenging research question, but one that is well beyond the scope of this paper.
Remark 3.1 The above notion of uncertainty complexity is similar to the notion of maximum information by Kirschner and Krause (2018), see their Eq. (2). The main difference is that we do not require the essential supremum of to exist. This makes our analysis simpler; see, e.g., the second paragraph on page 6 of (Kirschner and Krause 2018). Also, in contrast to Kirschner and Krause (2018), our proof technique provides regret bounds for the (generalized) gapped version of the problem as well.
be a positive and fixed real number and, for
any
Then, we choose the following uncertainty structure:
Lemmas 10 and 11 of Abbasi-Yadkori et al. (2011) essentially prove that
Example 3.2 (Bayesian linear bandit with fixed finite action sets) Consider a finite
action set that does not change over time. In other words,
for all
almost surely. Following a similar notation as Russo and Van Roy (2016), for all
let
and
Now, defining , we consider the following uncertainty functions:
where
The analysis of Russo and Van Roy (2016) implies that
where is the entropy of
. For completeness, we provide a slightly modified version of their proof in Section A.
Example 3.3 (Bayesian linear bandit with normal prior and noise) In this example, we
focus on the Bayesian setting in which Θ), and at round t, the reward of selecting
action is given by R(
) =
where
) is independent of
. However, we allow the action sets to change over time and also allow the action sets to have infinite size.
Inspired by the previous example, we define
where
It is easy to see that in the setting of Example 3.2, the above definition is equivalent to Eq. (3.3). We now use a different technique to bound the uncertainty complexity. Notice that the normality
assumption yields
Therefore, Example 3.1 implies that
In this section, building on the notion of uncertainty structure, we introduce the notion of gain rate of any policy and then use that to obtain an upper bound for the regret.
be fixed. We say that a policy
has gain rate
respect to an uncertainty structure
for all
Remark 4.1 The constant is meant to account for very unlikely cases where the observations deviate from generic cases (this will be formalized by tail bounds). In most cases,
can be set to 0 or 1.
We are ready now to state a general result on the regret of any policy for any gap level that relies on uncertainty complexity K, gain rate
Theorem 4.1 Given an uncertainty structure , gap level
, and associated parameter
,
the regret of any policy
Remark 4.2 (Problem-independent bound) In most examples of this paper we will prove the following stronger variant of Eq. (4.1):
This inequality, implies that the gain rate is not a function of , which means that the regret bound in Eq. (4.2) holds for any
. Therefore, one can take the infimum of the right-hand side of Eq. (4.2)
over -independent regret bound
Remark 4.3 The above notion of gain rate is similar to the notion of information ratio of Russo and Van Roy (2016). Specifically, if is defined as in Example 3.2 and D = 0, then 1/G becomes the information ratio. We also note that Russo and Van Roy (2016) consider a Bayesian setting while our gain rate is defined for both Bayesian and frequentist settings.
be the shorthand for the indicator function
We then have
By summing both sides of the above inequality over t, we get
Finally, note that
This completes the proof of Theorem 4.1.
In this section we generalize the well-known optimism principle that is at the core of the OFUL algorithm of Abbasi-Yadkori et al. (2011). Specifically, we introduce the new notion of optimism in expectation, which allows us to propose a more general and more flexible version of OFUL, which we call the randomized OFUL (ROFUL) algorithm. We then show how optimism in expectation for a policy leads to a high gain rate and, hence a small regret bound. This allows us to prove a regret bound for ROFUL. In the next section we will show that, in addition to OFUL, Thompson sampling (TS) is also a special case of ROFUL. We will also see that our regret bound for ROFUL leads to a unified proof of rate optimality for both OFUL and TS.
Before executing the above plan, let us start with a few definitions.
Definition 5.1 (Confidence bounds) Confidence bounds are real-valued functions
such that, with probability at least 1
Also, refers to the indicator function for the event that
)] holds for all
Definition 5.2 (Baseline) For confidence bounds ) and
), the baseline
at time t is
defined by,
Next, we state an assumption that allows us to provide results in situations where Θis unbounded. In most of the prior literature
is bounded almost surely, which results in the exclusion of normal priors. The assumption allows us to overcome this constraint.
Assumption 5.1 For any constant refer to the family of all Bernoulli random
variables . Assume that,
Note that random variables can be correlated with
The expressionin Assumption 5.1 is the maximum attainable regret of any policy at time t. Applying the Cauchy–Schwarz inequality, we can see that a sufficient
condition for Assumption 5.1 to hold is that
For example, in the special case where 1 almost surely, the parameter D can be set to 4
We are ready now to introduce the ROFUL algorithm.
ROFUL Algorithm. ROFUL receives a ) that maps each arm
and each history instance
into a real number. The policy then chooses the action with the highest worth. Algorithm 1 presents the pseudocode of ROFUL.
Regret of ROFUL. The ROFUL algorithm as formulated in Algorithm 1 may not perform well, unless the worth functions ) are well behaved. We formally define what “well behaved” means by introducing two conditions of reasonableness and optimism. Intuitively, an algorithm that explores too much or too little incurs a high regret. Reasonableness and optimism are mechanisms for controlling these potential flaws, respectively. To define these notions rigorously, we assume that for each action A we are given upper and lower confidence bounds
), where as in Definition 5.1, the interval [
)] contains
) with high probability. In Section 6, we provide examples of these confidence bounds for several examples of problems.
We are now ready to define the reasonableness for worth functions.
Definition 5.3 (Reasonableness) Given confidence bounds ) and
), a worth function
reasonable if, with probability at least 1
Moreover, the notation refers to the indicator function for the event that
holds for all
As we saw in Definition 5.1, the confidence bounds are such that for each arm A, the true mean reward ) lies in the confidence interval [
)] with high probability. Therefore, reasonableness ensures that the action chosen by ROFUL is close to the best action that ensures that ROFUL does not explore actions unnecessarily.
Next, we define optimism in expectation, which guarantees that ROFUL explores sufficiently.
Definition 5.4 (Optimism in expectation)
function is called optimistic with parameter
Note that are defined as in Definition 5.1 and Definition 5.3, respectively.
Figure 1 shows an illustration of the confidence bounds, the baseline, the interval used in optimism, and the worth functions.
The above notion requires the ROFUL algorithm to avoid paying the price of pure optimism (as
OFUL does). Specifically, OFUL ensures that the inequality
holds almost surely since for OFUL (as we will see in Section 6) the worth function is However, the analysis of ROFUL shows that all we need is that the inequality holds in expectation and up to a constant p. In Section 6.5, we will leverage the above intuition and introduce our sieved greedy (SG) algorithm that selects actions more greedily than OFUL while maintaining OFUL’s regret guarantees up to a constant. The core idea behind SG is to use data to stay as close as possible
Figure 1 Illustration of the building blocks of the ROFUL algorithm. Specifically, confidence bounds, baseline,
to the greedy policy while ensuring that the selected action satisfies the optimism-in-expectation condition.
Next, we show that the gain rate of ROFUL can be controlled by p, when the worth functions are reasonable and optimistic in expectation with parameter p.
Theorem 5.1 (Gain rate of ROFUL) Assume that is reasonable and optimistic in expec-
tation (with parameter p). Also assume that Assumption 5.1 holds with constant D; then we have
Before proving Theorem 5.1, we state our main regret bound for ROFUL which is a corollary of Theorem 4.1, Remark 4.2, and Theorem 5.1.
Corollary 5.1 (Regret of ROFUL) If one defines an uncertainty structure by
Theorem 4.1 implies the following gap-dependent regret bound, for any as in Definition 2.2:
and (by Remark 4.2) the following gap-independent regret bound:
Proof of Theorem 5.1. Define indicator variable as
:=
. Since these are indicator variables, we have 1
(1
) + (1
). Therefore, by the definition of
and of
, we
obtain that,
Using Assumption 5.1 we obtain
On the other hand,
The result now follows by summing both sides of Eq. (5.2) and Eq. (5.3).
The goal of this section is to demonstrate tangible examples of the ROFUL algorithm that may have seemed rather abstract up to this point. First, in Sections 6.1 to 6.4 we show that OFUL and variations of TS are special cases of ROFUL, which helps us to recover known regret bounds for them via our machinery from Sections 3 to 5. Then, in Section 6.5, motivated by our notion of optimism in expectation and its role in the regret of ROFUL, we introduce a new algorithm (sieved greedy) that enjoys similar theoretical guarantees to those of OFUL, but tends to make more greedy decisions, and hence achieves better empirical performance.
6.1. Worst-case analysis of OFUL
As our first example, we study the OFUL algorithm of Abbasi-Yadkori et al. (2011). First, building
on the notation from Eq. (3.1), we define
Using Theorem 1 of Abbasi-Yadkori et al. (2011), we realize that
where
Therefore, we can apply the Cauchy–Schwartz inequality and conclude that, for all , the following functions satisfy the confidence bounds definition:
Moreover, OFUL can be written as an instance of ROFUL as follows:
Reasonableness follows from the definition of this worth function and will be always equal to 1.
For optimism, note that, whenever = 1, we have
We thus get
This in turn implies that the optimism in expectation holds with p = 1. Using Corollary 5.1 together with Eq. (3.2) leads to the following gap-dependent bound:
and the following gap-independent bound:
Note that, if we ignore logarithmic factors, this bound is ) since
is
) and D is constant.
6.2. Bayesian analysis of TS
We obtain a Bayesian regret upper bound for TS similar to the one proved by Russo and Van Roy (2014). Let be given as in Section 6.1. Unlike in the previous section where Θ
fixed, here we assume that Θ
is also drawn from a prior distribution. Define the worth function by
where is a sample drawn from the posterior distribution of Θ
at time t that is used in TS. Therefore,
are exchangeable, given
; this, together with the definition of TS, gives
almost surely. This implies optimism in expectation with p = 1. For reasonableness, we can leverage
the same argument and obtain
Hence, we obtain the same gap-dependent bound of
and the same gap-independent bound of
6.3. Worst-case analysis of TS
In this section, we study the worst-case (frequentist) regret of TS with inflated posterior variance. We recover the same bounds as the ones by Agrawal and Goyal (2013) and Abeille et al. (2017). Algorithm 2 shows the pseudocode for this instance of TS. We also make the additional assumption that
Due to the inflated variance, we need to redefine ). Specifically, let
and define
As , we infer that
) satisfy the confidence bounds condition (Definition 5.1). We note that this definition replaces the
. Next, we prove that the worth function
given by
is reasonable. This is achieved by the following lemma, which is proved in Section A.
In order to derive our regret bound, we also need to verify the optimism in expectation assumption.
Whenever
Sinceis distributed as
Therefore, for sufficiently large T, we have
Therefore, optimism in expectation holds with
Thus, similar to Section 6.2, Corollary 5.1 gives a gap-dependent bound of
and a similar gap-independent bound of
6.4. Bayesian analysis of TS with finitely many arms
Following the same technique as in the previous example, we can prove a sharper regret bound for TS in the normal prior and normal noise setting. The main idea is to use smaller confidence
bounds. More precisely, write
and then define
Using the same techniques as in the proof of reasonableness in the previous example, we can show that these functions satisfy the confidence bounds condition and that the worth function defined by Eq. (6.3) is reasonable with respect to these confidence bounds. This yields an regret bound that is sharper than the well-known
regret bound. The only comparable result that we are aware of is the
bound provided by Russo and Van Roy (2016). Although their bound does not require normality and is sharper than ours, it does not allow changing action sets as does our bound.
Algorithm 3 Sieved-Greedy (SG)
Require:
6.5. Toward a better use of data: sieved greedy (SG)
In this section, we present a novel algorithm that enjoys the same regret bound as the one we proved for OFUL. This new policy, nonetheless, tends to make more greedy decisions. As we will see in Section 7, this algorithm achieves a similar cumulative regret to that of the best policy in each scenario.
Figure 2 Illustration of how SG works compared to OFUL and greedy.
This algorithm receives a as input. Then, at time t, this algorithm first
discards all the actions that lack sufficient uncertainty, i.e., that satisfy
Denoting the set of remaining (sieved) actions by , we note that the algorithm makes a greedy
decision over
Therefore, we call the algorithm sieved greedy (SG). When = 1, this algorithm is identical to OFUL and
= 0 leads to the greedy algorithm. Algorithm 3 shows the pseudocode for SG and Figure 2 is an illustration of how SG works.
We show that this algorithm is also an instance of ROFUL. To do so, we introduce the following worth function:
We need to show that the ROFUL algorithm with this worth function chooses the same action as
SG. Notice that
Next, the reasonableness of the worth function is evident from the definition of ). For the
optimism, note that
Therefore, optimism in expectation holds with , which means that Corollary 5.1 gives the
gap-dependent bound of
and the gap-independent bound of
Remark 6.1 (Sieved version of general ROFUL) Here we showed that SG is an instance of
ROFUL. However, as shown in Eq. (6.4), this reduction is general. Specifically, need not to be
selected greedily. In fact, any action that is selected from the set of sieved actions can be replaced with , and the above regret analysis of SG stays valid. This means one can apply the sieving idea to any instance of ROFUL, including TS and OFUL. We expect SG to outperform such “sieved TS” or “sieved OFUL”, at least empirically, because it makes more greedy decisions. But, there could be other circumstances, under which, sieved TS or sieved OFUL may be more preferred. For example, a decision-maker may prefer TS as it is a randomized policy, and in such a scenario, she can use sieved TS, with the same theoretical guarantees as SG, while maintaining a TS-based policy.
In this section, we compare the performance of OFUL, TS, greedy, and SG (with sieving rates 0,2, 0.5, and 0.8) in two scenarios. In each scenario, the unknown parameter vector Θis first sampled from
= 120. Then, in round t, a set of n = 10 actions is generated. More precisely:
Scenario I. A random vector is picked uniformly at random on the sphere of radius 5 in
Then, the action
10 is constructed by copying
-th block of size 12. Although d = 120, this scenario is equivalent to a 10-armed 12-dimensional contextual bandit problem with a shared feature vector
, embedded in the linear bandit framework, as explained by
Scenario II. Motivated by the more general linear bandit problem, each action is chosen uniformly at random on the sphere of radius 5 in
Each policy chooses an action
and receives the reward
is a sequence of i.i.d. standard normal random variables. We run each experiment for T = 10,000 rounds and repeat this procedure 50 times. The average regret of each policy (and error bars of width 2
in each direction) is shown in Figure 3. As is clear from the plots, in Scenario I, TS is the best policy and SG has a very similar performance, while greedy performs very poorly. But in Scenario II, greedy and SG achieve a substantially better performance compared to OFUL and TS. We also see that the performance of SG is generally less dependent on the sieving rate
Specifically, in Scenario II, all versions have the same performance as greedy. In Scenario I, while all versions outperform OFUL and greedy,
5 slightly outperforms the other two variants and nearly ties with TS.
These results underscore that SG inherits beneficial properties of both greedy and OFUL. It performs similar to greedy when greedy works well, but does not prematurely drop potentially optimal arms, which causes greedy to perform very poorly sometimes.
We can strengthen our regret bounds for ROFUL for two important special cases of the stochastic linear bandit problem. Specifically, in Section 8.1, motivated by the k-armed contextual bandit
Figure 3 Comparison of the cumulative regret incurred by SG with sieving rates 0.2, 0.5, and 0.8 to the cumulative
problem, we show how our proof technique allows improving all regret bounds of Section 6 by a factor. Then, in Section 8.2, making similar (generalized gap and margin) assumptions to
those by Goldenshluger and Zeevi (2013) and Bastani and Bayati (2020), we obtain polylogarithmic
regret bounds for ROFUL and the obtain the first such results for OFUL and TS.
8.1. Grouped linear bandit
Here we focus on improving our previous regret bounds for a family of subproblems. Although these improvements are mainly motivated by the special case of the k-armed contextual bandit, we formulate a slightly more general case of the stochastic linear bandit which we will refer to as the
grouped linear bandit.
Definition 8.1 (Grouped linear bandit) Let k and d be two integers and let (Z
sequence of d-dimensional subspaces of such that each vector
can be uniquely decomposed
to grouped linear bandit (GLB) problem simply refers to a linear bandit problem in which
As mentioned above, the GLB formulation is meant to capture the specific structure in the contextual setting. In fact, a k-armed d-dimensional contextual bandit problem can be modeled as a kd-dimensional linear bandit one, as discussed by Abbasi-Yadkori (2012). However, the GLB problem also includes the original linear bandit problem if we assume that k = 1. Notice that in the GLB problem we also have Θ, which in turn implies that the number of parameters is kd (rather than d). Therefore, our previous problem-independent regret bounds from Section 6 would be
-dimensional problem. As we will see shortly, this bound can be tightened by a factor of
The key observation is that the radius of the confidence set can be shrunk to
Note that as defined in Eq. (6.1) for this problem is given by
which is worse than as defined in Eq. (8.1) by an asymptotic factor of
as T grows large. Specifically, we can show that the following functions satisfy the confidence bounds definition:
This can be shown by noting that when , then
can be bounded by applying Theorem 1 of Abbasi-Yadkori et al. (2011) in a d-dimensional rather than kd-dimensional setting.
Combining this with the union bound, we obtain
By the same argument as in Section 6.1, we get
and by tuning as before we get
Moreover, in the Bayesian setting, one can use these confidence bounds to prove a similar Bayesian regret bound for TS with the proper update rule. In the frequentist setting, on the other hand, this
idea can be used to show that
and by tuning as before,
where
This shows that the posterior variance inflation can be reduced by a factor of ) in TS as T
grows large.
8.2. Polylogarithmic regret bounds
In this subsection we provide regret bounds for ROFUL when confidence sets are defined such that they grow with T polylogarithmically, under additional assumptions. Our assumptions are similar
to those made by Goldenshluger and Zeevi (2013) and Bastani and Bayati (2020). Throughout this
section we consider special classes of ROFUL where confidence intervals (Definition 5.1) are defined as in Eq. (6.2) with varying definitions of . This special class includes all examples of Section 6 such as OFUL, TS, and SG algorithms.
To state the first assumption, recall the gap parameters
Assumption 8.1 (Margin condition)
for all 0
Before stating the next condition, we need to define a notion of near-optimal space for the family of GLB problems.
Definition 8.2 (Near-optimal space) Consider a GLB problem as defined in Definition 8.1.
Let be the smallest number such that there exists
We define near-optimal space and, with a slight abuse of notation, we also treat W as the projection of
onto the subspace W.
Remark 8.1 The main purpose of this notion is to handle suboptimal arms in the special case of a k-armed contextual bandit. One might harmlessly assume that , or equivalently, assuming it is the identity function if viewed as an operator, and follow the rest of this section.
The next assumption demands the selected actions to be diverse in the near-optimal space. Specifi-cally, recall the inverse covariance matrix of the actions chosen by a policy from Eq. (3.1).
Assumption 8.2 (Linear expansion) We say that linear expansion holds for a policy if
for some constants 0 and all
] with
. We denote the indicator variable for the event
We will show in Lemma 8.2 that ROFUL satisfies the linear expansion assumption, under a variant of the optimism-in-expectation assumption from Section 5 as well as a certain diversity assumption. This fact, combined with the following lemma, leads to our main result of this section which is presented as Corollary 8.1. The next lemma operates on the same setting as Theorem 4.1 with additional assumptions on the reasonableness of the worth functions, the margin condition, and the linear expansions.
Lemma 8.1 Consider an uncertainty structure with uncertainty complexity K, gap level
, associated parameter
, as well as gain-rate parameters
and
. Also, assume that worth functions of Algorithm 1 (policy
) are reasonable (Definition 5.3), and that the margin condition (Assumption 8.1) and the linear expansion condition (Assumption 8.2) hold. Then, the cumulative regret of
satisfies the following inequality:
The proof of Lemma 8.1 is given in Section A.3.
Linear expansion and ROFUL. In what follows, we will show that under a certain diversity condition (Assumption 8.3) and a generalization of the optimism assumption in Section 5, the ROFUL algorithm satisfies linear expansion.
Assumption 8.3 (Diversity condition) We say that a GLB problem satisfies the diversity con-
dition with parameter is independent of
Assumption 8.3 is similar to Assumption A3 of Goldenshluger and Zeevi (2013) and Assumption 4
Definition 8.3 (Optimism in probability) optimistic
in probability if for some
almost surely.
Remark 8.2 A slightly stronger version of Eq. (8.4) is that
almost surely. It is worth noting that this stronger condition also implies optimism in expectation (Definition 5.4). First, note that is a deterministic function of (Θ
).
Therefore, we have
This proves optimism in expectation since,
It is worthwhile to mention that in the worst-case analysis of an algorithm, Θis a deterministic constant and, therefore, Eq. (8.4) and Eq. (8.5) are equivalent. Nevertheless, the stronger condition Eq. (8.5) need not hold when Θ
is drawn from a prior distribution. An example of this situation is the Bayesian analysis of TS in which Eq. (5.1) and Eq. (8.4) hold simultaneously, although Eq. (8.5) fails to hold.
Now, we are ready to state our result that ROFUL satisfies the linear expansion assumption if its worth functions are optimistic in probability.
Lemma 8.2 (ROFUL satisfies linear expansion) If the diversity condition (Assumption 8.3) holds and the worth functions of ROFUL are optimistic in probability (Definition 8.3), then ROFUL
satisfies the linear expansion condition (Assumption 8.2) with
where
The proof of Lemma 8.2 is given in Section A.4.
Next, we state the main result of this section, which directly follows from Lemma 8.1 and Lemma 8.2.
Corollary 8.1 Consider a GLB problem that satisfies the margin condition (Assumption 8.1) and the diversity condition (Assumption 8.3). Also, assume that the worth functions of Algorithm 1 (denoted by policy ) are reasonable (Definition 5.3) and optimistic in probability (Definition 8.3). Then the cumulative regret of
satisfies the following inequality:
where constants are defined as in Lemma 8.2.
Remark 8.3 Note that in terms of dependence in T, by Corollary 8.1 we prove a regret bound that is under similar conditions as the ones by Goldenshluger and Zeevi (2013) Bastani and Bayati (2020). Since OFUL and TS are special cases of ROFUL, this immediately provides an
regret bound for OFUL and TS as well. To the best of our knowledge, these results are new.
Remark 8.4 Corollary 8.1 also holds when a more general -margin condition for
0, as by
Goldenshluger and Zeevi (2009) Bastani et al. (2017) that replaces Eq. (8.2) with
is satisfied. In this case, the term 16) in the regret bound would be replaced by a term
of order
through the same proof technique.
This work was supported by the Stanford Data Science Initiative, and by National Science Foundation CAREER award CMMI: 1554140.
Abbasi-Yadkori, Yasin. 2012. Online Learning for Linearly Parametrized Control Problems. PhD. Thesis.
Abbasi-Yadkori, Yasin, D´avid P´al, Csaba Szepesv´ari. 2011. Improved algorithms for linear stochastic bandits. Advances in Neural Information Processing Systems. 2312–2320.
Abe, Naoki, Philip M. Long. 1999. Associative reinforcement learning using linear probabilistic concepts. Proceedings of the Sixteenth International Conference on Machine Learning. ICML ’99, Morgan Kaufmann Publishers Inc., San Francisco, CA, USA, 3–11.
Abeille, Marc, Alessandro Lazaric, et al. 2017. Linear thompson sampling revisited. Electronic Journal of Statistics 11(2) 5165–5197.
Agrawal, Shipra, Navin Goyal. 2013. Thompson sampling for contextual bandits with linear payoffs. ICML (3). 127–135.
Auer, Peter. 2003. Using confidence bounds for exploitation-exploration trade-offs. Journal of Machine Learning Research 3 397–422.
Bastani, Hamsa, Mohsen Bayati. 2020. Online decision making with high-dimensional covariates. Operations Research 68(1) 276–294. doi:10.1287/opre.2019.1902.
Bastani, Hamsa, Mohsen Bayati, Khashayar Khosravi. 2017. Mostly exploration-free algorithms for contextual bandits. arXiv preprint arXiv:1704.09011 .
Bayati, Mohsen, Nima Hamidi, Ramesh Johari, Khashayar Khosravi. 2020. The unreasonable effectiveness of greedy algorithms in multi-armed bandit with many arms. Advances in Neural Information Processing Systems 33.
Dani, Varsha, Thomas P. Hayes, Sham M. Kakade. 2008. Stochastic linear optimization under bandit feedback. COLT.
Dong, Shi, Benjamin Van Roy. 2018. An information-theoretic analysis for thompson sampling with many actions. Advances in Neural Information Processing Systems. 4157–4165.
Goldenshluger, Alexander, Assaf Zeevi. 2009. Woodroofe’s one-armed bandit problem revisited. Ann. Appl. Probab. 19(4) 1603–1633. doi:10.1214/08-AAP589. URL https://doi.org/10.1214/08-AAP589.
Goldenshluger, Alexander, Assaf Zeevi. 2013. A linear response bandit problem. Stochastic Systems 3(1) 230–261.
Hao, Botao, Tor Lattimore, Csaba Szepesvari. 2019. Adaptive exploration in linear contextual bandit. arXiv preprint arXiv:1910.06996 .
Johari, Ramesh, Pete Koomen, Leonid Pekelis, David Walsh. 2017. Peeking at a/b tests: Why it matters, and what to do about it. Proceedings of the 23rd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. ACM, New York, NY, USA, 1517–1525. doi:10.1145/3097983.3097992.
Kannan, Sampath, Jamie H Morgenstern, Aaron Roth, Bo Waggoner, Zhiwei Steven Wu. 2018. A smoothed analysis of the greedy algorithm for the linear contextual bandit problem. Advances in Neural Information Processing Systems. 2227–2236.
Kirschner, Johannes, Andreas Krause. 2018. Information directed sampling and bandits with heteroscedastic noise. Proc. International Conference on Learning Theory (COLT).
Lai, Tze Leung, Herbert Robbins. 1985. Asymptotically efficient adaptive allocation rules. Advances in applied mathematics 6(1) 4–22.
Lattimore, Tor. 2015. The pareto regret frontier for bandits. C. Cortes, N. D. Lawrence, D. D. Lee, M. Sugiyama, R. Garnett, eds., Advances in Neural Information Processing Systems 28. Curran Associates, Inc., 208– 216. URL http://papers.nips.cc/paper/6032-the-pareto-regret-frontier-for-bandits.pdf.
Laurent, Beatrice, Pascal Massart. 2000. Adaptive estimation of a quadratic functional by model selection. Annals of Statistics 1302–1338.
Raghavan, Manish, Aleksandrs Slivkins, Jennifer Wortman Vaughan, Zhiwei Steven Wu. 2018. The externalities of exploration and how data diversity helps exploitation. arXiv preprint arXiv:1806.00543 .
Rusmevichientong, Paat, John N Tsitsiklis. 2010. Linearly parameterized bandits. Mathematics of Operations Research 35(2) 395–411.
Russo, Daniel, Benjamin Van Roy. 2014. Learning to optimize via posterior sampling. Mathematics of Operations Research 39(4) 1221–1243. doi:10.1287/moor.2014.0650.
Russo, Daniel, Benjamin Van Roy. 2016. An information-theoretic analysis of thompson sampling. The Journal of Machine Learning Research 17(1) 2442–2471.
Scott, Steven L. 2010. A modern bayesian look at the multi-armed bandit. Applied Stochastic Models in Business and Industry 26(6) 639–658.
Scott, Steven L. 2015. Multi-armed bandit experiments in the online service economy. Appl. Stoch. Model. Bus. Ind. 31(1) 37–45. doi:10.1002/asmb.2104.
Srinivas, Niranjan, Andreas Krause, Sham Kakade, Matthias Seeger. 2010. Gaussian process optimization in the bandit setting: No regret and experimental design. ICML’10, Omnipress, Madison, WI, USA, 1015–1022.
Thompson, William R. 1933. On the likelihood that one unknown probability exceeds another in view of the evidence of two samples. Biometrika 25(3/4) 285–294.
Tropp, Joel A. 2012. User-friendly tail bounds for sums of random matrices. Foundations of computational mathematics 12(4) 389–434.
Appendix A: Additional Proofs
A.1. Proof of Eq. (3.4)
Proof. Noting that
we get, by Lemma 3 of Russo and Van Roy (2016), that for all
This in turn implies that
For any policy
Using the assumption that is independent of
conditional on
, we can write
Therefore, by summing up both sides of the above inequalities, we get
A.2. Proof of Lemma 6.1
First, we state the following lemma.
, then for all positive constants
Proof. The proof follows directly from applying Lemma 1 of Laurent and Massart (2000) which
gives
Proof of Lemma 6.1. First, assume that 6log 2+ 12log T. Since
), it follows from Lemma A.1 (with
Therefore, combining this with log 2
In the finite action set case, we provide a different bound using the union bound. For each
note that ). Hence, we have,
where in the last step we used the fact that Φ() for all positive
A.3. Proof of Lemma 8.1
Proof. The main idea is to refine the proof of Theorem 4.1. We first recall Eq. (4.5):
where denotes
). We next improve the upper bound for each individual
term in the above sum. For , where
is defined as in Assumption 8.2, we use our previous
bound
Next, we consider . Whenever
= 0, we have
(
, which in turn implies
that . By recalling the indicator variable
defined in the proof of Theorem 5.1, and
provided that (1 = 1, we have
In the above, (a) holds since = 1, (b) follows from
= 1, (c) uses the fact that ROFUL chooses
the action with maximum worth ), and (d) is a consequence of
= 1. Now, using this inequality,
we can write
where the last step uses Assumption 8.1. This inequality in combination with Eq. (A.1) yields
which is the desired result.
A.4. Proof of Lemma 8.2
be a Bernoulli random variable with
almost surely such that
The existence of this random variable is guaranteed by the optimism-in-probability assumption.
Next, for
Moreover, it follows from the definition of
for all ]. Therefore, we get
By recalling Υ
We now bound each term separately. Next, we prove that the smallest singular value of
grows linearly with high probability. Using the noncommutative Bernstein’s inequality for 0,
(e.g., Theorem 1.4 of Tropp (2012)), we get
Setting 2 and applying the triangle inequality yields
Our next goal is to prove an upper bound for the largest singular value of
the following bound:
Using Eq. (A.2), we can deduce that, whenever = 1, we have
Therefore, we can write
Hence, it is a direct consequence of Eq. (A.3) that, for any
We prove that for sufficiently large t, the right-hand side of the above inequality is bounded above
by 3. This is equivalent to
Using Lemma A.2 below, we infer that this is satisfied for all
be given. Then, for all
Proof. First, note that is an increasing function of t for all
. To see this, we compute the derivative of f as follows:
Next, setting