A General Theory of the Stochastic Linear Bandit and Its Applications

2020·arXiv

1. Introduction

2. Problem Formulation and Notation

For any positive integer n, we denote by [n]. The transpose of a vector A is denoted by . For positive semidefinite matrix and for any vector to. We also use notation to denote the d-by-d identity matrix.

Let A be the set of all possible actions and let (be a sequence of T random subsets of will be referred to as the time horizon. A policy sequentially interacts with this environment in T rounds. At time ], the action set is revealed to the policy and it chooses an action and receives a stochastic reward R( ). We also assume that and is

bounded; that is, there exists a positive constant . Moreover, we assume there exists a random vector Θ

for all whereis the standard dotproduct on . We also assume that there exists a

positive parameter such that distribution of Θ

For example, if Θis bounded then Eq. (2.2) easily holds. Another important setting where Eq. (2.2) holds is when Θhas a normal distribution.

Next, we introduce the notation

and assume that there exists (random) optimal action such that the following holds almost

surely for all

Now, consider the sequence of that encode history of observations up

to time t and are defined by

In this model, a is formally defined as a deterministic function that maps to an element of

Specifically,

The performance measure for evaluating the policies is the standard cumulative Bayesian regret

defined as

The expectation is taken with respect to the entire randomness in our model, including the prior distribution of Θ. Although we assumed that Θis random, and we described a Bayesian formulation of regret, our model and results include the deterministic Θsetting as well. This can be achieved by considering the prior distribution to be the distribution with a point mass at Θ

Action sets. Action sets and their structure play key roles in this paper which require introducing a number of important notions associated with them. We start by defining the extremal points of an action set.

Definition 2.1 (Extremal points) For an action set A, define its extremal points E(A) to be all that are not a convex combination of other actions in A, i.e., actions in A for which one

cannot find actions and coefficients satisfying

The importance of this definition is that all the algorithms studied in this paper choose only extremal points in action sets, because of the linearity assumption on the mean reward as stated in Eq. (2.1). This observation implies that the rewards attained by any of these algorithms, when

provided with the action set A, belong to the reward profile of A defined by

Recall from Eq. (2.3) that ) is the maximum attainable reward of an action set A. Building

on this, we define gap of an action set A as

Moreover, for any 0, we define

In the sequel, we may simplify the above notation and use subscript t instead of . For instance, . We now define a gapped problem as follows:

Definition 2.2 (Gapped problem) We call a linear bandit problem gapped if for some positive numbers , the following inequality holds:

where the probability is calculated with respect to the randomness of the action sets. Moreover, for a fixed gap level to be the indicator of the event

Remark 2.1 The above notion of gap is more general than the well-known notion of gap in the literature, as in (Abbasi-Yadkori et al. 2011), which is a deterministic concept. Specifically, we do not assume that the probability is equal to 0, for a fixed

Remark 2.2 All problems are gapped for all 0 and = 1 since Eq. (2.4) will be trivially satisfied. This observation will help us obtain gap-independent bounds.

3. Uncertainty Complexity

In this section we introduce the notion of uncertainty structure, which will be a key parameter in obtaining regret bounds in subsequent sections. We also calculate this parameter in three examples to help build intuition.

By uncertainty structure, we simply refer to a sequence of functions : (, where

. By a slight abuse of notation, for any policy

Finally, for a set of policies P, the uncertainty complexity is defined as

Note that uncertainty complexity is not a unique quantity for a given problem as the choice of functions can vary. We will see in the following sections that any uncertainty structure, together with an associated gain rate that is defined in Section 4, can be used to provide an upper bound for the regret of any policy. However, the quality of the regret bound does depend on the choice of uncertainty structure.

In order to get a better of sense of uncertainty complexity, in the remainder of this section we provide upper bounds for the uncertainty complexity of several well-known problems. We then use these bounds in Section 6 to derive rate-optimal regret bounds for OFUL and variants of TS. Overall, the optimal selection of an uncertainty structure is an interesting and challenging research question, but one that is well beyond the scope of this paper.

Remark 3.1 The above notion of uncertainty complexity is similar to the notion of maximum information by Kirschner and Krause (2018), see their Eq. (2). The main difference is that we do not require the essential supremum of to exist. This makes our analysis simpler; see, e.g., the second paragraph on page 6 of (Kirschner and Krause 2018). Also, in contrast to Kirschner and Krause (2018), our proof technique provides regret bounds for the (generalized) gapped version of the problem as well.

be a positive and fixed real number and, for

any

Then, we choose the following uncertainty structure:

Lemmas 10 and 11 of Abbasi-Yadkori et al. (2011) essentially prove that

Example 3.2 (Bayesian linear bandit with fixed finite action sets) Consider a finite

action set that does not change over time. In other words, for all

almost surely. Following a similar notation as Russo and Van Roy (2016), for all

let

and

Now, defining , we consider the following uncertainty functions:

where

The analysis of Russo and Van Roy (2016) implies that

where is the entropy of . For completeness, we provide a slightly modified version of their proof in Section A.

Example 3.3 (Bayesian linear bandit with normal prior and noise) In this example, we

focus on the Bayesian setting in which Θ), and at round t, the reward of selecting

action is given by R( ) = where ) is independent of . However, we allow the action sets to change over time and also allow the action sets to have infinite size.

Inspired by the previous example, we define

where

It is easy to see that in the setting of Example 3.2, the above definition is equivalent to Eq. (3.3). We now use a different technique to bound the uncertainty complexity. Notice that the normality

assumption yields

Therefore, Example 3.1 implies that

4. Regret Bound and Gain Rate

In this section, building on the notion of uncertainty structure, we introduce the notion of gain rate of any policy and then use that to obtain an upper bound for the regret.

be fixed. We say that a policy has gain rate

respect to an uncertainty structure

for all

Remark 4.1 The constant is meant to account for very unlikely cases where the observations deviate from generic cases (this will be formalized by tail bounds). In most cases, can be set to 0 or 1.

We are ready now to state a general result on the regret of any policy for any gap level that relies on uncertainty complexity K, gain rate

Theorem 4.1 Given an uncertainty structure , gap level , and associated parameter ,

the regret of any policy

Remark 4.2 (Problem-independent bound) In most examples of this paper we will prove the following stronger variant of Eq. (4.1):

This inequality, implies that the gain rate is not a function of , which means that the regret bound in Eq. (4.2) holds for any . Therefore, one can take the infimum of the right-hand side of Eq. (4.2)

over -independent regret bound

Remark 4.3 The above notion of gain rate is similar to the notion of information ratio of Russo and Van Roy (2016). Specifically, if is defined as in Example 3.2 and D = 0, then 1/G becomes the information ratio. We also note that Russo and Van Roy (2016) consider a Bayesian setting while our gain rate is defined for both Bayesian and frequentist settings.

be the shorthand for the indicator function

We then have

By summing both sides of the above inequality over t, we get

Finally, note that

This completes the proof of Theorem 4.1.

5. ROFUL Algorithm

In this section we generalize the well-known optimism principle that is at the core of the OFUL algorithm of Abbasi-Yadkori et al. (2011). Specifically, we introduce the new notion of optimism in expectation, which allows us to propose a more general and more flexible version of OFUL, which we call the randomized OFUL (ROFUL) algorithm. We then show how optimism in expectation for a policy leads to a high gain rate and, hence a small regret bound. This allows us to prove a regret bound for ROFUL. In the next section we will show that, in addition to OFUL, Thompson sampling (TS) is also a special case of ROFUL. We will also see that our regret bound for ROFUL leads to a unified proof of rate optimality for both OFUL and TS.

Before executing the above plan, let us start with a few definitions.

Definition 5.1 (Confidence bounds) Confidence bounds are real-valued functions

such that, with probability at least 1

Also, refers to the indicator function for the event that )] holds for all

Definition 5.2 (Baseline) For confidence bounds ) and ), the baseline at time t is

defined by,

Next, we state an assumption that allows us to provide results in situations where Θis unbounded. In most of the prior literature is bounded almost surely, which results in the exclusion of normal priors. The assumption allows us to overcome this constraint.

Assumption 5.1 For any constant refer to the family of all Bernoulli random

variables . Assume that,

Note that random variables can be correlated with

The expressionin Assumption 5.1 is the maximum attainable regret of any policy at time t. Applying the Cauchy–Schwarz inequality, we can see that a sufficient

condition for Assumption 5.1 to hold is that

For example, in the special case where 1 almost surely, the parameter D can be set to 4

We are ready now to introduce the ROFUL algorithm.

ROFUL Algorithm. ROFUL receives a ) that maps each arm and each history instance into a real number. The policy then chooses the action with the highest worth. Algorithm 1 presents the pseudocode of ROFUL.

Regret of ROFUL. The ROFUL algorithm as formulated in Algorithm 1 may not perform well, unless the worth functions ) are well behaved. We formally define what “well behaved” means by introducing two conditions of reasonableness and optimism. Intuitively, an algorithm that explores too much or too little incurs a high regret. Reasonableness and optimism are mechanisms for controlling these potential flaws, respectively. To define these notions rigorously, we assume that for each action A we are given upper and lower confidence bounds ), where as in Definition 5.1, the interval [)] contains ) with high probability. In Section 6, we provide examples of these confidence bounds for several examples of problems.

We are now ready to define the reasonableness for worth functions.

Definition 5.3 (Reasonableness) Given confidence bounds ) and ), a worth function reasonable if, with probability at least 1

Moreover, the notation refers to the indicator function for the event that holds for all

As we saw in Definition 5.1, the confidence bounds are such that for each arm A, the true mean reward ) lies in the confidence interval [)] with high probability. Therefore, reasonableness ensures that the action chosen by ROFUL is close to the best action that ensures that ROFUL does not explore actions unnecessarily.

Next, we define optimism in expectation, which guarantees that ROFUL explores sufficiently.

Definition 5.4 (Optimism in expectation)

function is called optimistic with parameter

Note that are defined as in Definition 5.1 and Definition 5.3, respectively.

Figure 1 shows an illustration of the confidence bounds, the baseline, the interval used in optimism, and the worth functions.

The above notion requires the ROFUL algorithm to avoid paying the price of pure optimism (as

OFUL does). Specifically, OFUL ensures that the inequality

holds almost surely since for OFUL (as we will see in Section 6) the worth function is However, the analysis of ROFUL shows that all we need is that the inequality holds in expectation and up to a constant p. In Section 6.5, we will leverage the above intuition and introduce our sieved greedy (SG) algorithm that selects actions more greedily than OFUL while maintaining OFUL’s regret guarantees up to a constant. The core idea behind SG is to use data to stay as close as possible

Figure 1 Illustration of the building blocks of the ROFUL algorithm. Specifically, confidence bounds, baseline,

to the greedy policy while ensuring that the selected action satisfies the optimism-in-expectation condition.

Next, we show that the gain rate of ROFUL can be controlled by p, when the worth functions are reasonable and optimistic in expectation with parameter p.

Theorem 5.1 (Gain rate of ROFUL) Assume that is reasonable and optimistic in expec-

tation (with parameter p). Also assume that Assumption 5.1 holds with constant D; then we have

Before proving Theorem 5.1, we state our main regret bound for ROFUL which is a corollary of Theorem 4.1, Remark 4.2, and Theorem 5.1.

Corollary 5.1 (Regret of ROFUL) If one defines an uncertainty structure by

Theorem 4.1 implies the following gap-dependent regret bound, for any as in Definition 2.2:

and (by Remark 4.2) the following gap-independent regret bound:

Proof of Theorem 5.1. Define indicator variable as := . Since these are indicator variables, we have 1 (1 ) + (1 ). Therefore, by the definition of and of , we

obtain that,

Using Assumption 5.1 we obtain

On the other hand,

The result now follows by summing both sides of Eq. (5.2) and Eq. (5.3).

6. Examples of ROFUL and Sieved Greedy

The goal of this section is to demonstrate tangible examples of the ROFUL algorithm that may have seemed rather abstract up to this point. First, in Sections 6.1 to 6.4 we show that OFUL and variations of TS are special cases of ROFUL, which helps us to recover known regret bounds for them via our machinery from Sections 3 to 5. Then, in Section 6.5, motivated by our notion of optimism in expectation and its role in the regret of ROFUL, we introduce a new algorithm (sieved greedy) that enjoys similar theoretical guarantees to those of OFUL, but tends to make more greedy decisions, and hence achieves better empirical performance.

6.1. Worst-case analysis of OFUL

As our first example, we study the OFUL algorithm of Abbasi-Yadkori et al. (2011). First, building

on the notation from Eq. (3.1), we define

Using Theorem 1 of Abbasi-Yadkori et al. (2011), we realize that

where

Therefore, we can apply the Cauchy–Schwartz inequality and conclude that, for all , the following functions satisfy the confidence bounds definition:

Moreover, OFUL can be written as an instance of ROFUL as follows:

Reasonableness follows from the definition of this worth function and will be always equal to 1.

For optimism, note that, whenever = 1, we have

We thus get

This in turn implies that the optimism in expectation holds with p = 1. Using Corollary 5.1 together with Eq. (3.2) leads to the following gap-dependent bound:

and the following gap-independent bound:

Note that, if we ignore logarithmic factors, this bound is ) since is ) and D is constant.

6.2. Bayesian analysis of TS

We obtain a Bayesian regret upper bound for TS similar to the one proved by Russo and Van Roy (2014). Let be given as in Section 6.1. Unlike in the previous section where Θfixed, here we assume that Θis also drawn from a prior distribution. Define the worth function by

where is a sample drawn from the posterior distribution of Θat time t that is used in TS. Therefore, are exchangeable, given ; this, together with the definition of TS, gives

almost surely. This implies optimism in expectation with p = 1. For reasonableness, we can leverage

the same argument and obtain

Hence, we obtain the same gap-dependent bound of

and the same gap-independent bound of

6.3. Worst-case analysis of TS

In this section, we study the worst-case (frequentist) regret of TS with inflated posterior variance. We recover the same bounds as the ones by Agrawal and Goyal (2013) and Abeille et al. (2017). Algorithm 2 shows the pseudocode for this instance of TS. We also make the additional assumption that

Due to the inflated variance, we need to redefine ). Specifically, let

and define

As , we infer that ) satisfy the confidence bounds condition (Definition 5.1). We note that this definition replaces the . Next, we prove that the worth function

given by

is reasonable. This is achieved by the following lemma, which is proved in Section A.

In order to derive our regret bound, we also need to verify the optimism in expectation assumption.

Whenever

Sinceis distributed as

Therefore, for sufficiently large T, we have

Therefore, optimism in expectation holds with

Thus, similar to Section 6.2, Corollary 5.1 gives a gap-dependent bound of

and a similar gap-independent bound of

6.4. Bayesian analysis of TS with finitely many arms

Following the same technique as in the previous example, we can prove a sharper regret bound for TS in the normal prior and normal noise setting. The main idea is to use smaller confidence

bounds. More precisely, write

and then define

Using the same techniques as in the proof of reasonableness in the previous example, we can show that these functions satisfy the confidence bounds condition and that the worth function defined by Eq. (6.3) is reasonable with respect to these confidence bounds. This yields an regret bound that is sharper than the well-known regret bound. The only comparable result that we are aware of is the bound provided by Russo and Van Roy (2016). Although their bound does not require normality and is sharper than ours, it does not allow changing action sets as does our bound.

Algorithm 3 Sieved-Greedy (SG)

Require:

6.5. Toward a better use of data: sieved greedy (SG)

In this section, we present a novel algorithm that enjoys the same regret bound as the one we proved for OFUL. This new policy, nonetheless, tends to make more greedy decisions. As we will see in Section 7, this algorithm achieves a similar cumulative regret to that of the best policy in each scenario.

Figure 2 Illustration of how SG works compared to OFUL and greedy.

This algorithm receives a as input. Then, at time t, this algorithm first

discards all the actions that lack sufficient uncertainty, i.e., that satisfy

Denoting the set of remaining (sieved) actions by , we note that the algorithm makes a greedy

decision over

Therefore, we call the algorithm sieved greedy (SG). When = 1, this algorithm is identical to OFUL and = 0 leads to the greedy algorithm. Algorithm 3 shows the pseudocode for SG and Figure 2 is an illustration of how SG works.

We show that this algorithm is also an instance of ROFUL. To do so, we introduce the following worth function:

We need to show that the ROFUL algorithm with this worth function chooses the same action as

SG. Notice that

Next, the reasonableness of the worth function is evident from the definition of ). For the

optimism, note that

Therefore, optimism in expectation holds with , which means that Corollary 5.1 gives the

gap-dependent bound of

and the gap-independent bound of

Remark 6.1 (Sieved version of general ROFUL) Here we showed that SG is an instance of

ROFUL. However, as shown in Eq. (6.4), this reduction is general. Specifically, need not to be

selected greedily. In fact, any action that is selected from the set of sieved actions can be replaced with , and the above regret analysis of SG stays valid. This means one can apply the sieving idea to any instance of ROFUL, including TS and OFUL. We expect SG to outperform such “sieved TS” or “sieved OFUL”, at least empirically, because it makes more greedy decisions. But, there could be other circumstances, under which, sieved TS or sieved OFUL may be more preferred. For example, a decision-maker may prefer TS as it is a randomized policy, and in such a scenario, she can use sieved TS, with the same theoretical guarantees as SG, while maintaining a TS-based policy.

7. Numerical Simulations

In this section, we compare the performance of OFUL, TS, greedy, and SG (with sieving rates 0,2, 0.5, and 0.8) in two scenarios. In each scenario, the unknown parameter vector Θis first sampled from = 120. Then, in round t, a set of n = 10 actions is generated. More precisely:

Scenario I. A random vector is picked uniformly at random on the sphere of radius 5 in Then, the action 10 is constructed by copying -th block of size 12. Although d = 120, this scenario is equivalent to a 10-armed 12-dimensional contextual bandit problem with a shared feature vector , embedded in the linear bandit framework, as explained by

Scenario II. Motivated by the more general linear bandit problem, each action is chosen uniformly at random on the sphere of radius 5 in

Each policy chooses an action and receives the reward is a sequence of i.i.d. standard normal random variables. We run each experiment for T = 10,000 rounds and repeat this procedure 50 times. The average regret of each policy (and error bars of width 2 in each direction) is shown in Figure 3. As is clear from the plots, in Scenario I, TS is the best policy and SG has a very similar performance, while greedy performs very poorly. But in Scenario II, greedy and SG achieve a substantially better performance compared to OFUL and TS. We also see that the performance of SG is generally less dependent on the sieving rate Specifically, in Scenario II, all versions have the same performance as greedy. In Scenario I, while all versions outperform OFUL and greedy, 5 slightly outperforms the other two variants and nearly ties with TS.

These results underscore that SG inherits beneficial properties of both greedy and OFUL. It performs similar to greedy when greedy works well, but does not prematurely drop potentially optimal arms, which causes greedy to perform very poorly sometimes.

8. Improved Bounds for Two Important Subproblems

We can strengthen our regret bounds for ROFUL for two important special cases of the stochastic linear bandit problem. Specifically, in Section 8.1, motivated by the k-armed contextual bandit

Figure 3 Comparison of the cumulative regret incurred by SG with sieving rates 0.2, 0.5, and 0.8 to the cumulative

problem, we show how our proof technique allows improving all regret bounds of Section 6 by a factor. Then, in Section 8.2, making similar (generalized gap and margin) assumptions to

those by Goldenshluger and Zeevi (2013) and Bastani and Bayati (2020), we obtain polylogarithmic

regret bounds for ROFUL and the obtain the first such results for OFUL and TS.

8.1. Grouped linear bandit

Here we focus on improving our previous regret bounds for a family of subproblems. Although these improvements are mainly motivated by the special case of the k-armed contextual bandit, we formulate a slightly more general case of the stochastic linear bandit which we will refer to as the

grouped linear bandit.

Definition 8.1 (Grouped linear bandit) Let k and d be two integers and let (Z

sequence of d-dimensional subspaces of such that each vector can be uniquely decomposed

to grouped linear bandit (GLB) problem simply refers to a linear bandit problem in which

As mentioned above, the GLB formulation is meant to capture the specific structure in the contextual setting. In fact, a k-armed d-dimensional contextual bandit problem can be modeled as a kd-dimensional linear bandit one, as discussed by Abbasi-Yadkori (2012). However, the GLB problem also includes the original linear bandit problem if we assume that k = 1. Notice that in the GLB problem we also have Θ, which in turn implies that the number of parameters is kd (rather than d). Therefore, our previous problem-independent regret bounds from Section 6 would be -dimensional problem. As we will see shortly, this bound can be tightened by a factor of

The key observation is that the radius of the confidence set can be shrunk to

Note that as defined in Eq. (6.1) for this problem is given by

which is worse than as defined in Eq. (8.1) by an asymptotic factor ofas T grows large. Specifically, we can show that the following functions satisfy the confidence bounds definition:

This can be shown by noting that when , then can be bounded by applying Theorem 1 of Abbasi-Yadkori et al. (2011) in a d-dimensional rather than kd-dimensional setting.

Combining this with the union bound, we obtain

By the same argument as in Section 6.1, we get

and by tuning as before we get

Moreover, in the Bayesian setting, one can use these confidence bounds to prove a similar Bayesian regret bound for TS with the proper update rule. In the frequentist setting, on the other hand, this

idea can be used to show that

and by tuning as before,

where

This shows that the posterior variance inflation can be reduced by a factor of ) in TS as T

grows large.

8.2. Polylogarithmic regret bounds

In this subsection we provide regret bounds for ROFUL when confidence sets are defined such that they grow with T polylogarithmically, under additional assumptions. Our assumptions are similar

to those made by Goldenshluger and Zeevi (2013) and Bastani and Bayati (2020). Throughout this

section we consider special classes of ROFUL where confidence intervals (Definition 5.1) are defined as in Eq. (6.2) with varying definitions of . This special class includes all examples of Section 6 such as OFUL, TS, and SG algorithms.

To state the first assumption, recall the gap parameters

Assumption 8.1 (Margin condition)

for all 0

Before stating the next condition, we need to define a notion of near-optimal space for the family of GLB problems.

Definition 8.2 (Near-optimal space) Consider a GLB problem as defined in Definition 8.1.

Let be the smallest number such that there exists

We define near-optimal space and, with a slight abuse of notation, we also treat W as the projection of onto the subspace W.

Remark 8.1 The main purpose of this notion is to handle suboptimal arms in the special case of a k-armed contextual bandit. One might harmlessly assume that , or equivalently, assuming it is the identity function if viewed as an operator, and follow the rest of this section.

The next assumption demands the selected actions to be diverse in the near-optimal space. Specifi-cally, recall the inverse covariance matrix of the actions chosen by a policy from Eq. (3.1).

Assumption 8.2 (Linear expansion) We say that linear expansion holds for a policy if

for some constants 0 and all ] with . We denote the indicator variable for the event

We will show in Lemma 8.2 that ROFUL satisfies the linear expansion assumption, under a variant of the optimism-in-expectation assumption from Section 5 as well as a certain diversity assumption. This fact, combined with the following lemma, leads to our main result of this section which is presented as Corollary 8.1. The next lemma operates on the same setting as Theorem 4.1 with additional assumptions on the reasonableness of the worth functions, the margin condition, and the linear expansions.

Lemma 8.1 Consider an uncertainty structure with uncertainty complexity K, gap level , associated parameter , as well as gain-rate parameters and . Also, assume that worth functions of Algorithm 1 (policy ) are reasonable (Definition 5.3), and that the margin condition (Assumption 8.1) and the linear expansion condition (Assumption 8.2) hold. Then, the cumulative regret of satisfies the following inequality:

The proof of Lemma 8.1 is given in Section A.3.

Linear expansion and ROFUL. In what follows, we will show that under a certain diversity condition (Assumption 8.3) and a generalization of the optimism assumption in Section 5, the ROFUL algorithm satisfies linear expansion.

Assumption 8.3 (Diversity condition) We say that a GLB problem satisfies the diversity con-

dition with parameter is independent of

Assumption 8.3 is similar to Assumption A3 of Goldenshluger and Zeevi (2013) and Assumption 4

Definition 8.3 (Optimism in probability) optimistic

in probability if for some

almost surely.

Remark 8.2 A slightly stronger version of Eq. (8.4) is that

almost surely. It is worth noting that this stronger condition also implies optimism in expectation (Definition 5.4). First, note that is a deterministic function of (Θ).

Therefore, we have

This proves optimism in expectation since,

It is worthwhile to mention that in the worst-case analysis of an algorithm, Θis a deterministic constant and, therefore, Eq. (8.4) and Eq. (8.5) are equivalent. Nevertheless, the stronger condition Eq. (8.5) need not hold when Θis drawn from a prior distribution. An example of this situation is the Bayesian analysis of TS in which Eq. (5.1) and Eq. (8.4) hold simultaneously, although Eq. (8.5) fails to hold.

Now, we are ready to state our result that ROFUL satisfies the linear expansion assumption if its worth functions are optimistic in probability.

Lemma 8.2 (ROFUL satisfies linear expansion) If the diversity condition (Assumption 8.3) holds and the worth functions of ROFUL are optimistic in probability (Definition 8.3), then ROFUL

satisfies the linear expansion condition (Assumption 8.2) with

where

The proof of Lemma 8.2 is given in Section A.4.

Next, we state the main result of this section, which directly follows from Lemma 8.1 and Lemma 8.2.

Corollary 8.1 Consider a GLB problem that satisfies the margin condition (Assumption 8.1) and the diversity condition (Assumption 8.3). Also, assume that the worth functions of Algorithm 1 (denoted by policy ) are reasonable (Definition 5.3) and optimistic in probability (Definition 8.3). Then the cumulative regret of satisfies the following inequality:

where constants are defined as in Lemma 8.2.

Remark 8.3 Note that in terms of dependence in T, by Corollary 8.1 we prove a regret bound that is under similar conditions as the ones by Goldenshluger and Zeevi (2013) Bastani and Bayati (2020). Since OFUL and TS are special cases of ROFUL, this immediately provides an regret bound for OFUL and TS as well. To the best of our knowledge, these results are new.

Remark 8.4 Corollary 8.1 also holds when a more general -margin condition for 0, as by

Goldenshluger and Zeevi (2009) Bastani et al. (2017) that replaces Eq. (8.2) with

is satisfied. In this case, the term 16) in the regret bound would be replaced by a term

of order

through the same proof technique.

Acknowledgments

This work was supported by the Stanford Data Science Initiative, and by National Science Foundation CAREER award CMMI: 1554140.

References

Abbasi-Yadkori, Yasin. 2012. Online Learning for Linearly Parametrized Control Problems. PhD. Thesis.

Abbasi-Yadkori, Yasin, D´avid P´al, Csaba Szepesv´ari. 2011. Improved algorithms for linear stochastic bandits. Advances in Neural Information Processing Systems. 2312–2320.

Abe, Naoki, Philip M. Long. 1999. Associative reinforcement learning using linear probabilistic concepts. Proceedings of the Sixteenth International Conference on Machine Learning. ICML ’99, Morgan Kaufmann Publishers Inc., San Francisco, CA, USA, 3–11.

Abeille, Marc, Alessandro Lazaric, et al. 2017. Linear thompson sampling revisited. Electronic Journal of Statistics 11(2) 5165–5197.

Agrawal, Shipra, Navin Goyal. 2013. Thompson sampling for contextual bandits with linear payoffs. ICML (3). 127–135.

Auer, Peter. 2003. Using confidence bounds for exploitation-exploration trade-offs. Journal of Machine Learning Research 3 397–422.

Bastani, Hamsa, Mohsen Bayati. 2020. Online decision making with high-dimensional covariates. Operations Research 68(1) 276–294. doi:10.1287/opre.2019.1902.

Bastani, Hamsa, Mohsen Bayati, Khashayar Khosravi. 2017. Mostly exploration-free algorithms for contextual bandits. arXiv preprint arXiv:1704.09011 .

Bayati, Mohsen, Nima Hamidi, Ramesh Johari, Khashayar Khosravi. 2020. The unreasonable effectiveness of greedy algorithms in multi-armed bandit with many arms. Advances in Neural Information Processing Systems 33.

Dani, Varsha, Thomas P. Hayes, Sham M. Kakade. 2008. Stochastic linear optimization under bandit feedback. COLT.

Dong, Shi, Benjamin Van Roy. 2018. An information-theoretic analysis for thompson sampling with many actions. Advances in Neural Information Processing Systems. 4157–4165.

Goldenshluger, Alexander, Assaf Zeevi. 2009. Woodroofe’s one-armed bandit problem revisited. Ann. Appl. Probab. 19(4) 1603–1633. doi:10.1214/08-AAP589. URL https://doi.org/10.1214/08-AAP589.

Goldenshluger, Alexander, Assaf Zeevi. 2013. A linear response bandit problem. Stochastic Systems 3(1) 230–261.

Hao, Botao, Tor Lattimore, Csaba Szepesvari. 2019. Adaptive exploration in linear contextual bandit. arXiv preprint arXiv:1910.06996 .

Johari, Ramesh, Pete Koomen, Leonid Pekelis, David Walsh. 2017. Peeking at a/b tests: Why it matters, and what to do about it. Proceedings of the 23rd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. ACM, New York, NY, USA, 1517–1525. doi:10.1145/3097983.3097992.

Kannan, Sampath, Jamie H Morgenstern, Aaron Roth, Bo Waggoner, Zhiwei Steven Wu. 2018. A smoothed analysis of the greedy algorithm for the linear contextual bandit problem. Advances in Neural Information Processing Systems. 2227–2236.

Kirschner, Johannes, Andreas Krause. 2018. Information directed sampling and bandits with heteroscedastic noise. Proc. International Conference on Learning Theory (COLT).

Lai, Tze Leung, Herbert Robbins. 1985. Asymptotically efficient adaptive allocation rules. Advances in applied mathematics 6(1) 4–22.

Lattimore, Tor. 2015. The pareto regret frontier for bandits. C. Cortes, N. D. Lawrence, D. D. Lee, M. Sugiyama, R. Garnett, eds., Advances in Neural Information Processing Systems 28. Curran Associates, Inc., 208– 216. URL http://papers.nips.cc/paper/6032-the-pareto-regret-frontier-for-bandits.pdf.

Laurent, Beatrice, Pascal Massart. 2000. Adaptive estimation of a quadratic functional by model selection. Annals of Statistics 1302–1338.

Raghavan, Manish, Aleksandrs Slivkins, Jennifer Wortman Vaughan, Zhiwei Steven Wu. 2018. The externalities of exploration and how data diversity helps exploitation. arXiv preprint arXiv:1806.00543 .

Rusmevichientong, Paat, John N Tsitsiklis. 2010. Linearly parameterized bandits. Mathematics of Operations Research 35(2) 395–411.

Russo, Daniel, Benjamin Van Roy. 2014. Learning to optimize via posterior sampling. Mathematics of Operations Research 39(4) 1221–1243. doi:10.1287/moor.2014.0650.

Russo, Daniel, Benjamin Van Roy. 2016. An information-theoretic analysis of thompson sampling. The Journal of Machine Learning Research 17(1) 2442–2471.

Scott, Steven L. 2010. A modern bayesian look at the multi-armed bandit. Applied Stochastic Models in Business and Industry 26(6) 639–658.

Scott, Steven L. 2015. Multi-armed bandit experiments in the online service economy. Appl. Stoch. Model. Bus. Ind. 31(1) 37–45. doi:10.1002/asmb.2104.

Srinivas, Niranjan, Andreas Krause, Sham Kakade, Matthias Seeger. 2010. Gaussian process optimization in the bandit setting: No regret and experimental design. ICML’10, Omnipress, Madison, WI, USA, 1015–1022.

Thompson, William R. 1933. On the likelihood that one unknown probability exceeds another in view of the evidence of two samples. Biometrika 25(3/4) 285–294.

Tropp, Joel A. 2012. User-friendly tail bounds for sums of random matrices. Foundations of computational mathematics 12(4) 389–434.

Appendix A: Additional Proofs

A.1. Proof of Eq. (3.4)

Proof. Noting that

we get, by Lemma 3 of Russo and Van Roy (2016), that for all

This in turn implies that

For any policy

Using the assumption that is independent of conditional on , we can write

Therefore, by summing up both sides of the above inequalities, we get

A.2. Proof of Lemma 6.1

First, we state the following lemma.

, then for all positive constants

Proof. The proof follows directly from applying Lemma 1 of Laurent and Massart (2000) which

gives

Proof of Lemma 6.1. First, assume that 6log 2+ 12log T. Since

), it follows from Lemma A.1 (with

Therefore, combining this with log 2

In the finite action set case, we provide a different bound using the union bound. For each

note that ). Hence, we have,

where in the last step we used the fact that Φ() for all positive

A.3. Proof of Lemma 8.1

Proof. The main idea is to refine the proof of Theorem 4.1. We first recall Eq. (4.5):

where denotes ). We next improve the upper bound for each individual

term in the above sum. For , where is defined as in Assumption 8.2, we use our previous

bound

Next, we consider . Whenever = 0, we have ( , which in turn implies

that . By recalling the indicator variable defined in the proof of Theorem 5.1, and

provided that (1 = 1, we have

In the above, (a) holds since = 1, (b) follows from = 1, (c) uses the fact that ROFUL chooses

the action with maximum worth ), and (d) is a consequence of = 1. Now, using this inequality,

we can write

where the last step uses Assumption 8.1. This inequality in combination with Eq. (A.1) yields

which is the desired result.

A.4. Proof of Lemma 8.2

be a Bernoulli random variable with

almost surely such that

The existence of this random variable is guaranteed by the optimism-in-probability assumption.

Next, for

Moreover, it follows from the definition of

for all ]. Therefore, we get

By recalling Υ

We now bound each term separately. Next, we prove that the smallest singular value of

grows linearly with high probability. Using the noncommutative Bernstein’s inequality for 0,

(e.g., Theorem 1.4 of Tropp (2012)), we get

Setting 2 and applying the triangle inequality yields

Our next goal is to prove an upper bound for the largest singular value of

the following bound:

Using Eq. (A.2), we can deduce that, whenever = 1, we have

Therefore, we can write

Hence, it is a direct consequence of Eq. (A.3) that, for any

We prove that for sufficiently large t, the right-hand side of the above inequality is bounded above

by 3. This is equivalent to

Using Lemma A.2 below, we infer that this is satisfied for all

be given. Then, for all

Proof. First, note that is an increasing function of t for all . To see this, we compute the derivative of f as follows:

Next, setting

Designed for Accessibility and to further Open Science