b

DiscoverSearch
About
My stuff
A General Theory of the Stochastic Linear Bandit and Its Applications
2020·arXiv
1. Introduction
1. Introduction

Recently, multiarmed bandit (MAB) experiments have received extensive attention due to their potential for reducing the opportunity cost of running online experiments (Scott 2010, 2015, Johari et al. 2017). Specifically, MAB experiments allow adaptive adjustments to the design of the experiments, based on partially available data during the experiment. The MAB approach was first motivated by the cost of experimentation in clinical trials (Thompson 1933, Lai and Robbins 1985).

More formally, in a MAB problem, a decision-maker, also known as the policy or algorithm, sequentially chooses actions from given action sets and receives rewards corresponding to the selected actions. The goal is to maximize the cumulative reward throughout the experimentation periods, by utilizing the history of previous observations. Alternatively, the aim is to choose a policy that minimizes the cumulative regret, which is the difference between the highest achievable reward by a clairvoyant decision-maker who knows the expected reward of each action relative to the reward obtained by the policy. This paper considers a variant of this problem, called stochastic linear bandit, in which all actions are elements of  Rd for a positive integer d and the expected value of the reward depends on the actions via a linear function. This class of problems includes the well-known subclass of k-armed contextual MABs as a special case, when the action sets are allowed to be time-dependent.

Since its introduction by Abe and Long (1999), the linear bandit problem has attracted a great deal of attention. Several algorithms based on the idea of optimism or upper confidence bound (UCB), due to Lai and Robbins (1985), have been proposed and analyzed. Notable examples

are by Auer (2003), Dani et al. (2008), Rusmevichientong and Tsitsiklis (2010), Abbasi-Yadkori

et al. (2011), and Lattimore (2015). The best algorithm in this class is the optimism-in-the-face-of-uncertainty-linear-bandit (OFUL) algorithm of Abbasi-Yadkori et al. (2011) with the regret O�d√T log3/2 T�that matches the best lower bound due to Dani et al. (2008) up to logarithmic factors.

A second line of research examines the performance of Thompson sampling (TS) or posterior sampling, a Bayesian heuristic due to Thompson (1933) that employs the posterior distribution of the reward function to balance exploration and exploitation and reduce regret. Russo and Van Roy (2014), Dong and Van Roy (2018) proved an  O�d√T log T�upper bound for the Bayesian regret of TS, thereby indicating its near-optimality.

In addition, when there is a deterministic gap  δbetween the expected rewards of the top two actions, OFUL and TS are shown to have a regret with a poly(log  T)/δdependence in T instead of a√T log Tone. But this bound is not applicable when  δis exactly zero. In fact, this happens for the well-known subclass of linear k-armed contextual MABs. Hence, the prior log  T/δbounds for OFUL or TS are not applicable. One needs a more general (probabilistic) notion of the gap to study these types of problems. This is in fact the subject of the third stream of research, pioneered by Goldenshluger and Zeevi (2013), that leverages a so-called margin condition to model probabilistic reward  δ. They showed that the best lower bound for the contextual MAB is logarithmic in T, and proposed a variant of the  ϵ-greedy algorithm that achieves this bound. This idea was extended by Bastani and Bayati (2020) to settings where contexts are high-dimensional (i.e., d becomes very large). However, both of these papers propose algorithms that require an input parameter h to

adjust for the probabilistic gap. It is an open problem whether such logarithmic (in T) bounds

for OFUL and TS, that do not take any gap parameter as input can be proved, under the same

conditions as in (Goldenshluger and Zeevi 2013). In addition, while the first two streams of the aforementioned research were mostly united by the results of Russo and Van Roy (2014) and Abeille et al. (2017) that connected OFUL and TS, there was a disconnect between them and the third stream of research.

Contributions. In this paper we propose an analysis framework for the stochastic linear bandit problem that bridges all three aforementioned streams of literature and yields a number of new results. To be explicit, the main contributions of this paper are as follows:

1. We propose a general family of algorithms, called randomized OFUL (ROFUL), for the

image

2. Most importantly, we employ the margin assumption of Goldenshluger and Zeevi (2013) to

image

3. Our analysis of ROFUL naturally leads us to introduce a new rate-optimal policy, Sieved

image

4. Motivated by the fact that the k-armed d-dimensional contextual MAB problem is a special

image

image

1.1. Other literature

Some of the components in our analysis of the ROFUL algorithm have similarities with prior

literature (Srinivas et al. 2010, Russo and Van Roy 2016, Kirschner and Krause 2018). Specifically,

our notion of uncertainty complexity is similar to the notion of maximum information by Kirschner and Krause (2018). We discuss the differences between the two in Remark 3.1, but in summary our approach provides regret bounds for the more general probabilistic  δsetting as well. In addition, our notion of gain rate is similar to the notion of information ratio by Russo and Van Roy (2016). We discuss the differences between the two in Remark 4.3, but in summary Russo and Van Roy (2016) consider a Bayesian setting while we consider both Bayesian and frequentist settings.

1.2. Organization

We introduce notation and the problem formulation in Section 2. Then we introduce uncertainty complexity and its connection to regret in Sections 3 and 4. Our ROFUL algorithm and its regret analysis are presented in Section 5. In Section 6, we first demonstrate how OFUL and TS are special cases of ROFUL and then introduce our rate-optimal SG algorithm, which is empirically compared with existing benchmarks in Section 7. Finally, in Section 8, we provide extensions of our results to obtain polylogarithmic regret bounds for OFUL and TS and sharper bounds for the k-armed contextual bandit problem.

For any positive integer n, we denote  {1,2,··· ,n}by [n]. The transpose of a vector A is denoted by  A⊤. For positive semidefinite matrix  Σ ∈ Rn×n and for any vector  A ∈ Rn, notation ∥A∥Σ refersto√A⊤ΣA. We also use notation  Idto denote the d-by-d identity matrix.

Let A be the set of all possible actions and let (At)Tt=1be a sequence of T random subsets of A, where T ∈ Nwill be referred to as the time horizon. A policy  πsequentially interacts with this environment in T rounds. At time  t ∈ [T], the action set  Atis revealed to the policy and it chooses an action �At ∈ Atand receives a stochastic reward R( �At). We also assume that  A ⊂ Rdand is

bounded; that is, there exists a positive constant  a such that ∥A∥2 ≤ a for all A ∈ A. Moreover, we assume there exists a random vector Θ⋆ ∈ Rd for which

image

for all  A ∈ Atwhere�·,·�is the standard dotproduct on  Rd. We also assume that there exists a

positive parameter  θsuch that distribution of Θ⋆ satisfies

image

For example, if Θ⋆ is bounded then Eq. (2.2) easily holds. Another important setting where Eq. (2.2) holds is when Θ⋆ has a normal distribution.

Next, we introduce the notation

image

and assume that there exists (random) optimal action  A⋆t ∈ At such that the following holds almost

surely for all  A ∈ At,

image

Now, consider the sequence of  σ-algebras F0 ⊆ F1 ··· ⊆ Ft−1that encode history of observations up

to time t and are defined by

image

In this model, a  policy πis formally defined as a deterministic function that maps  Ft−1to an element of  At.

image

Specifically,  E[|εt| | Ft−1] < ∞ and

image

The performance measure for evaluating the policies is the standard cumulative Bayesian regret

defined as

image

The expectation is taken with respect to the entire randomness in our model, including the prior distribution of Θ⋆. Although we assumed that Θ⋆ is random, and we described a Bayesian formulation of regret, our model and results include the deterministic Θ⋆ setting as well. This can be achieved by considering the prior distribution to be the distribution with a point mass at Θ⋆.

Action sets. Action sets and their structure play key roles in this paper which require introducing a number of important notions associated with them. We start by defining the extremal points of an action set.

Definition 2.1 (Extremal points) For an action set A, define its extremal points E(A) to be all A′ ∈ Athat are not a convex combination of other actions in A, i.e., actions in A for which one

cannot find actions  {A1,··· ,An} ⊆ A \ {A′}and coefficients  {c1,··· ,cn} ⊂ [0,1]satisfying

image

The importance of this definition is that all the algorithms studied in this paper choose only extremal points in action sets, because of the linearity assumption on the mean reward as stated in Eq. (2.1). This observation implies that the rewards attained by any of these algorithms, when

provided with the action set A, belong to the reward profile of A defined by

image

Recall from Eq. (2.3) that  Mt(A⋆t) is the maximum attainable reward of an action set A. Building

on this, we define gap of an action set A as

image

Moreover, for any  z ≥0, we define

image

In the sequel, we may simplify the above notation and use subscript t instead of  At. For instance, ∆t refers to ∆At. We now define a gapped problem as follows:

Definition 2.2 (Gapped problem) We call a linear bandit problem gapped if for some positive numbers  δ and qδ, the following inequality holds:

image

where the probability is calculated with respect to the randomness of the action sets. Moreover, for a fixed gap level  δ, we define Gtto be the indicator of the event  {∆t ≥ δ}.

Remark 2.1 The above notion of gap is more general than the well-known notion of gap in the literature, as in (Abbasi-Yadkori et al. 2011), which is a deterministic concept. Specifically, we do not assume that the probability  qδis equal to 0, for a fixed  δ > 0.

Remark 2.2 All problems are gapped for all  δ >0 and  qδ= 1 since Eq. (2.4) will be trivially satisfied. This observation will help us obtain gap-independent bounds.

In this section we introduce the notion of uncertainty structure, which will be a key parameter in obtaining regret bounds in subsequent sections. We also calculate this parameter in three examples to help build intuition.

By uncertainty structure, we simply refer to a sequence of functions  Vt: (Ft−1,A) �→ R, where

A ∈ At. By a slight abuse of notation, for any policy  π, we define expected uncertainty to be

image

Finally, for a set of policies P, the uncertainty complexity is defined as

image

Note that uncertainty complexity is not a unique quantity for a given problem as the choice of functions  Vtcan vary. We will see in the following sections that any uncertainty structure, together with an associated gain rate that is defined in Section 4, can be used to provide an upper bound for the regret of any policy. However, the quality of the regret bound does depend on the choice of uncertainty structure.

In order to get a better of sense of uncertainty complexity, in the remainder of this section we provide upper bounds for the uncertainty complexity of several well-known problems. We then use these bounds in Section 6 to derive rate-optimal regret bounds for OFUL and variants of TS. Overall, the optimal selection of an uncertainty structure is an interesting and challenging research question, but one that is well beyond the scope of this paper.

Remark 3.1 The above notion of uncertainty complexity is similar to the notion of maximum information by Kirschner and Krause (2018), see their Eq. (2). The main difference is that we do not require the essential supremum of �t Vtto exist. This makes our analysis simpler; see, e.g., the second paragraph on page 6 of (Kirschner and Krause 2018). Also, in contrast to Kirschner and Krause (2018), our proof technique provides regret bounds for the (generalized) gapped version of the problem as well.

Example 3.1 (Unstructured linear bandit) Let λbe a positive and fixed real number and, for

any  t ∈ [T], define

image

Then, we choose the following uncertainty structure:

image

Lemmas 10 and 11 of Abbasi-Yadkori et al. (2011) essentially prove that

image

Example 3.2 (Bayesian linear bandit with fixed finite action sets) Consider a finite

action set  A = {A1,A2,··· ,Ak}that does not change over time. In other words,  At = Afor all

t ∈ [T]almost surely. Following a similar notation as Russo and Van Roy (2016), for all  j ∈ [k], we

let

image

and

image

Now, defining  µt := E[Θ⋆ | Ft−1], we consider the following uncertainty functions:

image

where

image

The analysis of Russo and Van Roy (2016) implies that

image

where  H(A⋆)is the entropy of  A⋆. For completeness, we provide a slightly modified version of their proof in Section A.

Example 3.3 (Bayesian linear bandit with normal prior and noise) In this example, we

focus on the Bayesian setting in which Θ⋆ ∼ N(0,λId), and at round t, the reward of selecting

action �Atis given by R( �At) =  ⟨Θ⋆, �At⟩ + εtwhere  εt ∼ N(0,σ2) is independent of  Ft−1. However, we allow the action sets to change over time and also allow the action sets to have infinite size.

Inspired by the previous example, we define

image

where

image

It is easy to see that in the setting of Example 3.2, the above definition is equivalent to Eq. (3.3). We now use a different technique to bound the uncertainty complexity. Notice that the normality

assumption yields

image

Therefore, Example 3.1 implies that

image

In this section, building on the notion of uncertainty structure, we introduce the notion of gain rate of any policy and then use that to obtain an upper bound for the regret.

Definition 4.1 (Gain rate) Let δ > 0be fixed. We say that a policy  πhas gain rate  Gδ > 0 with

respect to an uncertainty structure  {Vt}t≥1 if

image

for all  t ∈ [T].

Remark 4.1 The constant  Dδis meant to account for very unlikely cases where the observations deviate from generic cases (this will be formalized by tail bounds). In most cases,  Dδcan be set to 0 or 1.

We are ready now to state a general result on the regret of any policy for any gap level  δthat relies on uncertainty complexity K, gain rate  Gδ, and Dδ.

Theorem 4.1 Given an uncertainty structure  {Vt}t≥1, gap level  δ, and associated parameter  qδ,

the regret of any policy  π satisfies

image

Remark 4.2 (Problem-independent bound) In most examples of this paper we will prove the following stronger variant of Eq. (4.1):

image

This inequality, implies that the gain rate is not a function of  δ, which means that the regret bound in Eq. (4.2) holds for any  δ. Therefore, one can take the infimum of the right-hand side of Eq. (4.2)

over  δ to get a δ-independent regret bound

image

Remark 4.3 The above notion of gain rate is similar to the notion of information ratio of Russo and Van Roy (2016). Specifically, if  {Vt}t≥1is defined as in Example 3.2 and D = 0, then 1/G becomes the information ratio. We also note that Russo and Van Roy (2016) consider a Bayesian setting while our gain rate is defined for both Bayesian and frequentist settings.

Proof of Theorem 4.1. Let Btbe the shorthand for the indicator function  I�Mt(A⋆t) − Mt( �At) ≥ δ�.

We then have

image

By summing both sides of the above inequality over t, we get

image

Finally, note that

image

This completes the proof of Theorem 4.1. □

In this section we generalize the well-known optimism principle that is at the core of the OFUL algorithm of Abbasi-Yadkori et al. (2011). Specifically, we introduce the new notion of optimism in expectation, which allows us to propose a more general and more flexible version of OFUL, which we call the randomized OFUL (ROFUL) algorithm. We then show how optimism in expectation for a policy leads to a high gain rate and, hence a small regret bound. This allows us to prove a regret bound for ROFUL. In the next section we will show that, in addition to OFUL, Thompson sampling (TS) is also a special case of ROFUL. We will also see that our regret bound for ROFUL leads to a unified proof of rate optimality for both OFUL and TS.

Before executing the above plan, let us start with a few definitions.

Definition 5.1 (Confidence bounds) Confidence bounds are real-valued functions  Lt(·) and Ut(·)

such that, with probability at least 1  − t−3,

image

Also,  T⋆trefers to the indicator function for the event that  Mt(A) ∈ [Lt(A),Ut(A)] holds for all

A ∈ At.

Definition 5.2 (Baseline) For confidence bounds  Lt(·) and  Ut(·), the baseline  Btat time t is

defined by,

image

Next, we state an assumption that allows us to provide results in situations where Θ⋆ is unbounded. In most of the prior literature  ∥Θ⋆∥2is bounded almost surely, which results in the exclusion of normal priors. The assumption allows us to overcome this constraint.

Assumption 5.1 For any constant  ρ ∈ [T −2,1], let Bρrefer to the family of all Bernoulli random

variables  Z such that E[Z] = ρ. Assume that,

image

Note that random variables  Z in Bρcan be correlated with�supA∈At Mt(A) − infA∈At Mt(A)�.

The expression�supA∈At Mt(A) − infA∈At Mt(A)�in Assumption 5.1 is the maximum attainable regret of any policy at time t. Applying the Cauchy–Schwarz inequality, we can see that a sufficient

condition for Assumption 5.1 to hold is that

image

For example, in the special case where  ∥Θ⋆∥2 ≤1 almost surely, the parameter D can be set to 4a2.

We are ready now to introduce the ROFUL algorithm.

image

ROFUL Algorithm. ROFUL receives a  worth function �Mt(·) that maps each arm  A ∈ Atand each history instance  Ft−1into a real number. The policy then chooses the action with the highest worth. Algorithm 1 presents the pseudocode of ROFUL.

image

Regret of ROFUL. The ROFUL algorithm as formulated in Algorithm 1 may not perform well, unless the worth functions �Mt(·) are well behaved. We formally define what “well behaved” means by introducing two conditions of reasonableness and optimism. Intuitively, an algorithm that explores too much or too little incurs a high regret. Reasonableness and optimism are mechanisms for controlling these potential flaws, respectively. To define these notions rigorously, we assume that for each action A we are given upper and lower confidence bounds  Ut(A) ≥ Lt(A), where as in Definition 5.1, the interval [Lt(A),Ut(A)] contains  Mt(A) with high probability. In Section 6, we provide examples of these confidence bounds for several examples of problems.

We are now ready to define the reasonableness for worth functions.

Definition 5.3 (Reasonableness) Given confidence bounds  Lt(·) and  Ut(·), a worth function �Mt(·) is calledreasonable if, with probability at least 1  − t−3,

image

Moreover, the notation �Ttrefers to the indicator function for the event that �Mt(A) ∈ [Lt(A),Ut(A)]holds for all  A ∈ At.

As we saw in Definition 5.1, the confidence bounds are such that for each arm A, the true mean reward Mt(A) lies in the confidence interval [Lt(A),Ut(A)] with high probability. Therefore, reasonableness ensures that the action chosen by ROFUL is close to the best action that ensures that ROFUL does not explore actions unnecessarily.

Next, we define optimism in expectation, which guarantees that ROFUL explores sufficiently.

Definition 5.4 (Optimism in expectation)  Given confidence bounds Lt(·) and Ut(·), a worth

function �Mt(·)is called optimistic with parameter  p ∈ (0,1], if

image

Note that  T⋆t and �Tt are defined as in Definition 5.1 and Definition 5.3, respectively.

Figure 1 shows an illustration of the confidence bounds, the baseline, the interval used in optimism, and the worth functions.

The above notion requires the ROFUL algorithm to avoid paying the price of pure optimism (as

OFUL does). Specifically, OFUL ensures that the inequality

image

holds almost surely since for OFUL (as we will see in Section 6) the worth function is �Mt( �At) = Ut( �At).However, the analysis of ROFUL shows that all we need is that the inequality holds in expectation and up to a constant p. In Section 6.5, we will leverage the above intuition and introduce our sieved greedy (SG) algorithm that selects actions more greedily than OFUL while maintaining OFUL’s regret guarantees up to a constant. The core idea behind SG is to use data to stay as close as possible

image

Figure 1 Illustration of the building blocks of the ROFUL algorithm. Specifically, confidence bounds, baseline,

image

to the greedy policy while ensuring that the selected action �Atsatisfies the optimism-in-expectation condition.

Next, we show that the gain rate of ROFUL can be controlled by p, when the worth functions are reasonable and optimistic in expectation with parameter p.

image

Theorem 5.1 (Gain rate of ROFUL) Assume that �Mt(·)is reasonable and optimistic in expec-

tation (with parameter p). Also assume that Assumption 5.1 holds with constant D; then we have

image

Before proving Theorem 5.1, we state our main regret bound for ROFUL which is a corollary of Theorem 4.1, Remark 4.2, and Theorem 5.1.

Corollary 5.1 (Regret of ROFUL) If one defines an uncertainty structure by

image

Theorem 4.1 implies the following gap-dependent regret bound, for any  δ and qδas in Definition 2.2:

image

and (by Remark 4.2) the following gap-independent regret bound:

image

Proof of Theorem 5.1. Define indicator variable  Ttas  Tt:= �Tt · T⋆t. Since these are indicator variables, we have 1  − Tt ≤(1  − �Tt) + (1  − T⋆t). Therefore, by the definition of  �Ttand of  T⋆t, we

obtain that,

image

Using Assumption 5.1 we obtain

image

On the other hand,

image

The result now follows by summing both sides of Eq. (5.2) and Eq. (5.3). □

The goal of this section is to demonstrate tangible examples of the ROFUL algorithm that may have seemed rather abstract up to this point. First, in Sections 6.1 to 6.4 we show that OFUL and variations of TS are special cases of ROFUL, which helps us to recover known regret bounds for them via our machinery from Sections 3 to 5. Then, in Section 6.5, motivated by our notion of optimism in expectation and its role in the regret of ROFUL, we introduce a new algorithm (sieved greedy) that enjoys similar theoretical guarantees to those of OFUL, but tends to make more greedy decisions, and hence achieves better empirical performance.

6.1. Worst-case analysis of OFUL

As our first example, we study the OFUL algorithm of Abbasi-Yadkori et al. (2011). First, building

on the notation from Eq. (3.1), we define

image

Using Theorem 1 of Abbasi-Yadkori et al. (2011), we realize that

image

where

image

Therefore, we can apply the Cauchy–Schwartz inequality and conclude that, for all  A ∈ At, the following functions satisfy the confidence bounds definition:

image

Moreover, OFUL can be written as an instance of ROFUL as follows:

image

Reasonableness follows from the definition of this worth function and �Ttwill be always equal to 1.

For optimism, note that, whenever  T⋆t = 1, we have

image

We thus get

image

This in turn implies that the optimism in expectation holds with p = 1. Using Corollary 5.1 together with Eq. (3.2) leads to the following gap-dependent bound:

image

and the following gap-independent bound:

image

Note that, if we ignore logarithmic factors, this bound is  O(d√T) since  ρis  O(√d) and D is constant.

6.2. Bayesian analysis of TS

We obtain a Bayesian regret upper bound for TS similar to the one proved by Russo and Van Roy (2014). Let �Θt, ρ, Lt, and Utbe given as in Section 6.1. Unlike in the previous section where Θ⋆ wasfixed, here we assume that Θ⋆ is also drawn from a prior distribution. Define the worth function by

image

where �Θtis a sample drawn from the posterior distribution of Θ⋆at time t that is used in TS. Therefore, �Θt and Θ⋆ are exchangeable, given  Ft−1; this, together with the definition of TS, gives

image

almost surely. This implies optimism in expectation with p = 1. For reasonableness, we can leverage

the same argument and obtain

image

Hence, we obtain the same gap-dependent bound of

image

and the same gap-independent bound of

image

6.3. Worst-case analysis of TS

In this section, we study the worst-case (frequentist) regret of TS with inflated posterior variance. We recover the same bounds as the ones by Agrawal and Goyal (2013) and Abeille et al. (2017). Algorithm 2 shows the pseudocode for this instance of TS. We also make the additional assumption that  |At| ≤ n for all t.

Due to the inflated variance, we need to redefine  Lt(·) and Ut(·). Specifically, let

image

image

and define

image

As  ρ′ ≥ ρ, we infer that  Lt(·) and Ut(·) satisfy the confidence bounds condition (Definition 5.1). We note that this definition replaces the  ρ2 term in K with ρ′2. Next, we prove that the worth function

given by

image

is reasonable. This is achieved by the following lemma, which is proved in Section A.

Lemma 6.1 For all t ∈ [T], we have

image

In order to derive our regret bound, we also need to verify the optimism in expectation assumption.

Whenever��Θt − Θ⋆,A⋆t�≥ −ρ��A⋆t��Σt, we have

image

Since��Θt − �Θt,A⋆t�is distributed as  N�0,ι2��A⋆t��2Σt

image

image

Therefore, for sufficiently large T, we have

image

Therefore, optimism in expectation holds with  p = Φ(−ρ/ι)/2.

Thus, similar to Section 6.2, Corollary 5.1 gives a gap-dependent bound of

image

and a similar gap-independent bound of

image

6.4. Bayesian analysis of TS with finitely many arms

Following the same technique as in the previous example, we can prove a sharper regret bound for TS in the normal prior and normal noise setting. The main idea is to use smaller confidence

bounds. More precisely, write

image

and then define

image

Using the same techniques as in the proof of reasonableness in the previous example, we can show that these functions satisfy the confidence bounds condition and that the worth function defined by Eq. (6.3) is reasonable with respect to these confidence bounds. This yields an O��dT log(T)log(nT)�regret bound that is sharper than the well-known  O�dlog(T)√T�regret bound. The only comparable result that we are aware of is the  O��dTH(A⋆)�bound provided by Russo and Van Roy (2016). Although their bound does not require normality and is sharper than ours, it does not allow changing action sets as does our bound.

image

Algorithm 3 Sieved-Greedy (SG)

Require:  ρ, λ, σ, α.

image

6.5. Toward a better use of data: sieved greedy (SG)

In this section, we present a novel algorithm that enjoys the same regret bound as the one we proved for OFUL. This new policy, nonetheless, tends to make more greedy decisions. As we will see in Section 7, this algorithm achieves a similar cumulative regret to that of the best policy in each scenario.

image

Figure 2 Illustration of how SG works compared to OFUL and greedy.

This algorithm receives a  sieving-rate parameter αas input. Then, at time t, this algorithm first

discards all the actions that lack sufficient uncertainty, i.e., that satisfy

image

Denoting the set of remaining (sieved) actions by  A′t, we note that the algorithm makes a greedy

decision over  A′t, i.e.,

image

Therefore, we call the algorithm sieved greedy (SG). When  α= 1, this algorithm is identical to OFUL and  α= 0 leads to the greedy algorithm. Algorithm 3 shows the pseudocode for SG and Figure 2 is an illustration of how SG works.

We show that this algorithm is also an instance of ROFUL. To do so, we introduce the following worth function:

image

We need to show that the ROFUL algorithm with this worth function chooses the same action as

SG. Notice that

image

Next, the reasonableness of the worth function is evident from the definition of �Mt(·). For the

optimism, note that

image

Therefore, optimism in expectation holds with  p = α2, which means that Corollary 5.1 gives the

gap-dependent bound of

image

and the gap-independent bound of

image

Remark 6.1 (Sieved version of general ROFUL) Here we showed that SG is an instance of

ROFUL. However, as shown in Eq. (6.4), this reduction is general. Specifically, �Atneed not to be

selected greedily. In fact, any action that is selected from the set of sieved actions can be replaced with �At, and the above regret analysis of SG stays valid. This means one can apply the sieving idea to any instance of ROFUL, including TS and OFUL. We expect SG to outperform such “sieved TS” or “sieved OFUL”, at least empirically, because it makes more greedy decisions. But, there could be other circumstances, under which, sieved TS or sieved OFUL may be more preferred. For example, a decision-maker may prefer TS as it is a randomized policy, and in such a scenario, she can use sieved TS, with the same theoretical guarantees as SG, while maintaining a TS-based policy.

In this section, we compare the performance of OFUL, TS, greedy, and SG (with sieving rates 0,2, 0.5, and 0.8) in two scenarios. In each scenario, the unknown parameter vector Θ⋆ is first sampled from  N(0,Id), where d= 120. Then, in round t, a set of n = 10 actions is generated. More precisely:

Scenario I. A random vector  Vtis picked uniformly at random on the sphere of radius 5 in  R12.Then, the action  At,i ∈ R120 for i = 1,2,...,10 is constructed by copying  Vt into the i-th block of size 12. Although d = 120, this scenario is equivalent to a 10-armed 12-dimensional contextual bandit problem with a shared feature vector  Vt, embedded in the linear bandit framework, as explained by

image

Scenario II. Motivated by the more general linear bandit problem, each action is chosen uniformly at random on the sphere of radius 5 in  R120.

Each policy  πchooses an action �Aπt ∈ At and receives the reward  R( �Aπt ) = ⟨Θ⋆, �Aπt ⟩ + εt, where εtis a sequence of i.i.d. standard normal random variables. We run each experiment for T = 10,000 rounds and repeat this procedure 50 times. The average regret of each policy (and error bars of width 2  × SDin each direction) is shown in Figure 3. As is clear from the plots, in Scenario I, TS is the best policy and SG has a very similar performance, while greedy performs very poorly. But in Scenario II, greedy and SG achieve a substantially better performance compared to OFUL and TS. We also see that the performance of SG is generally less dependent on the sieving rate  α.Specifically, in Scenario II, all versions have the same performance as greedy. In Scenario I, while all versions outperform OFUL and greedy,  α = 0.5 slightly outperforms the other two variants and nearly ties with TS.

These results underscore that SG inherits beneficial properties of both greedy and OFUL. It performs similar to greedy when greedy works well, but does not prematurely drop potentially optimal arms, which causes greedy to perform very poorly sometimes.

We can strengthen our regret bounds for ROFUL for two important special cases of the stochastic linear bandit problem. Specifically, in Section 8.1, motivated by the k-armed contextual bandit

image

Figure 3 Comparison of the cumulative regret incurred by SG with sieving rates 0.2, 0.5, and 0.8 to the cumulative

image

problem, we show how our proof technique allows improving all regret bounds of Section 6 by a factor√k. Then, in Section 8.2, making similar (generalized gap and margin) assumptions to

those by Goldenshluger and Zeevi (2013) and Bastani and Bayati (2020), we obtain polylogarithmic

regret bounds for ROFUL and the obtain the first such results for OFUL and TS.

8.1. Grouped linear bandit

Here we focus on improving our previous regret bounds for a family of subproblems. Although these improvements are mainly motivated by the special case of the k-armed contextual bandit, we formulate a slightly more general case of the stochastic linear bandit which we will refer to as the

grouped linear bandit.

Definition 8.1 (Grouped linear bandit) Let k and d be two integers and let (Zj)kj=1 be a

sequence of d-dimensional subspaces of  Rkd such that each vector  v ∈ Rkd can be uniquely decomposed

to  v = �kj=1 zj, where zj ∈ Zj, i.e., Rkd = Z1 ⊕ Z2 ⊕ ··· ⊕ Zk. Then, agrouped linear bandit (GLB) problem simply refers to a linear bandit problem in which  A ⊆ �kj=1 Zj.

As mentioned above, the GLB formulation is meant to capture the specific structure in the contextual setting. In fact, a k-armed d-dimensional contextual bandit problem can be modeled as a kd-dimensional linear bandit one, as discussed by Abbasi-Yadkori (2012). However, the GLB problem also includes the original linear bandit problem if we assume that k = 1. Notice that in the GLB problem we also have Θ⋆ ∈ Rkd, which in turn implies that the number of parameters is kd (rather than d). Therefore, our previous problem-independent regret bounds from Section 6 would be  O(kd√T) for this kd-dimensional problem. As we will see shortly, this bound can be tightened by a factor of√k to O(dlog(T)√kT).

The key observation is that the radius of the confidence set can be shrunk to

image

Note that  ρas defined in Eq. (6.1) for this problem is given by

image

which is worse than  ρas defined in Eq. (8.1) by an asymptotic factor of√kas T grows large. Specifically, we can show that the following functions satisfy the confidence bounds definition:

image

This can be shown by noting that when  A ∈ Zj, then  ⟨�Θt − Θ⋆,A⟩can be bounded by applying Theorem 1 of Abbasi-Yadkori et al. (2011) in a d-dimensional rather than kd-dimensional setting.

Combining this with the union bound, we obtain

image

By the same argument as in Section 6.1, we get

image

and by tuning  δas before we get

image

Moreover, in the Bayesian setting, one can use these confidence bounds to prove a similar Bayesian regret bound for TS with the proper update rule. In the frequentist setting, on the other hand, this

idea can be used to show that

image

and by tuning  δas before,

image

where

image

This shows that the posterior variance inflation can be reduced by a factor of  O(√k) in TS as T

grows large.

8.2. Polylogarithmic regret bounds

In this subsection we provide regret bounds for ROFUL when confidence sets are defined such that they grow with T polylogarithmically, under additional assumptions. Our assumptions are similar

to those made by Goldenshluger and Zeevi (2013) and Bastani and Bayati (2020). Throughout this

section we consider special classes of ROFUL where confidence intervals (Definition 5.1) are defined as in Eq. (6.2) with varying definitions of  ρ. This special class includes all examples of Section 6 such as OFUL, TS, and SG algorithms.

To state the first assumption, recall the gap parameters  ∆t and δ from §2.

Assumption 8.1 (Margin condition)  There exists constants c0,t0 > 0 such that

image

for all 0  ≤ z ≤ δ and t ∈ [T] with t ≥ t0.

Before stating the next condition, we need to define a notion of near-optimal space for the family of GLB problems.

Definition 8.2 (Near-optimal space) Consider a GLB problem as defined in Definition 8.1.

Let  l ≤ kbe the smallest number such that there exists  I ⊆ [k] with |I| = l and

image

We define near-optimal space  W as W := ⊕j∈IZjand, with a slight abuse of notation, we also treat W as the projection of  Rkd onto the subspace W.

Remark 8.1 The main purpose of this notion is to handle suboptimal arms in the special case of a k-armed contextual bandit. One might harmlessly assume that  W = Rkd, or equivalently, assuming it is the identity function if viewed as an operator, and follow the rest of this section.

The next assumption demands the selected actions to be diverse in the near-optimal space. Specifi-cally, recall the inverse covariance matrix  Σtof the actions chosen by a policy from Eq. (3.1).

Assumption 8.2 (Linear expansion) We say that linear expansion holds for a policy if

image

for some constants  c1,c2 >0 and all  t ∈ [T] with  t ≥ t0. We denote the indicator variable for the event  ∥W⊤ΣtW∥op < c2/t by Lt.

We will show in Lemma 8.2 that ROFUL satisfies the linear expansion assumption, under a variant of the optimism-in-expectation assumption from Section 5 as well as a certain diversity assumption. This fact, combined with the following lemma, leads to our main result of this section which is presented as Corollary 8.1. The next lemma operates on the same setting as Theorem 4.1 with additional assumptions on the reasonableness of the worth functions, the margin condition, and the linear expansions.

Lemma 8.1 Consider an uncertainty structure  {Vt}t≥1with uncertainty complexity K, gap level δ, associated parameter  qδ, as well as gain-rate parameters  Gδand  Dδ. Also, assume that worth functions of Algorithm 1 (policy  πROFUL) are reasonable (Definition 5.3), and that the margin condition (Assumption 8.1) and the linear expansion condition (Assumption 8.2) hold. Then, the cumulative regret of  πROFUL satisfies the following inequality:

image

The proof of Lemma 8.1 is given in Section A.3.

Linear expansion and ROFUL. In what follows, we will show that under a certain diversity condition (Assumption 8.3) and a generalization of the optimism assumption in Section 5, the ROFUL algorithm satisfies linear expansion.

Assumption 8.3 (Diversity condition) We say that a GLB problem satisfies the diversity con-

dition with parameter  υ if Atis independent of  σ(A1, �A1,R( �A1),...,At−1, �At−1,R( �At−1)) and

image

Assumption 8.3 is similar to Assumption A3 of Goldenshluger and Zeevi (2013) and Assumption 4

image

Definition 8.3 (Optimism in probability)  We say that the worth function �Mt(·) isoptimistic

in probability if for some  ω and p in (0,1] we have

image

almost surely.

Remark 8.2 A slightly stronger version of Eq. (8.4) is that

image

almost surely. It is worth noting that this stronger condition also implies optimism in expectation (Definition 5.4). First, note that  ω�Mt(A⋆t) − Bt�· T⋆tis a deterministic function of ⋆,Ft−1).

Therefore, we have

image

This proves optimism in expectation since,

image

It is worthwhile to mention that in the worst-case analysis of an algorithm, Θ⋆is a deterministic constant and, therefore, Eq. (8.4) and Eq. (8.5) are equivalent. Nevertheless, the stronger condition Eq. (8.5) need not hold when Θ⋆ is drawn from a prior distribution. An example of this situation is the Bayesian analysis of TS in which Eq. (5.1) and Eq. (8.4) hold simultaneously, although Eq. (8.5) fails to hold.

Now, we are ready to state our result that ROFUL satisfies the linear expansion assumption if its worth functions are optimistic in probability.

Lemma 8.2 (ROFUL satisfies linear expansion) If the diversity condition (Assumption 8.3) holds and the worth functions of ROFUL are optimistic in probability (Definition 8.3), then ROFUL

satisfies the linear expansion condition (Assumption 8.2) with

image

where

image

The proof of Lemma 8.2 is given in Section A.4.

Next, we state the main result of this section, which directly follows from Lemma 8.1 and Lemma 8.2.

Corollary 8.1 Consider a GLB problem that satisfies the margin condition (Assumption 8.1) and the diversity condition (Assumption 8.3). Also, assume that the worth functions of Algorithm 1 (denoted by policy  πROFUL) are reasonable (Definition 5.3) and optimistic in probability (Definition 8.3). Then the cumulative regret of  πROFUL satisfies the following inequality:

image

where constants  c1, c2, and t0are defined as in Lemma 8.2.

Remark 8.3 Note that in terms of dependence in T, by Corollary 8.1 we prove a regret bound that is  O(log2(T))under similar conditions as the ones by Goldenshluger and Zeevi (2013) Bastani and Bayati (2020). Since OFUL and TS are special cases of ROFUL, this immediately provides an O(log2(T))regret bound for OFUL and TS as well. To the best of our knowledge, these results are new.

Remark 8.4 Corollary 8.1 also holds when a more general  τ-margin condition for  τ >0, as by

Goldenshluger and Zeevi (2009) Bastani et al. (2017) that replaces Eq. (8.2) with

image

is satisfied. In this case, the term 16a2ρ2c2 log(T) in the regret bound would be replaced by a term

of order

image

through the same proof technique.

This work was supported by the Stanford Data Science Initiative, and by National Science Foundation CAREER award CMMI: 1554140.

Abbasi-Yadkori, Yasin. 2012. Online Learning for Linearly Parametrized Control Problems. PhD. Thesis.

Abbasi-Yadkori, Yasin, D´avid P´al, Csaba Szepesv´ari. 2011. Improved algorithms for linear stochastic bandits. Advances in Neural Information Processing Systems. 2312–2320.

Abe, Naoki, Philip M. Long. 1999. Associative reinforcement learning using linear probabilistic concepts. Proceedings of the Sixteenth International Conference on Machine Learning. ICML ’99, Morgan Kaufmann Publishers Inc., San Francisco, CA, USA, 3–11.

Abeille, Marc, Alessandro Lazaric, et al. 2017. Linear thompson sampling revisited. Electronic Journal of Statistics 11(2) 5165–5197.

Agrawal, Shipra, Navin Goyal. 2013. Thompson sampling for contextual bandits with linear payoffs. ICML (3). 127–135.

Auer, Peter. 2003. Using confidence bounds for exploitation-exploration trade-offs. Journal of Machine Learning Research 3 397–422.

Bastani, Hamsa, Mohsen Bayati. 2020. Online decision making with high-dimensional covariates. Operations Research 68(1) 276–294. doi:10.1287/opre.2019.1902.

Bastani, Hamsa, Mohsen Bayati, Khashayar Khosravi. 2017. Mostly exploration-free algorithms for contextual bandits. arXiv preprint arXiv:1704.09011 .

Bayati, Mohsen, Nima Hamidi, Ramesh Johari, Khashayar Khosravi. 2020. The unreasonable effectiveness of greedy algorithms in multi-armed bandit with many arms. Advances in Neural Information Processing Systems 33.

Dani, Varsha, Thomas P. Hayes, Sham M. Kakade. 2008. Stochastic linear optimization under bandit feedback. COLT.

Dong, Shi, Benjamin Van Roy. 2018. An information-theoretic analysis for thompson sampling with many actions. Advances in Neural Information Processing Systems. 4157–4165.

Goldenshluger, Alexander, Assaf Zeevi. 2009. Woodroofe’s one-armed bandit problem revisited. Ann. Appl. Probab. 19(4) 1603–1633. doi:10.1214/08-AAP589. URL https://doi.org/10.1214/08-AAP589.

Goldenshluger, Alexander, Assaf Zeevi. 2013. A linear response bandit problem. Stochastic Systems 3(1) 230–261.

Hao, Botao, Tor Lattimore, Csaba Szepesvari. 2019. Adaptive exploration in linear contextual bandit. arXiv preprint arXiv:1910.06996 .

Johari, Ramesh, Pete Koomen, Leonid Pekelis, David Walsh. 2017. Peeking at a/b tests: Why it matters, and what to do about it. Proceedings of the 23rd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. ACM, New York, NY, USA, 1517–1525. doi:10.1145/3097983.3097992.

Kannan, Sampath, Jamie H Morgenstern, Aaron Roth, Bo Waggoner, Zhiwei Steven Wu. 2018. A smoothed analysis of the greedy algorithm for the linear contextual bandit problem. Advances in Neural Information Processing Systems. 2227–2236.

Kirschner, Johannes, Andreas Krause. 2018. Information directed sampling and bandits with heteroscedastic noise. Proc. International Conference on Learning Theory (COLT).

Lai, Tze Leung, Herbert Robbins. 1985. Asymptotically efficient adaptive allocation rules. Advances in applied mathematics 6(1) 4–22.

Lattimore, Tor. 2015. The pareto regret frontier for bandits. C. Cortes, N. D. Lawrence, D. D. Lee, M. Sugiyama, R. Garnett, eds., Advances in Neural Information Processing Systems 28. Curran Associates, Inc., 208– 216. URL http://papers.nips.cc/paper/6032-the-pareto-regret-frontier-for-bandits.pdf.

Laurent, Beatrice, Pascal Massart. 2000. Adaptive estimation of a quadratic functional by model selection. Annals of Statistics 1302–1338.

Raghavan, Manish, Aleksandrs Slivkins, Jennifer Wortman Vaughan, Zhiwei Steven Wu. 2018. The externalities of exploration and how data diversity helps exploitation. arXiv preprint arXiv:1806.00543 .

Rusmevichientong, Paat, John N Tsitsiklis. 2010. Linearly parameterized bandits. Mathematics of Operations Research 35(2) 395–411.

Russo, Daniel, Benjamin Van Roy. 2014. Learning to optimize via posterior sampling. Mathematics of Operations Research 39(4) 1221–1243. doi:10.1287/moor.2014.0650.

Russo, Daniel, Benjamin Van Roy. 2016. An information-theoretic analysis of thompson sampling. The Journal of Machine Learning Research 17(1) 2442–2471.

Scott, Steven L. 2010. A modern bayesian look at the multi-armed bandit. Applied Stochastic Models in Business and Industry 26(6) 639–658.

Scott, Steven L. 2015. Multi-armed bandit experiments in the online service economy. Appl. Stoch. Model. Bus. Ind. 31(1) 37–45. doi:10.1002/asmb.2104.

Srinivas, Niranjan, Andreas Krause, Sham Kakade, Matthias Seeger. 2010. Gaussian process optimization in the bandit setting: No regret and experimental design. ICML’10, Omnipress, Madison, WI, USA, 1015–1022.

Thompson, William R. 1933. On the likelihood that one unknown probability exceeds another in view of the evidence of two samples. Biometrika 25(3/4) 285–294.

Tropp, Joel A. 2012. User-friendly tail bounds for sums of random matrices. Foundations of computational mathematics 12(4) 389–434.

Appendix A: Additional Proofs

A.1. Proof of Eq. (3.4)

Proof. Noting that

image

we get, by Lemma 3 of Russo and Van Roy (2016), that for all  i ∈ [k],

image

This in turn implies that

image

For any policy  π ∈ P, we have

image

Using the assumption that �Atis independent of  A⋆ conditional on  Ft−1, we can write

image

Therefore, by summing up both sides of the above inequalities, we get

image

image

A.2. Proof of Lemma 6.1

First, we state the following lemma.

Lemma A.1 If X ∼ χ2d, then for all positive constants  γ, we have P(X ≥ 2d + 3γ) ≤ exp(−γ).

Proof. The proof follows directly from applying Lemma 1 of Laurent and Massart (2000) which

gives

image

Proof of Lemma 6.1. First, assume that 6log 2nT ≥ 2d+ 12log T. Since  ι−1Σ− 12t (�Θt − �Θt) ∼

N(0,Id), it follows from Lemma A.1 (with  γ = 4log T) that

image

Therefore, combining this with log 2nT ≥ 2d + 12log T, we have

image

In the finite action set case, we provide a different bound using the union bound. For each  A ∈ At,

note that  ⟨�Θt − �Θt,A⟩ ∼ N(0,ι2∥A∥2Σt). Hence, we have,

image

where in the last step we used the fact that Φ(−x) ≤ exp(−x2/2)/(x√2π) for all positive  x. □

A.3. Proof of Lemma 8.1

Proof. The main idea is to refine the proof of Theorem 4.1. We first recall Eq. (4.5):

image

where  Btdenotes  I(Mt(A⋆t) − Mt( �At) ≥ δ). We next improve the upper bound for each individual

term in the above sum. For  t ≤ t0, where  t0is defined as in Assumption 8.2, we use our previous

bound

image

Next, we consider  t > t0. Whenever  Bt= 0, we have  Mt( �At) > Mt(A⋆t) − δ, which in turn implies

that  A⋆t, �At ∈ W. By recalling the indicator variable  Ttdefined in the proof of Theorem 5.1, and

provided that (1  − Bt)LtTt= 1, we have

image

In the above, (a) holds since  T⋆t = 1, (b) follows from  �Tt = 1, (c) uses the fact that ROFUL chooses

the action with maximum worth �Mt(·), and (d) is a consequence of  Lt= 1. Now, using this inequality,

we can write

image

where the last step uses Assumption 8.1. This inequality in combination with Eq. (A.1) yields

image

which is the desired result. □

A.4. Proof of Lemma 8.2

Proof. For any t ∈ [T], let Otbe a Bernoulli random variable with

image

almost surely such that

image

The existence of this random variable is guaranteed by the optimism-in-probability assumption.

Next, for  t ∈ [T], we have

image

Moreover, it follows from the definition of  Bi and Gi that

image

for all  i ∈ [T]. Therefore, we get

image

By recalling Υi = A⋆i A⋆i⊤ · Gi, we have

image

We now bound each term separately. Next, we prove that the smallest singular value of �ti= t2 Υi ·Oi

grows linearly with high probability. Using the noncommutative Bernstein’s inequality for  s ≥0,

(e.g., Theorem 1.4 of Tropp (2012)), we get

image

Setting  s := tpυ/2 and applying the triangle inequality yields

image

Our next goal is to prove an upper bound for the largest singular value of �ti= t2 Υi · BiOi. We apply

the following bound:

image

Using Eq. (A.2), we can deduce that, whenever  Tt = 1 and BiOi= 1, we have

image

Therefore, we can write

image

Hence, it is a direct consequence of Eq. (A.3) that, for any  t ≥ t0 ≥ 16a2ρ2kdpυω2δ2 · log�λ + T a2d �, we get

image

We prove that for sufficiently large t, the right-hand side of the above inequality is bounded above

by 3/t2. This is equivalent to

image

Using Lemma A.2 below, we infer that this is satisfied for all  t ≥ t0. □

Lemma A.2 Let a ≥ 3be given. Then, for all  t ≥ 3alog(a), we have tlog(t) ≥ a.

Proof. First, note that  f : t �→ tlog(t)is an increasing function of t for all  t ≥ e. To see this, we compute the derivative of f as follows:

image

Next, setting  t0 = 3alog(a), we have

image


Designed for Accessibility and to further Open Science