b

DiscoverSearch
About
My stuff
Best Arm Identification for Cascading Bandits in the Fixed Confidence Setting
2020·arXiv
Abstract
Abstract

We design and analyze CASCADEBAI, an algorithm for finding the best set of K items, also called an arm, within the framework of cascading bandits. An upper bound on the time complexity of CASCADEBAI is derived by overcoming a crucial analytical challenge, namely, that of probabilistically estimating the amount of available feedback at each step. To do so, we define a new class of random variables (r.v.’s) which we term as left-sided sub-Gaussian r.v.’s; these are r.v.’s whose cumulant generating functions (CGFs) can be bounded by a quadratic only for non-positive arguments of the CGFs. This enables the application of a sufficiently tight Bernstein-type concentration inequality. We show, through the derivation of a lower bound on the time complexity, that the performance of CASCADEBAI is optimal in some practical regimes. Finally, extensive numerical simulations corroborate the efficacy of CAS- CADEBAI as well as the tightness of our upper bound on its time complexity.

Online recommender systems seek to recommend a small list of items (such as movies or hotels) to users based on a larger ground set [L] := {1, . . . , L} of items. In this paper, we consider the cascading bandits model (Craswell et al., 2008; Kveton et al., 2015a), which is widely used in information retrieval and online advertising. Upon seeing the chosen list, the user looks at the items sequentially. She clicks on an item if she is attracted by it and skips to the next one otherwise. This process stops when she clicks on one item in the list or if no item is clicked, it is deemed that she is not attracted by any of the items. The items that are in the ground set but not in the chosen list and those in the list that come after the attractive one are unobserved.

Each item  i ∈ [L], with a certain click probability  w(i) ∈[0, 1] which is unknown to the learning agent, attracts the user independently of other items. Under this assumption, the optimal solution is the list of items with largest w(i)’s. Based on the chosen lists and obtained feedback in previous steps, the agent tries to learn the click probabilities (explore the combinatorial space) in order to find the optimal list with high probability in as few time steps as possible.

Main Contributions. Given  δ > 0, a learning agent aims to find a list of optimal items of size K with probability at least  1 − δin minimal time steps. To achieve a greater generality, we provide results for identifying a list of nearoptimal items (Even-Dar et al., 2002; Mannor & Tsitsiklis, 2004; Kalyanakrishnan et al., 2012), where the notion of near-optimality is precisely defined in Section 2. First, we design CASCADEBAI(ϵ, δ, K) and derive an upper bound on its time complexity. Second, we establish a lower bound on the time complexity of any best arm identification (BAI) algorithm in cascading bandits, which implies that the performance of CASCADEBAI(ϵ, δ, K) is optimal in some regimes. Finally, our extensive numerical results corroborate the efficacy of CASCADEBAI(ϵ, δ, K) and the tightness of our upper bound on its time complexity.

Different from combinatorial semi-bandit settings, the amount of feedback in cascading bandits is, in general, random. The analysis of cascading bandits involves the unique challenge in adapting to the variation of the amount of feedback across time. To this end, we define a random variable (r.v.) that describes the feedback from the user at a step and bound its expectation. We define a novel class of r.v.’s, known as left-sided sub-Gaussian (LSG) r.v.’s, and apply a concentration inequality to quantify the variation of the amount of feedback.

Bernstein-type concentration inequalities are applied in many stochastic bandit problems and indicate that subGaussian (SG) distributions possess light tails (Audibert & Bubeck, 2010). Since it turns out that we only need to analyze a one-sided tail in this work, it suffices to consider a one-sided SG condition, which motivates the definition of LSG. We also provide a general estimate of a certain corresponding parameter in Theorem 5.4, which is crucial for the utilization of the inequality. This may be of independent interest. Summary and future work are deferred to Appendix 7.

Literature review. In a stochastic combinatorial bandit (SCB) model, an arm corresponds to a list of items in the ground set, and each item is associated with an r.v. at each time step. The corresponding reward depends on the constituent items’ realizations. We first review the related works on the BAI problem, in which a learning agent aims to identify an optimal arm, i.e., a list of optimal items. (i) Given δ > 0, a learning agent aims to identify an optimal arm with probability  1−δin minimal time steps (Jamieson & Nowak, 2014; Kalyanakrishnan et al., 2012). (ii) Given B > 0, an agent aims to maximize the probability of identifying an optimal arm in B steps (Auer et al., 2002; Audibert & Bubeck, 2010; Carpentier & Locatelli, 2016). These two settings are known as the fixed-confidence and fixed-budget setting respectively. Under the fixed-confidence setting, the early works aim to identify only one optimal item (Audibert & Bubeck, 2010) and the later ones aim to find an optimal arm (Chen et al., 2014; Rejwan & Mansour, 2019). Besides, Mannor & Tsitsiklis (2004); Kaufmann et al. (2016); Agar- wal et al. (2017) provide problem-dependent lower bounds on the time complexity when Kalyanakrishnan et al. (2012) establishes a problem-independent one. All these existing works are under the semi-bandit feedback setting, where an agent observes realizations of all pulled items.

Secondly, we review the relevant works on the regret minimization (RM) problem, in which an agent aims to maximize his overall reward, or equivalently to minimize the so-called cumulative regret. Under the semi-bandit feedback setting, this problem has been extensively studied by Lai & Robbins (1985); Anantharam et al. (1987); Kveton et al. (2014); Li et al. (2010); Qin et al. (2014). Moreover, motivated by numerous applications in clinical analysis and online advertisement, some researchers consider SCB models with partial feedback, where an agent observes realizations of only a portion of pulled items. One prime model that incorporates the partial feedback is cascading bandits (Craswell et al., 2008; Kveton et al., 2015a). Recently, Kveton et al. (2015b); Li et al. (2016); Zong et al. (2016); Wang & Chen (2017); Cheung et al. (2019) studied this model and derived various regret bounds.

When the RM problem is studied with both semi-bandit and partial feedback, the BAI problem has only been studied in the semi-bandit feedback setting thus far. Despite existing works, analysis of the BAI problem in the more challenging case of partial feedback is yet to be done. Our work fills in this gap in the literature by studying the fixed-confidence setting in cascading bandits, and our analysis provides tools for handling the statistical dependence between the amount of feedback and that of time steps in the cascading bandit setting.

For brevity, we denote the set {1, . . . , n} by [n] for any n ∈ N, and the set of all m-permutations of [n], i.e., all ordered m-subsets of [n], by  [n](m)for any  m ≤ n. Let there be  L ∈ Nground items, contained in [L]. Each item i ∈ [L]is associated with a weight  w(i) ∈ [0, 1], signifying the item’s click probability. We define an arm as a list of K ≤ Litems in  [L](K). At each time step t, the agent pulls an arm  St := (it1, . . . , itK) ∈ [L](K). Then the user examines the items from  it1to  itKone at a time, until one item is clicked or all items are examined. For each item  i ∈[L], Wt(i) ∼ Bern(w(i))are i.i.d. across time. The agent observes  Wt(i) = 1iff the user clicks on i. The feedback Otfrom the user is defined as a vector in  {0, 1, ⋆}K, where0, 1, ⋆represents observing no click, observing a click and no observation respectively. For example, if K = 4 and the user clicks on the third item at time step 2, we have O2 = {0, 0, 1, ⋆}. Clearly, there is a one-to-one mapping from  Otto the integer

image

where we assume  min ∅ = ∞. If ˜kt < ∞(i.e.,  Otis not the all-zero vector), the agent observes  Wt(itk) = 0for 1 ≤ k < ˜kt, and also observes  Wt(it˜kt) = 1, but does not observe  Wt(itk)for  k > ˜kt. Otherwise, we have  ˜kt =∞(i.e.,  Otis the all-zero vector), then the agent observes Wt(itk) = 0for  1 ≤ k ≤ K. We denote  ¯w(i) = 1 − w(i), w = (w(1), . . . , w(L)), and the probability law (resp. the expectation) of the process  ({Wt(i)}i,t) by Pw (resp. Ew).

Without loss of generality, we assume  w∗ := w(1) ≥w(2) ≥ . . . ≥ w(L) := w′. We say item i is optimal if  w(i) ≥ w(K). We assume w(K) > w(K +1) to ensure there are exactly K optimal items. Next, we say item i is  ϵ-optimal (ϵ ≥ 0) if  w(i) ≥ w(K) − ϵand set K′ϵ := max{i ∈ [L] : w(i) ≥ w(K) − ϵ}. Then [K′ϵ] is theset of all  ϵ-optimal items,  [K](K)is the set of all optimal arms  S∗(up to permutation), and  [K′ϵ](K)is the set of all ϵ-optimal arms.

To identify an  ϵ-optimal arm, an agent uses an algorithm πthat decides which arms to pull, when to stop pulling, and which arm ˆSπto choose eventually. A deterministic and non-anticipatory online algorithm consists in a triple π := ((πt)t, T π, φπ) in which:

the sampling rule  πtdetermines, based on the observation history, the arm  Sπt to pull at time step t; in other words,  Sπtis  Ft−1-measurable, with  Ft := σ(Sπ1 , Oπ1 , . . . , Sπt , Oπt );the stopping rule determines the termination of the algo-

rithm, which leads to a stopping time  T πwith respect to (Ft)t∈Nsatisfying  P(T π < +∞) = 1;

the recommendation rule  φπ chooses an arm ˆSπ, which isFT π-measurable.

We define the time complexity of  π as T π. Under the fixed-confidence setting, a risk parameter (failure probability)  δ ∈(0, 1) is fixed. We say an algorithm  π is (ϵ, δ, K)-PAC (probably approximately correct) if  Pw( ˆSπ ⊂[K′ϵ]) ≥ 1−δ. Thegoal is to obtain an  (ϵ, δ, K)-PAC algorithm  πsuch that EwT πis small and  T πis small with high probability. We also define the optimal expected time complexity over all (ϵ, δ, K)-PAC algorithms as

image

This measures the hardness of the problem. We abbreviate (0, δ, K)-PAC as (δ, K)-PAC, Ew as E, Pw as P, K′ϵ as K′,T π as T , T∗(w, ϵ, δ, K) as T∗ when there is no ambiguity.

image

Our algorithm CASCADEBAI(ϵ, δ, K) is presented in Algorithm 1. Intuitively, to identify an  ϵ-optimal arm, an agent needs to learn the true weights w(i) of a number of items in [L] by exploring the combinatorial arm space.

At each step t, we classify an item as surviving, accepted or rejected. Initially, all items are surviving and belong to the survival set  Dt. Over time, an item may be eliminated from  Dt, in which case we say that it is identified. Once an item is identified, it can be moved to either the accept set  Atif it is deemed to be  ϵ-optimal, or the reject set  Rtotherwise. (i) At step 1, all items are in  D1. (ii) At each step t, the agent selects  min{K, |Dt|}surviving items with the least number of previous observations,  Tt(i)’s, pulls them in ascending order of the  Tt(i), and gets cascading feedback from the user in the form of the ˜kt’s. Similarly to a Racing algorithm (Even-Dar et al., 2002; Maron & Moore, 1994; Heidrich-Meisner & Igel, 2009; Jun et al., 2016), this design of  Stincreases the  Tt(i)’s of all surviving items almost uniformly and avoids the wastage of time steps. (iii) Next, we maintain upper and lower confidence bounds (UCB, LCB) across time to facilitate the identification of items as in Lines 13–17. The confidence radius is defined as follows:

image

We set  Ct(i, δ) = +∞when  Tt(i) = 0. (iv) Lastly, the algorithm stops once  Dt = ∅, |At| ≥ K or |Rt| ≥ L − K.

We develop an upper bound on the time complexity of CASCADEBAI(ϵ, δ, K) and a lower bound on the expected time complexity of any  (δ, K)-PAC algorithm. We also discuss the gap between the bounds. We use  c1, c2, . . .to denote finite and positive universal constants whose values may vary from line to line. The proofs are sketched in Section 5 and more details are provided in Appendix D.

4.1. Upper bound

The gaps between the click probabilities determine the hardness to identify the items. The gaps are defined as

image

Here ¯∆iis a slight variation of  ∆ithat takes into account the  ϵ-optimality of items. Moreover, to correctly identify item i with probability at least  1 − δ/2, our algorithm needs to observe it at least

image

image

times. Similarly to existing works (Even-Dar et al., 2002; Mannor & Tsitsiklis, 2004), we derive the upper bound with ¯∆i’s and ¯Ti,δ’s. A larger  ¯∆ileads to a smaller  ¯Ti,δ, implyingthat it requires fewer observations to identify item i correctly. The permutation  σdefines the ordering of ¯∆: ¯∆σ(1) ≥. . . ≥ ¯∆σ(L). At step t, we set ˆktas the number of surviving items in  St, and Xˆkt;tas the number of observations of them. Note that ˆktis an r.v. We lower bound  EXk;t with

image

µ(k, w):=

image

and upper bound  EX2k;twith  v(k, w) := min{k,√2/w′}. We abbreviate ¯Ti,δas ¯Ti, ρ(δ)as  ρ, µ(k, w)as  µk, v(k, w)as  vkwhen there is no ambiguity. In anticipation of Theorem 4.1, we define three more notations:

image

K2 := max{K′ − K, 1}, Mk := K + 1 − kµK+1−k − K − kµK−k.

Theorem 4.1. Assume  K′ < 2K − 1. With probability at least  1 − δ, Algorithm 1 outputs an  ϵ-optimal arm after at most  (c1N1 + c2N2 + c3N3)steps, where

image

When  ϵ = 0, ¯∆i = ∆ifor all  i ∈ [L]and  K′ = K. We note that it is a waste to pull identified items. This occurs only when  K′ < 2K − 1(see Lemma 5.9) and this scenario is more complicated to analyze. The scenario  K′ ≥ 2K − 1is relatively easier to analyze and the result is deferred to Proposition C.1 (see Appendix C).

Interpretation of the bound. The first term  N1in the bound is unique to the cascading model, which results from the gap between  Xˆkt;tand  EXˆkt;t. We can bound  N1in terms of the maximum and minimum weights,  w∗ and w′.

Proposition 4.2. Assume  0 < w′ < w∗ ≤ 1. We have

image

Next, recall that we say that an item is identified by time t if it is put into  Aτor  Rτfor some  τ ≤ t. In the worstcase scenario, the agent identifies items in descending order of ¯∆i’s. With probability at least  1 − δ, it costs at most  c2N2steps to identify items  σ(1), . . . , σ(L − K) andc3N3is for identifying the remaining ones. More precisely, after item  σ(L−K −k−1)is identified, the number of steps required for identifying item  σ(L−K −k)is (c3/µK−k+1)·(K − k + 1)[ ¯Tσ(L−K+k) − ¯Tσ(L−K+k−1)];we sum these steps up to obtain (4.1). Since the results in many existing works (Even-Dar et al., 2002; Jun et al., 2016) mainly involve ¯Ti’s, we show the dependence of  N3 on ¯Ti’smore concretely in (4.2).

Technique. The crucial analytical challenge to derive our bound, especially to establish  µk, vk, N1, is to quantify the impact of partial feedback that results from the cascading model. Firstly, we bound  EXˆkt;tby exploiting some properties of the cascading feedback. Next, to bound the gap between �nt=1 Xˆkt;tand  �nt=1 EXˆkt;tfor some  n ∈ N, we propose a novel class of r.v.’s, known as LSG r.v.’s, provide an estimate of a certain LSG parameter, and utilize a Berstein-type concentration inequality to bound the tail probability of a certain LSG r.v.. Details are in Section 5.1.

To facilitate the remaining discussion in Section 4.1, we specialize our analysis and results henceforth to the case of ϵ=0, in which ¯∆i =∆iand the agent aims to find  S∗. Theremaining results in Section 4.1 can be directly generalized to the scenario of  ϵ>0by replacing  ∆i’s with ¯∆i’s.

Comparison to the semi-bandit problem. A related algorithm in the setting of semi-bandit feedback and  ϵ = 0is the BATCHRACING Algorithm, which was proposed by Jun et al. (2016). This algorithm has three paramters k, r and b which respectively represent the number of optimal items, the maximum number of pulls of one item at one step and the size of a pulled arm. When r = 1 and b = k, we denote it as BATRAC(k). The fact that our algorithm observes between 1 and K items per step motivates a comparison among CASCADEBAI(0, δ, K), BATRAC(K) and BATRAC(1).

Corollary 4.3. (i) If all w(i)’s are at most 1/K, with probability at least  1 − δ, Algorithm 1 outputs  S∗ after at most

image

image

steps; (ii) if all w(i)’s are at least 1/2, with probability at least  1 − δ, Algorithm 1 outputs  S∗ after at most

image

steps.

The results of Corollary 4.3 are intuitive: (i) if all w(i)’s are close to 0 (i.e., at most 1/K), the bound on the time complexity of CASCADEBAI(0, δ, K) is of the same order as that of BATRAC(K); (ii) if all w(i)’s are close to 1 (i.e., at least 1/2), the bound corresponds with that of BATRAC(1) (Jun et al., 2016). We further upper bound the expected time complexity of our algorithm (denoted by  π1) in these cases.

Proposition 4.4. (i) If all w(i)’s are at most 1/K,

image

(ii) if all w(i)’s are at least 1/2,

image

According to the definition of  T∗ in Section 2, T∗ ≤ ET π1and hence also satisfies the above bounds. Corollary 4.3 and Proposition 4.4 indicate that the high probability upper bound on  T π1 and the upper bound on  ET π1 are of the same order in the sense that (i) if all w(i)’s are at most 1/K, both upper bounds are ˜O�(1/K)·�L−Ki=1 ∆−2σ(i) +∆−2σ(L−1)�; (ii)if all w(i)’s are at least  1/2, both are ˜O� �L−1i=1 ∆−2σ(i)�.

Specialization to the case of two click probabilities. We consider a simplified scenario with the following assumption; this allows us to present the upper bound on the time complexity with greater clarity.

Assumption 4.5. With  0 < w′ < w∗ ≤ 1, the K optimal and  L − Ksuboptimal items have click probabilities  w∗and  w′ respectively.

Proposition 4.6. Under Assumption 4.5, (i) if  0 < w∗ ≤1/K, with probability at least  1 − δ, Algorithm 1 outputs S∗ after at most

image

steps; (ii) if  1/K < w∗ ≤ 1, with probability at least  1 − δ,

Algorithm 1 outputs  S∗ after at most

image

steps.

In the second case, if  L ≥ w∗(w∗ − w′)2/w′2, the first term dominates the bound. For instance,  w′ ≥ 1/√Lsatisfies this condition.

Proposition 4.7. Under Assumption 4.5, (i) if  0<w∗ ≤1/K,

image

(ii) if  w′ ≥ 1/2 or w∗/w′ ≤ 2,

image

Proposition 4.7 also upper bounds  T∗ since T∗ ≤ ET π1. It,together with Proposition 4.6 implies that the high probability bound on  T π1and the bound on  ET π1are of the same order in these cases.

4.2. Lower bound

We set  ϵ = 0, in which scenario the agent aims to find an optimal arm  S∗. We also upper bound the expected number of observations of items per step by  ˜µ(K, w) where

image

We write ˜µ(k, w) = ˜µk, d(m, n) = KL(w(m), w(n)), where KL(p, q) = p log pq + (1 − p) log 1−p1−q .

Theorem 4.8. We have

image

Comparison to the semi-bandit problem. First, if  w′is close to 1 (i.e.,  w′ ≥ 1/2), ˜µK = 1/w′ ≤ 2, i.e., at one step, the agent observes at most 2 items in expectation. We can recover the lower bound in Kaufmann et al. (2016) by replacing  ˜µKwith 1, which is of the same order as our bound in this case. Next, if  w′ is close to  0 (i.e., w′ ≤ 1/K),the agent observes  ˜µK = Kitems in expectation. Then the bound is the same as that incurred by pulling K items per step and getting semi-bandit feedback, which is

image

Specialization to the case of two click probabilities.

Corollary 4.9. Under Assumption 4.5, we have

image

4.3. Comparison of the upper and lower bounds

To see whether the upper and lower bounds on  T∗match, we set  ϵ = 0and consider the following simplified cases.

Corollary 4.10. Set  ϵ = 0. (i) If 0 < w∗ ≤ 1/K,

image

(ii) if  1 > w′ ≥ 1/2,

image

The upper bounds above are achieved by Algorithm 1.

In the first case, the gap between the upper and lower bounds is manifested in the terms  1/K and w′2. In the second case, the gap is manifested in  w∗ and w′(1 − w∗).

5.1. Analysis of partial feedback for cascading bandits

At a high level, the time complexity T can be established by analyzing �Tt=1 Xˆkt;tand  Xˆkt;t. The first term is de- termined by ¯Ti,δ’s, the number of observations that guarantees the correct identification of items with high probability. These ¯Ti,δ’s are invariant to the scenario whether the agent receives semi-bandit or partial feedback from the user. The second term  Xˆkt;t equals to ˆktin the semi-bandit feedback setting while it is an r.v. in the partial feedback setting. Since ¯Ti,δ’s have already been studied by a number of works on the semi-bandit feedback (Even-Dar et al., 2002; Jun et al., 2016), the crucial challenge of analyzing cascading bandits is to estimate  Xˆkt;tprobabilitistically.

According to Algorithm 1, ˆkt = min{K, |Dt|}. When ˆkt =K ≤ |Dt|, the agent pulls K surviving (i.e., not identified) items. Otherwise, the agent pulls all surviving items first and then complements  Stwith some identified items. In the cascading bandit setting, the agent observes only one item when the first item  it1 is clicked, and the corresponding probability is  w(it1); the agent observes two items when it1is not clicked but  it2is clicked, and the probability is [1 − w(it1)]w(it2); and so on. Therefore,

image

Since  EXˆkt;tdepends only on  St(the pulled arm at step t) and  Stis learnt online, it is difficult to estimate  EXˆkt;tfor each step separately. Therefore, the second best thing one can do is to bound  EXˆkt;tas a function of ˆktand w. We now present some properties of  EXˆkt;t.

Theorem 5.1. Consider a set of items with weights u = (u1, . . . , uk)such that  u1 ≥ . . . ≥ uk, and let  µk(u, I)be the expected number of observations when items are placed with order I. Let  Idec = (1, . . . , k), Iinc = (k, . . . , 1), and I be any order, then

(i) boundedness:  µk(u, Idec)≤ µk(u, I) ≤ µk(u, Iinc);

(ii) monotonicity: let  v=(v1, . . . , vk)be another vector of weights, then  µk(u, I)≥µk(v, I) if ui ≤vi for all i∈[k].

Theorem 5.1 implies that when w is fixed,  EXk;tattains its minimum when the agent pulls items 1, 2, . . . , k in this order and attains its maximum when the agent pulls  L, L −1, . . . , L − K + 1in this order. Moreover, if  w(i) = w∗for all  i ∈ [k], EXk;tis even smaller; if  w(j) = w′for all  j ∈ {L − k + 1, . . . , L}, EXk;tis even larger. This observation inspires Lemma 5.2.

Lemma 5.2. For any k, t,

image

Next, since  Xk;t, instead of  EXk;t, affects the dynamics, we examine the gap between �nt=1 Xk;tand  �nt=1 EXk;t. Clearly, a tight concentration inequality is essential to estimate this gap well. Since  Xk;tis a bounded r.v., there are some applicable Bernstein-type inequalities. For instance, we can apply Azuma’s inequality to analyze SG r.v.’s. However, (i) it is challenging to find an SG parameter of  Xk;tthat is good enough for our purpose (a more detailed explanation is provided after Lemma 5.6), and (ii) we only require a one-sided concentration inequality. Hence, we resort to defining a new class of r.v.’s — known as LSG r.v.’s — and provide an estimate of the relevant LSG parameter.

Definition 5.3 (LSG). An r.v.  X is v-LSG (v ≥ 0) if

image

Theorem 5.4. Let X be an almost surely bounded nonnegative r.v.. If  EX2 ≤ v2, then X is v-LSG.

Furthermore, we bound  EX2k;t(Lemma 5.5) and adapt a variation of Azuma’s inequality as in Theorem B.1 (Shamir, 2011) to evaluate the dependence between the number of observations and the number of time steps.

Lemma 5.5. For any  k, t, EX2k;t ≤ v2k = min{k2, 2/w′2 }.

Lemma 5.6. For any  k, t, δ > 0, set

image

then  Pr(E∗) ≤ δ. Further when  E∗holds, for any T > 0, �nt=1 Xk;t ≤Timplies that  n≤2T/µk+2 log(1/δ)v2k/µ2k.

Lemma 5.6 implies that with high probability, we can lower bound the amount of observations on the surviving items over the whole horizon. Subsequently, with probability at least  1 − δ, the agent would have received sufficiently many observations on the surviving items to return an  ϵ-optimal arm after at most  (c1N1 + c2N2 + c3N3)time steps (see Theorem 4.1). The lemma also indicates that a smaller LSG/SG parameter of  Xk;tleads to a smaller upper bound on the number of time steps. Since we can show  Xk;tis vk-LSG but cannot show it is  vk-SG (a detailed discussion is deferred to Appendix D.9), it is beneficial to consider the class of LSG distributions for our problem. The class of LSG r.v.’s and the general estimate of the LSG parameter, which is crucial for the utilization of the concentration inequality, may be of independent interest.

5.2. Proof sketch of Theorem 4.1

Concentration. As the algorithm proceeds, the agent moves items from  Dtto  Ator  Rtaccording to the con-fidence bounds of all surviving items in  Dt. This motivates us to define a “nice event”

image

To show that �Li=1 E(i, δ)holds with high probability, we utilize Theorem B.2 (Jamieson et al., 2014; Jun et al., 2016) and the SG property of  Wt(i)(the r.v. that reflects whether item i is clicked at time step t).

Lemma 5.7. For any  δ∈[0, 1], P� �Li=1 E(i, δ)�≥ 1−δ/2.

Sufficient observations. Next, we assume �Li=1 E(i, δ)holds and find the number of observations that guarantees the correct identification of an item. To facilitate the analysis of the expected time complexity (Proposition 4.4, 4.7), we assume �Li=1 E(i, δ′)holds for a fixed  δ′ ∈ (0, δ]in Lemma 5.8, which generalizes Jun et al. (2016, Lemma 2). Lemma 5.8. Fix any  0<δ′ ≤δ, assume �Li=1 E(i, δ′) holds.Set  T ′t := mini∈Dt Tt(i), then for any time step t,

∀i≤K′, T ′(t)≥ ¯Ti,δ′ ⇒ Lt(i, δ)>Ut(j∗, δ)−ϵ ⇒ i∈At,∀i>K′, T ′(t)≥ ¯Ti,δ′ ⇒ Ut(i, δ)<Lt(j′, δ)−ϵ ⇒ i∈Rt.

Lemmas 5.7 and 5.8 imply that with sufficiently many observations, the agent can correctly identify items with probability at least  1−δ/2.

Time complexity. Subsequently, we observe that our algorithm stops before identifying all items.

Lemma 5.9. Assume �Li=1 E(i, δ)holds. Algorithm 1 stops after identifying at most  L − max{K′ − K, 1} items.

Lemma 5.9 indicates that it suffices to count the number of time steps needed to identify at most  L − K′ + K items.

We consider the worst case in which the agent identifies items in descending order of the ¯∆i’s. We divide the whole horizon into several phrases according to  |Dt|, the number of surviving items. During each phrase, we upper bound the required number of observations with Lemma 5.8; then Lemma 5.6 helps to upper bound the required number of time steps with high probability. Lastly, we bound the total error probability by  δ/2and utilize the Lagrange multipliers to solve the following problem:

image

Altogether, we upper bound the time complexity.

5.3. Proof sketch of Theorem 4.8

image

In particular,  S∗ ∈ [K](K)is optimal under instance 0, while suboptimal under instance  1 ≤ ℓ ≤ L. Bearing the risk of overloading the notations, under instance  ℓ, we denote  S∗,ℓas an optimal arm,  Sπ,ℓtas the arm chosen by algorithm  πat step t and  Oπ,ℓtas the corresponding stochastic outcome (see its definition in Section 2).

image

Figure 5.1: Optimal set  S∗,ℓ in instance  ℓ(shaded in green)

KL divergence. We first apply chain rule to decompose KL�{Sπ,0t , Oπ,0t }Tt=1, {Sπ,ℓt , Oπ,ℓt }Tt=1�.

Lemma 5.10 (Cheung et al. (2018, Lemma 6.4)). For any 1 ≤ ℓ ≤ L,

image

Next, we lower bound  E[TT (ℓ)]with the KL divergence by

applying a result from Kaufmann et al. (2016).

Lemma 5.11. For any  1 ≤ ℓ ≤ L,

image

Define the event  E := { ˆSπ ∈ [K](K)} ∈ FT. We estab- lish that, for any  (δ, K)-PAC algorithm,  P0(E) ≥ 1 − δand  Pℓ(E) ≤ δ (∀1 ≤ ℓ ≤ L). Lastly, by revisiting the definition of  Xk;tin Section 4.1, we see that  ˜µKalso upper bounds the expected number of observations of items at one step for any  (δ, K)-PAC algorithm (Lemma 5.2). This allows us to lower bound  T∗.

We evaluate the performance of CASCADEBAI(ϵ, δ, K) andsome related algorithms. For each choice of algorithm and instance, we run 20 independent trials. The standard deviations of the time complexities of our algorithm are negligible compared to the averages, and therefore are omitted from the plots. More details are provided in Appendix E.

6.1. Order of pulled items

image

Figure 6.1: Average time complexity incurred by different sorting order of  St: ascending order of  Ti(t)(Algorithm 1), ascending/descending order of  ˆµt(i)/Ut(i)/Lt(i) with L =64, K = 16, δ = 0.1 and ϵ = 0in the cascading bandits.

As shown in Lines 5–9 of Algorithm 1, CASCADEBAI(ϵ, δ, K) sorts items in  Stbased on ascending order of  Tt−1(i)’s. This order is crucial for proving our theoretical results. To learn the impact of ordering on the time complexity, we evaluate the empirical performance of sorting  Stin descending or ascending order of  ˆwt(i)’s,  Ut(i, δ)’s or  Lt(i, δ)’s. We compare these methods under various problem settings and set the maximum time step as  107. Figure 6.1 shows that CASCADEBAI(ϵ, δ, K) empirically performs as well as the best among the other heuristics, but CASCADEBAI(ϵ, δ, K)is the only one with a theoretical guarantee.

6.2. Comparison to semi-feedback setting

We compare CASCADEBAI(0, δ, K), BATRAC(K) and BATRAC(1) (Jun et al., 2016) empirically. In Figure 6.2, if  w∗, w′are sufficiently small as the parameters shown in subfigure (a), CASCADEBAI(0, δ, K) performs similarly to BATRAC(K); if  w∗, w′are large as in subfigure (b), it behaves similarly to BATRAC(1). This corroborates the implications of Corollary 4.3.

image

Figure 6.2: Average time complexity of CASCADEBAI(ϵ, δ, K), BATRAC(1), BATRAC(K) with  L = 128, δ = 0.1, ϵ = 0, K = 20, . . . , 60.

6.3. Further empirical evidence

Our analysis of the cascading feedback involves  vk, µkin the upper bound of the time complexity; these parameters depend strongly on  w∗, w′ and K (Lemma 5.2, 5.5). Hence,to assess the tightness of our analyses, we consider several simplified cases by choosing  w∗, w′ as functions of K and examine whether the dependence of the resultant time complexity (Proposition 4.6) on K is materialized through numerical experiments.

Table 6.1: Upper bounds on the time complexity of Al- gorithm 1 with  L = 128, K = 20, . . . , 60, δ = 0.1, ϵ = 0(Proposition 4.6), and their fitting results.

image

We fit a model to the averaged time complexity under each setting as stated in Table 6.1. In each case, the  R2-statistic is almost 1, implying that the variability of the time complexity is almost fully explained by the proposed polynomial model (Glantz et al., 1990). Therefore, the empirical results show that the dependence of the upper bound on K (Proposition 4.6) is rather tight, which implies that using the new concept of LSG r.v.’s, our quantifications of the cascading feedback are also rather tight.

image

Figure 6.3: Fit the averaged time complexity with functions of K for two cases. Fix  L = 128, δ = 0.1, ϵ = 0. Blue dots are the averaged time complexity, red line is the fitted curve, and cyan dashed lines show the 95% confidence interval.

This work presents the first theoretical analysis of best arm identification problem with partial feedback. We also show that the upper bound for the CASCADEBAI(ϵ, δ, K) algorithm closely matches the lower bound in some cases. Empirical experiments further support the theoretical analysis. Moreover, the relation between the second moment and the LSG property of a bounded random variable may be of independent mathematical interest.

The assumption of  w∗ < 1/K(ensuring tightness of the sample complexity in Corollary 4.10) is relevant in practical applications since CTRs are low in real applications. For most applications (e.g. online advertising), K is small (≈ 5-10), so our assumption is reasonable. We are cautiously optimistic that, the framework could be enhanced for better bounds in the remaining less practically relevant regime, which is an avenue for future work.

The following are some more avenues for further investigations. First, we note that estimating the number of observations per time step is key to establishing a high probability bound on the total number of time steps. In this work, we bound the expectation of this term with  w∗and w (Lemma 5.2). This bound is tight in some cases (cf. Corollary 4.10). These include the difficult case where all click probabilities w(i)’s are close to one another and hence the gaps are small. Nevertheless, a tighter bound for each individual time step may improve the results; this, however, will require a more elaborate and delicate analysis of the stochastic outcomes and their impacts at each time step. This is especially so since the order of selected items also affects the number of observations in the cascading bandit setting. Secondly, this work focuses on the fixed-confidence setting of the BAI problem. We see that the consideration of the fixed-budget setting for cascading bandits is still not available. It is envisioned that the analysis of the statistical dependence between the number of observations and time steps would be non-trivial. Thirdly, we envision that the analysis may be generalized to the contextual setting (Soare et al., 2014; Tao et al., 2018).

We thank the reviewers for their insightful comments. This work is partially funded by a National University of Singapore Start-Up Grant (R-266-000-136-133) and a Singapore National Research Foundation (NRF) Fellowship (R-263-000-D02-281).

Agarwal, A., Agarwal, S., Assadi, S., and Khanna, S. Learning with limited rounds of adaptivity: Coin tossing, multi-armed bandits, and ranking from pairwise comparisons. In Kale, S. and Shamir, O. (eds.), Proceedings of the 2017 Conference on Learning Theory, volume 65 of Proceedings of Machine Learning Research, pp. 39–75, Amsterdam, Netherlands, 07–10 Jul 2017. PMLR. URL http://proceedings.mlr.press/ v65/agarwal17c.html.

Anantharam, V., Varaiya, P., and Walrand, J. Asymptot- ically efficient allocation rules for the multiarmed bandit problem with multiple plays–Part I: I.I.D. rewards. IEEE Transactions on Automatic Control, 32(11):968– 976, November 1987.

Audibert, J.-Y. and Bubeck, S. Best arm identification in multi-armed bandits. In COLT-23th Conference on learning theory-2010, pp. 13–p, 2010.

Auer, P., Cesa-Bianchi, N., and Fischer, P. Finite-time analysis of the multiarmed bandit problem. Machine learning, 47(2-3):235–256, 2002.

Carpentier, A. and Locatelli, A. Tight (lower) bounds for the fixed budget best arm identification bandit problem. In Conference on Learning Theory, pp. 590–604, 2016.

Chen, S., Lin, T., King, I., Lyu, M. R., and Chen, W. Combinatorial pure exploration of multi-armed bandits. In Ghahramani, Z., Welling, M., Cortes, C., Lawrence, N. D., and Weinberger, K. Q. (eds.), Advances in Neural Information Processing Systems 27, pp. 379–387. Curran Associates, Inc., 2014. URL http://papers.nips.cc/paper/5433- combinatorial-pure-exploration-of- multi-armed-bandits.pdf.

Cheung, W. C., Tan, V. Y. F., and Zhong, Z. Thompson sam- pling for cascading bandits. CoRR, abs/1810.01187, 2018. URL http://arxiv.org/abs/1810.01187.

Cheung, W. C., Tan, V., and Zhong, Z. A Thompson sam- pling algorithm for cascading bandits. In The 22nd International Conference on Artificial Intelligence and Statistics, pp. 438–447, 2019.

Cover, T. M. and Thomas, J. A. Elements of information theory. John Wiley & Sons, 2012.

Craswell, N., Zoeter, O., Taylor, M., and Ramsey, B. An experimental comparison of click position-bias models. In Proceedings of the 1st ACM International Conference on Web Search and Data Mining, pp. 87–94, 2008.

Even-Dar, E., Mannor, S., and Mansour, Y. Pac bounds for multi-armed bandit and markov decision processes. In International Conference on Computational Learning Theory, pp. 255–270. Springer, 2002.

Glantz, S. A., Slinker, B. K., and Neilands, T. B. Primer of applied regression and analysis of variance, volume 309. McGraw-Hill New York, 1990.

Heidrich-Meisner, V. and Igel, C. Hoeffding and bernstein races for selecting policies in evolutionary direct policy search. In Proceedings of the 26th Annual International Conference on Machine Learning, pp. 401–408. ACM, 2009.

Jamieson, K. and Nowak, R. Best-arm identification algo- rithms for multi-armed bandits in the fixed confidence setting. In 2014 48th Annual Conference on Information Sciences and Systems (CISS), pp. 1–6. IEEE, 2014.

Jamieson, K., Malloy, M., Nowak, R., and Bubeck, S. lilucb: An optimal exploration algorithm for multi-armed bandits. In Conference on Learning Theory, pp. 423–439, 2014.

Jun, K.-S., Jamieson, K. G., Nowak, R. D., and Zhu, X. Top arm identification in multi-armed bandits with batch arm pulls. In AISTATS, pp. 139–148, 2016.

Kalyanakrishnan, S., Tewari, A., Auer, P., and Stone, P. Pac subset selection in stochastic multi-armed bandits. In ICML, volume 12, pp. 655–662, 2012.

Kaufmann, E., Capp´e, O., and Garivier, A. On the complexity of best-arm identification in multi-armed bandit models. The Journal of Machine Learning Research, 17 (1):1–42, 2016.

Kveton, B., Wen, Z., Ashkan, A., Eydgahi, H., and Eriksson, B. Matroid bandits: Fast combinatorial optimization with learning. In Proceedings of the Thirtieth Conference on Uncertainty in Artificial Intelligence, UAI’14, pp. 420– 429, 2014.

Kveton, B., Szepesvari, C., Wen, Z., and Ashkan, A. Cas- cading bandits: Learning to rank in the cascade model.

In International Conference on Machine Learning, pp. 767–776, 2015a.

Kveton, B., Wen, Z., Ashkan, A., and Szepesv´ari, C. Combinatorial cascading bandits. In Proceedings of the 28th International Conference on Neural Information Processing Systems - Volume 1, NIPS’15, pp. 1450–1458, 2015b.

Lai, T. L. and Robbins, H. Asymptotically efficient adaptive allocation rules. Advances in applied mathematics, 6(1): 4–22, 1985.

Li, L., Chu, W., Langford, J., and Schapire, R. E. A contextual-bandit approach to personalized news article recommendation. In Proceedings of the 19th International Conference on World Wide Web, pp. 661–670, 2010.

Li, S., Wang, B., Zhang, S., and Chen, W. Contextual combinatorial cascading bandits. In Proceedings of The 33rd International Conference on Machine Learning, volume 48, pp. 1245–1253, 2016.

Mannor, S. and Tsitsiklis, J. N. The sample complexity of exploration in the multi-armed bandit problem. Journal of Machine Learning Research, 5(Jun):623–648, 2004.

Maron, O. and Moore, A. W. Hoeffding races: Accelerat- ing model selection search for classification and function approximation. In Advances in neural information processing systems, pp. 59–66, 1994.

Qin, L., Chen, S., and Zhu, X. Contextual combinatorial bandit and its application on diversified online recommendation. In SDM, pp. 461–469. SIAM, 2014.

Rejwan, I. and Mansour, Y. Combinatorial bandits with full-bandit feedback: Sample complexity and regret minimization. arXiv preprint arXiv:1905.12624, 2019.

Shamir, O. A variant of azuma’s inequality for martingales with subgaussian tails. arXiv preprint arXiv:1110.2392, 2011.

Soare, M., Lazaric, A., and Munos, R. Best-arm identifica- tion in linear bandits. In Advances in Neural Information Processing Systems, pp. 828–836, 2014.

Tao, C., Blanco, S., and Zhou, Y. Best arm identifi-cation in linear bandits with linear dimension dependency. In Dy, J. and Krause, A. (eds.), Proceedings of the 35th International Conference on Machine Learning, volume 80 of Proceedings of Machine Learning Research, pp. 4877–4886, Stockholmsmssan, Stockholm Sweden, 10–15 Jul 2018. PMLR. URL http:// proceedings.mlr.press/v80/tao18a.html.

Wang, Q. and Chen, W. Improving regret bounds for com- binatorial semi-bandits with probabilistically triggered arms and its applications. In Advances in Neural Information Processing Systems, pp. 1161–1171, 2017.

Zong, S., Ni, H., Sung, K., Ke, N. R., Wen, Z., and Kveton, B. Cascading bandits for large-scale recommendation problems. In Proceedings of the Thirty-Second Conference on Uncertainty in Artificial Intelligence, UAI’16, pp. 835–844, 2016.

image

image

Here are some basic facts from the literature that we will use:

Theorem B.1 (Azumas Inequality for Martingales with Subgaussian Tails, implied by Shamir (2011)). Let  {(Dt, Ft)}∞t=1be a martingale difference sequence, and suppose that for any  λ ≤ 0, we have E[eλDt|Ft−1] ≤ eλ2ω2/2 almost surely. Then for all  ω ≥ 0,

image

Theorem B.2 (Non-asymptotic law of the iterated logarithm (Jamieson et al., 2014; Jun et al., 2016)). Let  X1, X2, . . .be i.i.d. zero-mean sub-Gaussian random variables with scale  σ > 0; i.e. EeλXi ≤ eλ2σ22 . Let ω ∈ (0,�1/6). Then,

image

In general, a larger  ϵindicates a smaller time complexity. Here are two explanations. (i) When  ϵgrows,  K′ϵ, the number of  ϵ-optimal items also grows. Then it should be easier to identify an  ϵ-optimal arm. (ii) If  ϵis sufficiently large such that  K′ϵ ≥ 2K − 1, then there are at least K items left in the survival set  Dtbefore the algorithm stops. Otherwise, when |Dt| < K, the agent pulls  |Dt| < Ksurviving items at some steps and this results in a wastage in the number of time steps.

Proposition C.1. Assume  K′ ≥ 2K − 1. With probability at least  1 − δ, Algorithm 1 outputs an  ϵ-optimal arm after at most  (c1N ′1 + c2N ′2)steps where

image

In this Section, we provide proofs of Proposition 4.2, Corollary 4.3, Propositions 4.4, 4.6, 4.7, Corollary 4.9, Theorem 5.1, 5.4, Lemmas 5.2 5.11, and complete the proof of Theorems 4.1, 4.8, Proposition C.1 in this order.

D.1. Proof of Proposition 4.2

Proposition 4.2. Assume  0 < w′ < w∗ ≤ 1. We have

image

Proof. According to Lemma 5.2 and Theorem 5.4,

image

We upper bound  vk/µk and k/µkin two cases:

(i):  0 < w∗ ≤ 1/K: 0 < w′ < w∗ ≤ 1/K, vk = k, µk ≥ k/2.

image

(ii):  1/K < w∗ ≤ 1: vk ≤√2/w′, µk ≥ 1/(2w∗).

image

D.2. Proof of Corollary 4.3

image

steps; (ii) if all w(i)’s are at least 1/2, with probability at least  1 − δ, Algorithm 1 outputs  S∗ after at most

image

steps.

Proof. According to Lemma 5.2 and Theorem 5.4,

image

We first upper bound  vk/µk and k/µkin two cases:

(i):  0 < w∗ ≤ 1/K: 0 < w′ < w∗ ≤ 1/K, vk = k, µk ≥ k/2.

image

(ii):  1/K < w∗ ≤ 1: vk ≤√2/w′, µk ≥ 1/(2w∗).

Next, we separate the upper bound in Theorem 4.1 into two parts and bound them separately:

image

Case 1: All click probabilities w(i) are at most  1/K. 1/w∗ ≥ K and vk/µk ≤ 2, K1 = K − 1.

image

Case 2: All click probabilities w(i) are at least  1/2. 1/w∗ ≤ K implies vk/µk ≤ 4√2, K1 ≥ 1.

image

image

Recall that when  ϵ = 0,

image

for all  i ∈ [L]. Altogether, we complete the proof.

D.3. Proof of Proposition 4.4

Proposition 4.4. (i) If all w(i)’s are at most 1/K,

image

(ii) if all w(i)’s are at least 1/2,

Proof. (i) Consider all click probabilities  w(i)′’s are at most 1/K. For any  0 < δ′ ≤ δ, revisit the proof and result of Theorem 4.1. First, Lemma 5.7 implies that  P��Li=1 E(ϵ, δ′)�≥ 1 − δ′/2. Assume �Li=1 E(ϵ, δ′)holds from now on. Secondly, Lemma 5.8 indicates that Algorithm 1 can correctly identify item  i after ¯Ti,δobservations. Thirdly, similar to the discussion in Section 5.2, we set �K−1k=1 δk ≤ δ′/2. Additionally applying the analysis in Proposition 4.2 and Corollary 4.3, with probability at least  1 − δ′, we can bound the time complexity by

image

In short, set

image

then  Pr(T > −A log δ′) < δ′ for any 0 ≤ δ′ ≤ δ.

Meanwhile, Tonelli’s Theorem implies that

Since  x = −A log δ implies δ = e−x/A and� +∞0 e−x/A dx = Ae−x/A|0x=+∞ = A,

image

(ii) Consider all click probabilities w(i)’s are at least 1/2. The analysis is similar to that in Case (i). With the analysis in Theorem 4.1 and results in Proposition 4.2 and Corollary 4.3, for any  0 < δ′ ≤ δ, with probability at least  1 − δ′, the time complexity is at most

image

D.4. Proof of Proposition 4.6

Proposition 4.6. Under Assumption 4.5, (i) if  0 < w∗ ≤ 1/K, with probability at least  1 − δ, Algorithm 1 outputs  S∗ afterat most

image

steps; (ii) if  1/K < w∗ ≤ 1, with probability at least  1 − δ, Algorithm 1 outputs  S∗ after at most

image

steps.

Proof. We first remind ourselves how the algorithm proceeds. In this instance,  ϵ = 0yields  K∗ = 1. For any item i ∈ [L], ¯∆i = ∆i = w∗ − w′. And according to Lemma 5.8, item i will be correctly classified with high probability after ¯Tiobservations where  ρ = δ/(12L),

image

image

This implies that each item requires the same number of observations to be correctly identified. According to the design of algorithm,  Tt(j) − 1 ≤ Tt(i) ≤ Tt(j) + 1for any remaining items  i ̸= j. Therefore, the worst case is as follows:

• the agent observes one item for ¯T(w)times and the others for ¯T(w) − 1times after  t′ steps, and identifies one item per step for the subsequent  L − 2 steps.

Therefore, we now turn to upper bounding the number of steps required to eliminate an item for the first time. According to Lemma 5.6, we set  δ0 = δ/2, k = K, n = t′, ω′K = −�−2t′v2K log(δ/2). Then the total number of observations during  t′steps should be larger than  t′µK + ω′K with probability at least  1 − δ/2. The number of observations can be upper bounded as follows:

image

Then with Lemma 5.7 and its ensuing discussion in Section 5.2, with probability at least  1 − δ, the time complexity is upper bounded by

image

Next, we consider how the values of  w∗ and w′ affect the bound. According to Lemma 5.2 and Theorem 5.4,

image

We discuss two cases separately:

Case 1:  0 < w∗ ≤ 1/K: 0 < w′ < w∗ ≤ 1/K, vK = K, µK ≥ K/2. The upper bound becomes:

image

Case 2:  1/K < w∗ ≤ 1: vk ≤√2/w′, µk ≥ 1/(2w∗). The bound becomes

image

D.5. Proof of Proposition 4.7

Proposition 4.7. Under Assumption 4.5, (i) if  0<w∗ ≤1/K,

image

(ii) if  w′ ≥ 1/2 or w∗/w′ ≤ 2,

Proof. For any  0 < δ′ ≤ δ, we revisit the proof of Proposition 4.6. Firstly, Lemma 5.7 implies that  P��Li=1 E(ϵ, δ′)�≥1 − δ′/2. Assume �Li=1 E(ϵ, δ′)holds from now on. Secondly, Lemma 5.8 implies that the agent can identify any item correctly after

image

observations. Then with analysis similar to Appendix D.4, we can upper bound the time complexity of Algorithm 1 with probability  1 − δ′.

Case 1:  0 < w∗ ≤ 1/K: with probability at least  1 − δ′, the time complexity is upper bounded by

image

Then for any  0 < δ′ ≤ δ, Pr(T > −A log δ′) < δ′. Meanwhile, Tonelli’s Theorem implies that

image

Since  x = −A log δ implies δ = e−x/A and� +∞0 e−x/A dx = Ae−x/A|0x=+∞ = A,

image

Case 2:  1/2 ≤ w′ < 1 or w∗/w′ ≤ 2: with a similar analysis, for any  0 < δ ≤ δ′, with

image

Pr(T > −A log δ′) < δ′. Lastly,

image

D.6. Proof of Corollary 4.9

Corollary 4.9. Under Assumption 4.5, we have

Proof. First, by setting  w(i) = w∗ for all 1 ≤ i ≤ K and w(j) = w′ for all k < j ≤ L, the result in Theorem 4.8 becomes

image

Next, according to Pinsker’s and reverse Pinsker’s inequality for any two distributions P and Q defined in the same finite space X we have

image

where  δ(P, Q) = sup{|P(A)−Q(A)|��A ⊂ X} and αQ = minx∈X:Q(x)>0 Q(x). In our case, set  δ(w∗, w′) = (w∗ −w′)2and  α = min{w′, w∗, 1 − w∗, 1 − w′} = min{w′, 1 − w∗}, we have

image

Further since  ˜µK ≤ 1/w′ as stated by Lemma 5.2, the lower bound becomes

image

D.7. Proof of Theorem 5.1

Theorem 5.1. Consider a set of items with weights  u = (u1, . . . , uk)such that  u1 ≥ . . . ≥ uk, and let  µk(u, I)be the expected number of observations when items are placed with order  I. Let Idec = (1, . . . , k), Iinc = (k, . . . , 1), and I be any

order, then

Proof. (i) Consider any ordered set  I = (iI1, . . . , iIk). To show µk(u, Idec) ≤ µk(u, I) ≤ µk(u, Iinc), it is sufficient to show the following:

(∗):If there exists  1 ≤ m < ksuch that  uiIm < uiIm+1, we can change their positions to get  I′and have  µk(u, I) >µk(u, I′).

The proof of  (∗)is as follows:

image

µk(u, I) − µk(u, I′) = m ·

image

µk(u, I) − µk(u, I′) = m ·

(ii) To show the monotonicity, it is sufficient to show the following:

(#): Set two sets of click probabilities u, v such that  viIm > uiImfor some  1 ≤ m ≤ kand  viIj = uiIjfor  j ̸= m. Then we have  µk(u, I) ≥ µk(v, I).

Here is the proof of (#). If m = k, then obviously we have  µk(u, I) = µk(v, I). If 1 ≤ m < k, we exchange positions of the m-th and (m + 1)-th item to get a new ordered set  I1, then

image

Hence

If m + 1 < k, note that the only difference between  (u, I1) and (v, I1)now lies in the click probability of the (m + 1)-th item. In detail,

image

We exchange positions of the (m + 1)-th and (m + 2)-th item in  I1to get a new ordered set  I2. Similarly we have

image

We can continue this operation for  n = k − mtimes and get  In. Iteratively, we have  µk(u, I) − µk(v, I) ≥ µk(u, In) −µk(v, In). Besides, the only difference between  (u, In) and (v, In)now lies in the click probability of the k-th item:

image

Since µk(u, In) = µk(v, In), µk(u, I) ≥ µk(v, I).

D.8. Proof of Lemma 5.2

Lemma 5.2. For any k, t,

Proof. Lower bound. According to Lemma 5.1, the expectation of observations attains its minimum when we pull an ordered set {1, 2, . . . , k}, and attains its maximum when we pull an ordered set  {L − k + 1, L − k + 2, . . . , L}. In other words, depending on the instance, the expectation of observations can be lower bounded as follows:

image

Moreover, since the lower bound  µkis larger than the expectation of observations when  w(i) = w∗ for all 1 ≤ i ≤ k or wepull item 1 for K times (note that this is not allowed in Algorithm 1), we can utilize only  w∗ to lower bound the expectation:

image

then

image

image

Let  w∗ = k−β ∈ [0, 1], then  β ≥ 0. Since  (1 − 1/x)xis a nondecreasing function of x and  limx→∞(1 − 1/x)x = 1/e, k1−β ≥ 0,

image

If  β ≥ 1, let f(x) = e−x, then f (n)(x) = (−1)n · e−x. For any x ≥ 0, there exists  y ∈ [0, x] such that

image

This leads to  1 − e−x ≥ x(1 − x/2) and

image

Otherwise,  0 ≤ β < 1. Since

image

1 − e−k1−β decreases as  βincreases. Then,

Upper bound. Similarly we can see that the expectation of observations attains its maximum when we pull an ordered set {L, L − 1, . . . , L − k + 1}, and therefore upper bounded by

image

Furthermore, the upper bound  ˜µkis smaller than the expectation of observations when  w(j) = w′ for all L − k + 1 ≤ j ≤ Lor we pull item L for K times (again note that this is not allowed in Algorithm 1):

image

D.9. Proof of Theorem 5.4

Theorem 5.4. Let X be an almost surely bounded nonnegative r.v.. If  EX2 ≤ v2, then X is v-LSG.

Proof. Set  EX = µand  0 ≤ X ≤ Ma.s., then  M ≥ 0and  0 ≤ µ ≤ M. It is equivalent to show that for any  v ≥ EX2, λ ≤ 0,

image

Set

it is further equivalent to show  f(λ) ≥ 0. Then since  0 ≤ X ≤ Ma.s., for any  λ ≤ 0, by Bounded Convergence Theorem,

image

Therefore,

Let  µbe the probability measure of  X on R. Since 0 ≤ X ≤ M a.s., µ([0, M]) = 1 and

image

Since  E[exp(λX)])2 > 0, g′(λ) ≤ 0. Hence  g(λ)is monotonically decreasing on R. Further, for any  λ ≤ 0, since v2 ≥ EX2

image

Given  v2 ≥ EX2, it is more challenging to show X is v-SG than to show X is v-LSG. By revisiting the proof above, we see that given X is v-LSG, to show X is v-SG suffices to show  f(λ) ≥ 0 for any λ ≥ 0. Since it is hard to directly tell whether the inequality above holds for any  λ ≥ 0, it is natural to look at how  f(λ) grows in R.

Fix any  M0 > 0. For any λ ∈ [0, M0], again, applying the Bounded Convergence Theorem, we have

image

Since  f(0) = 0 and f ′ is differentiable on R, it requires at least  r > 0 such that f ′(λ) ≥ 0 for any λ ∈ [0, r]. Furthermore, since  f ′(0) = 0, one may consider showing that  f ′′(λ) ≥ 0 on [0, r].

In the proof above, we define a function g to show that  f ′′(λ) ≥ g(λ) ≥ 0on  (−∞, 0]. However, since  g(λ) ≤ 0on [0, +∞), this cannot help to show  f ′′(λ) ≥ 0 on [0, r].

The discussion above indicates that showing X is v-SG is more challenging than showing X is v-LSG.

D.10. Proof of Lemma 5.5

Lemma 5.5. For any  k, t, EX2k;t ≤ v2k = min{k2, 2/w′2 }.

Proof. Recall  w′is the minimum click probability. We abbreviate  Xk;tas X. Firstly, since  X ∈ [1, k], EX2 ≤ k2. Next, note that  EX2increases when the click probabilities decrease or k increases. Set Y as a random variable drawn from a geometric distribution with parameter  w′, then EX2 ≤ EY 2. Since EY 2 = 2/w′2 − 1/w, EX2 ≤ 2/w′2.

D.11. Proof of Lemma 5.6

Lemma 5.6. For any  k, t, δ > 0, set

then  Pr(E∗) ≤ δ. Further when  E∗ holds, for any  T >0, �nt=1 Xk;t ≤Timplies that  n≤2T/µk+2 log(1/δ)v2k/µ2k.

Proof. We abbreviate  Xk;t as Xt(the number of observations of surviving items at step t when pulling k surviving items), and set  Dt = Xt − EXt, Ftdenote the decisions and observations up to step t. Besides, let  Stbe the set to pull at step t, then  Stis determined by  Ft−1, and Xtdepends on  St. Since

image

D1, . . . , Dti s a martingale difference sequence adapted to  F = (Ft)t. Besides, according to Theorem 5.4, for any t, any λ ≤ 0, v2k ≥ EX2 yields E[eλDt|Ft−1] ≤ eλ2v2/2. Then for any  ω > 0,

image

Let the probability bound in the right hand side be  δ, then

image

Note that  EXt ≥ µk for any t,

image

Next, for any T > 0, consider

image

Set

then  x ≥ 0 and x2 − a0x − b0 ≤ 0. Note that  (p + q)2 ≤ 2(p2 + q2),

image

D.12. Proof of Lemma 5.7

Proof of Remark D.1. Any non-negative random variable bounded in [a, b] a.s. is sub-Gaussian with parameter  (b − a)/2.Meanwhile,  Wt(i) ∈ [0, 1]yields that  ηt(i) ∈ [w(i) − 1, w(i)]. [w(i) − (w(i) − 1)]/2 = 1/2.

Proof. For all  i ∈ [L], E(i, δ) = {∀t ≥ 1, | ˆwt(i) − w(i)| ≤ Ct(i, δ)}. Recall that

image

then according to Theorem B.2,

image

D.13. Proof of Lemma 5.8

image

Preliminary. Since we use the UCB of the empirical top-(kt + 1)item to accept  ϵ-optimal items, hopefully it should be close to the true click probability of item  (kt + 1); likewise, the LCB of the empirical top-(kt)item should be close to the true click probability of item  (kt). This is stated in Lemma D.2.

Lemma D.2 (Jun et al. (2016, Lemma 3)). Denote by ˆithe index of the item with empirical mean is i-th largest: i.e., ˆw(ˆ1) ≥ . . . ≥ ˆw(ˆL).Assume that the empirical means of the arms are controlled by  ϵ : i.e., ∀i, | ˆw(i) − w(i)| < ϵ. Then,

image

After that, Lemma 5.8 shows that the agent will correctly classify the items after a sufficient number of observations, and also show what is the sufficient number of observations for each item.

Proof. Recall

And We use  ρ and ρ′ as abbreviations for  ρ(δ) and ρ(δ′)respectively.

It suffices to show for the case where  At and Rtare empty since otherwise the problem is equivalent to removing rejected or accepted arms from consideration and starting a new problem with  Lnew = L − |At| − |Rt|and  Knew = K − |At|while maintaining the observations so far. Note that when  At is empty, kt = K.

First of all,  Tt(i) ≥ T ′t implies that

Then since �Li=1 E(i, δ′) holds, | ˆwt(i)−w(i)| ≤ ˜C (Tt(i), ρ′) ≤ ˜C (T ′t, ρ′) for all i ∈ Dt. Combining this with Lemma D.2, we have

image

We first prove that for any  i ≤ K′,

For clarity, we write  j∗ = �K + 1, which is the item with (K + 1)-th largest empirical mean at the t-th step. We assume the contrary:  Lt(i, δ) ≤ Ut( �K + 1, δ) − ϵ. We can apply (D.1) and (D.2) to obtain

image

Next,

Part (a) of the second line above follows from: (i) if  i ≤ K, w(i)−w(K +1)+ϵ = ∆i +ϵ > 0; (ii) else,  K < i ≤ K′, sincew(i) ≥ w(K) − ϵ, we have w(i) − w(K + 1) + ϵ = w(i) − w(K) + w(K) − w(K + 1) + ϵ = ∆K − ∆i + ϵ ≥ ∆K > 0.Then invert to the third line using

image

image

Therefore, ¯T ′t ≥ ¯Ti,δ′implies that  Lt(i, δ) > Ut(j∗, δ) − ϵ where j∗ = arg max(kt+1)j∈Dt ˆwt. Then i ∈ Atis accepted.

Subsequently, we prove that for any  i > K′,

Again for brevity, we write ˆK = j′, the item with K-th largest empirical mean at the t-th step. We assume the contrary: Ut(i, δ) ≥ Lt( ˆK, δ) − ϵ. Again applying (D.1) and (D.2), we have

image

Next,

image

Similar to the first case, with

image

D.14. Proof of Lemma 5.9

Lemma 5.9. Assume �Li=1 E(i, δ)holds. Algorithm 1 stops after identifying at most  L − max{K′ − K, 1} items.

Proof. Assume �Li=1 E(i, δ) holds.Case (i):  K′ = K. In the worst case, the algorithm does not terminate before identifying the  (L − 1)-th one. In this case,

after identifying the  (L − 1)-th one with sufficient observations, either the accept set or the reject set is full, i.e.,  |At| = Kor  |Rt| = L − K, the the agent can just place the remaining item in the unfilled set.

Hence, the algorithm terminates after sufficiently observing and identifying at most  L − 1 = L − max{K′ + K, 1} items.

Case (ii):  K′ > K. The algorithm classify all items correctly according to Lemma 5.8. since the number of  ϵ-optimal items

is  K′ = max{i : w(i) ≥ w(K) − ϵ} ≥ K, the number of suboptimal items is  L − K′ ≤ L − K. Hence, |Rt| ≤ L − K′.Besides,  |At| ≤ Kaccording to the design of the algorithm. Therefore,

image

In other words, the algorithm terminates after sufficiently observing and identifying at most  L − K′ + K = L − max{K′ +K, 1} items.

D.15. Proof of Lemma 5.11

Lemma 5.11. For any  1 ≤ ℓ ≤ L,

To manifest the difference between instance  ℓand other instances, with  w(0)(i) = w(i) for all i ∈ [L] we write

 {w(0)(1), w(0)(2), . . . , w(0)(L)}under instance 0;

 {w(0)(1), w(0)(2), . . . , w(0)(ℓ − 1), w(ℓ)(ℓ), w(0)(ℓ + 1), . . . , w(0)(L)}under instance .

We combine Lemma 5.10 and a result from Kaufmann et al. (2016) to relate the time complexity and KL divergence together. Lemma D.3 ((Kaufmann et al., 2016, Lemma 19)). Let T be any almost surely finite stopping time with respect to  Ft. Forevery event  E ∈ FT, instance  1 ≤ ℓ ≤ L,

image

Notations. Before presenting the proof, we remind the reader of the definition of the KL divergence (Cover & Thomas, 2012). For two discrete random variables X and Y with common support A,

image

denotes the KL divergence between probability mass functions of X and Y . Next, we also use  KL(PX∥PY )to also signify this KL divergence. Lastly, when a and b are two real numbers between 0 and  1, KL(a, b) = KL (Bern(a)∥Bern(b)), i.e., KL(a, b) denotes the KL divergence between Bern(a) and Bern(b).

Proof. For any certain  st, we can observe that the KL divergence  KL�POπ,0t |Sπ,0t (· | st)��� POπ,ℓt |Sπ,ℓt (· | st)�should grow with the KL divergence of observed items. Further, for each observed item i, there is a KL divergence of KL�w(0)(i), w(ℓ)(i)�. Whenever Sπ,0t = st, we have

image

Then according to Lemma 5.10,

image

D.16. Proof of Theorem 4.1

Preliminary. Recall that ¯∆σ(1) ≥ ¯∆σ(2) ≥ . . . ≥ ¯∆σ(L), and  Tt(i)counts the number of observations of item i up to the t-th step. The worst case is that the algorithm eliminates  σ(1), σ(2), . . .in order, and the algorithm eliminates at most 1 item at one time step. Besides, the design of Algorithm 1 implies that

image

In the following discussion,we assume �Li=1 E(i, δ) holds and K′ < 2K −1(discussion on  K′ ≥ 2K −1is in Appendix C). Note that Lemma 5.7 implies that  P��Li=1 E(i, δ)�≥ 1 − δ/2. Besides, we write  µ(k, w) as µk, v(k, w) as vk, ¯Ti,δ as ¯Ti,ρ(δ) as ρfor simplicity.

Bound the number of observations per phrase. Observe that there are less than K surviving items remaining in the survival set  Dtat some steps before the algorithm terminates, we separate the steps into several phrases:

(i) During the first phrase, the agent eliminates  L − K + 1items within  t1steps. According to Lemma 5.8 and Line (D.3),

image

Then the total number of observations of surviving items in  Dtwithin this phrase can be bounded as follows:

image

(ii) During the k-th phrase for any  2 ≤ k ≤ K − max{K′ − K, 1}, the agent eliminates the  L − K + k-th item within  tksteps. Again apply Lemma 5.8 and Line (D.3):

image

Then the total number of observations of surviving items in  Dtwithin this phrase can also be bounded:

image

Bound the number of time steps per phrase. Recall that the k-th (1 ≤ k ≤ K − max{K′ − K, 1}) phrase consist of  tktime steps. Let  Zkbe the total number of observations within the  tksteps. Lemma 5.6 indicates that

image

Then according to Lemma 5.6, for any  k (1 ≤ k ≤ K − max{K′ − K, 1}), with probability at least  1 − δk,

image

Bound the time complexity. Altogether, we would have �K−max{K′−K,1}k=1 tkas the time complexity. Besides, we bound the total error incurred by partial observation by  δ/2. In other words,

image

Depending on the value of  K′ − K, there are two cases:

image

For brevity, we only go through the remaining analysis for the first case, the analysis for the second one is similar.

Since the second term in the bound on T merely depends on the problem, we turn to analyze the first term. Since the first term holds for any values of  δk’s such that �2K−K′k=1 δk ≤ δ/2, we minimize the first term with the method of Lagrange

image

Let

image

then for all  1 ≤ k ≤ 2K − K′,

(▲)attains its maximum when  δk = δ∗k for all 1 ≤ k ≤ 2K − K′. Hence

image

Now we bound  (♠), (♥), (♣)individually.

Bounding  (♠): note that  µK+1−k ≥ 2 for all 1 ≤ k ≤ 2K − K′, K′ ≥ K and ck =2v2K−k+1µ2K−k+1 ,

image

Bounding  (♥) : Let g(x) = log xx for x > 0, then g′(x) = 1−log xx2 . Since g′(x) > 0 when x ∈ (0, e), g′(e) = 0, g′(x) < 0when  x ∈ (e, +∞), g(x)is increasing on (0, e), is decreasing on  (e, +∞)and attains its global maximum  g(e) = 1eat x = e. Hence,

image

Bounding  (♣): We first rewrite this term according to the definition of ˜Tk’s:

image

image

Next, since  µk ≥ min{k/2, 1/(2w∗)}as shown in Lemma 5.2, when  K − k + 1 ≤ 1/w∗,

image

and

image

Further,

image

Summation of  (♠), (♥), (♣). Recall ρ = δ/(12L) and

image

The time complexity is upper bounded by

image

where

image

image

= c4

image

D.17. Proof of Theorem 4.8

Recall that  Oπtis a vector in  {0, 1, ⋆}K, where  0, 1, ⋆represents observing no click, observing a click and no observation respectively. For example, when  Sπt = (2, 1, 5, 4)and  Oπt = (0, 0, 1, ⋆), items 2, 1, 5, 4 are listed in the displayed order; items 2, 1 are not clicked, item 5 is clicked, and the response to item 4 is not observed. By the definition of the cascading model, the outcome  Oπt = (0, 0, 1, ⋆)is in general a (possibly emtpy) string of 0s, followed by a 1 (if the realized reward is 1), and then followed by a possibly empty string of  ⋆s. Clearly,  Sπ,ℓt , Oπ,ℓtare random variables with distribution depending on  w(ℓ)(hence these random variables distribute differently for different  ℓ), albeit a possibly complicated dependence on w(ℓ).

With the analysis in Section 5.3, according to Lemma 5.11 and the definition of the instance  ℓ, one obtains for  i ∈ {1, . . . , K}or  j ∈ {K + 1, . . . , L}respectively,

image

Let  Ytdenote the number of observations of items at time step t. By revisiting the definition of  Xk;tin Section 4.1, we see that  XK;tactually counts the observation of all pulled items at time step  t. Hence, Yt ≤ XK;t. Setting α → 0and summing over the items yields a bound on the expected number of total observations  E��Tt=1 Yt�= �Li=1 E[TT (i)]. Meanwhile, an upper bound of  EXK;tas stated in Lemma 5.2 and tower property indicates that

image

Note that  KL(x, 1 − x) ≥ log(1/2.4x) for any x ∈ [0, 1], we complete the proof of Theorem 4.8.

D.18. Proof of Proposition C.1

Proposition C.1. Assume  K′ ≥ 2K − 1. With probability at least  1 − δ, Algorithm 1 outputs an  ϵ-optimal arm after at most  (c1N ′1 + c2N ′2)steps where

image

Proof. Consider  K′ ≥ 2K − 1, i.e, K′ − K ≥ K − 1. According to Lemma 5.9, there are at least  K′ − K + 1 ≥ K itemsin the survival set  Dtbefore the algorithm terminates, so the algorithm pulls K items from the surviving set  Dtat each time step. And for simplicity, we again write  µ(k, w) as µk, v(k, w) as vk, ¯Ti,δ as ¯Ti, ρ(δ) as ρ.

Recall Lemma 5.6, we set  δ0 = δ/2, k = K, n = t′0, ρ′ = −�−2t′0v2K log(δ/2). Then the total number of observations during  t′0 steps should be larger than  t′0µK + ρ′with probability at least  1 − δ/2. And since the number of observations can

be upper bounded, we consider

Lastly, with Lemma 5.6, 5.7 and 5.8, we obtain that with probability at least  1 − δ, Algorithm 1 stops after at most

image

steps.

E.1. Order of pulled items

image

image

image

image

Figure E.1: Average time complexity incurred by different sorting order of  St: ascending order of  Ti(t)(Algorithm 1), ascending/descending order of  ˆµt(i)/Ut(i)/Lt(i)in the cascading bandits.

After a large amount of observations, it is likely that the empirical mean  ˆwt(i)approaches the true weight w(i), and w(i) lies between the confidence bounds  Ut(i, δ) and Lt(i, δ)with high probability. Therefore, one may consider to sort  St in thedescending or ascending order of  ˆwt(i)’s, Ut(i, δ)’s or Lt(i, δ)’s (the difference to Algorithm 1 reveals in Line 5–9). Diving into the numerical results, we found an algorithm always manages to find an  ϵ-optimal arm provided that it is not terminated by the limit of  107 steps. Hence, we focus on the comparison of averaged stopping time.

In Figure E.1, we can see that sorting  Stin the ascending order of  ˆµt(i)or  Ut(i), especially the latter one, incurs an apparently larger averaged stopping time than other methods in most cases. Next, the descending order of  ˆµt(i)does not work well in some cases. Thirdly, the ascending order of  Lt(i)performs almost the same as our algorithm in most cases but there are several cases where it performs much worse and does not terminate even after  107 iterations. Lastly, the descending order of  Ut(i)works almost as well as Algorithm 1 empirically but is in lack of theoretical guarantee on time complexity. Meanwhile, the standard deviation of the stopping time of our algorithm is negligible comparing to the average value. For instance, in the left-most case of Figure 6.1, the standard deviation is about 22318.54 when the average is about 754140.65.

E.2. Further empirical evidence

Table E.1: Fitted results of upper bounds on the stopping time T of Algorithm 1 with  ϵ = 0(Proposition 4.6).

image

As shown in Table E.1, p-value is the probability that we reject the assumption of our fitting model versus a constant model (Glantz et al., 1990). Hence, the small p-values indicates that our fitting models are reasonable. Next, all  c1’s are positive, implying all averaged stopping time grows with K, which corroborates our theoretical results.

image

Figure E.2: Fit the averaged stopping time with functions of K for each case in order. Fix  L = 128, δ = 0.1, ϵ = 0. Blue dots are the averaged stopping time, red line is the fitted curve, and cyan dashed lines show the 95% confidence interval.


Designed for Accessibility and to further Open Science