We consider a high-dimensional online batch decision making problem, a setting in which the decision-maker must interacts with a group of users, instead of a user, at each decision epoch. Such setting arises very naturally in real-world applications but received less attention in the literature. In interactive marketing Bertsi-
Proceedings of the 23International Conference on Artificial Intelligence and Statistics (AISTATS) 2020, Palermo, Italy. PMLR: Volume 108. Copyright 2020 by the author(s).
mas and Mersereau (2007), marketers choose among a set of marketing messages to be sent to several customers simultaneously; updating marketing strategy once receiving a feedback is computationally impractical, and a more practical approach is to aggregate data in a prescribed length of period before adopting new strategy. In clinical trials Ahuja and Birge (2016), physicians choose among a set of available treatments to be administered to a group of patients simultaneously; updating treatment policy once measuring a response is unrealistic, and therapy in real practice is to collect data in a pre-approved length of period before designing new policy. In real-world practice, applicability of online decision-making methodology turns out to be impeded by such limited adaptivity. Similar issue has been addressed by Bai et al. (2019) in reinforcement learning setting, but it remains open in the setting of online decision making under bandit feedback.
One feature of modern economy digitization for decision makers is to deliver personalized products, services, and solutions based on individual-level data. Additionally, in many practical settings, individual-level data are high-dimensional but typically only a small number of the observed covariates are decisive. An additional layer of complexity in online batch decision-making is that, the whole decision making process is learned from bandit feedback: decision makers only observe the outcome for the product that was delivered, but not for any other available products that could have been delivered. Taking heterogeneity and bandit feedback into account, approaches of approximate dynamic programming or Markov decision process in Ahuja and Birge (2016) addressed the dimensionality issue by transforming the covariate space into a finite number of states and then solving the corresponding Bellman equation. However, how to construct such a transformation is often unspecified in this line of approach and particularly ambiguous in the case of high-dimensional covariate space, impeding itself to embrace the blessing of modern economy digitization.
In this paper, we propose a new class of approaches, named Teamwork LASSO Bandit algorithm, for online
Figure 1: A realization of proposed Teamwork-Selfish policy in online batch decision making. N is the batch size. K is the number of available treatments. q is the number of repetition in a block. The agent runs teamwork mode in blue blocks and runs selfish mode in red blocks. The samples collected from blue blocks is called Teamwork sample. The samples collected from red blocks is called Selfish sample.
batch decision-making to tackle the curse of heterogeneity and high-dimensional covariate space under bandit feedback. To achieve both goals simultaneously, a deliberate balance between exploration and exploitation is required to efficiently learn a personalize decision-making rule that maximizes the cumulative treatment efficacy. In particular, in Teamwork LASSO Bandit algorithm, the agent switches between teamwork mode and selfish mode during the whole decision process to alternatively perform pure exploration and pure exploitation. By the proposed Teamwork-Selfish policy, the resulting sample set strikes a balance on optimal allocation in the sense that the rate of optimal allocation is beyond certain level for each treatment (see Defini-tion 1) with high probability. This strategy ensures that the LASSO regression accommodates dependency arising in a sequence of batch observations and enjoys certain statistical properties in order to achieve optimal allocation for each incoming user.
Contributions. Our contribution is three-fold. First, we propose the “Teamwork LASSO Bandit algorithm,” that solves the online batch decision-making problem with high-dimensional user covariates via learning LASSO estimates of treatment efficacy. Second, we propose a general “optimal allocation rate condition” on sample set as a theoretical guideline in designing a data collection scheme to identify the optimal treatment given observed user covariates. Such a scheme is illustrated by the proposed Teamwork-Selfish policy. Last, we establish an upper bound of the expected cumulative regret that scales linearly with the batch size. As a technical by-product, we develop a deviation inequality of LASSO regression that adaptes to a sequence of batch observations. To our knowledge, this is the first work addressing batch version of explore-exploitation dilemma in the setting of online batch decision-making.
Related works Batched bandit problem has received less research attention over the last decade. The special case of two-armed bandit setting has been addressed in Perchet et al. (2016); recently, Gao et al. (2019) extends the setting to K-armed bandit. While these works addressed the problem by mean model, we addressed batched bandit problem with high-dimensional linear model. In non-batch setting, Rusmevichientong and Tsitsiklis (2010) addressed the bandit problem by linear model in low dimensional setting; Bastani and Bayati (2020) then extends the setting to high-dimensional linear model. Comparing to these works, our model extends such setting to batched bandit problem.
A toy example for illustration
Consider the simplest scenario where the agent allocates one of two available treatments to each user in a size-2 batch at every decision epoch (That is, N = 2, K = 2 in Figure 1). In this case, each block contains 2q epochs and each epoch has two user. In the blue block, the agent runs in a teamwork mode: it sends both users to the first treatment at the first q epochs and to the second treatment at the second q epochs; in the red block, the agent runs in a selfish mode: it sends each user to her estimated optimal treatment based on all historical information over all 2q epochs. The agent runs in a teamwork mode (blue) on the block whose number is of power of 2, that is, and runs in a selfish mode on the red block. The samples collected from blue blocks are called Teamwork sample; the samples collected from red blocks are called Selfish sample. More specifically, in red block, the agent runs a selfish mode by a two-step procedure: the first step is to run LASSO regression on Teamwork sample to output a set of optimal treatment candidates for the current user; the second step is to run LASSO regression on the union of Teamwork sample and Selfish sample to decide the optimal treatment for the current user.
Notation For any positive integer n, define [n] = and for any
and
denotes the number of elements for any collection A. For any vector
the indexes of non-zero coordinate. For a vector u =
denotes the maximum absolute value of its entries;
the
Treatment efficacy model We consider an agent, who needs to allocate one of K different available treatments (arms), denoted as , for a size-N batch of users in each decision epoch
where T denotes the length of the decision epoch horizon. The users in the current batch are represented by N observable vectors of features (covariates)
. We assume that feature vectors
may vary across decision epoch and are sampled independently from a fixed, but a priori unknown distribution
with bounded support X.
The efficacy of treatment w for the ith user (feedback) at epoch , where the function y is unknown. At each decision epoch t, the agent allocates a batch of treatments
, and then observes a batch of efficacy
. The objective is to design an allocation policy, which maps users to their own treatments, that maximizes the cumulative sum of observed efficacy.
We assume that the efficacy of treatment w for user is a linear function of her covariates
where are martingale difference noise. At each epoch
’s are drawn independently from a mean zero
-sub-Gaussian distribution (that is,
for all real
) and
forms a martingale difference sequence (that is,
). The noise is often introduced to account for the features that are not included in the model.
Efficacy parameters are a prior unknown to agent. Therefore, the agent deals with exploration-exploitation trade-off as it needs to choose between learning B and exploiting what has been learned so far to maximize treatment efficacy.
Our proposed algorithm exploits the structure (sparsity) of the feature space to improve its performance. For this purpose, let denote the number of nonzero coordinates of
. Define
and note that
is a priori unknown to the agent.
Technical assumptions For ease of presentation, we assume that , for all
, and
for a known constant b.
We denote by the set of feasible parameters, that is,
and we write
To measure the performance of proposed bandit algorithm, we present three technical assumptions.
Assumption 1. (Margin Condition) There exists a constant such that for
in W, P(0 <
Assumption 1 is referred to the Margin Condition in the classification literature Tsybakov et al. (2004) and is introduced in multi-armed linear bandit literature to ensure only a small fraction of features can be drawn near the classification boundary 0} in which efficacy of both treatments are almost equivalent; see Rusmevichientong and Tsitsiklis (2010), Bastani and Bayati (2020).
Assumption 2. (Treatment Optimality Condition) There exist some constant h > 0 and two mutually exclusive sets, denoted as and
with W =
(a) For each treatment , it holds for every covariate vector
(b) For each treatment , there exists a constant
Assumption 2 is referred to the Treatment Optimality Condition in Rusmevichientong and Tsitsiklis (2010), Bastani and Bayati (2020) and is to separate available treatments into an optimal subset and a sub-optimal subset
such that every optimal treatment
is strictly optimal for some users (denoted by the set
) and every sub-optimal treatment is strictly sub-optimal for every users.
Assumption 3. (Compatibility Condition) There exists a constant such that for each optimal treatment
, the population covariance matrix
belongs to the compatibility set of its treatment efficacy parameter
where is defined as
Assumption 3 is referred to as the Compatibility Condition in high-dimensional statistics literature Bühlmann and Van De Geer (2011) and is to ensure that LASSO estimate trained on samples converges to the true parameter
with high probability as the number of samples grows to infinity.
Oracle Policy and Performance Metric We evaluate the performance of our algorithm using the usual notion of regret: the expected treatment efficacy loss compared with the oracle allocation policy that has full knowledge of B, but not of the realizations of noise ). Let us first characterize this benchmark policy. Based on the model (1), the optimal treatment allocation, denoted by
, is defined as
Throughout the paper, denotes the optimal treat- ment allocation for the ith user at epoch t.
We now define regret of a policy. Let be the agent’s policy that allocates the treatment
to user
, and the choice of
may depend on the information up to decision epoch
. The worst-case regret is defined as:
where the expectation is with respect to the distribution of martingale difference noise, , and the distribution of feature vector,
. The notation Q(X) denotes a set of probability distributions with bounded support in X. We want to point out that Regret
corresponds to the usual cumulative expected regret in online learning when N = 1 and the empirical expected estimation error in batch learning when T = 1.
Rate of Optimal Allocation Condition Suppose the agent has collected a sample set for treatment w. The set
consists of both optimal and sub-optimal allocation. The optimal allocation subsample set is defined as
Intuitively, bigger size of improves the accuracy of statistical procedure, which is the prize of achieving optimal allocation. On the contrary, bigger size of
undermines the accuracy of statistical procedure, which is the price of misallocating users to any suboptimal treatment. The ratio
matters and is termed as the rate of optimal allocation of sample set
To gain a deeper insight of such balance into high-dimensional setting, consider the LASSO regression:
where -dimension response vector and X is a
covariate matrix.
Note that, if satisfies the compatability condi- tion with constant
, then
satisfies the compatability condition with constant
above observation suggests that certain balance between and
should be stricken during the decision process. We make this intuition precise for sample sets characterization by introducing the optimal
Definition 1. A sample set satisfies the rate r optimal allocation condition, if it satisfies both conditions
We briefly call ple set if
satisfies a optimal allocation condition of certain rate r. Note that the optimal allocation subsample set
is not directly observable. The rate r optimal allocation condition is to lower bound the size of
to ensure enough accuracy of LASSO estimator.
Our first contribution is a deviation inequality for LASSO regression based on the sample set collected in online batch decision making setting:
adapted LASSO) Given a rate r optimal allocation sample set follows the dependence structure of online batch decision making problem. If
then for any
, the oracle inequality holds that
Theorem 1 is a general version of LASSO deviation inequality for adapted observations (see Proposition 1 in Bastani and Bayati (2020)). Our contribution is to extend the deviation inequality for adapted sequence of batch observations with martingale difference noise.
In this section, we propose the Teamwork LASSO Bandit algorithm that runs in an interactive fashion: the agent switches between Teamwork mode (for pure exploration) and Selfish mode (for pure exploitation). The collection of decision epochs that the agent runs in teamwork mode and selfish mode are called teamwork stage and selfish stage, respectively. In the former stage, the agent allocates the current batch of users to a prescribed and possibly sub-optimal treatment for studying treatment efficacy. In the latter stage, the agent allocates each user in the current batch to her estimated optimal treatment for maximizing treatment efficacy. Epochs in teamwork and selfish stages are collected into sets , respectively.
3.1 Allocation in Teamwork mode
The teamwork stage T consists of exploration on all available treatments; that is, where
denotes the prescribed decision epochs for studying efficacy of treatment
. We defer the exact speci-fication of
to Section 4.1.
Our treatment allocation policy now runs in Teamwork mode that, in any decision epoch , the agent allocates the whole batch of users to the treatment
for all . Such an allocation results in a batch of covariate-efficacy pairs,
[N]}, which is aggregated into the teamwork sample set up to epoch t
3.2 Allocation in Selfish mode
The selfish stage deploys exploitation by allocating the current best-possible treatment to each user in the current batch. This is done by a two-step procedure: estimating candidates of personal optimal treatment and then committing selfish allocation. Such an allocation results in another batch of covariate-efficacy pairs,
. Then, for each treatment
, the set
is aggregated into the selfish sample set up to epoch t
Here we outline the above two-step procedure whose details are deferred to Section 4.2. At a selfish decision epoch , the agent first estimates
by using LASSO regression based on teamwork sample set
and the resulting estimator
is called Teamwork LASSO. The Teamwork LASSO screens out the sub-optimal treatments for each user
in current batch and then outputs a set of optimal treatment candidates
(See Eq.(11) for detailed description).
The agent then estimates by using LASSO regression based on full sample set
and the resulting estimator
is called All LASSO. The All LASSO then selects the treatment with highest expected efficacy for the user
Observe that by design of our interactive policy (see Figure 2), the agent maintains for each treatment w two different sample sets (teamwork and greedy). The independence in teamwork sample set is preserved by the pure exploration nature of teamwork stage, facilitating the subsequent analysis of LASSO estimate. However, the full sample set mixes independence from teamwork stage with the dependency arising in the selfish stage, complicating the subsequent analysis of LASSO estimate. Tackling such dependency requires a closer look at every epoch of decision making process.
4.1 Planing Teamwork Stage
One factor contributes to the success of our Teamwork-Selfish policy is the design of in teamwork mode (See eq.(3.1)). Here we explain our design of
reveal the quantity q that characterized the complexity of online batch decision making problem.
Recall that the collection of teamwork decision epochs up to epoch t is the set . Such set is a truncation to epoch range [t] of the collection of treatment
teamwork decision epoch
which is developed by the
The is defined as a prescribed range of decision epochs
(Refer to the blue block in Figure.1) Observe that each teamwork round prescribes q decision epochs for treatment pure exploration. The magnitude q characterizes the complexity of a online batch decision problem.
Our theoretical results suggests a lower bound of q by (See Eq.(10) for
). Such lower bound ensures both teamwork sample set and full sample set collected from the Teamwork-Selfish policy to satisfy the rate of optimal allocation condition (Definition 1) with high probability. Consequently, the Teamwork LASSO and ALL LASSO enjoy their statistical properties and the regret bound (2) can be guaranteed.
Remark that our lower bound on q is proportional to 1/N, where N is the batch size . Compare to the full adaptive setting in Bastani and Bay- ati (2020), we have . In terms of update frequency, LASSO Bandit requires
, while Teamwork LASSO Bandit requires
. In terms of regret, LASSO Bandit is of rate
Teamwork LASSO Bandit is of rate
there is a trade off between regret and update frequency.
Our design of teamwork stage is a generalization of the forced sampling for the two-arm and non-batch setting in Rusmevichientong and Tsitsiklis (2010) and for the K-arm and non-batch setting in Bastani and Bay- ati (2020). Our contribution is to redesign such type of policy when batch is the fundamental sampling unit to accommodate the restrictions from limited adaptivity.
4.2 Two Step Procedure in Selfish Stage
Another factor contributing to the success of our Teamwork-Selfish policy is the two step procedure to commit selfish allocation. In a selfish decision epoch t, the agent first estimates every treatment’s efficacy based on information up to decision epoch by
Figure 2: Teamwork-Selfish policy.
Teamwork LASSO to separate optimal treatments from suboptimal treatments. Then, the agent estimates candidate treatments’ efficacy by ALL LASSO
to identify the true optimal treatment for current user. In particular, the success of step 1 relies on a "good event“
which marks the case that every Teamwork LASSO are accurate enough to screen out suboptimal treatments.
Now we give a high-probability statement for access the performance of Step 1.
Lemma 1 is an immediate consequence of Corollary 1
Corollary 1. (Deviation inequality for Teamwork LASSO) For all treatments , if
for
, the Teamwork LASSO estimator satisfies the deviation inequality (set
where
Corollary 1 is an application of Theorem 1, given a deviation inequality of LASSO regression based on . As shown in lemma 7,
allocation sample set with probability at least
Step 1: Screen out Sub-optimal Treatments
Given a current user x, the agent’s objective at step 1 is to output a set of user’s potential optimal treatments. To construct such set of treatment candidates, we require that a candidate treatment should have its estimated treatment efficacy almost as good as the maximum of estimated treatment efficacy of all available treatments, as the Teamwork LASSO can tell.
Formulating such intuition motivates us to define the set of
The exact members in candidate set depend on the region that x belongs to. Recall that
denotes the optimal treatment of x. For the case
by definition of
. As shown in lemma 13, given the event
holds ,the candidate set
the optimal treatment of x, that is,
Therefore, the agent definitely has an optimal allocation for every user under the event
For the case , the user covariate x lies near a decision boundary
some comparable treatment
. As shown in lemma 12, the candidate set
the optimal treatment of x, that is,
but may contain other comparable treatments performs almost equally well as the optimal treatment
, under the event
Step 2: Commit Selfish Allocation The agent’s objective in step 2 is to commit a selfish allocation for current user x. Such selfish allocation is done by allocating user x to the treatment with highest estimated efficacy, as far as the All LASSO can tell:
An application of Theorem 1 gives a deviation inequality of LASSO regression based on
Corollary 2. (Deviation inequality for All LASSO) For treatments , if
, the All LASSO estimator satisfies the deviation inequality (Set
with probability at least
The following theorem bounds the regret of our Teamwork-Selfish treatment allocation policy.
Theorem 2. Suppose Assumptions 1, 2 and 3 hold. Then, the regret of the Teamwork-Selfish policy over decision horizon [T] satisfies
Below we provide a roadmap for the proof of Theorem 2. The proof is motivated by the regret analysis in Bastani and Bayati (2020). Our contribution is to generalize the approach for regret analysis in online batch decision making setting.
1. Regret Guarantee in Good and Bad Epochs
To reason about different sources of regret contribution, we decompose the decision process into four subcases (i-iv) so that we can examine each one independently:
As shown in lemma 14 in section F.3, the expected regret in case (iv) is guaranteed to be bounded by the function f(t) as
For this reason, we define case (iv) as good epochs
and then we interpret lemma 14 in section F.3 as
Outside the good epochs are bad epochs epoch, allocations cannot be guaranteed to be optimal, due to the pure exploration nature in teamwork mode or the insufficient accuracy of LASSO estimate in selfish mode, resulting in the worst-case regret guarantee
for cases (i-iii). We interpret the above fact as
.
Combine (16) and (17), the expected instantaneous regret is upper bounded by
Apply (18) to gain
Note that, by lemma 1, we have
We illustrate the trade-off between regret and update frequency by comparing the cumulative regret between LASSO Bandit algorithm (high update frequency) and our Teamwork LASSO Bandit algorithm (low update frequency) in Figure 3 ( Appendix, section H). Here we give remarks on the experiment:
1. In terms of the number of updates, say case q=1, LASSO Bandit (N=1) (high update frequency) requires updates, Teamwork LASSO Bandit (low update frequency) requires
1224 updates for N=4 case and
updates for N=12 case. Note that both give comparable regrets.
2. If the length of exploration phase q is sufficiently large, high update frequency algorithm has lower cumulative regret than low update frequency algorithm; if q is not sufficiently large, low update frequency algorithm outperforms high update frequency algorithm.
3. In general, the performance of high update frequency algorithm has higher variance than the performance of low update frequency algorithm. In particular, the performance of high update frequency algorithm is more sensitive to the increase in covariate dimension than our low update frequency algorithm.
In conclusion, high update frequency algorithm (Lasso Bandit) do have lower cumulative regret than low update frequency algorithm ( Teamwork Lasso Bandit) if the length of exploration phase q is sufficiently large. However, it is hard to determine how large of q can be thought of as being sufficiently large in practice. On the other hand, low update frequency algorithm is immune from such a concern in the sense that we can simply set q=1 when we have a batch of new samples at every decision epoch.
We have proposed a framework to address batch-version explore-exploit dilemma in the setting of online batch decision making with high dimensional covariate. In terms of regret analysis, we formulate the rate of optimal allocation condition on the collected sample set to characterize the underlying constraint behind the data collection scheme and to serve as a guideline for designing policy in bandit algorithms. Based on the rate of optimal allocation condition, we propose the Teamwork LASSO Bandit algorithm for sequential decision making. In theory, the cumulative total regret of the Teamwork LASSO Bandit algorithm of constant batch size N over finite time horizon T is shown to be bounded by . In terms of observed covariate dimension p and sparsity parameter
, the cumulative total regret of the Teamwork LASSO Bandit algorithm grows as
In the end, we highlight a few particularly relevant questions that are left as future works. The first one is the minimax lower bound of regret over all possible algorithms solving batched bandit problem with covariates. In one pull situation, Rusmevichientong and Tsitsiklis (2010) showed the lower bound is O(log T). Recently, Wang et al. (2018) showed that the regret of Bastani and Bayati (2020) can be reduced from O(log T) by invoking the Minimax Concave Penalized (MCP) penalty. Hence, the MCP-Bandit algorithm matches the oracle policy with high probability. We expect the regret of Teamwork LASSO Bandit algorithm can also be reduced from
to O(N log T) if MCP penalty instead of the lasso one is adopted, although whether this rate matches the theoretical lower bound remains unknown. The second one is more relevant to practice: can we design an effective teamwork strategy when batch size is non-constant? In Grover et al. (2018), the authors have proposed four different kinds of delayed feedback mechanism that frequently happen in online advertising context, which may lead to non-constant batch size in our setting. When we are performing batch update in the above delayed feedback scenario, is there a guideline for algorithm design? In particular, can the rate of optimal allocation condition be extended to handle delayed feedback situation?
Chi-Hua Wang thanks Yang Yu (Purdue University) for discussion and Zhanyu Wang (Purdue University) for help on experiment. Guang Cheng would like to acknowledge support by NSF DMS-1712907, DMS-1811812, DMS-1821183, and Office of Naval Research (ONR N00014- 18-2759). While completing this work, Guang Cheng was a member of Institute for Advanced Study, Princeton and visiting Fellow of SAMSI for the Deep Learning Program in the Fall of 2019; he would like to thank both Institutes for their hospitality.
Abbasi-Yadkori, Y., Pal, D., and Szepesvari, C. (2012). Online-to-confidence-set conversions and application to sparse stochastic bandits. In Artificial Intelligence and Statistics, pages 1–9.
Ahuja, V. and Birge, J. R. (2016). Response-adaptive designs for clinical trials: Simultaneous learning from multiple patients. European Journal of Operational Research, 248(2):619–633.
Alon, N. and Spencer, J. H. (2004). The probabilistic method. John Wiley & Sons.
Auer, P. (2002). Using confidence bounds for exploitation-exploration trade-offs. Journal of Machine Learning Research, 3(Nov):397–422.
Bai, Y., Xie, T., Jiang, N., and Wang, Y.-X. (2019). Provably efficient q-learning with low switching cost. arXiv preprint arXiv:1905.12849.
Bastani, H. and Bayati, M. (2020). Online decision making with high-dimensional covariates. Operations Research, 68(1):276–294.
Bertsimas, D. and Mersereau, A. J. (2007). A learning approach for interactive marketing to a customer segment. Operations Research, 55(6):1120–1135.
Bolton, P. and Harris, C. (1999). Strategic experimentation. Econometrica, 67(2):349–374.
Bühlmann, P. and Van De Geer, S. (2011). Statistics for high-dimensional data: methods, theory and applications. Springer Science & Business Media.
Dimakopoulou, M., Athey, S., and Imbens, G. (2017). Estimation considerations in contextual bandits. arXiv preprint arXiv:1711.07077.
Gao, Z., Han, Y., Ren, Z., and Zhou, Z. (2019). Batched multi-armed bandits problem. arXiv preprint arXiv:1904.01763.
Goldenshluger, A. and Zeevi, A. (2013). A linear response bandit problem. Stochastic Systems, 3(1):230– 261.
Grover, A., Markov, T., Attia, P., Jin, N., Perkins, N., Cheong, B., Chen, M., Yang, Z., Harris, S., Chueh, W., et al. (2018). Best arm identification in multi-armed bandits with delayed feedback. arXiv preprint arXiv:1803.10937.
Jun, K.-S., Jamieson, K. G., Nowak, R. D., and Zhu, X. (2016). Top arm identification in multi-armed bandits with batch arm pulls. In AISTATS, pages 139–148.
Li, L., Chu, W., Langford, J., and Schapire, R. E. (2010). A contextual-bandit approach to personalized news article recommendation. In Proceedings of the 19th international conference on World wide web, pages 661–670.
Perchet, V., Rigollet, P., Chassang, S., Snowberg, E., et al. (2016). Batched bandit problems. The Annals of Statistics, 44(2):660–681.
Perchet, V., Rigollet, P., et al. (2013). The multi-armed bandit problem with covariates. The Annals of Statistics, 41(2):693–721.
Rusmevichientong, P. and Tsitsiklis, J. N. (2010). Lin- early parameterized bandits. Mathematics of Operations Research, 35(2):395–411.
Tsybakov, A. B. et al. (2004). Optimal aggregation of classifiers in statistical learning. The Annals of Statistics, 32(1):135–166.
Wang, X., Wei, M., and Yao, T. (2018). Minimax concave penalized multi-armed bandit model with high-dimensional covariates. In International Conference on Machine Learning, pages 5200–5208.
Supplement to “Online Batch Decision-Making with High-Dimensional Covariates”
Proof of Theorem 1. Based on standard arguments in high dimensional statistics, the Template LASSO on when choosing
where which is a high-probability event by carefully choosing the threshold
stated in the following lemma:
Lemma 2. Given a sample set and choose
Proof. See Section B.1.
Besides, the sample covariance matrix of the template sample set satisfies the compatibility condition with high probability, as stated in the following lemma:
Lemma 3. Given a sample set that satisfies rate r optimal allocation condition. Then
Proof. See Section B.2.
Now, lemma 2 and lemma 3 together turn the equation (20) into
Then Theorem 1 follows by solving from the condition
B.1 Proof of Lemma 2
Lemma 2 The event
holds with probability at least by choosing
Proof. Let denote the number of users allocated to treatment
at the sth decision epoch. The sample collected at epoch s is denoted by
is the jth column of covariate matrix X and the good event
With the help of union bound, we have
The quantity actually has sub-Gaussian tail. To see this, define the filtration
From the tower property of expectation, independence between , sub-Gaussian assumption on
, bounded assumption on covariates
The above result gives us a bound on the moment generating function of
We find -sub-Gaussian. The tail probability bound of sub-Gaussian distribution gives
Now, to reformat this into a desired tail probability form, we note
The above suggests us to choose
B.2 Proof of Lemma 3
Lemma 3 Given a sample set satisfying template condition with rate r. Then
Proof. From our population assumption, the population covariance matrix satisfies the compatability condition
By carefully controlling
and
, one could first show
, which implies
with high probability (by using Corollary 6.8 in page 152 of Bühlmann and Van De Geer (2011)). Next, by estimating an upper bound of the quadratic form induced by the covariance matrix of sample set
, we can show, with high probability, that
C.1 Basic Inequality
Lemma 4. (Basic Ineuality from Optimality Condition) In LASSO,
Proof. To perform optimality analysis, we play with true beta and empirical minimizer
(Short hand of
). From the argument min, we start with
Direct calculation gives us
Lemma 5. (Basic Inequality on Good Event) On good event F and with , the basic inequality can be
further reduced to
Proof. Recall the basic inequality
Multiply it by 2 to get
Plug in the upper bound to get
Then on good event F, it becomes
To further reduce equation (52), we play with sparsity component. Let denote the sparsity location of truth
On the other hand, the empirical minimizer only has identity
Combine these two observations into equation (52), we have inequality
Reorganize them into the inequality
Lemma 6. (Compatibility passes norm to square root of
On good event , and compatability condition associate with gram matrix
Proof. To further reduce the basic inequality on good event (48), we impose condition on sparsity component In lemma 5, two quantities we play with are
An implication of equation (48) is, on good event F, it is true that
On such class, the compatability condition with Gram matrix
Thus, we have
C.2 Static Oracle Inequality
Theorem 3. (Oracle Inequality of LASSO minimizor)
On good event , and compatability condition associate with gram matrix
Proof. Plus both side of basic inequality on good event (48) an addition term
Now we show two types of sample set–teamwork sample set and all sample set-produced from our proposed data collection protocol both satisfies the template condition.
The following lemmas are used to prove template condition of teamwork sample set and all sample set(lemma 7 and lemma 10).
D.1 Teamwork Sample Set
Lemma 7. For any decision epoch , the teamwork sample set for treatment
up to time t,
a template sample set of rate
, with probability at least
Proof of Lemma 7. To check (i): Lemma 8,
To check (ii): Lemma 9 shows that, for
Lemma 8. (Size of Teamwork Sample Set) If
At , we have finished round
teamwork stage for arm k, each of size Nq, therefore
With this, our task becomes to derive the lower bound and upper bound for n and n + 1 in terms of log t by using the condition
which means
Use condition
On the other hand, we have
Proof. Apply in Alon and Spencer (2004) to the indicator random variable
D.2 All Sample Set
We set in this subsection.
Lemma 10. For any decision epoch , the all sample set for treatment
up to
, is a template
sample set of rate , with probability at least
Proof of Lemma 10 To check (i): Lemma 8,
To check (ii): Lemma 11 shows that, for
Lemma 11.
Proof. We start from noting the fact that the all sample set for treatment , can have at most t elements up to time t (
To handle RHS, we note that the size of admits a representation
The strategy to utilize such representation is first to construct a martingale difference sequence and then apply Azuma’s inequality to attain desired result.
First, all samples been collected in are optimal allocation in selfish stage given good event happens. Thus, whether a sample
belongs to
has a representation
Recall that samples in also satisfies model assumption and hence can be written as
the sigma algebra generated by the first
rows of the design matrix X and the first N(s) entries of the noise vector
. With this,
measurable;
measurable and independent of
is deterministic by planning of teamwork stage. Follow the Doob’s martingale construction, define
for all . The resulting sequence
is a martingale adapted to the filtration
The desired martingale differences is thus
Now since the martingale differences are bounded by
, the Azuma’s inequality, (see. Theorem 7.2.1 from Alon and Spencer 1992), to obtain for all
Now a lower bound for expected size of follows from adopted policy that
where the last inequality from the definition of constant . Thus, taking
We show the properties of for
and for
of a available treatment w. In words, for any given observed covariate
, Teamwork LASSO excludes those sub-optimal treatment of x up to tolerance level h. If
, then Teamwork excludes all treatment other than the optimal treatment of x. Therefore, the probability of random covariate X belongs to
F.1 Proof of lemma 12
Lemma 12. (For ) Suppose the (
)th decision epoch is in selfish stage and event
holds. Then for each available treatment
and any possible observed covariate
, the estimated optimal treatment candidate set contains the optimal treatment of
and no any sub-optimal treatment
Proof. First, we show . Note at the tth decision epoch, the optimal treatment suggested by Teamwork LASSO is
holds, it implies
for all available treatment w, which includes
where the last inequality is from the definition of
Second, we show . Since
holds, it implies
for all available treatment w, which includes
where the last inequality is from the definition of
F.2 Proof of lemma 13
) Suppose the (
)th decision epoch is in selfish stage and event
holds. Then for each available treatment
, if a observed covariate x belongs to
, then the estimated optimal treatment candidate set contains only treatment
Proof. For every treatment other than
, definition of
holds, it implies
. Combine them to obtain, for every treatment
other than
That is, for every treatment other than
Therefore, by construction of optimal treatment candidate set, we conclude
Lemma 14.
Proof. Without loss of generality, for a observed covariate vector of the ith user at the tth decision epoch, assume
is the optimal treatment, that is
. First, we note that the instantaneous regret occurs if we allocate treatment other than
to covariate x. This happens when
for some treatments w. This observation suggests
Second, to handle RHS, define a function consider the set
The boundness assumption on observed covariate x and efficacy parameter suggests
for all
; the definition of set
. This observation suggests
Third, we handle equation (110) and (111). We note the marginal condition implies
Based on this observation, we have
Last, combine above results and take
as desired.
F.4 Full Regret Bound–Proof of Theorem 2
The regret can be bounded by:
Here we list the constants that appear in the proof.
•
•
•
•
•
•
In Figure. 3, we compare our Teamwork LASSO Bandit with batch size N = 4 and N = 12 to the LASSO Bandit in Bastani and Bayati (2020). In the attached plot, covariate dimension d = 200, 500 and 1000, number of treatments (arms) K =3, the length of exploration phase q = 1,2,3,4,5,6 with a total number of decisions 5000. N is the batch size, where N=1 corresponds to LASSO Bandit and N=4, 12 corresponds our Teamwork LASSO Bandit. We run 100 replications for each setting.
Remark on cumulative regret and covariate vector dimension. In the experiment, we increase the covariate vector dimension from 200, 500 to 1000. The performance of high update frequency algorithm is more sensitive to the increase in covariate dimension than our low update frequency algorithm.
Remark on the length of exploration phase q. In real world practice, the length of exploration phase q is pre-specified and then an explore-exploitation policy follows the choice of q. Given the same total number of decisions, it is often the case that one prefers a smaller value of q, which means fewer regret from exploration and is more time efficient in the sense that more rounds of explore-exploit can be done.
Figure 3: Comparison of our Teamwork LASSO Bandit with batch size N = 4 and N = 12 to the LASSO Bandit in Bastani and Bayati (2020). The error bars represent the maximum and minimum of the regret among 100 replications.