Online Batch Decision-Making with High-Dimensional Covariates

2020·Arxiv

Abstract

Abstract

We propose and investigate a class of new algorithms for sequential decision making that interacts with a batch of users simultaneously instead of a user at each decision epoch. This type of batch models is motivated by interactive marketing and clinical trial, where a group of people are treated simultaneously and the outcomes of the whole group are collected before the next stage of decision. In such a scenario, our goal is to allocate a batch of treatments to maximize treatment efficacy based on observed high-dimensional user covariates. We deliver a solution, named Teamwork LASSO Bandit algorithm, that resolves a batch version of explore-exploit dilemma via switching between teamwork stage and selfish stage during the whole decision process. This is made possible based on statistical properties of LASSO estimate of treatment efficacy that adapts to a sequence of batch observations. In general, a rate of optimal allocation condition is proposed to delineate the exploration and exploitation trade-off on the data collection scheme, which is sufficient for LASSO to identify the optimal treatment for observed user covariates. An upper bound on expected cumulative regret of the proposed algorithm is provided.

1 Introduction

We consider a high-dimensional online batch decision making problem, a setting in which the decision-maker must interacts with a group of users, instead of a user, at each decision epoch. Such setting arises very naturally in real-world applications but received less attention in the literature. In interactive marketing Bertsi-

Proceedings of the 23International Conference on Artificial Intelligence and Statistics (AISTATS) 2020, Palermo, Italy. PMLR: Volume 108. Copyright 2020 by the author(s).

mas and Mersereau (2007), marketers choose among a set of marketing messages to be sent to several customers simultaneously; updating marketing strategy once receiving a feedback is computationally impractical, and a more practical approach is to aggregate data in a prescribed length of period before adopting new strategy. In clinical trials Ahuja and Birge (2016), physicians choose among a set of available treatments to be administered to a group of patients simultaneously; updating treatment policy once measuring a response is unrealistic, and therapy in real practice is to collect data in a pre-approved length of period before designing new policy. In real-world practice, applicability of online decision-making methodology turns out to be impeded by such limited adaptivity. Similar issue has been addressed by Bai et al. (2019) in reinforcement learning setting, but it remains open in the setting of online decision making under bandit feedback.

One feature of modern economy digitization for decision makers is to deliver personalized products, services, and solutions based on individual-level data. Additionally, in many practical settings, individual-level data are high-dimensional but typically only a small number of the observed covariates are decisive. An additional layer of complexity in online batch decision-making is that, the whole decision making process is learned from bandit feedback: decision makers only observe the outcome for the product that was delivered, but not for any other available products that could have been delivered. Taking heterogeneity and bandit feedback into account, approaches of approximate dynamic programming or Markov decision process in Ahuja and Birge (2016) addressed the dimensionality issue by transforming the covariate space into a finite number of states and then solving the corresponding Bellman equation. However, how to construct such a transformation is often unspecified in this line of approach and particularly ambiguous in the case of high-dimensional covariate space, impeding itself to embrace the blessing of modern economy digitization.

In this paper, we propose a new class of approaches, named Teamwork LASSO Bandit algorithm, for online

Figure 1: A realization of proposed Teamwork-Selfish policy in online batch decision making. N is the batch size. K is the number of available treatments. q is the number of repetition in a block. The agent runs teamwork mode in blue blocks and runs selfish mode in red blocks. The samples collected from blue blocks is called Teamwork sample. The samples collected from red blocks is called Selfish sample.

batch decision-making to tackle the curse of heterogeneity and high-dimensional covariate space under bandit feedback. To achieve both goals simultaneously, a deliberate balance between exploration and exploitation is required to efficiently learn a personalize decision-making rule that maximizes the cumulative treatment efficacy. In particular, in Teamwork LASSO Bandit algorithm, the agent switches between teamwork mode and selfish mode during the whole decision process to alternatively perform pure exploration and pure exploitation. By the proposed Teamwork-Selfish policy, the resulting sample set strikes a balance on optimal allocation in the sense that the rate of optimal allocation is beyond certain level for each treatment (see Defini-tion 1) with high probability. This strategy ensures that the LASSO regression accommodates dependency arising in a sequence of batch observations and enjoys certain statistical properties in order to achieve optimal allocation for each incoming user.

Contributions. Our contribution is three-fold. First, we propose the “Teamwork LASSO Bandit algorithm,” that solves the online batch decision-making problem with high-dimensional user covariates via learning LASSO estimates of treatment efficacy. Second, we propose a general “optimal allocation rate condition” on sample set as a theoretical guideline in designing a data collection scheme to identify the optimal treatment given observed user covariates. Such a scheme is illustrated by the proposed Teamwork-Selfish policy. Last, we establish an upper bound of the expected cumulative regret that scales linearly with the batch size. As a technical by-product, we develop a deviation inequality of LASSO regression that adaptes to a sequence of batch observations. To our knowledge, this is the first work addressing batch version of explore-exploitation dilemma in the setting of online batch decision-making.

Related works Batched bandit problem has received less research attention over the last decade. The special case of two-armed bandit setting has been addressed in Perchet et al. (2016); recently, Gao et al. (2019) extends the setting to K-armed bandit. While these works addressed the problem by mean model, we addressed batched bandit problem with high-dimensional linear model. In non-batch setting, Rusmevichientong and Tsitsiklis (2010) addressed the bandit problem by linear model in low dimensional setting; Bastani and Bayati (2020) then extends the setting to high-dimensional linear model. Comparing to these works, our model extends such setting to batched bandit problem.

A toy example for illustration

Consider the simplest scenario where the agent allocates one of two available treatments to each user in a size-2 batch at every decision epoch (That is, N = 2, K = 2 in Figure 1). In this case, each block contains 2q epochs and each epoch has two user. In the blue block, the agent runs in a teamwork mode: it sends both users to the first treatment at the first q epochs and to the second treatment at the second q epochs; in the red block, the agent runs in a selfish mode: it sends each user to her estimated optimal treatment based on all historical information over all 2q epochs. The agent runs in a teamwork mode (blue) on the block whose number is of power of 2, that is, and runs in a selfish mode on the red block. The samples collected from blue blocks are called Teamwork sample; the samples collected from red blocks are called Selfish sample. More specifically, in red block, the agent runs a selfish mode by a two-step procedure: the first step is to run LASSO regression on Teamwork sample to output a set of optimal treatment candidates for the current user; the second step is to run LASSO regression on the union of Teamwork sample and Selfish sample to decide the optimal treatment for the current user.

Notation For any positive integer n, define [n] = and for any and denotes the number of elements for any collection A. For any vector the indexes of non-zero coordinate. For a vector u = denotes the maximum absolute value of its entries; the

2 Online Batch Decision Making

Treatment efficacy model We consider an agent, who needs to allocate one of K different available treatments (arms), denoted as , for a size-N batch of users in each decision epoch where T denotes the length of the decision epoch horizon. The users in the current batch are represented by N observable vectors of features (covariates) . We assume that feature vectors may vary across decision epoch and are sampled independently from a fixed, but a priori unknown distribution with bounded support X.

The efficacy of treatment w for the ith user (feedback) at epoch , where the function y is unknown. At each decision epoch t, the agent allocates a batch of treatments , and then observes a batch of efficacy . The objective is to design an allocation policy, which maps users to their own treatments, that maximizes the cumulative sum of observed efficacy.

We assume that the efficacy of treatment w for user is a linear function of her covariates

where are martingale difference noise. At each epoch ’s are drawn independently from a mean zero -sub-Gaussian distribution (that is, for all real ) and forms a martingale difference sequence (that is, ). The noise is often introduced to account for the features that are not included in the model.

Efficacy parameters are a prior unknown to agent. Therefore, the agent deals with exploration-exploitation trade-off as it needs to choose between learning B and exploiting what has been learned so far to maximize treatment efficacy.

Our proposed algorithm exploits the structure (sparsity) of the feature space to improve its performance. For this purpose, let denote the number of nonzero coordinates of . Define and note that is a priori unknown to the agent.

Technical assumptions For ease of presentation, we assume that , for all , and for a known constant b.

We denote by the set of feasible parameters, that is,

and we write

To measure the performance of proposed bandit algorithm, we present three technical assumptions.

Assumption 1. (Margin Condition) There exists a constant such that for in W, P(0 <

Assumption 1 is referred to the Margin Condition in the classification literature Tsybakov et al. (2004) and is introduced in multi-armed linear bandit literature to ensure only a small fraction of features can be drawn near the classification boundary 0} in which efficacy of both treatments are almost equivalent; see Rusmevichientong and Tsitsiklis (2010), Bastani and Bayati (2020).

Assumption 2. (Treatment Optimality Condition) There exist some constant h > 0 and two mutually exclusive sets, denoted as and with W =

(a) For each treatment , it holds for every covariate vector

(b) For each treatment , there exists a constant

Assumption 2 is referred to the Treatment Optimality Condition in Rusmevichientong and Tsitsiklis (2010), Bastani and Bayati (2020) and is to separate available treatments into an optimal subset and a sub-optimal subset such that every optimal treatment is strictly optimal for some users (denoted by the set ) and every sub-optimal treatment is strictly sub-optimal for every users.

Assumption 3. (Compatibility Condition) There exists a constant such that for each optimal treatment , the population covariance matrix belongs to the compatibility set of its treatment efficacy parameter

where is defined as

Assumption 3 is referred to as the Compatibility Condition in high-dimensional statistics literature Bühlmann and Van De Geer (2011) and is to ensure that LASSO estimate trained on samples converges to the true parameter with high probability as the number of samples grows to infinity.

Oracle Policy and Performance Metric We evaluate the performance of our algorithm using the usual notion of regret: the expected treatment efficacy loss compared with the oracle allocation policy that has full knowledge of B, but not of the realizations of noise ). Let us first characterize this benchmark policy. Based on the model (1), the optimal treatment allocation, denoted by , is defined as

Throughout the paper, denotes the optimal treat- ment allocation for the ith user at epoch t.

We now define regret of a policy. Let be the agent’s policy that allocates the treatment to user , and the choice of may depend on the information up to decision epoch . The worst-case regret is defined as:

where the expectation is with respect to the distribution of martingale difference noise, , and the distribution of feature vector, . The notation Q(X) denotes a set of probability distributions with bounded support in X. We want to point out that Regretcorresponds to the usual cumulative expected regret in online learning when N = 1 and the empirical expected estimation error in batch learning when T = 1.

Rate of Optimal Allocation Condition Suppose the agent has collected a sample set for treatment w. The set consists of both optimal and sub-optimal allocation. The optimal allocation subsample set is defined as

Intuitively, bigger size of improves the accuracy of statistical procedure, which is the prize of achieving optimal allocation. On the contrary, bigger size of undermines the accuracy of statistical procedure, which is the price of misallocating users to any suboptimal treatment. The ratio matters and is termed as the rate of optimal allocation of sample set

To gain a deeper insight of such balance into high-dimensional setting, consider the LASSO regression:

where -dimension response vector and X is a covariate matrix.

Note that, if satisfies the compatability condi- tion with constant , then satisfies the compatability condition with constant

above observation suggests that certain balance between and should be stricken during the decision process. We make this intuition precise for sample sets characterization by introducing the optimal

Definition 1. A sample set satisfies the rate r optimal allocation condition, if it satisfies both conditions

We briefly call ple set if satisfies a optimal allocation condition of certain rate r. Note that the optimal allocation subsample set is not directly observable. The rate r optimal allocation condition is to lower bound the size of to ensure enough accuracy of LASSO estimator.

Our first contribution is a deviation inequality for LASSO regression based on the sample set collected in online batch decision making setting:

adapted LASSO) Given a rate r optimal allocation sample set follows the dependence structure of online batch decision making problem. If then for any , the oracle inequality holds that

Theorem 1 is a general version of LASSO deviation inequality for adapted observations (see Proposition 1 in Bastani and Bayati (2020)). Our contribution is to extend the deviation inequality for adapted sequence of batch observations with martingale difference noise.

3 Teamwork LASSO Bandit Algorithm

In this section, we propose the Teamwork LASSO Bandit algorithm that runs in an interactive fashion: the agent switches between Teamwork mode (for pure exploration) and Selfish mode (for pure exploitation). The collection of decision epochs that the agent runs in teamwork mode and selfish mode are called teamwork stage and selfish stage, respectively. In the former stage, the agent allocates the current batch of users to a prescribed and possibly sub-optimal treatment for studying treatment efficacy. In the latter stage, the agent allocates each user in the current batch to her estimated optimal treatment for maximizing treatment efficacy. Epochs in teamwork and selfish stages are collected into sets , respectively.

3.1 Allocation in Teamwork mode

The teamwork stage T consists of exploration on all available treatments; that is, where denotes the prescribed decision epochs for studying efficacy of treatment . We defer the exact speci-fication of to Section 4.1.

Our treatment allocation policy now runs in Teamwork mode that, in any decision epoch , the agent allocates the whole batch of users to the treatment

for all . Such an allocation results in a batch of covariate-efficacy pairs, [N]}, which is aggregated into the teamwork sample set up to epoch t

3.2 Allocation in Selfish mode

The selfish stage deploys exploitation by allocating the current best-possible treatment to each user in the current batch. This is done by a two-step procedure: estimating candidates of personal optimal treatment and then committing selfish allocation. Such an allocation results in another batch of covariate-efficacy pairs, . Then, for each treatment , the set is aggregated into the selfish sample set up to epoch t

Here we outline the above two-step procedure whose details are deferred to Section 4.2. At a selfish decision epoch , the agent first estimates by using LASSO regression based on teamwork sample set and the resulting estimator is called Teamwork LASSO. The Teamwork LASSO screens out the sub-optimal treatments for each user in current batch and then outputs a set of optimal treatment candidates (See Eq.(11) for detailed description).

The agent then estimates by using LASSO regression based on full sample set and the resulting estimator is called All LASSO. The All LASSO then selects the treatment with highest expected efficacy for the user

Observe that by design of our interactive policy (see Figure 2), the agent maintains for each treatment w two different sample sets (teamwork and greedy). The independence in teamwork sample set is preserved by the pure exploration nature of teamwork stage, facilitating the subsequent analysis of LASSO estimate. However, the full sample set mixes independence from teamwork stage with the dependency arising in the selfish stage, complicating the subsequent analysis of LASSO estimate. Tackling such dependency requires a closer look at every epoch of decision making process.

4 A Teamwork-Selﬁsh Policy

4.1 Planing Teamwork Stage

One factor contributes to the success of our Teamwork-Selfish policy is the design of in teamwork mode (See eq.(3.1)). Here we explain our design of reveal the quantity q that characterized the complexity of online batch decision making problem.

Recall that the collection of teamwork decision epochs up to epoch t is the set . Such set is a truncation to epoch range [t] of the collection of treatment teamwork decision epoch which is developed by the

The is defined as a prescribed range of decision epochs

(Refer to the blue block in Figure.1) Observe that each teamwork round prescribes q decision epochs for treatment pure exploration. The magnitude q characterizes the complexity of a online batch decision problem.

Our theoretical results suggests a lower bound of q by (See Eq.(10) for ). Such lower bound ensures both teamwork sample set and full sample set collected from the Teamwork-Selfish policy to satisfy the rate of optimal allocation condition (Definition 1) with high probability. Consequently, the Teamwork LASSO and ALL LASSO enjoy their statistical properties and the regret bound (2) can be guaranteed.

Remark that our lower bound on q is proportional to 1/N, where N is the batch size . Compare to the full adaptive setting in Bastani and Bay- ati (2020), we have . In terms of update frequency, LASSO Bandit requires , while Teamwork LASSO Bandit requires . In terms of regret, LASSO Bandit is of rate Teamwork LASSO Bandit is of rate there is a trade off between regret and update frequency.

Our design of teamwork stage is a generalization of the forced sampling for the two-arm and non-batch setting in Rusmevichientong and Tsitsiklis (2010) and for the K-arm and non-batch setting in Bastani and Bay- ati (2020). Our contribution is to redesign such type of policy when batch is the fundamental sampling unit to accommodate the restrictions from limited adaptivity.

4.2 Two Step Procedure in Selfish Stage

Another factor contributing to the success of our Teamwork-Selfish policy is the two step procedure to commit selfish allocation. In a selfish decision epoch t, the agent first estimates every treatment’s efficacy based on information up to decision epoch by

Figure 2: Teamwork-Selfish policy.

Teamwork LASSO to separate optimal treatments from suboptimal treatments. Then, the agent estimates candidate treatments’ efficacy by ALL LASSO to identify the true optimal treatment for current user. In particular, the success of step 1 relies on a "good event“

which marks the case that every Teamwork LASSO are accurate enough to screen out suboptimal treatments.

Now we give a high-probability statement for access the performance of Step 1.

Lemma 1 is an immediate consequence of Corollary 1

Corollary 1. (Deviation inequality for Teamwork LASSO) For all treatments , if for , the Teamwork LASSO estimator satisfies the deviation inequality (set

where

Corollary 1 is an application of Theorem 1, given a deviation inequality of LASSO regression based on . As shown in lemma 7, allocation sample set with probability at least

Step 1: Screen out Sub-optimal Treatments

Given a current user x, the agent’s objective at step 1 is to output a set of user’s potential optimal treatments. To construct such set of treatment candidates, we require that a candidate treatment should have its estimated treatment efficacy almost as good as the maximum of estimated treatment efficacy of all available treatments, as the Teamwork LASSO can tell.

Formulating such intuition motivates us to define the set of

The exact members in candidate set depend on the region that x belongs to. Recall that denotes the optimal treatment of x. For the case by definition of . As shown in lemma 13, given the event holds ,the candidate set the optimal treatment of x, that is,

Therefore, the agent definitely has an optimal allocation for every user under the event

For the case , the user covariate x lies near a decision boundary some comparable treatment . As shown in lemma 12, the candidate set the optimal treatment of x, that is,

but may contain other comparable treatments performs almost equally well as the optimal treatment , under the event

Step 2: Commit Selfish Allocation The agent’s objective in step 2 is to commit a selfish allocation for current user x. Such selfish allocation is done by allocating user x to the treatment with highest estimated efficacy, as far as the All LASSO can tell:

An application of Theorem 1 gives a deviation inequality of LASSO regression based on

Corollary 2. (Deviation inequality for All LASSO) For treatments , if , the All LASSO estimator satisfies the deviation inequality (Set

with probability at least

5 Regret Analysis

The following theorem bounds the regret of our Teamwork-Selfish treatment allocation policy.

Theorem 2. Suppose Assumptions 1, 2 and 3 hold. Then, the regret of the Teamwork-Selfish policy over decision horizon [T] satisfies

Below we provide a roadmap for the proof of Theorem 2. The proof is motivated by the regret analysis in Bastani and Bayati (2020). Our contribution is to generalize the approach for regret analysis in online batch decision making setting.

1. Regret Guarantee in Good and Bad Epochs

To reason about different sources of regret contribution, we decompose the decision process into four subcases (i-iv) so that we can examine each one independently:

As shown in lemma 14 in section F.3, the expected regret in case (iv) is guaranteed to be bounded by the function f(t) as

For this reason, we define case (iv) as good epochs

and then we interpret lemma 14 in section F.3 as

Outside the good epochs are bad epochs epoch, allocations cannot be guaranteed to be optimal, due to the pure exploration nature in teamwork mode or the insufficient accuracy of LASSO estimate in selfish mode, resulting in the worst-case regret guarantee for cases (i-iii). We interpret the above fact as

.

Combine (16) and (17), the expected instantaneous regret is upper bounded by

Apply (18) to gain

Note that, by lemma 1, we have

6 Experiment

We illustrate the trade-off between regret and update frequency by comparing the cumulative regret between LASSO Bandit algorithm (high update frequency) and our Teamwork LASSO Bandit algorithm (low update frequency) in Figure 3 ( Appendix, section H). Here we give remarks on the experiment:

1. In terms of the number of updates, say case q=1, LASSO Bandit (N=1) (high update frequency) requires updates, Teamwork LASSO Bandit (low update frequency) requires 1224 updates for N=4 case and updates for N=12 case. Note that both give comparable regrets.

2. If the length of exploration phase q is sufficiently large, high update frequency algorithm has lower cumulative regret than low update frequency algorithm; if q is not sufficiently large, low update frequency algorithm outperforms high update frequency algorithm.

3. In general, the performance of high update frequency algorithm has higher variance than the performance of low update frequency algorithm. In particular, the performance of high update frequency algorithm is more sensitive to the increase in covariate dimension than our low update frequency algorithm.

In conclusion, high update frequency algorithm (Lasso Bandit) do have lower cumulative regret than low update frequency algorithm ( Teamwork Lasso Bandit) if the length of exploration phase q is sufficiently large. However, it is hard to determine how large of q can be thought of as being sufficiently large in practice. On the other hand, low update frequency algorithm is immune from such a concern in the sense that we can simply set q=1 when we have a batch of new samples at every decision epoch.

7 Discussion and Conclusions

We have proposed a framework to address batch-version explore-exploit dilemma in the setting of online batch decision making with high dimensional covariate. In terms of regret analysis, we formulate the rate of optimal allocation condition on the collected sample set to characterize the underlying constraint behind the data collection scheme and to serve as a guideline for designing policy in bandit algorithms. Based on the rate of optimal allocation condition, we propose the Teamwork LASSO Bandit algorithm for sequential decision making. In theory, the cumulative total regret of the Teamwork LASSO Bandit algorithm of constant batch size N over finite time horizon T is shown to be bounded by . In terms of observed covariate dimension p and sparsity parameter , the cumulative total regret of the Teamwork LASSO Bandit algorithm grows as

In the end, we highlight a few particularly relevant questions that are left as future works. The first one is the minimax lower bound of regret over all possible algorithms solving batched bandit problem with covariates. In one pull situation, Rusmevichientong and Tsitsiklis (2010) showed the lower bound is O(log T). Recently, Wang et al. (2018) showed that the regret of Bastani and Bayati (2020) can be reduced from O(log T) by invoking the Minimax Concave Penalized (MCP) penalty. Hence, the MCP-Bandit algorithm matches the oracle policy with high probability. We expect the regret of Teamwork LASSO Bandit algorithm can also be reduced from to O(N log T) if MCP penalty instead of the lasso one is adopted, although whether this rate matches the theoretical lower bound remains unknown. The second one is more relevant to practice: can we design an effective teamwork strategy when batch size is non-constant? In Grover et al. (2018), the authors have proposed four different kinds of delayed feedback mechanism that frequently happen in online advertising context, which may lead to non-constant batch size in our setting. When we are performing batch update in the above delayed feedback scenario, is there a guideline for algorithm design? In particular, can the rate of optimal allocation condition be extended to handle delayed feedback situation?

Acknowledgements

Chi-Hua Wang thanks Yang Yu (Purdue University) for discussion and Zhanyu Wang (Purdue University) for help on experiment. Guang Cheng would like to acknowledge support by NSF DMS-1712907, DMS-1811812, DMS-1821183, and Office of Naval Research (ONR N00014- 18-2759). While completing this work, Guang Cheng was a member of Institute for Advanced Study, Princeton and visiting Fellow of SAMSI for the Deep Learning Program in the Fall of 2019; he would like to thank both Institutes for their hospitality.

References

Abbasi-Yadkori, Y., Pal, D., and Szepesvari, C. (2012). Online-to-confidence-set conversions and application to sparse stochastic bandits. In Artificial Intelligence and Statistics, pages 1–9.

Ahuja, V. and Birge, J. R. (2016). Response-adaptive designs for clinical trials: Simultaneous learning from multiple patients. European Journal of Operational Research, 248(2):619–633.

Alon, N. and Spencer, J. H. (2004). The probabilistic method. John Wiley & Sons.

Auer, P. (2002). Using confidence bounds for exploitation-exploration trade-offs. Journal of Machine Learning Research, 3(Nov):397–422.

Bai, Y., Xie, T., Jiang, N., and Wang, Y.-X. (2019). Provably efficient q-learning with low switching cost. arXiv preprint arXiv:1905.12849.

Bastani, H. and Bayati, M. (2020). Online decision making with high-dimensional covariates. Operations Research, 68(1):276–294.

Bertsimas, D. and Mersereau, A. J. (2007). A learning approach for interactive marketing to a customer segment. Operations Research, 55(6):1120–1135.

Bolton, P. and Harris, C. (1999). Strategic experimentation. Econometrica, 67(2):349–374.

Bühlmann, P. and Van De Geer, S. (2011). Statistics for high-dimensional data: methods, theory and applications. Springer Science & Business Media.

Dimakopoulou, M., Athey, S., and Imbens, G. (2017). Estimation considerations in contextual bandits. arXiv preprint arXiv:1711.07077.

Gao, Z., Han, Y., Ren, Z., and Zhou, Z. (2019). Batched multi-armed bandits problem. arXiv preprint arXiv:1904.01763.

Goldenshluger, A. and Zeevi, A. (2013). A linear response bandit problem. Stochastic Systems, 3(1):230– 261.

Grover, A., Markov, T., Attia, P., Jin, N., Perkins, N., Cheong, B., Chen, M., Yang, Z., Harris, S., Chueh, W., et al. (2018). Best arm identification in multi-armed bandits with delayed feedback. arXiv preprint arXiv:1803.10937.

Jun, K.-S., Jamieson, K. G., Nowak, R. D., and Zhu, X. (2016). Top arm identification in multi-armed bandits with batch arm pulls. In AISTATS, pages 139–148.

Li, L., Chu, W., Langford, J., and Schapire, R. E. (2010). A contextual-bandit approach to personalized news article recommendation. In Proceedings of the 19th international conference on World wide web, pages 661–670.

Perchet, V., Rigollet, P., Chassang, S., Snowberg, E., et al. (2016). Batched bandit problems. The Annals of Statistics, 44(2):660–681.

Perchet, V., Rigollet, P., et al. (2013). The multi-armed bandit problem with covariates. The Annals of Statistics, 41(2):693–721.

Rusmevichientong, P. and Tsitsiklis, J. N. (2010). Lin- early parameterized bandits. Mathematics of Operations Research, 35(2):395–411.

Tsybakov, A. B. et al. (2004). Optimal aggregation of classifiers in statistical learning. The Annals of Statistics, 32(1):135–166.

Wang, X., Wei, M., and Yao, T. (2018). Minimax concave penalized multi-armed bandit model with high-dimensional covariates. In International Conference on Machine Learning, pages 5200–5208.

Supplement to “Online Batch Decision-Making with High-Dimensional Covariates”

A Proof of Theorem 1

Proof of Theorem 1. Based on standard arguments in high dimensional statistics, the Template LASSO on when choosing

where which is a high-probability event by carefully choosing the threshold stated in the following lemma:

Lemma 2. Given a sample set and choose

Proof. See Section B.1.

Besides, the sample covariance matrix of the template sample set satisfies the compatibility condition with high probability, as stated in the following lemma:

Lemma 3. Given a sample set that satisfies rate r optimal allocation condition. Then

Proof. See Section B.2.

Now, lemma 2 and lemma 3 together turn the equation (20) into

Then Theorem 1 follows by solving from the condition

B Proof of key Lemmas

B.1 Proof of Lemma 2

Lemma 2 The event

holds with probability at least by choosing

Proof. Let denote the number of users allocated to treatment at the sth decision epoch. The sample collected at epoch s is denoted by is the jth column of covariate matrix X and the good event

With the help of union bound, we have

The quantity actually has sub-Gaussian tail. To see this, define the filtration

From the tower property of expectation, independence between , sub-Gaussian assumption on , bounded assumption on covariates

The above result gives us a bound on the moment generating function of

We find -sub-Gaussian. The tail probability bound of sub-Gaussian distribution gives

Now, to reformat this into a desired tail probability form, we note

The above suggests us to choose

B.2 Proof of Lemma 3

Lemma 3 Given a sample set satisfying template condition with rate r. Then

Proof. From our population assumption, the population covariance matrix satisfies the compatability condition By carefully controlling and , one could first show , which implies with high probability (by using Corollary 6.8 in page 152 of Bühlmann and Van De Geer (2011)). Next, by estimating an upper bound of the quadratic form induced by the covariance matrix of sample set , we can show, with high probability, that

C Theory of LASSO

C.1 Basic Inequality

Lemma 4. (Basic Ineuality from Optimality Condition) In LASSO,

Proof. To perform optimality analysis, we play with true beta and empirical minimizer (Short hand of ). From the argument min, we start with

Direct calculation gives us

Lemma 5. (Basic Inequality on Good Event) On good event F and with , the basic inequality can be

further reduced to

Proof. Recall the basic inequality

Multiply it by 2 to get

Plug in the upper bound to get

Then on good event F, it becomes

To further reduce equation (52), we play with sparsity component. Let denote the sparsity location of truth

On the other hand, the empirical minimizer only has identity

Combine these two observations into equation (52), we have inequality

Reorganize them into the inequality

Lemma 6. (Compatibility passes norm to square root of

On good event , and compatability condition associate with gram matrix

Proof. To further reduce the basic inequality on good event (48), we impose condition on sparsity component In lemma 5, two quantities we play with are

An implication of equation (48) is, on good event F, it is true that

On such class, the compatability condition with Gram matrix

Thus, we have

C.2 Static Oracle Inequality

Theorem 3. (Oracle Inequality of LASSO minimizor)

On good event , and compatability condition associate with gram matrix

Proof. Plus both side of basic inequality on good event (48) an addition term

D Checking Optimal Allocation Condition

Now we show two types of sample set–teamwork sample set and all sample set-produced from our proposed data collection protocol both satisfies the template condition.

The following lemmas are used to prove template condition of teamwork sample set and all sample set(lemma 7 and lemma 10).

D.1 Teamwork Sample Set

Lemma 7. For any decision epoch , the teamwork sample set for treatment up to time t, a template sample set of rate , with probability at least

Proof of Lemma 7. To check (i): Lemma 8,

To check (ii): Lemma 9 shows that, for

Lemma 8. (Size of Teamwork Sample Set) If

At , we have finished round teamwork stage for arm k, each of size Nq, therefore

With this, our task becomes to derive the lower bound and upper bound for n and n + 1 in terms of log t by using the condition

which means

Use condition

On the other hand, we have

Proof. Apply in Alon and Spencer (2004) to the indicator random variable

D.2 All Sample Set

We set in this subsection.

Lemma 10. For any decision epoch , the all sample set for treatment up to , is a template

sample set of rate , with probability at least

Proof of Lemma 10 To check (i): Lemma 8,

To check (ii): Lemma 11 shows that, for

Lemma 11.

Proof. We start from noting the fact that the all sample set for treatment , can have at most t elements up to time t (

To handle RHS, we note that the size of admits a representation

The strategy to utilize such representation is first to construct a martingale difference sequence and then apply Azuma’s inequality to attain desired result.

First, all samples been collected in are optimal allocation in selfish stage given good event happens. Thus, whether a sample belongs to has a representation

Recall that samples in also satisfies model assumption and hence can be written as the sigma algebra generated by the first rows of the design matrix X and the first N(s) entries of the noise vector . With this, measurable; measurable and independent of is deterministic by planning of teamwork stage. Follow the Doob’s martingale construction, define

for all . The resulting sequence is a martingale adapted to the filtration The desired martingale differences is thus

Now since the martingale differences are bounded by , the Azuma’s inequality, (see. Theorem 7.2.1 from Alon and Spencer 1992), to obtain for all

Now a lower bound for expected size of follows from adopted policy that

where the last inequality from the definition of constant . Thus, taking

E Deviation Inequalities of Teamwork LASSO and ALL LASSO

F Regret Analysis

We show the properties of for and for of a available treatment w. In words, for any given observed covariate , Teamwork LASSO excludes those sub-optimal treatment of x up to tolerance level h. If , then Teamwork excludes all treatment other than the optimal treatment of x. Therefore, the probability of random covariate X belongs to

F.1 Proof of lemma 12

Lemma 12. (For ) Suppose the ()th decision epoch is in selfish stage and event holds. Then for each available treatment and any possible observed covariate , the estimated optimal treatment candidate set contains the optimal treatment of and no any sub-optimal treatment

Proof. First, we show . Note at the tth decision epoch, the optimal treatment suggested by Teamwork LASSO is holds, it implies for all available treatment w, which includes

where the last inequality is from the definition of

Second, we show . Since holds, it implies for all available treatment w, which includes

where the last inequality is from the definition of

F.2 Proof of lemma 13

) Suppose the ()th decision epoch is in selfish stage and event holds. Then for each available treatment , if a observed covariate x belongs to , then the estimated optimal treatment candidate set contains only treatment

Proof. For every treatment other than , definition of holds, it implies . Combine them to obtain, for every treatment other than

That is, for every treatment other than

Therefore, by construction of optimal treatment candidate set, we conclude

Lemma 14.

Proof. Without loss of generality, for a observed covariate vector of the ith user at the tth decision epoch, assume is the optimal treatment, that is . First, we note that the instantaneous regret occurs if we allocate treatment other than to covariate x. This happens when for some treatments w. This observation suggests

Second, to handle RHS, define a function consider the set

The boundness assumption on observed covariate x and efficacy parameter suggests for all ; the definition of set . This observation suggests

Third, we handle equation (110) and (111). We note the marginal condition implies

Based on this observation, we have

Last, combine above results and take

as desired.

F.4 Full Regret Bound–Proof of Theorem 2

The regret can be bounded by:

G Constants

Here we list the constants that appear in the proof.

•

•

•

•

•

•

H Experiment

In Figure. 3, we compare our Teamwork LASSO Bandit with batch size N = 4 and N = 12 to the LASSO Bandit in Bastani and Bayati (2020). In the attached plot, covariate dimension d = 200, 500 and 1000, number of treatments (arms) K =3, the length of exploration phase q = 1,2,3,4,5,6 with a total number of decisions 5000. N is the batch size, where N=1 corresponds to LASSO Bandit and N=4, 12 corresponds our Teamwork LASSO Bandit. We run 100 replications for each setting.

Remark on cumulative regret and covariate vector dimension. In the experiment, we increase the covariate vector dimension from 200, 500 to 1000. The performance of high update frequency algorithm is more sensitive to the increase in covariate dimension than our low update frequency algorithm.

Remark on the length of exploration phase q. In real world practice, the length of exploration phase q is pre-specified and then an explore-exploitation policy follows the choice of q. Given the same total number of decisions, it is often the case that one prefers a smaller value of q, which means fewer regret from exploration and is more time efficient in the sense that more rounds of explore-exploit can be done.

Figure 3: Comparison of our Teamwork LASSO Bandit with batch size N = 4 and N = 12 to the LASSO Bandit in Bastani and Bayati (2020). The error bars represent the maximum and minimum of the regret among 100 replications.