On conditional versus marginal bias in multi-armed bandits

2020·Arxiv

Abstract

Abstract

The bias of the sample means of the arms in multi-armed bandits is an important issue in adaptive data analysis that has recently received considerable attention in the literature. Existing results relate in precise ways the sign and magnitude of the bias to various sources of data adaptivity, but do not apply to the conditional inference setting in which the sample means are computed only if some specific conditions are satisfied. In this paper, we characterize the sign of the conditional bias of monotone functions of the rewards, including the sample mean. Our results hold for arbitrary conditioning events and leverage natural monotonicity properties of the data collection policy. We further demonstrate, through several examples from sequential testing and best arm identification, that the sign of the conditional and marginal bias of the sample mean of an arm can be different, depending on the conditioning event. Our analysis offers new and interesting perspectives on the subtleties of assessing the bias in data adaptive settings.

1. Introduction

In modern data analysis, it is often the case that both the data collection and the selection of the target for statistical inference are determined adaptively, i.e. in a manner that depends on the realization of the data observed so far. In these fairly common data-adaptive scenarios, the validity of traditional procedures for statistical inference (that were designed to work in nonadaptive or in i.i.d. settings) can be severely affected.

This issue is perhaps best exemplified within the stochastic multi-armed bandit (MAB) framework (Robbins, 1952).

1Department of Statistics and Data Science, Carnegie Mellon University 2Machine Learning Department, Carnegie Mellon University. Correspondence to: Jaehyeok Shin <shinjaehyeok@cmu.edu>.

Proceedings of the International Conference on Machine Learning, Online, PMLR 119, 2020. Copyright 2020 by the author(s).

The data are collected sequentially in stages, during which the analyst draws a sample from one among finitely many unknown distributions or arms. At each stage, the determination of whether no more samples are to be acquired (adaptive stopping), the choice of the arm to draw from (adaptive sampling) and the selection of the target arm to analyze once the experiment has terminated (adaptive choosing) are all made based on previously observed data.

The combined effect of these sources of data adaptivity will impact the correctness of standard, nonadaptive statistical procedures in ways that are generally difficult to discern and to quantify. The recent literature on adaptive data analysis in MAB settings has focused primarily on the fundamental problem of mean estimation (Xu et al., 2013; Villar et al., 2015; Bowden & Trippa, 2017; Nie et al., 2018; Shin et al., 2019a). It is now known that the sample mean of an arm is a biased estimator of the true mean and that the sign and magnitude of the bias can be related to adaptive sampling, stopping and choosing in precise ways.

The current theoretical understanding of the bias in MAB experiments is, however limited in two aspects. First, virtually all results concern the bias in mean estimation and do not cover other functionals of the arms. Secondly, and perhaps more interestingly, the existing guarantees cover only the marginal bias of the sample mean, i.e., the bias obtained by accounting for all possible outcomes of the MAB experiment. However, in practice, one is often interested in the sample means of the arms only when certain outcomes have occurred. For instance, the analyst may wish to evaluate the sample mean of a given arm only when that arm was identified as the best arm or, in a sequential framework corresponding to a MAB experiment with only one arm, when the null hypothesis has been rejected or when the random criterion for determining whether enough samples have been collected has been met. In all these cases, it is of interest to compute the conditional bias, i.e., the bias of the sample mean given a certain conditioning event, such as that the arm of interest turned out to be the best arm. A priori, it is not at all clear how the sign of the conditional bias is affected by the choice of the conditioning event and by other sources of data adaptivity (e.g., sampling and stopping), or whether the signs of the marginal and conditional bias should match. See Figure 1 for illustrations of marginal and conditional biases.

Figure 1. Illustrations of marginal and conditional biases in the K prototypes example. (a) Marginal bias is obtained by accounting for all possible outcomes. (b) Conditional bias is obtained by accounting for outcomes related to a conditioning event.

As a concrete example, suppose we have K prototypes of an online service and wish to test whether the potential average revenue of each prototype (i.e., arm), , would be larger than a pre-specified threshold based on a stream of test user samples. The usual sequential testing approach will be based on an appropriate upper stopping boundary and will reject each null if the corresponding sample mean has crossed such boundary during the testing period. If the null is rejected, then we conclude the prototype as a promising one. If not, we disregard it. It is well known (see, e.g., Starr & Woodroofe, 1968; Shin et al., 2019a) that for each prototype, the sample mean based on data collected through this sequential testing procedure is positively biased: that is, for each regardless of whether the true mean is larger or smaller than the threshold . This positive bias result can provide a useful warning signal about possible overestimation of the true revenues. At the same time, however, without careful consideration of the conditioning effect, we may end up with a false sense of comfort with low sample mean estimates from “disregarded” prototypes. Indeed, we can naively expect the true revenues of the disregarded prototypes should be even lower than the observed estimates based on the positive bias result . In fact, this would be a wrong conclusion: conditioned on the disregarding event C, the sample mean is negatively biased and we have , as demonstrated in Section 4.1.

In this paper, we make the following contributions:

termining the sign of the conditional bias that hold under arbitrary conditioning events and rely on certain natural, highly interpretable monotonicity properties of the rules used for adaptive sampling, stopping and selecting. Our analysis captures in a rigorous and intuitive way the interaction between adaptivity arising from the data collection stage (sampling and stopping) with adaptivity in the choice of the target for inference.

• We characterize the sign of the conditional bias of monotone functions of the rewards of each arm, which includes the sample mean as a special case.

• In Section 4, we demonstrate with several examples in best arm identification and sequential testing problems how the marginal and conditional bias of the sample means of the arms can have opposite signs. These are, we believe, instances of a general, important phenomenon of theoretical and practical relevance.

Overall, our results advance our ability to assess the impact of the bias in adaptive data analysis problems and offer several new and interesting insights on this important issue.

The rest of this paper is organized as follows. In Section 2, we briefly formalize the three notions of adaptivity by introducing a stochastic MAB framework. Section 3 derives results on when the bias can be positive or negative. In Section 4, we demonstrate the correctness of our theoretical predictions through simulations in a variety of practical situations. We end with a brief summary in Section 5, and for reasons of space, we defer all proofs to the Appendix.

2. The Stochastic Multi-Armed Bandit Model

In this section we formally describe the MAB experiment setting, detailing the sources of data adaptivity. Let distributions of K arms with finite means . If it is clear from context, every inequality and equality between two random variables is understood in the almost sure sense.

2.1. Three Sources of Adaptivity in MABs

A general stochastic MAB algorithm consists of the following sampling, stopping, and choosing rules:

• Sampling: For each time , choose a possibly randomized index of arm from a multinomial distribution . Then, we draw a sample (also called a reward) from the chosen arm refers to conditional probabilities of sampling each arm defined by

with the constraint , and refers to observed data up to time including all external randomness. If each does not depend on the previous data , we call it a nonadaptive sampling rule.

With a proper choice of , this sampling rule captures a broad class of existing methods including fixed design, random allocation, -greedy (Sutton & Barto, 1998), upper confidence bound (UCB) algorithms (Auer et al., 2002; Audibert & Bubeck, 2009; Garivier & Capp´e, 2011; Kalyanakrishnan et al., 2012; Jamieson et al., 2014) and Thompson sampling (Thompson, 1933; Agrawal & Goyal, 2012; Kaufmann et al., 2012).

• Stopping: Given a sampling rule, let be a filtra-tion such that each -measurable and each and are -measurable. Let T be a stopping time with respect to the filtration . Then, for each time t, the stopping event -measurable which can be used to characterize a stopping rule of a MAB algorithm.

For example, the following stopping time corresponds to the stopping rule in which we stop whenever the absolute difference between sample means for arm 1 and arm 2 exceeding a fixed threshold

where and are the running sum and number of draws for arm k defined respectively and for each and .1 Note that, by the construction, are -measurable and is -measurable for each

If a stopping rule is nonadaptive, that is, if the corresponding stopping time is -measurable, we use T instead of T to refer the stopping time which includes all fixed time and, more generally, all data-independent stopping rules.

• Choosing: After stopping, let denote collected data up to the stopping time T . Based on with possible randomization, we choose a data-dependent arm based on a choosing rule . In this paper, we denote the index of chosen arm for simplicity.

For example, in the best arm identification, we often choose the best arm as the arm with the largest sample mean at the stopping time T . In this case, the

choosing rule is given as does not depend on the observed data, we call it a nonadaptive choosing rule.

2.2. The Tabular Model of Stochastic MABs

To analyze the theoretical properties of a MAB experiment, it is convenient to express it using an infinite tabular representation of the arm rewards ( Section 2.2 of Shin et al. (2019a), Lattimore & Szepesv´ari (2019, Section 4.6)).

In the tabular model, we assume that each observation comes from an imaginary table or stacks of samples is an independent draw from , and the observation is equal to if

Now, given a MAB algorithm, let be a random variable which accounts for all possible external randomness of the algorithm for each time t. Next, we define as the collection of all possible arm rewards and external randomness. Then, given , every property and outcome of a MAB experiment becomes a deterministic function of . In particular and , for each t and k, can be expressed as deterministic functions of . See Section 2.2 in Shin et al. (2019a) for details.

3. Monotonicity Property and Signs of Biases

In a recent work by Shin et al. (2019a), a notion of monotonicity of the data collecting was introduced to characterize the marginal bias of a chosen sample mean at a stopping time. In this section, we summarize the previous result with some minor modifications in notations.

Definition 1. For each , we say a MAB algorithm satisfies the “monotonic increasing (or decreasing) property for arm k” if for any , the function , is an increasing (or decreas- ing) function of while keeping all other entries in fixed. Further, we say that

• A sampling rule is optimistic for arm k if the function is an increasing function of while keeping all other entries in fixed for any fixed

• A stopping rule is optimistic for arm k if the function is a decreasing function of while keep- ing all other entries in fixed for any fixed

• A choosing rule is optimistic for arm k if the function is an increasing function of while keeping all other entries in fixed for any fixed

It can be shown that many MAB algorithms, including -greedy, UCB, and Thompson sampling algorithms, have optimistic sampling rules. As for optimistic stopping, a sequential testing procedure with an upper stopping boundary yields a typical example of an optimistic stopping rule based on the stopping time where is a deterministic function characterizing the upper stopping boundary. Finally, the index of the arm with the largest sample mean, can be viewed as a typical optimistic choosing rule.

3.1. The Sign of the Marginal Bias

Shin et al. (2019a) demonstrated how certain natural monotonicity properties of the MAB experiment can be exploited to determine the sign of the marginal bias and of the conditional bias given conditioning events corresponding to adaptive choosing of arms. For completeness, we recall this result next.

Proposition 1 (Theorem 7 in Shin et al. (2019a)). Given a MAB algorithm, suppose we have and for all k with . If the MAB algorithm satisfies the monotonic decreasing property for each arm then we have

which also also implies that

Similarly if the MAB algorithm satisfies the monotonic increasing property for each arm k with then we have

which also implies that

If each arm has a bounded distribution then the condition can be dropped.

Note that if a MAB algorithm consists of an optimistic sampling but nonadaptive stopping and choosing rules, then the algorithm satisfies the monotonic decreasing property. This is the reason why -greedy, UCB, and Thompson sampling algorithms yield a negatively biased sample mean for each arm, marginally.

Also, if a MAB algorithm consists of an optimistic stopping but nonadaptive sampling and choosing, the algorithm satisfies the monotonic increasing property, which explains why sequential testing procedures (corresponding to onearm MBA experiments) with an upper stopping boundary produces a positively biased sample mean.

3.2. The Sign of the Conditional Bias

In this section, we generalize Proposition 1 and derive suf-ficient conditions for the sign of the conditional bias under arbitrary conditioning events of monotone functions of the arms rewards.

We first introduce some notation. For each arm k, let denote the corresponding cumulative distribution function (CDF): be the empirical CDF for arm k based on samples up to time is a random function on R and taking values in [0, 1] given by

which returns the fractions of samples from arm k whose values are no larger than y. For a stopping time T with respect to the MAB filtration , we then define be the empirical CDF of the rewards of arm k up to time T ; that is, for each

The use of the limit in the above definition is a necessary technicality in order to allow for the possibility that . Note that, on the event , we have that, for each , almost surely, based on the strong law of large numbers; see Theorem 2.1 in Gut (2009).

Next, for any function integrable with respect to , let be the corresponding expectation. Similarly, we let

to be the expectation of f under the empirical measure of the k-th arm at time t. Clearly, is a random variable. If f is the identity function, then it is immediate to see that and . Also, for any , setting f to be the indicator function yields that and . Finally, for a (possibly infinite) stopping time T with respect to the filtration

We are now ready to present the main result of this paper. The proof is given in Appendix A.1. For any event C, we let denote the conditional expectation given C.

Theorem 2. Let T be a stopping time with respect to the MAB experiment natural filtration . For a fixed [K], suppose . Let be any event at stopping time T with P(C) > 0. Assume that, for each i, the function is a decreasing function

of while keeping all other entries in fixed. Then, for each

or, equivalently, for any non-decreasing function with

Similarly, if, for each i, the function is an increasing function of while keeping all other entries in fixed, then we have

or, equivalently,

for any non-decreasing function satisfying

The important conclusion from the above theorem is that the conditional expected value of the empirical CDF of the rewards of any given arm can be stochastically smaller or greater than the true CDF of the corresponding arm. And furthermore, the sign of the bias can be determined in the basis of natural and often verifiable (as shown below) monotonicity conditions that depend on (i) the specific sampling and stopping rules deployed in the MAB experiment and (ii) the choice of the conditioning event C.

Note that if in the theorem we choose C as the conditioning event and the function f as the identity function, Theorem 2 yields Proposition 1 of Shin et al. (2019a) as a special case (indeed the statements about the marginal bias in that result follows from those on the conditional bias given the conditioning event ). We emphasize that Theorem 2 is a generalization of Proposition 1 of Shin et al. (2019a) in at least two ways: (i) it provides a more general guarantee about the conditional bias of monotone functions of the empirical CDF of the arms as opposed to just the sample mean and (ii) it allows for virtually any conditioning event that depends on the outcome of the MAB experiment (formally, that is measurable with respect to

The monotonicity assumption in Theorem 2 captures in a mathematically concise yet intuitive manner how adaptivity in the data collection process combines with adaptivity in the selective data analysis, exemplified by the conditioning event, to affect the sign of the bias. The interaction between these two sources of adaptivity is, at least to us, not at all apparent and, in fact, rather subtle. To illustrate this phenomenon, in the next section, we apply Theorem 2 in several, fairly routine, situations in adaptive data analysis to demonstrates how marginal and conditional biases may very well have opposite signs.

4. Applications

In this section, we discuss several practical examples of the conditional bias results in Theorem 2.

4.1. Conditional Versus Marginal Bias of a Stopped Sequential Test

Suppose we have a stream of samples from a single arm with a finite mean be the sample mean of the rewards observed up to time t.

To test whether is larger than a reference value construct an upper boundary and conclude that if the sample mean ever crosses the upper boundary.

Let T be the first time the sample mean crosses the boundary, i.e. . The stopping time T is an example of an optimistic stopping and, from Proposition 1, we can check that the stopped sample mean is always positively biased (Shin et al., 2019a).

However, the stopping time T can be large or even infinite with non-zero probability. Thus, in practice, we may want to allow for the possibility of stopping the sequential test before reaching the stopping time. Let M be any fixed pre-defined time at which we stop the testing procedure (if still ongoing), and let be the corresponding stopping time which takes account the early stopping option. Again, by Proposition 1, we know that the stopped sample mean is still positively biased, i.e. since the function is an increasing function of while keeping all other entries in fixed. From Theorem 2, we can also check that the expected empirical CDF at stopping time is negatively biased:

Conditioned on the early stopping event {M < T }, however, Theorem 2 shows that the early stopped sample mean and the empirical CDF are negatively and positively biased, respectively, since the function

is an decreasing function of while keeping all other entries in fixed. That is,

However, we can also check that the function

Figure 2. Average of marginal (red dashed line) and conditional empirical CDFs (green dotted and blue dot-dashed lines) from repeated stopped sequential test in Section 4.1. The black solid line refers to the true CDF of the underlying arm

implies that, on the line-crossing event ple mean and empirical CDF are positively and negatively biased, respectively. Thus, depending on the conditioning event, the conditional bias can be positive or negative. Also note that, without the early stopping condition, unbiased estimator of

Experiment: We verify these facts with simulations, where we repeat a stopped sequential test times. In each test, the arm corresponds to a standard normal distribution, and the upper boundary is constructed by the point-wise upper confidence bounds , where is the - upper quantile of the standard normal distribution. Note that this upper boundary does not yield a valid testing procedure since it inflates the type 1 error. However, we choose this boundary with unusual parameters manifest the difference between marginal and conditional biases.

Figure 2 shows the point-wise averages of the observed marginal and conditional CDFs from the simulated stopped sequential tests. The black solid line refers to the true CDF of the underlying arm. The red dashed line cor- responds to the marginal CDF, which lies below the true CDF, as expected, and thus the corresponding mean bias is positive. The green dotted and blue dot-dashed lines correspond to the averages of empirical CDFs conditioned on the early stopping and line-crossing events, respectively. As predicted, conditioned on the early stopping event, the empirical CDF and the sample mean are positively and negatively biased, respectively. In contrast, conditioned on the line-crossing event, the empirical CDF and the sample mean

are negatively and positively biased, respectively.

4.2. Sequential Test for Two Arms: Conditional Biases from Upper and Lower Stopping Boundaries

Suppose we have two independent arms with unknown means and . In this subsection, we consider the following testing problem:

To test this hypothesis, we draw a sample from arm 1 for every odd time and from arm 2 for every even time. Then, at each even time t, we check whether the difference between the sample means from the two arms crosses predefined upper and lower stopping boundaries, and

To be specific, define stopping times as follows:

Let M > 0 be a predetermined maximum time budget. Based on and M, we stop sampling whenever crosses one of the boundaries or the maximum time budget M is met. Define the corresponding stopping time as . At time , we accept if (upper-crossing event), and accept if (lower-crossing event). Otherwise, we declare that we do not have enough evidence to accept either one of two hypotheses.

In this case, we cannot apply Proposition 1 to determine the sign of the marginal bias since the stopping rule is neither optimistic nor pessimistic, marginally. However, we can determine the sign of the conditional bias based on Theorem 2 since the the function

is an increasing (resp. decreasing) function of (resp. , keeping all other entries in

In detail, conditioned on the event of accepting , the sample mean and empirical CDF for arm 1 are negatively and positively biased, respectively. That is,

Similarly, for arm 2, we have opposite signs of the sample mean and empirical CDF.

Figure 3. Average of marginal and conditional empirical CDFs of arm 1 from repeated sequential tests for two arms in Section 4.2. The black solid line refers to the true CDF of the underlying arm.

By the same reasoning, the signs of the conditional biases conditioned on the event of accepting are reversed:

Thus, we conclude that, in all cases, the expected difference between the sample means are exaggerated toward “the direction of decision”.

Remark 3. These results hold regardless of whether the underlying distribution is under the null or alternative.

Experiment: To demonstrate the conditional bias result, we set two standard normal arms with same means 0. In this experiment, we use upper and lower stopping boundaries based on naive point-wise confidence intervals:

where to show the bias better. Figure 3 show the marginal and conditional biases of the empirical CDFs and sample means for arm repetitions of the experiment. The black solid line corresponds to the true underlying CDF. The red dashed line refers to the average of the marginal CDFs, and the purple long-dashed line corresponds to the average of the empirical CDFs conditioned on reaching the maximal time. Note that for these two cases, the empirical CDFs are neither positively nor negatively biased across all

However, for the cases corresponding to accepting (green dotted line) and accepting (blue dot-dashed line), the empirical CDFs are positively and negatively biased, respectively: see inequalities (17) and (19). The conditional bias of the sample mean is also negative conditioning on the event of accepting and positive conditioning on the event of accepting : see inequalities (16) and (18).

4.3. Best-Arm Identification: (i) lil’UCB Algorithm

Suppose we have K > 2 sub-Gaussian arms with a common and known parameter . In many applications, we may want to identify which of the K arms has the largest mean parameter by using as few samples as possible.

In the previous subsection, we observed that, in the twoarmed bandit setting, if we use a boundary-crossing compatible with a best-arm identification style algorithm with deterministic sampling, then the optimal arm has positive bias and the sub-optimal arm has a negative bias.

Shin et al. (2019a) showed that the same phenomenon happens for the best arm as chosen by the lil’UCB algorithm, which consists of the following sampling, stopping and choosing rules (Jamieson et al., 2014):

• Sampling: For t = 1, . . . K, draw a sample from each arm. For t > K, draw a sample from arm k if

where are algorithm parameters and

• Stopping: the stopping time T is defined as the first time at which the following inequality holds for some

Proposition 4 (Corollary 8 in Shin et al. (2019a)). In the settings of the lil’UCB algorithm, for each k with k) > 0, we have that

Figure 4. Average of conditional empirical CDFs of arm 1 from lil’UCB algorithm runs on three unit-variance normal arms with , as described in Section 4.3. The black solid line refers to the true CDF of arm 1.

The following corollary to Theorem 2 shows a complementary result—the sample mean and the empirical CDF of a given arm are negatively and positively biased, respectively, conditioned on the event that the arm is not selected as the best arm. In contrast, on the same conditioning event, the empirical distribution of the selected arm is negatively biased.

Corollary 5. In the settings of the lil’UCB algorithm, for each , we have that

Also, for each

The proof of Corollary 5 can be found in Appendix A.2.

Remark 6. Corollary 5 remains true for any other choice of if the function is decreasing.

Experiment: To verify the previous claims, we conducted trials of the lil’UCB algorithm on three unit-variance normal arms with and . It is important to note that the signs of the biases do not depend on the choice of parameters or of the underlying distributions, but the magnitudes of the biases do. To best illustrates the bias results, we use an unusual set of algorithm parameters as in this experiment.

Figure 5. Average of conditional empirical CDFs of arm 2 from lil’UCB algorithm runs on three unit-variance normal arms with , as described in Section 4.3. The black solid line refers to the true CDF of arm 2.

Figure 4 shows the averages of the empirical CDFs of arm 1 (the arm with the larges mean) conditioned on each arm being chosen as the best arm. The black solid line corresponds to the true underlying CDF. The red dashed line, which lies below the true CDF, indicates that the empirical CDF of arm 1 conditioned on the event that the arm 1 is chosen as the best arm (i.e., ) is negatively biased; this then implies that the sample mean of the chosen arm is positively biased. In contrast, the green dotted and blue dot-dashed lines, lying above the true CDF, show that conditioned on the event the arm 1 is not chosen as the best arm (i.e., ), the empirical CDF is positively biased and the sample mean is negatively biased.

Figure 5 displays the averages of the empirical CDFs of arm 2. Though arm 2 is not the best arm, we can check that the signs of the conditional biases follow the same pattern as arm 1. Conditioned on the event that arm 2 is chosen as the best arm (), the empirical CDF is negatively biased (green dotted line), but conditioned on the event that arm 2 is not chosen as the best arm (, the corresponding CDFs are now positively biased (red dashed and blue dot-dashed lines), as expected.

4.4. Best-Arm Identification: (ii) Sequential Halving

The action elimination (Paulson et al., 1964; Even-Dar et al., 2002; 2006; Karnin et al., 2013) is another class of best arm identification algorithms in which each arm in an active set is sampled a fixed number of times and some of arms are eliminated from the active set at each round. In this subsection, we focus on the sequential halving algorithm (Karnin et al., 2013), which is designed to identify the best arm with high probability and based on a fixed number of samples. To be specific, the sequential halving algorithm for a given time budget T is defined as follows :

i.i.d. samples from it, and let be the average of the observed samples in the round.

• Let be the set of arms in with the largest average based on data collected in round r.

• Choose the unique arm in as the best one.

Let be the (random) index of the chosen arm at the final time T, and for each be the sample mean of arm k at time T based on all previously collected samples from that arm.

The following two corollaries show that under the sequential halving algorithm, the sample mean and empirical CDF of each arm is negatively and positively biased, respectively, marginally and conditionally on the event that the arm is not selected as the best one (). On the other hand, conditionally on the event that the arm is selected as the best one (), the sample mean and empirical CDF of each arm is positively and negatively biased, respectively.

Corollary 7. Suppose that data are collected by the sequential halving algorithm with a time budget T. Then, for each and for each , we have that

Corollary 8. In the settings of the sequential halving algorithm with a time budget T, for each k with , we have that

Also, for each

The proofs of Corollary 7 and 8 can be found in Appendix A.3. Due to limited space, an experiment about the bias of the sequential halving algorithm is presented in Appendix B.2.

5. Summary

In this paper, we have investigated the sign of the conditional bias of monotone functions of the rewards in the MAB framework under an arbitrary conditioning event, generalizing the results in Shin et al. (2019a), and complementing the results on the magnitude of the bias (Shin et al., 2019b). In our analysis, we have exploited certain natural monotonicity properties of MAB experiments and have characterized the impact on the bias of both adaptivity in the data acquisition process and in the selection of the target for inference.

Several interesting extensions of our results are worth pursuing. We emphasize two important ones:

• It is still an open problem how to characterize the bias (marginal or conditional) of other important functionals that are not necessarily monotone or linear, such as the sample variance and sample quantiles.

• Several debiasing methods have been proposed in the MAB literature: see, e.g., Xu et al. (2013); Desh- pande et al. (2018); Neel & Roth (2018); Nie et al. (2018); Hadad et al. (2019). However, the existing approaches typically only adjust for adaptive sampling but ignore the other sources of adaptivity. Current debiasing methods do not offer theoretical guarantees at stopping times, and also do not account for natural conditioning events in MABs, like selection of a promising arm. Furthermore, they are typically not interested in characterizing the sign of the bias, and thus have complementary aims. It is of interest to investigate how our results and techniques can be used in order to design more general debiased estimators.

References

Agrawal, S. and Goyal, N. Analysis of thompson sampling for the multi-armed bandit problem. In Conference on Learning Theory, pp. 39–1, 2012.

Audibert, J.-Y. and Bubeck, S. Minimax policies for ad- versarial and stochastic bandits. In COLT, pp. 217–226, 2009.

Auer, P., Cesa-Bianchi, N., and Fischer, P. Finite-time analysis of the multiarmed bandit problem. Machine learning, 47(2-3):235–256, 2002.

Bowden, J. and Trippa, L. Unbiased estimation for response adaptive clinical trials. Statistical methods in medical research, 26(5):2376–2388, 2017.

Deshpande, Y., Mackey, L., Syrgkanis, V., and Taddy, M. Accurate inference for adaptive linear models. In Proceedings of the 35th International Conference on Machine Learning, 2018.

Even-Dar, E., Mannor, S., and Mansour, Y. Pac bounds for multi-armed bandit and markov decision processes. In International Conference on Computational Learning Theory, pp. 255–270. Springer, 2002.

Even-Dar, E., Mannor, S., and Mansour, Y. Action elimination and stopping conditions for the multi-armed bandit and reinforcement learning problems. Journal of machine learning research, 7(Jun):1079–1105, 2006.

Garivier, A. and Capp´e, O. The KL-UCB algorithm for bounded stochastic bandits and beyond. In Proceedings of the 24th Annual Conference On Learning Theory, pp. 359–376, 2011.

Gut, A. Stopped random walks. Springer, 2009.

Hadad, V., Hirshberg, D. A., Zhan, R., Wager, S., and Athey, S. Confidence intervals for policy evaluation in adaptive experiments. arXiv preprint arXiv:1911.02768, 2019.

Jamieson, K., Malloy, M., Nowak, R., and Bubeck, S. lil’ UCB: An Optimal Exploration Algorithm for MultiArmed Bandits. In Proceedings of The 27th Conference on Learning Theory, volume 35 of Proceedings of Machine Learning Research, pp. 423–439, 2014.

Kalyanakrishnan, S., Tewari, A., Auer, P., and Stone, P. Pac subset selection in stochastic multi-armed bandits. In ICML, volume 12, pp. 655–662, 2012.

Karnin, Z., Koren, T., and Somekh, O. Almost optimal exploration in multi-armed bandits. In International Conference on Machine Learning, pp. 1238–1246, 2013.

Kaufmann, E., Korda, N., and Munos, R. Thompson sam- pling: An asymptotically optimal finite-time analysis. In International Conference on Algorithmic Learning Theory, pp. 199–213. Springer, 2012.

Lattimore, T. and Szepesv´ari, C. Bandit algorithms. Cambridge University Press, 2019.

Neel, S. and Roth, A. Mitigating bias in adaptive data gath- ering via differential privacy. In International Conference on Machine Learning, pp. 3717–3726, 2018.

Nie, X., Tian, X., Taylor, J., and Zou, J. Why adaptively collected data have negative bias and how to correct for it. In International Conference on Artificial Intelligence and Statistics, pp. 1261–1269, 2018.

Paulson, E. et al. A sequential procedure for selecting the population with the largest mean from k normal populations. The Annals of Mathematical Statistics, 35(1): 174–180, 1964.

Robbins, H. Some aspects of the sequential design of exper- iments. Bulletin of the American Mathematical Society, 58(5):527–535, 1952.

Shin, J., Ramdas, A., and Rinaldo, A. Are sample means in multi-armed bandits positively or negatively biased? In Advances in Neural Information Processing Systems, pp. 7100–7109, 2019a.

Shin, J., Ramdas, A., and Rinaldo, A. On the bias, risk and consistency of sample means in multi-armed bandits. arXiv preprint arXiv:1902.00746, 2019b.

Starr, N. and Woodroofe, M. B. Remarks on a stopping time. Proceedings of the National Academy of Sciences of the United States of America, 61(4):1215, 1968.

Sutton, R. S. and Barto, A. G. Introduction to reinforcement learning. MIT press Cambridge, 1998.

Thompson, W. R. On the likelihood that one unknown probability exceeds another in view of the evidence of two samples. Biometrika, 25(3/4):285–294, 1933.

Villar, S. S., Bowden, J., and Wason, J. Multi-armed bandit models for the optimal design of clinical trials: benefits and challenges. Statistical science: a review journal of the Institute of Mathematical Statistics, 30(2):199, 2015.

Xu, M., Qin, T., and Liu, T.-Y. Estimation bias in multi- armed bandit algorithms for search advertising. In Advances in Neural Information Processing Systems, pp. 2400–2408, 2013.

SUPPLEMENTARY MATERIAL A. Proofs

A.1. Proof of Theorem 2

The proof of Theorem 2 is closely related to the proof of Proposition 1 in Shin et al. (2019a). However, in this proof, we decouple the bias part from the integrability condition which makes the proof significantly simpler than the one in Shin et al. (2019a).

Under the condition in Theorem 2, we first prove that if the function is an decreasing function of while keeping all other entries in fixed for each i then, for any , the following inequality holds.

Proof of inequality (32). First note that if is a decreasing function of then the following function is also a decreasing function of

Then, by the tabular representation of MAB, we can rewrite the LHS of inequality (32) as follows:

where the third equality comes from the fact . Therefore, to prove the inequality (32), it is enough to show the following inequality:

Note that the term does not depend on by the definition of

Now, let be an other the tabular representation which is identical to -th entry of replaced with an independent copy from the same distribution

which implies that

By multiplying and taking expectations on both sides, we can show the inequality (32) hold as

follows:

where the first equality comes from the independence between and , and the second inequality holds since

Based on the inequality (32), we are ready to prove Theorem 2.

Proof of Theorem 2. First, suppose the function is an decreasing function of while keeping all other entries in fixed for each be a sequence of random variables defined as follows:

From the inequality (32), we have

Note that since it is understood that for . Therefore, we know that, for each , the sequence of random variables converges to

almost surely. Also, it can be easily checked for each is upper bounded by 2. Hence, from the dominated convergence theorem and the inequality (42), we have

Since almost surely, the last inequality also implies that

Since we assumed P(C) > 0, by multiplying 1/P(C) on both sides, we have

as desired. The inequality (46) shows that the underlying distribution of arm k stochastically dominates the empirical distribution of arm k in the conditional expectation. In this case, it is well-known that for any non-decreasing integrable function f, the following inequality holds

For the completeness of the proof, we formally prove the inequality (47). Since f is integrable, without loss of generality, we may assume is non-decreasing, for any probability measure P, the following equality holds

for all but at most countably many which implies that

where the first and last equalities come from the Fubini’s theorem with the integrability condition on f, and the first inequality comes from the inequality (46).

From the same argument with reversed inequalities, it can be shown that if the function is an increasing function of while keeping all other entries in fixed for each i, we have

Equivalently, for any non-decreasing integrable function f, we have

which completes the proof of Theorem 2.

A.2. Proof of Corollary 5

Before proving Corollary 5 formally, we first provide an intuition as to why, for any reasonable and efficient algorithm for the best-arm identification problem, the sample mean and empirical CDF of an arm are negative and positive biases, respectively, conditionally on the event that the arm is not chosen as the best arm.

For any and , let and be two collections of all possible arm rewards and external randomness that agree with each other except . Since we have a smaller reward from arm k in the second scenario , if under the first scenario , any reasonable algorithm also would not pick the arm k as the best arm under the more unfavorable scenario . In this case, we know that . Also note that any efficient algorithm should be able to exploit the more unfavorable scenario to easily identify arm k as a suboptimal arm and choose another arm as the best one by using less samples from arm k. Therefore, we would have . As a result, we can expect that, from any reasonable and efficient algorithm, we would have which implies that for each i, the function is a decreasing function of while keeping all other entries in fixed. Then, from Theorem 2, we have that the sample mean and empirical CDF of arm k are negatively and positively biased conditionally on the event , respectively. Below, we formally verify that this intuition works for the lil’UCB algorithm. The proof is based on the following two facts about the lil’UCB algorithm:

• Fact 1. The lil’UCB algorithm has an optimistic sampling rule. That is, for any fixed , and , the function is an increasing function of while keeping all other entries in fixed (see Fact 3 in Shin et al., 2019a).

• Fact 2. Let be two collections of all possible arm rewards and external randomness that agree with each other except in their k-th column of stacks of rewards be the numbers of draws from arm respectively. Then for each , the following implications hold for lil’UCB algorithm (see Fact 3 and Lemma 9 in Shin et al., 2019a):

Proof of Corollary 5. For any given and , let and be two collections of all possible arm rewards and external randomness that agree with each other except (i, k)-th entries, and of their stacks of rewards. Let denote the numbers of draws from arm k up to time t. Let be the stopping times and be choosing functions of the lil’UCB algorithm under respectively.

Suppose . To prove the claimed bias result, it is enough to show that the function is a decreasing function of while keeping all other entries in fixed which corresponds to prove the following inequality holds:

Note that if , the inequality (55) holds trivially. Therefore, for the rest of the proof, we assume and

We will first prove the inequality holds. From Fact 1 and the assumption , we have for any fixed t > 0. Then, by Fact 2, we also have for any . Since for all t, we can rewrite the lil’UCB stopping rule as stopping whenever there exists such that the inequality holds. Therefore, from the definition of the stopping rule with the fact for any , at the stopping time T , we have

for some which also implies that the stopping condition is also satisfied for arm which implies that the stopping time under must be at most T . Therefore we have . Now, since the inequality holds for any , we have . Finally, since is a non-decreasing function, we can conclude

Since we proved , to complete the proof of Corollary 5, it is enough to show that implies . We prove this statement by the proof by contradiction. Suppose . Then, there exists By the definition of , we know that

It is important to note that these inequalities are strict. Note that since we draw a singe sample at each time, if then at the time , either arm j or k should satisfy the stopping rule which contradicts to the definition of T .

Recall that, in Equation (56), we showed that if , at stopping time T , we have

which implies . By the same argument, at the stopping time with the assumption

which also implies . From these two inequalities on stopping times, we have . Finally, by combining inequalities between pairs of with the observation

where the first inequality comes from the inequality (58). The second inequality come from . The first equality comes from and the third inequality comes from the inequality (57). The last inequality comes from the final equality comes from

This is a contradiction, and, therefore, implies that . This proves that for each i, the function is a decreasing function of while keeping all other entries in fixed and, from Theorem 2, we can conclude that the sample mean and empirical CDF of arm k from the lil’UCB algorithm are negatively and positive biased conditionally on the event the arm k is not chosen as the best arm, respectively.

A.3. Proofs of Corollary 7 and 8

For any given be two infinite collections of arm rewards and external randomness that agree with each other except on the (i, k)-th entries, of their stacks of rewards. Without loss of generality, we assume . Finally, let denote the numbers of draws from arm k up to time

We first prove that the following inequality holds:

From Theorem 2, the above inequality implies that the sample mean and empirical CDF of each arm at a fixed time are negatively and positively biased, respectively as claimed in Corollary 7.

Proof of inequality (59). For each , let and be numbers of rounds in which arm k is in the active set up to time t under and , respectively. Since and , it is enough to show . Now, for the sake of contradiction, suppose . Since arm k is active at rounds under both and , we must have that for each , where and

are averages of the observed samples from arm k at each round r under and , respectively. By the same logic, we have that for each , where and are the sets of active arms under and , respectively. In particular, at the -th round, under both scenarios and , we have the exactly same active sets and the same sample averages of all the active arms except for arm k, for which it hols that it cannot be the case that arm k is eliminated from the active set after the -th round under but the same arm still remains active after the same round under the scenario in which the arm k has even a smaller sample average. This proves , which implies the claimed inequality (59) as desired.

For the proof of Corollary 8, let and be indices of the chosen arm under and , respectively. To prove the first two inequalities, we need to to show that

For the sake of contradiction, suppose but . Recall that and are the number of rounds in which arm k is in the active set up to time T under and respectively. Since but , we have . However this contradicts the fact , since and . Hence, inequality (61) is true. By Theorem 2 this implies the first two inequalities in Corollary 8.

Similarly, to prove the last two inequalities in Corollary 8, we need to to show that

If , the above inequality holds trivially. Hence, we can assume and, from the inequality (61), we also have . Since arm k is chosen as the best one under both and , we have that which proves inequality (62). The result again follows from Theorem 2.

Figure 6. Average of conditional empirical CDFs of arm 1 from repeated sequential tests for two arms under the alternative hypothesis (). See Section 4.2 for the detailed explanation about the sequential test.

B. Additional Simulation Results

In this section, we present additional simulation results for Section 4 and 5 which are omitted from the main part due to the page limit.

B.1. Conditional Bias Under Alternative Hypothesis in Section 4.2

As we conducted in Section 4.2, we have two standard normal arms with means and . Then, we use the following upper and lower stopping boundaries to test whether

where to show the bias better. In contrast to the experiment in Section 4.2 in which the true means are equal to each other, in this experiment, we set to make the alternative hypothesis is true.

Figure 6 show the marginal and conditional biases of the empirical CDFs and sample means for arm 1 based on

repetitions of the experiment. The black solid line corresponds to the true underlying CDF. The red dashed line refers to the average of the marginal CDFs, and the purple long-dashed line corresponds to the average of the empirical CDFs conditionally on reaching the maximal time. For these two cases, although the marginal CDF is negatively and the conditional CDF is positively biased, these are not general phenomena and the sign of bias can be changed as we change mean parameters.

However, for the cases corresponding to accepting (blue dot-dashed line) and accepting (green dotted line), we can check that signs of biases of CDFs and sample means are consistent with what Theorem 2 and corresponding inequalities (18) to (17) described. Also note that the bias results do not depend on whether the arms are under the null or alternative hypotheses.

B.2. Experiment About the Bias of the Sequential Halving Algorithm in Section 4.4

To verify the conditional bias results in Corollary 7 and 8, we conducted trials of the sequential halving algorithm on three unit-variance normal arms with and as we did for the lil’UCB algorithm in Section 4.3. Again, it is important to note that the signs of the biases do not depend on the choice of parameters or of the underlying distributions, but the magnitudes of the biases do. To best illustrates the bias results, we use an unusually small time budget T = 10 in this experiment.

The left side of Figure 7 shows the averages of the empirical CDFs of arm 1 (the arm with the largest mean) conditionally

Figure 7. Average of conditional empirical CDFs of arm 1 (left) and arm 2 (right) from the sequential halving algorithm runs on three unit-variance normal arms with , as described in Appendix B.2. Black solid lines refer to the true CDF of arm 1 and arm 2.

on each arm being chosen as the best arm. The thick black line corresponds to the true underlying CDF. The red dashed line, which lies below the true CDF, indicates that the empirical CDF of arm 1 conditionally on the event that arm 1 is chosen as the best arm (i.e., ) is negatively biased; this then implies that the sample mean of the chosen arm is positively biased. In contrast, the green dotted and blue dot-dashed lines, lying above the true CDF, show that conditionally on the event that arm 1 is not chosen as the best arm (i.e., ), the empirical CDF is positively biased and the sample mean is negatively biased.

The right side of Figure 7 displays the averages of the empirical CDFs of arm 2. Though arm 2 is not the best arm, we can check that the signs of the conditional biases follow the same pattern as arm 1. Conditionally on the event that arm 2 is chosen as the best arm (), the empirical CDF is negatively biased (green dotted line), but conditionally on the event that arm 2 is not chosen as the best arm (, the corresponding CDFs are now positively biased (red dashed and blue dot-dashed lines), as expected.

B.3. Experiments on Conditional Biases of Sample Variance and Median in MABs

As stated in Section 5, characterizing the bias of other important functionals such as sample variance and sample quantiles is an important open problem. In this subsection, we present a simulation study on the bias of sample variance and median.

For a given i.i.d. samples from a distribution P, the sample variance are defined by

where corresponds to the sample mean and refers to the i-th smallest sample.

It is well-known that for any distribution P with finite variance , the sample variance is an unbiased estimator of . Also, though the sample median is not necessarily unbiased, for any symmetric distribution P including the normal distribution as a special case, the sample median is unbiased. However, for adaptively collected data from a MAB experiment, it is unclear whether the sample variance and median are unbiased or not. Furthermore, it is an open question how to characterize the bias of sample variance and median estimators if they are biased estimators.

As an initial step, we conduct the repeated sequential experiments described in Section 4.1 and empirically investigate the biases of the sample variance and median estimators. Figure 8 describes a simulation study on the bias of the sample

Figure 8. Left: Densities of observed sample variances from repeated stopped sequential test as described in Section 4.1. Right: Densities of observed sample median from the same repeated stopped sequential test. For both figures, vertical dashed lines correspond to averages of sample variances and medians on each conditions.

variance in the sequential testing setting of Section 4.1. Recall that, in this experiment, we have a stream of samples from a standard normal distribution. Each test terminates once either the number of samples reaches a fixed early stopping time M = 10 or the sample mean crosses the upper boundary

Figure 8 shows the marginal and conditional distributions of the sample variance and median from stopped sequential tests. Vertical lines correspond to averages of the sample variances and medians over repetitions of the experiment and under different conditions. For the sample variance, the simulation shows that the sample variance is negatively biased marginally and conditionally on the early stopping event. On the other hand, conditionally on the line-crossing event, the sample variance has a heavy right tail and is positively biased. For the sample median, we can check that, marginally and conditionally on the line-crossing event, the sample median is positively biased. In contrast, the sample median is negatively biased conditionally on the early stopping event. Note that, for the sample median, sizes of biases of the sample median are similar to ones from the sample means which were equal to