Experimentation platform is now an essential part in online services for providing the best service by evaluating and comparing multiple candidates in real services [3]. A/B test and multi-armed bandit are main methodologies that direct a design and a decision in an experiment. Multi-armed bandit methods provide simple but flexible experiment framework compared to A/B test [3, 4]. For example, thompson sampling [5], a popular method of multi-armed bandit, outputs a result of experiment in terms of allocation of next experiment, which allows a sequential experiment [6].
In real service, it is common to have a type of non-stationary environment, that is time-varying effect [3, 7, 8, 9, 10, 11]. However, multi-armed bandit including thompson sampling is sensitive to this irregular condition in nature compared to A/B test, where sample sizes for all variants do not change during an experiment. In a continuous experiment where arms are added and dropped in the middle of the experiment, the time-varying effect would cause more severe sub-optimal result of the multi-armed bandit policy. It is because each arm is tested in different period.
In this study, we propose a natural way of dealing with time-varying effect with reparameterizing base model for thompson sampling. First, we will describe different parameterizations of logistic model and propose a novel thompson sampling (Odds-ratio thompson sampling) policy with a specific parameterization. Then its use in continuous experiment will be described. Finally we will evaluate its empirical regret together with other methods in simulated data.
Many multi-armed bandit applications adopt Batch Update, where arms are played multiple times, then, policy and related parameters are updated with aggregated rewards [2, 6]. Batch update, which is sometimes called as delay update, is practical set up, because it requires much less computational resources than online or realtime update. There are many chances that the temporal effect changes concurrent with batch update, making reward probabilities change through rounds. Our study discusses thompson sampling that are applied with the batch update, while there is common effect in each round.
Mean reward of i-th arm is represented by , where is a random variable for binary reward. Later we will discuss these mean rewards, or performances can drift through rounds. Reward observations are aggregated at each round and given as , where is the number of trials of i-th arm and is its number of successes. For simplicity of exposition, we suppress a subscript t if it is not necessary. For a task finding the arm with the maximum click through rate (CTR), page view counts (or impression) and click counts can be used for and , respectively.
In batch update, we allocate samples (or traffics) to each arm according to their thompson probabilities, instead of choosing actions with their computed probabilities. In the context of batch update, we call thompson probability as proportion.
In this section, we will parameterize logistic model, and introduce the first version of thompson sampling policy based on the logistic model, which we denote as full rank thompson sampling (Full-TS). Logistic model for multi-armed bandit for K number of arms is given:
We can see each parameter except is written as:
, which represents (log of) Odds Ratio, i-th arm’s reward probability with respect to reference K-th arm in logit scale. Note if and only if . We denote parameter as follows:
Our logistic model, (1) is written differently from typical setting, where in (1) is replaced by . Let us denote this parameterization as . In fact, the two parameterizations have a relationship of linear transformation, , where is matrix with entries,
This transformation does not change the model’s representation, thus the two parameterizations are equivalent to each other.
Once our thompson sampling begins with initial proportion for each arm, it alternates two steps: posterior update and allocation proportion, each step of which is described in detail in separate sections.
3.1 Prior and Initial Proportions
We use non-informative, uniform prior, which is an imaginary distribution that has constant density on IR. We found use of this improper prior has many advantages, described below.
We manually give initial proportions, ’s as 1/K. We found setting as zero-matrix from makes the following computation simpler.
3.2 Posterior Update
Thompson sampling begins with prior, then as rewards are observed, posterior is updated using bayes rule:
, where is posterior at previous round, or prior if is likelihood and is posterior that is updated in current round. For given round, t, we will call as prior at the round.
We follow a general Bayesian logistic regression described in [12]. Each time we will do Laplacian approximation to keep posterior as Gaussian distribution:
, where and is mean and covariance parameter of Gaussian distribution, respectively. Note that our parameters, s are not independent to each other, so posterior is multivariate distribution that is not factorized into univariate distributions.
Negative log of posterior is
, where is log of binomial likelihood and is log of prior, that is log probability density function of Gaussian distribution.
The two parameters are updated sequentially,
, where is the second derivative of (3) at . Using which is derived from is represented with its (i, j)-th entry,
3.3 Getting Allocation Proportion
Order among mean reward of arms is determined by order among and 0. That is, we set
Then, Thompson Sampling states probability with which each arm is pulled (or selected), is given,
, where I is an indicator function, is a function that maps to , which is derived from (1). This can
be obtained at round t by generating N number of multivariate samples from and compute i-th arm’s proportion, argmaxthen, randomly allocate i-th arm to proportion of next round.
3.4 Linear Transformation and Invariance
In our logistic model, it may be less intuitive whether the result is affected by an indexing of arms, for example, changing a reference arm.
Any indexing of same arm set can be transformed to the independent parameterization, , and its posterior is accordingly transformed to . Once transformed to the parameterization, univariate parameter in is independent to each other, where it is obvious different indexings give the identical posterior.
We call this property as "reference-invariance" property. In other domain, it has been understood that different encoding of categorical variable does not matter. Here, we see that this is true for multi-armed bandit using the logistic model with uniform prior.
In previous section, the logistic model (1) has the interesting property: ”Order among reward probability ’s is determined only by comparing among and 0, rather than .” In other words, parameter required for step (5) is , and does not contribute any information.
This property motivated us to devise a bandit model where each round shares only OR parameters , rather than full K parameters, . We name this model as Odds-Ratio Thompson Sampling (OR-TS). Freeing intercept parameter at each round corresponds to allow each arm’s reward probability drift of the same interval, in logit scale at each round.
We describe Odds-Ratio Logistic Bandit. Posterior update at round would produce posterior for to be used as prior at round t. Then updating only odds-ratio parameters can be symbolically described as changing the equality from to .
We can obtain prior for as -variate Gaussian distribution, by marginalizing full rank prior with respect to , where and are
We use a uniform prior for as we have done for parameters at t = 1. Then, bayes rule for posterior update is written:
(7) Once we replace of (3) by , we follow the same downstream process as in Full-TS.
It is worth to noting that this marginalization does not depend on a reference arm. In other words, OR update also has a reference-invariance property. This can be seen by the fact that any indexed marginalized prior is transform to independent parameterization, that is identical (degenerate) Gaussian distribution.
Note that OR-TS and Full-TS is defined for a round, thus, one can switch using between OR-TS or Full-TS at each round through an experiment. That is, at one round, assuming and at other round, .
Multi-Armed Bandit has been discussed with strict setting, where an experiment period or an arms set is assumed to be fixed during an experiment. However, multi-armed bandit can be used in more flexible way, such as without fixing the experiment period beforehand [7]. We define continuous experiment, the one where a set of arms to be allocated (denoted by ) changes over rounds and its period is not fixed. Simple scenarios include a case when new arms are added in the middle of an experiment.
Odds Ratio Thompson Sampling is more desirable in the continuous experiment, because periods for which each arm has been observed, differ, thus non-stationary environment affects full rank bandit worse.
We denote a set of arms observed until round t as . Generally it can be represented as cumulative set: . For simplicity, we assume all arms in are pulled (i.e. observed) at round t. Note that information from previous observations is delivered only when
, where is the number of items in a group. Therefore, to continue OR bandit, one should design satisfying the condition, (8). Nevertheless, when , we can just begin new multi-armed bandit by initializing and , or we can use full rank prior.
5.1 Getting Allocation Proportions in Continuous Experiment
Parameter, its prior and posterior can be freely transformed for any indexing of arms due to the reference-invariance property. This transformation is required when a previous reference arm is not included in , for example, because it is removed due to its low performance. In this case, we need to change a reference arm to one from and reparameterize it accordingly. Let f be the mapping function, that is, f(i) is a new index for previous i-th arm. The transform matrix, C can be represented as a chain of transformations: , where is corresponding matrix for f, with entries:
Getting allocation proportions for in a continuous experiment is straightforward using the rules previously applied. Assume that comprises arms already observed, thus in and the new ones which is not in . Denote the two groups as and , respectively. First, we manually set proportions for arms in as . We transform
Figure 1: Cumulative Regrets for different level of time-varying effect (d): Left panel for no effect (d = 0) and right panel for with time-varying effect (d = 20). Total sample size per round is set to N = 10, 000.
posterior from indexing in posterior to new indexing for arms in . Then, we can obtain s for i-th arm in on transformed posterior as described in section 3.3. Then, allocation proportions for arms in is given . Note this step is identical in Full-TS.
5.2 Posterior Update in Continuous Experiment
Posterior for at round is updated as we observe r for . Again, if a reference arm in a given parameterization of posterior at previous round, is not in , change the reference arm to one in , and transform parameterization accordingly. Note that in previous allocation step, we transform to arms in , but for posterior update we transform posterior to all arms in . This previous posterior is marginalized in terms of reference arm, in equation 7. Since all parameters are correlated, parameters for arms in are also updated. As far as grows with satisfying (8) at round t, then posterior for all arms in are successfully accumulated.
Information about relative performance among arms in is efficiently summarized in posterior. Note that this includes arm pair which has not been directly observed at the same round. For example, assume {B, C, D}. Even though two arms, A and D are not directly compared at the same round, indirect information from their odds ratios with respect to a common reference, that is, B or C, is summarized in posterior.
6.1 Simulation Study
We compared three different thompson sampling policies for binary reward: Full-TS, OR-TS and Beta-Bernoulli Thompson Sampling (Beta-TS). We simulated various environments and investigated behaviours of the three policies.
We set K = 10 arms, one of which has greater mean reward than the other nine arms. In specific, for nine sub-optimal arms are set to have , and 0.31 for an optimal arm. Then at each round, all arms have common background effect, , which is generated from Gaussian distribution:
Level of time-varying effect is controlled by variance of Gaussian distribution, . We represent the level of time-varying effect by a scaled metric, d which represents relative scale to mean reward difference between optimal and sub-optimal arms. We repeated simulation 100 times with 50 rounds for each simulation. Each round consists of 10,000 trials.
Mean of cumulative regret is shown in Figure 1. For both panels, we see that Beta-TS and Full-TS has very similar regret as expected. In environment of no time-varying effect (left panel), OR-TS have a slightly greater regret than the other two policies. However, when there is time-varying effect (right panel), OR-TS is robust to the effect, while Beta-TS and Full-TS are affected severely.
6.2 Experiment Based on Real Digital Advertisement Data
We investigated the performance of the three policies with real digital advertisement data of online messenger users. Even number of users have been exposed to four different listings which is sub module in a mobile page for 18 days.
Figure 2: Expected Click Counts for Advertisement Data
Setting an arm as showing each listing, we consider a multi-armed bandit problem of finding the best listing that has the maximum CTRs among the four listings. On the 11-th day of the experiment, the listings are shifted up into very top screen. Due to this change, performance values of the arms have been changed: page views have increased more than two times, while CTRs have decreased because of increase of views. Therefore, the data is expected to have two types of varying effects: one minor effect daily and the other major effect between the two periods.
We set ground truth CTRs as estimated from the real data and simulated multi-armed bandit. To investigate behaviors of the policies more efficiently, we assume total sample size is about 20, 000 to 200, 000 page views per round, which amounts to 20% of real data. Then, we record expected rewards, i.e. click counts for each round from the three policies. Figure 2 shows click counts for Beta-TS, Full-TS, and OR-TS. From the figure, indirectly we see time-varying effect through the period. OR-TS has gained more clicks compared to the other policies, especially during the two days after a major event occurred. We may infer that the temporal effects in day 1-10 (especially low performance in 10-th day) resulted in biased posterior for Beta-TS and Full-TS, while OR-TS worked robust to the effects. During the period, it is estimated that OR-TS gains 3.9% more clicks than Beta-TS and Full-TS.
In this study, we investigated an alternative expression of logistic model for thompson sampling. Based on the expression we showed that Odds-Ratio only update makes multi-armed bandit policy robust in many practical environment. Therefore, we believe that one may consider Odds Ratio thompson sampling as standard policy for binary reward data in batch update settings.
This study focused on the case when there is a common background effect, which does not change an optimal arm. However, real data can be confounded with the fact that an optimal arm changes over time. In this case, use of plain OR-TS would be also affected. In this case, we can use discount TS [8] or aggressiveness parameter [3]. For example, we can use multiply in second term of (1), where is a decay parameter ranging [0, 1].
We have discussed a base logistic model. Its extension to other generalized linear model, or contextual bandit is straightforward. Our implementation of OR-TS and Full-TS is available at http://github.com/sulgik/orts.
[1] Shipra Agrawal and Navin Goyal. Analysis of thompson sampling for the multi-armed bandit problem. In Conference on Learning Theory, pages 39–1, 2012.
[2] Olivier Chapelle and Lihong Li. An empirical evaluation of thompson sampling. In Advances in neural information processing systems, pages 2249–2257, 2011.
[3] Steven L Scott. Multi-armed bandit experiments in the online service economy. Applied Stochastic Models in Business and Industry, 31(1):37–45, 2015.
[4] Giuseppe Burtini, Jason Loeppky, and Ramon Lawrence. A survey of online experiment design with the stochas- tic multi-armed bandit. arXiv preprint arXiv:1510.00757, 2015.
[5] William R Thompson. On the likelihood that one unknown probability exceeds another in view of the evidence of two samples. Biometrika, 25(3/4):285–294, 1933.
[6] Daniel J Russo, Benjamin Van Roy, Abbas Kazerouni, Ian Osband, Zheng Wen, et al. A tutorial on thompson sampling. Foundations and Trends Rin Machine Learning, 11(1):1–96, 2018.
[7] Rocío Cañamares, Marcos Redondo, and Pablo Castells. Multi-armed recommender system bandit ensembles. In Proceedings of the 13th ACM Conference on Recommender Systems, pages 432–436, 2019.
[8] Vishnu Raj and Sheetal Kalyani. Taming non-stationary bandits: A bayesian approach. arXiv preprint arXiv:1707.09727, 2017.
[9] Chunqiu Zeng, Qing Wang, Shekoofeh Mokhtari, and Tao Li. Online context-aware recommendation with time varying multi-armed bandit. In Proceedings of the 22nd ACM SIGKDD international conference on Knowledge discovery and data mining, pages 2025–2034. ACM, 2016.
[10] Giuseppe Burtini, Jason L Loeppky, and Ramon Lawrence. Improving online marketing experiments with drift- ing multi-armed bandits. In ICEIS (1), pages 630–636, 2015.
[11] Neha Gupta, Ole-Christoffer Granmo, and Ashok Agrawala. Thompson sampling for dynamic multi-armed bandits. In 2011 10th International Conference on Machine Learning and Applications and Workshops, volume 1, pages 484–489. IEEE, 2011.
[12] Christopher M Bishop. Pattern recognition and machine learning. springer, 2006.