Odds-Ratio Thompson Sampling to Control for Time-Varying Effect

Multi-armed bandit methods have been used for dynamic experiments particularly in online services. Among the methods, thompson sampling is widely used because it is simple but shows desirable performance [1, 2]. Many thompson sampling methods for binary rewards use logistic model that is written in a specific parameterization. In this study, we reparameterize logistic model with odds ratio parameters. This shows that thompson sampling can be used with subset of parameters. Based on this finding, we propose a novel method, “Odds-ratio thompson sampling”, which is expected to work robust to time-varying effect. Use of the proposed method in continuous experiment is described with discussing a desirable property of the method. In simulation studies, the novel method works robust to temporal background effect, while the loss of performance was only marginal in case with no such effect. Finally, using dataset from real service, we showed that the novel method would gain greater rewards in practical environment.

Experimentation platform is now an essential part in online services for providing the best service by evaluating and comparing multiple candidates in real services [3]. A/B test and multi-armed bandit are main methodologies that direct a design and a decision in an experiment. Multi-armed bandit methods provide simple but flexible experiment framework compared to A/B test [3, 4]. For example, thompson sampling [5], a popular method of multi-armed bandit, outputs a result of experiment in terms of allocation of next experiment, which allows a sequential experiment [6].

In real service, it is common to have a type of non-stationary environment, that is time-varying effect [3, 7, 8, 9, 10, 11]. However, multi-armed bandit including thompson sampling is sensitive to this irregular condition in nature compared to A/B test, where sample sizes for all variants do not change during an experiment. In a continuous experiment where arms are added and dropped in the middle of the experiment, the time-varying effect would cause more severe sub-optimal result of the multi-armed bandit policy. It is because each arm is tested in different period.

In this study, we propose a natural way of dealing with time-varying effect with reparameterizing base model for thompson sampling. First, we will describe different parameterizations of logistic model and propose a novel thompson sampling (Odds-ratio thompson sampling) policy with a specific parameterization. Then its use in continuous experiment will be described. Finally we will evaluate its empirical regret together with other methods in simulated data.

Many multi-armed bandit applications adopt Batch Update, where arms are played multiple times, then, policy and related parameters are updated with aggregated rewards [2, 6]. Batch update, which is sometimes called as delay update, is practical set up, because it requires much less computational resources than online or realtime update. There are many chances that the temporal effect changes concurrent with batch update, making reward probabilities change through rounds. Our study discusses thompson sampling that are applied with the batch update, while there is common effect in each round.

Mean reward of i-th arm is represented by  E(Ri) = pi, where  Riis a random variable for binary reward. Later we will discuss these mean rewards, or performances can drift through rounds. Reward observations are aggregated at each round and given as  ri = (ni, ci), where  niis the number of trials of i-th arm and  ciis its number of successes. For simplicity of exposition, we suppress a subscript t if it is not necessary. For a task finding the arm with the maximum click through rate (CTR), page view counts (or impression) and click counts can be used for  niand  ci, respectively.

In batch update, we allocate samples (or traffics) to each arm according to their thompson probabilities, instead of choosing actions with their computed probabilities. In the context of batch update, we call thompson probability as proportion.

In this section, we will parameterize logistic model, and introduce the first version of thompson sampling policy based on the logistic model, which we denote as full rank thompson sampling (Full-TS). Logistic model for multi-armed bandit for K number of arms is given:


We can see each parameter except  βKis written as:


, which represents (log of) Odds Ratio, i-th arm’s reward probability with respect to reference K-th arm in logit scale. Note  βi > 0if and only if  pi > pK. We denote parameter as follows:


Our logistic model, (1) is written differently from typical setting, where  βKin (1) is replaced by  β′KxK. Let us denote this parameterization as ˜βind. In fact, the two parameterizations have a relationship of linear transformation, ˜βind = Cind ˜β, where  Cindis  K × Kmatrix with entries,


This transformation does not change the model’s representation, thus the two parameterizations are equivalent to each other.

Once our thompson sampling begins with initial proportion for each arm, it alternates two steps: posterior update and allocation proportion, each step of which is described in detail in separate sections.

3.1 Prior and Initial Proportions

We use non-informative, uniform prior, which is an imaginary distribution that has constant density on IR. We found use of this improper prior has many advantages, described below.

We manually give initial proportions,  pT S,i’s as 1/K. We found setting  Σ−10as zero-matrix from ˜β ∼ N(µ0, Σ0)makes the following computation simpler.

3.2 Posterior Update

Thompson sampling begins with prior, then as rewards are observed, posterior is updated using bayes rule:


, where  P(˜β)is posterior at previous round, or prior if  t = 1, P(r|˜β)is likelihood and  P(˜β|r)is posterior that is updated in current round. For given round, t, we will call  P(˜β)as prior at the round.

We follow a general Bayesian logistic regression described in [12]. Each time we will do Laplacian approximation to keep posterior as Gaussian distribution:


, where  µtand  Σtis mean and covariance parameter of Gaussian distribution, respectively. Note that our parameters, βis are not independent to each other, so posterior is multivariate distribution that is not factorized into univariate distributions.

Negative log of posterior is


, where  l(r|µ)is log of binomial likelihood and  l(µ, Σt−1)is log of prior, that is log probability density function of Gaussian distribution.

The two parameters are updated sequentially,


, where  Λis the second derivative of (3) at  µ = µt. Using  ˆpiwhich is derived from  µt, Λis represented with its (i, j)-th entry,


3.3 Getting Allocation Proportion

Order among mean reward of arms is determined by order among  βi’s (i = 1, · · · , K − 1)and 0. That is, we set


Then, Thompson Sampling states probability with which each arm is pulled (or selected),  pT S,iis given,


, where I is an indicator function,  pj�˜β�is a function that maps ˜βto  pj, which is derived from (1). This can

be obtained at round t by generating N number of multivariate samples ˜βsfrom  N (µt, Σt)and compute i-th arm’s proportion,  pT S,i = 1N� I(i =argmaxj(pj(˜βs)))then, randomly allocate i-th arm to  pT S,iproportion of next round.

3.4 Linear Transformation and Invariance

In our logistic model, it may be less intuitive whether the result is affected by an indexing of arms, for example, changing a reference arm.

Any indexing of same arm set can be transformed to the independent parameterization, ˜βind, and its posterior is accordingly transformed to ˜βind ∼ N(Cindµ, CindΣCTind). Once transformed to the parameterization, univariate parameter in ˜βindis independent to each other, where it is obvious different indexings give the identical posterior.

We call this property as "reference-invariance" property. In other domain, it has been understood that different encoding of categorical variable does not matter. Here, we see that this is true for multi-armed bandit using the logistic model with uniform prior.

In previous section, the logistic model (1) has the interesting property: ”Order among reward probability  pi’s is determined only by comparing among ˜βORand 0, rather than ˜β.” In other words, parameter required for step (5) is βOR, and  βKdoes not contribute any information.

This property motivated us to devise a bandit model where each round shares only OR parameters ˜βOR, rather than full K parameters, ˜β. We name this model as Odds-Ratio Thompson Sampling (OR-TS). Freeing intercept parameter at each round corresponds to allow each arm’s reward probability drift of the same interval, in logit scale at each round.

We describe Odds-Ratio Logistic Bandit. Posterior update at round  t − 1would produce posterior for  βt−1to be used as prior at round t. Then updating only odds-ratio parameters can be symbolically described as changing the equality from  βK,t−1 = βK,tto  βK,t−1 ̸= βK,t.

We can obtain prior for  βOR,tas  (k−1)-variate Gaussian distribution,  N(µ−K, Σ−K)by marginalizing full rank prior with respect to  βK,t−1, where  µ−Kand  Σ−Kare


We use a uniform prior for  βK,tas we have done for parameters at t = 1. Then, bayes rule for posterior update is written:

P(˜β|r) = P(r|˜β)P(˜βOR)(7) Once we replace  l(µ, Σt−1)of (3) by  l(µ−K, Σ−K,t−1), we follow the same downstream process as in Full-TS.

It is worth to noting that this marginalization does not depend on a reference arm. In other words, OR update also has a reference-invariance property. This can be seen by the fact that any indexed marginalized prior is transform to independent parameterization, that is identical (degenerate) Gaussian distribution.

Note that OR-TS and Full-TS is defined for a round, thus, one can switch using between OR-TS or Full-TS at each round through an experiment. That is, at one round, assuming  βK,t−1 ̸= βK,tand at other round,  βK,t−1 = βK,t.

Multi-Armed Bandit has been discussed with strict setting, where an experiment period or an arms set is assumed to be fixed during an experiment. However, multi-armed bandit can be used in more flexible way, such as without fixing the experiment period beforehand [7]. We define continuous experiment, the one where a set of arms to be allocated (denoted by  At) changes over rounds and its period is not fixed. Simple scenarios include a case when new arms are added in the middle of an experiment.

Odds Ratio Thompson Sampling is more desirable in the continuous experiment, because periods for which each arm has been observed, differ, thus non-stationary environment affects full rank bandit worse.

We denote a set of arms observed until round t as  Ct. Generally it can be represented as cumulative set:  Ct =A1 ∪ · · · ∪ At. For simplicity, we assume all arms in  Atare pulled (i.e. observed) at round t. Note that information from previous observations is delivered only when


, where  n(·)is the number of items in a group. Therefore, to continue OR bandit, one should design  Atsatisfying the condition, (8). Nevertheless, when  n (At ∩ Ct−1) = 1, we can just begin new multi-armed bandit by initializing  µand  Σ, or we can use full rank prior.

5.1 Getting Allocation Proportions in Continuous Experiment

Parameter, its prior and posterior can be freely transformed for any indexing of arms due to the reference-invariance property. This transformation is required when a previous reference arm is not included in  At, for example, because it is removed due to its low performance. In this case, we need to change a reference arm to one from  At ∩ Ct−1and reparameterize it accordingly. Let f be the mapping function, that is, f(i) is a new index for previous i-th arm. The transform matrix, C can be represented as a chain of transformations:  C = Cind−1CfCind, where  Cfis corresponding matrix for f, with entries:


Getting allocation proportions for  Atin a continuous experiment is straightforward using the rules previously applied. Assume that  Atcomprises arms already observed, thus in  Ct−1and the new ones which is not in  Ct−1. Denote the two groups as  Otand  Ut, respectively. First, we manually set proportions for arms in  Utas  1/n(At). We transform


Figure 1: Cumulative Regrets for different level of time-varying effect (d): Left panel for no effect (d = 0) and right panel for with time-varying effect (d = 20). Total sample size per round is set to N = 10, 000.

posterior from indexing in posterior to new indexing for arms in  Ot. Then, we can obtain  pT S,is for i-th arm in  Oton transformed posterior as described in section 3.3. Then, allocation proportions for arms in  Otis given n(Ot)n(At)pT S,i. Note this step is identical in Full-TS.

5.2 Posterior Update in Continuous Experiment

Posterior for  Ct−1at round  t−1is updated as we observe r for  At. Again, if a reference arm in a given parameterization of posterior at previous round,  t − 1is not in  At, change the reference arm to one in  At ∪ Ct−1, and transform parameterization accordingly. Note that in previous allocation step, we transform to arms in  At, but for posterior update we transform posterior to all arms in  Ct. This previous posterior is marginalized in terms of reference arm, in equation 7. Since all parameters are correlated, parameters for arms in  Utare also updated. As far as  Ctgrows with satisfying (8) at round t, then posterior for all arms in  Ctare successfully accumulated.

Information about relative performance among arms in  Ctis efficiently summarized in posterior. Note that this includes arm pair which has not been directly observed at the same round. For example, assume  A1 = {A, B, C}, A2 ={B, C, D}. Even though two arms, A and D are not directly compared at the same round, indirect information from their odds ratios with respect to a common reference, that is, B or C, is summarized in posterior.

6.1 Simulation Study

We compared three different thompson sampling policies for binary reward: Full-TS, OR-TS and Beta-Bernoulli Thompson Sampling (Beta-TS). We simulated various environments and investigated behaviours of the three policies.

We set K = 10 arms, one of which has greater mean reward than the other nine arms. In specific,  βifor nine sub-optimal arms are set to have  pi = 0.30, and 0.31 for an optimal arm. Then at each round, all arms have common background effect,  δt, which is generated from Gaussian distribution:


Level of time-varying effect is controlled by variance of Gaussian distribution,  σ. We represent the level of time-varying effect by a scaled metric, d which represents relative scale to mean reward difference between optimal and sub-optimal arms. We repeated simulation 100 times with 50 rounds for each simulation. Each round consists of 10,000 trials.

Mean of cumulative regret is shown in Figure 1. For both panels, we see that Beta-TS and Full-TS has very similar regret as expected. In environment of no time-varying effect (left panel), OR-TS have a slightly greater regret than the other two policies. However, when there is time-varying effect (right panel), OR-TS is robust to the effect, while Beta-TS and Full-TS are affected severely.

6.2 Experiment Based on Real Digital Advertisement Data

We investigated the performance of the three policies with real digital advertisement data of online messenger users. Even number of users have been exposed to four different listings which is sub module in a mobile page for 18 days.


Figure 2: Expected Click Counts for Advertisement Data

Setting an arm as showing each listing, we consider a multi-armed bandit problem of finding the best listing that has the maximum CTRs among the four listings. On the 11-th day of the experiment, the listings are shifted up into very top screen. Due to this change, performance values of the arms have been changed: page views have increased more than two times, while CTRs have decreased because of increase of views. Therefore, the data is expected to have two types of varying effects: one minor effect daily and the other major effect between the two periods.

We set ground truth CTRs as estimated from the real data and simulated multi-armed bandit. To investigate behaviors of the policies more efficiently, we assume total sample size is about 20, 000 to 200, 000 page views per round, which amounts to 20% of real data. Then, we record expected rewards, i.e. click counts for each round from the three policies. Figure 2 shows click counts for Beta-TS, Full-TS, and OR-TS. From the figure, indirectly we see time-varying effect through the period. OR-TS has gained more clicks compared to the other policies, especially during the two days after a major event occurred. We may infer that the temporal effects in day 1-10 (especially low performance in 10-th day) resulted in biased posterior for Beta-TS and Full-TS, while OR-TS worked robust to the effects. During the period, it is estimated that OR-TS gains 3.9% more clicks than Beta-TS and Full-TS.

In this study, we investigated an alternative expression of logistic model for thompson sampling. Based on the expression we showed that Odds-Ratio only update makes multi-armed bandit policy robust in many practical environment. Therefore, we believe that one may consider Odds Ratio thompson sampling as standard policy for binary reward data in batch update settings.

This study focused on the case when there is a common background effect, which does not change an optimal arm. However, real data can be confounded with the fact that an optimal arm changes over time. In this case, use of plain OR-TS would be also affected. In this case, we can use discount TS [8] or aggressiveness parameter [3]. For example, we can use multiply  (1 − λ)in second term of (1), where  λis a decay parameter ranging [0, 1].

We have discussed a base logistic model. Its extension to other generalized linear model, or contextual bandit is straightforward. Our implementation of OR-TS and Full-TS is available at http://github.com/sulgik/orts.

[1] Shipra Agrawal and Navin Goyal. Analysis of thompson sampling for the multi-armed bandit problem. In Conference on Learning Theory, pages 39–1, 2012.

[2] Olivier Chapelle and Lihong Li. An empirical evaluation of thompson sampling. In Advances in neural information processing systems, pages 2249–2257, 2011.

[3] Steven L Scott. Multi-armed bandit experiments in the online service economy. Applied Stochastic Models in Business and Industry, 31(1):37–45, 2015.

[4] Giuseppe Burtini, Jason Loeppky, and Ramon Lawrence. A survey of online experiment design with the stochas- tic multi-armed bandit. arXiv preprint arXiv:1510.00757, 2015.

[5] William R Thompson. On the likelihood that one unknown probability exceeds another in view of the evidence of two samples. Biometrika, 25(3/4):285–294, 1933.

[6] Daniel J Russo, Benjamin Van Roy, Abbas Kazerouni, Ian Osband, Zheng Wen, et al. A tutorial on thompson sampling. Foundations and Trends R⃝in Machine Learning, 11(1):1–96, 2018.

[7] Rocío Cañamares, Marcos Redondo, and Pablo Castells. Multi-armed recommender system bandit ensembles. In Proceedings of the 13th ACM Conference on Recommender Systems, pages 432–436, 2019.

[8] Vishnu Raj and Sheetal Kalyani. Taming non-stationary bandits: A bayesian approach. arXiv preprint arXiv:1707.09727, 2017.

[9] Chunqiu Zeng, Qing Wang, Shekoofeh Mokhtari, and Tao Li. Online context-aware recommendation with time varying multi-armed bandit. In Proceedings of the 22nd ACM SIGKDD international conference on Knowledge discovery and data mining, pages 2025–2034. ACM, 2016.

[10] Giuseppe Burtini, Jason L Loeppky, and Ramon Lawrence. Improving online marketing experiments with drift- ing multi-armed bandits. In ICEIS (1), pages 630–636, 2015.

[11] Neha Gupta, Ole-Christoffer Granmo, and Ashok Agrawala. Thompson sampling for dynamic multi-armed bandits. In 2011 10th International Conference on Machine Learning and Applications and Workshops, volume 1, pages 484–489. IEEE, 2011.

[12] Christopher M Bishop. Pattern recognition and machine learning. springer, 2006.

Designed for Accessibility and to further Open Science