Regime Switching Bandits

2020·Arxiv

Abstract

Abstract

We study a multi-armed bandit problem where the rewards exhibit regime-switching. Specifically, the distributions of the random rewards generated from all arms depend on a common underlying state modeled as a finite-state Markov chain. The agent does not observe the underlying state and has to learn the unknown transition probability matrix as well as the reward distribution. We propose an efficient learning algorithm for this problem, building on spectral method-of-moments estimations for hidden Markov models and upper confidence bound methods for reinforcement learning. We also establish ) bound on the regret of the proposed learning algorithm where T is the unknown horizon. Finally, we conduct numerical experiments to illustrate the effectiveness of the learning algorithm.

1 Introduction

The multi-armed bandit (MAB) problem is a popular model for sequential decision making with unknown information: the decision maker makes decisions repeatedly among I different options, or arms. After each decision she receives a random reward having an unknown probability distribution that depends on the chosen arm. The objective is to maximize the expected total reward over a finite horizon of T periods. The MAB problem has been extensively studied in various fields and applications including Internet advertising, dynamic pricing, recommender systems, clinical trials and medicine. See, e.g., Bouneffouf and Rish [2019], Bubeck and Cesa-Bianchi [2012], Slivkins [2019].

In the classical MAB problem, it is typically assumed that the random reward of each arm is i.i.d. (independently and identically distributed) over time and independent of the rewards from other arms. However, these assumptions do not necessarily hold in practice [Besbes et al., 2014, Bubeck and Cesa-Bianchi, 2012]. To address the drawback, a growing body of literature studies MAB problems with non-stationary rewards to capture temporal changes in the reward distributions in applications, see e.g. Besbes et al. [2014, 2019], Cheung et al. [2018], Garivier and Moulines [2011], Levine et al. [2017], Ortner et al. [2014], Wei and Srivatsva [2018] and the references therein.

In this paper, we study a non-stationary MAB model with Markovian regime-switching rewards. We assume that the random rewards associated with all the arms are modulated by a common unobserved state (or regime) process modeled as a finite-state Markov chain. Given , the reward of arm i is i.i.d., whose distribution is denoted ). Such structural change of the environment is usually referred to as regime switching in finance [Mamon and Elliott, 2007]. The agent doesn’t observe or control the underlying state , and has to learn the transition probability matrix as well as the distribution of reward of each arm ), based on the observed historical rewards. The goal of the agent is to design a learning policy that decides which arm to pull in each period to maximize the expected cumulative reward over T periods.

To measure the performance of a learning policy, we use the cumulative T-period regret, a common measure used in the literature [Bubeck and Cesa-Bianchi, 2012]. The regret is the difference between the total rewards received from a learning policy and a benchmark policy (typically an oracle that knows all parameters). We consider an oracle who knows P and ), but doesn’t observe the hidden state . As a result, the problem faced by the oracle is a partially observable Markov decision process (POMDP) [Krishnamurthy, 2016]. We reformulate the POMDP as a Markov decision process (MDP) with a continuous belief space (i.e. the state of the MDP is a posterior distribution over the underlying hidden state we follow the literature [Jaksch et al., 2010] to use the optimal policy that maximizes the long-run average reward over an infinite horizon as the benchmark policy to compute the regret.

In this work, we propose a novel learning algorithm (see Algorithm 2) for the regime-switching MAB model. It has the following two key ingredients.

First, we build on recent advances on the estimation of the parameters of hidden Markov models (HMMs) using spectral method-of-moments methods, which involve the spectral decomposition of certain low-order multivariate moments computed from data [Anandkumar et al., 2012, 2014, Azizzadenesheli et al., 2016]. Spectral method-of-moments for HMM can yield finite sample performance bound on the accuracy of their estimates that is necessary for a finite-sample analysis of regret, while the finite-sample guarantees of other alternatives such as maximum likelihood estimators remain an open problem [Lehricy, 2019]. To apply the spectral method, we divide the horizon into nested exploration and exploitation phases, so that the arms pulled in the exploration phase are purely random and the rewards observed can be treated as the observations from an HMM. Using the finite-sample performance of the spectral estimators, we are able to gauge the estimation error of ) in the exploration phase.

Second, we use the well-known “upper confidence bound” (UCB) method in reinforcement learning [Auer and Ortner, 2007, Jaksch et al., 2010] to control the regret in the exploitation phase. Once the confidence set of the unknown parameters (the transition probability and the reward distribution) are formed in the exploration phase, we choose the most optimistic estimator inside the set and compute the corresponding optimal policy that gives the highest long-run average reward of the POMDP. This policy is implemented until one exploitation phase ends and a new exploration phase starts. Since the agent also forms belief of the underlying state based on the estimator, what differentiates our work from other UCB algorithms is the re-computation of the belief at the beginning of the exploitation phase.

We establish a regret bound of ) for our proposed learning algorithm, assuming access to an optimization oracle that provides the optimal policy of the belief MDPs. Our regret analysis draws inspirations from Jaksch et al. [2010], Ortner and Ryabko [2012] for learning MDPs and undiscounted reinforcement learning problem, but the analysis differs significantly from theirs since there are two main technical challenges in our problem.

The first challenge in regret analysis is the control of the error of the belief state which is not directly observed. Note that the estimation error of the parameters affect the updating of the belief in every previous period. Hence one would expect that the error in the belief state might accumulate over time and explode even if the estimation of the parameters is quite accurate. We prove that in our algorithm this cannot happen and as a result the regret is well controlled.

The second challenge is bounding the span of the bias function (sometimes referred as the relative value function [Puterman, 2014]) for the POMDP or the belief MDP which is a continuous-state MDP. Such a bound is often critical in regret analysis for undiscounted reinforcement learning problems. Previous works in learning continuous MDPs, e.g. Ortner and Ryabko [2012], Lakshmanan et al. [2015], Qian et al. [2018], establish bounds of span of the bias function under the assumptions that rewards and transition probabilities are H¨older continuous in states under metric, but such assumptions are violated for the belief transition densities in our problem. Instead, we use the Kantorovich metric for the belief transition kernel together with the classical vanishing discount approach for POMDPs to establish bounded solutions to the Bellman equation for the average-reward belief MDP. This further yields an upper bound on the span of the bias function.

The regime-switching MAB model we study may find applications in various industries. For instance, in financial trading, a market participant trading a financial asset is deciding among two different strategies (“arms”): “trend-following” and “mean-reverting”. Either strategy may be advantageous under certain market regime, which changes dynamically and a standard model for such behavior is a hidden Markov chain [Mamon and Elliott, 2007]. In each period, the random reward from adopting each strategy depends on the market regime, and the participant attempts to maximize the expected reward by the end of the trading horizon. As another example, consider the learning-and-earning problem in revenue management [den Boer, 2015, Keskin and Zeevi, 2016]. A firm (e.g. an online retailer) selling a product is deciding among a finite set of prices (“arms”) to maximize the expected total revenue, but does not have information about the market demand for each price initially. The demand and hence the revenue depend on market factors including the local economy and consumer sentiments that may switch between the “good” and “bad” states [Ketter et al., 2009]. These states are usually unobservable to the firm, and their switching behavior can be modeled by a Markov chain [Aviv and Pazgal, 2005]. If the firm can infer the underlying state and learn the demand at the same time, then it may plan accordingly and earn more revenues.

Other Related Work. Our work is related to the restless Markov bandit problem [Guha et al., 2010, Ortner et al., 2014, Slivkins and Upfal, 2008] in which the state of each arm evolves according to independent Markov chains independently of the agents actions. In contrast, our regime-switching MAB model assumes a common underlying Markov chain so that the rewards of all arms are correlated, and the underlying state or regime is unobservable to the agent. In addition, our work is related to MAB studies where rewards of all arms depend on a common unknown parameter or a latent random variable, see, e.g., Atan et al. [2018], Gupta et al. [2018], Lattimore and Munos [2014], Maillard and Mannor [2014], Mersereau et al. [2009]. Our paper differs from them mainly in that the common latent state variable follows a dynamic stochastic process. We also mention Fiez et al. [2019], Yemini et al. [2019] which also consider correlated rewards driven by hidden Markov chains. Our work differs from them in many aspects including the reward structure, the policy class considered, the definition of regret and the algorithmic design.

Our paper is also related to studies on reinforcement learning for POMDPs with theoretical guarantees, see e.g. Azizzadenesheli et al. [2016]. The main difference is that they define regret with respect to the optimal memoryless policy that directly maps observations to actions, while we use the optimal policy that maps belief states to actions for average reward POMDPs as the benchmark to compute the regret.

2 MAB with Markovian Regime switching

2.1 Problem Formulation

Consider a MAB problem with arms I := {1, . . . , I}. There is a Markov chain with states M := {1, 2, . . . , M} and transition probability matrix . In period t = 1, 2, . . . , if the state of the Markov chain and the agent chooses arm , then the reward in that period is with bounded support, and its distribution is denoted by ) := ), with ]. We use to denote the mean reward matrix.

The agent knows M and I, but has no knowledge about the underlying state (also referred to as the regime), the transition matrix P or the reward distribution ). The goal is to design a learning policy that is adapted to the filtration generated by the observed rewards to decide which arm to pull in each period to maximize the expected cumulative reward over T periods where T is unknown in advance.

If an oracle knows ) and the underlying state , then the problem becomes trivial as s/he would select = argmaxin period t. If we benchmark a learning policy against the oracle, then the regret must be linear in T, because the oracle always observes while the agent cannot predict the transition based on the history. Whenever a transition occurs, there is non-vanishing regret incurred. Since the number of transitions during [0, T] is linear in T, the total regret is of the same order.

Since comparing to the oracle knowing is uninformative, we consider a weaker oracle who knows P, . In this case, the oracle solves a POMDP since the states are unobservable. The total expected reward of the POMDP can be shown to scale linearly in T, and asymptotically the reward per period converges to . See more details in Section 2.2.

For a learning policy , the arm pulled in period t, which we denote by , is adapted to filtration denotes the reward received under the learning policy (which does not know ) initially) in period t. Then we follow the literature (see, e.g., Jaksch et al. 2010, Ortner

et al. 2014, Agrawal and Jia 2017) and define its total regret after T periods by

For technical reasons we consider the regret with respect to , our results also bound the regret with respect to the optimal T-period reward. See Section A in the appendix for further discussions. We look for efficient learning algorithms with theoretical guarantees including high probability and expectation bounds (sublinear in T) on the total regret.

For the simplicity of presentation, we consider Bernoulli rewards. Hence the mean characterizes the distribution ). In addition, we impose the following assumptions.

Assumption 1. The finite-state Markov chain is geometrically ergodic with a unique stationary distribution = ()), where 0 for every state . In addition, the transition matrix P is invertible.

Assumption 2. The mean reward matrix has full row rank with

Assumption 3. The smallest element of the transition matrix

The first two assumptions are required for the finite-sample guarantee of spectral estimators for HMMs [Anandkumar et al., 2012, 2014]. See Section 3.1. The third assumption is required for controlling the error in the belief state of the belief MDP we describe next; see Proposition 3 in Section 5. The Bernoulli rewards are without loss of generality. Our analysis holds generally for random rewards with discrete and bounded support.

2.2 Reduction of POMDP to Belief MDP

To present our learning algorithm and analyze the regret, we first investigate the POMDP problem faced by the oracle where parameters (equivalently Q for Bernoulli rewards) are known with unobserved states . For notational simplicity, we write . It is well known that a POMDP can be reformualted as a belief MDP. We demonstrate the reduction of our problem below.

Based on the historical observed rewards, the oracle forms a belief of the underlying state. The belief can be encoded by a M-dimension vector )) defined as

Note that we must have

The POMDP of the oracle can be seen as a MDP built on a continuous state space B where the belief is constantly being updated [Krishnamurthy, 2016, Puterman, 2014]. We next introduce a few notations that facilitate the analysis. We first generalize the definition of c(m, i) to a belief vector. Given belief , the expected reward of arm i is

In period t, given belief , the selected arm and observed reward , the updated belief follows the Bayes theorem:

where the m-th entry is

We remark here that the forward function H depends on the transition matrix P and the reward matrix . We can also define the transition probability of the belief state conditional on the arm pulled:

Given this belief MDP with state , we can define policies of the oracle , which map the current belief into an arm to pull. The long-run average reward of the infinite-horizon belief MDP following policy given the initial belief b can be written as := lim sup= b]. The optimal policy maximizes for a given b. In our problem, one can show that it is independent of the initial belief b (Proposition 8.2.1 [Puterman, 2014]). Therefore, we use

to denote the optimal long-run average reward. Under this belief MDP formulation, for all , the Bellman equation states that

where is the bias function of the belief state. The bias function, or sometimes referred to as the relative value function (Chapter 8 [Puterman, 2014]), records the deviation in total expected reward of an initial state b relative to the long-run average . It will be shown in Lemma 2 in Section 5 that the corresponding optimal policy are well defined.

3 The SEEU Algorithm

This section describes our learning algorithm for the regime switching MAB model: the Spectral Exploration and Exploitation with UCB (SEEU) algorithm.

To device a learning policy for the POMDP with unknown and P, one needs a procedure to estimate those quantities from observed rewards. Anandkumar et al. [2012, 2014] propose the so-called spectral estimator for the unknown parameters in hidden Markov models (HMMs). However, the algorithm is not directly applicable to ours, because there is no decision making in HMMs. To use the spectral estimator, we divide the learning horizon T into nested “exploration” and “exploitation” phases. In the exploration phase, we randomly select an arm in each period. This transforms the system into a HMM so that we can apply the spectral method to estimate and P from the observed rewards in the phase. In the exploitation phase, based on the estimators obtained from the exploration phase, we use a UCB-type policy to further narrow down the optimal policy in the POMDP introduced in Section 2.2.

3.1 Spectral Estimator

We introduce the spectral estimator in Anandkumar et al. [2012, 2014] where the observation samples can be generated from a finite set, and we adapt it to our problem setting.

To simplify the notation, suppose the exploration phase starts from period 1 until period N, with realized arms , and realized rewards sampled from Bernoulli distributions. Note that is randomly selected in I. Let I be the cardinality of the arm set I. One can create a one-to-one mapping from a pair (i, r) into a scalar . Therefore, the pair can be expressed as a vector such that in each period is a basis vector with its s-th element being one and zero otherwise. Let be the observation probability matrix conditional on the state:

It can be shown that We denote Θ := -th column vector of Θ. For three consecutive observations

where represents the tensor product. By Theorem 3.6 in Anandkumar et al. [2014], if the initial state has the stationary distribution, then we have

where )) is the stationary distribution of the underlying Markov chain From the observations, we may construct the sample average for

to estimate ]. Following equation (6), we define

and construct estimators ˆas follows:

From ˆ, one may apply the tensor decomposition to obtain the estimator ˆfor the unknown mean reward matrix and ˆP for the transition matrix. This procedure is summarized in Algorithm 1. We use ˆ(respectively ˆ) to denote the th column vector of ˆA (respectively ˆB).

In addition, we can obtain the following result from page 2 of Azizzadenesheli et al. [2016] and it provides the confidence regions of the estimators in Algorithm 1.

Proposition 1. Under Assumptions 1 and 2, for any and any initial distribution, there exists such that when , with probability 1 , the estimated ˆby Algorithm 1 satisfy

where (and (ˆare the m-th row vectors of and ˆ, respectively, S = 2I, and are explicit constants independent of n.

The explicit expressions of constants are given in Section B in the appendix. Note that parameters are identifiable up to permutations of the hidden states labels, and we remark that we ignored such permutations in the statement of Proposition 1 for simplicity. The reason is that this permutation does not affect our learning algorithm: given a sequence of rewards and arms pulled, it can be shown from the Bellman equation (5) that the optimal belief-based policy is permutation-free.

3.2 The SEEU Algorithm

The SEEU algorithm proceeds with episodes with increasing length, similar to UCRL2 algorithm in [Jaksch et al., 2010]. As mentioned before, each episode is divided into exploration and exploitation phases. In episode k, it starts with the exploration phase that lasts for a fixed number of periods , and the algorithm uniformly randomly chooses an arm and observes the rewards. After the exploration phase, the algorithm applies Algorithm 1 to (re-)estimate . Moreover, it constructs a confidence interval based on Proposition 1 with a confidence level 1 , where is a vanishing sequence. Then the algorithm enters the exploitation phase. Its length is proportional to. In the exploitation phase, it conducts UCB-type learning:

the arm is pulled according to a policy that corresponds to the optimistic estimator of and P inside the

confidence interval. The detailed steps are listed in Algorithm 2. We have the following remarks about Algorithm 2: (1) In Line 6 of Algorithm 2, we use all the observations (action-reward pairs) in the previous exploration

phases. It is not hard to adapt Algorithm 1 to accommodate non-consecutive observations. In fact, any three

consecutive observations can be used to construct as a sample in (9). So we have in each episode’s exploration phase, and calculate the average of (2)k samples as the approximation of

((2) In Line 8, for given parameters (), we need to solve the Bellman equation (5) to obtain the

optimal average reward ) that depends on the parameters. Since the belief state is continuous, in our

experiments, we discretize the state and then solve it approximately by value iteration. In addition, to find the optimisitic POMDP, we need to search over infinite combinations () in ). For learning finite-state MDPs, Jaksch et al. [2010] propose the extended value iteration method to solve this problem. However, we can not apply this method for two reasons: (a) the reward function ¯c and the transition probability kernel ¯T of belief states both depend on the reward matrix . (b) We do not have the norm of each column of the transition probability kernel ¯T because it depends on and P in a complex way. So to find in experiments, we discretize ) into grids, calculating the corresponding at each grid point, choosing the one with the largest as our optimistic . For simplicity, we do not consider such discretization errors in the regret bound.

4 Regret Bound

In this section we give the regret bound for the SEEU algorithm. The first result is a high probability bound.

Theorem 1. Under Assumptions 1 to 3, when , with probability 1 , the regret of Algorithm 2 satisfies

where S = 2are constants independent of T whose expressions are given in Equation (38) and (41) in the appendix.

Since the reward is bounded, we can choose an appropriate , and readily infer from Theorem 1 the following expectation bound on the regret. The proof is omitted.

Theorem 2. Under the same assumptions as Theorem 1, the regret of Algorithm 2 satisfies

Remark 3. For the lower bound of the regret, consider the following I problem instance (equal to the number of arms): In instance i, let = 0.5 + for all m while = 0.5 for all m and . Such structure makes sure that the oracle policy simply pulls one arm without the need to infer the state. Since the problem reduces to the classic MAB, the regret is at least ) in this case. Note that the setup of the instances may violate Assumption 2, but this can be easily fixed by introducing an arbitrarily small perturbation to remains an open problem if it can be attained.

5 Analysis: Proof of Theorem 1

5.1 Preliminaries

Recall the belief MDP discussed in Section 2.2 and the Bayesian updating of the belief state in (3). We first give a lemma to show that the impact of the initial belief decays exponentially. The proof follows from Proposition 4.3.26 in Capp et al. [2005], so it is omitted here.

Lemma 1. Suppose Assumption 3 holds. Let be two different initial beliefs, then for arbitrary reward-action sequence be the corresponding beliefs at time t, we have

Based on Lemma 1, we next show the existence of the optimal policy and the boundedness of the bias function for the Bellman equation (5) for the belief MDP. The proof is given in the appendix.

Lemma 2. Under Assumptions 2 and 3, given and P, there exists a function and a constant such that the Bellman optimality equation holds:

Moreover, there exists a stationary optimal policy which maps a belief state to an arm to pull.

Based on the above two lemmas, we next show a uniform bound on the span of where is the bias function satisfying (11) in the optimistic belief MDP in episode k.

Proposition 2. Suppose Assumptions 2 and 3 hold, and (satisfies the Bellman equation (11). Then there exists a constant is bounded by D uniformly in k.

An explicit expression of D can be found in equation (25) in the appendix. The next result controls the error in the belief state.

Proposition 3. Suppose Assumption 3 holds. Given estimators (), for an arbitrary reward-action sequence , let and be the corresponding beliefs in period t under () and (), respectively. Then there exists constants

where

with are the maximum and minimum element of the matrix , respectively.

Recall that in Algorithm 2, we re-compute the belief after re-estimating the parameters in each exploration phase. If a small error in the parameter estimation may propagate over time and cause the belief to deviate from the true belief, then the algorithm cannot perform well in the exploitation phase. Proposition 3 guarantees that the error doesn’t accumulate and as a result, the regret incurred in the exploitation phase is connected to the estimation error. The proof is similar to the proof of Proposition 3 in De Castro et al. [2017] if we consider a Hidden Markov model where the reward-action pair () is the observation, so we omit details.

5.2 Proof of Theorem 1

We provide a road map for the proof of Theorem 1, while leave the complete proof in Section E in the appendix. Our approach builds on the regret analysis of UCRL2 algorithms Jaksch et al. [2010] while heavily leverages the analysis of POMDPs and the results in Section 5.1.

Recall the definition of regret in (1), it can be rewritten as:

where represents taking expectation with respect to the reward under the true environment parameter and policy is the filtration generated by the arms pulled and rewards received under policy to time

First, applying the Azuma-Hoeffding inequality [Azuma, 1967] for martingales with bounded increments, we have the following result for the second term of equation (12):

Next we bound the first term of (12). One can show that

where ¯c(b, i) given in (2) denotes the expected reward of arm i given belief state b. Let K be the total number of episodes up to time T. For each episode k = 1, let denote the exploration and exploitation phases respectively. Then, we can split the summation in equation (13) according to these two

phases and bound the regret. We proceed with the following three steps to bound (13). Step 1: Bounding the regret in exploration phases The regret in exploration phases can be simply bounded by:

We can obtain that the per-episode regret in the exploration phase is O(1). Because each exploration phase is of constant length (), the regret incurred in the exploration phases is proportional to the total number of episodes K up to T.

A failure event means the true POMDP environment () does not lie in confidence regions ) for

some episode k. By the choice , one can show that the probability of failure events is at most

. Next we consider the success event where the set of plausible POMDPs defined by confidence regions ) contains the true POMDP model. Note in the beginning of exploitation phase k, SEEU chooses an optimistic POMDP denoted by that has the largest average reward among the plausible POMDPS, and we denote its corresponding reward matrix, value function, and the optimal average reward by

respectively. Then when “success” events happen we have for any episode k. It follows that

where ¯) denotes the expected reward of arm given belief state which is updated based on the parameters () from episode k. Using Bellman equation (11) for the optimistic POMDP and Proposition 2, one can bound the first term of (14) by

where (is the th column vector of the matrix . Using the definition of ¯c(b, i) in (2), the second term of (14) can be upper bounded by

Applying Proposition 3 to bound and using Proposition 1 to bound events, one can infer that (14) is bounded by

i.e., K. A simple calculation of

6 Numerical Experiment

We compare our algorithm (SEEU) with classical MAB algorithms including UCB, -greedy (= 0.1), and greedy algorithms (see, e.g., Bubeck and Cesa-Bianchi [2012]). For illustrations, we consider a 2-state, 2-arm setting with a transition probability matrix P =

the random reward follows a Bernouli distribution with mean when the underlying state is m and arm i is pulled.

In Figure 1, we plot the expected regret as a function of time horizon T of the four algorithms in log-log scale. As we can see from Figure 1, the slope of UCB, -greedy, and greedy algorithms are all close to one, suggesting that they incur linear regrets. This is expected, because they don’t take into account the hidden states and treat the problem as a stationary one. On the other hand, the slope of of the curve for the regret of the SEEU algorithm is close to 2/3. This is consistent with our theoretical result (Theorem 2). It also demonstrates the effectiveness of the SEEU algorithm, particularly when the horizon length T is relatively large.

Figure 1: Regret performances of four different algorithms

7 Conclusion and Future Work

In this paper, we study a non-stationary MAB model with Markovian regime-switching rewards. We propose an efficient learning algorithm that integrates spectral estimators for hidden Markov Models and upper confidence methods from reinforcement learning. We also establish a regret bound of order of for the learning algorithm. As far as we know, this is the first algorithm with sublinear regret for MAB with unobservable regime switching. It would be interesting to find out whether the regret bound nearly-optimal. It is also an open question whether the spectral method can be applied to samples generated from different policies while maintaining the theoretical guarantees. We leave them for future work.

References

S. Agrawal and R Jia. Optimistic posterior sampling for reinforcement learning: worst-case regret bounds. In In Advances in Neural Information Processing Systems, pages 1184–1194, 2017.

A. Anandkumar, D. Hsu, and S. M. Kakade. A method of moments for mixture models and hidden markov models. In Conference on Learning Theory, pages 33–1, 2012.

A. Anandkumar, R. Ge, D. Hsu, S. M. Kakade, and M. Telgarsky. Tensor decompositions for learning latent variable models. The Journal of Machine Learning Research, 15(1):2773–2832, 2014.

O. Atan, C. Tekin, and M. van der Schaar. Global bandits. IEEE transactions on neural networks and learning systems, 29(12):5798–5811, 2018.

P. Auer and R. Ortner. Logarithmic online regret bounds for undiscounted reinforcement learning. In Advances in Neural Information Processing Systems, pages 49–56, 2007.

Y. Aviv and A. Pazgal. A partially observed markov decision process for dynamic pricing. Management Science, 51(9):1400–1416, 2005.

K. Azizzadenesheli, A. Lazaric, and A. Anandkumar. Reinforcement learning of POMDPs using spectral methods. arXiv preprint arXiv:1602.07764, 2016.

K. Azuma. Weighted sums of certain dependent random variables. Tohoku Mathematical Journal, Second Series, 19(3):357–367, 1967.

D. P. Bertsekas. Dynamic programming and stochastic control. New York, NY, USA: Academic Press, 1976.

O. Besbes, Y. Gur, and A. Zeevi. Stochastic multi-armed-bandit problem with non-stationary rewards. In Advances in neural information processing systems, pages 199–207, 2014.

O. Besbes, Y. Gur, and A. Zeevi. Optimal exploration-exploitation in a multi-armed-bandit problem with non-stationary rewards. Stochastic Systems, 2019.

D. Bouneffouf and I. Rish. A survey on practical applications of multi-armed and contextual bandits. arXiv preprint arXiv:1904.10040, 2019.

S. Bubeck and N. Cesa-Bianchi. Regret analysis of stochastic and nonstochastic multi-armed bandit problems. , 5(1):1–122, 2012.

O. Capp, E. Moulines, and T. Rydn. Inference in hidden Markov models. New York, NY, USA: Springer-Verlag, 2005.

W. C. Cheung, D. Simchi-Levi, and R. Zhu. Learning to optimize under non-stationarity. arXiv preprint arXiv:1810.03024, 2018.

Y. De Castro, E. Gassiat, and S. Le Corff. Consistent estimation of the filtering and marginal smoothing distributions in nonparametric hidden markov models. IEEE Transactions on Information Theory, 63(8): 4758–4777, 2017.

A. V. den Boer. Dynamic pricing and learning: historical origins, current research, and new directions. Surveys in Operations Research and Management Science, 20(1):1–18, 2015.

T. Fiez, S. Sekar, and L. J. Ratliff. Multi-armed bandits for correlated markovian environments with smoothed reward feedback. arXiv preprint arXiv:1803.04008, 2019.

A. Garivier and E. Moulines. On upper-confidence bound policies for switching bandit problems. In International Conference on Algorithmic Learning Theory, pages 174–188, Berlin, Heidelberg, 2011. Springer.

S. Guha, K. Munagala, and P. Shi. Approximation algorithms for restless bandit problems. Journal of the ACM (JACM), 58(1):3, 2010.

S. Gupta, G. Joshi, and O. Ya˘gan. Correlated multi-armed bandits with a latent random source. arXiv preprint arXiv:1808.05904, 2018.

K. Hinderer. Lipschitz continuity of value functions in markovian decision processes. Mathematical Methods of Operations Research, 62(1):3–22, 2005.

T. Jaksch, R. Ortner, and P. Auer. Near-optimal regret bounds for reinforcement learning. Journal of Machine Learning Research, 11(Apr):1563–1600, 2010.

N. B. Keskin and A. Zeevi. Chasing demand: Learning and earning in a changing environment. Mathematics of Operations Research, 42(2):277–307, 2016.

W. Ketter, J. Collins, M. Gini, A. Gupta, and P. Schrater. Detecting and forecasting economic regimes in multi-agent automated exchanges. Decision Support Systems, 47(4):307318, 2009.

V. Krishnamurthy. Partially observed Markov decision processes. Cambridge University Press, 2016.

K. Lakshmanan, R. Ortner, and D. Ryabko. Improved regret bounds for undiscounted continuous reinforcement learning. In Proceedings of the International Conference on Machine Learning, pages 524–532, 2015.

T. Lattimore and R. Munos. Bounded regret for finite-armed structured bandits. In Advances in Neural Information Processing Systems, pages 550–558, 2014.

T. Lattimore and C. Szepesv´ari. Bandit algorithms. preprint, 2018.

L. Lehricy. Consistent order estimation for nonparametric hidden markov models. Bernoulli, 25(1):464–498, 2019.

N. Levine, K. Crammer, and S. Mannor. Rotting bandits. In Advances in Neural Information Processing Systems, pages 3074–3083, 2017.

O. A. Maillard and S. Mannor. Latent bandits. In International Conference on Machine Learning, pages 136–144, 2014.

Rogemar S Mamon and Robert J Elliott. Hidden Markov models in finance, volume 4. Springer, 2007.

A. J. Mersereau, P. Rusmevichientong, and J. N. Tsitsiklis. A structured multiarmed bandit problem and the greedy policy. IEEE Transactions on Automatic Control, 54(12):2787–2802, 2009.

R. Ortner and D. Ryabko. Online regret bounds for undiscounted continuous reinforcement learning. In Advances in Neural Information Processing Systems, pages 1763–1771, 2012.

R. Ortner, D. Ryabko, P. Auer, and R. Munos. Regret bounds for restless markov bandits. Theoretical Computer Science, 558:62–76, 2014.

M. L. Puterman. Markov Decision Processes: Discrete Stochastic Dynamic Programming. John Wiley and Sons, 2014.

J. Qian, R. Fruit, M. Pirotta, and A. Lazaric. Exploration bonus for regret minimization in undiscounted discrete and continuous markov decision processes. arXiv preprint arXiv:1812.04363, 2018.

A. Slivkins. Introduction to multi-armed bandits. arXiv preprint arXiv:1904.07272, 2019.

A. Slivkins and E. Upfal. Adapting to a changing environment: the brownian restless bandits. In Conference on Learning Theory, pages 343–354, 2008.

L. Wei and V. Srivatsva. On abruptly-changing and slowly-varying multiarmed bandit problems. In 2018 Annual American Control Conference (ACC), pages 6291–6296, 2018.

M. Yemini, A. Leshem, and A. Somekh-Baruch. Restless hidden markov bandits with linear rewards. arXiv preprint arXiv:1910.10271, 2019.

Online Appendix

Table of Notations

Table 1: Summary of notations

A Remark on the deﬁnition of regret in (1)

Note that we use as the benchmark evaluating the regret of any learning policy, where is the optimal long-run average reward for the POMDP or belief MDP with known parameters (state unobserved) . One may argue that a finite-horizon formulation is more reasonable as the learning problem has a finite horizon T. In fact, next we show that the infinite-horizon approximation incurs negligible error compared to the regret. Let be the optimal value function for the t-step belief MDP discussed in Section 2.2, the maximal cumulative reward achievable with t periods remaining given the initial belief. The Bellman equation for satisfies:

As shown in Lattimore and Szepesv´ari [2018], for any initial state span(v) := max) denotes the range of v(b) introduced in (5). This means that if we replace ) in the definition of the regret in 1, the regret is off by span(v) which is upper bounded by a constant D as shown in Proposition 2. Because the regret, as we have seen, scales , the constant-gap approximation of doesn’t significantly change the magnitude.

B Proof of Proposition 1

Proof of Proposition 1. We consider the action-reward pair () as our observation of the underlying state . We encode the pair (r, i) into a variable through a suitable one-to-one mapping. We rewrite our observable random vector () as a random variable . Hence we can define the following matrix

for . It follows from Lemma 5, Lemma 8 and Theorem 3 in Azizzadenesheli et al. [2016] that the spectral estimators ˆhave the following guarantee: pick any 1), when the number of samples N with arbitrary initial state distribution satisfies

then with probability 1

for up to permutation, with

where is the smallest nonzero singular value of the covariance matrix = min) represents the smallest nonzero singular value of the matrix are the mixing rate parameters such that

where ) denotes the probability distribution vector of the underlying state starting from initial state

C Proof of Lemma 2

Proof of Lemma 2. Let ) be the value function of the discounted version of the POMDP with discount factor and initial state ) to be the value function deviation from a fix point s, so called the bias function.

We resort to Proposition 2 in Bertsekas [1976] on page 335, which connects the undiscounted problem to the limit of discounted problem. For the discounted problems, when the discount factor varies in (0, 1), if the difference of the value function is bounded , then the value function of the undiscounted problem is the limit of discounted problem as the discount factor is vanishing:

Lemma 3 (Bertsekas [1976]). Assume that there exists a constant L such that for some state

Then the following results hold:

From definition of Lipschitz module (15), we have:

factor . Standard theory in dynamic programming states that

With the following lemma, we can show ) is bounded by the sequence ). The proof is straightforward is thus omitted.

be functions on the same metric space (converges pointwisely to v for

Thus, to bound the left-hand side in (16), it suffices to bound the Lipschitz module We define the n-step observation kernel ¯) is a probability measure on

Recall that H is the forward kernel defined in Equation (3), we can define the n-step forward kernel

and we define ¯) to be the probability measure of steps from the initial state b:

So we have:

Consider the Bellman equation for a finite horizon MDP problem with discounted factor and length We have

Next we derive ) from the above equation, We introduce the Lipschitz module of the transition kernel ) be the Kantorovich metric of two probability measure

where Lip) is the set of functions on B with Lipschitz module 1. Then the Lipschitz module of the transition kernel ) is defined as:

We introduce the Kantorovich metric as the distance measure because the Lipschitz module of the integral in equation (18) can be bounded by the following lemma (Lemma 3.3 Hinderer [2005]):

Lemma 5. Let f be a function on be a mapping from a state into a probability measure ¯be defined as above, there holds

The next lemma (Lemma 2.1(c) [Hinderer, 2005]) shows that the Lipschitz module is a seminorm on the vector space of Lipschitz functions on B.

Lemma 6. On the vector space of functions on the metric space (, the Lipschitz module following properties:

Moreover, with the following lemma (Lemma 3.2(a), [Hinderer, 2005]), we can change the order of the operator of taking “supreme” and “Lipschitz module”.

Lemma 7. Consider a function is finite for all b. Let

Applying Lemma 5, Lemma 6 and Lemma 7 to (18), we have:

where

, there exists an

The proof will be shown later. Next construct a subsequence with and apply the above inequality, (19) implies that

Keep iterating over . It is easy to see that

Noting the fact that 1, we have

Combined with inequalities (16) and (17), we obtain:

That is say, we have verified the assumption of Lemma 3.

Thus, applying this crucial lemma, there exists a constant and a function v satisfy equation (11). Applying Proposition 1 in Bertsekas [1976] page 332, we can get if ) attains the maximum of equation (11) for , then the stationary policy is optimal for average reward problem.

Proof of Lemma 8. Rewriting the Kantorovich metric, we have:

The last equality follows from the following argument. Note that the span of a function f with Lipschitz module 1 is bounded by Diam(B). So for any we can find a constant c that Moreover, let ) = ) ¯) ¯, we know ) = ) for any constant c. Without loss of generality, we can constrain 2. Note that Diam(B) = sup

where the first inequality follows from , and the second inequality can be obtained by H¨older’s inequality with implies that the Lipschitz module of the second term of (21) decays exponentially fast as n goes to infinity. It remains to bound the first term (21).

Recall that b defines the initial pobability distribution steps observation kernel is

Define a vector

where the next to last equality follows from Next we bound maxbe the space of the sample path of the initial state

For any two initial states , we can bound the likelihood ratio:

Note that the last expression only depends on ) and doesn’t depend n. It implies that there exists 0 independent n, such that:

for any

where are the maximumn and minimum element of respectively, and are the maximumn and minimum element of P respectively. Write ). We have:

Thus, we have proved that for such

D Proof of Proposition 2

Proof of Proposition 2. In the proof of Lemma 2, we have verified the assumption of Lemma 3. Applying part (2) of Lemma 3, we have that there exists a sequence

Following the same analysis in the proof of Lemma 2, we have that for given

Thus, from (16), we can obtain

and it can be verified

where are the maximum and minimum of the matrix respectively,

where are the maximum and minimum of the matrix P respectively,

Recall that . For some given positive constants ∆we can choose the length of the exploration period , so that each element of () in the confidence interval in (10) in each episode satisfy:

Then span() is uniformly bounded over k by a constant D, where

E Proof of Theorem 1

In this section we provide a complete version of the proof of Theorem 1 sketched in Section 5.2.

Proof of Theorem 1. We need the following result, the proof of which is deferred to the end of this section.

Proposition 4. Let K be the number of total episodes up to time T. For each episode k = 1, let be the index set of the kth exploitation phase. Then with probability at most

where the expectation is taken respect to the true parameters and P under policy , and the filtration is defined as

Based on Proposition 4, next we prove Theorem 1. We follow our prior simplication that random reward follows the Bernoulli distribution. Here we want to clarify two notations in advance. means the expectation is taken respect to true mean reward matrix and transition probabilities P, and denotes that the underlying parameters are estimators

Recalling the definition of regret in (1), it can be rewritten as:

We first bound the second term of (26), i.e. the total bias between the conditional expectation of reward and the realization. Define a stochastic process

then the second term in (26) is . It is easy to see that is a martingale. Moreover, due to the Bernoulli distribution of

Applying the Azuma-Hoeffding inequality [Azuma, 1967], we have

Next we bound the first term of (26). Recall that the definition of belief state under the optimistic and true parameters: ), and the definition of reward functions with respect to the true belief state ¯) = ). We can also define the reward functions with respect to the optimistic belief state

where (-th column vector of the reward matrix , respectively. Then we can rewrite the first term of (26):

where the first equation is due to the tower property and the fact that and are conditionally independent given

Let K be the number of total episodes. For each episode k = 1, let be the exploration and exploitation phases, respectively. Then we can split equation (29) to the summation of the bias in these two phases as:

(, as it may end at period T.

We bound it by separating into “success” and “failure” events below. Recall that in episode k, we define the set of plausible POMDPs ), which is defined in terms of confidence regions ) around the estimated mean reward matrix and the transition probabilities . Then choose an optimistic POMDP ) that has the optimal average reward among the plausible POMDPS and denote its corresponding reward matrix, value function, and the optimal average reward by , respectively. Thus, we say a “success” event if and only if the set of plausible POMDPs ) contains the true POMDP G. In the following proof, we omit the dependence on ) for simplicity. From Algorithm 2, the confidence level of in episode , we can obtain:

Thus, with probability at least 1 , “success” events happen. It means the optimal average reward of the optimistic POMDP from the set . Then we can bound the regret of “success” events in exploitation phases as follows.

To bound the first term of formula (32), we use the Bellman optimality equation for the optimistic belief MDP on the continous belief state space B:

where ¯) = ) means transition probability of the belief state conditional on pulled arm under estimated reward matrix and transition matrix of underlying Markov chain

Moreover, we note that if value function satisfies the Bellman equation (5), then so is . Thus, without loss of generality, we assume that needs to satisfy 2. Then from Proposition 2, suppose the uniform bound of span(, then we have:

Thus the first term of (32) can be bounded by

where recall that ¯) in the second equality are the belief state transition probabilities under estimated and true parameters, respectively. And the last inequality is applying the H¨older’s inequality. For the first term of (34), we have

where the first term in the last equality is due to the telescoping from + 1 to , the start and end of the exploitation phase in episode k. The second term in the last equality is because:

Applying Proposition 2, we have

Applying Proposition 4, with probability at least 1

Thus, the first term of (34) can be upper bounded by:

For the second term of (34), we note that ¯) is zero except for two points where are excatly the Bayesian updating after receiving an observation of Bernouli reward taking value 0 or 1. Thus, we have the following transition kernel:

and

Plug these two equations in the second term of equation (34), it equals to:

where the inequality follows from the facts that ) is a probability mass function and

Summing up (35) and (36), the first term of formula (32) can be bounded by

then from the H¨older’s inequality, we can further bounded the right hand side of above formula to

If after episode , then from Proposition 1, under the confidence level

Together with the fact that the matrix norms satisfy , can be bounded by

Step 3: Summing up the regret Combining (31) and (39), we can get that with probability at least 1 , the regret is bounded by

Finally, combining (27) and (40), we can see that with probability at least 1 , the regret presented in (26) can be bounded by

Note that

where S = 2I, and

where the maximum and minimum element of the matrix respectively.

E.1 Proof of Proposition 4

Proof of Proposition 4. Let K be the number of total episodes. For each episode the index set of the kth exploration phase, and be the set of all exploitation periods among the horizon T. For an arbitrary time , which means the number of exploitation periods up to time t. Define a stochastic process

where mean the corresponding episode and period of jth exploitation, respectively.

We first show that is a martingale. Note that . It remains to show holds, i.e., ] = 0. Note that

where the last equality is due to and then applying the tower property of conditional expectations. Therefore, is a martingale for any given policy . Moreover, by Proposition 2, we have

Note that ¯). Thus, setting we can obtain

Moreover, we remark here that the length of the last exploitation phase , as it may end at period T.