b

DiscoverSearch
About
My stuff
A Hoeffding Inequality for Finite State Markov Chains and its Applications to Markovian Bandits
2020·arXiv
Abstract
Abstract

This paper develops a Hoeffding inequality for the partial sums �nk=1 f(Xk), where{Xk}k∈Z>0is an irreducible Markov chain on a finite state space  S, and f : S →[a, b] is a real-valued function. Our bound is simple, general, since it only assumes irreducibility and finiteness of the state space, and powerful. In order to demonstrate its usefulness we provide two applications in multi-armed bandit problems. The first is about identifying an approximately best Markovian arm, while the second is concerned with regret minimization in the context of Markovian bandits.

Let  {Xk}k∈Z>0be a Markov chain on a finite state space S, with initial distribution q, and irreducible transition probability matrix P, governed by the probability law  Pq. Let π beits stationary distribution, and  f : S → [a, b] be a real-valued function on the state space. Then the strong law of large numbers for Markov chains asserts that,

image

Moreover, the central limit theorem for Markov chains provides a rate for this convergence,

image

where  σ2 = limn→∞1n varq (�nk=1 f(Xk)) is the limiting variance. Those asymptotic results are insufficient in many applications which require finite-sample estimates. One of the most central such application is the convergence of Markov chain

Monte Carlo (MCMC) approximation techniques [Metropolis et al., 1953], where a finite-sample estimate is needed to bound the approximation error. Further applications include theoretical computer science and the approximation of the permanent [Jerrum et al., 2001], as well as statistical learning theory and multi-armed bandit problems [Moulos, 2019].

Motivated by this discussion we provide a finite-sample Hoeffding inequality for finite Markov chains. In the special case that the random variables  {Xk}k∈Z>0are independent and identically distributed according to  π, Hoeffding’s classic inequality [Hoeffding, 1963] states that,

image

where  ν2 = 14n(b−a)2. In Theorem 1 we develop a version of Hoeffding’s inequality for finite state Markov chains. Our bound is very simple and easily computable, since it is based on martingale techniques and it only involves hitting times of Markov chains which are very well studied for many types of Markov chains [Aldous and Fill, 2002]. It is worth mentioning that our bound is based solely on irreducibility, and it does not make any extra assumptions like aperiodicity or reversibility which prior works require.

There is a rich literature on finite-sample bounds for Markov chains. One of the earliest works [Davisson et al., 1981] uses counting and a generalization of the method of types, in order to derive a Chernoff bound for Markov chains which are irreducible and aperiodic. An alternative approach [Watanabe and Hayashi, 2017, Moulos and Anantharam, 2019], uses the theory of large deviations to derive sharper Chernoff bounds. When reversibility is assumed, the transition probability matrix is symmetric with respect to the space  L2(π), which enables the use of matrix perturbation theory. This idea leads to Hoeffding inequalities that involve the spectral gap of the Markov chain and was initiated in [Gillman, 1993]. Refinements of this bound were given in a series of works [Dinwoodie, 1995, Kahale, 1997, Lezaud, 1998, Le´on and Perron, 2004, Miasojedow, 2014]. In [Paulin, 2015, Rao, 2019, Fan et al., 2018] a generalized spectral gap is introduced in order to obtain bounds even for a certain class of irreversible Markov chains as long as they posses a strictly positive generalized spectral gap. Information-theoretic ideas are used in [Kontoyiannis et al., 2006] in order to derive a Hoeffding inequality for Markov chains with general state spaces that satisfy Doeblin’s minorization condition, which in the case of a finite state space can be written as,

image

Of course there are irreducible transition probability matrices P for which (1) fails, but if we further assume aperiodicity, then (1) is satisfied. Our approach uses Doob’s martingale combined with Azuma’s inequality, and is probably closest related to the work in [Glynn and Ormoneit, 2002], where a bound for Markov chains with general state spaces is established using Dynkin’s martingale. But the result in [Glynn and Ormoneit, 2002] heavily relies on the Markov chains satisfying Doeblin’s condition (1). A regeneration approach for uniformly ergodic Markov chains, where one splits the Markov chain in i.i.d. blocks, and reduces the problem to the concentration of an i.i.d. process, can be found in [Douc et al., 2011].

Another line of research is related to the concentration of a function of n random variables around its mean, under Markovian or other dependent structures. This was pioneered

by the works of Marton [Marton, 1996a, Marton, 1996b, Marton, 1998] who used the transportation method, and further developed using coupling ideas in [Samson, 2000, Marton, 2003, Chazottes et al., 2007, Kontorovich and Ramanan, 2008, Paulin, 2015].

To illustrate the applicability of our bound we use it to study two Markovian multi-armed bandit problems. The stochastic multi-armed bandits problem is a prototypical statistical learning problem that exhibits an exploration-exploitation trade-off. One is given multiple options, referred to as arms, and each of them associated with a probability distribution. The emphasis is put on focusing as quickly as possible on the best available option, rather than estimating with high confidence the statistics of each option. The cornerstone of this field is the pioneering work of Lai and Robbins [Lai and Robbins, 1985]. Here we study two variants of the multi-armed bandits problem where the probability distributions of the arms form Markov chains. First we consider the task of identifying with some fixed confidence an approximately best arm, and we use our bound to analyze the median elimination algorithm, originally proposed in [Even-Dar et al., 2006] for the case of i.i.d. bandits. Then we turn into the problem of regret minimization for Markovian bandits, where we analyze the UCB algorithm that was introduced in [Auer et al., 2002] for i.i.d. bandits. For a thorough introduction to multi-armed bandits we refer the interested reader to the survey [Bubeck and Cesa-Bianchi, 2012], and the books [Lattimore and Szepesv´ari, 2019, Slivkins, 2019].

image

The central quantity that shows up in our Hoeffding inequality, and makes it differ from the classical i.i.d. Hoeffding inequality, is the maximum hitting time of a Markov chain with an irreducible transition probability matrix P. This is defined as,

image

where  Ty = inf{n ≥ 0 : Xn+1 = y}is the number of transitions taken in order to visit state y for the first time. HitT(P) is ensured to be finite due to irreducibility and the finiteness of the state space.

Theorem 1. Let {Xk}k∈Z>0be a Markov chain on a finite state space S, driven by an initial distribution q, and an irreducible transition probability matrix  P. Let f : S → [a, b] be areal-valued function. Then, for any t > 0,

image

Proof. We define the sums  Sl,m = f(Xl)+. . .+f(Xm), for 1 ≤ l ≤ m ≤ n, and the filtration

image

a zero mean martingale with respect to  {Fk}nk=0, and let ∆k = E(S1,n | Fk) − E(S1,n | Fk−1),for k = 1, . . . , n, be the martingale differences.

We first note the following bounds on the martingale differences,

image

Therefore, in order to bound the variation of ∆kit suffices to control,

image

where in the first equality we used the Markov property, and in the second the timehomogeneity. We now use a hitting time argument. Observe the following pointwise statements,

image

from which we deduce that,

image

Taking  E[· | X1 = x]-expectations, and using the strong Markov property we obtain,

image

Therefore,

image

With this in our possession we apply Hoeffding’s lemma, see for instance Lemma 2.3 in [De- vroye and Lugosi, 2001], in order to get,

image

Using Markov’s inequality, and successive conditioning we obtain that for  θ > 0,

image

Plugging in  θ = t/ν2, we see that,

image

The conclusion follows by combining the inequality above for  f and −f.

Example 1. Consider a two-state Markov chain with S = {0, 1} and P(0, 1) = p, P(1, 0) =

image

and Theorem 1 takes the form,

image

Example 2. Consider the random walk on the m-cycle with state space  S = {0, 1, . . . , m −1}, and transition probability matrix  P(x, y) = (1{y ≡ x + 1 (mod m)} + 1{y ≡ x − 1(mod m)})/2. If m is odd, then the Markov chain is aperiodic, while if m is even, then the Markov chain has period 2. Then,

image

and Theorem 1 takes the form,

image

Remark 1. Observe that the technique used to establish Theorem 1 is limited to Markov chains with a finite state space S. Indeed, if  {Xk}k∈Z>0is a Markov chain on a countably infinite state space S with an irreducible and positive recurrent transition probability matrix P and a stationary distribution  π, then we claim that,

image

from which it follows that supx,y∈S E[Ty | X1 = x] = ∞, due to the fact that �y∈S π(y) = 1and S is countably infinite. The aforementioned inequality can be established as follows.

image

Moreover, through Theorem 1 we can obtain a concentration inequality for sums of a function evaluated on the transitions of a Markov chain. In particular, let

image

On the state space  S(2) define the transition probability matrix,

image

It is straightforward to verify that the fact that P is irreducible, implies that  P (2) is irre-ducible as well. This readily gives the following theorem.

Theorem 2. Let {Xk}k∈Z>0be a Markov chain on a finite state space S, driven by an initial distribution q, and an irreducible transition probability matrix  P. Let f (2) : S(2) → [a, b] be areal-valued function evaluated on the transitions of the Markov chain. Then, for any t > 0,

image

Corollary 1. When the Markov chain is initialized with its stationary distribution,  π, The-orem 1 and Theorem 2 give the following nonasymptotic versions of the weak law of large numbers for irreducible Markov chains. For any  ϵ > 0,

image

and,

image

3.1 Setup

There are  K ≥2 arms, and each arm  a ∈ [K] = {1, . . . , K}is associated with a parameter θa ∈ Rwhich uniquely encodes1 an irreducible transition probability matrix  Pθa. We willdenote the overall parameter configuration of all  K arms with θθθ = (θ1, . . . , θK) ∈ RK. Arma evolves according to the stationary Markov chain,  {Xan}n∈Z>0, driven by the irreducible transition probability matrix  Pθawhich has a unique stationary distribution  πθa, so thatXa1 ∼ πθa. There is a common reward function  f : S → [c, d] which generates the reward process  {Y an }n∈Z>0 = {f(Xan)}n∈Z>0. The reward process, in general, is not going to be a Markov chain, unless f is injective, and it will have more complicated dependencies than the underlying Markov chain. Each time that we select arm a, this arm evolves by one transition and we observe the corresponding sample from the reward process  {Y an }n∈Z>0, while all the other arms stay rested.

The stationary reward of arm  a is µ(θa) = �x∈S f(x)πθa(x). Let µ∗(θθθ) = maxa∈[K] µ(θa)be the maximum stationary mean, and for simplicity assume that there exists a unique arm, a∗(θθθ), attaining this maximum stationary mean, i.e.  {a∗(θθθ)} = arg maxa∈[K] µ(θa). In thefollowing sections we will consider two objectives: identifying an  ϵbest arm with some fixed confidence level  δusing as few samples as possible, and minimizing the expected regret given some fixed time horizon T.

3.2 Approximate Best Arm Identification

In the approximate best arm identification problem, we are given an approximation accuracy ϵ >0, and a confidence level  δ ∈ (0,1). Our goal is to come up with an adaptive algorithm A which collects a total of N samples, and returns an arm ˆa that is within  ϵfrom the best arm,  a∗(θθθ), with probability at least 1  − δ, i.e.

image

Such an algorithm is called (ϵ, δ)-PAC (probably approximately correct).

In [Mannor and Tsitsiklis, 2004] a lower bound for the sample complexity of any (ϵ, δ)-PAC algorithm is derived. The lower bound states that no matter the (ϵ, δ)-PAC algorithm A, there exists an instance  θθθsuch that the sample complexity is at least,

image

A matching upper bound is provided for i.i.d. bandits in [Even-Dar et al., 2006] in the form of the median elimination algorithm. We demonstrate the usefulness of our Hoeffding inequality, by providing an analysis of the median elimination algorithm in the more general setting of Markovian bandits.

Theorem 3. If β ≥ 12(d−c)2 maxa∈[K] HitT(Pθa)2 then, the β-Median-Elimination algorithm is (ϵ, δ)-PAC, and its sample complexity is upper bounded by  O� Kϵ2 log 1δ�.

Proof. The total number of sampling rounds is at most  ⌈log2 K⌉, and we can set them equal to  ⌈log2 K⌉by setting  Ar = {ˆa}, for r ≥ R0, where AR0 = {ˆa}. Fix r ∈ {1, . . . , ⌈log2 K⌉}.We claim that,

image

We condition on the value of  Ar. If |Ar|= 1, then the claim is trivially true, so we only consider the case  |Ar| ≥ 2. Let µ∗r = maxa∈Ar µ(θa), and a∗r ∈ arg maxa∈Ar:µ(θa)=µ∗r ¯Ya[r]. Weconsider the following set of bad arms,

image

image

and observe that,

image

In order to upper bound the latter fix  b ∈ Arand write,

image

where in the last inequality we used Corollary 1. Now via Markov’s inequality this yields,

image

Furthermore, Corollary 1 gives that for any  a ∈ Ar,

image

We obtain (2) by using (4) and (5) in (3). With (2) in our possession, the fact that median elimination is (ϵ, δ)-PAC follows through a union bound,

image

Regarding the sample complexity, we have that the total number of samples is at most,

image

3.3 Regret Minimization

Our device to solve the regret minimization problem is an  adaptive allocation rule, φφφ ={φt}t∈Z>0, which is a sequence of random variables where  φt ∈ [K] is the arm that we select at time  t. Let Na(t) = �ts=1 I{φs=a}, be the number of times we selected arm a up to time t. Our decision,  φt, at time tis based on the information that we have accumulated so far. More precisely, the event  {φt = a}is measurable with respect to the  σ-field generated by the past decisions  φ1, . . . , φt−1, and the past observations  {X1n}N1(t−1)n=1 , . . . , {XKn }NK(t−1)n=1 .Given a time horizon T, and a parameter configuration  θθθ, the expected regret incurred when the adaptive allocation rule  φφφis used, is defined as,

image

where ∆b(θθθ) = µ∗(θθθ) − µ(θb). Our goal is to come up with an adaptive allocation rule that makes the expected regret as small as possible.

There is a known asymptotic lower bound on how much we can minimize the expected regret. Any adaptive allocation rule that is uniformly good across all parameter configurations should satisfy the following instance specific, asymptotic regret lower bound (see [Anan- tharam et al., 1987] for details),

image

where  D (θ ∥ λ) is the Kullback-Leibler divergence rate between the Markov chains with transition probability matrices  Pθ and Pλ, given by,

image

Here we utilize our Theorem 1 to provide a finite-time analysis of the  β-UCB adaptive allocation rule for Markovian bandits, which is order optimal. The  β-UCB adaptive allocation rule, is a simple and computationally efficient index policy based on upper confidence bounds which was initially proposed in [Auer et al., 2002] for i.i.d. bandits. It has already been studied in the context of Markovian bandits in [Tekin and Liu, 2010], but in a more restrictive setting under the further assumptions of aperiodicity and reversibility due to the

image

use of the bounds from [Gillman, 1993, Lezaud, 1998]. For adaptive allocation rules that asymptotically match the lower bound we refer the interested reader to [Anantharam et al., 1987, Moulos, 2020].

image

Proof. Fix b ̸= a∗(θθθ), and observe that,

image

On the event�φt+1 = b, Nb(t) ≥ 8β∆b(θθθ)2 log T�, we have that, either ¯Yb(t) ≥ µ(θb)+�

or ¯Ya∗(θθθ)(t) ≤ µ∗(θθθ) −�

image

Similarly we can see that,

image

The conclusion now follows by putting everything together and using the integral estimate,

image

This research was supported in part by the NSF grant CCF-1816861.

[Aldous and Fill, 2002] Aldous, D. and Fill, J. (2002). Reversible markov chains and random walks on graphs.

[Anantharam et al., 1987] Anantharam, V., Varaiya, P., and Walrand, J. (1987). Asymp- totically efficient allocation rules for the multiarmed bandit problem with multiple plays. II. Markovian rewards. IEEE Trans. Automat. Control, 32(11):977–982.

[Auer et al., 2002] Auer, P., Cesa-Bianchi, N., and Fischer, P. (2002). Finite-time Analysis of the Multiarmed Bandit Problem. Mach. Learn., 47(2-3):235–256.

[Bubeck and Cesa-Bianchi, 2012] Bubeck, S. and Cesa-Bianchi, N. (2012). Regret Analysis of Stochastic and Nonstochastic Multi-armed Bandit Problems. Foundations and Trends in Machine Learning, 5(1):1–122.

[Chazottes et al., 2007] Chazottes, J.-R., Collet, P., K¨ulske, C., and Redig, F. (2007). Con- centration inequalities for random fields via coupling. Probab. Theory Related Fields, 137(1-2):201–225.

[Davisson et al., 1981] Davisson, L. D., Longo, G., and Sgarro, A. (1981). The error expo- nent for the noiseless encoding of finite ergodic Markov sources. IEEE Trans. Inform. Theory, 27(4):431–438.

[Devroye and Lugosi, 2001] Devroye, L. and Lugosi, G. (2001). Combinatorial methods in density estimation. Springer Series in Statistics. Springer-Verlag, New York.

[Dinwoodie, 1995] Dinwoodie, I. H. (1995). A probability inequality for the occupation measure of a reversible Markov chain. Ann. Appl. Probab., 5(1):37–43.

[Douc et al., 2011] Douc, R., Moulines, E., Olsson, J., and van Handel, R. (2011). Con- sistency of the maximum likelihood estimator for general hidden Markov models. Ann. Statist., 39(1):474–513.

[Even-Dar et al., 2006] Even-Dar, E., Mannor, S., and Mansour, Y. (2006). Action elim- ination and stopping conditions for the multi-armed bandit and reinforcement learning problems. J. Mach. Learn. Res., 7:1079–1105.

[Fan et al., 2018] Fan, J., Jiang, B., and Sun, Q. (2018). Hoeffding’s lemma for Markov Chains and its applications to statistical learning.

[Gillman, 1993] Gillman, D. (1993). A Chernoff bound for random walks on expander graphs. In 34th Annual Symposium on Foundations of Computer Science (Palo Alto, CA, 1993), pages 680–691. IEEE Comput. Soc. Press, Los Alamitos, CA.

[Glynn and Ormoneit, 2002] Glynn, P. W. and Ormoneit, D. (2002). Hoeffding’s inequality for uniformly ergodic Markov chains. Statist. Probab. Lett., 56(2):143–146.

[Hoeffding, 1963] Hoeffding, W. (1963). Probability inequalities for sums of bounded random variables. J. Amer. Statist. Assoc., 58:13–30.

[Jerrum et al., 2001] Jerrum, M., Sinclair, A., and Vigoda, E. (2001). A polynomial-time approximation algorithm for the permanent of a matrix with non-negative entries. In Proceedings of the Thirty-Third Annual ACM Symposium on Theory of Computing, pages 712–721. ACM, New York.

[Kahale, 1997] Kahale, N. (1997). Large deviation bounds for Markov chains. Combin. Probab. Comput., 6(4):465–474.

[Kontorovich and Ramanan, 2008] Kontorovich, L. and Ramanan, K. (2008). Concentration inequalities for dependent random variables via the martingale method. Ann. Probab., 36(6):2126–2158.

[Kontoyiannis et al., 2006] Kontoyiannis, I., Lastras-Monta˜no, L. A., and Meyn, S. P. (2006). Exponential Bounds and Stopping Rules for MCMC and General Markov Chains. In Proceedings of the 1st International Conference on Performance Evaluation Methodolgies and Tools, valuetools ’06, New York, NY, USA. ACM.

[Lai and Robbins, 1985] Lai, T. L. and Robbins, H. (1985). Asymptotically efficient adaptive allocation rules. Adv. in Appl. Math., 6(1):4–22.

[Lattimore and Szepesv´ari, 2019] Lattimore, T. and Szepesv´ari, C. (2019). Bandit Algorithms.

[Le´on and Perron, 2004] Le´on, C. A. and Perron, F. (2004). Optimal Hoeffding bounds for discrete reversible Markov chains. Ann. Appl. Probab., 14(2):958–970.

[Lezaud, 1998] Lezaud, P. (1998). Chernoff-type bound for finite Markov chains. Ann. Appl. Probab., 8(3):849–867.

[Mannor and Tsitsiklis, 2004] Mannor, S. and Tsitsiklis, J. N. (2004). The sample complex- ity of exploration in the multi-armed bandit problem. J. Mach. Learn. Res., 5:623–648.

[Marton, 1996a] Marton, K. (1996a). Bounding d-distance by informational divergence: a method to prove measure concentration. Ann. Probab., 24(2):857–866.

[Marton, 1996b] Marton, K. (1996b). A measure concentration inequality for contracting Markov chains. Geom. Funct. Anal., 6(3):556–571.

[Marton, 1998] Marton, K. (1998). Measure concentration for a class of random processes. Probab. Theory Related Fields, 110(3):427–439.

[Marton, 2003] Marton, K. (2003). Measure concentration and strong mixing. Studia Sci. Math. Hungar., 40(1-2):95–113.

[Metropolis et al., 1953] Metropolis, N., Rosenbluth, A. W., Rosenbluth, M. N., Teller, A. H., and Teller, E. (1953). Equation of state calculations by fast computing machines. The Journal of Chemical Physics, 21(6):1087–1092.

[Miasojedow, 2014] Miasojedow, B. a. (2014). Hoeffding’s inequalities for geometrically er- godic Markov chains on general state space. Statist. Probab. Lett., 87:115–120.

[Moulos, 2019] Moulos, V. (2019). Optimal Best Markovian Arm Identification with Fixed Confidence. In Advances in Neural Information Processing Systems (NeurIPS), pages 5605–5614.

[Moulos, 2020] Moulos, V. (2020). Finite-Time Analysis of Round-Robin Kullback-Leibler Upper Confidence Bounds for Optimal Adaptive Allocation with Multiple Plays and Markovian Rewards.

[Moulos and Anantharam, 2019] Moulos, V. and Anantharam, V. (2019). Optimal Chernoff and Hoeffding Bounds for Finite State Markov Chains.

[Paulin, 2015] Paulin, D. (2015). Concentration inequalities for Markov chains by Marton couplings and spectral methods. Electron. J. Probab., 20:no. 79, 32.

[Rao, 2019] Rao, S. (2019). A Hoeffding inequality for Markov chains. Electron. Commun. Probab., 24:Paper No. 14, 11.

[Samson, 2000] Samson, P.-M. (2000). Concentration of measure inequalities for Markov chains and Φ-mixing processes. Ann. Probab., 28(1):416–461.

[Slivkins, 2019] Slivkins, A. (2019). Introduction to Multi-Armed Bandits. Foundations and Trends R⃝ in Machine Learning, 12(1-2):1–286.

[Tekin and Liu, 2010] Tekin, C. and Liu, M. (2010). Online algorithms for the multi-armed bandit problem with Markovian rewards. In 2010 48th Annual Allerton Conference on Communication, Control, and Computing (Allerton), pages 1675–1682.

[Watanabe and Hayashi, 2017] Watanabe, S. and Hayashi, M. (2017). Finite-length analysis on tail probability for Markov chain and application to simple hypothesis testing. Ann. Appl. Probab., 27(2):811–845.


Designed for Accessibility and to further Open Science