On Learning Markov Chains

2018·Arxiv

Abstract

Abstract

The problem of estimating an unknown discrete distribution from its samples is a fundamental tenet of statistical learning. Over the past decade, it attracted significant research effort and has been solved for a variety of divergence measures. Surprisingly, an equally important problem, estimating an unknown Markov chain from its samples, is still far from understood. We consider two problems related to the min-max risk (expected loss) of estimating an unknown k-state Markov chain from its n sequential samples: predicting the conditional distribution of the next sample with respect to the KL-divergence, and estimating the transition matrix with respect to a natural loss induced by KL or a more general f-divergence measure. For the first measure, we determine the min-max prediction risk to within a linear factor in the alphabet size, showing it is and . For the second, if the transition probabilities can be arbitrarily small, then only trivial uniform risk upper bounds can be derived. We therefore consider transition probabilities that are bounded away from zero, and resolve the problem for essentially all sufficiently smooth f-divergences, including KL-, -, Chi-squared, Hellinger, and Alpha-divergences.

1 Introduction

Many natural phenomena are inherently probabilistic. With past observations at hand, probabilistic models can therefore help us predict, estimate, and understand, future outcomes and trends. The two most fundamental probabilistic models for sequential data are i.i.d. processes and Markov chains. In an i.i.d. process, for each , a sample is generated independently according to the same underlying distribution. In Markov chains, for each , the distribution of sample is determined by just the value of

Let us confine our discussion to random processes over finite alphabets, without loss of generality, assumed to be [k] := {1, 2, . . . , k}. An i.i.d. process is defined by a single distribution p over [k], while a Markov chain is characterized by a transition probability matrix . We denote the initial and stationary distributions of a Markov model by and , respectively. For notational consistency let P = (p) denote an i.i.d. model and P = (M) denote a Markov model.

Having observed a sample sequence unknown i.i.d. process or Markov chain, a natural problem is to predict the next sample point . Since is a random

variable, this task is typically interpreted as estimating the conditional probability distribution of the next sample point Let denote the collection of all finite-length sequences over [k].

Therefore, conditioning on , our first objective is to estimate the conditional distribution To be more precise, we would like to find an estimator , that associates with every sequence a distribution that approximates in a suitable sense.

Perhaps a more classical problem is parameter estimation, which describes the underlying process. An i.i.d. process is completely characterized by , hence this problem coincides with the previous one. For Markov chains, we seek to estimate the transition matrix M. Therefore, instead of producing a probability distribution , the estimator maps every sequence to a transition matrix

For two distributions p and q over [k], let L(p, q) be the loss when p is approximated by q. For the prediction problem, we measure the performance of an estimator in terms of its prediction risk,

the expected loss with respect to the sample sequence

For the estimation problem, we quantify the performance of the estimator by estimation risk. We first consider the expected loss of with respect to a single state

We then define the estimation risk of given sample sequence as the maximum expected loss over all states,

While the process P we are trying to learn is unknown, it often belongs to a known collection P. The worst prediction risk of an estimator over all distributions in P is

The minimal possible worst-case prediction risk, or simply the minimax prediction risk, incurred by any estimator is

The worst-case estimation risk and the minimax estimation risk are defined similarly. Given P, our goals are to approximate the minimax prediction/estimation risk to a universal constant-factor, and to devise estimators that achieve this performance.

An alternative definition of the estimation risk, considered in (1) and mentioned by a reviewer, is

We denote the corresponding minimax estimation risk by

Let o(1) represent a quantity that vanishes as . In the following, we use to denote

For the collection of all the i.i.d. processes over [k], the above two formulations coincide and the problem is essentially the classical discrete distribution estimation problem. The problem of determining was introduced by (2) and studied in a sequence of papers (3; 4; 5; 6; 7). For fixed k and KL-divergence loss, as n goes to infinity, (7) showed that

KL-divergence and many other important similarity measures between two distributions can be expressed as f-divergences (8). Let f be a convex function with f(1) = 0, the f-divergence between two distributions p and q over [k], whenever well-defined, is . Call an f-divergence ordinary if f is thrice continuously differentiable over , sub-exponential, namely, , and satisfies

Observe that all the following notable measures are ordinary f-divergences: Chi-squared divergence (9) from , KL-divergence (10) from f(x) = x log x, Hellinger divergence (11) from , and Alpha-divergence (12) from , where

Related Work For any f-divergence, we denote the corresponding minimax prediction risk for an n-element sample over set P by . Researchers in (13) considered the problem of de- termining for the ordinary f-divergences. Except the above minimax formulation, re- cently, researchers also considered formulations that are more adaptive to the underlying i.i.d. processes (14) (15). Surprisingly, while the min-max risk of i.i.d. processes was addressed in a large body of work, the risk of Markov chains, which frequently arise in practice, was not studied until very recently.

Let denote the collection of all the Markov chains over [k]. For prediction with KLdivergence, (16) showed that , but did not specify the dependence on k. For estimation, (17) considered the class of Markov Chains whose pseudo-spectral gap is bounded away from 0 and approximated the estimation risk to within a log n factor. Some of their techniques, in particular the lower-bound construction in their displayed equation (4.3), are of similar nature and were derived independently of results in Section 5 in our paper.

Our first main result determines the dependence of on both k and n, to within a factor of roughly k: Theorem 1. The minimax KL-prediction risk of Markov chains satisfies

Depending on M, some states may be observed very infrequently, or not at all. This does not drastically affect the prediction problem as these states will be also have small impact on the prediction risk . For estimation, however, rare and unobserved states still need to be well approximated, hence does not decrease with n, and for example

We therefore parametrize the risk by the lowest probability in the transition matrix. For

be the collection of Markov chains whose lowest transition probability exceeds . Note that is trivial if , we only consider . We characterize the minimax estimation risk of almost precisely.

Theorem 2. For all ordinary f-divergences and all

We can further refine the estimation-risk bounds by partitioning based on the smallest probability in the chain’s stationary distribution . Clearly, . For and

be the collection of all Markov chains in whose lowest stationary probability is . We determine the minimax estimation risk over nearly precisely.

Theorem 3. For all ordinary f-divergences,

and

For -distance corresponding to the squared Euclidean norm, we prove the following risk bounds.

Theorem 4. For all

Theorem 5. For all

and

The rest of the paper is organized as follows. Section 2 introduces add-constant estimators and additional definitions and notation for Markov chains. Note that each of the above results consists of a lower bound and an upper bound. We prove the lower bound by constructing a suitable prior distribution over the relevant collection of processes. Section 3 and 5 describe these prior distributions for the prediction and estimation problems, respectively. The upper bounds are derived via simple variants of the standard add-constant estimators. Section 4 and 6 describe the estimators for the prediction and estimation bounds, respectively. For space considerations, we relegate all the proofs to Section 9 to 12.

2 Deﬁnitions and Notation

2.1 Add-constant estimators

Given a sample sequence denote the number of times symbol i appears in . The classical empirical estimator estimates p by

The empirical estimator performs poorly for loss measures such as KL-divergence. For example, if p assigns a tiny probability to a symbol so that it is unlikely to appear in , then with high probability the KL-divergence between will be infinity.

A common solution applies the Laplace smoothing technique (18) that assigns to each symbol i a probability proportional to , where is a fixed constant. The resulting add-estimator, is denoted by . Due to their simplicity and effectiveness, add-estimators are widely used in various machine learning algorithms such as naive Bayes classifiers (19). As shown in (7), for the i.i.d. processes, a variant of the add-3/4 estimator achieves the minimax estimation risk Analogously, given a sample sequence generated by a Markov chain, let denote the number of times symbol j appears right after symbol i in , and let denote the number of times that symbol i appears in . We define the add-

2.2 More on Markov chains

Adopting notation in (20), let denote the collection of discrete distributions over [k]. Let and be the collection of even and odd integers in [k], respectively. By convention, for a Markov chain over [k], we call each symbol . Given a Markov chain, the hitting time first time the chain reaches state j. We denote by Prthe probability that starting from i, the hitting time of j is exactly t. For a Markov chain (M), we denote by the distribution of if we draw . Additionally, for a fixed Markov chain (M), the mixing time denotes the smallest index t such that . Finally, for notational convenience, we write instead of M(i, j) whenever appropriate.

3 Minimax prediction: lower bound

A standard lower-bound argument for minimax prediction risk uses the fact that

for any prior distribution . One advantage of this approach is that the optimal estimator that minimizes can often be computed explicitly.

Perhaps the simplest prior is the uniform distribution over a subset . Let be the optimal estimator minimizing . Computing for all the possible sample sequences may be unrealistic. Instead, let be an arbitrary subset of , we can lower bound

The key to applying the above arguments is to find a proper pair

Without loss of generality, assume that k is even. Let , and define

In addition, let

and let denote the uniform distribution over [k]. Finally, given n, define

Next, let be the collection of sequences whose last appearing state didn’t transition to any other state. For example, 3132, or 31322, but not 21323. In other words, for any state represent an arbitrary state in [k] \ {i}, then

4 Minimax prediction: upper bound

For the defined in the last section,

We denote the partial minimax prediction risk over

Let . Define and in the same manner. As the consequence of being a function from , we have the following triangle inequality,

Turning back to Markov chains, let denote the estimator that maps one can show that

Recall the following lower bound

This together with the above upper bound on and the triangle inequality shows that an upper bound on also suffices to bound the leading term of . The following construction yields such an upper bound.

We partition according to the last appearing state and the number of times it transitions to itself,

For any , there is a unique . Consider the following estimator

we can show that

The upper-bound proof applies the following lemma that uniformly bounds the hitting probability of any k-state Markov chain. Lemma 1. (21) For any Markov chain over [k] and any two states

5 Minimax estimation: lower bound

Analogous to Section 3, we use the following standard argument to lower bound the minimax risk

where and is the uniform distribution over . Setting , we outline the construction of as follows.

Let be the uniform distribution over ), denote the ball of radius r around

where distance between two distributions. Define

and

Noting that the uniform distribution over , is induced by , the uniform distribution over and thus is well-defined.

An important property of the above construction is that for a sample sequence , , the number of times that state k appears in , is a binomial random variable with parameters n and . Therefore, by the following lemma, is highly concentrated around its mean

In order to prove the lower bound on , we only need to modify the above construction as follows. Instead of drawing the last row of the transition matrix uniformly from the distribution induced by , we draw all rows independently in the same fashion. The proof is omitted due to similarity.

6 Minimax estimation: upper bound

The proof of the upper bound relies on a concentration inequality for Markov chains in , which can be informally expressed as

Note that this inequality is very similar to the Hoeffding’s inequality for i.i.d. processes.

The difficulty in analyzing the performance of the original add-estimator is that the chain’s initial distribution could be far away from its stationary distribution and finding a simple expression for and could be hard. To overcome this difficulty, we ignore the first few sample points and construct a new add-estimator based on the remaining sample points. Specifically, let be a length-n sample sequence drawn from the Markov chain (M). Removing the first m sample points, can be viewed as a length-sample sequence drawn from (M) whose initial distribution

Let . For sufficiently large and . Hence without loss of generality, we assume that the original initial distribution already satisfies we can simply replace

To prove the desired upper bound for ordinary f-divergences, it suffices to use the add-

For the -distance, instead of an add-constant estimator, we apply an add-

7 Experiments

We augment the theory with experiments that demonstrate the efficacy of our proposed estimators and validate the functional form of the derived bounds.

We briefly describe the experimental setup. For the first three figures, , and . For the last figure, , and . In all the experiments, the initial distribution of the Markov chain is drawn from the k-Dirichlet(1) distribution. For the transition matrix M, we first construct a transition matrix where each row is drawn independently from the k-Dirichlet(1) distribution. To ensure that each element of M is at least , let represent the all-ones matrix, and set . We generate a new Markov chain for each curve in the plots. And each data point on the curve shows the average loss of 100 independent restarts of the same Markov chain.

The plots use the following abbreviations: Theo for theoretical minimax-risk values; Real for real experimental results: using the estimators described in Sections 4 and 6; Pre for average prediction loss and Est for average estimation loss; Const for add-constant estimator; Prop for proposed add-estimator described in Section 6; Hell, Chi, and Alpha(c) for Hellinger divergence, Chi-squared divergence, and Alpha-divergence with parameter c. In all three graphs, the theoretical min-max curves are precisely the upper bounds in the corresponding theorems, except that in the prediction curve in Figure 1a the constant factor 2 in the upper bound is adjusted to 1/2 to better fit the experiments. Note the excellent fit between the theoretical bounds and experimental results.

Figure 1a shows the decay of the experimental and theoretical KL-prediction and KL-estimation losses with the sample size n. Figure 1b compares the -estimation losses of our proposed estimator and the add-one estimator, and the theoretical minimax values. Figure 1c compares the experimental estimation losses and the theoretical minimax-risk values for different loss measures. Finally, figure 1d presents an experiment on KL-learning losses that scales k up while n is fixed. All the four plots demonstrate that our theoretical results are accurate and can be used to estimate the loss incurred in learning Markov chains. Additionally, Figure 1b shows that our proposed add-estimator is uniformly better than the traditional add-one estimator for different values of sample size n. We have also considered add-constant estimators with different constants varying from 2 to 10 and our proposed estimator outperformed all of them.

8 Conclusions

We studied the problem of learning an unknown k-state Markov chain from its n sequential sample points. We considered two formulations: prediction and estimation. For prediction, we determined the minimax risk up to a multiplicative factor of k. For estimation, when the transition probabilities are bounded away from zero, we obtained nearly matching lower and upper bounds on the minimax risk for and ordinary f-divergences. The effectiveness of our proposed estimators was verified through experimental simulations. Future directions include closing the gap in the prediction problem in Section 1, extending the results on the min-max estimation problem to other classes of Markov chains, and extending the work from the classical setting , to general k and n.

Figure 1: Experiments

References

[1] Moein Falahatgar, Mesrob I. Ohannessian, and Alon Orlitsky. Near-optimal smoothing of structured conditional probability matrices? In In Advances in Neural Information Processing Systems (NIPS), pages 4860–4868, 2016.

[2] Edgar Gilbert. Codes based on inaccurate source probabilities. IEEE Transactions on Information Theory, 17, 3:304–314, 1971.

[3] Thomas Cover. Admissibility properties or gilbert’s encoding for unknown source probabilities (corresp.). IEEE Transactions on Information Theory, 18.1:216–217, 1972.

[4] Raphail Krichevsky and Victor Trofimov. The performance of universal encoding. IEEE Transactions on Information Theory, 27.2:199–207, 1981.

[5] Dietrich Braess, Jürgen Forster, Tomas Sauer, and Hans U. Simon. How to achieve minimax expected kullback-leibler distance from an unknown finite distribution. In International Conference on Algorithmic Learning Theory, pages 380–394. Springer, 2002.

[6] Liam Paninski. Variational minimax estimation of discrete distributions under kl loss. In Advances in Neural Information Processing Systems, pages 1033–1040, 2004.

[7] Dietrich Braess and Thomas Sauer. Bernstein polynomials and learning theory. Journal of Approximation Theory, 128(2):187–206, 2004.

[8] I. Csiszár. Information type measures of differences of probability distribution and indirect observations. Studia Math. Hungarica, 2:299–318, 1967.

[9] Frank Nielsen and Richard Nock. On the chi square and higher-order chi distances for approximating f-divergences. IEEE Signal Processing Letters, 21, no. 1:10–13, 2014.

[10] Solomon Kullback and Richard A. Leibler. On information and sufficiency. The Annals of Mathematical Statistics, 22, no. 1:79–86, 1951.

[11] Mikhail S Nikulin. Hellinger distance. Encyclopedia of mathematics, 151, 2001.

[12] Gavin E. Crooks. On measures of entropy and information. Tech. Note 9 (2017): v4, 2017.

[13] Sudeep Kamath, Alon Orlitsky, Dheeraj Pichapati, and Ananda Theertha Suresh. On learning distributions from their samples. In Annual Conference on Learning Theory (COLT), pages 1066–1100, 2015.

[14] Alon Orlitsky and Ananda Theertha Suresh. Competitive distribution estimation: Why is good-turing good. In Advances in Neural Information Processing Systems, pages 2143–2151, 2015.

[15] Gregory Valiant and Paul Valiant. Instance optimal learning of discrete distributions. In 48th annual ACM symposium on Theory of Computing, pages 142–155, 2016.

[16] Moein Falahatgar, Alon Orlitsky, Venkatadheeraj Pichapati, and Ananda Theertha Suresh. Learning markov distributions: Does estimation trump compression? In IEEE International Symposium on Information Theory (ISIT), pages 2689–2693, 2016.

[17] Geoffrey Wolfer and Aryeh Kontorovich. Minimax learning of ergodic markov chains. arXiv:1809.05014, 2018.

[18] Kai Lai Chung and Farid AitSahlia. Elementary probability theory: With stochastic processes and an introduction to mathematical finance. Springer Science & Business Media, 2012.

[19] Christopher M. Bishop and Tom M. Mitchell. Pattern recognition and machine learning. Springer, 2006.

[20] David A.Levin and Yuval Peres. Markov chains and mixing Times. American Mathematical Soc., 2017.

[21] James Norris, Yuval Peres, and Alex Zhai. Surprise probabilities in markov chains. Combinatorics, Probability and Computing, 26.4:603–627, 2017.

[22] Fan RK Chung and Linyuan Lu. Complex graphs and networks. American Mathematical Soc., 2006.

[23] Daniel Paulin. Concentration inequalities for markov chains by marton couplings and spectral methods. Electronic Journal of Probability, 20, 2015.

9 Minimax prediction: lower bound

A standard argument for lower bounding the minimax prediction risk is

where is a prior distribution over P. The advantage of this approach is that the optimal estimator that minimizes can often be computed explicitly.

Perhaps the simplest prior is the uniform distribution over some subset of P. Consider the uniform distribution over , say , the following lemma shows an explicit way of computing the optimal estimator for

Lemma 3. Let be the optimal estimator that minimizes , then for any and any symbol

Clearly, computing for all the possible sample sequences may be unrealistic. Instead, let be an arbitrary subset of , we can lower bound

The key to apply the above arguments is to find a proper pair . The rest of this section is organized as follows. In Subsection 9.1, we present our construction of and . In Subsection 9.2, we find the exact form of the optimal estimator using Lemma 3. Then we analyze its prediction risk over in Subsection 9.3, where we further partition into smaller subsets and lower bound the KL-divergence over and the probability in Lemma 6 and 7, respectively. Finally, we consolidate all the previous results and prove the desired lower bound on

9.1 Prior construction

Without loss of generality, we assume that k is an even integer. For notational convenience, we denote by the uniform distribution over [k] and define

Given n, we set

Then, we choose to be the collection of sequences whose last appearing state didn’t transition to any other symbol. In other words, for any state represent an arbitrary state other than i, then

According to both the last appearing state and the number of times it transitions to itself, we can partition

9.2 The optimal estimator

Let denote the optimal estimator that minimizes . The following lemma presents the exact form of

Lemma 4. For any , there exists a unique that contains it. Consider

Proof. Given

By Lemma 3, for any evaluates to

Noting that implies and . Besides, for any and is uniquely determined by

Thus, for s = 0 or 1, we can rewrite

where is a constant that only depends on

Hence, for any

Combining in the nominator,

This completes the proof.

9.3 Analysis

Next, for any , we lower bound in terms of

Lemma 5. For any

Proof. By the previous lemma,

Noting that

This completes the proof.

Lemma 6. For any where , and sufficiently large n, if

then,

Proof. Consider

where

We have the following bounds on these quantities.

Bounds for

Bounds for

Bounds for

Bounds for

where we use the inequality

Hence,

Bounds for

Similarly, consider a single term of without the factor

where we use the inequality

Hence,

where we use the inequality

Putting everything together:

This completes the proof.

Another quantity that will be appear later is where . We need the following lower bound. Lemma 7. For

Proof. By our construction of , for and , we have the following observations.

2. The probability of transitioning from some state to some state that is not

3. The probability of transitioning from some state

4. The probability of transitioning from state i to itself is

Therefore,

This completes the proof.

By Lemma 7,

Therefore,

where the last step follows by symmetry.

Next, we show that for any

Noting that is simply the summation of a geometric sequence, we can compute it as follows

To provide a lower bound for , we use the following inequalities:

Consolidating these three inequalities, the sum can be lower bounded by

Finally,

10 Minimax prediction: upper bound

The proof makes use of the following lemma, which provides a uniform upper bound for the hitting probability of any k-state Markov chain. Lemma 8. (21) For any Markov chain over [k] and any two states

Let be the same as is in the previous section. Recall that

we denote the partial minimax prediction risk over

Let , we define and in the same manner. As the consequence of being a function from , we have the following triangle inequality,

Turning back to Markov chains, the next lemma upper bounds

Lemma 9. Let denote the estimator that maps

which implies

Proof. The proof of this lemma is essentially a combination of the upper bounds’ proofs in (16) and in Section 12. Instead of using the fact that are bounded away from 0 (see Section 12), we partition into different subsets according to how close the counts are to their expected values, the number of times that the last appearing state transitioning to itself, and the number of times that the last appearing state transitioning to other states. Then, we bound the estimator’s expected loss over each set of the partition by . We omit the proof for the sake of brevity.

Recall the following lower bound,

This together with Lemma 5 and the triangle inequality above shows that an upper bound on also suffices to bound the leading term of . The following lemma provides such an upper bound. Recall that for any is defined as

Lemma 10. For any , there exists a unique pair . Consider the following estimator

and

then we have

Proof. Let be an arbitrary state. For simplicity of illustration, we use the following notation: for any , denote ; for any , denote ; for any , denote . By Lemma 8, the hitting probability is upper bounded by for all . We can write

Now, we break the right hand side into two sums according to whether is greater than n/2 or not. For

Similarly, for

This completes the proof.

11 Minimax estimation: lower bound

The proof of the lower bound makes use of the following concentration inequality, which upper

bounds the probability that a binomial random variable exceeds its mean.

Lemma 11. (22) Let Y be a binomial random variable with parameters

11.1 Prior construction

Again we use the following standard argument to lower bound the minimax risk,

where is the uniform distribution over , we outline the construction of as follows.

We adopt the notation in (13) and denote the ball of radius , the uniform distribution

where distance between two distributions. For simplicity, define

and

where

Given

Noting that the uniform distribution over , is induced by , the uniform distribution over and thus is well-defined.

An important property of the above construction is that for a sample sequence , , the number of times that state k appears in , is a binomial random variable with parameters n and . Therefore, Lemma 11 implies that is highly concentrated around its mean

11.2 L2-divergence lower bound

Let us first consider the -distance. Similar to Lemma 3, , the estimator that minimizes , can be computed exactly. In particular, we have the following lemma.

Lemma 12. There exists an estimator

such that

Based on the above lemma, we can relate the minimax estimation risk of Markov chains to the minimax prediction risk of i.i.d. processes. For simplicity, denote . The following lemma holds.

Lemma 13. For any , let be the collection of indexes such that . Then,

where is an arbitrary non-empty subset of is a constant whose value only depends on

Proof. We first consider the inner expectation on the left-hand side of the equality. For any

Let us partition into two parts: the collection of indexes such that and , say , and the remaining elements in . By the construction of , we have

For any denote the subsequence . We can further partition the last summation according to

Fixing , there is a bijective mapping from . Furthermore, we have Hence, we can denote for and treat it as a mapping from implies that Thus,

By definition, minimizes and for each , its value is completely determined by . Besides, forms a partition of . Therefore, by the linearity of expectation and the definition of , the estimator also minimizes , where the minimization is over all the possible mappings . Equivalently, we have

This immediately yields the lemma.

For any , denote by the number of times that state k appears in which is a random variable induced by (M) and n. Lemma 13, we can deduce that Lemma 14.

By Lemma 11 and our construction of , the probability that . This together with Lemma 14 and

11.3 Lower bound for ordinary f-divergences

Now we proceed from the -distance to ordinary f-divergences. The following lemma from (13) shows that decreases if we move q closer to p. Lemma 16. For

Based on the above lemma, we show that for any , the value of the optimal estimator is always close to

Let , we claim that either and . Otherwise, Lemma 16 implies that we can reduce the estimation risk by moving

If ˆand ˆ, then ˆ, and ˆ. Similarly, if ˆand ˆ

Now we relate . For simplicity, denote Lemma 17. For sufficiently large n,

Proof. By the previous lemma, Therefore,

Let us denote the interval on the right hand side by I.

For sufficiently large n, we can apply the second-order Taylor expansion to f at point 1.

12 Minimax estimation: upper bound

12.1 Concentration of the counts

The proof of the upper bound relies on the following concentration inequality, which shows that for any Markov chain in and any state , with high probability stays close to sufficiently large n.

Lemma 19. Given a sample sequence from any Markov chain , let denote the number of times that symbol i appears in . Then for any

where is the stationary distribution of (M) and

Proof. Given , recall that denotes the distribution of if we draw (M). First, we show that

Let be the matrix such that for all . Noting that , we can define

which is also a valid transition matrix.

By induction, we can show

Hence, let denote the norm, we have

This implies that we can upper bound

The remaining proof follows from Proposition 3.4, Theorem 3.4, and Proposition 3.10 of (23) and is omitted here for the sake of brevity.

Noting that Pr

Informally, we can express the above inequality as

which is very similar to the Hoeffding’s inequality for the i.i.d. processes. As an important implication, the following lemma bounds the moments of Lemma 20. For defined in Lemma 19 and any

Proof. The statement follows from

12.2 A modified add-β estimator

The difficulty with analyzing the performance of the original add-estimator is that the chain’s initial distribution could be far away from its stationary distribution and finding a simple expression for could be hard. To overcome such difficulty, we ignore the first few sample points and construct a new add-estimator based on the remaining sample points. To be more specific, let be a length-n sample sequence drawn from the Markov chain (M). Removing the first m sample points, can be viewed as a length-sample sequence drawn from (M) whose initial distribution

Setting . Noting that for sufficiently large n, without loss of generality, we assume that the original distribution already satisfies . If not, we can simply replace

To prove the upper bound, we consider the following (modified) add-estimator:

where is a fixed constant.

We can compute the expected values of these counts as

and

12.3 Analysis

For notational convenience, let us re-denote

The second inequality follows from the fact that can be viewed as the sum of counts from the following two Markov chains over whose transition probabilities are greater than

In other words, and are highly concentrated around and , respectively. Let denote the event that and . Let denote the event that does not happen. Applying the union bound, we have

Now consider

the corresponding estimation risk of over a particular sate can be decomposed as

Noting that

and

Hence, we can bound the second term as

where the last step follows from our assumption that f is sub-exponential.

By the definition of

Let , then h is thrice continuously differentiable around some neighborhood of point 0 and

We apply Taylor expansion to h at point 0 and rewrite the expectation on the right-hand side as

where by our definition of

Now, we bound individual terms. Taking out , the first term evaluates to:

Taking out , the second term evaluates to:

Finally, taking out , the last term can be bounded as

where we have used the ineuqality

By the definition of

and

Hence, consolidating all the previous results,

It remains to analyze

Thus, the desired quantity evaluates to

The above inequality yields

This completes our proof for ordinary f-divergences.

12.4 L2-divergence upper bound

Finally, we consider the -divergence. Again, we assume that the sample sequence

Instead of using an add-constant estimator, we use the following add-estimator:

Now, consider the expected loss for a particular state

We first show that the last term is negligible. Noting that

we can apply Taylor expansion to the function

at point

where the last step follows from Lemma 20. It remains to consider

According to the previous derivations, for

For 2

Therefore,

Finally,

We have

This completes our proof for the -divergence.

designed for accessibility and to further open science