Many natural phenomena are inherently probabilistic. With past observations at hand, probabilistic models can therefore help us predict, estimate, and understand, future outcomes and trends. The two most fundamental probabilistic models for sequential data are i.i.d. processes and Markov chains. In an i.i.d. process, for each , a sample
is generated independently according to the same underlying distribution. In Markov chains, for each
, the distribution of sample
is determined by just the value of
Let us confine our discussion to random processes over finite alphabets, without loss of generality, assumed to be [k] := {1, 2, . . . , k}. An i.i.d. process is defined by a single distribution p over [k], while a Markov chain is characterized by a transition probability matrix . We denote the initial and stationary distributions of a Markov model by
and
, respectively. For notational consistency let P = (p) denote an i.i.d. model and P = (M) denote a Markov model.
Having observed a sample sequence unknown i.i.d. process or Markov chain, a natural problem is to predict the next sample point
. Since
is a random
variable, this task is typically interpreted as estimating the conditional probability distribution of the next sample point
Let
denote the collection of all finite-length sequences over [k].
Therefore, conditioning on , our first objective is to estimate the conditional distribution To be more precise, we would like to find an estimator
, that associates with every sequence
a distribution
that approximates
in a suitable sense.
Perhaps a more classical problem is parameter estimation, which describes the underlying process. An i.i.d. process is completely characterized by , hence this problem coincides with the previous one. For Markov chains, we seek to estimate the transition matrix M. Therefore, instead of producing a probability distribution
, the estimator
maps every sequence
to a transition matrix
For two distributions p and q over [k], let L(p, q) be the loss when p is approximated by q. For the prediction problem, we measure the performance of an estimator in terms of its prediction risk,
the expected loss with respect to the sample sequence
For the estimation problem, we quantify the performance of the estimator by estimation risk. We first consider the expected loss of with respect to a single state
We then define the estimation risk of given sample sequence
as the maximum expected loss over all states,
While the process P we are trying to learn is unknown, it often belongs to a known collection P. The worst prediction risk of an estimator over all distributions in P is
The minimal possible worst-case prediction risk, or simply the minimax prediction risk, incurred by any estimator is
The worst-case estimation risk and the minimax estimation risk
are defined similarly. Given P, our goals are to approximate the minimax prediction/estimation risk to a universal constant-factor, and to devise estimators that achieve this performance.
An alternative definition of the estimation risk, considered in (1) and mentioned by a reviewer, is
We denote the corresponding minimax estimation risk by
Let o(1) represent a quantity that vanishes as . In the following, we use
to denote
For the collection of all the i.i.d. processes over [k], the above two formulations coincide and the problem is essentially the classical discrete distribution estimation problem. The problem of determining
was introduced by (2) and studied in a sequence of papers (3; 4; 5; 6; 7). For fixed k and KL-divergence loss, as n goes to infinity, (7) showed that
KL-divergence and many other important similarity measures between two distributions can be expressed as f-divergences (8). Let f be a convex function with f(1) = 0, the f-divergence between two distributions p and q over [k], whenever well-defined, is . Call an f-divergence ordinary if f is thrice continuously differentiable over
, sub-exponential, namely,
, and satisfies
Observe that all the following notable measures are ordinary f-divergences: Chi-squared divergence (9) from , KL-divergence (10) from f(x) = x log x, Hellinger divergence (11) from
, and Alpha-divergence (12) from
, where
Related Work For any f-divergence, we denote the corresponding minimax prediction risk for an n-element sample over set P by . Researchers in (13) considered the problem of de- termining
for the ordinary f-divergences. Except the above minimax formulation, re- cently, researchers also considered formulations that are more adaptive to the underlying i.i.d. processes (14) (15). Surprisingly, while the min-max risk of i.i.d. processes was addressed in a large body of work, the risk of Markov chains, which frequently arise in practice, was not studied until very recently.
Let denote the collection of all the Markov chains over [k]. For prediction with KLdivergence, (16) showed that
, but did not specify the dependence on k. For estimation, (17) considered the class of Markov Chains whose pseudo-spectral gap is bounded away from 0 and approximated the
estimation risk to within a log n factor. Some of their techniques, in particular the lower-bound construction in their displayed equation (4.3), are of similar nature and were derived independently of results in Section 5 in our paper.
Our first main result determines the dependence of on both k and n, to within a factor of roughly k: Theorem 1. The minimax KL-prediction risk of Markov chains satisfies
Depending on M, some states may be observed very infrequently, or not at all. This does not drastically affect the prediction problem as these states will be also have small impact on the prediction risk
. For estimation, however, rare and unobserved states still need to be well approximated, hence
does not decrease with n, and for example
We therefore parametrize the risk by the lowest probability in the transition matrix. For
be the collection of Markov chains whose lowest transition probability exceeds . Note that
is trivial if
, we only consider
. We characterize the minimax estimation risk of
almost precisely.
Theorem 2. For all ordinary f-divergences and all
We can further refine the estimation-risk bounds by partitioning based on the smallest probability in the chain’s stationary distribution
. Clearly,
. For
and
be the collection of all Markov chains in whose lowest stationary probability is
. We determine the minimax estimation risk over
nearly precisely.
Theorem 3. For all ordinary f-divergences,
and
For -distance corresponding to the squared Euclidean norm, we prove the following risk bounds.
Theorem 4. For all
Theorem 5. For all
and
The rest of the paper is organized as follows. Section 2 introduces add-constant estimators and additional definitions and notation for Markov chains. Note that each of the above results consists of a lower bound and an upper bound. We prove the lower bound by constructing a suitable prior distribution over the relevant collection of processes. Section 3 and 5 describe these prior distributions for the prediction and estimation problems, respectively. The upper bounds are derived via simple variants of the standard add-constant estimators. Section 4 and 6 describe the estimators for the prediction and estimation bounds, respectively. For space considerations, we relegate all the proofs to Section 9 to 12.
2.1 Add-constant estimators
Given a sample sequence denote the number of times symbol i appears in
. The classical empirical estimator estimates p by
The empirical estimator performs poorly for loss measures such as KL-divergence. For example, if p assigns a tiny probability to a symbol so that it is unlikely to appear in , then with high probability the KL-divergence between
will be infinity.
A common solution applies the Laplace smoothing technique (18) that assigns to each symbol i a probability proportional to , where
is a fixed constant. The resulting add-
estimator, is denoted by
. Due to their simplicity and effectiveness, add-
estimators are widely used in various machine learning algorithms such as naive Bayes classifiers (19). As shown in (7), for the i.i.d. processes, a variant of the add-3/4 estimator achieves the minimax estimation risk
Analogously, given a sample sequence
generated by a Markov chain, let
denote the number of times symbol j appears right after symbol i in
, and let
denote the number of times that symbol i appears in
. We define the add-
2.2 More on Markov chains
Adopting notation in (20), let denote the collection of discrete distributions over [k]. Let
and
be the collection of even and odd integers in [k], respectively. By convention, for a Markov chain over [k], we call each symbol
. Given a Markov chain, the hitting time
first time the chain reaches state j. We denote by Pr
the probability that starting from i, the hitting time of j is exactly t. For a Markov chain (M), we denote by
the distribution of
if we draw
. Additionally, for a fixed Markov chain (M), the mixing time
denotes the smallest index t such that
. Finally, for notational convenience, we write
instead of M(i, j) whenever appropriate.
A standard lower-bound argument for minimax prediction risk uses the fact that
for any prior distribution . One advantage of this approach is that the optimal estimator that minimizes
can often be computed explicitly.
Perhaps the simplest prior is the uniform distribution over a subset
. Let
be the optimal estimator minimizing
. Computing
for all the possible sample sequences
may be unrealistic. Instead, let
be an arbitrary subset of
, we can lower bound
The key to applying the above arguments is to find a proper pair
Without loss of generality, assume that k is even. Let , and define
In addition, let
and let denote the uniform distribution over [k]. Finally, given n, define
Next, let be the collection of sequences
whose last appearing state didn’t transition to any other state. For example, 3132, or 31322, but not 21323. In other words, for any state
represent an arbitrary state in [k] \ {i}, then
For the defined in the last section,
We denote the partial minimax prediction risk over
Let . Define
and
in the same manner. As the consequence of
being a function from
, we have the following triangle inequality,
Turning back to Markov chains, let denote the estimator that maps
one can show that
Recall the following lower bound
This together with the above upper bound on and the triangle inequality shows that an upper bound on
also suffices to bound the leading term of
. The following construction yields such an upper bound.
We partition according to the last appearing state and the number of times it transitions to itself,
For any , there is a unique
. Consider the following estimator
we can show that
The upper-bound proof applies the following lemma that uniformly bounds the hitting probability of any k-state Markov chain. Lemma 1. (21) For any Markov chain over [k] and any two states
Analogous to Section 3, we use the following standard argument to lower bound the minimax risk
where and
is the uniform distribution over
. Setting
, we outline the construction of
as follows.
Let be the uniform distribution over
), denote the
ball of radius r around
where distance between two distributions. Define
and
Noting that the uniform distribution over , is induced by
, the uniform distribution over
and thus is well-defined.
An important property of the above construction is that for a sample sequence ,
, the number of times that state k appears in
, is a binomial random variable with parameters n and
. Therefore, by the following lemma,
is highly concentrated around its mean
In order to prove the lower bound on , we only need to modify the above construction as follows. Instead of drawing the last row of the transition matrix
uniformly from the distribution induced by
, we draw all rows independently in the same fashion. The proof is omitted due to similarity.
The proof of the upper bound relies on a concentration inequality for Markov chains in , which can be informally expressed as
Note that this inequality is very similar to the Hoeffding’s inequality for i.i.d. processes.
The difficulty in analyzing the performance of the original add-estimator is that the chain’s initial distribution could be far away from its stationary distribution and finding a simple expression for
and
could be hard. To overcome this difficulty, we ignore the first few sample points and construct a new add-
estimator based on the remaining sample points. Specifically, let
be a length-n sample sequence drawn from the Markov chain (M). Removing the first m sample points,
can be viewed as a length-
sample sequence drawn from (M) whose initial distribution
Let . For sufficiently large
and
. Hence without loss of generality, we assume that the original initial distribution
already satisfies
we can simply replace
To prove the desired upper bound for ordinary f-divergences, it suffices to use the add-
For the -distance, instead of an add-constant estimator, we apply an add-
We augment the theory with experiments that demonstrate the efficacy of our proposed estimators and validate the functional form of the derived bounds.
We briefly describe the experimental setup. For the first three figures, , and
. For the last figure,
, and
. In all the experiments, the initial distribution
of the Markov chain is drawn from the k-Dirichlet(1) distribution. For the transition matrix M, we first construct a transition matrix
where each row is drawn independently from the k-Dirichlet(1) distribution. To ensure that each element of M is at least
, let
represent the
all-ones matrix, and set
. We generate a new Markov chain for each curve in the plots. And each data point on the curve shows the average loss of 100 independent restarts of the same Markov chain.
The plots use the following abbreviations: Theo for theoretical minimax-risk values; Real for real experimental results: using the estimators described in Sections 4 and 6; Pre for average prediction loss and Est for average estimation loss; Const for add-constant estimator; Prop for proposed add-estimator described in Section 6; Hell, Chi, and Alpha(c) for Hellinger divergence, Chi-squared divergence, and Alpha-divergence with parameter c. In all three graphs, the theoretical min-max curves are precisely the upper bounds in the corresponding theorems, except that in the prediction curve in Figure 1a the constant factor 2 in the upper bound is adjusted to 1/2 to better fit the experiments. Note the excellent fit between the theoretical bounds and experimental results.
Figure 1a shows the decay of the experimental and theoretical KL-prediction and KL-estimation losses with the sample size n. Figure 1b compares the -estimation losses of our proposed estimator and the add-one estimator, and the theoretical minimax values. Figure 1c compares the experimental estimation losses and the theoretical minimax-risk values for different loss measures. Finally, figure 1d presents an experiment on KL-learning losses that scales k up while n is fixed. All the four plots demonstrate that our theoretical results are accurate and can be used to estimate the loss incurred in learning Markov chains. Additionally, Figure 1b shows that our proposed add-
estimator is uniformly better than the traditional add-one estimator for different values of sample size n. We have also considered add-constant estimators with different constants varying from 2 to 10 and our proposed estimator outperformed all of them.
We studied the problem of learning an unknown k-state Markov chain from its n sequential sample points. We considered two formulations: prediction and estimation. For prediction, we determined the minimax risk up to a multiplicative factor of k. For estimation, when the transition probabilities are bounded away from zero, we obtained nearly matching lower and upper bounds on the minimax risk for and ordinary f-divergences. The effectiveness of our proposed estimators was verified through experimental simulations. Future directions include closing the gap in the prediction problem in Section 1, extending the results on the min-max estimation problem to other classes of Markov chains, and extending the work from the classical setting
, to general k and n.
Figure 1: Experiments
[1] Moein Falahatgar, Mesrob I. Ohannessian, and Alon Orlitsky. Near-optimal smoothing of structured conditional probability matrices? In In Advances in Neural Information Processing Systems (NIPS), pages 4860–4868, 2016.
[2] Edgar Gilbert. Codes based on inaccurate source probabilities. IEEE Transactions on Information Theory, 17, 3:304–314, 1971.
[3] Thomas Cover. Admissibility properties or gilbert’s encoding for unknown source probabilities (corresp.). IEEE Transactions on Information Theory, 18.1:216–217, 1972.
[4] Raphail Krichevsky and Victor Trofimov. The performance of universal encoding. IEEE Transactions on Information Theory, 27.2:199–207, 1981.
[5] Dietrich Braess, Jürgen Forster, Tomas Sauer, and Hans U. Simon. How to achieve minimax expected kullback-leibler distance from an unknown finite distribution. In International Conference on Algorithmic Learning Theory, pages 380–394. Springer, 2002.
[6] Liam Paninski. Variational minimax estimation of discrete distributions under kl loss. In Advances in Neural Information Processing Systems, pages 1033–1040, 2004.
[7] Dietrich Braess and Thomas Sauer. Bernstein polynomials and learning theory. Journal of Approximation Theory, 128(2):187–206, 2004.
[8] I. Csiszár. Information type measures of differences of probability distribution and indirect observations. Studia Math. Hungarica, 2:299–318, 1967.
[9] Frank Nielsen and Richard Nock. On the chi square and higher-order chi distances for approximating f-divergences. IEEE Signal Processing Letters, 21, no. 1:10–13, 2014.
[10] Solomon Kullback and Richard A. Leibler. On information and sufficiency. The Annals of Mathematical Statistics, 22, no. 1:79–86, 1951.
[11] Mikhail S Nikulin. Hellinger distance. Encyclopedia of mathematics, 151, 2001.
[12] Gavin E. Crooks. On measures of entropy and information. Tech. Note 9 (2017): v4, 2017.
[13] Sudeep Kamath, Alon Orlitsky, Dheeraj Pichapati, and Ananda Theertha Suresh. On learning distributions from their samples. In Annual Conference on Learning Theory (COLT), pages 1066–1100, 2015.
[14] Alon Orlitsky and Ananda Theertha Suresh. Competitive distribution estimation: Why is good-turing good. In Advances in Neural Information Processing Systems, pages 2143–2151, 2015.
[15] Gregory Valiant and Paul Valiant. Instance optimal learning of discrete distributions. In 48th annual ACM symposium on Theory of Computing, pages 142–155, 2016.
[16] Moein Falahatgar, Alon Orlitsky, Venkatadheeraj Pichapati, and Ananda Theertha Suresh. Learning markov distributions: Does estimation trump compression? In IEEE International Symposium on Information Theory (ISIT), pages 2689–2693, 2016.
[17] Geoffrey Wolfer and Aryeh Kontorovich. Minimax learning of ergodic markov chains. arXiv:1809.05014, 2018.
[18] Kai Lai Chung and Farid AitSahlia. Elementary probability theory: With stochastic processes and an introduction to mathematical finance. Springer Science & Business Media, 2012.
[19] Christopher M. Bishop and Tom M. Mitchell. Pattern recognition and machine learning. Springer, 2006.
[20] David A.Levin and Yuval Peres. Markov chains and mixing Times. American Mathematical Soc., 2017.
[21] James Norris, Yuval Peres, and Alex Zhai. Surprise probabilities in markov chains. Combinatorics, Probability and Computing, 26.4:603–627, 2017.
[22] Fan RK Chung and Linyuan Lu. Complex graphs and networks. American Mathematical Soc., 2006.
[23] Daniel Paulin. Concentration inequalities for markov chains by marton couplings and spectral methods. Electronic Journal of Probability, 20, 2015.
A standard argument for lower bounding the minimax prediction risk is
where is a prior distribution over P. The advantage of this approach is that the optimal estimator that minimizes
can often be computed explicitly.
Perhaps the simplest prior is the uniform distribution over some subset of P. Consider the uniform distribution over , say
, the following lemma shows an explicit way of computing the optimal estimator for
Lemma 3. Let be the optimal estimator that minimizes
, then for any
and any symbol
Clearly, computing for all the possible sample sequences
may be unrealistic. Instead, let
be an arbitrary subset of
, we can lower bound
The key to apply the above arguments is to find a proper pair . The rest of this section is organized as follows. In Subsection 9.1, we present our construction of
and
. In Subsection 9.2, we find the exact form of the optimal estimator using Lemma 3. Then we analyze its prediction risk over
in Subsection 9.3, where we further partition
into smaller subsets
and lower bound the KL-divergence over
and the probability
in Lemma 6 and 7, respectively. Finally, we consolidate all the previous results and prove the desired lower bound on
9.1 Prior construction
Without loss of generality, we assume that k is an even integer. For notational convenience, we denote by the uniform distribution over [k] and define
Given n, we set
Then, we choose to be the collection of sequences
whose last appearing state didn’t transition to any other symbol. In other words, for any state
represent an arbitrary state other than i, then
According to both the last appearing state and the number of times it transitions to itself, we can partition
9.2 The optimal estimator
Let denote the optimal estimator that minimizes
. The following lemma presents the exact form of
Lemma 4. For any , there exists a unique
that contains it. Consider
Proof. Given
By Lemma 3, for any evaluates to
Noting that implies
and
. Besides, for any
and
is uniquely determined by
Thus, for s = 0 or 1, we can rewrite
where is a constant that only depends on
Hence, for any
Combining in the nominator,
This completes the proof.
9.3 Analysis
Next, for any , we lower bound
in terms of
Lemma 5. For any
Proof. By the previous lemma,
Noting that
This completes the proof.
Lemma 6. For any where
, and sufficiently large n, if
then,
Proof. Consider
where
We have the following bounds on these quantities.
Bounds for
Bounds for
Bounds for
Bounds for
where we use the inequality
Hence,
Bounds for
Similarly, consider a single term of without the factor
where we use the inequality
Hence,
where we use the inequality
Putting everything together:
This completes the proof.
Another quantity that will be appear later is where
. We need the following lower bound. Lemma 7. For
Proof. By our construction of , for
and
, we have the following observations.
2. The probability of transitioning from some state to some state that is not
3. The probability of transitioning from some state
4. The probability of transitioning from state i to itself is
Therefore,
This completes the proof.
By Lemma 7,
Therefore,
where the last step follows by symmetry.
Next, we show that for any
Noting that is simply the summation of a geometric sequence, we can compute it as follows
To provide a lower bound for , we use the following inequalities:
Consolidating these three inequalities, the sum can be lower bounded by
Finally,
The proof makes use of the following lemma, which provides a uniform upper bound for the hitting probability of any k-state Markov chain. Lemma 8. (21) For any Markov chain over [k] and any two states
Let be the same as is in the previous section. Recall that
we denote the partial minimax prediction risk over
Let , we define
and
in the same manner. As the consequence of
being a function from
, we have the following triangle inequality,
Turning back to Markov chains, the next lemma upper bounds
Lemma 9. Let denote the estimator that maps
which implies
Proof. The proof of this lemma is essentially a combination of the upper bounds’ proofs in (16) and in Section 12. Instead of using the fact that are bounded away from 0 (see Section 12), we partition
into different subsets according to how close the counts are to their expected values, the number of times that the last appearing state transitioning to itself, and the number of times that the last appearing state transitioning to other states. Then, we bound the estimator’s expected loss over each set of the partition by
. We omit the proof for the sake of brevity.
Recall the following lower bound,
This together with Lemma 5 and the triangle inequality above shows that an upper bound on also suffices to bound the leading term of
. The following lemma provides such an upper bound. Recall that for any
is defined as
Lemma 10. For any , there exists a unique pair
. Consider the following estimator
and
then we have
Proof. Let be an arbitrary state. For simplicity of illustration, we use the following notation: for any
, denote
; for any
, denote
; for any
, denote
. By Lemma 8, the hitting probability
is upper bounded by
for all
. We can write
Now, we break the right hand side into two sums according to whether is greater than n/2 or not. For
Similarly, for
This completes the proof.
The proof of the lower bound makes use of the following concentration inequality, which upper
bounds the probability that a binomial random variable exceeds its mean.
Lemma 11. (22) Let Y be a binomial random variable with parameters
11.1 Prior construction
Again we use the following standard argument to lower bound the minimax risk,
where is the uniform distribution over
, we outline the construction of
as follows.
We adopt the notation in (13) and denote the ball of radius
, the uniform distribution
where distance between two distributions. For simplicity, define
and
where
Given
Noting that the uniform distribution over , is induced by
, the uniform distribution over
and thus is well-defined.
An important property of the above construction is that for a sample sequence ,
, the number of times that state k appears in
, is a binomial random variable with parameters n and
. Therefore, Lemma 11 implies that
is highly concentrated around its mean
11.2 L2-divergence lower bound
Let us first consider the -distance. Similar to Lemma 3,
, the estimator that minimizes
, can be computed exactly. In particular, we have the following lemma.
Lemma 12. There exists an estimator
such that
Based on the above lemma, we can relate the minimax estimation risk of Markov chains to the minimax prediction risk of i.i.d. processes. For simplicity, denote . The following lemma holds.
Lemma 13. For any , let
be the collection of indexes
such that
. Then,
where is an arbitrary non-empty subset of
is a constant whose value only depends on
Proof. We first consider the inner expectation on the left-hand side of the equality. For any
Let us partition into two parts: the collection of indexes
such that
and
, say
, and the remaining elements in
. By the construction of
, we have
For any denote the subsequence
. We can further partition the last summation according to
Fixing , there is a bijective mapping from
. Furthermore, we have
Hence, we can denote
for
and treat it as a mapping from
implies that
Thus,
By definition, minimizes
and for each
, its value
is completely determined by
. Besides,
forms a partition of
. Therefore, by the linearity of expectation and the definition of
, the estimator
also minimizes
, where the minimization is over all the possible mappings
. Equivalently, we have
This immediately yields the lemma.
For any , denote by
the number of times that state k appears in
which is a random variable induced by (M) and n. Lemma 13, we can deduce that Lemma 14.
By Lemma 11 and our construction of , the probability that
. This together with Lemma 14 and
11.3 Lower bound for ordinary f-divergences
Now we proceed from the -distance to ordinary f-divergences. The following lemma from (13) shows that
decreases if we move q closer to p. Lemma 16. For
Based on the above lemma, we show that for any , the value of the optimal estimator is always close to
Let , we claim that either
and
. Otherwise, Lemma 16 implies that we can reduce the estimation risk by moving
If ˆand ˆ
, then ˆ
,
and ˆ
. Similarly, if ˆ
and ˆ
Now we relate . For simplicity, denote
Lemma 17. For sufficiently large n,
Proof. By the previous lemma, Therefore,
Let us denote the interval on the right hand side by I.
For sufficiently large n, we can apply the second-order Taylor expansion to f at point 1.
12.1 Concentration of the counts
The proof of the upper bound relies on the following concentration inequality, which shows that for any Markov chain in and any state
, with high probability
stays close to
sufficiently large n.
Lemma 19. Given a sample sequence from any Markov chain
, let
denote the number of times that symbol i appears in
. Then for any
where is the stationary distribution of (M) and
Proof. Given , recall that
denotes the distribution of
if we draw
(M). First, we show that
Let be the
matrix such that
for all
. Noting that
, we can define
which is also a valid transition matrix.
By induction, we can show
Hence, let denote the
norm, we have
This implies that we can upper bound
The remaining proof follows from Proposition 3.4, Theorem 3.4, and Proposition 3.10 of (23) and is omitted here for the sake of brevity.
Noting that Pr
Informally, we can express the above inequality as
which is very similar to the Hoeffding’s inequality for the i.i.d. processes. As an important implication, the following lemma bounds the moments of Lemma 20. For
defined in Lemma 19 and any
Proof. The statement follows from
12.2 A modified add-β estimator
The difficulty with analyzing the performance of the original add-estimator is that the chain’s initial distribution could be far away from its stationary distribution and finding a simple expression for
could be hard. To overcome such difficulty, we ignore the first few sample points and construct a new add-
estimator based on the remaining sample points. To be more specific, let
be a length-n sample sequence drawn from the Markov chain (M). Removing the first m sample points,
can be viewed as a length-
sample sequence drawn from (M) whose initial distribution
Setting . Noting that
for sufficiently large n, without loss of generality, we assume that the original distribution
already satisfies
. If not, we can simply replace
To prove the upper bound, we consider the following (modified) add-estimator:
where is a fixed constant.
We can compute the expected values of these counts as
and
12.3 Analysis
For notational convenience, let us re-denote
The second inequality follows from the fact that can be viewed as the sum of counts from the following two Markov chains over
whose transition probabilities are greater than
In other words, and
are highly concentrated around
and
, respectively. Let
denote the event that
and
. Let
denote the event that
does not happen. Applying the union bound, we have
Now consider
the corresponding estimation risk of over a particular sate
can be decomposed as
Noting that
and
Hence, we can bound the second term as
where the last step follows from our assumption that f is sub-exponential.
By the definition of
Let , then h is thrice continuously differentiable around some neighborhood of point 0 and
We apply Taylor expansion to h at point 0 and rewrite the expectation on the right-hand side as
where by our definition of
Now, we bound individual terms. Taking out , the first term evaluates to:
Taking out , the second term evaluates to:
Finally, taking out , the last term can be bounded as
where we have used the ineuqality
By the definition of
and
Hence, consolidating all the previous results,
It remains to analyze
Thus, the desired quantity evaluates to
The above inequality yields
This completes our proof for ordinary f-divergences.
12.4 L2-divergence upper bound
Finally, we consider the -divergence. Again, we assume that the sample sequence
Instead of using an add-constant estimator, we use the following add-estimator:
Now, consider the expected loss for a particular state
We first show that the last term is negligible. Noting that
we can apply Taylor expansion to the function
at point
where the last step follows from Lemma 20. It remains to consider
According to the previous derivations, for
For 2
Therefore,
Finally,
We have
This completes our proof for the -divergence.