b

DiscoverSearch
About
My stuff
On Learning Markov Chains
2018·arXiv
Abstract
Abstract

The problem of estimating an unknown discrete distribution from its samples is a fundamental tenet of statistical learning. Over the past decade, it attracted significant research effort and has been solved for a variety of divergence measures. Surprisingly, an equally important problem, estimating an unknown Markov chain from its samples, is still far from understood. We consider two problems related to the min-max risk (expected loss) of estimating an unknown k-state Markov chain from its n sequential samples: predicting the conditional distribution of the next sample with respect to the KL-divergence, and estimating the transition matrix with respect to a natural loss induced by KL or a more general f-divergence measure. For the first measure, we determine the min-max prediction risk to within a linear factor in the alphabet size, showing it is  Ω(k log log n/n)and  O(k2 log log n/n). For the second, if the transition probabilities can be arbitrarily small, then only trivial uniform risk upper bounds can be derived. We therefore consider transition probabilities that are bounded away from zero, and resolve the problem for essentially all sufficiently smooth f-divergences, including KL-,  L2-, Chi-squared, Hellinger, and Alpha-divergences.

Many natural phenomena are inherently probabilistic. With past observations at hand, probabilistic models can therefore help us predict, estimate, and understand, future outcomes and trends. The two most fundamental probabilistic models for sequential data are i.i.d. processes and Markov chains. In an i.i.d. process, for each  i ≥ 1, a sample  Xiis generated independently according to the same underlying distribution. In Markov chains, for each  i ≥ 2, the distribution of sample  Xiis determined by just the value of  Xi−1.

Let us confine our discussion to random processes over finite alphabets, without loss of generality, assumed to be [k] := {1, 2, . . . , k}. An i.i.d. process is defined by a single distribution p over [k], while a Markov chain is characterized by a transition probability matrix  M over [k] × [k]. We denote the initial and stationary distributions of a Markov model by  µand  π, respectively. For notational consistency let P = (p) denote an i.i.d. model and P = (M) denote a Markov model.

Having observed a sample sequence  Xn := X1, . . . , Xn from anunknown i.i.d. process or Markov chain, a natural problem is to predict the next sample point  Xn+1. Since  Xn+1is a random

variable, this task is typically interpreted as estimating the conditional probability distribution Pxn := Pr(Xn+1 = ·|Xn = xn)of the next sample point  Xn+1.Let  [k]∗ denote the collection of all finite-length sequences over [k].

Therefore, conditioning on  Xn = xn, our first objective is to estimate the conditional distribution To be more precise, we would like to find an estimator ˆP, that associates with every sequence  xn ∈ [k]∗a distribution ˆPxn over [k]that approximates  Pxnin a suitable sense.

Perhaps a more classical problem is parameter estimation, which describes the underlying process. An i.i.d. process is completely characterized by  Pxn = p, hence this problem coincides with the previous one. For Markov chains, we seek to estimate the transition matrix M. Therefore, instead of producing a probability distribution ˆPxn, the estimator ˆMmaps every sequence  xn ∈ [k]∗to a transition matrix ˆMxn over [k] × [k].

For two distributions p and q over [k], let L(p, q) be the loss when p is approximated by q. For the prediction problem, we measure the performance of an estimator ˆPin terms of its prediction risk,

image

the expected loss with respect to the sample sequence  Xn, where P(xn) := Pr(Xn = xn).

For the estimation problem, we quantify the performance of the estimator by estimation risk. We first consider the expected loss of ˆMwith respect to a single state  i ∈ [k]:

image

We then define the estimation risk of ˆMgiven sample sequence  Xnas the maximum expected loss over all states,

image

While the process P we are trying to learn is unknown, it often belongs to a known collection P. The worst prediction risk of an estimator ˆPover all distributions in P is

image

The minimal possible worst-case prediction risk, or simply the minimax prediction risk, incurred by any estimator is

image

The worst-case estimation risk  εLn(P, ˆM)and the minimax estimation risk  εLn(P)are defined similarly. Given P, our goals are to approximate the minimax prediction/estimation risk to a universal constant-factor, and to devise estimators that achieve this performance.

An alternative definition of the estimation risk, considered in (1) and mentioned by a reviewer, is

image

We denote the corresponding minimax estimation risk by  ˜εLn(P).

Let o(1) represent a quantity that vanishes as  n → ∞. In the following, we use  a ≲ bto denote a ≤ b(1 + o(1)), and a ≍ b to denote a ≤ b(1 + o(1)) and b ≤ a(1 + o(1)).

For the collection  IIDkof all the i.i.d. processes over [k], the above two formulations coincide and the problem is essentially the classical discrete distribution estimation problem. The problem of determining  ρLn(IIDk)was introduced by (2) and studied in a sequence of papers (3; 4; 5; 6; 7). For fixed k and KL-divergence loss, as n goes to infinity, (7) showed that

image

KL-divergence and many other important similarity measures between two distributions can be expressed as f-divergences (8). Let f be a convex function with f(1) = 0, the f-divergence between two distributions p and q over [k], whenever well-defined, is  Df(p, q) := �i∈[k] q(i)f(p(i)/q(i)). Call an f-divergence ordinary if f is thrice continuously differentiable over  (0, ∞), sub-exponential, namely,  limx→∞ |f(x)|/ecx = 0 for all c > 0, and satisfies  f ′′(1) ̸= 0.

Observe that all the following notable measures are ordinary f-divergences: Chi-squared divergence (9) from  f(x) = (x−1)2, KL-divergence (10) from f(x) = x log x, Hellinger divergence (11) from  f(x) = (√x − 1)2, and Alpha-divergence (12) from  fα(x) := 4(1 − x(1+α)/2)/(1 − α2), where  α ̸= ±1.

Related Work For any f-divergence, we denote the corresponding minimax prediction risk for an n-element sample over set P by  ρfn(P). Researchers in (13) considered the problem of de- termining  ρfn(IIDk)for the ordinary f-divergences. Except the above minimax formulation, re- cently, researchers also considered formulations that are more adaptive to the underlying i.i.d. processes (14) (15). Surprisingly, while the min-max risk of i.i.d. processes was addressed in a large body of work, the risk of Markov chains, which frequently arise in practice, was not studied until very recently.

Let  Mkdenote the collection of all the Markov chains over [k]. For prediction with KLdivergence, (16) showed that  ρKLn (Mk) = Θk (log log n/n), but did not specify the dependence on k. For estimation, (17) considered the class of Markov Chains whose pseudo-spectral gap is bounded away from 0 and approximated the  L1estimation risk to within a log n factor. Some of their techniques, in particular the lower-bound construction in their displayed equation (4.3), are of similar nature and were derived independently of results in Section 5 in our paper.

Our first main result determines the dependence of  ρKLn (Mk)on both k and n, to within a factor of roughly k: Theorem 1. The minimax KL-prediction risk of Markov chains satisfies

image

Depending on M, some states may be observed very infrequently, or not at all. This does not drastically affect the prediction problem as these states will be also have small impact on  ρKLn (Mk) inthe prediction risk  ρLn(P, ˆP). For estimation, however, rare and unobserved states still need to be well approximated, hence  εLn(Mk)does not decrease with n, and for example  εKLn (Mk) = log k for all n.

We therefore parametrize the risk by the lowest probability in the transition matrix. For  δ > 0 let

image

be the collection of Markov chains whose lowest transition probability exceeds  δ. Note that  Mkδis trivial if  δ ≥ 1/k, we only consider  δ ∈ (0, 1/k). We characterize the minimax estimation risk of Mkδ almost precisely.

Theorem 2. For all ordinary f-divergences and all  δ ∈ (0, 1/k),

image

We can further refine the estimation-risk bounds by partitioning  Mkδ based on the smallest probability in the chain’s stationary distribution  π. Clearly,  mini∈[k] πi ≤ 1/k. For  0 < π∗ ≤ 1/kand

image

be the collection of all Markov chains in  Mkδ whose lowest stationary probability is  π∗. We determine the minimax estimation risk over  Mkδ,π∗nearly precisely.

Theorem 3. For all ordinary f-divergences,

image

and

image

For  L2-distance corresponding to the squared Euclidean norm, we prove the following risk bounds.

Theorem 4. For all  δ ∈ (0, 1/k),

image

Theorem 5. For all  δ ∈ (0, 1/k) and π∗ ∈ (0, 1/k],

image

and

image

The rest of the paper is organized as follows. Section 2 introduces add-constant estimators and additional definitions and notation for Markov chains. Note that each of the above results consists of a lower bound and an upper bound. We prove the lower bound by constructing a suitable prior distribution over the relevant collection of processes. Section 3 and 5 describe these prior distributions for the prediction and estimation problems, respectively. The upper bounds are derived via simple variants of the standard add-constant estimators. Section 4 and 6 describe the estimators for the prediction and estimation bounds, respectively. For space considerations, we relegate all the proofs to Section 9 to 12.

2.1 Add-constant estimators

Given a sample sequence  Xn from an i.i.d. process (p), let N ′i denote the number of times symbol i appears in  Xn. The classical empirical estimator estimates p by

image

The empirical estimator performs poorly for loss measures such as KL-divergence. For example, if p assigns a tiny probability to a symbol so that it is unlikely to appear in  Xn, then with high probability the KL-divergence between  p and ˆpXnwill be infinity.

A common solution applies the Laplace smoothing technique (18) that assigns to each symbol i a probability proportional to  N ′i + β, where  β > 0is a fixed constant. The resulting add-βestimator, is denoted by  ˆp+β. Due to their simplicity and effectiveness, add-βestimators are widely used in various machine learning algorithms such as naive Bayes classifiers (19). As shown in (7), for the i.i.d. processes, a variant of the add-3/4 estimator achieves the minimax estimation risk  ρKLn (IIDk).Analogously, given a sample sequence  Xn generated by a Markov chain, let  Nijdenote the number of times symbol j appears right after symbol i in  Xn, and let  Nidenote the number of times that symbol i appears in  Xn−1. We define the add-β estimator ˆM +β as

image

2.2 More on Markov chains

Adopting notation in (20), let  ∆kdenote the collection of discrete distributions over [k]. Let  [k]eand  [k]o be the collection of even and odd integers in [k], respectively. By convention, for a Markov chain over [k], we call each symbol  i ∈ [k] a state. Given a Markov chain, the hitting time  τ(j) is thefirst time the chain reaches state j. We denote by Pri(τ(j) = t)the probability that starting from i, the hitting time of j is exactly t. For a Markov chain (M), we denote by  P tthe distribution of  Xtif we draw  Xt ∼ (M). Additionally, for a fixed Markov chain (M), the mixing time  tmixdenotes the smallest index t such that  L1(P t, π) < 1/2. Finally, for notational convenience, we write  Mijinstead of M(i, j) whenever appropriate.

A standard lower-bound argument for minimax prediction risk uses the fact that

image

for any prior distribution  Π over P. One advantage of this approach is that the optimal estimator that minimizes  EP ∼Π[ρKLn (P, ˆP)]can often be computed explicitly.

Perhaps the simplest prior is the uniform distribution  U(PS)over a subset  PS ⊂ P. Let ˆP ∗be the optimal estimator minimizing  EP ∼U(PS)[ρKLn (P, ˆP)]. Computing  ˆP ∗for all the possible sample sequences  xn may be unrealistic. Instead, let  Knbe an arbitrary subset of  [k]n, we can lower bound

image

The key to applying the above arguments is to find a proper pair  (PS, Kn).

Without loss of generality, assume that k is even. Let  a := 1n and b := 1 − k−2n , and define

image

In addition, let

image

and let  ukdenote the uniform distribution over [k]. Finally, given n, define

image

Next, let  Knbe the collection of sequences  xn ∈ [k]n whose last appearing state didn’t transition to any other state. For example, 3132, or 31322, but not 21323. In other words, for any state  i ∈ [k], let¯irepresent an arbitrary state in [k] \ {i}, then

image

For the  Kndefined in the last section,

image

We denote the partial minimax prediction risk over  Kn by

image

Let  Kn := [k]n \ Kn. Define  ρKLn (P, ˆP; Kn)and  ρKLn (P; Kn)in the same manner. As the consequence of ˆPbeing a function from  [k]n to ∆k, we have the following triangle inequality,

image

Turning back to Markov chains, let ˆP + 12denote the estimator that maps  Xn ∼ (M) to ˆM + 12 (Xn, ·),one can show that

image

Recall the following lower bound

image

This together with the above upper bound on  ρKLn (Mk; Kn)and the triangle inequality shows that an upper bound on  ρKLn (Mk; Kn)also suffices to bound the leading term of  ρKLn (Mk). The following construction yields such an upper bound.

We partition  Knaccording to the last appearing state and the number of times it transitions to itself,

image

For any  xn ∈ Kn, there is a unique  Kℓ(i) such that xn ∈ Kℓ(i). Consider the following estimator

image

we can show that

image

The upper-bound proof applies the following lemma that uniformly bounds the hitting probability of any k-state Markov chain. Lemma 1. (21) For any Markov chain over [k] and any two states  i, j ∈ [k], if n > k, then

image

Analogous to Section 3, we use the following standard argument to lower bound the minimax risk

image

where  MS ⊂ Mand  U(MS)is the uniform distribution over  MS. Setting  M = Mk(δ, π∗), we outline the construction of  MSas follows.

Let  uk−1be the uniform distribution over  [k − 1]. As in (13), denote the  L∞ball of radius r around

image

where  L∞(·, ·) is the L∞distance between two distributions. Define

image

and

image

Noting that the uniform distribution over  MS, U(MS), is induced by  U(Bk−1(1/n′)), the uniform distribution over  Bk−1(1/n′)and thus is well-defined.

An important property of the above construction is that for a sample sequence  Xn ∼ (M) ∈ MS, Nk, the number of times that state k appears in  Xn, is a binomial random variable with parameters n and  π∗. Therefore, by the following lemma,  Nkis highly concentrated around its mean  nπ∗.

image

In order to prove the lower bound on  ˜εfn(Mkδ,π∗), we only need to modify the above construction as follows. Instead of drawing the last row of the transition matrix  Mn(p′)uniformly from the distribution induced by  U(Bk−1(1/n′)), we draw all rows independently in the same fashion. The proof is omitted due to similarity.

The proof of the upper bound relies on a concentration inequality for Markov chains in  Mkδ, which can be informally expressed as

image

Note that this inequality is very similar to the Hoeffding’s inequality for i.i.d. processes.

The difficulty in analyzing the performance of the original add-βestimator is that the chain’s initial distribution could be far away from its stationary distribution and finding a simple expression for E[Ni]and  E[Nij]could be hard. To overcome this difficulty, we ignore the first few sample points and construct a new add-βestimator based on the remaining sample points. Specifically, let  Xnbe a length-n sample sequence drawn from the Markov chain (M). Removing the first m sample points,  Xnm+1 := Xm+1, . . . , Xncan be viewed as a length-(n−m)sample sequence drawn from (M) whose initial distribution  µ′ satisfies

image

Let  m = √n. For sufficiently large  n, L1(µ′, π) ≪ 1/n2and √n ≪ n. Hence without loss of generality, we assume that the original initial distribution  µalready satisfies  L1(µ, π) < 1/n2. If not,we can simply replace  Xn by Xn√n+1.

To prove the desired upper bound for ordinary f-divergences, it suffices to use the add-β estimator

image

For the  L2-distance, instead of an add-constant estimator, we apply an add-√Ni/k estimator

image

We augment the theory with experiments that demonstrate the efficacy of our proposed estimators and validate the functional form of the derived bounds.

We briefly describe the experimental setup. For the first three figures,  k = 6, δ = 0.05, and 10, 000 ≤ n ≤ 100, 000. For the last figure,  δ = 0.01, n = 100, 000, and  4 ≤ k ≤ 36. In all the experiments, the initial distribution  µof the Markov chain is drawn from the k-Dirichlet(1) distribution. For the transition matrix M, we first construct a transition matrix  M ′ where each row is drawn independently from the k-Dirichlet(1) distribution. To ensure that each element of M is at least  δ, let  Jkrepresent the  k × kall-ones matrix, and set  M = M ′(1 − kδ) + δJk. We generate a new Markov chain for each curve in the plots. And each data point on the curve shows the average loss of 100 independent restarts of the same Markov chain.

The plots use the following abbreviations: Theo for theoretical minimax-risk values; Real for real experimental results: using the estimators described in Sections 4 and 6; Pre for average prediction loss and Est for average estimation loss; Const for add-constant estimator; Prop for proposed add-√Ni/kestimator described in Section 6; Hell, Chi, and Alpha(c) for Hellinger divergence, Chi-squared divergence, and Alpha-divergence with parameter c. In all three graphs, the theoretical min-max curves are precisely the upper bounds in the corresponding theorems, except that in the prediction curve in Figure 1a the constant factor 2 in the upper bound is adjusted to 1/2 to better fit the experiments. Note the excellent fit between the theoretical bounds and experimental results.

Figure 1a shows the decay of the experimental and theoretical KL-prediction and KL-estimation losses with the sample size n. Figure 1b compares the  L2-estimation losses of our proposed estimator and the add-one estimator, and the theoretical minimax values. Figure 1c compares the experimental estimation losses and the theoretical minimax-risk values for different loss measures. Finally, figure 1d presents an experiment on KL-learning losses that scales k up while n is fixed. All the four plots demonstrate that our theoretical results are accurate and can be used to estimate the loss incurred in learning Markov chains. Additionally, Figure 1b shows that our proposed add-√Ni/kestimator is uniformly better than the traditional add-one estimator for different values of sample size n. We have also considered add-constant estimators with different constants varying from 2 to 10 and our proposed estimator outperformed all of them.

We studied the problem of learning an unknown k-state Markov chain from its n sequential sample points. We considered two formulations: prediction and estimation. For prediction, we determined the minimax risk up to a multiplicative factor of k. For estimation, when the transition probabilities are bounded away from zero, we obtained nearly matching lower and upper bounds on the minimax risk for  L2and ordinary f-divergences. The effectiveness of our proposed estimators was verified through experimental simulations. Future directions include closing the gap in the prediction problem in Section 1, extending the results on the min-max estimation problem to other classes of Markov chains, and extending the work from the classical setting  k ≪ n, to general k and n.

image

Figure 1: Experiments

[1] Moein Falahatgar, Mesrob I. Ohannessian, and Alon Orlitsky. Near-optimal smoothing of structured conditional probability matrices? In In Advances in Neural Information Processing Systems (NIPS), pages 4860–4868, 2016.

[2] Edgar Gilbert. Codes based on inaccurate source probabilities. IEEE Transactions on Information Theory, 17, 3:304–314, 1971.

[3] Thomas Cover. Admissibility properties or gilbert’s encoding for unknown source probabilities (corresp.). IEEE Transactions on Information Theory, 18.1:216–217, 1972.

[4] Raphail Krichevsky and Victor Trofimov. The performance of universal encoding. IEEE Transactions on Information Theory, 27.2:199–207, 1981.

[5] Dietrich Braess, Jürgen Forster, Tomas Sauer, and Hans U. Simon. How to achieve minimax expected kullback-leibler distance from an unknown finite distribution. In International Conference on Algorithmic Learning Theory, pages 380–394. Springer, 2002.

[6] Liam Paninski. Variational minimax estimation of discrete distributions under kl loss. In Advances in Neural Information Processing Systems, pages 1033–1040, 2004.

[7] Dietrich Braess and Thomas Sauer. Bernstein polynomials and learning theory. Journal of Approximation Theory, 128(2):187–206, 2004.

[8] I. Csiszár. Information type measures of differences of probability distribution and indirect observations. Studia Math. Hungarica, 2:299–318, 1967.

[9] Frank Nielsen and Richard Nock. On the chi square and higher-order chi distances for approximating f-divergences. IEEE Signal Processing Letters, 21, no. 1:10–13, 2014.

[10] Solomon Kullback and Richard A. Leibler. On information and sufficiency. The Annals of Mathematical Statistics, 22, no. 1:79–86, 1951.

[11] Mikhail S Nikulin. Hellinger distance. Encyclopedia of mathematics, 151, 2001.

[12] Gavin E. Crooks. On measures of entropy and information. Tech. Note 9 (2017): v4, 2017.

[13] Sudeep Kamath, Alon Orlitsky, Dheeraj Pichapati, and Ananda Theertha Suresh. On learning distributions from their samples. In Annual Conference on Learning Theory (COLT), pages 1066–1100, 2015.

[14] Alon Orlitsky and Ananda Theertha Suresh. Competitive distribution estimation: Why is good-turing good. In Advances in Neural Information Processing Systems, pages 2143–2151, 2015.

[15] Gregory Valiant and Paul Valiant. Instance optimal learning of discrete distributions. In 48th annual ACM symposium on Theory of Computing, pages 142–155, 2016.

[16] Moein Falahatgar, Alon Orlitsky, Venkatadheeraj Pichapati, and Ananda Theertha Suresh. Learning markov distributions: Does estimation trump compression? In IEEE International Symposium on Information Theory (ISIT), pages 2689–2693, 2016.

[17] Geoffrey Wolfer and Aryeh Kontorovich. Minimax learning of ergodic markov chains. arXiv:1809.05014, 2018.

[18] Kai Lai Chung and Farid AitSahlia. Elementary probability theory: With stochastic processes and an introduction to mathematical finance. Springer Science & Business Media, 2012.

[19] Christopher M. Bishop and Tom M. Mitchell. Pattern recognition and machine learning. Springer, 2006.

[20] David A.Levin and Yuval Peres. Markov chains and mixing Times. American Mathematical Soc., 2017.

[21] James Norris, Yuval Peres, and Alex Zhai. Surprise probabilities in markov chains. Combinatorics, Probability and Computing, 26.4:603–627, 2017.

[22] Fan RK Chung and Linyuan Lu. Complex graphs and networks. American Mathematical Soc., 2006.

[23] Daniel Paulin. Concentration inequalities for markov chains by marton couplings and spectral methods. Electronic Journal of Probability, 20, 2015.

A standard argument for lower bounding the minimax prediction risk is

image

where  Πis a prior distribution over P. The advantage of this approach is that the optimal estimator that minimizes  EP ∼Π[ρKLn (P, ˆP)]can often be computed explicitly.

Perhaps the simplest prior is the uniform distribution over some subset of P. Consider the uniform distribution over  PS ⊂ P, say  U(PS), the following lemma shows an explicit way of computing the optimal estimator for  EP ∼U(PS)[ρKLn (P, ˆP)] when PS is finite.

Lemma 3. Let ˆP ∗be the optimal estimator that minimizes  EP ∼U(PS)[ρKLn (P, ˆP)], then for any xn ∈ [k]n and any symbol  i ∈ [k],

image

Clearly, computing ˆP ∗ for all the possible sample sequences  xn may be unrealistic. Instead, let  Knbe an arbitrary subset of  [k]n, we can lower bound

image

The key to apply the above arguments is to find a proper pair  (PS, Kn). The rest of this section is organized as follows. In Subsection 9.1, we present our construction of  PSand  Kn. In Subsection 9.2, we find the exact form of the optimal estimator using Lemma 3. Then we analyze its prediction risk over  Knin Subsection 9.3, where we further partition  Kninto smaller subsets  Kℓ(i),and lower bound the KL-divergence over  Kℓ(i)and the probability  P(Xn ∈ Kℓ(i))in Lemma 6 and 7, respectively. Finally, we consolidate all the previous results and prove the desired lower bound on  ρKLn (P).

9.1 Prior construction

Without loss of generality, we assume that k is an even integer. For notational convenience, we denote by  ukthe uniform distribution over [k] and define

image

Given n, we set

image

Then, we choose  Knto be the collection of sequences  xn ∈ [k]nwhose last appearing state didn’t transition to any other symbol. In other words, for any state  i ∈ [k], let ¯irepresent an arbitrary state other than i, then

image

According to both the last appearing state and the number of times it transitions to itself, we can partition  Kn as

image

9.2 The optimal estimator

Let ˆP ∗denote the optimal estimator that minimizes  EP ∼U(PS)[ρKLn (P, ˆP; Kn)]. The following lemma presents the exact form of ˆP ∗.

Lemma 4. For any  xn ∈ Kn, there exists a unique  Kℓ(i)that contains it. Consider ˆP ∗xn, we have:

image

Proof. Given  (M) ∈ PS, consider Xn ∼ (M),

image

By Lemma 3, for any  xn ∈ Kℓ(i) and j ∈ [k], ˆP ∗xn(j)evaluates to

image

Noting that  xn ∈ Kℓ(i)implies  Nii = ℓ − 1and  Nij = 0, ∀j ̸= i. Besides, for any  j1 ∈ [k]and i1 ∈ [k] \ {j1, j1 + 1}, Mi1j1is uniquely determined by  i1 and j1 for all (M) ∈ PS.

Thus, for s = 0 or 1, we can rewrite  Mijs �i1∈[k]�j1∈[k]MNi1j1i1j1 as

image

where  C(xn, k)is a constant that only depends on  xn and k.

Hence, for any  xn ∈ Kℓ(i),

image

image

Combining  MjjNjj with Mjjin the nominator,

image

This completes the proof.

9.3 Analysis

Next, for any  xn ∈ Kℓ(i), we lower bound  DKL(Pxn, ˆP ∗xn)in terms of  Mi(i−1) and ˆP ∗xn(i − 1).

Lemma 5. For any  (M) ∈ PS and xn ∈ Kℓ(i),

image

Proof. By the previous lemma,

image

Noting that xx+1 ≤ log(x + 1) for all x > −1,

image

This completes the proof.

image

Lemma 6. For any  (M) ∈ PS, xn ∈ Kℓ(i)where  i ∈ [k]e, Mi(i−1) = 1(log n)m ∈ V ′n, and sufficiently large n, if

image

then,

image

Proof. Consider  Mi(i−1) = 1(log n)m ∈ V ′n, where m ∈ [1, log n4 log log n].

image

where

image

We have the following bounds on these quantities.

Bounds for  Xℓ

image

Bounds for  Aℓ

image

Bounds for  Dℓ

image

Bounds for  Cℓ

image

where we use the inequality  i ≥ m + 1 and (1 − 1x)x ≥ 14 for x ≥ 2.

Hence,

image

Bounds for  Bℓ

Similarly, consider a single term of  Bℓwithout the factor  (log n)m−i,

image

where we use the inequality  (1 − 1x)x ≤ 1e for x ≥ 2.

Hence,

image

where we use the inequality  x − 2(log x)2 ≥ 0 for x ≥ 1.

Putting everything together:

image

This completes the proof.

Another quantity that will be appear later is  Pr(Xn ∈ Kℓ(i))where  Xn ∼ (M) ∈ PS. We need the following lower bound. Lemma 7. For  Xn ∼ (M) ∈ PS and i ∈ [k]e,

image

Proof. By our construction of  PS, for  Xn ∼ (M) ∈ PSand  i ∈ [k]e, we have the following observations.

image

2. The probability of transitioning from some state  j ̸= ito some state that is not  i is 1 − 1n.

3. The probability of transitioning from some state  j ̸= i to state i is 1n.

4. The probability of transitioning from state i to itself is  1 − k−2n − Mi(i−1).

Therefore,

image

This completes the proof.

image

By Lemma 5 and 6, for  ℓ1(M) ≤ ℓ ≤ ℓ2(M) and Mi(i−1) ∈ V ′n,

image

By Lemma 7,

image

Therefore,

image

where the last step follows by symmetry.

Next, we show that for any  v = 1(log n)m ∈ V ′n,

image

Noting that  Tmis simply the summation of a geometric sequence, we can compute it as follows

image

To provide a lower bound for  Tm, we use the following inequalities:

image

Consolidating these three inequalities, the sum  Tmcan be lower bounded by

image

Finally,

image

The proof makes use of the following lemma, which provides a uniform upper bound for the hitting probability of any k-state Markov chain. Lemma 8. (21) For any Markov chain over [k] and any two states  i, j ∈ [k], if n > k, then

image

Let  Knbe the same as is in the previous section. Recall that

image

we denote the partial minimax prediction risk over  Kn by

image

Let  Kn := [k]n \ Kn, we define  ρKLn (P, ˆP; Kn)and  ρKLn (P; Kn)in the same manner. As the consequence of ˆPbeing a function from  [k]n to ∆k, we have the following triangle inequality,

image

Turning back to Markov chains, the next lemma upper bounds  ρKLn (Mk; Kn).

Lemma 9. Let ˆP + 12denote the estimator that maps  Xn ∼ (M) to ˆM + 12 (Xn, ·), then

image

which implies

image

Proof. The proof of this lemma is essentially a combination of the upper bounds’ proofs in (16) and in Section 12. Instead of using the fact that  Mijare bounded away from 0 (see Section 12), we partition  Kninto different subsets according to how close the counts are to their expected values, the number of times that the last appearing state transitioning to itself, and the number of times that the last appearing state transitioning to other states. Then, we bound the estimator’s expected loss over each set of the partition by  Ok (1/n). We omit the proof for the sake of brevity.

Recall the following lower bound,

image

This together with Lemma 5 and the triangle inequality above shows that an upper bound on ρKLn (Mk; Kn)also suffices to bound the leading term of  ρKLn (Mk). The following lemma provides such an upper bound. Recall that for any  i ∈ [k], Kℓ(i)is defined as  {xn ∈ [k]n : xn = ¯in−ℓiℓ}.

Lemma 10. For any  xn ∈ Kn, there exists a unique pair  (ℓ, i) such that xn ∈ Kℓ(i). Consider the following estimator

image

and

image

then we have

image

Proof. Let  i ∈ [k]be an arbitrary state. For simplicity of illustration, we use the following notation: for any  xn = ¯in−ℓiℓ, denote  ˆpℓ := ˆPxn; for any  (M) ∈ Mk, denote  pi := M(i, ·); for any  ℓ ≤ n, denote  hi,ℓ := Pr(τ(i) = ℓ). By Lemma 8, the hitting probability  hi,ℓis upper bounded by  k/ℓfor all  ℓ > k. We can write

image

Now, we break the right hand side into two sums according to whether  ℓis greater than n/2 or not. For  ℓ > n/2, we have

image

Similarly, for  ℓ ≤ n/2, we have

image

This completes the proof.

The proof of the lower bound makes use of the following concentration inequality, which upper

bounds the probability that a binomial random variable exceeds its mean.

Lemma 11. (22) Let Y be a binomial random variable with parameters  m ∈ N and p ∈ [0, 1], then

image

11.1 Prior construction

Again we use the following standard argument to lower bound the minimax risk,

image

where  MS ⊂ M and U(MS)is the uniform distribution over  MS. Setting M = Mkδ,π∗, we outline the construction of  MSas follows.

We adopt the notation in (13) and denote the  L∞ball of radius  r around uk−1, the uniform distribution

image

where  L∞(·, ·) is the L∞distance between two distributions. For simplicity, define

image

and

image

where  ¯π∗ = 1 − π∗ and �k−1i=1 pi = 1.

Given  n and ϵ ∈ (0, 1), let n′ := (n(1 + ϵ)π∗)1/5. We set

image

Noting that the uniform distribution over  MS, U(MS), is induced by  U(Bk−1(1/n′)), the uniform distribution over  Bk−1(1/n′)and thus is well-defined.

An important property of the above construction is that for a sample sequence  Xn ∼ (M) ∈ MS, Nk, the number of times that state k appears in  Xn, is a binomial random variable with parameters n and  π∗. Therefore, Lemma 11 implies that  Nkis highly concentrated around its mean  nπ∗.

11.2 L2-divergence lower bound

Let us first consider the  L2-distance. Similar to Lemma 3, ˆM ∗, the estimator that minimizes E(M)∼U(MS)[εL2n (M, ˆM)], can be computed exactly. In particular, we have the following lemma.

Lemma 12. There exists an estimator ˆM ∗ with

image

such that ˆM ∗ minimizes E(M)∼U(MS)[εL2n (M, ˆM)].

Based on the above lemma, we can relate the minimax estimation risk of Markov chains to the minimax prediction risk of i.i.d. processes. For simplicity, denote  Bi.i.d. := {(p) ∈ IIDk−1 : p ∈Bk−1(1/n′)}. The following lemma holds.

Lemma 13. For any  xn ∈ [k]n, let  I(xn)be the collection of indexes  j ∈ [n]such that  xj = k. Then,

image

where  I0is an arbitrary non-empty subset of  [n] and C(I0, π∗, p∗, n)is a constant whose value only depends on  I0, π∗, p∗, and n.

Proof. We first consider the inner expectation on the left-hand side of the equality. For any  (M) ∈

image

Let us partition  I0into two parts: the collection of indexes  m ∈ I0 ∩ [n − 1]such that  m ∈ I0and m + 1 ̸∈ I0, say  {m1, . . . , ms}, and the remaining elements in  I0. By the construction of  MS, we have

image

For any  xn, let xn \ I0denote the subsequence  xj1, . . . , xjn−|I0|−s such that j1 < j2 . . . < jn−|I0|−s,jt ̸∈ I0 and jt − 1 ̸∈ {m1, . . . , ms}, ∀t. We can further partition the last summation according to

image

Fixing  yn−|I0|−s ∈ [k − 1]n−|I0|−s, there is a bijective mapping from  S(I0, yn−|I0|−s) := {xn :xj = k, ∀j ∈ I0 and xn\I0 = yn−|I0|−s} to [k−1]s, say g(·). Furthermore, we have ˆM ∗(k, k) = π∗.Hence, we can denote  q∗g(xn) := ˆM ∗xn(k,[k−1])¯π∗for  xn ∈ S(I0, yn−|I0|−s)and treat it as a mapping from  [k−1]s to ∆k−1. Also, (M) ∈ MSimplies that  M(k, [k−1]) = p′ for some p′ ∈ Bk−1(1/n′).Thus,

image

By definition, ˆM ∗minimizes  E(M)∼U(MS)[εL2n (M, ˆM)]and for each  xn ∈ [k]n, its value ˆM ∗xnis completely determined by  xn. Besides,  {S(I0, yn−|I0|−s) : I0 ⊂ [n] and yn−|I0|−s ∈ [k −1]n−|I0|−s}forms a partition of  [k]n. Therefore, by the linearity of expectation and the definition of  q∗, the estimator  q∗also minimizes  Ep′∼U(Bk−1(1/n′))[EZs∼(p′)[(¯π∗)2L2(p′, qZs)]], where the minimization is over all the possible mappings  q from [k − 1]s to ∆k−1. Equivalently, we have

image

This immediately yields the lemma.

For any  (M) ∈ MS, denote by  Nk((M), n)the number of times that state k appears in  Xn ∼ (M),which is a random variable induced by (M) and n. Lemma 13, we can deduce that Lemma 14.

image

By Lemma 11 and our construction of  MS, the probability that  Nk((M), n) ≥ (1 + ϵ)nπ∗ is at mostexp(−ϵ2nπ∗/3) for any (M) ∈ MS and ϵ ∈ (0, 1). This together with Lemma 14 and

image

11.3 Lower bound for ordinary f-divergences

Now we proceed from the  L2-distance to ordinary f-divergences. The following lemma from (13) shows that  Df(p, q)decreases if we move q closer to p. Lemma 16. For  p1 > q1, p2 < q2 and d ≤ min{p1 − q1, q2 − p2},

image

Based on the above lemma, we show that for any  xn ∈ [k]n, the value of the optimal estimator is always close to  (uk−1¯π∗, π∗).

Let  ˆp∗xn := ˆM ∗xn(k, ·). For any xn ∈ [k]n, we claim that either  ˆp∗xn(j) ≥ ( 1k−1 − 1n′ )¯π∗, ∀j ∈ [k −1]and  ˆp∗xn(k) ≥ π∗ OR ˆp∗xn(j) ≤ ( 1k−1 + 1n′ )¯π∗, ∀j ∈ [k−1] and ˆp∗xn(k) ≤ π∗. Otherwise, Lemma 16 implies that we can reduce the estimation risk by moving  ˆp∗xn closer to (uk−1¯π∗, π∗).

If ˆp∗xn(j) ≥ ( 1k−1 − 1n′ )¯π∗, ∀j ∈ [k − 1]and ˆp∗xn(k) ≥ π∗, then ˆp∗xn(j) ≤ ( 1k−1 + k−2n′ )¯π∗, ∀j ∈ [k − 1]and ˆp∗xn(k) ≤ π∗ + k−1n′ ¯π∗. Similarly, if ˆp∗xn(j) ≤ ( 1k−1 + 1n′ )¯π∗, ∀j ∈ [k − 1]and ˆp∗xn(k) ≤ π∗, then ˆp∗xn(j) ≥ ( 1k−1 − k−2n′ )¯π∗, ∀j ∈ [k − 1] and ˆp∗xn(k) ≥ π∗ − k−1n′ ¯π∗.

Now we relate  Df(p, ˆp∗) to L2(p, ˆp∗). For simplicity, denote  p := M(k, ·) and drop xn from ˆp∗xn.Lemma 17. For sufficiently large n,

image

Proof. By the previous lemma,  ˆp∗xn(j) = ( 1k−1 ± k−2n′ )¯π∗, ∀j ∈ [k − 1] and ˆp∗xn(k) = π∗ ± k−1n′ ¯π∗.Therefore,

image

Let us denote the interval on the right hand side by I.

For sufficiently large n, we can apply the second-order Taylor expansion to f at point 1.

image

12.1 Concentration of the counts

The proof of the upper bound relies on the following concentration inequality, which shows that for any Markov chain in  Mkδ and any state  i ∈ [k], with high probability  Nistays close to  (n − 1)πi, forsufficiently large n.

Lemma 19. Given a sample sequence  Xnfrom any Markov chain  (M) ∈ Mkδ, let  Nidenote the number of times that symbol i appears in  Xn−1. Then for any  t ≥ 0,

image

where  πis the stationary distribution of (M) and

image

Proof. Given  (M) ∈ Mkδ, recall that  P n+1denotes the distribution of  Xn+1if we draw  Xn+1 ∼(M). First, we show that

image

Let  Πbe the  k × kmatrix such that  Π(i, ·) = πfor all  i ∈ [k]. Noting that  M(i, j) ≥ δΠ(i, j), we can define

image

which is also a valid transition matrix.

By induction, we can show

image

Hence, let  | · |denote the  L1norm, we have

image

This implies that we can upper bound  tmix by C(δ).

The remaining proof follows from Proposition 3.4, Theorem 3.4, and Proposition 3.10 of (23) and is omitted here for the sake of brevity.

Noting that Pr(|Ni − (n − 1)πi| > (n − 1)) = 0, we have

image

Informally, we can express the above inequality as

image

which is very similar to the Hoeffding’s inequality for the i.i.d. processes. As an important implication, the following lemma bounds the moments of  |Ni − (n − 1)πi|.Lemma 20. For  Nidefined in Lemma 19 and any  m ∈ Z+,

image

Proof. The statement follows from

image

12.2 A modified add-β estimator

The difficulty with analyzing the performance of the original add-βestimator is that the chain’s initial distribution could be far away from its stationary distribution and finding a simple expression for E[Ni] and E[Nij]could be hard. To overcome such difficulty, we ignore the first few sample points and construct a new add-βestimator based on the remaining sample points. To be more specific, let Xn be a length-n sample sequence drawn from the Markov chain (M). Removing the first m sample points,  Xnm+1 := Xm+1, . . . , Xncan be viewed as a length-(n−m)sample sequence drawn from (M) whose initial distribution  µ′ satisfies

image

Setting  m = √n, we have L1(µ′, π) ≲ 1/n2. Noting that √n ≪ nfor sufficiently large n, without loss of generality, we assume that the original distribution  µalready satisfies  L1(µ, π) < 1/n2. If not, we can simply replace  Xn by Xn√n+1.

To prove the upper bound, we consider the following (modified) add-βestimator:

image

where  β > 0is a fixed constant.

We can compute the expected values of these counts as

image

and

image

12.3 Analysis

For notational convenience, let us re-denote  n′ := n − 1.

image

The second inequality follows from the fact that  Nijcan be viewed as the sum of counts from the following two Markov chains over  [k] × [k]whose transition probabilities are greater than  δ2:

image

In other words,  Niand  Nijare highly concentrated around  n′πiand  n′πiMij, respectively. Let  Aidenote the event that  Ni = n′πi(1 ± δ/2)and  Nij = n′πiMij(1 ± δ/2), ∀j ∈ [k]. Let  ACidenote the event that  Aidoes not happen. Applying the union bound, we have

image

Now consider

image

the corresponding estimation risk of ˆM +β over a particular sate  i ∈ [k]can be decomposed as

image

Noting that

image

and  Mij ∈ [δ, 1], we have

image

Hence, we can bound the second term as

image

where the last step follows from our assumption that f is sub-exponential.

By the definition of  Df and ˆM +β,

image

Let  h(x) := f� 11+x�, then h is thrice continuously differentiable around some neighborhood of point 0 and

image

We apply Taylor expansion to h at point 0 and rewrite the expectation on the right-hand side as

image

where by our definition of  Ai, we set

image

Now, we bound individual terms. Taking out  h′(0), the first term evaluates to:

image

Taking out  h′′(0)/2, the second term evaluates to:

image

Finally, taking out  M(δ)/6, the last term can be bounded as

image

where we have used the ineuqality  (a + b)3 ≤ 4(|a|3 + |b|3) twice.

By the definition of  h(·), we have

image

and

image

Hence, consolidating all the previous results,

image

It remains to analyze  EN 2i and EN 2ij.

image

Thus, the desired quantity evaluates to

image

The above inequality yields

image

This completes our proof for ordinary f-divergences.

12.4 L2-divergence upper bound

Finally, we consider the  L2-divergence. Again, we assume that the sample sequence  Xn ∼ (M) and

image

Instead of using an add-constant estimator, we use the following add-√Ni/kestimator:

image

Now, consider the expected loss for a particular state  i ∈ [k].

image

We first show that the last term is negligible. Noting that

image

we can apply Taylor expansion to the function

image

at point  x = E[Ni] and set x = Ni:

image

image

where the last step follows from Lemma 20. It remains to consider

image

According to the previous derivations, for  M 2ijEN 2i , we have

image

For 2MijENijNi, we have

image

Therefore,

image

Finally,

image

We have

image

This completes our proof for the  L2-divergence.


Designed for Accessibility and to further Open Science