Loop Estimator for Discounted Values in Markov Reward Processes

2020·Arxiv

Abstract

Abstract

At the working heart of policy iteration algorithms commonly used and studied in the discounted setting of reinforcement learning, the policy evaluation step estimates the value of states with samples from a Markov reward process induced by following a Markov policy in a Markov decision process. We propose a simple and efficient estimator called loop estimator that exploits the regenerative structure of Markov reward processes without explicitly estimating a full model. Our method enjoys a space complexity of O(1) when estimating the value of a single positive recurrent state s unlike TD with O(S ) or model-based methods with O. Moreover, the regenerative structure enables us to show, without relying on the generative model approach, that the estimator has an instance-dependent convergence rate of T on a single sample path, where is the maximal expected hitting time to state s. In preliminary numerical experiments, the loop estimator outperforms model-free methods, such as TD(k), and is competitive with the model-based estimator.

1 Introduction

The problem of policy evaluation arises naturally in the context of reinforcement learning (RL) (Sutton and Barto 2018) when one wants to evaluate the (action) values of a policy in a Markov decision process (MDP). In particular, policy iteration (Howard 1960) is a classic algorithmic framework for solving MDPs that poses and solves a policy evaluation problem during each iteration. Being motivated by the setting of reinforcement learning, i.e., the underlying MDP parameters are unknown and samples are obtained interactively, we focus on solving the policy evaluation problem given only a single sample path. Following a stationary Markov policy in an MDP, i.e., actions are determined based solely on the current state, gives rise to a Markov reward process (MRP) (Puterman 1994). For the rest of the article, we focus on MRPs and consider the problem of estimating the infinite-horizon discounted state values of an unknown MRP. A straightforward approach to policy evaluation is to estimate the parameters of the MRP and then the value by plugging them into the classic Bellman equation (5) (Bert- sekas and Tsitsiklis 1996). We call this the model-based estimator in the sequel. This approach is recently proved to be minimax-optimal given a generative model (Pananjady

and Wainwright 2019) and it provides excellent estimates of discounted values in the single sample path setting as well, as our numerical experiments show (Section 5). However, model-based estimators suffer from a space complexity of OS 2, where S is the number of states in the MRP. In contrast, model-free methods enjoy a lower space complexity of O(S ) by not explicitly estimating the model parameters (Sut- ton 1988) but tend to exhibit a greater estimation error.

A popular class of estimators, k-step bootstrapping temporal difference or TD(k)1 estimates a state’s value based on the estimated values of other states. Like the model-based estimator, TD(k) is based on the classic Bellman equation (5). The key property of the Bellman equation (5) is that the estimate of a state’s value is tied to the estimates of other states which makes it hard to study the convergence of a specific state’s value estimate in isolation and motivates the traditional approach of generative model in the literature.

Traditionally, prior works (Kearns and Singh 1999; Even- Dar and Mansour 2003; Gheshlaghi Azar, Munos, and Kap- pen 2013; Pananjady and Wainwright 2019) first show effi-cient estimation of all state values under the assumption that we can generate a sample of next states and rewards starting in each states, and then invoke an argument that such a batch of samples can be obtained over a single sample path when all states are visited for at least once, i.e., over cover times. In this work, we break with the traditional approach by directly studying the convergence of the value estimate of a single state over the sample path. The convergence over all states is obtained as a simple consequence of the union bound. Our key insight is that it is possible to circumvent the general difficulties of non-independent samples in the single sample path setting by recognizing the embedded regenerative structure of an MRP. We alleviate the reliance on estimates of other states by studying segments of the sample path that start and end in the same state, i.e., loops. This results in a novel and simple algorithm we call the loop estimator (Algorithm 1) which is a plug-in estimator based on a novel loop Bellman equation (10). One important consequence is that the loop estimator can estimate the value of a single state with a space complexity of O(1) which neither TD(k) or the model-based estimator can achieve.

We first review the requisite definitions (Section 3) and then propose the loop estimator (Section 4.2). First, we analyze the algorithm’s rate of convergence over visits to a single state (Theorem 4.2). Second, we study many steps it takes to visit a state. Using the exponential concentration of first return times (Lemma 4.3), we relate visits to their waiting times and establish the rate of convergence over steps (Theorem 4.5). Lastly, we obtain the convergence in -norm over all states via the union bound as a consequence (Corollary 4.6). Besides theoretical analysis, we also compare the loop estimator to several other estimators numerically on a commonly used example (Section 5). Finally, we discuss the model-based vs. model-free status of the loop estimator (Section 6).

Our main contributions in this paper are two-fold:

• By recognizing the embedded regenerative structure in MRPs, we derive a new Bellman equation over loops, segments that start and end in the same state.

• We introduce loop estimator, a novel algorithm that can provably efficiently estimate the discounted values of a single state in an MRP from a single sample path.

In the interest of a concise presentation, we defer detailed proofs to Appendix A with fully expanded logarithmic factors and constants. Similarly, see Appendix B for extra results. An implementation of the proposed loop estimator and presented experiments is publicly available.2

2 Related works

Much work that formally studies the convergence of value estimators (particularly the TD estimators) relies on having access to independent trajectories that start in all states (Dayan and Sejnowski 1994; Even-Dar and Mansour 2003; Jaakkola, Jordan, and Singh 1994; Kearns and Singh 2000). This is called a generative model or sometimes a parallel sampling model (Kearns and Singh 1999). Given a convergence over batches of generative samples, we still need some reduction arguments to actually obtain a batch of generative (or parallel) samples over the sample path of a MRP. Kearns and Singh (1999) consider how a set of independent trajectories can be obtained via mixing, i.e., approximately samples from the stationary distribution. This suggests on average it takes O(tmix)-many steps where tmix is the expected steps to get close to the stationary distribution (1in total variation distance) and pis the smallest probability in the stationary distribution. This reduction can be improved by considering the steps the chain takes to visit all states at least once, i.e., cover times, which is exactly when we have a batch of generative samples. This is an improved reduction in that we can study its convergence rate with high probability instead of the average behavior. But the cover time of a Markov chain can be quite large: its concentration can be related to that of the hitting times to all states. In contrast, for a single state, our re-

sults scale more favorably with the maximal expected hitting time of that state by a factor of log S . To ensure consistency of estimation is at all possible, we assume that the specific state to estimate is positive recurrent (Assumption 3.1), otherwise we cannot hope to (significantly) improve its value estimate after the final visit (see Appendix B.1 for an illustrative example). We think that this assumption is reasonable as recurrence is a key feature of many Markov chains and it connects naturally to the online (interactive) setting where we cannot arbitrarily restart the chain. Moreover, this assumption is no stronger than the assumption used in the cover time reduction which assumes that we can repeatedly visit all states. If a resetting mechanism is available, values of transient states can be estimated from values of the recurrent states. Furthermore, in a finite MRP, there is at least one recurrent state due to the infinite length of a trajectory.

Besides the interest in the RL community to study the policy evaluation problem, operation researchers were also motivated to study estimation in order to leverage simulations as a computational tool. In such settings, the restriction of estimating only from a single sample path is usually not a concern. Classic work in simulations by Fox and Glynn (1989) deals with estimating discounted value in a continuous time setting, including an estimator using regenerative structure. In comparison to their work, we provides an instance-dependent rate based on the transition structure which is relevant for the single sample path setting. Haviv and Puterman (1992) and Derman (1970) propose unbiased value estimators whereas the loop estimator is biased due to inversion.

Outside of the studies on reward processes, the regenerative structure of Markov chains has found application in the local computation of PageRank (Lee, Ozdaglar, and Shah 2013). We make use of a lemma (Lemma 4.3, whose proof is included in the Appendix A.3 for completeness) from this work to establish an upper bound on waiting times (Corollary 4.4). Furthermore, we provide an example to support why hitting times do not exponentially concentrate over its expectation in general (see Appendix B.2). Similar in spirit to the concept of locality studied by Lee, Ozdaglar, and Shah (2013), our loop estimator enables space-efficient estimation of a single state value with a space complexity of O(1) and an error bound without explicit dependency on the size of the state space. As a consequence, the loop estimator can provably estimate the value of a state with a finite maximal expected hitting time even if the state space is infinite.

Recently, an independent work by Subramanian and Ma- hajan (2019) makes a similar observation of the regenerative structure and studies using estimates similar to the loop estimator in the context of a policy gradient algorithm. It provides promising experimental results that complement our novel theoretical guarantees on the rates of convergence. Taken together, these works show that regenerative structure is a promising direction in RL.

3 Preliminaries

3.1 Markov reward processes and Markov chains

Definition 3.1 (Expected recurrence time). Given a Markov chain, we define the expected recurrence time of state s as the expected first return time of s starting in s

A state s is positive recurrent if its expected recurrence time is finite, i.e., .

Definition 3.2 (Maximal expected hitting time). Given a Markov chain, we define the maximal expected hitting time of state s as the maximal expected first return time over starting states

3.2 Discounted total rewards

In RL, we are generally interested in some expected longterm rewards that will be collected by following a policy. In the infinite-horizon discounted total reward setting, following a Markov policy on an MDP induces an MRP and the state value of state s is

where [0, 1) is the discount factor. Note that since the reward is bounded by rmax, state values are also bounded by rmax. A fundamental result relating values to the MRP parameters (P, ¯r) is the Bellman equation for each state s S (Sutton and Barto 2018)

3.3 Problem statement

Suppose that we have a sample path (Xt, Rt)0of length T from an MRP whose parameters (P, ¯r) are unknown. Given a state s and discount factor , we want to estimate v(s).

Assumption 3.1 (State s is reachable). We assume state s is reachable from all states, i.e., .

Otherwise there is some non-negligible probability that state s will not be visited from some starting state. This will prevent the convergence in probability (in the form of a PACstyle error bound) that we seek (see Appendix B.1). Remark 3.1. Assumption 3.1 can be weakened to the assumption that s is positive recurrent and the MRP starts in the recurrent class containing s. All following results can be recovered by restricting S in the definition of to the recurrent class containing s. However, for ease of presentation, we will adopt Assumption 3.1 in the rest of the article without loss of generality.

Note that Assumption 3.1 implies the positive recurrence of s, i.e., , by definition, and that the MRP visits state s for infinitely many times with probability 1.

3.4 Renewal theory and loops

Stochastic processes in general can exhibit complex dependencies between random variables at different steps, and thus often fall outside of the applicability of approaches that rely on independence assumptions. Renewal theory (Ross 1996) focuses on a class of stochastic processes where the process restarts after a renewal event. Such regenerative structure allows us to apply results from the independent and identical distribution (IID) settings.

In particular, we consider the visits to state s as renewal events and define waiting times Wn(s) for n , to be the number of steps before the n-th visit

and the interarrival times In(s) to be the steps between the n-th and (n + 1)-th visit

Remark 3.2. The random times relate to each other in a few intuitive relations. The waiting time of the first visit is the same as the hitting time W1(s) = Hs . Waiting times relate to interarrival times Wn(s) = W1(s) Ii(s).

To justify treating visits to s as renewal events, consider the sub-processes starting at W1(s) and at W2(s)—both MRPs start in state s—due to Markov property of MRP, they are statistical replica of each other. Since segments (Xt, Rt)Wn(s)start and end in the same state, we call them loops. It follows that loops are independent of each other and obey the same statistical law. Intuitively speaking, an MRP is (probabilistically) specified by its starting state. Definition 3.3 (Loop -discounted rewards). Given a Markov reward process and a positive recurrent state s, we define the n-th loop -discounted rewards as the discounted total rewards over the n-th loop

Definition 3.4 (Loop -discount). Given a Markov reward process and a positive recurrent state s, we define the n-th loop -discount as the total discounting over the n-th loop

(In(s),Gn(s))nforms a regenerative process that has nice independence relations. Specifically, In(s) Im(s), Gn(s) Gm(s), and Gn(s) Im(s) when n m. Furthermore, (In(s))nare identically distributed the same as Hwhen starting in s. Similarly, (Gn(s))nare identically distributed. Note however that Gn(s) In(s).

4 Main results

4.1 Bellman equations over loops

Given the regenerative process (In(s),Gn(s))n, we derive a new Bellman equation over the loops for state value v(s).

Theorem 4.1 (Loop Bellman equations). Suppose the expected loop -discount is (s) (s)] and the expected loop -discounted rewards is (s) [G1(s)], we can relate the state value v(s) to itself

Remark 4.1. The key difference between the loop Bellman equations (10) and the classic Bellman equations (5) is the state values involved. Only state value v(s) appears on the right-hand side of (10).

4.2 Loop estimator

We plug in the empirical means for the expected loop -discount (s) and the expected loop -discounted rewards (s) into the loop Bellman equation (10) and define the n-th loop estimator for state value v(s)

where ˆ(s) and ˆ(s) Gi(s). Fur- thermore, we have visited state s for (N + 1) times before step T where N is a random variable that counts the number of loops before step T

and the estimate ˆvN(s) would be the last estimate before step T. Hence, with a slight abuse of notations, we define

By using incremental updates to keep track of empirical means, Algorithm 1 implements the loop estimator ˆvT(s) with a space complexity of O(1). Running S -many copies of loop estimators, one for each state s , takes a space complexity of O(S ).

4.3 Rates of convergence

Now we investigate the convergence of the loop estimator, first over visits, i.e., ˆvn(s) pv(s) as n , then over steps, i.e., ˆvT(s) pv(s) as T . By applying Hoeffding bound to the definition of loop estimator (11), we obtain a PACstyle upper bound on the estimation error.

Theorem 4.2 (Convergence rate over visits). Given a sample path from an MRP (Xt, Rt)t, a discount factor [0, 1), and a positive recurrent state s, with probability of at least 1 , the loop estimator converges to v(s)

To determine the convergence rate over steps, we need to study the concentration of waiting times which allows us to lower-bound the random visits with high probability. As an intermediate step, we use the fact that the tail of the distribution of first return times is upper-bounded by an exponential distribution per the Markov property of MRP (Lee, Ozdaglar, and Shah 2013; Aldous and Fill 1999). Lemma 4.3 (Exponential concentration of first return times (Lee, Ozdaglar, and Shah 2013; Aldous and Fill 1999)). Given a Markov chain (Xt)tdefined on a finite state space S, for any state s and any t > 0, we have

Remark 4.2. Note that the waiting time Wn(s) is nearly linear in n with a dependency on the Markov chain structure via the maximal expected hitting time of s, namely O(n ). In contrast, the expected waiting time scales with the expected recurrence time E[Wn(s)] (n ). However, an exponential concentration with the expected recurrence time is not possible in general (see Appendix B.2 for a counterexample).

Using Lambert W function, we invert Corollary 4.4 to lower-bound the visits by step T with high probability. Finally, the convergence rate of ˆvT(s) follows from Theorem 4.2. Theorem 4.5 (Convergence rate over steps). With probability of at least 1 , for any T , the MRP (Xt, Rt)0visits state s for at least (T) many times, and the last loop estimate converges to v(s)

Suppose we run a copy of loop estimator to estimate each state’s value in S, and denote them with a vector ˆvT : s ˆvT(s). Convergence of the estimation error ˆvT v in terms of the -norm follows immediately by applying the union bound.

Corollary 4.6 (Convergence rate over all states). With probability of at least 1 , for any T maxs , the MRP (Xt, Rt)0visits each state s for at least (T) many times, and the last loop estimates converge to state values v

5 Numerical experiments

We consider RiverSwim, an MDP proposed by Strehl and Littman (2008) that is often used to illustrate the challenge of exploration in RL. The MDP consists of six states S = {s1s6} and two actions A = {“swim downstream”, “swim upstream”}. Executing the “swim upstream” action often fails due to the strong current, while there is a high reward for staying in the most upstream state s6. For our experiments, we use the MRP induced by always taking the “swim upstream” action (see Figure 1a for numerical details). The most relevant aspect of the induced MRP is that the maximal expected hitting times are very different for different states: 752, 237, 68, 15, 17, 22. Figure 1b shows a plot of the estimation errors of the loop estimator for each state over the square root of maximal expected hitting times of that state. The observed linear relationship between the two quantities (supported by a good linear fit) is consistent with the instance-dependence in our result of |ˆvT(s) v(s), c.f., Theorem 4.5.

5.1 Alternative estimators

We define several alternative estimators for v(s) and briefly mention their relevance for comparison.

Model-based. We compute add-1 smoothed maximum likelihood estimates (MLE) of the MRP parameters (P, ¯r) from the sample path

and

where I is the identity matrix.

TD(k). k-step temporal difference (or k-step backup) estimators are commonly recursively defined (Kearns and Singh 2000) with TD(0) being a textbook classic (Bertsekas and

Figure 1: (a) The induced RiverSwim MRP. The arrows are labeled with transition probabilities. The rewards are all zero except for state s6, where r(s6) = 1. (b) With discount factor 105. The estimation error of each state (normalized by maxs v(s)) is plotted over the square root of maximal expected hitting times of that state. Error bars show the standard deviations over 200 runs.

Tsitsiklis 1996; Sutton and Barto 2018). Let ˆvTD(0, s) 0 for all states s . And for t > 0

where is the learning rates. A common choice is to set [Xu = s]which satisfies the RobbinsMonro conditions (Bertsekas and Tsitsiklis 1996). But it has been shown to lead to slower convergence than 1[Xu = s]where d (11) (Even-Dar and Man- sour 2003).

It is more accurate to consider TD methods as a large family of estimators each with different choices of k, . Choosing these parameters can create extra work and sometimes confusion for practitioners. Whereas the loop estimator, like the model-based estimator, has no parameters to tune. In any case, it is not our intention to compare with the TD family exhaustively (see more results on TD on (Kearns and Singh 2000; Even-Dar and Mansour 2003)). Instead, we will compare with TD(0) and TD(10), both with d = 1, and TD(0)with d .

5.2 Comparative experiments

We experiment with different values for the discount factor , because, roughly speaking, 1) sets the horizon beyond which rewards are discounted too heavily to matter. We compare the estimation errors measured in -norm, which is important in RL. The results are shown in Figure 2.

• The model-based estimator dominates all estimators for every discount setting we tested.

• TD(k) estimators perform well if k ).

• The loop estimator performs worse than, but is competitive with, the model-based estimator. Furthermore, similar to the model-based estimator and unlike the TD(k) estimators, its performance seems to be less influenced by discounting.

Figure 2: Estimation errors (normalized by maxs v(s) to be comparable across discount factors) of different estimators at different discount factors (left) 9 and (right) 99. Shaded areas represent the standard deviations over 200 runs. Note the vertical log scale.

6 Discussions

The elementary identity below relates the expected first re- turn times Ys sto the transition probabilities Ps sfor a finite Markov chain. Using the matrix notations, suppose that the expected first return times are organized in a matrix Y, and P the transition matrix of the Markov chain, then we have Y = P diagY where diagY is a matrix with the same diagonal as Y and zero elsewhere, and E is a matrix with all ones. Thus, knowing Y is equivalent to knowing the full model, as we can compute P using this identity. Recall that by definition E [I1(s)] , which is exactly the diagonal of Y. But only knowing the diagonal is not sufficient to determine the entire set of model parameters, namely Y, the loop estimator based on (In)nindeed falls short of being a model-based method. It may be considered a semi-model-based method as it estimates some but not all of the model parameters.

For large MRPs, a natural extension of our work is to consider recurrence of features instead of states, e.g., a video game screen might not repeat itself completely but the same items might reappear. After all, without repetition exactly or approximately, it would not be possible for an agent to learn and improve its decisions.

We believe that regenerative structure can be further exploited in RL (particularly in the form of the loop Bellman equation (10)) and we think this article provides the fundamental results for future study in this direction.

Acknowledgments

This work was supported in part by the National Science Foundation under Grant No. 1830660. We thank Mesrob I. Ohannessian for a helpful discussion on Markov chains, and Christina Lee Yu for discussing an early version of this work. We also thank anonymous reviewers for their constructive feedback, in particular, for bringing an independent work (Subramanian and Mahajan 2019) to our attention.

References

Aldous, D.; and Fill, J. 1999. Reversible Markov chains and random walks on graphs. Book in preparation (available at http://www.stat.berkeley.edualdous/RWG/Chap2.pdf).

Bertsekas, D. P.; and Tsitsiklis, J. N. 1996. Neuro-dynamic programming. Athena Scientific Belmont, MA.

Dayan, P.; and Sejnowski, T. J. 1994. TD () converges with probability 1. Machine Learning 14(3): 295–301.

Derman, C. 1970. Finite state Markovian decision processes. Academic Press.

Even-Dar, E.; and Mansour, Y. 2003. Learning rates for Q- learning. Journal of machine learning Research 5(Dec): 1– 25.

Fox, B. L.; and Glynn, P. W. 1989. Simulating discounted costs. Management Science 35(11): 1297–1315.

Gheshlaghi Azar, M.; Munos, R.; and Kappen, H. J. 2013. Minimax PAC bounds on the sample complexity of reinforcement learning with a generative model. Machine Learning 91(3): 325–349. ISSN 1573-0565. doi:10.1007/ s10994-013-5368-1.

Haviv, M.; and Puterman, M. L. 1992. Estimating the value of a discounted reward process. Operations Research Letters 11(5): 267–272. ISSN 01676377. doi:10.1016/0167-6377(92)90002-K.

Howard, R. A. 1960. Dynamic programming and Markov processes. John Wiley.

Jaakkola, T.; Jordan, M. I.; and Singh, S. P. 1994. Convergence of stochastic iterative dynamic programming algorithms. In Advances in Neural Information Processing Systems, 703–710.

Kearns, M. J.; and Singh, S. P. 1999. Finite-sample con- vergence rates for Q-learning and indirect algorithms. In Advances in Neural Information Processing Systems, 996– 1002.

Kearns, M. J.; and Singh, S. P. 2000. Bias-Variance Error Bounds for Temporal Difference Updates. In Conference on Learning Theory, 142–147.

Lee, C. E.; Ozdaglar, A.; and Shah, D. 2013. Approximat- ing the Stationary Probability of a Single State in a Markov chain. arXiv preprint URL https://arxiv.org/abs/1312.1986.

Pananjady, A.; and Wainwright, M. J. 2019. Value function estimation in Markov reward processes: Instance-dependent -bounds for policy evaluation. arXiv preprint URL https: //arxiv.org/abs/1909.08749.

Puterman, M. L. 1994. Markov Decision Processes: Discrete Stochastic Dynamic Programming. John Wiley & Sons, Inc.

Ross, S. M. 1996. Stochastic processes. John Wiley, 2nd edition.

Strehl, A. L.; and Littman, M. L. 2008. An analysis of model-based interval estimation for Markov decision processes. Journal of Computer and System Sciences 74(8): 1309–1331.

Subramanian, J.; and Mahajan, A. 2019. Renewal Monte Carlo: Renewal theory based reinforcement learning. IEEE Transactions on Automatic Control 1–1.

Sutton, R. S. 1988. Learning to predict by the methods of temporal differences. Machine Learning 3(1): 9–44.

Sutton, R. S.; and Barto, A. G. 2018. Reinforcement learning: An introduction. MIT press, 2nd edition.

A Detailed proofs

A.1 Proof of Theorem 4.1

Proof. Note that since X0 = s, we have W1(s) = 0 and I1(s) = W2(s). Since only state s appears here, we will suppress s from the random variables below to simplify the notation. We use Assumption 3.1 or the weaker assumption that s is positive recurrent, i.e., , to guarantee that W2 with probability 1.

A.2 Proof of Theorem 4.2

Proof. Since only state s appears below, we will suppress it in the interest of conciseness. Consider

By the definition of an MRP, we have v [0] and G1 [0). Furthermore, I1 1 implies that (0] and 0 . Hence the estimation error is bounded as follows

With failure probability of at most 2, from Hoeffding’s inequality, we have

and similarly

Applying the union bound, we have

A.3 Proof of Lemma 4.3

This proof largely follows the proof by Lee, Ozdaglar, and Shah (2013) and is presented here in the interest of selfcontainedness.

Proof. Suppose a, b > 0, consider the probability of the event that s is not visited in the next a steps given that it is not visited in the previous b steps, that is

Let a , and apply the above-many times to

A.4 Proof of Corollary 4.4

Proof. For conciseness, we suppress s here since only state s appears. Suppose a > 0. By Remark 3.2, we have Wn < n a if W1 < a and Ii < a for i 1. Note that W1 and Ii distribute identically to H. Immediately from inverting Lemma 4.3, we have with failure probability of at most , W1 is bounded

Suppose each Ii < a fails with probability of at most , then we similarly have

A.5 Proof of Theorem 4.5

Proof. First, we introduce the Lambert W function to invert Corollary 4.4. Recall that the Lambert W function is a transcendental function defined such that W(x)ex, and thus it is a monotonically increasing function. At step T, suppose

Use the fact that if x > e, then log x log log x < W(x). So given T , we can lower-bound the number of visits

Plugging this into Theorem 4.2, we obtain the desired expression

A.6 Proof of Corollary 4.6

Proof. We run S many copies of loop estimators, one for each state s . Following Theorem 4.5, with failure probability of at most , we can ensure that each estimator has an error of at most

The largest upper bound comes from the state with the largest maximal expected hitting time maxsof the Markov chain. Apply the union bound and we have

ˆvT rmax(1 )2

B Additional results

B.1 Conditions for consistency We provide an example to show that if a state is not positive recurrent, i.e., transient, then we cannot attain a consistent estimate of its value in general. This suggests that Assumption 3.1 is not too strong as a sufficient condition to study. Recall that we are interested in consistent estimation of the discounted value of a state given a single sample path from an unknown MRP. If a state is not positive recurrent, then without assuming any reset mechanisms, it is visited finitely many times over any sample path almost surely. Consider the following three MRPs in Figure 3. It is obvious that v(s, v(s) = 0, and v(s1) . Suppose we start in s1 (of the MRP in the middle), there are only two possible sample paths: (s1, 0, s2, 1, s2) and (s1, 0, s3, 0, s3). Note that s1 is only visited once in either sample path thus transient. Furthermore, we obtain the first sample path with probability of 1in which case we

Figure 3: Diagram of three Markov reward processes with transition probabilities labeled on the edges. The rewards are 1 for s2 and 0 elsewhere.

cannot distinguish it from a sample path from the MRP on the top. Similarly, with probability of 1, we get the second sample path which is indistinguishable from a sample path from the MRP at the bottom. However, the values of sand sare different (and both not equal to that of s1). Hence we cannot devise an estimator that can consistently estimate v(s1), v(s) and v(s).

B.2 Concentration of first return times

We provide an example to show that an exponential concentration of first return times given the expected recurrence time is impossible. In contrast, in Lemma 4.3 we proved an exponential concentration given the maximal expected hitting time. Furthermore, this is consistent with what one would expect from Markov’s inequality since first return times are nonnegative random variables.

Consider a class of Markov chains {Mk} indexed by k 3 where Markov chain Mk has a state space {s1sk} and a transition kernel as depicted in Figure 4. Starting in s1, the chain Mk can either transition back to s1 in one step with probability 1 or to s2 with probability 1k. Thus, there are only two possible values for the first return time to s1: 1 by the self-transition, and k by going through s2, s3, . . . , sk, s1. We can calculate the expected recurrence

time as

Suppose that there is an exponential concentration of Hgiven , then we can upper-bound by some exponential function of t. However Hk with probability of 1kin Mk makes such an exponential bound impossible as the upper bound has to work for all Mk.

Figure 4: Diagram of Markov chain Mk with transition probabilities labeled on the edges.