36:[["$","audio",null,{"id":"tts"}],["$","$L3b",null,{"paperID":"2002.06299","publisher":"arxiv","paperJSON":{"title":"Loop Estimator for Discounted Values in Markov Reward Processes","paperID":"2002.06299","avgLineHeight":10.96,"imgScale":4,"sections":[{"heading":"Abstract","paragraphs":[[{"text":"At the working heart of policy iteration algorithms commonly used and studied in the discounted setting of reinforcement learning, the policy evaluation step estimates the value of states with samples from a Markov reward process induced by following a Markov policy in a Markov decision process. We propose a simple and e","element":"span"},{"text":"ffi","element":"span"},{"text":"cient estimator called ","element":"span"},{"style":{"fontStyle":"italic"},"text":"loop estimator ","element":"span"},{"text":"that exploits the regenerative structure of Markov reward processes without explicitly estimating a full model. Our method enjoys a space complexity of ","element":"span"},{"style":{"fontStyle":"italic"},"text":"O","element":"span"},{"text":"(1) when estimating the value of a single positive recurrent state ","element":"span"},{"style":{"fontStyle":"italic"},"text":"s ","element":"span"},{"text":"unlike TD with ","element":"span"},{"style":{"fontStyle":"italic"},"text":"O","element":"span"},{"text":"(","element":"span"},{"style":{"fontStyle":"italic"},"text":"S ","element":"span"},{"text":") or model-based methods with ","element":"span"},{"style":{"fontStyle":"italic"},"text":"O","element":"span"},{"style":{"height":21.6},"width":63.31,"height":54,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2002.06299/images/0-0.png","element":"img","alt":"�S 2�","inline":true},{"text":". Moreover, the regenerative structure enables us to show, without relying on the generative model approach, that the estimator has an instance-dependent convergence rate of ","element":"span"},{"style":{"height":21.6},"width":306.74,"height":54,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2002.06299/images/0-1.png","element":"img","alt":"�O� √τs/T�over steps","inline":true,"padRight":true},{"style":{"fontStyle":"italic"},"text":"T ","element":"span"},{"text":"on a single sample path, where ","element":"span"},{"style":{"height":8.15},"width":27.2,"height":20.38,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2002.06299/images/0-2.png","element":"img","alt":" τs","inline":true,"padRight":true},{"text":"is the maximal expected hitting time to state ","element":"span"},{"style":{"fontStyle":"italic"},"text":"s","element":"span"},{"text":". In preliminary numerical experiments, the loop estimator outperforms model-free methods, such as TD(k), and is competitive with the model-based estimator.","element":"span"}]]},{"heading":"1 Introduction","paragraphs":[[{"text":"The problem of policy evaluation arises naturally in the context of reinforcement learning (RL) ","element":"span"},{"href":"#id-0","referenceIndex":20,"text":"(Sutton and Barto ","element":"a"},{"href":"#id-0","referenceIndex":20,"text":"2018) ","element":"a"},{"text":"when one wants to evaluate the (action) values of a policy in a Markov decision process (MDP). In particular, policy iteration ","element":"span"},{"href":"#id-1","referenceIndex":9,"text":"(Howard ","element":"a"},{"href":"#id-1","referenceIndex":9,"text":"1960) ","element":"a"},{"text":"is a classic algorithmic framework for solving MDPs that poses and solves a policy evaluation problem during each iteration. Being motivated by the setting of reinforcement learning, i.e., the underlying MDP parameters are unknown and samples are obtained interactively, we focus on solving the policy evaluation problem given only a ","element":"span"},{"style":{"fontStyle":"italic"},"text":"single ","element":"span"},{"text":"sample path. Following a stationary Markov policy in an MDP, i.e., actions are determined based solely on the current state, gives rise to a ","element":"span"},{"style":{"fontStyle":"italic"},"text":"Markov reward process ","element":"span"},{"text":"(MRP) ","element":"span"},{"href":"#id-2","referenceIndex":15,"text":"(Puterman ","element":"a"},{"href":"#id-2","referenceIndex":15,"text":"1994)","element":"a"},{"text":". For the rest of the article, we focus on MRPs and consider the problem of estimating the infinite-horizon ","element":"span"},{"style":{"fontStyle":"italic"},"text":"discounted ","element":"span"},{"text":"state values of an unknown MRP. A straightforward approach to policy evaluation is to estimate the parameters of the MRP and then the value by plugging them into the classic Bellman equation ","element":"span"},{"href":"#id-3","text":"(5) ","element":"a"},{"href":"#id-4","referenceIndex":2,"text":"(Bert- ","element":"a"},{"href":"#id-4","referenceIndex":2,"text":"sekas and Tsitsiklis ","element":"a"},{"href":"#id-4","referenceIndex":2,"text":"1996)","element":"a"},{"text":". We call this the model-based estimator in the sequel. This approach is recently proved to be minimax-optimal given a generative model ","element":"span"},{"href":"#id-5","referenceIndex":14,"text":"(Pananjady","element":"a"}],[{"href":"#id-5","referenceIndex":14,"text":"and Wainwright ","element":"a"},{"href":"#id-5","referenceIndex":14,"text":"2019) ","element":"a"},{"text":"and it provides excellent estimates of discounted values in the single sample path setting as well, as our numerical experiments show (Section ","element":"span"},{"href":"#id-6","text":"5)","element":"a"},{"text":". However, model-based estimators su","element":"span"},{"text":"ff","element":"span"},{"text":"er from a space complexity of ","element":"span"},{"style":{"fontStyle":"italic"},"text":"O","element":"span"},{"style":{"height":23.6},"width":14,"height":59,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2002.06299/images/0-3.png","element":"img","alt":"�","inline":true},{"style":{"fontStyle":"italic"},"text":"S ","element":"span"},{"text":"2","element":"span"},{"style":{"height":23.6},"width":14,"height":59,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2002.06299/images/0-4.png","element":"img","alt":"�","inline":true},{"text":", where ","element":"span"},{"style":{"fontStyle":"italic"},"text":"S ","element":"span"},{"text":"is the number of states in the MRP. In contrast, ","element":"span"},{"style":{"fontStyle":"italic"},"text":"model-free ","element":"span"},{"text":"methods enjoy a lower space complexity of ","element":"span"},{"style":{"fontStyle":"italic"},"text":"O","element":"span"},{"text":"(","element":"span"},{"style":{"fontStyle":"italic"},"text":"S ","element":"span"},{"text":") by not explicitly estimating the model parameters ","element":"span"},{"href":"#id-7","referenceIndex":19,"text":"(Sut- ","element":"a"},{"href":"#id-7","referenceIndex":19,"text":"ton ","element":"a"},{"href":"#id-7","referenceIndex":19,"text":"1988) ","element":"a"},{"text":"but tend to exhibit a greater estimation error.","element":"span"}],[{"text":"A popular class of estimators, ","element":"span"},{"style":{"fontStyle":"italic"},"text":"k","element":"span"},{"text":"-step bootstrapping temporal di","element":"span"},{"text":"ff","element":"span"},{"text":"erence or TD(k)","element":"span"},{"text":"1 ","element":"span"},{"text":"estimates a state’s value based on the estimated values of other states. Like the model-based estimator, TD(k) is based on the classic Bellman equation ","element":"span"},{"href":"#id-3","text":"(5)","element":"a"},{"text":". The key property of the Bellman equation ","element":"span"},{"href":"#id-3","text":"(5) ","element":"a"},{"text":"is that the estimate of a state’s value is tied to the estimates of other states which makes it hard to study the convergence of a specific state’s value estimate in isolation and motivates the traditional approach of generative model in the literature.","element":"span"}],[{"text":"Traditionally, prior works ","element":"span"},{"href":"#id-8","referenceIndex":11,"text":"(Kearns and Singh ","element":"a"},{"href":"#id-8","referenceIndex":11,"text":"1999; ","element":"a"},{"href":"#id-9","referenceIndex":5,"text":"Even- ","element":"a"},{"href":"#id-9","referenceIndex":5,"text":"Dar and Mansour ","element":"a"},{"href":"#id-9","referenceIndex":5,"text":"2003; ","element":"a"},{"href":"#id-10","referenceIndex":7,"text":"Gheshlaghi Azar, Munos, and Kap- ","element":"a"},{"href":"#id-10","referenceIndex":7,"text":"pen ","element":"a"},{"href":"#id-10","referenceIndex":7,"text":"2013; ","element":"a"},{"href":"#id-5","referenceIndex":14,"text":"Pananjady and Wainwright ","element":"a"},{"href":"#id-5","referenceIndex":14,"text":"2019) ","element":"a"},{"text":"first show e","element":"span"},{"text":"ffi","element":"span"},{"text":"-cient estimation of ","element":"span"},{"style":{"fontStyle":"italic"},"text":"all ","element":"span"},{"text":"state values under the assumption that we can generate a sample of next states and rewards starting in each states, and then invoke an argument that such a batch of samples can be obtained over a single sample path when all states are visited for at least once, i.e., over cover times. In this work, we break with the traditional approach by directly studying the convergence of the value estimate of a ","element":"span"},{"style":{"fontStyle":"italic"},"text":"single ","element":"span"},{"text":"state over the sample path. The convergence over all states is obtained as a simple consequence of the union bound. Our key insight is that it is possible to circumvent the general di","element":"span"},{"text":"ffi","element":"span"},{"text":"culties of non-independent samples in the single sample path setting by recognizing the embedded regenerative structure of an MRP. We alleviate the reliance on estimates of other states by studying segments of the sample path that start and end in the same state, i.e., ","element":"span"},{"style":{"fontStyle":"italic"},"text":"loops","element":"span"},{"text":". This results in a novel and simple algorithm we call the ","element":"span"},{"style":{"fontStyle":"italic"},"text":"loop estimator ","element":"span"},{"text":"(Algorithm ","element":"span"},{"href":"#id-11","text":"1) ","element":"a"},{"text":"which is a plug-in estimator based on a novel loop Bellman equation ","element":"span"},{"href":"#id-12","text":"(10)","element":"a"},{"text":". One important consequence is that the loop estimator can estimate the value of a single state with a space complexity of ","element":"span"},{"style":{"fontStyle":"italic"},"text":"O","element":"span"},{"text":"(1) which neither ","element":"span"},{"style":{"fontStyle":"italic"},"text":"TD","element":"span"},{"text":"(","element":"span"},{"style":{"fontStyle":"italic"},"text":"k","element":"span"},{"text":") or the model-based estimator can achieve.","element":"span"}],[{"text":"We first review the requisite definitions (Section ","element":"span"},{"href":"#id-13","text":"3) ","element":"a"},{"text":"and then propose the loop estimator (Section ","element":"span"},{"href":"#id-14","text":"4.2)","element":"a"},{"text":". First, we analyze the algorithm’s rate of convergence over visits to a single state (Theorem ","element":"span"},{"href":"#id-15","text":"4.2)","element":"a"},{"text":". Second, we study many steps it takes to visit a state. Using the exponential concentration of first return times (Lemma ","element":"span"},{"href":"#id-16","text":"4.3)","element":"a"},{"text":", we relate visits to their waiting times and establish the rate of convergence over steps (Theorem ","element":"span"},{"href":"#id-17","text":"4.5)","element":"a"},{"text":". Lastly, we obtain the convergence in ","element":"span"},{"style":{"height":13.19},"width":40.62,"height":32.98,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2002.06299/images/1-0.png","element":"img","alt":" ℓ∞","inline":true},{"text":"-norm over all states via the union bound as a consequence (Corollary ","element":"span"},{"href":"#id-18","text":"4.6)","element":"a"},{"text":". Besides theoretical analysis, we also compare the loop estimator to several other estimators numerically on a commonly used example (Section ","element":"span"},{"href":"#id-6","text":"5)","element":"a"},{"text":". Finally, we discuss the model-based vs. model-free status of the loop estimator (Section ","element":"span"},{"href":"#id-19","text":"6)","element":"a"},{"text":".","element":"span"}],[{"text":"Our main contributions in this paper are two-fold:","element":"span"}],[{"text":"• ","element":"span"},{"text":"By recognizing the embedded regenerative structure in MRPs, we derive a new Bellman equation over loops, segments that start and end in the same state.","element":"span"}],[{"text":"• ","element":"span"},{"text":"We introduce ","element":"span"},{"style":{"fontStyle":"italic"},"text":"loop estimator","element":"span"},{"text":", a novel algorithm that can provably e","element":"span"},{"text":"ffi","element":"span"},{"text":"ciently estimate the discounted values of a single state in an MRP from a single sample path.","element":"span"}],[{"text":"In the interest of a concise presentation, we defer detailed proofs to Appendix ","element":"span"},{"href":"#id-20","text":"A ","element":"a"},{"text":"with fully expanded logarithmic factors and constants. Similarly, see Appendix ","element":"span"},{"href":"#id-21","text":"B ","element":"a"},{"text":"for extra results. An implementation of the proposed loop estimator and presented experiments is publicly available.","element":"span"},{"text":"2","element":"span"}]]},{"heading":"2 Related works","paragraphs":[[{"text":"Much work that formally studies the convergence of value estimators (particularly the TD estimators) relies on having access to independent trajectories that start in ","element":"span"},{"style":{"fontStyle":"italic"},"text":"all ","element":"span"},{"text":"states ","element":"span"},{"href":"#id-22","referenceIndex":3,"text":"(Dayan and Sejnowski ","element":"a"},{"href":"#id-22","referenceIndex":3,"text":"1994; ","element":"a"},{"href":"#id-9","referenceIndex":5,"text":"Even-Dar and Mansour ","element":"a"},{"href":"#id-9","referenceIndex":5,"text":"2003; ","element":"a"},{"href":"#id-23","referenceIndex":10,"text":"Jaakkola, Jordan, and Singh ","element":"a"},{"href":"#id-23","referenceIndex":10,"text":"1994; ","element":"a"},{"href":"#id-24","referenceIndex":12,"text":"Kearns and Singh ","element":"a"},{"href":"#id-24","referenceIndex":12,"text":"2000)","element":"a"},{"text":". This is called a ","element":"span"},{"style":{"fontStyle":"italic"},"text":"generative model ","element":"span"},{"text":"or sometimes a parallel sampling model ","element":"span"},{"href":"#id-8","referenceIndex":11,"text":"(Kearns and Singh ","element":"a"},{"href":"#id-8","referenceIndex":11,"text":"1999)","element":"a"},{"text":". Given a convergence over batches of generative samples, we still need some reduction arguments to actually obtain a batch of generative (or parallel) samples over the sample path of a MRP. ","element":"span"},{"href":"#id-8","referenceIndex":11,"text":"Kearns ","element":"a"},{"href":"#id-8","referenceIndex":11,"text":"and Singh ","element":"a"},{"href":"#id-8","referenceIndex":11,"text":"(1999) ","element":"a"},{"text":"consider how a set of independent trajectories can be obtained via mixing, i.e., approximately samples from the stationary distribution. This suggests on ","element":"span"},{"style":{"fontStyle":"italic"},"text":"average ","element":"span"},{"text":"it takes ","element":"span"},{"style":{"fontStyle":"italic"},"text":"O","element":"span"},{"text":"(","element":"span"},{"style":{"fontStyle":"italic"},"text":"t","element":"span"},{"text":"mix","element":"span"},{"style":{"height":14.8},"width":53.49,"height":37,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2002.06299/images/1-1.png","element":"img","alt":"/p∗","inline":true},{"text":")-many steps where ","element":"span"},{"style":{"fontStyle":"italic"},"text":"t","element":"span"},{"text":"mix ","element":"span"},{"text":"is the expected steps to get close to the stationary distribution (","element":"span"},{"text":"1","element":"span"},{"style":{"height":14},"width":29.36,"height":35,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2002.06299/images/1-2.png","element":"img","alt":"/4","inline":true,"padRight":true},{"text":"in total variation distance) and ","element":"span"},{"style":{"fontStyle":"italic"},"text":"p","element":"span"},{"style":{"height":5.2},"width":13,"height":13,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2002.06299/images/1-3.png","element":"img","alt":"∗","inline":true,"padRight":true},{"text":"is the smallest probability in the stationary distribution. This reduction can be improved by considering the steps the chain takes to visit all states at least once, i.e., ","element":"span"},{"style":{"fontStyle":"italic"},"text":"cover times","element":"span"},{"text":", which is exactly when we have a batch of generative samples. This is an improved reduction in that we can study its convergence rate with high probability instead of the average behavior. But the cover time of a Markov chain can be quite large: its concentration can be related to that of the hitting times to ","element":"span"},{"style":{"fontStyle":"italic"},"text":"all ","element":"span"},{"text":"states. In contrast, for a single state, our re-","element":"span"}],[{"text":"sults scale more favorably with the maximal expected hitting time of that state by a factor of log ","element":"span"},{"style":{"fontStyle":"italic"},"text":"S ","element":"span"},{"text":". To ensure consistency of estimation is at all possible, we assume that the specific state to estimate is positive recurrent (Assumption ","element":"span"},{"href":"#id-25","text":"3.1)","element":"a"},{"text":", otherwise we cannot hope to (significantly) improve its value estimate after the final visit (see Appendix ","element":"span"},{"href":"#id-21","text":"B.1 ","element":"a"},{"text":"for an illustrative example). We think that this assumption is reasonable as recurrence is a key feature of many Markov chains and it connects naturally to the online (interactive) setting where we cannot arbitrarily restart the chain. Moreover, this assumption is no stronger than the assumption used in the cover time reduction which assumes that we can repeatedly visit all states. If a resetting mechanism is available, values of transient states can be estimated from values of the recurrent states. Furthermore, in a finite MRP, there is at least one recurrent state due to the infinite length of a trajectory.","element":"span"}],[{"text":"Besides the interest in the RL community to study the policy evaluation problem, operation researchers were also motivated to study estimation in order to leverage simulations as a computational tool. In such settings, the restriction of estimating only from a single sample path is usually not a concern. Classic work in simulations by ","element":"span"},{"href":"#id-26","referenceIndex":6,"text":"Fox and ","element":"a"},{"href":"#id-26","referenceIndex":6,"text":"Glynn ","element":"a"},{"href":"#id-26","referenceIndex":6,"text":"(1989) ","element":"a"},{"text":"deals with estimating discounted value in a continuous time setting, including an estimator using regenerative structure. In comparison to their work, we provides an instance-dependent rate based on the transition structure which is relevant for the single sample path setting. ","element":"span"},{"href":"#id-27","referenceIndex":8,"text":"Haviv ","element":"a"},{"href":"#id-27","referenceIndex":8,"text":"and Puterman ","element":"a"},{"href":"#id-27","referenceIndex":8,"text":"(1992) ","element":"a"},{"text":"and ","element":"span"},{"href":"#id-28","referenceIndex":4,"text":"Derman ","element":"a"},{"href":"#id-28","referenceIndex":4,"text":"(1970) ","element":"a"},{"text":"propose unbiased value estimators whereas the loop estimator is biased due to inversion.","element":"span"}],[{"text":"Outside of the studies on reward processes, the regenerative structure of Markov chains has found application in the ","element":"span"},{"style":{"fontStyle":"italic"},"text":"local ","element":"span"},{"text":"computation of PageRank ","element":"span"},{"href":"#id-29","referenceIndex":13,"text":"(Lee, Ozdaglar, and Shah ","element":"a"},{"href":"#id-29","referenceIndex":13,"text":"2013)","element":"a"},{"text":". We make use of a lemma (Lemma ","element":"span"},{"href":"#id-16","text":"4.3, ","element":"a"},{"text":"whose proof is included in the Appendix ","element":"span"},{"href":"#id-30","text":"A.3 ","element":"a"},{"text":"for completeness) from this work to establish an upper bound on waiting times (Corollary ","element":"span"},{"href":"#id-16","text":"4.4)","element":"a"},{"text":". Furthermore, we provide an example to support why hitting times do not exponentially concentrate over its expectation in general (see Appendix ","element":"span"},{"href":"#id-31","text":"B.2)","element":"a"},{"text":". Similar in spirit to the concept of locality studied by ","element":"span"},{"href":"#id-29","referenceIndex":13,"text":"Lee, Ozdaglar, and Shah ","element":"a"},{"href":"#id-29","referenceIndex":13,"text":"(2013)","element":"a"},{"text":", our loop estimator enables space-e","element":"span"},{"text":"ffi","element":"span"},{"text":"cient estimation of a single state value with a space complexity of ","element":"span"},{"style":{"fontStyle":"italic"},"text":"O","element":"span"},{"text":"(1) and an error bound without explicit dependency on the size of the state space. As a consequence, the loop estimator can provably estimate the value of a state with a finite maximal expected hitting time even if the state space is infinite.","element":"span"}],[{"text":"Recently, an independent work by ","element":"span"},{"href":"#id-32","referenceIndex":18,"text":"Subramanian and Ma- ","element":"a"},{"href":"#id-32","referenceIndex":18,"text":"hajan ","element":"a"},{"href":"#id-32","referenceIndex":18,"text":"(2019) ","element":"a"},{"text":"makes a similar observation of the regenerative structure and studies using estimates similar to the loop estimator in the context of a policy gradient algorithm. It provides promising experimental results that complement our novel theoretical guarantees on the rates of convergence. Taken together, these works show that regenerative structure is a promising direction in RL.","element":"span"}]]},{"heading":"3 Preliminaries","paragraphs":[[{"id":"id-13","style":{"fontWeight":"bold"},"text":"3.1 ","element":"span"},{"style":{"fontWeight":"bold"},"text":"Markov reward processes and Markov chains","element":"span"}],[{"style":{"width":"100%"},"width":954,"height":725,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2002.06299/images/2-0.png","element":"img"}],[{"style":{"fontWeight":"bold"},"text":"Definition 3.1 ","element":"span"},{"text":"(Expected recurrence time)","element":"span"},{"style":{"fontWeight":"bold"},"text":". ","element":"span"},{"style":{"fontStyle":"italic"},"text":"Given a Markov chain, we define the ","element":"span"},{"text":"expected recurrence time of state ","element":"span"},{"style":{"fontStyle":"italic"},"text":"s as the expected first return time of s starting in s","element":"span"}],[{"style":{"width":"62%"},"width":593,"height":42,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2002.06299/images/2-1.png","element":"img"}],[{"text":"A state ","element":"span"},{"style":{"fontStyle":"italic"},"text":"s ","element":"span"},{"text":"is ","element":"span"},{"style":{"fontStyle":"italic"},"text":"positive recurrent ","element":"span"},{"text":"if its expected recurrence time is finite, i.e., ","element":"span"},{"style":{"height":12},"width":115.97,"height":30,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2002.06299/images/2-2.png","element":"img","alt":" ρs < ∞","inline":true},{"text":".","element":"span"}],[{"style":{"fontWeight":"bold"},"text":"Definition 3.2 ","element":"span"},{"text":"(Maximal expected hitting time)","element":"span"},{"style":{"fontWeight":"bold"},"text":". ","element":"span"},{"style":{"fontStyle":"italic"},"text":"Given a Markov chain, we define the ","element":"span"},{"text":"maximal expected hitting time of state ","element":"span"},{"style":{"fontStyle":"italic"},"text":"s as the maximal expected first return time over starting states","element":"span"}],[{"style":{"width":"65%"},"width":625,"height":56,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2002.06299/images/2-3.png","element":"img"}],[{"style":{"fontWeight":"bold"},"text":"3.2 ","element":"span"},{"style":{"fontWeight":"bold"},"text":"Discounted total rewards","element":"span"}],[{"text":"In RL, we are generally interested in some expected longterm rewards that will be collected by following a policy. In the infinite-horizon discounted total reward setting, following a Markov policy on an MDP induces an MRP and the ","element":"span"},{"style":{"fontStyle":"italic"},"text":"state value ","element":"span"},{"text":"of state ","element":"span"},{"style":{"fontStyle":"italic"},"text":"s ","element":"span"},{"text":"is","element":"span"}],[{"style":{"width":"67%"},"width":642,"height":112,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2002.06299/images/2-4.png","element":"img"}],[{"text":"where ","element":"span"},{"style":{"height":11.6},"width":58.48,"height":29,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2002.06299/images/2-5.png","element":"img","alt":" γ ∈","inline":true,"padRight":true},{"text":"[0","element":"span"},{"text":", ","element":"span"},{"text":"1) is the ","element":"span"},{"style":{"fontStyle":"italic"},"text":"discount factor","element":"span"},{"text":". Note that since the reward is bounded by ","element":"span"},{"style":{"fontStyle":"italic"},"text":"r","element":"span"},{"text":"max","element":"span"},{"text":", state values are also bounded by ","element":"span"},{"style":{"fontStyle":"italic"},"text":"r","element":"span"},{"text":"max","element":"span"},{"style":{"height":14},"width":62.05,"height":35,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2002.06299/images/2-6.png","element":"img","alt":"/1−γ","inline":true},{"text":". A fundamental result relating values to the MRP parameters (","element":"span"},{"style":{"fontWeight":"bold"},"text":"P","element":"span"},{"text":", ","element":"span"},{"text":"¯","element":"span"},{"style":{"fontWeight":"bold"},"text":"r","element":"span"},{"text":") is the Bellman equation for each state ","element":"span"},{"style":{"fontStyle":"italic"},"text":"s ","element":"span"},{"style":{"height":8},"width":21,"height":20,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2002.06299/images/2-7.png","element":"img","alt":" ∈","inline":true,"padRight":true},{"text":"S ","element":"span"},{"href":"#id-0","referenceIndex":20,"text":"(Sutton and Barto ","element":"a"},{"href":"#id-0","referenceIndex":20,"text":"2018)","element":"a"}],[{"id":"id-3","style":{"width":"71%"},"width":683,"height":89,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2002.06299/images/2-8.png","element":"img"}],[{"style":{"fontWeight":"bold"},"text":"3.3 ","element":"span"},{"style":{"fontWeight":"bold"},"text":"Problem statement","element":"span"}],[{"text":"Suppose that we have a sample path (","element":"span"},{"style":{"fontStyle":"italic"},"text":"X","element":"span"},{"style":{"fontStyle":"italic"},"text":"t","element":"span"},{"text":", ","element":"span"},{"style":{"fontStyle":"italic"},"text":"R","element":"span"},{"style":{"fontStyle":"italic"},"text":"t","element":"span"},{"text":")","element":"span"},{"text":"0","element":"span"},{"style":{"height":8},"width":61.74,"height":20,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2002.06299/images/2-9.png","element":"img","alt":"≤t0","inline":true,"padRight":true},{"text":"forms a regenerative process that has nice independence relations. Specifically, ","element":"span"},{"style":{"fontStyle":"italic"},"text":"I","element":"span"},{"style":{"fontStyle":"italic"},"text":"n","element":"span"},{"text":"(","element":"span"},{"style":{"fontStyle":"italic"},"text":"s","element":"span"},{"text":") ","element":"span"},{"style":{"height":0},"width":40,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2002.06299/images/3-1.png","element":"img","alt":" �","inline":true,"padRight":true},{"style":{"fontStyle":"italic"},"text":"I","element":"span"},{"style":{"fontStyle":"italic"},"text":"m","element":"span"},{"text":"(","element":"span"},{"style":{"fontStyle":"italic"},"text":"s","element":"span"},{"text":"), ","element":"span"},{"style":{"fontStyle":"italic"},"text":"G","element":"span"},{"style":{"fontStyle":"italic"},"text":"n","element":"span"},{"text":"(","element":"span"},{"style":{"fontStyle":"italic"},"text":"s","element":"span"},{"text":") ","element":"span"},{"style":{"height":0},"width":40,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2002.06299/images/3-2.png","element":"img","alt":" �","inline":true,"padRight":true},{"style":{"fontStyle":"italic"},"text":"G","element":"span"},{"style":{"fontStyle":"italic"},"text":"m","element":"span"},{"text":"(","element":"span"},{"style":{"fontStyle":"italic"},"text":"s","element":"span"},{"text":"), and ","element":"span"},{"style":{"fontStyle":"italic"},"text":"G","element":"span"},{"style":{"fontStyle":"italic"},"text":"n","element":"span"},{"text":"(","element":"span"},{"style":{"fontStyle":"italic"},"text":"s","element":"span"},{"text":") ","element":"span"},{"style":{"height":0},"width":40,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2002.06299/images/3-3.png","element":"img","alt":" �","inline":true,"padRight":true},{"style":{"fontStyle":"italic"},"text":"I","element":"span"},{"style":{"fontStyle":"italic"},"text":"m","element":"span"},{"text":"(","element":"span"},{"style":{"fontStyle":"italic"},"text":"s","element":"span"},{"text":") when ","element":"span"},{"style":{"fontStyle":"italic"},"text":"n ","element":"span"},{"style":{"height":10.4},"width":25,"height":26,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2002.06299/images/3-4.png","element":"img","alt":" �","inline":true,"padRight":true},{"style":{"fontStyle":"italic"},"text":"m","element":"span"},{"text":". Furthermore, (","element":"span"},{"style":{"fontStyle":"italic"},"text":"I","element":"span"},{"style":{"fontStyle":"italic"},"text":"n","element":"span"},{"text":"(","element":"span"},{"style":{"fontStyle":"italic"},"text":"s","element":"span"},{"text":"))","element":"span"},{"style":{"fontStyle":"italic"},"text":"n","element":"span"},{"style":{"height":7.6},"width":31.74,"height":19,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2002.06299/images/3-5.png","element":"img","alt":">0","inline":true,"padRight":true},{"text":"are identically distributed the same as ","element":"span"},{"style":{"fontStyle":"italic"},"text":"H","element":"span"},{"style":{"height":15.72},"width":18.4,"height":39.31,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2002.06299/images/3-6.png","element":"img","alt":"+s","inline":true,"padRight":true},{"text":"when starting in ","element":"span"},{"style":{"fontStyle":"italic"},"text":"s","element":"span"},{"text":". Similarly, (","element":"span"},{"style":{"fontStyle":"italic"},"text":"G","element":"span"},{"style":{"fontStyle":"italic"},"text":"n","element":"span"},{"text":"(","element":"span"},{"style":{"fontStyle":"italic"},"text":"s","element":"span"},{"text":"))","element":"span"},{"style":{"fontStyle":"italic"},"text":"n","element":"span"},{"style":{"height":7.6},"width":31.74,"height":19,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2002.06299/images/3-7.png","element":"img","alt":">0","inline":true,"padRight":true},{"text":"are identically distributed. Note however that ","element":"span"},{"style":{"fontStyle":"italic"},"text":"G","element":"span"},{"style":{"fontStyle":"italic"},"text":"n","element":"span"},{"text":"(","element":"span"},{"style":{"fontStyle":"italic"},"text":"s","element":"span"},{"text":") ","element":"span"},{"style":{"height":13.6},"width":40,"height":34,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2002.06299/images/3-8.png","element":"img","alt":" ̸�","inline":true,"padRight":true},{"style":{"fontStyle":"italic"},"text":"I","element":"span"},{"style":{"fontStyle":"italic"},"text":"n","element":"span"},{"text":"(","element":"span"},{"style":{"fontStyle":"italic"},"text":"s","element":"span"},{"text":").","element":"span"}]]},{"heading":"4 Main results","paragraphs":[[{"style":{"fontWeight":"bold"},"text":"4.1 ","element":"span"},{"style":{"fontWeight":"bold"},"text":"Bellman equations over loops","element":"span"}],[{"text":"Given the regenerative process (","element":"span"},{"style":{"fontStyle":"italic"},"text":"I","element":"span"},{"style":{"fontStyle":"italic"},"text":"n","element":"span"},{"text":"(","element":"span"},{"style":{"fontStyle":"italic"},"text":"s","element":"span"},{"text":")","element":"span"},{"text":",","element":"span"},{"style":{"fontStyle":"italic"},"text":"G","element":"span"},{"style":{"fontStyle":"italic"},"text":"n","element":"span"},{"text":"(","element":"span"},{"style":{"fontStyle":"italic"},"text":"s","element":"span"},{"text":"))","element":"span"},{"style":{"fontStyle":"italic"},"text":"n","element":"span"},{"style":{"height":7.6},"width":31.74,"height":19,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2002.06299/images/3-9.png","element":"img","alt":">0","inline":true},{"text":", we derive a new Bellman equation over the loops for state value ","element":"span"},{"style":{"fontStyle":"italic"},"text":"v","element":"span"},{"text":"(","element":"span"},{"style":{"fontStyle":"italic"},"text":"s","element":"span"},{"text":").","element":"span"}],[{"id":"id-39","style":{"fontWeight":"bold"},"text":"Theorem 4.1 ","element":"span"},{"text":"(Loop Bellman equations)","element":"span"},{"style":{"fontWeight":"bold"},"text":". ","element":"span"},{"style":{"fontStyle":"italic"},"text":"Suppose the expected loop ","element":"span"},{"style":{"height":10.8},"width":22,"height":27,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2002.06299/images/3-10.png","element":"img","alt":" γ","inline":true},{"style":{"fontStyle":"italic"},"text":"-discount is ","element":"span"},{"style":{"height":7.2},"width":24,"height":18,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2002.06299/images/3-11.png","element":"img","alt":" α","inline":true},{"text":"(","element":"span"},{"style":{"fontStyle":"italic"},"text":"s","element":"span"},{"text":") ","element":"span"},{"style":{"height":13.2},"width":132.84,"height":33,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2002.06299/images/3-12.png","element":"img","alt":" � Es[Γ1","inline":true},{"text":"(","element":"span"},{"style":{"fontStyle":"italic"},"text":"s","element":"span"},{"text":")] ","element":"span"},{"style":{"fontStyle":"italic"},"text":"and the expected loop ","element":"span"},{"style":{"height":10.8},"width":22,"height":27,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2002.06299/images/3-13.png","element":"img","alt":" γ","inline":true},{"style":{"fontStyle":"italic"},"text":"-discounted rewards is ","element":"span"},{"style":{"height":14.4},"width":25,"height":36,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2002.06299/images/3-14.png","element":"img","alt":" β","inline":true},{"text":"(","element":"span"},{"style":{"fontStyle":"italic"},"text":"s","element":"span"},{"text":") ","element":"span"},{"style":{"height":12.79},"width":85.34,"height":31.98,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2002.06299/images/3-15.png","element":"img","alt":" � Es","inline":true},{"text":"[","element":"span"},{"style":{"fontStyle":"italic"},"text":"G","element":"span"},{"text":"1","element":"span"},{"text":"(","element":"span"},{"style":{"fontStyle":"italic"},"text":"s","element":"span"},{"text":")]","element":"span"},{"style":{"fontStyle":"italic"},"text":", we can relate the state value v","element":"span"},{"text":"(","element":"span"},{"style":{"fontStyle":"italic"},"text":"s","element":"span"},{"text":") ","element":"span"},{"style":{"fontStyle":"italic"},"text":"to itself","element":"span"}],[{"id":"id-12","style":{"width":"68%"},"width":656,"height":36,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2002.06299/images/3-16.png","element":"img"}],[{"style":{"fontWeight":"bold"},"text":"Remark 4.1. ","element":"span"},{"style":{"fontStyle":"italic"},"text":"The key di","element":"span"},{"text":"ff","element":"span"},{"style":{"fontStyle":"italic"},"text":"erence between the loop Bellman equations ","element":"span"},{"href":"#id-12","text":"(10) ","element":"a"},{"style":{"fontStyle":"italic"},"text":"and the classic Bellman equations ","element":"span"},{"href":"#id-3","text":"(5) ","element":"a"},{"style":{"fontStyle":"italic"},"text":"is the state values involved. Only state value v","element":"span"},{"text":"(","element":"span"},{"style":{"fontStyle":"italic"},"text":"s","element":"span"},{"text":") ","element":"span"},{"style":{"fontStyle":"italic"},"text":"appears on the right-hand side of ","element":"span"},{"href":"#id-12","text":"(10)","element":"a"},{"style":{"fontStyle":"italic"},"text":".","element":"span"}],[{"id":"id-14","style":{"fontWeight":"bold"},"text":"4.2 ","element":"span"},{"style":{"fontWeight":"bold"},"text":"Loop estimator","element":"span"}],[{"text":"We plug in the empirical means for the expected loop ","element":"span"},{"style":{"height":10.8},"width":22,"height":27,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2002.06299/images/3-17.png","element":"img","alt":" γ","inline":true},{"text":"-discount ","element":"span"},{"style":{"height":7.2},"width":24,"height":18,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2002.06299/images/3-18.png","element":"img","alt":" α","inline":true},{"text":"(","element":"span"},{"style":{"fontStyle":"italic"},"text":"s","element":"span"},{"text":") and the expected loop ","element":"span"},{"style":{"height":10.8},"width":22,"height":27,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2002.06299/images/3-19.png","element":"img","alt":" γ","inline":true},{"text":"-discounted rewards ","element":"span"},{"style":{"height":14.4},"width":25,"height":36,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2002.06299/images/3-20.png","element":"img","alt":"β","inline":true},{"text":"(","element":"span"},{"style":{"fontStyle":"italic"},"text":"s","element":"span"},{"text":") into the loop Bellman equation ","element":"span"},{"href":"#id-12","text":"(10) ","element":"a"},{"text":"and define the ","element":"span"},{"style":{"fontStyle":"italic"},"text":"n","element":"span"},{"text":"-th ","element":"span"},{"style":{"fontStyle":"italic"},"text":"loop estimator ","element":"span"},{"text":"for state value ","element":"span"},{"style":{"fontStyle":"italic"},"text":"v","element":"span"},{"text":"(","element":"span"},{"style":{"fontStyle":"italic"},"text":"s","element":"span"},{"text":")","element":"span"}],[{"id":"id-34","style":{"width":"71%"},"width":682,"height":44,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2002.06299/images/3-21.png","element":"img"}],[{"text":"where ˆ","element":"span"},{"style":{"height":9.19},"width":37.63,"height":22.98,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2002.06299/images/3-22.png","element":"img","alt":"αn","inline":true},{"text":"(","element":"span"},{"style":{"fontStyle":"italic"},"text":"s","element":"span"},{"text":") ","element":"span"},{"style":{"height":19.38},"width":229.49,"height":48.45,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2002.06299/images/3-23.png","element":"img","alt":" � 1n�ni=1 γIi(s)","inline":true,"padRight":true},{"text":"and ˆ","element":"span"},{"style":{"height":14.4},"width":34.25,"height":36,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2002.06299/images/3-24.png","element":"img","alt":"βn","inline":true},{"text":"(","element":"span"},{"style":{"fontStyle":"italic"},"text":"s","element":"span"},{"text":") ","element":"span"},{"style":{"height":19.38},"width":151.69,"height":48.45,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2002.06299/images/3-25.png","element":"img","alt":" � 1n�ni=1","inline":true,"padRight":true},{"style":{"fontStyle":"italic"},"text":"G","element":"span"},{"style":{"fontStyle":"italic"},"text":"i","element":"span"},{"text":"(","element":"span"},{"style":{"fontStyle":"italic"},"text":"s","element":"span"},{"text":"). Fur- ","element":"span"},{"text":"thermore, we have visited state ","element":"span"},{"style":{"fontStyle":"italic"},"text":"s ","element":"span"},{"text":"for (","element":"span"},{"style":{"fontStyle":"italic"},"text":"N ","element":"span"},{"text":"+ ","element":"span"},{"text":"1) times before step ","element":"span"},{"style":{"fontStyle":"italic"},"text":"T ","element":"span"},{"text":"where ","element":"span"},{"style":{"fontStyle":"italic"},"text":"N ","element":"span"},{"text":"is a random variable that counts the number of loops before step ","element":"span"},{"style":{"fontStyle":"italic"},"text":"T","element":"span"}],[{"style":{"width":"72%"},"width":690,"height":37,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2002.06299/images/3-26.png","element":"img"}],[{"text":"and the estimate ˆ","element":"span"},{"style":{"fontStyle":"italic"},"text":"v","element":"span"},{"style":{"fontStyle":"italic"},"text":"N","element":"span"},{"text":"(","element":"span"},{"style":{"fontStyle":"italic"},"text":"s","element":"span"},{"text":") would be the last estimate before step ","element":"span"},{"style":{"fontStyle":"italic"},"text":"T","element":"span"},{"text":". Hence, with a slight abuse of notations, we define","element":"span"}],[{"style":{"width":"61%"},"width":591,"height":34,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2002.06299/images/3-27.png","element":"img"}],[{"text":"By using incremental updates to keep track of empirical means, Algorithm ","element":"span"},{"href":"#id-11","text":"1 ","element":"a"},{"text":"implements the loop estimator ˆ","element":"span"},{"style":{"fontStyle":"italic"},"text":"v","element":"span"},{"style":{"fontStyle":"italic"},"text":"T","element":"span"},{"text":"(","element":"span"},{"style":{"fontStyle":"italic"},"text":"s","element":"span"},{"text":") with a space complexity of ","element":"span"},{"style":{"fontStyle":"italic"},"text":"O","element":"span"},{"text":"(1). Running ","element":"span"},{"style":{"fontStyle":"italic"},"text":"S ","element":"span"},{"text":"-many copies of loop estimators, one for each state ","element":"span"},{"style":{"fontStyle":"italic"},"text":"s ","element":"span"},{"style":{"height":12},"width":64.94,"height":30,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2002.06299/images/3-28.png","element":"img","alt":" ∈ S","inline":true},{"text":", takes a space complexity of ","element":"span"},{"style":{"fontStyle":"italic"},"text":"O","element":"span"},{"text":"(","element":"span"},{"style":{"fontStyle":"italic"},"text":"S ","element":"span"},{"text":").","element":"span"}],[{"style":{"fontWeight":"bold"},"text":"4.3 ","element":"span"},{"style":{"fontWeight":"bold"},"text":"Rates of convergence","element":"span"}],[{"text":"Now we investigate the convergence of the loop estimator, first over visits, i.e., ˆ","element":"span"},{"style":{"fontStyle":"italic"},"text":"v","element":"span"},{"style":{"fontStyle":"italic"},"text":"n","element":"span"},{"text":"(","element":"span"},{"style":{"fontStyle":"italic"},"text":"s","element":"span"},{"text":") ","element":"span"},{"style":{"fontStyle":"italic"},"text":"p","element":"span"},{"style":{"height":7.6},"width":39,"height":19,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2002.06299/images/3-29.png","element":"img","alt":"−→","inline":true,"padRight":true},{"style":{"fontStyle":"italic"},"text":"v","element":"span"},{"text":"(","element":"span"},{"style":{"fontStyle":"italic"},"text":"s","element":"span"},{"text":") as ","element":"span"},{"style":{"fontStyle":"italic"},"text":"n ","element":"span"},{"style":{"height":7.6},"width":83.49,"height":19,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2002.06299/images/3-30.png","element":"img","alt":" → ∞","inline":true},{"text":", then over steps, i.e., ˆ","element":"span"},{"style":{"fontStyle":"italic"},"text":"v","element":"span"},{"style":{"fontStyle":"italic"},"text":"T","element":"span"},{"text":"(","element":"span"},{"style":{"fontStyle":"italic"},"text":"s","element":"span"},{"text":") ","element":"span"},{"style":{"fontStyle":"italic"},"text":"p","element":"span"},{"style":{"height":7.6},"width":39,"height":19,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2002.06299/images/3-31.png","element":"img","alt":"−→","inline":true,"padRight":true},{"style":{"fontStyle":"italic"},"text":"v","element":"span"},{"text":"(","element":"span"},{"style":{"fontStyle":"italic"},"text":"s","element":"span"},{"text":") as ","element":"span"},{"style":{"fontStyle":"italic"},"text":"T ","element":"span"},{"style":{"height":7.6},"width":84.77,"height":19,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2002.06299/images/3-32.png","element":"img","alt":" → ∞","inline":true},{"text":". By applying Hoe","element":"span"},{"text":"ff","element":"span"},{"text":"ding bound to the definition of loop estimator ","element":"span"},{"href":"#id-34","text":"(11)","element":"a"},{"text":", we obtain a PACstyle upper bound on the estimation error.","element":"span"}],[{"id":"id-15","style":{"fontWeight":"bold"},"text":"Theorem 4.2 ","element":"span"},{"text":"(Convergence rate over visits)","element":"span"},{"style":{"fontWeight":"bold"},"text":". ","element":"span"},{"style":{"fontStyle":"italic"},"text":"Given a sample path from an MRP ","element":"span"},{"text":"(","element":"span"},{"style":{"fontStyle":"italic"},"text":"X","element":"span"},{"style":{"fontStyle":"italic"},"text":"t","element":"span"},{"text":", ","element":"span"},{"style":{"fontStyle":"italic"},"text":"R","element":"span"},{"style":{"fontStyle":"italic"},"text":"t","element":"span"},{"text":")","element":"span"},{"style":{"fontStyle":"italic"},"text":"t","element":"span"},{"style":{"height":8.4},"width":31.74,"height":21,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2002.06299/images/3-33.png","element":"img","alt":"≥0","inline":true},{"style":{"fontStyle":"italic"},"text":", a discount factor ","element":"span"},{"style":{"height":11.6},"width":66.98,"height":29,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2002.06299/images/3-34.png","element":"img","alt":" γ ∈","inline":true,"padRight":true},{"text":"[0","element":"span"},{"text":", ","element":"span"},{"text":"1)","element":"span"},{"style":{"fontStyle":"italic"},"text":", and a positive recurrent state s, with probability of at least ","element":"span"},{"text":"1 ","element":"span"},{"style":{"height":11.2},"width":53.2,"height":28,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2002.06299/images/3-35.png","element":"img","alt":" − δ","inline":true},{"style":{"fontStyle":"italic"},"text":", the loop estimator converges to v","element":"span"},{"text":"(","element":"span"},{"style":{"fontStyle":"italic"},"text":"s","element":"span"},{"text":")","element":"span"}],[{"style":{"width":"66%"},"width":636,"height":110,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2002.06299/images/3-36.png","element":"img"}],[{"id":"id-11","style":{"width":"100%"},"width":954,"height":848,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2002.06299/images/3-37.png","element":"img"}],[{"text":"To determine the convergence rate over steps, we need to study the concentration of waiting times which allows us to lower-bound the random visits with high probability. As an intermediate step, we use the fact that the tail of the distribution of first return times is upper-bounded by an exponential distribution per the Markov property of MRP ","element":"span"},{"href":"#id-29","referenceIndex":13,"text":"(Lee, ","element":"a"},{"href":"#id-29","referenceIndex":13,"text":"Ozdaglar, and Shah ","element":"a"},{"href":"#id-29","referenceIndex":13,"text":"2013; ","element":"a"},{"href":"#id-35","referenceIndex":1,"text":"Aldous and Fill ","element":"a"},{"href":"#id-35","referenceIndex":1,"text":"1999)","element":"a"},{"text":". ","element":"span"},{"style":{"fontWeight":"bold"},"text":"Lemma 4.3 ","element":"span"},{"text":"(Exponential concentration of first return times ","element":"span"},{"href":"#id-29","referenceIndex":13,"text":"(Lee, Ozdaglar, and Shah ","element":"a"},{"href":"#id-29","referenceIndex":13,"text":"2013; ","element":"a"},{"href":"#id-35","referenceIndex":1,"text":"Aldous and Fill ","element":"a"},{"href":"#id-35","referenceIndex":1,"text":"1999)","element":"a"},{"text":")","element":"span"},{"style":{"fontWeight":"bold"},"text":". ","element":"span"},{"style":{"fontStyle":"italic"},"text":"Given a Markov chain ","element":"span"},{"text":"(","element":"span"},{"style":{"fontStyle":"italic"},"text":"X","element":"span"},{"style":{"fontStyle":"italic"},"text":"t","element":"span"},{"text":")","element":"span"},{"style":{"fontStyle":"italic"},"text":"t","element":"span"},{"style":{"height":8.4},"width":31.74,"height":21,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2002.06299/images/3-38.png","element":"img","alt":"≥0","inline":true,"padRight":true},{"style":{"fontStyle":"italic"},"text":"defined on a finite state space ","element":"span"},{"text":"S","element":"span"},{"style":{"fontStyle":"italic"},"text":", for any state s ","element":"span"},{"style":{"height":12},"width":59.44,"height":30,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2002.06299/images/3-39.png","element":"img","alt":" ∈ S","inline":true,"padRight":true},{"style":{"fontStyle":"italic"},"text":"and any t ","element":"span"},{"text":"> ","element":"span"},{"text":"0","element":"span"},{"style":{"fontStyle":"italic"},"text":", we have","element":"span"}],[{"id":"id-16","style":{"width":"100%"},"width":954,"height":303,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2002.06299/images/3-40.png","element":"img"}],[{"style":{"fontWeight":"bold"},"text":"Remark 4.2. ","element":"span"},{"style":{"fontStyle":"italic"},"text":"Note that the waiting time W","element":"span"},{"style":{"fontStyle":"italic"},"text":"n","element":"span"},{"text":"(","element":"span"},{"style":{"fontStyle":"italic"},"text":"s","element":"span"},{"text":") ","element":"span"},{"style":{"fontStyle":"italic"},"text":"is nearly linear in n with a dependency on the Markov chain structure via the maximal expected hitting time of s, namely ","element":"span"},{"style":{"height":0},"width":40,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2002.06299/images/3-41.png","element":"img","alt":"�","inline":true},{"style":{"fontStyle":"italic"},"text":"O","element":"span"},{"text":"(","element":"span"},{"style":{"fontStyle":"italic"},"text":"n ","element":"span"},{"style":{"height":9.19},"width":31.28,"height":22.98,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2002.06299/images/3-42.png","element":"img","alt":" τs","inline":true},{"text":")","element":"span"},{"style":{"fontStyle":"italic"},"text":". In contrast, the ","element":"span"},{"text":"expected ","element":"span"},{"style":{"fontStyle":"italic"},"text":"waiting time scales with the expected recurrence time ","element":"span"},{"text":"E","element":"span"},{"text":"[","element":"span"},{"style":{"fontStyle":"italic"},"text":"W","element":"span"},{"style":{"fontStyle":"italic"},"text":"n","element":"span"},{"text":"(","element":"span"},{"style":{"fontStyle":"italic"},"text":"s","element":"span"},{"text":")] ","element":"span"},{"style":{"height":10.8},"width":65.42,"height":27,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2002.06299/images/3-43.png","element":"img","alt":" = Θ","inline":true},{"text":"(","element":"span"},{"style":{"fontStyle":"italic"},"text":"n ","element":"span"},{"style":{"height":10.4},"width":32.64,"height":26,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2002.06299/images/3-44.png","element":"img","alt":" ρs","inline":true},{"text":")","element":"span"},{"style":{"fontStyle":"italic"},"text":". However, an exponential concentration with the expected recurrence time is not possible in general (see Appendix ","element":"span"},{"href":"#id-31","style":{"fontStyle":"italic"},"text":"B.2 ","element":"a"},{"style":{"fontStyle":"italic"},"text":"for a counterexample).","element":"span"}],[{"text":"Using Lambert W function, we invert Corollary ","element":"span"},{"href":"#id-16","text":"4.4 ","element":"a"},{"text":"to lower-bound the visits by step ","element":"span"},{"style":{"fontStyle":"italic"},"text":"T ","element":"span"},{"text":"with high probability. Finally, the convergence rate of ˆ","element":"span"},{"style":{"fontStyle":"italic"},"text":"v","element":"span"},{"style":{"fontStyle":"italic"},"text":"T","element":"span"},{"text":"(","element":"span"},{"style":{"fontStyle":"italic"},"text":"s","element":"span"},{"text":") follows from Theorem ","element":"span"},{"href":"#id-15","text":"4.2. ","element":"a"},{"id":"id-17","style":{"fontWeight":"bold"},"text":"Theorem 4.5 ","element":"span"},{"text":"(Convergence rate over steps)","element":"span"},{"style":{"fontWeight":"bold"},"text":". ","element":"span"},{"style":{"fontStyle":"italic"},"text":"With probability of at least ","element":"span"},{"text":"1 ","element":"span"},{"style":{"height":11.2},"width":51.58,"height":28,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2002.06299/images/3-45.png","element":"img","alt":" − δ","inline":true},{"style":{"fontStyle":"italic"},"text":", for any T ","element":"span"},{"style":{"height":13.19},"width":117.2,"height":32.98,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2002.06299/images/3-46.png","element":"img","alt":" > e δ τs","inline":true},{"style":{"fontStyle":"italic"},"text":", the MRP ","element":"span"},{"text":"(","element":"span"},{"style":{"fontStyle":"italic"},"text":"X","element":"span"},{"style":{"fontStyle":"italic"},"text":"t","element":"span"},{"text":", ","element":"span"},{"style":{"fontStyle":"italic"},"text":"R","element":"span"},{"style":{"fontStyle":"italic"},"text":"t","element":"span"},{"text":")","element":"span"},{"text":"0","element":"span"},{"style":{"height":8},"width":61.74,"height":20,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2002.06299/images/3-47.png","element":"img","alt":"≤t e δ","inline":true,"padRight":true},{"text":"max","element":"span"},{"style":{"fontStyle":"italic"},"text":"s ","element":"span"},{"style":{"height":9.19},"width":31.28,"height":22.98,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2002.06299/images/4-5.png","element":"img","alt":" τs","inline":true},{"style":{"fontStyle":"italic"},"text":", the MRP ","element":"span"},{"text":"(","element":"span"},{"style":{"fontStyle":"italic"},"text":"X","element":"span"},{"style":{"fontStyle":"italic"},"text":"t","element":"span"},{"text":", ","element":"span"},{"style":{"fontStyle":"italic"},"text":"R","element":"span"},{"style":{"fontStyle":"italic"},"text":"t","element":"span"},{"text":")","element":"span"},{"text":"0","element":"span"},{"style":{"height":8},"width":61.74,"height":20,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2002.06299/images/4-6.png","element":"img","alt":"≤t ","element":"span"},{"text":"0","element":"span"}],[{"style":{"width":"88%"},"width":849,"height":249,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2002.06299/images/4-27.png","element":"img"}],[{"text":"where ","element":"span"},{"style":{"height":10.8},"width":28.72,"height":27,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2002.06299/images/4-28.png","element":"img","alt":" ηt","inline":true,"padRight":true},{"text":"is the learning rates. A common choice is to set ","element":"span"},{"style":{"height":23.6},"width":262.82,"height":59,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2002.06299/images/4-29.png","element":"img","alt":" ηt = 1/��tu=0 1","inline":true},{"text":"[","element":"span"},{"style":{"fontStyle":"italic"},"text":"X","element":"span"},{"style":{"fontStyle":"italic"},"text":"u ","element":"span"},{"text":"= ","element":"span"},{"style":{"fontStyle":"italic"},"text":"s","element":"span"},{"text":"]","element":"span"},{"style":{"height":23.6},"width":14,"height":59,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2002.06299/images/4-30.png","element":"img","alt":"�","inline":true},{"text":"which satisfies the RobbinsMonro conditions ","element":"span"},{"href":"#id-4","referenceIndex":2,"text":"(Bertsekas and Tsitsiklis ","element":"a"},{"href":"#id-4","referenceIndex":2,"text":"1996)","element":"a"},{"text":". But it has been shown to lead to slower convergence than ","element":"span"},{"style":{"height":10.8},"width":77.28,"height":27,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2002.06299/images/4-31.png","element":"img","alt":" ηt =","inline":true,"padRight":true},{"text":"1","element":"span"},{"style":{"height":23.6},"width":142.23,"height":59,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2002.06299/images/4-32.png","element":"img","alt":"/��tu=0 1","inline":true},{"text":"[","element":"span"},{"style":{"fontStyle":"italic"},"text":"X","element":"span"},{"style":{"fontStyle":"italic"},"text":"u ","element":"span"},{"text":"= ","element":"span"},{"style":{"fontStyle":"italic"},"text":"s","element":"span"},{"text":"]","element":"span"},{"style":{"height":26.59},"width":29.19,"height":66.46,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2002.06299/images/4-33.png","element":"img","alt":"�d","inline":true,"padRight":true},{"text":"where ","element":"span"},{"style":{"fontStyle":"italic"},"text":"d ","element":"span"},{"style":{"height":8},"width":21,"height":20,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2002.06299/images/4-34.png","element":"img","alt":" ∈","inline":true,"padRight":true},{"text":"(","element":"span"},{"text":"1","element":"span"},{"style":{"height":14},"width":39.31,"height":35,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2002.06299/images/4-35.png","element":"img","alt":"/2,","inline":true,"padRight":true},{"text":"1) ","element":"span"},{"href":"#id-9","referenceIndex":5,"text":"(Even-Dar and Man- ","element":"a"},{"href":"#id-9","referenceIndex":5,"text":"sour ","element":"a"},{"href":"#id-9","referenceIndex":5,"text":"2003)","element":"a"},{"text":".","element":"span"}],[{"text":"It is more accurate to consider TD methods as a large family of estimators each with di","element":"span"},{"text":"ff","element":"span"},{"text":"erent choices of ","element":"span"},{"style":{"fontStyle":"italic"},"text":"k","element":"span"},{"text":", ","element":"span"},{"style":{"height":10.8},"width":28.72,"height":27,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2002.06299/images/4-36.png","element":"img","alt":" ηt","inline":true},{"text":". Choosing these parameters can create extra work and sometimes confusion for practitioners. Whereas the loop estimator, like the model-based estimator, has no parameters to tune. In any case, it is not our intention to compare with the TD family exhaustively (see more results on TD on ","element":"span"},{"href":"#id-24","referenceIndex":12,"text":"(Kearns and Singh ","element":"a"},{"href":"#id-24","referenceIndex":12,"text":"2000; ","element":"a"},{"href":"#id-9","referenceIndex":5,"text":"Even-Dar and Mansour ","element":"a"},{"href":"#id-9","referenceIndex":5,"text":"2003)","element":"a"},{"text":"). Instead, we will compare with TD(0) and TD(10), both with ","element":"span"},{"style":{"fontStyle":"italic"},"text":"d ","element":"span"},{"text":"= ","element":"span"},{"text":"1, and TD(0)","element":"span"},{"style":{"height":5.2},"width":13,"height":13,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2002.06299/images/4-37.png","element":"img","alt":"∗","inline":true,"padRight":true},{"text":"with ","element":"span"},{"style":{"fontStyle":"italic"},"text":"d ","element":"span"},{"style":{"height":14},"width":75.29,"height":35,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2002.06299/images/4-38.png","element":"img","alt":" = 1/2","inline":true},{"text":".","element":"span"}],[{"style":{"fontWeight":"bold"},"text":"5.2 ","element":"span"},{"style":{"fontWeight":"bold"},"text":"Comparative experiments","element":"span"}],[{"text":"We experiment with di","element":"span"},{"text":"ff","element":"span"},{"text":"erent values for the discount factor ","element":"span"},{"style":{"height":10.8},"width":22,"height":27,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2002.06299/images/5-0.png","element":"img","alt":"γ","inline":true},{"text":", because, roughly speaking, 1","element":"span"},{"style":{"height":14.8},"width":116.17,"height":37,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2002.06299/images/5-1.png","element":"img","alt":"/(1 − γ","inline":true},{"text":") sets the horizon beyond which rewards are discounted too heavily to matter. We compare the estimation errors measured in ","element":"span"},{"style":{"height":7.6},"width":34,"height":19,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2002.06299/images/5-2.png","element":"img","alt":" ∞","inline":true},{"text":"-norm, which is important in RL. The results are shown in Figure ","element":"span"},{"href":"#id-38","text":"2.","element":"a"}],[{"text":"• ","element":"span"},{"text":"The model-based estimator dominates all estimators for every discount setting we tested.","element":"span"}],[{"text":"• ","element":"span"},{"text":"TD(k) estimators perform well if ","element":"span"},{"style":{"fontStyle":"italic"},"text":"k ","element":"span"},{"style":{"height":14.8},"width":172.16,"height":37,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2002.06299/images/5-3.png","element":"img","alt":" ≥ 1/(1 − γ","inline":true},{"text":").","element":"span"}],[{"text":"• ","element":"span"},{"text":"The loop estimator performs worse than, but is competitive with, the model-based estimator. Furthermore, similar to the model-based estimator and unlike the TD(k) estimators, its performance seems to be less influenced by discounting.","element":"span"}],[{"id":"id-38","style":{"width":"100%"},"width":954,"height":1388,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2002.06299/images/5-4.png","element":"img"}],[{"text":"Figure 2: Estimation errors (normalized by max","element":"figcaption","subtype":"caption"},{"style":{"fontStyle":"italic"},"text":"s ","element":"figcaption","subtype":"caption"},{"style":{"fontStyle":"italic"},"text":"v","element":"figcaption","subtype":"caption"},{"text":"(","element":"figcaption","subtype":"caption"},{"style":{"fontStyle":"italic"},"text":"s","element":"figcaption","subtype":"caption"},{"text":") to be comparable across discount factors) of di","element":"figcaption","subtype":"caption"},{"text":"ff","element":"figcaption","subtype":"caption"},{"text":"erent estimators at di","element":"figcaption","subtype":"caption"},{"text":"ff","element":"figcaption","subtype":"caption"},{"text":"erent discount factors (left) ","element":"figcaption","subtype":"caption"},{"style":{"height":14.4},"width":114.59,"height":36,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2002.06299/images/5-5.png","element":"img","alt":" γ = 0.","inline":true},{"text":"9 and (right) ","element":"figcaption","subtype":"caption"},{"style":{"height":14.4},"width":119.22,"height":36,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2002.06299/images/5-6.png","element":"img","alt":"γ = 0.","inline":true},{"text":"99. Shaded areas represent the standard deviations over 200 runs. Note the vertical log scale.","element":"figcaption","subtype":"caption"}]]},{"heading":"6 Discussions","paragraphs":[[{"id":"id-19","text":"The elementary identity below relates the expected first re- ","element":"span"},{"text":"turn times ","element":"span"},{"style":{"fontStyle":"italic"},"text":"Y","element":"span"},{"style":{"fontStyle":"italic"},"text":"s s","element":"span"},{"style":{"height":23.2},"width":192.6,"height":58,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2002.06299/images/5-7.png","element":"img","alt":"′ � Es�H+s′�","inline":true},{"text":"to the transition probabilities ","element":"span"},{"style":{"fontStyle":"italic"},"text":"P","element":"span"},{"style":{"fontStyle":"italic"},"text":"s s","element":"span"},{"style":{"height":0},"width":20,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2002.06299/images/5-8.png","element":"img","alt":"′","inline":true,"padRight":true},{"text":"for a finite Markov chain. Using the matrix notations, suppose that the expected first return times are organized in a matrix ","element":"span"},{"style":{"fontWeight":"bold"},"text":"Y","element":"span"},{"text":", and ","element":"span"},{"style":{"fontWeight":"bold"},"text":"P ","element":"span"},{"text":"the transition matrix of the Markov chain, then we have ","element":"span"},{"style":{"fontWeight":"bold"},"text":"Y ","element":"span"},{"text":"= ","element":"span"},{"style":{"fontWeight":"bold"},"text":"P ","element":"span"},{"style":{"height":16},"width":76.3,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2002.06299/images/5-9.png","element":"img","alt":"�Y −","inline":true,"padRight":true},{"text":"diag","element":"span"},{"style":{"fontWeight":"bold"},"text":"Y ","element":"span"},{"style":{"height":16},"width":74.78,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2002.06299/images/5-10.png","element":"img","alt":" + E�","inline":true,"padRight":true},{"text":"where diag","element":"span"},{"style":{"fontWeight":"bold"},"text":"Y ","element":"span"},{"text":"is a matrix with the same diagonal as ","element":"span"},{"style":{"fontWeight":"bold"},"text":"Y ","element":"span"},{"text":"and zero elsewhere, and ","element":"span"},{"style":{"fontWeight":"bold"},"text":"E ","element":"span"},{"text":"is a matrix with all ones. Thus, knowing ","element":"span"},{"style":{"fontWeight":"bold"},"text":"Y ","element":"span"},{"text":"is equivalent to knowing the full model, as we can compute ","element":"span"},{"style":{"fontWeight":"bold"},"text":"P ","element":"span"},{"text":"using this identity. Recall that by definition ","element":"span"},{"text":"E ","element":"span"},{"text":"[","element":"span"},{"style":{"fontStyle":"italic"},"text":"I","element":"span"},{"text":"1","element":"span"},{"text":"(","element":"span"},{"style":{"fontStyle":"italic"},"text":"s","element":"span"},{"text":")] ","element":"span"},{"style":{"height":15.72},"width":159.43,"height":39.31,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2002.06299/images/5-11.png","element":"img","alt":" = Es�H+s�","inline":true},{"text":", which is exactly the diagonal of ","element":"span"},{"style":{"fontWeight":"bold"},"text":"Y","element":"span"},{"text":". But only knowing the diagonal is not su","element":"span"},{"text":"ffi","element":"span"},{"text":"cient to determine the entire set of model parameters, namely ","element":"span"},{"style":{"fontWeight":"bold"},"text":"Y","element":"span"},{"text":", the loop estimator based on (","element":"span"},{"style":{"fontStyle":"italic"},"text":"I","element":"span"},{"style":{"fontStyle":"italic"},"text":"n","element":"span"},{"text":")","element":"span"},{"style":{"fontStyle":"italic"},"text":"n","element":"span"},{"style":{"height":7.6},"width":31.74,"height":19,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2002.06299/images/5-12.png","element":"img","alt":">0","inline":true,"padRight":true},{"text":"indeed falls short of being a model-based method. It may be considered a ","element":"span"},{"style":{"fontStyle":"italic"},"text":"semi","element":"span"},{"text":"-model-based method as it estimates some but not all of the model parameters.","element":"span"}],[{"text":"For large MRPs, a natural extension of our work is to consider recurrence of features instead of states, e.g., a video game screen might not repeat itself completely but the same items might reappear. After all, without repetition exactly or approximately, it would not be possible for an agent to learn and improve its decisions.","element":"span"}],[{"text":"We believe that regenerative structure can be further exploited in RL (particularly in the form of the loop Bellman equation ","element":"span"},{"href":"#id-12","text":"(10)","element":"a"},{"text":") and we think this article provides the fundamental results for future study in this direction.","element":"span"}]]},{"heading":"Acknowledgments","paragraphs":[[{"text":"This work was supported in part by the National Science Foundation under Grant No. 1830660. We thank Mesrob I. Ohannessian for a helpful discussion on Markov chains, and Christina Lee Yu for discussing an early version of this work. We also thank anonymous reviewers for their constructive feedback, in particular, for bringing an independent work ","element":"span"},{"href":"#id-32","referenceIndex":18,"text":"(Subramanian and Mahajan ","element":"a"},{"href":"#id-32","referenceIndex":18,"text":"2019) ","element":"a"},{"text":"to our attention.","element":"span"}]]},{"heading":"References","paragraphs":[[{"id":"id-35","text":"Aldous, D.; and Fill, J. 1999. Reversible Markov chains and ","element":"span"},{"text":"random walks on graphs. Book in preparation (available at http:","element":"span"},{"text":"//","element":"span"},{"href":"http://www.stat.berkeley.edu/~aldous/RWG/Chap2.pdf","text":"www.stat.berkeley.edu","element":"a"},{"style":{"height":14},"width":29.24,"height":35,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2002.06299/images/5-13.png","element":"img","alt":"/∼","inline":true},{"text":"aldous","element":"span"},{"text":"/","element":"span"},{"text":"RWG","element":"span"},{"text":"/","element":"span"},{"text":"Chap2.pdf).","element":"span"}],[{"id":"id-4","text":"Bertsekas, D. P.; and Tsitsiklis, J. N. 1996. ","element":"span"},{"style":{"fontStyle":"italic"},"text":"Neuro-dynamic programming","element":"span"},{"text":". Athena Scientific Belmont, MA.","element":"span"}],[{"id":"id-22","text":"Dayan, P.; and Sejnowski, T. J. 1994. TD (","element":"span"},{"style":{"height":10.8},"width":25,"height":27,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2002.06299/images/5-14.png","element":"img","alt":"λ","inline":true},{"text":") converges with probability 1. ","element":"span"},{"style":{"fontStyle":"italic"},"text":"Machine Learning ","element":"span"},{"text":"14(3): 295–301.","element":"span"}],[{"id":"id-28","text":"Derman, C. 1970. ","element":"span"},{"style":{"fontStyle":"italic"},"text":"Finite state Markovian decision processes","element":"span"},{"text":". Academic Press.","element":"span"}],[{"id":"id-9","text":"Even-Dar, E.; and Mansour, Y. 2003. Learning rates for Q- ","element":"span"},{"text":"learning. ","element":"span"},{"style":{"fontStyle":"italic"},"text":"Journal of machine learning Research ","element":"span"},{"text":"5(Dec): 1– 25.","element":"span"}],[{"id":"id-26","text":"Fox, B. L.; and Glynn, P. W. 1989. Simulating discounted ","element":"span"},{"text":"costs. ","element":"span"},{"style":{"fontStyle":"italic"},"text":"Management Science ","element":"span"},{"text":"35(11): 1297–1315.","element":"span"}],[{"id":"id-10","text":"Gheshlaghi Azar, M.; Munos, R.; and Kappen, H. J. 2013. ","element":"span"},{"text":"Minimax PAC bounds on the sample complexity of reinforcement learning with a generative model. ","element":"span"},{"style":{"fontStyle":"italic"},"text":"Machine Learning ","element":"span"},{"text":"91(3): 325–349. ISSN 1573-0565. doi:10.1007","element":"span"},{"text":"/ ","element":"span"},{"text":"s10994-013-5368-1.","element":"span"}],[{"id":"id-27","text":"Haviv, M.; and Puterman, M. L. 1992. Estimating the value ","element":"span"},{"text":"of a discounted reward process. ","element":"span"},{"style":{"fontStyle":"italic"},"text":"Operations Research Letters ","element":"span"},{"text":"11(5): 267–272. ISSN 01676377. doi:10.1016","element":"span"},{"text":"/","element":"span"},{"text":"0167-6377(92)90002-K.","element":"span"}],[{"id":"id-1","text":"Howard, R. A. 1960. ","element":"span"},{"style":{"fontStyle":"italic"},"text":"Dynamic programming and Markov processes. ","element":"span"},{"text":"John Wiley.","element":"span"}],[{"id":"id-23","text":"Jaakkola, T.; Jordan, M. I.; and Singh, S. P. 1994. ","element":"span"},{"text":"Convergence of stochastic iterative dynamic programming algorithms. ","element":"span"},{"text":"In ","element":"span"},{"style":{"fontStyle":"italic"},"text":"Advances in Neural Information Processing Systems","element":"span"},{"text":", 703–710.","element":"span"}],[{"id":"id-8","text":"Kearns, M. J.; and Singh, S. P. 1999. Finite-sample con- ","element":"span"},{"text":"vergence rates for Q-learning and indirect algorithms. In ","element":"span"},{"style":{"fontStyle":"italic"},"text":"Advances in Neural Information Processing Systems","element":"span"},{"text":", 996– 1002.","element":"span"}],[{"id":"id-24","text":"Kearns, M. J.; and Singh, S. P. 2000. Bias-Variance Error ","element":"span"},{"text":"Bounds for Temporal Di","element":"span"},{"text":"ff","element":"span"},{"text":"erence Updates. In ","element":"span"},{"style":{"fontStyle":"italic"},"text":"Conference on Learning Theory","element":"span"},{"text":", 142–147.","element":"span"}],[{"id":"id-29","text":"Lee, C. E.; Ozdaglar, A.; and Shah, D. 2013. Approximat- ","element":"span"},{"text":"ing the Stationary Probability of a Single State in a Markov chain. ","element":"span"},{"style":{"fontStyle":"italic"},"text":"arXiv preprint ","element":"span"},{"text":"URL https:","element":"span"},{"text":"//","element":"span"},{"text":"arxiv.org","element":"span"},{"text":"/","element":"span"},{"text":"abs","element":"span"},{"text":"/","element":"span"},{"href":"https://arxiv.org/abs/1312.1986","text":"1312.1986.","element":"a"}],[{"id":"id-5","text":"Pananjady, A.; and Wainwright, M. J. 2019. Value function ","element":"span"},{"text":"estimation in Markov reward processes: Instance-dependent ","element":"span"},{"style":{"height":13.19},"width":40.62,"height":32.98,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2002.06299/images/6-0.png","element":"img","alt":"ℓ∞","inline":true},{"text":"-bounds for policy evaluation. ","element":"span"},{"style":{"fontStyle":"italic"},"text":"arXiv preprint ","element":"span"},{"text":"URL ","element":"span"},{"href":"https://arxiv.org/abs/1909.08749","text":"https: ","element":"a"},{"text":"//","element":"span"},{"text":"arxiv.org","element":"span"},{"text":"/","element":"span"},{"text":"abs","element":"span"},{"text":"/","element":"span"},{"href":"https://arxiv.org/abs/1909.08749","text":"1909.08749.","element":"a"}],[{"id":"id-2","text":"Puterman, M. L. 1994. ","element":"span"},{"style":{"fontStyle":"italic"},"text":"Markov Decision Processes: Discrete Stochastic Dynamic Programming","element":"span"},{"text":". ","element":"span"},{"text":"John Wiley & Sons, Inc.","element":"span"}],[{"id":"id-33","text":"Ross, S. M. 1996. ","element":"span"},{"style":{"fontStyle":"italic"},"text":"Stochastic processes","element":"span"},{"text":". John Wiley, 2nd edition.","element":"span"}],[{"id":"id-36","text":"Strehl, A. L.; and Littman, M. L. 2008. ","element":"span"},{"text":"An analysis of model-based interval estimation for Markov decision processes. ","element":"span"},{"style":{"fontStyle":"italic"},"text":"Journal of Computer and System Sciences ","element":"span"},{"text":"74(8): 1309–1331.","element":"span"}],[{"id":"id-32","text":"Subramanian, J.; and Mahajan, A. 2019. Renewal Monte ","element":"span"},{"text":"Carlo: Renewal theory based reinforcement learning. ","element":"span"},{"style":{"fontStyle":"italic"},"text":"IEEE Transactions on Automatic Control ","element":"span"},{"text":"1–1.","element":"span"}],[{"id":"id-7","text":"Sutton, R. S. 1988. Learning to predict by the methods of ","element":"span"},{"text":"temporal di","element":"span"},{"text":"ff","element":"span"},{"text":"erences. ","element":"span"},{"style":{"fontStyle":"italic"},"text":"Machine Learning ","element":"span"},{"text":"3(1): 9–44.","element":"span"}],[{"id":"id-0","text":"Sutton, R. S.; and Barto, A. G. 2018. ","element":"span"},{"style":{"fontStyle":"italic"},"text":"Reinforcement learning: An introduction","element":"span"},{"text":". MIT press, 2nd edition.","element":"span"}]]},{"heading":"A Detailed proofs","paragraphs":[[{"id":"id-20","style":{"fontWeight":"bold"},"text":"A.1 ","element":"span"},{"style":{"fontWeight":"bold"},"text":"Proof of Theorem ","element":"span"},{"href":"#id-39","style":{"fontWeight":"bold"},"text":"4.1","element":"a"}],[{"style":{"fontStyle":"italic"},"text":"Proof. ","element":"span"},{"text":"Note that since ","element":"span"},{"style":{"fontStyle":"italic"},"text":"X","element":"span"},{"text":"0 ","element":"span"},{"text":"= ","element":"span"},{"style":{"fontStyle":"italic"},"text":"s","element":"span"},{"text":", we have ","element":"span"},{"style":{"fontStyle":"italic"},"text":"W","element":"span"},{"text":"1","element":"span"},{"text":"(","element":"span"},{"style":{"fontStyle":"italic"},"text":"s","element":"span"},{"text":") ","element":"span"},{"text":"= ","element":"span"},{"text":"0 and ","element":"span"},{"style":{"fontStyle":"italic"},"text":"I","element":"span"},{"text":"1","element":"span"},{"text":"(","element":"span"},{"style":{"fontStyle":"italic"},"text":"s","element":"span"},{"text":") ","element":"span"},{"text":"= ","element":"span"},{"style":{"fontStyle":"italic"},"text":"W","element":"span"},{"text":"2","element":"span"},{"text":"(","element":"span"},{"style":{"fontStyle":"italic"},"text":"s","element":"span"},{"text":"). Since only state ","element":"span"},{"style":{"fontStyle":"italic"},"text":"s ","element":"span"},{"text":"appears here, we will suppress ","element":"span"},{"style":{"fontStyle":"italic"},"text":"s ","element":"span"},{"text":"from the random variables below to simplify the notation. We use Assumption ","element":"span"},{"href":"#id-25","text":"3.1 ","element":"a"},{"text":"or the weaker assumption that ","element":"span"},{"style":{"fontStyle":"italic"},"text":"s ","element":"span"},{"text":"is positive recurrent, i.e., ","element":"span"},{"style":{"height":12},"width":131.08,"height":30,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2002.06299/images/7-0.png","element":"img","alt":" ρs < ∞","inline":true},{"text":", to guarantee that ","element":"span"},{"style":{"fontStyle":"italic"},"text":"W","element":"span"},{"text":"2 ","element":"span"},{"style":{"height":8.4},"width":70.42,"height":21,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2002.06299/images/7-1.png","element":"img","alt":" < ∞","inline":true,"padRight":true},{"text":"with probability 1.","element":"span"}],[{"style":{"width":"92%"},"width":885,"height":1641,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2002.06299/images/7-2.png","element":"img"}],[{"style":{"fontWeight":"bold"},"text":"A.2 ","element":"span"},{"style":{"fontWeight":"bold"},"text":"Proof of Theorem ","element":"span"},{"href":"#id-15","style":{"fontWeight":"bold"},"text":"4.2","element":"a"}],[{"style":{"fontStyle":"italic"},"text":"Proof. ","element":"span"},{"text":"Since only state ","element":"span"},{"style":{"fontStyle":"italic"},"text":"s ","element":"span"},{"text":"appears below, we will suppress it in the interest of conciseness. Consider","element":"span"}],[{"style":{"width":"47%"},"width":454,"height":290,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2002.06299/images/7-3.png","element":"img"}],[{"style":{"width":"61%"},"width":585,"height":179,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2002.06299/images/7-4.png","element":"img"}],[{"text":"By the definition of an MRP, we have ","element":"span"},{"style":{"fontStyle":"italic"},"text":"v ","element":"span"},{"style":{"height":8},"width":21,"height":20,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2002.06299/images/7-5.png","element":"img","alt":" ∈","inline":true,"padRight":true},{"text":"[0","element":"span"},{"style":{"height":14},"width":121.38,"height":35,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2002.06299/images/7-6.png","element":"img","alt":", rmax/1−γ","inline":true},{"text":"] and ","element":"span"},{"style":{"fontStyle":"italic"},"text":"G","element":"span"},{"text":"1 ","element":"span"},{"style":{"height":8},"width":21,"height":20,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2002.06299/images/7-7.png","element":"img","alt":" ∈","inline":true,"padRight":true},{"text":"[0","element":"span"},{"style":{"height":14},"width":121.38,"height":35,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2002.06299/images/7-8.png","element":"img","alt":", rmax/1−γ","inline":true},{"text":"). Furthermore, ","element":"span"},{"style":{"fontStyle":"italic"},"text":"I","element":"span"},{"text":"1 ","element":"span"},{"style":{"height":10.8},"width":25,"height":27,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2002.06299/images/7-9.png","element":"img","alt":" ≥","inline":true,"padRight":true},{"text":"1 implies that ","element":"span"},{"style":{"height":16.59},"width":87.15,"height":41.47,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2002.06299/images/7-10.png","element":"img","alt":" γI1 ∈","inline":true,"padRight":true},{"text":"(0","element":"span"},{"style":{"height":10.8},"width":38.62,"height":27,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2002.06299/images/7-11.png","element":"img","alt":", γ","inline":true},{"text":"] and 0 ","element":"span"},{"style":{"height":14.4},"width":273.75,"height":36,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2002.06299/images/7-12.png","element":"img","alt":" < 1 − γ ≤ 1 − ˆαn","inline":true},{"text":". Hence the estimation error is bounded as follows","element":"span"}],[{"style":{"width":"90%"},"width":863,"height":348,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2002.06299/images/7-13.png","element":"img"}],[{"text":"With failure probability of at most ","element":"span"},{"style":{"height":14},"width":36.53,"height":35,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2002.06299/images/7-14.png","element":"img","alt":" δ/","inline":true},{"text":"2, from Hoe","element":"span"},{"text":"ff","element":"span"},{"text":"ding’s inequality, we have","element":"span"}],[{"style":{"width":"43%"},"width":411,"height":102,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2002.06299/images/7-15.png","element":"img"}],[{"text":"and similarly","element":"span"}],[{"style":{"width":"36%"},"width":352,"height":95,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2002.06299/images/7-16.png","element":"img"}],[{"text":"Applying the union bound, we have","element":"span"}],[{"style":{"width":"97%"},"width":928,"height":360,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2002.06299/images/7-17.png","element":"img"}],[{"id":"id-30","style":{"fontWeight":"bold"},"text":"A.3 ","element":"span"},{"style":{"fontWeight":"bold"},"text":"Proof of Lemma ","element":"span"},{"href":"#id-16","style":{"fontWeight":"bold"},"text":"4.3","element":"a"}],[{"text":"This proof largely follows the proof by ","element":"span"},{"href":"#id-29","referenceIndex":13,"text":"Lee, Ozdaglar, and ","element":"a"},{"href":"#id-29","referenceIndex":13,"text":"Shah ","element":"a"},{"href":"#id-29","referenceIndex":13,"text":"(2013) ","element":"a"},{"text":"and is presented here in the interest of selfcontainedness.","element":"span"}],[{"style":{"fontStyle":"italic"},"text":"Proof. ","element":"span"},{"text":"Suppose ","element":"span"},{"style":{"fontStyle":"italic"},"text":"a","element":"span"},{"text":", ","element":"span"},{"style":{"fontStyle":"italic"},"text":"b ","element":"span"},{"text":"> ","element":"span"},{"text":"0, consider the probability of the event that ","element":"span"},{"style":{"fontStyle":"italic"},"text":"s ","element":"span"},{"text":"is not visited in the next ","element":"span"},{"style":{"fontStyle":"italic"},"text":"a ","element":"span"},{"text":"steps given that it is not visited in the previous ","element":"span"},{"style":{"fontStyle":"italic"},"text":"b ","element":"span"},{"text":"steps, that is","element":"span"}],[{"style":{"width":"80%"},"width":766,"height":441,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2002.06299/images/7-18.png","element":"img"}],[{"style":{"width":"84%"},"width":810,"height":250,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2002.06299/images/8-0.png","element":"img"}],[{"text":"Let ","element":"span"},{"style":{"fontStyle":"italic"},"text":"a ","element":"span"},{"style":{"height":9.59},"width":92.03,"height":23.98,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2002.06299/images/8-1.png","element":"img","alt":" = e τs","inline":true},{"text":", and apply the above","element":"span"},{"style":{"height":23.6},"width":53.29,"height":59,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2002.06299/images/8-2.png","element":"img","alt":"� ta�","inline":true},{"text":"-many times to","element":"span"}],[{"style":{"width":"81%"},"width":781,"height":557,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2002.06299/images/8-3.png","element":"img"}],[{"style":{"fontWeight":"bold"},"text":"A.4 ","element":"span"},{"style":{"fontWeight":"bold"},"text":"Proof of Corollary ","element":"span"},{"href":"#id-16","style":{"fontWeight":"bold"},"text":"4.4","element":"a"}],[{"style":{"fontStyle":"italic"},"text":"Proof. ","element":"span"},{"text":"For conciseness, we suppress ","element":"span"},{"style":{"fontStyle":"italic"},"text":"s ","element":"span"},{"text":"here since only state ","element":"span"},{"style":{"fontStyle":"italic"},"text":"s ","element":"span"},{"text":"appears. Suppose ","element":"span"},{"style":{"fontStyle":"italic"},"text":"a ","element":"span"},{"text":"> ","element":"span"},{"text":"0. By Remark ","element":"span"},{"href":"#id-40","text":"3.2, ","element":"a"},{"text":"we have ","element":"span"},{"style":{"fontStyle":"italic"},"text":"W","element":"span"},{"style":{"fontStyle":"italic"},"text":"n ","element":"span"},{"text":"< ","element":"span"},{"style":{"fontStyle":"italic"},"text":"n a ","element":"span"},{"text":"if ","element":"span"},{"style":{"fontStyle":"italic"},"text":"W","element":"span"},{"text":"1 ","element":"span"},{"text":"< ","element":"span"},{"style":{"fontStyle":"italic"},"text":"a ","element":"span"},{"text":"and ","element":"span"},{"style":{"fontStyle":"italic"},"text":"I","element":"span"},{"style":{"fontStyle":"italic"},"text":"i ","element":"span"},{"text":"< ","element":"span"},{"style":{"fontStyle":"italic"},"text":"a ","element":"span"},{"text":"for ","element":"span"},{"style":{"fontStyle":"italic"},"text":"i ","element":"span"},{"style":{"height":13.2},"width":211.52,"height":33,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2002.06299/images/8-4.png","element":"img","alt":" = 1, · · · , n −","inline":true,"padRight":true},{"text":"1. Note that ","element":"span"},{"style":{"fontStyle":"italic"},"text":"W","element":"span"},{"text":"1 ","element":"span"},{"style":{"height":15.72},"width":93.43,"height":39.31,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2002.06299/images/8-5.png","element":"img","alt":" ≤ H+s","inline":true,"padRight":true},{"text":"and ","element":"span"},{"style":{"fontStyle":"italic"},"text":"I","element":"span"},{"style":{"fontStyle":"italic"},"text":"i ","element":"span"},{"text":"distribute identically to ","element":"span"},{"style":{"fontStyle":"italic"},"text":"H","element":"span"},{"style":{"height":15.72},"width":18.4,"height":39.31,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2002.06299/images/8-6.png","element":"img","alt":"+s","inline":true,"padRight":true},{"text":". Immediately ","element":"span"},{"text":"from inverting Lemma ","element":"span"},{"href":"#id-16","text":"4.3, ","element":"a"},{"text":"we have with failure probability of at most ","element":"span"},{"style":{"height":14},"width":37.9,"height":35,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2002.06299/images/8-7.png","element":"img","alt":"δ/n","inline":true},{"text":", ","element":"span"},{"style":{"fontStyle":"italic"},"text":"W","element":"span"},{"text":"1 ","element":"span"},{"text":"is bounded","element":"span"}],[{"style":{"width":"38%"},"width":364,"height":74,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2002.06299/images/8-8.png","element":"img"}],[{"text":"Suppose each ","element":"span"},{"style":{"fontStyle":"italic"},"text":"I","element":"span"},{"style":{"fontStyle":"italic"},"text":"i ","element":"span"},{"text":"< ","element":"span"},{"style":{"fontStyle":"italic"},"text":"a ","element":"span"},{"text":"fails with probability of at most ","element":"span"},{"style":{"height":14},"width":37.9,"height":35,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2002.06299/images/8-9.png","element":"img","alt":"δ/n","inline":true},{"text":", then we similarly have","element":"span"}],[{"style":{"width":"100%"},"width":954,"height":296,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2002.06299/images/8-10.png","element":"img"}],[{"style":{"fontWeight":"bold"},"text":"A.5 ","element":"span"},{"style":{"fontWeight":"bold"},"text":"Proof of Theorem ","element":"span"},{"href":"#id-17","style":{"fontWeight":"bold"},"text":"4.5","element":"a"}],[{"style":{"fontStyle":"italic"},"text":"Proof. ","element":"span"},{"text":"First, we introduce the Lambert ","element":"span"},{"text":"W ","element":"span"},{"text":"function to invert Corollary ","element":"span"},{"href":"#id-16","text":"4.4. ","element":"a"},{"text":"Recall that the Lambert ","element":"span"},{"text":"W ","element":"span"},{"text":"function is a transcendental function defined such that ","element":"span"},{"text":"W","element":"span"},{"text":"(","element":"span"},{"style":{"fontStyle":"italic"},"text":"x","element":"span"},{"text":")","element":"span"},{"style":{"fontStyle":"italic"},"text":"e","element":"span"},{"style":{"height":13.39},"width":111.58,"height":33.46,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2002.06299/images/8-11.png","element":"img","alt":"W(x) =","inline":true,"padRight":true},{"style":{"fontStyle":"italic"},"text":"x","element":"span"},{"text":", and thus it is a monotonically increasing function. At step ","element":"span"},{"style":{"fontStyle":"italic"},"text":"T","element":"span"},{"text":", suppose","element":"span"}],[{"style":{"width":"69%"},"width":659,"height":451,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2002.06299/images/8-12.png","element":"img"}],[{"style":{"width":"47%"},"width":457,"height":130,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2002.06299/images/8-13.png","element":"img"}],[{"text":"Use the fact that if ","element":"span"},{"style":{"fontStyle":"italic"},"text":"x ","element":"span"},{"text":"> ","element":"span"},{"style":{"fontStyle":"italic"},"text":"e","element":"span"},{"text":", then log ","element":"span"},{"style":{"fontStyle":"italic"},"text":"x ","element":"span"},{"style":{"height":4.8},"width":25,"height":12,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2002.06299/images/8-14.png","element":"img","alt":" −","inline":true,"padRight":true},{"text":"log log ","element":"span"},{"style":{"fontStyle":"italic"},"text":"x ","element":"span"},{"text":"< ","element":"span"},{"text":"W","element":"span"},{"text":"(","element":"span"},{"style":{"fontStyle":"italic"},"text":"x","element":"span"},{"text":"). So given ","element":"span"},{"style":{"fontStyle":"italic"},"text":"T ","element":"span"},{"style":{"height":13.19},"width":103.93,"height":32.98,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2002.06299/images/8-15.png","element":"img","alt":" > eδτs","inline":true},{"text":", we can lower-bound the number of visits","element":"span"}],[{"style":{"width":"38%"},"width":363,"height":340,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2002.06299/images/8-16.png","element":"img"}],[{"text":"Plugging this into Theorem ","element":"span"},{"href":"#id-15","text":"4.2, ","element":"a"},{"text":"we obtain the desired expression","element":"span"}],[{"style":{"width":"88%"},"width":842,"height":175,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2002.06299/images/8-17.png","element":"img"}],[{"style":{"fontWeight":"bold"},"text":"A.6 ","element":"span"},{"style":{"fontWeight":"bold"},"text":"Proof of Corollary ","element":"span"},{"href":"#id-18","style":{"fontWeight":"bold"},"text":"4.6","element":"a"}],[{"style":{"fontStyle":"italic"},"text":"Proof. ","element":"span"},{"text":"We run ","element":"span"},{"style":{"fontStyle":"italic"},"text":"S ","element":"span"},{"text":"many copies of loop estimators, one for each state ","element":"span"},{"style":{"fontStyle":"italic"},"text":"s ","element":"span"},{"style":{"height":12},"width":60,"height":30,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2002.06299/images/8-18.png","element":"img","alt":" ∈ S","inline":true},{"text":". Following Theorem ","element":"span"},{"href":"#id-17","text":"4.5, ","element":"a"},{"text":"with failure probability of at most ","element":"span"},{"style":{"height":14},"width":38.9,"height":35,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2002.06299/images/8-19.png","element":"img","alt":"δ/S","inline":true},{"text":", we can ensure that each estimator has an error of at most","element":"span"}],[{"style":{"width":"78%"},"width":752,"height":126,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2002.06299/images/8-20.png","element":"img"}],[{"text":"The largest upper bound comes from the state with the largest maximal expected hitting time max","element":"span"},{"style":{"fontStyle":"italic"},"text":"s","element":"span"},{"style":{"height":9.59},"width":73.8,"height":23.97,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2002.06299/images/8-21.png","element":"img","alt":"∈S τs","inline":true,"padRight":true},{"text":"of the Markov chain. Apply the union bound and we have","element":"span"}],[{"style":{"width":"49%"},"width":475,"height":24,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2002.06299/images/8-22.png","element":"img"}],[{"style":{"height":13.6},"width":16,"height":34,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2002.06299/images/8-23.png","element":"img","alt":"∥","inline":true},{"text":"ˆ","element":"span"},{"style":{"fontWeight":"bold"},"text":"v","element":"span"},{"style":{"fontStyle":"italic"},"text":"T ","element":"span"},{"style":{"height":13.97},"width":131.92,"height":34.93,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2002.06299/images/8-24.png","element":"img","alt":" − v∥∞ <","inline":true,"padRight":true},{"style":{"fontStyle":"italic"},"text":"r","element":"span"},{"text":"max","element":"span"},{"text":"(1 ","element":"span"},{"style":{"height":10.8},"width":56.2,"height":27,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2002.06299/images/8-25.png","element":"img","alt":" − γ","inline":true},{"text":")","element":"span"},{"text":"2","element":"span"}],[{"style":{"width":"1%"},"width":31,"height":22,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2002.06299/images/8-26.png","element":"img"}]]},{"heading":"B Additional results","paragraphs":[[{"id":"id-21","style":{"fontWeight":"bold"},"text":"B.1 ","element":"span"},{"style":{"fontWeight":"bold"},"text":"Conditions for consistency ","element":"span"},{"text":"We provide an example to show that if a state is not positive recurrent, i.e., transient, then we cannot attain a consistent estimate of its value in general. This suggests that Assumption ","element":"span"},{"href":"#id-25","text":"3.1 ","element":"a"},{"text":"is not too strong as a su","element":"span"},{"text":"ffi","element":"span"},{"text":"cient condition to study. Recall that we are interested in consistent estimation of the discounted value of a state given a single sample path from an unknown MRP. If a state is not positive recurrent, then without assuming any reset mechanisms, it is visited finitely many times over any sample path almost surely. Consider the following three MRPs in Figure ","element":"span"},{"href":"#id-41","text":"3. ","element":"a"},{"text":"It is obvious that ","element":"span"},{"style":{"fontStyle":"italic"},"text":"v","element":"span"},{"text":"(","element":"span"},{"style":{"fontStyle":"italic"},"text":"s","element":"span"},{"style":{"height":16.45},"width":170.46,"height":41.12,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2002.06299/images/8-27.png","element":"img","alt":"′1) = γ/1−γ","inline":true},{"text":", ","element":"span"},{"style":{"fontStyle":"italic"},"text":"v","element":"span"},{"text":"(","element":"span"},{"style":{"fontStyle":"italic"},"text":"s","element":"span"},{"style":{"height":11.04},"width":37.68,"height":27.59,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2002.06299/images/8-28.png","element":"img","alt":"′′1","inline":true,"padRight":true},{"text":") ","element":"span"},{"text":"= ","element":"span"},{"text":"0, and ","element":"span"},{"style":{"fontStyle":"italic"},"text":"v","element":"span"},{"text":"(","element":"span"},{"style":{"fontStyle":"italic"},"text":"s","element":"span"},{"text":"1","element":"span"},{"text":") ","element":"span"},{"style":{"height":14},"width":151.56,"height":35,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2002.06299/images/8-29.png","element":"img","alt":" = γ/2(1−γ)","inline":true},{"text":". ","element":"span"},{"text":"Suppose we start in ","element":"span"},{"style":{"fontStyle":"italic"},"text":"s","element":"span"},{"text":"1 ","element":"span"},{"text":"(of the MRP in the middle), there are only two possible sample paths: (","element":"span"},{"style":{"fontStyle":"italic"},"text":"s","element":"span"},{"text":"1","element":"span"},{"text":", ","element":"span"},{"text":"0","element":"span"},{"text":", ","element":"span"},{"style":{"fontStyle":"italic"},"text":"s","element":"span"},{"text":"2","element":"span"},{"text":", ","element":"span"},{"text":"1","element":"span"},{"text":", ","element":"span"},{"style":{"fontStyle":"italic"},"text":"s","element":"span"},{"text":"2","element":"span"},{"style":{"height":13.2},"width":96.32,"height":33,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2002.06299/images/8-30.png","element":"img","alt":", 1, · · ·","inline":true,"padRight":true},{"text":") and (","element":"span"},{"style":{"fontStyle":"italic"},"text":"s","element":"span"},{"text":"1","element":"span"},{"text":", ","element":"span"},{"text":"0","element":"span"},{"text":", ","element":"span"},{"style":{"fontStyle":"italic"},"text":"s","element":"span"},{"text":"3","element":"span"},{"text":", ","element":"span"},{"text":"0","element":"span"},{"text":", ","element":"span"},{"style":{"fontStyle":"italic"},"text":"s","element":"span"},{"text":"3","element":"span"},{"style":{"height":13.2},"width":96.33,"height":33,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2002.06299/images/8-31.png","element":"img","alt":", 0, · · ·","inline":true,"padRight":true},{"text":"). Note that ","element":"span"},{"style":{"fontStyle":"italic"},"text":"s","element":"span"},{"text":"1 ","element":"span"},{"text":"is only visited ","element":"span"},{"style":{"fontStyle":"italic"},"text":"once ","element":"span"},{"text":"in either sample path thus transient. Furthermore, we obtain the first sample path with probability of ","element":"span"},{"text":"1","element":"span"},{"style":{"height":14},"width":29.36,"height":35,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2002.06299/images/8-32.png","element":"img","alt":"/2","inline":true,"padRight":true},{"text":"in which case we","element":"span"}],[{"id":"id-41","style":{"width":"55%"},"width":525,"height":1089,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2002.06299/images/9-0.png","element":"img"}],[{"text":"Figure 3: Diagram of three Markov reward processes with transition probabilities labeled on the edges. The rewards are 1 for ","element":"figcaption","subtype":"caption"},{"style":{"fontStyle":"italic"},"text":"s","element":"figcaption","subtype":"caption"},{"text":"2 ","element":"figcaption","subtype":"caption"},{"text":"and 0 elsewhere.","element":"figcaption","subtype":"caption"}],[{"text":"cannot distinguish it from a sample path from the MRP on the top. Similarly, with probability of ","element":"span"},{"text":"1","element":"span"},{"style":{"height":14},"width":29.36,"height":35,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2002.06299/images/9-1.png","element":"img","alt":"/2","inline":true},{"text":", we get the second sample path which is indistinguishable from a sample path from the MRP at the bottom. However, the values of ","element":"span"},{"style":{"fontStyle":"italic"},"text":"s","element":"span"},{"style":{"height":11.04},"width":28,"height":27.59,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2002.06299/images/9-2.png","element":"img","alt":"′1","inline":true,"padRight":true},{"text":"and ","element":"span"},{"style":{"fontStyle":"italic"},"text":"s","element":"span"},{"style":{"height":11.04},"width":37.68,"height":27.59,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2002.06299/images/9-3.png","element":"img","alt":"′′1","inline":true,"padRight":true},{"text":"are di","element":"span"},{"text":"ff","element":"span"},{"text":"erent (and both not equal to that of ","element":"span"},{"style":{"fontStyle":"italic"},"text":"s","element":"span"},{"text":"1","element":"span"},{"text":"). Hence ","element":"span"},{"text":"we cannot devise an estimator that can ","element":"span"},{"style":{"fontStyle":"italic"},"text":"consistently ","element":"span"},{"text":"estimate ","element":"span"},{"style":{"fontStyle":"italic"},"text":"v","element":"span"},{"text":"(","element":"span"},{"style":{"fontStyle":"italic"},"text":"s","element":"span"},{"text":"1","element":"span"},{"text":"), ","element":"span"},{"style":{"fontStyle":"italic"},"text":"v","element":"span"},{"text":"(","element":"span"},{"style":{"fontStyle":"italic"},"text":"s","element":"span"},{"style":{"height":11.04},"width":28,"height":27.59,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2002.06299/images/9-4.png","element":"img","alt":"′1","inline":true},{"text":") and ","element":"span"},{"style":{"fontStyle":"italic"},"text":"v","element":"span"},{"text":"(","element":"span"},{"style":{"fontStyle":"italic"},"text":"s","element":"span"},{"style":{"height":11.04},"width":37.68,"height":27.59,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2002.06299/images/9-5.png","element":"img","alt":"′′1","inline":true,"padRight":true},{"text":").","element":"span"}],[{"id":"id-31","style":{"fontWeight":"bold"},"text":"B.2 ","element":"span"},{"style":{"fontWeight":"bold"},"text":"Concentration of first return times","element":"span"}],[{"text":"We provide an example to show that an exponential concentration of first return times given the expected recurrence time is impossible. In contrast, in Lemma ","element":"span"},{"href":"#id-16","text":"4.3 ","element":"a"},{"text":"we proved an exponential concentration given the maximal expected hitting time. Furthermore, this is consistent with what one would expect from Markov’s inequality since first return times are nonnegative random variables.","element":"span"}],[{"text":"Consider a class of Markov chains ","element":"span"},{"text":"{","element":"span"},{"style":{"fontStyle":"italic"},"text":"M","element":"span"},{"style":{"fontStyle":"italic"},"text":"k","element":"span"},{"text":"} ","element":"span"},{"text":"indexed by ","element":"span"},{"style":{"fontStyle":"italic"},"text":"k ","element":"span"},{"style":{"height":10.8},"width":25,"height":27,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2002.06299/images/9-6.png","element":"img","alt":" ≥","inline":true,"padRight":true},{"text":"3 where Markov chain ","element":"span"},{"style":{"fontStyle":"italic"},"text":"M","element":"span"},{"style":{"fontStyle":"italic"},"text":"k ","element":"span"},{"text":"has a state space ","element":"span"},{"text":"{","element":"span"},{"style":{"fontStyle":"italic"},"text":"s","element":"span"},{"text":"1","element":"span"},{"style":{"height":7.6},"width":83.06,"height":19,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2002.06299/images/9-7.png","element":"img","alt":", · · · ,","inline":true,"padRight":true},{"style":{"fontStyle":"italic"},"text":"s","element":"span"},{"style":{"fontStyle":"italic"},"text":"k","element":"span"},{"text":"} ","element":"span"},{"text":"and a transition kernel as depicted in Figure ","element":"span"},{"href":"#id-42","text":"4. ","element":"a"},{"text":"Starting in ","element":"span"},{"style":{"fontStyle":"italic"},"text":"s","element":"span"},{"text":"1","element":"span"},{"text":", the chain ","element":"span"},{"style":{"fontStyle":"italic"},"text":"M","element":"span"},{"style":{"fontStyle":"italic"},"text":"k ","element":"span"},{"text":"can either transition back to ","element":"span"},{"style":{"fontStyle":"italic"},"text":"s","element":"span"},{"text":"1 ","element":"span"},{"text":"in one step with probability 1 ","element":"span"},{"style":{"height":19.38},"width":83.89,"height":48.45,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2002.06299/images/9-8.png","element":"img","alt":" − 1k−1","inline":true,"padRight":true},{"text":"or to ","element":"span"},{"style":{"fontStyle":"italic"},"text":"s","element":"span"},{"text":"2 ","element":"span"},{"text":"with probability ","element":"span"},{"text":"1","element":"span"},{"style":{"fontStyle":"italic"},"text":"k","element":"span"},{"style":{"height":7.6},"width":31.74,"height":19,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2002.06299/images/9-9.png","element":"img","alt":"−1","inline":true},{"text":". Thus, ","element":"span"},{"text":"there are only two possible values for the first return time to ","element":"span"},{"style":{"fontStyle":"italic"},"text":"s","element":"span"},{"text":"1","element":"span"},{"text":": 1 by the self-transition, and ","element":"span"},{"style":{"fontStyle":"italic"},"text":"k ","element":"span"},{"text":"by going through ","element":"span"},{"style":{"fontStyle":"italic"},"text":"s","element":"span"},{"text":"2","element":"span"},{"text":", ","element":"span"},{"style":{"fontStyle":"italic"},"text":"s","element":"span"},{"text":"3","element":"span"},{"text":", . . . , ","element":"span"},{"style":{"fontStyle":"italic"},"text":"s","element":"span"},{"style":{"fontStyle":"italic"},"text":"k","element":"span"},{"text":", ","element":"span"},{"style":{"fontStyle":"italic"},"text":"s","element":"span"},{"text":"1","element":"span"},{"text":". We can calculate the expected recurrence","element":"span"}],[{"text":"time as","element":"span"}],[{"style":{"width":"61%"},"width":583,"height":97,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2002.06299/images/9-10.png","element":"img"}],[{"text":"Suppose that there is an exponential concentration of ","element":"span"},{"style":{"fontStyle":"italic"},"text":"H","element":"span"},{"style":{"height":17.4},"width":20.85,"height":43.5,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2002.06299/images/9-11.png","element":"img","alt":"+s1","inline":true,"padRight":true},{"text":"given ","element":"span"},{"style":{"height":10.87},"width":42.49,"height":27.16,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2002.06299/images/9-12.png","element":"img","alt":" ρs1","inline":true},{"text":", then we can upper-bound ","element":"span"},{"style":{"height":23.2},"width":165.07,"height":58,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2002.06299/images/9-13.png","element":"img","alt":" P�H+s1 ≥ t�","inline":true},{"text":"by some exponential function of ","element":"span"},{"style":{"fontStyle":"italic"},"text":"t","element":"span"},{"text":". However ","element":"span"},{"style":{"fontStyle":"italic"},"text":"H","element":"span"},{"style":{"height":17.4},"width":60.87,"height":43.49,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2002.06299/images/9-14.png","element":"img","alt":"+s1 =","inline":true,"padRight":true},{"style":{"fontStyle":"italic"},"text":"k ","element":"span"},{"text":"with probability of ","element":"span"},{"text":"1","element":"span"},{"style":{"fontStyle":"italic"},"text":"k","element":"span"},{"style":{"height":7.6},"width":31.74,"height":19,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2002.06299/images/9-15.png","element":"img","alt":"−1","inline":true,"padRight":true},{"text":"in ","element":"span"},{"style":{"fontStyle":"italic"},"text":"M","element":"span"},{"style":{"fontStyle":"italic"},"text":"k ","element":"span"},{"text":"makes such an exponential bound impossible as ","element":"span"},{"id":"id-42","text":"the upper bound has to work for all ","element":"span"},{"style":{"fontStyle":"italic"},"text":"M","element":"span"},{"style":{"fontStyle":"italic"},"text":"k","element":"span"},{"text":".","element":"span"}],[{"style":{"width":"39%"},"width":373,"height":1238,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2002.06299/images/9-16.png","element":"img"}],[{"text":"Figure 4: Diagram of Markov chain ","element":"figcaption","subtype":"caption"},{"style":{"fontStyle":"italic"},"text":"M","element":"figcaption","subtype":"caption"},{"style":{"fontStyle":"italic"},"text":"k ","element":"figcaption","subtype":"caption"},{"text":"with transition probabilities labeled on the edges.","element":"figcaption","subtype":"caption"}]]}],"_version":"3.3.2"},"paperNode":"$28:props:children:props:children:0:props:product"}]]