28:["$","$L31",null,{"isWhiteLabelled":false,"children":["$","$Lc",null,{"pt":{"compact":0,"expanded":3},"children":[["$","$L32",null,{"noStar":true,"publisher":true,"task":true,"params":true,"size":"xl","product":{"id":"eyJwYXBlcklEIjoiMjAwMS4xMDc0MiIsInB1Ymxpc2hlciI6ImFyeGl2In0=","publisher":"arxiv","updated":"2020-01-29T09:56:26.000Z","paperID":"2001.10742","published":"2020-01-29T09:56:26.000Z","authors":"[\"Ming Yin\",\"Yu-Xiang Wang\"]","title":"Asymptotically Efficient Off-Policy Evaluation for Tabular Reinforcement Learning","scoreTrending":null,"summary":"$33","lastCheckedForCode":"2022-09-02T02:59:04.271Z","links":[],"reposConnection":{"edges":[]},"models":[],"tags":[],"summaries":[],"emailsConnection":{"edges":[]},"__typename":"paper","authorArray":["Ming Yin","Yu-Xiang Wang"]}}],["$","$L25",null,{"container":true,"columns":100,"spacing":{"compact":0,"expanded":2,"large":3},"children":[["$","$L25",null,{"size":{"compact":100,"expanded":100,"large":68},"children":[["$","$8",null,{"children":["$","$L34",null,{"publisher":"arxiv","paperID":"2001.10742","product":{"paper":"$28:props:children:props:children:0:props:product","models":"$28:props:children:props:children:0:props:product:models"},"isWhiteLabelled":false}]}],["$","$8",null,{"children":["$","$L35",null,{"article":"$L36","model":"$undefined"}]}]]}],["$","$L25",null,{"size":"grow","children":["$","$L37",null,{}]}]]}],["$","$8",null,{"children":null}],[["$","audio",null,{"id":"tts"}],["$","$L38",null,{"paperID":"2001.10742","publisher":"arxiv","paperJSON":{"title":"Asymptotically Efficient Off-Policy Evaluation for Tabular Reinforcement Learning","paperID":"2001.10742","avgLineHeight":13.55,"imgScale":4,"sections":[{"heading":"Abstract","paragraphs":[[{"text":"We consider the problem of off-policy evaluation for reinforcement learning, where the goal is to estimate the expected reward of a target policy ","element":"span"},{"style":{"height":6.8},"width":23,"height":17,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2001.10742/images/0-0.png","element":"img","alt":" π","inline":true,"padRight":true},{"text":"using offline data collected by running a logging policy ","element":"span"},{"style":{"height":10},"width":24,"height":25,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2001.10742/images/0-1.png","element":"img","alt":" µ","inline":true},{"text":". Standard importance-sampling based approaches for this problem suffer from a variance that scales exponentially with time horizon ","element":"span"},{"style":{"fontStyle":"italic"},"text":"H","element":"span"},{"text":", which motivates a splurge of recent interest in alternatives that break the “Curse of Horizon” (","element":"span"},{"href":"#id-0","referenceIndex":22,"text":"Liu et al.","element":"a"},{"text":", ","element":"span"},{"href":"#id-0","referenceIndex":22,"text":"2018a","element":"a"},{"text":"; ","element":"span"},{"href":"#id-1","referenceIndex":38,"text":"Xie et al.","element":"a"},{"text":", ","element":"span"},{"href":"#id-1","referenceIndex":38,"text":"2019","element":"a"},{"text":"). In particular, it was shown that a marginalized importance sampling (MIS) approach can be used to achieve an estimation error of order ","element":"span"},{"style":{"height":17.38},"width":144.86,"height":43.46,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2001.10742/images/0-2.png","element":"img","alt":" O(H3/n","inline":true},{"text":") in mean square error (MSE) under an episodic Markov Decision Process model with finite states and potentially infinite actions. The MSE bound however is still a factor of ","element":"span"},{"style":{"fontStyle":"italic"},"text":"H ","element":"span"},{"text":"away from a Cramer-Rao lower bound of order Ω(","element":"span"},{"style":{"height":17.39},"width":98.17,"height":43.46,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2001.10742/images/0-3.png","element":"img","alt":"H2/n","inline":true},{"text":"). In this paper, we prove that with a simple modification to the MIS estimator, we can asymptotically attain the Cramer-Rao lower bound, provided that the action space is finite. We also provide a general method for constructing MIS estimators with high-probability error bounds.","element":"span"}]]},{"heading":"1 Introduction","paragraphs":[[{"style":{"fontStyle":"italic"},"text":"Off-policy evaluation ","element":"span"},{"text":"(OPE), which predicts the performance of a policy with data only sampled by a logging/behavior policy (","element":"span"},{"href":"#id-2","referenceIndex":31,"text":"Sutton & Barto","element":"a"},{"text":", ","element":"span"},{"href":"#id-2","referenceIndex":31,"text":"2018","element":"a"},{"text":"), plays a key role for using reinforcement learning (RL) algorithms responsibly in many real-world decision-making problems such as marketing, finance, robotics, and healthcare. Deploying a policy without having an accurate evaluate of its performance could be costly, illegal, and can even break down the machine learning system. There is a large body of literature that studied the off-policy evaluation problem in both theoretical and application-oriented aspects. From the theoretical perspective, OPE problem is extensively studied in contextual bandits (","element":"span"},{"href":"#id-3","referenceIndex":20,"text":"Li et al.","element":"a"},{"text":", ","element":"span"},{"href":"#id-3","referenceIndex":20,"text":"2011","element":"a"},{"text":"; ","element":"span"},{"href":"#id-4","referenceIndex":6,"text":"Dud´ık et al.","element":"a"},{"text":", ","element":"span"},{"href":"#id-4","referenceIndex":6,"text":"2011","element":"a"},{"text":"; ","element":"span"},{"href":"#id-5","referenceIndex":32,"text":"Swaminathan et al.","element":"a"},{"text":", ","element":"span"},{"href":"#id-5","referenceIndex":32,"text":"2017","element":"a"},{"text":"; ","element":"span"},{"href":"#id-6","referenceIndex":37,"text":"Wang et al.","element":"a"},{"text":", ","element":"span"},{"href":"#id-6","referenceIndex":37,"text":"2017","element":"a"},{"text":") and reinforcement learning (RL) (","element":"span"},{"href":"#id-7","referenceIndex":21,"text":"Li et al.","element":"a"},{"text":", ","element":"span"},{"href":"#id-7","referenceIndex":21,"text":"2015","element":"a"},{"text":"; ","element":"span"},{"href":"#id-8","referenceIndex":14,"text":"Jiang & Li","element":"a"},{"text":", ","element":"span"},{"href":"#id-8","referenceIndex":14,"text":"2016","element":"a"},{"text":"; ","element":"span"},{"href":"#id-9","referenceIndex":34,"text":"Thomas & Brunskill","element":"a"},{"text":", ","element":"span"},{"href":"#id-9","referenceIndex":34,"text":"2016","element":"a"},{"text":"; ","element":"span"},{"href":"#id-10","referenceIndex":7,"text":"Farajtabar ","element":"a"},{"href":"#id-10","referenceIndex":7,"text":"et al.","element":"a"},{"text":", ","element":"span"},{"href":"#id-10","referenceIndex":7,"text":"2018","element":"a"},{"text":"; ","element":"span"},{"href":"#id-1","referenceIndex":38,"text":"Xie et al.","element":"a"},{"text":", ","element":"span"},{"href":"#id-1","referenceIndex":38,"text":"2019","element":"a"},{"text":") and the results of OPE studies have been applied to real-world applications including marketing (","element":"span"},{"href":"#id-11","referenceIndex":33,"text":"Theocharous et al.","element":"a"},{"text":", ","element":"span"},{"href":"#id-11","referenceIndex":33,"text":"2015","element":"a"},{"text":"; ","element":"span"},{"href":"#id-12","referenceIndex":35,"text":"Thomas et al.","element":"a"},{"text":", ","element":"span"},{"href":"#id-12","referenceIndex":35,"text":"2017","element":"a"},{"text":") and education (","element":"span"},{"href":"#id-13","referenceIndex":25,"text":"Mandel et al.","element":"a"},{"text":", ","element":"span"},{"href":"#id-13","referenceIndex":25,"text":"2014","element":"a"},{"text":").","element":"span"}],[{"style":{"fontWeight":"bold"},"text":"Problem setup. ","element":"span"},{"text":"In the reinforcement learning (RL) problem the agent interacts with an underlying unknown dynamics which is modeled as a Markov decision process (MDP). An MDP is defined by a tuple ","element":"span"},{"style":{"height":17.2},"width":742.05,"height":43,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2001.10742/images/2-0.png","element":"img","alt":" M = (S, A, r, P, d1, H), where S and A","inline":true,"padRight":true},{"text":"are the state and action spaces, ","element":"span"},{"style":{"height":15.42},"width":355.52,"height":38.55,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2001.10742/images/2-1.png","element":"img","alt":" Pt : S × A × S →","inline":true,"padRight":true},{"text":"[0","element":"span"},{"style":{"fontStyle":"italic"},"text":", ","element":"span"},{"text":"1] is the transition kernel with ","element":"span"},{"style":{"height":17.6},"width":166.18,"height":44,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2001.10742/images/2-2.png","element":"img","alt":" Pt(s′|s, a","inline":true},{"text":") representing the probability of seeing state ","element":"span"},{"style":{"height":8.4},"width":36.45,"height":21,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2001.10742/images/2-3.png","element":"img","alt":" s′","inline":true,"padRight":true},{"text":"after taking action ","element":"span"},{"style":{"fontStyle":"italic"},"text":"a ","element":"span"},{"text":"at state ","element":"span"},{"style":{"height":16},"width":335.61,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2001.10742/images/2-4.png","element":"img","alt":" s, rt : S × A → R","inline":true,"padRight":true},{"text":"is the mean reward function with ","element":"span"},{"style":{"height":17.2},"width":113.42,"height":43,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2001.10742/images/2-5.png","element":"img","alt":" rt(s, a","inline":true},{"text":") being the average immediate goodness of (","element":"span"},{"style":{"fontStyle":"italic"},"text":"s, a","element":"span"},{"text":") at time ","element":"span"},{"style":{"fontStyle":"italic"},"text":"t","element":"span"},{"text":". Also, ","element":"span"},{"style":{"height":15.02},"width":39.71,"height":37.54,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2001.10742/images/2-6.png","element":"img","alt":"d1","inline":true,"padRight":true},{"text":"is denoted as the initial state distribution and ","element":"span"},{"style":{"fontStyle":"italic"},"text":"H ","element":"span"},{"text":"is the time horizon. The subscript ","element":"span"},{"style":{"fontStyle":"italic"},"text":"t ","element":"span"},{"text":"in ","element":"span"},{"style":{"height":14.62},"width":40.02,"height":36.55,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2001.10742/images/2-7.png","element":"img","alt":"Pt","inline":true,"padRight":true},{"text":"means the transition dynamics are non-stationary and could be different at each time ","element":"span"},{"style":{"fontStyle":"italic"},"text":"t","element":"span"},{"text":". A (non-stationary) policy ","element":"span"},{"href":"#id-14","style":{"height":21.11},"width":236.12,"height":52.76,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2001.10742/images/2-8.png","element":"img","alt":" π : S → PHA1","inline":true,"padRight":true},{"text":"assigns each state ","element":"span"},{"style":{"height":15.02},"width":117.04,"height":37.55,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2001.10742/images/2-9.png","element":"img","alt":" st ∈ S","inline":true,"padRight":true},{"text":"a distribution over actions at each time ","element":"span"},{"style":{"height":17.6},"width":245,"height":44,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2001.10742/images/2-10.png","element":"img","alt":" t, i.e. πt(·|st","inline":true},{"text":") is a probability simplex with dimension ","element":"span"},{"style":{"fontStyle":"italic"},"text":"|S|","element":"span"},{"text":". For brevity we suppress the subscript ","element":"span"},{"style":{"fontStyle":"italic"},"text":"t ","element":"span"},{"text":"of ","element":"span"},{"style":{"height":10.22},"width":36.88,"height":25.54,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2001.10742/images/2-11.png","element":"img","alt":" πt","inline":true,"padRight":true},{"text":"and denote ","element":"span"},{"style":{"height":17.6},"width":125.45,"height":44,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2001.10742/images/2-12.png","element":"img","alt":" π(at|st","inline":true},{"text":") the p.m.f of actions given state at time ","element":"span"},{"style":{"fontStyle":"italic"},"text":"t","element":"span"},{"text":".","element":"span"}],[{"text":"Given a target policy of interest ","element":"span"},{"style":{"height":8},"width":25,"height":20,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2001.10742/images/2-13.png","element":"img","alt":" π","inline":true},{"text":", then the distribution of one H-step trajectory ","element":"span"},{"style":{"height":8},"width":23,"height":20,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2001.10742/images/2-14.png","element":"img","alt":" τ","inline":true,"padRight":true},{"text":"= (","element":"span"},{"style":{"height":11.9},"width":546.22,"height":29.76,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2001.10742/images/2-15.png","element":"img","alt":"s1, a1, r1, ..., sH, aH, rH, sH+1","inline":true},{"text":") is specified by ","element":"span"},{"style":{"height":8},"width":25,"height":20,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2001.10742/images/2-16.png","element":"img","alt":" π","inline":true,"padRight":true},{"text":":= (","element":"span"},{"href":"#id-14","style":{"height":19.13},"width":121.79,"height":47.83,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2001.10742/images/2-17.png","element":"img","alt":"d1, π)2","inline":true,"padRight":true},{"text":"as follows: ","element":"span"},{"style":{"height":17.08},"width":152,"height":42.69,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2001.10742/images/2-18.png","element":"img","alt":" s1 ∼ dπ1","inline":true},{"text":", for ","element":"span"},{"style":{"fontStyle":"italic"},"text":"t ","element":"span"},{"text":"= ","element":"span"},{"text":"1","element":"span"},{"style":{"height":17.6},"width":352.66,"height":44,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2001.10742/images/2-19.png","element":"img","alt":", ..., H, at ∼ πt(·|st","inline":true},{"text":") and random reward ","element":"span"},{"style":{"height":10.62},"width":31.69,"height":26.54,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2001.10742/images/2-20.png","element":"img","alt":" rt","inline":true,"padRight":true},{"text":"has mean ","element":"span"},{"style":{"height":17.6},"width":140.36,"height":44,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2001.10742/images/2-21.png","element":"img","alt":" rt(st, at","inline":true},{"text":"). Then the value function under policy ","element":"span"},{"style":{"height":8},"width":25,"height":20,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2001.10742/images/2-22.png","element":"img","alt":" π","inline":true,"padRight":true},{"text":"is defined as:","element":"span"}],[{"style":{"width":"19%"},"width":333,"height":130,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2001.10742/images/2-23.png","element":"img"}],[{"text":"The OPE problem aims at estimating ","element":"span"},{"style":{"height":12.33},"width":42.71,"height":30.84,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2001.10742/images/2-24.png","element":"img","alt":" vπ ","inline":true,"padRight":true},{"text":"while given that ","element":"span"},{"style":{"fontStyle":"italic"},"text":"n ","element":"span"},{"text":"episodic data","element":"span"},{"href":"#id-15","style":{"height":39.97},"width":519.45,"height":99.92,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2001.10742/images/2-25.png","element":"img","alt":"3 D =�(s(i)t , a(i)t , r(i)t )�t∈[H]i∈[n]","inline":true}],[{"id":"id-14","style":{"width":"99%"},"width":1713,"height":91,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2001.10742/images/2-26.png","element":"img"}],[{"style":{"height":6.4},"width":15,"height":16,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2001.10742/images/2-27.png","element":"img","alt":"2","inline":true},{"text":"For brevity, ","element":"span"},{"style":{"height":10.4},"width":194.13,"height":26,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2001.10742/images/2-28.png","element":"img","alt":" ∀π we use π","inline":true,"padRight":true},{"text":"to denote the pair (","element":"span"},{"style":{"height":12.8},"width":73.09,"height":32,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2001.10742/images/2-29.png","element":"img","alt":"d1, π","inline":true},{"text":"). This can be understood as: ","element":"span"},{"style":{"height":13.29},"width":203.66,"height":33.23,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2001.10742/images/2-30.png","element":"img","alt":" ∀π, dπ1 = d1.","inline":true},{"style":{"height":6.4},"width":15,"height":16,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2001.10742/images/2-31.png","element":"img","alt":"3","inline":true},{"id":"id-15","text":"To distinguish the data from different episodes, we use superscript to denote which episode they belong","element":"span"}],[{"text":"are actually coming from a different logging policy ","element":"span"},{"style":{"height":12},"width":38.29,"height":30,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2001.10742/images/3-0.png","element":"img","alt":" µ.","inline":true}],[{"style":{"fontWeight":"bold"},"text":"Existing methods. ","element":"span"},{"text":"The classical way to tackle the problem of OPE relies on incorporating importance sampling weights (IS), which corrects the mismatch in the distributions under the behavior policy and target policy. Specifically, define the ","element":"span"},{"style":{"fontStyle":"italic"},"text":"t","element":"span"},{"text":"-step importance ratio as ","element":"span"},{"style":{"height":12},"width":34.56,"height":30,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2001.10742/images/3-1.png","element":"img","alt":"ρt","inline":true,"padRight":true},{"text":":= ","element":"span"},{"style":{"height":17.6},"width":319.32,"height":44,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2001.10742/images/3-2.png","element":"img","alt":" πt(at|st)/µt(at|st","inline":true},{"text":"), then it uses the cumulative importance ratio ","element":"span"},{"style":{"height":12},"width":60.91,"height":30,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2001.10742/images/3-3.png","element":"img","alt":" ρ1:t","inline":true,"padRight":true},{"text":":= ","element":"span"},{"style":{"height":21.2},"width":163.52,"height":53.01,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2001.10742/images/3-4.png","element":"img","alt":"�tt′=1 ρt′","inline":true,"padRight":true},{"text":"to ","element":"span"},{"text":"create IS based estimators:","element":"span"}],[{"style":{"width":"52%"},"width":902,"height":283,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2001.10742/images/3-5.png","element":"img"}],[{"text":"where ","element":"span"},{"style":{"height":23.8},"width":60.9,"height":59.5,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2001.10742/images/3-6.png","element":"img","alt":" ρ(i)1:t","inline":true,"padRight":true},{"text":"= ","element":"span"},{"style":{"height":24.52},"width":560.17,"height":61.31,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2001.10742/images/3-7.png","element":"img","alt":" �tt′=1 πt′(a(i)t′ |s(i)t′ )/µt′(a(i)t′ |s(i)t′","inline":true,"padRight":true},{"text":"). There are different versions of IS estimators ","element":"span"},{"text":"including weighted IS estimators and doubly robust estimators (","element":"span"},{"href":"#id-16","referenceIndex":26,"text":"Murphy et al.","element":"a"},{"text":", ","element":"span"},{"href":"#id-16","referenceIndex":26,"text":"2001","element":"a"},{"text":"; ","element":"span"},{"href":"#id-17","referenceIndex":11,"text":"Hirano ","element":"a"},{"href":"#id-17","referenceIndex":11,"text":"et al.","element":"a"},{"text":", ","element":"span"},{"href":"#id-17","referenceIndex":11,"text":"2003","element":"a"},{"text":"; ","element":"span"},{"href":"#id-4","referenceIndex":6,"text":"Dud´ık et al.","element":"a"},{"text":", ","element":"span"},{"href":"#id-4","referenceIndex":6,"text":"2011","element":"a"},{"text":"; ","element":"span"},{"href":"#id-8","referenceIndex":14,"text":"Jiang & Li","element":"a"},{"text":", ","element":"span"},{"href":"#id-8","referenceIndex":14,"text":"2016","element":"a"},{"text":").","element":"span"}],[{"text":"Even though IS-based off-policy evaluation methods possess a lot of advantages (","element":"span"},{"style":{"fontStyle":"italic"},"text":"e.g. ","element":"span"},{"text":"unbiasedness), the variance of the cumulative importance ratios ","element":"span"},{"style":{"height":12},"width":60.9,"height":30,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2001.10742/images/3-8.png","element":"img","alt":" ρ1:t","inline":true,"padRight":true},{"text":"may grow exponentially as the horizon goes long. Attempts to break the barriers of horizon have been tried using model-based approaches (","element":"span"},{"href":"#id-18","referenceIndex":23,"text":"Liu et al.","element":"a"},{"text":", ","element":"span"},{"href":"#id-18","referenceIndex":23,"text":"2018b","element":"a"},{"text":"; ","element":"span"},{"href":"#id-19","referenceIndex":9,"text":"Gottesman et al.","element":"a"},{"text":", ","element":"span"},{"href":"#id-19","referenceIndex":9,"text":"2019","element":"a"},{"text":"), which builds the whole MDP using either parametric or nonparametric models for estimating the value of target policy. (","element":"span"},{"href":"#id-0","referenceIndex":22,"text":"Liu et al.","element":"a"},{"text":", ","element":"span"},{"href":"#id-0","referenceIndex":22,"text":"2018a","element":"a"},{"text":") considers breaking the curse of horizon of time-invariant MDPs by deploying importance sampling on the average visitation distribution of state-action pairs, (","element":"span"},{"href":"#id-20","referenceIndex":10,"text":"Hallak & Mannor","element":"a"},{"text":", ","element":"span"},{"href":"#id-20","referenceIndex":10,"text":"2017","element":"a"},{"text":") considers leveraging the stationary ratio of state-action pairs to replace the trajectory weights in an online fashion and (","element":"span"},{"href":"#id-21","referenceIndex":8,"text":"Gelada & Bellemare","element":"a"},{"text":", ","element":"span"},{"href":"#id-21","referenceIndex":8,"text":"2019","element":"a"},{"text":") further applies the same idea in the deep reinforcement learning regime. Recently, (","element":"span"},{"href":"#id-22","referenceIndex":16,"text":"Kallus ","element":"a"},{"href":"#id-22","referenceIndex":16,"text":"& Uehara","element":"a"},{"text":", ","element":"span"},{"href":"#id-22","referenceIndex":16,"text":"2019a","element":"a"},{"text":",","element":"span"},{"href":"#id-23","referenceIndex":17,"text":"b","element":"a"},{"text":") propose double reinforcement learning (DRL), which is based on doubly robust estimator with cross-fold estimation of ","element":"span"},{"style":{"fontStyle":"italic"},"text":"q","element":"span"},{"text":"-functions and marginalized density ratios. It was shown that DRL is asymptotically efficient when both components are estimated at fourth-root rates, however no finite sample error bounds are given.","element":"span"}],[{"style":{"fontWeight":"bold"},"text":"Our goal. ","element":"span"},{"text":"In this paper, our goal is to obtain the optimality of IS-based methods through marginalized importance sampling (MIS). As an earlier attempt, ","element":"span"},{"href":"#id-1","referenceIndex":38,"text":"Xie et al. ","element":"a"},{"text":"(","element":"span"},{"href":"#id-1","referenceIndex":38,"text":"2019","element":"a"},{"text":") constructs MIS estimator by aggregating all trajectories that share the same state transition patterns to directly estimate the state distribution shifts after the change of policies from the behavioral to the target. However, as pointed out by ","element":"span"},{"href":"#id-22","referenceIndex":16,"text":"Kallus & Uehara ","element":"a"},{"text":"(","element":"span"},{"href":"#id-22","referenceIndex":16,"text":"2019a","element":"a"},{"text":") and Remark 4 in ","element":"span"},{"href":"#id-1","referenceIndex":38,"text":"Xie et al. ","element":"a"},{"text":"(","element":"span"},{"href":"#id-1","referenceIndex":38,"text":"2019","element":"a"},{"text":"), the MSE upper bound of MIS estimator is asymptotically inefficient by a multiplicative factor of ","element":"span"},{"style":{"fontStyle":"italic"},"text":"H","element":"span"},{"text":". ","element":"span"},{"href":"#id-1","referenceIndex":38,"text":"Xie et al. ","element":"a"},{"text":"(","element":"span"},{"href":"#id-1","referenceIndex":38,"text":"2019","element":"a"},{"text":") conjectures that the lower bound is not achievable in their infinite action setting. To bridge the gap and ultimately achieve the ","element":"span"},{"text":"optimality, we consider the Tabular MDPs, where both the state space and action space are finite (","element":"span"},{"style":{"height":17.6},"width":655.92,"height":44,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2001.10742/images/4-0.png","element":"img","alt":"i.e. S = |S| < ∞, A = |A| < ∞","inline":true},{"text":") and each state-action pair can be visited frequently as long as the logging policy ","element":"span"},{"style":{"height":12},"width":26,"height":30,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2001.10742/images/4-1.png","element":"img","alt":" µ","inline":true,"padRight":true},{"text":"does sufficient exploration (which corresponds to Assumption ","element":"span"},{"href":"#id-24","text":"2.2","element":"a"},{"text":"). Under the Tabular MDP setting, we can show the MSE upper bound of MIS estimator matches the Cramer-Rao lower bound provided by ","element":"span"},{"href":"#id-8","referenceIndex":14,"text":"Jiang & Li ","element":"a"},{"text":"(","element":"span"},{"href":"#id-8","referenceIndex":14,"text":"2016","element":"a"},{"text":") by incorporating frequent action observability. To distinguish the difference, throughout the rest of paper we call the modified MIS estimator Tabular-MIS (TMIS) and the MIS estimator in ","element":"span"},{"href":"#id-1","referenceIndex":38,"text":"Xie et al. ","element":"a"},{"text":"(","element":"span"},{"href":"#id-1","referenceIndex":38,"text":"2019","element":"a"},{"text":") State-MIS (SMIS).","element":"span"}],[{"style":{"fontWeight":"bold"},"text":"1.1 ","element":"span"},{"style":{"fontWeight":"bold"},"text":"Summary of results.","element":"span"}],[{"text":"This work considers the problem of off-policy evaluation for a finite horizon, nonstationary, episodic MDP under tabular MDP setting. We propose and analyze Tabular-MIS estimator, which closes the gap between Cramer-Rao lower bound provided by ","element":"span"},{"href":"#id-8","referenceIndex":14,"text":"Jiang & Li ","element":"a"},{"text":"(","element":"span"},{"href":"#id-8","referenceIndex":14,"text":"2016","element":"a"},{"text":") (on the variance of any unbiased estimator for a simplified setting of an nonstationary episodic MDP) and the MSE upper bound of State-MIS estimator (","element":"span"},{"href":"#id-1","referenceIndex":38,"text":"Xie et al.","element":"a"},{"text":", ","element":"span"},{"href":"#id-1","referenceIndex":38,"text":"2019","element":"a"},{"text":"). We also provide a high probability result by introducing a data-splitting type Tabular-MIS estimator, which retains the asymptotic efficiency while having an exponential tail. To the best of our knowledge, Split-TMIS is the first IS-based estimator in OPE that achieves asymptotic sample efficiency while having finite sample guarantees in high probability.","element":"span"}],[{"text":"Moreover, the calculation of Tabular-MIS estimator and Split-TMIS does not explicitly incorporate the importance weights, which in turn implies our off-policy evaluation algorithm can be implemented without needing to know logging probabilities ","element":"span"},{"style":{"height":12},"width":26,"height":30,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2001.10742/images/4-2.png","element":"img","alt":" µ","inline":true},{"text":". Such logging-policy-free feature makes our TMIS estimator more practical in the real-world applications.","element":"span"}],[{"style":{"fontWeight":"bold"},"text":"Key proof ingradients. ","element":"span"},{"text":"We use a modified version of fictitious estimator of ","element":"span"},{"href":"#id-1","referenceIndex":38,"text":"Xie et al. ","element":"a"},{"text":"(","element":"span"},{"href":"#id-1","referenceIndex":38,"text":"2019","element":"a"},{"text":") as the bridge to connect our estimator ","element":"span"},{"style":{"height":17.11},"width":260.94,"height":42.77,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2001.10742/images/4-3.png","element":"img","alt":" �vπTMIS with vπ","inline":true},{"text":". Different from ","element":"span"},{"href":"#id-1","referenceIndex":38,"text":"Xie et al. ","element":"a"},{"text":"(","element":"span"},{"href":"#id-1","referenceIndex":38,"text":"2019","element":"a"},{"text":") ","element":"span"},{"text":"who directly analyzes transition dynamic ","element":"span"},{"style":{"height":19.08},"width":224.7,"height":47.69,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2001.10742/images/4-4.png","element":"img","alt":"�P πt+1(st+1|st","inline":true},{"text":"), we need to do a finer decomposition ","element":"span"},{"style":{"height":19.08},"width":225.37,"height":47.69,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2001.10742/images/4-5.png","element":"img","alt":"�P πt+1(st+1|st","inline":true},{"text":") = ","element":"span"},{"style":{"height":19.95},"width":513.92,"height":49.88,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2001.10742/images/4-6.png","element":"img","alt":" �at �Pt+1(st+1|st, at)π(at|st","inline":true},{"text":") and analyze ","element":"span"},{"style":{"height":17.6},"width":282.05,"height":44,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2001.10742/images/4-7.png","element":"img","alt":" �Pt+1(st+1|st, at","inline":true},{"text":"). Also, Bellman ","element":"span"},{"text":"equations are leveraged for expressing the variance of TMIS recursively. For deriving the high probability bound, we design the data-splitting TMIS which not only matches perfectly with the standard concentration inequalities but also maintains the MSE of the same order as TMIS for certain appropriately chosen batch data-splitting size.","element":"span"}],[{"style":{"fontWeight":"bold"},"text":"1.2 ","element":"span"},{"style":{"fontWeight":"bold"},"text":"Other related work","element":"span"}],[{"text":"Markov Decision Processes have a long history of associated research (","element":"span"},{"href":"#id-25","referenceIndex":27,"text":"Puterman","element":"a"},{"text":", ","element":"span"},{"href":"#id-25","referenceIndex":27,"text":"1994","element":"a"},{"text":"; ","element":"span"},{"href":"#id-26","referenceIndex":30,"text":"Sutton & Barto","element":"a"},{"text":", ","element":"span"},{"href":"#id-26","referenceIndex":30,"text":"1998","element":"a"},{"text":"), but many theoretical problems in the basic tabular setting remain ","element":"span"},{"text":"an active area of research as of today. We briefly review the other settings that this problem and connect them to our results.","element":"span"}],[{"style":{"fontWeight":"bold"},"text":"Regret bound and sample complexity in the online setting. ","element":"span"},{"text":"The bulk of existing work focuses on online learning, where the agent interacts with the MDP with the interests of identifying the optimal policy or minimizing the regret against the optimal policy. The optimal regret is obtained by (","element":"span"},{"href":"#id-27","referenceIndex":3,"text":"Azar et al.","element":"a"},{"text":", ","element":"span"},{"href":"#id-27","referenceIndex":3,"text":"2017","element":"a"},{"text":") using a model-based approach which translates into a sample complexity bound of ","element":"span"},{"style":{"height":19.13},"width":228.41,"height":47.84,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2001.10742/images/5-0.png","element":"img","alt":" O(H3SA/ϵ2","inline":true},{"text":"), which matches the lower bound of Ω(","element":"span"},{"style":{"height":19.13},"width":213.22,"height":47.84,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2001.10742/images/5-1.png","element":"img","alt":"H3SA/ϵ2)(","inline":true},{"href":"#id-28","referenceIndex":2,"text":"Azar et al.","element":"a"},{"text":", ","element":"span"},{"href":"#id-28","referenceIndex":2,"text":"2013","element":"a"},{"text":"). The method is however not “uniform PAC” where the state of the art sample complexity remains ","element":"span"},{"style":{"height":19.13},"width":229.04,"height":47.84,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2001.10742/images/5-2.png","element":"img","alt":" O(H4SA/ϵ2","inline":true},{"text":") (","element":"span"},{"href":"#id-29","referenceIndex":5,"text":"Dann et al.","element":"a"},{"text":", ","element":"span"},{"href":"#id-29","referenceIndex":5,"text":"2017","element":"a"},{"text":"). Model-free approaches that require a space constraint of ","element":"span"},{"style":{"fontStyle":"italic"},"text":"O","element":"span"},{"text":"(","element":"span"},{"style":{"fontStyle":"italic"},"text":"HSA","element":"span"},{"text":") were studied by ","element":"span"},{"href":"#id-30","referenceIndex":15,"text":"Jin et al. ","element":"a"},{"text":"(","element":"span"},{"href":"#id-30","referenceIndex":15,"text":"2018","element":"a"},{"text":") which implies a sample complexity bound of ","element":"span"},{"style":{"height":19.13},"width":259.65,"height":47.83,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2001.10742/images/5-3.png","element":"img","alt":" O(H4SA/ϵ2).","inline":true}],[{"style":{"fontWeight":"bold"},"text":"Sample complexity with a generative model. ","element":"span"},{"text":"Another sequence of work assumes access to a generative model where one can sample from ","element":"span"},{"style":{"height":15.42},"width":650.02,"height":38.55,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2001.10742/images/5-4.png","element":"img","alt":" st+1 and rt given any st, at in time","inline":true,"padRight":true},{"style":{"fontStyle":"italic"},"text":"O","element":"span"},{"text":"(1) (","element":"span"},{"href":"#id-31","referenceIndex":19,"text":"Kearns & Singh","element":"a"},{"text":", ","element":"span"},{"href":"#id-31","referenceIndex":19,"text":"1999","element":"a"},{"text":"). ","element":"span"},{"href":"#id-32","referenceIndex":28,"text":"Sidford et al. ","element":"a"},{"text":"(","element":"span"},{"href":"#id-32","referenceIndex":28,"text":"2018","element":"a"},{"text":") is the first that establishes the optimal sample complexity of ","element":"span"},{"text":"˜","element":"span"},{"text":"Θ(","element":"span"},{"style":{"height":19.14},"width":177.28,"height":47.84,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2001.10742/images/5-5.png","element":"img","alt":"H3SA/ϵ2","inline":true},{"text":") under this setting (counting ","element":"span"},{"style":{"fontStyle":"italic"},"text":"H ","element":"span"},{"text":"generative model calls as one episode). ","element":"span"},{"href":"#id-33","referenceIndex":1,"text":"Agarwal et al. ","element":"a"},{"text":"(","element":"span"},{"href":"#id-33","referenceIndex":1,"text":"2019","element":"a"},{"text":") establishes similar results by estimating the parameters of the MDP model using maximum-likelihood estimation.","element":"span"}],[{"text":"Our setting is different in two ways. First, we consider a fixed pair of logging and target policy ","element":"span"},{"style":{"height":12},"width":26,"height":30,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2001.10742/images/5-6.png","element":"img","alt":" µ","inline":true,"padRight":true},{"text":"and ","element":"span"},{"style":{"height":8},"width":25,"height":20,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2001.10742/images/5-7.png","element":"img","alt":" π","inline":true},{"text":", so our bounds can depend explicitly ","element":"span"},{"style":{"height":8},"width":25,"height":20,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2001.10742/images/5-8.png","element":"img","alt":" π","inline":true,"padRight":true},{"text":"and ","element":"span"},{"style":{"height":12},"width":26,"height":30,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2001.10742/images/5-9.png","element":"img","alt":" µ","inline":true,"padRight":true},{"text":"instead of ","element":"span"},{"style":{"fontStyle":"italic"},"text":"S, A","element":"span"},{"text":". Second, we do not have either online access to the environment (to change policies) or a generative model. Our high-probability bound with a direct union bound argument, implies a sample complexity of ","element":"span"},{"text":"˜","element":"span"},{"style":{"height":19.13},"width":247.77,"height":47.84,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2001.10742/images/5-10.png","element":"img","alt":"O(H3S2A/ϵ2","inline":true},{"text":") for identifying the optimal policy, which is suboptimal up to a factor of ","element":"span"},{"style":{"fontStyle":"italic"},"text":"S","element":"span"},{"text":", but notably has the optimal dependence in ","element":"span"},{"style":{"fontStyle":"italic"},"text":"H","element":"span"},{"text":". We remark that achieving the optimal dependence in the planning horizon ","element":"span"},{"style":{"fontStyle":"italic"},"text":"H ","element":"span"},{"text":"(or the discounting factor (1 ","element":"span"},{"style":{"height":19.13},"width":181.43,"height":47.83,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2001.10742/images/5-11.png","element":"img","alt":" − γ)−1 in","inline":true,"padRight":true},{"text":"the infinite horizon case) is generally tricky (see, e.g., the COLT open problem (","element":"span"},{"href":"#id-34","referenceIndex":13,"text":"Jiang & ","element":"a"},{"href":"#id-34","referenceIndex":13,"text":"Agarwal","element":"a"},{"text":", ","element":"span"},{"href":"#id-34","referenceIndex":13,"text":"2018","element":"a"},{"text":") for more details). The current paper is among the few instances where we know how to obtain the optimal parameters.","element":"span"}],[{"text":"Finally, we acknowledge that tabular RL is a basic abstraction that is relatively far away from real applications, which might have unobserved states, continuous state, non-zero Bellman error in the value function approximation. We leave generalization of the techniques in this paper to these more practical settings as future work.","element":"span"}]]},{"heading":"2 Method","paragraphs":[[{"style":{"fontWeight":"bold"},"text":"2.1 ","element":"span"},{"style":{"fontWeight":"bold"},"text":"Problem description","element":"span"}],[{"text":"In addition to the non-stationary, finite horizon tabular MDP ","element":"span"},{"style":{"fontStyle":"italic"},"text":"M ","element":"span"},{"text":"= (","element":"span"},{"style":{"height":16},"width":289.88,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2001.10742/images/6-0.png","element":"img","alt":"S, A, r, P, d1, H","inline":true},{"text":") (where ","element":"span"},{"style":{"fontStyle":"italic"},"text":"S ","element":"span"},{"text":":= ","element":"span"},{"style":{"height":17.6},"width":168.34,"height":44,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2001.10742/images/6-1.png","element":"img","alt":" |S| < ∞","inline":true,"padRight":true},{"text":"and ","element":"span"},{"style":{"fontStyle":"italic"},"text":"A ","element":"span"},{"text":":= ","element":"span"},{"style":{"height":17.6},"width":173.49,"height":44,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2001.10742/images/6-2.png","element":"img","alt":" |A| < ∞","inline":true},{"text":"), non-stationary logging policy ","element":"span"},{"style":{"height":12},"width":26,"height":30,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2001.10742/images/6-3.png","element":"img","alt":" µ","inline":true,"padRight":true},{"text":"and target policy ","element":"span"},{"style":{"height":8},"width":25,"height":20,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2001.10742/images/6-4.png","element":"img","alt":" π","inline":true,"padRight":true},{"text":"in Section ","element":"span"},{"text":"1","element":"span"},{"text":", we denote ","element":"span"},{"style":{"height":18.72},"width":42.71,"height":46.79,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2001.10742/images/6-5.png","element":"img","alt":" dµt","inline":true,"padRight":true},{"text":"(","element":"span"},{"style":{"height":11.2},"width":89.14,"height":28,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2001.10742/images/6-6.png","element":"img","alt":"st, at","inline":true},{"text":") and ","element":"span"},{"style":{"height":16.72},"width":42.71,"height":41.79,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2001.10742/images/6-7.png","element":"img","alt":" dπt","inline":true,"padRight":true},{"text":"(","element":"span"},{"style":{"height":11.2},"width":89.14,"height":28,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2001.10742/images/6-8.png","element":"img","alt":"st, at","inline":true},{"text":") the induced joint state-action ","element":"span"},{"text":"distribution at time ","element":"span"},{"style":{"fontStyle":"italic"},"text":"t ","element":"span"},{"text":"and the state distribution counterparts ","element":"span"},{"style":{"height":18.72},"width":307.88,"height":46.79,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2001.10742/images/6-9.png","element":"img","alt":" dµt (st) and dπt (st","inline":true},{"text":"), satisfying ","element":"span"},{"style":{"height":19.13},"width":537.87,"height":47.84,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2001.10742/images/6-10.png","element":"img","alt":"dπt (st, at) = dπt (st)·π(at|st).4","inline":true,"padRight":true},{"text":"The initial distributions are identical ","element":"span"},{"style":{"height":19.09},"width":245.1,"height":47.72,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2001.10742/images/6-11.png","element":"img","alt":" dµ1 = dπ1 = d1","inline":true},{"text":". Moreover, ","element":"span"},{"text":"we use ","element":"span"},{"style":{"height":22.42},"width":394.4,"height":56.06,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2001.10742/images/6-12.png","element":"img","alt":" P πi,j ∈ RS×S, ∀j < i","inline":true,"padRight":true},{"text":"to represent the state transition probability from step ","element":"span"},{"style":{"fontStyle":"italic"},"text":"j ","element":"span"},{"text":"to ","element":"span"},{"text":"step ","element":"span"},{"style":{"fontStyle":"italic"},"text":"i ","element":"span"},{"text":"under policy ","element":"span"},{"style":{"height":8},"width":25,"height":20,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2001.10742/images/6-13.png","element":"img","alt":" π","inline":true},{"text":", where ","element":"span"},{"style":{"height":20.27},"width":189.23,"height":50.68,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2001.10742/images/6-14.png","element":"img","alt":" P πt+1,t(s′|s","inline":true},{"text":") = ","element":"span"},{"style":{"height":18.36},"width":434.36,"height":45.9,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2001.10742/images/6-15.png","element":"img","alt":" �a Pt+1,t(s′|s, a)πt(a|s","inline":true},{"text":"). The marginal state ","element":"span"},{"text":"distribution vector ","element":"span"},{"style":{"height":20.28},"width":569.5,"height":50.69,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2001.10742/images/6-16.png","element":"img","alt":" dπt (·) satisfies dπt = P πt,t−1dπt−1.","inline":true}],[{"text":"Historical data ","element":"span"},{"style":{"height":39.97},"width":490.82,"height":99.92,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2001.10742/images/6-17.png","element":"img","alt":" D =�(s(i)t , a(i)t , r(i)t )�t∈[H]i∈[n] ","inline":true,"padRight":true},{"text":"was obtained by logging policy ","element":"span"},{"style":{"height":12},"width":26,"height":30,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2001.10742/images/6-18.png","element":"img","alt":" µ","inline":true,"padRight":true},{"text":"and we can only ","element":"span"},{"text":"use ","element":"span"},{"style":{"fontStyle":"italic"},"text":"D ","element":"span"},{"text":"to estimate the value of target policy ","element":"span"},{"style":{"height":15.13},"width":174.35,"height":37.84,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2001.10742/images/6-19.png","element":"img","alt":" π, i.e. vπ","inline":true},{"text":". Suppose we only assume knowledge about ","element":"span"},{"style":{"height":17.6},"width":99.54,"height":44,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2001.10742/images/6-20.png","element":"img","alt":" π(a|s","inline":true},{"text":") for all (","element":"span"},{"style":{"height":17.6},"width":783.42,"height":44,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2001.10742/images/6-21.png","element":"img","alt":"s, a) ∈ S × A and do not observe rt(st, at","inline":true},{"text":") for any actions other than the noisy immediate reward ","element":"span"},{"style":{"height":23.42},"width":58.6,"height":58.56,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2001.10742/images/6-22.png","element":"img","alt":" r(i)t","inline":true,"padRight":true},{"text":"after observing ","element":"span"},{"style":{"height":23.42},"width":140.49,"height":58.56,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2001.10742/images/6-23.png","element":"img","alt":" s(i)t , a(i)t","inline":true,"padRight":true},{"text":". The goal is to find an estimator ","element":"span"},{"text":"to minimize the mean-square error (MSE):","element":"span"}],[{"style":{"width":"35%"},"width":605,"height":54,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2001.10742/images/6-24.png","element":"img"}],[{"id":"id-41","style":{"fontWeight":"bold"},"text":"Assumption 2.1 ","element":"span"},{"text":"(Bounded rewards)","element":"span"},{"style":{"height":23.42},"width":916.81,"height":58.56,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2001.10742/images/6-25.png","element":"img","alt":". ∀ t = 1, ..., H and i = 1, ..., n, 0 ≤ r(i)t ≤ Rmax.","inline":true}],[{"text":"The bounded reward assumption can be relaxed to : ","element":"span"},{"style":{"height":15.6},"width":322.62,"height":39,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2001.10742/images/6-26.png","element":"img","alt":" ∃Rmax, σ < +∞","inline":true,"padRight":true},{"text":"such that 0 ","element":"span"},{"style":{"height":13.6},"width":34,"height":34,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2001.10742/images/6-27.png","element":"img","alt":" ≤","inline":true},{"style":{"height":17.6},"width":456.09,"height":44,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2001.10742/images/6-28.png","element":"img","alt":"E[rt|st, at, st+1] ≤ Rmax","inline":true},{"text":", Var[","element":"span"},{"style":{"height":19.13},"width":361.86,"height":47.84,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2001.10742/images/6-29.png","element":"img","alt":"rt|st, at, st+1] ≤ σ2","inline":true,"padRight":true},{"text":"(as in ","element":"span"},{"href":"#id-1","referenceIndex":38,"text":"Xie et al. ","element":"a"},{"text":"(","element":"span"},{"href":"#id-1","referenceIndex":38,"text":"2019","element":"a"},{"text":")), for achieving Cramer-Rao lower bound. However, the boundedness will become essential for applying concentrate inequalities in deriving high probability bounds.","element":"span"}],[{"id":"id-24","style":{"fontWeight":"bold"},"text":"Assumption 2.2 ","element":"span"},{"text":"(Sufficient exploration)","element":"span"},{"style":{"height":19.11},"width":943.5,"height":47.78,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2001.10742/images/6-30.png","element":"img","alt":". Logging policy µ obeys that dm := mint,st dµt (st) >","inline":true,"padRight":true},{"text":"0","element":"span"},{"style":{"fontStyle":"italic"},"text":".","element":"span"}],[{"id":"id-35","style":{"width":"102%"},"width":1764,"height":316,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2001.10742/images/6-31.png","element":"img"}],[{"text":"Assumption ","element":"span"},{"href":"#id-35","text":"2.3 ","element":"a"},{"text":"is also necessary for discrete state and actions, as otherwise the second moments of the importance weight would be unbounded and the MSE of estimators will become intractable . The bound on ","element":"span"},{"style":{"height":10.22},"width":35.08,"height":25.54,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2001.10742/images/6-32.png","element":"img","alt":" τs","inline":true,"padRight":true},{"text":"is natural since ","element":"span"},{"style":{"height":26.51},"width":715.89,"height":66.26,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2001.10742/images/6-33.png","element":"img","alt":" τs ≤ maxt,st 1dµt (st) = 1mint,st dµt (st) = 1dm","inline":true,"padRight":true},{"text":"and it is finite by the Assumption ","element":"span"},{"href":"#id-24","text":"2.2","element":"a"},{"text":"; similarly, ","element":"span"},{"style":{"height":12.22},"width":141.37,"height":30.54,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2001.10742/images/7-0.png","element":"img","alt":" τa < ∞","inline":true,"padRight":true},{"text":"is also automatically satisfied if min","element":"span"},{"style":{"height":18.22},"width":293.24,"height":45.54,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2001.10742/images/7-1.png","element":"img","alt":"t,st,at µ(at|st) >","inline":true,"padRight":true},{"text":"0. Finally, as we will see in the results, explicit dependence on ","element":"span"},{"style":{"height":10.8},"width":93.2,"height":27,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2001.10742/images/7-2.png","element":"img","alt":" τs, τa","inline":true,"padRight":true},{"text":"and ","element":"span"},{"style":{"height":15.02},"width":52.71,"height":37.55,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2001.10742/images/7-3.png","element":"img","alt":" dm","inline":true,"padRight":true},{"text":"only appear in the low-order terms of the error bound.","element":"span"}],[{"style":{"fontWeight":"bold"},"text":"2.2 ","element":"span"},{"style":{"fontWeight":"bold"},"text":"Tabular-MIS estimator","element":"span"}],[{"text":"To overcome the barrier caused by cumulative importance weights in IS type estimators, marginalized importance sampling directly estimates the marginalized state visitation distribution ","element":"span"},{"style":{"height":15.02},"width":34.71,"height":37.55,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2001.10742/images/7-4.png","element":"img","alt":"�dt","inline":true,"padRight":true},{"text":"and defines the MIS estimator:","element":"span"}],[{"id":"id-36","style":{"width":"99%"},"width":1711,"height":220,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2001.10742/images/7-5.png","element":"img"}],[{"style":{"height":5.6},"width":21,"height":14,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2001.10742/images/7-6.png","element":"img","alt":"n","inline":true,"padRight":true},{"text":"whenever ","element":"span"},{"style":{"height":13.81},"width":106,"height":34.53,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2001.10742/images/7-7.png","element":"img","alt":" nst >","inline":true,"padRight":true},{"text":"0 and ","element":"span"},{"style":{"height":16.71},"width":42.71,"height":41.78,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2001.10742/images/7-8.png","element":"img","alt":"�dπt","inline":true,"padRight":true},{"text":"(","element":"span"},{"style":{"height":18.72},"width":116.52,"height":46.79,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2001.10742/images/7-9.png","element":"img","alt":"st)/�dµt","inline":true,"padRight":true},{"text":"(","element":"span"},{"style":{"height":10.62},"width":32.46,"height":26.54,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2001.10742/images/7-10.png","element":"img","alt":"st","inline":true},{"text":") = 0 when ","element":"span"},{"style":{"height":12.21},"width":52.86,"height":30.53,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2001.10742/images/7-11.png","element":"img","alt":" nst","inline":true,"padRight":true},{"text":"= 0. Then the MIS estimator (","element":"span"},{"href":"#id-36","text":"1","element":"a"},{"text":") ","element":"span"},{"text":"becomes:","element":"span"}],[{"id":"id-37","style":{"width":"64%"},"width":1108,"height":121,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2001.10742/images/7-12.png","element":"img"}],[{"style":{"fontWeight":"bold"},"text":"Construction of State-MIS estimator. ","element":"span"},{"text":"Based on the estimated marginal state transition ","element":"span"},{"style":{"height":16.71},"width":42.71,"height":41.78,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2001.10742/images/7-13.png","element":"img","alt":"�dπt","inline":true,"padRight":true},{"text":"= ","element":"span"},{"style":{"height":17.08},"width":134.96,"height":42.69,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2001.10742/images/7-14.png","element":"img","alt":" �P πt �dπt−1","inline":true},{"text":", State-MIS estimatorin ","element":"span"},{"href":"#id-1","referenceIndex":38,"text":"Xie et al. ","element":"a"},{"text":"(","element":"span"},{"href":"#id-1","referenceIndex":38,"text":"2019","element":"a"},{"text":") directly estimates the state ","element":"span"},{"text":"transition ","element":"span"},{"style":{"height":17.6},"width":196.47,"height":44,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2001.10742/images/7-15.png","element":"img","alt":" P πt (st|st−1","inline":true},{"text":") and state reward ","element":"span"},{"style":{"height":17.6},"width":177.65,"height":44,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2001.10742/images/7-16.png","element":"img","alt":" rπt (st) as:","inline":true}],[{"style":{"width":"77%"},"width":1337,"height":285,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2001.10742/images/7-17.png","element":"img"}],[{"text":"State-MIS estimator directly constructs state transitions ","element":"span"},{"style":{"height":16.32},"width":54.08,"height":40.79,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2001.10742/images/7-18.png","element":"img","alt":"�P πt","inline":true,"padRight":true},{"text":"(","element":"span"},{"style":{"height":17.6},"width":122.83,"height":44,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2001.10742/images/7-19.png","element":"img","alt":"st|st−1","inline":true},{"text":") without explicitly ","element":"span"},{"text":"modeling actions. Therefore, it is still valid when action space ","element":"span"},{"style":{"fontStyle":"italic"},"text":"A ","element":"span"},{"text":"is unbounded. However, importance weights must be explicitly utilized for compensating the discrepancy between ","element":"span"},{"style":{"height":12},"width":26,"height":30,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2001.10742/images/7-20.png","element":"img","alt":" µ","inline":true,"padRight":true},{"text":"and ","element":"span"},{"style":{"height":8},"width":25,"height":20,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2001.10742/images/7-21.png","element":"img","alt":" π","inline":true,"padRight":true},{"text":"and the knowledge of ","element":"span"},{"style":{"height":17.6},"width":99.45,"height":44,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2001.10742/images/7-22.png","element":"img","alt":" µ(a|s","inline":true},{"text":") at each state-action pair (","element":"span"},{"style":{"fontStyle":"italic"},"text":"s, a","element":"span"},{"text":") is required.","element":"span"}],[{"style":{"fontWeight":"bold"},"text":"Construction of Tabular-MIS estimator. ","element":"span"},{"text":"Since tabular MDP setting assumes finite states and actions, we can go beyond importance weights and construct empirical estimates for ","element":"span"},{"style":{"height":17.6},"width":625.06,"height":44,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2001.10742/images/8-0.png","element":"img","alt":"�Pt+1(st+1|st, at) and �rt(st, at) as:","inline":true}],[{"id":"id-38","style":{"width":"82%"},"width":1422,"height":280,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2001.10742/images/8-1.png","element":"img"}],[{"text":"where we set ","element":"span"},{"style":{"height":18.22},"width":1171.78,"height":45.55,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2001.10742/images/8-2.png","element":"img","alt":"�Pt+1(st+1|st, at) = 0 and �rt(st, at) = 0 if nst,at = 0, with nst,at","inline":true,"padRight":true},{"text":"the empirical visitation frequency to state-action (","element":"span"},{"style":{"height":11.2},"width":89.14,"height":28,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2001.10742/images/8-3.png","element":"img","alt":"st, at","inline":true},{"text":") at time ","element":"span"},{"style":{"fontStyle":"italic"},"text":"t","element":"span"},{"text":". The corresponding estimation of ","element":"span"},{"style":{"height":17.6},"width":407.67,"height":44,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2001.10742/images/8-4.png","element":"img","alt":"�P πt (st|st−1) and �rπt (st","inline":true},{"text":") are defined as:","element":"span"}],[{"id":"id-39","style":{"width":"75%"},"width":1295,"height":223,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2001.10742/images/8-5.png","element":"img"}],[{"text":"In conclusion, by using the same estimator for ","element":"span"},{"style":{"height":19.27},"width":381.6,"height":48.16,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2001.10742/images/8-6.png","element":"img","alt":"�dµt , �vπTMIS and �vπSMIS ","inline":true,"padRight":true},{"text":"share the same form of ","element":"span"},{"text":"(","element":"span"},{"href":"#id-37","text":"2","element":"a"},{"text":"). However, Tabular-MIS estimator constructs a different estimation of component ","element":"span"},{"style":{"height":16.72},"width":42.72,"height":41.79,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2001.10742/images/8-7.png","element":"img","alt":"�dπt","inline":true,"padRight":true},{"text":"though (","element":"span"},{"href":"#id-38","text":"3","element":"a"},{"text":")-(","element":"span"},{"href":"#id-39","text":"4","element":"a"},{"text":") by leveraging the fact that each state-action pair is visited frequently under tabular setting.","element":"span"}],[{"text":"The motivation of MIS-type estimators comes from the fact that we have a nonstationary MDP model and its underlying state marginal transition follows ","element":"span"},{"style":{"height":16.71},"width":42.72,"height":41.78,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2001.10742/images/8-8.png","element":"img","alt":" dπt","inline":true,"padRight":true},{"text":"= ","element":"span"},{"style":{"height":17.08},"width":134.95,"height":42.69,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2001.10742/images/8-9.png","element":"img","alt":" P πt dπt−1","inline":true},{"text":". ","element":"span"},{"text":"The MIS estimators are then obtained by using corresponding plug-in estimators for each different components (","element":"span"},{"style":{"height":16.72},"width":124.27,"height":41.79,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2001.10742/images/8-10.png","element":"img","alt":"i.e. �dπt","inline":true,"padRight":true},{"text":"for ","element":"span"},{"style":{"height":16.72},"width":42.71,"height":41.79,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2001.10742/images/8-11.png","element":"img","alt":" dπt","inline":true,"padRight":true},{"text":", ","element":"span"},{"style":{"height":16.32},"width":54.08,"height":40.79,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2001.10742/images/8-12.png","element":"img","alt":" �P πt","inline":true,"padRight":true},{"text":"for ","element":"span"},{"style":{"height":16.32},"width":54.08,"height":40.79,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2001.10742/images/8-13.png","element":"img","alt":" P πt","inline":true,"padRight":true},{"text":"). On the other hand, IS-type estimators ","element":"span"},{"text":"design the value function in a more straightforward way without needing to estimate the transition environment (","element":"span"},{"href":"#id-40","referenceIndex":24,"text":"Mahmood et al.","element":"a"},{"text":", ","element":"span"},{"href":"#id-40","referenceIndex":24,"text":"2014","element":"a"},{"text":"). Therefore in this sense MIS-type estimators are essentially model-based estimators with the model of interactive environment ","element":"span"},{"style":{"height":17.6},"width":441.88,"height":44,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2001.10742/images/8-14.png","element":"img","alt":"M = (S, A, r, P, d1, H).","inline":true}]]},{"heading":"3 Main Results","paragraphs":[[{"text":"We now show that our Tabular-MIS estimator achieves the asymptotic Cramer-Rao lower bound for DAG-MDP (","element":"span"},{"href":"#id-8","referenceIndex":14,"text":"Jiang & Li","element":"a"},{"text":", ","element":"span"},{"href":"#id-8","referenceIndex":14,"text":"2016","element":"a"},{"text":") and therefore is asymptotically sample efficient. To formalize our statement, we pre-specify the following boundary conditions: ","element":"span"},{"style":{"height":17.6},"width":158.23,"height":44,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2001.10742/images/8-15.png","element":"img","alt":" r0(s0) ≡","inline":true,"padRight":true},{"text":"0, ","element":"span"},{"style":{"height":30.09},"width":865.16,"height":75.22,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2001.10742/images/8-16.png","element":"img","alt":"σ0(s0, a0) ≡ 0, dπ0 (s0)dµ0 (s0) ≡ 1, π(a0|s0)µ(a0|s0) ≡ 1, V πH+1 ≡","inline":true,"padRight":true},{"text":"0, and, as a reminder, ","element":"span"},{"style":{"height":27.65},"width":411.81,"height":69.12,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2001.10742/images/8-17.png","element":"img","alt":" τa := maxt,st,atπ(at|st)µ(at|st)","inline":true,"padRight":true},{"id":"id-55","text":"and ","element":"span"},{"style":{"height":29.53},"width":362.84,"height":73.82,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2001.10742/images/8-18.png","element":"img","alt":" τs := maxt,stdπt (st)dµt (st).","inline":true}],[{"style":{"fontWeight":"bold"},"text":"Theorem 3.1. ","element":"span"},{"style":{"fontStyle":"italic"},"text":"Suppose the ","element":"span"},{"style":{"fontStyle":"italic"},"text":"n ","element":"span"},{"style":{"fontStyle":"italic"},"text":"episodic historical data ","element":"span"},{"style":{"height":38.37},"width":561.38,"height":95.92,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2001.10742/images/9-0.png","element":"img","alt":" D = �(s(i)t , a(i)t , r(i)t )�t=1,...,Hi=1,...,n","inline":true,"padRight":true},{"style":{"fontStyle":"italic"},"text":"is ","element":"span"},{"style":{"fontStyle":"italic"},"text":"obtained by running a logging policy ","element":"span"},{"style":{"height":15.6},"width":150.32,"height":39,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2001.10742/images/9-1.png","element":"img","alt":" µ and π","inline":true,"padRight":true},{"style":{"fontStyle":"italic"},"text":"is the new target policy which we want to test. If the number of episodes ","element":"span"},{"style":{"fontStyle":"italic"},"text":"n ","element":"span"},{"style":{"fontStyle":"italic"},"text":"satisfies","element":"span"}],[{"style":{"width":"60%"},"width":1043,"height":98,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2001.10742/images/9-2.png","element":"img"}],[{"style":{"fontStyle":"italic"},"text":"then under Assumption ","element":"span"},{"href":"#id-41","style":{"fontStyle":"italic"},"text":"2.1","element":"a"},{"style":{"fontStyle":"italic"},"text":"-","element":"span"},{"href":"#id-35","style":{"fontStyle":"italic"},"text":"2.3 ","element":"a"},{"style":{"fontStyle":"italic"},"text":"our Tabular-MIS estimator ","element":"span"},{"style":{"height":17.1},"width":107.83,"height":42.76,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2001.10742/images/9-3.png","element":"img","alt":" �vπTMIS ","inline":true,"padRight":true},{"style":{"fontStyle":"italic"},"text":"has the following Mean- ","element":"span"},{"style":{"fontStyle":"italic"},"text":"Square-Error upper bound:","element":"span"}],[{"style":{"width":"109%"},"width":1873,"height":83,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2001.10742/images/9-4.png","element":"img"}],[{"style":{"height":35.78},"width":64.72,"height":89.46,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2001.10742/images/9-5.png","element":"img","alt":"≤ 1n","inline":true}],[{"text":"+","element":"span"},{"style":{"height":30.94},"width":192.74,"height":77.36,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2001.10742/images/9-6.png","element":"img","alt":"O(τ 2aτsH3","inline":true}],[{"style":{"width":"93%"},"width":1606,"height":53,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2001.10742/images/9-7.png","element":"img"}],[{"style":{"fontStyle":"italic"},"text":"where the value function under ","element":"span"},{"style":{"height":8},"width":25,"height":20,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2001.10742/images/9-8.png","element":"img","alt":" π","inline":true,"padRight":true},{"style":{"fontStyle":"italic"},"text":"is defined as: ","element":"span"},{"style":{"height":17.31},"width":55.15,"height":43.28,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2001.10742/images/9-9.png","element":"img","alt":" V πh","inline":true,"padRight":true},{"text":"(","element":"span"},{"style":{"height":10.84},"width":40.46,"height":27.11,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2001.10742/images/9-10.png","element":"img","alt":"sh","inline":true},{"text":") := ","element":"span"},{"style":{"height":32.7},"width":589.48,"height":81.75,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2001.10742/images/9-11.png","element":"img","alt":" Eπ��Ht=h r(1)t ��s(1)h = sh�, ∀h ∈","inline":true,"padRight":true},{"style":{"fontStyle":"italic"},"text":"{","element":"span"},{"text":"1","element":"span"},{"style":{"fontStyle":"italic"},"text":", ","element":"span"},{"text":"2","element":"span"},{"style":{"fontStyle":"italic"},"text":", ..., H","element":"span"},{"style":{"fontStyle":"italic"},"text":"}","element":"span"},{"style":{"fontStyle":"italic"},"text":".","element":"span"}],[{"text":"The proof of this theorem, and all the other technical results we present in this section, are deferred to the appendix due to the space constraint. We summarize the novel ingredients in the proof in Section ","element":"span"},{"href":"#id-42","text":"3.1","element":"a"},{"text":". Before that, we make a few remarks about a few interesting aspects of this result.","element":"span"}],[{"id":"id-64","style":{"fontWeight":"bold"},"text":"Remark 3.2 ","element":"span"},{"text":"(Asymptotic efficiency and local minimaxity)","element":"span"},{"style":{"fontWeight":"bold"},"text":". ","element":"span"},{"style":{"fontStyle":"italic"},"text":"The error bound implies that ","element":"span"},{"text":"lim","element":"span"},{"style":{"height":19.91},"width":469.63,"height":49.77,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2001.10742/images/9-12.png","element":"img","alt":"n→∞ n · E[(�vπTMIS − vπ)2]","inline":true}],[{"style":{"width":"56%"},"width":971,"height":120,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2001.10742/images/9-13.png","element":"img"}],[{"style":{"fontStyle":"italic"},"text":"This exactly matches the CR-lower bound in ","element":"span"},{"href":"#id-8","referenceIndex":14,"style":{"fontStyle":"italic"},"text":"Jiang & Li ","element":"a"},{"style":{"fontStyle":"italic"},"text":"(","element":"span"},{"href":"#id-8","referenceIndex":14,"style":{"fontStyle":"italic"},"text":"2016","element":"a"},{"style":{"fontStyle":"italic"},"text":", Proposition 3) for DAGMDP","element":"span"},{"style":{"height":8.4},"width":17,"height":21,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2001.10742/images/9-14.png","element":"img","alt":"5","inline":true},{"style":{"fontStyle":"italic"},"text":". In contrast, the State-MIS estimator in (","element":"span"},{"href":"#id-1","referenceIndex":38,"style":{"fontStyle":"italic"},"text":"Xie et al.","element":"a"},{"style":{"fontStyle":"italic"},"text":", ","element":"span"},{"href":"#id-1","referenceIndex":38,"style":{"fontStyle":"italic"},"text":"2019","element":"a"},{"style":{"fontStyle":"italic"},"text":") achieves an asymptotic MSE of","element":"span"}],[{"style":{"width":"79%"},"width":1371,"height":117,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2001.10742/images/9-15.png","element":"img"}],[{"text":"We note that while in classical literature CR-lower bound is often used to lower bound the variance of ","element":"span"},{"style":{"fontStyle":"italic"},"text":"unbiased ","element":"span"},{"text":"estimators, the modern theory of estimation establishes that it is also the correct asymptotic minimax lower bound for the MSE of ","element":"span"},{"style":{"fontStyle":"italic"},"text":"all ","element":"span"},{"text":"estimators in every local neighborhood of the parameter space (see, e.g., ","element":"span"},{"href":"#id-43","referenceIndex":36,"text":"Van der Vaart","element":"a"},{"text":", ","element":"span"},{"href":"#id-43","referenceIndex":36,"text":"2000","element":"a"},{"text":", Chapter 8). In other ","element":"span"},{"text":"words, our results imply that Tabular-MIS estimator is asymptotically, locally, uniformly minimax optimal, namely, optimal for every problem instance separately.","element":"span"}],[{"text":"It is worth pointing out that while asymptotically efficient estimators for this problem in related settings have been proposed in independent recent work (","element":"span"},{"href":"#id-22","referenceIndex":16,"text":"Kallus & Uehara","element":"a"},{"text":", ","element":"span"},{"href":"#id-22","referenceIndex":16,"text":"2019a","element":"a"},{"text":",","element":"span"},{"href":"#id-23","referenceIndex":17,"text":"b","element":"a"},{"text":"), our estimator is the first that comes with finite sample guarantees with an explicit expression on the low-order terms. Moreover, our estimator demonstrates that doubly robust estimation techniques is not essential for achieving asymptotic efficiency.","element":"span"}],[{"style":{"fontWeight":"bold"},"text":"Remark 3.3 ","element":"span"},{"text":"(Simplified finite sample error bound)","element":"span"},{"style":{"fontWeight":"bold"},"text":". ","element":"span"},{"style":{"fontStyle":"italic"},"text":"The theory implies that there is universal constants ","element":"span"},{"style":{"height":15.6},"width":117.7,"height":39,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2001.10742/images/10-0.png","element":"img","alt":" C1, C2","inline":true,"padRight":true},{"style":{"fontStyle":"italic"},"text":"such that for all ","element":"span"},{"style":{"height":20.22},"width":222.52,"height":50.54,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2001.10742/images/10-1.png","element":"img","alt":" n ≥ C1H τadm","inline":true,"padRight":true},{"style":{"fontStyle":"italic"},"text":", i.e., when we have a just visited ","element":"span"},{"style":{"fontStyle":"italic"},"text":"every state-action pair for ","element":"span"},{"text":"Ω(","element":"span"},{"style":{"height":19.91},"width":918.36,"height":49.76,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2001.10742/images/10-2.png","element":"img","alt":"H) times, E[(�vπTMIS − vπ)2] = C2H2τaτsR2max/n.","inline":true}],[{"text":"In deriving the above remark, we used the somewhat surprising observation that","element":"span"}],[{"style":{"width":"57%"},"width":984,"height":129,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2001.10742/images/10-3.png","element":"img"}],[{"text":"Note that we are summing ","element":"span"},{"style":{"fontStyle":"italic"},"text":"H ","element":"span"},{"text":"quantities that are potentially on the order of ","element":"span"},{"style":{"height":19.05},"width":155.02,"height":47.63,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2001.10742/images/10-4.png","element":"img","alt":" H2R2max","inline":true},{"text":", yet ","element":"span"},{"text":"no additional factors of ","element":"span"},{"style":{"fontStyle":"italic"},"text":"H ","element":"span"},{"text":"shows up. This observation is folklore and has been used in deriving tight results for tabular RL in (e.g., ","element":"span"},{"href":"#id-27","referenceIndex":3,"text":"Azar et al.","element":"a"},{"text":", ","element":"span"},{"href":"#id-27","referenceIndex":3,"text":"2017","element":"a"},{"text":"). It can be proven using the following decomposition of the variance of the empirical mean estimator and the fact that it is bounded by ","element":"span"},{"style":{"height":19.14},"width":212.55,"height":47.84,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2001.10742/images/10-5.png","element":"img","alt":" H2R2max/4.","inline":true}],[{"id":"id-70","style":{"fontWeight":"bold"},"text":"Lemma 3.4. ","element":"span"},{"style":{"fontStyle":"italic"},"text":"For any policy ","element":"span"},{"style":{"height":8},"width":25,"height":20,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2001.10742/images/10-6.png","element":"img","alt":" π","inline":true,"padRight":true},{"style":{"fontStyle":"italic"},"text":"and any MDP.","element":"span"}],[{"style":{"width":"60%"},"width":1042,"height":210,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2001.10742/images/10-7.png","element":"img"}],[{"text":"The proof, which applies the law-of-total-variance recursively, is deferred to the appendix.","element":"span"}],[{"style":{"fontWeight":"bold"},"text":"Remark 3.5 ","element":"span"},{"text":"(When ","element":"span"},{"style":{"height":17.6},"width":145.24,"height":44,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2001.10742/images/10-8.png","element":"img","alt":" π = µ).","inline":true,"padRight":true},{"style":{"fontStyle":"italic"},"text":"One surprising observation is that Tabular-MIS estimator improves the efficiency even for the on-policy evaluation problem when ","element":"span"},{"style":{"height":12},"width":116.66,"height":30,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2001.10742/images/10-9.png","element":"img","alt":" π = µ","inline":true},{"style":{"fontStyle":"italic"},"text":". In other word, the natural Monte Carlo estimator of the reward in the on-policy evaluation problem is in fact asymptotically inefficient.","element":"span"}],[{"id":"id-42","style":{"fontWeight":"bold"},"text":"3.1 ","element":"span"},{"style":{"fontWeight":"bold"},"text":"Building blocks of the analysis","element":"span"}],[{"text":"At the high level, the techniques we used, including the idea of fictitious estimator and peeling the variance (expectation) of fictitious estimator ","element":"span"},{"style":{"height":12.33},"width":42.72,"height":30.84,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2001.10742/images/10-10.png","element":"img","alt":" �vπ ","inline":true,"padRight":true},{"text":"from behind by applying total law of variances (expectations) repeatedly, are consistent with ","element":"span"},{"href":"#id-1","referenceIndex":38,"text":"Xie et al. ","element":"a"},{"text":"(","element":"span"},{"href":"#id-1","referenceIndex":38,"text":"2019","element":"a"},{"text":").","element":"span"}],[{"text":"In addition to the above techniques, we leverage the fact of frequent state-action visitations in our design of TMIS estimator and based on that we are able to achieve an asymptotic lower Mean Square Error (MSE) bound. The main components are the followings.","element":"span"}],[{"id":"id-49","style":{"width":"101%"},"width":1733,"height":446,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2001.10742/images/11-0.png","element":"img"}],[{"text":"and","element":"span"}],[{"id":"id-50","style":{"width":"80%"},"width":1374,"height":117,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2001.10742/images/11-1.png","element":"img"}],[{"text":"where ","element":"span"},{"style":{"height":12.8},"width":21,"height":32,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2001.10742/images/11-2.png","element":"img","alt":" θ","inline":true,"padRight":true},{"text":"is the parameter constrained by 0 ","element":"span"},{"style":{"height":13.2},"width":125.72,"height":33,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2001.10742/images/11-3.png","element":"img","alt":" < θ <","inline":true,"padRight":true},{"text":"1, which we will choose later in the proof.","element":"span"}],[{"text":"This slight modification makes ","element":"span"},{"style":{"height":17.11},"width":107.83,"height":42.77,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2001.10742/images/11-4.png","element":"img","alt":" �vπTMIS ","inline":true,"padRight":true},{"text":"no longer implementable using the logging data ","element":"span"},{"style":{"fontStyle":"italic"},"text":"D","element":"span"},{"text":", but ","element":"span"},{"text":"it does provide an unbiased estimator of ","element":"span"},{"href":"#id-44","style":{"height":17.2},"width":287.7,"height":43,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2001.10742/images/11-5.png","element":"img","alt":" vπ (Lemma B.5","inline":true,"padRight":true},{"text":"in appendix) and, most importantly, it is easier to do theoretical analysis on ","element":"span"},{"style":{"height":17.11},"width":107.83,"height":42.76,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2001.10742/images/11-6.png","element":"img","alt":" �vπTMIS","inline":true,"padRight":true},{"text":"than on ","element":"span"},{"style":{"height":17.11},"width":107.84,"height":42.76,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2001.10742/images/11-7.png","element":"img","alt":" �vπTMIS","inline":true},{"text":". Moreover, Multiplicative ","element":"span"},{"text":"Chernoff bound (Lemma ","element":"span"},{"href":"#id-45","text":"A.2 ","element":"a"},{"text":"in appendix) helps to find the connection between ","element":"span"},{"style":{"height":17.11},"width":193.66,"height":42.77,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2001.10742/images/11-8.png","element":"img","alt":" �vπTMIS and","inline":true},{"style":{"height":17.11},"width":121.64,"height":42.77,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2001.10742/images/11-9.png","element":"img","alt":"�vπTMIS.","inline":true}],[{"style":{"fontWeight":"bold"},"text":"Peeling arguments using the total law of variance (expectation). ","element":"span"},{"text":"The core idea in analyzing the variance of ","element":"span"},{"style":{"height":12.34},"width":42.72,"height":30.84,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2001.10742/images/11-10.png","element":"img","alt":" �vπ ","inline":true,"padRight":true},{"text":"is to peel the variance from behind (start from time ","element":"span"},{"style":{"fontStyle":"italic"},"text":"H ","element":"span"},{"text":"to 1) and the peeling tool we used here is through marriaging the standard Bellman equations with the total law of variance. Lemma ","element":"span"},{"text":"B.2 ","element":"span"},{"text":"(in appendix) shows this spirit and it is used repeatedly throughout the whole analysis. Beyond that, the peeling argument can be used to prove the dependence in ","element":"span"},{"style":{"height":18.73},"width":250.54,"height":46.84,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2001.10742/images/11-11.png","element":"img","alt":" H is only H2 ","inline":true,"padRight":true},{"text":"for our Tabular-MIS estimator. This result explicates that ","element":"span"},{"style":{"height":14.74},"width":56.82,"height":36.84,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2001.10742/images/11-12.png","element":"img","alt":" H2 ","inline":true,"padRight":true},{"text":"is enough for TMIS to evaluate a particular policy and this is different from SMIS, which in general requires the dependence of ","element":"span"},{"style":{"height":14.73},"width":56.82,"height":36.83,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2001.10742/images/11-13.png","element":"img","alt":" H3 ","inline":true,"padRight":true},{"text":"for off-policy evaluation.","element":"span"}],[{"style":{"fontWeight":"bold"},"text":"3.2 ","element":"span"},{"style":{"fontWeight":"bold"},"text":"A high-probability bound with data-splitting TMIS.","element":"span"}],[{"text":"Tabular-MIS estimator provides the asymptotic optimal variance bound of order ","element":"span"},{"style":{"height":19.13},"width":236.88,"height":47.83,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2001.10742/images/11-14.png","element":"img","alt":" O(H2SA/n)","inline":true,"padRight":true},{"text":"and based on that it is natural to ask the related learning question: whether TMIS can further achieve a high probability bound with the same sample complexity? We figure out that the standard concentration inequalities (","element":"span"},{"style":{"fontStyle":"italic"},"text":"e.g. ","element":"span"},{"text":"Hoeffding’s inequality, Bernstein inequality) cannot be directly applied because of the highly correlated structures of the ","element":"span"},{"text":"Tabular-MIS estimator. To address this problem we design the following data split version of TMIS and as we will see, the original TMIS is essentially a special case of data-splitting TMIS.","element":"span"}],[{"style":{"fontWeight":"bold"},"text":"Data splitting Tabular-MIS estimator. ","element":"span"},{"text":"Assume the total number of episodes ","element":"span"},{"style":{"fontStyle":"italic"},"text":"n ","element":"span"},{"text":"can be factorized as ","element":"span"},{"style":{"height":12.4},"width":203.49,"height":31,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2001.10742/images/12-0.png","element":"img","alt":" n = M · N","inline":true},{"text":", where ","element":"span"},{"style":{"fontStyle":"italic"},"text":"M, N > ","element":"span"},{"text":"1 are two integers,","element":"span"},{"style":{"height":8.4},"width":17,"height":21,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2001.10742/images/12-1.png","element":"img","alt":"6","inline":true,"padRight":true},{"text":"and we can partition the data ","element":"span"},{"style":{"fontStyle":"italic"},"text":"D ","element":"span"},{"text":"into ","element":"span"},{"style":{"fontStyle":"italic"},"text":"N ","element":"span"},{"text":"folds with each fold ","element":"span"},{"style":{"height":20.33},"width":470.22,"height":50.83,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2001.10742/images/12-2.png","element":"img","alt":" D(i) (i = 1, ..., N) has M","inline":true,"padRight":true},{"text":"different episodes, or in other words, we split the ","element":"span"},{"style":{"fontStyle":"italic"},"text":"n ","element":"span"},{"text":"episodes evenly. Then by the i.i.d. nature of ","element":"span"},{"style":{"fontStyle":"italic"},"text":"n ","element":"span"},{"text":"episodes, we have ","element":"span"},{"style":{"height":19.13},"width":346.17,"height":47.84,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2001.10742/images/12-3.png","element":"img","alt":"D(1), D(2), ..., D(N) ","inline":true,"padRight":true},{"text":"are independent collections.","element":"span"}],[{"text":"For each ","element":"span"},{"style":{"height":15.93},"width":72.58,"height":39.84,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2001.10742/images/12-4.png","element":"img","alt":" D(i)","inline":true},{"text":", we can create a Tabular-MIS estimator ","element":"span"},{"style":{"height":24.29},"width":107.84,"height":60.74,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2001.10742/images/12-5.png","element":"img","alt":" �vπ(i)TMIS","inline":true,"padRight":true},{"text":"(for notation simplicity ","element":"span"},{"text":"we use ","element":"span"},{"style":{"height":21.16},"width":58.86,"height":52.9,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2001.10742/images/12-6.png","element":"img","alt":" �vπ(i)","inline":true,"padRight":true},{"text":"to denote ","element":"span"},{"style":{"height":24.29},"width":107.84,"height":60.74,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2001.10742/images/12-7.png","element":"img","alt":" �vπ(i)TMIS","inline":true,"padRight":true},{"text":"in the future discussions) using its own ","element":"span"},{"style":{"fontStyle":"italic"},"text":"M ","element":"span"},{"text":"episodes. Then ","element":"span"},{"style":{"height":21.16},"width":305,"height":52.89,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2001.10742/images/12-8.png","element":"img","alt":"�vπ(1), �vπ(2), ..., �vπ(N) ","inline":true,"padRight":true},{"text":"are independent of each other and we can use the empirical mean to define ","element":"span"},{"text":"the data splitting Tabular-MIS estimatorand the corresponding fictitious version:","element":"span"}],[{"style":{"width":"99%"},"width":1711,"height":233,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2001.10742/images/12-9.png","element":"img"}],[{"text":"The data splitting TMIS estimator explicitly characterizes the independence of ","element":"span"},{"style":{"fontStyle":"italic"},"text":"n ","element":"span"},{"text":"different episodes by grouping them into ","element":"span"},{"style":{"fontStyle":"italic"},"text":"N ","element":"span"},{"text":"chunks. Chunks are independent of each other and taking the average over all ","element":"span"},{"style":{"height":21.16},"width":58.86,"height":52.89,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2001.10742/images/12-10.png","element":"img","alt":" �vπ(i)","inline":true,"padRight":true},{"text":"(","element":"span"},{"style":{"fontStyle":"italic"},"text":"i ","element":"span"},{"text":"= 1","element":"span"},{"style":{"fontStyle":"italic"},"text":", ..., N","element":"span"},{"text":") will guarantee the validity of using concentration ","element":"span"},{"text":"inequalities.","element":"span"}],[{"text":"More importantly, the data splitting TMIS estimator holds the same information-theoretical variance lower bound as the non-data splitting TMIS estimator, which is not surprising since the non-data splitting TMIS estimator is just the special case of the data splitting TabularMIS estimator with ","element":"span"},{"style":{"fontStyle":"italic"},"text":"N ","element":"span"},{"text":"= 1. This idea is summarized into the following theorem:","element":"span"}],[{"id":"id-46","style":{"width":"115%"},"width":1975,"height":294,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2001.10742/images/12-11.png","element":"img"}],[{"style":{"fontWeight":"bold"},"text":"Remark 3.7. ","element":"span"},{"style":{"fontStyle":"italic"},"text":"The condition in Theorem ","element":"span"},{"href":"#id-46","style":{"fontStyle":"italic"},"text":"3.6 ","element":"a"},{"style":{"fontStyle":"italic"},"text":"is achieveable. For example, choose ","element":"span"},{"style":{"height":17.6},"width":180.83,"height":44,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2001.10742/images/12-12.png","element":"img","alt":" M ≈ √n,","inline":true,"padRight":true},{"style":{"fontStyle":"italic"},"text":"then the condition holds when ","element":"span"},{"style":{"fontStyle":"italic"},"text":"n ","element":"span"},{"style":{"fontStyle":"italic"},"text":"is sufficiently large.","element":"span"}],[{"style":{"fontWeight":"bold"},"text":"High probability bound. ","element":"span"},{"text":"By coupling the data splitting techniques with the boundedness of Tabular-MIS estimator (","element":"span"},{"style":{"height":15.2},"width":577.45,"height":38,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2001.10742/images/12-13.png","element":"img","alt":"i.e. �vπ ≤ HRmax, �vπ ≤ HRmax","inline":true},{"text":", see Lemma ","element":"span"},{"href":"#id-47","text":"B.3 ","element":"a"},{"text":"in appendix), ","element":"span"},{"text":"we can apply concentration inequalities to show the difference between ","element":"span"},{"style":{"height":19.65},"width":85.13,"height":49.12,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2001.10742/images/13-0.png","element":"img","alt":" �vπsplit","inline":true,"padRight":true},{"text":"and ","element":"span"},{"style":{"height":12.33},"width":42.72,"height":30.84,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2001.10742/images/13-1.png","element":"img","alt":" vπ","inline":true,"padRight":true},{"text":"is ","element":"span"},{"text":"bounded by order ","element":"span"},{"style":{"height":20.8},"width":263.68,"height":52,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2001.10742/images/13-2.png","element":"img","alt":"�O(�H2SA/n","inline":true},{"text":"), which is summarized into the following theorem.","element":"span"}],[{"id":"id-48","style":{"fontWeight":"bold"},"text":"Theorem 3.8. ","element":"span"},{"style":{"fontStyle":"italic"},"text":"Suppose ","element":"span"},{"style":{"fontStyle":"italic"},"text":"n ","element":"span"},{"style":{"fontStyle":"italic"},"text":"i.i.d. episodic historical data comes from a near-uniform logging policy ","element":"span"},{"style":{"height":12},"width":26,"height":30,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2001.10742/images/13-3.png","element":"img","alt":" µ","inline":true,"padRight":true},{"style":{"fontStyle":"italic"},"text":"and suppose ","element":"span"},{"style":{"fontStyle":"italic"},"text":"M","element":"span"},{"style":{"fontStyle":"italic"},"text":", the number of episodes in each ","element":"span"},{"style":{"height":15.93},"width":72.58,"height":39.83,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2001.10742/images/13-4.png","element":"img","alt":" D(i)","inline":true},{"style":{"fontStyle":"italic"},"text":", satisfies: ","element":"span"},{"style":{"height":17.6},"width":299.99,"height":44,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2001.10742/images/13-5.png","element":"img","alt":"�O(n · SA) ≥ M","inline":true,"padRight":true},{"style":{"fontStyle":"italic"},"text":"and ","element":"span"},{"style":{"height":17.6},"width":1024.45,"height":44,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2001.10742/images/13-6.png","element":"img","alt":" M > max [O(SA · Polylog(S, H, A, n, 1/δ)), O(Hτaτs)]","inline":true},{"style":{"fontStyle":"italic"},"text":". Then we have with probability ","element":"span"},{"text":"1 ","element":"span"},{"style":{"height":12.8},"width":63.64,"height":32,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2001.10742/images/13-7.png","element":"img","alt":" − δ","inline":true},{"style":{"fontStyle":"italic"},"text":", the data splitting Tabular-MIS estimator obeys:","element":"span"}],[{"style":{"width":"30%"},"width":515,"height":106,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2001.10742/images/13-8.png","element":"img"}],[{"text":"The proof Theorem ","element":"span"},{"href":"#id-48","text":"3.8 ","element":"a"},{"text":"relies on bounding the difference between ","element":"span"},{"style":{"height":19.65},"width":85.13,"height":49.12,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2001.10742/images/13-9.png","element":"img","alt":" �vπsplit","inline":true,"padRight":true},{"text":"and ","element":"span"},{"style":{"height":19.65},"width":85.13,"height":49.12,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2001.10742/images/13-10.png","element":"img","alt":" �vπsplit","inline":true,"padRight":true},{"text":"using ","element":"span"},{"text":"Multiplicative Chernoff bound and bounding the difference between ","element":"span"},{"style":{"height":19.65},"width":85.13,"height":49.12,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2001.10742/images/13-11.png","element":"img","alt":" �vπsplit","inline":true,"padRight":true},{"text":"and ","element":"span"},{"style":{"height":12.33},"width":42.72,"height":30.83,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2001.10742/images/13-12.png","element":"img","alt":" vπ","inline":true,"padRight":true},{"text":"using ","element":"span"},{"text":"Bernstein inequality. During the process of bounding ","element":"span"},{"style":{"height":20.92},"width":257.42,"height":52.29,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2001.10742/images/13-13.png","element":"img","alt":" |�vπsplit − �vπsplit|","inline":true,"padRight":true},{"text":"we observe that a ","element":"span"},{"text":"stronger uniform bound can be derived. In fact, this bound is 0. We formalize it into the following lemma.","element":"span"}],[{"id":"id-74","style":{"fontWeight":"bold"},"text":"Lemma 3.9. ","element":"span"},{"style":{"fontStyle":"italic"},"text":"Suppose ","element":"span"},{"style":{"fontStyle":"italic"},"text":"n ","element":"span"},{"style":{"fontStyle":"italic"},"text":"i.i.d. episodic historical data comes from a near-uniform logging policy ","element":"span"},{"style":{"height":12},"width":26,"height":30,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2001.10742/images/13-14.png","element":"img","alt":" µ","inline":true,"padRight":true},{"style":{"fontStyle":"italic"},"text":"and suppose ","element":"span"},{"style":{"fontStyle":"italic"},"text":"M","element":"span"},{"style":{"fontStyle":"italic"},"text":", the number of episodes in each ","element":"span"},{"style":{"height":15.93},"width":72.58,"height":39.83,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2001.10742/images/13-15.png","element":"img","alt":" D(i)","inline":true},{"style":{"fontStyle":"italic"},"text":", satisfies: ","element":"span"},{"style":{"fontStyle":"italic"},"text":"M > ","element":"span"},{"text":"max [","element":"span"},{"style":{"height":17.6},"width":829.78,"height":44,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2001.10742/images/13-16.png","element":"img","alt":"O(SA · Polylog(S, H, A, N, 1/δ)), O(Hτaτs)]","inline":true},{"style":{"fontStyle":"italic"},"text":".Then we have with probability ","element":"span"},{"text":"1 ","element":"span"},{"style":{"height":15.6},"width":77.68,"height":39,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2001.10742/images/13-17.png","element":"img","alt":" − δ,","inline":true}],[{"style":{"width":"24%"},"width":413,"height":81,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2001.10742/images/13-18.png","element":"img"}],[{"style":{"fontStyle":"italic"},"text":"Since ","element":"span"},{"style":{"height":12.4},"width":181.64,"height":31,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2001.10742/images/13-19.png","element":"img","alt":" n = N·M","inline":true},{"style":{"fontStyle":"italic"},"text":", therefore let ","element":"span"},{"style":{"fontStyle":"italic"},"text":"N ","element":"span"},{"text":"= 1","element":"span"},{"style":{"fontStyle":"italic"},"text":", ","element":"span"},{"style":{"fontStyle":"italic"},"text":"M ","element":"span"},{"text":"= ","element":"span"},{"style":{"fontStyle":"italic"},"text":"n","element":"span"},{"style":{"fontStyle":"italic"},"text":", we have: if ","element":"span"},{"style":{"height":17.6},"width":991.98,"height":44,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2001.10742/images/13-20.png","element":"img","alt":" M > max [O(SA · Polylog(S, H, A, 1/δ)), O(Hτaτs)],","inline":true,"padRight":true},{"style":{"fontStyle":"italic"},"text":"then we have with probability ","element":"span"},{"text":"1 ","element":"span"},{"style":{"height":15.6},"width":77.68,"height":39,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2001.10742/images/13-21.png","element":"img","alt":" − δ,","inline":true}],[{"style":{"width":"26%"},"width":458,"height":81,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2001.10742/images/13-22.png","element":"img"}],[{"style":{"fontStyle":"italic"},"text":"where ","element":"span"},{"style":{"height":17.6},"width":42,"height":44,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2001.10742/images/13-23.png","element":"img","alt":"� ","inline":true,"padRight":true},{"style":{"fontStyle":"italic"},"text":"consists of all the ","element":"span"},{"style":{"fontStyle":"italic"},"text":"H","element":"span"},{"style":{"fontStyle":"italic"},"text":"-step nonstationary policies.","element":"span"}],[{"style":{"fontWeight":"bold"},"text":"Remark 3.10. ","element":"span"},{"style":{"fontStyle":"italic"},"text":"The uniform difference bound between ","element":"span"},{"style":{"height":17.11},"width":107.84,"height":42.76,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2001.10742/images/13-24.png","element":"img","alt":" �vπTMIS","inline":true,"padRight":true},{"style":{"fontStyle":"italic"},"text":"and ","element":"span"},{"style":{"height":17.11},"width":107.84,"height":42.76,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2001.10742/images/13-25.png","element":"img","alt":" �vπTMIS","inline":true,"padRight":true},{"style":{"fontStyle":"italic"},"text":"is obtained by ","element":"span"},{"style":{"fontStyle":"italic"},"text":"observing the construction of fictitious estimator ","element":"span"},{"text":"(","element":"span"},{"href":"#id-49","text":"7","element":"a"},{"text":") ","element":"span"},{"style":{"fontStyle":"italic"},"text":"and ","element":"span"},{"text":"(","element":"span"},{"href":"#id-50","text":"8","element":"a"},{"text":") ","element":"span"},{"style":{"fontStyle":"italic"},"text":"are independent of the specific target policy ","element":"span"},{"style":{"height":8},"width":25,"height":20,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2001.10742/images/13-26.png","element":"img","alt":" π","inline":true},{"style":{"fontStyle":"italic"},"text":". This result tells the ","element":"span"},{"text":"sup","element":"span"},{"style":{"height":20.63},"width":381.81,"height":51.56,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2001.10742/images/13-27.png","element":"img","alt":"π∈� |�vπTMIS − �vπTMIS|","inline":true,"padRight":true},{"style":{"fontStyle":"italic"},"text":"can be arbitrarily small with ","element":"span"},{"style":{"fontStyle":"italic"},"text":"high probability and therefore does not depend on ","element":"span"},{"style":{"fontStyle":"italic"},"text":"H ","element":"span"},{"style":{"fontStyle":"italic"},"text":"factor. This fact will help us to derive the correct dependence in ","element":"span"},{"style":{"fontStyle":"italic"},"text":"H ","element":"span"},{"style":{"fontStyle":"italic"},"text":"for uniform convergence problem, see Section ","element":"span"},{"style":{"fontStyle":"italic"},"text":"5","element":"span"},{"style":{"fontStyle":"italic"},"text":".","element":"span"}],[{"id":"id-51","style":{"width":"100%"},"width":1720,"height":1529,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2001.10742/images/14-0.png","element":"img"}],[{"style":{"fontWeight":"bold"},"text":"3.3 ","element":"span"},{"style":{"fontWeight":"bold"},"text":"Some interpretations.","element":"span"}],[{"style":{"fontWeight":"bold"},"text":"Logging policy free algorithm. ","element":"span"},{"text":"We point out the implementation of Tabular-MIS estimator does not require the knowledge of logging policy ","element":"span"},{"style":{"height":12},"width":26,"height":30,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2001.10742/images/14-1.png","element":"img","alt":" µ","inline":true},{"text":", as shown in Algorithm ","element":"span"},{"href":"#id-51","text":"1","element":"a"},{"text":",","element":"span"},{"href":"#id-52","text":"2","element":"a"},{"text":".This is critical in the sense that in the real-world sequential decision making problems, it is very likely the complete information about logging policy is not provided. This may happen due to mis-records or the lack of maintainance. By only using the historical data, tabular MIS off-policy evaluation is able to achieve the asymptotic efficiency. In contrast, the state MIS estimator always requires the full information about the logging policy.","element":"span"}],[{"style":{"fontWeight":"bold"},"text":"Connection to approximate MDP estimation. ","element":"span"},{"text":"Our TMIS is essentially an approxi-","element":"span"}],[{"text":"mate MDP estimator (with the non-stationary dynamic transitions ","element":"span"},{"style":{"height":14.62},"width":40.02,"height":36.54,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2001.10742/images/15-0.png","element":"img","alt":" Pt","inline":true,"padRight":true},{"text":"estimated by ","element":"span"},{"style":{"fontStyle":"italic"},"text":"maximum likelihood estimator ","element":"span"},{"text":"(MLE)) except that we marginalize out the action in both ","element":"span"},{"style":{"height":17.6},"width":97.88,"height":44,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2001.10742/images/15-1.png","element":"img","alt":" �rπt (s)","inline":true,"padRight":true},{"text":"and ","element":"span"},{"style":{"height":17.2},"width":82.94,"height":43,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2001.10742/images/15-2.png","element":"img","alt":"�dπt (s","inline":true},{"text":") and provide an importance sampling interpretation. To the best of our knowledge, ","element":"span"},{"text":"existing analysis of the fully model-based approach does not provide tight bounds. We give two examples. The seminal simulation lemma in ","element":"span"},{"href":"#id-53","referenceIndex":18,"text":"Kearns & Singh ","element":"a"},{"text":"(","element":"span"},{"href":"#id-53","referenceIndex":18,"text":"2002","element":"a"},{"text":") together with a naive concentration-type analysis gives only an ","element":"span"},{"style":{"height":20.8},"width":282.43,"height":52,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2001.10742/images/15-3.png","element":"img","alt":"�O(�H4S3A/n","inline":true},{"text":") bound in our setting. In a very recent compilation of improvements over this bound (","element":"span"},{"href":"#id-54","referenceIndex":12,"text":"Jiang","element":"a"},{"text":", ","element":"span"},{"href":"#id-54","referenceIndex":12,"text":"2018","element":"a"},{"text":"), this bound can be improved to either ","element":"span"},{"style":{"height":20.8},"width":631.14,"height":52,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2001.10742/images/15-4.png","element":"img","alt":"�O(�H4S2A/n) or �O(�H6SA/n","inline":true},{"text":"). Our result is the first that achieves the optimal ","element":"span"},{"style":{"height":20.8},"width":263.78,"height":52,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2001.10742/images/15-5.png","element":"img","alt":"�O(�H2SA/n","inline":true},{"text":") rate regardless of whether it is the model-based or model-free approach.","element":"span"}]]},{"heading":"4 Experiments","paragraphs":[[{"text":"In this section, we present some empirical studies to demonstrate that our main theoretical results about Tabular-MIS estimator in Theorem ","element":"span"},{"href":"#id-55","text":"3.1 ","element":"a"},{"text":"are empirically verified.","element":"span"}],[{"style":{"fontWeight":"bold"},"text":"Time-varying, non-mixing Tabular MDP","element":"span"},{"text":". We test our approach in simulated MDP environment where both the states and the actions are binary. ","element":"span"},{"text":"Concretely, there are two states ","element":"span"},{"style":{"height":10.62},"width":37.46,"height":26.55,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2001.10742/images/15-6.png","element":"img","alt":" s0","inline":true,"padRight":true},{"text":"and ","element":"span"},{"style":{"height":10.62},"width":37.46,"height":26.55,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2001.10742/images/15-7.png","element":"img","alt":" s1","inline":true,"padRight":true},{"text":"and two actions ","element":"span"},{"style":{"height":10.62},"width":40.06,"height":26.55,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2001.10742/images/15-8.png","element":"img","alt":" a1","inline":true,"padRight":true},{"text":"and ","element":"span"},{"style":{"height":10.62},"width":40.06,"height":26.55,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2001.10742/images/15-9.png","element":"img","alt":" a2","inline":true},{"text":". State ","element":"span"},{"style":{"height":10.62},"width":37.45,"height":26.55,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2001.10742/images/15-10.png","element":"img","alt":" s0","inline":true,"padRight":true},{"text":"always has probability 1 going back to itself, regardless of the actions, ","element":"span"},{"style":{"height":17.6},"width":296.44,"height":44,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2001.10742/images/15-11.png","element":"img","alt":" i.e. Pt(s0|s0, a1","inline":true},{"text":") = 1 and ","element":"span"},{"style":{"height":17.6},"width":209.9,"height":44,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2001.10742/images/15-12.png","element":"img","alt":" Pt(s0|s0, a2","inline":true},{"text":") = 1. For state ","element":"span"},{"style":{"height":10.62},"width":37.45,"height":26.54,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2001.10742/images/15-13.png","element":"img","alt":" s1","inline":true},{"text":", at each time step there is one action (we call it ","element":"span"},{"style":{"fontStyle":"italic"},"text":"a","element":"span"},{"text":") that has probability 2","element":"span"},{"style":{"fontStyle":"italic"},"text":"/H ","element":"span"},{"text":"going to ","element":"span"},{"style":{"height":10.62},"width":37.46,"height":26.54,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2001.10742/images/15-14.png","element":"img","alt":" s0","inline":true,"padRight":true},{"text":"and the other action (we call it ","element":"span"},{"style":{"height":8.4},"width":39.06,"height":21,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2001.10742/images/15-15.png","element":"img","alt":" a′","inline":true},{"text":") has probability 1 going back to ","element":"span"},{"style":{"height":15.2},"width":131.25,"height":38,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2001.10742/images/15-16.png","element":"img","alt":" s1, i.e.","inline":true},{"style":{"height":17.6},"width":192.85,"height":44,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2001.10742/images/15-17.png","element":"img","alt":"Pt(s0|s1, a","inline":true},{"text":") = 2","element":"span"},{"style":{"fontStyle":"italic"},"text":"/H ","element":"span"},{"text":"= 1 ","element":"span"},{"style":{"height":17.6},"width":237.98,"height":44,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2001.10742/images/15-18.png","element":"img","alt":" − Pt(s1|s1, a","inline":true},{"text":") and ","element":"span"},{"style":{"height":17.6},"width":208.9,"height":44,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2001.10742/images/15-19.png","element":"img","alt":" Pt(s1|s1, a′","inline":true},{"text":") = 1. Moreover, which action will make state ","element":"span"},{"style":{"height":10.62},"width":37.45,"height":26.55,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2001.10742/images/15-20.png","element":"img","alt":" s1","inline":true,"padRight":true},{"text":"go to state ","element":"span"},{"style":{"height":10.62},"width":37.46,"height":26.55,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2001.10742/images/15-21.png","element":"img","alt":" s0","inline":true,"padRight":true},{"text":"with probability 2","element":"span"},{"style":{"fontStyle":"italic"},"text":"/H ","element":"span"},{"text":"is decided by a random parameter ","element":"span"},{"style":{"height":13.2},"width":84.12,"height":33,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2001.10742/images/15-22.png","element":"img","alt":"pt ∈","inline":true,"padRight":true},{"text":"[0","element":"span"},{"style":{"fontStyle":"italic"},"text":", ","element":"span"},{"text":"1]. If ","element":"span"},{"style":{"height":15.6},"width":332.28,"height":39,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2001.10742/images/15-23.png","element":"img","alt":" pt < 0.5, a = a1","inline":true,"padRight":true},{"text":"and if ","element":"span"},{"style":{"height":15.6},"width":332.27,"height":39,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2001.10742/images/15-24.png","element":"img","alt":" pt ≥ 0.5, a = a2","inline":true},{"text":". One can receive reward 1 at each time step if ","element":"span"},{"style":{"fontStyle":"italic"},"text":"t > H/","element":"span"},{"text":"2 and is in state ","element":"span"},{"style":{"height":10.62},"width":37.45,"height":26.54,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2001.10742/images/15-25.png","element":"img","alt":" s0","inline":true},{"text":", and will receive reward 0 otherwise. Lastly, for state ","element":"span"},{"style":{"height":10.62},"width":37.46,"height":26.54,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2001.10742/images/15-26.png","element":"img","alt":" s0","inline":true},{"text":", we set ","element":"span"},{"style":{"height":17.6},"width":105.3,"height":44,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2001.10742/images/15-27.png","element":"img","alt":" µ(·|s0","inline":true},{"text":") = ","element":"span"},{"style":{"height":17.6},"width":105.45,"height":44,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2001.10742/images/15-28.png","element":"img","alt":" π(·|s0","inline":true},{"text":"); for state ","element":"span"},{"style":{"height":10.62},"width":37.45,"height":26.54,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2001.10742/images/15-29.png","element":"img","alt":" s1","inline":true},{"text":", we set ","element":"span"},{"style":{"height":17.6},"width":135.17,"height":44,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2001.10742/images/15-30.png","element":"img","alt":" µ(a1|s1","inline":true},{"text":") = ","element":"span"},{"style":{"height":17.6},"width":135.18,"height":44,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2001.10742/images/15-31.png","element":"img","alt":" µ(a2|s1","inline":true},{"text":") = 1","element":"span"},{"style":{"fontStyle":"italic"},"text":"/","element":"span"},{"text":"2 and ","element":"span"},{"style":{"height":17.6},"width":576.73,"height":44,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2001.10742/images/15-32.png","element":"img","alt":"π(a1|s1) = 1/4 = 1 − π(a2|s1).","inline":true}],[{"text":"Figure ","element":"span"},{"href":"#id-56","text":"1(a) ","element":"a"},{"text":"shows the asymptotic convergence rates of relative RMSE with respect to the number of episodes, given fixed horizon ","element":"span"},{"style":{"fontStyle":"italic"},"text":"H ","element":"span"},{"text":"= 100. Both SMIS and TMIS has a ","element":"span"},{"style":{"height":17.77},"width":158.24,"height":44.43,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2001.10742/images/15-33.png","element":"img","alt":" O(1/√n","inline":true},{"text":") convergence rate. The saving of","element":"span"},{"style":{"height":17.6},"width":75.36,"height":44,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2001.10742/images/15-34.png","element":"img","alt":"√H","inline":true,"padRight":true},{"text":"of TMIS over SMIS in this log-log plot is reflected in the intercept. Figure ","element":"span"},{"href":"#id-56","text":"1(b) ","element":"a"},{"text":"has fixed ","element":"span"},{"style":{"fontStyle":"italic"},"text":"n ","element":"span"},{"text":"= 1024 with varying horizon ","element":"span"},{"style":{"fontStyle":"italic"},"text":"H","element":"span"},{"text":". Note since ","element":"span"},{"style":{"height":17.6},"width":194.12,"height":44,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2001.10742/images/15-35.png","element":"img","alt":"vπ ≈ O(H","inline":true},{"text":"), therefore for TMIS our theoretical result implies","element":"span"},{"style":{"height":19.98},"width":194.83,"height":49.95,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2001.10742/images/15-36.png","element":"img","alt":"√MSE/vπ","inline":true,"padRight":true},{"text":"= ","element":"span"},{"style":{"height":20.64},"width":207.54,"height":51.61,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2001.10742/images/15-37.png","element":"img","alt":" O(√H2/H","inline":true},{"text":") = ","element":"span"},{"style":{"fontStyle":"italic"},"text":"O","element":"span"},{"text":"(1), which is consistent with the horizontal line when ","element":"span"},{"style":{"fontStyle":"italic"},"text":"H ","element":"span"},{"text":"is large. Moreover, for SMIS ","element":"span"},{"style":{"height":19.98},"width":194.84,"height":49.94,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2001.10742/images/15-38.png","element":"img","alt":"√MSE/vπ","inline":true,"padRight":true},{"text":"= ","element":"span"},{"style":{"height":20.64},"width":207.52,"height":51.61,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2001.10742/images/15-39.png","element":"img","alt":" O(√H3/H","inline":true},{"text":") = ","element":"span"},{"style":{"height":19.98},"width":126.95,"height":49.94,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2001.10742/images/15-40.png","element":"img","alt":" O(√H","inline":true},{"text":"), so after taking the log(","element":"span"},{"style":{"height":5.6},"width":12,"height":14,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2001.10742/images/15-41.png","element":"img","alt":"·","inline":true},{"text":") we should have asymptotic linear trend with coefficient 1","element":"span"},{"style":{"fontStyle":"italic"},"text":"/","element":"span"},{"text":"2. The red line in Figure ","element":"span"},{"href":"#id-56","text":"1(b) ","element":"a"},{"text":"empirically verifies this result. More empirical study discussions are deferred to Appendix ","element":"span"},{"href":"#id-57","text":"D","element":"a"},{"text":".","element":"span"}],[{"id":"id-56","style":{"width":"93%"},"width":1602,"height":584,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2001.10742/images/16-0.png","element":"img"}],[{"text":"Figure 1: Relative RMSE (","element":"figcaption","subtype":"caption"},{"style":{"height":19.98},"width":194.84,"height":49.94,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2001.10742/images/16-1.png","element":"img","alt":"√MSE/vπ","inline":true},{"text":") on Non-stationary Non-mixing MDP","element":"figcaption","subtype":"caption"}]]},{"heading":"5 Discussion","paragraphs":[[{"style":{"fontWeight":"bold"},"text":"From off-policy evaluation to offline learning. ","element":"span"},{"text":"A real offline reinforcement learning system is equipped with both offline learning algorithms and off-policy evaluation algorithms. The decision maker should first run the offline learning algorithm to find a near optimal policy and then use off-policy evaluation methods to check if the obtained policy is good enough. Under our tabular MDP setting, we point out it is possible to find a ","element":"span"},{"style":{"height":8},"width":18,"height":20,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2001.10742/images/16-2.png","element":"img","alt":" ϵ","inline":true},{"text":"-optimal policy in near optimal time and sample complexity ","element":"span"},{"style":{"height":19.14},"width":466.44,"height":47.84,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2001.10742/images/16-3.png","element":"img","alt":" O(H3SA/ϵ2)using the Q","inline":true},{"text":"-value iteration (QVI) based algorithm designed by ","element":"span"},{"href":"#id-32","referenceIndex":28,"text":"Sidford et al. ","element":"a"},{"text":"(","element":"span"},{"href":"#id-32","referenceIndex":28,"text":"2018","element":"a"},{"text":"). Their QVI algorithm assumes a generative model which can provide independent sample of the next state ","element":"span"},{"style":{"height":8.4},"width":36.46,"height":21,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2001.10742/images/16-4.png","element":"img","alt":" s′","inline":true,"padRight":true},{"text":"given any current state-action (","element":"span"},{"style":{"fontStyle":"italic"},"text":"s, a","element":"span"},{"text":"). At a first glance, this assumption seems too strong for offline learning since we cannot force the agent to stay in any arbitrary location. In fact, the Assumption ","element":"span"},{"href":"#id-24","text":"2.2 ","element":"a"},{"text":"on ","element":"span"},{"style":{"height":12},"width":26,"height":30,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2001.10742/images/16-5.png","element":"img","alt":" µ","inline":true,"padRight":true},{"text":"actually reveals that the underlying logging policy can be considered as the surrogate of the generative model. As ","element":"span"},{"style":{"fontStyle":"italic"},"text":"n ","element":"span"},{"text":"goes large, the visitation frequency of any (","element":"span"},{"style":{"height":11.2},"width":89.14,"height":28,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2001.10742/images/16-6.png","element":"img","alt":"st, at","inline":true},{"text":") will be large enough with high probability, as guaranteed by Multiplicative Chernoff bound.","element":"span"}],[{"style":{"fontWeight":"bold"},"text":"From off-policy evaluation to uniform off-policy evaluation. ","element":"span"},{"text":"The high probability result achieves ","element":"span"},{"style":{"height":20.8},"width":263.71,"height":52,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2001.10742/images/16-7.png","element":"img","alt":"�O(�H2SA/n","inline":true},{"text":") complexity. Following this discovery line, then it is natural to ask whether uniform convergence over a class of policies (","element":"span"},{"style":{"fontStyle":"italic"},"text":"e.g. ","element":"span"},{"text":"all deterministic policies) can be achieved with optimal sample complexity. This problem is interesting since it will guarantee the strong performance of off-policy evaluation methods over all policies in certain policy class ","element":"span"},{"style":{"height":17.6},"width":42,"height":44,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2001.10742/images/16-8.png","element":"img","alt":"�","inline":true},{"text":". By a direct application of union bound, we can obtain the following result:","element":"span"}],[{"id":"id-78","style":{"height":17.6},"width":427.19,"height":44,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2001.10742/images/16-9.png","element":"img","alt":"Theorem 5.1. Let � ","inline":true,"padRight":true},{"style":{"fontStyle":"italic"},"text":"contains all the deterministic ","element":"span"},{"style":{"fontStyle":"italic"},"text":"H","element":"span"},{"style":{"fontStyle":"italic"},"text":"-step policies. Then under the same","element":"span"}],[{"style":{"fontStyle":"italic"},"text":"condition as Theorem ","element":"span"},{"href":"#id-48","style":{"fontStyle":"italic"},"text":"3.8","element":"a"},{"style":{"fontStyle":"italic"},"text":", the data splitting Tabular-MIS estimator satisfies:","element":"span"}],[{"style":{"width":"36%"},"width":627,"height":121,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2001.10742/images/17-0.png","element":"img"}],[{"style":{"fontStyle":"italic"},"text":"with probability ","element":"span"},{"text":"1 ","element":"span"},{"style":{"height":12.8},"width":77.68,"height":32,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2001.10742/images/17-1.png","element":"img","alt":" − δ.","inline":true}],[{"text":"The uniform convergence bound implies that the empirical best policy ˆ","element":"span"},{"style":{"height":8},"width":25,"height":20,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2001.10742/images/17-2.png","element":"img","alt":"π","inline":true,"padRight":true},{"text":"= argmax","element":"span"},{"style":{"height":19.65},"width":114.99,"height":49.12,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2001.10742/images/17-3.png","element":"img","alt":"π �vπsplit","inline":true,"padRight":true},{"text":"is within ","element":"span"},{"style":{"height":8},"width":18,"height":20,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2001.10742/images/17-4.png","element":"img","alt":" ϵ","inline":true,"padRight":true},{"text":"=","element":"span"}],[{"text":"bound for learning the optimal policy (","element":"span"},{"href":"#id-28","referenceIndex":2,"text":"Azar et al.","element":"a"},{"text":", ","element":"span"},{"href":"#id-28","referenceIndex":2,"text":"2013","element":"a"},{"text":") in all parameters except a factor of ","element":"span"},{"style":{"fontStyle":"italic"},"text":"S","element":"span"},{"text":".","element":"span"}],[{"style":{"height":18.73},"width":1167.36,"height":46.84,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2001.10742/images/17-5.png","element":"img","alt":"Open problem: H3 vs H2 in the infinite A setting.","inline":true,"padRight":true},{"text":"Finally, we note that the conjecture posed by ","element":"span"},{"href":"#id-1","referenceIndex":38,"text":"Xie et al. ","element":"a"},{"text":"(","element":"span"},{"href":"#id-1","referenceIndex":38,"text":"2019","element":"a"},{"text":") remains unsolved. The result we presented in this paper leverage the fact that we can estimate the parameters of the MDP model. In the infinite ","element":"span"},{"style":{"fontStyle":"italic"},"text":"A ","element":"span"},{"text":"case, we can never observe any (","element":"span"},{"style":{"fontStyle":"italic"},"text":"s, a","element":"span"},{"text":") pairs more than once, hence not able to estimate the transition dynamics or the expected reward. The minimax lower bound in (","element":"span"},{"href":"#id-6","referenceIndex":37,"text":"Wang et al.","element":"a"},{"text":", ","element":"span"},{"href":"#id-6","referenceIndex":37,"text":"2017","element":"a"},{"text":") (for the contextual bandit setting) already establishes that the Cramer-Rao lower bound is not achievable in this setting even if ","element":"span"},{"style":{"fontStyle":"italic"},"text":"H ","element":"span"},{"text":"= 1 and ","element":"span"},{"style":{"fontStyle":"italic"},"text":"S ","element":"span"},{"text":"= 1. It remains open whether ","element":"span"},{"style":{"height":14.73},"width":56.82,"height":36.84,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2001.10742/images/17-6.png","element":"img","alt":" H3 ","inline":true,"padRight":true},{"text":"is required.","element":"span"}]]},{"heading":"6 Conclusion","paragraphs":[[{"text":"In this paper, we propose and analyze a new marginalized importance sampling estimator for the off-policy evaluation (OPE) problem under the episodic tabular Markov decision process model. We show that the estimator is has a finite sample error bound that matches the exact Cramer-Rao lower bound up to low-order factors. We also provide an extension with high probability error bound. To the best of our knowledge, these results are the first of their kinds. Future work includes resolving the open problems mentioned before and generalizing the results to more practical settings.","element":"span"}]]},{"heading":"References","paragraphs":[[{"id":"id-33","text":"Agarwal, A., Kakade, S., & Yang, L. F. (2019). On the optimality of sparse model-based ","element":"span"},{"text":"planning for markov decision processes. ","element":"span"},{"style":{"fontStyle":"italic"},"text":"arXiv preprint arXiv:1906.03804","element":"span"},{"text":".","element":"span"}],[{"id":"id-28","text":"Azar, M. G., Munos, R., & Kappen, H. J. (2013). Minimax pac bounds on the sample ","element":"span"},{"text":"complexity of reinforcement learning with a generative model. ","element":"span"},{"style":{"fontStyle":"italic"},"text":"Machine learning","element":"span"},{"text":", ","element":"span"},{"style":{"fontStyle":"italic"},"text":"91","element":"span"},{"text":"(3), 325–349.","element":"span"}],[{"id":"id-27","text":"Azar, M. G., Osband, I., & Munos, R. (2017). Minimax regret bounds for reinforcement ","element":"span"},{"text":"learning. In ","element":"span"},{"style":{"fontStyle":"italic"},"text":"Proceedings of the 34th International Conference on Machine LearningVolume 70","element":"span"},{"text":", (pp. 263–272). JMLR. org.","element":"span"}],[{"id":"id-59","text":"Chernoff, H., et al. (1952). A measure of asymptotic efficiency for tests of a hypothesis ","element":"span"},{"text":"based on the sum of observations. ","element":"span"},{"style":{"fontStyle":"italic"},"text":"The Annals of Mathematical Statistics","element":"span"},{"text":", ","element":"span"},{"style":{"fontStyle":"italic"},"text":"23","element":"span"},{"text":"(4), 493–507.","element":"span"}],[{"id":"id-29","text":"Dann, C., Lattimore, T., & Brunskill, E. (2017). Unifying pac and regret: Uniform pac ","element":"span"},{"text":"bounds for episodic reinforcement learning. In ","element":"span"},{"style":{"fontStyle":"italic"},"text":"Advances in Neural Information Processing Systems","element":"span"},{"text":", (pp. 5713–5723).","element":"span"}],[{"id":"id-4","text":"Dud´ık, M., Langford, J., & Li, L. (2011). Doubly robust policy evaluation and learning. ","element":"span"},{"style":{"fontStyle":"italic"},"text":"arXiv preprint arXiv:1103.4601","element":"span"},{"text":".","element":"span"}],[{"id":"id-10","text":"Farajtabar, M., Chow, Y., & Ghavamzadeh, M. (2018). More robust doubly robust off-policy ","element":"span"},{"text":"evaluation. ","element":"span"},{"style":{"fontStyle":"italic"},"text":"arXiv preprint arXiv:1802.03493","element":"span"},{"text":".","element":"span"}],[{"id":"id-21","text":"Gelada, C., & Bellemare, M. G. (2019). Off-policy deep reinforcement learning by bootstrap- ","element":"span"},{"text":"ping the covariate shift. In ","element":"span"},{"style":{"fontStyle":"italic"},"text":"Proceedings of the AAAI Conference on Artificial Intelligence","element":"span"},{"text":", vol. 33, (pp. 3647–3655).","element":"span"}],[{"id":"id-19","text":"Gottesman, O., Liu, Y., Sussex, S., Brunskill, E., & Doshi-Velez, F. (2019). Combin- ","element":"span"},{"text":"ing parametric and nonparametric models for off-policy evaluation. ","element":"span"},{"style":{"fontStyle":"italic"},"text":"arXiv preprint arXiv:1905.05787","element":"span"},{"text":".","element":"span"}],[{"id":"id-20","text":"Hallak, A., & Mannor, S. (2017). Consistent on-line off-policy evaluation. In ","element":"span"},{"style":{"fontStyle":"italic"},"text":"Proceedings of the 34th International Conference on Machine Learning-Volume 70","element":"span"},{"text":", (pp. 1372–1383). JMLR. org.","element":"span"}],[{"id":"id-17","text":"Hirano, K., Imbens, G. W., & Ridder, G. (2003). Efficient estimation of average treatment ","element":"span"},{"text":"effects using the estimated propensity score. ","element":"span"},{"style":{"fontStyle":"italic"},"text":"Econometrica","element":"span"},{"text":", ","element":"span"},{"style":{"fontStyle":"italic"},"text":"71","element":"span"},{"text":"(4), 1161–1189.","element":"span"}],[{"id":"id-54","text":"Jiang, N. (2018). Notes on tabular methods.","element":"span"}],[{"id":"id-34","text":"Jiang, N., & Agarwal, A. (2018). Open problem: The dependence of sample complexity ","element":"span"},{"text":"lower bounds on planning horizon. In ","element":"span"},{"style":{"fontStyle":"italic"},"text":"Conference On Learning Theory","element":"span"},{"text":", (pp. 3395–3398).","element":"span"}],[{"id":"id-8","text":"Jiang, N., & Li, L. (2016). Doubly robust off-policy value evaluation for reinforcement ","element":"span"},{"text":"learning. In ","element":"span"},{"style":{"fontStyle":"italic"},"text":"Proceedings of the 33rd International Conference on International Conference on Machine Learning-Volume 48","element":"span"},{"text":", (pp. 652–661). JMLR. org.","element":"span"}],[{"id":"id-30","text":"Jin, C., Allen-Zhu, Z., Bubeck, S., & Jordan, M. I. (2018). Is q-learning provably efficient? ","element":"span"},{"text":"In ","element":"span"},{"style":{"fontStyle":"italic"},"text":"Advances in Neural Information Processing Systems","element":"span"},{"text":", (pp. 4863–4873).","element":"span"}],[{"id":"id-22","text":"Kallus, N., & Uehara, M. (2019a). Double reinforcement learning for efficient off-policy ","element":"span"},{"text":"evaluation in markov decision processes. ","element":"span"},{"style":{"fontStyle":"italic"},"text":"arXiv preprint arXiv:1908.08526","element":"span"},{"text":".","element":"span"}],[{"id":"id-23","text":"Kallus, N., & Uehara, M. (2019b). Efficiently breaking the curse of horizon: Double ","element":"span"},{"text":"reinforcement learning in infinite-horizon processes. ","element":"span"},{"style":{"fontStyle":"italic"},"text":"arXiv preprint arXiv:1909.05850","element":"span"},{"text":".","element":"span"}],[{"id":"id-53","text":"Kearns, M., & Singh, S. (2002). Near-optimal reinforcement learning in polynomial time. ","element":"span"},{"style":{"fontStyle":"italic"},"text":"Machine learning","element":"span"},{"text":", ","element":"span"},{"style":{"fontStyle":"italic"},"text":"49","element":"span"},{"text":"(2-3), 209–232.","element":"span"}],[{"id":"id-31","text":"Kearns, M. J., & Singh, S. P. (1999). ","element":"span"},{"text":"Finite-sample convergence rates for q-learning and indirect algorithms. In ","element":"span"},{"style":{"fontStyle":"italic"},"text":"Advances in neural information processing systems","element":"span"},{"text":", (pp. 996–1002).","element":"span"}],[{"id":"id-3","text":"Li, L., Chu, W., Langford, J., & Wang, X. (2011). Unbiased offline evaluation of contextual- ","element":"span"},{"text":"bandit-based news article recommendation algorithms. In ","element":"span"},{"style":{"fontStyle":"italic"},"text":"Proceedings of the fourth ACM international conference on Web search and data mining","element":"span"},{"text":", (pp. 297–306). ACM.","element":"span"}],[{"id":"id-7","text":"Li, L., Munos, R., & Szepesv´ari, C. (2015). Toward minimax off-policy value estimation.","element":"span"}],[{"id":"id-0","text":"Liu, Q., Li, L., Tang, Z., & Zhou, D. (2018a). Breaking the curse of horizon: Infinite- ","element":"span"},{"text":"horizon off-policy estimation. In ","element":"span"},{"style":{"fontStyle":"italic"},"text":"Advances in Neural Information Processing Systems","element":"span"},{"text":", (pp. 5361–5371).","element":"span"}],[{"id":"id-18","text":"Liu, Y., Gottesman, O., Raghu, A., Komorowski, M., Faisal, A. A., Doshi-Velez, F., & ","element":"span"},{"text":"Brunskill, E. (2018b). Representation balancing mdps for off-policy policy evaluation. In ","element":"span"},{"style":{"fontStyle":"italic"},"text":"Advances in Neural Information Processing Systems","element":"span"},{"text":", (pp. 2644–2653).","element":"span"}],[{"id":"id-40","text":"Mahmood, A. R., van Hasselt, H. P., & Sutton, R. S. (2014). ","element":"span"},{"text":"Weighted importance sampling for off-policy learning with linear function approximation. In ","element":"span"},{"style":{"fontStyle":"italic"},"text":"Advances in Neural Information Processing Systems","element":"span"},{"text":", (pp. 3014–3022).","element":"span"}],[{"id":"id-13","text":"Mandel, T., Liu, Y.-E., Levine, S., Brunskill, E., & Popovic, Z. (2014). Offline policy ","element":"span"},{"text":"evaluation across representations with applications to educational games. In ","element":"span"},{"style":{"fontStyle":"italic"},"text":"Proceedings of the 2014 international conference on Autonomous agents and multi-agent systems","element":"span"},{"text":", (pp. 1077–1084). International Foundation for Autonomous Agents and Multiagent Systems.","element":"span"}],[{"id":"id-16","text":"Murphy, S. A., van der Laan, M. J., Robins, J. M., & Group, C. P. P. R. (2001). Marginal ","element":"span"},{"text":"mean models for dynamic regimes. ","element":"span"},{"style":{"fontStyle":"italic"},"text":"Journal of the American Statistical Association","element":"span"},{"text":", ","element":"span"},{"style":{"fontStyle":"italic"},"text":"96","element":"span"},{"text":"(456), 1410–1423.","element":"span"}],[{"id":"id-25","text":"Puterman, M. L. (1994). Markov decision processes: Discrete stochastic dynamic program- ","element":"span"},{"text":"ming.","element":"span"}],[{"id":"id-32","text":"Sidford, A., Wang, M., Wu, X., Yang, L., & Ye, Y. (2018). Near-optimal time and sample ","element":"span"},{"text":"complexities for solving markov decision processes with a generative model. In ","element":"span"},{"style":{"fontStyle":"italic"},"text":"Advances in Neural Information Processing Systems","element":"span"},{"text":", (pp. 5186–5196).","element":"span"}],[{"id":"id-58","text":"Sridharan, K. (2002). A gentle introduction to concentration inequalities. ","element":"span"},{"style":{"fontStyle":"italic"},"text":"Dept. Comput. Sci., Cornell Univ., Tech. Rep","element":"span"},{"text":".","element":"span"}],[{"id":"id-26","text":"Sutton, R. S., & Barto, A. G. (1998). ","element":"span"},{"style":{"fontStyle":"italic"},"text":"Reinforcement learning: An introduction","element":"span"},{"text":", vol. 1. MIT press Cambridge.","element":"span"}],[{"id":"id-2","text":"Sutton, R. S., & Barto, A. G. (2018). ","element":"span"},{"style":{"fontStyle":"italic"},"text":"Reinforcement learning: An introduction","element":"span"},{"text":". MIT press.","element":"span"}],[{"id":"id-5","text":"Swaminathan, A., Krishnamurthy, A., Agarwal, A., Dudik, M., Langford, J., Jose, D., & ","element":"span"},{"text":"Zitouni, I. (2017). Off-policy evaluation for slate recommendation. In ","element":"span"},{"style":{"fontStyle":"italic"},"text":"Advances in Neural Information Processing Systems","element":"span"},{"text":", (pp. 3632–3642).","element":"span"}],[{"id":"id-11","text":"Theocharous, G., Thomas, P. S., & Ghavamzadeh, M. (2015). Personalized ad recom- ","element":"span"},{"text":"mendation systems for life-time value optimization with guarantees. In ","element":"span"},{"style":{"fontStyle":"italic"},"text":"Twenty-Fourth International Joint Conference on Artificial Intelligence","element":"span"},{"text":".","element":"span"}],[{"id":"id-9","text":"Thomas, P., & Brunskill, E. (2016). Data-efficient off-policy policy evaluation for reinforce- ","element":"span"},{"text":"ment learning. In ","element":"span"},{"style":{"fontStyle":"italic"},"text":"International Conference on Machine Learning","element":"span"},{"text":", (pp. 2139–2148).","element":"span"}],[{"id":"id-12","text":"Thomas, P. S., Theocharous, G., Ghavamzadeh, M., Durugkar, I., & Brunskill, E. (2017). ","element":"span"},{"text":"Predictive off-policy policy evaluation for nonstationary decision problems, with applications to digital marketing. In ","element":"span"},{"style":{"fontStyle":"italic"},"text":"Twenty-Ninth IAAI Conference","element":"span"},{"text":".","element":"span"}],[{"id":"id-43","text":"Van der Vaart, A. W. (2000). ","element":"span"},{"style":{"fontStyle":"italic"},"text":"Asymptotic statistics","element":"span"},{"text":", vol. 3. Cambridge university press.","element":"span"}],[{"id":"id-6","text":"Wang, Y.-X., Agarwal, A., & Dudik, M. (2017). Optimal and adaptive off-policy evaluation ","element":"span"},{"text":"in contextual bandits. In ","element":"span"},{"style":{"fontStyle":"italic"},"text":"Proceedings of the 34th International Conference on Machine Learning-Volume 70","element":"span"},{"text":", (pp. 3589–3597). JMLR. org.","element":"span"}],[{"id":"id-1","text":"Xie, T., Ma, Y., & Wang, Y.-X. (2019). Towards optimal off-policy evaluation for re- ","element":"span"},{"text":"inforcement learning with marginalized importance sampling. In ","element":"span"},{"style":{"fontStyle":"italic"},"text":"Advances in Neural Information Processing Systems","element":"span"},{"text":", (pp. 9665–9675).","element":"span"}]]},{"heading":"Appendix","paragraphs":[[{"style":{"fontWeight":"bold"},"text":"A ","element":"span"},{"style":{"fontWeight":"bold"},"text":"Concentration inequalities and other technical lemmas","element":"span"}],[{"style":{"fontWeight":"bold"},"text":"Lemma A.1 ","element":"span"},{"text":"(Bernsteins Inequality (","element":"span"},{"href":"#id-58","referenceIndex":29,"text":"Sridharan","element":"a"},{"text":", ","element":"span"},{"href":"#id-58","referenceIndex":29,"text":"2002","element":"a"},{"text":") )","element":"span"},{"style":{"fontWeight":"bold"},"text":". ","element":"span"},{"style":{"fontStyle":"italic"},"text":"Let ","element":"span"},{"style":{"height":11.2},"width":164.96,"height":28,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2001.10742/images/21-0.png","element":"img","alt":" x1, ..., xn","inline":true,"padRight":true},{"style":{"fontStyle":"italic"},"text":"be independent bounded random variables such that ","element":"span"},{"style":{"height":17.6},"width":78.39,"height":44,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2001.10742/images/21-1.png","element":"img","alt":" E[xi","inline":true},{"text":"] = 0 ","element":"span"},{"style":{"fontStyle":"italic"},"text":"and ","element":"span"},{"style":{"height":17.6},"width":147.45,"height":44,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2001.10742/images/21-2.png","element":"img","alt":" |xi| ≤ ξ","inline":true,"padRight":true},{"style":{"fontStyle":"italic"},"text":"with probability ","element":"span"},{"text":"1","element":"span"},{"style":{"fontStyle":"italic"},"text":". Let ","element":"span"},{"style":{"height":15.13},"width":43.5,"height":37.84,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2001.10742/images/21-3.png","element":"img","alt":" σ2","inline":true,"padRight":true},{"text":"=","element":"span"}],[{"style":{"width":"67%"},"width":1151,"height":217,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2001.10742/images/21-4.png","element":"img"}],[{"id":"id-45","style":{"fontWeight":"bold"},"text":"Lemma A.2 ","element":"span"},{"text":"(Multiplicative Chernoff bound (","element":"span"},{"href":"#id-59","referenceIndex":4,"text":"Chernoff et al.","element":"a"},{"text":", ","element":"span"},{"href":"#id-59","referenceIndex":4,"text":"1952","element":"a"},{"text":") )","element":"span"},{"style":{"fontWeight":"bold"},"text":". ","element":"span"},{"style":{"fontStyle":"italic"},"text":"Let ","element":"span"},{"style":{"fontStyle":"italic"},"text":"X ","element":"span"},{"style":{"fontStyle":"italic"},"text":"be a Binomial random variable with parameter ","element":"span"},{"style":{"height":16.4},"width":367.32,"height":41,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2001.10742/images/21-5.png","element":"img","alt":" p, n. For any δ > 0","inline":true},{"style":{"fontStyle":"italic"},"text":", we have that","element":"span"}],[{"style":{"width":"41%"},"width":706,"height":112,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2001.10742/images/21-6.png","element":"img"}],[{"text":"A slightly weaker bound that suffices for our propose is the following:","element":"span"}],[{"style":{"width":"30%"},"width":517,"height":71,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2001.10742/images/21-7.png","element":"img"}],[{"style":{"fontWeight":"bold"},"text":"B ","element":"span"},{"style":{"fontWeight":"bold"},"text":"Proof of the main theorem","element":"span"}],[{"text":"To analyze the MSE upper bound ","element":"span"},{"style":{"height":20.15},"width":320.64,"height":50.38,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2001.10742/images/21-8.png","element":"img","alt":" Eµ[(�vπTMIS − vπ)2","inline":true},{"text":"], we create a fictitious surrogate ","element":"span"},{"style":{"height":17.11},"width":121.64,"height":42.77,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2001.10742/images/21-9.png","element":"img","alt":" �vπTMIS,","inline":true,"padRight":true},{"text":"which is an unbiased version of ","element":"span"},{"style":{"height":17.11},"width":107.84,"height":42.77,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2001.10742/images/21-10.png","element":"img","alt":" �vπTMIS","inline":true},{"text":". A few auxiliary lemmas are first presented and ","element":"span"},{"text":"Bellman equations are used for deriving variance decomposition in a recursive way. Second order moment of marginalized state distribution ","element":"span"},{"style":{"height":16.71},"width":42.72,"height":41.78,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2001.10742/images/21-11.png","element":"img","alt":"�dπt ","inline":true,"padRight":true},{"text":"can then be bounded by analyzing its ","element":"span"},{"text":"variance.","element":"span"}],[{"style":{"fontWeight":"bold"},"text":"B.1 ","element":"span"},{"style":{"fontWeight":"bold"},"text":"Fictitious tabular MIS estimator.","element":"span"}],[{"text":"The fictitious estimator","element":"span"},{"style":{"height":15.14},"width":80.67,"height":37.84,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2001.10742/images/21-12.png","element":"img","alt":"8 �vπ","inline":true,"padRight":true},{"text":"fills in the gap of state-action location (","element":"span"},{"style":{"height":11.2},"width":89.14,"height":28,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2001.10742/images/21-13.png","element":"img","alt":"st, at","inline":true},{"text":") of the true estimator ","element":"span"},{"style":{"height":12.33},"width":42.72,"height":30.84,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2001.10742/images/21-14.png","element":"img","alt":" �vπ","inline":true,"padRight":true},{"text":"where ","element":"span"},{"style":{"height":13.02},"width":93.66,"height":32.54,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2001.10742/images/21-15.png","element":"img","alt":" nst,at","inline":true,"padRight":true},{"text":"= 0. Specifically, it replaces every component in ","element":"span"},{"style":{"height":12.33},"width":42.72,"height":30.84,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2001.10742/images/21-16.png","element":"img","alt":" �vπ","inline":true,"padRight":true},{"text":"with a fictitious counterpart, ","element":"span"},{"style":{"height":12.4},"width":140.61,"height":31,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2001.10742/images/21-17.png","element":"img","alt":" i.e. �vπ","inline":true,"padRight":true},{"text":":= ","element":"span"},{"style":{"height":22},"width":245.72,"height":55.01,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2001.10742/images/21-18.png","element":"img","alt":"�Ht=1⟨�dπt , �rπt ⟩","inline":true},{"text":", with ","element":"span"},{"style":{"height":16.72},"width":42.71,"height":41.79,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2001.10742/images/21-19.png","element":"img","alt":" �dπt","inline":true,"padRight":true},{"text":"= ","element":"span"},{"style":{"height":17.08},"width":134.96,"height":42.69,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2001.10742/images/21-20.png","element":"img","alt":" �P πt �dπt−1","inline":true,"padRight":true},{"text":"and ","element":"span"},{"style":{"height":16.32},"width":54.08,"height":40.79,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2001.10742/images/21-21.png","element":"img","alt":" �P πt","inline":true,"padRight":true},{"text":"(","element":"span"},{"style":{"height":17.6},"width":122.84,"height":44,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2001.10742/images/21-22.png","element":"img","alt":"st|st−1","inline":true},{"text":") =","element":"span"}],[{"style":{"height":20.13},"width":731.72,"height":50.32,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2001.10742/images/22-0.png","element":"img","alt":"�at−1 �Pt(st|st−1, at−1)π(at−1|st−1), �rπt","inline":true,"padRight":true},{"text":"(","element":"span"},{"style":{"height":10.62},"width":32.46,"height":26.54,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2001.10742/images/22-1.png","element":"img","alt":"st","inline":true},{"text":") = ","element":"span"},{"style":{"height":19.95},"width":403.76,"height":49.88,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2001.10742/images/22-2.png","element":"img","alt":" �at �rt(st, at)π(at|st).","inline":true,"padRight":true},{"text":"In particular, let ","element":"span"},{"style":{"height":14.62},"width":44.22,"height":36.54,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2001.10742/images/22-3.png","element":"img","alt":" Et","inline":true,"padRight":true},{"text":"denotes the event ","element":"span"},{"style":{"height":19.75},"width":657.43,"height":49.38,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2001.10742/images/22-4.png","element":"img","alt":" {nst,at ≥ ndµt (st, at)(1 − θ)}9, then","inline":true}],[{"style":{"width":"60%"},"width":1040,"height":119,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2001.10742/images/22-5.png","element":"img"}],[{"text":"where 0 ","element":"span"},{"style":{"height":13.2},"width":113.85,"height":33,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2001.10742/images/22-6.png","element":"img","alt":" < θ <","inline":true,"padRight":true},{"text":"1 is a parameter that we will choose later.","element":"span"}],[{"text":"The name ”fictitious” comes from the fact that ","element":"span"},{"style":{"height":12.33},"width":42.72,"height":30.84,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2001.10742/images/22-7.png","element":"img","alt":" �vπ","inline":true,"padRight":true},{"text":"is ","element":"span"},{"style":{"fontStyle":"italic"},"text":"not implementable ","element":"span"},{"text":"using the data","element":"span"},{"style":{"height":8.4},"width":33.93,"height":21,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2001.10742/images/22-8.png","element":"img","alt":"10","inline":true},{"text":", but it creates a bridge between ","element":"span"},{"style":{"height":12.33},"width":42.72,"height":30.84,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2001.10742/images/22-9.png","element":"img","alt":" �vπ","inline":true,"padRight":true},{"text":"and ","element":"span"},{"style":{"height":12.33},"width":42.72,"height":30.84,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2001.10742/images/22-10.png","element":"img","alt":" vπ","inline":true,"padRight":true},{"text":"because of its unbiasedness, see Lemma ","element":"span"},{"href":"#id-44","text":"B.5","element":"a"},{"text":". Also, for simplicity of the proof, throughout the rest of the paper we denote: ","element":"span"},{"style":{"height":14.62},"width":45.66,"height":36.54,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2001.10742/images/22-11.png","element":"img","alt":" Dt","inline":true,"padRight":true},{"text":":= ","element":"span"},{"style":{"height":32.77},"width":401.25,"height":81.92,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2001.10742/images/22-12.png","element":"img","alt":"�s(i)1:t, a(i)1:t, r(i)1:t−1�ni=1 .","inline":true,"padRight":true},{"text":"Also, in the base case, we denote ","element":"span"},{"style":{"height":14.62},"width":50.67,"height":36.55,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2001.10742/images/22-13.png","element":"img","alt":" D1","inline":true,"padRight":true},{"text":":= ","element":"span"},{"style":{"height":32.77},"width":255.7,"height":81.92,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2001.10742/images/22-14.png","element":"img","alt":"�s(i)1 , a(i)1 �ni=1","inline":true,"padRight":true},{"text":"and that ","element":"span"},{"style":{"height":25.56},"width":1722.46,"height":63.89,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2001.10742/images/22-15.png","element":"img","alt":"rπt (st) := Eπ[r(1)t |s(1)t = st] = �at E[r(1)t |s(1)t = st, a(1)t = at]π(at|st) := �at rt(st, at)π(at|st).","inline":true,"padRight":true},{"text":"Then we have the following preliminary auxiliary lemmas.","element":"span"}],[{"id":"id-60","style":{"height":16.72},"width":339.97,"height":41.79,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2001.10742/images/22-16.png","element":"img","alt":"Lemma B.1. �dπt","inline":true,"padRight":true},{"style":{"fontStyle":"italic"},"text":"and ","element":"span"},{"style":{"height":16.61},"width":75.27,"height":41.53,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2001.10742/images/22-17.png","element":"img","alt":" �rπt−1","inline":true,"padRight":true},{"style":{"fontStyle":"italic"},"text":"are deterministic given ","element":"span"},{"style":{"height":14.62},"width":45.66,"height":36.54,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2001.10742/images/22-18.png","element":"img","alt":" Dt","inline":true},{"style":{"fontStyle":"italic"},"text":". ","element":"span"},{"style":{"fontStyle":"italic"},"text":"Moreover, given ","element":"span"},{"style":{"height":14.62},"width":45.66,"height":36.54,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2001.10742/images/22-19.png","element":"img","alt":" Dt","inline":true},{"style":{"fontStyle":"italic"},"text":", ","element":"span"},{"style":{"height":19.08},"width":104.94,"height":47.69,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2001.10742/images/22-20.png","element":"img","alt":"�P πt+1,t","inline":true,"padRight":true},{"style":{"fontStyle":"italic"},"text":"is ","element":"span"},{"style":{"fontStyle":"italic"},"text":"unbiased of ","element":"span"},{"style":{"height":19.48},"width":248.43,"height":48.69,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2001.10742/images/22-21.png","element":"img","alt":" P πt+1,t and �rπt ","inline":true,"padRight":true},{"style":{"fontStyle":"italic"},"text":"is unbiased of ","element":"span"},{"style":{"height":16.25},"width":56.49,"height":40.62,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2001.10742/images/22-22.png","element":"img","alt":" rπt .","inline":true}],[{"style":{"fontStyle":"italic"},"text":"Proof of Lemma ","element":"span"},{"href":"#id-60","style":{"fontStyle":"italic"},"text":"B.1","element":"a"},{"style":{"fontStyle":"italic"},"text":". ","element":"span"},{"text":"By construction of the estimator, ","element":"span"},{"style":{"height":16.72},"width":42.71,"height":41.79,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2001.10742/images/22-23.png","element":"img","alt":"�dπt","inline":true,"padRight":true},{"text":"and ","element":"span"},{"style":{"height":16.61},"width":75.26,"height":41.52,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2001.10742/images/22-24.png","element":"img","alt":" �rπt−1","inline":true,"padRight":true},{"text":"only depend on ","element":"span"},{"style":{"height":14.62},"width":45.66,"height":36.54,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2001.10742/images/22-25.png","element":"img","alt":" Dt","inline":true},{"text":", ","element":"span"},{"text":"therefore ","element":"span"},{"style":{"height":17.08},"width":396.04,"height":42.69,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2001.10742/images/22-26.png","element":"img","alt":"�dπt and �rπt−1 given Dt","inline":true,"padRight":true},{"text":"are constants. For the second argument, we have ","element":"span"},{"style":{"height":16.22},"width":168.28,"height":40.54,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2001.10742/images/22-27.png","element":"img","alt":" ∀st, st+1,","inline":true}],[{"style":{"width":"79%"},"width":1361,"height":493,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2001.10742/images/22-28.png","element":"img"}],[{"text":"where the third equal sign comes from the fact that conditional on ","element":"span"},{"style":{"height":14.62},"width":44.22,"height":36.54,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2001.10742/images/22-29.png","element":"img","alt":" Et","inline":true},{"text":", ","element":"span"},{"text":"ˆ","element":"span"},{"style":{"height":17.6},"width":230.6,"height":44,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2001.10742/images/22-30.png","element":"img","alt":"P(st+1|st, at","inline":true},{"text":") — the empirical mean — is unbiased. The result about ˜","element":"span"},{"style":{"height":16.25},"width":40.9,"height":40.62,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2001.10742/images/22-31.png","element":"img","alt":"rπt","inline":true,"padRight":true},{"text":"can be derived using a similar ","element":"span"},{"text":"fashion.","element":"span"}],[{"text":"Using Lemma ","element":"span"},{"href":"#id-60","text":"B.1","element":"a"},{"text":", we can derive the following recursions for expectation and variance:","element":"span"}],[{"style":{"fontWeight":"bold"},"text":"Lemma B.2. ","element":"span"},{"style":{"fontStyle":"italic"},"text":"For ","element":"span"},{"style":{"fontStyle":"italic"},"text":"h ","element":"span"},{"text":"= 1","element":"span"},{"style":{"fontStyle":"italic"},"text":", ..., H","element":"span"},{"style":{"fontStyle":"italic"},"text":", we have","element":"span"}],[{"style":{"width":"113%"},"width":1946,"height":191,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2001.10742/images/23-0.png","element":"img"}],[{"text":"Var","element":"span"}],[{"id":"id-65","style":{"width":"108%"},"width":1860,"height":91,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2001.10742/images/23-1.png","element":"img"}],[{"style":{"fontStyle":"italic"},"text":"Proof. ","element":"span"},{"text":"The proof of Lemma ","element":"span"},{"text":"B.2 ","element":"span"},{"text":"can be found in Lemma B.2 and Lemma 4.1 in ","element":"span"},{"href":"#id-1","referenceIndex":38,"text":"Xie et al. ","element":"a"},{"text":"(","element":"span"},{"href":"#id-1","referenceIndex":38,"text":"2019","element":"a"},{"text":") by coupling the standard Bellman equation:","element":"span"}],[{"style":{"width":"63%"},"width":1089,"height":59,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2001.10742/images/23-2.png","element":"img"}],[{"text":"with the total law of expectations and the total law of variances.","element":"span"}],[{"id":"id-47","style":{"fontWeight":"bold"},"text":"Lemma B.3 ","element":"span"},{"text":"(Boundedness of Tabular MIS estimators)","element":"span"},{"style":{"height":15.2},"width":605.72,"height":38,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2001.10742/images/23-3.png","element":"img","alt":". 0 ≤ �vπ ≤ HRmax, 0 ≤ �vπ ≤","inline":true},{"style":{"height":14.62},"width":150.98,"height":36.54,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2001.10742/images/23-4.png","element":"img","alt":"HRmax.","inline":true}],[{"id":"id-61","style":{"width":"99%"},"width":1708,"height":732,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2001.10742/images/23-5.png","element":"img"}],[{"text":"The last line is inequality since ","element":"span"},{"style":{"height":17.6},"width":282.35,"height":44,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2001.10742/images/23-6.png","element":"img","alt":"�Pt(st|st−1, at−1","inline":true},{"text":") = 0 when ","element":"span"},{"style":{"height":13.02},"width":169.72,"height":32.55,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2001.10742/images/23-7.png","element":"img","alt":" nst−1,at−1","inline":true,"padRight":true},{"text":"= 0. Following the same logic, it is easy to show ","element":"span"},{"style":{"height":17.6},"width":173.91,"height":44,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2001.10742/images/23-8.png","element":"img","alt":"�P πt (·|st−1","inline":true},{"text":") is a non-degenerated probability distribution.","element":"span"}],[{"text":"Next note ","element":"span"},{"style":{"height":20.13},"width":187.66,"height":50.33,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2001.10742/images/23-9.png","element":"img","alt":"�s1 �dπ1(s1","inline":true},{"text":") = ","element":"span"},{"style":{"height":21.02},"width":187.44,"height":52.56,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2001.10742/images/23-10.png","element":"img","alt":" �s1 �dµ1(s1","inline":true},{"text":") = ","element":"span"},{"style":{"height":22.41},"width":141.9,"height":56.02,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2001.10742/images/23-11.png","element":"img","alt":" �s1ns1n","inline":true,"padRight":true},{"text":"= 1. Suppose ","element":"span"},{"style":{"height":17.88},"width":109.53,"height":44.69,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2001.10742/images/23-12.png","element":"img","alt":"�dπt−1(·","inline":true},{"text":") is a (degenerated) ","element":"span"},{"text":"probability distribution, then from ","element":"span"},{"href":"#id-61","style":{"height":17.88},"width":416.52,"height":44.69,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2001.10742/images/23-13.png","element":"img","alt":"�dπt = �P πt �dπt−1 and (13)","inline":true},{"text":", by induction we know ","element":"span"},{"style":{"height":17.2},"width":171.23,"height":43,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2001.10742/images/23-14.png","element":"img","alt":" �dπt (·) is a","inline":true,"padRight":true},{"text":"(degenerated) probability distribution for all ","element":"span"},{"style":{"fontStyle":"italic"},"text":"t","element":"span"},{"text":".","element":"span"}],[{"text":"Using Assumption ","element":"span"},{"href":"#id-41","text":"2.1","element":"a"},{"text":", it is easy to show ˆ","element":"span"},{"style":{"height":17.2},"width":436.42,"height":43,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2001.10742/images/23-15.png","element":"img","alt":"rπt (st) ≤ Rmax for all st","inline":true},{"text":", then combining all results ","element":"span"},{"text":"above we have ","element":"span"},{"style":{"height":22},"width":555.57,"height":55,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2001.10742/images/23-16.png","element":"img","alt":" �vπ := �Ht=1⟨�dπt , �rπt ⟩ ≤ HRmax","inline":true},{"text":". Similarly, ","element":"span"},{"style":{"height":14.62},"width":253.47,"height":36.54,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2001.10742/images/23-17.png","element":"img","alt":" �vπ ≤ HRmax.","inline":true}],[{"style":{"width":"1%"},"width":23,"height":23,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2001.10742/images/23-18.png","element":"img"}],[{"text":"The boundedness of Tabular-MIS estimator cannot be inherited by the State-MIS estimator since ","element":"span"},{"style":{"height":17.11},"width":102.19,"height":42.77,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2001.10742/images/24-0.png","element":"img","alt":" �vπSMIS ","inline":true,"padRight":true},{"text":"explicitly uses importance weights and there is no reason for it to be less than ","element":"span"},{"style":{"height":14.62},"width":136.09,"height":36.55,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2001.10742/images/24-1.png","element":"img","alt":"HRmax","inline":true},{"text":". As a result, we do not need an extra projection step for our estimation to be valid (see ","element":"span"},{"href":"#id-1","referenceIndex":38,"text":"Xie et al. ","element":"a"},{"text":"(","element":"span"},{"href":"#id-1","referenceIndex":38,"text":"2019","element":"a"},{"text":") Lemma B.1). Thanks to the following lemma, throughout the rest of the analysis we only need to consider ","element":"span"},{"style":{"height":12.33},"width":57.31,"height":30.84,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2001.10742/images/24-2.png","element":"img","alt":" �vπ.","inline":true}],[{"id":"id-62","style":{"height":12.4},"width":396.87,"height":31,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2001.10742/images/24-3.png","element":"img","alt":"Lemma B.4. Let �vπ ","inline":true,"padRight":true},{"style":{"fontStyle":"italic"},"text":"be the Tabular-MIS estimator and ","element":"span"},{"style":{"height":12.33},"width":42.72,"height":30.84,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2001.10742/images/24-4.png","element":"img","alt":" �vπ ","inline":true,"padRight":true},{"style":{"fontStyle":"italic"},"text":"be the fictitious version of TMIS we described above with parameter ","element":"span"},{"style":{"height":12.8},"width":21,"height":32,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2001.10742/images/24-5.png","element":"img","alt":" θ","inline":true},{"style":{"fontStyle":"italic"},"text":". Then the MSE of the TMIS and fictitious TMIS satisfies","element":"span"}],[{"style":{"width":"71%"},"width":1226,"height":72,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2001.10742/images/24-6.png","element":"img"}],[{"style":{"fontStyle":"italic"},"text":"Proof of Lemma ","element":"span"},{"href":"#id-62","style":{"fontStyle":"italic"},"text":"B.4","element":"a"},{"style":{"fontStyle":"italic"},"text":". ","element":"span"},{"text":"Define ","element":"span"},{"style":{"fontStyle":"italic"},"text":"E ","element":"span"},{"text":":= ","element":"span"},{"style":{"height":19.11},"width":489.66,"height":47.77,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2001.10742/images/24-7.png","element":"img","alt":" {∃t, st, at s.t. nst,at < ndµt","inline":true,"padRight":true},{"text":"(","element":"span"},{"style":{"height":11.2},"width":89.14,"height":28,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2001.10742/images/24-8.png","element":"img","alt":"st, at","inline":true},{"text":")(1 ","element":"span"},{"style":{"height":17.6},"width":105.57,"height":44,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2001.10742/images/24-9.png","element":"img","alt":" − θ)}","inline":true},{"text":". Similarly to ","element":"span"},{"text":"Lemma B.1 in the appendix of ","element":"span"},{"href":"#id-1","referenceIndex":38,"text":"Xie et al. ","element":"a"},{"text":"(","element":"span"},{"href":"#id-1","referenceIndex":38,"text":"2019","element":"a"},{"text":"), we have","element":"span"}],[{"style":{"width":"97%"},"width":1681,"height":193,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2001.10742/images/24-10.png","element":"img"}],[{"text":"where the last inequality uses Lemma ","element":"span"},{"href":"#id-47","text":"B.3","element":"a"},{"text":". Then combining the multiplicative Chernoff bound (Lemma ","element":"span"},{"href":"#id-45","text":"A.2 ","element":"a"},{"text":"in the Appendix) and a union bound over each ","element":"span"},{"style":{"height":14.8},"width":195.36,"height":37,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2001.10742/images/24-11.png","element":"img","alt":" t,st and at","inline":true},{"text":", we get that","element":"span"}],[{"style":{"width":"83%"},"width":1438,"height":122,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2001.10742/images/24-12.png","element":"img"}],[{"text":"which provides the stated result.","element":"span"}],[{"text":"Lemma ","element":"span"},{"href":"#id-62","text":"B.4 ","element":"a"},{"text":"tells that MSE of two TMISs differs by a quantity 3","element":"span"},{"style":{"height":30.83},"width":592.15,"height":77.08,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2001.10742/images/24-13.png","element":"img","alt":"H3SAR2maxe−θ2n mint,st,at dµt (st,at)2","inline":true}],[{"text":"and this illustrates that the gap between two MSE’s can be sufficiently small as long as ","element":"span"},{"style":{"height":29.44},"width":390.04,"height":73.59,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2001.10742/images/24-14.png","element":"img","alt":"n ≥ polylog(S,A,H,n)mint,st,at dµt (st,at).","inline":true}],[{"style":{"fontWeight":"bold"},"text":"B.2 ","element":"span"},{"style":{"fontWeight":"bold"},"text":"Variance and Bias of Fictitious tabular MIS estimator.","element":"span"}],[{"id":"id-44","style":{"fontWeight":"bold"},"text":"Lemma B.5 ","element":"span"},{"text":"(","element":"span"},{"href":"#id-1","referenceIndex":38,"text":"Xie et al. ","element":"a"},{"text":"(","element":"span"},{"href":"#id-1","referenceIndex":38,"text":"2019","element":"a"},{"text":") Lemma B.2)","element":"span"},{"style":{"height":17.2},"width":898.35,"height":43,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2001.10742/images/24-15.png","element":"img","alt":". Tabular-MIS estimator is unbiased: E[�vπ] = vπ","inline":true},{"style":{"height":16.4},"width":326.84,"height":41,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2001.10742/images/24-16.png","element":"img","alt":"for all 0 < θ < 1.","inline":true,"padRight":true},{"style":{"fontWeight":"bold"},"text":"Lemma B.6 ","element":"span"},{"text":"(Variance decomposition)","element":"span"},{"style":{"fontWeight":"bold"},"text":".","element":"span"}],[{"text":"Var[","element":"span"},{"id":"id-63","style":{"height":42.98},"width":357.8,"height":107.46,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2001.10742/images/24-17.png","element":"img","alt":"�vπ] =Var[V π1 (s(1)1 )]n","inline":true}],[{"style":{"width":"96%"},"width":1654,"height":185,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2001.10742/images/24-18.png","element":"img"}],[{"style":{"fontStyle":"italic"},"text":"where ","element":"span"},{"style":{"height":17.6},"width":126.39,"height":44,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2001.10742/images/25-0.png","element":"img","alt":" V πt (st)","inline":true,"padRight":true},{"style":{"fontStyle":"italic"},"text":"denotes the value function under ","element":"span"},{"style":{"height":8},"width":25,"height":20,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2001.10742/images/25-1.png","element":"img","alt":" π","inline":true,"padRight":true},{"style":{"fontStyle":"italic"},"text":"which satisfies the Bellman equation","element":"span"}],[{"style":{"width":"48%"},"width":840,"height":100,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2001.10742/images/25-2.png","element":"img"}],[{"style":{"fontWeight":"bold"},"text":"Remark B.7. ","element":"span"},{"style":{"fontStyle":"italic"},"text":"Note even though the construction of TMIS and SMIS are different, both fictitious estimators are unbiased for ","element":"span"},{"style":{"height":12.33},"width":42.72,"height":30.84,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2001.10742/images/25-3.png","element":"img","alt":" vπ","inline":true},{"style":{"fontStyle":"italic"},"text":". Therefore the MSE of MIS estimators are dominated by the variance of the fictitious estimators. Comparing Lemma ","element":"span"},{"href":"#id-63","style":{"fontStyle":"italic"},"text":"B.6 ","element":"a"},{"style":{"fontStyle":"italic"},"text":"with Lemma 4.1 in ","element":"span"},{"href":"#id-1","referenceIndex":38,"style":{"fontStyle":"italic"},"text":"Xie et al. ","element":"a"},{"style":{"fontStyle":"italic"},"text":"(","element":"span"},{"href":"#id-1","referenceIndex":38,"style":{"fontStyle":"italic"},"text":"2019","element":"a"},{"style":{"fontStyle":"italic"},"text":") we can see our Tabular-MIS estimator achieves a lower bound, and it is essentially asymptotic optimal, as explained by Remark ","element":"span"},{"href":"#id-64","style":{"fontStyle":"italic"},"text":"3.2","element":"a"},{"style":{"fontStyle":"italic"},"text":".","element":"span"}],[{"style":{"fontStyle":"italic"},"text":"Proof of Lemma ","element":"span"},{"href":"#id-63","style":{"fontStyle":"italic"},"text":"B.6","element":"a"},{"style":{"fontStyle":"italic"},"text":". ","element":"span"},{"text":"The proof relies on applying Lemma ","element":"span"},{"text":"B.2 ","element":"span"},{"text":"in a recursive way. One key observation is","element":"span"}],[{"text":"To begin with the following variance decomposition, which applies (","element":"span"},{"href":"#id-65","text":"11","element":"a"},{"text":") recursively.","element":"span"}],[{"text":"Var[","element":"span"},{"style":{"height":17.6},"width":670.34,"height":44,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2001.10742/images/25-4.png","element":"img","alt":"�vπ] =EVar[�vπ|DH] + Var[E[�vπ|DH]]","inline":true}],[{"style":{"width":"92%"},"width":1581,"height":831,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2001.10742/images/25-5.png","element":"img"}],[{"text":"Now let us analyze ","element":"span"},{"style":{"height":32.7},"width":1340.92,"height":81.75,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2001.10742/images/25-6.png","element":"img","alt":" E�Var�⟨�dπh+1, V πh+1⟩ + ⟨�dπh, �rπh⟩��Dh��. Note �P πh+1,h(·, sh) and �rπh(sh) for","inline":true,"padRight":true},{"text":"each ","element":"span"},{"style":{"height":10.84},"width":40.46,"height":27.11,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2001.10742/images/25-7.png","element":"img","alt":" sh","inline":true,"padRight":true},{"text":"are conditionally independent given ","element":"span"},{"style":{"height":15.2},"width":242.56,"height":38,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2001.10742/images/25-8.png","element":"img","alt":" Dh, since Dh","inline":true,"padRight":true},{"text":"partitions the ","element":"span"},{"style":{"fontStyle":"italic"},"text":"n ","element":"span"},{"text":"episodes into ","element":"span"},{"style":{"fontStyle":"italic"},"text":"S ","element":"span"},{"text":"disjoint sets according to the states ","element":"span"},{"style":{"height":24.44},"width":58.16,"height":61.09,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2001.10742/images/25-9.png","element":"img","alt":" s(i)h","inline":true,"padRight":true},{"text":"at time ","element":"span"},{"style":{"fontStyle":"italic"},"text":"h","element":"span"},{"text":". Similarly, ˜","element":"span"},{"style":{"height":17.6},"width":238.62,"height":44,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2001.10742/images/25-10.png","element":"img","alt":"Ph+1(·|sh, ah","inline":true},{"text":") and ˜","element":"span"},{"style":{"height":18.52},"width":165.06,"height":46.29,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2001.10742/images/25-11.png","element":"img","alt":"rπh(sh, ah","inline":true},{"text":") ","element":"span"},{"text":"for each (","element":"span"},{"style":{"height":11.2},"width":104.42,"height":28,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2001.10742/images/25-12.png","element":"img","alt":"sh, ah","inline":true},{"text":") are also conditionally independent given ","element":"span"},{"style":{"height":14.84},"width":53.66,"height":37.11,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2001.10742/images/25-13.png","element":"img","alt":" Dh","inline":true},{"text":". These observations imply:","element":"span"}],[{"id":"id-66","style":{"width":"96%"},"width":1661,"height":1039,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2001.10742/images/26-0.png","element":"img"}],[{"text":"(15) The second line and the fourth line use the conditional independence for ","element":"span"},{"style":{"height":10.62},"width":32.46,"height":26.54,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2001.10742/images/26-1.png","element":"img","alt":" st","inline":true,"padRight":true},{"text":"and (","element":"span"},{"style":{"height":11.2},"width":89.14,"height":28,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2001.10742/images/26-2.png","element":"img","alt":"st, at","inline":true},{"text":") respectively. The fifth line uses that when ","element":"span"},{"style":{"height":19.73},"width":371.86,"height":49.32,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2001.10742/images/26-3.png","element":"img","alt":" nsh,ah < ndµh(sh, ah","inline":true},{"text":")(1 ","element":"span"},{"style":{"height":12.8},"width":66.78,"height":32,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2001.10742/images/26-4.png","element":"img","alt":" − θ","inline":true},{"text":"), the conditional ","element":"span"},{"text":"variance is 0. The sixth line uses the fact that episodes are iid.","element":"span"}],[{"text":"Plug (","element":"span"},{"href":"#id-66","text":"15","element":"a"},{"text":") into the above variance decomposition and uses ","element":"span"},{"style":{"height":15.9},"width":99.07,"height":39.75,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2001.10742/images/26-5.png","element":"img","alt":" VH+1","inline":true,"padRight":true},{"text":"= 0, we finally get","element":"span"}],[{"text":"Var[","element":"span"},{"style":{"height":42.98},"width":357.8,"height":107.46,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2001.10742/images/26-6.png","element":"img","alt":"�vπ] =Var[V π1 (s(1)1 )]n","inline":true}],[{"style":{"width":"96%"},"width":1654,"height":206,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2001.10742/images/26-7.png","element":"img"}],[{"style":{"fontWeight":"bold"},"text":"B.3 ","element":"span"},{"style":{"fontWeight":"bold"},"text":"Bounding the variance of ","element":"span"},{"style":{"height":19.6},"width":126.7,"height":49,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2001.10742/images/26-8.png","element":"img","alt":"�dπh(sh)","inline":true},{"style":{"fontWeight":"bold"},"text":".","element":"span"}],[{"text":"Applying the definition of variance, we directly have","element":"span"}],[{"style":{"width":"97%"},"width":1681,"height":132,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2001.10742/images/26-9.png","element":"img"}],[{"text":"where we use the fact that ","element":"span"},{"style":{"height":18.51},"width":102.79,"height":46.28,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2001.10742/images/27-0.png","element":"img","alt":"�dπh(sh","inline":true},{"text":") is unbiased (which can be proved by induction through ","element":"span"},{"text":"applying total law of expectations and the recursive relationship ","element":"span"},{"style":{"height":17.08},"width":238.52,"height":42.69,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2001.10742/images/27-1.png","element":"img","alt":"�dπt = �P πt �dπt−1","inline":true},{"text":"). Therefore ","element":"span"},{"text":"the only thing left is to bound the the variance of ","element":"span"},{"style":{"height":18.12},"width":102.39,"height":45.29,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2001.10742/images/27-2.png","element":"img","alt":"�dπh(sh","inline":true},{"text":"). To tackle it, we consider bounding ","element":"span"},{"text":"the covariance matrix of ","element":"span"},{"style":{"height":18.12},"width":102.5,"height":45.29,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2001.10742/images/27-3.png","element":"img","alt":"�dπh(sh","inline":true},{"text":"). As we shall see in Lemma ","element":"span"},{"href":"#id-67","text":"B.8","element":"a"},{"text":", fortunately, we are able to ","element":"span"},{"text":"derive an identical result of Lemma B.4 in ","element":"span"},{"href":"#id-1","referenceIndex":38,"text":"Xie et al. ","element":"a"},{"text":"(","element":"span"},{"href":"#id-1","referenceIndex":38,"text":"2019","element":"a"},{"text":") for our Tabular-MIS estimator, which helps greatly in bounding the the variance of ","element":"span"},{"style":{"height":18.51},"width":133.21,"height":46.28,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2001.10742/images/27-4.png","element":"img","alt":"�dπh(sh).","inline":true}],[{"id":"id-67","style":{"fontWeight":"bold"},"text":"Lemma B.8 ","element":"span"},{"text":"(Covariance of ","element":"span"},{"style":{"height":18.51},"width":301.72,"height":46.28,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2001.10742/images/27-5.png","element":"img","alt":"�dπh with TMIS).","inline":true}],[{"text":"Cov( ","element":"span"},{"style":{"height":38.92},"width":287.24,"height":97.29,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2001.10742/images/27-6.png","element":"img","alt":"�dπh) ⪯(1 − θ)−1n","inline":true}],[{"style":{"width":"84%"},"width":1452,"height":116,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2001.10742/images/27-7.png","element":"img"}],[{"style":{"fontStyle":"italic"},"text":"where ","element":"span"},{"style":{"height":19.72},"width":67.58,"height":49.29,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2001.10742/images/27-8.png","element":"img","alt":" Pπh,t","inline":true,"padRight":true},{"text":"= ","element":"span"},{"style":{"height":19.72},"width":525.14,"height":49.29,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2001.10742/images/27-9.png","element":"img","alt":" Pπh,h−1 · Pπh−1,h−2 · ... · Pπt+1,t","inline":true,"padRight":true},{"style":{"fontStyle":"italic"},"text":"— the transition matrices under policy ","element":"span"},{"style":{"height":8},"width":25,"height":20,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2001.10742/images/27-10.png","element":"img","alt":" π","inline":true,"padRight":true},{"style":{"fontStyle":"italic"},"text":"from ","element":"span"},{"style":{"fontStyle":"italic"},"text":"time ","element":"span"},{"style":{"height":20.91},"width":470.37,"height":52.28,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2001.10742/images/27-11.png","element":"img","alt":" t to h (define Pπh,h := I).","inline":true}],[{"style":{"fontStyle":"italic"},"text":"Proof of Lemma ","element":"span"},{"href":"#id-67","style":{"fontStyle":"italic"},"text":"B.8","element":"a"},{"style":{"fontStyle":"italic"},"text":". ","element":"span"},{"text":"We start by applying the law of total variance to obtain the following recursive equation","element":"span"}],[{"id":"id-69","style":{"width":"98%"},"width":1691,"height":608,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2001.10742/images/27-12.png","element":"img"}],[{"text":"The decomposition of the covariance in the third line uses that Cov(","element":"span"},{"style":{"fontStyle":"italic"},"text":"X ","element":"span"},{"text":"+ ","element":"span"},{"style":{"fontStyle":"italic"},"text":"Y ","element":"span"},{"text":") = Cov(","element":"span"},{"style":{"fontStyle":"italic"},"text":"X","element":"span"},{"text":") + Cov(","element":"span"},{"style":{"fontStyle":"italic"},"text":"Y ","element":"span"},{"text":") when ","element":"span"},{"style":{"fontStyle":"italic"},"text":"X ","element":"span"},{"text":"and ","element":"span"},{"style":{"fontStyle":"italic"},"text":"Y ","element":"span"},{"text":"are statistically independent and the columns of ","element":"span"},{"style":{"height":17.24},"width":118.46,"height":43.11,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2001.10742/images/27-13.png","element":"img","alt":"�Ph,h−1","inline":true,"padRight":true},{"text":"are independent when conditioning on ","element":"span"},{"style":{"height":14.84},"width":110.46,"height":37.11,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2001.10742/images/27-14.png","element":"img","alt":" Dh−1.","inline":true}],[{"style":{"width":"74%"},"width":1287,"height":49,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2001.10742/images/28-0.png","element":"img"}],[{"text":"(","element":"span"},{"style":{"height":17.6},"width":113.85,"height":44,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2001.10742/images/28-1.png","element":"img","alt":"∗) =E","inline":true}],[{"id":"id-68","style":{"width":"97%"},"width":1668,"height":985,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2001.10742/images/28-2.png","element":"img"}],[{"text":"The second line uses the fact that conditional on ","element":"span"},{"style":{"height":17.31},"width":95.08,"height":43.28,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2001.10742/images/28-3.png","element":"img","alt":" Ech−1","inline":true},{"text":", the variance of ","element":"span"},{"style":{"height":17.6},"width":258.79,"height":44,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2001.10742/images/28-4.png","element":"img","alt":" �P(·|sh−1, ah−1","inline":true},{"text":") is ","element":"span"},{"text":"zero given Data","element":"span"},{"style":{"height":8.8},"width":20,"height":22,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2001.10742/images/28-5.png","element":"img","alt":"h","inline":true},{"text":". The third line uses the basic property of empirical average, and the fourth line comes from the fact","element":"span"}],[{"style":{"width":"84%"},"width":1450,"height":379,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2001.10742/images/28-6.png","element":"img"}],[{"text":"The last line (","element":"span"},{"href":"#id-68","text":"25","element":"a"},{"text":") uses the fact that ","element":"span"},{"style":{"height":22.85},"width":577.73,"height":57.12,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2001.10742/images/28-7.png","element":"img","alt":" Pπh,h−1(·|sh−1)[Pπh,h−1(·|sh−1)]T","inline":true,"padRight":true},{"text":"is positive semidefi- ","element":"span"},{"text":"nite, ","element":"span"},{"style":{"height":19.73},"width":569.61,"height":49.32,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2001.10742/images/28-8.png","element":"img","alt":" nsh−1,ah−1 ≥ ndµh−1(sh−1, ah−1","inline":true},{"text":")(1 ","element":"span"},{"style":{"height":12.8},"width":65.56,"height":32,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2001.10742/images/28-9.png","element":"img","alt":" − θ","inline":true},{"text":") and the definition of variance for ","element":"span"},{"style":{"height":18.51},"width":188.14,"height":46.28,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2001.10742/images/28-10.png","element":"img","alt":" �dπh−1(sh−1","inline":true},{"text":"). ","element":"span"},{"text":"Combining (","element":"span"},{"href":"#id-69","text":"19","element":"a"},{"text":") and (","element":"span"},{"href":"#id-68","text":"25","element":"a"},{"text":") and by recursively apply them, we get the stated results.","element":"span"}],[{"text":"Benefitting from the identical semidefinite ordering bound on Cov( ","element":"span"},{"style":{"height":17.72},"width":42.71,"height":44.29,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2001.10742/images/28-11.png","element":"img","alt":"�dπh","inline":true},{"text":") for TMIS and ","element":"span"},{"text":"SMIS, we can borrow the following results from ","element":"span"},{"href":"#id-1","referenceIndex":38,"text":"Xie et al. ","element":"a"},{"text":"(","element":"span"},{"href":"#id-1","referenceIndex":38,"text":"2019","element":"a"},{"text":") for our Tabular-MIS estimator.","element":"span"}],[{"style":{"fontWeight":"bold"},"text":"Lemma B.9 ","element":"span"},{"text":"(Corollary 2 of ","element":"span"},{"href":"#id-1","referenceIndex":38,"text":"Xie et al. ","element":"a"},{"text":"(","element":"span"},{"href":"#id-1","referenceIndex":38,"text":"2019","element":"a"},{"text":"))","element":"span"},{"style":{"height":21.29},"width":861.25,"height":53.23,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2001.10742/images/29-0.png","element":"img","alt":". For h = 1, we have Var[�dπ1(s1)] = 1n(dπh(s1)−","inline":true}],[{"style":{"width":"94%"},"width":1618,"height":285,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2001.10742/images/29-1.png","element":"img"}],[{"style":{"fontStyle":"italic"},"text":"where ","element":"span"},{"style":{"height":20.13},"width":272.63,"height":50.33,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2001.10742/images/29-2.png","element":"img","alt":" ϱ(st) := �st−1","inline":true}],[{"id":"id-71","style":{"width":"103%"},"width":1776,"height":335,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2001.10742/images/29-3.png","element":"img"}],[{"text":"Before giving the proof of Theorem ","element":"span"},{"href":"#id-55","text":"3.1","element":"a"},{"text":", we first prove Lemma ","element":"span"},{"href":"#id-70","text":"3.4","element":"a"},{"text":".","element":"span"}],[{"style":{"fontStyle":"italic"},"text":"Proof of Lemma ","element":"span"},{"href":"#id-70","style":{"fontStyle":"italic"},"text":"3.4","element":"a"},{"style":{"fontStyle":"italic"},"text":". ","element":"span"},{"text":"Let value function ","element":"span"},{"style":{"height":17.31},"width":55.15,"height":43.28,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2001.10742/images/29-4.png","element":"img","alt":" V πh","inline":true,"padRight":true},{"text":"(","element":"span"},{"style":{"height":10.84},"width":40.46,"height":27.11,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2001.10742/images/29-5.png","element":"img","alt":"sh","inline":true},{"text":") = ","element":"span"},{"style":{"height":24.44},"width":420.26,"height":61.09,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2001.10742/images/29-6.png","element":"img","alt":" Eπ[�Ht=h r(1)t |s(1)h = sh","inline":true},{"text":"] and ","element":"span"},{"style":{"fontStyle":"italic"},"text":"Q","element":"span"},{"text":"-function ","element":"span"},{"style":{"height":18.52},"width":178.82,"height":46.29,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2001.10742/images/29-7.png","element":"img","alt":"Qπh(sh, ah","inline":true},{"text":") = ","element":"span"},{"style":{"height":24.44},"width":627.34,"height":61.1,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2001.10742/images/29-8.png","element":"img","alt":" Eπ[�Ht=h r(1)t |s(1)h = sh, a(1)h = ah","inline":true},{"text":"], then by total law of variance we obtain","element":"span"}],[{"text":"(let’s suppress the policy ","element":"span"},{"style":{"height":8},"width":25,"height":20,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2001.10742/images/30-0.png","element":"img","alt":" π","inline":true,"padRight":true},{"text":"for simplicity):","element":"span"}],[{"style":{"width":"81%"},"width":1396,"height":191,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2001.10742/images/30-1.png","element":"img"}],[{"text":"=","element":"span"},{"text":"E","element":"span"}],[{"style":{"width":"96%"},"width":1656,"height":551,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2001.10742/images/30-2.png","element":"img"}],[{"text":"+Var","element":"span"}],[{"style":{"width":"82%"},"width":1419,"height":215,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2001.10742/images/30-3.png","element":"img"}],[{"text":"+Var","element":"span"}],[{"style":{"width":"106%"},"width":1832,"height":187,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2001.10742/images/30-4.png","element":"img"}],[{"text":"(26) where we use Markovian property that (","element":"span"},{"style":{"height":25.64},"width":275.9,"height":64.09,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2001.10742/images/30-5.png","element":"img","alt":"Vh+1(s(1)h+1)|Dh","inline":true},{"text":") equals (","element":"span"},{"style":{"height":25.64},"width":373.52,"height":64.09,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2001.10742/images/30-6.png","element":"img","alt":"Vh+1(s(1)h+1)|s(1)h , a(1)h","inline":true,"padRight":true},{"text":") in ","element":"span"},{"text":"distribution and ","element":"span"},{"text":"E","element":"span"}],[{"text":"recursively and letting ","element":"span"},{"style":{"fontStyle":"italic"},"text":"h ","element":"span"},{"text":"= ","element":"span"},{"style":{"fontStyle":"italic"},"text":"H","element":"span"},{"text":", we get the stated result.","element":"span"}],[{"style":{"fontWeight":"bold"},"text":"Remark B.11. ","element":"span"},{"style":{"fontStyle":"italic"},"text":"A straight forward implication of Lemma ","element":"span"},{"href":"#id-70","style":{"fontStyle":"italic"},"text":"3.4 ","element":"a"},{"style":{"fontStyle":"italic"},"text":"is the following:","element":"span"}],[{"style":{"width":"57%"},"width":984,"height":129,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2001.10742/images/30-7.png","element":"img"}],[{"text":"Combing Lemma ","element":"span"},{"href":"#id-63","text":"B.6 ","element":"a"},{"text":"and ","element":"span"},{"href":"#id-71","text":"B.10","element":"a"},{"text":", we are now ready to prove the main Theorem ","element":"span"},{"href":"#id-55","text":"3.1","element":"a"},{"text":".","element":"span"}],[{"style":{"fontStyle":"italic"},"text":"Proof of Theorem ","element":"span"},{"href":"#id-55","style":{"fontStyle":"italic"},"text":"3.1","element":"a"},{"style":{"fontStyle":"italic"},"text":". ","element":"span"},{"text":"Plug the result of Lemma ","element":"span"},{"href":"#id-71","text":"B.10 ","element":"a"},{"text":"into Lemma ","element":"span"},{"href":"#id-63","text":"B.6 ","element":"a"},{"text":"and uses the unbi-","element":"span"}],[{"id":"id-72","style":{"width":"111%"},"width":1917,"height":482,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2001.10742/images/31-0.png","element":"img"}],[{"text":"Choose ","element":"span"},{"style":{"height":20.82},"width":711.16,"height":52.04,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2001.10742/images/31-1.png","element":"img","alt":" θ =�4 log(n)/(n mint,st,at dµt (st, at))","inline":true},{"text":". Then by assumption ","element":"span"},{"style":{"height":27.93},"width":381.34,"height":69.82,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2001.10742/images/31-2.png","element":"img","alt":" n > 16 log nmint,st,at dµt (st,at)","inline":true,"padRight":true},{"text":"we have ","element":"span"},{"style":{"height":17.6},"width":128.11,"height":44,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2001.10742/images/31-3.png","element":"img","alt":" θ < 1/","inline":true},{"text":"2, which allows us to write (1 ","element":"span"},{"style":{"height":19.13},"width":176.82,"height":47.84,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2001.10742/images/31-4.png","element":"img","alt":" − θ)−1 ≤","inline":true,"padRight":true},{"text":"(1 + 2","element":"span"},{"style":{"height":12.8},"width":21,"height":32,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2001.10742/images/31-5.png","element":"img","alt":"θ","inline":true},{"text":") in the leading term and (1 ","element":"span"},{"style":{"height":19.14},"width":173.96,"height":47.84,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2001.10742/images/31-6.png","element":"img","alt":" − θ)−1 ≤","inline":true,"padRight":true},{"text":"2 in the subsequent terms. The condition of Lemma ","element":"span"},{"href":"#id-71","text":"B.10 ","element":"a"},{"text":"is satisfied by The second assumption on ","element":"span"},{"style":{"fontStyle":"italic"},"text":"n","element":"span"},{"text":". Then, combining (","element":"span"},{"href":"#id-72","text":"27","element":"a"},{"text":") with Lemma ","element":"span"},{"href":"#id-62","text":"B.4 ","element":"a"},{"text":"we get:","element":"span"}],[{"style":{"width":"102%"},"width":1762,"height":315,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2001.10742/images/31-7.png","element":"img"}],[{"text":"+","element":"span"},{"id":"id-73","text":"8","element":"span"},{"style":{"fontStyle":"italic"},"text":"τ","element":"span"},{"style":{"height":31.78},"width":69.7,"height":79.46,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2001.10742/images/31-8.png","element":"img","alt":"aτsn2","inline":true}],[{"text":"(28) now use Lemma ","element":"span"},{"href":"#id-70","text":"3.4","element":"a"},{"text":", we can bound the last term in (","element":"span"},{"href":"#id-73","text":"28","element":"a"},{"text":") by","element":"span"}],[{"style":{"width":"72%"},"width":1251,"height":129,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2001.10742/images/31-9.png","element":"img"}],[{"text":"Combine this term with ","element":"span"},{"style":{"height":21.75},"width":258.96,"height":54.38,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2001.10742/images/31-10.png","element":"img","alt":"3n2 H3SAR2max","inline":true,"padRight":true},{"text":"we obtain the higher order term ","element":"span"},{"style":{"height":27.57},"width":248.82,"height":68.93,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2001.10742/images/31-11.png","element":"img","alt":" O( τ 2aτsH3R2maxn2·dm","inline":true,"padRight":true},{"text":"), where we use that pigeonhole principle implies that ","element":"span"},{"style":{"height":16},"width":285.55,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2001.10742/images/31-12.png","element":"img","alt":" S < τs, A < τa.","inline":true}],[{"text":"This completes the proof.","element":"span"}],[{"style":{"width":"1%"},"width":23,"height":33,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2001.10742/images/31-13.png","element":"img"}],[{"style":{"fontWeight":"bold"},"text":"C ","element":"span"},{"style":{"fontWeight":"bold"},"text":"Proofs of data splitting Tabular-MIS estimator.","element":"span"}],[{"text":"We define the fictitious data splitting Tabular-MIS estimator as:","element":"span"}],[{"style":{"width":"19%"},"width":343,"height":130,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2001.10742/images/31-14.png","element":"img"}],[{"text":"where each ","element":"span"},{"style":{"height":21.16},"width":58.86,"height":52.89,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2001.10742/images/32-0.png","element":"img","alt":" �vπ(i)","inline":true,"padRight":true},{"text":"is the fictitious Tabular-MIS estimator of ","element":"span"},{"style":{"height":21.16},"width":58.86,"height":52.89,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2001.10742/images/32-1.png","element":"img","alt":" �vπ(i)","inline":true},{"text":". ","element":"span"},{"text":"Moreover, we set all ","element":"span"},{"style":{"height":21.16},"width":305,"height":52.89,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2001.10742/images/32-2.png","element":"img","alt":"�vπ(1), �vπ(2), ..., �vπ(N) ","inline":true,"padRight":true},{"text":"jointly share the same fictitious parameter ","element":"span"},{"style":{"height":15.1},"width":70.4,"height":37.75,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2001.10742/images/32-3.png","element":"img","alt":" θM.","inline":true}],[{"style":{"fontStyle":"italic"},"text":"Proof of Theorem ","element":"span"},{"href":"#id-46","style":{"fontStyle":"italic"},"text":"3.6","element":"a"},{"style":{"fontStyle":"italic"},"text":". ","element":"span"},{"text":"Let ","element":"span"},{"style":{"height":12},"width":50.73,"height":30,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2001.10742/images/32-4.png","element":"img","alt":" E′","inline":true,"padRight":true},{"text":":= ","element":"span"},{"style":{"height":22.42},"width":124.59,"height":56.06,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2001.10742/images/32-5.png","element":"img","alt":" {∃ �vπ(i)","inline":true,"padRight":true},{"text":": ","element":"span"},{"style":{"height":22.42},"width":278.47,"height":56.06,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2001.10742/images/32-6.png","element":"img","alt":" s.t.�vπ(i) ̸= �vπ(i)}","inline":true},{"text":", then an argument similar to ","element":"span"},{"text":"Lemma ","element":"span"},{"href":"#id-62","text":"B.4 ","element":"a"},{"text":"can be derived:","element":"span"}],[{"style":{"width":"57%"},"width":980,"height":58,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2001.10742/images/32-7.png","element":"img"}],[{"text":"and","element":"span"}],[{"style":{"width":"96%"},"width":1652,"height":122,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2001.10742/images/32-8.png","element":"img"}],[{"text":"therefore ","element":"span"},{"style":{"height":17.6},"width":89.47,"height":44,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2001.10742/images/32-9.png","element":"img","alt":" P[E′","inline":true},{"text":"] will be sufficiently small if ","element":"span"},{"style":{"height":19.11},"width":915.41,"height":47.78,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2001.10742/images/32-10.png","element":"img","alt":" M ≥ O(Polylog(H, S, A, n)/ mint,st,at dµt (st, at)).","inline":true,"padRight":true},{"text":"By near-uniformity we ","element":"span"},{"style":{"height":19.11},"width":1458.1,"height":47.77,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2001.10742/images/32-11.png","element":"img","alt":" M ≥ O(Polylog(H, S, A, n)SA) ≥ O(Polylog(H, S, A, n)/ mint,st,at dµt (st, at)).","inline":true}],[{"text":"Moreover, by i.i.d and unbiasedness of ","element":"span"},{"style":{"height":21.62},"width":238.59,"height":54.06,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2001.10742/images/32-12.png","element":"img","alt":" �vπ(i), we have","inline":true}],[{"style":{"width":"74%"},"width":1275,"height":97,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2001.10742/images/32-13.png","element":"img"}],[{"text":"by the second assumption on ","element":"span"},{"style":{"fontStyle":"italic"},"text":"M ","element":"span"},{"text":"and Theorem ","element":"span"},{"href":"#id-55","text":"3.1","element":"a"},{"text":".","element":"span"}],[{"style":{"width":"1%"},"width":23,"height":23,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2001.10742/images/32-14.png","element":"img"}],[{"text":"We now proof Lemma ","element":"span"},{"href":"#id-74","text":"3.9","element":"a"},{"text":", since it will be used to as the intermediate step for proving Theorem ","element":"span"},{"href":"#id-48","text":"3.8","element":"a"},{"text":".","element":"span"}],[{"style":{"fontStyle":"italic"},"text":"Proof of Lemma ","element":"span"},{"href":"#id-74","style":{"fontStyle":"italic"},"text":"3.9","element":"a"},{"style":{"fontStyle":"italic"},"text":". ","element":"span"},{"text":"Note that","element":"span"}],[{"style":{"width":"77%"},"width":1326,"height":265,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2001.10742/images/32-15.png","element":"img"}],[{"text":"therefore by near-uniformity ","element":"span"},{"style":{"height":17.6},"width":1035.66,"height":44,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2001.10742/images/32-16.png","element":"img","alt":" M > max [O(SA · Polylog(S, H, A, N, 1/δ)), O(Hτaτs)]","inline":true,"padRight":true},{"text":"is suf-ficient to guarantee the stated result.","element":"span"}],[{"text":"Now we can prove Theorem ","element":"span"},{"href":"#id-48","text":"3.8","element":"a"},{"text":".","element":"span"}],[{"style":{"fontStyle":"italic"},"text":"Proof of Theorem ","element":"span"},{"href":"#id-48","style":{"fontStyle":"italic"},"text":"3.8","element":"a"},{"style":{"fontStyle":"italic"},"text":". ","element":"span"},{"text":"First of all, we have","element":"span"}],[{"style":{"width":"86%"},"width":1481,"height":56,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2001.10742/images/32-17.png","element":"img"}],[{"text":"Now by Bernstein inequality we have","element":"span"}],[{"style":{"width":"67%"},"width":1166,"height":25,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2001.10742/images/33-0.png","element":"img"}],[{"id":"id-75","style":{"height":21.78},"width":445.21,"height":54.45,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2001.10742/images/33-1.png","element":"img","alt":"P�|�vπsplit − vπ| > ϵ�= P","inline":true}],[{"text":"(30) Solving (","element":"span"},{"href":"#id-75","text":"30","element":"a"},{"text":") and apply Theorem ","element":"span"},{"href":"#id-55","text":"3.1","element":"a"},{"text":", we obtain","element":"span"}],[{"style":{"width":"92%"},"width":1584,"height":51,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2001.10742/images/33-2.png","element":"img"}],[{"id":"id-76","style":{"height":13.6},"width":63.83,"height":34,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2001.10742/images/33-3.png","element":"img","alt":"ϵ ≤","inline":true}],[{"text":"(31) As ","element":"span"},{"style":{"fontStyle":"italic"},"text":"N ","element":"span"},{"text":"goes large, the square root term in (","element":"span"},{"href":"#id-76","text":"31","element":"a"},{"text":") will dominate and it seems we only need to consider the square root term in ","element":"span"},{"style":{"fontStyle":"italic"},"text":"N ","element":"span"},{"text":"and treat the second term as the higher order term. However, since ","element":"span"},{"style":{"height":17.6},"width":1302.11,"height":44,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2001.10742/images/33-4.png","element":"img","alt":" M > max [O(SA · Polylog(S, H, A, N, 1/δ)), O(Hτaτs)], N cannot be","inline":true,"padRight":true},{"text":"arbitrary large given ","element":"span"},{"style":{"fontStyle":"italic"},"text":"n","element":"span"},{"text":". An example is: when ","element":"span"},{"style":{"fontStyle":"italic"},"text":"N ","element":"span"},{"text":"= ","element":"span"},{"style":{"fontStyle":"italic"},"text":"n","element":"span"},{"text":", then ","element":"span"},{"style":{"fontStyle":"italic"},"text":"M ","element":"span"},{"text":"= ","element":"span"},{"style":{"fontStyle":"italic"},"text":"n/N ","element":"span"},{"text":"= 1 does not satisfy the condition. Therefore to make the square root term dominates we need","element":"span"}],[{"id":"id-77","style":{"width":"45%"},"width":777,"height":106,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2001.10742/images/33-5.png","element":"img"}],[{"text":"This translates to","element":"span"}],[{"style":{"width":"91%"},"width":1575,"height":70,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2001.10742/images/33-6.png","element":"img"}],[{"text":"where ","element":"span"},{"style":{"height":12.8},"width":34,"height":32,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2001.10742/images/33-7.png","element":"img","alt":"�O","inline":true,"padRight":true},{"text":"absorbs all the Polylog terms.","element":"span"}],[{"text":"Therefore under the condition (","element":"span"},{"href":"#id-77","text":"32","element":"a"},{"text":"), we can really absorb the second term in (","element":"span"},{"href":"#id-76","text":"31","element":"a"},{"text":") (as higher order term) and combine it with Lemma ","element":"span"},{"href":"#id-74","text":"3.9 ","element":"a"},{"text":"to get that with probability 1 ","element":"span"},{"style":{"height":15.6},"width":76.68,"height":39,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2001.10742/images/33-8.png","element":"img","alt":" − δ,","inline":true}],[{"style":{"width":"75%"},"width":1302,"height":180,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2001.10742/images/33-9.png","element":"img"}],[{"style":{"fontStyle":"italic"},"text":"Proof of Theorem ","element":"span"},{"href":"#id-78","style":{"fontStyle":"italic"},"text":"5.1","element":"a"},{"style":{"fontStyle":"italic"},"text":". ","element":"span"},{"text":"The non-uniform result of Theorem ","element":"span"},{"href":"#id-48","text":"3.8 ","element":"a"},{"text":"gives:","element":"span"}],[{"style":{"width":"29%"},"width":503,"height":106,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2001.10742/images/33-10.png","element":"img"}],[{"text":"Note that all nonstationary deterministic polices class have cardinality ","element":"span"},{"style":{"height":19.64},"width":361.09,"height":49.11,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2001.10742/images/33-11.png","element":"img","alt":" | � | = AHS, which","inline":true,"padRight":true},{"text":"implies log ","element":"span"},{"style":{"height":17.71},"width":311.15,"height":44.27,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2001.10742/images/33-12.png","element":"img","alt":" | � | = HS log A","inline":true},{"text":", therefore combine Lemma ","element":"span"},{"href":"#id-74","text":"3.9 ","element":"a"},{"text":"with a direct union bound and Multiplicative Chernoff bound we obtain","element":"span"}],[{"style":{"width":"67%"},"width":1163,"height":184,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2001.10742/images/33-13.png","element":"img"}],[{"id":"id-57","style":{"fontWeight":"bold"},"text":"D ","element":"span"},{"style":{"fontWeight":"bold"},"text":"More details about Empirical Results.","element":"span"}],[{"style":{"fontWeight":"bold"},"text":"Restate Time-varying, non-mixing Tabular MDP in Section ","element":"span"},{"style":{"fontWeight":"bold"},"text":"4","element":"span"},{"text":".","element":"span"}],[{"text":"There are two states ","element":"span"},{"style":{"height":14.22},"width":173.28,"height":35.55,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2001.10742/images/34-0.png","element":"img","alt":" s0 and s1","inline":true,"padRight":true},{"text":"and two actions ","element":"span"},{"style":{"height":14.22},"width":360.09,"height":35.55,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2001.10742/images/34-1.png","element":"img","alt":" a1 and a2. State s0","inline":true,"padRight":true},{"text":"always has probability 1 going back to itself, regardless of the actions, ","element":"span"},{"style":{"height":17.6},"width":293.1,"height":44,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2001.10742/images/34-2.png","element":"img","alt":" i.e. Pt(s0|s0, a1","inline":true},{"text":") = 1 and ","element":"span"},{"style":{"height":17.6},"width":209.9,"height":44,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2001.10742/images/34-3.png","element":"img","alt":" Pt(s0|s0, a2","inline":true},{"text":") = 1. For state ","element":"span"},{"style":{"height":10.62},"width":37.46,"height":26.54,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2001.10742/images/34-4.png","element":"img","alt":" s1","inline":true},{"text":", at each time step there is one action (we call it ","element":"span"},{"style":{"fontStyle":"italic"},"text":"a","element":"span"},{"text":") that has probability 2","element":"span"},{"style":{"fontStyle":"italic"},"text":"/H ","element":"span"},{"text":"going to ","element":"span"},{"style":{"height":10.62},"width":37.46,"height":26.54,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2001.10742/images/34-5.png","element":"img","alt":" s0","inline":true,"padRight":true},{"text":"and the other action (we call it ","element":"span"},{"style":{"height":8.4},"width":39.07,"height":21,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2001.10742/images/34-6.png","element":"img","alt":" a′","inline":true},{"text":") has probability 1 going back to ","element":"span"},{"style":{"height":11.2},"width":51.38,"height":28,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2001.10742/images/34-7.png","element":"img","alt":" s1,","inline":true}],[{"style":{"width":"72%"},"width":1239,"height":132,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2001.10742/images/34-8.png","element":"img"}],[{"text":"and which action will make state ","element":"span"},{"style":{"height":10.62},"width":37.46,"height":26.54,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2001.10742/images/34-9.png","element":"img","alt":" s1","inline":true,"padRight":true},{"text":"go to state ","element":"span"},{"style":{"height":10.62},"width":37.46,"height":26.54,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2001.10742/images/34-10.png","element":"img","alt":" s0","inline":true,"padRight":true},{"text":"with probability 2","element":"span"},{"style":{"fontStyle":"italic"},"text":"/H ","element":"span"},{"text":"is decided by a random parameter ","element":"span"},{"style":{"height":11.6},"width":33.96,"height":29,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2001.10742/images/34-11.png","element":"img","alt":" pt","inline":true,"padRight":true},{"text":"uniform sampled in [0","element":"span"},{"style":{"fontStyle":"italic"},"text":", ","element":"span"},{"text":"1]. If ","element":"span"},{"style":{"height":15.6},"width":298.42,"height":39,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2001.10742/images/34-12.png","element":"img","alt":" pt < 0.5, a = a1","inline":true,"padRight":true},{"text":"and if ","element":"span"},{"style":{"height":15.6},"width":298.43,"height":39,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2001.10742/images/34-13.png","element":"img","alt":" pt ≥ 0.5, a = a2","inline":true},{"text":". These ","element":"span"},{"style":{"height":11.6},"width":168,"height":29,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2001.10742/images/34-14.png","element":"img","alt":" p1, ..., pH","inline":true,"padRight":true},{"text":"are generated by a sequence of pseudo-random numbers. Moreover, one can receive reward 1 at each time step if ","element":"span"},{"style":{"fontStyle":"italic"},"text":"t > H/","element":"span"},{"text":"2 and is in state ","element":"span"},{"style":{"height":10.62},"width":37.46,"height":26.55,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2001.10742/images/34-15.png","element":"img","alt":" s0","inline":true},{"text":", and will receive reward 0 otherwise. Lastly, for logging policy, we define it to be uniform:","element":"span"}],[{"style":{"width":"65%"},"width":1125,"height":132,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2001.10742/images/34-16.png","element":"img"}],[{"text":"For target policy ","element":"span"},{"style":{"height":8},"width":25,"height":20,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2001.10742/images/34-17.png","element":"img","alt":" π","inline":true},{"text":", we define it as:","element":"span"}],[{"style":{"width":"65%"},"width":1125,"height":132,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2001.10742/images/34-18.png","element":"img"}],[{"text":"We run this non-stationary MDP model in the ","element":"span"},{"style":{"fontFamily":"monospace"},"text":"Python ","element":"span"},{"text":"environment and pseudo-random numbers ","element":"span"},{"style":{"height":11.6},"width":33.96,"height":29,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2001.10742/images/34-19.png","element":"img","alt":" pt","inline":true},{"text":"’s are generated by keeping ","element":"span"},{"style":{"fontFamily":"monospace"},"text":"numpy.random.seed(100)","element":"span"},{"text":".","element":"span"}],[{"text":"We run each methods under ","element":"span"},{"style":{"fontStyle":"italic"},"text":"K ","element":"span"},{"text":"= 100 macro-replications with data ","element":"span"},{"style":{"height":39.97},"width":633.73,"height":99.92,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2001.10742/images/34-20.png","element":"img","alt":" D(k) =�(s(i)t , a(i)t , r(i)t )�i∈[n],t∈[H](k) ,","inline":true,"padRight":true},{"text":"and use each ","element":"span"},{"style":{"height":19.95},"width":135.04,"height":49.88,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2001.10742/images/34-21.png","element":"img","alt":" D(k) (k","inline":true,"padRight":true},{"text":"= 1","element":"span"},{"style":{"fontStyle":"italic"},"text":", ..., K","element":"span"},{"text":") to construct a estimator ","element":"span"},{"style":{"height":21.16},"width":58.04,"height":52.89,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2001.10742/images/34-22.png","element":"img","alt":" �vπ[k]","inline":true},{"text":", then the (empirical) RMSE ","element":"span"},{"text":"is computed as:","element":"span"}],[{"style":{"width":"109%"},"width":1871,"height":318,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2001.10742/images/34-23.png","element":"img"}],[{"style":{"fontWeight":"bold"},"text":"Other generic IS-based estimators. ","element":"span"},{"text":"There are other importance sampling based estimators including ","element":"span"},{"style":{"fontStyle":"italic"},"text":"weighted importance sampling ","element":"span"},{"text":"(WIS) and ","element":"span"},{"style":{"fontStyle":"italic"},"text":"importance sampling with stationary state distribution ","element":"span"},{"text":"(SSD-IS, ","element":"span"},{"href":"#id-0","referenceIndex":22,"text":"Liu et al. ","element":"a"},{"text":"(","element":"span"},{"href":"#id-0","referenceIndex":22,"text":"2018a","element":"a"},{"text":")). The empirical comparisons ","element":"span"},{"text":"including these methods are well-demonstrated in ","element":"span"},{"href":"#id-1","referenceIndex":38,"text":"Xie et al. ","element":"a"},{"text":"(","element":"span"},{"href":"#id-1","referenceIndex":38,"text":"2019","element":"a"},{"text":") and it was empirically shown that they are worse than SMIS. Because of that, we only focus on comparing SMIS and TMIS in our simulation study.","element":"span"}],[{"id":"id-52","style":{"width":"100%"},"width":1719,"height":545,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2001.10742/images/35-0.png","element":"img"}]]}],"_version":"3.3.2"},"paperNode":"$28:props:children:props:children:0:props:product"}]]]}]}]