28:["$","$L31",null,{"isWhiteLabelled":false,"children":["$","$Lc",null,{"pt":{"compact":0,"expanded":3},"children":[["$","$L32",null,{"noStar":true,"publisher":true,"task":true,"params":true,"size":"xl","product":{"id":"eyJwYXBlcklEIjoiMTYwNC4wMDkyMyIsInB1Ymxpc2hlciI6ImFyeGl2In0=","publisher":"arxiv","updated":"2016-04-04T15:56:52.000Z","paperID":"1604.00923","published":"2016-04-04T15:56:52.000Z","authors":"[\"Philip S. Thomas\",\"Emma Brunskill\"]","title":"Data-Efficient Off-Policy Policy Evaluation for Reinforcement Learning","scoreTrending":null,"summary":"In this paper we present a new way of predicting the performance of a\nreinforcement learning policy given historical data that may have been\ngenerated by a different policy. The ability to evaluate a policy from\nhistorical data is important for applications where the deployment of a bad\npolicy can be dangerous or costly. We show empirically that our algorithm\nproduces estimates that often have orders of magnitude lower mean squared error\nthan existing methods---it makes more efficient use of the available data. Our\nnew estimator is based on two advances: an extension of the doubly robust\nestimator (Jiang and Li, 2015), and a new way to mix between model based\nestimates and importance sampling based estimates.","lastCheckedForCode":"2022-09-02T11:52:29.065Z","links":[{"id":"eyJ1cmwiOiJodHRwczovL3BhcGVyc3dpdGhjb2RlLmNvbS9wYXBlci9kYXRhLWVmZmljaWVudC1vZmYtcG9saWN5LXBvbGljeS1ldmFsdWF0aW9uIn0=","type":"pwc","url":"https://paperswithcode.com/paper/data-efficient-off-policy-policy-evaluation","data":null}],"reposConnection":{"edges":[{"official":null,"node":{"id":"eyJyZXBvSUQiOiIzMzk1NzQzNDAiLCJzb3VyY2UiOiJnaXRodWIifQ==","source":"github","repoID":"339574340","url":"https://github.com/adityamodi/magic-oppe","title":"magic-oppe","language":"python","stars":0,"forks":0,"framework":null,"scoreTrending":null,"updated":null,"created":null,"downloads":null,"likes":null,"owner":[{"username":"adityamodi","avatar":"https://avatars.githubusercontent.com/u/4456757?v=4"}]}}]},"models":[],"tags":[],"summaries":[{"model":"gpt-4o-mini","header":"paper.summary.expertise.beginner","summary":"This paper introduces a new method for predicting how well a reinforcement learning policy will perform using historical data from a different policy. The method, called MAGIC, combines advantages from existing techniques to provide more accurate predictions with less error. The researchers show that MAGIC often does much better than older methods when estimating performance, which is important to avoid costly mistakes when using AI in real-world situations like recommending ads or medical treatments."}],"emailsConnection":{"edges":[]},"__typename":"paper","authorArray":["Philip S. Thomas","Emma Brunskill"]}}],["$","$L25",null,{"container":true,"columns":100,"spacing":{"compact":0,"expanded":2,"large":3},"children":[["$","$L25",null,{"size":{"compact":100,"expanded":100,"large":68},"children":[["$","$8",null,{"children":["$","$L33",null,{"publisher":"arxiv","paperID":"1604.00923","product":{"paper":"$28:props:children:props:children:0:props:product","models":"$28:props:children:props:children:0:props:product:models"},"isWhiteLabelled":false}]}],["$","$8",null,{"children":["$","$L34",null,{"article":"$L35","model":"$undefined"}]}]]}],["$","$L25",null,{"size":"grow","children":["$","$L36",null,{}]}]]}],["$","$8",null,{"children":null}],[["$","audio",null,{"id":"tts"}],["$","$L37",null,{"paperID":"1604.00923","publisher":"arxiv","paperJSON":{"title":"Data-Efficient Off-Policy Policy Evaluation for Reinforcement Learning","paperID":"1604.00923","avgLineHeight":11.63,"imgScale":4,"sections":[{"heading":"Abstract","paragraphs":[[{"text":"In this paper we present a new way of predicting the performance of a reinforcement learning policy given historical data that may have been generated by a different policy. The ability to evaluate a policy from historical data is important for applications where the deployment of a bad policy can be dangerous or costly. We show empirically that our algorithm produces estimates that often have orders of magnitude lower mean squared error than existing methods—it makes more efficient use of the available data. Our new estimator is based on two advances: an extension of the doubly robust estimator (","element":"span"},{"text":"Jiang & Li","element":"span"},{"text":", ","element":"span"},{"text":"2015","element":"span"},{"text":"), and a new way to mix between model based estimates and importance sampling based estimates.","element":"span"}]]},{"heading":"1. Introduction","paragraphs":[[{"text":"The ability to predict the performance of a policy without actually having to use it is crucial to the responsible use of reinforcement learning algorithms. ","element":"span"},{"text":"Consider the setting where the user of a reinforcement learning algorithm has already deployed some policy, e.g., for determining which advertisement to show a user visiting a website (","element":"span"},{"text":"Theocharous et al.","element":"span"},{"text":", ","element":"span"},{"text":"2015","element":"span"},{"text":"), for determining which medical treatment to suggest for a patient (","element":"span"},{"text":"Thapa et al.","element":"span"},{"text":", ","element":"span"},{"text":"2005","element":"span"},{"text":"), or for suggesting a personalized curriculum for a student (","element":"span"},{"text":"Mandel et al.","element":"span"},{"text":", ","element":"span"},{"text":"2014","element":"span"},{"text":"). In these examples, using a bad policy can be costly or dangerous, so it is important that the user of a reinforcement learning algorithm be able to accurately predict how well a new policy will perform without having to deploy it.","element":"span"}],[{"text":"In this paper we propose a new algorithm for tackling this performance prediction problem, which is called the ","element":"span"},{"style":{"fontStyle":"italic"},"text":"off-policy policy evaluation ","element":"span"},{"text":"(OPE) problem. The primary objective in OPE problems is to produce estimates that minimize some notion of error. We select mean squared error, a popular notion of error for estimators, as our loss function. This is in line with previous works that all use (root) mean squared error when empirically validating their methods (","element":"span"},{"text":"Precup et al.","element":"span"},{"text":", ","element":"span"},{"text":"2000","element":"span"},{"text":"; ","element":"span"},{"text":"Dud´ık et al.","element":"span"},{"text":", ","element":"span"},{"text":"2011","element":"span"},{"text":"; ","element":"span"},{"text":"Mahmood et al.","element":"span"},{"text":", ","element":"span"},{"text":"2014","element":"span"},{"text":"; ","element":"span"},{"text":"Thomas","element":"span"},{"text":", ","element":"span"},{"text":"2015b","element":"span"},{"text":"; ","element":"span"},{"text":"Jiang & Li","element":"span"},{"text":", ","element":"span"},{"text":"2015","element":"span"},{"text":").","element":"span"}],[{"text":"Given ","element":"span"},{"text":"this ","element":"span"},{"text":"goal, ","element":"span"},{"text":"an ","element":"span"},{"text":"estimator ","element":"span"},{"text":"should ","element":"span"},{"text":"be ","element":"span"},{"text":"strongly consistent—its mean squared error should converge almost surely to zero as the amount of available data increases.","element":"span"},{"text":"1 ","element":"span"},{"text":"In this paper we introduce a new strongly consistent estimator, MAGIC, that directly optimizes mean squared error. Our empirical results show that MAGIC can produce estimates with orders of magnitude lower mean squared error than the estimates produced by existing algorithms.","element":"span"}],[{"text":"Our new algorithm comes from the synthesis of two new contributions. ","element":"span"},{"text":"The first contribution is an extension of the recently proposed ","element":"span"},{"style":{"fontStyle":"italic"},"text":"doubly robust ","element":"span"},{"text":"(DR) OPE algorithm (","element":"span"},{"text":"Jiang & Li","element":"span"},{"text":", ","element":"span"},{"text":"2015","element":"span"},{"text":"). We present a novel derivation of their algorithm that removes the assumption that the horizon is finite and known. We also give conditions under which the DR estimator is strongly consistent. We then show how we can significantly reduce the variance of the DR estimator by introducing a small amount of bias—an effective trade-off when attempting to minimize the mean squared error of the estimates. We call our extension of the DR estimator the ","element":"span"},{"style":{"fontStyle":"italic"},"text":"weighted doubly robust ","element":"span"},{"text":"(WDR) estimator.","element":"span"}],[{"text":"Our second major contribution is a new estimator, which we call the ","element":"span"},{"style":{"fontStyle":"italic"},"text":"blending IS and model ","element":"span"},{"text":"(BIM) estimator, that combines two different OPE estimators not just by selecting between them, but by blending them together in a way that minimizes the mean squared error. The combination of these two contributions results in a particularly powerful new OPE algorithm that we call the ","element":"span"},{"style":{"fontStyle":"italic"},"text":"model and guided importance sampling combining ","element":"span"},{"text":"(MAGIC) estimator, which uses BIM to combine a purely model-based estimator with WDR. In our simulations, MAGIC has the best general performance, often exhibiting orders of magnitude lower mean squared error than prior state-of-the-art estimators.","element":"span"}]]},{"heading":"2. Notation","paragraphs":[[{"text":"We assume that the reader is familiar with reinforcement learning (","element":"span"},{"text":"Sutton & Barto","element":"span"},{"text":", ","element":"span"},{"text":"1998","element":"span"},{"text":") and adopt notational standard MDPNv1 for ","element":"span"},{"style":{"fontStyle":"italic"},"text":"Markov decision processes ","element":"span"},{"text":"(","element":"span"},{"text":"Thomas","element":"span"},{"text":", ","element":"span"},{"text":"2015a","element":"span"},{"text":", MDPs). ","element":"span"},{"text":"For simplicity, our notation assumes that the state, action, and reward sets are fi-nite, although our results carry over to more general settings.","element":"span"},{"text":"2 ","element":"span"},{"text":"Let ","element":"span"},{"style":{"height":15.6},"width":440.56,"height":39,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1604.00923/images/1-0.png","element":"img","alt":" H := (S0, A0, R0, S1, . . . )","inline":true,"padRight":true},{"text":"be a ","element":"span"},{"style":{"fontStyle":"italic"},"text":"trajectory","element":"span"},{"text":",","element":"span"},{"text":"3 ","element":"span"},{"text":"and ","element":"span"},{"style":{"height":17.65},"width":346.34,"height":44.13,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1604.00923/images/1-1.png","element":"img","alt":" g(H) := �∞t=0 γtRt","inline":true,"padRight":true},{"text":"denote the ","element":"span"},{"style":{"fontStyle":"italic"},"text":"return ","element":"span"},{"text":"of a trajec- ","element":"span"},{"text":"tory. We assume that the (possibly unknown) minimum and maximum rewards, ","element":"span"},{"style":{"height":9.13},"width":60.16,"height":22.82,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1604.00923/images/1-2.png","element":"img","alt":" rmin","inline":true,"padRight":true},{"text":"and ","element":"span"},{"style":{"height":9.13},"width":64.66,"height":22.82,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1604.00923/images/1-3.png","element":"img","alt":" rmax","inline":true},{"text":", are finite and that ","element":"span"},{"style":{"height":15.6},"width":172.51,"height":39,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1604.00923/images/1-4.png","element":"img","alt":"γ ∈ [0, 1]","inline":true,"padRight":true},{"text":"for the finite-horizon setting and ","element":"span"},{"style":{"height":15.6},"width":176.52,"height":39,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1604.00923/images/1-5.png","element":"img","alt":" γ ∈ [0, 1)","inline":true,"padRight":true},{"text":"for the indefinite and infinite horizon settings so that ","element":"span"},{"style":{"fontStyle":"italic"},"text":"g","element":"span"},{"text":"(","element":"span"},{"style":{"fontStyle":"italic"},"text":"H","element":"span"},{"text":") ","element":"span"},{"text":"is bounded. ","element":"span"},{"text":"We use the discounted objective function, ","element":"span"},{"style":{"height":15.6},"width":434.54,"height":39,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1604.00923/images/1-6.png","element":"img","alt":"v(π) := E[g(H)|H ∼ π]","inline":true},{"text":", where ","element":"span"},{"style":{"height":10.8},"width":131.56,"height":27,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1604.00923/images/1-7.png","element":"img","alt":" H ∼ π","inline":true,"padRight":true},{"text":"denotes that ","element":"span"},{"style":{"fontStyle":"italic"},"text":"H ","element":"span"},{"text":"was generated using the policy ","element":"span"},{"style":{"height":6.8},"width":23,"height":17,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1604.00923/images/1-8.png","element":"img","alt":" π","inline":true},{"text":". When dealing with multiple trajectories, we use superscripts to denote which trajectory a term comes from. For example, we write ","element":"span"},{"style":{"height":17.06},"width":52.02,"height":42.66,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1604.00923/images/1-9.png","element":"img","alt":" SHt","inline":true,"padRight":true},{"text":"to denote the state at time ","element":"span"},{"style":{"fontStyle":"italic"},"text":"t ","element":"span"},{"text":"during trajectory ","element":"span"},{"style":{"fontStyle":"italic"},"text":"H","element":"span"},{"text":". Let ","element":"span"},{"style":{"height":10.43},"width":38.19,"height":26.08,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1604.00923/images/1-10.png","element":"img","alt":" vπ","inline":true,"padRight":true},{"text":"and ","element":"span"},{"style":{"height":13.63},"width":36.7,"height":34.08,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1604.00923/images/1-11.png","element":"img","alt":" qπ","inline":true,"padRight":true},{"text":"be the ","element":"span"},{"style":{"fontStyle":"italic"},"text":"state value function ","element":"span"},{"text":"and ","element":"span"},{"style":{"fontStyle":"italic"},"text":"state-action value function ","element":"span"},{"text":"for policy ","element":"span"},{"style":{"height":6.8},"width":23,"height":17,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1604.00923/images/1-12.png","element":"img","alt":" π","inline":true},{"text":"—for all ","element":"span"},{"style":{"height":15.6},"width":403.49,"height":39,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1604.00923/images/1-13.png","element":"img","alt":" (π, s, a) ∈ Π × S × A","inline":true},{"text":", let ","element":"span"},{"style":{"height":17.65},"width":576.32,"height":44.13,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1604.00923/images/1-14.png","element":"img","alt":" vπ(s) := E [�∞t=0 γtRt|S0 = s, π]","inline":true,"padRight":true},{"text":"and ","element":"span"},{"style":{"height":15.6},"width":188.5,"height":39,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1604.00923/images/1-15.png","element":"img","alt":" qπ(s, a) :=","inline":true},{"style":{"height":17.65},"width":535.32,"height":44.13,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1604.00923/images/1-16.png","element":"img","alt":"E [�∞t=0 γtRt|S0 = s, A0 = a, π]","inline":true},{"text":". Notice that ","element":"span"},{"style":{"fontStyle":"italic"},"text":"v ","element":"span"},{"text":"without ","element":"span"},{"text":"a superscript denotes the objective function, while ","element":"span"},{"style":{"height":10.43},"width":38.19,"height":26.07,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1604.00923/images/1-17.png","element":"img","alt":" vπ","inline":true,"padRight":true},{"text":"denotes a value function.","element":"span"}],[{"text":"We will assume that ","element":"span"},{"style":{"fontStyle":"italic"},"text":"historical data","element":"span"},{"text":", ","element":"span"},{"style":{"fontStyle":"italic"},"text":"D","element":"span"},{"text":", is provided. Formally this historical data is a set of ","element":"span"},{"style":{"height":14.33},"width":183.89,"height":35.82,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1604.00923/images/1-18.png","element":"img","alt":" n ∈ N>0","inline":true,"padRight":true},{"text":"trajectories and the known policies, called ","element":"span"},{"style":{"fontStyle":"italic"},"text":"behavior policies","element":"span"},{"text":", that were used to generate them. ","element":"span"},{"text":"That is, ","element":"span"},{"style":{"fontStyle":"italic"},"text":"D ","element":"span"},{"text":":= ","element":"span"},{"style":{"height":15.64},"width":216.12,"height":39.1,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1604.00923/images/1-19.png","element":"img","alt":"{(Hi, πi)}ni=1","inline":true},{"text":", where ","element":"span"},{"style":{"height":13.13},"width":177.49,"height":32.82,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1604.00923/images/1-20.png","element":"img","alt":" Hi ∼ πi","inline":true},{"text":". ","element":"span"},{"text":"Importantly, when we write ","element":"span"},{"style":{"height":13.13},"width":43.24,"height":32.82,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1604.00923/images/1-21.png","element":"img","alt":" Hi","inline":true},{"text":", we ","element":"span"},{"style":{"fontStyle":"italic"},"text":"always ","element":"span"},{"text":"mean that ","element":"span"},{"style":{"height":13.13},"width":173.18,"height":32.82,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1604.00923/images/1-22.png","element":"img","alt":" Hi ∼ πi","inline":true},{"text":". ","element":"span"},{"text":"Let ","element":"span"},{"style":{"height":20.32},"width":820.66,"height":50.8,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1604.00923/images/1-23.png","element":"img","alt":"ρt(H, πe, πb) := �ti=0 πe�AHi��SHi �/πb�AHi��SHi �,","inline":true,"padRight":true},{"text":"be an ","element":"span"},{"style":{"fontStyle":"italic"},"text":"importance weight","element":"span"},{"text":", which is the probability of the first ","element":"span"},{"style":{"fontStyle":"italic"},"text":"t ","element":"span"},{"text":"steps of ","element":"span"},{"style":{"fontStyle":"italic"},"text":"H ","element":"span"},{"text":"under the ","element":"span"},{"style":{"fontStyle":"italic"},"text":"evaluation policy","element":"span"},{"text":", ","element":"span"},{"style":{"height":9.13},"width":37.1,"height":22.82,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1604.00923/images/1-24.png","element":"img","alt":" πe","inline":true},{"text":", divided by its probability under the behavior policy ","element":"span"},{"style":{"height":9.13},"width":36.1,"height":22.82,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1604.00923/images/1-25.png","element":"img","alt":" πb","inline":true,"padRight":true},{"text":"(","element":"span"},{"text":"Precup et al.","element":"span"},{"text":", ","element":"span"},{"text":"2000","element":"span"},{"text":", Section 2). For brevity, we write ","element":"span"},{"style":{"height":16.66},"width":32.05,"height":41.66,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1604.00923/images/1-26.png","element":"img","alt":" ρit","inline":true,"padRight":true},{"text":"as shorthand for ","element":"span"},{"style":{"height":15.6},"width":217.18,"height":39,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1604.00923/images/1-27.png","element":"img","alt":"ρt(Hi, πe, πi)","inline":true,"padRight":true},{"text":"and ","element":"span"},{"style":{"height":10},"width":32.05,"height":25,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1604.00923/images/1-28.png","element":"img","alt":" ρt","inline":true,"padRight":true},{"text":"as shorthand for ","element":"span"},{"style":{"height":15.6},"width":207.93,"height":39,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1604.00923/images/1-29.png","element":"img","alt":" ρt(H, πe, πb)","inline":true},{"text":". To simplify later expressions, let ","element":"span"},{"style":{"height":16.68},"width":141.59,"height":41.7,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1604.00923/images/1-30.png","element":"img","alt":" ρi−1 := 1","inline":true,"padRight":true},{"text":"for all ","element":"span"},{"style":{"fontStyle":"italic"},"text":"i","element":"span"},{"text":". One of the ","element":"span"},{"text":"primary challenges will be to combat the high variance and large range of the importance weights, ","element":"span"},{"style":{"height":10},"width":32.05,"height":25,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1604.00923/images/1-31.png","element":"img","alt":" ρt","inline":true},{"text":".","element":"span"}],[{"text":"Some of the methods that we describe use an approximate model of an MDP. Let ","element":"span"},{"style":{"height":15.6},"width":156.85,"height":39,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1604.00923/images/1-32.png","element":"img","alt":" ˆrπ(s, a, t)","inline":true,"padRight":true},{"text":"denote the model’s prediction of the reward ","element":"span"},{"style":{"fontStyle":"italic"},"text":"t ","element":"span"},{"text":"steps later, ","element":"span"},{"style":{"height":13.13},"width":41.44,"height":32.82,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1604.00923/images/1-33.png","element":"img","alt":" Rt","inline":true},{"text":", if ","element":"span"},{"style":{"height":13.6},"width":262.01,"height":34,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1604.00923/images/1-34.png","element":"img","alt":" S0 = s, A0 = a","inline":true},{"text":", and the policy ","element":"span"},{"style":{"height":6.8},"width":23,"height":17,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1604.00923/images/1-35.png","element":"img","alt":" π","inline":true,"padRight":true},{"text":"is used to generate the subsequent actions, ","element":"span"},{"style":{"height":14.4},"width":196.6,"height":36,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1604.00923/images/1-36.png","element":"img","alt":"A1, A2, . . . .","inline":true,"padRight":true},{"text":"For example, ","element":"span"},{"style":{"height":15.6},"width":162.23,"height":39,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1604.00923/images/1-37.png","element":"img","alt":" ˆrπ(s, a, 0)","inline":true,"padRight":true},{"text":"is a prediction of the immediate reward after taking action ","element":"span"},{"style":{"fontStyle":"italic"},"text":"a ","element":"span"},{"text":"in state ","element":"span"},{"style":{"fontStyle":"italic"},"text":"s ","element":"span"},{"text":"and is thus the same for all policies, ","element":"span"},{"style":{"height":6.8},"width":23,"height":17,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1604.00923/images/1-38.png","element":"img","alt":" π","inline":true},{"text":". We assume that these predictions are bounded by finite (possibly unknown) constants ","element":"span"},{"style":{"height":17.61},"width":86.89,"height":44.03,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1604.00923/images/1-39.png","element":"img","alt":"rmodelmin","inline":true,"padRight":true},{"text":"and ","element":"span"},{"style":{"height":17.06},"width":86.89,"height":42.66,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1604.00923/images/1-40.png","element":"img","alt":" rmodelmax","inline":true,"padRight":true},{"text":", i.e., ","element":"span"},{"style":{"height":17.61},"width":420.08,"height":44.03,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1604.00923/images/1-41.png","element":"img","alt":" ˆrπ(s, a, t) ∈ [rmodelmin , rmodelmax ]","inline":true},{"text":". Let","element":"span"}],[{"style":{"width":"78%"},"width":711,"height":86,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1604.00923/images/1-42.png","element":"img"}],[{"text":"be a prediction of ","element":"span"},{"style":{"height":13.13},"width":41.44,"height":32.82,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1604.00923/images/1-43.png","element":"img","alt":" Rt","inline":true,"padRight":true},{"text":"if ","element":"span"},{"style":{"height":13.13},"width":123.4,"height":32.82,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1604.00923/images/1-44.png","element":"img","alt":" S0 = s","inline":true,"padRight":true},{"text":"and the policy ","element":"span"},{"style":{"height":6.8},"width":23,"height":17,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1604.00923/images/1-45.png","element":"img","alt":" π","inline":true,"padRight":true},{"text":"is used to generate actions ","element":"span"},{"style":{"height":14.4},"width":172.9,"height":36,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1604.00923/images/1-46.png","element":"img","alt":" A0, A1, . . .","inline":true,"padRight":true},{"text":", for all ","element":"span"},{"style":{"height":15.6},"width":264.12,"height":39,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1604.00923/images/1-47.png","element":"img","alt":" (s, t, π) ∈ S ×","inline":true},{"style":{"height":15.13},"width":149.04,"height":37.82,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1604.00923/images/1-48.png","element":"img","alt":"N≥0 × Π","inline":true},{"text":". Let ","element":"span"},{"style":{"height":17.65},"width":412.29,"height":44.13,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1604.00923/images/1-49.png","element":"img","alt":" ˆvπ(s) := �∞t=0 γtˆrπ(s, t)","inline":true,"padRight":true},{"text":"and ","element":"span"},{"style":{"height":15.6},"width":178.54,"height":39,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1604.00923/images/1-50.png","element":"img","alt":" ˆqπ(s, a) :=","inline":true},{"style":{"height":17.65},"width":293.06,"height":44.13,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1604.00923/images/1-51.png","element":"img","alt":"�∞t=0 γtˆrπ(s, a, t)","inline":true,"padRight":true},{"text":"be the model’s estimates of ","element":"span"},{"style":{"height":15.6},"width":89.5,"height":39,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1604.00923/images/1-52.png","element":"img","alt":" vπ(s)","inline":true,"padRight":true},{"text":"and ","element":"span"},{"style":{"height":15.6},"width":125.75,"height":39,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1604.00923/images/1-53.png","element":"img","alt":"qπ(s, a)","inline":true},{"text":". We assume that if a ","element":"span"},{"style":{"fontStyle":"italic"},"text":"terminal absorbing state","element":"span"},{"text":",","element":"span"},{"style":{"height":14.98},"width":31,"height":37.45,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1604.00923/images/1-54.png","element":"img","alt":"∞s","inline":true,"padRight":true},{"text":", is reached, the model’s predictions of rewards that occur thereafter are always zero: ","element":"span"},{"style":{"height":18.98},"width":245.47,"height":47.45,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1604.00923/images/1-55.png","element":"img","alt":" ˆrπ(∞s , a, t) = 0","inline":true,"padRight":true},{"text":"for all ","element":"span"},{"style":{"height":15.6},"width":122.55,"height":39,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1604.00923/images/1-56.png","element":"img","alt":" (π, a, t)","inline":true},{"style":{"height":15.53},"width":262.85,"height":38.82,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1604.00923/images/1-57.png","element":"img","alt":"∈ Π × A × N≥0","inline":true},{"text":". Although better models will tend to improve our estimates, we make no assumptions about the veracity of the approximate model’s predictions, ","element":"span"},{"style":{"height":15.6},"width":156.85,"height":39,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1604.00923/images/1-58.png","element":"img","alt":" ˆrπ(s, a, t)","inline":true},{"text":".","element":"span"}]]},{"heading":"3. Off-Policy Policy Evaluation (OPE)","paragraphs":[[{"text":"The problem of ","element":"span"},{"style":{"fontStyle":"italic"},"text":"off-policy policy evaluation ","element":"span"},{"text":"(OPE) is de-fined as follows. ","element":"span"},{"text":"We are given an evaluation policy, ","element":"span"},{"style":{"height":9.12},"width":37.11,"height":22.81,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1604.00923/images/1-59.png","element":"img","alt":"πe","inline":true},{"text":", ","element":"span"},{"style":{"fontStyle":"italic"},"text":"historical data","element":"span"},{"text":", ","element":"span"},{"style":{"fontStyle":"italic"},"text":"D","element":"span"},{"text":", and an approximate model. Our goal is to produce an estimator, ","element":"span"},{"style":{"height":15.6},"width":83.45,"height":39,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1604.00923/images/1-60.png","element":"img","alt":" ˆv(D)","inline":true},{"text":", of ","element":"span"},{"style":{"height":15.6},"width":89.03,"height":39,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1604.00923/images/1-61.png","element":"img","alt":" v(πe)","inline":true,"padRight":true},{"text":"that has low ","element":"span"},{"style":{"fontStyle":"italic"},"text":"mean squared error ","element":"span"},{"text":"(MSE): ","element":"span"},{"style":{"height":15.6},"width":365.37,"height":39,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1604.00923/images/1-62.png","element":"img","alt":" MSE(ˆv(D), v(πe)) :=","inline":true},{"style":{"height":28},"width":357.45,"height":70,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1604.00923/images/1-63.png","element":"img","alt":"E�(ˆv(D) − v(πe))2�.","inline":true,"padRight":true},{"text":"We use capital letters to denote random variables, and so the random terms in expected values are always the capitalized letters (e.g. ","element":"span"},{"style":{"fontStyle":"italic"},"text":"D ","element":"span"},{"text":"is a random variable). We assume that the process producing states, actions, and rewards is an MDP with an unknown initial state distribution, transition function, and reward function. We assume that the evaluation policy, ","element":"span"},{"style":{"height":9.12},"width":37.1,"height":22.81,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1604.00923/images/1-64.png","element":"img","alt":" πe","inline":true},{"text":", the behavior policies, ","element":"span"},{"style":{"height":15.6},"width":284.64,"height":39,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1604.00923/images/1-65.png","element":"img","alt":" πi, i ∈ {1, . . . , n}","inline":true},{"text":", and the discount parameter, ","element":"span"},{"style":{"height":10},"width":22,"height":25,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1604.00923/images/1-66.png","element":"img","alt":" γ","inline":true},{"text":", are known. For a review of OPE methods, see the works of ","element":"span"},{"text":"Precup et al. ","element":"span"},{"text":"(","element":"span"},{"text":"2000","element":"span"},{"text":") or ","element":"span"},{"text":"Thomas ","element":"span"},{"text":"(","element":"span"},{"text":"2015b","element":"span"},{"text":", Chapter 3). More recent methods can be found in the works of ","element":"span"},{"text":"Jiang & Li ","element":"span"},{"text":"(","element":"span"},{"text":"2015","element":"span"},{"text":") and ","element":"span"},{"text":"Mandel et al. ","element":"span"},{"text":"(","element":"span"},{"text":"2016","element":"span"},{"text":").","element":"span"}]]},{"heading":"4. Doubly Robust (DR) Estimator","paragraphs":[[{"text":"The ","element":"span"},{"style":{"fontStyle":"italic"},"text":"doubly robust ","element":"span"},{"text":"(DR) estimator (","element":"span"},{"text":"Jiang & Li","element":"span"},{"text":", ","element":"span"},{"text":"2015","element":"span"},{"text":") is a new unbiased estimator of ","element":"span"},{"style":{"height":15.6},"width":89.03,"height":39,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1604.00923/images/1-67.png","element":"img","alt":" v(πe)","inline":true,"padRight":true},{"text":"that achieves promising empirical and theoretical results by leveraging an approximate model of an MDP to decrease the variance of the unbiased estimates produced by ordinary importance sampling (","element":"span"},{"text":"Precup et al.","element":"span"},{"text":", ","element":"span"},{"text":"2000","element":"span"},{"text":"). It is doubly robust in that it will provide “good” estimates if either ","element":"span"},{"style":{"fontWeight":"bold"},"text":"1) ","element":"span"},{"text":"the model is accurate or ","element":"span"},{"style":{"fontWeight":"bold"},"text":"2) ","element":"span"},{"text":"the behavior policies are known. By “good” it is meant that if the former does not hold then the estimator will remain unbiased (although it might have high variance and thus high mean squared error), and if the latter does not hold then if the model has low error the doubly robust estimator will also tend to have low error. Doubly robust estimators were introduced and remain popular in the statistics community (","element":"span"},{"text":"Rotnitzky & Robins","element":"span"},{"text":", ","element":"span"},{"text":"1995","element":"span"},{"text":"; ","element":"span"},{"text":"Heejung & Robins","element":"span"},{"text":", ","element":"span"},{"text":"2005","element":"span"},{"text":").","element":"span"}],[{"text":"The work that introduced the DR estimator for MDPs (","element":"span"},{"text":"Jiang & Li","element":"span"},{"text":", ","element":"span"},{"text":"2015","element":"span"},{"text":") derived it as a generalization of a doubly robust estimator for bandits (","element":"span"},{"text":"Dud´ık et al.","element":"span"},{"text":", ","element":"span"},{"text":"2011","element":"span"},{"text":"). This may be why the DR estimator was derived only for the fi-nite horizon setting where the horizon is known (every trajectory must terminate within ","element":"span"},{"style":{"height":11.6},"width":132.32,"height":29,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1604.00923/images/2-0.png","element":"img","alt":" L < ∞","inline":true,"padRight":true},{"text":"time steps, and ","element":"span"},{"style":{"fontStyle":"italic"},"text":"L ","element":"span"},{"text":"must be known). It also resulted in a recursive definition of the DR estimator that can be difficult to interpret. In Appendix ","element":"span"},{"text":"B ","element":"span"},{"text":"we instead derive the DR estimator for MDPs as an application of control variates. Our new derivation holds without assumptions on the horizon and gives the intuitive non-recursive definition, where ","element":"span"},{"style":{"height":16.83},"width":169.52,"height":42.08,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1604.00923/images/2-1.png","element":"img","alt":" wit = ρit/n","inline":true},{"text":":","element":"span"}],[{"style":{"width":"97%"},"width":887,"height":237,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1604.00923/images/2-2.png","element":"img"}],[{"text":"In Appendix ","element":"span"},{"text":"B ","element":"span"},{"text":"we show that this definition is equivalent to that of ","element":"span"},{"text":"Jiang & Li ","element":"span"},{"text":"(","element":"span"},{"text":"2015","element":"span"},{"text":") when the horizon is finite and known, and we provide several new theoretical results pertaining to the DR estimator. Specifically, we give conditions for DR to be an unbiased estimator without assumptions on the horizon, and we give the first proofs that it is a strongly consistent estimator. Although these are important properties to establish, we relegate them to an appendix due to space limitations.","element":"span"}],[{"text":"The non-recursive definition of the DR estimator presented in (","element":"span"},{"text":"2","element":"span"},{"text":") also reveals the close relationship of the DR estimator to ","element":"span"},{"style":{"fontStyle":"italic"},"text":"advantage sum ","element":"span"},{"text":"estimators. Advantage sum estimators were introduced as a way to lower the variance of on-policy Monte Carlo performance estimates for a setting that is a generalization of the (partially observable) MDP setting (","element":"span"},{"text":"Zinkevich et al.","element":"span"},{"text":", ","element":"span"},{"text":"2006","element":"span"},{"text":"; ","element":"span"},{"text":"White & Bowling","element":"span"},{"text":", ","element":"span"},{"text":"2009","element":"span"},{"text":"). The DR estimator for the on-policy setting can be found in the work of ","element":"span"},{"text":"Zinkevich et al. ","element":"span"},{"text":"(","element":"span"},{"text":"2006","element":"span"},{"text":", Equation 8). One may therefore view the DR estimator (","element":"span"},{"text":"Jiang & Li","element":"span"},{"text":", ","element":"span"},{"text":"2015","element":"span"},{"text":") as the extension of the advantage sum estimator (","element":"span"},{"text":"Zinkevich et al.","element":"span"},{"text":", ","element":"span"},{"text":"2006","element":"span"},{"text":") to the off-policy setting or as the extension of the doubly robust estimator for bandits (","element":"span"},{"text":"Dud´ık et al.","element":"span"},{"text":", ","element":"span"},{"text":"2011","element":"span"},{"text":") to the sequential setting. We are therefore not the first to show that the DR estimator can be viewed as an application of control variates, since ","element":"span"},{"text":"Veness et al. ","element":"span"},{"text":"(","element":"span"},{"text":"2011","element":"span"},{"text":", Section 3.1) point out that the advantage sum estimator is an application of control variates. Still, our derivation in Appendix ","element":"span"},{"text":"B ","element":"span"},{"text":"of the DR estimator is novel.","element":"span"}],[{"text":"The DR estimator is not purely model based, since it uses importance weights. However, it is also not a model-free importance sampling method, since it uses an approximate model to decrease the variance of its estimates. We therefore refer to it as a ","element":"span"},{"style":{"fontStyle":"italic"},"text":"guided importance sampling ","element":"span"},{"text":"method, since the approximate model is used to guide, but not completely replace, the importance sampling estimates.","element":"span"}]]},{"heading":"5. Weighted Doubly Robust (WDR) Estimator","paragraphs":[[{"text":"Empirical and theoretical results show that the DR estimator developed, analyzed, and tested by ","element":"span"},{"text":"Jiang & Li ","element":"span"},{"text":"(","element":"span"},{"text":"2015","element":"span"},{"text":") can significantly reduce the variance of ordinary importance sampling without introducing bias. ","element":"span"},{"text":"The fact that it does not introduce bias can be particularly important when the estimator is used to produce confidence bounds on ","element":"span"},{"style":{"height":15.6},"width":89.03,"height":39,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1604.00923/images/2-3.png","element":"img","alt":" v(πe)","inline":true,"padRight":true},{"text":"(","element":"span"},{"text":"Thomas","element":"span"},{"text":", ","element":"span"},{"text":"2015b","element":"span"},{"text":"). ","element":"span"},{"text":"However, in practice these confidence bounds often require an impractical amount of data before they are tight enough to be useful, and so approximate confidence bounds (e.g., bootstrap confi-dence bounds) are used instead (","element":"span"},{"text":"Theocharous et al.","element":"span"},{"text":", ","element":"span"},{"text":"2015","element":"span"},{"text":"; ","element":"span"},{"text":"Thomas","element":"span"},{"text":", ","element":"span"},{"text":"2015b","element":"span"},{"text":"). ","element":"span"},{"text":"When using these approximate confi-dence bounds, the strict requirement that an OPE estimator be an unbiased estimator of ","element":"span"},{"style":{"height":15.6},"width":89.03,"height":39,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1604.00923/images/2-4.png","element":"img","alt":" v(πe)","inline":true,"padRight":true},{"text":"is not necessary. Furthermore, often the goal of OPE is not to produce confidence bounds, but to produce the best estimate of ","element":"span"},{"style":{"height":15.6},"width":89.03,"height":39,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1604.00923/images/2-5.png","element":"img","alt":" v(πe)","inline":true,"padRight":true},{"text":"possible, in order to determine whether ","element":"span"},{"style":{"height":9.13},"width":37.11,"height":22.82,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1604.00923/images/2-6.png","element":"img","alt":" πe","inline":true,"padRight":true},{"text":"should be used instead of the current behavior policy or as an internal mechanism in a policy search algorithm (","element":"span"},{"text":"Levine & Koltun","element":"span"},{"text":", ","element":"span"},{"text":"2013","element":"span"},{"text":"). In these cases, the “best” estimator is typically defined as the one that has the lowest ","element":"span"},{"style":{"fontStyle":"italic"},"text":"mean squared error ","element":"span"},{"text":"(MSE). For example, in their experiments, ","element":"span"},{"text":"Precup et al. ","element":"span"},{"text":"(","element":"span"},{"text":"2000","element":"span"},{"text":"), ","element":"span"},{"text":"Dud´ık et al. ","element":"span"},{"text":"(","element":"span"},{"text":"2011","element":"span"},{"text":"), ","element":"span"},{"text":"Mahmood et al. ","element":"span"},{"text":"(","element":"span"},{"text":"2014","element":"span"},{"text":"), ","element":"span"},{"text":"Thomas ","element":"span"},{"text":"(","element":"span"},{"text":"2015b","element":"span"},{"text":"), and ","element":"span"},{"text":"Jiang & Li ","element":"span"},{"text":"(","element":"span"},{"text":"2015","element":"span"},{"text":") all use the (root) MSE when evaluating OPE methods.","element":"span"}],[{"text":"Although unbiasedness might seem like a desirable property of an estimator, when the goal is to minimize MSE, it often is not. In general, the MSE of an estimator, ","element":"span"},{"style":{"height":14.89},"width":23.18,"height":37.23,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1604.00923/images/2-7.png","element":"img","alt":"ˆθ","inline":true},{"text":", of a statistic, ","element":"span"},{"style":{"height":10.8},"width":18,"height":27,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1604.00923/images/2-8.png","element":"img","alt":" θ","inline":true},{"text":", can be decomposed into its variance and its squared bias: ","element":"span"},{"style":{"height":18.89},"width":636.15,"height":47.24,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1604.00923/images/2-9.png","element":"img","alt":" MSE(ˆθ, θ) = E[(θ − ˆθ)2] = Var(ˆθ) +","inline":true},{"style":{"height":18.89},"width":137.37,"height":47.23,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1604.00923/images/2-10.png","element":"img","alt":"Bias(ˆθ)2","inline":true},{"text":", where ","element":"span"},{"style":{"height":18.89},"width":335.09,"height":47.23,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1604.00923/images/2-11.png","element":"img","alt":" Bias(ˆθ) := E[ˆθ] − θ","inline":true},{"text":". The optimal estimator in terms of MSE is typically one that balances this bias-variance trade-off, not one with zero bias. Therefore, in the context of minimizing MSE, strong asymptotic consistency, which requires the MSE of an estimator to almost surely converge to zero as the amount of available data increases, is a more desirable property than unbiasedness.","element":"span"}],[{"text":"In this section we propose a new OPE estimator that we call the ","element":"span"},{"style":{"fontStyle":"italic"},"text":"weighted doubly robust ","element":"span"},{"text":"(WDR) estimator. The WDR estimator comes from applying a simple well-known extension to importance sampling estimators to the DR estimator to produce a new guided importance sampling method. This extension does not directly optimize the bias-variance trade-off, but it does tend to significantly better balance it while maintaining asymptotic consistency. More specifically, WDR is based on ","element":"span"},{"style":{"fontStyle":"italic"},"text":"weighted importance sampling ","element":"span"},{"text":"(","element":"span"},{"text":"Powell & Swann","element":"span"},{"text":", ","element":"span"},{"text":"1966","element":"span"},{"text":") as opposed to ordinary importance sampling (","element":"span"},{"text":"Hammersley & Handscomb","element":"span"},{"text":", ","element":"span"},{"text":"1964","element":"span"},{"text":"). For further discussion of the benefits of weighted importance sampling over ordinary importance sampling, see the work of ","element":"span"},{"text":"Thomas ","element":"span"},{"text":"(","element":"span"},{"text":"2015b","element":"span"},{"text":", Section 3.8). ","element":"span"},{"text":"Weighted importance sampling has been used before for OPE (","element":"span"},{"text":"Precup et al.","element":"span"},{"text":", ","element":"span"},{"text":"2000","element":"span"},{"text":"), but not in conjunction with the DR estimator.","element":"span"}],[{"text":"Our WDR estimator is defined as the DR estimator in (","element":"span"},{"text":"2","element":"span"},{"text":"), except where ","element":"span"},{"style":{"height":21.31},"width":317.93,"height":53.27,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1604.00923/images/3-0.png","element":"img","alt":" wit := ρit/ �nj=1 ρjt","inline":true},{"text":".","element":"span"},{"text":"4 ","element":"span"},{"text":"Intuitively it is ","element":"span"},{"text":"clear that this estimator is asymptotically correct because ","element":"span"},{"style":{"height":18.65},"width":160.36,"height":46.63,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1604.00923/images/3-1.png","element":"img","alt":"E[ρjt] = 1","inline":true},{"text":", and so by the law of large numbers the denom- ","element":"span"},{"text":"inator of ","element":"span"},{"style":{"height":16.66},"width":39.8,"height":41.66,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1604.00923/images/3-2.png","element":"img","alt":" wit","inline":true,"padRight":true},{"text":"will converge to ","element":"span"},{"style":{"fontStyle":"italic"},"text":"n","element":"span"},{"text":". Although WDR is not an ","element":"span"},{"text":"unbiased estimator of ","element":"span"},{"style":{"height":15.6},"width":89.03,"height":39,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1604.00923/images/3-3.png","element":"img","alt":" v(πe)","inline":true},{"text":", its bias follows a pattern that is both predictable and also sometimes desirable. When there is only a single trajectory, i.e., ","element":"span"},{"style":{"fontStyle":"italic"},"text":"n ","element":"span"},{"text":"= 1","element":"span"},{"text":", ","element":"span"},{"text":"WDR(","element":"span"},{"style":{"fontStyle":"italic"},"text":"D","element":"span"},{"text":") ","element":"span"},{"text":"is an unbiased estimator of the performance of the behavior policy, since ","element":"span"},{"style":{"height":16.66},"width":122.2,"height":41.66,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1604.00923/images/3-4.png","element":"img","alt":" w1t = 1","inline":true,"padRight":true},{"text":"for all ","element":"span"},{"style":{"fontStyle":"italic"},"text":"t","element":"span"},{"text":". If there is a single behavior ","element":"span"},{"text":"policy, ","element":"span"},{"style":{"height":9.13},"width":36.11,"height":22.82,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1604.00923/images/3-5.png","element":"img","alt":" πb","inline":true},{"text":", as the number of trajectories increases, the expected value of ","element":"span"},{"text":"WDR(","element":"span"},{"style":{"fontStyle":"italic"},"text":"D","element":"span"},{"text":") ","element":"span"},{"text":"shifts away from ","element":"span"},{"style":{"height":15.6},"width":87.95,"height":39,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1604.00923/images/3-6.png","element":"img","alt":" v(πb)","inline":true,"padRight":true},{"text":"towards ","element":"span"},{"style":{"height":15.6},"width":89.03,"height":39,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1604.00923/images/3-7.png","element":"img","alt":"v(πe)","inline":true},{"text":".","element":"span"}],[{"text":"Before presenting theoretical results about the WDR estimator, we introduce assumptions that they will require. These assumptions are only included if they are explicitly mentioned in a theorem—most theorems only rely on few of these assumptions. Even when these assumptions are not satisfied, it does ","element":"span"},{"style":{"fontStyle":"italic"},"text":"not ","element":"span"},{"text":"mean that the result does not hold or that the WDR estimator will perform poorly—it merely means that the theoretical results that we provide are not guaranteed by our proofs.","element":"span"}],[{"text":"Assumption ","element":"span"},{"text":"1 ","element":"span"},{"text":"ensures that all trajectories of interest when evaluating ","element":"span"},{"style":{"height":9.13},"width":37.1,"height":22.82,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1604.00923/images/3-8.png","element":"img","alt":" πe","inline":true,"padRight":true},{"text":"will be produced by all of the behavior policies. This is a standard assumption in OPE and typically precludes the use of deterministic behavior policies.","element":"span"},{"text":"5","element":"span"}],[{"style":{"fontWeight":"bold"},"text":"Assumption 1 ","element":"span"},{"text":"(Absolute continuity)","element":"span"},{"style":{"fontWeight":"bold"},"text":". ","element":"span"},{"style":{"fontStyle":"italic"},"text":"For all ","element":"span"},{"style":{"height":15.6},"width":164.32,"height":39,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1604.00923/images/3-9.png","element":"img","alt":" (s, a, i) ∈","inline":true}],[{"style":{"height":15.6},"width":191.54,"height":39,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1604.00923/images/3-10.png","element":"img","alt":"S × A × {1","inline":true},{"style":{"fontStyle":"italic"},"text":", . . . , n","element":"span"},{"style":{"fontStyle":"italic"},"text":"}","element":"span"},{"style":{"fontStyle":"italic"},"text":", if ","element":"span"},{"style":{"fontStyle":"italic"},"text":"π","element":"span"},{"style":{"height":15.6},"width":164.25,"height":39,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1604.00923/images/3-11.png","element":"img","alt":"i(a|s) = 0","inline":true,"padRight":true},{"style":{"fontStyle":"italic"},"text":"then ","element":"span"},{"style":{"fontStyle":"italic"},"text":"π","element":"span"},{"style":{"height":15.6},"width":167.99,"height":39,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1604.00923/images/3-12.png","element":"img","alt":"e(a|s) = 0","inline":true},{"style":{"fontStyle":"italic"},"text":".","element":"span"}],[{"text":"Assumption ","element":"span"},{"text":"2 ","element":"span"},{"text":"requires all of the behavior policies to be identical. This assumption is trivially satisfied if data is collected from one behavior policy. Also, this assumption is often satisfied for applications where data is abundant, so that evaluation can be performed using only the data from the most recent behavior policy. Also, Assumption ","element":"span"},{"text":"3 ","element":"span"},{"text":"requires the horizon, ","element":"span"},{"style":{"fontStyle":"italic"},"text":"L","element":"span"},{"text":", to be finite.","element":"span"}],[{"style":{"fontWeight":"bold"},"text":"Assumption 2 ","element":"span"},{"text":"(Single behavior policy)","element":"span"},{"style":{"fontWeight":"bold"},"text":". ","element":"span"},{"style":{"fontStyle":"italic"},"text":"For all ","element":"span"},{"style":{"height":15.6},"width":124.22,"height":39,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1604.00923/images/3-13.png","element":"img","alt":" (i, j) ∈","inline":true},{"style":{"height":17.16},"width":326.24,"height":42.89,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1604.00923/images/3-14.png","element":"img","alt":"{1, . . . , n}2, πi = πj","inline":true},{"style":{"fontStyle":"italic"},"text":".","element":"span"}],[{"style":{"fontWeight":"bold"},"text":"Assumption 3. ","element":"span"},{"style":{"fontStyle":"italic"},"text":"L ","element":"span"},{"style":{"fontStyle":"italic"},"text":"is finite.","element":"span"}],[{"text":"Assumption ","element":"span"},{"text":"4 ","element":"span"},{"text":"requires the importance weights, ","element":"span"},{"style":{"height":16.66},"width":32.05,"height":41.66,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1604.00923/images/3-15.png","element":"img","alt":" ρit","inline":true},{"text":", to be ","element":"span"},{"text":"bounded above by a finite constant, ","element":"span"},{"style":{"height":14.4},"width":99.38,"height":36,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1604.00923/images/3-16.png","element":"img","alt":" β ∈ R","inline":true,"padRight":true},{"text":"(they are always bounded below by zero). It is trivially satisfied in the common setting where the horizon is finite and the state and action sets are finite. Although Assumption ","element":"span"},{"text":"4 ","element":"span"},{"text":"requires ","element":"span"},{"style":{"height":14.4},"width":23,"height":36,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1604.00923/images/3-17.png","element":"img","alt":" β","inline":true,"padRight":true},{"text":"to exist, none of our results depend on how large ","element":"span"},{"style":{"height":14.4},"width":23,"height":36,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1604.00923/images/3-18.png","element":"img","alt":" β","inline":true,"padRight":true},{"text":"is. So, in the non-finite state, action, and horizon settings one may ensure that evaluation policies are only considered if they satisfy Assumption ","element":"span"},{"text":"4 ","element":"span"},{"text":"for some arbitrarily large ","element":"span"},{"style":{"height":14.4},"width":23,"height":36,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1604.00923/images/3-19.png","element":"img","alt":" β","inline":true},{"text":".","element":"span"}],[{"style":{"fontWeight":"bold"},"text":"Assumption 4 ","element":"span"},{"text":"(Bounded importance weight)","element":"span"},{"style":{"fontWeight":"bold"},"text":". ","element":"span"},{"style":{"fontStyle":"italic"},"text":"There exists a constant ","element":"span"},{"style":{"height":14.4},"width":130.43,"height":36,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1604.00923/images/3-20.png","element":"img","alt":" β < ∞","inline":true,"padRight":true},{"style":{"fontStyle":"italic"},"text":"such that for all ","element":"span"},{"style":{"height":15.93},"width":249.32,"height":39.82,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1604.00923/images/3-21.png","element":"img","alt":" (t, i) ∈ N≥0 ×","inline":true},{"style":{"height":16.83},"width":295.42,"height":42.07,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1604.00923/images/3-22.png","element":"img","alt":"{1, . . . , n}, ρit ≤ β","inline":true,"padRight":true},{"style":{"fontStyle":"italic"},"text":"surely.","element":"span"}],[{"text":"In the following theorems, Theorems ","element":"span"},{"text":"1 ","element":"span"},{"text":"and ","element":"span"},{"text":"2","element":"span"},{"text":", we give two different sets of assumptions that are sufficient to show that WDR is a strongly consistent estimator of ","element":"span"},{"style":{"height":15.6},"width":89.02,"height":39,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1604.00923/images/3-23.png","element":"img","alt":" v(πe)","inline":true},{"text":". Notice that if the sets of states and actions are finite and the horizon is finite, then Assumption ","element":"span"},{"text":"4 ","element":"span"},{"text":"holds, and so Theorem ","element":"span"},{"text":"2 ","element":"span"},{"text":"means that WDR will be strongly consistent given only Assumption ","element":"span"},{"text":"1","element":"span"},{"text":".","element":"span"}],[{"style":{"fontWeight":"bold"},"text":"Theorem 1 ","element":"span"},{"text":"(WDR – strongly consistent estimator for one behavior policy, finite horizon)","element":"span"},{"style":{"fontWeight":"bold"},"text":". ","element":"span"},{"style":{"fontStyle":"italic"},"text":"If Assumptions ","element":"span"},{"text":"1","element":"span"},{"style":{"fontStyle":"italic"},"text":", ","element":"span"},{"text":"2","element":"span"},{"style":{"fontStyle":"italic"},"text":", and ","element":"span"},{"text":"3 ","element":"span"},{"style":{"fontStyle":"italic"},"text":"hold then ","element":"span"},{"style":{"height":17.59},"width":345.5,"height":43.98,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1604.00923/images/3-24.png","element":"img","alt":" WDR(D) a.s.−→ v(πe).","inline":true,"padRight":true},{"style":{"fontWeight":"bold"},"text":"Proof ","element":"span"},{"style":{"fontStyle":"italic"},"text":"See Appendix ","element":"span"},{"style":{"fontStyle":"italic"},"text":"C.1","element":"span"},{"style":{"fontStyle":"italic"},"text":".","element":"span"}],[{"style":{"fontWeight":"bold"},"text":"Theorem 2 ","element":"span"},{"text":"(WDR – strongly consistent estimator for many behavior policies)","element":"span"},{"style":{"fontWeight":"bold"},"text":". ","element":"span"},{"style":{"fontStyle":"italic"},"text":"If Assumptions ","element":"span"},{"text":"1 ","element":"span"},{"style":{"fontStyle":"italic"},"text":"and ","element":"span"},{"text":"4 ","element":"span"},{"style":{"fontStyle":"italic"},"text":"hold then ","element":"span"},{"style":{"height":17.59},"width":345.5,"height":43.98,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1604.00923/images/3-25.png","element":"img","alt":"WDR(D) a.s.−→ v(πe).","inline":true,"padRight":true},{"style":{"fontWeight":"bold"},"text":"Proof ","element":"span"},{"style":{"fontStyle":"italic"},"text":"See Appendix ","element":"span"},{"style":{"fontStyle":"italic"},"text":"C.2","element":"span"},{"style":{"fontStyle":"italic"},"text":".","element":"span"}]]},{"heading":"6. Empirical Studies (WDR)","paragraphs":[[{"text":"In order to both show the empirical benefits of WDR over existing importance sampling estimators and better motivate our second major contribution, in this section we present an empirical comparison of different OPE methods.","element":"span"},{"text":"6 ","element":"span"},{"text":"We compare to a broad sampling of model-free importance sampling estimators, definitions of which can be found in the work of ","element":"span"},{"text":"Thomas ","element":"span"},{"text":"(","element":"span"},{"text":"2015b","element":"span"},{"text":", Chapter 3): ","element":"span"},{"style":{"fontStyle":"italic"},"text":"importance sampling ","element":"span"},{"text":"(IS), ","element":"span"},{"style":{"fontStyle":"italic"},"text":"per-decision importance sampling","element":"span"}],[{"style":{"width":"99%"},"width":1876,"height":393,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1604.00923/images/4-0.png","element":"img"}],[{"text":"Figure 1: Empirical results for three different experimental setups. All plots in this paper have the same format: they show the mean squared error of different estimators as ","element":"figcaption","subtype":"caption"},{"style":{"fontStyle":"italic"},"text":"n","element":"figcaption","subtype":"caption"},{"text":", the number of episodes in ","element":"figcaption","subtype":"caption"},{"style":{"fontStyle":"italic"},"text":"D","element":"figcaption","subtype":"caption"},{"text":", increases. Both axes always use a logarithmic scale and standard error bars are included from ","element":"figcaption","subtype":"caption"},{"text":"128 ","element":"figcaption","subtype":"caption"},{"text":"trials. All plots use the following legend:","element":"figcaption","subtype":"caption"}],[{"style":{"width":"63%"},"width":1196,"height":23,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1604.00923/images/4-1.png","element":"img"}],[{"text":"(PDIS), ","element":"span"},{"style":{"fontStyle":"italic"},"text":"weighted importance sampling ","element":"span"},{"text":"(WIS), and ","element":"span"},{"style":{"fontStyle":"italic"},"text":"consistent weighted per-decision importance sampling ","element":"span"},{"text":"(CWPDIS). We also compare to the guided importance sampling ","element":"span"},{"style":{"fontStyle":"italic"},"text":"doubly robust ","element":"span"},{"text":"(DR) estimator (","element":"span"},{"text":"Jiang & Li","element":"span"},{"text":", ","element":"span"},{"text":"2015","element":"span"},{"text":").","element":"span"}],[{"text":"Lastly, we compare to the ","element":"span"},{"style":{"fontStyle":"italic"},"text":"approximate model ","element":"span"},{"text":"(AM) estimator, which uses all of the available data to construct an approximate model of the MDP.","element":"span"},{"text":"7 ","element":"span"},{"text":"The performance of the evaluation policy on the approximate model is typically easy to compute and can be used as an estimate of ","element":"span"},{"style":{"height":15.6},"width":89.03,"height":39,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1604.00923/images/4-2.png","element":"img","alt":" v(πe)","inline":true},{"text":". For example, in our experiments the approximate model maintains an estimate, ","element":"span"},{"style":{"height":13.13},"width":35.18,"height":32.82,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1604.00923/images/4-3.png","element":"img","alt":"�d0","inline":true},{"text":", of the initial state distribution, and so we define ","element":"span"},{"style":{"height":17.09},"width":428.99,"height":42.72,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1604.00923/images/4-4.png","element":"img","alt":" AM := �s∈S �d0(s)ˆvπe(s)","inline":true},{"text":". Notice that ","element":"span"},{"text":"unlike the importance sampling based methods, AM does not include any importance weights (","element":"span"},{"style":{"height":10},"width":32.05,"height":25,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1604.00923/images/4-5.png","element":"img","alt":"ρt","inline":true,"padRight":true},{"text":"terms).","element":"span"}],[{"text":"In Appendix ","element":"span"},{"text":"D ","element":"span"},{"text":"we provide detailed descriptions of the experimental setup and results. ","element":"span"},{"text":"Here we provide only an overview. We used three domains: a ","element":"span"},{"style":{"height":10.4},"width":89.08,"height":26,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1604.00923/images/4-6.png","element":"img","alt":" 4 × 4","inline":true,"padRight":true},{"text":"gridworld previously constructed specifically for evaluating OPE methods (","element":"span"},{"text":"Thomas","element":"span"},{"text":", ","element":"span"},{"text":"2015b","element":"span"},{"text":", Section 2.5), as well as two simple domains that we developed to exemplify the settings where different methods excel and fail. ","element":"span"},{"text":"In our simulations, WDR dominated the other importance sampling and guided importance sampling estimators (but not AM). Not only did WDR always achieve the lowest mean squared error of these estimators, but no other single (guided) importance sampling estimator was able to always achieve mean squared errors within an order of WDR’s. Figure ","element":"span"},{"text":"1a ","element":"span"},{"text":"is an example using the gridworld where no other method achieved mean squared errors within an order of WDR’s.","element":"span"}],[{"text":"The second notable trend is that WDR often significantly outperformed AM. We constructed a simple MDP and experimental setup that we call ","element":"span"},{"style":{"fontStyle":"italic"},"text":"ModelFail","element":"span"},{"text":", which exemplifies this. In the ModelFail experiments, the approximate model uses function approximation, which causes it to converge to the wrong MDP. This results in AM’s mean squared error plateauing at a non-zero value, as shown in Figure ","element":"span"},{"text":"1b","element":"span"},{"text":". Since WDR remains strongly consistent in this setting, its MSE converges almost surely to zero.","element":"span"}],[{"text":"The third notable trend is that when an accurate approximate model is available, WDR does not always outperform AM. We constructed a simple MDP that we call ","element":"span"},{"style":{"fontStyle":"italic"},"text":"ModelWin","element":"span"},{"text":", which exemplifies this. In the ModelWin MDP, the approximate model quickly converges to the correct MDP, and so AM outperforms WDR. This is depicted in Figure ","element":"span"},{"text":"1c","element":"span"},{"text":".","element":"span"}],[{"text":"One might wonder why DR and WDR can do worse than AM even though they incorporate the approximate model. Notice that we can write the DR and WDR estimators as:","element":"span"}],[{"style":{"width":"97%"},"width":890,"height":298,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1604.00923/images/4-7.png","element":"img"}],[{"text":"If the approximate model is perfect, then ","element":"span"},{"style":{"fontWeight":"bold"},"text":"(a) ","element":"span"},{"text":"is both a low variance and unbiased estimator of ","element":"span"},{"style":{"height":15.6},"width":89.03,"height":39,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1604.00923/images/4-8.png","element":"img","alt":" v(πe)","inline":true},{"text":". If the approximate model is perfect and ","element":"span"},{"style":{"height":13.13},"width":41.44,"height":32.82,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1604.00923/images/4-9.png","element":"img","alt":" Rt","inline":true,"padRight":true},{"text":"and ","element":"span"},{"style":{"height":14.33},"width":74.3,"height":35.82,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1604.00923/images/4-10.png","element":"img","alt":" St+1","inline":true,"padRight":true},{"text":"are deterministic functions of ","element":"span"},{"style":{"height":13.13},"width":35.78,"height":32.82,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1604.00923/images/4-11.png","element":"img","alt":" St","inline":true,"padRight":true},{"text":"and ","element":"span"},{"style":{"height":13.53},"width":41.08,"height":33.82,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1604.00923/images/4-12.png","element":"img","alt":" At","inline":true},{"text":", then ","element":"span"},{"style":{"fontWeight":"bold"},"text":"(b) ","element":"span"},{"text":"is zero, and so the second term is always zero and WDR is an excellent estimator. However, if ","element":"span"},{"style":{"height":13.12},"width":41.45,"height":32.81,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1604.00923/images/4-13.png","element":"img","alt":" Rt","inline":true,"padRight":true},{"text":"or ","element":"span"},{"style":{"height":14.32},"width":74.3,"height":35.81,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1604.00923/images/4-14.png","element":"img","alt":" St+1","inline":true,"padRight":true},{"text":"is ","element":"span"},{"style":{"fontStyle":"italic"},"text":"not ","element":"span"},{"text":"a deterministic function of ","element":"span"},{"style":{"height":13.12},"width":35.78,"height":32.81,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1604.00923/images/4-15.png","element":"img","alt":" St","inline":true,"padRight":true},{"text":"and ","element":"span"},{"style":{"height":13.53},"width":41.09,"height":33.82,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1604.00923/images/4-16.png","element":"img","alt":" At","inline":true},{"text":"—if the state transitions or rewards are stochastic— then ","element":"span"},{"style":{"fontWeight":"bold"},"text":"(b) ","element":"span"},{"text":"is not necessarily zero. If the importance weights, ","element":"span"},{"style":{"height":16.66},"width":39.81,"height":41.66,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1604.00923/images/4-17.png","element":"img","alt":"wit","inline":true},{"text":", have high variance, then even slightly non-zero values ","element":"span"},{"text":"of ","element":"span"},{"style":{"fontWeight":"bold"},"text":"(b) ","element":"span"},{"text":"can cause DR and WDR to have high variance.","element":"span"}],[{"text":"In summary, in our experiments, WDR dominated the other (guided) importance sampling estimators, sometimes achieving orders of magnitude lower MSE. However, the experiments also show that WDR is not always the best estimator—sometimes AM can produce estimates with an order lower MSE. This trend is also visible in the results of ","element":"span"},{"text":"Jiang & Li ","element":"span"},{"text":"(","element":"span"},{"text":"2015","element":"span"},{"text":"), where AM performs better than DR (although they did not compare to WDR, since it had not yet been introduced). Ideally we would like an estimator that combines WDR and AM or switches between them automatically, to always achieve the performance of the better estimator. In the following sections we show how this can be done.","element":"span"}]]},{"heading":"7. Blending IS and Model (BIM) Estimator","paragraphs":[[{"text":"In this section we show how two OPE estimators can be merged into a single estimator that exhibits the desirable properties of both. Before doing so, we establish some terminology. We divide OPE estimators into three classes. The first class we call ","element":"span"},{"style":{"fontStyle":"italic"},"text":"importance sampling estimators","element":"span"},{"text":". We define this class to include all estimators that, when ","element":"span"},{"style":{"fontStyle":"italic"},"text":"L ","element":"span"},{"text":"is finite, are defined using all of the importance weights ","element":"span"},{"style":{"height":10},"width":258.87,"height":25,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1604.00923/images/5-0.png","element":"img","alt":"ρ0, ρ1, . . . , ρL−1","inline":true},{"text":". Notice that this includes IS, PDIS, WIS, and CWPDIS, as well as the guided importance sampling methods, DR and WDR.","element":"span"}],[{"text":"The second class we call ","element":"span"},{"style":{"fontStyle":"italic"},"text":"purely model-based estimators","element":"span"},{"text":". We define this class to include all estimators that do not contain any ","element":"span"},{"style":{"height":10},"width":32.05,"height":25,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1604.00923/images/5-1.png","element":"img","alt":" ρt","inline":true,"padRight":true},{"text":"terms for ","element":"span"},{"style":{"height":12.8},"width":99.21,"height":32,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1604.00923/images/5-2.png","element":"img","alt":" t ≥ 0","inline":true},{"text":". The only purely model-based estimator in this paper is AM. Finally, we call the third class ","element":"span"},{"style":{"fontStyle":"italic"},"text":"partial importance sampling estimators","element":"span"},{"text":". These estimators are those that do not fall into either of the other two classes—estimators that use importance weights, ","element":"span"},{"style":{"height":10},"width":32.05,"height":25,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1604.00923/images/5-3.png","element":"img","alt":" ρt","inline":true},{"text":", but only for ","element":"span"},{"style":{"height":11.6},"width":155.24,"height":29,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1604.00923/images/5-4.png","element":"img","alt":" t < L − 1","inline":true},{"text":". We will introduce one such estimator later in this section.","element":"span"}],[{"text":"We contend that importance sampling estimators and purely model-based estimators are two extremes on a spectrum of estimators. Importance sampling estimators tend to be strongly consistent. That is, as more historical data becomes available, their estimates become increasingly accurate. However, their use of importance weights means that they ","element":"span"},{"style":{"fontStyle":"italic"},"text":"all ","element":"span"},{"text":"(including DR and WDR) also can have high variance relative to purely model-based estimators. This is evident in the results on the ModelWin domain.","element":"span"}],[{"text":"On the other end of the spectrum, purely model-based estimators like AM are often ","element":"span"},{"style":{"fontStyle":"italic"},"text":"not ","element":"span"},{"text":"strongly consistent. If the approximate model uses function approximation or if there is some partial observability, then the approximate model may not converge to the true MDP. So, as more historical data becomes available, the estimates of AM may converge to a value other than ","element":"span"},{"style":{"height":15.6},"width":89.02,"height":39,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1604.00923/images/5-5.png","element":"img","alt":" v(πe)","inline":true},{"text":". Thus, purely model-based estimators tend to have high bias, even asymptotically, as evidenced by the AM curve in Figure ","element":"span"},{"text":"1b","element":"span"},{"text":". However, purely model-based methods also tend to have low variance because they do not contain any ","element":"span"},{"style":{"height":10},"width":32.05,"height":25,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1604.00923/images/5-6.png","element":"img","alt":" ρt","inline":true,"padRight":true},{"text":"terms.","element":"span"}],[{"text":"Between these two extremes lies a range of partial importance sampling estimators. ","element":"span"},{"text":"Estimators that are close to the purely model-based estimators use ","element":"span"},{"style":{"height":10},"width":32.05,"height":25,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1604.00923/images/5-7.png","element":"img","alt":" ρt","inline":true,"padRight":true},{"text":"terms only for small ","element":"span"},{"style":{"fontStyle":"italic"},"text":"t","element":"span"},{"text":", while estimators that are close to importance sampling estimators use ","element":"span"},{"style":{"height":10},"width":32.05,"height":25,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1604.00923/images/5-8.png","element":"img","alt":" ρt","inline":true,"padRight":true},{"text":"terms with large ","element":"span"},{"style":{"fontStyle":"italic"},"text":"t ","element":"span"},{"text":"approaching ","element":"span"},{"style":{"height":10.8},"width":107.34,"height":27,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1604.00923/images/5-9.png","element":"img","alt":" L − 1","inline":true},{"text":". ","element":"span"},{"text":"Before formally defining one such partial importance sampling estimator, we present a few additional definitions. First, let ","element":"span"},{"style":{"height":18.76},"width":139.39,"height":46.91,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1604.00923/images/5-10.png","element":"img","alt":" IS(j)(D)","inline":true,"padRight":true},{"text":"denote an estimate of ","element":"span"},{"style":{"height":19.65},"width":372.34,"height":49.14,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1604.00923/images/5-11.png","element":"img","alt":"E[�jt=0 γtRt|H ∼ πe]","inline":true},{"text":", constructed from ","element":"span"},{"style":{"fontStyle":"italic"},"text":"D ","element":"span"},{"text":"using an im- ","element":"span"},{"text":"portance sampling method like PDIS or WDR, which uses importance weights up to and including ","element":"span"},{"style":{"height":10},"width":32.05,"height":25,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1604.00923/images/5-12.png","element":"img","alt":" ρt","inline":true},{"text":". Similarly, let ","element":"span"},{"style":{"height":18.76},"width":168.48,"height":46.91,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1604.00923/images/5-13.png","element":"img","alt":"AM(j)(D)","inline":true,"padRight":true},{"text":"denote a primarily model-based prediction from ","element":"span"},{"style":{"fontStyle":"italic"},"text":"D ","element":"span"},{"text":"of ","element":"span"},{"style":{"height":19.65},"width":383.67,"height":49.13,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1604.00923/images/5-14.png","element":"img","alt":" E[�∞t=j γtRt|H ∼ πe]","inline":true,"padRight":true},{"text":"that may not use ","element":"span"},{"style":{"height":10},"width":32.05,"height":25,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1604.00923/images/5-15.png","element":"img","alt":" ρt","inline":true,"padRight":true},{"text":"terms ","element":"span"},{"text":"with ","element":"span"},{"style":{"height":13.6},"width":82.71,"height":34,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1604.00923/images/5-16.png","element":"img","alt":" t ≥ j","inline":true},{"text":".","element":"span"}],[{"text":"We can now define a partial importance sampling estimator that we call the ","element":"span"},{"style":{"fontStyle":"italic"},"text":"off-policy ","element":"span"},{"style":{"fontStyle":"italic"},"text":"j","element":"span"},{"style":{"fontStyle":"italic"},"text":"-step return","element":"span"},{"text":", ","element":"span"},{"style":{"height":17.63},"width":123.73,"height":44.07,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1604.00923/images/5-17.png","element":"img","alt":" g(j)(D)","inline":true},{"text":", which uses an importance sampling based method to predict the outcome of using ","element":"span"},{"style":{"height":9.13},"width":37.11,"height":22.82,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1604.00923/images/5-18.png","element":"img","alt":" πe","inline":true,"padRight":true},{"text":"up until ","element":"span"},{"style":{"height":15.13},"width":42.44,"height":37.82,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1604.00923/images/5-19.png","element":"img","alt":" Rj","inline":true,"padRight":true},{"text":"is generated, and the approximate model estimator to predict the outcomes thereafter. That is, for all ","element":"span"},{"style":{"height":15.13},"width":157.07,"height":37.82,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1604.00923/images/5-20.png","element":"img","alt":" j ∈ N≥−1","inline":true},{"text":", let","element":"span"},{"text":"8","element":"span"}],[{"style":{"width":"82%"},"width":749,"height":135,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1604.00923/images/5-21.png","element":"img"}],[{"text":"We refer to ","element":"span"},{"style":{"fontStyle":"italic"},"text":"j ","element":"span"},{"text":"as the ","element":"span"},{"style":{"fontStyle":"italic"},"text":"length ","element":"span"},{"text":"of the ","element":"span"},{"style":{"fontStyle":"italic"},"text":"j","element":"span"},{"text":"-step return.","element":"span"}],[{"text":"Notice that ","element":"span"},{"style":{"height":17.63},"width":149.02,"height":44.08,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1604.00923/images/5-22.png","element":"img","alt":" g(−1)(D)","inline":true,"padRight":true},{"text":"is a purely model-based estimator, ","element":"span"},{"style":{"height":17.63},"width":140.24,"height":44.08,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1604.00923/images/5-23.png","element":"img","alt":"g(∞)(D)","inline":true,"padRight":true},{"text":"is an importance sampling estimator, and the other off-policy ","element":"span"},{"style":{"fontStyle":"italic"},"text":"j","element":"span"},{"text":"-step returns are partial importance sampling estimators that blend between these two extremes. When ","element":"span"},{"style":{"fontStyle":"italic"},"text":"j ","element":"span"},{"text":"is small, the off-policy ","element":"span"},{"style":{"fontStyle":"italic"},"text":"j","element":"span"},{"text":"-step return is similar to AM, using importance sampling to predict only a few early rewards. When ","element":"span"},{"style":{"fontStyle":"italic"},"text":"j ","element":"span"},{"text":"is large, it uses importance sampling to predict most of the rewards and the model only for a few rewards at the end of a trajectory. So, as ","element":"span"},{"style":{"fontStyle":"italic"},"text":"j ","element":"span"},{"text":"increases, we expect the variance of the return to increase, but the bias to decrease.","element":"span"}],[{"text":"We propose a new estimator, which we call the ","element":"span"},{"style":{"fontStyle":"italic"},"text":"blending IS and model ","element":"span"},{"text":"(BIM) estimator, that leverages this spectrum of estimators to blend together the IS and AM estimators in a way that minimizes MSE. It does this by computing a weighted average of the different length returns: ","element":"span"},{"style":{"height":15.6},"width":340.55,"height":39,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1604.00923/images/5-24.png","element":"img","alt":"BIM(D) := x⊺g(D),","inline":true,"padRight":true},{"text":"where ","element":"span"},{"style":{"height":15.6},"width":376.92,"height":39,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1604.00923/images/5-25.png","element":"img","alt":" x := (x−1, x0, x1, . . . )⊺","inline":true,"padRight":true},{"text":"is an infinite-dimensional weight vector and ","element":"span"},{"style":{"fontWeight":"bold"},"text":"g","element":"span"},{"text":"(","element":"span"},{"style":{"fontStyle":"italic"},"text":"D","element":"span"},{"text":") ","element":"span"},{"text":"is an infinite-dimensional vector of different length returns, ","element":"span"},{"style":{"fontWeight":"bold"},"text":"g","element":"span"},{"text":"(","element":"span"},{"style":{"fontStyle":"italic"},"text":"D","element":"span"},{"text":") := ","element":"span"},{"style":{"height":17.63},"width":424.55,"height":44.07,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1604.00923/images/5-26.png","element":"img","alt":"(g(−1)(D), g(0)(D), . . . , )⊺","inline":true},{"text":". Although in theory ","element":"span"},{"style":{"fontWeight":"bold"},"text":"x ","element":"span"},{"text":"can be infinite, in practice there is always finite data, so ","element":"span"},{"style":{"fontWeight":"bold"},"text":"x ","element":"span"},{"text":"is finite. The remaining question is then: how should we select the weights, ","element":"span"},{"style":{"fontWeight":"bold"},"text":"x","element":"span"},{"text":"?","element":"span"}],[{"text":"A similar question has been studied before in reinforcement learning research when deciding how to weight ","element":"span"},{"style":{"fontStyle":"italic"},"text":"j","element":"span"},{"text":"-step returns (not off-policy), as reviewed by ","element":"span"},{"text":"Sutton & Barto ","element":"span"},{"text":"(","element":"span"},{"text":"1998","element":"span"},{"text":", Section 7.2). The most common solution, a complex return called the ","element":"span"},{"style":{"height":10.8},"width":23,"height":27,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1604.00923/images/6-0.png","element":"img","alt":" λ","inline":true},{"style":{"fontStyle":"italic"},"text":"-return","element":"span"},{"text":", uses ","element":"span"},{"style":{"height":12.73},"width":150.72,"height":31.82,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1604.00923/images/6-1.png","element":"img","alt":" x−1 = 0","inline":true,"padRight":true},{"text":"and ","element":"span"},{"style":{"height":11.13},"width":86.88,"height":27.82,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1604.00923/images/6-2.png","element":"img","alt":" xj =","inline":true},{"style":{"height":16.83},"width":154.94,"height":42.07,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1604.00923/images/6-3.png","element":"img","alt":"(1 − λ)λj","inline":true,"padRight":true},{"text":"for all other ","element":"span"},{"style":{"fontStyle":"italic"},"text":"j","element":"span"},{"text":". The ","element":"span"},{"style":{"height":10.8},"width":23,"height":27,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1604.00923/images/6-4.png","element":"img","alt":" λ","inline":true},{"text":"-return is the foundation of the entire TD","element":"span"},{"style":{"height":15.6},"width":52.7,"height":39,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1604.00923/images/6-5.png","element":"img","alt":"(λ)","inline":true,"padRight":true},{"text":"family of algorithms, which includes the original linear-time algorithm (","element":"span"},{"text":"Sutton","element":"span"},{"text":", ","element":"span"},{"text":"1988","element":"span"},{"text":"), least-squares formulations (","element":"span"},{"text":"Bradtke & Barto","element":"span"},{"text":", ","element":"span"},{"text":"1996","element":"span"},{"text":"; ","element":"span"},{"text":"Mahmood et al.","element":"span"},{"text":", ","element":"span"},{"text":"2014","element":"span"},{"text":"), methods for adapting ","element":"span"},{"style":{"height":10.8},"width":23,"height":27,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1604.00923/images/6-6.png","element":"img","alt":" λ","inline":true,"padRight":true},{"text":"(","element":"span"},{"text":"Downey & Sanner","element":"span"},{"text":", ","element":"span"},{"text":"2010","element":"span"},{"text":"), true-online methods (","element":"span"},{"text":"van Hasselt et al.","element":"span"},{"text":", ","element":"span"},{"text":"2014","element":"span"},{"text":"), and the recent emphatic methods (","element":"span"},{"text":"Mahmood et al.","element":"span"},{"text":", ","element":"span"},{"text":"2015","element":"span"},{"text":").","element":"span"}],[{"text":"Recent work has suggested that the ","element":"span"},{"style":{"height":10.8},"width":23,"height":27,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1604.00923/images/6-7.png","element":"img","alt":" λ","inline":true},{"text":"-return could be replaced by more statistically principled complex returns like the ","element":"span"},{"style":{"height":10},"width":22,"height":25,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1604.00923/images/6-8.png","element":"img","alt":" γ","inline":true},{"text":"-return (","element":"span"},{"text":"Konidaris et al.","element":"span"},{"text":", ","element":"span"},{"text":"2011","element":"span"},{"text":") or ","element":"span"},{"style":{"height":10.8},"width":28,"height":27,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1604.00923/images/6-9.png","element":"img","alt":" Ω","inline":true},{"text":"-return (","element":"span"},{"text":"Thomas et al.","element":"span"},{"text":", ","element":"span"},{"text":"2015","element":"span"},{"text":"). For the finite-horizon setting and for ","element":"span"},{"style":{"height":15.6},"width":371.45,"height":39,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1604.00923/images/6-10.png","element":"img","alt":" j ∈ {0, . . . , L − 1}","inline":true,"padRight":true},{"text":"the ","element":"span"},{"style":{"height":10},"width":22,"height":25,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1604.00923/images/6-11.png","element":"img","alt":" γ","inline":true},{"text":"-return uses ","element":"span"},{"style":{"height":25.93},"width":718.65,"height":64.84,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1604.00923/images/6-12.png","element":"img","alt":"xj := (�ji=0 γ2i))−1/ �L−1ˆj=0 (�ˆji=0 γ2i)−1,","inline":true,"padRight":true},{"text":"and the ","element":"span"},{"style":{"height":10.8},"width":28,"height":27,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1604.00923/images/6-13.png","element":"img","alt":" Ω","inline":true},{"text":"- ","element":"span"},{"text":"return uses ","element":"span"},{"style":{"height":23.42},"width":705.4,"height":58.56,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1604.00923/images/6-14.png","element":"img","alt":" xj = �L−1i=0 Ω−1n (j, i)/ �L−1ˆj,i=0 Ω−1n (ˆj, i),","inline":true,"padRight":true},{"text":"where ","element":"span"},{"style":{"height":13.13},"width":47.01,"height":32.82,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1604.00923/images/6-15.png","element":"img","alt":" Ωn","inline":true,"padRight":true},{"text":"is the ","element":"span"},{"style":{"height":10.8},"width":93.39,"height":27,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1604.00923/images/6-16.png","element":"img","alt":" L×L","inline":true,"padRight":true},{"text":"covariance matrix where ","element":"span"},{"style":{"height":15.6},"width":168.84,"height":39,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1604.00923/images/6-17.png","element":"img","alt":" Ωn(i, j) =","inline":true},{"style":{"height":17.63},"width":369.92,"height":44.08,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1604.00923/images/6-18.png","element":"img","alt":"Cov(g(i)(D), g(j)(D)),","inline":true,"padRight":true},{"text":"and where both the ","element":"span"},{"style":{"height":10},"width":22,"height":25,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1604.00923/images/6-19.png","element":"img","alt":" γ","inline":true,"padRight":true},{"text":"and ","element":"span"},{"style":{"height":10.8},"width":28,"height":27,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1604.00923/images/6-20.png","element":"img","alt":" Ω","inline":true},{"text":"-returns use ","element":"span"},{"style":{"height":14.72},"width":110.22,"height":36.81,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1604.00923/images/6-21.png","element":"img","alt":" xj = 0","inline":true,"padRight":true},{"text":"for ","element":"span"},{"style":{"height":15.6},"width":303.72,"height":39,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1604.00923/images/6-22.png","element":"img","alt":" j ̸∈ {0, . . . , L − 1}","inline":true},{"text":".","element":"span"}],[{"text":"The advantage of the ","element":"span"},{"style":{"height":10},"width":22,"height":25,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1604.00923/images/6-23.png","element":"img","alt":" γ","inline":true},{"text":"-return over the ","element":"span"},{"style":{"height":10.8},"width":23,"height":27,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1604.00923/images/6-24.png","element":"img","alt":" λ","inline":true},{"text":"-return is that it uses a more accurate model of how variance increases with the length of a return, which also eliminates the ","element":"span"},{"style":{"height":10.8},"width":23,"height":27,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1604.00923/images/6-25.png","element":"img","alt":" λ","inline":true,"padRight":true},{"text":"hyperparameter used by the ","element":"span"},{"style":{"height":10.8},"width":23,"height":27,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1604.00923/images/6-26.png","element":"img","alt":" λ","inline":true},{"text":"-return. The advantages of the ","element":"span"},{"style":{"height":10.8},"width":28,"height":27,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1604.00923/images/6-27.png","element":"img","alt":" Ω","inline":true},{"text":"-return over the ","element":"span"},{"style":{"height":10},"width":22,"height":25,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1604.00923/images/6-28.png","element":"img","alt":" γ","inline":true},{"text":"-return are that it both uses a yet moreaccurate estimate of how variance grows with the length of the return, which is computed from historical data, and that it better accounts for the fact that different length returns are ","element":"span"},{"style":{"fontStyle":"italic"},"text":"not ","element":"span"},{"text":"independent, i.e., ","element":"span"},{"style":{"height":17.63},"width":120.3,"height":44.07,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1604.00923/images/6-29.png","element":"img","alt":" g(i)(D)","inline":true,"padRight":true},{"text":"and ","element":"span"},{"style":{"height":17.63},"width":123.73,"height":44.07,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1604.00923/images/6-30.png","element":"img","alt":" g(j)(D)","inline":true,"padRight":true},{"text":"are not independent even if ","element":"span"},{"style":{"height":14.8},"width":82.06,"height":37,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1604.00923/images/6-31.png","element":"img","alt":" i ̸= j","inline":true},{"text":".","element":"span"}],[{"text":"However, none of these weighting schemes are sufficient for our needs because they do not cause BIM to necessarily be a strongly consistent estimator.","element":"span"},{"text":"9 ","element":"span"},{"text":"This is likely because they were all designed for the setting where only one trajectory is available, i.e., ","element":"span"},{"style":{"fontStyle":"italic"},"text":"n ","element":"span"},{"text":"= 1","element":"span"},{"text":", while strong consistency is a property that deals with performance as ","element":"span"},{"style":{"height":8.4},"width":122.6,"height":21,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1604.00923/images/6-32.png","element":"img","alt":" n → ∞","inline":true},{"text":". Furthermore, they were designed for ","element":"span"},{"style":{"fontStyle":"italic"},"text":"on-policy ","element":"span"},{"text":"policy evaluation.","element":"span"}],[{"text":"We therefore propose a new weighting scheme (a new complex return for multiple trajectories) that directly ","element":"span"},{"text":"optimizes ","element":"span"},{"text":"our ","element":"span"},{"text":"primary ","element":"span"},{"text":"objective: ","element":"span"},{"text":"the ","element":"span"},{"text":"mean squared error. ","element":"span"},{"text":"This new weighting scheme is ","element":"span"},{"style":{"height":7.34},"width":110.26,"height":18.34,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1604.00923/images/6-33.png","element":"img","alt":" x⋆ :=","inline":true},{"style":{"height":15.6},"width":581.66,"height":39,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1604.00923/images/6-34.png","element":"img","alt":"arg minx∈R∞ MSE(x⊺g(D), v(πe)).","inline":true,"padRight":true},{"text":"Unfortunately, ","element":"span"},{"text":"we typically cannot compute ","element":"span"},{"style":{"height":6.8},"width":37.54,"height":17,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1604.00923/images/6-35.png","element":"img","alt":" x⋆","inline":true},{"text":", because we do not know ","element":"span"},{"style":{"height":15.6},"width":349.21,"height":39,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1604.00923/images/6-36.png","element":"img","alt":"MSE(x⊺g(D), v(πe))","inline":true,"padRight":true},{"text":"for any ","element":"span"},{"style":{"fontWeight":"bold"},"text":"x","element":"span"},{"text":". ","element":"span"},{"text":"Instead, we propose estimating ","element":"span"},{"style":{"height":6.8},"width":37.54,"height":17,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1604.00923/images/6-37.png","element":"img","alt":" x⋆","inline":true,"padRight":true},{"text":"by minimizing an approximation of ","element":"span"},{"style":{"height":15.6},"width":349.21,"height":39,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1604.00923/images/6-38.png","element":"img","alt":"MSE(x⊺g(D), v(πe))","inline":true},{"text":". First, dealing with an infinite number of different return lengths is challenging. To avoid this, we propose only using a subset of the returns, ","element":"span"},{"style":{"height":17.63},"width":166.23,"height":44.07,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1604.00923/images/6-39.png","element":"img","alt":" {g(j)(D)}","inline":true},{"text":", for ","element":"span"},{"style":{"height":14},"width":113.02,"height":35,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1604.00923/images/6-40.png","element":"img","alt":" j ∈ J","inline":true,"padRight":true},{"text":", where ","element":"span"},{"style":{"height":15.6},"width":160.12,"height":39,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1604.00923/images/6-41.png","element":"img","alt":" |J | < ∞","inline":true},{"text":". For all ","element":"span"},{"style":{"height":14.8},"width":113.02,"height":37,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1604.00923/images/6-42.png","element":"img","alt":" j ̸∈ J","inline":true,"padRight":true},{"text":", we assign ","element":"span"},{"style":{"height":14.73},"width":111.59,"height":36.82,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1604.00923/images/6-43.png","element":"img","alt":"xj = 0","inline":true},{"text":". We suggest including ","element":"span"},{"style":{"height":10.4},"width":50.16,"height":26,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1604.00923/images/6-44.png","element":"img","alt":" −1","inline":true,"padRight":true},{"text":"and ","element":"span"},{"style":{"height":7.2},"width":39,"height":18,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1604.00923/images/6-45.png","element":"img","alt":" ∞","inline":true,"padRight":true},{"text":"in ","element":"span"},{"style":{"fontStyle":"italic"},"text":"J ","element":"span"},{"text":".","element":"span"}],[{"text":"To simplify later notation, let ","element":"span"},{"style":{"height":17.63},"width":246.62,"height":44.07,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1604.00923/images/6-46.png","element":"img","alt":" gJ (D) ∈ R|J |","inline":true,"padRight":true},{"text":"be the elements of ","element":"span"},{"style":{"fontWeight":"bold"},"text":"g","element":"span"},{"text":"(","element":"span"},{"style":{"fontStyle":"italic"},"text":"D","element":"span"},{"text":") ","element":"span"},{"text":"whose indexes are in ","element":"span"},{"style":{"fontStyle":"italic"},"text":"J ","element":"span"},{"text":"—the returns that will not necessarily be given weights of zero. Also let ","element":"span"},{"style":{"height":15.13},"width":39.29,"height":37.82,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1604.00923/images/6-47.png","element":"img","alt":" Jj","inline":true,"padRight":true},{"text":"denote the ","element":"span"},{"style":{"height":16.43},"width":39.74,"height":41.08,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1604.00923/images/6-48.png","element":"img","alt":" jth","inline":true,"padRight":true},{"text":"element in ","element":"span"},{"style":{"fontStyle":"italic"},"text":"J ","element":"span"},{"text":". We can then estimate ","element":"span"},{"style":{"height":6.8},"width":37.54,"height":17,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1604.00923/images/6-49.png","element":"img","alt":" x⋆","inline":true,"padRight":true},{"text":"by:","element":"span"}],[{"style":{"width":"69%"},"width":637,"height":62,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1604.00923/images/6-50.png","element":"img"}],[{"text":"where our estimate of ","element":"span"},{"style":{"height":12.84},"width":36.16,"height":32.09,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1604.00923/images/6-51.png","element":"img","alt":" x⋆j","inline":true,"padRight":true},{"text":"is zero if ","element":"span"},{"style":{"height":14.8},"width":99.6,"height":37,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1604.00923/images/6-52.png","element":"img","alt":" j ̸∈ J","inline":true,"padRight":true},{"text":"and our estimate ","element":"span"},{"text":"of ","element":"span"},{"style":{"height":14.22},"width":54.39,"height":35.56,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1604.00923/images/6-53.png","element":"img","alt":" x⋆Jj","inline":true,"padRight":true},{"text":"is ","element":"span"},{"style":{"height":12.84},"width":36.16,"height":32.09,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1604.00923/images/6-54.png","element":"img","alt":" �x⋆j","inline":true,"padRight":true},{"text":"for ","element":"span"},{"style":{"height":15.6},"width":265.55,"height":39,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1604.00923/images/6-55.png","element":"img","alt":" j ∈ {1, . . . , |J |}","inline":true},{"text":".","element":"span"}],[{"text":"Next, to avoid searching all of ","element":"span"},{"style":{"height":13.63},"width":72.5,"height":34.07,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1604.00923/images/6-56.png","element":"img","alt":" R|J |","inline":true,"padRight":true},{"text":"and also to serve as a form of regularization on ","element":"span"},{"style":{"height":6.8},"width":37.53,"height":17,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1604.00923/images/6-57.png","element":"img","alt":" �x⋆","inline":true},{"text":", we limit the set of ","element":"span"},{"style":{"fontWeight":"bold"},"text":"x ","element":"span"},{"text":"that we consider to the ","element":"span"},{"style":{"fontStyle":"italic"},"text":"|J |","element":"span"},{"text":"-simplex, i.e., we require ","element":"span"},{"style":{"height":14.73},"width":112.76,"height":36.82,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1604.00923/images/6-58.png","element":"img","alt":" xj ≥ 0","inline":true,"padRight":true},{"text":"for all ","element":"span"},{"style":{"height":15.6},"width":281.8,"height":39,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1604.00923/images/6-59.png","element":"img","alt":"j ∈ {1, . . . , |J |}","inline":true,"padRight":true},{"text":"and ","element":"span"},{"style":{"height":22.45},"width":229.46,"height":56.13,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1604.00923/images/6-60.png","element":"img","alt":"�|J |j=1 xj = 1","inline":true},{"text":". We write ","element":"span"},{"style":{"height":13.63},"width":76.81,"height":34.07,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1604.00923/images/6-61.png","element":"img","alt":" ∆|J |","inline":true,"padRight":true},{"text":"to ","element":"span"},{"text":"denote this set of weight vectors—the ","element":"span"},{"style":{"fontStyle":"italic"},"text":"|J |","element":"span"},{"text":"-simplex.","element":"span"}],[{"text":"Using the bias-variance decomposition of MSE, we therefore have that","element":"span"}],[{"style":{"width":"90%"},"width":828,"height":148,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1604.00923/images/6-62.png","element":"img"}],[{"text":"where ","element":"span"},{"style":{"fontStyle":"italic"},"text":"n ","element":"span"},{"text":"remains the number of trajectories in ","element":"span"},{"style":{"height":13.2},"width":111.77,"height":33,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1604.00923/images/6-63.png","element":"img","alt":" D, Ωn","inline":true,"padRight":true},{"text":"is the ","element":"span"},{"style":{"height":15.6},"width":172,"height":39,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1604.00923/images/6-64.png","element":"img","alt":" |J | × |J |","inline":true,"padRight":true},{"text":"covariance matrix where ","element":"span"},{"style":{"height":15.6},"width":186.84,"height":39,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1604.00923/images/6-65.png","element":"img","alt":" Ωn(i, j) =","inline":true},{"style":{"height":17.63},"width":408.76,"height":44.08,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1604.00923/images/6-66.png","element":"img","alt":"Cov(g(Ji)(D), g(Jj)(D))","inline":true,"padRight":true},{"text":"and ","element":"span"},{"style":{"height":13.12},"width":43.78,"height":32.81,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1604.00923/images/6-67.png","element":"img","alt":" bn","inline":true,"padRight":true},{"text":"is the ","element":"span"},{"style":{"fontStyle":"italic"},"text":"|J |","element":"span"},{"text":"-dimensional vector with ","element":"span"},{"style":{"height":17.63},"width":514.47,"height":44.07,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1604.00923/images/6-68.png","element":"img","alt":" bn(j) = E[g(Jj)(D)] − v(πe)","inline":true,"padRight":true},{"text":"for all ","element":"span"},{"style":{"height":13.6},"width":66.78,"height":34,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1604.00923/images/6-69.png","element":"img","alt":" j ∈","inline":true,"padRight":true},{"style":{"fontStyle":"italic"},"text":"{","element":"span"},{"text":"1","element":"span"},{"style":{"fontStyle":"italic"},"text":", . . . , ","element":"span"},{"style":{"fontStyle":"italic"},"text":"|J |}","element":"span"},{"text":".","element":"span"},{"text":"10 ","element":"span"},{"text":"This simplifies the problem of estimating the MSE for all possible ","element":"span"},{"style":{"fontWeight":"bold"},"text":"x ","element":"span"},{"text":"into estimating two terms: the bias vector, ","element":"span"},{"style":{"height":13.13},"width":43.77,"height":32.82,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1604.00923/images/6-70.png","element":"img","alt":" bn","inline":true},{"text":", and the covariance matrix, ","element":"span"},{"style":{"height":13.13},"width":47.01,"height":32.82,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1604.00923/images/6-71.png","element":"img","alt":" Ωn","inline":true},{"text":".","element":"span"}],[{"text":"Let ","element":"span"},{"style":{"height":13.13},"width":43.78,"height":32.82,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1604.00923/images/6-72.png","element":"img","alt":"�bn","inline":true,"padRight":true},{"text":"and ","element":"span"},{"style":{"height":13.13},"width":47,"height":32.82,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1604.00923/images/6-73.png","element":"img","alt":"�Ωn","inline":true,"padRight":true},{"text":"be the estimates of ","element":"span"},{"style":{"height":13.13},"width":43.78,"height":32.82,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1604.00923/images/6-74.png","element":"img","alt":" bn","inline":true,"padRight":true},{"text":"and ","element":"span"},{"style":{"height":13.13},"width":47.01,"height":32.82,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1604.00923/images/6-75.png","element":"img","alt":" Ωn","inline":true,"padRight":true},{"text":"when there are ","element":"span"},{"style":{"fontStyle":"italic"},"text":"n ","element":"span"},{"text":"trajectories in ","element":"span"},{"style":{"fontStyle":"italic"},"text":"D","element":"span"},{"text":". The exact scheme used to estimate ","element":"span"},{"style":{"height":13.13},"width":43.78,"height":32.82,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1604.00923/images/6-76.png","element":"img","alt":" bn","inline":true,"padRight":true},{"text":"and ","element":"span"},{"style":{"height":13.13},"width":47.01,"height":32.82,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1604.00923/images/6-77.png","element":"img","alt":" Ωn","inline":true,"padRight":true},{"text":"depends on the definitions of ","element":"span"},{"style":{"height":18.76},"width":139.39,"height":46.91,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1604.00923/images/6-78.png","element":"img","alt":" IS(j)(D)","inline":true,"padRight":true},{"text":"and ","element":"span"},{"style":{"height":18.76},"width":168.48,"height":46.91,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1604.00923/images/6-79.png","element":"img","alt":" AM(j)(D)","inline":true},{"text":". In general, both terms are easier to estimate for unweighted importance sampling estimators like PDIS and DR than for weighted estimators like CWPDIS or WDR.","element":"span"}],[{"text":"To make the dependence of BIM on the estimates of ","element":"span"},{"style":{"height":13.13},"width":47.01,"height":32.82,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1604.00923/images/6-80.png","element":"img","alt":" Ωn","inline":true,"padRight":true},{"text":"and ","element":"span"},{"style":{"height":13.13},"width":43.78,"height":32.82,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1604.00923/images/6-81.png","element":"img","alt":" bn","inline":true,"padRight":true},{"text":"explicit, and to summarize the approximations we have made, we redefine the BIM estimator as:","element":"span"}],[{"style":{"width":"59%"},"width":542,"height":50,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1604.00923/images/6-82.png","element":"img"}],[{"text":"where","element":"span"}],[{"style":{"width":"59%"},"width":543,"height":71,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1604.00923/images/6-83.png","element":"img"}],[{"text":"In the next section we propose using WDR as the importance sampling method, IS, and show how ","element":"span"},{"style":{"height":13.13},"width":43.78,"height":32.82,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1604.00923/images/6-84.png","element":"img","alt":" bn","inline":true,"padRight":true},{"text":"and ","element":"span"},{"style":{"height":13.13},"width":47.01,"height":32.82,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1604.00923/images/6-85.png","element":"img","alt":" Ωn","inline":true,"padRight":true},{"text":"can be approximated in this setting. First we show that if at least one of the returns included in ","element":"span"},{"style":{"fontStyle":"italic"},"text":"J ","element":"span"},{"text":"is a strongly consistent estimator of ","element":"span"},{"style":{"height":15.6},"width":89.03,"height":39,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1604.00923/images/7-0.png","element":"img","alt":" v(πe)","inline":true},{"text":", and if the estimates of ","element":"span"},{"style":{"height":13.13},"width":43.78,"height":32.82,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1604.00923/images/7-1.png","element":"img","alt":" bn","inline":true,"padRight":true},{"text":"and ","element":"span"},{"style":{"height":13.13},"width":47.01,"height":32.82,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1604.00923/images/7-2.png","element":"img","alt":" Ωn","inline":true,"padRight":true},{"text":"are themselves strongly consistent, then BIM is a strongly consistent estimator of ","element":"span"},{"style":{"height":15.6},"width":89.03,"height":39,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1604.00923/images/7-3.png","element":"img","alt":" v(πe)","inline":true},{"text":":","element":"span"}],[{"style":{"fontWeight":"bold"},"text":"Theorem 3. ","element":"span"},{"style":{"fontStyle":"italic"},"text":"If Assumption ","element":"span"},{"text":"4 ","element":"span"},{"style":{"fontStyle":"italic"},"text":"holds, there exists at least one ","element":"span"},{"style":{"height":14},"width":104.82,"height":35,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1604.00923/images/7-4.png","element":"img","alt":" j ∈ J","inline":true,"padRight":true},{"style":{"fontStyle":"italic"},"text":"such that ","element":"span"},{"style":{"height":17.63},"width":123.73,"height":44.07,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1604.00923/images/7-5.png","element":"img","alt":" g(j)(D)","inline":true,"padRight":true},{"style":{"fontStyle":"italic"},"text":"is a strongly consistent estimator of ","element":"span"},{"style":{"height":15.6},"width":89.03,"height":39,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1604.00923/images/7-6.png","element":"img","alt":" v(πe)","inline":true},{"style":{"fontStyle":"italic"},"text":", ","element":"span"},{"style":{"height":21.8},"width":241.4,"height":54.51,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1604.00923/images/7-7.png","element":"img","alt":"�bn − bn a.s.−→ 0","inline":true},{"style":{"fontStyle":"italic"},"text":", and ","element":"span"},{"style":{"height":15.92},"width":247.86,"height":39.8,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1604.00923/images/7-8.png","element":"img","alt":"�Ωn − Ωn a.s.−→ 0","inline":true},{"style":{"fontStyle":"italic"},"text":", then ","element":"span"},{"style":{"height":17.59},"width":453.98,"height":43.98,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1604.00923/images/7-9.png","element":"img","alt":"BIM(D, �Ωn, �bn) a.s.−→ v(πe).","inline":true,"padRight":true},{"style":{"fontWeight":"bold"},"text":"Proof ","element":"span"},{"style":{"fontStyle":"italic"},"text":"See Appendix ","element":"span"},{"style":{"fontStyle":"italic"},"text":"E","element":"span"},{"style":{"fontStyle":"italic"},"text":".","element":"span"}]]},{"heading":"8. Model and Guided Importance Sampling Combining (MAGIC) Estimator","paragraphs":[[{"text":"In this section we propose using the BIM estimator with WDR as the importance sampling estimator. The resulting estimator combines purely model based estimates with the estimates of the guided importance sampling algorithm WDR, and so we call it the ","element":"span"},{"style":{"fontWeight":"bold"},"text":"m","element":"span"},{"style":{"fontStyle":"italic"},"text":"odel ","element":"span"},{"style":{"fontWeight":"bold"},"text":"a","element":"span"},{"style":{"fontStyle":"italic"},"text":"nd ","element":"span"},{"style":{"fontWeight":"bold"},"text":"g","element":"span"},{"style":{"fontStyle":"italic"},"text":"uided ","element":"span"},{"style":{"fontWeight":"bold"},"text":"i","element":"span"},{"style":{"fontStyle":"italic"},"text":"mportance sampling ","element":"span"},{"style":{"fontWeight":"bold"},"text":"c","element":"span"},{"style":{"fontStyle":"italic"},"text":"ombining ","element":"span"},{"text":"(MAGIC) estimator.","element":"span"}],[{"text":"Although the derivation of how to properly define ","element":"span"},{"style":{"height":18.77},"width":139.39,"height":46.92,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1604.00923/images/7-10.png","element":"img","alt":" IS(j)(D)","inline":true,"padRight":true},{"text":"and ","element":"span"},{"style":{"height":18.76},"width":168.48,"height":46.91,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1604.00923/images/7-11.png","element":"img","alt":" AM(j)(D)","inline":true,"padRight":true},{"text":"in order to blend WDR with the approximate model is less obvious than one might expect and therefore an important technical detail, we relegate it to Appendix ","element":"span"},{"text":"F ","element":"span"},{"text":"due to space restrictions. The resulting definition of an off-policy ","element":"span"},{"style":{"fontStyle":"italic"},"text":"j","element":"span"},{"text":"-step return is, for all ","element":"span"},{"style":{"height":15.13},"width":157.07,"height":37.82,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1604.00923/images/7-12.png","element":"img","alt":" j ∈ N≥−1","inline":true},{"text":":","element":"span"}],[{"style":{"width":"89%"},"width":817,"height":331,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1604.00923/images/7-13.png","element":"img"}],[{"text":"where ","element":"span"},{"style":{"fontWeight":"bold"},"text":"(c) ","element":"span"},{"text":"is the combined control variate for both the importance sampling based term, ","element":"span"},{"style":{"fontWeight":"bold"},"text":"(a)","element":"span"},{"text":", and the model-based term, ","element":"span"},{"style":{"fontWeight":"bold"},"text":"(b)","element":"span"},{"text":", and where we use WDR’s definition of ","element":"span"},{"style":{"height":16.66},"width":39.81,"height":41.66,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1604.00923/images/7-14.png","element":"img","alt":" wit","inline":true},{"text":".","element":"span"}],[{"text":"We estimate ","element":"span"},{"style":{"height":13.12},"width":47,"height":32.81,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1604.00923/images/7-15.png","element":"img","alt":" Ωn","inline":true,"padRight":true},{"text":"from the ","element":"span"},{"style":{"fontStyle":"italic"},"text":"n ","element":"span"},{"text":"trajectories in ","element":"span"},{"style":{"fontStyle":"italic"},"text":"D ","element":"span"},{"text":"using a sample covariance matrix, ","element":"span"},{"style":{"height":13.13},"width":47,"height":32.82,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1604.00923/images/7-16.png","element":"img","alt":"�Ωn","inline":true},{"text":". See Appendix ","element":"span"},{"text":"G ","element":"span"},{"text":"for details and pseudocode for MAGIC.","element":"span"}],[{"text":"Estimating the bias vector, ","element":"span"},{"style":{"height":13.13},"width":43.78,"height":32.82,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1604.00923/images/7-17.png","element":"img","alt":" bn","inline":true},{"text":", is challenging because it has a strong dependence on the value that we wish we knew, ","element":"span"},{"style":{"height":15.6},"width":89.03,"height":39,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1604.00923/images/7-18.png","element":"img","alt":"v(πe)","inline":true},{"text":". ","element":"span"},{"text":"We cannot use AM’s estimate as a stand-in for ","element":"span"},{"style":{"height":15.6},"width":89.03,"height":39,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1604.00923/images/7-19.png","element":"img","alt":"v(πe)","inline":true,"padRight":true},{"text":"because it would cause us to assume that AM’s greatest weakness—its high bias—is negligible. We cannot use WDR’s estimate (or any other importance sampling estimator’s estimate) because our estimate of ","element":"span"},{"style":{"height":13.13},"width":43.78,"height":32.82,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1604.00923/images/7-20.png","element":"img","alt":" bn","inline":true,"padRight":true},{"text":"would then con-flate the high variance of importance sampling estimates with the bias that we wish to estimate.","element":"span"}],[{"text":"When ","element":"span"},{"style":{"fontStyle":"italic"},"text":"n","element":"span"},{"text":", the number of trajectories in ","element":"span"},{"style":{"fontStyle":"italic"},"text":"D","element":"span"},{"text":", is small, variance tends to be the root cause of high MSE. We therefore propose using an estimate of ","element":"span"},{"style":{"height":13.13},"width":43.78,"height":32.82,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1604.00923/images/7-21.png","element":"img","alt":" bn","inline":true,"padRight":true},{"text":"that is initially conservative— initially it underestimates the bias—but which becomes correct as ","element":"span"},{"style":{"fontStyle":"italic"},"text":"n ","element":"span"},{"text":"increases. Let ","element":"span"},{"style":{"height":17.63},"width":248.36,"height":44.07,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1604.00923/images/7-22.png","element":"img","alt":" CI(g(∞)(D), δ)","inline":true,"padRight":true},{"text":"be a ","element":"span"},{"style":{"height":11.6},"width":73.64,"height":29,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1604.00923/images/7-23.png","element":"img","alt":" 1−δ","inline":true,"padRight":true},{"text":"confi-dence interval on the expected value of the random variable ","element":"span"},{"style":{"height":17.63},"width":362.05,"height":44.07,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1604.00923/images/7-24.png","element":"img","alt":"g(∞)(D) = WDR(D)","inline":true},{"text":". Intuitively, as ","element":"span"},{"style":{"fontStyle":"italic"},"text":"n ","element":"span"},{"text":"increases we expect that this confidence interval will converge to ","element":"span"},{"style":{"height":17.63},"width":140.25,"height":44.07,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1604.00923/images/7-25.png","element":"img","alt":" g(∞)(D)","inline":true},{"text":", which in turn converges to ","element":"span"},{"style":{"height":15.6},"width":89.03,"height":39,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1604.00923/images/7-26.png","element":"img","alt":" v(πe)","inline":true},{"text":". So, we estimate ","element":"span"},{"style":{"height":15.6},"width":94.16,"height":39,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1604.00923/images/7-27.png","element":"img","alt":" bn(j)","inline":true},{"text":", the bias of the off-policy ","element":"span"},{"style":{"fontStyle":"italic"},"text":"j","element":"span"},{"text":"-step return, by its distance from the ","element":"span"},{"text":"10% ","element":"span"},{"text":"confidence interval. That is, we estimate ","element":"span"},{"style":{"height":15.6},"width":94.16,"height":39,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1604.00923/images/7-28.png","element":"img","alt":" bn(j)","inline":true,"padRight":true},{"text":"as","element":"span"}],[{"style":{"width":"79%"},"width":725,"height":71,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1604.00923/images/7-29.png","element":"img"}],[{"text":"where ","element":"span"},{"text":"dist(","element":"span"},{"style":{"fontStyle":"italic"},"text":"y, ","element":"span"},{"style":{"fontStyle":"italic"},"text":"Z","element":"span"},{"text":") ","element":"span"},{"text":"is the distance between ","element":"span"},{"style":{"height":14},"width":95.8,"height":35,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1604.00923/images/7-30.png","element":"img","alt":" y ∈ R","inline":true,"padRight":true},{"text":"and the set ","element":"span"},{"style":{"height":15.6},"width":589.7,"height":39,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1604.00923/images/7-31.png","element":"img","alt":"Z ⊆ R: dist(y, Z) = minz∈Z |y−z|.","inline":true,"padRight":true},{"text":"We use both the percentile bootstrap confidence interval (","element":"span"},{"text":"Efron & Tibshirani","element":"span"},{"text":", ","element":"span"},{"text":"1993","element":"span"},{"text":") and Chernoff-Hoeffding’s inequality—whichever is tighter—for ","element":"span"},{"text":"CI ","element":"span"},{"text":"in our experiments.","element":"span"}],[{"text":"In Theorem ","element":"span"},{"text":"4 ","element":"span"},{"text":"we show that the MAGIC estimator is a strongly consistent estimator of ","element":"span"},{"style":{"height":15.6},"width":89.03,"height":39,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1604.00923/images/7-32.png","element":"img","alt":" v(πe)","inline":true,"padRight":true},{"text":"given one set of assumptions that we used to show that WDR is strongly consistent and that WDR is included as one of the off-policy ","element":"span"},{"style":{"fontStyle":"italic"},"text":"j","element":"span"},{"text":"-step returns.","element":"span"}],[{"style":{"fontWeight":"bold"},"text":"Theorem 4 ","element":"span"},{"text":"(MAGIC - strongly consistent)","element":"span"},{"style":{"fontWeight":"bold"},"text":". ","element":"span"},{"style":{"fontStyle":"italic"},"text":"If Assumptions ","element":"span"},{"text":"1 ","element":"span"},{"style":{"fontStyle":"italic"},"text":"and ","element":"span"},{"text":"4 ","element":"span"},{"style":{"fontStyle":"italic"},"text":"hold and ","element":"span"},{"style":{"height":12.8},"width":131.84,"height":32,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1604.00923/images/7-33.png","element":"img","alt":" ∞ ∈ J","inline":true,"padRight":true},{"style":{"fontStyle":"italic"},"text":"then ","element":"span"},{"style":{"height":17.59},"width":396.14,"height":43.98,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1604.00923/images/7-34.png","element":"img","alt":" MAGIC(D) a.s.−→ v(πe).","inline":true,"padRight":true},{"style":{"fontWeight":"bold"},"text":"Proof ","element":"span"},{"style":{"fontStyle":"italic"},"text":"See Appendix ","element":"span"},{"style":{"fontStyle":"italic"},"text":"H","element":"span"},{"style":{"fontStyle":"italic"},"text":".","element":"span"}]]},{"heading":"9. Empirical Studies (MAGIC)","paragraphs":[[{"text":"Appendix ","element":"span"},{"text":"I ","element":"span"},{"text":"provides detailed experiments using MAGIC. In this section we provide an overview of these results. The first three plots in Figure ","element":"span"},{"text":"2 ","element":"span"},{"text":"correspond to those in Figure ","element":"span"},{"text":"1","element":"span"},{"text":", but include MAGIC. In general MAGIC does very well, tracking or exceeding the best performance of WDR and AM. However, in Figure ","element":"span"},{"text":"2c ","element":"span"},{"text":"MAGIC does not perfectly track AM. The scale is logarithmic, so the difference between MAGIC and AM is small in comparison to the ben-efit of MAGIC over WDR. We hypothesize that the reason MAGIC does not match AM may be due to error in our estimates of ","element":"span"},{"style":{"height":13.13},"width":47.01,"height":32.82,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1604.00923/images/7-35.png","element":"img","alt":" Ωn","inline":true,"padRight":true},{"text":"and ","element":"span"},{"style":{"height":13.13},"width":43.78,"height":32.82,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1604.00923/images/7-36.png","element":"img","alt":" bn","inline":true},{"text":".","element":"span"}],[{"text":"Figure ","element":"span"},{"text":"2d ","element":"span"},{"text":"is for an experimental setup that we call ","element":"span"},{"style":{"fontStyle":"italic"},"text":"Hybrid","element":"span"},{"text":", where early in trajectories there is partial observability (e.g., initial uncertainty about a student’s knowledge in an intelligent tutoring system, or uncertainty about the state of the world in a robotic application). In these settings MAGIC outperforms all other estimators, even AM and WDR, by automatically leveraging WDR for the parts of trajectories where partial observability causes the model to be inaccurate, and AM for parts of the trajectories where the model is accurate. ","element":"span"},{"text":"To emphasize this, we include MAGIC-B (B for ","element":"span"},{"style":{"fontStyle":"italic"},"text":"binary","element":"span"},{"text":") where ","element":"span"},{"style":{"height":15.6},"width":247.62,"height":39,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1604.00923/images/7-37.png","element":"img","alt":" J = {−1, ∞}","inline":true},{"text":", so that BIM can only blend AM and WDR by placing weights on them. The relatively poor performance of MAGIC-B supports our use of off-policy ","element":"span"},{"style":{"fontStyle":"italic"},"text":"j","element":"span"},{"text":"-step returns.","element":"span"}],[{"style":{"width":"99%"},"width":1884,"height":319,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1604.00923/images/8-0.png","element":"img"}],[{"text":"Figure 2: Empirical comparison of MAGIC to other estimators using the legend from Figure ","element":"figcaption","subtype":"caption"},{"text":"1","element":"figcaption","subtype":"caption"},{"text":". All plots use the following legend (although only Figure ","element":"figcaption","subtype":"caption"},{"text":"2d ","element":"figcaption","subtype":"caption"},{"text":"includes MAGIC-B):","element":"figcaption","subtype":"caption"}],[{"style":{"width":"48%"},"width":918,"height":23,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1604.00923/images/8-1.png","element":"img"}]]},{"heading":"10. Conclusion","paragraphs":[[{"text":"We have proposed several new OPE estimators and showed empirically that they outperform existing estimators. While previous OPE estimators that use importance sampling often failed to outperform the approximate model estimator (which does not use importance sampling), our new estimators often do, frequently by orders of magnitude. In cases where the approximate model estimator remains the best estimator, one of our new estimators, MAGIC, performs similarly. In other cases, MAGIC meets or exceeds the performance of state-of-the-art prior estimators.","element":"span"}]]},{"heading":"References","paragraphs":[[{"text":"Bartle, Robert G. ","element":"span"},{"style":{"fontStyle":"italic"},"text":"The elements of integration and Lebesgue measure","element":"span"},{"text":". John Wiley & Sons, 2014.","element":"span"}],[{"text":"Bhatnagar, S., Sutton, R. S., Ghavamzadeh, M., and Lee, M. Natural actor-critic algorithms. ","element":"span"},{"style":{"fontStyle":"italic"},"text":"Automatica","element":"span"},{"text":", 45(11): 2471–2482, 2009.","element":"span"}],[{"text":"Bradtke, S.J. and Barto, A.G. Linear least-squares algorithms for temporal difference learning. ","element":"span"},{"style":{"fontStyle":"italic"},"text":"Machine Learning","element":"span"},{"text":", 22(1–3):33–57, March 1996.","element":"span"}],[{"text":"Davison, A. C. and Hinkley, D. V. ","element":"span"},{"style":{"fontStyle":"italic"},"text":"Bootstrap Methods and their Application","element":"span"},{"text":". ","element":"span"},{"text":"Cambridge University Press, Cambridge, 1997.","element":"span"}],[{"text":"Downey, C. and Sanner, S. Temporal difference Bayesian model averaging: A Bayesian perspective on adapting lambda. In ","element":"span"},{"style":{"fontStyle":"italic"},"text":"Proceedings of the 27th International Conference on Machine Learning","element":"span"},{"text":", pp. 311–318, 2010.","element":"span"}],[{"text":"Dud´ık, M., Langford, J., and Li, L. Doubly robust policy evaluation and learning. In ","element":"span"},{"style":{"fontStyle":"italic"},"text":"Proceedings of the TwentyEighth International Conference on Machine Learning","element":"span"},{"text":", pp. 1097–1104, 2011.","element":"span"}],[{"text":"Efron, B. and Tibshirani, R. J. ","element":"span"},{"style":{"fontStyle":"italic"},"text":"An Introduction to the Bootstrap","element":"span"},{"text":". Chapman and Hall, London, 1993.","element":"span"}],[{"text":"Hammersley, J. M. and Handscomb, D. C. Monte carlo methods, methuen & co. ","element":"span"},{"style":{"fontStyle":"italic"},"text":"Ltd., London","element":"span"},{"text":", pp. 40, 1964.","element":"span"}],[{"style":{"width":"45%"},"width":416,"height":23,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1604.00923/images/8-2.png","element":"img"}],[{"text":"Heejung, H. and Robins, J. M. Doubly robust estimation in missing data and causal inference models. ","element":"span"},{"style":{"fontStyle":"italic"},"text":"Biometrics","element":"span"},{"text":", 61(4):962–973, 2005.","element":"span"}],[{"text":"Jiang, N. and Li, L. Doubly robust off-policy evaluation for reinforcement learning. ","element":"span"},{"style":{"fontStyle":"italic"},"text":"ArXiv","element":"span"},{"text":", arXiv:1511.03722v1, 2015.","element":"span"}],[{"text":"Konidaris, G. D., Niekum, S., and Thomas, P. S. TD","element":"span"},{"style":{"height":7.2},"width":17,"height":18,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1604.00923/images/8-3.png","element":"img","alt":"γ","inline":true},{"text":": Re-evaluating complex backups in temporal difference learning. In ","element":"span"},{"style":{"fontStyle":"italic"},"text":"Advances in Neural Information Processing Systems 24","element":"span"},{"text":", pp. 2402–2410. 2011.","element":"span"}],[{"text":"Levine, S. and Koltun, V. Guided policy search. In ","element":"span"},{"style":{"fontStyle":"italic"},"text":"Proceedings of The 30th International Conference on Machine Learning","element":"span"},{"text":", pp. 1–9, 2013.","element":"span"}],[{"text":"Mahmood, A. R., Hasselt, H., and Sutton, R. S. Weighted importance sampling for off-policy learning with linear function approximation. In ","element":"span"},{"style":{"fontStyle":"italic"},"text":"Advances in Neural Information Processing Systems 27","element":"span"},{"text":", 2014.","element":"span"}],[{"text":"Mahmood, A. R., Yu, H., White, M., and Sutton, R. S. ","element":"span"},{"text":"Emphatic temporal-difference learning. ","element":"span"},{"style":{"fontStyle":"italic"},"text":"ArXiv","element":"span"},{"text":", arXiv:1507.01569, 2015.","element":"span"}],[{"text":"Mandel, T., Liu, Y., Levine, S., Brunskill, E., and Popovi´c, Z. Offline policy evaluation across representations with applications to educational games. In ","element":"span"},{"style":{"fontStyle":"italic"},"text":"Proceedings of the 13th International Conference on Autonomous Agents and Multiagent Systems","element":"span"},{"text":", 2014.","element":"span"}],[{"text":"Mandel, T., Liu, Y., Brunskill, E., and Popovi´c, Z. Offline evaluation of online reinforcement learning algorithms. In ","element":"span"},{"style":{"fontStyle":"italic"},"text":"Proceedings of the Thirtieth Conference on Artificial Intelligence","element":"span"},{"text":", 2016.","element":"span"}],[{"text":"Mittelhammer, R. C. ","element":"span"},{"style":{"fontStyle":"italic"},"text":"Mathematical statistics for economics and business","element":"span"},{"text":", volume 78. Springer, 1996.","element":"span"}],[{"text":"Powell, M. J. D. and Swann, J. Weighted uniform sampling: a Monte Carlo technique for reducing variance. ","element":"span"},{"style":{"fontStyle":"italic"},"text":"Journal of the Institute of Mathematics and its Applications","element":"span"},{"text":", 2(3):228–236, 1966.","element":"span"}],[{"text":"Precup, D., Sutton, R. S., and Singh, S. Eligibility traces for off-policy policy evaluation. In ","element":"span"},{"style":{"fontStyle":"italic"},"text":"Proceedings of the 17th International Conference on Machine Learning","element":"span"},{"text":", pp. 759–766, 2000.","element":"span"}],[{"text":"Rotnitzky, A. and Robins, J. M. Semiparametric regression estimation in the presence of dependent censoring. ","element":"span"},{"style":{"fontStyle":"italic"},"text":"Biometrika","element":"span"},{"text":", 82(4):805–820, 1995.","element":"span"}],[{"text":"Sen, P. K. and Singer, J. M. ","element":"span"},{"style":{"fontStyle":"italic"},"text":"Large Sample Methods in Statistics An Introduction With Applications","element":"span"},{"text":". Chapman & Hall, 1993.","element":"span"}],[{"text":"Sutton, R. S. and Barto, A. G. ","element":"span"},{"style":{"fontStyle":"italic"},"text":"Reinforcement Learning: An Introduction","element":"span"},{"text":". MIT Press, Cambridge, MA, 1998.","element":"span"}],[{"text":"Sutton, R.S. Learning to predict by the methods of temporal differences. ","element":"span"},{"style":{"fontStyle":"italic"},"text":"Machine Learning","element":"span"},{"text":", 3(1):9–44, 1988.","element":"span"}],[{"text":"Thapa, D., Jung, I., and Wang, G. ","element":"span"},{"text":"Agent based decision support system using reinforcement learning under emergency circumstances. ","element":"span"},{"style":{"fontStyle":"italic"},"text":"Advances in Natural Computation","element":"span"},{"text":", 3610:888–892, 2005.","element":"span"}],[{"text":"Theocharous, G., Thomas, P. S., and Ghavamzadeh, M. Personalized ad recommendation systems for life-time value optimization with guarantees. In ","element":"span"},{"style":{"fontStyle":"italic"},"text":"Proceedings of the International Joint Conference on Artificial Intelligence","element":"span"},{"text":", 2015.","element":"span"}],[{"text":"Thomas, P. S. A notation for Markov decision processes. ","element":"span"},{"style":{"fontStyle":"italic"},"text":"ArXiv","element":"span"},{"text":", arXiv:1512.09075v1, 2015a.","element":"span"}],[{"text":"Thomas, P. S. ","element":"span"},{"style":{"fontStyle":"italic"},"text":"Safe Reinforcement Learning","element":"span"},{"text":". PhD thesis, University of Massachusetts Amherst, 2015b.","element":"span"}],[{"text":"Thomas, ","element":"span"},{"text":"P. S., ","element":"span"},{"text":"Niekum, ","element":"span"},{"text":"S., ","element":"span"},{"text":"Theocharous, ","element":"span"},{"text":"G., ","element":"span"},{"text":"and Konidaris, G. D. Policy evaluation using the ","element":"span"},{"style":{"height":10.8},"width":28,"height":27,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1604.00923/images/9-0.png","element":"img","alt":" Ω","inline":true},{"text":"-return. In ","element":"span"},{"style":{"fontStyle":"italic"},"text":"Advances in Neural Information Processing Systems","element":"span"},{"text":", 2015.","element":"span"}],[{"text":"van Hasselt, H., Mahmood, A. R., and Sutton, R. S. Offpolicy TD","element":"span"},{"style":{"height":15.6},"width":52.7,"height":39,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1604.00923/images/9-1.png","element":"img","alt":"(λ)","inline":true,"padRight":true},{"text":"with true online equivalence. In ","element":"span"},{"style":{"fontStyle":"italic"},"text":"Proceedings of the 30th Conference on Uncertainty in Artificial Intelligence","element":"span"},{"text":", 2014.","element":"span"}],[{"text":"Veness, J., Lanctot, M., and Bowling, M. Variance reduction in monte-carlo tree search. In ","element":"span"},{"style":{"fontStyle":"italic"},"text":"Advances in Neural Information Processing Systems","element":"span"},{"text":", pp. 1836–1844, 2011.","element":"span"}],[{"text":"White, M. and Bowling, M. ","element":"span"},{"text":"Learning a value analysis tool for agent evaluation. In ","element":"span"},{"style":{"fontStyle":"italic"},"text":"Proceedings of the International Joint Conference on Artificial Intelligence","element":"span"},{"text":", pp. 1976–1981, 2009.","element":"span"}],[{"text":"Zinkevich, M., Bowling, M., Bard, N., Kan, M., and Billings, D. Optimal unbiased estimators for evaluating agent performance. In ","element":"span"},{"style":{"fontStyle":"italic"},"text":"Proceedings of the Twenty-First National Conference on Artificial Intelligence (AAAI)","element":"span"},{"text":", pp. 573–578, 2006.","element":"span"}]]},{"heading":"A. Preliminaries","paragraphs":[[{"text":"In this section we present additional notation, definitions, properties, (known) theorems, corollaries, and lemmas that are useful when we prove theorems later.","element":"span"}],[{"text":"Let ","element":"span"},{"style":{"height":16.43},"width":796.42,"height":41.07,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1604.00923/images/10-0.png","element":"img","alt":" Ht := (S0, A0, R0, S1, . . . , St−1, At−1, Rt−1, St)","inline":true,"padRight":true},{"text":"be the first ","element":"span"},{"style":{"fontStyle":"italic"},"text":"t ","element":"span"},{"text":"transitions in the episode ","element":"span"},{"style":{"fontStyle":"italic"},"text":"H","element":"span"},{"text":". We call ","element":"span"},{"style":{"height":12.43},"width":47.38,"height":31.08,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1604.00923/images/10-1.png","element":"img","alt":" Ht","inline":true,"padRight":true},{"text":"a ","element":"span"},{"style":{"fontStyle":"italic"},"text":"partial trajectory ","element":"span"},{"text":"of length ","element":"span"},{"style":{"fontStyle":"italic"},"text":"t","element":"span"},{"text":". Notice that we use subscripts on trajectories to denote the trajectory’s index in ","element":"span"},{"style":{"fontStyle":"italic"},"text":"D ","element":"span"},{"text":"and superscripts to denote partial trajectories—","element":"span"},{"style":{"height":16.47},"width":47.39,"height":41.16,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1604.00923/images/10-2.png","element":"img","alt":"Hti","inline":true,"padRight":true},{"text":"is the first ","element":"span"},{"style":{"fontStyle":"italic"},"text":"t ","element":"span"},{"text":"transitions of the ","element":"span"},{"style":{"height":13.23},"width":34.91,"height":33.07,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1604.00923/images/10-3.png","element":"img","alt":" ith","inline":true,"padRight":true},{"text":"trajectory in ","element":"span"},{"style":{"fontStyle":"italic"},"text":"D","element":"span"},{"text":". Let ","element":"span"},{"style":{"height":13.63},"width":45.12,"height":34.07,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1604.00923/images/10-4.png","element":"img","alt":" Ht","inline":true,"padRight":true},{"text":"be the set of all possible partial trajectories of length ","element":"span"},{"style":{"fontStyle":"italic"},"text":"t","element":"span"},{"text":".","element":"span"}],[{"text":"For all ","element":"span"},{"style":{"height":15.6},"width":247.67,"height":39,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1604.00923/images/10-5.png","element":"img","alt":" (π, s) ∈ Π × S","inline":true},{"text":", let ","element":"span"},{"style":{"height":15.6},"width":150.08,"height":39,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1604.00923/images/10-6.png","element":"img","alt":" supps(π)","inline":true,"padRight":true},{"text":"be the set of actions that have non-zero probability when the policy ","element":"span"},{"style":{"height":6.8},"width":23,"height":17,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1604.00923/images/10-7.png","element":"img","alt":" π","inline":true,"padRight":true},{"text":"is used to select an action in state ","element":"span"},{"style":{"fontStyle":"italic"},"text":"s","element":"span"},{"text":", i.e., ","element":"span"},{"style":{"height":15.6},"width":374.78,"height":39,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1604.00923/images/10-8.png","element":"img","alt":" supps(π) := {a ∈ A :","inline":true},{"style":{"height":15.6},"width":212.5,"height":39,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1604.00923/images/10-9.png","element":"img","alt":"π(a|s) ̸= 0}","inline":true},{"text":". Similarly, let ","element":"span"},{"style":{"height":16.43},"width":441.81,"height":41.08,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1604.00923/images/10-10.png","element":"img","alt":" supp(π, t) := {ht ∈ Ht :","inline":true},{"style":{"height":16.43},"width":333.85,"height":41.07,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1604.00923/images/10-11.png","element":"img","alt":"Pr(Ht = ht|π) ̸= 0}","inline":true},{"text":".","element":"span"}],[{"text":"Later we will need to bound terms like ","element":"span"},{"style":{"height":16.67},"width":75.15,"height":41.66,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1604.00923/images/10-12.png","element":"img","alt":" ρitRit","inline":true,"padRight":true},{"text":"for some ","element":"span"},{"style":{"fontStyle":"italic"},"text":"t ","element":"span"},{"text":"and ","element":"span"},{"style":{"fontStyle":"italic"},"text":"i","element":"span"},{"text":". ","element":"span"},{"text":"Notice that even if ","element":"span"},{"style":{"height":16.66},"width":133.1,"height":41.66,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1604.00923/images/10-13.png","element":"img","alt":" ρit < β","inline":true},{"text":", it is possible for ","element":"span"},{"style":{"height":16.66},"width":227.98,"height":41.66,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1604.00923/images/10-14.png","element":"img","alt":"ρitRit > βrmax","inline":true,"padRight":true},{"text":"if ","element":"span"},{"style":{"height":9.13},"width":64.67,"height":22.82,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1604.00923/images/10-15.png","element":"img","alt":" rmax","inline":true,"padRight":true},{"text":"is negative, since ","element":"span"},{"style":{"height":16.66},"width":32.05,"height":41.66,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1604.00923/images/10-16.png","element":"img","alt":" ρit","inline":true,"padRight":true},{"text":"could be zero. ","element":"span"},{"text":"Additionally, sometimes we may deal with ","element":"span"},{"style":{"height":9.13},"width":64.67,"height":22.82,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1604.00923/images/10-17.png","element":"img","alt":" rmax","inline":true,"padRight":true},{"text":"terms and other times ","element":"span"},{"style":{"height":17.06},"width":86.89,"height":42.66,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1604.00923/images/10-18.png","element":"img","alt":" rmodelmax","inline":true,"padRight":true},{"text":". To avoid explicitly handling these cases, ","element":"span"},{"text":"we will bound terms using loose bounds that depend on a new term: ","element":"span"},{"style":{"height":17.61},"width":680.12,"height":44.03,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1604.00923/images/10-19.png","element":"img","alt":" r⋆max := max{|rmin|, |rmax|, |rmodelmin |, |rmodelmax |}","inline":true},{"text":".","element":"span"}],[{"style":{"fontWeight":"bold"},"text":"Definition 1 ","element":"span"},{"text":"(Almost Sure Convergence)","element":"span"},{"style":{"fontWeight":"bold"},"text":". ","element":"span"},{"style":{"fontStyle":"italic"},"text":"A sequence of random variables, ","element":"span"},{"style":{"height":15.6},"width":141.37,"height":39,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1604.00923/images/10-20.png","element":"img","alt":" (Xn)∞n=1","inline":true},{"style":{"fontStyle":"italic"},"text":", converges almost surely to the ","element":"span"},{"style":{"fontStyle":"italic"},"text":"random variable ","element":"span"},{"style":{"fontStyle":"italic"},"text":"X ","element":"span"},{"style":{"fontStyle":"italic"},"text":"if","element":"span"}],[{"style":{"width":"43%"},"width":400,"height":71,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1604.00923/images/10-21.png","element":"img"}],[{"text":"We write ","element":"span"},{"style":{"height":15.92},"width":172.98,"height":39.8,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1604.00923/images/10-22.png","element":"img","alt":" Xn a.s.−→ X","inline":true,"padRight":true},{"text":"to denote that the sequence ","element":"span"},{"style":{"height":15.6},"width":141.38,"height":39,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1604.00923/images/10-23.png","element":"img","alt":" (Xn)∞n=1","inline":true,"padRight":true},{"text":"convergences almost surely to ","element":"span"},{"style":{"fontStyle":"italic"},"text":"X","element":"span"},{"text":".","element":"span"}],[{"style":{"fontWeight":"bold"},"text":"Definition 2. ","element":"span"},{"style":{"fontStyle":"italic"},"text":"Let ","element":"span"},{"style":{"height":10.8},"width":18,"height":27,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1604.00923/images/10-24.png","element":"img","alt":" θ","inline":true,"padRight":true},{"style":{"fontStyle":"italic"},"text":"be a real number and ","element":"span"},{"style":{"height":18.89},"width":127.45,"height":47.24,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1604.00923/images/10-25.png","element":"img","alt":" (ˆθn)∞n=1","inline":true,"padRight":true},{"style":{"fontStyle":"italic"},"text":"be ","element":"span"},{"style":{"fontStyle":"italic"},"text":"an infinite sequence of random variables. We call ","element":"span"},{"style":{"height":17.22},"width":37.2,"height":43.05,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1604.00923/images/10-26.png","element":"img","alt":"ˆθn","inline":true},{"style":{"fontStyle":"italic"},"text":", a (strongly) ","element":"span"},{"style":{"fontWeight":"bold"},"text":"consistent estimator ","element":"span"},{"style":{"fontStyle":"italic"},"text":"of ","element":"span"},{"style":{"height":10.8},"width":18,"height":27,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1604.00923/images/10-27.png","element":"img","alt":" θ","inline":true,"padRight":true},{"style":{"fontStyle":"italic"},"text":"if and only if ","element":"span"},{"style":{"height":17.22},"width":141.34,"height":43.06,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1604.00923/images/10-28.png","element":"img","alt":"ˆθn a.s.−→ θ","inline":true},{"style":{"fontStyle":"italic"},"text":".","element":"span"}],[{"text":"Notice that an estimator being unbiased does ","element":"span"},{"style":{"fontStyle":"italic"},"text":"not ","element":"span"},{"text":"mean that it is also strongly consistent—estimators can be any combination of biased/unbiased and consistent/inconsistent. Next we present several known properties of almost sure convergence (","element":"span"},{"text":"Mittelhammer","element":"span"},{"text":", ","element":"span"},{"text":"1996","element":"span"},{"text":", Section 5.5).","element":"span"}],[{"style":{"fontWeight":"bold"},"text":"Property 1. ","element":"span"},{"style":{"fontStyle":"italic"},"text":"[Continuous mapping theorem] ","element":"span"},{"style":{"height":15.92},"width":185.8,"height":39.8,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1604.00923/images/10-29.png","element":"img","alt":" Xn a.s.−→ X","inline":true,"padRight":true},{"style":{"fontStyle":"italic"},"text":"implies that ","element":"span"},{"style":{"height":17.59},"width":287.66,"height":43.98,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1604.00923/images/10-30.png","element":"img","alt":" f(Xn) a.s.−→ f(X)","inline":true,"padRight":true},{"style":{"fontStyle":"italic"},"text":"for every continuous function ","element":"span"},{"style":{"fontStyle":"italic"},"text":"f","element":"span"},{"style":{"fontStyle":"italic"},"text":".","element":"span"}],[{"style":{"fontWeight":"bold"},"text":"Property 2. ","element":"span"},{"style":{"fontStyle":"italic"},"text":"Let ","element":"span"},{"style":{"height":13.13},"width":51.13,"height":32.82,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1604.00923/images/10-31.png","element":"img","alt":" Xn","inline":true,"padRight":true},{"style":{"fontStyle":"italic"},"text":"and ","element":"span"},{"style":{"height":13.13},"width":41.51,"height":32.82,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1604.00923/images/10-32.png","element":"img","alt":" Yn","inline":true,"padRight":true},{"style":{"fontStyle":"italic"},"text":"be sequences of random variables and ","element":"span"},{"style":{"fontStyle":"italic"},"text":"X ","element":"span"},{"style":{"fontStyle":"italic"},"text":"and ","element":"span"},{"style":{"fontStyle":"italic"},"text":"Y ","element":"span"},{"style":{"fontStyle":"italic"},"text":"be random variables. If ","element":"span"},{"style":{"height":15.92},"width":181.4,"height":39.8,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1604.00923/images/10-33.png","element":"img","alt":" Xn a.s.−→ X","inline":true},{"style":{"fontStyle":"italic"},"text":", ","element":"span"},{"style":{"height":15.92},"width":157.65,"height":39.8,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1604.00923/images/10-34.png","element":"img","alt":"Yn a.s.−→ Y","inline":true,"padRight":true},{"style":{"fontStyle":"italic"},"text":", and if ","element":"span"},{"text":"Pr(","element":"span"},{"style":{"fontStyle":"italic"},"text":"Y ","element":"span"},{"text":"= 0) = 0","element":"span"},{"style":{"fontStyle":"italic"},"text":", then ","element":"span"},{"style":{"height":20.86},"width":163.78,"height":52.14,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1604.00923/images/10-35.png","element":"img","alt":"XnYn a.s.−→ XY","inline":true,"padRight":true},{"style":{"fontStyle":"italic"},"text":".","element":"span"}],[{"style":{"fontWeight":"bold"},"text":"Property 3. ","element":"span"},{"style":{"fontStyle":"italic"},"text":"If ","element":"span"},{"style":{"height":16.87},"width":141.8,"height":42.16,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1604.00923/images/10-36.png","element":"img","alt":" {Xin}mi=1","inline":true,"padRight":true},{"style":{"fontStyle":"italic"},"text":"are ","element":"span"},{"style":{"height":9.6},"width":124.76,"height":24,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1604.00923/images/10-37.png","element":"img","alt":" m < ∞","inline":true,"padRight":true},{"style":{"fontStyle":"italic"},"text":"sequences of random ","element":"span"},{"style":{"fontStyle":"italic"},"text":"variables such that ","element":"span"},{"style":{"height":17.43},"width":199.21,"height":43.57,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1604.00923/images/10-38.png","element":"img","alt":" Xin a.s.−→ Xi","inline":true,"padRight":true},{"style":{"fontStyle":"italic"},"text":"for all ","element":"span"},{"style":{"height":15.6},"width":255.54,"height":39,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1604.00923/images/10-39.png","element":"img","alt":" i ∈ {1, . . . , m}","inline":true},{"style":{"fontStyle":"italic"},"text":", then ","element":"span"},{"style":{"height":18.25},"width":382.58,"height":45.62,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1604.00923/images/10-40.png","element":"img","alt":"�mi=1 Xin a.s.−→ �mi=1 Xi","inline":true},{"style":{"fontStyle":"italic"},"text":".","element":"span"}],[{"text":"We will require an additional property of almost sure convergence that is similar to Property ","element":"span"},{"text":"3","element":"span"},{"text":", but which allows for the sum over a countably infinite number of sequences of random variables, i.e., ","element":"span"},{"style":{"height":7.2},"width":134.25,"height":18,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1604.00923/images/10-41.png","element":"img","alt":" m = ∞","inline":true},{"text":". In order to establish this property we begin with Lebesgue’s dominated convergence theorem:","element":"span"}],[{"style":{"fontWeight":"bold"},"text":"Theorem 5 ","element":"span"},{"text":"(Lebesgue’s Dominated Convergence Theorem)","element":"span"},{"style":{"fontWeight":"bold"},"text":". ","element":"span"},{"style":{"fontStyle":"italic"},"text":"Let ","element":"span"},{"style":{"height":15.6},"width":128.24,"height":39,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1604.00923/images/10-42.png","element":"img","alt":" (fn)∞n=1","inline":true,"padRight":true},{"style":{"fontStyle":"italic"},"text":"be a sequence of integrable functions ","element":"span"},{"style":{"fontStyle":"italic"},"text":"that converges almost everywhere to a real-valued measurable function ","element":"span"},{"style":{"fontStyle":"italic"},"text":"f","element":"span"},{"style":{"fontStyle":"italic"},"text":". If there exists an integrable function","element":"span"},{"text":"11 ","element":"span"},{"style":{"fontStyle":"italic"},"text":"g ","element":"span"},{"style":{"fontStyle":"italic"},"text":"such that ","element":"span"},{"style":{"height":15.6},"width":132.35,"height":39,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1604.00923/images/10-43.png","element":"img","alt":" |fn| ≤ g","inline":true,"padRight":true},{"style":{"fontStyle":"italic"},"text":"for all ","element":"span"},{"style":{"fontStyle":"italic"},"text":"n","element":"span"},{"style":{"fontStyle":"italic"},"text":", then","element":"span"}],[{"style":{"width":"43%"},"width":400,"height":87,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1604.00923/images/10-44.png","element":"img"}],[{"style":{"fontStyle":"italic"},"text":"Proof. ","element":"span"},{"text":"See the work of (","element":"span"},{"text":"Bartle","element":"span"},{"text":", ","element":"span"},{"text":"2014","element":"span"},{"text":", Theorem 5.6).","element":"span"}],[{"text":"Next we use Lebesgue’s dominated convergence theorem to show conditions under which we can reverse the order of a limit and an infinite summation:","element":"span"}],[{"style":{"fontWeight":"bold"},"text":"Lemma 1. ","element":"span"},{"style":{"fontStyle":"italic"},"text":"Let ","element":"span"},{"style":{"height":16.87},"width":131.83,"height":42.17,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1604.00923/images/10-45.png","element":"img","alt":" {xin}∞i=0","inline":true,"padRight":true},{"style":{"fontStyle":"italic"},"text":"be a countably infinite num- ","element":"span"},{"style":{"fontStyle":"italic"},"text":"ber of real-valued sequences indexed by ","element":"span"},{"style":{"fontStyle":"italic"},"text":"i","element":"span"},{"style":{"fontStyle":"italic"},"text":", such that ","element":"span"},{"style":{"height":16.66},"width":271.41,"height":41.66,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1604.00923/images/10-46.png","element":"img","alt":"limn→∞ xin = xi","inline":true,"padRight":true},{"style":{"fontStyle":"italic"},"text":"for all ","element":"span"},{"style":{"height":15.13},"width":128,"height":37.82,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1604.00923/images/10-47.png","element":"img","alt":" i ∈ N≥0","inline":true},{"style":{"fontStyle":"italic"},"text":". If there exists a function ","element":"span"},{"style":{"height":15.12},"width":222.85,"height":37.81,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1604.00923/images/10-48.png","element":"img","alt":"g : N≥0 → R","inline":true,"padRight":true},{"style":{"fontStyle":"italic"},"text":"such that ","element":"span"},{"style":{"height":16.83},"width":186.2,"height":42.08,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1604.00923/images/10-49.png","element":"img","alt":" |xin| ≤ g(i)","inline":true,"padRight":true},{"style":{"fontStyle":"italic"},"text":"for all ","element":"span"},{"style":{"height":14.32},"width":144.26,"height":35.81,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1604.00923/images/10-50.png","element":"img","alt":" n ∈ N>0","inline":true,"padRight":true},{"style":{"fontStyle":"italic"},"text":"and ","element":"span"},{"style":{"height":15.13},"width":128,"height":37.82,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1604.00923/images/10-51.png","element":"img","alt":"i ∈ N≥0","inline":true},{"style":{"fontStyle":"italic"},"text":", and ","element":"span"},{"style":{"height":17.65},"width":253.7,"height":44.13,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1604.00923/images/10-52.png","element":"img","alt":"�∞i=0 g(i) < ∞","inline":true},{"style":{"fontStyle":"italic"},"text":", then","element":"span"}],[{"style":{"width":"48%"},"width":445,"height":108,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1604.00923/images/10-53.png","element":"img"}],[{"style":{"fontStyle":"italic"},"text":"Proof. ","element":"span"},{"text":"We apply Lebesgue’s dominated convergence theorem (Theorem ","element":"span"},{"text":"5","element":"span"},{"text":"), where for all ","element":"span"},{"style":{"height":15.93},"width":348.15,"height":39.82,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1604.00923/images/10-54.png","element":"img","alt":" (n, i) ∈ N>0 × N≥0","inline":true},{"text":", ","element":"span"},{"style":{"height":16.83},"width":396.68,"height":42.08,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1604.00923/images/10-55.png","element":"img","alt":"fn(i) = Xin, f(i) = xi","inline":true},{"text":", and ","element":"span"},{"style":{"height":10},"width":23,"height":25,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1604.00923/images/10-56.png","element":"img","alt":" µ","inline":true,"padRight":true},{"text":"is the counting measure ","element":"span"},{"text":"on the measure space ","element":"span"},{"style":{"height":15.93},"width":246.92,"height":39.82,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1604.00923/images/10-57.png","element":"img","alt":" (N≥0, P(N≥0))","inline":true},{"text":", where ","element":"span"},{"style":{"height":15.93},"width":129.88,"height":39.82,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1604.00923/images/10-58.png","element":"img","alt":" P(N≥0)","inline":true,"padRight":true},{"text":"is the power set of ","element":"span"},{"style":{"height":15.13},"width":67.24,"height":37.82,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1604.00923/images/10-59.png","element":"img","alt":" N≥0","inline":true},{"text":".","element":"span"}],[{"text":"We can now establish our desired property about almost sure convergence:","element":"span"}],[{"style":{"fontWeight":"bold"},"text":"Property 4. ","element":"span"},{"style":{"fontStyle":"italic"},"text":"Let ","element":"span"},{"style":{"height":16.87},"width":141.79,"height":42.17,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1604.00923/images/10-60.png","element":"img","alt":" {Xin}∞i=0","inline":true,"padRight":true},{"style":{"fontStyle":"italic"},"text":"be a countably infinite number ","element":"span"},{"style":{"fontStyle":"italic"},"text":"of sequences of random variables such that ","element":"span"},{"style":{"height":17.43},"width":183.43,"height":43.57,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1604.00923/images/10-61.png","element":"img","alt":" Xin a.s.−→ Xi","inline":true,"padRight":true},{"style":{"fontStyle":"italic"},"text":"for all ","element":"span"},{"style":{"height":15.13},"width":134.21,"height":37.82,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1604.00923/images/10-62.png","element":"img","alt":" i ∈ N≥0","inline":true},{"style":{"fontStyle":"italic"},"text":". If there exists a function ","element":"span"},{"style":{"height":15.13},"width":222.59,"height":37.82,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1604.00923/images/10-63.png","element":"img","alt":" g : N≥0 → R","inline":true,"padRight":true},{"style":{"fontStyle":"italic"},"text":"such that ","element":"span"},{"style":{"height":16.83},"width":198.85,"height":42.08,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1604.00923/images/10-64.png","element":"img","alt":" |Xin| ≤ g(i)","inline":true,"padRight":true},{"style":{"fontStyle":"italic"},"text":"surely for all ","element":"span"},{"style":{"height":15.93},"width":328.36,"height":39.82,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1604.00923/images/10-65.png","element":"img","alt":" (n, i) ∈ N>0 × N≥0","inline":true},{"style":{"fontStyle":"italic"},"text":", and ","element":"span"},{"style":{"height":17.65},"width":253.69,"height":44.14,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1604.00923/images/10-66.png","element":"img","alt":"�∞i=0 g(i) < ∞","inline":true},{"style":{"fontStyle":"italic"},"text":", then ","element":"span"},{"style":{"height":18.25},"width":382.58,"height":45.62,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1604.00923/images/10-67.png","element":"img","alt":" �∞i=0 Xin a.s.−→ �∞i=0 Xi","inline":true},{"style":{"fontStyle":"italic"},"text":".","element":"span"}],[{"style":{"fontStyle":"italic"},"text":"Proof.","element":"span"}],[{"style":{"width":"94%"},"width":861,"height":169,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1604.00923/images/11-0.png","element":"img"}],[{"text":"    ","element":"span"}],[{"style":{"width":"97%"},"width":890,"height":734,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1604.00923/images/11-1.png","element":"img"}],[{"text":"where ","element":"span"},{"style":{"fontWeight":"bold"},"text":"(a) ","element":"span"},{"text":"comes from Lemma ","element":"span"},{"text":"1 ","element":"span"},{"text":"which ensures that","element":"span"}],[{"style":{"width":"99%"},"width":909,"height":105,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1604.00923/images/11-2.png","element":"img"}],[{"style":{"fontWeight":"bold"},"text":"(c) ","element":"span"},{"text":"holds because ","element":"span"},{"style":{"fontWeight":"bold"},"text":"(d) ","element":"span"},{"style":{"height":9.2},"width":62.7,"height":23,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1604.00923/images/11-3.png","element":"img","alt":" =⇒","inline":true,"padRight":true},{"style":{"fontWeight":"bold"},"text":"(b)","element":"span"},{"text":", and ","element":"span"},{"style":{"fontWeight":"bold"},"text":"(e) ","element":"span"},{"text":"has zero measure because it is the countable union of zero measure sets by the assumption that ","element":"span"},{"style":{"height":17.43},"width":183.43,"height":43.56,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1604.00923/images/11-4.png","element":"img","alt":" Xin a.s.−→ Xi","inline":true,"padRight":true},{"text":"for all ","element":"span"},{"style":{"height":15.13},"width":128,"height":37.82,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1604.00923/images/11-5.png","element":"img","alt":" i ∈ N≥0","inline":true},{"text":".","element":"span"}],[{"text":"Next we show that if a sequence of random variables, ","element":"span"},{"style":{"height":13.13},"width":51.13,"height":32.82,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1604.00923/images/11-6.png","element":"img","alt":" Xn","inline":true},{"text":", converges almost surely to a random variable, ","element":"span"},{"style":{"fontStyle":"italic"},"text":"X","element":"span"},{"text":", then the expected value of ","element":"span"},{"style":{"height":13.12},"width":51.13,"height":32.81,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1604.00923/images/11-7.png","element":"img","alt":" Xn","inline":true,"padRight":true},{"text":"converges to the expected value of ","element":"span"},{"style":{"fontStyle":"italic"},"text":"X","element":"span"},{"text":".","element":"span"}],[{"style":{"fontWeight":"bold"},"text":"Lemma 2. ","element":"span"},{"style":{"fontStyle":"italic"},"text":"If ","element":"span"},{"style":{"height":15.64},"width":124.98,"height":39.09,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1604.00923/images/11-8.png","element":"img","alt":" (Xi)∞i=1","inline":true,"padRight":true},{"style":{"fontStyle":"italic"},"text":"is a sequence of uniformly bounded ","element":"span"},{"style":{"fontStyle":"italic"},"text":"real-valued random variables and if ","element":"span"},{"style":{"height":15.92},"width":201.88,"height":39.8,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1604.00923/images/11-9.png","element":"img","alt":" Xn a.s.−→ X","inline":true},{"style":{"fontStyle":"italic"},"text":", then ","element":"span"},{"style":{"height":15.6},"width":396.07,"height":39,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1604.00923/images/11-10.png","element":"img","alt":"limn→∞ E[Xn] = E[X].","inline":true}],[{"style":{"fontStyle":"italic"},"text":"Proof. ","element":"span"},{"text":"Let ","element":"span"},{"style":{"height":13.13},"width":51.13,"height":32.82,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1604.00923/images/11-11.png","element":"img","alt":" Xn","inline":true,"padRight":true},{"text":"(for all ","element":"span"},{"style":{"fontStyle":"italic"},"text":"n","element":"span"},{"text":") and ","element":"span"},{"style":{"fontStyle":"italic"},"text":"X ","element":"span"},{"text":"be random variables on the probability space ","element":"span"},{"style":{"height":15.6},"width":150.85,"height":39,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1604.00923/images/11-12.png","element":"img","alt":" (Ω, Σ, P)","inline":true,"padRight":true},{"text":"and let ","element":"span"},{"style":{"height":15.6},"width":272.52,"height":39,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1604.00923/images/11-13.png","element":"img","alt":" A = {ω ∈ Ω :","inline":true},{"style":{"height":15.6},"width":303.38,"height":39,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1604.00923/images/11-14.png","element":"img","alt":"limn→∞ Xn = X}","inline":true},{"text":". Then:","element":"span"}],[{"style":{"width":"90%"},"width":824,"height":357,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1604.00923/images/11-15.png","element":"img"}],[{"text":"where ","element":"span"},{"style":{"fontWeight":"bold"},"text":"(a) ","element":"span"},{"text":"comes from the bounded convergence theorem. For term ","element":"span"},{"style":{"fontWeight":"bold"},"text":"(b)","element":"span"},{"text":", notice that for all ","element":"span"},{"style":{"height":13.93},"width":409.98,"height":34.82,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1604.00923/images/11-16.png","element":"img","alt":" ω ∈ A, limn→∞ Xn = X","inline":true},{"text":". For term ","element":"span"},{"style":{"fontWeight":"bold"},"text":"(c)","element":"span"},{"text":", notice that by the assumption that ","element":"span"},{"style":{"height":15.92},"width":171.26,"height":39.8,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1604.00923/images/11-17.png","element":"img","alt":" Xn a.s.−→ X","inline":true},{"text":", we have that ","element":"span"},{"style":{"height":15.6},"width":96.62,"height":39,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1604.00923/images/11-18.png","element":"img","alt":" Ω \\ A","inline":true,"padRight":true},{"text":"has measure zero. So:","element":"span"}],[{"style":{"width":"83%"},"width":759,"height":359,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1604.00923/images/11-19.png","element":"img"}],[{"text":"Next we present a lemma that relates almost sure convergence of estimators to mean squared error. Let ","element":"span"},{"style":{"height":14.89},"width":23.18,"height":37.24,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1604.00923/images/11-20.png","element":"img","alt":"ˆθ","inline":true,"padRight":true},{"text":"be an estimator of ","element":"span"},{"style":{"height":10.8},"width":18,"height":27,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1604.00923/images/11-21.png","element":"img","alt":" θ","inline":true},{"text":". Recall that:","element":"span"}],[{"style":{"width":"49%"},"width":449,"height":71,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1604.00923/images/11-22.png","element":"img"}],[{"text":"We show that a sequence, ","element":"span"},{"style":{"height":15.6},"width":141.38,"height":39,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1604.00923/images/11-23.png","element":"img","alt":" (Xn)∞n=1","inline":true,"padRight":true},{"text":"converges almost ","element":"span"},{"text":"surely to ","element":"span"},{"style":{"fontStyle":"italic"},"text":"X ","element":"span"},{"text":"if and only if ","element":"span"},{"style":{"height":15.6},"width":434.28,"height":39,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1604.00923/images/11-24.png","element":"img","alt":" limn→∞ MSE(Xn, X) = 0","inline":true},{"text":".","element":"span"}],[{"style":{"fontWeight":"bold"},"text":"Lemma 3. ","element":"span"},{"style":{"fontStyle":"italic"},"text":"If ","element":"span"},{"style":{"height":15.64},"width":124.98,"height":39.1,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1604.00923/images/11-25.png","element":"img","alt":" (Xi)∞i=1","inline":true,"padRight":true},{"style":{"fontStyle":"italic"},"text":"is a sequence of uniformly bounded ","element":"span"},{"style":{"fontStyle":"italic"},"text":"real-valued random variables, then ","element":"span"},{"style":{"height":15.92},"width":173.23,"height":39.8,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1604.00923/images/11-26.png","element":"img","alt":" Xn a.s.−→ X","inline":true,"padRight":true},{"style":{"fontStyle":"italic"},"text":"if and only if ","element":"span"},{"style":{"height":15.6},"width":434.28,"height":39,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1604.00923/images/11-27.png","element":"img","alt":" limn→∞ MSE(Xn, X) = 0","inline":true},{"style":{"fontStyle":"italic"},"text":".","element":"span"}],[{"style":{"fontStyle":"italic"},"text":"Proof. ","element":"span"},{"text":"We show each direction separately. First we show that ","element":"span"},{"style":{"height":15.92},"width":171.26,"height":39.8,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1604.00923/images/11-28.png","element":"img","alt":" Xn a.s.−→ X","inline":true,"padRight":true},{"text":"implies ","element":"span"},{"style":{"height":15.6},"width":434.28,"height":39,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1604.00923/images/11-29.png","element":"img","alt":" limn→∞ MSE(Xn, X) = 0","inline":true},{"text":".","element":"span"}],[{"style":{"width":"53%"},"width":490,"height":103,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1604.00923/images/11-30.png","element":"img"}],[{"text":"where ","element":"span"},{"style":{"height":16.83},"width":314.34,"height":42.07,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1604.00923/images/11-31.png","element":"img","alt":" Yn := (Xn − X)2","inline":true},{"text":". By the continuous mapping theorem we have that ","element":"span"},{"style":{"height":17.59},"width":408.66,"height":43.98,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1604.00923/images/11-32.png","element":"img","alt":" Yn a.s.−→ (X − X)2 = 0","inline":true},{"text":". So, by Lemma ","element":"span"},{"text":"2 ","element":"span"},{"text":"(applied to ","element":"span"},{"style":{"height":15.6},"width":94.69,"height":39,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1604.00923/images/11-33.png","element":"img","alt":" E[Yn]","inline":true},{"text":") we have that","element":"span"}],[{"style":{"width":"45%"},"width":414,"height":106,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1604.00923/images/11-34.png","element":"img"}],[{"text":"Next ","element":"span"},{"text":"we ","element":"span"},{"text":"show ","element":"span"},{"text":"the ","element":"span"},{"text":"other ","element":"span"},{"text":"direction: ","element":"span"},{"text":"that ","element":"span"},{"style":{"height":15.6},"width":463.44,"height":39,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1604.00923/images/11-35.png","element":"img","alt":"limn→∞ MSE(Xn, X) = 0","inline":true,"padRight":true},{"text":"implies ","element":"span"},{"style":{"height":15.92},"width":200.44,"height":39.8,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1604.00923/images/11-36.png","element":"img","alt":" Xn a.s.−→ X","inline":true},{"text":". ","element":"span"},{"text":"Let ","element":"span"},{"style":{"fontStyle":"italic"},"text":"X ","element":"span"},{"text":"and all ","element":"span"},{"style":{"height":13.13},"width":51.13,"height":32.82,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1604.00923/images/11-37.png","element":"img","alt":" Xn","inline":true,"padRight":true},{"text":"be random variables on the probability space ","element":"span"},{"style":{"height":15.6},"width":901.76,"height":39,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1604.00923/images/11-38.png","element":"img","alt":"(Ω, Σ, P), A = {ω ∈ Ω : limn→∞ MSE(Xn, X) = 0}","inline":true},{"text":", and ","element":"span"},{"style":{"height":15.6},"width":742.76,"height":39,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1604.00923/images/11-39.png","element":"img","alt":" B = {ω ∈ A : limn→∞ Xn ̸= X}","inline":true},{"text":". ","element":"span"},{"text":"If ","element":"span"},{"style":{"height":15.6},"width":469.46,"height":39,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1604.00923/images/11-40.png","element":"img","alt":"limn→∞ MSE(Xn, X) = 0","inline":true},{"text":", then by the definition of","element":"span"}],[{"text":"MSE we have that:","element":"span"}],[{"style":{"width":"60%"},"width":550,"height":688,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1604.00923/images/12-0.png","element":"img"}],[{"text":"where we get ","element":"span"},{"style":{"fontWeight":"bold"},"text":"(a) ","element":"span"},{"text":"by using the bounded convergence theorem to pass the limit inside the integral and the fact that ","element":"span"},{"style":{"height":16.83},"width":185.24,"height":42.08,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1604.00923/images/12-1.png","element":"img","alt":"(Xn − X)2","inline":true,"padRight":true},{"text":"is a continuous function of ","element":"span"},{"style":{"height":13.13},"width":51.13,"height":32.82,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1604.00923/images/12-2.png","element":"img","alt":" Xn","inline":true,"padRight":true},{"text":"to then move the limit to the ","element":"span"},{"style":{"height":13.13},"width":51.13,"height":32.82,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1604.00923/images/12-3.png","element":"img","alt":" Xn","inline":true,"padRight":true},{"text":"term. Notice that ","element":"span"},{"style":{"fontWeight":"bold"},"text":"(b)","element":"span"},{"text":", ","element":"span"},{"style":{"fontWeight":"bold"},"text":"(c)","element":"span"},{"text":", and ","element":"span"},{"style":{"fontWeight":"bold"},"text":"(d) ","element":"span"},{"text":"are all positive, and so they must all be zero for the equality with zero to hold. ","element":"span"},{"text":"We have that ","element":"span"},{"style":{"fontWeight":"bold"},"text":"(d) ","element":"span"},{"text":"is necessarily zero due to the definition of ","element":"span"},{"style":{"fontStyle":"italic"},"text":"A ","element":"span"},{"text":"and our assumption that ","element":"span"},{"style":{"height":15.6},"width":434.28,"height":39,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1604.00923/images/12-4.png","element":"img","alt":"limn→∞ MSE(Xn, X) = 0","inline":true},{"text":". Similarly, ","element":"span"},{"style":{"fontWeight":"bold"},"text":"(c) ","element":"span"},{"text":"is zero because, from the definition of ","element":"span"},{"style":{"fontStyle":"italic"},"text":"B","element":"span"},{"text":", ","element":"span"},{"style":{"fontStyle":"italic"},"text":"A \\ B ","element":"span"},{"text":"causes ","element":"span"},{"style":{"height":13.52},"width":286.48,"height":33.81,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1604.00923/images/12-5.png","element":"img","alt":" limn→∞ Xn = X","inline":true},{"text":". However, in ","element":"span"},{"style":{"fontWeight":"bold"},"text":"(b)","element":"span"},{"text":", by the definition of ","element":"span"},{"style":{"height":13.53},"width":313.64,"height":33.82,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1604.00923/images/12-6.png","element":"img","alt":" B, limn→∞ Xn −X","inline":true,"padRight":true},{"text":"is non-zero, and so for the equality with zero to hold, ","element":"span"},{"style":{"fontStyle":"italic"},"text":"B ","element":"span"},{"text":"must have measure zero. That is, ","element":"span"},{"style":{"height":15.6},"width":447.04,"height":39,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1604.00923/images/12-7.png","element":"img","alt":" Pr(limn→∞ Xn ̸= X) = 0","inline":true},{"text":", and thus ","element":"span"},{"style":{"height":15.6},"width":426.81,"height":39,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1604.00923/images/12-8.png","element":"img","alt":" Pr(limn→∞ Xn = X) = 1","inline":true},{"text":".","element":"span"}],[{"text":"Next we show that if two sequences of random variables converge to the same random variable, then any sequence of random variables bounded between the two sequences must also converge to the same random variable.","element":"span"}],[{"style":{"fontWeight":"bold"},"text":"Lemma 4. ","element":"span"},{"style":{"fontStyle":"italic"},"text":"If ","element":"span"},{"style":{"height":15.99},"width":359.24,"height":39.98,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1604.00923/images/12-9.png","element":"img","alt":" Xn a.s.−→ X, Zn a.s.−→ X","inline":true},{"style":{"fontStyle":"italic"},"text":", and for all ","element":"span"},{"style":{"height":13.2},"width":137.33,"height":33,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1604.00923/images/12-10.png","element":"img","alt":" n, Xn ≤","inline":true},{"style":{"height":13.2},"width":140.8,"height":33,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1604.00923/images/12-11.png","element":"img","alt":"Yn ≤ Zn","inline":true},{"style":{"fontStyle":"italic"},"text":", then ","element":"span"},{"style":{"height":15.92},"width":161.64,"height":39.8,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1604.00923/images/12-12.png","element":"img","alt":" Yn a.s.−→ X","inline":true},{"style":{"fontStyle":"italic"},"text":".","element":"span"}],[{"style":{"fontStyle":"italic"},"text":"Proof.","element":"span"}],[{"text":"Pr","element":"span"},{"style":{"height":28.4},"width":23,"height":71,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1604.00923/images/12-13.png","element":"img","alt":"�","inline":true},{"text":"lim","element":"span"},{"style":{"height":28.4},"width":393.51,"height":71,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1604.00923/images/12-14.png","element":"img","alt":"n→∞ Yn = X�= Pr� �","inline":true},{"text":"lim","element":"span"},{"style":{"height":28.4},"width":240.96,"height":71,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1604.00923/images/12-15.png","element":"img","alt":"n→∞ Yn ≤ X�","inline":true,"padRight":true},{"text":"(5) ","element":"span"},{"style":{"height":28.4},"width":72.55,"height":71,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1604.00923/images/12-16.png","element":"img","alt":"� �","inline":true},{"text":"lim","element":"span"},{"style":{"height":28.4},"width":270.6,"height":71,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1604.00923/images/12-17.png","element":"img","alt":"n→∞ Yn ≥ X� �","inline":true}],[{"text":"Since","element":"span"}],[{"style":{"width":"74%"},"width":674,"height":216,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1604.00923/images/12-18.png","element":"img"}],[{"text":"and","element":"span"}],[{"style":{"width":"73%"},"width":669,"height":215,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1604.00923/images/12-19.png","element":"img"}],[{"text":"we have that (","element":"span"},{"text":"5","element":"span"},{"text":") is the probability of the joint occurance of two probability one events, and so","element":"span"}],[{"style":{"width":"71%"},"width":651,"height":137,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1604.00923/images/12-20.png","element":"img"}],[{"text":"Next we show that if the difference between two sequences converges almost surely to zero, then we can substitute one sequence for the other as an input to a continuous function without changing the almost sure convergence properties of the function:","element":"span"}],[{"style":{"fontWeight":"bold"},"text":"Lemma 5. ","element":"span"},{"style":{"fontStyle":"italic"},"text":"If ","element":"span"},{"style":{"fontStyle":"italic"},"text":"f ","element":"span"},{"style":{"fontStyle":"italic"},"text":"is a continuous function, ","element":"span"},{"style":{"height":17.6},"width":235.05,"height":43.99,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1604.00923/images/12-21.png","element":"img","alt":" f(Xn) a.s.−→ X","inline":true},{"style":{"fontStyle":"italic"},"text":", and ","element":"span"},{"style":{"height":15.92},"width":248.28,"height":39.8,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1604.00923/images/12-22.png","element":"img","alt":" Yn − Xn a.s.−→ 0","inline":true},{"style":{"fontStyle":"italic"},"text":", then ","element":"span"},{"style":{"height":17.59},"width":214.96,"height":43.98,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1604.00923/images/12-23.png","element":"img","alt":" f(Yn) a.s.−→ X","inline":true},{"style":{"fontStyle":"italic"},"text":".","element":"span"}],[{"style":{"fontStyle":"italic"},"text":"Proof.","element":"span"}],[{"style":{"width":"99%"},"width":904,"height":795,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1604.00923/images/12-24.png","element":"img"}],[{"text":"where ","element":"span"},{"style":{"fontWeight":"bold"},"text":"(a) ","element":"span"},{"text":"holds because ","element":"span"},{"style":{"fontStyle":"italic"},"text":"f ","element":"span"},{"text":"is a continuous function, and where ","element":"span"},{"style":{"fontWeight":"bold"},"text":"(b) ","element":"span"},{"text":"holds because it gives sufficient conditions for the event in the line above to hold, and ","element":"span"},{"style":{"fontWeight":"bold"},"text":"(c) ","element":"span"},{"text":"holds because under our assumptions the two events both occur with probability one. So we can conclude that ","element":"span"},{"style":{"height":17.59},"width":214.96,"height":43.98,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1604.00923/images/12-25.png","element":"img","alt":" f(Yn) a.s.−→ X","inline":true},{"text":".","element":"span"}],[{"text":"Next we review two standard forms of the strong law of large numbers.","element":"span"}],[{"style":{"fontWeight":"bold"},"text":"Theorem 6 ","element":"span"},{"text":"(Khintchine Strong Law of Large Numbers)","element":"span"},{"style":{"fontWeight":"bold"},"text":". ","element":"span"},{"style":{"fontStyle":"italic"},"text":"Let ","element":"span"},{"style":{"height":15.64},"width":133.6,"height":39.1,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1604.00923/images/12-26.png","element":"img","alt":" {Xi}∞i=1","inline":true,"padRight":true},{"style":{"fontStyle":"italic"},"text":"be independent and identically distributed ","element":"span"},{"style":{"fontStyle":"italic"},"text":"random variables. Then ","element":"span"},{"style":{"height":18.66},"width":267.7,"height":46.64,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1604.00923/images/12-27.png","element":"img","alt":" ( 1n�ni=1 Xi)∞n=1","inline":true,"padRight":true},{"style":{"fontStyle":"italic"},"text":"is a sequence of ","element":"span"},{"style":{"fontStyle":"italic"},"text":"random variables that converges almost surely to ","element":"span"},{"style":{"height":15.6},"width":100.6,"height":39,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1604.00923/images/12-28.png","element":"img","alt":" E[X1]","inline":true},{"style":{"fontStyle":"italic"},"text":".","element":"span"}],[{"style":{"fontStyle":"italic"},"text":"Proof. ","element":"span"},{"text":"See the work of ","element":"span"},{"text":"Sen & Singer ","element":"span"},{"text":"(","element":"span"},{"text":"1993","element":"span"},{"text":", Theorem 2.3.13).","element":"span"}],[{"style":{"fontWeight":"bold"},"text":"Theorem 7 ","element":"span"},{"text":"(Kolmogorov Strong Law of Large Numbers)","element":"span"},{"style":{"fontWeight":"bold"},"text":". ","element":"span"},{"style":{"fontStyle":"italic"},"text":"Let ","element":"span"},{"style":{"height":15.64},"width":133.6,"height":39.09,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1604.00923/images/13-0.png","element":"img","alt":" {Xi}∞i=1","inline":true,"padRight":true},{"style":{"fontStyle":"italic"},"text":"be independent (not necessarily identically ","element":"span"},{"style":{"fontStyle":"italic"},"text":"distributed) random variables. ","element":"span"},{"style":{"fontStyle":"italic"},"text":"If all ","element":"span"},{"style":{"height":13.12},"width":43.13,"height":32.81,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1604.00923/images/13-1.png","element":"img","alt":" Xi","inline":true,"padRight":true},{"style":{"fontStyle":"italic"},"text":"have the same mean and bounded variance (i.e., there is a finite constant ","element":"span"},{"style":{"fontStyle":"italic"},"text":"b ","element":"span"},{"style":{"fontStyle":"italic"},"text":"such that for all ","element":"span"},{"style":{"height":15.6},"width":382.56,"height":39,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1604.00923/images/13-2.png","element":"img","alt":" i ≥ 1, Var(Xi) ≤ b","inline":true},{"style":{"fontStyle":"italic"},"text":"), then ","element":"span"},{"style":{"height":18.66},"width":267.69,"height":46.64,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1604.00923/images/13-3.png","element":"img","alt":"( 1n�ni=1 Xi)∞n=1","inline":true,"padRight":true},{"style":{"fontStyle":"italic"},"text":"is a sequence of random variables that ","element":"span"},{"style":{"fontStyle":"italic"},"text":"converges almost surely to ","element":"span"},{"style":{"height":15.6},"width":100.6,"height":39,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1604.00923/images/13-4.png","element":"img","alt":" E[X1]","inline":true},{"style":{"fontStyle":"italic"},"text":".","element":"span"}],[{"style":{"fontStyle":"italic"},"text":"Proof. ","element":"span"},{"text":"See the work of ","element":"span"},{"text":"Sen & Singer ","element":"span"},{"text":"(","element":"span"},{"text":"1993","element":"span"},{"text":", Theorem 2.3.10 with Proposition 2.3.10).","element":"span"}],[{"text":"In Corollary ","element":"span"},{"text":"1 ","element":"span"},{"text":"we present a simple extension of Kolmogorov’s strong law of large numbers that we often still refer to as Kolmogorov’s strong law of large numbers:","element":"span"}],[{"style":{"fontWeight":"bold"},"text":"Corollary 1. ","element":"span"},{"style":{"fontStyle":"italic"},"text":"Let ","element":"span"},{"style":{"height":15.64},"width":133.6,"height":39.09,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1604.00923/images/13-5.png","element":"img","alt":" {Xi}∞i=1","inline":true,"padRight":true},{"style":{"fontStyle":"italic"},"text":"be independent (not necessarily ","element":"span"},{"style":{"fontStyle":"italic"},"text":"identically distributed) random variables. If all ","element":"span"},{"style":{"height":13.12},"width":43.12,"height":32.81,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1604.00923/images/13-6.png","element":"img","alt":" Xi","inline":true,"padRight":true},{"style":{"fontStyle":"italic"},"text":"have the same mean and are uniformly bounded by a finite constant ","element":"span"},{"style":{"fontStyle":"italic"},"text":"b","element":"span"},{"style":{"fontStyle":"italic"},"text":", then ","element":"span"},{"style":{"height":18.66},"width":267.7,"height":46.64,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1604.00923/images/13-7.png","element":"img","alt":" ( 1n�ni=1 Xi)∞n=1","inline":true,"padRight":true},{"style":{"fontStyle":"italic"},"text":"is a sequence of random ","element":"span"},{"style":{"fontStyle":"italic"},"text":"variables that converges almost surely to ","element":"span"},{"style":{"height":15.6},"width":100.6,"height":39,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1604.00923/images/13-8.png","element":"img","alt":" E[X1]","inline":true},{"style":{"fontStyle":"italic"},"text":".","element":"span"}],[{"style":{"fontStyle":"italic"},"text":"Proof. ","element":"span"},{"text":"For all ","element":"span"},{"style":{"height":14.33},"width":139.49,"height":35.82,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1604.00923/images/13-9.png","element":"img","alt":" i ∈ N>0","inline":true,"padRight":true},{"text":"we have that ","element":"span"},{"style":{"height":15.6},"width":146.77,"height":39,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1604.00923/images/13-10.png","element":"img","alt":" |Xi| ≤ b","inline":true,"padRight":true},{"text":"surely, so from Popoviciu’s inequality, ","element":"span"},{"style":{"height":16.83},"width":219.69,"height":42.07,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1604.00923/images/13-11.png","element":"img","alt":" Var(Xi) ≤ b2","inline":true},{"text":", and so we can apply Theorem ","element":"span"},{"text":"7","element":"span"},{"text":".","element":"span"}],[{"text":"We now turn to results that are more specific to reinforcement learning and off-policy policy evaluation. Lemma ","element":"span"},{"text":"6 ","element":"span"},{"text":"establishes a relationship between the expected values of ","element":"span"},{"style":{"height":15.6},"width":132.57,"height":39,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1604.00923/images/13-12.png","element":"img","alt":"ˆrπe(s, i)","inline":true,"padRight":true},{"text":"and ","element":"span"},{"style":{"height":15.6},"width":178.89,"height":39,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1604.00923/images/13-13.png","element":"img","alt":" ˆrπe(s, A, i)","inline":true,"padRight":true},{"text":"for all ","element":"span"},{"style":{"fontStyle":"italic"},"text":"i ","element":"span"},{"text":"if ","element":"span"},{"style":{"fontStyle":"italic"},"text":"A ","element":"span"},{"text":"is generated by some policy ","element":"span"},{"style":{"height":6.8},"width":23,"height":17,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1604.00923/images/13-14.png","element":"img","alt":" π","inline":true},{"text":".","element":"span"}],[{"style":{"fontWeight":"bold"},"text":"Lemma 6. ","element":"span"},{"style":{"fontStyle":"italic"},"text":"Let ","element":"span"},{"style":{"height":16.83},"width":213.15,"height":42.07,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1604.00923/images/13-15.png","element":"img","alt":" (πe, π) ∈ Π2","inline":true},{"style":{"fontStyle":"italic"},"text":", where ","element":"span"},{"style":{"height":15.6},"width":312.64,"height":39,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1604.00923/images/13-16.png","element":"img","alt":" (π(a|s) = 0) =⇒","inline":true},{"style":{"height":15.6},"width":220.96,"height":39,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1604.00923/images/13-17.png","element":"img","alt":"(πe(a|s) = 0)","inline":true,"padRight":true},{"style":{"fontStyle":"italic"},"text":"for all ","element":"span"},{"style":{"height":15.6},"width":239.82,"height":39,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1604.00923/images/13-18.png","element":"img","alt":" (a, s) ∈ A × S","inline":true},{"style":{"fontStyle":"italic"},"text":". Then for all ","element":"span"},{"style":{"height":15.6},"width":116.42,"height":39,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1604.00923/images/13-19.png","element":"img","alt":" (s, i) ∈","inline":true},{"style":{"height":15.13},"width":141.04,"height":37.82,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1604.00923/images/13-20.png","element":"img","alt":"S × N≥0","inline":true},{"style":{"fontStyle":"italic"},"text":",","element":"span"}],[{"style":{"width":"77%"},"width":707,"height":95,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1604.00923/images/13-21.png","element":"img"}],[{"style":{"fontStyle":"italic"},"text":"Proof. ","element":"span"},{"text":"First, recall from (","element":"span"},{"text":"1","element":"span"},{"text":") that for all ","element":"span"},{"style":{"height":15.6},"width":231.64,"height":39,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1604.00923/images/13-22.png","element":"img","alt":" (s, i) ∈ S ×","inline":true}],[{"style":{"fontStyle":"italic"},"text":"{","element":"span"},{"text":"1","element":"span"},{"style":{"fontStyle":"italic"},"text":", . . . , n","element":"span"},{"style":{"fontStyle":"italic"},"text":"}","element":"span"},{"text":":","element":"span"}],[{"style":{"width":"82%"},"width":751,"height":695,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1604.00923/images/13-23.png","element":"img"}],[{"text":"where ","element":"span"},{"style":{"fontWeight":"bold"},"text":"(a) ","element":"span"},{"text":"holds by the assumption that ","element":"span"},{"style":{"height":15.6},"width":293.54,"height":39,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1604.00923/images/13-24.png","element":"img","alt":" (π(a|s) = 0) =⇒","inline":true},{"style":{"height":15.6},"width":219.53,"height":39,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1604.00923/images/13-25.png","element":"img","alt":"(πe(a|s) = 0)","inline":true,"padRight":true},{"text":"for all ","element":"span"},{"style":{"height":15.6},"width":237.83,"height":39,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1604.00923/images/13-26.png","element":"img","alt":" (a, s) ∈ A × S","inline":true},{"text":".","element":"span"}],[{"text":"Corollary ","element":"span"},{"text":"2 ","element":"span"},{"text":"extends Lemma ","element":"span"},{"text":"6 ","element":"span"},{"text":"to show a relationship between ","element":"span"},{"style":{"height":15.6},"width":110.05,"height":39,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1604.00923/images/13-27.png","element":"img","alt":" ˆvπe (s)","inline":true,"padRight":true},{"text":"and the expected value of ","element":"span"},{"style":{"height":15.6},"width":185.48,"height":39,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1604.00923/images/13-28.png","element":"img","alt":" ˆqπe (s, A, i)","inline":true,"padRight":true},{"text":"if ","element":"span"},{"style":{"fontStyle":"italic"},"text":"A ","element":"span"},{"text":"is generated by some policy ","element":"span"},{"style":{"height":6.8},"width":23,"height":17,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1604.00923/images/13-29.png","element":"img","alt":" π","inline":true},{"text":":","element":"span"}],[{"style":{"fontWeight":"bold"},"text":"Corollary 2. ","element":"span"},{"style":{"fontStyle":"italic"},"text":"Let ","element":"span"},{"style":{"height":16.83},"width":205.34,"height":42.07,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1604.00923/images/13-30.png","element":"img","alt":" (πe, π) ∈ Π2","inline":true},{"style":{"fontStyle":"italic"},"text":", where ","element":"span"},{"style":{"height":15.6},"width":297.03,"height":39,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1604.00923/images/13-31.png","element":"img","alt":" (π(a|s) = 0) =⇒","inline":true},{"style":{"height":15.6},"width":219.53,"height":39,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1604.00923/images/13-32.png","element":"img","alt":"(πe(a|s) = 0)","inline":true,"padRight":true},{"style":{"fontStyle":"italic"},"text":"for all ","element":"span"},{"style":{"height":15.6},"width":237.83,"height":39,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1604.00923/images/13-33.png","element":"img","alt":" (a, s) ∈ A × S","inline":true},{"style":{"fontStyle":"italic"},"text":". Then for all ","element":"span"},{"style":{"height":11.6},"width":91.59,"height":29,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1604.00923/images/13-34.png","element":"img","alt":" s ∈ S","inline":true},{"style":{"fontStyle":"italic"},"text":",","element":"span"}],[{"style":{"width":"71%"},"width":654,"height":96,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1604.00923/images/13-35.png","element":"img"}],[{"style":{"fontStyle":"italic"},"text":"Proof. ","element":"span"},{"text":"We have from Lemma ","element":"span"},{"text":"6 ","element":"span"},{"text":"that for all ","element":"span"},{"style":{"height":15.12},"width":128,"height":37.81,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1604.00923/images/13-36.png","element":"img","alt":" i ∈ N≥0","inline":true},{"text":",","element":"span"}],[{"style":{"width":"77%"},"width":707,"height":96,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1604.00923/images/13-37.png","element":"img"}],[{"text":"Summing both sides over ","element":"span"},{"style":{"fontStyle":"italic"},"text":"t ","element":"span"},{"text":"and multiplying by ","element":"span"},{"style":{"height":15.63},"width":34.23,"height":39.07,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1604.00923/images/13-38.png","element":"img","alt":" γt","inline":true,"padRight":true},{"text":"we have that:","element":"span"}],[{"style":{"width":"98%"},"width":901,"height":644,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1604.00923/images/13-39.png","element":"img"}],[{"text":"Before presenting the next theorem, notice that we can express the DR estimator, (","element":"span"},{"text":"2","element":"span"},{"text":"), as ","element":"span"},{"style":{"height":18.66},"width":442.08,"height":46.64,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1604.00923/images/13-40.png","element":"img","alt":" DR(D) = 1n�ni=1 DRi(D)","inline":true}],[{"text":"if","element":"span"}],[{"style":{"width":"94%"},"width":857,"height":236,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1604.00923/images/14-0.png","element":"img"}],[{"text":"Lemma ","element":"span"},{"text":"7 ","element":"span"},{"text":"gives conditions under which the DR estimator is an unbiased estimator of ","element":"span"},{"style":{"height":15.6},"width":89.03,"height":39,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1604.00923/images/14-1.png","element":"img","alt":" v(πe)","inline":true,"padRight":true},{"text":"when using only one trajectory. This lemma is the bulk of the proof that the full DR estimator is unbiased—we have placed it in a separate lemma because it is also a useful result when showing that the DR estimator is strongly consistent.","element":"span"}],[{"style":{"fontWeight":"bold"},"text":"Lemma 7. ","element":"span"},{"style":{"fontStyle":"italic"},"text":"If Assumption ","element":"span"},{"text":"1 ","element":"span"},{"style":{"fontStyle":"italic"},"text":"holds then ","element":"span"},{"style":{"height":15.6},"width":242.57,"height":39,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1604.00923/images/14-2.png","element":"img","alt":" E[DRi(D)] =","inline":true},{"style":{"height":15.6},"width":89.03,"height":39,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1604.00923/images/14-3.png","element":"img","alt":"v(πe)","inline":true,"padRight":true},{"style":{"fontStyle":"italic"},"text":"for all ","element":"span"},{"style":{"height":15.6},"width":228.99,"height":39,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1604.00923/images/14-4.png","element":"img","alt":" i ∈ {1, . . . , n}","inline":true},{"style":{"fontStyle":"italic"},"text":".","element":"span"}],[{"style":{"fontStyle":"italic"},"text":"Proof. ","element":"span"},{"text":"Recall that","element":"span"}],[{"style":{"width":"94%"},"width":857,"height":236,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1604.00923/images/14-5.png","element":"img"}],[{"text":"First, notice that ","element":"span"},{"style":{"height":19.15},"width":261.02,"height":47.87,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1604.00923/images/14-6.png","element":"img","alt":"�∞t=0 γtρHit RHit","inline":true,"padRight":true},{"text":"is the ","element":"span"},{"style":{"fontStyle":"italic"},"text":"per-decision importance sampling ","element":"span"},{"text":"(PDIS) estimator, which is known to be an unbiased estimator of ","element":"span"},{"style":{"height":15.6},"width":89.03,"height":39,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1604.00923/images/14-7.png","element":"img","alt":" v(πe)","inline":true,"padRight":true},{"text":"(","element":"span"},{"text":"Precup et al.","element":"span"},{"text":", ","element":"span"},{"text":"2000","element":"span"},{"text":"; ","element":"span"},{"text":"Thomas","element":"span"},{"text":", ","element":"span"},{"text":"2015b","element":"span"},{"text":"). So, we need only show that the remaining terms in the definition of ","element":"span"},{"style":{"height":15.6},"width":134.34,"height":39,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1604.00923/images/14-8.png","element":"img","alt":" DRi(D)","inline":true,"padRight":true},{"text":"have expected value zero, i.e., that","element":"span"}],[{"style":{"width":"98%"},"width":897,"height":106,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1604.00923/images/14-9.png","element":"img"}],[{"text":"By Corollary ","element":"span"},{"text":"2 ","element":"span"},{"text":"(which requires Assumption ","element":"span"},{"text":"1","element":"span"},{"text":") we have that","element":"span"}],[{"style":{"width":"97%"},"width":892,"height":537,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1604.00923/images/14-10.png","element":"img"}],[{"text":"For completeness, next we show formally the obvious result that Assumption ","element":"span"},{"text":"1 ","element":"span"},{"text":"implies that partial trajectories that occur under the evaluation policy must occur under the behavior policy.","element":"span"}],[{"style":{"fontWeight":"bold"},"text":"Lemma 8. ","element":"span"},{"style":{"fontStyle":"italic"},"text":"Assumption ","element":"span"},{"text":"1 ","element":"span"},{"style":{"fontStyle":"italic"},"text":"implies that if ","element":"span"},{"style":{"height":16.43},"width":282.37,"height":41.07,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1604.00923/images/14-11.png","element":"img","alt":" Pr(Ht =ht|πi) =","inline":true,"padRight":true},{"text":"0","element":"span"},{"style":{"fontStyle":"italic"},"text":", then ","element":"span"},{"style":{"height":16.43},"width":373.85,"height":41.08,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1604.00923/images/14-12.png","element":"img","alt":" Pr(Ht = ht|πe) = 0","inline":true,"padRight":true},{"style":{"fontStyle":"italic"},"text":"for all ","element":"span"},{"style":{"height":15.6},"width":257.52,"height":39,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1604.00923/images/14-13.png","element":"img","alt":" i ∈ {1, . . . , n}","inline":true},{"style":{"fontStyle":"italic"},"text":", ","element":"span"},{"style":{"height":16.43},"width":827.09,"height":41.07,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1604.00923/images/14-14.png","element":"img","alt":"ht := (s0, a0, r0, s1, . . . , st−1, at−1, rt−1, st) ∈ Ht","inline":true},{"style":{"fontStyle":"italic"},"text":", and ","element":"span"},{"style":{"height":12.8},"width":175.8,"height":32,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1604.00923/images/14-15.png","element":"img","alt":"0 ≤ t < ∞","inline":true},{"style":{"fontStyle":"italic"},"text":".","element":"span"}],[{"style":{"fontStyle":"italic"},"text":"Proof. ","element":"span"},{"text":"If ","element":"span"},{"style":{"fontStyle":"italic"},"text":"t ","element":"span"},{"text":"= 0 ","element":"span"},{"text":"then ","element":"span"},{"style":{"height":16.43},"width":179.34,"height":41.07,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1604.00923/images/14-16.png","element":"img","alt":" ht = (s0),","inline":true,"padRight":true},{"text":"which does not depend on the policy, so clearly if ","element":"span"},{"style":{"height":16.83},"width":378.94,"height":42.08,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1604.00923/images/14-17.png","element":"img","alt":" Pr(H0 = h0|πi) = 0","inline":true,"padRight":true},{"text":"then ","element":"span"},{"style":{"height":16.83},"width":348.73,"height":42.07,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1604.00923/images/14-18.png","element":"img","alt":"Pr(H0 = h0|πe) = 0","inline":true},{"text":". Hereafter we assume ","element":"span"},{"style":{"height":12.8},"width":187.33,"height":32,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1604.00923/images/14-19.png","element":"img","alt":" 1 ≤ t < ∞","inline":true},{"text":". Notice that for any ","element":"span"},{"style":{"height":11.6},"width":99.9,"height":29,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1604.00923/images/14-20.png","element":"img","alt":" π ∈ Π","inline":true},{"text":",","element":"span"}],[{"style":{"width":"95%"},"width":869,"height":658,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1604.00923/images/14-21.png","element":"img"}],[{"text":"where ","element":"span"},{"style":{"fontWeight":"bold"},"text":"(a) ","element":"span"},{"text":"comes from repeated application of the rule that, for any random variables ","element":"span"},{"style":{"fontStyle":"italic"},"text":"X ","element":"span"},{"text":"and ","element":"span"},{"style":{"fontStyle":"italic"},"text":"Y ","element":"span"},{"text":", ","element":"span"},{"text":"Pr(","element":"span"},{"style":{"fontStyle":"italic"},"text":"X ","element":"span"},{"text":"= ","element":"span"},{"style":{"fontStyle":"italic"},"text":"x, Y ","element":"span"},{"text":"= ","element":"span"},{"style":{"fontStyle":"italic"},"text":"y","element":"span"},{"text":") = Pr(","element":"span"},{"style":{"fontStyle":"italic"},"text":"X ","element":"span"},{"text":"= ","element":"span"},{"style":{"fontStyle":"italic"},"text":"x","element":"span"},{"text":") Pr(","element":"span"},{"style":{"fontStyle":"italic"},"text":"Y ","element":"span"},{"text":"= ","element":"span"},{"style":{"fontStyle":"italic"},"text":"y","element":"span"},{"style":{"fontStyle":"italic"},"text":"|","element":"span"},{"style":{"fontStyle":"italic"},"text":"X ","element":"span"},{"text":"= ","element":"span"},{"style":{"fontStyle":"italic"},"text":"x","element":"span"},{"text":") ","element":"span"},{"text":"and the Markov property for state transitions, actions, and rewards, and ","element":"span"},{"style":{"fontWeight":"bold"},"text":"(b) ","element":"span"},{"text":"comes from the definitions of ","element":"span"},{"style":{"height":14},"width":125.54,"height":35,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1604.00923/images/14-22.png","element":"img","alt":" d0, π, R","inline":true,"padRight":true},{"text":"and ","element":"span"},{"style":{"fontStyle":"italic"},"text":"P ","element":"span"},{"text":"in MDPNv1.","element":"span"}],[{"text":"So, if ","element":"span"},{"style":{"height":15.6},"width":345.22,"height":39,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1604.00923/images/14-23.png","element":"img","alt":" Pr(Ht = ht|πi) = 0","inline":true},{"text":", then one of the terms in the product above (using ","element":"span"},{"style":{"height":9.13},"width":33.1,"height":22.82,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1604.00923/images/14-24.png","element":"img","alt":" πi","inline":true,"padRight":true},{"text":"for ","element":"span"},{"style":{"height":6.8},"width":23,"height":17,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1604.00923/images/14-25.png","element":"img","alt":" π","inline":true},{"text":") must be zero. If that term is not a ","element":"span"},{"style":{"height":9.13},"width":33.1,"height":22.82,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1604.00923/images/14-26.png","element":"img","alt":" πi","inline":true,"padRight":true},{"text":"term, then it also shows up in ","element":"span"},{"style":{"height":15.6},"width":258.62,"height":39,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1604.00923/images/14-27.png","element":"img","alt":" Pr(Ht = ht|πe)","inline":true},{"text":", and so ","element":"span"},{"style":{"height":15.6},"width":332.42,"height":39,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1604.00923/images/14-28.png","element":"img","alt":" Pr(Ht = ht|πe) = 0","inline":true},{"text":". If the term is a ","element":"span"},{"style":{"height":9.12},"width":33.11,"height":22.81,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1604.00923/images/14-29.png","element":"img","alt":" πi","inline":true,"padRight":true},{"text":"term, then by Assumption ","element":"span"},{"text":"1","element":"span"},{"text":", the corresponding ","element":"span"},{"style":{"height":9.12},"width":37.11,"height":22.81,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1604.00923/images/14-30.png","element":"img","alt":" πe","inline":true,"padRight":true},{"text":"term must also be zero, and so ","element":"span"},{"style":{"height":15.6},"width":322.84,"height":39,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1604.00923/images/14-31.png","element":"img","alt":" Pr(Ht = ht|πi) = 0","inline":true},{"text":".","element":"span"}],[{"text":"Next, recall the known result that the ratio of partial trajectory probabilities under two different policies can be written in terms of the two policies:","element":"span"}],[{"style":{"fontWeight":"bold"},"text":"Lemma 9. ","element":"span"},{"style":{"fontStyle":"italic"},"text":"Let ","element":"span"},{"style":{"height":9.13},"width":37.11,"height":22.82,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1604.00923/images/14-32.png","element":"img","alt":" πe","inline":true,"padRight":true},{"style":{"fontStyle":"italic"},"text":"and ","element":"span"},{"style":{"height":9.13},"width":36.11,"height":22.82,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1604.00923/images/14-33.png","element":"img","alt":" πb","inline":true,"padRight":true},{"style":{"fontStyle":"italic"},"text":"be any two policies and ","element":"span"},{"style":{"height":14.33},"width":128.64,"height":35.82,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1604.00923/images/14-34.png","element":"img","alt":" t ∈ N>0","inline":true},{"style":{"fontStyle":"italic"},"text":". Let ","element":"span"},{"style":{"height":13.13},"width":34.34,"height":32.82,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1604.00923/images/14-35.png","element":"img","alt":" ht","inline":true,"padRight":true},{"style":{"fontStyle":"italic"},"text":"be any history of length ","element":"span"},{"style":{"fontStyle":"italic"},"text":"t ","element":"span"},{"style":{"fontStyle":"italic"},"text":"that has non-zero probability under ","element":"span"},{"style":{"height":9.12},"width":36.1,"height":22.81,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1604.00923/images/14-36.png","element":"img","alt":" πb","inline":true},{"style":{"fontStyle":"italic"},"text":", i.e., ","element":"span"},{"style":{"height":15.6},"width":312.58,"height":39,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1604.00923/images/14-37.png","element":"img","alt":" Pr(Ht =ht|πb) ̸= 0","inline":true},{"style":{"fontStyle":"italic"},"text":". Then","element":"span"}],[{"style":{"width":"57%"},"width":524,"height":114,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1604.00923/images/14-38.png","element":"img"}],[{"style":{"fontStyle":"italic"},"text":"Proof. ","element":"span"},{"text":"See the works of (","element":"span"},{"text":"Precup et al.","element":"span"},{"text":", ","element":"span"},{"text":"2000","element":"span"},{"text":") or (","element":"span"},{"text":"Thomas","element":"span"},{"text":", ","element":"span"},{"text":"2015b","element":"span"},{"text":", Lemma 1).","element":"span"}],[{"text":"Next we establish Lemma ","element":"span"},{"text":"10","element":"span"},{"text":", which states that we can use importance sampling to generate unbiased estimates of any function of partial trajectories in ","element":"span"},{"style":{"fontStyle":"italic"},"text":"D","element":"span"},{"text":". Recall that whenever","element":"span"}],[{"text":"we write ","element":"span"},{"style":{"height":13.13},"width":43.24,"height":32.82,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1604.00923/images/15-0.png","element":"img","alt":" Hi","inline":true,"padRight":true},{"text":"(or ","element":"span"},{"style":{"height":16.47},"width":47.39,"height":41.17,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1604.00923/images/15-1.png","element":"img","alt":" Hti","inline":true,"padRight":true},{"text":") we always mean a trajectory generated ","element":"span"},{"text":"by ","element":"span"},{"style":{"height":9.13},"width":33.11,"height":22.82,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1604.00923/images/15-2.png","element":"img","alt":" πi","inline":true},{"text":", so ","element":"span"},{"style":{"height":13.13},"width":129.96,"height":32.82,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1604.00923/images/15-3.png","element":"img","alt":" Hi ∼ πi","inline":true},{"text":".","element":"span"}],[{"style":{"fontWeight":"bold"},"text":"Lemma 10. ","element":"span"},{"style":{"fontStyle":"italic"},"text":"If Assumption ","element":"span"},{"text":"1 ","element":"span"},{"style":{"fontStyle":"italic"},"text":"holds, then for all ","element":"span"},{"style":{"height":15.6},"width":121.2,"height":39,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1604.00923/images/15-4.png","element":"img","alt":" (t, i) ∈","inline":true},{"style":{"height":15.93},"width":309.51,"height":39.82,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1604.00923/images/15-5.png","element":"img","alt":"N≥−1 × {1, . . . , n}","inline":true},{"style":{"fontStyle":"italic"},"text":":","element":"span"}],[{"style":{"width":"73%"},"width":666,"height":47,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1604.00923/images/15-6.png","element":"img"}],[{"style":{"fontStyle":"italic"},"text":"for any real-valued function ","element":"span"},{"style":{"fontStyle":"italic"},"text":"f","element":"span"},{"style":{"fontStyle":"italic"},"text":".","element":"span"}],[{"style":{"fontStyle":"italic"},"text":"Proof. ","element":"span"},{"text":"If ","element":"span"},{"style":{"height":10.4},"width":124.22,"height":26,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1604.00923/images/15-7.png","element":"img","alt":" t = −1","inline":true,"padRight":true},{"text":"then ","element":"span"},{"style":{"height":16.83},"width":220.05,"height":42.08,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1604.00923/images/15-8.png","element":"img","alt":" Ht−1 = (S0)","inline":true},{"text":", which does not depend on the policy, so the result is immediate. If ","element":"span"},{"style":{"height":12.8},"width":85.71,"height":32,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1604.00923/images/15-9.png","element":"img","alt":" t ≥ 0","inline":true},{"text":":","element":"span"}],[{"style":{"width":"85%"},"width":777,"height":734,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1604.00923/images/15-10.png","element":"img"}],[{"text":"where ","element":"span"},{"style":{"fontWeight":"bold"},"text":"(a) ","element":"span"},{"text":"comes from Lemma ","element":"span"},{"text":"9 ","element":"span"},{"text":"and ","element":"span"},{"style":{"fontWeight":"bold"},"text":"(b) ","element":"span"},{"text":"comes from Lemma ","element":"span"},{"text":"8","element":"span"},{"text":", which requires Assumption ","element":"span"},{"text":"1","element":"span"},{"text":".","element":"span"}],[{"text":"We can use Lemma ","element":"span"},{"text":"10 ","element":"span"},{"text":"to show the well-known result that the expected value of an importance weight is one:","element":"span"}],[{"style":{"fontWeight":"bold"},"text":"Lemma 11. ","element":"span"},{"style":{"fontStyle":"italic"},"text":"For all ","element":"span"},{"style":{"height":9.13},"width":33.11,"height":22.82,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1604.00923/images/15-11.png","element":"img","alt":" πi","inline":true,"padRight":true},{"style":{"fontStyle":"italic"},"text":"and ","element":"span"},{"style":{"height":15.13},"width":171.43,"height":37.82,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1604.00923/images/15-12.png","element":"img","alt":" t ∈ N≥−1","inline":true},{"style":{"fontStyle":"italic"},"text":", if Assumption ","element":"span"},{"text":"1 ","element":"span"},{"style":{"fontStyle":"italic"},"text":"holds, then ","element":"span"},{"style":{"height":16.83},"width":156.23,"height":42.07,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1604.00923/images/15-13.png","element":"img","alt":" E[ρit] = 1","inline":true},{"style":{"fontStyle":"italic"},"text":".","element":"span"}],[{"style":{"fontStyle":"italic"},"text":"Proof. ","element":"span"},{"text":"This follows from Lemma ","element":"span"},{"text":"10 ","element":"span"},{"text":"with ","element":"span"},{"style":{"height":16.83},"width":229.36,"height":42.07,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1604.00923/images/15-14.png","element":"img","alt":" f(Ht+1) := 1","inline":true},{"text":".","element":"span"}],[{"style":{"width":"3%"},"width":28,"height":27,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1604.00923/images/15-15.png","element":"img"}],[{"text":"Next we establish a lemma that will be crucial to showing that the WDR estimator is strongly consistent:","element":"span"}],[{"style":{"fontWeight":"bold"},"text":"Lemma 12. ","element":"span"},{"style":{"fontStyle":"italic"},"text":"For all ","element":"span"},{"style":{"height":15.13},"width":148.64,"height":37.82,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1604.00923/images/15-16.png","element":"img","alt":" t ∈ N≥0","inline":true},{"style":{"fontStyle":"italic"},"text":", let ","element":"span"},{"style":{"height":16.03},"width":279.31,"height":40.08,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1604.00923/images/15-17.png","element":"img","alt":" ft : Ht+1 → R","inline":true},{"style":{"fontStyle":"italic"},"text":". If Assumption ","element":"span"},{"text":"1 ","element":"span"},{"style":{"fontStyle":"italic"},"text":"holds, ","element":"span"},{"style":{"height":14},"width":104.36,"height":35,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1604.00923/images/15-18.png","element":"img","alt":" ft = 0","inline":true,"padRight":true},{"style":{"fontStyle":"italic"},"text":"for all ","element":"span"},{"style":{"height":15.13},"width":134.64,"height":37.82,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1604.00923/images/15-19.png","element":"img","alt":" t ∈ N≥L","inline":true},{"style":{"fontStyle":"italic"},"text":", and either:","element":"span"}],[{"style":{"width":"63%"},"width":577,"height":121,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1604.00923/images/15-20.png","element":"img"}],[{"style":{"fontStyle":"italic"},"text":"• ","element":"span"},{"style":{"fontWeight":"bold"},"text":"Case 2: ","element":"span"},{"style":{"fontStyle":"italic"},"text":"Assumption ","element":"span"},{"text":"4 ","element":"span"},{"style":{"fontStyle":"italic"},"text":"holds and there is a finite ","element":"span"},{"style":{"height":14},"width":66.16,"height":35,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1604.00923/images/15-21.png","element":"img","alt":"fmax","inline":true,"padRight":true},{"style":{"fontStyle":"italic"},"text":"such that for all ","element":"span"},{"style":{"height":15.13},"width":147.83,"height":37.82,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1604.00923/images/15-22.png","element":"img","alt":" t ∈ N≥0","inline":true,"padRight":true},{"style":{"fontStyle":"italic"},"text":"and ","element":"span"},{"style":{"height":14.03},"width":225.48,"height":35.08,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1604.00923/images/15-23.png","element":"img","alt":" ht+1 ∈ Ht+1","inline":true},{"style":{"fontStyle":"italic"},"text":", ","element":"span"},{"style":{"height":16.83},"width":277.47,"height":42.08,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1604.00923/images/15-24.png","element":"img","alt":"|ft(ht+1)| < fmax","inline":true},{"style":{"fontStyle":"italic"},"text":".","element":"span"}],[{"style":{"fontStyle":"italic"},"text":"then","element":"span"}],[{"style":{"width":"89%"},"width":818,"height":244,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1604.00923/images/15-25.png","element":"img"}],[{"style":{"fontStyle":"italic"},"text":"Proof. ","element":"span"},{"text":"Let","element":"span"}],[{"style":{"width":"56%"},"width":519,"height":111,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1604.00923/images/15-26.png","element":"img"}],[{"text":"so that the left side of (","element":"span"},{"text":"6","element":"span"},{"text":") can be written as ","element":"span"},{"style":{"height":17.65},"width":151.44,"height":44.13,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1604.00923/images/15-27.png","element":"img","alt":"�∞t=0 Xtn","inline":true},{"text":". First ","element":"span"},{"text":"we multiply the numerator and denominator of ","element":"span"},{"style":{"height":16.26},"width":51.13,"height":40.66,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1604.00923/images/15-28.png","element":"img","alt":" Xtn","inline":true,"padRight":true},{"text":"by ","element":"span"},{"style":{"height":18.66},"width":19,"height":46.64,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1604.00923/images/15-29.png","element":"img","alt":" 1n","inline":true,"padRight":true},{"text":"to ","element":"span"},{"text":"get:","element":"span"}],[{"style":{"width":"76%"},"width":694,"height":106,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1604.00923/images/15-30.png","element":"img"}],[{"text":"We will show that the numerator of (","element":"span"},{"text":"7","element":"span"},{"text":") converges almost surely to the desired value:","element":"span"}],[{"style":{"width":"95%"},"width":872,"height":148,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1604.00923/images/15-31.png","element":"img"}],[{"text":"By Lemma ","element":"span"},{"text":"10","element":"span"},{"text":", which relies on Assumption ","element":"span"},{"text":"1","element":"span"},{"text":", we have that ","element":"span"},{"style":{"height":18.16},"width":794.24,"height":45.4,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1604.00923/images/15-32.png","element":"img","alt":"E[ρitγtft(Ht+1i )] = E[γtft(Ht+1)|Ht+1 ∼ πe]","inline":true},{"text":". Consider the two cases from the statement of the lemma:","element":"span"}],[{"text":"1. ","element":"span"},{"style":{"fontWeight":"bold"},"text":"Case 1: ","element":"span"},{"style":{"height":18.16},"width":85.9,"height":45.4,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1604.00923/images/15-33.png","element":"img","alt":" Ht+1i","inline":true,"padRight":true},{"text":"is independent and identically distributed for all ","element":"span"},{"style":{"fontStyle":"italic"},"text":"i","element":"span"},{"text":", so ","element":"span"},{"style":{"height":18.16},"width":220.61,"height":45.4,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1604.00923/images/15-34.png","element":"img","alt":" ρitγtft(Ht+1i )","inline":true,"padRight":true},{"text":"is also independent and identically distributed for all ","element":"span"},{"style":{"fontStyle":"italic"},"text":"i","element":"span"},{"text":". Therefore by Khintchine’s strong law of large numbers, Theorem ","element":"span"},{"text":"6","element":"span"},{"text":", we have (","element":"span"},{"text":"8","element":"span"},{"text":").","element":"span"}],[{"text":"2. ","element":"span"},{"style":{"fontWeight":"bold"},"text":"Case 2: ","element":"span"},{"style":{"height":18.16},"width":85.91,"height":45.4,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1604.00923/images/15-35.png","element":"img","alt":" Ht+1i","inline":true,"padRight":true},{"text":"are ","element":"span"},{"style":{"fontStyle":"italic"},"text":"not ","element":"span"},{"text":"necessarily identically distributed since there may be multiple behavior policies, so we cannot directly apply Khintchine’s strong law of large numbers. Instead notice that ","element":"span"},{"style":{"height":16.66},"width":32.05,"height":41.66,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1604.00923/images/15-36.png","element":"img","alt":" ρit","inline":true,"padRight":true},{"text":"is bounded ","element":"span"},{"text":"by ","element":"span"},{"style":{"height":14.4},"width":23,"height":36,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1604.00923/images/15-37.png","element":"img","alt":" β","inline":true,"padRight":true},{"text":"due to Assumption ","element":"span"},{"text":"4","element":"span"},{"text":", and so ","element":"span"},{"style":{"height":18.16},"width":286.05,"height":45.4,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1604.00923/images/15-38.png","element":"img","alt":" |ρitγtft(Ht+1i )| ≤","inline":true},{"style":{"height":15.63},"width":126.02,"height":39.08,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1604.00923/images/15-39.png","element":"img","alt":"βγtfmax","inline":true},{"text":". So, we can apply Kolmogorov’s strong law of large numbers, Corollary ","element":"span"},{"text":"1","element":"span"},{"text":", to get (","element":"span"},{"text":"8","element":"span"},{"text":").","element":"span"}],[{"text":"Next we show that the denominator of (","element":"span"},{"text":"7","element":"span"},{"text":") converges almost surely to one:","element":"span"}],[{"style":{"width":"63%"},"width":575,"height":106,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1604.00923/images/15-40.png","element":"img"}],[{"text":"By Lemma ","element":"span"},{"text":"11","element":"span"},{"text":", which relies on Assumption ","element":"span"},{"text":"1","element":"span"},{"text":", we have that ","element":"span"},{"style":{"height":16.83},"width":156.27,"height":42.07,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1604.00923/images/15-41.png","element":"img","alt":"E[ρit] = 1","inline":true},{"text":". Again consider the two possible settings:","element":"span"}],[{"text":"1. ","element":"span"},{"style":{"fontWeight":"bold"},"text":"Case 1: ","element":"span"},{"style":{"height":18.16},"width":85.91,"height":45.4,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1604.00923/images/15-42.png","element":"img","alt":" Ht+1i","inline":true,"padRight":true},{"text":"is independent and identically distributed for all ","element":"span"},{"style":{"fontStyle":"italic"},"text":"i","element":"span"},{"text":", so ","element":"span"},{"style":{"height":16.66},"width":32.05,"height":41.66,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1604.00923/images/15-43.png","element":"img","alt":" ρit","inline":true,"padRight":true},{"text":"is also independent and iden- ","element":"span"},{"text":"tically distributed for all ","element":"span"},{"style":{"fontStyle":"italic"},"text":"i","element":"span"},{"text":". Therefore by Khintchine’s strong law of large numbers we have (","element":"span"},{"text":"9","element":"span"},{"text":").","element":"span"}],[{"text":"2. ","element":"span"},{"style":{"fontWeight":"bold"},"text":"Case 2: ","element":"span"},{"text":"Since ","element":"span"},{"style":{"height":16.66},"width":120.34,"height":41.66,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1604.00923/images/15-44.png","element":"img","alt":" ρit ≤ β","inline":true},{"text":", we can apply Kolmogorov’s","element":"span"},{"text":"strong law of large numbers to get (","element":"span"},{"text":"9","element":"span"},{"text":").","element":"span"}],[{"text":"By applying Property ","element":"span"},{"text":"2 ","element":"span"},{"text":"to (","element":"span"},{"text":"8","element":"span"},{"text":") and (","element":"span"},{"text":"9","element":"span"},{"text":") we have that for all ","element":"span"},{"style":{"fontStyle":"italic"},"text":"t","element":"span"},{"text":", ","element":"span"},{"style":{"height":19.71},"width":583.87,"height":49.27,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1604.00923/images/16-0.png","element":"img","alt":"Xtn a.s.−→ E�γtft(Ht+1)��Ht+1 ∼ πe�","inline":true},{"text":". So,","element":"span"}],[{"style":{"width":"93%"},"width":850,"height":464,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1604.00923/images/16-1.png","element":"img"}],[{"text":"2. ","element":"span"},{"style":{"fontWeight":"bold"},"text":"Case 2: ","element":"span"},{"text":"In order to apply Property ","element":"span"},{"text":"4 ","element":"span"},{"text":"we must show that there exists a function ","element":"span"},{"style":{"height":15.13},"width":239,"height":37.82,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1604.00923/images/16-2.png","element":"img","alt":" g : N≥0 → R","inline":true,"padRight":true},{"text":"such that ","element":"span"},{"style":{"height":17.65},"width":264.42,"height":44.13,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1604.00923/images/16-3.png","element":"img","alt":"�∞t=0 g(t) < ∞","inline":true,"padRight":true},{"text":"and for all ","element":"span"},{"style":{"height":14.33},"width":147.28,"height":35.82,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1604.00923/images/16-4.png","element":"img","alt":" n ∈ N>0","inline":true,"padRight":true},{"text":"and ","element":"span"},{"style":{"height":15.13},"width":138,"height":37.82,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1604.00923/images/16-5.png","element":"img","alt":" t ∈ N≥0","inline":true},{"text":", ","element":"span"},{"style":{"height":16.43},"width":202.71,"height":41.07,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1604.00923/images/16-6.png","element":"img","alt":"|Xtn| ≤ g(t)","inline":true},{"text":". The following definition of ","element":"span"},{"style":{"fontStyle":"italic"},"text":"g ","element":"span"},{"text":"satisfies ","element":"span"},{"text":"these requirements:","element":"span"}],[{"style":{"width":"72%"},"width":662,"height":389,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1604.00923/images/16-7.png","element":"img"}],[{"text":"since we have assumed that ","element":"span"},{"style":{"height":10},"width":22,"height":25,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1604.00923/images/16-8.png","element":"img","alt":" γ","inline":true,"padRight":true},{"text":"can only be ","element":"span"},{"text":"1 ","element":"span"},{"text":"in the finite-horizon setting, where ","element":"span"},{"style":{"height":14.8},"width":127.96,"height":37,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1604.00923/images/16-9.png","element":"img","alt":" L ̸= ∞","inline":true},{"text":". Also, ","element":"span"},{"style":{"height":16.43},"width":120.98,"height":41.07,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1604.00923/images/16-10.png","element":"img","alt":" |Xtn| =","inline":true,"padRight":true},{"text":"0 = ","element":"span"},{"style":{"fontStyle":"italic"},"text":"g","element":"span"},{"text":"(","element":"span"},{"style":{"fontStyle":"italic"},"text":"t","element":"span"},{"text":") ","element":"span"},{"text":"by definition if ","element":"span"},{"style":{"height":13.2},"width":92.71,"height":33,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1604.00923/images/16-11.png","element":"img","alt":" t ≥ L","inline":true,"padRight":true},{"text":"and if ","element":"span"},{"style":{"fontStyle":"italic"},"text":"t < L ","element":"span"},{"text":"then:","element":"span"}],[{"style":{"width":"91%"},"width":831,"height":501,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1604.00923/images/16-12.png","element":"img"}],[{"text":"Finally, we establish an extension of Lemma ","element":"span"},{"text":"12 ","element":"span"},{"text":"that will facilitate its use with sequences that are not quite in the form that it is defined for:","element":"span"}],[{"style":{"fontWeight":"bold"},"text":"Lemma 13. ","element":"span"},{"style":{"fontStyle":"italic"},"text":"For all ","element":"span"},{"style":{"height":15.13},"width":129.38,"height":37.82,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1604.00923/images/16-13.png","element":"img","alt":" t ∈ N≥0","inline":true},{"style":{"fontStyle":"italic"},"text":", let ","element":"span"},{"style":{"height":15.63},"width":201.52,"height":39.07,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1604.00923/images/16-14.png","element":"img","alt":" ft : Ht → R","inline":true},{"style":{"fontStyle":"italic"},"text":". If Assumption ","element":"span"},{"text":"1 ","element":"span"},{"style":{"fontStyle":"italic"},"text":"holds, ","element":"span"},{"style":{"height":14},"width":104.35,"height":35,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1604.00923/images/16-15.png","element":"img","alt":" ft = 0","inline":true,"padRight":true},{"style":{"fontStyle":"italic"},"text":"for all ","element":"span"},{"style":{"height":15.13},"width":134.64,"height":37.82,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1604.00923/images/16-16.png","element":"img","alt":" t ∈ N≥L","inline":true},{"style":{"fontStyle":"italic"},"text":", and either:","element":"span"}],[{"style":{"fontStyle":"italic"},"text":"• ","element":"span"},{"style":{"fontWeight":"bold"},"text":"Case 1: ","element":"span"},{"style":{"fontStyle":"italic"},"text":"Assumptions ","element":"span"},{"text":"2 ","element":"span"},{"style":{"fontStyle":"italic"},"text":"and ","element":"span"},{"text":"3 ","element":"span"},{"style":{"fontStyle":"italic"},"text":"hold. ","element":"span"},{"style":{"fontWeight":"bold"},"text":"or","element":"span"}],[{"style":{"fontStyle":"italic"},"text":"• ","element":"span"},{"style":{"fontWeight":"bold"},"text":"Case 2: ","element":"span"},{"style":{"fontStyle":"italic"},"text":"Assumption ","element":"span"},{"text":"4 ","element":"span"},{"style":{"fontStyle":"italic"},"text":"holds and there is a finite ","element":"span"},{"style":{"height":14},"width":66.16,"height":35,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1604.00923/images/16-17.png","element":"img","alt":" fmax","inline":true,"padRight":true},{"style":{"fontStyle":"italic"},"text":"such that for all ","element":"span"},{"style":{"height":15.13},"width":141.66,"height":37.82,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1604.00923/images/16-18.png","element":"img","alt":" t ∈ N≥0","inline":true,"padRight":true},{"style":{"fontStyle":"italic"},"text":"and ","element":"span"},{"style":{"height":16.43},"width":334.6,"height":41.07,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1604.00923/images/16-19.png","element":"img","alt":" ht ∈ Ht, |ft(ht)| <","inline":true},{"style":{"height":14},"width":66.16,"height":35,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1604.00923/images/16-20.png","element":"img","alt":"fmax","inline":true},{"style":{"fontStyle":"italic"},"text":".","element":"span"}],[{"style":{"fontStyle":"italic"},"text":"then","element":"span"}],[{"style":{"width":"99%"},"width":909,"height":141,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1604.00923/images/16-21.png","element":"img"}],[{"style":{"fontStyle":"italic"},"text":"Proof. ","element":"span"},{"text":"By removing the first term of the sum and shifting the variable that the sum uses by one, we can rewrite the left side of (","element":"span"},{"text":"10","element":"span"},{"text":") as","element":"span"}],[{"style":{"width":"87%"},"width":800,"height":111,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1604.00923/images/16-22.png","element":"img"}],[{"text":"We have that","element":"span"}],[{"style":{"width":"76%"},"width":693,"height":107,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1604.00923/images/16-23.png","element":"img"}],[{"text":"by Khintchine’s strong law of large numbers in Case 1, and Kolmogorov’s strong law of large numbers in Case 2 (since ","element":"span"},{"style":{"height":14},"width":33.99,"height":35,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1604.00923/images/16-24.png","element":"img","alt":"f0","inline":true,"padRight":true},{"text":"is bounded). Also, by Lemma ","element":"span"},{"text":"12 ","element":"span"},{"text":"(where the definition of ","element":"span"},{"style":{"height":14.33},"width":69.51,"height":35.82,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1604.00923/images/16-25.png","element":"img","alt":" ft+1","inline":true,"padRight":true},{"text":"in this lemma is used for ","element":"span"},{"style":{"height":14},"width":30.98,"height":35,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1604.00923/images/16-26.png","element":"img","alt":" ft","inline":true,"padRight":true},{"text":"in our application of Lemma ","element":"span"},{"text":"12","element":"span"},{"text":") we have that","element":"span"}],[{"style":{"width":"90%"},"width":826,"height":244,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1604.00923/images/16-27.png","element":"img"}],[{"text":"So by applying Property ","element":"span"},{"text":"3 ","element":"span"},{"text":"to (","element":"span"},{"text":"11","element":"span"},{"text":") and (","element":"span"},{"text":"12","element":"span"},{"text":") we have:","element":"span"}],[{"style":{"width":"96%"},"width":881,"height":579,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1604.00923/images/16-28.png","element":"img"}]]},{"heading":"B. Doubly Robust Derivation and Proofs","paragraphs":[[{"text":"In this appendix we provide an alternate derivation of the DR estimator using control variates. The idea behind control variates is as follows. Suppose that we would like to estimate ","element":"span"},{"style":{"height":15.6},"width":167.99,"height":39,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1604.00923/images/16-29.png","element":"img","alt":" θ := E[X]","inline":true,"padRight":true},{"text":"given a sample of ","element":"span"},{"style":{"fontStyle":"italic"},"text":"X","element":"span"},{"text":". The obvious estimator would be ","element":"span"},{"style":{"height":17.22},"width":134.47,"height":43.05,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1604.00923/images/16-30.png","element":"img","alt":"ˆθ1 := X","inline":true},{"text":". However, if we have a sample of another random variable, ","element":"span"},{"style":{"fontStyle":"italic"},"text":"Y ","element":"span"},{"text":", with known expected value, ","element":"span"},{"style":{"fontWeight":"bold"},"text":"E","element":"span"},{"text":"[","element":"span"},{"style":{"fontStyle":"italic"},"text":"Y ","element":"span"},{"text":"]","element":"span"},{"text":", then the estimator ","element":"span"},{"style":{"height":18.89},"width":351.86,"height":47.23,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1604.00923/images/16-31.png","element":"img","alt":"ˆθ2 := X − Y + E[Y ]","inline":true,"padRight":true},{"text":"may have lower variance. Specifically, while ","element":"span"},{"style":{"height":18.89},"width":303.58,"height":47.23,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1604.00923/images/17-0.png","element":"img","alt":" Var(ˆθ1) = Var(X)","inline":true},{"text":", we have that ","element":"span"},{"style":{"height":18.89},"width":693.87,"height":47.24,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1604.00923/images/17-1.png","element":"img","alt":" Var(ˆθ2) = Var(X)+Var(Y )−2 Cov(X, Y )","inline":true},{"text":". So, ","element":"span"},{"style":{"height":17.22},"width":33.21,"height":43.05,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1604.00923/images/17-2.png","element":"img","alt":"ˆθ2","inline":true,"padRight":true},{"text":"has lower variance than ","element":"span"},{"style":{"height":17.22},"width":33.2,"height":43.05,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1604.00923/images/17-3.png","element":"img","alt":" ˆθ1","inline":true,"padRight":true},{"text":"if ","element":"span"},{"text":"2 Cov(","element":"span"},{"style":{"fontStyle":"italic"},"text":"X, Y ","element":"span"},{"text":") ","element":"span"},{"style":{"fontStyle":"italic"},"text":"> ","element":"span"},{"text":"Var(","element":"span"},{"style":{"fontStyle":"italic"},"text":"Y ","element":"span"},{"text":")","element":"span"},{"text":". Often ","element":"span"},{"style":{"fontStyle":"italic"},"text":"Y ","element":"span"},{"text":"is referred to as the ","element":"span"},{"style":{"fontStyle":"italic"},"text":"control variate","element":"span"},{"text":". Notice that the optimal control variate is ","element":"span"},{"style":{"fontStyle":"italic"},"text":"Y ","element":"span"},{"text":":= ","element":"span"},{"style":{"fontStyle":"italic"},"text":"X","element":"span"},{"text":", since then ","element":"span"},{"style":{"height":18.89},"width":197.92,"height":47.23,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1604.00923/images/17-4.png","element":"img","alt":" Var(ˆθ2) = 0","inline":true},{"text":". Furthermore, notice that ","element":"span"},{"style":{"height":17.22},"width":33.21,"height":43.05,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1604.00923/images/17-5.png","element":"img","alt":"ˆθ2","inline":true,"padRight":true},{"text":"remains an unbiased estimator of ","element":"span"},{"style":{"height":10.8},"width":18,"height":27,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1604.00923/images/17-6.png","element":"img","alt":" θ","inline":true,"padRight":true},{"text":"as long as the expected value of ","element":"span"},{"style":{"fontStyle":"italic"},"text":"Y ","element":"span"},{"text":"exists—","element":"span"},{"style":{"height":18.89},"width":134.05,"height":47.24,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1604.00923/images/17-7.png","element":"img","alt":"E[ˆθ2] =","inline":true},{"style":{"height":15.6},"width":899.88,"height":39,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1604.00923/images/17-8.png","element":"img","alt":"E[X − Y + E[Y ]] = E[X] − E[Y ] + E[Y ] = E[X] = θ","inline":true},{"text":". Control variates have been used before in reinforcement learning to reduce the variance of policy gradient estimates (","element":"span"},{"text":"Bhatnagar et al.","element":"span"},{"text":", ","element":"span"},{"text":"2009","element":"span"},{"text":"), where the control variate was referred to as a ","element":"span"},{"style":{"fontStyle":"italic"},"text":"baseline","element":"span"},{"text":".","element":"span"}],[{"text":"Recall that we have defined the DR estimator in (","element":"span"},{"text":"2","element":"span"},{"text":") as","element":"span"}],[{"style":{"width":"95%"},"width":870,"height":343,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1604.00923/images/17-9.png","element":"img"}],[{"text":"In this definition the ","element":"span"},{"style":{"fontStyle":"italic"},"text":"X ","element":"span"},{"text":"term is the ","element":"span"},{"style":{"fontStyle":"italic"},"text":"per-decision importance sampling ","element":"span"},{"text":"(PDIS) estimator, which is known to be an unbiased and strongly consistent estimator of ","element":"span"},{"style":{"height":15.6},"width":89.03,"height":39,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1604.00923/images/17-10.png","element":"img","alt":" v(πe)","inline":true,"padRight":true},{"text":"(","element":"span"},{"text":"Precup et al.","element":"span"},{"text":", ","element":"span"},{"text":"2000","element":"span"},{"text":"; ","element":"span"},{"text":"Thomas","element":"span"},{"text":", ","element":"span"},{"text":"2015b","element":"span"},{"text":"). Also, the control variate, ","element":"span"},{"style":{"fontStyle":"italic"},"text":"Y ","element":"span"},{"text":", is mean zero, i.e., ","element":"span"},{"style":{"fontWeight":"bold"},"text":"E","element":"span"},{"text":"[","element":"span"},{"style":{"fontStyle":"italic"},"text":"Y ","element":"span"},{"text":"] = 0","element":"span"},{"text":". To see why this control variate is reasonable, notice that all of the terms that are multiplied by ","element":"span"},{"style":{"height":16.66},"width":75.69,"height":41.66,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1604.00923/images/17-11.png","element":"img","alt":" γtwit","inline":true,"padRight":true},{"text":"approximately cancel:","element":"span"}],[{"style":{"width":"71%"},"width":647,"height":71,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1604.00923/images/17-12.png","element":"img"}],[{"text":"So, ","element":"span"},{"style":{"fontStyle":"italic"},"text":"Y ","element":"span"},{"text":"is a decent approximation of ","element":"span"},{"style":{"fontStyle":"italic"},"text":"X","element":"span"},{"text":", and therefore ","element":"span"},{"text":"DR(","element":"span"},{"style":{"fontStyle":"italic"},"text":"D","element":"span"},{"text":") ","element":"span"},{"text":"will have low variance.","element":"span"}],[{"text":"Our derivation of the control variate used by the DR estimator is based on an alternate view of control variates. If we do not know the expected value of the control variate, ","element":"span"},{"style":{"fontStyle":"italic"},"text":"Y ","element":"span"},{"text":", but we have another random variable, ","element":"span"},{"style":{"fontStyle":"italic"},"text":"Z","element":"span"},{"text":", such that ","element":"span"},{"style":{"fontWeight":"bold"},"text":"E","element":"span"},{"text":"[","element":"span"},{"style":{"fontStyle":"italic"},"text":"Z","element":"span"},{"text":"] = ","element":"span"},{"style":{"fontWeight":"bold"},"text":"E","element":"span"},{"text":"[","element":"span"},{"style":{"fontStyle":"italic"},"text":"Y ","element":"span"},{"text":"]","element":"span"},{"text":", then we can use the unbiased estimator ","element":"span"},{"style":{"height":17.22},"width":285.69,"height":43.05,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1604.00923/images/17-13.png","element":"img","alt":"ˆθ3 = X − Y + Z","inline":true},{"text":". The variance of this estimator is given by ","element":"span"},{"style":{"height":18.89},"width":853.43,"height":47.24,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1604.00923/images/17-14.png","element":"img","alt":" Var(ˆθ3) = Var(X)+Var(Y −Z)−2 Cov(X, Y −Z)","inline":true},{"text":". So, if ","element":"span"},{"style":{"height":10.8},"width":122.1,"height":27,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1604.00923/images/17-15.png","element":"img","alt":" Y ≈ X","inline":true,"padRight":true},{"text":"and ","element":"span"},{"style":{"fontStyle":"italic"},"text":"Z ","element":"span"},{"text":"has low variance, then this estimator may have lower variance than ","element":"span"},{"style":{"height":13.13},"width":33.2,"height":32.82,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1604.00923/images/17-16.png","element":"img","alt":" θ1","inline":true},{"text":". Technically, this is an ordinary application of control variates using ","element":"span"},{"style":{"height":10.8},"width":105.74,"height":27,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1604.00923/images/17-17.png","element":"img","alt":" Y − Z","inline":true,"padRight":true},{"text":"as the mean-zero control variate. We derive DR using this alternate view.","element":"span"}],[{"text":"We begin with the ","element":"span"},{"style":{"fontStyle":"italic"},"text":"per-decision importance sampling ","element":"span"},{"text":"(PDIS) estimator, which is known to be an unbiased and strongly consistent estimator of ","element":"span"},{"style":{"height":15.6},"width":89.03,"height":39,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1604.00923/images/17-18.png","element":"img","alt":" v(πe)","inline":true,"padRight":true},{"text":"(","element":"span"},{"text":"Precup et al.","element":"span"},{"text":", ","element":"span"},{"text":"2000","element":"span"},{"text":";","element":"span"}],[{"text":"Thomas","element":"span"},{"text":", ","element":"span"},{"text":"2015b","element":"span"},{"text":"). The PDIS estimator is given by:","element":"span"}],[{"style":{"width":"57%"},"width":525,"height":108,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1604.00923/images/17-19.png","element":"img"}],[{"text":"In order to reduce the variance of this estimator we will subtract a control variate that we expect to be highly correlated with the PDIS estimator, and then add back in the expected value of the control variate:","element":"span"}],[{"style":{"width":"96%"},"width":882,"height":360,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1604.00923/images/17-20.png","element":"img"}],[{"text":"Here we expect the control variate to be similar to the PDIS estimator if the model’s reward predictions are accurate, i.e., if ","element":"span"},{"style":{"height":18.49},"width":393.07,"height":46.23,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1604.00923/images/17-21.png","element":"img","alt":" RHit ≈ ˆrπe(SHit , AHit , 0)","inline":true},{"text":".","element":"span"}],[{"text":"If it could be used, (","element":"span"},{"text":"13","element":"span"},{"text":") would be an extremely lowvariance estimator of ","element":"span"},{"style":{"height":15.6},"width":89.03,"height":39,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1604.00923/images/17-22.png","element":"img","alt":" v(πe)","inline":true,"padRight":true},{"text":"since ","element":"span"},{"style":{"height":10.8},"width":111.88,"height":27,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1604.00923/images/17-23.png","element":"img","alt":" X − Y","inline":true,"padRight":true},{"text":"would usually be near-zero and ","element":"span"},{"style":{"fontWeight":"bold"},"text":"E","element":"span"},{"text":"[","element":"span"},{"style":{"fontStyle":"italic"},"text":"Y ","element":"span"},{"text":"] ","element":"span"},{"text":"is a constant that is near ","element":"span"},{"style":{"height":15.6},"width":89.03,"height":39,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1604.00923/images/17-24.png","element":"img","alt":" v(πe)","inline":true},{"text":". However, ","element":"span"},{"style":{"fontWeight":"bold"},"text":"E","element":"span"},{"text":"[","element":"span"},{"text":"control variate","element":"span"},{"text":"] ","element":"span"},{"text":"is not known, and so we cannot use (","element":"span"},{"text":"13","element":"span"},{"text":") directly. Although estimating ","element":"span"},{"style":{"fontWeight":"bold"},"text":"E","element":"span"},{"text":"[","element":"span"},{"style":{"fontStyle":"italic"},"text":"Y ","element":"span"},{"text":"] ","element":"span"},{"text":"is nearly as hard as estimating ","element":"span"},{"style":{"height":15.6},"width":89.03,"height":39,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1604.00923/images/17-25.png","element":"img","alt":" v(πe)","inline":true},{"text":", it is marginally easier. It is easier because ","element":"span"},{"style":{"height":15.6},"width":89.03,"height":39,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1604.00923/images/17-26.png","element":"img","alt":" v(πe)","inline":true,"padRight":true},{"text":"uses the unknown transition and reward functions of the MDP to produce the distribution of rewards at each time step, while ","element":"span"},{"style":{"fontWeight":"bold"},"text":"E","element":"span"},{"text":"[","element":"span"},{"style":{"fontStyle":"italic"},"text":"Y ","element":"span"},{"text":"] ","element":"span"},{"text":"uses the known approximate model’s transition and reward function for the last transition before each reward occurs. We can therefore estimate ","element":"span"},{"style":{"fontWeight":"bold"},"text":"E","element":"span"},{"text":"[","element":"span"},{"style":{"fontStyle":"italic"},"text":"Y ","element":"span"},{"text":"] ","element":"span"},{"text":"using an unbiased estimator that typically has lower variance than the control variate. In the alternate view of control variates this new term will be ","element":"span"},{"style":{"fontStyle":"italic"},"text":"Z","element":"span"},{"text":":","element":"span"}],[{"style":{"width":"98%"},"width":893,"height":398,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1604.00923/images/17-27.png","element":"img"}],[{"text":"Here we expect the ","element":"span"},{"style":{"fontStyle":"italic"},"text":"Z ","element":"span"},{"text":"term to have lower variance than the ","element":"span"},{"style":{"fontStyle":"italic"},"text":"Y ","element":"span"},{"text":"term because for each ","element":"span"},{"style":{"fontStyle":"italic"},"text":"i ","element":"span"},{"text":"and ","element":"span"},{"style":{"fontStyle":"italic"},"text":"t ","element":"span"},{"text":"it only depends on actions ","element":"span"},{"style":{"height":18.63},"width":234.92,"height":46.58,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1604.00923/images/17-28.png","element":"img","alt":"AHi1 , . . . , AHit−1","inline":true,"padRight":true},{"text":"and not ","element":"span"},{"style":{"height":18.31},"width":64.48,"height":45.78,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1604.00923/images/17-29.png","element":"img","alt":" AHit","inline":true,"padRight":true},{"text":". This is reflected in its use of ","element":"span"},{"style":{"height":16.68},"width":71,"height":41.7,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1604.00923/images/17-30.png","element":"img","alt":"ρit−1","inline":true,"padRight":true},{"text":"rather than ","element":"span"},{"style":{"height":16.67},"width":32.05,"height":41.66,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1604.00923/images/17-31.png","element":"img","alt":" ρit","inline":true},{"text":". Before continuing our derivation we","element":"span"}],[{"text":"verify that ","element":"span"},{"style":{"fontWeight":"bold"},"text":"E","element":"span"},{"text":"[","element":"span"},{"style":{"fontStyle":"italic"},"text":"Y ","element":"span"},{"text":"] = ","element":"span"},{"style":{"fontWeight":"bold"},"text":"E","element":"span"},{"text":"[","element":"span"},{"style":{"fontStyle":"italic"},"text":"Z","element":"span"},{"text":"] ","element":"span"},{"text":"if Assumption ","element":"span"},{"text":"1 ","element":"span"},{"text":"holds:","element":"span"}],[{"style":{"width":"100%"},"width":911,"height":472,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1604.00923/images/18-0.png","element":"img"}],[{"text":"where ","element":"span"},{"style":{"fontWeight":"bold"},"text":"(a) ","element":"span"},{"text":"comes from Lemma ","element":"span"},{"text":"6","element":"span"},{"text":".","element":"span"}],[{"text":"So far, in (","element":"span"},{"text":"14","element":"span"},{"text":"), we have introduced a control variate into PDIS that we expect might reduce the variance of the estimator a little without introducing bias. However, it will still have high variance because ","element":"span"},{"style":{"fontStyle":"italic"},"text":"Z ","element":"span"},{"text":"is a high-variance estimator of ","element":"span"},{"style":{"fontWeight":"bold"},"text":"E","element":"span"},{"text":"[","element":"span"},{"style":{"fontStyle":"italic"},"text":"Y ","element":"span"},{"text":"]","element":"span"},{"text":". To overcome this, we can introduce another control variate into ","element":"span"},{"style":{"fontStyle":"italic"},"text":"Z ","element":"span"},{"text":"to make it a lower-variance estimator of ","element":"span"},{"style":{"fontWeight":"bold"},"text":"E","element":"span"},{"text":"[","element":"span"},{"style":{"fontStyle":"italic"},"text":"Y ","element":"span"},{"text":"]","element":"span"},{"text":". So, we introduce another control variate:","element":"span"}],[{"style":{"width":"96%"},"width":877,"height":764,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1604.00923/images/18-1.png","element":"img"}],[{"text":"Here ","element":"span"},{"style":{"height":15.6},"width":253.14,"height":39,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1604.00923/images/18-2.png","element":"img","alt":" E[Z′] = E[Y ′]","inline":true,"padRight":true},{"text":"(although we omit to proof of this claim), ","element":"span"},{"style":{"height":10.8},"width":45.13,"height":27,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1604.00923/images/18-3.png","element":"img","alt":" Y ′","inline":true,"padRight":true},{"text":"is similar to ","element":"span"},{"style":{"fontStyle":"italic"},"text":"Z ","element":"span"},{"text":"and so it serves as a good control variate therefor, and ","element":"span"},{"style":{"height":11.13},"width":38.18,"height":27.82,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1604.00923/images/18-4.png","element":"img","alt":" Z′","inline":true,"padRight":true},{"text":"will usually have lower variance than ","element":"span"},{"style":{"height":10.8},"width":45.13,"height":27,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1604.00923/images/18-5.png","element":"img","alt":" Y ′","inline":true,"padRight":true},{"text":"because it uses ","element":"span"},{"style":{"height":16.68},"width":71,"height":41.7,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1604.00923/images/18-6.png","element":"img","alt":" ρit−2","inline":true,"padRight":true},{"text":"rather than ","element":"span"},{"style":{"height":16.68},"width":71,"height":41.7,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1604.00923/images/18-7.png","element":"img","alt":" ρit−1","inline":true},{"text":". However, ","element":"span"},{"text":"now ","element":"span"},{"style":{"height":10.8},"width":43.24,"height":27,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1604.00923/images/18-8.png","element":"img","alt":" Z′","inline":true,"padRight":true},{"text":"is a high-variance estimator of ","element":"span"},{"style":{"height":15.6},"width":93.08,"height":39,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1604.00923/images/18-9.png","element":"img","alt":" E[Y ′]","inline":true},{"text":". We therefore introduce a control variate for ","element":"span"},{"style":{"height":10.8},"width":43.24,"height":27,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1604.00923/images/18-10.png","element":"img","alt":" Z′","inline":true},{"text":", and this process repeats. This process of introducing control variates eventually terminates when the new control variate is not random. The resulting estimator is (we call this estimator ","element":"span"},{"text":"DR(","element":"span"},{"style":{"fontStyle":"italic"},"text":"D","element":"span"},{"text":") ","element":"span"},{"text":"be-","element":"span"}],[{"text":"cause we will show that it is equivalent to (","element":"span"},{"text":"2","element":"span"},{"text":")):","element":"span"}],[{"style":{"width":"97%"},"width":887,"height":376,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1604.00923/images/18-11.png","element":"img"}],[{"text":"Next we will combine the ","element":"span"},{"style":{"height":10.8},"width":21.75,"height":27,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1604.00923/images/18-12.png","element":"img","alt":" ˆr","inline":true,"padRight":true},{"text":"terms into ","element":"span"},{"style":{"height":10.8},"width":21.48,"height":27,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1604.00923/images/18-13.png","element":"img","alt":" ˆv","inline":true,"padRight":true},{"text":"and ","element":"span"},{"style":{"height":14},"width":22.89,"height":35,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1604.00923/images/18-14.png","element":"img","alt":" ˆq","inline":true,"padRight":true},{"text":"terms to get a more succinct expression. To this end, we will use the property that ","element":"span"},{"style":{"height":21.65},"width":661.05,"height":54.14,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1604.00923/images/18-15.png","element":"img","alt":"�∞i=0�ij=0 f(i, j) = �∞j=0�∞i=j f(i, j)","inline":true,"padRight":true},{"text":"to ","element":"span"},{"text":"change the order of the sums over ","element":"span"},{"style":{"fontStyle":"italic"},"text":"t ","element":"span"},{"text":"and ","element":"span"},{"style":{"height":6.8},"width":20,"height":17,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1604.00923/images/18-16.png","element":"img","alt":" τ","inline":true},{"text":". We also split ","element":"span"},{"style":{"height":15.63},"width":34.23,"height":39.07,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1604.00923/images/18-17.png","element":"img","alt":" γt","inline":true,"padRight":true},{"text":"into ","element":"span"},{"style":{"height":15.63},"width":115.79,"height":39.07,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1604.00923/images/18-18.png","element":"img","alt":" γτγt−τ","inline":true},{"text":":","element":"span"}],[{"style":{"width":"98%"},"width":898,"height":416,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1604.00923/images/18-19.png","element":"img"}],[{"text":"Next we perform a change of variable using ","element":"span"},{"style":{"height":13.6},"width":161.73,"height":34,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1604.00923/images/18-20.png","element":"img","alt":" j = t − τ","inline":true,"padRight":true},{"text":"to replace ","element":"span"},{"style":{"fontStyle":"italic"},"text":"t","element":"span"},{"text":":","element":"span"}],[{"style":{"width":"92%"},"width":840,"height":764,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1604.00923/images/18-21.png","element":"img"}],[{"text":"Replacing the variable ","element":"span"},{"style":{"height":6.8},"width":20,"height":17,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1604.00923/images/18-22.png","element":"img","alt":" τ","inline":true,"padRight":true},{"text":"with ","element":"span"},{"style":{"fontStyle":"italic"},"text":"t ","element":"span"},{"text":"and using ","element":"span"},{"style":{"height":23.14},"width":129.68,"height":57.86,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1604.00923/images/18-23.png","element":"img","alt":" wit = ρitn","inline":true,"padRight":true},{"text":"we get","element":"span"}],[{"text":"that:","element":"span"}],[{"style":{"width":"95%"},"width":870,"height":236,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1604.00923/images/19-0.png","element":"img"}],[{"text":"which is (","element":"span"},{"text":"2","element":"span"},{"text":").","element":"span"}],[{"text":"The original derivation of the DR estimator (","element":"span"},{"text":"Jiang & Li","element":"span"},{"text":", ","element":"span"},{"text":"2015","element":"span"},{"text":") required the horizon to be finite and known. Our derivation makes neither of these assumptions. That is, it allows for infinite or indefinite horizons and for finite horizons where the horizon is not known. If the horizon, ","element":"span"},{"style":{"fontStyle":"italic"},"text":"L","element":"span"},{"text":", is finite and known, then one should ensure that the model uses all of the available information, including the known horizon and time step. In the next section we show that if ","element":"span"},{"style":{"fontStyle":"italic"},"text":"L ","element":"span"},{"text":"is finite and known, then our non-recursive definition of the DR estimator is equivalent to the recursive form of (","element":"span"},{"text":"Jiang & Li","element":"span"},{"text":", ","element":"span"},{"text":"2015","element":"span"},{"text":").","element":"span"}],[{"style":{"fontWeight":"bold"},"text":"B.1. Equivalence of DR Definitions","element":"span"}],[{"text":"In this section we show that our non-recursive definition of the DR estimator is equivalent to the recursive definition provided by ","element":"span"},{"text":"Jiang & Li ","element":"span"},{"text":"(","element":"span"},{"text":"2015","element":"span"},{"text":") when the horizon is finite and known.","element":"span"}],[{"style":{"fontWeight":"bold"},"text":"Theorem 8. ","element":"span"},{"text":"(","element":"span"},{"text":"2","element":"span"},{"text":") ","element":"span"},{"style":{"fontStyle":"italic"},"text":"is equivalent to the DR estimator presented by ","element":"span"},{"style":{"fontStyle":"italic"},"text":"Jiang & Li ","element":"span"},{"style":{"fontStyle":"italic"},"text":"(","element":"span"},{"style":{"fontStyle":"italic"},"text":"2015","element":"span"},{"style":{"fontStyle":"italic"},"text":") if the finite horizon, ","element":"span"},{"style":{"fontStyle":"italic"},"text":"L","element":"span"},{"style":{"fontStyle":"italic"},"text":", of the MDP is known.","element":"span"}],[{"style":{"fontStyle":"italic"},"text":"Proof. ","element":"span"},{"text":"Jiang & Li ","element":"span"},{"text":"(","element":"span"},{"text":"2015","element":"span"},{"text":") define the ","element":"span"},{"text":"DR ","element":"span"},{"text":"estimator for a single trajectory (i.e., ","element":"span"},{"style":{"fontStyle":"italic"},"text":"n ","element":"span"},{"text":"= 1","element":"span"},{"text":") as the last element, ","element":"span"},{"style":{"height":13.13},"width":53.13,"height":32.82,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1604.00923/images/19-1.png","element":"img","alt":" XL","inline":true},{"text":", of a sequence, ","element":"span"},{"style":{"height":17.27},"width":124.98,"height":43.17,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1604.00923/images/19-2.png","element":"img","alt":" (Xi)Li=0","inline":true},{"text":". This sequence is defined by the following ","element":"span"},{"text":"recurrence relation. Let ","element":"span"},{"style":{"height":13.13},"width":129.43,"height":32.82,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1604.00923/images/19-3.png","element":"img","alt":" X0 := 0","inline":true,"padRight":true},{"text":"and for all ","element":"span"},{"style":{"height":15.6},"width":240.16,"height":39,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1604.00923/images/19-4.png","element":"img","alt":" k ∈ {1, . . . , L}","inline":true,"padRight":true},{"text":"let","element":"span"}],[{"style":{"width":"90%"},"width":822,"height":226,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1604.00923/images/19-5.png","element":"img"}],[{"text":"As in the definition of ","element":"span"},{"text":"DR(","element":"span"},{"style":{"fontStyle":"italic"},"text":"D","element":"span"},{"text":") ","element":"span"},{"text":"in (","element":"span"},{"text":"2","element":"span"},{"text":"), ","element":"span"},{"text":"Jiang & Li ","element":"span"},{"text":"(","element":"span"},{"text":"2015","element":"span"},{"text":") define the ","element":"span"},{"text":"DR ","element":"span"},{"text":"estimator for multiple trajectories to be the average of the estimator for each trajectory individually. So, to show that their recursive definition and our definition are equivalent, we need only show that they are equivalent when there is a single trajectory.","element":"span"}],[{"text":"Since hereafter in this proof we deal with only a single trajectory, we drop the superscripts that we use to specify the trajectory, i.e., we write ","element":"span"},{"style":{"height":10},"width":32.05,"height":25,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1604.00923/images/19-6.png","element":"img","alt":" ρt","inline":true,"padRight":true},{"text":"rather than ","element":"span"},{"style":{"height":16.66},"width":35.05,"height":41.66,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1604.00923/images/19-7.png","element":"img","alt":" ρ1t","inline":true},{"text":". Also let ","element":"span"},{"style":{"height":9.66},"width":134.68,"height":24.16,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1604.00923/images/19-8.png","element":"img","alt":" πb := π1","inline":true,"padRight":true},{"text":"denote the single behavior policy. For further brevity, let","element":"span"}],[{"style":{"width":"35%"},"width":327,"height":93,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1604.00923/images/19-9.png","element":"img"}],[{"text":"First, notice that we can rewrite (","element":"span"},{"text":"2","element":"span"},{"text":") for the single-trajectory finite-horizon setting as:","element":"span"}],[{"style":{"width":"89%"},"width":814,"height":249,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1604.00923/images/19-10.png","element":"img"}],[{"text":"since ","element":"span"},{"style":{"height":13.13},"width":44.78,"height":32.82,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1604.00923/images/19-11.png","element":"img","alt":" SL","inline":true,"padRight":true},{"text":"is surely the absorbing state and so ","element":"span"},{"style":{"height":13.13},"width":41.44,"height":32.82,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1604.00923/images/19-12.png","element":"img","alt":" Rt","inline":true},{"text":", ","element":"span"},{"style":{"height":15.6},"width":187.8,"height":39,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1604.00923/images/19-13.png","element":"img","alt":"ˆqπe (St, At)","inline":true},{"text":", and ","element":"span"},{"style":{"height":15.6},"width":129.31,"height":39,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1604.00923/images/19-14.png","element":"img","alt":" ˆvπe (St)","inline":true,"padRight":true},{"text":"are all zero for ","element":"span"},{"style":{"height":13.2},"width":121.42,"height":33,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1604.00923/images/19-15.png","element":"img","alt":" t ≥ L","inline":true},{"text":". ","element":"span"},{"text":"To verify that this definition is equivalent to ","element":"span"},{"style":{"height":13.13},"width":53.13,"height":32.82,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1604.00923/images/19-16.png","element":"img","alt":" XL","inline":true},{"text":", we will de-fine another sequence, ","element":"span"},{"style":{"height":17.27},"width":115.36,"height":43.17,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1604.00923/images/19-17.png","element":"img","alt":" (Yi)Li=1","inline":true},{"text":", such that ","element":"span"},{"style":{"height":13.12},"width":140.62,"height":32.81,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1604.00923/images/19-18.png","element":"img","alt":" Xi = Yi","inline":true,"padRight":true},{"text":"for all ","element":"span"},{"style":{"height":15.6},"width":232.11,"height":39,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1604.00923/images/19-19.png","element":"img","alt":"i ∈ {1, . . . , L}","inline":true,"padRight":true},{"text":"and such that ","element":"span"},{"style":{"height":15.6},"width":218.87,"height":39,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1604.00923/images/19-20.png","element":"img","alt":" YL = DR(D)","inline":true,"padRight":true},{"text":"trivially.","element":"span"}],[{"text":"Let","element":"span"}],[{"style":{"width":"96%"},"width":877,"height":151,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1604.00923/images/19-21.png","element":"img"}],[{"text":"Notice that ","element":"span"},{"style":{"height":13.13},"width":43.51,"height":32.82,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1604.00923/images/19-22.png","element":"img","alt":" YL","inline":true,"padRight":true},{"text":"is identical to (","element":"span"},{"text":"16","element":"span"},{"text":") since ","element":"span"},{"style":{"height":16.43},"width":291.14,"height":41.07,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1604.00923/images/19-23.png","element":"img","alt":" γL−LρL−L−1 = 1","inline":true},{"text":". So, all that remains is to show that ","element":"span"},{"style":{"height":13.13},"width":152.93,"height":32.82,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1604.00923/images/19-24.png","element":"img","alt":" Yk = Xk","inline":true,"padRight":true},{"text":"for all ","element":"span"},{"style":{"height":11.6},"width":63.93,"height":29,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1604.00923/images/19-25.png","element":"img","alt":" k ∈","inline":true,"padRight":true},{"style":{"fontStyle":"italic"},"text":"{","element":"span"},{"text":"1","element":"span"},{"style":{"fontStyle":"italic"},"text":", . . . , L","element":"span"},{"style":{"fontStyle":"italic"},"text":"}","element":"span"},{"text":". We will show this using a proof by induction.","element":"span"}],[{"text":"For the base case, ","element":"span"},{"style":{"fontStyle":"italic"},"text":"k ","element":"span"},{"text":"= 1","element":"span"},{"text":", it is straightforward to verify that ","element":"span"},{"style":{"height":13.13},"width":145.41,"height":32.82,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1604.00923/images/19-26.png","element":"img","alt":"X1 = Y1","inline":true},{"text":". For the inductive step we assume the inductive hypothesis that ","element":"span"},{"style":{"height":13.13},"width":233.66,"height":32.82,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1604.00923/images/19-27.png","element":"img","alt":" Xk−1 = Yk−1","inline":true,"padRight":true},{"text":"and show that then ","element":"span"},{"style":{"height":13.13},"width":98.06,"height":32.82,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1604.00923/images/19-28.png","element":"img","alt":" Xk =","inline":true},{"style":{"height":13.13},"width":38.52,"height":32.82,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1604.00923/images/19-29.png","element":"img","alt":"Yk","inline":true},{"text":":","element":"span"}],[{"style":{"width":"84%"},"width":774,"height":513,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1604.00923/images/19-30.png","element":"img"}],[{"text":"Substituting in the definition of ","element":"span"},{"style":{"height":13.13},"width":78.89,"height":32.82,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1604.00923/images/19-31.png","element":"img","alt":" Yk−1","inline":true,"padRight":true},{"text":"and performing algebraic manipulations we have that:","element":"span"}],[{"style":{"fontStyle":"italic"},"text":"X","element":"span"},{"style":{"height":32.11},"width":712.69,"height":80.28,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1604.00923/images/19-32.png","element":"img","alt":"k =ˆvπe (SL−k) + πeb(L − k)RL−k + πeb(L − k)γL−kρL−k","inline":true}],[{"style":{"width":"87%"},"width":796,"height":204,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1604.00923/images/19-33.png","element":"img"}],[{"text":"where ","element":"span"},{"style":{"height":7.6},"width":30,"height":19,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1604.00923/images/19-34.png","element":"img","alt":" ×","inline":true,"padRight":true},{"text":"denotes that a line was split into multiple lines (we do not use cross-products anywhere in this paper). Since","element":"span"}],[{"style":{"width":"40%"},"width":365,"height":92,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1604.00923/images/19-35.png","element":"img"}],[{"text":"and by reordering terms, we have that","element":"span"}],[{"style":{"width":"100%"},"width":911,"height":266,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1604.00923/images/20-0.png","element":"img"}],[{"text":"Adding one more element to the summation so that it starts at ","element":"span"},{"style":{"height":10.8},"width":164.78,"height":27,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1604.00923/images/20-1.png","element":"img","alt":" t = L − k","inline":true},{"text":", and then explicitly subtracting off this additional term we have that:","element":"span"}],[{"style":{"width":"99%"},"width":910,"height":562,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1604.00923/images/20-2.png","element":"img"}],[{"text":"Canceling several ","element":"span"},{"style":{"height":10},"width":22,"height":25,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1604.00923/images/20-3.png","element":"img","alt":" γ","inline":true,"padRight":true},{"text":"and ","element":"span"},{"style":{"height":10},"width":20,"height":25,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1604.00923/images/20-4.png","element":"img","alt":" ρ","inline":true,"padRight":true},{"text":"terms, we have that:","element":"span"}],[{"style":{"width":"97%"},"width":889,"height":260,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1604.00923/images/20-5.png","element":"img"}],[{"style":{"fontWeight":"bold"},"text":"B.2. DR is Unbiased","element":"span"}],[{"text":"While ","element":"span"},{"text":"Jiang & Li ","element":"span"},{"text":"(","element":"span"},{"text":"2015","element":"span"},{"text":") showed that the DR estimator (with finite horizon) is an unbiased estimator of ","element":"span"},{"style":{"height":15.6},"width":89.02,"height":39,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1604.00923/images/20-6.png","element":"img","alt":" v(πe)","inline":true},{"text":", in this section we show that the DR estimator (without assumptions about the horizon) is an unbiased estimator of ","element":"span"},{"style":{"height":15.6},"width":89.03,"height":39,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1604.00923/images/20-7.png","element":"img","alt":"v(πe)","inline":true},{"text":".","element":"span"}],[{"style":{"fontWeight":"bold"},"text":"Theorem 9 ","element":"span"},{"text":"(DR – unbiased estimator)","element":"span"},{"style":{"fontWeight":"bold"},"text":". ","element":"span"},{"style":{"fontStyle":"italic"},"text":"If Assumption ","element":"span"},{"text":"1 ","element":"span"},{"style":{"fontStyle":"italic"},"text":"holds, then ","element":"span"},{"style":{"height":15.6},"width":313.09,"height":39,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1604.00923/images/20-8.png","element":"img","alt":" E[DR(D)] = v(πe)","inline":true},{"style":{"fontStyle":"italic"},"text":".","element":"span"}],[{"style":{"fontStyle":"italic"},"text":"Proof. ","element":"span"},{"text":"This result was shown previously for the known fi-nite horizon setting (","element":"span"},{"text":"Jiang & Li","element":"span"},{"text":", ","element":"span"},{"text":"2015","element":"span"},{"text":"), but has not been shown before for the other settings. Because we will use some steps of this proof in later proofs, the majority of this proof is relegated to a lemma.","element":"span"}],[{"style":{"width":"57%"},"width":525,"height":306,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1604.00923/images/20-9.png","element":"img"}],[{"text":"where ","element":"span"},{"style":{"fontWeight":"bold"},"text":"(a) ","element":"span"},{"text":"comes from Lemma ","element":"span"},{"text":"7","element":"span"},{"text":".","element":"span"}],[{"style":{"fontWeight":"bold"},"text":"B.3. Conditions for Consistency of DR","element":"span"}],[{"text":"In this section we show that the DR estimator is a strongly consistent estimator of ","element":"span"},{"style":{"height":15.6},"width":89.03,"height":39,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1604.00923/images/20-10.png","element":"img","alt":" v(πe)","inline":true,"padRight":true},{"text":"given mild technical assumptions and that there is only one behavior policy (Theorem ","element":"span"},{"text":"10","element":"span"},{"text":") or that the importance weights are bounded (Theorem ","element":"span"},{"text":"11","element":"span"},{"text":").","element":"span"}],[{"style":{"fontWeight":"bold"},"text":"Theorem 10 ","element":"span"},{"text":"(DR – strongly consistent estimator for one behavior policy)","element":"span"},{"style":{"fontWeight":"bold"},"text":". ","element":"span"},{"style":{"fontStyle":"italic"},"text":"If Assumptions ","element":"span"},{"text":"1 ","element":"span"},{"style":{"fontStyle":"italic"},"text":"and ","element":"span"},{"text":"2 ","element":"span"},{"style":{"fontStyle":"italic"},"text":"hold then ","element":"span"},{"style":{"height":17.59},"width":305.64,"height":43.98,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1604.00923/images/20-11.png","element":"img","alt":"DR(D) a.s.−→ v(πe).","inline":true}],[{"style":{"fontStyle":"italic"},"text":"Proof. ","element":"span"},{"text":"This proof is a relatively straightforward application of the law of large numbers.","element":"span"}],[{"text":"We have from Lemma ","element":"span"},{"text":"7 ","element":"span"},{"text":"that ","element":"span"},{"style":{"height":15.6},"width":335.25,"height":39,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1604.00923/images/20-12.png","element":"img","alt":" E[DRi(D)] = v(πe)","inline":true,"padRight":true},{"text":"for all ","element":"span"},{"style":{"height":15.6},"width":228.99,"height":39,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1604.00923/images/20-13.png","element":"img","alt":"i ∈ {1, . . . , n}","inline":true},{"text":". By Assumption ","element":"span"},{"style":{"height":15.64},"width":260.33,"height":39.1,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1604.00923/images/20-14.png","element":"img","alt":" 2, {DRi(D)}ni=1","inline":true,"padRight":true},{"text":"is a set of ","element":"span"},{"style":{"fontStyle":"italic"},"text":"n ","element":"span"},{"text":"independent ","element":"span"},{"style":{"fontStyle":"italic"},"text":"and identically distributed ","element":"span"},{"text":"random variables (since ","element":"span"},{"style":{"height":13.13},"width":133.96,"height":32.82,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1604.00923/images/20-15.png","element":"img","alt":" Hi ∼ π1","inline":true,"padRight":true},{"text":"for all ","element":"span"},{"style":{"fontStyle":"italic"},"text":"i","element":"span"},{"text":", and ","element":"span"},{"style":{"height":15.6},"width":134.35,"height":39,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1604.00923/images/20-16.png","element":"img","alt":" DRi(D)","inline":true,"padRight":true},{"text":"only depends on ","element":"span"},{"style":{"height":13.13},"width":43.24,"height":32.82,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1604.00923/images/20-17.png","element":"img","alt":" Hi","inline":true},{"text":"). We can therefore conclude by Khintchine’s strong law of large numbers, Theorem ","element":"span"},{"text":"6","element":"span"},{"text":", that ","element":"span"},{"style":{"height":17.59},"width":294.56,"height":43.98,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1604.00923/images/20-18.png","element":"img","alt":" DR(D) a.s.−→ v(πe)","inline":true},{"text":".","element":"span"}],[{"style":{"fontWeight":"bold"},"text":"Theorem 11 ","element":"span"},{"text":"(DR – strongly consistent estimator for many behavior policies)","element":"span"},{"style":{"fontWeight":"bold"},"text":". ","element":"span"},{"style":{"fontStyle":"italic"},"text":"If Assumptions ","element":"span"},{"text":"1 ","element":"span"},{"style":{"fontStyle":"italic"},"text":"and ","element":"span"},{"text":"4 ","element":"span"},{"style":{"fontStyle":"italic"},"text":"hold then ","element":"span"},{"style":{"height":17.59},"width":305.64,"height":43.98,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1604.00923/images/20-19.png","element":"img","alt":"DR(D) a.s.−→ v(πe).","inline":true}],[{"style":{"fontStyle":"italic"},"text":"Proof. ","element":"span"},{"text":"We have from Lemma ","element":"span"},{"text":"7 ","element":"span"},{"text":"that ","element":"span"},{"style":{"height":15.6},"width":332.46,"height":39,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1604.00923/images/20-20.png","element":"img","alt":" E[DRi(D)] = v(πe)","inline":true,"padRight":true},{"text":"for all ","element":"span"},{"style":{"height":15.6},"width":245.94,"height":39,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1604.00923/images/20-21.png","element":"img","alt":" i ∈ {1, . . . , n}","inline":true},{"text":". However, ","element":"span"},{"style":{"height":15.64},"width":222.98,"height":39.09,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1604.00923/images/20-22.png","element":"img","alt":" {DRi(D)}ni=1","inline":true,"padRight":true},{"text":"is a set ","element":"span"},{"text":"of ","element":"span"},{"style":{"fontStyle":"italic"},"text":"n ","element":"span"},{"text":"independent but not necessarily identically distributed random variables, so we cannot apply Khintchine’s strong law of large numbers. Instead, we will apply Kolmogorov’s strong law of large numbers, which requires each random variable, ","element":"span"},{"style":{"height":15.6},"width":134.34,"height":39,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1604.00923/images/20-23.png","element":"img","alt":" DRi(D)","inline":true},{"text":", to be bounded.","element":"span"}],[{"text":"We have that:","element":"span"}],[{"style":{"width":"89%"},"width":814,"height":789,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1604.00923/images/20-24.png","element":"img"}],[{"text":"So,","element":"span"}],[{"style":{"width":"56%"},"width":515,"height":173,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1604.00923/images/21-0.png","element":"img"}],[{"text":"since either ","element":"span"},{"style":{"height":11.6},"width":118.8,"height":29,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1604.00923/images/21-1.png","element":"img","alt":" L < ∞","inline":true,"padRight":true},{"text":"or ","element":"span"},{"style":{"height":15.6},"width":153.12,"height":39,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1604.00923/images/21-2.png","element":"img","alt":" γ ∈ [0, 1)","inline":true},{"text":". So, ","element":"span"},{"style":{"height":15.6},"width":134.34,"height":39,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1604.00923/images/21-3.png","element":"img","alt":" DRi(D)","inline":true,"padRight":true},{"text":"is bounded above and below and thus we can apply Kolmogorov’s strong law of large numbers (Corollary ","element":"span"},{"text":"1","element":"span"},{"text":") to conclude that ","element":"span"},{"style":{"height":17.59},"width":294.57,"height":43.98,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1604.00923/images/21-4.png","element":"img","alt":"DR(D) a.s.−→ v(πe)","inline":true},{"text":".","element":"span"}]]},{"heading":"C. Weighted Doubly Robust Proofs","paragraphs":[[{"style":{"fontWeight":"bold"},"text":"C.1. Proof of Theorem ","element":"span"},{"text":"1","element":"span"}],[{"text":"In this section we prove Theorem ","element":"span"},{"text":"1","element":"span"},{"text":", which states that ","element":"span"},{"text":"WDR(","element":"span"},{"style":{"fontStyle":"italic"},"text":"D","element":"span"},{"text":") ","element":"span"},{"text":"is a strongly consistent estimator of ","element":"span"},{"style":{"height":15.6},"width":89.03,"height":39,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1604.00923/images/21-5.png","element":"img","alt":" v(πe)","inline":true,"padRight":true},{"text":"if Assumptions ","element":"span"},{"text":"1","element":"span"},{"text":", ","element":"span"},{"text":"2","element":"span"},{"text":", and ","element":"span"},{"text":"3 ","element":"span"},{"text":"hold.","element":"span"}],[{"text":"First, notice that we can rewrite the WDR estimator as:","element":"span"}],[{"style":{"width":"95%"},"width":870,"height":542,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1604.00923/images/21-6.png","element":"img"}],[{"text":"We have from Lemma ","element":"span"},{"text":"12 ","element":"span"},{"text":"that","element":"span"}],[{"style":{"width":"86%"},"width":788,"height":177,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1604.00923/images/21-7.png","element":"img"}],[{"text":"which has been shown before (","element":"span"},{"text":"Thomas","element":"span"},{"text":", ","element":"span"},{"text":"2015b","element":"span"},{"text":", Theorem 13). Also by Lemma ","element":"span"},{"text":"12 ","element":"span"},{"text":"we have that","element":"span"}],[{"style":{"width":"87%"},"width":799,"height":107,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1604.00923/images/21-8.png","element":"img"}],[{"text":"and by Lemma ","element":"span"},{"text":"13 ","element":"span"},{"text":"we have that","element":"span"}],[{"style":{"width":"99%"},"width":910,"height":815,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1604.00923/images/21-9.png","element":"img"}],[{"text":"So, by applying Property ","element":"span"},{"text":"3 ","element":"span"},{"text":"to (","element":"span"},{"text":"18","element":"span"},{"text":"), (","element":"span"},{"text":"19","element":"span"},{"text":"), and (","element":"span"},{"text":"20","element":"span"},{"text":") we have that ","element":"span"},{"style":{"height":17.6},"width":334.42,"height":43.99,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1604.00923/images/21-10.png","element":"img","alt":" WDR(D) a.s.−→ v(πe)","inline":true},{"text":".","element":"span"}],[{"style":{"fontWeight":"bold"},"text":"C.2. Proof of Theorem ","element":"span"},{"text":"2","element":"span"}],[{"text":"In this section we prove Theorem ","element":"span"},{"text":"2","element":"span"},{"text":", which states that if Assumptions ","element":"span"},{"text":"1 ","element":"span"},{"text":"and ","element":"span"},{"text":"4 ","element":"span"},{"text":"hold then","element":"span"}],[{"style":{"width":"37%"},"width":342,"height":45,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1604.00923/images/21-11.png","element":"img"}],[{"text":"Recall that WDR can be defined as in (","element":"span"},{"text":"17","element":"span"},{"text":"). ","element":"span"},{"text":"First we apply Lemma ","element":"span"},{"text":"12 ","element":"span"},{"text":"to the ","element":"span"},{"text":"CWPDIS(","element":"span"},{"style":{"fontStyle":"italic"},"text":"D","element":"span"},{"text":") ","element":"span"},{"text":"term, which uses ","element":"span"},{"style":{"height":18.82},"width":276.18,"height":47.04,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1604.00923/images/21-12.png","element":"img","alt":"ft(Ht+1i ) = RHit","inline":true,"padRight":true},{"text":", which is bounded since ","element":"span"},{"style":{"height":18.49},"width":215.52,"height":46.23,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1604.00923/images/21-13.png","element":"img","alt":" |RHit | ≤ r⋆max","inline":true},{"text":". ","element":"span"},{"text":"The result of Lemma ","element":"span"},{"text":"12 ","element":"span"},{"text":"is that","element":"span"}],[{"style":{"width":"86%"},"width":788,"height":176,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1604.00923/images/21-14.png","element":"img"}],[{"text":"Next we apply Lemma ","element":"span"},{"text":"12 ","element":"span"},{"text":"to the ","element":"span"},{"style":{"height":13.13},"width":51.13,"height":32.82,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1604.00923/images/21-15.png","element":"img","alt":" Xn","inline":true,"padRight":true},{"text":"term, which uses ","element":"span"},{"style":{"height":28.4},"width":460.86,"height":71,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1604.00923/images/21-16.png","element":"img","alt":"ft(Ht+1i ) = ˆqπe�SHit , AHit �","inline":true},{"text":", which is bounded since","element":"span"}],[{"style":{"width":"71%"},"width":651,"height":118,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1604.00923/images/21-17.png","element":"img"}],[{"text":"The result of applying Lemma ","element":"span"},{"text":"12 ","element":"span"},{"text":"to ","element":"span"},{"style":{"height":13.13},"width":51.13,"height":32.82,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1604.00923/images/21-18.png","element":"img","alt":" Xn","inline":true,"padRight":true},{"text":"is that","element":"span"}],[{"style":{"width":"87%"},"width":799,"height":107,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1604.00923/images/21-19.png","element":"img"}],[{"text":"Lastly, we apply Lemma ","element":"span"},{"text":"13 ","element":"span"},{"text":"to the ","element":"span"},{"style":{"height":13.12},"width":41.52,"height":32.81,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1604.00923/images/21-20.png","element":"img","alt":" Yn","inline":true,"padRight":true},{"text":"term, which uses ","element":"span"},{"style":{"height":28.4},"width":337.14,"height":71,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1604.00923/images/21-21.png","element":"img","alt":"ft(Hti ) = ˆvπe�SHit �","inline":true},{"text":", which is bounded since","element":"span"}],[{"style":{"width":"62%"},"width":567,"height":118,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1604.00923/images/21-22.png","element":"img"}],[{"text":"The result of applying Lemma ","element":"span"},{"text":"13 ","element":"span"},{"text":"to ","element":"span"},{"style":{"height":13.13},"width":41.51,"height":32.82,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1604.00923/images/22-0.png","element":"img","alt":" Yn","inline":true,"padRight":true},{"text":"is that","element":"span"}],[{"style":{"width":"84%"},"width":769,"height":213,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1604.00923/images/22-1.png","element":"img"}],[{"text":"where ","element":"span"},{"style":{"fontWeight":"bold"},"text":"(a) ","element":"span"},{"text":"comes from the same derivation that was used in (","element":"span"},{"text":"20","element":"span"},{"text":"). So, by applying Property ","element":"span"},{"text":"3 ","element":"span"},{"text":"to (","element":"span"},{"text":"21","element":"span"},{"text":"), (","element":"span"},{"text":"22","element":"span"},{"text":"), and (","element":"span"},{"text":"23","element":"span"},{"text":") we have that ","element":"span"},{"style":{"height":17.59},"width":334.42,"height":43.98,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1604.00923/images/22-2.png","element":"img","alt":" WDR(D) a.s.−→ v(πe)","inline":true},{"text":".","element":"span"}]]},{"heading":"D. Extended Empirical Studies (WDR)","paragraphs":[[{"text":"In this section we provide a detailed description of our experiments comparing the WDR estimator to various importance sampling estimators (IS, PDIS, WIS, CWPDIS), as well as DR and AM. We performed experiments using three domains: ModelFail, ModelWin, and a gridworld. We will describe each domain, then describe the experimental setup, and then present empirical results. All three domains have a finite horizon and use ","element":"span"},{"style":{"height":13.6},"width":124.1,"height":34,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1604.00923/images/22-3.png","element":"img","alt":" γ = 1.0","inline":true},{"text":".","element":"span"}],[{"style":{"fontWeight":"bold"},"text":"D.1. The ModelFail Domain","element":"span"}],[{"text":"The ModelFail domain was constructed so that the model would fail to converge to the true MDP. One way that this can happen is if the model uses function approximation, so that it cannot represent the true MDP. Another way that this can happen is if there is some partial observability, which is common in real applications. We therefore construct a domain where the true underlying MDP has three states (plus the terminal absorbing state), but where the agent cannot tell the difference between any of the states.","element":"span"}],[{"text":"The MDP used by ModelFail is depicted in Figure ","element":"span"},{"text":"3","element":"span"},{"text":". Although the MDP has three states (denoted by circles) plus the terminal absorbing state (denoted by the double-circle), the agent does not observe which state it is in—it only sees a single state. The agent begins in the left-most state, where it has two actions available. The first action always takes it to the upper state, while the second always takes in to the lower state. In both cases, the agent receives no reward.","element":"span"}],[{"text":"At time ","element":"span"},{"style":{"fontStyle":"italic"},"text":"t ","element":"span"},{"text":"= 1","element":"span"},{"text":", the agent is always in the upper or lower state (although it cannot tell the difference between them and the initial state), and it must select between two possible actions. Both actions always have the same effect—the agent transitions to the terminal absorbing state. However, if the agent was in the upper state, ","element":"span"},{"style":{"height":13.13},"width":118.56,"height":32.82,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1604.00923/images/22-4.png","element":"img","alt":" R1 = 1","inline":true},{"text":", while ","element":"span"},{"style":{"height":13.13},"width":148.71,"height":32.82,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1604.00923/images/22-5.png","element":"img","alt":" R1 = −1","inline":true,"padRight":true},{"text":"if the agent was in the lower state. The horizon is ","element":"span"},{"style":{"fontStyle":"italic"},"text":"L ","element":"span"},{"text":"= 2 ","element":"span"},{"text":"since ","element":"span"},{"style":{"height":17.31},"width":123.88,"height":43.27,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1604.00923/images/22-6.png","element":"img","alt":" S2 =∞s","inline":true,"padRight":true},{"text":"always.","element":"span"}],[{"text":"The behavior policy selects ","element":"span"},{"style":{"height":9.13},"width":35.5,"height":22.82,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1604.00923/images/22-7.png","element":"img","alt":" a1","inline":true,"padRight":true},{"text":"with probability approximately ","element":"span"},{"text":"0","element":"span"},{"style":{"fontStyle":"italic"},"text":".","element":"span"},{"text":"88 ","element":"span"},{"text":"and ","element":"span"},{"style":{"height":9.13},"width":35.5,"height":22.82,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1604.00923/images/22-8.png","element":"img","alt":" a2","inline":true,"padRight":true},{"text":"with probability approximately ","element":"span"},{"text":"0","element":"span"},{"style":{"fontStyle":"italic"},"text":".","element":"span"},{"text":"12 ","element":"span"},{"text":"(these probabilities were chosen arbitrarily by using weights of ","element":"span"},{"text":"1 ","element":"span"},{"text":"and ","element":"span"},{"style":{"height":10.4},"width":50.16,"height":26,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1604.00923/images/22-9.png","element":"img","alt":" −1","inline":true,"padRight":true},{"text":"with softmax action selection,","element":"span"}],[{"style":{"width":"54%"},"width":493,"height":229,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1604.00923/images/22-10.png","element":"img"}],[{"text":"Figure 3: ModelFail MDP.","element":"figcaption","subtype":"caption"}],[{"text":"and were not optimized). The evaluation policy does the opposite—it selects ","element":"span"},{"style":{"height":9.13},"width":35.5,"height":22.82,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1604.00923/images/22-11.png","element":"img","alt":" a1","inline":true,"padRight":true},{"text":"with probability approximately ","element":"span"},{"text":"0","element":"span"},{"style":{"fontStyle":"italic"},"text":".","element":"span"},{"text":"12 ","element":"span"},{"text":"and ","element":"span"},{"style":{"height":9.13},"width":35.5,"height":22.82,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1604.00923/images/22-12.png","element":"img","alt":" a2","inline":true,"padRight":true},{"text":"with probability approximately ","element":"span"},{"text":"0","element":"span"},{"style":{"fontStyle":"italic"},"text":".","element":"span"},{"text":"88","element":"span"},{"text":".","element":"span"}],[{"text":"Consider what happens when we try to model this MDP based on the observations produced by running the behavior policy to produce an infinite number of trajectories (without trying to infer anything about the true underlying structure of the MDP). Recall that we observe only a single state. First consider the transition dynamics: half of the time either action causes a transition back to the single state, while half of the time the agent transitions to the absorbing state. Next consider the rewards: half of the time the agent receives no reward, with probability ","element":"span"},{"text":"0","element":"span"},{"style":{"fontStyle":"italic"},"text":".","element":"span"},{"text":"88","element":"span"},{"style":{"fontStyle":"italic"},"text":"/","element":"span"},{"text":"2 ","element":"span"},{"text":"it receives a reward of ","element":"span"},{"text":"1","element":"span"},{"text":", and with probability ","element":"span"},{"text":"0","element":"span"},{"style":{"fontStyle":"italic"},"text":".","element":"span"},{"text":"12","element":"span"},{"style":{"fontStyle":"italic"},"text":"/","element":"span"},{"text":"2 ","element":"span"},{"text":"it receives a reward of ","element":"span"},{"style":{"height":10.4},"width":50.16,"height":26,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1604.00923/images/22-13.png","element":"img","alt":" −1","inline":true},{"text":", and these rewards appear completely uncorrelated with the action that was selected (since non-zero rewards occur at time ","element":"span"},{"style":{"fontStyle":"italic"},"text":"t ","element":"span"},{"text":"= 1 ","element":"span"},{"text":"and ","element":"span"},{"style":{"height":13.53},"width":44.09,"height":33.82,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1604.00923/images/22-14.png","element":"img","alt":" A1","inline":true,"padRight":true},{"text":"has no bearing on rewards or state transitions). So, from the model’s point of view, the actions have no impact on state transitions or rewards, and so every policy is equally good and will produce an expected return of ","element":"span"},{"text":"0","element":"span"},{"style":{"fontStyle":"italic"},"text":".","element":"span"},{"text":"38","element":"span"},{"text":", while in reality an optimal policy will produce an expected return of ","element":"span"},{"text":"0","element":"span"},{"style":{"fontStyle":"italic"},"text":".","element":"span"},{"text":"5 ","element":"span"},{"text":"and a pessimal policy will produce an expected return of ","element":"span"},{"style":{"height":10.8},"width":80.32,"height":27,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1604.00923/images/22-15.png","element":"img","alt":" −0.5","inline":true},{"text":".","element":"span"}],[{"text":"We provided the model with the true horizon, ","element":"span"},{"style":{"fontStyle":"italic"},"text":"L ","element":"span"},{"text":"= 2","element":"span"},{"text":", so that its predictions of ","element":"span"},{"style":{"height":13.13},"width":41.45,"height":32.82,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1604.00923/images/22-16.png","element":"img","alt":" Rt","inline":true,"padRight":true},{"text":"are zero for ","element":"span"},{"style":{"height":12.8},"width":85.71,"height":32,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1604.00923/images/22-17.png","element":"img","alt":" t ≥ 2","inline":true},{"text":".","element":"span"}],[{"style":{"fontWeight":"bold"},"text":"D.2. The ModelWin Domain","element":"span"}],[{"text":"This domain was constructed so that the approximate model of the MDP would quickly converge to the true MDP, while importance sampling based approaches like DR and WDR would continue to have high variance. Recall from our discussion in Section ","element":"span"},{"text":"6 ","element":"span"},{"text":"that DR and WDR will be equal to a simple model-based approach if the approximate MDP is perfect and state transition and rewards are deterministic. To avoid this, the ModelWin domain has stochastic state transitions that cause the ","element":"span"},{"style":{"fontWeight":"bold"},"text":"(b) ","element":"span"},{"text":"term in (","element":"span"},{"text":"3","element":"span"},{"text":") to not necessarily be zero.","element":"span"}],[{"text":"The ModelWin MDP is depicted in Figure ","element":"span"},{"text":"4","element":"span"},{"text":". Unlike the ModelFail domain, the agent observes the true underlying states of the ModelWin MDP, of which there are three, plus a terminal absorbing state (not pictured). The agent always begins in ","element":"span"},{"style":{"height":9.12},"width":33.18,"height":22.81,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1604.00923/images/22-18.png","element":"img","alt":" s1","inline":true},{"text":", where it must select between two actions. The first action, ","element":"span"},{"style":{"height":9.13},"width":35.5,"height":22.82,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1604.00923/images/23-0.png","element":"img","alt":" a1","inline":true},{"text":", causes the agent to transition to ","element":"span"},{"style":{"height":9.13},"width":33.18,"height":22.82,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1604.00923/images/23-1.png","element":"img","alt":" s2","inline":true,"padRight":true},{"text":"with probability ","element":"span"},{"text":"0","element":"span"},{"style":{"fontStyle":"italic"},"text":".","element":"span"},{"text":"4 ","element":"span"},{"text":"and ","element":"span"},{"style":{"height":9.13},"width":33.18,"height":22.82,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1604.00923/images/23-2.png","element":"img","alt":" s3","inline":true,"padRight":true},{"text":"with probability ","element":"span"},{"text":"0","element":"span"},{"style":{"fontStyle":"italic"},"text":".","element":"span"},{"text":"6","element":"span"},{"text":". The second action, ","element":"span"},{"style":{"height":9.13},"width":35.5,"height":22.82,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1604.00923/images/23-3.png","element":"img","alt":" a2","inline":true},{"text":", does the opposite: the agent transitions to ","element":"span"},{"style":{"height":9.13},"width":33.18,"height":22.82,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1604.00923/images/23-4.png","element":"img","alt":" s2","inline":true,"padRight":true},{"text":"with probability ","element":"span"},{"text":"0","element":"span"},{"style":{"fontStyle":"italic"},"text":".","element":"span"},{"text":"6 ","element":"span"},{"text":"and ","element":"span"},{"style":{"height":9.13},"width":33.18,"height":22.82,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1604.00923/images/23-5.png","element":"img","alt":" s3","inline":true,"padRight":true},{"text":"with probability ","element":"span"},{"text":"0","element":"span"},{"style":{"fontStyle":"italic"},"text":".","element":"span"},{"text":"4","element":"span"},{"text":". If the agent transitions to ","element":"span"},{"style":{"height":9.13},"width":33.18,"height":22.82,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1604.00923/images/23-6.png","element":"img","alt":" s2","inline":true},{"text":", then it receives a reward of ","element":"span"},{"text":"1","element":"span"},{"text":", and if it transitions to ","element":"span"},{"style":{"height":9.13},"width":33.18,"height":22.82,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1604.00923/images/23-7.png","element":"img","alt":" s3","inline":true,"padRight":true},{"text":"it receives a reward of ","element":"span"},{"style":{"height":10.4},"width":50.16,"height":26,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1604.00923/images/23-8.png","element":"img","alt":" −1","inline":true},{"text":". In states ","element":"span"},{"style":{"height":9.13},"width":33.18,"height":22.82,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1604.00923/images/23-9.png","element":"img","alt":" s2","inline":true,"padRight":true},{"text":"and ","element":"span"},{"style":{"height":9.13},"width":33.18,"height":22.82,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1604.00923/images/23-10.png","element":"img","alt":"s3","inline":true},{"text":", the agent has two possible actions, but both always produce a reward of zero and a deterministic transition back to ","element":"span"},{"style":{"height":9.13},"width":33.18,"height":22.82,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1604.00923/images/23-11.png","element":"img","alt":"s1","inline":true},{"text":". The horizon is set to ","element":"span"},{"style":{"fontStyle":"italic"},"text":"L ","element":"span"},{"text":"= 20","element":"span"},{"text":", so, ","element":"span"},{"style":{"height":17.31},"width":139.34,"height":43.27,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1604.00923/images/23-12.png","element":"img","alt":" S20 =∞s","inline":true,"padRight":true},{"text":"always.","element":"span"},{"text":"12","element":"span"}],[{"style":{"width":"46%"},"width":425,"height":332,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1604.00923/images/23-13.png","element":"img"}],[{"text":"Figure 4: ModelWin MDP.","element":"figcaption","subtype":"caption"}],[{"text":"To see why DR and WDR struggle on this domain, consider what happens if the approximate model is perfect and the agent takes action ","element":"span"},{"style":{"height":9.13},"width":35.5,"height":22.82,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1604.00923/images/23-14.png","element":"img","alt":" a1","inline":true,"padRight":true},{"text":"in state ","element":"span"},{"style":{"height":9.13},"width":33.18,"height":22.82,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1604.00923/images/23-15.png","element":"img","alt":" s1","inline":true},{"text":". In our discussion of (","element":"span"},{"text":"3","element":"span"},{"text":") we concluded that DR and WDR will perform well if ","element":"span"},{"style":{"height":15.6},"width":484.95,"height":39,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1604.00923/images/23-16.png","element":"img","alt":"R1 = qπe(s1, a1) − γˆvπe(S′)","inline":true},{"text":", where ","element":"span"},{"style":{"height":10.8},"width":40.01,"height":27,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1604.00923/images/23-17.png","element":"img","alt":" S′","inline":true,"padRight":true},{"text":"is the state that the agent transitions to after taking action ","element":"span"},{"style":{"height":9.12},"width":35.5,"height":22.81,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1604.00923/images/23-18.png","element":"img","alt":" a1","inline":true,"padRight":true},{"text":"in state ","element":"span"},{"style":{"height":9.12},"width":33.18,"height":22.81,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1604.00923/images/23-19.png","element":"img","alt":" s1","inline":true},{"text":", which is a random variable. Consider the two values that the right side can take, depending on whether ","element":"span"},{"style":{"height":13.13},"width":144.4,"height":32.82,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1604.00923/images/23-20.png","element":"img","alt":" S′ = s2","inline":true,"padRight":true},{"text":"or ","element":"span"},{"style":{"height":13.13},"width":141.19,"height":32.82,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1604.00923/images/23-21.png","element":"img","alt":" S′ = s3","inline":true},{"text":". It can be either ","element":"span"},{"style":{"height":15.6},"width":373.09,"height":39,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1604.00923/images/23-22.png","element":"img","alt":" ˆqπe(s1, a1) − γˆvπe(s2)","inline":true,"padRight":true},{"text":"or ","element":"span"},{"style":{"height":15.6},"width":370.29,"height":39,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1604.00923/images/23-23.png","element":"img","alt":"ˆqπe(s1, a1) − γˆvπe(s3)","inline":true},{"text":". Since ","element":"span"},{"style":{"height":15.6},"width":306.16,"height":39,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1604.00923/images/23-24.png","element":"img","alt":" ˆvπe(s2) = ˆvπe(s3)","inline":true},{"text":", these two statements are equal—the prediction of ","element":"span"},{"style":{"height":13.13},"width":44.45,"height":32.82,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1604.00923/images/23-25.png","element":"img","alt":" R1","inline":true,"padRight":true},{"text":"will be the same regardless of whether the agent transitions to ","element":"span"},{"style":{"height":9.13},"width":33.18,"height":22.82,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1604.00923/images/23-26.png","element":"img","alt":" s2","inline":true,"padRight":true},{"text":"or ","element":"span"},{"style":{"height":9.13},"width":33.18,"height":22.82,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1604.00923/images/23-27.png","element":"img","alt":"s3","inline":true},{"text":", and so its prediction must sometimes be wrong (since the rewards differ depending on whether the agent transitions to ","element":"span"},{"style":{"height":9.13},"width":33.18,"height":22.82,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1604.00923/images/23-28.png","element":"img","alt":" s2","inline":true,"padRight":true},{"text":"or ","element":"span"},{"style":{"height":9.13},"width":33.18,"height":22.82,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1604.00923/images/23-29.png","element":"img","alt":" s3","inline":true},{"text":"). So, term ","element":"span"},{"style":{"fontWeight":"bold"},"text":"(b) ","element":"span"},{"text":"in (","element":"span"},{"text":"3","element":"span"},{"text":") will not be zero—the control variate used by DR and WDR does not perfectly cancel with the PDIS (or CWPDIS) term. If ","element":"span"},{"style":{"height":16.66},"width":39.8,"height":41.66,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1604.00923/images/23-30.png","element":"img","alt":" wit","inline":true,"padRight":true},{"text":"is large, ","element":"span"},{"text":"then this will produce high variance. In order to make ","element":"span"},{"style":{"height":16.67},"width":39.81,"height":41.66,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1604.00923/images/23-31.png","element":"img","alt":" wit","inline":true,"padRight":true},{"text":"large, we need only make the horizon long and the behavior and evaluation policies dissimilar.","element":"span"}],[{"text":"The behavior and evaluation policies both select actions uniformly randomly in states ","element":"span"},{"style":{"height":9.13},"width":33.18,"height":22.82,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1604.00923/images/23-32.png","element":"img","alt":" s2","inline":true,"padRight":true},{"text":"and ","element":"span"},{"style":{"height":9.13},"width":33.18,"height":22.82,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1604.00923/images/23-33.png","element":"img","alt":" s3","inline":true},{"text":". However, in ","element":"span"},{"style":{"height":9.13},"width":33.18,"height":22.82,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1604.00923/images/23-34.png","element":"img","alt":" s1","inline":true,"padRight":true},{"text":"the behavior policy takes action ","element":"span"},{"style":{"height":9.13},"width":35.5,"height":22.82,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1604.00923/images/23-35.png","element":"img","alt":" a1","inline":true,"padRight":true},{"text":"with probability approximately ","element":"span"},{"text":"0","element":"span"},{"style":{"fontStyle":"italic"},"text":".","element":"span"},{"text":"73 ","element":"span"},{"text":"and action ","element":"span"},{"style":{"height":9.13},"width":35.5,"height":22.82,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1604.00923/images/23-36.png","element":"img","alt":" a2","inline":true,"padRight":true},{"text":"with probability approximately ","element":"span"},{"text":"0","element":"span"},{"style":{"fontStyle":"italic"},"text":".","element":"span"},{"text":"27","element":"span"},{"text":", while the evaluation policy does the opposite—it takes action ","element":"span"},{"style":{"height":9.13},"width":35.5,"height":22.82,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1604.00923/images/23-37.png","element":"img","alt":" a1","inline":true,"padRight":true},{"text":"with probability approximately ","element":"span"},{"text":"0","element":"span"},{"style":{"fontStyle":"italic"},"text":".","element":"span"},{"text":"27 ","element":"span"},{"text":"and action ","element":"span"},{"style":{"height":9.13},"width":35.5,"height":22.82,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1604.00923/images/23-38.png","element":"img","alt":" a2","inline":true,"padRight":true},{"text":"with probability approximately ","element":"span"},{"text":"0","element":"span"},{"style":{"fontStyle":"italic"},"text":".","element":"span"},{"text":"73 ","element":"span"},{"text":"(these probabilities come from using softmax action selection with weights of ","element":"span"},{"text":"1 ","element":"span"},{"text":"and ","element":"span"},{"text":"0","element":"span"},{"text":").","element":"span"}],[{"text":"As in the ModelFail domain, for the ModelWin domain we provided the approximate model with the true horizon of the MDP, ","element":"span"},{"style":{"fontStyle":"italic"},"text":"L ","element":"span"},{"text":"= 20","element":"span"},{"text":", so that its predictions of ","element":"span"},{"style":{"height":13.13},"width":41.44,"height":32.82,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1604.00923/images/23-39.png","element":"img","alt":" Rt","inline":true,"padRight":true},{"text":"were zero for ","element":"span"},{"style":{"height":12.8},"width":105.1,"height":32,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1604.00923/images/23-40.png","element":"img","alt":" t ≥ 20","inline":true},{"text":".","element":"span"}],[{"style":{"fontWeight":"bold"},"text":"D.3. The Gridworld Domain","element":"span"}],[{"text":"The third domain that we used was the gridworld domain developed by ","element":"span"},{"text":"Thomas ","element":"span"},{"text":"(","element":"span"},{"text":"2015b","element":"span"},{"text":", Section 2.5) for evaluating OPE algorithms. It is a ","element":"span"},{"style":{"height":10.4},"width":88.42,"height":26,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1604.00923/images/23-41.png","element":"img","alt":" 4 × 4","inline":true,"padRight":true},{"text":"gridworld with four actions, ","element":"span"},{"style":{"fontStyle":"italic"},"text":"L ","element":"span"},{"text":"= 100","element":"span"},{"text":", and deterministic transition and reward functions. This domain was developed specifically for evaluating different OPE methods. ","element":"span"},{"text":"Thomas ","element":"span"},{"text":"(","element":"span"},{"text":"2015b","element":"span"},{"text":") proposed five policies, ","element":"span"},{"style":{"height":10},"width":162.79,"height":25,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1604.00923/images/23-42.png","element":"img","alt":" π1, . . . , π5","inline":true},{"text":", that can serve as the behavior and evaluation policies.","element":"span"}],[{"text":"Although this setup was developed for evaluating OPE methods, it was not developed with DR and WDR in mind (since they were introduced later). Specifically, its use of deterministic state-transition and reward functions means that when the model is accurate, AM, DR, and WDR will all perform similarly (due the the ","element":"span"},{"style":{"fontWeight":"bold"},"text":"(b) ","element":"span"},{"text":"term in (","element":"span"},{"text":"3","element":"span"},{"text":") being near-zero).","element":"span"}],[{"text":"We therefore performed experiments with two variants of this gridworld. In the first variant the approximate model was provided with the horizon, ","element":"span"},{"style":{"fontStyle":"italic"},"text":"L ","element":"span"},{"text":"= 100","element":"span"},{"text":". However, in the second variant we introduced some partial observability by providing the model with the incorrect horizon: ","element":"span"},{"style":{"fontStyle":"italic"},"text":"L ","element":"span"},{"text":"= 101","element":"span"},{"text":". This has a significant impact for value predictions close to the end of a trajectory because the model incorrectly predicts when the rewards will necessarily be zero. We write ","element":"span"},{"style":{"fontStyle":"italic"},"text":"Gridworld-TH ","element":"span"},{"text":"and ","element":"span"},{"style":{"fontStyle":"italic"},"text":"Gridworld-FH ","element":"span"},{"text":"to denote the gridworld where the agent is provided with the true horizon and false horizon, respectively.","element":"span"}],[{"style":{"fontWeight":"bold"},"text":"D.4. Experimental Setup","element":"span"}],[{"text":"For each domain we generated ","element":"span"},{"style":{"fontStyle":"italic"},"text":"n ","element":"span"},{"text":"trajectories (for various ","element":"span"},{"style":{"fontStyle":"italic"},"text":"n","element":"span"},{"text":") and computed the sample mean squared error between the predictions of the various OPE methods and the true performance of the evaluation policy (estimated using a large number of on-policy Monte-Carlo rollouts). For each value of ","element":"span"},{"style":{"fontStyle":"italic"},"text":"n ","element":"span"},{"text":"and each OPE algorithm, we performed this experiment ","element":"span"},{"text":"128 ","element":"span"},{"text":"times and report the average sample mean squared error over these ","element":"span"},{"text":"128 ","element":"span"},{"text":"trials. All plots include standard error bars and use logarithmic scales for both the horizontal and vertical axes.","element":"span"}],[{"text":"Perhaps surprisingly, it is not obvious how to fairly compare the different OPE algorithms. Clearly IS, PDIS, WIS, and CWPDIS should use all of the trajectories in ","element":"span"},{"style":{"fontStyle":"italic"},"text":"D","element":"span"},{"text":", since they do not require an approximate model. Similarly, AM should use all of the data to construct an approximate model. However, how should the available data be split for DR, WDR, and the MAGIC estimators? We believe that","element":"span"}],[{"text":"there are at least three reasonable answers:","element":"span"}],[{"text":"1. DR, WDR, and MAGIC should be provided with additional trajectories not available to IS, PDIS, WIS, and CWPDIS, and these trajectories should be used to construct an approximate model. This setup would emulate the setting where prior domain knowledge (not necessarily trajectories) can be used to construct an approximate model, which IS, PDIS, WIS, and CWPDIS ignore.","element":"span"}],[{"text":"2. DR, WDR, and MAGIC should use all of the available data, ","element":"span"},{"style":{"fontStyle":"italic"},"text":"D","element":"span"},{"text":", to construct an approximate model. They should then reuse this same data to compute their estimates. This approach is reasonable, but the reuse of data invalidates our theoretical guarantees. Still, empirically we find that this approach causes DR, WDR, and MAGIC to perform at their best.","element":"span"}],[{"text":"3. DR, WDR, and MAGIC should partition ","element":"span"},{"style":{"fontStyle":"italic"},"text":"D ","element":"span"},{"text":"into two sets. The first set should be used to construct the approximate model, and the second set should be used to compute the DR, WDR, and MAGIC estimates using the approximate model.","element":"span"}],[{"text":"Since there is not necessarily a “correct” answer to which way of performing experiments is best, we show our results using both the second and third approach. For each domain, the “full-data” variant uses the second approach while the “half-data” variant uses the third approach, where ","element":"span"},{"style":{"fontStyle":"italic"},"text":"D ","element":"span"},{"text":"is partitioned into two sets of equal size.","element":"span"}],[{"text":"Since all of the domains that we use have finite state and action sets, we use a simple maximum-likelihood approximate model. That is, we predict that the probability of transitioning from ","element":"span"},{"style":{"fontStyle":"italic"},"text":"s ","element":"span"},{"text":"to ","element":"span"},{"style":{"height":6.8},"width":32.18,"height":17,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1604.00923/images/24-0.png","element":"img","alt":" s′","inline":true,"padRight":true},{"text":"given action ","element":"span"},{"style":{"fontStyle":"italic"},"text":"a ","element":"span"},{"text":"is the number of times this transition was observed divided by the number of times action ","element":"span"},{"style":{"fontStyle":"italic"},"text":"a ","element":"span"},{"text":"was taken in state ","element":"span"},{"style":{"fontStyle":"italic"},"text":"s","element":"span"},{"text":". If ","element":"span"},{"style":{"fontStyle":"italic"},"text":"D ","element":"span"},{"text":"contains no examples of action ","element":"span"},{"style":{"fontStyle":"italic"},"text":"a ","element":"span"},{"text":"being taken in state ","element":"span"},{"style":{"fontStyle":"italic"},"text":"s","element":"span"},{"text":", then we assume that taking action ","element":"span"},{"style":{"fontStyle":"italic"},"text":"a ","element":"span"},{"text":"in state ","element":"span"},{"style":{"fontStyle":"italic"},"text":"s ","element":"span"},{"text":"always causes a transition to the terminal absorbing state.","element":"span"}],[{"text":"In this appendix, we present empirical results from four previous importance sampling methods, definitions of which can be found in the work of ","element":"span"},{"text":"Thomas ","element":"span"},{"text":"(","element":"span"},{"text":"2015b","element":"span"},{"text":", Chapter 3): ","element":"span"},{"style":{"fontStyle":"italic"},"text":"importance sampling ","element":"span"},{"text":"(IS), ","element":"span"},{"style":{"fontStyle":"italic"},"text":"per-decision importance sampling ","element":"span"},{"text":"(PDIS), ","element":"span"},{"style":{"fontStyle":"italic"},"text":"weighted importance sampling ","element":"span"},{"text":"(WIS), and ","element":"span"},{"style":{"fontStyle":"italic"},"text":"consistent weighted per-decision importance sampling ","element":"span"},{"text":"(CWPDIS). We also show results for the guided importance sampling methods DR and WDR and the purely model-based method, AM. The legend used by all of the plots in this appendix is provided in Figure ","element":"span"},{"text":"5","element":"span"},{"text":".","element":"span"}],[{"style":{"width":"88%"},"width":810,"height":83,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1604.00923/images/24-1.png","element":"img"}],[{"text":"Figure 5: The legend used by all plots in Appendix ","element":"figcaption","subtype":"caption"},{"text":"D","element":"figcaption","subtype":"caption"},{"text":".","element":"figcaption","subtype":"caption"}],[{"style":{"fontWeight":"bold"},"text":"D.5. ModelFail Results","element":"span"}],[{"text":"Figure ","element":"span"},{"text":"1b ","element":"span"},{"text":"in Section ","element":"span"},{"text":"6 ","element":"span"},{"text":"depicts the result on the ModelFail domain in the full-data setting. We reproduce this plot in Figure ","element":"span"},{"text":"6","element":"span"},{"text":". Here the weighted importance sampling methods, WIS and CWPDIS, are obscured by the curve for WDR, while the unweighted importance sampling methods, IS and PDIS, are obscured by the curve for DR. Notice that WDR outperforms AM by orders of magnitude and DR by approximately an order of magnitude. Also notice that even though the approximate model is not accurate, which means that the control variates used by DR and WDR may be poor, the DR and WDR estimators do not perform worse than PDIS and CWPDIS, respectively.","element":"span"}],[{"text":"In Figure ","element":"span"},{"text":"7 ","element":"span"},{"text":"we reproduce this experiment in the half-data setting. Since AM does not use any data for importance sampling, in both settings (half-data and full-data) it is identical. Similarly, IS, PDIS, WIS, and CWPDIS do not use an approximate model, so they always use all of the data and are therefore also identical in both settings. However, DR and WDR are not the same—they use half of the data to construct the approximate model and the other half to compute their estimates. This means that, for DR and WDR, the approximate model tends to be worse, and the importance sampling estimate also tends to be worse. As a result, the DR and WDR curves are shifted up slightly. Still, the same general trends are evident—WDR outperforms AM by orders and DR by an order.","element":"span"}],[{"style":{"fontWeight":"bold"},"text":"D.6. ModelWin Results","element":"span"}],[{"text":"Figure ","element":"span"},{"text":"1c ","element":"span"},{"text":"in Section ","element":"span"},{"text":"6 ","element":"span"},{"text":"depicts the result of running importance sampling and guided importance sampling methods as well as the approximate model estimator on the ModelWin experimental setup in the full-data setting. We reproduce this plot in Figure ","element":"span"},{"text":"8","element":"span"},{"text":". Here AM has approximately an order of magnitude lower MSE than all of the other methods, including WDR, and was our motivation for combining AM and WDR using BIM.","element":"span"}],[{"text":"In Figure ","element":"span"},{"text":"9 ","element":"span"},{"text":"we reproduce this experiment in the half-data setting. As with the ModelWin setup, this only hurts DR and WDR. When there are few trajectories, it appears to impact DR more than WDR, although this may be due to noise (notice the large standard error bars on the DR curve when ","element":"span"},{"style":{"fontStyle":"italic"},"text":"n ","element":"span"},{"text":"is small.","element":"span"}],[{"style":{"fontWeight":"bold"},"text":"D.7. Gridworld Results","element":"span"}],[{"text":"Figure ","element":"span"},{"text":"1a ","element":"span"},{"text":"in Section ","element":"span"},{"text":"6 ","element":"span"},{"text":"depicts the results of using the fourth gridworld policy, ","element":"span"},{"style":{"height":9.13},"width":37.11,"height":22.82,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1604.00923/images/24-2.png","element":"img","alt":" π4","inline":true},{"text":", as the behavior policy and the fifth, ","element":"span"},{"style":{"height":9.13},"width":37.1,"height":22.82,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1604.00923/images/24-3.png","element":"img","alt":"π5","inline":true},{"text":", as the evaluation policy for the Gridworld-FH domain in the full-data setting. We reproduce it in Figure ","element":"span"},{"text":"10","element":"span"},{"text":". Notice that WDR outperforms all other methods by at least an order of magnitude.","element":"span"}],[{"style":{"width":"98%"},"width":900,"height":501,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1604.00923/images/25-0.png","element":"img"}],[{"text":"Figure 6: ModelFail, full-data.","element":"figcaption","subtype":"caption"}],[{"style":{"width":"94%"},"width":865,"height":482,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1604.00923/images/25-1.png","element":"img"}],[{"text":"Figure 7: ModelFail, half-data.","element":"figcaption","subtype":"caption"}],[{"style":{"width":"98%"},"width":897,"height":499,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1604.00923/images/25-2.png","element":"img"}],[{"text":"Figure 8: ModelWin, full-data.","element":"figcaption","subtype":"caption"}],[{"style":{"width":"96%"},"width":877,"height":488,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1604.00923/images/25-3.png","element":"img"}],[{"text":"Figure 9: ModelWin, half-data.","element":"figcaption","subtype":"caption"}],[{"text":"In Figure ","element":"span"},{"text":"11 ","element":"span"},{"text":"we reproduce this experiment in the half-data setting. As before there is little change, except that the DR and WDR curves shift up. WDR remains the best-performing estimator, by approximately an order of magnitude.","element":"span"}],[{"text":"Next we reproduced Figures ","element":"span"},{"text":"10 ","element":"span"},{"text":"and ","element":"span"},{"text":"11 ","element":"span"},{"text":"for Gridworld-TH as opposed to Gridworld-FH. The results are in Figures ","element":"span"},{"text":"12 ","element":"span"},{"text":"and ","element":"span"},{"text":"13 ","element":"span"},{"text":"respectively. Notice that, when given the true horizon, AM excels. In the full-data setting DR and WDR both lie directly on top of the curve for AM. This makes sense because the transition function and reward function are deterministic, and so, given the way that we constructed our approximate model, both methods degenerate to exactly AM. In the half-data setting DR and WDR lag slightly behind the curve for AM since they can only use half as much data.","element":"span"}],[{"text":"Next we reproduced these four figures using the first gridworld policy, ","element":"span"},{"style":{"height":9.13},"width":37.1,"height":22.82,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1604.00923/images/25-4.png","element":"img","alt":" π1","inline":true},{"text":", as the behavior policy and the second, ","element":"span"},{"style":{"height":9.13},"width":37.1,"height":22.82,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1604.00923/images/25-5.png","element":"img","alt":" π2","inline":true},{"text":", as the evaluation policy. Whereas ","element":"span"},{"style":{"height":9.13},"width":37.1,"height":22.82,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1604.00923/images/25-6.png","element":"img","alt":" π4","inline":true,"padRight":true},{"text":"and ","element":"span"},{"style":{"height":9.13},"width":37.1,"height":22.82,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1604.00923/images/25-7.png","element":"img","alt":" π5","inline":true,"padRight":true},{"text":"are nearly deterministic and produce long trajectories, ","element":"span"},{"style":{"height":9.13},"width":37.11,"height":22.82,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1604.00923/images/25-8.png","element":"img","alt":" π1","inline":true,"padRight":true},{"text":"and ","element":"span"},{"style":{"height":9.13},"width":37.1,"height":22.82,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1604.00923/images/25-9.png","element":"img","alt":" π2","inline":true,"padRight":true},{"text":"are far from deterministic and tend to produce shorter trajectories. Notably, the behavior policy, ","element":"span"},{"style":{"height":9.13},"width":37.11,"height":22.82,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1604.00923/images/25-10.png","element":"img","alt":" π1","inline":true},{"text":", selects actions uniformly randomly, and so this presents a very different setting for OPE. The results are provided in Figures ","element":"span"},{"text":"14","element":"span"},{"text":"–","element":"span"},{"text":"17","element":"span"},{"text":". In this example, DR and WDR perform similarly—significantly better than the importance sampling algorithms IS, PDIS, WIS, and CWPDIS, and marginally better than AM given enough data. Also, when the true horizon is provided to the model, DR and WDR again degenerate to AM.","element":"span"}],[{"style":{"fontWeight":"bold"},"text":"D.8. Summary","element":"span"}],[{"text":"The key takeaways from these experiments are that WDR tends to outperform the other importance sampling estimators, IS, PDIS, WIS, and CWPDIS, as well as the guided importance sampling method, DR. None of these methods achieved mean squared errors within an order of magnitude of WDR’s across all of our experiments. This shows the power of WDR as a guided importance sampling method.","element":"span"}],[{"text":"However, WDR did not always win—in the ModelFail setting, AM outperformed WDR by an order of magnitude. Similar results have been observed by others. For example, in the experiments of ","element":"span"},{"text":"Jiang & Li ","element":"span"},{"text":"(","element":"span"},{"text":"2015","element":"span"},{"text":"), AM tended to outperform DR (although they did not compare to WDR, since it had not yet been introduced). This motivated our introduction of the BIM estimator as a way to blend together WDR and AM.","element":"span"}],[{"text":"Notice that, if the transition function and reward function are deterministic and there is no partial observability (as in the gridworld experiments using the true horizon), then, given the way that we constructed our approximate model, DR and WDR degenerate to AM. This degeneration (which","element":"span"}],[{"style":{"width":"95%"},"width":871,"height":479,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1604.00923/images/26-0.png","element":"img"}],[{"text":"Figure 10: Gridworld-FH, full-data, ","element":"figcaption","subtype":"caption"},{"style":{"height":9.13},"width":37.1,"height":22.82,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1604.00923/images/26-1.png","element":"img","alt":" π4","inline":true,"padRight":true},{"text":"behavior policy, ","element":"figcaption","subtype":"caption"},{"style":{"height":9.13},"width":37.1,"height":22.82,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1604.00923/images/26-2.png","element":"img","alt":" π5","inline":true,"padRight":true},{"text":"evaluation policy.","element":"figcaption","subtype":"caption"}],[{"style":{"width":"97%"},"width":888,"height":488,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1604.00923/images/26-3.png","element":"img"}],[{"text":"Figure 11: Gridworld-FH, half-data, ","element":"figcaption","subtype":"caption"},{"style":{"height":9.13},"width":37.11,"height":22.82,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1604.00923/images/26-4.png","element":"img","alt":" π4","inline":true,"padRight":true},{"text":"behavior policy, ","element":"figcaption","subtype":"caption"},{"style":{"height":9.13},"width":37.1,"height":22.82,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1604.00923/images/26-5.png","element":"img","alt":" π5","inline":true,"padRight":true},{"text":"evaluation policy.","element":"figcaption","subtype":"caption"}],[{"style":{"width":"96%"},"width":877,"height":482,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1604.00923/images/26-6.png","element":"img"}],[{"text":"Figure 12: Gridworld-TH, full-data, ","element":"figcaption","subtype":"caption"},{"style":{"height":9.13},"width":37.1,"height":22.82,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1604.00923/images/26-7.png","element":"img","alt":" π4","inline":true,"padRight":true},{"text":"behavior policy, ","element":"figcaption","subtype":"caption"},{"style":{"height":9.13},"width":37.1,"height":22.82,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1604.00923/images/26-8.png","element":"img","alt":" π5","inline":true,"padRight":true},{"text":"evaluation policy.","element":"figcaption","subtype":"caption"}],[{"style":{"width":"96%"},"width":882,"height":484,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1604.00923/images/26-9.png","element":"img"}],[{"text":"Figure 13: Gridworld-TH, half-data, ","element":"figcaption","subtype":"caption"},{"style":{"height":9.13},"width":37.1,"height":22.82,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1604.00923/images/26-10.png","element":"img","alt":" π4","inline":true,"padRight":true},{"text":"behavior policy, ","element":"figcaption","subtype":"caption"},{"style":{"height":9.13},"width":37.1,"height":22.82,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1604.00923/images/26-11.png","element":"img","alt":" π5","inline":true,"padRight":true},{"text":"evaluation policy.","element":"figcaption","subtype":"caption"}],[{"style":{"width":"97%"},"width":891,"height":489,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1604.00923/images/26-12.png","element":"img"}],[{"text":"Figure 14: Gridworld-FH, full-data, ","element":"figcaption","subtype":"caption"},{"style":{"height":9.13},"width":37.1,"height":22.82,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1604.00923/images/26-13.png","element":"img","alt":" π1","inline":true,"padRight":true},{"text":"behavior policy, ","element":"figcaption","subtype":"caption"},{"style":{"height":9.13},"width":37.1,"height":22.82,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1604.00923/images/26-14.png","element":"img","alt":" π2","inline":true,"padRight":true},{"text":"evaluation policy.","element":"figcaption","subtype":"caption"}],[{"style":{"width":"97%"},"width":889,"height":489,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1604.00923/images/26-15.png","element":"img"}],[{"text":"Figure 15: Gridworld-FH, half-data, ","element":"figcaption","subtype":"caption"},{"style":{"height":9.13},"width":37.11,"height":22.82,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1604.00923/images/26-16.png","element":"img","alt":" π1","inline":true,"padRight":true},{"text":"behavior policy, ","element":"figcaption","subtype":"caption"},{"style":{"height":9.13},"width":37.1,"height":22.82,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1604.00923/images/26-17.png","element":"img","alt":" π2","inline":true,"padRight":true},{"text":"evaluation policy.","element":"figcaption","subtype":"caption"}],[{"style":{"width":"96%"},"width":882,"height":485,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1604.00923/images/27-0.png","element":"img"}],[{"text":"Figure 16: Gridworld-TH, full-data, ","element":"figcaption","subtype":"caption"},{"style":{"height":9.13},"width":37.1,"height":22.82,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1604.00923/images/27-1.png","element":"img","alt":" π1","inline":true,"padRight":true},{"text":"behavior policy, ","element":"figcaption","subtype":"caption"},{"style":{"height":9.13},"width":37.1,"height":22.82,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1604.00923/images/27-2.png","element":"img","alt":" π2","inline":true,"padRight":true},{"text":"evaluation policy.","element":"figcaption","subtype":"caption"}],[{"style":{"width":"97%"},"width":884,"height":486,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1604.00923/images/27-3.png","element":"img"}],[{"text":"Figure 17: Gridworld-TH, half-data, ","element":"figcaption","subtype":"caption"},{"style":{"height":9.13},"width":37.1,"height":22.82,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1604.00923/images/27-4.png","element":"img","alt":" π1","inline":true,"padRight":true},{"text":"behavior policy, ","element":"figcaption","subtype":"caption"},{"style":{"height":9.13},"width":37.1,"height":22.82,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1604.00923/images/27-5.png","element":"img","alt":" π2","inline":true,"padRight":true},{"text":"evaluation policy.","element":"figcaption","subtype":"caption"}],[{"text":"is not bad, but suggests that importance sampling methods are not necessary) would also not occur if the approximate model used function approximation.","element":"span"}],[{"text":"Lastly, notice that DR and WDR performed better in the full-data setting than in the half-data setting. This suggests that, in practice, one should use all of the available data both to produce an approximate model and to compute the DR and WDR estimates. Even though this violates the assumptions used by our theoretical guarantees, this does not mean, for example, that MAGIC will not still be a strongly consistent estimator for the application at hand.","element":"span"}]]},{"heading":"E. Consistency of BIM","paragraphs":[[{"text":"In this appendix we prove Theorem ","element":"span"},{"text":"3","element":"span"},{"text":", which states that if Assumption ","element":"span"},{"text":"4 ","element":"span"},{"text":"holds, there exists at least one ","element":"span"},{"style":{"height":14},"width":121.43,"height":35,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1604.00923/images/27-6.png","element":"img","alt":" j ∈ J","inline":true,"padRight":true},{"text":"such that ","element":"span"},{"style":{"height":17.63},"width":123.73,"height":44.07,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1604.00923/images/27-7.png","element":"img","alt":" g(j)(D)","inline":true,"padRight":true},{"text":"is a strongly consistent estimator of ","element":"span"},{"style":{"height":15.6},"width":89.03,"height":39,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1604.00923/images/27-8.png","element":"img","alt":"v(πe)","inline":true},{"text":", and ","element":"span"},{"style":{"height":15.92},"width":266.93,"height":39.8,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1604.00923/images/27-9.png","element":"img","alt":"�bn − bn a.s.−→ 0","inline":true},{"text":", and ","element":"span"},{"style":{"height":15.92},"width":273.38,"height":39.8,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1604.00923/images/27-10.png","element":"img","alt":"�Ωn − Ωn a.s.−→ 0","inline":true},{"text":", then ","element":"span"},{"style":{"height":17.59},"width":453.97,"height":43.98,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1604.00923/images/27-11.png","element":"img","alt":"BIM(D, �Ωn, �bn) a.s.−→ v(πe).","inline":true}],[{"text":"We begin by showing that BIM converges almost surely to ","element":"span"},{"style":{"height":15.6},"width":89.02,"height":39,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1604.00923/images/27-12.png","element":"img","alt":" v(πe)","inline":true,"padRight":true},{"text":"if it were to use the true ","element":"span"},{"style":{"height":13.13},"width":47.01,"height":32.82,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1604.00923/images/27-13.png","element":"img","alt":" Ωn","inline":true,"padRight":true},{"text":"and ","element":"span"},{"style":{"height":13.13},"width":43.78,"height":32.82,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1604.00923/images/27-14.png","element":"img","alt":" bn","inline":true},{"text":", rather than estimates thereof. ","element":"span"},{"text":"Let ","element":"span"},{"style":{"height":14},"width":146.28,"height":35,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1604.00923/images/27-15.png","element":"img","alt":" j⋆ ∈ J","inline":true,"padRight":true},{"text":"be an index such that ","element":"span"},{"style":{"height":17.63},"width":339.39,"height":44.07,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1604.00923/images/27-16.png","element":"img","alt":"g(j⋆)(D) a.s.−→ v(πe)","inline":true},{"text":", which exists by assumption. ","element":"span"},{"text":"Let ","element":"span"},{"style":{"height":16.83},"width":151.43,"height":42.08,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1604.00923/images/27-17.png","element":"img","alt":"y ∈ ∆|J |","inline":true,"padRight":true},{"text":"be the weight vector that places a weight of one on ","element":"span"},{"style":{"height":17.63},"width":139.95,"height":44.07,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1604.00923/images/27-18.png","element":"img","alt":" g(j⋆)(D)","inline":true,"padRight":true},{"text":"and a weight of zero on the other returns, such that ","element":"span"},{"style":{"height":17.63},"width":543.78,"height":44.07,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1604.00923/images/27-19.png","element":"img","alt":" y⊺gJ (D) = g(j⋆)(D) a.s.−→ v(πe)","inline":true},{"text":". So, by Lemma ","element":"span"},{"text":"3 ","element":"span"},{"text":"(which requires that ","element":"span"},{"style":{"height":17.63},"width":123.73,"height":44.07,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1604.00923/images/27-20.png","element":"img","alt":" g(j)(D)","inline":true,"padRight":true},{"text":"is uniformly bounded for all ","element":"span"},{"style":{"height":14},"width":103.48,"height":35,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1604.00923/images/27-21.png","element":"img","alt":"j ∈ J","inline":true,"padRight":true},{"text":", which holds by Assumption ","element":"span"},{"text":"4 ","element":"span"},{"text":"and the fact that rewards and reward predictions are bounded), we have that ","element":"span"},{"style":{"height":15.6},"width":564.86,"height":39,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1604.00923/images/27-22.png","element":"img","alt":"limn→∞ MSE(y⊺g(D), v(πe)) = 0","inline":true},{"text":".","element":"span"}],[{"text":"Recall that ","element":"span"},{"style":{"height":15.6},"width":269.76,"height":39,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1604.00923/images/27-23.png","element":"img","alt":" BIM(D, Ωn, bn)","inline":true,"padRight":true},{"text":"uses the weight vector, ","element":"span"},{"style":{"height":6.8},"width":37.54,"height":17,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1604.00923/images/27-24.png","element":"img","alt":" x⋆","inline":true,"padRight":true},{"text":"that minimizes the MSE:","element":"span"}],[{"style":{"width":"73%"},"width":670,"height":61,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1604.00923/images/27-25.png","element":"img"}],[{"text":"Since ","element":"span"},{"style":{"height":16.83},"width":148.36,"height":42.07,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1604.00923/images/27-26.png","element":"img","alt":" y ∈ ∆|J |","inline":true},{"text":", we have that for all ","element":"span"},{"style":{"fontStyle":"italic"},"text":"n","element":"span"}],[{"style":{"width":"97%"},"width":892,"height":215,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1604.00923/images/27-27.png","element":"img"}],[{"text":"and since MSE is always greater than or equal to zero, we can replace the ","element":"span"},{"style":{"height":12.4},"width":30,"height":31,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1604.00923/images/27-28.png","element":"img","alt":" ≤","inline":true,"padRight":true},{"text":"above with an equality. ","element":"span"},{"text":"Since ","element":"span"},{"style":{"height":15.6},"width":526.36,"height":39,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1604.00923/images/27-29.png","element":"img","alt":"(x⋆)⊺gJ (D) = BIM(D, Ωn, bn)","inline":true,"padRight":true},{"text":"this can be rewritten as","element":"span"}],[{"style":{"width":"71%"},"width":655,"height":54,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1604.00923/images/27-30.png","element":"img"}],[{"text":"By ","element":"span"},{"text":"Lemma ","element":"span"},{"text":"3 ","element":"span"},{"text":"we ","element":"span"},{"text":"have ","element":"span"},{"text":"that ","element":"span"},{"text":"this ","element":"span"},{"text":"implies ","element":"span"},{"text":"that ","element":"span"},{"style":{"height":17.59},"width":436.74,"height":43.98,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1604.00923/images/27-31.png","element":"img","alt":"BIM(D, Ωnbn) a.s.−→ v(πe).","inline":true}],[{"text":"So far we have shown that BIM, when using the true covariance matrix and bias vector, converges almost surely to","element":"span"}],[{"style":{"width":"0%"},"width":7,"height":2,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1604.00923/images/28-0.png","element":"img"}],[{"style":{"height":15.6},"width":89.03,"height":39,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1604.00923/images/28-1.png","element":"img","alt":"v(πe)","inline":true},{"text":". By Lemma ","element":"span"},{"text":"5 ","element":"span"},{"text":"we can therefore conclude that if ","element":"span"},{"style":{"height":13.13},"width":80,"height":32.82,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1604.00923/images/28-2.png","element":"img","alt":"�bn −","inline":true},{"style":{"height":15.92},"width":153.72,"height":39.8,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1604.00923/images/28-3.png","element":"img","alt":"bn a.s.−→ 0","inline":true,"padRight":true},{"text":"and ","element":"span"},{"style":{"height":15.92},"width":254.98,"height":39.8,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1604.00923/images/28-4.png","element":"img","alt":"�Ωn − Ωn a.s.−→ 0","inline":true},{"text":", then ","element":"span"},{"style":{"height":17.6},"width":327.97,"height":43.99,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1604.00923/images/28-5.png","element":"img","alt":" BIM(D, �Ωn�bn) a.s.−→","inline":true},{"style":{"height":15.6},"width":100.11,"height":39,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1604.00923/images/28-6.png","element":"img","alt":"v(πe).","inline":true}]]},{"heading":"F. Derivation of g(j)(D) using WDR","paragraphs":[[{"text":"In this appendix we derive a reasonable definition for ","element":"span"},{"style":{"height":17.63},"width":123.74,"height":44.07,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1604.00923/images/28-7.png","element":"img","alt":"g(j)(D)","inline":true},{"text":", the off-policy ","element":"span"},{"style":{"fontStyle":"italic"},"text":"j","element":"span"},{"text":"-step return, when using WDR for the importance sampling estimator. We assume that the reader is familiar with our use of control variates in Appendix ","element":"span"},{"text":"B","element":"span"},{"text":". First, consider what control variate should be added to the ","element":"span"},{"style":{"fontStyle":"italic"},"text":"j","element":"span"},{"text":"-step PDIS or CWPDIS estimator:","element":"span"}],[{"style":{"width":"30%"},"width":278,"height":116,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1604.00923/images/28-8.png","element":"img"}],[{"text":"where the definition of ","element":"span"},{"style":{"height":16.66},"width":39.81,"height":41.66,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1604.00923/images/28-9.png","element":"img","alt":" wit","inline":true,"padRight":true},{"text":"determines whether this is PDIS ","element":"span"},{"text":"or CWPDIS. Reproducing our arguments from Appendix ","element":"span"},{"text":"B","element":"span"},{"text":", we find that a reasonable definition for ","element":"span"},{"style":{"height":18.76},"width":139.39,"height":46.91,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1604.00923/images/28-10.png","element":"img","alt":" IS(j)(D)","inline":true,"padRight":true},{"text":"is similar to (","element":"span"},{"text":"15","element":"span"},{"text":"), but with the time index, ","element":"span"},{"style":{"fontStyle":"italic"},"text":"t","element":"span"},{"text":", summing only to ","element":"span"},{"style":{"fontStyle":"italic"},"text":"t ","element":"span"},{"text":"= ","element":"span"},{"style":{"fontStyle":"italic"},"text":"j ","element":"span"},{"text":"and using ","element":"span"},{"style":{"height":16.66},"width":39.8,"height":41.66,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1604.00923/images/28-11.png","element":"img","alt":" wit","inline":true,"padRight":true},{"text":"terms rather than ","element":"span"},{"style":{"height":16.66},"width":32.05,"height":41.66,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1604.00923/images/28-12.png","element":"img","alt":" ρit","inline":true,"padRight":true},{"text":"terms for general- ","element":"span"},{"text":"ity:","element":"span"}],[{"style":{"width":"93%"},"width":855,"height":389,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1604.00923/images/28-13.png","element":"img"}],[{"text":"Notice that this definition is ","element":"span"},{"style":{"fontStyle":"italic"},"text":"not ","element":"span"},{"text":"equivalent to what one would get if (","element":"span"},{"text":"2","element":"span"},{"text":") were modified only so that the sum goes from time ","element":"span"},{"style":{"fontStyle":"italic"},"text":"t ","element":"span"},{"text":"= 0 ","element":"span"},{"text":"to ","element":"span"},{"style":{"fontStyle":"italic"},"text":"t ","element":"span"},{"text":"= ","element":"span"},{"style":{"fontStyle":"italic"},"text":"j","element":"span"},{"text":", since that definition would include reward predictions beyond ","element":"span"},{"style":{"height":15.13},"width":42.44,"height":37.82,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1604.00923/images/28-14.png","element":"img","alt":" Rj","inline":true,"padRight":true},{"text":"in ","element":"span"},{"style":{"height":10.8},"width":21.48,"height":27,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1604.00923/images/28-15.png","element":"img","alt":" ˆv","inline":true,"padRight":true},{"text":"and ","element":"span"},{"style":{"height":14},"width":22.89,"height":35,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1604.00923/images/28-16.png","element":"img","alt":" ˆq","inline":true,"padRight":true},{"text":"terms. Instead, this definition is equivalent to the definition of (","element":"span"},{"text":"2","element":"span"},{"text":") if it were applied to a modified MDP where every episode terminates after ","element":"span"},{"style":{"height":15.13},"width":42.45,"height":37.82,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1604.00923/images/28-17.png","element":"img","alt":" Rj","inline":true,"padRight":true},{"text":"is produced.","element":"span"}],[{"text":"Next, consider the definition of ","element":"span"},{"style":{"height":18.76},"width":168.47,"height":46.91,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1604.00923/images/28-18.png","element":"img","alt":" AM(j)(D)","inline":true},{"text":". We might use importance sampling to correct for the distribution of ","element":"span"},{"style":{"height":15.13},"width":36.78,"height":37.82,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1604.00923/images/28-19.png","element":"img","alt":" Sj","inline":true},{"text":", and the model to predict the remaining rewards:","element":"span"},{"text":"13","element":"span"}],[{"style":{"width":"88%"},"width":808,"height":265,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1604.00923/images/28-20.png","element":"img"}],[{"text":"Notice that ","element":"span"},{"style":{"height":14.76},"width":103.16,"height":36.91,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1604.00923/images/28-21.png","element":"img","alt":" AM(j)","inline":true,"padRight":true},{"text":"is not a purely model-based estimator if ","element":"span"},{"style":{"height":13.6},"width":95.32,"height":34,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1604.00923/images/28-22.png","element":"img","alt":"j ≥ 0","inline":true,"padRight":true},{"text":"since it uses importance weights. Furthermore, this use of importance sampling can result in high variance. To partially mitigate this variance, we can introduce a control variate to get a new definition:","element":"span"}],[{"style":{"width":"99%"},"width":910,"height":416,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1604.00923/images/28-23.png","element":"img"}],[{"text":"As in our derivation of the DR estimator in Appendix ","element":"span"},{"text":"B","element":"span"},{"text":", we can repeat this process by continuing to add control variates until the control variate is not random to get our final defi-nition of ","element":"span"},{"style":{"height":18.76},"width":168.47,"height":46.91,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1604.00923/images/28-24.png","element":"img","alt":" AM(j)(D)","inline":true},{"text":":","element":"span"}],[{"style":{"width":"99%"},"width":910,"height":388,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1604.00923/images/28-25.png","element":"img"}],[{"text":"Combining the ","element":"span"},{"text":"IS ","element":"span"},{"text":"and ","element":"span"},{"text":"AM ","element":"span"},{"text":"definitions to produce a off-policy ","element":"span"},{"style":{"fontStyle":"italic"},"text":"j","element":"span"},{"text":"-step return as defined in (","element":"span"},{"text":"4","element":"span"},{"text":") we have:","element":"span"}],[{"style":{"width":"99%"},"width":905,"height":916,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1604.00923/images/28-26.png","element":"img"}],[{"text":"Notice that the terms ","element":"span"},{"style":{"fontWeight":"bold"},"text":"(a) ","element":"span"},{"text":"and ","element":"span"},{"style":{"fontWeight":"bold"},"text":"(b) ","element":"span"},{"text":"use predictions of rewards up until and including ","element":"span"},{"style":{"height":15.13},"width":42.45,"height":37.82,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1604.00923/images/29-0.png","element":"img","alt":" Rj","inline":true},{"text":", while the terms ","element":"span"},{"style":{"fontWeight":"bold"},"text":"(c) ","element":"span"},{"text":"and ","element":"span"},{"style":{"fontWeight":"bold"},"text":"(d) ","element":"span"},{"text":"use predictions of rewards beginning with ","element":"span"},{"style":{"height":15.13},"width":82.65,"height":37.82,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1604.00923/images/29-1.png","element":"img","alt":" Rj+1","inline":true,"padRight":true},{"text":"and going to infinity. So, with algebraic manipulations we can combine ","element":"span"},{"style":{"fontWeight":"bold"},"text":"(a) ","element":"span"},{"text":"and ","element":"span"},{"style":{"fontWeight":"bold"},"text":"(c) ","element":"span"},{"text":"to get","element":"span"}],[{"style":{"width":"49%"},"width":453,"height":116,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1604.00923/images/29-2.png","element":"img"}],[{"text":"and we can combine ","element":"span"},{"style":{"fontWeight":"bold"},"text":"(b) ","element":"span"},{"text":"and ","element":"span"},{"style":{"fontWeight":"bold"},"text":"(d) ","element":"span"},{"text":"to get:","element":"span"}],[{"style":{"width":"46%"},"width":428,"height":115,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1604.00923/images/29-3.png","element":"img"}],[{"text":"So, we have that","element":"span"}],[{"style":{"width":"96%"},"width":877,"height":252,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1604.00923/images/29-4.png","element":"img"}]]},{"heading":"G. MAGIC Details","paragraphs":[[{"text":"In this section we provide additional details about the MAGIC algorithm. Specifically, we describe exactly how we estimate ","element":"span"},{"style":{"height":13.13},"width":47.01,"height":32.82,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1604.00923/images/29-5.png","element":"img","alt":" Ωn","inline":true,"padRight":true},{"text":"and ","element":"span"},{"style":{"height":13.13},"width":43.78,"height":32.82,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1604.00923/images/29-6.png","element":"img","alt":" bn","inline":true,"padRight":true},{"text":"before presenting pseudocode for MAGIC.","element":"span"}],[{"style":{"fontWeight":"bold"},"text":"G.1. Estimating ","element":"span"},{"style":{"height":13.13},"width":47.01,"height":32.82,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1604.00923/images/29-7.png","element":"img","alt":" Ωn","inline":true}],[{"text":"We can write ","element":"span"},{"style":{"height":17.63},"width":123.73,"height":44.07,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1604.00923/images/29-8.png","element":"img","alt":" g(j)(D)","inline":true,"padRight":true},{"text":"as the sum of ","element":"span"},{"style":{"fontStyle":"italic"},"text":"n ","element":"span"},{"text":"terms:","element":"span"}],[{"style":{"width":"70%"},"width":640,"height":107,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1604.00923/images/29-9.png","element":"img"}],[{"text":"where","element":"span"}],[{"style":{"width":"89%"},"width":815,"height":252,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1604.00923/images/29-10.png","element":"img"}],[{"text":"So,","element":"span"}],[{"style":{"width":"95%"},"width":870,"height":105,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1604.00923/images/29-11.png","element":"img"}],[{"text":"Notice that ","element":"span"},{"style":{"height":20.38},"width":123.73,"height":50.95,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1604.00923/images/29-12.png","element":"img","alt":" g(j)i (D)","inline":true,"padRight":true},{"text":"really is a function of all of ","element":"span"},{"style":{"fontStyle":"italic"},"text":"D","element":"span"},{"text":", not ","element":"span"},{"text":"just ","element":"span"},{"style":{"height":13.13},"width":43.24,"height":32.82,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1604.00923/images/29-13.png","element":"img","alt":" Hi","inline":true},{"text":", since ","element":"span"},{"style":{"height":21.31},"width":309.01,"height":53.27,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1604.00923/images/29-14.png","element":"img","alt":" wit = ρit/ �nj=1 ρjt","inline":true},{"text":". This means that, al- ","element":"span"},{"text":"though the terms in the sum, ","element":"span"},{"style":{"height":20.74},"width":226.04,"height":51.84,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1604.00923/images/29-15.png","element":"img","alt":"�nk=1 g(i)k (D)","inline":true},{"text":", are identically ","element":"span"},{"text":"distributed, they are not independent, due to their shared reliance on ","element":"span"},{"style":{"fontStyle":"italic"},"text":"D","element":"span"},{"text":". However, notice that the ","element":"span"},{"style":{"height":20.74},"width":120.3,"height":51.84,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1604.00923/images/29-16.png","element":"img","alt":" g(i)k (D)","inline":true,"padRight":true},{"text":"terms, ","element":"span"},{"text":"for various ","element":"span"},{"style":{"fontStyle":"italic"},"text":"k","element":"span"},{"text":", become less dependent as ","element":"span"},{"style":{"height":8.4},"width":131.32,"height":21,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1604.00923/images/29-17.png","element":"img","alt":" n → ∞","inline":true,"padRight":true},{"text":"because the only dependence of ","element":"span"},{"style":{"height":20.74},"width":120.3,"height":51.84,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1604.00923/images/29-18.png","element":"img","alt":" g(i)k (D)","inline":true,"padRight":true},{"text":"on trajectories other than ","element":"span"},{"style":{"height":13.13},"width":48.24,"height":32.82,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1604.00923/images/29-19.png","element":"img","alt":"Hk","inline":true,"padRight":true},{"text":"comes from the denominator of ","element":"span"},{"style":{"height":16.66},"width":39.81,"height":41.66,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1604.00923/images/29-20.png","element":"img","alt":" wit","inline":true},{"text":", which converges ","element":"span"},{"text":"almost surely to ","element":"span"},{"style":{"fontStyle":"italic"},"text":"n ","element":"span"},{"text":"(we established this in our proofs that WDR is strongly consistent).","element":"span"}],[{"text":"We therefore propose an approximation of ","element":"span"},{"style":{"height":13.13},"width":47.01,"height":32.82,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1604.00923/images/29-21.png","element":"img","alt":" Ωn","inline":true,"padRight":true},{"text":"that comes from the assumption that the ","element":"span"},{"style":{"height":20.74},"width":120.3,"height":51.84,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1604.00923/images/29-22.png","element":"img","alt":" g(i)k (D)","inline":true,"padRight":true},{"text":"terms, for various ","element":"span"},{"style":{"fontStyle":"italic"},"text":"k","element":"span"},{"text":", ","element":"span"},{"text":"are independent:","element":"span"}],[{"style":{"width":"82%"},"width":754,"height":382,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1604.00923/images/29-23.png","element":"img"}],[{"text":"where ","element":"span"},{"style":{"fontWeight":"bold"},"text":"(a) ","element":"span"},{"text":"comes from the assumption that ","element":"span"},{"style":{"height":20.74},"width":120.31,"height":51.84,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1604.00923/images/29-24.png","element":"img","alt":" g(i)k (D)","inline":true,"padRight":true},{"text":"and ","element":"span"},{"style":{"height":20.74},"width":123.74,"height":51.84,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1604.00923/images/29-25.png","element":"img","alt":"g(j)l (D)","inline":true,"padRight":true},{"text":"are independent for all ","element":"span"},{"style":{"fontStyle":"italic"},"text":"i, j, k, ","element":"span"},{"text":"and ","element":"span"},{"style":{"fontStyle":"italic"},"text":"l ","element":"span"},{"text":"where ","element":"span"},{"style":{"height":14.8},"width":92.3,"height":37,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1604.00923/images/29-26.png","element":"img","alt":" k ̸= l","inline":true},{"text":", ","element":"span"},{"style":{"fontWeight":"bold"},"text":"(b) ","element":"span"},{"text":"comes from the assumption that they are identically distributed, and where ","element":"span"},{"style":{"height":24.14},"width":120.3,"height":60.35,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1604.00923/images/29-27.png","element":"img","alt":" g(i)(·)(D)","inline":true,"padRight":true},{"text":"uses ","element":"span"},{"style":{"height":15.6},"width":40.85,"height":39,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1604.00923/images/29-28.png","element":"img","alt":" (·)","inline":true,"padRight":true},{"text":"to denote that any sub- ","element":"span"},{"text":"script in ","element":"span"},{"style":{"fontStyle":"italic"},"text":"{","element":"span"},{"text":"1","element":"span"},{"style":{"fontStyle":"italic"},"text":", . . . , n","element":"span"},{"style":{"fontStyle":"italic"},"text":"} ","element":"span"},{"text":"could be used since the random variables are independent and identically distributed.","element":"span"}],[{"text":"We therefore approximate ","element":"span"},{"style":{"height":13.13},"width":47.01,"height":32.82,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1604.00923/images/29-29.png","element":"img","alt":" Ωn","inline":true,"padRight":true},{"text":"using the sample covariance:","element":"span"}],[{"style":{"width":"90%"},"width":824,"height":178,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1604.00923/images/29-30.png","element":"img"}],[{"text":"where","element":"span"}],[{"style":{"width":"49%"},"width":455,"height":99,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1604.00923/images/29-31.png","element":"img"}],[{"text":"The above scheme for estimating ","element":"span"},{"style":{"height":13.12},"width":47.01,"height":32.81,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1604.00923/images/29-32.png","element":"img","alt":" Ωn","inline":true,"padRight":true},{"text":"is the one that we use in our pseudocode and experiments. However, we also experimented with bootstrap estimates of ","element":"span"},{"style":{"height":13.13},"width":47.01,"height":32.82,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1604.00923/images/29-33.png","element":"img","alt":" Ωn","inline":true},{"text":". They yielded similar performance at significantly higher computational cost.","element":"span"}],[{"style":{"fontWeight":"bold"},"text":"G.2. Estimating ","element":"span"},{"style":{"height":13.13},"width":43.77,"height":32.82,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1604.00923/images/29-34.png","element":"img","alt":" bn","inline":true}],[{"text":"As described previously, we use a confidence interval, ","element":"span"},{"style":{"height":17.63},"width":248.36,"height":44.07,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1604.00923/images/29-35.png","element":"img","alt":"CI(g(∞)(D), δ)","inline":true},{"text":", when computing ","element":"span"},{"style":{"height":13.13},"width":35.64,"height":32.82,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1604.00923/images/29-36.png","element":"img","alt":" bn","inline":true},{"text":". We stated that the confidence interval that we use is a combination of the percentile bootstrap and the Chernoff-Hoeffding inequality. Specifically, we compute the confidence interval produced by both methods, and return the ","element":"span"},{"style":{"fontStyle":"italic"},"text":"tighter ","element":"span"},{"text":"of the two. In practice, this is nearly always the confidence interval produced by the percentile bootstrap, and so practical implementations of MAGIC may just use the percentile bootstrap. We include the loose Chernoff-Hoeffding bound because it allows for easier theoretical analysis of the MAGIC algorithm.","element":"span"}],[{"style":{"fontWeight":"bold"},"text":"G.3. Pseudocode","element":"span"}],[{"text":"Pseudocode for the MAGIC algorithm is provided in Algorithm ","element":"span"},{"text":"1","element":"span"},{"text":". It takes as input ","element":"span"},{"style":{"height":13.2},"width":88.82,"height":33,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1604.00923/images/30-0.png","element":"img","alt":" D, πe","inline":true},{"text":", and an approximate model, all of which are defined in Section ","element":"span"},{"text":"2","element":"span"},{"text":". It also takes as input ","element":"span"},{"style":{"fontStyle":"italic"},"text":"J ","element":"span"},{"text":", which is defined in Section ","element":"span"},{"text":"7","element":"span"},{"text":", and a positive integer ","element":"span"},{"style":{"height":7.2},"width":22,"height":18,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1604.00923/images/30-1.png","element":"img","alt":" κ","inline":true},{"text":", that we have not defined previously. We use ","element":"span"},{"style":{"height":7.2},"width":22,"height":18,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1604.00923/images/30-2.png","element":"img","alt":" κ","inline":true,"padRight":true},{"text":"to denote the number of times the bootstrap algorithm should resample the trajectories. In our experiments we used ","element":"span"},{"style":{"height":10.4},"width":139.05,"height":26,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1604.00923/images/30-3.png","element":"img","alt":" κ = 200","inline":true},{"text":". In general, it should be made as large as possible given any runtime constraints. Other literature has suggested that it should be chosen to be approximately ","element":"span"},{"style":{"height":10.4},"width":155.95,"height":26,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1604.00923/images/30-4.png","element":"img","alt":" κ = 2000","inline":true,"padRight":true},{"text":"(","element":"span"},{"text":"Efron & Tibshirani","element":"span"},{"text":", ","element":"span"},{"text":"1993","element":"span"},{"text":"; ","element":"span"},{"text":"Davison & Hinkley","element":"span"},{"text":", ","element":"span"},{"text":"1997","element":"span"},{"text":").","element":"span"}],[{"text":"Line 2 calls for the ","element":"span"},{"style":{"height":15.6},"width":159.21,"height":39,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1604.00923/images/30-5.png","element":"img","alt":" |J | × |J |","inline":true,"padRight":true},{"text":"matrix, ","element":"span"},{"style":{"height":13.13},"width":47.01,"height":32.82,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1604.00923/images/30-6.png","element":"img","alt":"�Ωn","inline":true},{"text":", to be computed according to (","element":"span"},{"text":"25","element":"span"},{"text":").","element":"span"}],[{"text":"Line 3 specifies that a structure, ","element":"span"},{"style":{"fontStyle":"italic"},"text":"D","element":"span"},{"text":", should be created. This structure will be used to store the bootstrap resamplings, such that ","element":"span"},{"style":{"height":13.13},"width":43.11,"height":32.82,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1604.00923/images/30-7.png","element":"img","alt":" Di","inline":true,"padRight":true},{"text":"is the ","element":"span"},{"style":{"height":13.23},"width":34.9,"height":33.07,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1604.00923/images/30-8.png","element":"img","alt":" ith","inline":true,"padRight":true},{"text":"resampling of ","element":"span"},{"style":{"fontStyle":"italic"},"text":"D","element":"span"},{"text":". That is, ","element":"span"},{"style":{"height":13.13},"width":43.1,"height":32.82,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1604.00923/images/30-9.png","element":"img","alt":" Di","inline":true,"padRight":true},{"text":"is a set of ","element":"span"},{"style":{"fontStyle":"italic"},"text":"n ","element":"span"},{"text":"trajectories and the behavior policies that generated them, sampled with replacement from ","element":"span"},{"style":{"fontStyle":"italic"},"text":"D ","element":"span"},{"text":"(this resampling is done on lines 4–6).","element":"span"}],[{"text":"Line 7 calls for the creation of a vector, ","element":"span"},{"style":{"fontWeight":"bold"},"text":"v","element":"span"},{"text":", to store the off-policy ","element":"span"},{"style":{"fontStyle":"italic"},"text":"j","element":"span"},{"text":"-step return for ","element":"span"},{"style":{"height":13.6},"width":118.33,"height":34,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1604.00923/images/30-10.png","element":"img","alt":" j = ∞","inline":true,"padRight":true},{"text":"(recall that this is just the WDR estimator) for each bootstrap sample, sorted into ascending order. Lines 8 and 9 then compute the percentile bootstrap ","element":"span"},{"text":"10% ","element":"span"},{"text":"confidence interval, ","element":"span"},{"text":"[","element":"span"},{"style":{"fontStyle":"italic"},"text":"l, u","element":"span"},{"text":"]","element":"span"},{"text":", for the mean of ","element":"span"},{"style":{"height":17.63},"width":140.24,"height":44.07,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1604.00923/images/30-11.png","element":"img","alt":"g(∞)(D)","inline":true},{"text":", which we ensure includes ","element":"span"},{"text":"WDR(","element":"span"},{"style":{"fontStyle":"italic"},"text":"D","element":"span"},{"text":")","element":"span"},{"text":". For our theoretical analysis, we add a line after this that sets","element":"span"}],[{"style":{"width":"84%"},"width":773,"height":118,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1604.00923/images/30-12.png","element":"img"}],[{"text":"and","element":"span"}],[{"style":{"width":"89%"},"width":815,"height":118,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1604.00923/images/30-13.png","element":"img"}],[{"text":"where ","element":"span"},{"style":{"height":14},"width":18,"height":35,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1604.00923/images/30-14.png","element":"img","alt":" ξ","inline":true,"padRight":true},{"text":"is a bound on the range of ","element":"span"},{"style":{"height":17.63},"width":120.31,"height":44.07,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1604.00923/images/30-15.png","element":"img","alt":" g(i)(D)","inline":true},{"text":". In practice, these lines almost never change the values of ","element":"span"},{"style":{"fontStyle":"italic"},"text":"l ","element":"span"},{"text":"and ","element":"span"},{"style":{"fontStyle":"italic"},"text":"u ","element":"span"},{"text":"and can be ignored.","element":"span"}],[{"text":"Lines 10–12 then show how the bias vector can be computed from the already defined terms. Notice that the order of ","element":"span"},{"style":{"height":17.63},"width":145.05,"height":44.07,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1604.00923/images/30-16.png","element":"img","alt":" g(Jj)(D)","inline":true,"padRight":true},{"text":"and ","element":"span"},{"style":{"fontStyle":"italic"},"text":"l ","element":"span"},{"text":"or ","element":"span"},{"style":{"fontStyle":"italic"},"text":"u ","element":"span"},{"text":"does not matter since the bias term in the decomposition of mean squared error is squared. The order that we use facilitates a simple consistency proof for MAGIC. Given that the covariance matrix and bias vector have been approximated, Line 13 sets ","element":"span"},{"style":{"fontWeight":"bold"},"text":"x ","element":"span"},{"text":"to be the solution of a constrained quadratic program (in our experiments we solved this quadratic program using the Gurobi library). Finally, line 14 returns the weighted combination of the different off-policy ","element":"span"},{"style":{"fontStyle":"italic"},"text":"j","element":"span"},{"text":"-step returns (recall that ","element":"span"},{"style":{"height":15.6},"width":113.79,"height":39,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1604.00923/images/30-17.png","element":"img","alt":" gJ (D)","inline":true,"padRight":true},{"text":"is defined in Section ","element":"span"},{"text":"7","element":"span"},{"text":").","element":"span"}],[{"style":{"width":"100%"},"width":912,"height":1349,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1604.00923/images/30-18.png","element":"img"}]]},{"heading":"H. Consistency of MAGIC","paragraphs":[[{"text":"In this section we prove Theorem ","element":"span"},{"text":"4","element":"span"},{"text":", which states that if Assumptions ","element":"span"},{"text":"1 ","element":"span"},{"text":"and ","element":"span"},{"text":"4 ","element":"span"},{"text":"hold and ","element":"span"},{"style":{"height":12.8},"width":119.19,"height":32,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1604.00923/images/30-19.png","element":"img","alt":" ∞ ∈ J","inline":true,"padRight":true},{"text":", then ","element":"span"},{"style":{"height":17.59},"width":272.8,"height":43.98,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1604.00923/images/30-20.png","element":"img","alt":" MAGIC(D) a.s.−→","inline":true},{"style":{"height":15.6},"width":100.11,"height":39,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1604.00923/images/30-21.png","element":"img","alt":"v(πe).","inline":true,"padRight":true},{"text":"This result follows immediately from Theorem ","element":"span"},{"text":"3 ","element":"span"},{"text":"if ","element":"span"},{"style":{"height":15.92},"width":270.12,"height":39.8,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1604.00923/images/30-22.png","element":"img","alt":"�Ωn − Ωn a.s.−→ 0","inline":true,"padRight":true},{"text":"and ","element":"span"},{"style":{"height":15.92},"width":263.65,"height":39.8,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1604.00923/images/30-23.png","element":"img","alt":"�bn − bn a.s.−→ 0","inline":true},{"text":", since Assumptions ","element":"span"},{"text":"1 ","element":"span"},{"text":"and ","element":"span"},{"text":"4 ","element":"span"},{"text":"are sufficient to ensure that ","element":"span"},{"style":{"height":17.63},"width":196.78,"height":44.08,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1604.00923/images/30-24.png","element":"img","alt":" g(∞)(D) =","inline":true},{"style":{"height":17.59},"width":359.6,"height":43.98,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1604.00923/images/30-25.png","element":"img","alt":"WDR(D) a.s.−→ v(πe)","inline":true},{"text":". ","element":"span"},{"text":"In Appendix ","element":"span"},{"text":"H.3 ","element":"span"},{"text":"we show that ","element":"span"},{"style":{"height":15.92},"width":260.19,"height":39.8,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1604.00923/images/30-26.png","element":"img","alt":"�Ωn − Ωn a.s.−→ 0","inline":true},{"text":", and then in Appendix ","element":"span"},{"text":"H.4 ","element":"span"},{"text":"we show that ","element":"span"},{"style":{"height":15.92},"width":264.99,"height":39.8,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1604.00923/images/30-27.png","element":"img","alt":"�bn − bn a.s.−→ 0","inline":true},{"text":". However, first we establish two useful properties of the off-policy ","element":"span"},{"style":{"fontStyle":"italic"},"text":"j","element":"span"},{"text":"-step returns.","element":"span"}],[{"style":{"fontWeight":"bold"},"text":"H.1. Convergence of Off-Policy ","element":"span"},{"style":{"fontStyle":"italic"},"text":"j","element":"span"},{"style":{"fontWeight":"bold"},"text":"-Step Return","element":"span"}],[{"text":"Recall that the off-policy ","element":"span"},{"style":{"fontStyle":"italic"},"text":"j","element":"span"},{"text":"-step return used by MAGIC is given by:","element":"span"}],[{"style":{"width":"96%"},"width":877,"height":252,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1604.00923/images/31-0.png","element":"img"}],[{"text":"which can be written as:","element":"span"}],[{"style":{"width":"80%"},"width":729,"height":115,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1604.00923/images/31-1.png","element":"img"}],[{"text":"where","element":"span"}],[{"style":{"width":"90%"},"width":821,"height":71,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1604.00923/images/31-2.png","element":"img"}],[{"text":"Notice that ","element":"span"},{"style":{"height":16.66},"width":46.17,"height":41.66,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1604.00923/images/31-3.png","element":"img","alt":" Xit","inline":true,"padRight":true},{"text":"is a bounded random variable since rewards ","element":"span"},{"text":"and reward predictions are bounded. So, by Lemma ","element":"span"},{"text":"12 ","element":"span"},{"text":"we have that","element":"span"}],[{"style":{"width":"92%"},"width":843,"height":119,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1604.00923/images/31-4.png","element":"img"}],[{"text":"Also, since ","element":"span"},{"style":{"height":18.63},"width":151.04,"height":46.57,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1604.00923/images/31-5.png","element":"img","alt":" ˆvπe(SHi0 )","inline":true,"padRight":true},{"text":"is bounded, we have from the Kol- ","element":"span"},{"text":"mogorov strong law of large numbers that","element":"span"}],[{"style":{"width":"78%"},"width":713,"height":107,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1604.00923/images/31-6.png","element":"img"}],[{"text":"So, combining (","element":"span"},{"text":"29","element":"span"},{"text":") and (","element":"span"},{"text":"30","element":"span"},{"text":") we have from Property ","element":"span"},{"text":"3 ","element":"span"},{"text":"that","element":"span"}],[{"style":{"width":"84%"},"width":774,"height":119,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1604.00923/images/31-7.png","element":"img"}],[{"text":"Let ","element":"span"},{"style":{"height":29.01},"width":674.16,"height":72.53,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1604.00923/images/31-8.png","element":"img","alt":" cj := E�ˆvπe(SH0 ) + �jt=0 γtXt��H ∼ πe�","inline":true},{"text":"denote this constant value that ","element":"span"},{"style":{"height":17.63},"width":123.73,"height":44.07,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1604.00923/images/31-9.png","element":"img","alt":" g(j)(D)","inline":true,"padRight":true},{"text":"converges to.","element":"span"}],[{"style":{"fontWeight":"bold"},"text":"H.2. Convergence of Component of Off-Policy ","element":"span"},{"style":{"fontStyle":"italic"},"text":"j","element":"span"},{"style":{"fontWeight":"bold"},"text":"-Step Return","element":"span"}],[{"text":"Recall from (","element":"span"},{"text":"24","element":"span"},{"text":") that the off-policy ","element":"span"},{"style":{"fontStyle":"italic"},"text":"j","element":"span"},{"text":"-step return can be written as:","element":"span"}],[{"style":{"width":"40%"},"width":369,"height":99,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1604.00923/images/31-10.png","element":"img"}],[{"text":"where","element":"span"}],[{"style":{"width":"89%"},"width":815,"height":252,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1604.00923/images/31-11.png","element":"img"}],[{"text":"here we will show that for any ","element":"span"},{"style":{"fontStyle":"italic"},"text":"i ","element":"span"},{"text":"and ","element":"span"},{"style":{"height":20.38},"width":265.41,"height":50.95,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1604.00923/images/31-12.png","element":"img","alt":" j, g(j)i (D) a.s.−→ 0","inline":true},{"text":".","element":"span"}],[{"style":{"width":"72%"},"width":663,"height":661,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1604.00923/images/31-13.png","element":"img"}],[{"text":"Since ","element":"span"},{"style":{"height":16.66},"width":46.17,"height":41.66,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1604.00923/images/31-14.png","element":"img","alt":"Xit","inline":true,"padRight":true},{"text":"and ","element":"span"},{"style":{"height":16.66},"width":32.05,"height":41.66,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1604.00923/images/31-15.png","element":"img","alt":"ρit","inline":true,"padRight":true},{"text":"are ","element":"span"},{"text":"bounded, ","element":"span"},{"text":"we ","element":"span"},{"text":"have ","element":"span"},{"text":"that ","element":"span"},{"style":{"height":18.66},"width":359.71,"height":46.64,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1604.00923/images/31-16.png","element":"img","alt":"limn→∞ 1nρitXit = 0","inline":true},{"text":". ","element":"span"},{"text":"Also, by Lemma ","element":"span"},{"text":"11 ","element":"span"},{"text":"and Kolmogorov’s strong law of large numbers, we have that","element":"span"}],[{"style":{"height":18.25},"width":255.35,"height":45.62,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1604.00923/images/31-17.png","element":"img","alt":"�nk=1 ρkt a.s−→ 1","inline":true},{"text":". So, ","element":"span"},{"style":{"height":17.43},"width":154.52,"height":43.57,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1604.00923/images/31-18.png","element":"img","alt":" Y it a.s.−→ 0","inline":true,"padRight":true},{"text":"for all ","element":"span"},{"style":{"fontStyle":"italic"},"text":"t ","element":"span"},{"text":"and ","element":"span"},{"style":{"fontStyle":"italic"},"text":"i","element":"span"},{"text":". Furthermore, ","element":"span"},{"style":{"height":16.66},"width":42.13,"height":41.66,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1604.00923/images/31-19.png","element":"img","alt":" Y it","inline":true,"padRight":true},{"text":"is bounded since ","element":"span"},{"style":{"height":27.46},"width":300.69,"height":68.65,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1604.00923/images/31-20.png","element":"img","alt":" 0 ≤ ρit�nk=1 ρkt ≤ 1","inline":true,"padRight":true},{"text":"and ","element":"span"},{"style":{"height":16.66},"width":46.17,"height":41.66,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1604.00923/images/31-21.png","element":"img","alt":" Xit","inline":true,"padRight":true},{"text":"is ","element":"span"},{"text":"bounded. So, by Property ","element":"span"},{"text":"4","element":"span"},{"text":", we have that ","element":"span"},{"style":{"height":20.38},"width":227.84,"height":50.96,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1604.00923/images/31-22.png","element":"img","alt":" g(j)i (D) a.s.−→ 0","inline":true},{"text":".","element":"span"}],[{"style":{"width":"0%"},"width":7,"height":2,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1604.00923/images/31-23.png","element":"img"}],[{"style":{"fontWeight":"bold"},"text":"H.3. Consistency of ","element":"span"},{"style":{"height":13.13},"width":47.01,"height":32.82,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1604.00923/images/31-24.png","element":"img","alt":"�Ωn","inline":true}],[{"text":"Here we establish that ","element":"span"},{"style":{"height":15.92},"width":275.93,"height":39.8,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1604.00923/images/31-25.png","element":"img","alt":"�Ωn − Ωn a.s.−→ 0","inline":true},{"text":". There are two steps to this result. First we will show that ","element":"span"},{"style":{"height":13.53},"width":234.97,"height":33.82,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1604.00923/images/31-26.png","element":"img","alt":" limn→∞ Ωn =","inline":true,"padRight":true},{"text":"0","element":"span"},{"text":"—the true covariance matrix converges to the zero matrix. We then show that ","element":"span"},{"style":{"height":15.92},"width":168.49,"height":39.8,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1604.00923/images/31-27.png","element":"img","alt":"�Ωn a.s.−→ 0","inline":true,"padRight":true},{"text":"as well, which means that ","element":"span"},{"style":{"height":15.92},"width":249.66,"height":39.8,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1604.00923/images/31-28.png","element":"img","alt":"�Ωn − Ωn a.s.−→ 0","inline":true},{"text":".","element":"span"}],[{"text":"Recall from Appendix ","element":"span"},{"text":"H.1 ","element":"span"},{"text":"that ","element":"span"},{"style":{"height":17.96},"width":250.09,"height":44.89,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1604.00923/images/31-29.png","element":"img","alt":" g(j)(D) a.s.−→ cj","inline":true},{"text":". We can write","element":"span"}],[{"style":{"width":"99%"},"width":909,"height":158,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1604.00923/images/31-30.png","element":"img"}],[{"text":"where","element":"span"}],[{"style":{"width":"97%"},"width":888,"height":71,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1604.00923/images/31-31.png","element":"img"}],[{"text":"Recall that ","element":"span"},{"style":{"height":17.96},"width":256.41,"height":44.89,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1604.00923/images/31-32.png","element":"img","alt":" g(j)(D) a.s.−→ cj","inline":true},{"text":". By Lemma ","element":"span"},{"text":"2 ","element":"span"},{"text":"we therefore have that for all ","element":"span"},{"style":{"height":17.96},"width":450.87,"height":44.89,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1604.00923/images/31-33.png","element":"img","alt":" j, limn→∞ E[g(j)(D)] = cj","inline":true},{"text":". So, by the continuous mapping theorem,","element":"span"}],[{"style":{"width":"43%"},"width":393,"height":95,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1604.00923/images/31-34.png","element":"img"}],[{"text":"So, ","element":"span"},{"text":"by applying Lemma ","element":"span"},{"text":"2 ","element":"span"},{"text":"to (","element":"span"},{"text":"31","element":"span"},{"text":") we have that ","element":"span"},{"style":{"height":15.6},"width":632.46,"height":39,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1604.00923/images/31-35.png","element":"img","alt":"limn→∞ Ωn(i, j) = limn→∞ E[Yn] = 0","inline":true},{"text":".","element":"span"}],[{"text":"Next we show that ","element":"span"},{"style":{"height":15.92},"width":156.96,"height":39.8,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1604.00923/images/31-36.png","element":"img","alt":"�Ωn a.s.−→ 0","inline":true},{"text":". First, recall from Appendix ","element":"span"},{"text":"H.2 ","element":"span"},{"text":"that for all ","element":"span"},{"style":{"height":14},"width":98.6,"height":35,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1604.00923/images/31-37.png","element":"img","alt":" j ∈ J","inline":true,"padRight":true},{"text":"and ","element":"span"},{"style":{"height":15.6},"width":237.04,"height":39,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1604.00923/images/31-38.png","element":"img","alt":" k ∈ {1, . . . , n}","inline":true},{"text":",","element":"span"}],[{"style":{"width":"25%"},"width":235,"height":53,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1604.00923/images/31-39.png","element":"img"}],[{"text":"So, by Property ","element":"span"},{"text":"3 ","element":"span"},{"text":"we have that ","element":"span"},{"style":{"height":20.74},"width":229.54,"height":51.84,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1604.00923/images/32-0.png","element":"img","alt":" ¯g(j)k (D) a.s.−→ 0","inline":true,"padRight":true},{"text":"as well. So, ","element":"span"},{"style":{"height":20.74},"width":419.59,"height":51.84,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1604.00923/images/32-1.png","element":"img","alt":"g(j)k (D) − ¯g(j)k (D) a.s.−→ 0","inline":true},{"text":", and so by Property ","element":"span"},{"text":"3 ","element":"span"},{"text":"and the definition of ","element":"span"},{"style":{"height":13.13},"width":47.01,"height":32.82,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1604.00923/images/32-2.png","element":"img","alt":"�Ωn","inline":true},{"text":", we have that","element":"span"}],[{"style":{"width":"25%"},"width":229,"height":49,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1604.00923/images/32-3.png","element":"img"}],[{"text":"for all ","element":"span"},{"style":{"height":16.83},"width":174.79,"height":42.08,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1604.00923/images/32-4.png","element":"img","alt":" (i, j) ∈ J 2","inline":true},{"text":".","element":"span"}],[{"style":{"width":"100%"},"width":911,"height":318,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1604.00923/images/32-5.png","element":"img"}],[{"text":"and","element":"span"}],[{"style":{"width":"98%"},"width":895,"height":93,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1604.00923/images/32-6.png","element":"img"}],[{"text":"We will show that both of the right hand sides above converge almost surely to zero, which, by Lemma ","element":"span"},{"text":"4","element":"span"},{"text":", implies that ","element":"span"},{"style":{"height":15.6},"width":226.64,"height":39,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1604.00923/images/32-7.png","element":"img","alt":"�bn(j)−bn(j)","inline":true,"padRight":true},{"text":"converges almost surely to zero as well.","element":"span"}],[{"text":"First consider (","element":"span"},{"text":"32","element":"span"},{"text":"). We have from Appendix ","element":"span"},{"text":"H.1 ","element":"span"},{"text":"that ","element":"span"},{"style":{"fontWeight":"bold"},"text":"1) ","element":"span"},{"style":{"height":19.11},"width":305.46,"height":47.77,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1604.00923/images/32-8.png","element":"img","alt":"g(Jj)(D) a.s.−→ cJj","inline":true},{"text":". ","element":"span"},{"text":"So, by Lemma ","element":"span"},{"text":"2 ","element":"span"},{"text":"we have that ","element":"span"},{"style":{"fontWeight":"bold"},"text":"2) ","element":"span"},{"style":{"height":19.11},"width":646.09,"height":47.77,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1604.00923/images/32-9.png","element":"img","alt":"limn→∞ E[g(Jj)(D)] = E[cJj] = cJj","inline":true},{"text":". ","element":"span"},{"text":"We also have that ","element":"span"},{"style":{"height":22.57},"width":429.21,"height":56.42,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1604.00923/images/32-10.png","element":"img","alt":" u − l ≤ 1√n�2ξ2 ln(2/δ)","inline":true},{"text":", by (","element":"span"},{"text":"26","element":"span"},{"text":") and (","element":"span"},{"text":"27","element":"span"},{"text":"). Since ","element":"span"},{"style":{"height":15.6},"width":282.31,"height":39,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1604.00923/images/32-11.png","element":"img","alt":"WDR(D) ∈ [l, u]","inline":true},{"text":", we have that","element":"span"}],[{"style":{"width":"64%"},"width":590,"height":89,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1604.00923/images/32-12.png","element":"img"}],[{"text":"Since ","element":"span"},{"style":{"height":14},"width":18,"height":35,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1604.00923/images/32-13.png","element":"img","alt":" ξ","inline":true,"padRight":true},{"text":"is a constant, the right side is a sequence of constants (not random variables) that converges to zero. The left side is positive and less than the right, and so it too must converge (surely, not just almost surely) to zero: ","element":"span"},{"style":{"height":15.6},"width":464.09,"height":39,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1604.00923/images/32-14.png","element":"img","alt":"limn→∞ | WDR(D) − l| = 0","inline":true},{"text":". So,","element":"span"}],[{"style":{"width":"99%"},"width":907,"height":238,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1604.00923/images/32-15.png","element":"img"}],[{"text":"where the last step comes from Theorem ","element":"span"},{"text":"2","element":"span"},{"text":". This means that ","element":"span"},{"style":{"fontWeight":"bold"},"text":"3) ","element":"span"},{"style":{"height":17.59},"width":185.38,"height":43.98,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1604.00923/images/32-16.png","element":"img","alt":" l a.s.−→ v(πe)","inline":true},{"text":".","element":"span"}],[{"text":"Combining ","element":"span"},{"style":{"fontWeight":"bold"},"text":"1)","element":"span"},{"text":", ","element":"span"},{"style":{"fontWeight":"bold"},"text":"2)","element":"span"},{"text":", and ","element":"span"},{"style":{"fontWeight":"bold"},"text":"3)","element":"span"},{"text":", we have that the right side of (","element":"span"},{"text":"32","element":"span"},{"text":") converges almost surely to zero. This same argument, using the upper bound, ","element":"span"},{"style":{"fontStyle":"italic"},"text":"u","element":"span"},{"text":", rather than the lower bound, ","element":"span"},{"style":{"fontStyle":"italic"},"text":"l","element":"span"},{"text":", shows that the right side of (","element":"span"},{"text":"33","element":"span"},{"text":") converges almost surely to zero as well, and so we can conclude.","element":"span"}]]},{"heading":"I. Extended Empirical Studies (MAGIC)","paragraphs":[[{"text":"Here we present detailed results concerning the MAGIC estimator. These results will use the same three domains and two experimental setups (full-data and half-data) that were introduced in Appendix ","element":"span"},{"text":"D","element":"span"},{"text":", as well as one additional domain, which we call the ","element":"span"},{"style":{"fontStyle":"italic"},"text":"Hybrid ","element":"span"},{"text":"domain. We begin by introducing the Hybrid domain, we then discuss minor changes to the experimental setup and then present results.","element":"span"}],[{"style":{"fontWeight":"bold"},"text":"I.1. The Hybrid Domain","element":"span"}],[{"text":"The purpose of this domain is to showcase a common problem type: domains where early in a trajectory there is partial observability, but as time passes within each trajectory, the partial observability decays. This happens, for example, in robotics applications where there may be some uncertainty about the position or pose of a robot. However, as the trajectory progresses the robot may be able to better localize itself, removing or diminishing the uncertainty.","element":"span"}],[{"text":"We emulate this setting by concatenating the ModelFail and ModelWin domains. That is, the agent begins in the ModelFail domain. Whenever it would transition to the absorbing state, it instead transitions to the initial state of the ModelWin domain.","element":"span"}],[{"style":{"fontWeight":"bold"},"text":"I.2. Experimental Setup","element":"span"}],[{"text":"We performed these experiments in the same way as those in Appendix ","element":"span"},{"text":"D","element":"span"},{"text":", except that we compared different estimators. Specifically, we introduce curves for the MAGIC estimator, but remove the curves for the poorly-performing importance sampling estimators, IS, PDIS, WIS, and CWPDIS. So, the plots contain curves for DR, WDR, AM, and MAGIC. The legend used by all of the plots in this appendix is provided in Figure ","element":"span"},{"text":"18","element":"span"},{"text":".","element":"span"}],[{"style":{"width":"90%"},"width":827,"height":21,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1604.00923/images/32-17.png","element":"img"}],[{"text":"Figure 18: The legend used by all plots in Appendix ","element":"figcaption","subtype":"caption"},{"text":"I","element":"figcaption","subtype":"caption"},{"text":".","element":"figcaption","subtype":"caption"}],[{"text":"Also, for the hybrid domain we included a curve for ","element":"span"},{"style":{"fontStyle":"italic"},"text":"binary MAGIC ","element":"span"},{"text":"(MAGIC-B), which uses ","element":"span"},{"style":{"height":15.6},"width":230.12,"height":39,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1604.00923/images/32-18.png","element":"img","alt":" J = {−1, ∞}","inline":true},{"text":". Whereas MAGIC blends between AM and WDR using off-policy ","element":"span"},{"style":{"fontStyle":"italic"},"text":"j","element":"span"},{"text":"-step returns of various lengths, binary MAGIC only places weights on AM and WDR. Our comparison to MAGICB shows the importance of including the off-policy ","element":"span"},{"style":{"fontStyle":"italic"},"text":"j","element":"span"},{"text":"-step returns rather than merely trying to switch between, or directly weight, AM and WDR.","element":"span"}],[{"text":"Lastly, since all of the domains have finite horizons, we used ","element":"span"},{"style":{"height":15.6},"width":305.86,"height":39,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1604.00923/images/32-19.png","element":"img","alt":" J = {−1, . . . , L}","inline":true,"padRight":true},{"text":"for MAGIC. This means that it uses all of the possible off-policy ","element":"span"},{"style":{"fontStyle":"italic"},"text":"j","element":"span"},{"text":"-step returns.","element":"span"}],[{"style":{"fontWeight":"bold"},"text":"I.3. ModelFail Results","element":"span"}],[{"text":"Figure ","element":"span"},{"text":"2b ","element":"span"},{"text":"in Section ","element":"span"},{"text":"9 ","element":"span"},{"text":"depicts the results for the ModelFail domain in the full-data setting. We reproduce this plot in Figure ","element":"span"},{"text":"19","element":"span"},{"text":". In Figure ","element":"span"},{"text":"20 ","element":"span"},{"text":"we show the results for ModelFail in the half-data setting. There is little difference between the plots—in both cases MAGIC properly tracks WDR, so that both WDR and MAGIC outperform AM an DR by at least an order of magnitude for most ","element":"span"},{"style":{"fontStyle":"italic"},"text":"n","element":"span"},{"text":".","element":"span"}],[{"style":{"fontWeight":"bold"},"text":"I.4. ModelWin Results","element":"span"}],[{"text":"Figure ","element":"span"},{"text":"2c ","element":"span"},{"text":"in Section ","element":"span"},{"text":"9 ","element":"span"},{"text":"depicts the results for the ModelWin domain in the full-data setting. We reproduce this plot in Figure ","element":"span"},{"text":"21","element":"span"},{"text":". In Figure ","element":"span"},{"text":"22 ","element":"span"},{"text":"we show the results for ModelFail in the half-data setting. In both cases MAGIC tracks AM, although it drifts away a little as ","element":"span"},{"style":{"fontStyle":"italic"},"text":"n ","element":"span"},{"text":"increases. This suggests that there may be room for improvement in our estimates of ","element":"span"},{"style":{"height":13.13},"width":47.01,"height":32.82,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1604.00923/images/33-0.png","element":"img","alt":" Ωn","inline":true,"padRight":true},{"text":"and ","element":"span"},{"style":{"height":13.13},"width":43.78,"height":32.82,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1604.00923/images/33-1.png","element":"img","alt":" bn","inline":true},{"text":". However, also notice that due to the logarithmic scale, the difference between MAGIC and AM is small in comparison to the distance between MAGIC and DR.","element":"span"}],[{"style":{"fontWeight":"bold"},"text":"I.5. Gridworld Results","element":"span"}],[{"text":"Figures ","element":"span"},{"text":"23 ","element":"span"},{"text":"through ","element":"span"},{"text":"30 ","element":"span"},{"text":"depict the results for the GridworldFH and Gridworld-TH domains in both the full and half-data settings. The same general trends are visible. First, WDR tends to outperform DR, sometimes by an order of magnitude. Also, MAGIC tends to track WDR, since in these experiments it is usually the best-performing algorithm. Lastly, for the Gridworld-TH, full-data setting, DR, WDR, and MAGIC all degenerate to AM, while in the Gridworld-TH, half-data setting they degenerate to approximately AM using half as much data.","element":"span"}],[{"style":{"fontWeight":"bold"},"text":"I.6. Hybrid Results","element":"span"}],[{"text":"Last, but not least, Figures ","element":"span"},{"text":"31 ","element":"span"},{"text":"and ","element":"span"},{"text":"32 ","element":"span"},{"text":"show the results on the Hybrid domain in the full-data and half-data settings, respectively. Notice that in MAGIC significantly outperforms all other methods, including WDR and AM. MAGIC also outperforms MAGIC-B, which shows the importance of using off-policy ","element":"span"},{"style":{"fontStyle":"italic"},"text":"j","element":"span"},{"text":"-step returns for various values of ","element":"span"},{"style":{"fontStyle":"italic"},"text":"j","element":"span"},{"text":".","element":"span"}],[{"style":{"fontWeight":"bold"},"text":"I.7. Summary","element":"span"}],[{"text":"Overall, MAGIC acts as desired—it tracks WDR or AM, whichever is better for the application at hand. However, notice that it does not do this perfectly, particularly when there is little data available. This is likely because when there is little data it is difficult to estimate ","element":"span"},{"style":{"height":13.13},"width":47.01,"height":32.82,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1604.00923/images/33-2.png","element":"img","alt":" Ωn","inline":true},{"text":", and the con-fidence interval used when estimating ","element":"span"},{"style":{"height":13.13},"width":43.77,"height":32.82,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1604.00923/images/33-3.png","element":"img","alt":" bn","inline":true,"padRight":true},{"text":"will be loose. In some cases, even when there is a large amount of data, MAGIC struggles to properly track AM. However, this tends to be when both methods perform well, and may be","element":"span"}],[{"style":{"width":"98%"},"width":899,"height":515,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1604.00923/images/33-4.png","element":"img"}],[{"text":"Figure 19: ModelFail, full-data.","element":"figcaption","subtype":"caption"}],[{"style":{"width":"98%"},"width":900,"height":514,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1604.00923/images/33-5.png","element":"img"}],[{"text":"Figure 20: ModelFail, half-data.","element":"figcaption","subtype":"caption"}],[{"style":{"width":"99%"},"width":905,"height":519,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1604.00923/images/33-6.png","element":"img"}],[{"text":"Figure 21: ModelWin, full-data.","element":"figcaption","subtype":"caption"}],[{"style":{"width":"99%"},"width":1880,"height":520,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1604.00923/images/34-0.png","element":"img"}],[{"text":"Figure 22: ModelWin, half-data.","element":"figcaption","subtype":"caption"}],[{"style":{"width":"99%"},"width":908,"height":514,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1604.00923/images/34-1.png","element":"img"}],[{"text":"Figure 23: Gridworld-FH, full-data.","element":"figcaption","subtype":"caption"}],[{"style":{"width":"99%"},"width":903,"height":509,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1604.00923/images/34-2.png","element":"img"}],[{"text":"Figure 24: Gridworld-FH, half-data.","element":"figcaption","subtype":"caption"}],[{"style":{"width":"98%"},"width":901,"height":508,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1604.00923/images/34-3.png","element":"img"}],[{"text":"Figure 25: Gridworld-TH, full-data.","element":"figcaption","subtype":"caption"}],[{"style":{"width":"98%"},"width":895,"height":498,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1604.00923/images/34-4.png","element":"img"}],[{"text":"Figure 26: Gridworld-TH, half-data.","element":"figcaption","subtype":"caption"}],[{"style":{"width":"98%"},"width":901,"height":512,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1604.00923/images/34-5.png","element":"img"}],[{"text":"Figure 27: Gridworld-FH, full-data. p1p2","element":"figcaption","subtype":"caption"}],[{"style":{"width":"99%"},"width":902,"height":506,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1604.00923/images/35-0.png","element":"img"}],[{"text":"Figure 28: Gridworld-FH, half-data. p1p2","element":"figcaption","subtype":"caption"}],[{"style":{"width":"98%"},"width":899,"height":509,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1604.00923/images/35-1.png","element":"img"}],[{"text":"Figure 29: Gridworld-TH, full-data. p1p2","element":"figcaption","subtype":"caption"}],[{"style":{"width":"98%"},"width":898,"height":499,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1604.00923/images/35-2.png","element":"img"}],[{"text":"Figure 30: Gridworld-TH, half-data. p1p2","element":"figcaption","subtype":"caption"}],[{"style":{"width":"99%"},"width":903,"height":536,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1604.00923/images/35-3.png","element":"img"}],[{"text":"Figure 31: Hybrid, full-data.","element":"figcaption","subtype":"caption"}],[{"style":{"width":"99%"},"width":902,"height":533,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1604.00923/images/35-4.png","element":"img"}],[{"text":"Figure 32: Hybrid, half-data.","element":"figcaption","subtype":"caption"}],[{"text":"due to an increased difficulty of determining which method to favor when they both are improving rapidly with ","element":"span"},{"style":{"fontStyle":"italic"},"text":"n","element":"span"},{"text":".","element":"span"}],[{"text":"We also showed in Figures ","element":"span"},{"text":"31 ","element":"span"},{"text":"and ","element":"span"},{"text":"32 ","element":"span"},{"text":"an example where MAGIC outperformed MAGIC-B by an order of magnitude, and all previous methods (including DR) by 2–3 orders of magnitude. ","element":"span"},{"text":"This exemplifies ","element":"span"},{"style":{"fontWeight":"bold"},"text":"1) ","element":"span"},{"text":"the importance of blending between importance sampling methods and purely model-based estimators using off-policy ","element":"span"},{"style":{"fontStyle":"italic"},"text":"j","element":"span"},{"text":"-step returns, as opposed to selecting between or directly weighting WDR and AM and ","element":"span"},{"style":{"fontWeight":"bold"},"text":"2) ","element":"span"},{"text":"the power of MAGIC relative to existing estimators.","element":"span"}]]},{"heading":"J. Future Work","paragraphs":[[{"text":"Several avenues of future work remain. Good performance of MAGIC is contingent on our ability to efficiently estimate ","element":"span"},{"style":{"height":13.12},"width":47.01,"height":32.81,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1604.00923/images/36-0.png","element":"img","alt":" Ωn","inline":true,"padRight":true},{"text":"and ","element":"span"},{"style":{"height":13.12},"width":43.78,"height":32.81,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1604.00923/images/36-1.png","element":"img","alt":" bn","inline":true},{"text":", and so improved estimators for these terms could yield even better performance. For instance, if the sample mean importance weight is near zero, then the importance sampling estimators have high variance that is not captured by the sample covariance matrix that we use.","element":"span"}],[{"text":"Another possible avenue of future work would be to consider how MAGIC could be applied when our fundamental assumptions are violated. For example, what should be done if the transition and reward functions of the MDP are nonstationary? Can our estimators be extended to the average reward setting? What should be done if the behavior policies are not known exactly? If the approximate model is not provided initially, but constructed from the same data that is used to produce the DR, WDR, or MAGIC estimates, will DR, WDR, and MAGIC remain strongly consistent estimators? If there are multiple approximate models available, is there a way to detect which one will work best with DR, WDR, and MAGIC?","element":"span"}]]}],"_version":"3.3.4"},"paperNode":"$28:props:children:props:children:0:props:product"}]]]}]}]