3a:[["$","audio",null,{"id":"tts"}],["$","$L3f",null,{"paperID":"96191","publisher":"neurips","paperJSON":{"title":"Action Gaps and Advantages in Continuous-Time Distributional Reinforcement Learning","paperID":"96191","avgLineHeight":10.93,"imgScale":4,"sections":[{"heading":"Abstract","paragraphs":[[{"text":"When decisions are made at high frequency, traditional reinforcement learning (RL) methods struggle to accurately estimate action values. In turn, their performance is inconsistent and often poor. Whether the performance of distributional RL (DRL) agents suffers similarly, however, is unknown. In this work, we establish that DRL agents ","element":"span"},{"style":{"fontStyle":"italic"},"text":"are ","element":"span"},{"text":"sensitive to the decision frequency. We prove that action-conditioned return distributions collapse to their underlying policy’s return distribution as the decision frequency increases. We quantify the rate of collapse of these return distributions and exhibit that their statistics collapse at different rates. Moreover, we define distributional perspectives on action gaps and advantages. In particular, we introduce the ","element":"span"},{"style":{"fontStyle":"italic"},"text":"superiority ","element":"span"},{"text":"as a probabilistic generalization of the advantage— the core object of approaches to mitigating performance issues in high-frequency value-based RL. In addition, we build a superiority-based DRL algorithm. Through simulations in an option-trading domain, we validate that proper modeling of the superiority distribution produces improved controllers at high decision frequencies.","element":"span"}]]},{"heading":"1 Introduction","paragraphs":[[{"text":"In many real-time deployments of reinforcement learning (RL)—quantitative finance, robotics, and autonomous driving, for instance—the state of the environment evolves continuously in time, but policies make decisions at discrete timesteps (","element":"span"},{"style":{"fontStyle":"italic"},"text":"h ","element":"span"},{"text":"units of time apart) [","element":"span"},{"href":"#id-0","referenceIndex":28,"text":"28","element":"a"},{"text":"]. In such systems, the performance of value-based agents is sensitive to the frequency ","element":"span"},{"style":{"height":16},"width":144.5,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/96191/images/0-0.png","element":"img","alt":" ω := 1/h","inline":true,"padRight":true},{"text":"with which actions are taken. In particular, action values become indistinguishable as the time between actions decreases. In turn, in high-frequency settings, Baird demonstrated that action value estimates are susceptible to noise and approximation error [","element":"span"},{"href":"#id-1","referenceIndex":20,"text":"20","element":"a"},{"text":"]. Moreover, Tallec et al. exhibited that the performance of popular deep ","element":"span"},{"style":{"fontStyle":"italic"},"text":"Q","element":"span"},{"text":"-learning agents is inconsistent and often poor [","element":"span"},{"href":"#id-2","referenceIndex":34,"text":"34","element":"a"},{"text":"].","element":"span"}],[{"text":"In order to remedy this sensitivity, Baird proposed the advantage function and advantage-based variants of ","element":"span"},{"style":{"fontStyle":"italic"},"text":"Q","element":"span"},{"text":"-learning, Advantage Updating (AU) [","element":"span"},{"href":"#id-1","referenceIndex":20,"text":"20","element":"a"},{"text":"] and Advantage Learning (AL) [","element":"span"},{"href":"#id-3","referenceIndex":2,"text":"2","element":"a"},{"text":"]. Unlike action values, advantages (appropriately rescaled) do not become indistinguishable as decision frequency increases. As a result, Baird, in [","element":"span"},{"href":"#id-1","referenceIndex":20,"text":"20","element":"a"},{"text":", ","element":"span"},{"href":"#id-3","referenceIndex":2,"text":"2","element":"a"},{"text":"], demonstrated that advantage-based agents can learn faster and be more resilient to noise than their action value-based counterparts. Furthermore, Tallec et al., in [","element":"span"},{"href":"#id-2","referenceIndex":34,"text":"34","element":"a"},{"text":"], exhibited that their extension of AU, Deep Advantage Updating (DAU), works efficiently over a wide range of timesteps and environments, unlike standard deep ","element":"span"},{"style":{"fontStyle":"italic"},"text":"Q","element":"span"},{"text":"-learning approaches.","element":"span"}],[{"text":"While advantage-based approaches to RL have demonstrated robustness to decision frequency, in this work, we establish that they are nevertheless sensitive to the frequency with which actions are taken. This discovery arises as we answer the question: to what extent is the performance of distributional RL (DRL) agents sensitive to decision frequency? To this end, we build theory within the formalism of continuous-time RL where environmental dynamics are governed by SDEs, as in [","element":"span"},{"href":"#id-4","referenceIndex":25,"text":"25","element":"a"},{"text":"]. Additionally, we validate our theory empirically through simulations. Specifically, we make the following four contributions:","element":"span"}],[{"style":{"fontWeight":"bold"},"text":"Distributional Action Gap. ","element":"span"},{"text":"First, we extend notions of action gap to the realm of DRL. Precisely, we consider the minimal distance between pairs of action-conditioned distributions under metrics on the space of probability measures on ","element":"span"},{"text":"R","element":"span"},{"text":". We observe that some metrics are viable for this extension, while others are not. This formalism sets the stage for analyzing the influence of individual actions as well as decision frequency on, for example, an agent’s return distributions.","element":"span"}],[{"style":{"fontWeight":"bold"},"text":"Collapse of Distributional Control at High Frequency. ","element":"span"},{"text":"Second, we establish tight bounds on the distributional action gaps of ","element":"span"},{"style":{"fontStyle":"italic"},"text":"h","element":"span"},{"style":{"fontStyle":"italic"},"text":"-dependent action-conditioned return distributions","element":"span"},{"text":"—return distributions induced by applying a specific initial action for ","element":"span"},{"style":{"fontStyle":"italic"},"text":"h ","element":"span"},{"text":"units of time. We prove that these distributional action gaps not only collapse, as ","element":"span"},{"style":{"fontStyle":"italic"},"text":"h ","element":"span"},{"text":"tends to zero, but do so at a ","element":"span"},{"style":{"fontStyle":"italic"},"text":"slower rate ","element":"span"},{"text":"than action-value gaps. On one hand, therefore, distributional ","element":"span"},{"style":{"fontStyle":"italic"},"text":"Q","element":"span"},{"text":"-learning algorithms are susceptible to the same failures as ","element":"span"},{"style":{"fontStyle":"italic"},"text":"Q","element":"span"},{"text":"-learning in continuous-time RL. On the other hand, however, remedies to these failures transliterated to distributional ","element":"span"},{"style":{"fontStyle":"italic"},"text":"Q","element":"span"},{"text":"-learning algorithms are unlikely to succeed, because the means of these return distributions collapse ","element":"span"},{"style":{"fontStyle":"italic"},"text":"faster ","element":"span"},{"text":"than their other statistics.","element":"span"}],[{"style":{"fontWeight":"bold"},"text":"Distributional Superiority. ","element":"span"},{"text":"Third, we propose an axiomatic construction of a distributional analogue of the advantage, which we call the ","element":"span"},{"style":{"fontStyle":"italic"},"text":"superiority","element":"span"},{"text":". Leveraging our analysis of ","element":"span"},{"style":{"fontStyle":"italic"},"text":"h","element":"span"},{"text":"-dependent action-conditioned returns and their distributional action gaps, we present a frequency-scaled superiority distribution that enables greedy action selection at any fixed decision frequency.","element":"span"}],[{"style":{"fontWeight":"bold"},"text":"A Distributional Action Gap-Preserving Algorithm. ","element":"span"},{"text":"Fourth, we propose an algorithm that learns the superiority distribution from data. Empirically, we demonstrate that our algorithm maintains the ability to perform policy optimization at high frequencies more reliably than existing methods.","element":"span"}]]},{"heading":"2 Setting","paragraphs":[[{"style":{"fontWeight":"bold"},"text":"Notation: ","element":"span"},{"text":"Spaces will either be subsets of Euclidean space or discrete. Measurability, in the former case, will be with respect to the Borel sigma algebra; in the latter case, it will be with respect to the power set. The set of probability measures over a space ","element":"span"},{"text":"Y ","element":"span"},{"text":"will be denoted by ","element":"span"},{"style":{"fontStyle":"italic"},"text":"P","element":"span"},{"text":"(","element":"span"},{"text":"Y","element":"span"},{"text":")","element":"span"},{"text":". Functions on spaces are assumed to be measurable. For ","element":"span"},{"style":{"height":14.8},"width":179,"height":37,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/96191/images/1-0.png","element":"img","alt":" f : Y → Z","inline":true,"padRight":true},{"text":"and ","element":"span"},{"style":{"height":16},"width":176,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/96191/images/1-1.png","element":"img","alt":" µ ∈ P(Y)","inline":true},{"text":", the ","element":"span"},{"style":{"fontStyle":"italic"},"text":"push forward ","element":"span"},{"text":"of ","element":"span"},{"style":{"height":10.8},"width":22,"height":27,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/96191/images/1-2.png","element":"img","alt":" µ","inline":true,"padRight":true},{"text":"through/by ","element":"span"},{"style":{"height":16.6},"width":271.5,"height":41.5,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/96191/images/1-3.png","element":"img","alt":" f, f#µ ∈ P(Z)","inline":true},{"text":", is defined by ","element":"span"},{"style":{"height":17.8},"width":272,"height":44.5,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/96191/images/1-4.png","element":"img","alt":" f#µ := µ ◦ f −1","inline":true},{"text":". For a random variable ","element":"span"},{"style":{"fontStyle":"italic"},"text":"X","element":"span"},{"text":", defined implicitly on some probability space ","element":"span"},{"style":{"height":16},"width":146,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/96191/images/1-5.png","element":"img","alt":" (Ω, F, P)","inline":true},{"text":", we write law","element":"span"},{"style":{"height":16.6},"width":217.5,"height":41.5,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/96191/images/1-6.png","element":"img","alt":"(X) := X#P","inline":true,"padRight":true},{"text":"to denote the law of ","element":"span"},{"style":{"fontStyle":"italic"},"text":"X","element":"span"},{"text":"; the notation ","element":"span"},{"style":{"height":13.6},"width":160,"height":34,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/96191/images/1-7.png","element":"img","alt":" X =law Y","inline":true,"padRight":true},{"text":"is shorthand for law","element":"span"},{"style":{"height":16},"width":629.5,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/96191/images/1-8.png","element":"img","alt":"(X) = law(Y ). For any µ ∈ P(R), the","inline":true,"padRight":true},{"style":{"fontStyle":"italic"},"text":"quantile function of ","element":"span"},{"style":{"height":19.6},"width":112.5,"height":49,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/96191/images/1-9.png","element":"img","alt":" µ, F −1µ ","inline":true,"padRight":true},{"text":", is defined by ","element":"span"},{"style":{"height":18.6},"width":692,"height":46.5,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/96191/images/1-10.png","element":"img","alt":" F −1µ (τ) := infz{Fµ(z) ≥ τ}, where Fµ(z)","inline":true,"padRight":true},{"text":"is the CDF of ","element":"span"},{"style":{"height":10.8},"width":30.5,"height":27,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/96191/images/1-11.png","element":"img","alt":" µ.","inline":true}],[{"style":{"fontWeight":"bold"},"text":"2.1 ","element":"span"},{"style":{"fontWeight":"bold"},"text":"Continuous-Time RL","element":"span"}],[{"text":"Here we give a brief introduction to the technical aspects of continuous-time RL, à la [","element":"span"},{"href":"#id-4","referenceIndex":25,"text":"25","element":"a"},{"text":"]. We provide additional exposition and references in Appendix ","element":"span"},{"text":"A","element":"span"},{"text":". For any reader looking to defer some of this technical introduction, we summarize the core objects of interest at ","element":"span"},{"href":"#id-5","text":"the end of Section 2.1.1","element":"a"},{"text":".","element":"span"}],[{"style":{"fontWeight":"bold"},"text":"2.1.1 ","element":"span"},{"style":{"fontWeight":"bold"},"text":"MDPs","element":"span"}],[{"text":"Continuous-time Markov Decision Processes (MDPs) are defined by three spaces and four measurable functions: a time interval ","element":"span"},{"text":"T ","element":"span"},{"text":":= [0","element":"span"},{"style":{"fontStyle":"italic"},"text":", T","element":"span"},{"text":"] ","element":"span"},{"text":"with ","element":"span"},{"style":{"height":16},"width":186,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/96191/images/1-12.png","element":"img","alt":" T ∈ (0, ∞)","inline":true,"padRight":true},{"text":"or ","element":"span"},{"text":"T ","element":"span"},{"text":":= [0","element":"span"},{"style":{"fontStyle":"italic"},"text":", T","element":"span"},{"text":") ","element":"span"},{"text":"with ","element":"span"},{"style":{"height":11},"width":123.5,"height":27.5,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/96191/images/1-13.png","element":"img","alt":" T = ∞","inline":true},{"text":", a state space ","element":"span"},{"style":{"height":12.8},"width":131.5,"height":32,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/96191/images/1-14.png","element":"img","alt":"X ⊂ Rn","inline":true},{"text":", an action space ","element":"span"},{"text":"A","element":"span"},{"text":", a drift ","element":"span"},{"style":{"height":12.2},"width":349.5,"height":30.5,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/96191/images/1-15.png","element":"img","alt":" b : T × X × A → Rn","inline":true},{"text":", a diffusion ","element":"span"},{"style":{"height":13},"width":401.5,"height":32.5,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/96191/images/1-16.png","element":"img","alt":" σ : T × X × A → Rn×n","inline":true},{"text":", a reward ","element":"span"},{"style":{"height":11.4},"width":244.5,"height":28.5,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/96191/images/1-17.png","element":"img","alt":" r : T × X → R","inline":true},{"text":", and a terminal reward ","element":"span"},{"style":{"height":17.4},"width":438.5,"height":43.5,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/96191/images/1-18.png","element":"img","alt":" f : X → R.3 The pair (b, σ)","inline":true,"padRight":true},{"text":"govern the environment’s dynamics by a family of SDEs parameterized by ","element":"span"},{"style":{"height":13.6},"width":103,"height":34,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/96191/images/1-19.png","element":"img","alt":" a ∈ A,","inline":true}],[{"id":"id-6","style":{"width":"71%"},"width":1132,"height":46,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/96191/images/1-20.png","element":"img"}],[{"text":"Here ","element":"span"},{"style":{"height":16.6},"width":123,"height":41.5,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/96191/images/1-21.png","element":"img","alt":" (Bt)t≥0","inline":true,"padRight":true},{"text":"is an ","element":"span"},{"style":{"fontStyle":"italic"},"text":"n","element":"span"},{"text":"-dimensional Brownian motion. In turn, any solution to (","element":"span"},{"href":"#id-6","text":"2.1","element":"a"},{"text":") collects the state paths of an agent that chooses action ","element":"span"},{"style":{"fontStyle":"italic"},"text":"a ","element":"span"},{"text":"at every time, regardless of the state they are in.","element":"span"}],[{"text":"As is done in discrete-time RL, an agent might consider the induced Markov Reward Process (MRP) derived from a policy ","element":"span"},{"style":{"height":17.4},"width":345,"height":43.5,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/96191/images/1-22.png","element":"img","alt":" π : T × X → P(A).4 ","inline":true,"padRight":true},{"text":"The dynamics of a policy-induced MRP (with policy ","element":"span"},{"style":{"height":13.6},"width":35,"height":34,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/96191/images/1-23.png","element":"img","alt":" π)","inline":true}],[{"text":"are governed by the SDE","element":"span"}],[{"id":"id-10","style":{"width":"70%"},"width":1118,"height":46,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/96191/images/2-0.png","element":"img"}],[{"text":"Here, following [","element":"span"},{"href":"#id-7","referenceIndex":38,"text":"38","element":"a"},{"text":", ","element":"span"},{"href":"#id-8","referenceIndex":15,"text":"15","element":"a"},{"text":", ","element":"span"},{"href":"#id-9","referenceIndex":16,"text":"16","element":"a"},{"text":"], the policy-averaged coefficients ","element":"span"},{"style":{"height":11.4},"width":157.5,"height":28.5,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/96191/images/2-1.png","element":"img","alt":" bπ and σπ ","inline":true,"padRight":true},{"text":"are defined by","element":"span"}],[{"id":"id-68","style":{"width":"98%"},"width":1560,"height":114,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/96191/images/2-2.png","element":"img"}],[{"text":"Thus, solutions to (","element":"span"},{"href":"#id-10","text":"2.2","element":"a"},{"text":") collect the paths of an agent following policy ","element":"span"},{"style":{"height":7.2},"width":31,"height":18,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/96191/images/2-3.png","element":"img","alt":" π.","inline":true}],[{"text":"A class of policies central to our study is those that fix an action ","element":"span"},{"style":{"fontStyle":"italic"},"text":"a ","element":"span"},{"text":"from some time ","element":"span"},{"style":{"fontStyle":"italic"},"text":"t ","element":"span"},{"text":"for a given ","element":"span"},{"style":{"fontStyle":"italic"},"text":"persistence horizon ","element":"span"},{"style":{"fontStyle":"italic"},"text":"h","element":"span"},{"text":".","element":"span"}],[{"style":{"fontWeight":"bold"},"text":"Definition 2.1. ","element":"span"},{"style":{"fontStyle":"italic"},"text":"Given ","element":"span"},{"style":{"fontStyle":"italic"},"text":"h > ","element":"span"},{"text":"0 ","element":"span"},{"style":{"fontStyle":"italic"},"text":"and ","element":"span"},{"style":{"height":12},"width":98,"height":30,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/96191/images/2-4.png","element":"img","alt":" a ∈ A","inline":true},{"style":{"fontStyle":"italic"},"text":", a policy ","element":"span"},{"style":{"height":7.2},"width":22.5,"height":18,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/96191/images/2-5.png","element":"img","alt":" π","inline":true,"padRight":true},{"style":{"fontStyle":"italic"},"text":"is said to be ","element":"span"},{"text":"(","element":"span"},{"style":{"fontStyle":"italic"},"text":"h, a","element":"span"},{"text":")","element":"span"},{"text":"-persistent at time ","element":"span"},{"style":{"height":11.8},"width":92,"height":29.5,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/96191/images/2-6.png","element":"img","alt":" t ∈ T","inline":true,"padRight":true},{"style":{"fontStyle":"italic"},"text":"if ","element":"span"},{"style":{"height":16},"width":719,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/96191/images/2-7.png","element":"img","alt":"π(· | s, y) = δa for all (s, y) ∈ [t, t + h) × X.","inline":true}],[{"text":"In particular, given a policy ","element":"span"},{"style":{"height":7.2},"width":22,"height":18,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/96191/images/2-8.png","element":"img","alt":" π","inline":true},{"text":", we will consider ","element":"span"},{"text":"(","element":"span"},{"style":{"fontStyle":"italic"},"text":"h, a","element":"span"},{"text":")","element":"span"},{"style":{"fontStyle":"italic"},"text":"-persistent modifications of ","element":"span"},{"style":{"height":13.4},"width":202,"height":33.5,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/96191/images/2-9.png","element":"img","alt":" π: for t ∈ T,","inline":true}],[{"style":{"width":"48%"},"width":772,"height":102,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/96191/images/2-10.png","element":"img"}],[{"text":"These policies will help us understand the influence of taking actions relative to others as well as to those taken by ","element":"span"},{"style":{"height":7.2},"width":22,"height":18,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/96191/images/2-11.png","element":"img","alt":" π","inline":true},{"text":". We assume ","element":"span"},{"style":{"fontStyle":"italic"},"text":"h ","element":"span"},{"text":"is small enough so that ","element":"span"},{"style":{"height":12.6},"width":168.5,"height":31.5,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/96191/images/2-12.png","element":"img","alt":" t + h ∈ T.","inline":true}],[{"text":"In order to guarantee the global-in-time existence and uniqueness of solutions to our SDEs (","element":"span"},{"href":"#id-6","text":"2.1","element":"a"},{"text":") and (","element":"span"},{"href":"#id-10","text":"2.2","element":"a"},{"text":"), we make two sets of assumptions.","element":"span"}],[{"id":"id-11","style":{"fontWeight":"bold"},"text":"Assumption 2.2. ","element":"span"},{"style":{"fontStyle":"italic"},"text":"The functions ","element":"span"},{"style":{"fontStyle":"italic"},"text":"b ","element":"span"},{"style":{"fontStyle":"italic"},"text":"and ","element":"span"},{"style":{"height":7.2},"width":22,"height":18,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/96191/images/2-13.png","element":"img","alt":" σ","inline":true,"padRight":true},{"style":{"fontStyle":"italic"},"text":"have linear growth and are Lipschitz in state, uniformly in time and action: a finite, positive constants ","element":"span"},{"href":"#id-11","style":{"height":13.8},"width":66.5,"height":34.5,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/96191/images/2-14.png","element":"img","alt":" C2.2","inline":true,"padRight":true},{"style":{"fontStyle":"italic"},"text":"exists such that","element":"span"}],[{"style":{"width":"84%"},"width":1346,"height":170,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/96191/images/2-15.png","element":"img"}],[{"id":"id-5","style":{"fontWeight":"bold"},"text":"Assumption 2.3. ","element":"span"},{"style":{"fontStyle":"italic"},"text":"The averaged coefficient functions ","element":"span"},{"style":{"height":11.4},"width":33.5,"height":28.5,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/96191/images/2-16.png","element":"img","alt":" bπ","inline":true},{"style":{"fontStyle":"italic"},"text":"and ","element":"span"},{"style":{"height":10.8},"width":41.5,"height":27,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/96191/images/2-17.png","element":"img","alt":" σπ","inline":true},{"style":{"fontStyle":"italic"},"text":"are Lipschitz in state, uniformly in time: a finite, positive constant ","element":"span"},{"href":"#id-5","style":{"height":14.2},"width":66.5,"height":35.5,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/96191/images/2-18.png","element":"img","alt":" C2.3","inline":true,"padRight":true},{"style":{"fontStyle":"italic"},"text":"exists such that","element":"span"}],[{"style":{"width":"81%"},"width":1286,"height":68,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/96191/images/2-19.png","element":"img"}],[{"text":"These assumptions are standard in the analysis of continuous-time RL, optimal control, and SDEs [","element":"span"},{"href":"#id-12","referenceIndex":12,"text":"12","element":"a"},{"text":", ","element":"span"},{"href":"#id-13","referenceIndex":26,"text":"26","element":"a"},{"text":", ","element":"span"},{"href":"#id-7","referenceIndex":38,"text":"38","element":"a"},{"text":", ","element":"span"},{"href":"#id-9","referenceIndex":16,"text":"16","element":"a"},{"text":", ","element":"span"},{"href":"#id-14","referenceIndex":42,"text":"42","element":"a"},{"text":"]. Since ","element":"span"},{"style":{"height":7.2},"width":22.5,"height":18,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/96191/images/2-20.png","element":"img","alt":" π","inline":true,"padRight":true},{"text":"is a function of state, we note that Assumption ","element":"span"},{"href":"#id-5","text":"2.3 ","element":"a"},{"text":"is not a direct consequence of Assumption ","element":"span"},{"href":"#id-11","text":"2.2","element":"a"},{"text":". The coefficients ","element":"span"},{"style":{"height":11.4},"width":156,"height":28.5,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/96191/images/2-21.png","element":"img","alt":" bπ and σπ ","inline":true,"padRight":true},{"text":"satisfy the conditions of Assumption ","element":"span"},{"href":"#id-5","text":"2.3 ","element":"a"},{"text":"provided ","element":"span"},{"style":{"fontStyle":"italic"},"text":"b ","element":"span"},{"text":"and ","element":"span"},{"style":{"height":7.2},"width":22,"height":18,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/96191/images/2-22.png","element":"img","alt":" σ","inline":true,"padRight":true},{"text":"satisfy some (also standard) additional regularity conditions and ","element":"span"},{"style":{"height":7.2},"width":22,"height":18,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/96191/images/2-23.png","element":"img","alt":" π","inline":true,"padRight":true},{"text":"satisfies some regularity conditions. These technical details are discussed in Appendix ","element":"span"},{"href":"#id-15","text":"A.2","element":"a"},{"text":".","element":"span"}],[{"text":"In summary, in continuous-time RL, there are three stochastic processes of interest: ","element":"span"},{"style":{"height":16.6},"width":133,"height":41.5,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/96191/images/2-24.png","element":"img","alt":" (X•s )s≥t","inline":true,"padRight":true},{"text":"with ","element":"span"},{"style":{"height":16.6},"width":288.5,"height":41.5,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/96191/images/2-25.png","element":"img","alt":"• ∈ {a, π, π|h,a,t}","inline":true},{"text":", all beginning at some time ","element":"span"},{"style":{"fontStyle":"italic"},"text":"t","element":"span"},{"text":". These processes collect the state paths of an agent in one of three scenarios: 1. choosing action ","element":"span"},{"style":{"fontStyle":"italic"},"text":"a ","element":"span"},{"text":"at every state and time; 2. following a policy ","element":"span"},{"style":{"height":7.2},"width":22.5,"height":18,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/96191/images/2-26.png","element":"img","alt":" π","inline":true},{"text":"; or 3. choosing ","element":"span"},{"style":{"fontStyle":"italic"},"text":"a ","element":"span"},{"text":"at every state and time for the first ","element":"span"},{"style":{"fontStyle":"italic"},"text":"h ","element":"span"},{"text":"units of time and following ","element":"span"},{"style":{"height":7.2},"width":22,"height":18,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/96191/images/2-27.png","element":"img","alt":" π","inline":true,"padRight":true},{"text":"thereafter.","element":"span"}],[{"id":"id-25","style":{"fontWeight":"bold"},"text":"2.1.2 ","element":"span"},{"style":{"fontWeight":"bold"},"text":"Value Functions and their Distributions","element":"span"}],[{"text":"Given a policy-induced state process ","element":"span"},{"style":{"height":16.4},"width":136,"height":41,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/96191/images/2-28.png","element":"img","alt":" (Xπs )s≥t","inline":true},{"text":", the (discounted) random ","element":"span"},{"style":{"fontStyle":"italic"},"text":"return ","element":"span"},{"style":{"height":16},"width":334,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/96191/images/2-29.png","element":"img","alt":" Gπ(t, x) earned by π","inline":true,"padRight":true},{"text":"starting from state ","element":"span"},{"style":{"height":12},"width":315.5,"height":30,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/96191/images/2-30.png","element":"img","alt":" x ∈ X at time t ∈ T","inline":true,"padRight":true},{"text":"is defined [","element":"span"},{"href":"#id-16","referenceIndex":41,"text":"41","element":"a"},{"text":"] by","element":"span"}],[{"style":{"width":"84%"},"width":1342,"height":106,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/96191/images/2-31.png","element":"img"}],[{"text":"where ","element":"span"},{"style":{"height":14.8},"width":99,"height":37,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/96191/images/2-32.png","element":"img","alt":" f ≡ 0","inline":true,"padRight":true},{"text":"when ","element":"span"},{"style":{"height":16},"width":185,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/96191/images/2-33.png","element":"img","alt":" T = [0, ∞)","inline":true},{"text":". We distinguish returns earned by ","element":"span"},{"text":"(","element":"span"},{"style":{"fontStyle":"italic"},"text":"h, a","element":"span"},{"text":")","element":"span"},{"text":"-persistent modifications of policies. We call these ","element":"span"},{"style":{"fontStyle":"italic"},"text":"h","element":"span"},{"style":{"fontStyle":"italic"},"text":"-dependent action-conditioned returns ","element":"span"},{"text":"and denote them by ","element":"span"},{"style":{"height":16.6},"width":171,"height":41.5,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/96191/images/2-34.png","element":"img","alt":" Zπh(t, x, a)","inline":true},{"text":". ","element":"span"},{"text":"Given ","element":"span"},{"style":{"height":7.2},"width":22,"height":18,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/96191/images/2-35.png","element":"img","alt":" π","inline":true},{"text":", they are defined by","element":"span"}],[{"style":{"width":"94%"},"width":1504,"height":106,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/96191/images/2-36.png","element":"img"}],[{"text":"Value-based approaches in RL estimate either the ","element":"span"},{"style":{"fontStyle":"italic"},"text":"value function ","element":"span"},{"style":{"height":16},"width":397,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/96191/images/3-0.png","element":"img","alt":" V π(t, x) := E[Gπ(t, x)]","inline":true,"padRight":true},{"text":"or the ","element":"span"},{"style":{"fontStyle":"italic"},"text":"h","element":"span"},{"style":{"fontStyle":"italic"},"text":"-dependent action-value function ","element":"span"},{"style":{"height":16.8},"width":464.5,"height":42,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/96191/images/3-1.png","element":"img","alt":" Qπh(t, x, a) := E[Zπh(t, x, a)]","inline":true},{"text":".","element":"span"},{"text":"5 ","element":"span"},{"text":"As distributional approaches in ","element":"span"},{"text":"RL estimate the laws of returns, following [","element":"span"},{"href":"#id-16","referenceIndex":41,"text":"41","element":"a"},{"text":"], we define","element":"span"}],[{"style":{"width":"67%"},"width":1066,"height":46,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/96191/images/3-2.png","element":"img"}],[{"text":"It is important to note that only the laws of random returns (and not their representations as random variables) are observable and modeled in practice.","element":"span"}],[{"style":{"fontWeight":"bold"},"text":"2.2 ","element":"span"},{"style":{"fontStyle":"italic"},"text":"Q","element":"span"},{"style":{"fontWeight":"bold"},"text":"-Learning in Continuous Time","element":"span"}],[{"text":"The failure of action-value-based RL in continuous-time stems from the collapse of action values at a given state to the value of that state. Precisely, Tallec et al. and Jia and Zhou established that","element":"span"}],[{"id":"id-17","style":{"width":"84%"},"width":1344,"height":46,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/96191/images/3-3.png","element":"img"}],[{"text":"where ","element":"span"},{"style":{"height":11.8},"width":133.5,"height":29.5,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/96191/images/3-4.png","element":"img","alt":" Hπ ∈ R","inline":true,"padRight":true},{"text":"is independent of ","element":"span"},{"style":{"fontStyle":"italic"},"text":"h ","element":"span"},{"text":"(see [","element":"span"},{"href":"#id-2","referenceIndex":34,"text":"34","element":"a"},{"text":"] and [","element":"span"},{"href":"#id-9","referenceIndex":16,"text":"16","element":"a"},{"text":"] respectively).","element":"span"},{"text":"6 ","element":"span"},{"text":"In a discrete action space, given a state ","element":"span"},{"style":{"fontStyle":"italic"},"text":"x ","element":"span"},{"text":"and time ","element":"span"},{"style":{"fontStyle":"italic"},"text":"t","element":"span"},{"text":", a concise way to capture the asymptotic information of (","element":"span"},{"href":"#id-17","text":"2.6","element":"a"},{"text":") is by considering the ","element":"span"},{"style":{"fontStyle":"italic"},"text":"action gap ","element":"span"},{"text":"[","element":"span"},{"href":"#id-18","referenceIndex":11,"text":"11","element":"a"},{"text":", ","element":"span"},{"href":"#id-19","referenceIndex":5,"text":"5","element":"a"},{"text":"] of the associated ","element":"span"},{"style":{"fontStyle":"italic"},"text":"h","element":"span"},{"text":"-dependent action values","element":"span"}],[{"style":{"width":"53%"},"width":856,"height":68,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/96191/images/3-5.png","element":"img"}],[{"text":"Anticipating (","element":"span"},{"href":"#id-17","text":"2.6","element":"a"},{"text":"), which implies that ","element":"span"},{"style":{"height":16.8},"width":353.5,"height":42,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/96191/images/3-6.png","element":"img","alt":" gap(Qπh, t, x) = O(h)","inline":true},{"text":", Baird proposed AU wherein he esti- ","element":"span"},{"text":"mated the ","element":"span"},{"style":{"fontStyle":"italic"},"text":"rescaled advantage function ","element":"span"},{"style":{"height":16.2},"width":47,"height":40.5,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/96191/images/3-7.png","element":"img","alt":" Aπh ","inline":true,"padRight":true},{"text":"in place of ","element":"span"},{"style":{"height":16},"width":59.5,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/96191/images/3-8.png","element":"img","alt":" Qπh:","inline":true}],[{"id":"id-45","style":{"width":"82%"},"width":1312,"height":85,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/96191/images/3-9.png","element":"img"}],[{"text":"Note that ","element":"span"},{"style":{"height":16.8},"width":355,"height":42,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/96191/images/3-10.png","element":"img","alt":" gap(Aπh, t, x) = O(1)","inline":true},{"text":". Tallec et al. [","element":"span"},{"href":"#id-2","referenceIndex":34,"text":"34","element":"a"},{"text":"] and Jia and Zhou [","element":"span"},{"href":"#id-9","referenceIndex":16,"text":"16","element":"a"},{"text":"], following Baird, also ","element":"span"},{"text":"estimated ","element":"span"},{"style":{"height":16.4},"width":47,"height":41,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/96191/images/3-11.png","element":"img","alt":" Aπh ","inline":true,"padRight":true},{"text":"to ameliorate ","element":"span"},{"style":{"fontStyle":"italic"},"text":"Q","element":"span"},{"text":"-learning in continuous time.","element":"span"}]]},{"heading":"3 The Distributional Action Gap","paragraphs":[[{"text":"In this section, we define a distributional notion of action gap; we prove that ","element":"span"},{"style":{"fontStyle":"italic"},"text":"h","element":"span"},{"text":"-dependent action-conditioned return distributions collapse to their underlying policy’s return distribution as ","element":"span"},{"style":{"fontStyle":"italic"},"text":"h ","element":"span"},{"text":"vanishes; and we quantify the rate of collapse of these return distributions.","element":"span"}],[{"style":{"fontWeight":"bold"},"text":"Definition 3.1. ","element":"span"},{"style":{"fontStyle":"italic"},"text":"Consider an MDP with discrete action space and let ","element":"span"},{"style":{"height":16},"width":465,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/96191/images/3-12.png","element":"img","alt":" µ : T × X × A → (P(R), d)","inline":true,"padRight":true},{"style":{"fontStyle":"italic"},"text":"for a metric ","element":"span"},{"style":{"fontStyle":"italic"},"text":"d","element":"span"},{"style":{"fontStyle":"italic"},"text":". The ","element":"span"},{"style":{"fontStyle":"italic"},"text":"d ","element":"span"},{"text":"action gap of ","element":"span"},{"style":{"height":10.8},"width":22,"height":27,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/96191/images/3-13.png","element":"img","alt":" µ","inline":true,"padRight":true},{"text":"at a state ","element":"span"},{"style":{"fontStyle":"italic"},"text":"x ","element":"span"},{"text":"and time ","element":"span"},{"style":{"fontStyle":"italic"},"text":"t ","element":"span"},{"style":{"fontStyle":"italic"},"text":"is given by","element":"span"}],[{"style":{"width":"52%"},"width":836,"height":62,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/96191/images/3-14.png","element":"img"}],[{"text":"While ","element":"span"},{"text":"R ","element":"span"},{"text":"has a canonical metric, induced by ","element":"span"},{"style":{"height":16},"width":38,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/96191/images/3-15.png","element":"img","alt":" | · |","inline":true},{"text":", the space ","element":"span"},{"style":{"fontStyle":"italic"},"text":"P","element":"span"},{"text":"(","element":"span"},{"text":"R","element":"span"},{"text":") ","element":"span"},{"text":"does not. So a choice must be made, and some metrics are unsuitable. For example, in deterministic MDPs with deterministic policies, return distributions are identified by expected returns: ","element":"span"},{"style":{"height":16.6},"width":161.5,"height":41.5,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/96191/images/3-16.png","element":"img","alt":" ζπh(t, x, a)","inline":true,"padRight":true},{"text":"is the delta at ","element":"span"},{"style":{"height":16.6},"width":172.5,"height":41.5,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/96191/images/3-17.png","element":"img","alt":" Qπh(t, x, a)","inline":true},{"text":", for all ","element":"span"},{"style":{"height":16.8},"width":746.5,"height":42,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/96191/images/3-18.png","element":"img","alt":"(t, x, a) ∈ T × X × A. Thus, distgapd(ζπh, t, x)","inline":true,"padRight":true},{"text":"should vanish as ","element":"span"},{"style":{"fontStyle":"italic"},"text":"h ","element":"span"},{"text":"decreases to zero if ","element":"span"},{"style":{"height":16.8},"width":212,"height":42,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/96191/images/3-19.png","element":"img","alt":" gap(Qπh, t, x)","inline":true,"padRight":true},{"text":"vanishes as ","element":"span"},{"style":{"fontStyle":"italic"},"text":"h ","element":"span"},{"text":"decreases to zero. With the total variation metric ","element":"span"},{"style":{"fontStyle":"italic"},"text":"d ","element":"span"},{"text":"= TV","element":"span"},{"text":", for instance, this is not the case, making ","element":"span"},{"text":"TV ","element":"span"},{"text":"unsuitable. Indeed, suppose we have a deterministic MDP with ","element":"span"},{"style":{"height":16},"width":281.5,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/96191/images/3-20.png","element":"img","alt":" A = {a1, a2} and","inline":true,"padRight":true},{"text":"such that ","element":"span"},{"style":{"height":16.6},"width":622,"height":41.5,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/96191/images/3-21.png","element":"img","alt":" ζπh(t, x, a1) = δh and ζπh(t, x, a2) = δ0","inline":true},{"text":", for some state ","element":"span"},{"style":{"fontStyle":"italic"},"text":"x ","element":"span"},{"text":"and time ","element":"span"},{"style":{"fontStyle":"italic"},"text":"t ","element":"span"},{"text":"(see, e.g., [","element":"span"},{"href":"#id-2","referenceIndex":34,"text":"34","element":"a"},{"text":"]). Then ","element":"span"},{"style":{"height":16.8},"width":986.5,"height":42,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/96191/images/3-22.png","element":"img","alt":"distgapTV(ζπh, t, x) = 1, for all h > 0, yet gap(Qπh, t, x) = h.","inline":true}],[{"text":"The ","element":"span"},{"style":{"height":15.6},"width":211,"height":39,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/96191/images/3-23.png","element":"img","alt":" Wp distances","inline":true,"padRight":true},{"text":"from the theory of Optimal Transportation (see [","element":"span"},{"href":"#id-20","referenceIndex":36,"text":"36","element":"a"},{"text":"]), however, are suitable. They are defined via ","element":"span"},{"style":{"fontStyle":"italic"},"text":"couplings ","element":"span"},{"text":"of distributions.","element":"span"}],[{"style":{"fontWeight":"bold"},"text":"Definition 3.2. ","element":"span"},{"style":{"fontStyle":"italic"},"text":"Let ","element":"span"},{"style":{"height":16},"width":218.5,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/96191/images/3-24.png","element":"img","alt":" µ, ν ∈ P(R)","inline":true},{"style":{"fontStyle":"italic"},"text":". A ","element":"span"},{"style":{"height":18.4},"width":194.5,"height":46,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/96191/images/3-25.png","element":"img","alt":" κ ∈ P(R2)","inline":true,"padRight":true},{"style":{"fontStyle":"italic"},"text":"is a ","element":"span"},{"text":"coupling of ","element":"span"},{"style":{"height":10.8},"width":22,"height":27,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/96191/images/3-26.png","element":"img","alt":" µ","inline":true,"padRight":true},{"text":"and ","element":"span"},{"style":{"height":7.2},"width":19.5,"height":18,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/96191/images/3-27.png","element":"img","alt":" ν","inline":true,"padRight":true},{"style":{"fontStyle":"italic"},"text":"if its first and second marginals are ","element":"span"},{"style":{"height":14.6},"width":124,"height":36.5,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/96191/images/3-28.png","element":"img","alt":" µ and ν","inline":true,"padRight":true},{"style":{"fontStyle":"italic"},"text":"respectively. We denote the set of these couplings by ","element":"span"},{"style":{"height":16},"width":133.5,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/96191/images/3-29.png","element":"img","alt":" C (µ, ν).","inline":true}],[{"style":{"fontWeight":"bold"},"text":"Definition 3.3. ","element":"span"},{"style":{"fontStyle":"italic"},"text":"Let ","element":"span"},{"style":{"height":17.6},"width":629,"height":44,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/96191/images/3-30.png","element":"img","alt":" µ, ν ∈ P(R)7 and p ∈ [1, ∞). The Wp","inline":true,"padRight":true},{"text":"distance between ","element":"span"},{"style":{"height":14.6},"width":159,"height":36.5,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/96191/images/3-31.png","element":"img","alt":" µ and ν is","inline":true}],[{"id":"id-21","style":{"width":"77%"},"width":1222,"height":110,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/96191/images/3-32.png","element":"img"}],[{"text":"Any coupling attaining the infimum in (","element":"span"},{"href":"#id-21","text":"3.1","element":"a"},{"text":") is called a ","element":"span"},{"style":{"height":15.6},"width":51.5,"height":39,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/96191/images/4-0.png","element":"img","alt":" Wp","inline":true},{"style":{"fontStyle":"italic"},"text":"-optimal coupling","element":"span"},{"text":". Henceforth, we write ","element":"span"},{"style":{"height":17.2},"width":134.5,"height":43,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/96191/images/4-1.png","element":"img","alt":"distgapp","inline":true,"padRight":true},{"text":"when considering ","element":"span"},{"style":{"height":15.6},"width":51.5,"height":39,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/96191/images/4-2.png","element":"img","alt":" Wp","inline":true,"padRight":true},{"text":"action gaps. If ","element":"span"},{"style":{"height":10.8},"width":22.5,"height":27,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/96191/images/4-3.png","element":"img","alt":" µ","inline":true,"padRight":true},{"text":"and ","element":"span"},{"style":{"height":7.2},"width":19.5,"height":18,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/96191/images/4-4.png","element":"img","alt":" ν","inline":true,"padRight":true},{"text":"are deltas at ","element":"span"},{"style":{"height":16.8},"width":190.5,"height":42,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/96191/images/4-5.png","element":"img","alt":" Qπh(t, x, a1)","inline":true,"padRight":true},{"text":"and ","element":"span"},{"style":{"height":16.8},"width":190.5,"height":42,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/96191/images/4-6.png","element":"img","alt":" Qπh(t, x, a2)","inline":true,"padRight":true},{"text":"respectively, then the right-hand side of (","element":"span"},{"href":"#id-21","text":"3.1","element":"a"},{"text":") is equal to ","element":"span"},{"style":{"height":16.8},"width":456,"height":42,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/96191/images/4-7.png","element":"img","alt":" |Qπh(t, x, a1) − Qπh(t, x, a2)|","inline":true},{"text":". Hence, in ","element":"span"},{"text":"deterministic MDPs with deterministic policies, ","element":"span"},{"style":{"height":15.6},"width":51.5,"height":39,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/96191/images/4-8.png","element":"img","alt":" Wp","inline":true,"padRight":true},{"text":"action gaps of ","element":"span"},{"style":{"height":16},"width":37,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/96191/images/4-9.png","element":"img","alt":" ζπh ","inline":true,"padRight":true},{"text":"are identical to action gaps of ","element":"span"},{"style":{"height":16},"width":48.5,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/96191/images/4-10.png","element":"img","alt":"Qπh","inline":true},{"text":", making the ","element":"span"},{"style":{"height":15.6},"width":51.5,"height":39,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/96191/images/4-11.png","element":"img","alt":" Wp","inline":true,"padRight":true},{"text":"distances suitable in the above sense. In non-deterministic MDPs, the relationship ","element":"span"},{"text":"between ","element":"span"},{"style":{"height":18},"width":572.5,"height":45,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/96191/images/4-12.png","element":"img","alt":" distgapp(ζπh, t, x) and gap(Qπh, t, x)","inline":true,"padRight":true},{"text":"is opaque.","element":"span"}],[{"text":"The following results study the ","element":"span"},{"style":{"height":15.6},"width":51.5,"height":39,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/96191/images/4-13.png","element":"img","alt":" Wp","inline":true,"padRight":true},{"text":"action gap of ","element":"span"},{"style":{"height":16},"width":37,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/96191/images/4-14.png","element":"img","alt":" ζπh","inline":true},{"text":"as a function of ","element":"span"},{"style":{"fontStyle":"italic"},"text":"h","element":"span"},{"text":", lending some color to the ","element":"span"},{"text":"relationship between ","element":"span"},{"style":{"height":17.8},"width":580.5,"height":44.5,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/96191/images/4-15.png","element":"img","alt":" distgapp(ζπh, t, x) and gap(Qπh, t, x)","inline":true},{"text":". These results all hold under Assumptions ","element":"span"},{"href":"#id-11","text":"2.2 ","element":"a"},{"text":"and ","element":"span"},{"href":"#id-5","text":"2.3","element":"a"},{"text":". Henceforth, we suppress mention of these assumptions; we do not restate them explicitly. First, we observe that ","element":"span"},{"style":{"height":15.6},"width":51.5,"height":39,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/96191/images/4-16.png","element":"img","alt":" Wp","inline":true,"padRight":true},{"text":"action gaps of ","element":"span"},{"style":{"height":16},"width":37,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/96191/images/4-17.png","element":"img","alt":" ζπh ","inline":true,"padRight":true},{"text":"are bounded from below by action gaps of ","element":"span"},{"style":{"height":16.2},"width":59,"height":40.5,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/96191/images/4-18.png","element":"img","alt":" Qπh.","inline":true}],[{"style":{"fontWeight":"bold"},"text":"Proposition 3.4. ","element":"span"},{"style":{"fontStyle":"italic"},"text":"For all ","element":"span"},{"style":{"height":16},"width":232.5,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/96191/images/4-19.png","element":"img","alt":" (t, x) ∈ T × X","inline":true},{"style":{"fontStyle":"italic"},"text":", we have that ","element":"span"},{"style":{"height":18},"width":557.5,"height":45,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/96191/images/4-20.png","element":"img","alt":" distgapp(ζπh, t, x) ≥ gap(Qπh, t, x).","inline":true}],[{"text":"For a proof of this statement and any other made in this work, see Appendix ","element":"span"},{"text":"B","element":"span"},{"text":". Our next result establishes that ","element":"span"},{"style":{"height":15.6},"width":51.5,"height":39,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/96191/images/4-21.png","element":"img","alt":" Wp","inline":true,"padRight":true},{"text":"action gaps of ","element":"span"},{"style":{"height":16},"width":37,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/96191/images/4-22.png","element":"img","alt":" ζπh","inline":true},{"text":", like action gaps of ","element":"span"},{"style":{"height":16.2},"width":48,"height":40.5,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/96191/images/4-23.png","element":"img","alt":" Qπh","inline":true},{"text":", vanishes for a large class of MDPs.","element":"span"}],[{"id":"id-22","style":{"fontWeight":"bold"},"text":"Theorem 3.5. ","element":"span"},{"style":{"fontStyle":"italic"},"text":"If ","element":"span"},{"style":{"fontStyle":"italic"},"text":"r ","element":"span"},{"style":{"fontStyle":"italic"},"text":"and ","element":"span"},{"style":{"fontStyle":"italic"},"text":"f ","element":"span"},{"style":{"fontStyle":"italic"},"text":"are bounded, then ","element":"span"},{"style":{"height":16.6},"width":874.5,"height":41.5,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/96191/images/4-24.png","element":"img","alt":" limh↓0 Wp(ζπh(t, x, a), ηπ(t, x)) = 0, for all (t, x, a) ∈","inline":true},{"style":{"height":18.2},"width":789,"height":45.5,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/96191/images/4-25.png","element":"img","alt":"T × X × A; hence, limh↓0 distgapp(ζπh, t, x) = 0.","inline":true}],[{"text":"While Theorem ","element":"span"},{"href":"#id-22","text":"3.5 ","element":"a"},{"text":"shows that the ","element":"span"},{"style":{"height":15.6},"width":51.5,"height":39,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/96191/images/4-26.png","element":"img","alt":" Wp","inline":true,"padRight":true},{"text":"distance between ","element":"span"},{"style":{"height":16.8},"width":682.5,"height":42,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/96191/images/4-27.png","element":"img","alt":" ζπh(t, x, a) and ηπ(t, x) (and the Wp action","inline":true,"padRight":true},{"text":"gap of ","element":"span"},{"style":{"height":16.6},"width":325.5,"height":41.5,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/96191/images/4-28.png","element":"img","alt":" ζπh at (t, x) ∈ T × X","inline":true},{"text":") does indeed vanish as ","element":"span"},{"style":{"fontStyle":"italic"},"text":"h ","element":"span"},{"text":"decreases, it does not identify the rate at which ","element":"span"},{"text":"it does so. Our next two theorems establish this rate.","element":"span"}],[{"id":"id-23","style":{"fontWeight":"bold"},"text":"Theorem 3.6. ","element":"span"},{"style":{"fontStyle":"italic"},"text":"MDPs and policies exist in and under which, for all ","element":"span"},{"style":{"height":16},"width":346,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/96191/images/4-29.png","element":"img","alt":" (t, x, a) ∈ T × X × A","inline":true},{"style":{"fontStyle":"italic"},"text":", we have that ","element":"span"},{"style":{"height":20.4},"width":978.5,"height":51,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/96191/images/4-30.png","element":"img","alt":" Wp(ζπh(t, x, a), ηπ(t, x)) ≳ h1/2 and distgapp(ζπh, t, x) ≳ h1/2.","inline":true}],[{"text":"Finally, we prove that for a large class of MDPs (different from but overlapping with the class of MDPs captured in Theorem ","element":"span"},{"href":"#id-22","text":"3.5","element":"a"},{"text":"), the lower bound found in Theorem ","element":"span"},{"href":"#id-23","text":"3.6 ","element":"a"},{"text":"is an upper bound.","element":"span"}],[{"id":"id-24","style":{"fontWeight":"bold"},"text":"Theorem 3.7. ","element":"span"},{"style":{"fontStyle":"italic"},"text":"If ","element":"span"},{"style":{"fontStyle":"italic"},"text":"r ","element":"span"},{"style":{"fontStyle":"italic"},"text":"is Lipschitz in state, uniformly in time, ","element":"span"},{"style":{"fontStyle":"italic"},"text":"f ","element":"span"},{"style":{"fontStyle":"italic"},"text":"is Lipschitz, and ","element":"span"},{"style":{"height":11.6},"width":141.5,"height":29,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/96191/images/4-31.png","element":"img","alt":" T < ∞","inline":true},{"style":{"fontStyle":"italic"},"text":", then ","element":"span"},{"style":{"height":20.4},"width":1514,"height":51,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/96191/images/4-32.png","element":"img","alt":"Wp(ζπh(t, x, a), ηπ(t, x)) ≲ h1/2, for all (t, x, a) ∈ T × X × A; hence, distgapp(ζπh, t, x) ≲ h1/2.","inline":true}],[{"text":"Theorems ","element":"span"},{"href":"#id-23","text":"3.6 ","element":"a"},{"text":"and ","element":"span"},{"href":"#id-24","text":"3.7 ","element":"a"},{"text":"demonstrate that the ","element":"span"},{"style":{"height":15.6},"width":51.5,"height":39,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/96191/images/4-33.png","element":"img","alt":" Wp","inline":true,"padRight":true},{"text":"distance between ","element":"span"},{"style":{"height":16.6},"width":162,"height":41.5,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/96191/images/4-34.png","element":"img","alt":" ζπh(t, x, a)","inline":true,"padRight":true},{"text":"and ","element":"span"},{"style":{"height":16},"width":124.5,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/96191/images/4-35.png","element":"img","alt":" ηπ(t, x)","inline":true,"padRight":true},{"text":"and the ","element":"span"},{"text":"distance between ","element":"span"},{"style":{"height":16.8},"width":172.5,"height":42,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/96191/images/4-36.png","element":"img","alt":" Qπh(t, x, a)","inline":true,"padRight":true},{"text":"and ","element":"span"},{"style":{"height":16},"width":134.5,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/96191/images/4-37.png","element":"img","alt":" V π(t, x)","inline":true,"padRight":true},{"text":"are of different orders in terms of ","element":"span"},{"style":{"fontStyle":"italic"},"text":"h","element":"span"},{"text":". Thus, we see that ","element":"span"},{"style":{"height":18.2},"width":572.5,"height":45.5,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/96191/images/4-38.png","element":"img","alt":"distgapp(ζπh, t, x) and gap(Qπh, t, x)","inline":true,"padRight":true},{"text":"in stochastic MDPs are fundamentally different.","element":"span"}]]},{"heading":"4 Distributional Superiority","paragraphs":[[{"text":"In this section, we introduce a probabilistic generalization of the advantage. We define this random variable—which we call the ","element":"span"},{"style":{"fontStyle":"italic"},"text":"superiority ","element":"span"},{"text":"and denote by ","element":"span"},{"style":{"height":16.2},"width":43,"height":40.5,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/96191/images/4-39.png","element":"img","alt":" Sπh","inline":true},{"text":"—via a pair of axioms.","element":"span"}],[{"text":"A natural construction of the superiority at ","element":"span"},{"text":"(","element":"span"},{"style":{"fontStyle":"italic"},"text":"t, x, a","element":"span"},{"text":") ","element":"span"},{"text":"is given by ","element":"span"},{"style":{"height":16.8},"width":358,"height":42,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/96191/images/4-40.png","element":"img","alt":" Zπh(t, x, a) − Gπ(t, x)","inline":true},{"text":". The law of ","element":"span"},{"text":"this difference, however, depends on the joint law of ","element":"span"},{"style":{"height":16.6},"width":356.5,"height":41.5,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/96191/images/4-41.png","element":"img","alt":" (Zπh(t, x, a), Gπ(t, x))","inline":true},{"text":", which is unobservable ","element":"span"},{"text":"in practice and ill-defined (cf. Section ","element":"span"},{"href":"#id-25","text":"2.1.2","element":"a"},{"text":"). Yet, the set of all possible laws of this difference is easily characterized; it is the set of ","element":"span"},{"style":{"fontStyle":"italic"},"text":"coupled difference representations of ","element":"span"},{"style":{"height":16.8},"width":381,"height":42,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/96191/images/4-42.png","element":"img","alt":" ζπh(t, x, a) and ηπ(t, x).","inline":true}],[{"style":{"fontWeight":"bold"},"text":"Definition 4.1. ","element":"span"},{"style":{"fontStyle":"italic"},"text":"Let ","element":"span"},{"style":{"height":16},"width":259.5,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/96191/images/4-43.png","element":"img","alt":" µ, ν ∈ P(R). A","inline":true,"padRight":true},{"text":"coupled difference representation (CDR) ","element":"span"},{"style":{"height":16},"width":351.5,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/96191/images/4-44.png","element":"img","alt":" ψ ∈ P(R) of µ and ν","inline":true,"padRight":true},{"style":{"fontStyle":"italic"},"text":"takes the form ","element":"span"},{"style":{"height":16.2},"width":163.5,"height":40.5,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/96191/images/4-45.png","element":"img","alt":" ψ = ∆#κ","inline":true,"padRight":true},{"style":{"fontStyle":"italic"},"text":"where ","element":"span"},{"style":{"height":16},"width":195,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/96191/images/4-46.png","element":"img","alt":" κ ∈ C (µ, ν)","inline":true,"padRight":true},{"style":{"fontStyle":"italic"},"text":"and ","element":"span"},{"style":{"height":14.6},"width":203,"height":36.5,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/96191/images/4-47.png","element":"img","alt":" ∆ : R2 → R","inline":true,"padRight":true},{"style":{"fontStyle":"italic"},"text":"is given by ","element":"span"},{"style":{"height":16},"width":292.5,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/96191/images/4-48.png","element":"img","alt":" ∆(z, w) := z − w","inline":true},{"style":{"fontStyle":"italic"},"text":". The set of all coupled difference representations of ","element":"span"},{"style":{"height":14.6},"width":124,"height":36.5,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/96191/images/4-49.png","element":"img","alt":" µ and ν","inline":true,"padRight":true},{"style":{"fontStyle":"italic"},"text":"will be denoted by ","element":"span"},{"style":{"height":16},"width":133,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/96191/images/4-50.png","element":"img","alt":" D(µ, ν).","inline":true}],[{"text":"Our first axiom places the superiority’s law in this set, ","element":"span"},{"style":{"height":16.6},"width":384.5,"height":41.5,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/96191/images/4-51.png","element":"img","alt":" D(ζπh(t, x, a), ηπ(t, x)).","inline":true}],[{"style":{"fontWeight":"bold"},"text":"Axiom 1. ","element":"span"},{"style":{"fontStyle":"italic"},"text":"The law of ","element":"span"},{"style":{"height":16.8},"width":167,"height":42,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/96191/images/4-52.png","element":"img","alt":" Sπh(t, x, a)","inline":true,"padRight":true},{"style":{"fontStyle":"italic"},"text":"is a coupled difference representation of ","element":"span"},{"style":{"height":16.8},"width":379.5,"height":42,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/96191/images/4-53.png","element":"img","alt":" ζπh(t, x, a) and ηπ(t, x).","inline":true}],[{"text":"Our second axiom encodes a type of consistency for deterministic policy behavior.","element":"span"}],[{"id":"id-26","style":{"fontWeight":"bold"},"text":"Axiom 2. ","element":"span"},{"style":{"height":16.6},"width":167.5,"height":41.5,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/96191/images/4-54.png","element":"img","alt":" Sπh(t, x, a)","inline":true,"padRight":true},{"style":{"fontStyle":"italic"},"text":"is deterministic whenever ","element":"span"},{"style":{"height":16},"width":159,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/96191/images/4-55.png","element":"img","alt":" π is (h, a)","inline":true},{"style":{"fontStyle":"italic"},"text":"-persistent at time ","element":"span"},{"style":{"fontStyle":"italic"},"text":"t","element":"span"},{"style":{"fontStyle":"italic"},"text":".","element":"span"}],[{"text":"To see how Axiom ","element":"span"},{"href":"#id-26","text":"2 ","element":"a"},{"text":"encodes a notion of deterministic consistency, first consider its discrete-time analogue: ","element":"span"},{"style":{"fontStyle":"italic"},"text":"the superiority at ","element":"span"},{"text":"(","element":"span"},{"style":{"fontStyle":"italic"},"text":"x, a","element":"span"},{"text":") ","element":"span"},{"style":{"fontStyle":"italic"},"text":"for a policy ","element":"span"},{"style":{"height":7.2},"width":22.5,"height":18,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/96191/images/4-56.png","element":"img","alt":" π","inline":true,"padRight":true},{"style":{"fontStyle":"italic"},"text":"is deterministic if ","element":"span"},{"style":{"height":9},"width":94.5,"height":22.5,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/96191/images/4-57.png","element":"img","alt":" π at x","inline":true,"padRight":true},{"style":{"fontStyle":"italic"},"text":"deterministically chooses ","element":"span"},{"style":{"fontStyle":"italic"},"text":"a","element":"span"},{"style":{"fontStyle":"italic"},"text":". ","element":"span"},{"text":"In this situation, our ","element":"span"},{"style":{"fontStyle":"italic"},"text":"a","element":"span"},{"text":"-following agent makes the same choices as a ","element":"span"},{"style":{"height":7.2},"width":22,"height":18,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/96191/images/4-58.png","element":"img","alt":" π","inline":true},{"text":"-following agent—both take action ","element":"span"},{"style":{"fontStyle":"italic"},"text":"a ","element":"span"},{"text":"in state ","element":"span"},{"style":{"fontStyle":"italic"},"text":"x ","element":"span"},{"text":"initially and then follow ","element":"span"},{"style":{"height":7.2},"width":22,"height":18,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/96191/images/4-59.png","element":"img","alt":" π","inline":true,"padRight":true},{"text":"thereafter—, and we posit that the superiority should not be random. The continuous-time analogue of the situation just described occurs precisely when a policy ","element":"span"},{"style":{"height":16},"width":155.5,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/96191/images/4-60.png","element":"img","alt":" π is (h, a)","inline":true},{"text":"-persistent at starting time ","element":"span"},{"style":{"fontStyle":"italic"},"text":"t","element":"span"},{"text":". Given a starting time ","element":"span"},{"style":{"fontStyle":"italic"},"text":"t ","element":"span"},{"text":"and state ","element":"span"},{"style":{"fontStyle":"italic"},"text":"x","element":"span"},{"text":", an agent that chooses action ","element":"span"},{"style":{"fontStyle":"italic"},"text":"a ","element":"span"},{"text":"between ","element":"span"},{"style":{"fontStyle":"italic"},"text":"t ","element":"span"},{"text":"and ","element":"span"},{"style":{"fontStyle":"italic"},"text":"t ","element":"span"},{"text":"+ ","element":"span"},{"style":{"fontStyle":"italic"},"text":"h ","element":"span"},{"text":"and then follows ","element":"span"},{"style":{"height":7.2},"width":22,"height":18,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/96191/images/4-61.png","element":"img","alt":" π","inline":true,"padRight":true},{"text":"is following the ","element":"span"},{"text":"(","element":"span"},{"style":{"fontStyle":"italic"},"text":"h, a","element":"span"},{"text":")","element":"span"},{"text":"-persistent modification of ","element":"span"},{"style":{"height":9.4},"width":61.5,"height":23.5,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/96191/images/4-62.png","element":"img","alt":" π at","inline":true,"padRight":true},{"text":"time ","element":"span"},{"style":{"fontStyle":"italic"},"text":"t","element":"span"},{"text":". By definition, they make the same choices as a ","element":"span"},{"style":{"height":7.2},"width":22,"height":18,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/96191/images/5-0.png","element":"img","alt":" π","inline":true},{"text":"-following agent when ","element":"span"},{"style":{"height":16},"width":159,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/96191/images/5-1.png","element":"img","alt":" π is (h, a)","inline":true},{"text":"-persistent starting at time ","element":"span"},{"style":{"fontStyle":"italic"},"text":"t","element":"span"},{"text":". Axiom ","element":"span"},{"href":"#id-26","text":"2 ","element":"a"},{"text":"stipulates that, in this case, ","element":"span"},{"style":{"height":16.8},"width":167.5,"height":42,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/96191/images/5-2.png","element":"img","alt":" Sπh(t, x, a)","inline":true,"padRight":true},{"text":"should be deterministic.","element":"span"}],[{"text":"By construction, if ","element":"span"},{"style":{"height":16.8},"width":603.5,"height":42,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/96191/images/5-3.png","element":"img","alt":" ψπh(t, x, a) ∈ D(ζπh(t, x, a), ηπ(t, x))","inline":true},{"text":", then its mean is ","element":"span"},{"style":{"height":16.8},"width":363.5,"height":42,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/96191/images/5-4.png","element":"img","alt":" Qπh(t, x, a) − V π(t, x)","inline":true,"padRight":true},{"text":"(see Appendix ","element":"span"},{"href":"#id-27","text":"B.2 ","element":"a"},{"text":"for a proof of this claim). Axiom ","element":"span"},{"href":"#id-26","text":"2 ","element":"a"},{"text":"then says that any determining coupling ","element":"span"},{"style":{"height":16.8},"width":430,"height":42,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/96191/images/5-5.png","element":"img","alt":"κπh(t, x, a) when π is (h, a)","inline":true},{"text":"-persistent at time ","element":"span"},{"style":{"fontStyle":"italic"},"text":"t ","element":"span"},{"text":"must be such that ","element":"span"},{"style":{"height":18.2},"width":340.5,"height":45.5,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/96191/images/5-6.png","element":"img","alt":" ∆#κπh(t, x, a) = δ0.8","inline":true,"padRight":true},{"text":"In particular, ","element":"span"},{"text":"Axiom ","element":"span"},{"href":"#id-26","text":"2 ","element":"a"},{"text":"nontrivially restricts ","element":"span"},{"style":{"height":16.8},"width":384,"height":42,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/96191/images/5-7.png","element":"img","alt":" D(ζπh(t, x, a), ηπ(t, x)).","inline":true}],[{"id":"id-28","style":{"fontWeight":"bold"},"text":"Example 4.2. ","element":"span"},{"style":{"fontStyle":"italic"},"text":"Let ","element":"span"},{"style":{"height":16.6},"width":463.5,"height":41.5,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/96191/images/5-8.png","element":"img","alt":" ιπh(t, x, a) := ∆#κπh(t, x, a)","inline":true,"padRight":true},{"style":{"fontStyle":"italic"},"text":"for ","element":"span"},{"style":{"height":16.6},"width":580.5,"height":41.5,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/96191/images/5-9.png","element":"img","alt":" κπh(t, x, a) = ζπh(t, x, a) ⊗ ηπ(t, x)","inline":true},{"style":{"fontStyle":"italic"},"text":". If ","element":"span"},{"style":{"height":7.2},"width":22.5,"height":18,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/96191/images/5-10.png","element":"img","alt":" π","inline":true,"padRight":true},{"style":{"fontStyle":"italic"},"text":"is ","element":"span"},{"text":"(","element":"span"},{"style":{"fontStyle":"italic"},"text":"h, a","element":"span"},{"text":")","element":"span"},{"style":{"fontStyle":"italic"},"text":"-persistent at time ","element":"span"},{"style":{"fontStyle":"italic"},"text":"t","element":"span"},{"style":{"fontStyle":"italic"},"text":", then ","element":"span"},{"style":{"height":16.4},"width":574.5,"height":41,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/96191/images/5-11.png","element":"img","alt":" Var(ιπh(t, x, a)) = 2Var(ηπ(t, x))","inline":true},{"style":{"fontStyle":"italic"},"text":". This variance is ","element":"span"},{"text":"0 ","element":"span"},{"style":{"fontStyle":"italic"},"text":"only when ","element":"span"},{"style":{"height":7.2},"width":22.5,"height":18,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/96191/images/5-12.png","element":"img","alt":"π","inline":true},{"style":{"fontStyle":"italic"},"text":"’s return is deterministic. Hence, ","element":"span"},{"style":{"height":16.8},"width":155,"height":42,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/96191/images/5-13.png","element":"img","alt":" ιπh(t, x, a)","inline":true,"padRight":true},{"style":{"fontStyle":"italic"},"text":"may be nontrivial even when conditioning on ","element":"span"},{"style":{"fontStyle":"italic"},"text":"a ","element":"span"},{"style":{"fontStyle":"italic"},"text":"reflects ","element":"span"},{"style":{"fontStyle":"italic"},"text":"the policy’s behavior exactly. We posit, via Axiom ","element":"span"},{"href":"#id-26","style":{"fontStyle":"italic"},"text":"2","element":"a"},{"style":{"fontStyle":"italic"},"text":", that this should be prohibited.","element":"span"}],[{"text":"In fact, Axiom ","element":"span"},{"href":"#id-26","text":"2 ","element":"a"},{"text":"determines a single coupling, if we want a consistent choice across all time-state-action triplets and all MDPs.","element":"span"}],[{"id":"id-84","style":{"fontWeight":"bold"},"text":"Theorem 4.3. ","element":"span"},{"style":{"fontStyle":"italic"},"text":"Let ","element":"span"},{"style":{"height":16},"width":200.5,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/96191/images/5-14.png","element":"img","alt":" κ ∈ C (µ, µ)","inline":true,"padRight":true},{"style":{"fontStyle":"italic"},"text":"for some ","element":"span"},{"style":{"height":16},"width":176,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/96191/images/5-15.png","element":"img","alt":" µ ∈ P(R)","inline":true},{"style":{"fontStyle":"italic"},"text":". The push-forward of ","element":"span"},{"style":{"height":7.4},"width":20,"height":18.5,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/96191/images/5-16.png","element":"img","alt":" κ","inline":true,"padRight":true},{"style":{"fontStyle":"italic"},"text":"by ","element":"span"},{"style":{"height":11.6},"width":29.5,"height":29,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/96191/images/5-17.png","element":"img","alt":" ∆","inline":true,"padRight":true},{"style":{"fontStyle":"italic"},"text":"is the delta at zero, ","element":"span"},{"style":{"height":16.2},"width":168,"height":40.5,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/96191/images/5-18.png","element":"img","alt":" ∆#κ = δ0","inline":true},{"style":{"fontStyle":"italic"},"text":", if and only if ","element":"span"},{"style":{"height":15.6},"width":151,"height":39,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/96191/images/5-19.png","element":"img","alt":" κ is a Wp","inline":true},{"style":{"fontStyle":"italic"},"text":"-optimal coupling, for some ","element":"span"},{"style":{"height":16},"width":170.5,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/96191/images/5-20.png","element":"img","alt":" p ∈ [1, ∞)","inline":true},{"style":{"fontStyle":"italic"},"text":". Moreover, there is only one such coupling. It is given by ","element":"span"},{"style":{"height":16.8},"width":267,"height":42,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/96191/images/5-21.png","element":"img","alt":" κµ := (id, id)#µ","inline":true,"padRight":true},{"style":{"fontStyle":"italic"},"text":"or, equivalently, ","element":"span"},{"style":{"height":19.4},"width":454,"height":48.5,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/96191/images/5-22.png","element":"img","alt":" κµ := (F −1µ , F −1µ )#U(0, 1).","inline":true,"padRight":true},{"style":{"fontStyle":"italic"},"text":"Here ","element":"span"},{"text":"U","element":"span"},{"text":"(0","element":"span"},{"style":{"fontStyle":"italic"},"text":", ","element":"span"},{"text":"1) ","element":"span"},{"style":{"fontStyle":"italic"},"text":"is the uniform distribution on ","element":"span"},{"text":"[0","element":"span"},{"style":{"fontStyle":"italic"},"text":", ","element":"span"},{"text":"1]","element":"span"},{"style":{"fontStyle":"italic"},"text":".","element":"span"}],[{"text":"The second definition of ","element":"span"},{"style":{"height":11.8},"width":39.5,"height":29.5,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/96191/images/5-23.png","element":"img","alt":" κµ","inline":true,"padRight":true},{"text":"corresponds, more generally, to the ","element":"span"},{"style":{"height":15.4},"width":51.5,"height":38.5,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/96191/images/5-24.png","element":"img","alt":" Wp","inline":true},{"text":"-optimal coupling, for all ","element":"span"},{"style":{"height":14},"width":103,"height":35,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/96191/images/5-25.png","element":"img","alt":" p ≥ 1,","inline":true,"padRight":true},{"text":"of ","element":"span"},{"style":{"height":10.8},"width":22.5,"height":27,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/96191/images/5-26.png","element":"img","alt":" µ","inline":true,"padRight":true},{"text":"and ","element":"span"},{"style":{"height":7.2},"width":19.5,"height":18,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/96191/images/5-27.png","element":"img","alt":" ν","inline":true,"padRight":true},{"text":"given by ","element":"span"},{"style":{"height":20.8},"width":473.5,"height":52,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/96191/images/5-28.png","element":"img","alt":" κµ,ν := (F −1µ , F −1ν )#U(0, 1)","inline":true},{"text":". As ","element":"span"},{"style":{"height":16.4},"width":128.5,"height":41,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/96191/images/5-29.png","element":"img","alt":" ∆#κµ,ν","inline":true},{"text":"’s quanitle function is ","element":"span"},{"style":{"height":19.4},"width":191.5,"height":48.5,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/96191/images/5-30.png","element":"img","alt":" F −1µ − F −1ν","inline":true,"padRight":true},{"text":", we have in hand everything we need to define the superiority distribution (via its quantile function).","element":"span"}],[{"style":{"fontWeight":"bold"},"text":"Definition 4.4. ","element":"span"},{"style":{"fontStyle":"italic"},"text":"The ","element":"span"},{"text":"superiority distribution ","element":"span"},{"style":{"height":16.8},"width":484,"height":42,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/96191/images/5-31.png","element":"img","alt":" ψπh at (t, x, a) ∈ T × X × A is","inline":true}],[{"style":{"width":"47%"},"width":760,"height":64,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/96191/images/5-32.png","element":"img"}],[{"text":"As ","element":"span"},{"style":{"height":16.6},"width":169.5,"height":41.5,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/96191/images/5-33.png","element":"img","alt":" ψπh(t, x, a)","inline":true,"padRight":true},{"text":"has the smallest possible central absolute ","element":"span"},{"style":{"fontStyle":"italic"},"text":"p","element":"span"},{"text":"th moments among all CDRs of ","element":"span"},{"style":{"height":16.6},"width":161.5,"height":41.5,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/96191/images/5-34.png","element":"img","alt":" ζπh(t, x, a)","inline":true,"padRight":true},{"text":"and ","element":"span"},{"style":{"height":16},"width":124.5,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/96191/images/5-35.png","element":"img","alt":" ηπ(t, x)","inline":true},{"text":", heuristically, it captures more of the individual features of both return distributions than other such CDRs (like ","element":"span"},{"style":{"height":16.8},"width":154.5,"height":42,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/96191/images/5-36.png","element":"img","alt":" ιπh(t, x, a)","inline":true,"padRight":true},{"text":"in Example ","element":"span"},{"href":"#id-28","text":"4.2","element":"a"},{"text":"). We illustrate this by example in Figure ","element":"span"},{"href":"#id-29","text":"4.1","element":"a"},{"text":".","element":"span"}],[{"style":{"width":"96%"},"width":1524,"height":223,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/96191/images/5-37.png","element":"img"}],[{"text":"Figure 4.1: PDFs of Return Distributions and Two Candidate CDRs.","element":"figcaption","subtype":"caption"}],[{"id":"id-29","style":{"fontWeight":"bold"},"text":"4.1 ","element":"span"},{"style":{"fontWeight":"bold"},"text":"The Rescaled Superiority Distribution","element":"span"}],[{"text":"From Section ","element":"span"},{"text":"3","element":"span"},{"text":", we know that the ","element":"span"},{"style":{"height":15.6},"width":51.5,"height":39,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/96191/images/5-38.png","element":"img","alt":" Wp","inline":true,"padRight":true},{"text":"distance between ","element":"span"},{"style":{"height":16.8},"width":162,"height":42,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/96191/images/5-39.png","element":"img","alt":" ζπh(t, x, a)","inline":true,"padRight":true},{"text":"and ","element":"span"},{"style":{"height":16},"width":124,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/96191/images/5-40.png","element":"img","alt":" ηπ(t, x)","inline":true},{"text":", for every ","element":"span"},{"style":{"height":14},"width":96.5,"height":35,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/96191/images/5-41.png","element":"img","alt":" p ≥ 1","inline":true},{"text":", ","element":"span"},{"text":"vanishes as ","element":"span"},{"style":{"fontStyle":"italic"},"text":"h ","element":"span"},{"text":"vanishes. Moreover, we know the rate at which this distance disappears. As a result, by construction, the central absolute ","element":"span"},{"style":{"fontStyle":"italic"},"text":"p","element":"span"},{"text":"th moments of ","element":"span"},{"style":{"height":16.8},"width":169,"height":42,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/96191/images/5-42.png","element":"img","alt":" ψπh(t, x, a)","inline":true,"padRight":true},{"text":"collapse as ","element":"span"},{"style":{"fontStyle":"italic"},"text":"h ","element":"span"},{"text":"collapses, and, for a ","element":"span"},{"text":"large class of MDPs, we understand the rate at which these moments collapse. More generally and precisely, if we consider the ","element":"span"},{"style":{"fontStyle":"italic"},"text":"q","element":"span"},{"style":{"fontStyle":"italic"},"text":"-rescaled superiority distribution ","element":"span"},{"text":"defined by","element":"span"}],[{"style":{"width":"22%"},"width":362,"height":52,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/96191/images/5-43.png","element":"img"}],[{"text":"we see that Theorems ","element":"span"},{"href":"#id-23","text":"3.6 ","element":"a"},{"text":"and ","element":"span"},{"href":"#id-24","text":"3.7 ","element":"a"},{"text":"translate to the follow statements on ","element":"span"},{"style":{"height":15.6},"width":51.5,"height":39,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/96191/images/5-44.png","element":"img","alt":" Wp","inline":true,"padRight":true},{"text":"action gaps of ","element":"span"},{"style":{"height":17.8},"width":78,"height":44.5,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/96191/images/5-45.png","element":"img","alt":" ψπh;q:","inline":true}],[{"id":"id-47","style":{"fontWeight":"bold"},"text":"Theorem 4.5. ","element":"span"},{"style":{"fontStyle":"italic"},"text":"MDPs and policies exist satisfying Assumptions ","element":"span"},{"href":"#id-11","style":{"fontStyle":"italic"},"text":"2.2 ","element":"a"},{"style":{"fontStyle":"italic"},"text":"and ","element":"span"},{"href":"#id-5","style":{"fontStyle":"italic"},"text":"2.3 ","element":"a"},{"style":{"fontStyle":"italic"},"text":"in and under which, for all ","element":"span"},{"style":{"height":16},"width":233,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/96191/images/5-46.png","element":"img","alt":" (t, x) ∈ T × X","inline":true},{"style":{"fontStyle":"italic"},"text":", we have that ","element":"span"},{"style":{"height":20.4},"width":463.5,"height":51,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/96191/images/5-47.png","element":"img","alt":" distgapp(ψπh;q, t, x) ≳ h1/2−q.","inline":true}],[{"id":"id-48","style":{"fontWeight":"bold"},"text":"Theorem 4.6. ","element":"span"},{"style":{"fontStyle":"italic"},"text":"Under Assumptions ","element":"span"},{"href":"#id-11","style":{"fontStyle":"italic"},"text":"2.2 ","element":"a"},{"style":{"fontStyle":"italic"},"text":"and ","element":"span"},{"href":"#id-5","style":{"fontStyle":"italic"},"text":"2.3","element":"a"},{"style":{"fontStyle":"italic"},"text":", if ","element":"span"},{"style":{"fontStyle":"italic"},"text":"r ","element":"span"},{"style":{"fontStyle":"italic"},"text":"is Lipschitz in state, uniformly in time, ","element":"span"},{"style":{"fontStyle":"italic"},"text":"f ","element":"span"},{"style":{"fontStyle":"italic"},"text":"is Lipschitz, and ","element":"span"},{"style":{"height":19.6},"width":1049.5,"height":49,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/96191/images/5-48.png","element":"img","alt":" T < ∞, then distgapp(ψπh;q, t, x) ≲ h1/2−q, for all (t, x) ∈ T × X.","inline":true}],[{"text":"These two theorems tell us how to preserves the ","element":"span"},{"style":{"height":15.6},"width":51.5,"height":39,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/96191/images/5-49.png","element":"img","alt":" Wp","inline":true,"padRight":true},{"text":"action gaps of ","element":"span"},{"style":{"fontStyle":"italic"},"text":"q","element":"span"},{"text":"-rescaled superiority distributions (as a function of ","element":"span"},{"style":{"fontStyle":"italic"},"text":"h","element":"span"},{"text":"). They identify ","element":"span"},{"style":{"height":16},"width":126,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/96191/images/5-50.png","element":"img","alt":" q = 1/2","inline":true},{"text":". For ","element":"span"},{"style":{"height":16.6},"width":205,"height":41.5,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/96191/images/5-51.png","element":"img","alt":" q < 1/2, Wp","inline":true,"padRight":true},{"text":"action gaps vanish as ","element":"span"},{"style":{"fontStyle":"italic"},"text":"h ","element":"span"},{"text":"vanishes. Whereas for ","element":"span"},{"style":{"height":16.6},"width":189,"height":41.5,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/96191/images/5-52.png","element":"img","alt":" q > 1/2, Wp","inline":true,"padRight":true},{"text":"action gaps blow up as ","element":"span"},{"style":{"fontStyle":"italic"},"text":"h ","element":"span"},{"text":"vanishes. These behaviors are undesirable. When ","element":"span"},{"style":{"height":16},"width":114.5,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/96191/images/6-0.png","element":"img","alt":"q < 1/2","inline":true},{"text":", the influence of an action on an agent’s superiority becomes indistinguishable from any other action. For ","element":"span"},{"style":{"height":16},"width":114.5,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/96191/images/6-1.png","element":"img","alt":" q > 1/2","inline":true},{"text":", ever larger sample sizes are need to obtain any statistical estimate of an agent’s superiority with the same level of accuracy. These scenarios are untenable.","element":"span"}],[{"text":"Another consideration regarding rescalings of ","element":"span"},{"style":{"height":16},"width":44.5,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/96191/images/6-2.png","element":"img","alt":" ψπh","inline":true},{"text":"is whether they upend rankings of actions deter- ","element":"span"},{"text":"mined by some given measure of utility. This would be counterproductive. In DRL, agents often use ","element":"span"},{"style":{"fontStyle":"italic"},"text":"distortion risk measures ","element":"span"},{"text":"[","element":"span"},{"href":"#id-30","referenceIndex":1,"text":"1","element":"a"},{"text":"] to rank actions ([","element":"span"},{"href":"#id-31","referenceIndex":10,"text":"10","element":"a"},{"text":", ","element":"span"},{"href":"#id-32","referenceIndex":9,"text":"9","element":"a"},{"text":", ","element":"span"},{"href":"#id-33","referenceIndex":22,"text":"22","element":"a"},{"text":", ","element":"span"},{"href":"#id-34","referenceIndex":4,"text":"4","element":"a"},{"text":", ","element":"span"},{"href":"#id-35","referenceIndex":17,"text":"17","element":"a"},{"text":"]).","element":"span"}],[{"id":"id-88","style":{"fontWeight":"bold"},"text":"Definition 4.7. ","element":"span"},{"style":{"fontStyle":"italic"},"text":"Given ","element":"span"},{"style":{"height":16},"width":230.5,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/96191/images/6-3.png","element":"img","alt":" β ∈ P([0, 1])","inline":true},{"style":{"fontStyle":"italic"},"text":", the ","element":"span"},{"text":"distortion risk measure ","element":"span"},{"style":{"height":16.6},"width":282.5,"height":41.5,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/96191/images/6-4.png","element":"img","alt":" ρβ : P(R) → R","inline":true,"padRight":true},{"style":{"fontStyle":"italic"},"text":"is defined by ","element":"span"},{"style":{"height":19.6},"width":553,"height":49,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/96191/images/6-5.png","element":"img","alt":"ρβ(µ) := ⟨β, F −1µ ⟩; on µ ∈ P(R)","inline":true},{"style":{"fontStyle":"italic"},"text":", its value is given by the integral of ","element":"span"},{"style":{"height":19.6},"width":68.5,"height":49,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/96191/images/6-6.png","element":"img","alt":" F −1µ","inline":true},{"style":{"fontStyle":"italic"},"text":"with respect to ","element":"span"},{"style":{"height":14.6},"width":29.5,"height":36.5,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/96191/images/6-7.png","element":"img","alt":" β.","inline":true}],[{"text":"A family of ","element":"span"},{"style":{"height":11.8},"width":38,"height":29.5,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/96191/images/6-8.png","element":"img","alt":" ρβ","inline":true,"padRight":true},{"text":"is the ","element":"span"},{"style":{"height":7.4},"width":23,"height":18.5,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/96191/images/6-9.png","element":"img","alt":" α","inline":true},{"text":"-conditional value-at-risk measures (","element":"span"},{"style":{"height":7.4},"width":23,"height":18.5,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/96191/images/6-10.png","element":"img","alt":"α","inline":true},{"text":"-CVaR) [","element":"span"},{"href":"#id-36","referenceIndex":29,"text":"29","element":"a"},{"text":"], where ","element":"span"},{"style":{"height":16},"width":216.5,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/96191/images/6-11.png","element":"img","alt":" βα = U(0, α)","inline":true,"padRight":true},{"text":"for ","element":"span"},{"style":{"height":16},"width":274,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/96191/images/6-12.png","element":"img","alt":"α ∈ (0, 1]; α = 1","inline":true,"padRight":true},{"text":"is the expected-value utility measure. Crucially, ","element":"span"},{"style":{"height":18},"width":278.5,"height":45,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/96191/images/6-13.png","element":"img","alt":" ψπh;q preserves ρβ","inline":true},{"text":"-valued utility.","element":"span"}],[{"id":"id-50","style":{"fontWeight":"bold"},"text":"Theorem 4.8. ","element":"span"},{"style":{"fontStyle":"italic"},"text":"Let ","element":"span"},{"style":{"height":11.8},"width":37.5,"height":29.5,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/96191/images/6-14.png","element":"img","alt":" ρβ","inline":true,"padRight":true},{"style":{"fontStyle":"italic"},"text":"be a distortion risk measure, ","element":"span"},{"style":{"height":13.8},"width":93.5,"height":34.5,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/96191/images/6-15.png","element":"img","alt":" q ≥ 0","inline":true},{"style":{"fontStyle":"italic"},"text":", and ","element":"span"},{"style":{"fontStyle":"italic"},"text":"h > ","element":"span"},{"text":"0","element":"span"},{"style":{"fontStyle":"italic"},"text":". If ","element":"span"},{"style":{"height":16.6},"width":295.5,"height":41.5,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/96191/images/6-16.png","element":"img","alt":" ρβ(ηπ(t, x)) < ∞","inline":true},{"style":{"fontStyle":"italic"},"text":", then ","element":"span"},{"style":{"height":18.6},"width":974,"height":46.5,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/96191/images/6-17.png","element":"img","alt":"arg maxa∈A ρβ(ψπh;q(t, x, a)) = arg maxa∈A ρβ(ζπh(t, x, a)).","inline":true}],[{"text":"In turn, the","element":"span"},{"style":{"height":16},"width":41,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/96191/images/6-18.png","element":"img","alt":"1/2","inline":true},{"text":"-rescaled superiority distribution is not only ","element":"span"},{"style":{"height":15.6},"width":51.5,"height":39,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/96191/images/6-19.png","element":"img","alt":" Wp","inline":true,"padRight":true},{"text":"action gap preserving but matches ","element":"span"},{"style":{"height":16},"width":37,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/96191/images/6-20.png","element":"img","alt":" ζπh","inline":true,"padRight":true},{"text":"in its greedy choice of action as measured by a distortion risk measure.","element":"span"}],[{"id":"id-87","style":{"fontWeight":"bold"},"text":"4.2 ","element":"span"},{"style":{"fontWeight":"bold"},"text":"Algorithmic Considerations","element":"span"}],[{"text":"We now turn to building DRL algorithms based on our theory. Our algorithms leverage the quantile TD-learning framework [","element":"span"},{"href":"#id-31","referenceIndex":10,"text":"10","element":"a"},{"text":"] to learn ","element":"span"},{"style":{"height":11.8},"width":37.5,"height":29.5,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/96191/images/6-21.png","element":"img","alt":" ρβ","inline":true},{"text":"-greedy policies, for a given distortion risk measure ","element":"span"},{"style":{"height":15.6},"width":119,"height":39,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/96191/images/6-22.png","element":"img","alt":" ρβ, just","inline":true,"padRight":true},{"text":"as DAU [","element":"span"},{"href":"#id-2","referenceIndex":34,"text":"34","element":"a"},{"text":"] leverages the ","element":"span"},{"style":{"fontStyle":"italic"},"text":"Q","element":"span"},{"text":"-learning framework to learn greedy policies. Full pseudocode and implementation details are given in Appendix ","element":"span"},{"text":"C","element":"span"},{"text":".","element":"span"}],[{"text":"At the heart of our algorithms is an equality of quantile functions, which holds by construction,","element":"span"}],[{"id":"id-37","style":{"width":"69%"},"width":1108,"height":68,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/96191/images/6-23.png","element":"img"}],[{"text":"Indeed, given ","element":"span"},{"style":{"height":17.8},"width":162.5,"height":44.5,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/96191/images/6-24.png","element":"img","alt":" η and ψh;q","inline":true},{"text":", as models of ","element":"span"},{"style":{"height":17.8},"width":182,"height":44.5,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/96191/images/6-25.png","element":"img","alt":" ηπ and ψπh;q ","inline":true,"padRight":true},{"text":"respectively, equation (","element":"span"},{"href":"#id-37","text":"4.1","element":"a"},{"text":") justifies the application ","element":"span"},{"text":"of quantile TD-learning to ","element":"span"},{"style":{"height":16},"width":33.5,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/96191/images/6-26.png","element":"img","alt":" ζh","inline":true},{"text":", as a model for ","element":"span"},{"style":{"height":16},"width":37,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/96191/images/6-27.png","element":"img","alt":" ζπh","inline":true},{"text":", defined via the quantile function","element":"span"}],[{"id":"id-38","style":{"width":"69%"},"width":1096,"height":63,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/96191/images/6-28.png","element":"img"}],[{"text":"That said, we cannot realize quantile TD-learning without defining ","element":"span"},{"text":"predictions ","element":"span"},{"text":"and bootstrap ","element":"span"},{"text":"targets ","element":"span"},{"text":"in terms of ","element":"span"},{"style":{"fontStyle":"italic"},"text":"m","element":"span"},{"text":"-quantile representations","element":"span"},{"text":"9 ","element":"span"},{"text":"[","element":"span"},{"href":"#id-31","referenceIndex":10,"text":"10","element":"a"},{"text":", ","element":"span"},{"href":"#id-34","referenceIndex":4,"text":"4","element":"a"},{"text":"] of ","element":"span"},{"style":{"height":16},"width":33.5,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/96191/images/6-29.png","element":"img","alt":" ζh","inline":true},{"text":", via those of ","element":"span"},{"style":{"height":18},"width":178.5,"height":45,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/96191/images/6-30.png","element":"img","alt":" η and ψh;q.","inline":true}],[{"text":"While we may freely parameterize the ","element":"span"},{"style":{"fontStyle":"italic"},"text":"m","element":"span"},{"text":"-quantile representation of ","element":"span"},{"style":{"height":10.8},"width":19.5,"height":27,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/96191/images/6-31.png","element":"img","alt":" η","inline":true,"padRight":true},{"text":"with a neural network (with interface) ","element":"span"},{"style":{"height":12.2},"width":265.5,"height":30.5,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/96191/images/6-32.png","element":"img","alt":" θ : T×X → Rm","inline":true},{"text":", we have to be careful when parameterizing the ","element":"span"},{"style":{"fontStyle":"italic"},"text":"m","element":"span"},{"text":"-quantile representation of ","element":"span"},{"style":{"height":17.8},"width":67.5,"height":44.5,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/96191/images/6-33.png","element":"img","alt":" ψh;q","inline":true},{"text":". Given a neural network ","element":"span"},{"style":{"height":15.4},"width":475,"height":38.5,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/96191/images/6-34.png","element":"img","alt":" ϕ : T × X × A → Rm, we set","inline":true}],[{"id":"id-39","style":{"width":"88%"},"width":1403,"height":65,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/96191/images/6-35.png","element":"img"}],[{"text":"This ensures we identify a ","element":"span"},{"style":{"height":11.8},"width":37.5,"height":29.5,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/96191/images/6-36.png","element":"img","alt":" ρβ","inline":true},{"text":"-greedy policy; it is ","element":"span"},{"style":{"height":15.6},"width":164.5,"height":39,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/96191/images/6-37.png","element":"img","alt":" 0 at the ρβ","inline":true},{"text":"-greedy action ","element":"span"},{"href":"#id-2","referenceIndex":34,"style":{"height":14.2},"width":166.5,"height":35.5,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/96191/images/6-38.png","element":"img","alt":" a⋆ (cf. [34","inline":true},{"text":", Eq. 27]).","element":"span"}],[{"text":"With appropriate parameterized ","element":"span"},{"style":{"fontStyle":"italic"},"text":"m","element":"span"},{"text":"-quantile representations of ","element":"span"},{"style":{"height":10.8},"width":19,"height":27,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/96191/images/6-39.png","element":"img","alt":" η","inline":true,"padRight":true},{"text":"and ","element":"span"},{"style":{"height":17.8},"width":67.5,"height":44.5,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/96191/images/6-40.png","element":"img","alt":" ψh;q","inline":true,"padRight":true},{"text":"in hand, we derive our ","element":"span"},{"text":"predictions ","element":"span"},{"text":"and bootstrap ","element":"span"},{"text":"targets","element":"span"},{"text":". By (","element":"span"},{"href":"#id-38","text":"4.2","element":"a"},{"text":"), recalling ","element":"span"},{"href":"#id-39","style":{"height":14.2},"width":156.5,"height":35.5,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/96191/images/6-41.png","element":"img","alt":" θ and (4.3","inline":true},{"text":"), we compute our ","element":"span"},{"text":"predictions ","element":"span"},{"text":"via","element":"span"}],[{"id":"id-42","style":{"width":"75%"},"width":1200,"height":59,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/96191/images/6-42.png","element":"img"}],[{"text":"By Wiltzer [","element":"span"},{"href":"#id-40","referenceIndex":40,"text":"40","element":"a"},{"text":"], as ","element":"span"},{"style":{"height":23.4},"width":554,"height":58.5,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/96191/images/6-43.png","element":"img","alt":" Xπ|h,a,tt = x and Xπ|h,a,tt+h = Xat+h","inline":true},{"text":", observe that","element":"span"}],[{"style":{"width":"70%"},"width":1118,"height":172,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/96191/images/6-44.png","element":"img"}],[{"text":"where ","element":"span"},{"style":{"height":17.4},"width":504.5,"height":43.5,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/96191/images/6-45.png","element":"img","alt":" E[|Yh|p] = o(h)10 for all p ∈ N","inline":true},{"text":". So upon getting a sample state/realization ","element":"span"},{"style":{"height":16.8},"width":274,"height":42,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/96191/images/6-46.png","element":"img","alt":" xt+h of Xat+h, as","inline":true,"padRight":true},{"text":"in [","element":"span"},{"href":"#id-41","referenceIndex":30,"text":"30","element":"a"},{"text":"], we compute our bootstrap ","element":"span"},{"text":"targets ","element":"span"},{"text":"via","element":"span"}],[{"id":"id-43","style":{"width":"72%"},"width":1146,"height":60,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/96191/images/6-47.png","element":"img"}],[{"text":"In summary, the ","element":"span"},{"text":"predictions ","element":"span"},{"text":"(","element":"span"},{"href":"#id-42","text":"4.4","element":"a"},{"text":") and the bootstrap ","element":"span"},{"text":"targets ","element":"span"},{"text":"(","element":"span"},{"href":"#id-43","text":"4.5","element":"a"},{"text":") together characterize a family of QR-DQN-based algorithms called DSUP(","element":"span"},{"style":{"fontStyle":"italic"},"text":"q","element":"span"},{"text":"), whose core update is outlined in Algorithm ","element":"span"},{"href":"#id-44","text":"1","element":"a"},{"text":".","element":"span"}],[{"id":"id-44","style":{"width":"100%"},"width":1587,"height":480,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/96191/images/7-0.png","element":"img"}],[{"text":"One theoretical drawback of DSUP(","element":"span"},{"style":{"fontStyle":"italic"},"text":"q","element":"span"},{"text":") for mean-return control is that the mean of the ","element":"span"},{"style":{"fontStyle":"italic"},"text":"q","element":"span"},{"text":"-rescaled superiority distribution is ","element":"span"},{"style":{"fontStyle":"italic"},"text":"O","element":"span"},{"text":"(1) ","element":"span"},{"text":"only when ","element":"span"},{"style":{"fontStyle":"italic"},"text":"q ","element":"span"},{"text":"= 1","element":"span"},{"text":", by (","element":"span"},{"href":"#id-17","text":"2.6","element":"a"},{"text":") and (","element":"span"},{"href":"#id-45","text":"2.7","element":"a"},{"text":"). Thus, we propose modeling ","element":"span"},{"style":{"height":16.4},"width":47,"height":41,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/96191/images/7-1.png","element":"img","alt":" Aπh","inline":true,"padRight":true},{"text":"simultaneously. This yields a novel form of a ","element":"span"},{"style":{"fontStyle":"italic"},"text":"two-timescale ","element":"span"},{"text":"approach to value-based RL (see, e.g., [","element":"span"},{"href":"#id-46","referenceIndex":8,"text":"8","element":"a"},{"text":"]). In particular, we estimate ","element":"span"},{"style":{"height":18},"width":244.5,"height":45,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/96191/images/7-2.png","element":"img","alt":" ϑπh;q defined by","inline":true}],[{"style":{"width":"52%"},"width":836,"height":63,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/96191/images/7-3.png","element":"img"}],[{"text":"We call ","element":"span"},{"style":{"height":18.2},"width":124,"height":45.5,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/96191/images/7-4.png","element":"img","alt":" ϑπh;q the","inline":true,"padRight":true},{"style":{"fontStyle":"italic"},"text":"advantage-shifted ","element":"span"},{"style":{"fontStyle":"italic"},"text":"q","element":"span"},{"style":{"fontStyle":"italic"},"text":"-rescaled superiority","element":"span"},{"text":". Note that its mean is ","element":"span"},{"style":{"height":16.8},"width":350.5,"height":42,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/96191/images/7-5.png","element":"img","alt":" Aπh, which is O(1). To","inline":true,"padRight":true},{"text":"realize this, we approximate ","element":"span"},{"style":{"height":16.2},"width":47.5,"height":40.5,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/96191/images/7-6.png","element":"img","alt":" Aπh ","inline":true,"padRight":true},{"text":"using DAU and employ parameter sharing between the approximators ","element":"span"},{"text":"of ","element":"span"},{"style":{"height":18.4},"width":195.5,"height":46,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/96191/images/7-7.png","element":"img","alt":" Aπh and ψπh;q","inline":true},{"text":". We call this family of algorithms DAU+DSUP(","element":"span"},{"style":{"fontStyle":"italic"},"text":"q","element":"span"},{"text":"). We note that ","element":"span"},{"style":{"height":16.4},"width":47,"height":41,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/96191/images/7-8.png","element":"img","alt":" Aπh ","inline":true,"padRight":true},{"text":"is used only for ","element":"span"},{"text":"increasing action gaps; it does not change the training loss for ","element":"span"},{"style":{"height":17.8},"width":178.5,"height":44.5,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/96191/images/7-9.png","element":"img","alt":" η and ψh;q.","inline":true}]]},{"heading":"5 Simulations","paragraphs":[[{"text":"The empirical work herein is two-fold in nature: illustrative and comparative. First, we simulate an example that illustrates Theorems ","element":"span"},{"href":"#id-23","text":"3.6","element":"a"},{"text":"/","element":"span"},{"href":"#id-47","text":"4.5 ","element":"a"},{"text":"and Theorem ","element":"span"},{"href":"#id-24","text":"3.7","element":"a"},{"text":"/","element":"span"},{"href":"#id-48","text":"4.6 ","element":"a"},{"text":"and their consequences. Second, in an option-trading environment, we compare the performance of ","element":"span"},{"style":{"height":16},"width":44.5,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/96191/images/7-10.png","element":"img","alt":" ψπh","inline":true},{"text":"-based agent(s) against QR-DQN ","element":"span"},{"text":"[","element":"span"},{"href":"#id-31","referenceIndex":10,"text":"10","element":"a"},{"text":"] and DAU [","element":"span"},{"href":"#id-2","referenceIndex":34,"text":"34","element":"a"},{"text":"] in the risk-neutral setting and against QR-DQN in a risk-sensitive setting.","element":"span"}],[{"id":"id-49","style":{"fontWeight":"bold"},"text":"5.1 ","element":"span"},{"style":{"fontWeight":"bold"},"text":"The Rescaled Superiority Distribution Revisited","element":"span"}],[{"text":"Consider an MDP with time horizon ","element":"span"},{"text":"10","element":"span"},{"text":", a two element action space, ","element":"span"},{"text":"0 ","element":"span"},{"text":"and ","element":"span"},{"text":"1","element":"span"},{"text":"—when action ","element":"span"},{"text":"1 ","element":"span"},{"text":"is executed, the system follows ","element":"span"},{"text":"1","element":"span"},{"text":"-dimensional Brownian dynamics with a constant drift of ","element":"span"},{"text":"10","element":"span"},{"text":", and otherwise, the state is fixed—, a reward that equals the agent’s signed distance to ","element":"span"},{"text":"0","element":"span"},{"text":", and a trivial terminal reward. We estimate four distributions at ","element":"span"},{"text":"(","element":"span"},{"style":{"fontStyle":"italic"},"text":"t, x, a","element":"span"},{"text":") = (0","element":"span"},{"style":{"fontStyle":"italic"},"text":", ","element":"span"},{"text":"0","element":"span"},{"style":{"fontStyle":"italic"},"text":", ","element":"span"},{"text":"1) ","element":"span"},{"text":"for the policy that always selects ","element":"span"},{"text":"0","element":"span"},{"text":". Figure ","element":"span"},{"href":"#id-49","text":"5.1 ","element":"a"},{"text":"shows these estimated distributions for a sample of frequencies (kHz), ","element":"span"},{"style":{"height":16},"width":134.5,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/96191/images/7-11.png","element":"img","alt":" ω = 1/h.","inline":true}],[{"style":{"width":"97%"},"width":1539,"height":342,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/96191/images/7-12.png","element":"img"}],[{"text":"Figure 5.1: Monte-Carlo estimates of ","element":"figcaption","subtype":"caption"},{"style":{"height":19.8},"width":514.5,"height":49.5,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/96191/images/7-13.png","element":"img","alt":" ψπh;q, for q = 0, 1, 1/2, and ϑπh;1/2 ","inline":true,"padRight":true},{"text":"as a function of ","element":"figcaption","subtype":"caption"},{"style":{"height":16},"width":135,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/96191/images/7-14.png","element":"img","alt":" ω = 1/h.","inline":true}],[{"text":"First (from the left), we see that ","element":"span"},{"style":{"height":15.8},"width":45,"height":39.5,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/96191/images/7-15.png","element":"img","alt":" ψπh","inline":true},{"text":"collapses to ","element":"span"},{"style":{"height":14.2},"width":31,"height":35.5,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/96191/images/7-16.png","element":"img","alt":" δ0","inline":true},{"text":", as ","element":"span"},{"style":{"fontStyle":"italic"},"text":"h ","element":"span"},{"text":"tends to ","element":"span"},{"text":"0","element":"span"},{"text":". Thus, accurate action ranking ","element":"span"},{"text":"distributional or otherwise becomes impossible in the vanishing ","element":"span"},{"style":{"fontStyle":"italic"},"text":"h ","element":"span"},{"text":"limit. Second, we see that rescaling by ","element":"span"},{"style":{"fontStyle":"italic"},"text":"h ","element":"span"},{"text":"produces distributions with ","element":"span"},{"style":{"fontStyle":"italic"},"text":"O","element":"span"},{"text":"(1) ","element":"span"},{"text":"mean but infinite non-mean statistics in the vanishing ","element":"span"},{"style":{"fontStyle":"italic"},"text":"h ","element":"span"},{"text":"limit. Here the ","element":"span"},{"style":{"fontStyle":"italic"},"text":"O","element":"span"},{"text":"(1) ","element":"span"},{"text":"means are imperceptible in face of the large variances. So while this rescaling permits ranking actions by action values, it does so at the expense of producing high-variance distributions. ","element":"span"},{"text":"Third, we see that rescaling by ","element":"span"},{"style":{"height":14.4},"width":57.5,"height":36,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/96191/images/8-0.png","element":"img","alt":" h1/2","inline":true,"padRight":true},{"text":"yields distributions with ","element":"span"},{"style":{"fontStyle":"italic"},"text":"O","element":"span"},{"text":"(1) ","element":"span"},{"text":"non-mean statistics but vanishingly small means, ","element":"span"},{"style":{"height":18.2},"width":120,"height":45.5,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/96191/images/8-1.png","element":"img","alt":" O(h1/2)","inline":true},{"text":". Hence, this rescaling permits ranking actions by non-mean statistics, even if action values again becomes indistinguishable in the vanishing ","element":"span"},{"style":{"fontStyle":"italic"},"text":"h ","element":"span"},{"text":"limit. That said, the vanishing rate of the means here is slower than when no rescaling is considered, ","element":"span"},{"style":{"fontStyle":"italic"},"text":"O","element":"span"},{"text":"(","element":"span"},{"style":{"fontStyle":"italic"},"text":"h","element":"span"},{"text":")","element":"span"},{"text":". Fourth, we see that rescaling by ","element":"span"},{"style":{"height":14.4},"width":57.5,"height":36,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/96191/images/8-2.png","element":"img","alt":" h1/2","inline":true,"padRight":true},{"text":"and then shifting it by ","element":"span"},{"style":{"height":18.8},"width":208.5,"height":47,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/96191/images/8-3.png","element":"img","alt":" (1 − h1/2)Aπh","inline":true},{"text":"produces distributions with ","element":"span"},{"style":{"fontStyle":"italic"},"text":"O","element":"span"},{"text":"(1) ","element":"span"},{"text":"mean and ","element":"span"},{"text":"non-mean statistics. In turn, this two-timescale approach permits ranking actions by either action values or non-mean statistics (but not both by Theorem ","element":"span"},{"href":"#id-50","text":"4.8","element":"a"},{"text":"). However, the mean estimates here are inaccurate and imprecise—rather than uniformly being ","element":"span"},{"text":"100","element":"span"},{"text":", they oscillate substantially.","element":"span"}],[{"text":"In risk-neutral control, we are left with a number of questions. What effect do the high variance distributions in DAU/DSUP(","element":"span"},{"text":"1","element":"span"},{"text":") have on performance? What effect do the ","element":"span"},{"style":{"height":18.2},"width":120,"height":45.5,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/96191/images/8-4.png","element":"img","alt":" O(h1/2)","inline":true,"padRight":true},{"text":"means have on the performance of DSUP(","element":"span"},{"style":{"height":16},"width":41,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/96191/images/8-5.png","element":"img","alt":"1/2","inline":true},{"text":")? What effect does the instability of the mean estimates in DAU+DSUP(","element":"span"},{"style":{"height":16},"width":53.5,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/96191/images/8-6.png","element":"img","alt":"1/2)","inline":true,"padRight":true},{"text":"have on performance? In Section ","element":"span"},{"href":"#id-51","text":"5.2","element":"a"},{"text":", we begin to answer these questions and others by testing our superiority-based algorithms against appropriate benchmarks in an option-trading environment.","element":"span"}],[{"id":"id-51","style":{"fontWeight":"bold"},"text":"5.2 ","element":"span"},{"style":{"fontWeight":"bold"},"text":"High-Frequency Option Trading","element":"span"}],[{"text":"The option-trading environment in which we run our comparative experiments is a commonly used benchmark (see, e.g., [","element":"span"},{"href":"#id-33","referenceIndex":22,"text":"22","element":"a"},{"text":", ","element":"span"},{"href":"#id-35","referenceIndex":17,"text":"17","element":"a"},{"text":"]). We use an Euler–Maruyama discretization scheme [","element":"span"},{"href":"#id-52","referenceIndex":23,"text":"23","element":"a"},{"text":"] at high resolution to simulate high-frequency trading. Returns are averaged over ","element":"span"},{"text":"10 ","element":"span"},{"text":"seeds and ","element":"span"},{"text":"10 ","element":"span"},{"text":"different dynamics models (corresponding to data from different stocks). Additionally, following [","element":"span"},{"href":"#id-33","referenceIndex":22,"text":"22","element":"a"},{"text":"], we use disjoint datasets to estimate the dynamics parameters for simulation during training and evaluation.","element":"span"},{"text":"11","element":"span"}],[{"text":"First, we consider the risk-neutral setting. Here we compare QR-DQN, DAU, and three algorithms based on the ","element":"span"},{"style":{"fontStyle":"italic"},"text":"q","element":"span"},{"text":"-rescaled superiority distribution with ","element":"span"},{"style":{"height":16},"width":157,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/96191/images/8-7.png","element":"img","alt":" q = 1, 1/2","inline":true},{"text":": DSUP(","element":"span"},{"text":"1","element":"span"},{"text":"), DAU+DSUP(","element":"span"},{"style":{"height":16},"width":41,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/96191/images/8-8.png","element":"img","alt":"1/2","inline":true},{"text":"), and DSUP(","element":"span"},{"href":"#id-53","style":{"height":16},"width":241,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/96191/images/8-9.png","element":"img","alt":"1/2). Figure 5.2","inline":true,"padRight":true},{"text":"summarizes their performance at a sample of frequencies (Hz).","element":"span"}],[{"style":{"width":"97%"},"width":1546,"height":583,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/96191/images/8-10.png","element":"img"}],[{"text":"Figure 5.2: Risk-neutral algorithms on high-frequency option-trading as a function of ","element":"figcaption","subtype":"caption"},{"id":"id-53","style":{"height":7.4},"width":33.5,"height":18.5,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/96191/images/8-11.png","element":"img","alt":" ω.","inline":true}],[{"text":"We see that DSUP(","element":"span"},{"style":{"height":16},"width":40.5,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/96191/images/8-12.png","element":"img","alt":"1/2","inline":true},{"text":") is not only the most consistent performer, but outperforms every competitor at all but the two lowest frequencies. Even then, its performance is very close to the best performer. We also see that DAU+DSUP(","element":"span"},{"style":{"height":16},"width":41,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/96191/images/8-13.png","element":"img","alt":"1/2","inline":true},{"text":")’s preservation of both action gaps and ","element":"span"},{"style":{"height":15.6},"width":51.5,"height":39,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/96191/images/8-14.png","element":"img","alt":" Wp","inline":true,"padRight":true},{"text":"action gaps does not lead to the strongest performance. In particular, its performance is inconsistent and sometimes poor. We believe this is because the tested frequencies are low enough that DSUP(","element":"span"},{"style":{"height":16},"width":40.5,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/96191/images/8-15.png","element":"img","alt":"1/2","inline":true},{"text":") maintains large enough action gaps to learn performant policies, but high enough that the variances of the distributions underlying ","element":"span"},{"style":{"height":16.4},"width":47,"height":41,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/96191/images/8-16.png","element":"img","alt":" Aπh ","inline":true,"padRight":true},{"text":"cause estimation difficulty. Indeed, the three methods that estimate ","element":"span"},{"style":{"height":16.4},"width":47,"height":41,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/96191/images/8-17.png","element":"img","alt":" Aπh ","inline":true,"padRight":true},{"text":"(explicitly in ","element":"span"},{"text":"DAU and DAU+DSUP(","element":"span"},{"style":{"height":16},"width":40.5,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/96191/images/8-18.png","element":"img","alt":"1/2","inline":true},{"text":") or implicitly in DSUP(","element":"span"},{"text":"1","element":"span"},{"text":")) exhibit almost identical behavior.","element":"span"}],[{"text":"Our results highlight a dichotomy in existing (ours included) methods for value-based, high-frequency, risk-neutral control. They can either maintain ","element":"span"},{"style":{"fontStyle":"italic"},"text":"O","element":"span"},{"text":"(1) ","element":"span"},{"text":"expected return estimates or ","element":"span"},{"style":{"fontStyle":"italic"},"text":"O","element":"span"},{"text":"(1) ","element":"span"},{"text":"return variance estimates, but not both. We observe better performance in estimating small means from ","element":"span"},{"style":{"fontStyle":"italic"},"text":"O","element":"span"},{"text":"(1) ","element":"span"},{"text":"variance distributions than in estimating ","element":"span"},{"style":{"fontStyle":"italic"},"text":"O","element":"span"},{"text":"(1) ","element":"span"},{"text":"means from receipricolly large variance distributions.","element":"span"}],[{"text":"To qualitatively illustrate the appeal of the ","element":"span"},{"style":{"height":16},"width":41,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/96191/images/8-19.png","element":"img","alt":"1/2","inline":true},{"text":"-rescaled superiority, Figure ","element":"span"},{"href":"#id-54","text":"5.3 ","element":"a"},{"text":"presents examples of learned action-conditioned distributions used by DSUP(","element":"span"},{"style":{"height":16},"width":41,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/96191/images/8-20.png","element":"img","alt":"1/2","inline":true},{"text":") and QR-DQN agents to make decisions.","element":"span"}],[{"style":{"width":"56%"},"width":895,"height":336,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/96191/images/9-0.png","element":"img"}],[{"id":"id-54","text":"Figure 5.3: CDFs of ","element":"figcaption","subtype":"caption"},{"style":{"height":19},"width":89,"height":47.5,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/96191/images/9-1.png","element":"img","alt":" ψπh;1/2","inline":true},{"text":"from DSUP(","element":"figcaption","subtype":"caption"},{"style":{"height":16},"width":41,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/96191/images/9-2.png","element":"img","alt":"1/2","inline":true},{"text":") (left) and ","element":"figcaption","subtype":"caption"},{"style":{"height":15.8},"width":37,"height":39.5,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/96191/images/9-3.png","element":"img","alt":" ζπh","inline":true,"padRight":true},{"text":"from QR-DQN (right) at the start state at ","element":"figcaption","subtype":"caption"},{"id":"id-55","style":{"height":11.2},"width":173,"height":28,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/96191/images/9-4.png","element":"img","alt":" ω = 35Hz.","inline":true}],[{"text":"In this environment, action ","element":"span"},{"text":"1 ","element":"span"},{"text":"taken in the start state terminates the episode, yielding the smallest return, ","element":"span"},{"text":"0","element":"span"},{"text":", making this action inferior to its alternative action, ","element":"span"},{"text":"1","element":"span"},{"text":". We see that DSUP(","element":"span"},{"style":{"height":16},"width":41,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/96191/images/9-5.png","element":"img","alt":"1/2","inline":true},{"text":") infers this fact. QR-DQN, on the other hand, has difficulty distinguishing these actions. This is because the","element":"span"},{"style":{"height":16},"width":41,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/96191/images/9-6.png","element":"img","alt":"1/2","inline":true},{"text":"-rescaled superiority preserves ","element":"span"},{"style":{"height":15.6},"width":52,"height":39,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/96191/images/9-7.png","element":"img","alt":" Wp","inline":true,"padRight":true},{"text":"action gaps, while ","element":"span"},{"style":{"height":16},"width":192,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/96191/images/9-8.png","element":"img","alt":" ζπh does not.","inline":true}],[{"text":"Second, we consider a risk-sensitive set-","element":"span"}],[{"text":"ting. Here we compare QR-DQN and DSUP(","element":"span"},{"style":{"height":16},"width":41,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/96191/images/9-9.png","element":"img","alt":"1/2","inline":true},{"text":") using ","element":"span"},{"style":{"height":7.4},"width":23,"height":18.5,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/96191/images/9-10.png","element":"img","alt":" α","inline":true},{"text":"-CVaR for greedy action selection. We do this because Theorem ","element":"span"},{"href":"#id-50","text":"4.8 ","element":"a"},{"text":"does not hold with ","element":"span"},{"style":{"height":18.2},"width":65,"height":45.5,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/96191/images/9-11.png","element":"img","alt":" ϑπh;q","inline":true},{"text":", and preserving means is less critical in risk- ","element":"span"},{"text":"sensitive control than it is in risk-neutral control. Figure ","element":"span"},{"href":"#id-55","text":"5.4 ","element":"a"},{"text":"depicts our results at ","element":"span"},{"style":{"height":11.4},"width":168.5,"height":28.5,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/96191/images/9-12.png","element":"img","alt":" ω = 35Hz","inline":true,"padRight":true},{"text":"(see Appendix ","element":"span"},{"text":"D ","element":"span"},{"text":"for results across a range of ","element":"span"},{"style":{"height":13.6},"width":47,"height":34,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/96191/images/9-13.png","element":"img","alt":" ω).","inline":true}],[{"style":{"width":"97%"},"width":1546,"height":365,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/96191/images/9-14.png","element":"img"}],[{"text":"Figure 5.4: Risk-sensitive algorithms on high-frequency option-trading at ","element":"figcaption","subtype":"caption"},{"style":{"height":11.2},"width":173,"height":28,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/96191/images/9-15.png","element":"img","alt":" ω = 35Hz.","inline":true}],[{"text":"Again, we see that DSUP(","element":"span"},{"style":{"height":16},"width":41,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/96191/images/9-16.png","element":"img","alt":"1/2","inline":true},{"text":") is conclusively the best performer.","element":"span"}]]},{"heading":"6 Related Work","paragraphs":[[{"text":"Notions of action gap and ranking have long been of interest in RL (see, e.g., [","element":"span"},{"href":"#id-18","referenceIndex":11,"text":"11","element":"a"},{"text":"]). Action gaps are related to sample complexity in RL—indeed, instance-dependent sample complexity rates are inversely proportional to the divergence between action-conditioned return distributions ([","element":"span"},{"href":"#id-56","referenceIndex":13,"text":"13","element":"a"},{"text":", ","element":"span"},{"href":"#id-57","referenceIndex":19,"text":"19","element":"a"},{"text":", ","element":"span"},{"href":"#id-58","referenceIndex":37,"text":"37","element":"a"},{"text":"]). Bellemare et al. [","element":"span"},{"href":"#id-19","referenceIndex":5,"text":"5","element":"a"},{"text":"] argue for the consideration of alternatives to the Bellman operator that explicitly devalue suboptimal actions, and they show that Baird’s AL [","element":"span"},{"href":"#id-3","referenceIndex":2,"text":"2","element":"a"},{"text":"] operator falls within this class of operators. On the other hand, Schaul et al. [","element":"span"},{"href":"#id-59","referenceIndex":31,"text":"31","element":"a"},{"text":"] implicitly question Bellemare et al.’s position. They demonstrate that stochastic gradient updates in deep value-based RL algorithms induce frequent changes in relative action values, which in turn is a mechanism for exploration.","element":"span"}],[{"text":"The advantage function is commonplace in RL (see, e.g., [","element":"span"},{"href":"#id-60","referenceIndex":39,"text":"39","element":"a"},{"text":", ","element":"span"},{"href":"#id-61","referenceIndex":32,"text":"32","element":"a"},{"text":", ","element":"span"},{"href":"#id-62","referenceIndex":27,"text":"27","element":"a"},{"text":", ","element":"span"},{"href":"#id-63","referenceIndex":35,"text":"35","element":"a"},{"text":", ","element":"span"},{"href":"#id-64","referenceIndex":24,"text":"24","element":"a"},{"text":"]). In [","element":"span"},{"href":"#id-64","referenceIndex":24,"text":"24","element":"a"},{"text":"], Mésnard et al. employ a distributional critic that is closely related to our (unscaled) distributional superiority. Their choice of critic stems from a desire to minimize variance. We note that the distributional superiority is ","element":"span"},{"style":{"fontStyle":"italic"},"text":"a posteriori ","element":"span"},{"text":"characterized as a minimal variance coupled difference representation of action-conditioned return distributions and policy-induced return distributions.","element":"span"}],[{"text":"Lastly, DRL in continuous-time MDPs is in its infancy. There are only three works to mention. Wiltzer et al. [","element":"span"},{"href":"#id-40","referenceIndex":40,"text":"40","element":"a"},{"text":", ","element":"span"},{"href":"#id-16","referenceIndex":41,"text":"41","element":"a"},{"text":"] give a characterization of return distributions for policy evaluation, and Halperin [","element":"span"},{"href":"#id-65","referenceIndex":14,"text":"14","element":"a"},{"text":"] studies algorithms for control. That said, neither work considers distributional notions of action gaps or advantages. Moreover, Halperin does not consider any of the challenges of estimating the influence of actions in high decision frequency settings.","element":"span"}]]},{"heading":"7 Conclusion","paragraphs":[[{"text":"We establish that DRL agents are sensitive to decision frequency through analysis and simulation. In experiments, DSUP(","element":"span"},{"style":{"height":16},"width":41,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/96191/images/9-17.png","element":"img","alt":"1/2","inline":true},{"text":") learns well-performing policies across a range of high decision frequencies, unlike prior approaches. DSUP(","element":"span"},{"text":"1","element":"span"},{"text":") and DAU+DSUP(","element":"span"},{"style":{"height":16},"width":41,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/96191/images/9-18.png","element":"img","alt":"1/2","inline":true},{"text":") are less robust. Given our analysis, the performance of DSUP(","element":"span"},{"text":"1","element":"span"},{"text":") is expected. Building an alternate algorithm to DAU+DSUP(","element":"span"},{"style":{"height":16},"width":40.5,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/96191/images/9-19.png","element":"img","alt":"1/2","inline":true},{"text":") that is both tailored to risk-neutral control and robust to ","element":"span"},{"style":{"fontStyle":"italic"},"text":"h ","element":"span"},{"text":"is an important avenue for future work.","element":"span"}]]},{"heading":"Acknowledgments and Disclosure of Funding","paragraphs":[[{"text":"The authors are very grateful to Yunhao Tang for fruitful correspondence about distributional analogues to the advantage. Additionally, we thank Mark Rowland, Jesse Farebrother, Tyler Kastner, Pierluca D’Oro, Nate Rahn, and Arnav Jain for helpful discussions. HW was supported by the Fonds de Recherche du Québec and the National Sciences and Engineering Research Council of Canada (NSERC). MGB was supported by the Canada CIFAR AI Chair program and NSERC. This work was supported in part by DARPA HR0011-23-9-0050 to PS. YJ was supported by in part by NSF Grant 2243869. This research was enabled in part by support provided by Calcul Québec, the Digital Research Alliance of Canada (","element":"span"},{"style":{"fontFamily":"monospace"},"text":"alliancecan.ca","element":"span"},{"text":"), and the compute resources provided by Mila (","element":"span"},{"style":{"fontFamily":"monospace"},"text":"mila.quebec","element":"span"},{"text":").","element":"span"}]]},{"heading":"References","paragraphs":[[{"id":"id-30","text":"[1] Carlo Acerbi. Spectral measures of risk: A coherent representation of subjective risk aversion. ","element":"span"},{"style":{"fontStyle":"italic"},"text":"Journal of Banking & Finance","element":"span"},{"text":", 26(7):1505–1518, July 2002.","element":"span"}],[{"id":"id-3","text":"[2] L. Baird. ","element":"span"},{"style":{"fontStyle":"italic"},"text":"Reinforcement Learning Through Gradient Descent","element":"span"},{"text":". PhD thesis, May 1999.","element":"span"}],[{"id":"id-90","text":"[3] ","element":"span"},{"text":"Marc G. Bellemare, Salvatore Candido, Pablo Samuel Castro, Jun Gong, Marlos C. Machado, Subhodeep Moitra, Sameera S. Ponda, and Ziyu Wang. Autonomous navigation of stratospheric balloons using reinforcement learning. ","element":"span"},{"style":{"fontStyle":"italic"},"text":"Nature","element":"span"},{"text":", 588(7836):77–82, December 2020.","element":"span"}],[{"id":"id-34","text":"[4] ","element":"span"},{"text":"Marc G. Bellemare, Will Dabney, and Mark Rowland. ","element":"span"},{"style":{"fontStyle":"italic"},"text":"Distributional Reinforcement Learning","element":"span"},{"text":". The MIT Press, May 2023.","element":"span"}],[{"id":"id-19","text":"[5] ","element":"span"},{"text":"Marc G Bellemare, Georg Ostrovski, Arthur Guez, Philip Thomas, and Rémi Munos. Increasing the action gap: New operators for reinforcement learning. In ","element":"span"},{"style":{"fontStyle":"italic"},"text":"Proceedings of the AAAI Conference on Artificial Intelligence","element":"span"},{"text":", volume 30, 2016.","element":"span"}],[{"id":"id-94","text":"[6] ","element":"span"},{"text":"James Bradbury, Roy Frostig, Peter Hawkins, Matthew James Johnson, Chris Leary, Dougal Maclaurin, George Necula, Adam Paszke, Jake VanderPlas, Skye Wanderman-Milne, and Qiao Zhang. JAX: composable transformations of Python+NumPy programs, 2018.","element":"span"}],[{"id":"id-69","text":"[7] ","element":"span"},{"text":"Gerard Brunick and Steven Shreve. Mimicking an Itô process by a solution of a stochastic differential equation. ","element":"span"},{"style":{"fontStyle":"italic"},"text":"The Annals of Applied Probability","element":"span"},{"text":", 23(4):1584–1628, August 2013.","element":"span"}],[{"id":"id-46","text":"[8] ","element":"span"},{"text":"Wesley Chung, Somjit Nath, Ajin Joseph, and Martha White. Two-timescale networks for nonlinear value function approximation. In ","element":"span"},{"style":{"fontStyle":"italic"},"text":"International Conference on Learning Representations","element":"span"},{"text":", 2018.","element":"span"}],[{"id":"id-32","text":"[9] Will Dabney, Georg Ostrovski, David Silver, and Rémi Munos. Implicit quantile networks for ","element":"span"},{"text":"distributional reinforcement learning. In ","element":"span"},{"style":{"fontStyle":"italic"},"text":"International Conference on Machine Learning","element":"span"},{"text":", pages 1096–1105. PMLR, 2018.","element":"span"}],[{"id":"id-31","text":"[10] ","element":"span"},{"text":"Will Dabney, Mark Rowland, Marc G. Bellemare, and Rémi Munos. Distributional Reinforcement Learning with Quantile Regression. In ","element":"span"},{"style":{"fontStyle":"italic"},"text":"AAAI","element":"span"},{"text":", 2017.","element":"span"}],[{"id":"id-18","text":"[11] ","element":"span"},{"text":"Amir-massoud Farahmand. Action-gap phenomenon in reinforcement learning. ","element":"span"},{"style":{"fontStyle":"italic"},"text":"Advances in Neural Information Processing Systems","element":"span"},{"text":", 24, 2011.","element":"span"}],[{"id":"id-12","text":"[12] ","element":"span"},{"text":"Wendell H Fleming and Halil Mete Soner. ","element":"span"},{"style":{"fontStyle":"italic"},"text":"Controlled Markov processes and viscosity solutions","element":"span"},{"text":", volume 25. Springer Science & Business Media, 2006.","element":"span"}],[{"id":"id-56","text":"[13] ","element":"span"},{"text":"Todd L. Graves and Tze Leung Lai. Asymptotically Efficient Adaptive Choice of Control Laws in Controlled Markov Chains. ","element":"span"},{"style":{"fontStyle":"italic"},"text":"SIAM Journal on Control and Optimization","element":"span"},{"text":", 35(3):715–743, May 1997.","element":"span"}],[{"id":"id-65","text":"[14] ","element":"span"},{"text":"Igor Halperin. Distributional offline continuous-time reinforcement learning with neural physicsinformed pdes (sciphy rl for doctr-l). ","element":"span"},{"style":{"fontStyle":"italic"},"text":"Neural Computing and Applications","element":"span"},{"text":", 36(9):4643–4659, 2024.","element":"span"}],[{"id":"id-8","text":"[15] ","element":"span"},{"text":"Yanwei Jia and Xun Yu Zhou. Policy Gradient and Actor-Critic Learning in Continuous Time and Space: Theory and Algorithms. ","element":"span"},{"style":{"fontStyle":"italic"},"text":"Journal of Machine Learning Research","element":"span"},{"text":", 23(275):1–50, 2022.","element":"span"}],[{"id":"id-9","text":"[16] ","element":"span"},{"text":"Yanwei Jia and Xun Yu Zhou. q-learning in continuous time. ","element":"span"},{"style":{"fontStyle":"italic"},"text":"Journal of Machine Learning Research","element":"span"},{"text":", 24(161):1–61, 2023.","element":"span"}],[{"id":"id-35","text":"[17] ","element":"span"},{"text":"Tyler Kastner, Murat A Erdogdu, and Amir-massoud Farahmand. Distributional Model Equivalence for Risk-Sensitive Reinforcement Learning. In ","element":"span"},{"style":{"fontStyle":"italic"},"text":"Advances in Neural Information Processing Systems (NeurIPS)","element":"span"},{"text":", 2023.","element":"span"}],[{"text":"[18] ","element":"span"},{"text":"Diederik P Kingma and Jimmy Ba. Adam: A method for stochastic optimization. In ","element":"span"},{"style":{"fontStyle":"italic"},"text":"International Conference on Learning Representations (ICLR)","element":"span"},{"text":", 2014.","element":"span"}],[{"id":"id-57","text":"[19] Tor Lattimore and Csaba Szepesvári. ","element":"span"},{"style":{"fontStyle":"italic"},"text":"Bandit algorithms","element":"span"},{"text":". Cambridge University Press, 2020.","element":"span"}],[{"id":"id-1","text":"[20] ","element":"span"},{"text":"Leemon C. Baird. Advantage Updating. Technical report, Defense Technical Information Center, Fort Belvoir, VA, November 1993.","element":"span"}],[{"id":"id-86","text":"[21] ","element":"span"},{"text":"Timothy P. Lillicrap, Jonathan J. Hunt, Alexander Pritzel, Nicolas Heess, Tom Erez, Yuval Tassa, David Silver, and Daan Wierstra. Continuous control with deep reinforcement learning. In ","element":"span"},{"style":{"fontStyle":"italic"},"text":"International Conference on Learning Representations (ICLR)","element":"span"},{"text":", 2016.","element":"span"}],[{"id":"id-33","text":"[22] ","element":"span"},{"text":"Shiau Hong Lim and Ilyas Malik. Distributional Reinforcement Learning for Risk-Sensitive Policies. In ","element":"span"},{"style":{"fontStyle":"italic"},"text":"NeurIPS","element":"span"},{"text":", October 2022.","element":"span"}],[{"id":"id-52","text":"[23] ","element":"span"},{"text":"Gisiro Maruyama. Continuous markov processes and stochastic equations. ","element":"span"},{"style":{"fontStyle":"italic"},"text":"Rendiconti del Circolo Matematico di Palermo","element":"span"},{"text":", 4:48–90, 1955.","element":"span"}],[{"id":"id-64","text":"[24] Thomas Mesnard, Wenqi Chen, Alaa Saade, Yunhao Tang, Mark Rowland, Theophane Weber, ","element":"span"},{"text":"Clare Lyle, Audrunas Gruslys, Michal Valko, Will Dabney, et al. Quantile credit assignment. In ","element":"span"},{"style":{"fontStyle":"italic"},"text":"International Conference on Machine Learning","element":"span"},{"text":", pages 24517–24531. PMLR, 2023.","element":"span"}],[{"id":"id-4","text":"[25] ","element":"span"},{"text":"Rémi Munos and Paul Bourgine. Reinforcement Learning for Continuous Stochastic Control Problems. In ","element":"span"},{"style":{"fontStyle":"italic"},"text":"Advances in Neural Information Processing Systems (NeurIPS)","element":"span"},{"text":", 1997.","element":"span"}],[{"id":"id-13","text":"[26] ","element":"span"},{"text":"Bernt Oksendal. ","element":"span"},{"style":{"fontStyle":"italic"},"text":"Stochastic differential equations: an introduction with applications","element":"span"},{"text":". Springer Science & Business Media, 2013.","element":"span"}],[{"id":"id-62","text":"[27] ","element":"span"},{"text":"Hsiao-Ru Pan, Nico Gürtler, Alexander Neitz, and Bernhard Schölkopf. Direct Advantage Estimation. In ","element":"span"},{"style":{"fontStyle":"italic"},"text":"Advances in Neural Information Processing Systems (NeurIPS)","element":"span"},{"text":", 2022.","element":"span"}],[{"id":"id-0","text":"[28] ","element":"span"},{"text":"Simon Ramstedt and Christopher J. Pal. Real-time reinforcement learning. In ","element":"span"},{"style":{"fontStyle":"italic"},"text":"Advances in Neural Information Processing Systems","element":"span"},{"text":", 2019.","element":"span"}],[{"id":"id-36","text":"[29] ","element":"span"},{"text":"R Tyrrell Rockafellar and Stanislav Uryasev. Conditional value-at-risk for general loss distributions. ","element":"span"},{"style":{"fontStyle":"italic"},"text":"Journal of Banking & Finance","element":"span"},{"text":", 26(7):1443–1471, 2002.","element":"span"}],[{"id":"id-41","text":"[30] ","element":"span"},{"text":"Mark Rowland, Rémi Munos, Mohammad Gheshlaghi Azar, Yunhao Tang, Georg Ostrovski, Anna Harutyunyan, Karl Tuyls, Marc G. Bellemare, and Will Dabney. An Analysis of Quantile Temporal-Difference Learning. ","element":"span"},{"style":{"fontStyle":"italic"},"text":"Journal of Machine Learning Research (JMLR)","element":"span"},{"text":", 25:1–47, 2023.","element":"span"}],[{"id":"id-59","text":"[31] ","element":"span"},{"text":"Tom Schaul, André Barreto, John Quan, and Georg Ostrovski. The phenomenon of policy churn. ","element":"span"},{"style":{"fontStyle":"italic"},"text":"Advances in Neural Information Processing Systems","element":"span"},{"text":", 35:2537–2549, 2022.","element":"span"}],[{"id":"id-61","text":"[32] ","element":"span"},{"text":"John Schulman, Filip Wolski, Prafulla Dhariwal, Alec Radford, and Oleg Klimov. Proximal Policy Optimization Algorithms. ","element":"span"},{"style":{"fontStyle":"italic"},"text":"CoRR","element":"span"},{"text":", abs/1707.06347, 2017.","element":"span"}],[{"id":"id-66","text":"[33] ","element":"span"},{"text":"Daniel W. Stroock and S. R. Srinivasa Varadhan. ","element":"span"},{"style":{"fontStyle":"italic"},"text":"Multidimensional Diffusion Processes","element":"span"},{"text":". Springer, 2006.","element":"span"}],[{"id":"id-2","text":"[34] ","element":"span"},{"text":"Corentin Tallec, Léonard Blier, and Yann Ollivier. Making Deep Q-learning methods robust to time discretization. In ","element":"span"},{"style":{"fontStyle":"italic"},"text":"International Conference on Machine Learning (ICML)","element":"span"},{"text":", 2019.","element":"span"}],[{"id":"id-63","text":"[35] ","element":"span"},{"text":"Yunhao Tang, Rémi Munos, Mark Rowland, and Michal Valko. VA-learning as a more efficient alternative to Q-learning. In ","element":"span"},{"style":{"fontStyle":"italic"},"text":"International Conference on Machine Learning (ICML)","element":"span"},{"text":", 2023.","element":"span"}],[{"id":"id-20","text":"[36] Cédric Villani et al. ","element":"span"},{"style":{"fontStyle":"italic"},"text":"Optimal transport: old and new","element":"span"},{"text":", volume 338. Springer, 2009.","element":"span"}],[{"id":"id-58","text":"[37] ","element":"span"},{"text":"Andrew J. Wagenmaker and Dylan J. Foster. Instance-Optimality in Interactive Decision Making: Toward a Non-Asymptotic Theory. In ","element":"span"},{"style":{"fontStyle":"italic"},"text":"Proceedings of Thirty Sixth Conference on Learning Theory","element":"span"},{"text":", pages 1322–1472. PMLR, July 2023.","element":"span"}],[{"id":"id-7","text":"[38] ","element":"span"},{"text":"Haoran Wang, Thaleia Zariphopoulou, and Xun Yu Zhou. Reinforcement learning in continuous time and space: A stochastic control approach. ","element":"span"},{"style":{"fontStyle":"italic"},"text":"Journal of Machine Learning Research","element":"span"},{"text":", 21(198):1–34, 2020.","element":"span"}],[{"id":"id-60","text":"[39] ","element":"span"},{"text":"Ziyu Wang, Tom Schaul, Matteo Hessel, Hado van Hasselt, Marc Lanctot, and Nando de Freitas. Dueling Network Architectures for Deep Reinforcement Learning. In ","element":"span"},{"style":{"fontStyle":"italic"},"text":"International Conference on Machine Learning (ICML)","element":"span"},{"text":", 2016.","element":"span"}],[{"id":"id-40","text":"[40] ","element":"span"},{"text":"Harley Wiltzer. ","element":"span"},{"style":{"fontStyle":"italic"},"text":"On the Evolution of Return Distributions in Continuous-Time Reinforcement Learning","element":"span"},{"text":". McGill University (Canada), 2021.","element":"span"}],[{"id":"id-16","text":"[41] ","element":"span"},{"text":"Harley Wiltzer, David Meger, and Marc G. Bellemare. Distributional Hamilton-Jacobi-Bellman Equations for Continuous-Time Reinforcement Learning. In ","element":"span"},{"style":{"fontStyle":"italic"},"text":"International Conference on Machine Learning (ICML)","element":"span"},{"text":", 2022.","element":"span"}],[{"id":"id-14","text":"[42] ","element":"span"},{"text":"Hanyang Zhao, Wenpin Tang, and David Yao. Policy optimization for continuous reinforcement learning. ","element":"span"},{"style":{"fontStyle":"italic"},"text":"Advances in Neural Information Processing Systems","element":"span"},{"text":", 36, 2024.","element":"span"}]]},{"heading":"A Formalism of Continuous-Time RL Controlled Markov Processes","paragraphs":[[{"text":"Expected-value RL is a data-driven approach to solving the (classic) optimal control problem: find an action (control) process ","element":"span"},{"style":{"height":16.8},"width":129.84,"height":42,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/96191/images/13-0.png","element":"img","alt":" (As)s≥t","inline":true,"padRight":true},{"text":"and an associated state process (then determined by the environment) ","element":"span"},{"style":{"height":16.8},"width":349.12,"height":42,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/96191/images/13-1.png","element":"img","alt":"(Xs)s≥t with Xt = x","inline":true},{"text":", for a given ","element":"span"},{"style":{"height":13.2},"width":87.56,"height":33,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/96191/images/13-2.png","element":"img","alt":" t ≥ 0","inline":true},{"text":", that maximize the expected return earned by following the state-action process ","element":"span"},{"style":{"height":16.8},"width":197.64,"height":42,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/96191/images/13-3.png","element":"img","alt":" (Xs, As)s≥t","inline":true},{"text":". In particular, RL agents search the space of state-action processes via policies ","element":"span"},{"style":{"height":16},"width":347.36,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/96191/images/13-4.png","element":"img","alt":" π : T × X → P(A)","inline":true},{"text":". Policies prescribe the conditional probabilities of the laws of state-action processes. Indeed, ","element":"span"},{"style":{"height":16.78},"width":197.64,"height":41.96,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/96191/images/13-5.png","element":"img","alt":" (Xs, As)s≥t","inline":true,"padRight":true},{"text":"is the state-action process of an agent following ","element":"span"},{"style":{"height":11.6},"width":125.2,"height":29,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/96191/images/13-6.png","element":"img","alt":" π if and","inline":true,"padRight":true},{"text":"only if, for each ","element":"span"},{"style":{"height":16},"width":465,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/96191/images/13-7.png","element":"img","alt":" s ≥ t, the set {π(· | s, x)}x∈X","inline":true,"padRight":true},{"text":"is the set of conditional probabilities of law","element":"span"},{"style":{"height":16},"width":177.16,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/96191/images/13-8.png","element":"img","alt":"((Xs, As))","inline":true,"padRight":true},{"text":"with respect to law","element":"span"},{"style":{"height":16},"width":91.04,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/96191/images/13-9.png","element":"img","alt":"(Xs).","inline":true}],[{"text":"Continuous-time RL is a data-driven approach to stochastic optimal control. Whence, environmental dynamics are assumed to arise from an action-parameterized family of SDEs determined by a drift ","element":"span"},{"style":{"height":12.24},"width":338.72,"height":30.6,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/96191/images/13-10.png","element":"img","alt":"b : T × X × A → Rn ","inline":true,"padRight":true},{"text":"and diffusion ","element":"span"},{"style":{"height":13.78},"width":429.92,"height":34.44,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/96191/images/13-11.png","element":"img","alt":" σ : T × X × A → Rn×n.12 ","inline":true,"padRight":true},{"text":"Thus, the goal of expected-value RL (and stochastic optimal control) is to find an expected-return maximizing state-action process among state-action processes ","element":"span"},{"style":{"height":16.78},"width":197.64,"height":41.96,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/96191/images/13-12.png","element":"img","alt":" (Xs, As)s≥t","inline":true,"padRight":true},{"text":"that satisfy","element":"span"}],[{"id":"id-67","style":{"width":"81%"},"width":1286,"height":41,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/96191/images/13-13.png","element":"img"}],[{"text":"We note that the MDPs defined in Section ","element":"span"},{"text":"2 ","element":"span"},{"text":"have equivalent formulations in terms of transition kernel, exactly as they formulated in discrete-time RL. We refer the reader to [","element":"span"},{"href":"#id-66","referenceIndex":33,"text":"33","element":"a"},{"text":"] for an in-depth discussion regarding this fact.","element":"span"}],[{"style":{"fontWeight":"bold"},"text":"A.1 ","element":"span"},{"style":{"fontWeight":"bold"},"text":"Justification of Random Returns ","element":"span"},{"text":"Given the above formalism, the “true” distribution of returns of an agent following a policy ","element":"span"},{"style":{"height":11.6},"width":119.72,"height":29,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/96191/images/13-14.png","element":"img","alt":" π is the","inline":true,"padRight":true},{"text":"law of","element":"span"}],[{"style":{"width":"36%"},"width":574,"height":88,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/96191/images/13-15.png","element":"img"}],[{"text":"where ","element":"span"},{"style":{"height":16.8},"width":197.64,"height":42,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/96191/images/13-16.png","element":"img","alt":" (Xs, As)s≥t","inline":true,"padRight":true},{"text":"is a state-action process associated to ","element":"span"},{"style":{"height":6.8},"width":23,"height":17,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/96191/images/13-17.png","element":"img","alt":" π","inline":true,"padRight":true},{"text":"and solves (","element":"span"},{"href":"#id-67","text":"A.1","element":"a"},{"text":"). That said, by the definition of ","element":"span"},{"style":{"height":9.2},"width":34.16,"height":23,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/96191/images/13-18.png","element":"img","alt":" π,","inline":true}],[{"style":{"width":"87%"},"width":1384,"height":47,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/96191/images/13-19.png","element":"img"}],[{"text":"where ","element":"span"},{"style":{"height":11.6},"width":158.64,"height":29,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/96191/images/13-20.png","element":"img","alt":" bπ and σπ ","inline":true,"padRight":true},{"text":"are exactly the coefficients defined in (","element":"span"},{"href":"#id-68","text":"2.3","element":"a"},{"text":"). Hence, by [","element":"span"},{"href":"#id-69","referenceIndex":7,"text":"7","element":"a"},{"text":"], provided that ","element":"span"},{"style":{"height":11.6},"width":158.6,"height":29,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/96191/images/13-21.png","element":"img","alt":" bπ and σπ","inline":true,"padRight":true},{"text":"are regular enough to guarantee that (","element":"span"},{"href":"#id-10","text":"2.2","element":"a"},{"text":") is well-posed in law, the processes ","element":"span"},{"style":{"height":16.8},"width":254.32,"height":42,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/96191/images/13-22.png","element":"img","alt":" Xπ = (Xπs )s≥t","inline":true,"padRight":true},{"text":"and ","element":"span"},{"style":{"height":16.78},"width":222.28,"height":41.96,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/96191/images/13-23.png","element":"img","alt":"X = (Xs)s≥t","inline":true,"padRight":true},{"text":"are equal in law.","element":"span"},{"text":"13 ","element":"span"},{"text":"Here ","element":"span"},{"style":{"height":16.78},"width":140.72,"height":41.96,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/96191/images/13-24.png","element":"img","alt":" (Xπs )s≥t ","inline":true,"padRight":true},{"text":"satisfies (","element":"span"},{"href":"#id-10","text":"2.2","element":"a"},{"text":") with ","element":"span"},{"style":{"height":14.74},"width":133.92,"height":36.84,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/96191/images/13-25.png","element":"img","alt":" Xπt = x","inline":true},{"text":". Consequently,","element":"span"}],[{"style":{"width":"75%"},"width":1201,"height":162,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/96191/images/13-26.png","element":"img"}],[{"id":"id-15","style":{"fontWeight":"bold"},"text":"A.2 ","element":"span"},{"style":{"fontWeight":"bold"},"text":"On Assumption ","element":"span"},{"href":"#id-5","style":{"fontWeight":"bold"},"text":"2.3","element":"a"}],[{"text":"Here we provide some conditions under which Assumption ","element":"span"},{"href":"#id-5","text":"2.3 ","element":"a"},{"text":"is established. These conditions are presented as additional assumptions. Assumptions ","element":"span"},{"href":"#id-70","text":"A.1 ","element":"a"},{"text":"and ","element":"span"},{"href":"#id-71","text":"A.2 ","element":"a"},{"text":"are common in stochastic control theory (see, e.g., [","element":"span"},{"href":"#id-12","referenceIndex":12,"text":"12","element":"a"},{"text":"]) and SDE theory in general (see, e.g., [","element":"span"},{"href":"#id-13","referenceIndex":26,"text":"26","element":"a"},{"text":"]). Assumption ","element":"span"},{"href":"#id-72","text":"A.3 ","element":"a"},{"text":"is ubiquitous in the continuous-time RL literature [","element":"span"},{"href":"#id-8","referenceIndex":15,"text":"15","element":"a"},{"text":", ","element":"span"},{"href":"#id-9","referenceIndex":16,"text":"16","element":"a"},{"text":", ","element":"span"},{"href":"#id-14","referenceIndex":42,"text":"42","element":"a"},{"text":"]. Notably, these conditions together guarantee the existence of transition probabilities for policy-induced state processes arising from (","element":"span"},{"href":"#id-10","text":"2.2","element":"a"},{"text":").","element":"span"}],[{"id":"id-70","style":{"fontWeight":"bold"},"text":"Assumption A.1. ","element":"span"},{"style":{"fontStyle":"italic"},"text":"The coefficients ","element":"span"},{"style":{"fontStyle":"italic"},"text":"b ","element":"span"},{"style":{"fontStyle":"italic"},"text":"and ","element":"span"},{"style":{"height":6.8},"width":23,"height":17,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/96191/images/13-27.png","element":"img","alt":" σ","inline":true,"padRight":true},{"style":{"fontStyle":"italic"},"text":"are uniformly bounded: a finite, positive constant ","element":"span"},{"href":"#id-70","style":{"height":13.2},"width":77.88,"height":33,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/96191/images/13-28.png","element":"img","alt":" CA.1","inline":true,"padRight":true},{"style":{"fontStyle":"italic"},"text":"exists such that","element":"span"}],[{"style":{"width":"42%"},"width":670,"height":67,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/96191/images/13-29.png","element":"img"}],[{"id":"id-71","style":{"fontWeight":"bold"},"text":"Assumption A.2. ","element":"span"},{"style":{"fontStyle":"italic"},"text":"The matrix ","element":"span"},{"style":{"height":6.8},"width":62.44,"height":17,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/96191/images/13-30.png","element":"img","alt":" σσ⊤","inline":true,"padRight":true},{"style":{"fontStyle":"italic"},"text":"is uniformly elliptic: a positive, finite constant ","element":"span"},{"href":"#id-71","style":{"height":13.2},"width":72.68,"height":33,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/96191/images/13-31.png","element":"img","alt":" λA.2","inline":true,"padRight":true},{"style":{"fontStyle":"italic"},"text":"exists such","element":"span"}],[{"style":{"width":"68%"},"width":1080,"height":96,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/96191/images/13-32.png","element":"img"}],[{"text":"A consequence of Assumption ","element":"span"},{"href":"#id-71","text":"A.2 ","element":"a"},{"text":"is","element":"span"}],[{"id":"id-73","style":{"width":"65%"},"width":1046,"height":71,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/96191/images/14-0.png","element":"img"}],[{"text":"In other words, ","element":"span"},{"style":{"height":10.58},"width":43.2,"height":26.44,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/96191/images/14-1.png","element":"img","alt":" σπ ","inline":true,"padRight":true},{"text":"is also uniformly elliptic.","element":"span"}],[{"id":"id-72","style":{"fontWeight":"bold"},"text":"Assumption A.3. ","element":"span"},{"style":{"fontStyle":"italic"},"text":"A finite, positive constant ","element":"span"},{"href":"#id-72","style":{"height":13.18},"width":77.92,"height":32.96,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/96191/images/14-2.png","element":"img","alt":" CA.3","inline":true,"padRight":true},{"style":{"fontStyle":"italic"},"text":"exists for which","element":"span"}],[{"style":{"width":"57%"},"width":918,"height":63,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/96191/images/14-3.png","element":"img"}],[{"style":{"fontStyle":"italic"},"text":"where ","element":"span"},{"text":"TV ","element":"span"},{"style":{"fontStyle":"italic"},"text":"is the total variation metric on ","element":"span"},{"style":{"fontStyle":"italic"},"text":"P","element":"span"},{"text":"(","element":"span"},{"text":"A","element":"span"},{"text":")","element":"span"},{"style":{"fontStyle":"italic"},"text":".","element":"span"}],[{"text":"Observe if ","element":"span"},{"style":{"height":6.8},"width":23,"height":17,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/96191/images/14-4.png","element":"img","alt":" π","inline":true,"padRight":true},{"text":"satisfies Assumption ","element":"span"},{"href":"#id-72","text":"A.3","element":"a"},{"text":", then ","element":"span"},{"style":{"height":16.8},"width":102.12,"height":42,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/96191/images/14-5.png","element":"img","alt":" π|h,a,t","inline":true,"padRight":true},{"text":"also satisfies Assumption ","element":"span"},{"href":"#id-72","text":"A.3","element":"a"},{"text":". Indeed,","element":"span"}],[{"style":{"width":"81%"},"width":1299,"height":195,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/96191/images/14-6.png","element":"img"}],[{"text":"and ","element":"span"},{"style":{"fontStyle":"italic"},"text":"π","element":"span"},{"style":{"height":16.8},"width":811.52,"height":42,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/96191/images/14-7.png","element":"img","alt":"|h,a,t(· | s, x) = π(· | s, x) for all s ∈ T \\ [t, t + h).","inline":true}],[{"style":{"fontWeight":"bold"},"text":"Proposition A.4. ","element":"span"},{"style":{"fontStyle":"italic"},"text":"If Assumptions ","element":"span"},{"href":"#id-11","style":{"fontStyle":"italic"},"text":"2.2","element":"a"},{"style":{"fontStyle":"italic"},"text":", ","element":"span"},{"href":"#id-70","style":{"fontStyle":"italic"},"text":"A.1","element":"a"},{"style":{"fontStyle":"italic"},"text":", and ","element":"span"},{"href":"#id-71","style":{"fontStyle":"italic"},"text":"A.2 ","element":"a"},{"style":{"fontStyle":"italic"},"text":"hold and ","element":"span"},{"style":{"height":6.8},"width":23,"height":17,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/96191/images/14-8.png","element":"img","alt":" π","inline":true,"padRight":true},{"style":{"fontStyle":"italic"},"text":"satisfies Assumption ","element":"span"},{"href":"#id-72","style":{"fontStyle":"italic"},"text":"A.3","element":"a"},{"style":{"fontStyle":"italic"},"text":", then Assumption ","element":"span"},{"href":"#id-5","style":{"fontStyle":"italic"},"text":"2.3 ","element":"a"},{"style":{"fontStyle":"italic"},"text":"holds.","element":"span"}],[{"style":{"fontStyle":"italic"},"text":"Proof. ","element":"span"},{"text":"Observe that","element":"span"}],[{"style":{"width":"81%"},"width":1285,"height":266,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/96191/images/14-9.png","element":"img"}],[{"text":"by Assumptions ","element":"span"},{"href":"#id-11","text":"2.2","element":"a"},{"text":", ","element":"span"},{"href":"#id-70","text":"A.1","element":"a"},{"text":", and ","element":"span"},{"href":"#id-72","text":"A.3","element":"a"},{"text":". Here we have also used Kantorovich duality to computed ","element":"span"},{"text":"TV ","element":"span"},{"text":"and invoke Assumption ","element":"span"},{"href":"#id-72","text":"A.3","element":"a"},{"text":".","element":"span"}],[{"text":"Let ","element":"span"},{"style":{"height":10.8},"width":23,"height":27,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/96191/images/14-10.png","element":"img","alt":" λ","inline":true,"padRight":true},{"text":"be an eigenvalue of ","element":"span"},{"style":{"height":16},"width":310.8,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/96191/images/14-11.png","element":"img","alt":" σπ(t, x) − σπ(t, y)","inline":true,"padRight":true},{"text":"with with unit eigenvector ","element":"span"},{"style":{"fontStyle":"italic"},"text":"v","element":"span"},{"text":". Observe that","element":"span"}],[{"style":{"width":"98%"},"width":1557,"height":111,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/96191/images/14-12.png","element":"img"}],[{"text":"Hence, by (","element":"span"},{"href":"#id-73","text":"A.2","element":"a"},{"text":"),","element":"span"}],[{"style":{"width":"75%"},"width":1204,"height":98,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/96191/images/14-13.png","element":"img"}],[{"text":"In turn, by Assumptions ","element":"span"},{"href":"#id-11","text":"2.2","element":"a"},{"text":", ","element":"span"},{"href":"#id-70","text":"A.1","element":"a"},{"text":", and ","element":"span"},{"href":"#id-72","text":"A.3","element":"a"},{"text":", as done to prove that ","element":"span"},{"style":{"height":10.8},"width":36.12,"height":27,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/96191/images/14-14.png","element":"img","alt":" bπ ","inline":true,"padRight":true},{"text":"was Lipschitz above,","element":"span"}],[{"style":{"width":"38%"},"width":612,"height":88,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/96191/images/14-15.png","element":"img"}],[{"text":"Assumption ","element":"span"},{"href":"#id-5","text":"2.3 ","element":"a"},{"text":"follows, since ","element":"span"},{"style":{"height":10.8},"width":23,"height":27,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/96191/images/14-16.png","element":"img","alt":" λ","inline":true,"padRight":true},{"text":"was an arbitrary eigenvalue and all norms on finite dimensional spaces are equivalent.","element":"span"}],[{"text":"We conclude this section with one final fact: under Assumption ","element":"span"},{"href":"#id-11","text":"2.2","element":"a"},{"text":", the policy-averaged coefficient (","element":"span"},{"href":"#id-68","text":"2.3","element":"a"},{"text":") have linear growth. Indeed,","element":"span"}],[{"style":{"width":"54%"},"width":856,"height":89,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/96191/images/14-17.png","element":"img"}],[{"text":"and","element":"span"}],[{"style":{"width":"59%"},"width":936,"height":81,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/96191/images/14-18.png","element":"img"}],[{"style":{"fontWeight":"bold"},"text":"A.3 ","element":"span"},{"style":{"fontWeight":"bold"},"text":"Action-Independent Rewards","element":"span"}],[{"text":"In this paper, we assume that the rewards do not depend on actions. This is a theoretical limitation of not just our work, but continuous-time DRL in general (see [","element":"span"},{"href":"#id-16","referenceIndex":41,"text":"41","element":"a"},{"text":", ","element":"span"},{"href":"#id-65","referenceIndex":14,"text":"14","element":"a"},{"text":"]). In the following sections, we discuss the nature of this theoretical limitation. However, many MDPs have action-independent reward functions. For example, MDPs encoding goal-reaching problems, tracking problems, and commodity-trading problems all have action-independent rewards.","element":"span"}],[{"style":{"fontWeight":"bold"},"text":"A.3.1 ","element":"span"},{"style":{"fontWeight":"bold"},"text":"Continuous-Time, Expected Return ","element":"span"},{"text":"In continuous-time, expected-value RL, when the reward function ","element":"span"},{"style":{"fontStyle":"italic"},"text":"r ","element":"span"},{"text":"depends on actions, the standard approach to analysis involves considering the averaged reward function ","element":"span"},{"style":{"height":14.8},"width":413.92,"height":37,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/96191/images/15-0.png","element":"img","alt":" rπ : T × X → R given by","inline":true}],[{"style":{"width":"36%"},"width":576,"height":91,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/96191/images/15-1.png","element":"img"}],[{"text":"In continuous-time RL specifically, the averaged reward ","element":"span"},{"style":{"height":10.58},"width":38.08,"height":26.44,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/96191/images/15-2.png","element":"img","alt":" rπ ","inline":true,"padRight":true},{"text":"is justified exactly as the coefficients ","element":"span"},{"style":{"height":10.8},"width":36.08,"height":27,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/96191/images/15-3.png","element":"img","alt":" bπ","inline":true,"padRight":true},{"text":"and ","element":"span"},{"style":{"height":10.59},"width":43.2,"height":26.48,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/96191/images/15-4.png","element":"img","alt":" σπ ","inline":true,"padRight":true},{"text":"are justified. However, the “true” return distribution is not equal to the law of","element":"span"}],[{"id":"id-74","style":{"width":"69%"},"width":1101,"height":101,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/96191/images/15-5.png","element":"img"}],[{"text":"To see this, it suffices to consider an MDP with a single state ","element":"span"},{"style":{"fontStyle":"italic"},"text":"x","element":"span"},{"text":". In this case, the expression in (","element":"span"},{"href":"#id-74","text":"A.3","element":"a"},{"text":") is deterministic. However, if ","element":"span"},{"style":{"height":6.8},"width":23,"height":17,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/96191/images/15-6.png","element":"img","alt":" π","inline":true,"padRight":true},{"text":"is nondeterministic and ","element":"span"},{"style":{"fontStyle":"italic"},"text":"r ","element":"span"},{"text":"is dependent on actions, then the “true” return distribution is nondeterministic. Hence, the law of the expression in (","element":"span"},{"href":"#id-74","text":"A.3","element":"a"},{"text":") cannot be the “true” return distribution associated to ","element":"span"},{"style":{"height":7.2},"width":34.12,"height":18,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/96191/images/15-7.png","element":"img","alt":" π.","inline":true}],[{"style":{"fontWeight":"bold"},"text":"A.3.2 ","element":"span"},{"style":{"fontWeight":"bold"},"text":"Discrete-Time, Random Return","element":"span"}],[{"text":"In discrete-time RL, the distribution of returns given an action-dependent reward is analyzed through the state-action process induced by a policy. This processes is defined by extending the action-parameterized family of transition probability kernels on ","element":"span"},{"text":"X","element":"span"},{"text":", which define the dynamics of a given MDP, to a single transition probability kernel on ","element":"span"},{"style":{"height":11.2},"width":92.36,"height":28,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/96191/images/15-8.png","element":"img","alt":" X×A","inline":true},{"text":". In the time-homogeneous setting, for instance, with transition kernels ","element":"span"},{"style":{"height":16},"width":289.72,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/96191/images/15-9.png","element":"img","alt":" {P(dy | x, a)}a∈A","inline":true},{"text":", this amounts to constructing","element":"span"}],[{"style":{"width":"44%"},"width":703,"height":41,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/96191/images/15-10.png","element":"img"}],[{"text":"provided that map ","element":"span"},{"style":{"height":16},"width":215.68,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/96191/images/15-11.png","element":"img","alt":" y �→ π(E | y)","inline":true,"padRight":true},{"text":"is measurable for all ","element":"span"},{"style":{"height":14.4},"width":96.76,"height":36,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/96191/images/15-12.png","element":"img","alt":" y ∈ X","inline":true},{"text":". In continuous-time environments, such a constructing has yet to be discovered.","element":"span"}],[{"text":"We note that trying to analogously extend the action-parameterized family of transition semigroups on ","element":"span"},{"text":"X","element":"span"},{"text":", which define the dynamics of a given time-homogeneous MDP in continuous time, to a single transition semigroup on ","element":"span"},{"style":{"height":11.2},"width":102.28,"height":28,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/96191/images/15-13.png","element":"img","alt":" X × A","inline":true,"padRight":true},{"text":"by defining","element":"span"}],[{"style":{"width":"44%"},"width":711,"height":41,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/96191/images/15-14.png","element":"img"}],[{"text":"where ","element":"span"},{"style":{"height":16},"width":161.32,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/96191/images/15-15.png","element":"img","alt":" Pt(dy | x)","inline":true,"padRight":true},{"text":"is a transition semigroup, may fail to satisfy the Chapman–Kolmogorov identity. Indeed, suppose ","element":"span"},{"text":"A ","element":"span"},{"text":"has two elements and ","element":"span"},{"text":"X ","element":"span"},{"text":"= ","element":"span"},{"text":"R","element":"span"},{"text":". Let ","element":"span"},{"style":{"height":16},"width":146,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/96191/images/15-16.png","element":"img","alt":" π(da | x)","inline":true,"padRight":true},{"text":"be the uniform measure on ","element":"span"},{"text":"A ","element":"span"},{"text":"for all ","element":"span"},{"style":{"height":12},"width":100.52,"height":30,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/96191/images/15-17.png","element":"img","alt":" x ∈ X","inline":true},{"text":". If ","element":"span"},{"style":{"height":16},"width":421.16,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/96191/images/15-18.png","element":"img","alt":" Pt(dy | x, aδ) = δx+t(dy)","inline":true,"padRight":true},{"text":"and ","element":"span"},{"style":{"height":18.98},"width":778.88,"height":47.44,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/96191/images/15-19.png","element":"img","alt":" Pt(dy | x, ag) = (2π)−1/2 exp(−|y − x|2/2t) dy","inline":true},{"text":", then Chapman–Kolmogorov identity fails, for example, on any tuple ","element":"span"},{"style":{"height":16},"width":572.6,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/96191/images/15-20.png","element":"img","alt":" (s, t, x, aδ, E × F) where E ⊂ X is","inline":true,"padRight":true},{"text":"open, ","element":"span"},{"style":{"height":16},"width":480.56,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/96191/images/15-21.png","element":"img","alt":" x + t + s /∈ E, and π(F) ̸= 0","inline":true},{"text":". On one hand,","element":"span"}],[{"style":{"width":"92%"},"width":1472,"height":379,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/96191/images/15-22.png","element":"img"}],[{"text":"At present, the question of how to generate a well-defined (even in law) state-action process in any continuous-time MDP framework given a stochastic policy is generally open. Of course, if ","element":"span"},{"style":{"height":6.8},"width":23,"height":17,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/96191/images/15-23.png","element":"img","alt":"π","inline":true,"padRight":true},{"text":"is deterministic, then the state-action process is ","element":"span"},{"style":{"height":16.8},"width":292.28,"height":42,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/96191/images/15-24.png","element":"img","alt":" (Xs, π(s, Xs))s≥t","inline":true},{"text":". If ","element":"span"},{"style":{"fontStyle":"italic"},"text":"b ","element":"span"},{"text":"and ","element":"span"},{"style":{"height":6.8},"width":23,"height":17,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/96191/images/15-25.png","element":"img","alt":" σ","inline":true,"padRight":true},{"text":"are Lipschitz in state and action, uniformly in time, and ","element":"span"},{"style":{"height":6.8},"width":23,"height":17,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/96191/images/15-26.png","element":"img","alt":" π","inline":true,"padRight":true},{"text":"is Lipschitz in state, uniformly in time, then (","element":"span"},{"href":"#id-67","text":"A.1","element":"a"},{"text":") with ","element":"span"},{"style":{"height":16},"width":242.12,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/96191/images/15-27.png","element":"img","alt":"As = π(s, Xs)","inline":true,"padRight":true},{"text":"is well-posed.","element":"span"}]]},{"heading":"B Proofs","paragraphs":[[{"style":{"fontWeight":"bold"},"text":"B.1 ","element":"span"},{"style":{"fontWeight":"bold"},"text":"The Distributional Action Gap","element":"span"}],[{"text":"In this section, we prove the statements made in Section ","element":"span"},{"text":"3","element":"span"},{"text":".","element":"span"}],[{"text":"Before proving any of the statements made in Section ","element":"span"},{"text":"3","element":"span"},{"text":", we recall an identity that relates the ","element":"span"},{"style":{"height":15.58},"width":54.6,"height":38.96,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/96191/images/16-0.png","element":"img","alt":" Wp","inline":true,"padRight":true},{"text":"distance between two probability measures ","element":"span"},{"style":{"height":10},"width":24,"height":25,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/96191/images/16-1.png","element":"img","alt":" µ","inline":true,"padRight":true},{"text":"and ","element":"span"},{"style":{"height":6.8},"width":21,"height":17,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/96191/images/16-2.png","element":"img","alt":" ν","inline":true,"padRight":true},{"text":"and the absolute central ","element":"span"},{"style":{"fontStyle":"italic"},"text":"p","element":"span"},{"text":"th moments of the differences of random variables distributed according to ","element":"span"},{"style":{"height":14.4},"width":134.72,"height":36,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/96191/images/16-3.png","element":"img","alt":" µ and ν:","inline":true}],[{"id":"id-75","style":{"width":"83%"},"width":1330,"height":65,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/96191/images/16-4.png","element":"img"}],[{"text":"This identity will be used a number of times, including in the proof of Section ","element":"span"},{"text":"3","element":"span"},{"text":"’s first result, which we restate here for the readers convenience.","element":"span"}],[{"style":{"fontWeight":"bold"},"text":"Proposition 3.4. ","element":"span"},{"style":{"fontStyle":"italic"},"text":"For all ","element":"span"},{"style":{"height":16},"width":237.4,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/96191/images/16-5.png","element":"img","alt":" (t, x) ∈ T × X","inline":true},{"style":{"fontStyle":"italic"},"text":", we have that ","element":"span"},{"style":{"height":18.29},"width":562.96,"height":45.72,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/96191/images/16-6.png","element":"img","alt":" distgapp(ζπh, t, x) ≥ gap(Qπh, t, x).","inline":true}],[{"style":{"fontStyle":"italic"},"text":"Proof. ","element":"span"},{"text":"Let ","element":"span"},{"style":{"height":16},"width":139.4,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/96191/images/16-7.png","element":"img","alt":" (Z1, Z2)","inline":true,"padRight":true},{"text":"be any random vector with such that law","element":"span"},{"style":{"height":16.51},"width":305.04,"height":41.28,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/96191/images/16-8.png","element":"img","alt":"(Zi) = ζπh(t, x, ai)","inline":true,"padRight":true},{"text":"for ","element":"span"},{"style":{"fontStyle":"italic"},"text":"i ","element":"span"},{"text":"= 1","element":"span"},{"style":{"fontStyle":"italic"},"text":", ","element":"span"},{"text":"2","element":"span"},{"text":". Then, ","element":"span"},{"text":"by Jensen’s inequality,","element":"span"}],[{"style":{"width":"84%"},"width":1337,"height":48,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/96191/images/16-9.png","element":"img"}],[{"text":"Hence, since ","element":"span"},{"style":{"height":16},"width":139.36,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/96191/images/16-10.png","element":"img","alt":" (Z1, Z2)","inline":true,"padRight":true},{"text":"was arbitrary, by (","element":"span"},{"href":"#id-75","text":"B.1","element":"a"},{"text":"),","element":"span"}],[{"style":{"width":"62%"},"width":995,"height":42,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/96191/images/16-11.png","element":"img"}],[{"text":"Finally, taking the minimum over pairs of actions ","element":"span"},{"style":{"height":16},"width":412.6,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/96191/images/16-12.png","element":"img","alt":" (a1, a2) such that a1 ̸= a2","inline":true,"padRight":true},{"text":"concludes the proof.","element":"span"}],[{"text":"Now we move on to the proofs of Theorems ","element":"span"},{"href":"#id-22","text":"3.5 ","element":"a"},{"text":"and ","element":"span"},{"href":"#id-24","text":"3.7","element":"a"},{"text":". We defer the proof of Theorem ","element":"span"},{"href":"#id-23","text":"3.6 ","element":"a"},{"text":"until after the proof of Theorem ","element":"span"},{"href":"#id-24","text":"3.7 ","element":"a"},{"text":"as the proofs of Theorems ","element":"span"},{"href":"#id-22","text":"3.5 ","element":"a"},{"text":"and ","element":"span"},{"href":"#id-24","text":"3.7 ","element":"a"},{"text":"are similar. For clarity’s sake, we first prove a collection of lemmas.","element":"span"}],[{"id":"id-76","style":{"fontWeight":"bold"},"text":"Lemma B.1. ","element":"span"},{"style":{"fontStyle":"italic"},"text":"Under Assumption ","element":"span"},{"href":"#id-11","style":{"fontStyle":"italic"},"text":"2.2","element":"a"},{"style":{"fontStyle":"italic"},"text":", let ","element":"span"},{"style":{"height":16.78},"width":138.4,"height":41.96,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/96191/images/16-13.png","element":"img","alt":" (Xas )s≥t ","inline":true,"padRight":true},{"style":{"fontStyle":"italic"},"text":"be the unique strong solution to ","element":"span"},{"text":"(","element":"span"},{"href":"#id-11","text":"2.2","element":"a"},{"text":") ","element":"span"},{"style":{"fontStyle":"italic"},"text":"with ","element":"span"},{"style":{"height":14.74},"width":131.56,"height":36.84,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/96191/images/16-14.png","element":"img","alt":" Xat = x","inline":true,"padRight":true},{"style":{"fontWeight":"bold"},"text":"P","element":"span"},{"style":{"fontStyle":"italic"},"text":"-a.s. Then, for all ","element":"span"},{"style":{"height":14},"width":92.36,"height":35,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/96191/images/16-15.png","element":"img","alt":" q ≥ 1","inline":true,"padRight":true},{"style":{"fontStyle":"italic"},"text":"and for all ","element":"span"},{"style":{"height":12.8},"width":97.2,"height":32,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/96191/images/16-16.png","element":"img","alt":" s ≥ t,","inline":true}],[{"style":{"width":"82%"},"width":1308,"height":56,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/96191/images/16-17.png","element":"img"}],[{"style":{"fontStyle":"italic"},"text":"where ","element":"span"},{"style":{"height":18.18},"width":203.4,"height":45.44,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/96191/images/16-18.png","element":"img","alt":" Cq = 42q−1.","inline":true}],[{"style":{"fontStyle":"italic"},"text":"Proof. ","element":"span"},{"text":"Let ","element":"span"},{"style":{"height":18.18},"width":191.52,"height":45.44,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/96191/images/16-19.png","element":"img","alt":" Cq = 42q−1","inline":true},{"text":". By Jensen’s inequality and Itô’s isometry, observe that","element":"span"}],[{"style":{"width":"89%"},"width":1418,"height":486,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/96191/images/16-20.png","element":"img"}],[{"text":"Thus, the lemma follows after taking expectation and applying Gronwall’s inequality.","element":"span"}],[{"id":"id-77","style":{"fontWeight":"bold"},"text":"Lemma B.2. ","element":"span"},{"style":{"fontStyle":"italic"},"text":"Under Assumptions ","element":"span"},{"href":"#id-11","style":{"fontStyle":"italic"},"text":"2.2 ","element":"a"},{"style":{"fontStyle":"italic"},"text":"and ","element":"span"},{"href":"#id-5","style":{"fontStyle":"italic"},"text":"2.3","element":"a"},{"style":{"fontStyle":"italic"},"text":", let ","element":"span"},{"style":{"height":16.78},"width":479.68,"height":41.96,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/96191/images/16-21.png","element":"img","alt":" (X•s )s≥t with • ∈ {π, π|h,a,t}","inline":true,"padRight":true},{"style":{"fontStyle":"italic"},"text":"be the unique strong ","element":"span"},{"style":{"fontStyle":"italic"},"text":"solution to ","element":"span"},{"text":"(","element":"span"},{"href":"#id-10","text":"2.2","element":"a"},{"text":") ","element":"span"},{"style":{"fontStyle":"italic"},"text":"with ","element":"span"},{"style":{"height":14.93},"width":171.32,"height":37.32,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/96191/images/16-22.png","element":"img","alt":" X•t = x P","inline":true},{"style":{"fontStyle":"italic"},"text":"-a.s. Then, for all ","element":"span"},{"style":{"height":13.2},"width":168.88,"height":33,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/96191/images/16-23.png","element":"img","alt":" s ≤ t + h,","inline":true}],[{"style":{"width":"79%"},"width":1258,"height":54,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/96191/images/16-24.png","element":"img"}],[{"style":{"fontStyle":"italic"},"text":"where ","element":"span"},{"style":{"fontStyle":"italic"},"text":"C ","element":"span"},{"style":{"fontStyle":"italic"},"text":"is some finite positive constant depending on ","element":"span"},{"href":"#id-11","style":{"height":14.4},"width":282.24,"height":36,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/96191/images/16-25.png","element":"img","alt":" q, C2.2, and C2.3.","inline":true}],[{"style":{"fontStyle":"italic"},"text":"Proof. ","element":"span"},{"text":"Let ","element":"span"},{"style":{"height":18.18},"width":198.2,"height":45.44,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/96191/images/17-0.png","element":"img","alt":" Cq = 82q−1","inline":true},{"text":". Note that ","element":"span"},{"style":{"height":21.81},"width":294.08,"height":54.52,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/96191/images/17-1.png","element":"img","alt":" Xπ|h,a,ts′ = Xas′ P","inline":true},{"text":"-a.s. for all ","element":"span"},{"style":{"height":13.2},"width":178.44,"height":33,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/96191/images/17-2.png","element":"img","alt":" s′ ≤ t + h","inline":true},{"text":", by the definition of ","element":"span"},{"style":{"height":16.78},"width":102.08,"height":41.96,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/96191/images/17-3.png","element":"img","alt":"π|h,a,t","inline":true,"padRight":true},{"text":"and the uniqueness of strong solutions to (","element":"span"},{"href":"#id-6","text":"2.1","element":"a"},{"text":"). So, by Jensen’s inequality and Itô’s isometry, observe that","element":"span"}],[{"style":{"width":"87%"},"width":1386,"height":316,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/96191/images/17-4.png","element":"img"}],[{"text":"Thus, after taking expectation, we deduce that","element":"span"}],[{"style":{"width":"96%"},"width":1531,"height":318,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/96191/images/17-5.png","element":"img"}],[{"text":"And so, the lemma follows after applying Lemma ","element":"span"},{"href":"#id-76","text":"B.1 ","element":"a"},{"text":"and Gronwall’s inequality.","element":"span"}],[{"id":"id-78","style":{"fontWeight":"bold"},"text":"Lemma B.3. ","element":"span"},{"style":{"fontStyle":"italic"},"text":"Under Assumptions ","element":"span"},{"href":"#id-11","style":{"fontStyle":"italic"},"text":"2.2 ","element":"a"},{"style":{"fontStyle":"italic"},"text":"and ","element":"span"},{"href":"#id-5","style":{"fontStyle":"italic"},"text":"2.3","element":"a"},{"style":{"fontStyle":"italic"},"text":", let ","element":"span"},{"style":{"height":16.8},"width":479.68,"height":42,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/96191/images/17-6.png","element":"img","alt":" (X•s )s≥t with • ∈ {π, π|h,a,t}","inline":true,"padRight":true},{"style":{"fontStyle":"italic"},"text":"be the unique strong ","element":"span"},{"style":{"fontStyle":"italic"},"text":"solution to ","element":"span"},{"text":"(","element":"span"},{"href":"#id-10","text":"2.2","element":"a"},{"text":") ","element":"span"},{"style":{"fontStyle":"italic"},"text":"with ","element":"span"},{"style":{"height":14.91},"width":171.32,"height":37.28,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/96191/images/17-7.png","element":"img","alt":" X•t = x P","inline":true},{"style":{"fontStyle":"italic"},"text":"-a.s. Then, for all ","element":"span"},{"style":{"fontStyle":"italic"},"text":"s > t ","element":"span"},{"text":"+ ","element":"span"},{"style":{"fontStyle":"italic"},"text":"h","element":"span"},{"style":{"fontStyle":"italic"},"text":",","element":"span"}],[{"style":{"height":21.14},"width":1166.6,"height":52.84,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/96191/images/17-8.png","element":"img","alt":"E[|Xπ|h,a,ts − Xπs |2q] ≤ C(1 + |x|)2q(hq + 1)hqeC((s−t−h)q+1)(s−t−h)q","inline":true,"padRight":true},{"style":{"fontStyle":"italic"},"text":"where ","element":"span"},{"style":{"fontStyle":"italic"},"text":"C ","element":"span"},{"style":{"fontStyle":"italic"},"text":"is some finite positive constant depending on ","element":"span"},{"href":"#id-11","style":{"height":14.4},"width":282.24,"height":36,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/96191/images/17-9.png","element":"img","alt":" q, C2.2, and C2.3.","inline":true}],[{"style":{"width":"80%"},"width":1272,"height":163,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/96191/images/17-10.png","element":"img"}],[{"text":"So, as ","element":"span"},{"style":{"height":16.78},"width":456.56,"height":41.96,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/96191/images/17-11.png","element":"img","alt":" π|h,a,t(· | s′, y) = π(· | s′, y)","inline":true,"padRight":true},{"text":"for all ","element":"span"},{"style":{"height":16},"width":443.92,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/96191/images/17-12.png","element":"img","alt":" (s′, y) ∈ T \\ [t, t + h) × X","inline":true},{"text":", by Jensen’s inequality and Itô’s isometry,","element":"span"}],[{"style":{"width":"78%"},"width":1244,"height":185,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/96191/images/17-13.png","element":"img"}],[{"text":"Thus, after taking expectation, applying Gronwall’s inequality, and considering Lemma ","element":"span"},{"href":"#id-77","text":"B.2","element":"a"},{"text":", the lemma follows.","element":"span"}],[{"text":"One consequence of Lemmas ","element":"span"},{"href":"#id-77","text":"B.2 ","element":"a"},{"text":"and ","element":"span"},{"href":"#id-78","text":"B.3 ","element":"a"},{"text":"is","element":"span"}],[{"style":{"width":"38%"},"width":617,"height":53,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/96191/images/17-14.png","element":"img"}],[{"text":"for all ","element":"span"},{"style":{"height":12},"width":94.4,"height":30,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/96191/images/17-15.png","element":"img","alt":" s ∈ T","inline":true,"padRight":true},{"text":"and for all ","element":"span"},{"style":{"height":16},"width":173.32,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/96191/images/17-16.png","element":"img","alt":" p ∈ [1, ∞)","inline":true},{"text":". And so, if ","element":"span"},{"style":{"fontStyle":"italic"},"text":"f ","element":"span"},{"text":"is bounded, then ","element":"span"},{"style":{"height":21.81},"width":347.6,"height":54.52,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/96191/images/17-17.png","element":"img","alt":" f(Xπ|h,a,tT ) − f(XπT )","inline":true,"padRight":true},{"text":"is bounded ","element":"span"},{"text":"and converges to zero ","element":"span"},{"style":{"fontWeight":"bold"},"text":"P","element":"span"},{"text":"-a.s. as ","element":"span"},{"style":{"fontStyle":"italic"},"text":"h ","element":"span"},{"text":"converges to zero. The dominated convergence theorem implies that","element":"span"}],[{"id":"id-80","style":{"width":"73%"},"width":1173,"height":38,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/96191/images/17-18.png","element":"img"}],[{"text":"Similarly, we see that the functions ","element":"span"},{"style":{"height":16},"width":266.56,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/96191/images/17-19.png","element":"img","alt":" gh : T → [0, ∞)","inline":true,"padRight":true},{"text":"defined by","element":"span"}],[{"style":{"width":"41%"},"width":652,"height":53,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/96191/images/17-20.png","element":"img"}],[{"text":"are uniformly bounded (in ","element":"span"},{"style":{"fontStyle":"italic"},"text":"h","element":"span"},{"text":") and converge to zero as ","element":"span"},{"style":{"height":14.8},"width":337.4,"height":37,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/96191/images/17-21.png","element":"img","alt":" h ↓ 0 for every s ∈ T","inline":true},{"text":". Hence, by the dominated convergence theorem, again,","element":"span"}],[{"id":"id-81","style":{"width":"80%"},"width":1270,"height":100,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/96191/images/17-22.png","element":"img"}],[{"text":"where ","element":"span"},{"style":{"height":18.18},"width":960.52,"height":45.44,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/96191/images/17-23.png","element":"img","alt":" Γ(ds) := (γs−t/CT,t,γ) ds for CT,t,γ := (γT −t − 1)/ log γ","inline":true},{"text":". We now prove Theorem ","element":"span"},{"href":"#id-22","text":"3.5","element":"a"},{"text":".","element":"span"}],[{"style":{"fontWeight":"bold"},"text":"Theorem 3.5. ","element":"span"},{"style":{"fontStyle":"italic"},"text":"If ","element":"span"},{"style":{"fontStyle":"italic"},"text":"r ","element":"span"},{"style":{"fontStyle":"italic"},"text":"and ","element":"span"},{"style":{"fontStyle":"italic"},"text":"f ","element":"span"},{"style":{"fontStyle":"italic"},"text":"are bounded, then ","element":"span"},{"style":{"height":16.8},"width":878.96,"height":42,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/96191/images/18-0.png","element":"img","alt":" limh↓0 Wp(ζπh(t, x, a), ηπ(t, x)) = 0, for all (t, x, a) ∈","inline":true},{"style":{"height":18.3},"width":794.72,"height":45.76,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/96191/images/18-1.png","element":"img","alt":"T × X × A; hence, limh↓0 distgapp(ζπh, t, x) = 0.","inline":true}],[{"style":{"fontStyle":"italic"},"text":"Proof. ","element":"span"},{"text":"Observe that","element":"span"}],[{"style":{"width":"89%"},"width":1411,"height":219,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/96191/images/18-2.png","element":"img"}],[{"text":"Thus, as claimed, it suffices to show that","element":"span"}],[{"style":{"width":"43%"},"width":687,"height":42,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/96191/images/18-3.png","element":"img"}],[{"text":"for all ","element":"span"},{"href":"#id-75","style":{"height":16},"width":499.6,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/96191/images/18-4.png","element":"img","alt":" (t, x, a) ∈ T × X × A. By (B.1","inline":true},{"text":"), it suffices to show that","element":"span"}],[{"id":"id-79","style":{"width":"73%"},"width":1166,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/96191/images/18-5.png","element":"img"}],[{"text":"for all ","element":"span"},{"style":{"height":16},"width":361,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/96191/images/18-6.png","element":"img","alt":" (t, x, a) ∈ T × X × A.","inline":true}],[{"text":"Since ","element":"span"},{"style":{"height":16},"width":282.84,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/96191/images/18-7.png","element":"img","alt":" (s, ω) �→ X•s (ω)","inline":true,"padRight":true},{"text":"is measurable for ","element":"span"},{"style":{"height":16.8},"width":269.6,"height":42,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/96191/images/18-8.png","element":"img","alt":" • ∈ {π, π|h,a,t}","inline":true},{"text":", by Jensen’s inequality and Fubini’s ","element":"span"},{"text":"theorem, we see that","element":"span"}],[{"id":"id-83","style":{"width":"95%"},"width":1519,"height":171,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/96191/images/18-9.png","element":"img"}],[{"text":"In turn, (","element":"span"},{"href":"#id-79","text":"B.4","element":"a"},{"text":") follows from (","element":"span"},{"href":"#id-80","text":"B.2","element":"a"},{"text":") and (","element":"span"},{"href":"#id-81","text":"B.3","element":"a"},{"text":").","element":"span"}],[{"text":"If ","element":"span"},{"style":{"height":11.6},"width":121.96,"height":29,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/96191/images/18-10.png","element":"img","alt":" T < ∞","inline":true,"padRight":true},{"text":"and ","element":"span"},{"style":{"fontStyle":"italic"},"text":"h < ","element":"span"},{"text":"1","element":"span"},{"text":", the inequalities in the statements of Lemmas ","element":"span"},{"href":"#id-77","text":"B.2 ","element":"a"},{"text":"and ","element":"span"},{"href":"#id-78","text":"B.3 ","element":"a"},{"text":"yield the following inequality: for all ","element":"span"},{"style":{"height":16},"width":182.8,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/96191/images/18-11.png","element":"img","alt":" p ∈ [1, ∞),","inline":true}],[{"id":"id-82","style":{"width":"72%"},"width":1148,"height":54,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/96191/images/18-12.png","element":"img"}],[{"text":"for some finite positive constant ","element":"span"},{"href":"#id-82","style":{"height":15.2},"width":86.32,"height":38,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/96191/images/18-13.png","element":"img","alt":" C(B.6)","inline":true,"padRight":true},{"text":"depending on ","element":"span"},{"href":"#id-11","style":{"height":16},"width":415.56,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/96191/images/18-14.png","element":"img","alt":" p, |x|, t, T, C2.2, and C2.3","inline":true},{"text":". With this inequality in hand, we now restate and prove Theorem ","element":"span"},{"href":"#id-24","text":"3.7","element":"a"},{"text":".","element":"span"}],[{"style":{"fontWeight":"bold"},"text":"Theorem 3.7. ","element":"span"},{"style":{"fontStyle":"italic"},"text":"If ","element":"span"},{"style":{"fontStyle":"italic"},"text":"r ","element":"span"},{"style":{"fontStyle":"italic"},"text":"is Lipschitz in state, uniformly in time, ","element":"span"},{"style":{"fontStyle":"italic"},"text":"f ","element":"span"},{"style":{"fontStyle":"italic"},"text":"is Lipschitz, and ","element":"span"},{"style":{"height":11.6},"width":144.24,"height":29,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/96191/images/18-15.png","element":"img","alt":" T < ∞","inline":true},{"style":{"fontStyle":"italic"},"text":", then ","element":"span"},{"style":{"height":20.48},"width":1519.96,"height":51.2,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/96191/images/18-16.png","element":"img","alt":"Wp(ζπh(t, x, a), ηπ(t, x)) ≲ h1/2, for all (t, x, a) ∈ T × X × A; hence, distgapp(ζπh, t, x) ≲ h1/2.","inline":true}],[{"style":{"fontStyle":"italic"},"text":"Proof. ","element":"span"},{"text":"Arguing as in the proof of Theorem ","element":"span"},{"href":"#id-22","text":"3.5","element":"a"},{"text":", we see that","element":"span"}],[{"style":{"width":"62%"},"width":987,"height":56,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/96191/images/18-17.png","element":"img"}],[{"text":"with","element":"span"}],[{"style":{"width":"92%"},"width":1466,"height":101,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/96191/images/18-18.png","element":"img"}],[{"text":"This is (","element":"span"},{"href":"#id-83","text":"B.5","element":"a"},{"text":"). In turn, by (","element":"span"},{"href":"#id-82","text":"B.6","element":"a"},{"text":") and that ","element":"span"},{"style":{"fontStyle":"italic"},"text":"r ","element":"span"},{"text":"is Lipschitz in space, uniformly in time and ","element":"span"},{"style":{"fontStyle":"italic"},"text":"f ","element":"span"},{"text":"is Lipschitz, we deduce that","element":"span"}],[{"style":{"width":"31%"},"width":496,"height":45,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/96191/images/18-19.png","element":"img"}],[{"text":"as desired, where ","element":"span"},{"style":{"height":16},"width":169.08,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/96191/images/18-20.png","element":"img","alt":" Cr and Cf","inline":true,"padRight":true},{"text":"are the Lipschitz constants of ","element":"span"},{"style":{"fontStyle":"italic"},"text":"r ","element":"span"},{"text":"and ","element":"span"},{"style":{"fontStyle":"italic"},"text":"f ","element":"span"},{"text":"respectively.","element":"span"}],[{"text":"Recall ","element":"span"},{"style":{"height":11.6},"width":245.76,"height":29,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/96191/images/18-21.png","element":"img","alt":" r : T × X → R","inline":true,"padRight":true},{"text":"is Lipschitz in state, uniformly in time if a finite positive constant ","element":"span"},{"style":{"height":13.2},"width":43.48,"height":33,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/96191/images/18-22.png","element":"img","alt":" Cr","inline":true,"padRight":true},{"text":"exists such that","element":"span"}],[{"style":{"width":"48%"},"width":766,"height":59,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/96191/images/18-23.png","element":"img"}],[{"text":"As promised, we now prove Theorem ","element":"span"},{"href":"#id-23","text":"3.6","element":"a"},{"text":".","element":"span"}],[{"style":{"fontWeight":"bold"},"text":"Theorem 3.6. ","element":"span"},{"style":{"fontStyle":"italic"},"text":"MDPs and policies exist in and under which, for all ","element":"span"},{"style":{"height":16},"width":351.52,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/96191/images/18-24.png","element":"img","alt":" (t, x, a) ∈ T × X × A","inline":true},{"style":{"fontStyle":"italic"},"text":", we have that ","element":"span"},{"style":{"height":20.48},"width":984.52,"height":51.2,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/96191/images/18-25.png","element":"img","alt":" Wp(ζπh(t, x, a), ηπ(t, x)) ≳ h1/2 and distgapp(ζπh, t, x) ≳ h1/2.","inline":true}],[{"style":{"fontStyle":"italic"},"text":"Proof. ","element":"span"},{"text":"Let ","element":"span"},{"style":{"height":17.68},"width":1407.32,"height":44.2,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/96191/images/19-0.png","element":"img","alt":" X = R and A = {0, 1}; set b ≡ 0 and σ = 1{a=1}; for all (s, y) ∈ T×X, let r(s, y) = y;","inline":true,"padRight":true},{"text":"and set ","element":"span"},{"style":{"height":14},"width":104.64,"height":35,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/96191/images/19-1.png","element":"img","alt":" f ≡ 0","inline":true},{"text":". In words, our action space has two elements, and when action ","element":"span"},{"text":"1 ","element":"span"},{"text":"is executed, the system follows Brownian dynamics, and otherwise, the state is fixed. Now consider the policy ","element":"span"},{"style":{"height":6.8},"width":23,"height":17,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/96191/images/19-2.png","element":"img","alt":" π","inline":true,"padRight":true},{"text":"which always selects the action ","element":"span"},{"style":{"height":16},"width":655.64,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/96191/images/19-3.png","element":"img","alt":" 0: π(· | s, y) = δ0, for all (s, y) ∈ T × X.","inline":true}],[{"style":{"fontWeight":"bold"},"text":"Case 1: ","element":"span"},{"style":{"height":14.4},"width":311.88,"height":36,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/96191/images/19-4.png","element":"img","alt":" γ = 1 and T < ∞.","inline":true,"padRight":true},{"text":"Observe that","element":"span"}],[{"style":{"width":"60%"},"width":956,"height":102,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/96191/images/19-5.png","element":"img"}],[{"text":"where ","element":"span"},{"style":{"height":19.63},"width":134.2,"height":49.08,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/96191/images/19-6.png","element":"img","alt":" ( ˜Bs)s≥0","inline":true,"padRight":true},{"text":"is a Brownian motion. Hence, ","element":"span"},{"style":{"height":16.51},"width":400.56,"height":41.28,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/96191/images/19-7.png","element":"img","alt":" Zπh(t, x, 1) − Zπh(t, x, 0)","inline":true,"padRight":true},{"text":"is equal in law to the sum ","element":"span"},{"text":"of two zero mean Gaussian random variables with variances ","element":"span"},{"style":{"height":17.39},"width":178.24,"height":43.48,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/96191/images/19-8.png","element":"img","alt":" σ21 = h3/3","inline":true,"padRight":true},{"text":"and ","element":"span"},{"style":{"height":17.39},"width":334.76,"height":43.48,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/96191/images/19-9.png","element":"img","alt":" σ22 = (T − t − h)2h","inline":true,"padRight":true},{"text":"respectively. And so, it is also Gaussian, and its variance is ","element":"span"},{"style":{"height":17.9},"width":415.28,"height":44.76,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/96191/images/19-10.png","element":"img","alt":" σ2h = σ21 + σ22 + 2cσ1σ2","inline":true,"padRight":true},{"text":"for some ","element":"span"},{"style":{"height":13.2},"width":194.48,"height":33,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/96191/images/19-11.png","element":"img","alt":"−1 ≤ c ≤ 1","inline":true},{"text":". In particular, ","element":"span"},{"style":{"height":18.99},"width":157.84,"height":47.48,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/96191/images/19-12.png","element":"img","alt":" σph ≳ hp/2","inline":true},{"text":". Recall that central absolute ","element":"span"},{"style":{"fontStyle":"italic"},"text":"p","element":"span"},{"text":"th moment of a Gaussian random ","element":"span"},{"text":"variable is proportional to its standard deviation to the power ","element":"span"},{"style":{"fontStyle":"italic"},"text":"p","element":"span"},{"text":", for all ","element":"span"},{"style":{"height":14},"width":93.92,"height":35,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/96191/images/19-13.png","element":"img","alt":" p ≥ 1","inline":true},{"text":". Therefore, by (","element":"span"},{"href":"#id-75","text":"B.1","element":"a"},{"text":"), we deduce that ","element":"span"},{"style":{"height":20.48},"width":561.48,"height":51.2,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/96191/images/19-14.png","element":"img","alt":" distgapp(ζπh, t, x) ≳ h1/2, for h < 1","inline":true},{"text":", as desired.","element":"span"}],[{"style":{"fontWeight":"bold"},"text":"Case 2: ","element":"span"},{"style":{"height":16},"width":431.44,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/96191/images/19-15.png","element":"img","alt":" T ∈ [0, ∞] and γ ∈ (0, 1).","inline":true,"padRight":true},{"text":"Observe that","element":"span"}],[{"style":{"width":"87%"},"width":1383,"height":102,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/96191/images/19-16.png","element":"img"}],[{"text":"for some Brownian motion ","element":"span"},{"style":{"height":19.62},"width":134.2,"height":49.04,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/96191/images/19-17.png","element":"img","alt":" ( ˜Bs)s≥0","inline":true},{"text":". We claim that ","element":"span"},{"style":{"fontStyle":"italic"},"text":"N","element":"span"},{"text":"(","element":"span"},{"style":{"fontStyle":"italic"},"text":"h","element":"span"},{"text":") ","element":"span"},{"text":"is a mean zero Gaussian random variable with variance ","element":"span"},{"style":{"height":17.9},"width":177.36,"height":44.76,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/96191/images/19-18.png","element":"img","alt":" σ2h ≈ h3/3","inline":true},{"text":". Then the concluding argument in the proof of Case 1 also concludes the ","element":"span"},{"text":"proof of this case.","element":"span"}],[{"text":"To prove our claim, first note that","element":"span"}],[{"style":{"width":"85%"},"width":1362,"height":102,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/96191/images/19-19.png","element":"img"}],[{"text":"As","element":"span"}],[{"style":{"width":"89%"},"width":1420,"height":102,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/96191/images/19-20.png","element":"img"}],[{"text":"we see that ","element":"span"},{"style":{"fontStyle":"italic"},"text":"N","element":"span"},{"text":"(","element":"span"},{"style":{"fontStyle":"italic"},"text":"h","element":"span"},{"text":") ","element":"span"},{"text":"has the claimed statistics. Second, observe that ","element":"span"},{"style":{"height":16},"width":505.96,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/96191/images/19-21.png","element":"img","alt":" N(h) = limn→∞ Nn(h) where","inline":true}],[{"style":{"width":"69%"},"width":1096,"height":117,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/96191/images/19-22.png","element":"img"}],[{"text":"(This is simply a Riemann sum approximation of the integral that defines ","element":"span"},{"style":{"fontStyle":"italic"},"text":"N","element":"span"},{"text":"(","element":"span"},{"style":{"fontStyle":"italic"},"text":"h","element":"span"},{"text":")","element":"span"},{"text":".) As the sum of any finite number of Gaussian random variables is a Gaussian random variable, ","element":"span"},{"style":{"height":16},"width":108.16,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/96191/images/19-23.png","element":"img","alt":" Nn(h)","inline":true,"padRight":true},{"text":"is Gaussian. Furthermore, as the limit of a sequence of Gaussian random variables whose sequences of means and variances converge (to finite values) is Gaussian, ","element":"span"},{"style":{"fontStyle":"italic"},"text":"N","element":"span"},{"text":"(","element":"span"},{"style":{"fontStyle":"italic"},"text":"h","element":"span"},{"text":") ","element":"span"},{"text":"is Gaussian, which proves our claim.","element":"span"}],[{"id":"id-27","style":{"width":"70%"},"width":1112,"height":96,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/96191/images/19-24.png","element":"img"}],[{"text":"First, we prove that the mean of every ","element":"span"},{"style":{"height":16.51},"width":1004.08,"height":41.28,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/96191/images/19-25.png","element":"img","alt":" ψπh(t, x, a) ∈ D(ζπh(t, x, a), ηπ(t, x)) is Qπh(t, x, a)−V π(t, x).","inline":true}],[{"style":{"fontWeight":"bold"},"text":"Lemma B.4. ","element":"span"},{"style":{"fontStyle":"italic"},"text":"If ","element":"span"},{"style":{"fontStyle":"italic"},"text":"ψ","element":"span"},{"style":{"height":16.53},"width":575.6,"height":41.32,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/96191/images/19-26.png","element":"img","alt":"πh(t, x, a) ∈ D(ζπh(t, x, a), ηπ(t, x))","inline":true},{"style":{"fontStyle":"italic"},"text":", then its mean is ","element":"span"},{"style":{"fontStyle":"italic"},"text":"Q","element":"span"},{"style":{"height":16.51},"width":344.6,"height":41.28,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/96191/images/19-27.png","element":"img","alt":"πh(t, x, a) − V π(t, x).","inline":true}],[{"style":{"width":"82%"},"width":1303,"height":321,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/96191/images/19-28.png","element":"img"}],[{"text":"as desired.","element":"span"}],[{"text":"Second, we prove Theorem ","element":"span"},{"href":"#id-84","text":"4.3","element":"a"},{"text":".","element":"span"}],[{"style":{"fontWeight":"bold"},"text":"Theorem 4.3. ","element":"span"},{"style":{"fontStyle":"italic"},"text":"Let ","element":"span"},{"style":{"height":16},"width":206.28,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/96191/images/20-0.png","element":"img","alt":" κ ∈ C (µ, µ)","inline":true,"padRight":true},{"style":{"fontStyle":"italic"},"text":"for some ","element":"span"},{"style":{"height":16},"width":180.68,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/96191/images/20-1.png","element":"img","alt":" µ ∈ P(R)","inline":true},{"style":{"fontStyle":"italic"},"text":". The push-forward of ","element":"span"},{"style":{"height":7.2},"width":23,"height":18,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/96191/images/20-2.png","element":"img","alt":" κ","inline":true,"padRight":true},{"style":{"fontStyle":"italic"},"text":"by ","element":"span"},{"style":{"height":11.6},"width":33,"height":29,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/96191/images/20-3.png","element":"img","alt":" ∆","inline":true,"padRight":true},{"style":{"fontStyle":"italic"},"text":"is the delta at zero, ","element":"span"},{"style":{"height":16.4},"width":171.2,"height":41,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/96191/images/20-4.png","element":"img","alt":" ∆#κ = δ0","inline":true},{"style":{"fontStyle":"italic"},"text":", if and only if ","element":"span"},{"style":{"height":15.6},"width":154.04,"height":39,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/96191/images/20-5.png","element":"img","alt":" κ is a Wp","inline":true},{"style":{"fontStyle":"italic"},"text":"-optimal coupling, for some ","element":"span"},{"style":{"height":16},"width":173.32,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/96191/images/20-6.png","element":"img","alt":" p ∈ [1, ∞)","inline":true},{"style":{"fontStyle":"italic"},"text":". Moreover, there is only one such coupling. It is given by ","element":"span"},{"style":{"height":16.8},"width":269.64,"height":42,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/96191/images/20-7.png","element":"img","alt":" κµ := (id, id)#µ","inline":true,"padRight":true},{"style":{"fontStyle":"italic"},"text":"or, equivalently, ","element":"span"},{"style":{"height":19.71},"width":460.32,"height":49.28,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/96191/images/20-8.png","element":"img","alt":" κµ := (F −1µ , F −1µ )#U(0, 1).","inline":true,"padRight":true},{"style":{"fontStyle":"italic"},"text":"Here ","element":"span"},{"text":"U","element":"span"},{"text":"(0","element":"span"},{"style":{"fontStyle":"italic"},"text":", ","element":"span"},{"text":"1) ","element":"span"},{"style":{"fontStyle":"italic"},"text":"is the uniform distribution on ","element":"span"},{"text":"[0","element":"span"},{"style":{"fontStyle":"italic"},"text":", ","element":"span"},{"text":"1]","element":"span"},{"style":{"fontStyle":"italic"},"text":".","element":"span"}],[{"text":"First, we establish that there is only one ","element":"span"},{"style":{"height":15.6},"width":54.64,"height":39,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/96191/images/20-9.png","element":"img","alt":" Wp","inline":true,"padRight":true},{"text":"optimal coupling between a given ","element":"span"},{"style":{"height":16},"width":176.64,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/96191/images/20-10.png","element":"img","alt":" µ ∈ P(R)","inline":true,"padRight":true},{"text":"and itself, for every ","element":"span"},{"style":{"height":16},"width":182.8,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/96191/images/20-11.png","element":"img","alt":" p ∈ [1, ∞).","inline":true}],[{"id":"id-85","style":{"fontWeight":"bold"},"text":"Lemma B.5. ","element":"span"},{"style":{"fontStyle":"italic"},"text":"Let ","element":"span"},{"style":{"height":16},"width":176.6,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/96191/images/20-12.png","element":"img","alt":" µ ∈ P(R)","inline":true},{"style":{"fontStyle":"italic"},"text":". There is only one ","element":"span"},{"style":{"height":15.6},"width":54.64,"height":39,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/96191/images/20-13.png","element":"img","alt":" Wp","inline":true,"padRight":true},{"style":{"fontStyle":"italic"},"text":"optimal coupling between ","element":"span"},{"style":{"height":10},"width":24,"height":25,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/96191/images/20-14.png","element":"img","alt":" µ","inline":true,"padRight":true},{"style":{"fontStyle":"italic"},"text":"and itself, for every ","element":"span"},{"style":{"height":16.8},"width":724.32,"height":42,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/96191/images/20-15.png","element":"img","alt":"p ∈ [1, ∞). It is κµ := (id, id)#µ ∈ C (µ, µ).","inline":true}],[{"style":{"fontStyle":"italic"},"text":"Proof. ","element":"span"},{"text":"Let ","element":"span"},{"style":{"height":16},"width":202.16,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/96191/images/20-16.png","element":"img","alt":" κ ∈ C (µ, µ)","inline":true},{"text":", and suppose there exists ","element":"span"},{"style":{"height":12},"width":253.08,"height":30,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/96191/images/20-17.png","element":"img","alt":" ϵ > 0 for which","inline":true}],[{"style":{"width":"97%"},"width":1547,"height":231,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/96191/images/20-18.png","element":"img"}],[{"text":"Hence, ","element":"span"},{"style":{"height":7.2},"width":23,"height":18,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/96191/images/20-19.png","element":"img","alt":" κ","inline":true},{"text":", as considered, is not optimal. Since ","element":"span"},{"style":{"height":0},"width":20,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/96191/images/20-20.png","element":"img","alt":" ϵ","inline":true,"padRight":true},{"text":"was arbitrary, it follows that a ","element":"span"},{"style":{"height":15.58},"width":54.64,"height":38.96,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/96191/images/20-21.png","element":"img","alt":" Wp","inline":true,"padRight":true},{"text":"optimal coupling is concentrated on ","element":"span"},{"style":{"fontStyle":"italic"},"text":"{","element":"span"},{"style":{"fontStyle":"italic"},"text":"z ","element":"span"},{"text":"= ","element":"span"},{"style":{"fontStyle":"italic"},"text":"w","element":"span"},{"style":{"fontStyle":"italic"},"text":"}","element":"span"},{"text":". Therefore, every optimal coupling is of the form ","element":"span"},{"style":{"height":16.78},"width":158.08,"height":41.96,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/96191/images/20-22.png","element":"img","alt":" (id, id)#ν","inline":true,"padRight":true},{"text":"for some ","element":"span"},{"style":{"height":16},"width":174.76,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/96191/images/20-23.png","element":"img","alt":"ν ∈ P(R)","inline":true},{"text":". As the marginals of such a coupling are ","element":"span"},{"style":{"height":11.6},"width":120.52,"height":29,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/96191/images/20-24.png","element":"img","alt":" ν and ν","inline":true},{"text":", we deduce that ","element":"span"},{"style":{"height":10},"width":99.36,"height":25,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/96191/images/20-25.png","element":"img","alt":" ν = µ","inline":true},{"text":", as desired.","element":"span"}],[{"style":{"width":"84%"},"width":1335,"height":163,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/96191/images/20-26.png","element":"img"}],[{"text":"where, again, ","element":"span"},{"style":{"height":16.78},"width":267.04,"height":41.96,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/96191/images/20-27.png","element":"img","alt":" κµ = (id, id)#µ","inline":true},{"text":". Hence, ","element":"span"},{"style":{"height":11.98},"width":126.48,"height":29.96,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/96191/images/20-28.png","element":"img","alt":" κ = κµ","inline":true},{"text":", by Lemma ","element":"span"},{"href":"#id-85","text":"B.5","element":"a"},{"text":". On the other hand, since ","element":"span"},{"style":{"height":11.98},"width":41.96,"height":29.96,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/96191/images/20-29.png","element":"img","alt":" κµ","inline":true,"padRight":true},{"text":"is concentrated on ","element":"span"},{"style":{"fontStyle":"italic"},"text":"{","element":"span"},{"style":{"fontStyle":"italic"},"text":"z ","element":"span"},{"text":"= ","element":"span"},{"style":{"fontStyle":"italic"},"text":"w","element":"span"},{"style":{"fontStyle":"italic"},"text":"}","element":"span"},{"text":", for any bounded, continuous function ","element":"span"},{"style":{"fontStyle":"italic"},"text":"g","element":"span"},{"text":", we see","element":"span"}],[{"style":{"width":"86%"},"width":1365,"height":98,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/96191/images/20-30.png","element":"img"}],[{"text":"In turn, ","element":"span"},{"style":{"height":16.38},"width":192.56,"height":40.96,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/96191/images/20-31.png","element":"img","alt":" ∆#κµ = δ0","inline":true},{"text":", as desired.","element":"span"}],[{"style":{"fontWeight":"bold"},"text":"Remark B.6. ","element":"span"},{"style":{"fontStyle":"italic"},"text":"Under the hypotheses of Theorem ","element":"span"},{"href":"#id-24","style":{"fontStyle":"italic"},"text":"3.7","element":"a"},{"style":{"fontStyle":"italic"},"text":", we see that the ","element":"span"},{"style":{"height":16},"width":45.16,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/96191/images/20-32.png","element":"img","alt":"1/2","inline":true},{"style":{"fontStyle":"italic"},"text":"-rescaled superiority distributions at any ","element":"span"},{"style":{"height":16},"width":344.84,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/96191/images/20-33.png","element":"img","alt":" (t, x, a) for h ∈ (0, 1]","inline":true,"padRight":true},{"style":{"fontStyle":"italic"},"text":"is a family of probability measures with uniformly bounded second moment. Hence, this family is tight. So, up to subsequences, these rescalings converges to limiting probability measure as ","element":"span"},{"style":{"height":14},"width":86.6,"height":35,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/96191/images/20-34.png","element":"img","alt":" h ↓ 0","inline":true},{"style":{"fontStyle":"italic"},"text":". An interesting open question, is whether or not these subsequential limits are the same.","element":"span"}],[{"text":"Third, we prove Theorems ","element":"span"},{"href":"#id-47","text":"4.5 ","element":"a"},{"text":"and ","element":"span"},{"href":"#id-48","text":"4.6","element":"a"},{"text":". The proofs of these theorems are a consequence of the following expression for the ","element":"span"},{"style":{"height":15.6},"width":54.64,"height":39,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/96191/images/20-35.png","element":"img","alt":" Wp","inline":true,"padRight":true},{"text":"distance between ","element":"span"},{"style":{"height":16},"width":456.92,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/96191/images/20-36.png","element":"img","alt":" µ and ν when µ, ν ∈ P(R):","inline":true}],[{"style":{"width":"43%"},"width":685,"height":101,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/96191/images/20-37.png","element":"img"}],[{"style":{"fontWeight":"bold"},"text":"Theorem 4.5. ","element":"span"},{"style":{"fontStyle":"italic"},"text":"MDPs and policies exist satisfying Assumptions ","element":"span"},{"href":"#id-11","style":{"fontStyle":"italic"},"text":"2.2 ","element":"a"},{"style":{"fontStyle":"italic"},"text":"and ","element":"span"},{"href":"#id-5","style":{"fontStyle":"italic"},"text":"2.3 ","element":"a"},{"style":{"fontStyle":"italic"},"text":"in and under which, for all ","element":"span"},{"style":{"height":16},"width":237.4,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/96191/images/20-38.png","element":"img","alt":" (t, x) ∈ T × X","inline":true},{"style":{"fontStyle":"italic"},"text":", we have that ","element":"span"},{"style":{"height":21.1},"width":468.8,"height":52.76,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/96191/images/20-39.png","element":"img","alt":" distgapp(ψπh;q, t, x) ≳ h1/2−q.","inline":true}],[{"style":{"width":"100%"},"width":1587,"height":370,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/96191/images/20-40.png","element":"img"}],[{"text":"Thus, taking the ","element":"span"},{"style":{"fontStyle":"italic"},"text":"p","element":"span"},{"text":"th root of both sides of the above equality, we deduce that","element":"span"}],[{"style":{"width":"71%"},"width":1131,"height":46,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/96191/images/21-0.png","element":"img"}],[{"text":"The example presented in Theorem ","element":"span"},{"href":"#id-23","text":"3.6 ","element":"a"},{"text":"is such that ","element":"span"},{"style":{"height":18.98},"width":578.68,"height":47.44,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/96191/images/21-1.png","element":"img","alt":" Wp(ζπh(t, x, a1), ζπh(t, x, a2)) ≳ h1/2","inline":true},{"text":". Whence, ","element":"span"},{"style":{"height":21.1},"width":688.4,"height":52.76,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/96191/images/21-2.png","element":"img","alt":"Wp(ψπh;q(t, x, a1), ψπh;q(t, x, a2)) ≳ h1/2−q.","inline":true}],[{"style":{"fontWeight":"bold"},"text":"Theorem 4.6. ","element":"span"},{"style":{"fontStyle":"italic"},"text":"Under Assumptions ","element":"span"},{"href":"#id-11","style":{"fontStyle":"italic"},"text":"2.2 ","element":"a"},{"style":{"fontStyle":"italic"},"text":"and ","element":"span"},{"href":"#id-5","style":{"fontStyle":"italic"},"text":"2.3","element":"a"},{"style":{"fontStyle":"italic"},"text":", if ","element":"span"},{"style":{"fontStyle":"italic"},"text":"r ","element":"span"},{"style":{"fontStyle":"italic"},"text":"is Lipschitz in state, uniformly in time, ","element":"span"},{"style":{"fontStyle":"italic"},"text":"f ","element":"span"},{"style":{"fontStyle":"italic"},"text":"is Lipschitz, and ","element":"span"},{"style":{"height":21.1},"width":1054.56,"height":52.76,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/96191/images/21-3.png","element":"img","alt":" T < ∞, then distgapp(ψπh;q, t, x) ≲ h1/2−q, for all (t, x) ∈ T × X.","inline":true}],[{"text":"The proof of Theorem ","element":"span"},{"href":"#id-48","text":"4.6 ","element":"a"},{"text":"is almost identical to the proof of Theorem ","element":"span"},{"href":"#id-47","text":"4.5","element":"a"},{"text":".","element":"span"}],[{"style":{"fontStyle":"italic"},"text":"Proof. ","element":"span"},{"text":"Arguing as in the proof of Theorem ","element":"span"},{"href":"#id-47","text":"4.5","element":"a"},{"text":", we see that","element":"span"}],[{"style":{"width":"71%"},"width":1131,"height":46,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/96191/images/21-4.png","element":"img"}],[{"text":"By Theorem ","element":"span"},{"href":"#id-24","text":"3.7","element":"a"},{"text":", it follows that ","element":"span"},{"style":{"height":21.1},"width":688.4,"height":52.76,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/96191/images/21-5.png","element":"img","alt":" Wp(ψπh;q(t, x, a1), ψπh;q(t, x, a2)) ≲ h1/2−q.","inline":true}],[{"text":"Finally, we prove Theorem ","element":"span"},{"href":"#id-50","text":"4.8","element":"a"}],[{"style":{"fontWeight":"bold"},"text":"Theorem 4.8. ","element":"span"},{"style":{"fontStyle":"italic"},"text":"Let ","element":"span"},{"style":{"height":11.58},"width":38.6,"height":28.96,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/96191/images/21-6.png","element":"img","alt":" ρβ","inline":true,"padRight":true},{"style":{"fontStyle":"italic"},"text":"be a distortion risk measure, ","element":"span"},{"style":{"height":14},"width":96.12,"height":35,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/96191/images/21-7.png","element":"img","alt":" q ≥ 0","inline":true},{"style":{"fontStyle":"italic"},"text":", and ","element":"span"},{"style":{"fontStyle":"italic"},"text":"h > ","element":"span"},{"text":"0","element":"span"},{"style":{"fontStyle":"italic"},"text":". If ","element":"span"},{"style":{"height":16.78},"width":298.72,"height":41.96,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/96191/images/21-8.png","element":"img","alt":" ρβ(ηπ(t, x)) < ∞","inline":true},{"style":{"fontStyle":"italic"},"text":", then ","element":"span"},{"style":{"height":18.93},"width":979.8,"height":47.32,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/96191/images/21-9.png","element":"img","alt":"arg maxa∈A ρβ(ψπh;q(t, x, a)) = arg maxa∈A ρβ(ζπh(t, x, a)).","inline":true}],[{"style":{"fontStyle":"italic"},"text":"Proof. ","element":"span"},{"text":"Observe that","element":"span"}],[{"style":{"width":"64%"},"width":1023,"height":390,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/96191/images/21-10.png","element":"img"}],[{"text":"Since ","element":"span"},{"style":{"height":16.78},"width":202.32,"height":41.96,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/96191/images/21-11.png","element":"img","alt":" ρβ(ηπ(t, x))","inline":true,"padRight":true},{"text":"is independent of ","element":"span"},{"style":{"fontStyle":"italic"},"text":"a ","element":"span"},{"text":"and ","element":"span"},{"style":{"fontStyle":"italic"},"text":"h > ","element":"span"},{"text":"0","element":"span"},{"text":",","element":"span"}],[{"style":{"width":"76%"},"width":1219,"height":118,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/96191/images/21-12.png","element":"img"}]]},{"heading":"C Algorithms and Pseudocode","paragraphs":[[{"text":"Here we discuss methods for policy optimization via distributional superiority. In practice, computers operate at a finite frequency—as such, all policies we consider here will be assumed to apply each action for ","element":"span"},{"style":{"fontStyle":"italic"},"text":"h ","element":"span"},{"text":"units of time, as in the settings of [","element":"span"},{"href":"#id-2","referenceIndex":34,"text":"34","element":"a"},{"text":", ","element":"span"},{"href":"#id-9","referenceIndex":16,"text":"16","element":"a"},{"text":"].","element":"span"}],[{"text":"Before describing the superiority learning algorithms, we first remark on the form of exploration policies used in our approaches. We consider policies of the form","element":"span"}],[{"style":{"width":"24%"},"width":384,"height":38,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/96191/images/21-13.png","element":"img"}],[{"text":"The first argument to such a policy is a vector of “action-values”. In our case, given a superiority distribution ","element":"span"},{"style":{"height":18.91},"width":186.4,"height":47.28,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/96191/images/21-14.png","element":"img","alt":" ψπh;q(t, x, ·)","inline":true},{"text":", the action values may be ","element":"span"},{"style":{"height":18.91},"width":358.72,"height":47.28,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/96191/images/21-15.png","element":"img","alt":" (ρβ(ψπh;q(t, x, a)))a∈A","inline":true,"padRight":true},{"text":"for a distortion risk measure ","element":"span"},{"style":{"height":11.6},"width":38.6,"height":29,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/96191/images/21-16.png","element":"img","alt":"ρβ","inline":true},{"text":". This generalizes the notion of ","element":"span"},{"style":{"fontStyle":"italic"},"text":"Q","element":"span"},{"text":"-values for distributional learning.","element":"span"}],[{"text":"The second argument represents a noise variable, in order to support stochastic policies. As input to our algorithms, we require a probability measure ","element":"span"},{"style":{"height":13.18},"width":66.4,"height":32.96,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/96191/images/21-17.png","element":"img","alt":" Pact","inline":true,"padRight":true},{"text":"on processes ","element":"span"},{"style":{"height":16.78},"width":114.16,"height":41.96,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/96191/images/21-18.png","element":"img","alt":" (ϵt)t≥0","inline":true},{"text":". This framework generalizes common exploration methods in deep RL. For example, to recover ","element":"span"},{"style":{"height":0},"width":20,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/96191/images/21-19.png","element":"img","alt":" ϵ","inline":true},{"text":"-greedy policies, ","element":"span"},{"style":{"height":13.2},"width":66.44,"height":33,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/96191/images/21-20.png","element":"img","alt":" Pact","inline":true,"padRight":true},{"text":"represent a 2-dimensional white noise, and","element":"span"}],[{"style":{"width":"72%"},"width":1147,"height":100,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/96191/images/21-21.png","element":"img"}],[{"text":"Alternatively, one might choose to correlate the action noise. Approaches such as DDPG [","element":"span"},{"href":"#id-86","referenceIndex":21,"text":"21","element":"a"},{"text":"] and DAU present such examples, where action noise evolves over time as an Ornstein-Uhlenbeck process. This can be implemented in the framework above by choosing ","element":"span"},{"style":{"height":13.2},"width":66.4,"height":33,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/96191/images/22-0.png","element":"img","alt":" Pact","inline":true,"padRight":true},{"text":"as the distribution of an ","element":"span"},{"style":{"fontStyle":"italic"},"text":"|","element":"span"},{"text":"A","element":"span"},{"style":{"fontStyle":"italic"},"text":"|","element":"span"},{"text":"-dimensional Ornstein-Uhlenbeck process, and defining","element":"span"}],[{"style":{"width":"67%"},"width":1067,"height":59,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/96191/images/22-1.png","element":"img"}],[{"text":"In our experiments, we found that","element":"span"},{"text":"-greedy exploration was sufficient for approximate policy optimization. To keep our implementation closest to the QR-DQN baseline, therefore, our implementations use the definition of ","element":"span"},{"style":{"height":13.38},"width":124,"height":33.44,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/96191/images/22-2.png","element":"img","alt":" πexplore ","inline":true,"padRight":true},{"text":"from equation","element":"span"}],[{"text":"The remainder of this section details the implementation of DSUP(","element":"span"},{"style":{"fontStyle":"italic"},"text":"q","element":"span"},{"text":") and DAU+DSUP(","element":"span"},{"style":{"fontStyle":"italic"},"text":"q","element":"span"},{"text":") as introduced in Section ","element":"span"},{"href":"#id-87","text":"4.2","element":"a"},{"text":". ","element":"span"},{"text":"Additionally, we provide source code for our implementations at","element":"span"}],[{"style":{"width":"71%"},"width":1139,"height":37,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/96191/images/22-3.png","element":"img"}],[{"style":{"fontWeight":"bold"},"text":"C.1 ","element":"span"},{"style":{"fontWeight":"bold"},"text":"Distributional Superiority ","element":"span"},{"text":"Generally, our goal is to learn a ","element":"span"},{"style":{"height":11.6},"width":38.6,"height":29,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/96191/images/22-4.png","element":"img","alt":" ρβ","inline":true},{"text":"-greedy policy (see Definition ","element":"span"},{"href":"#id-88","text":"4.7","element":"a"},{"text":"). Since all policies apply actions for ","element":"span"},{"style":{"fontStyle":"italic"},"text":"h ","element":"span"},{"text":"units of time, in order to satisfy Axiom ","element":"span"},{"href":"#id-26","text":"2","element":"a"},{"text":", we want","element":"span"}],[{"style":{"width":"19%"},"width":310,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/96191/images/22-5.png","element":"img"}],[{"text":"where ","element":"span"},{"style":{"height":16.78},"width":574.32,"height":41.96,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/96191/images/22-6.png","element":"img","alt":" π : x �→ arg maxa ρβ(Sπh(t, x, a))","inline":true,"padRight":true},{"text":"is the ","element":"span"},{"style":{"height":11.58},"width":38.6,"height":28.96,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/96191/images/22-7.png","element":"img","alt":" ρβ","inline":true},{"text":"-greedy policy. As such, following the DAU ","element":"span"},{"text":"algorithm of [","element":"span"},{"href":"#id-2","referenceIndex":34,"text":"34","element":"a"},{"text":"], our algorithms will model quantile functions ","element":"span"},{"style":{"height":16},"width":253.84,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/96191/images/22-8.png","element":"img","alt":" ϕ(t, x, a) ∈ Rm ","inline":true,"padRight":true},{"text":"that aim to satisfy","element":"span"}],[{"style":{"width":"82%"},"width":1305,"height":64,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/96191/images/22-9.png","element":"img"}],[{"text":"Our primary algorithmic contribution integrates such a model, with proper superiority rescaling, into the QR-DQN framework of [","element":"span"},{"href":"#id-31","referenceIndex":10,"text":"10","element":"a"},{"text":"] for estimating action-conditioned return distributions. It is outlined in Algorithm ","element":"span"},{"href":"#id-89","text":"2","element":"a"},{"text":".","element":"span"}],[{"text":"To deal with the increased time-resolution of transitions, [","element":"span"},{"href":"#id-2","referenceIndex":34,"text":"34","element":"a"},{"text":"] modified the step size by a factor of ","element":"span"},{"style":{"fontStyle":"italic"},"text":"h","element":"span"},{"text":". More recently, [","element":"span"},{"href":"#id-90","referenceIndex":3,"text":"3","element":"a"},{"text":"] found that subsampling transitions before storing them in the replay buffer was most effective in their high-decision-frequency domain. Thus, we opt for such a strategy here. Rather than only storing every ","element":"span"},{"style":{"height":13.39},"width":63.88,"height":33.48,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/96191/images/22-10.png","element":"img","alt":" h−1","inline":true,"padRight":true},{"text":"transitions, we randomly select transitions to store according to independent ","element":"span"},{"text":"Bernoulli","element":"span"},{"text":"(","element":"span"},{"style":{"fontStyle":"italic"},"text":"h","element":"span"},{"text":") ","element":"span"},{"text":"draws in order to avoid the possibility of only capturing cyclic phenomena in the replay buffer. We found that this strategy worked similarly to that of [","element":"span"},{"href":"#id-2","referenceIndex":34,"text":"34","element":"a"},{"text":"], but is far less computationally expensive. Likewise, as ","element":"span"},{"style":{"fontStyle":"italic"},"text":"h ","element":"span"},{"text":"decreases, we extend the number of training interactions by a factor of ","element":"span"},{"style":{"height":13.39},"width":63.84,"height":33.48,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/96191/images/22-11.png","element":"img","alt":" h−1","inline":true},{"text":". This corresponds to training for a constant amount of time units across decision frequencies, and likewise, a constant number of gradient updates across decision frequencies.","element":"span"}],[{"style":{"fontWeight":"bold"},"text":"C.2 ","element":"span"},{"style":{"fontWeight":"bold"},"text":"Two-Timescale Advantage-Shifted Distributional Superiority","element":"span"}],[{"text":"An astute reader might recognize that while ","element":"span"},{"style":{"height":17.71},"width":68.64,"height":44.28,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/96191/images/22-12.png","element":"img","alt":" ψπh;q","inline":true,"padRight":true},{"text":"may have nonzero distributional action gap, since ","element":"span"},{"style":{"height":10.8},"width":37.96,"height":27,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/96191/images/22-13.png","element":"img","alt":"hq","inline":true,"padRight":true},{"text":"is asymptotically larger than ","element":"span"},{"style":{"fontStyle":"italic"},"text":"h ","element":"span"},{"text":"for ","element":"span"},{"style":{"fontStyle":"italic"},"text":"q < ","element":"span"},{"text":"1","element":"span"},{"text":", the works of [","element":"span"},{"href":"#id-2","referenceIndex":34,"text":"34","element":"a"},{"text":", ","element":"span"},{"href":"#id-9","referenceIndex":16,"text":"16","element":"a"},{"text":"] would suggest that the ","element":"span"},{"style":{"fontStyle":"italic"},"text":"expected ","element":"span"},{"text":"action gap under ","element":"span"},{"style":{"height":17.71},"width":68.64,"height":44.28,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/96191/images/22-14.png","element":"img","alt":" ψπh;q ","inline":true,"padRight":true},{"text":"should vanish. To account for this, we propose shifting the rescaled superiority ","element":"span"},{"text":"quantiles by the advantage function, as estimated e.g. in DAU [","element":"span"},{"href":"#id-2","referenceIndex":34,"text":"34","element":"a"},{"text":"]. It is clear that such a procedure cannot cause the distributional action gap to vanish. The resulting procedure is depicted in Algorithm ","element":"span"},{"href":"#id-91","text":"3","element":"a"},{"text":", with the modifications relative to Algorithm ","element":"span"},{"href":"#id-89","text":"2 ","element":"a"},{"text":"highlighted in ","element":"span"},{"text":"blue","element":"span"},{"text":". In practice, we employ a shared feature extractor in the representations of ","element":"span"},{"style":{"height":13.98},"width":48.88,"height":34.96,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/96191/images/22-15.png","element":"img","alt":" Ah","inline":true,"padRight":true},{"text":"and ","element":"span"},{"style":{"height":14},"width":24,"height":35,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/96191/images/22-16.png","element":"img","alt":" ϕ","inline":true,"padRight":true},{"text":"to reap the representation learning benefits of DRL [","element":"span"},{"href":"#id-34","referenceIndex":4,"text":"4","element":"a"},{"text":"] when approximating the advantage.","element":"span"}],[{"style":{"fontWeight":"bold"},"text":"C.3 ","element":"span"},{"style":{"fontWeight":"bold"},"text":"Influence of the Rescaling Factor","element":"span"}],[{"text":"The algorithms ","element":"span"},{"href":"#id-89","text":"2 ","element":"a"},{"text":"and ","element":"span"},{"href":"#id-91","text":"3 ","element":"a"},{"text":"are parameterized by a ","element":"span"},{"style":{"fontStyle":"italic"},"text":"rescaling factor ","element":"span"},{"style":{"height":16},"width":170.2,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/96191/images/22-17.png","element":"img","alt":" q ∈ (0, 1]","inline":true},{"text":", which is meant to compensate for the collapse of the distributional action gap. Larger values of ","element":"span"},{"style":{"fontStyle":"italic"},"text":"q ","element":"span"},{"text":"correspond to larger compensation. In this work, we argue that the distributional action gap collapses at rate ","element":"span"},{"style":{"height":17.78},"width":199.08,"height":44.44,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/96191/images/22-18.png","element":"img","alt":" h1/2, leading","inline":true,"padRight":true},{"text":"to a natural choice of ","element":"span"},{"style":{"height":16},"width":117.52,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/96191/images/22-19.png","element":"img","alt":" q = 1/2","inline":true},{"text":", which theoretically preserves constant order action gaps with respect to the decision frequency. We also test the approach with ","element":"span"},{"style":{"fontStyle":"italic"},"text":"q ","element":"span"},{"text":"= 1","element":"span"},{"text":", which corresponds to the well-known scaling rate for preserving expected value action gaps.","element":"span"}],[{"text":"For any ","element":"span"},{"style":{"height":16},"width":119.92,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/96191/images/22-20.png","element":"img","alt":" q > 1/2","inline":true},{"text":", the distributional action gap theoretically grows without bound as ","element":"span"},{"style":{"height":14},"width":87.4,"height":35,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/96191/images/22-21.png","element":"img","alt":" h ↓ 0","inline":true},{"text":", leading to distributional estimates with arbitrarily large variance. On the other hand, for any ","element":"span"},{"style":{"height":16},"width":122.32,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/96191/images/22-22.png","element":"img","alt":" q < 1/2","inline":true},{"text":", the distributional action gap decays to ","element":"span"},{"style":{"height":14},"width":158.16,"height":35,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/96191/images/22-23.png","element":"img","alt":" 0 as h ↓ 0","inline":true},{"text":", which makes it difficult to identify the best actions in the presence of approximation error.","element":"span"}],[{"id":"id-89","style":{"width":"100%"},"width":1591,"height":2092,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/96191/images/23-0.png","element":"img"}],[{"id":"id-91","style":{"width":"100%"},"width":1591,"height":2654,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/96191/images/24-0.png","element":"img"}]]},{"heading":"D Additional Experimental Results","paragraphs":[[{"text":"Figure ","element":"span"},{"href":"#id-92","text":"D.1 ","element":"a"},{"text":"include results comparing the performance of DSUP(","element":"span"},{"style":{"height":16},"width":45.16,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/96191/images/25-0.png","element":"img","alt":"1/2","inline":true},{"text":") and QR-DQN for risk-sensitive option trading across a variety of decision frequencies.","element":"span"}],[{"id":"id-92","style":{"width":"99%"},"width":1584,"height":1674,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/96191/images/25-1.png","element":"img"}],[{"text":"Figure D.1: Risk-sensitive option trading performance for various decision frequencies ","element":"figcaption","subtype":"caption"},{"style":{"height":7.2},"width":36.24,"height":18,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/96191/images/25-2.png","element":"img","alt":" ω.","inline":true}]]},{"heading":"E Simulation Details","paragraphs":[[{"text":"Here we collect further information about the setup for the simulations described in Section ","element":"span"},{"href":"#id-51","text":"5.2","element":"a"},{"text":".","element":"span"}],[{"style":{"fontWeight":"bold"},"text":"E.1 ","element":"span"},{"style":{"fontWeight":"bold"},"text":"Option Trading Environment","element":"span"}],[{"text":"The environment used for the high-frequency option-trading setup is identical to that of Lim and Malik [","element":"span"},{"href":"#id-33","referenceIndex":22,"text":"22","element":"a"},{"text":"]. The environment emulates policies that decide when to exercise American call options. The state space is modeled as ","element":"span"},{"text":"X ","element":"span"},{"text":"= ","element":"span"},{"text":"R ","element":"span"},{"text":"and with ","element":"span"},{"text":"T ","element":"span"},{"text":"= [0","element":"span"},{"style":{"fontStyle":"italic"},"text":", T","element":"span"},{"text":"]","element":"span"},{"text":". Notably, existing works such as [","element":"span"},{"href":"#id-33","referenceIndex":22,"text":"22","element":"a"},{"text":"] and [","element":"span"},{"href":"#id-35","referenceIndex":17,"text":"17","element":"a"},{"text":"] describe the state space by ","element":"span"},{"style":{"height":14.64},"width":127.8,"height":36.6,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/96191/images/25-3.png","element":"img","alt":" X = R2","inline":true},{"text":", where one dimension represents time—in our setup, we generally condition policies and returns on time, so we indeed model policies and returns as functions on ","element":"span"},{"style":{"height":14.64},"width":44.76,"height":36.6,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/96191/images/25-4.png","element":"img","alt":" R2","inline":true},{"text":". The state ","element":"span"},{"style":{"height":13.2},"width":45,"height":33,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/96191/images/25-5.png","element":"img","alt":" Xt","inline":true,"padRight":true},{"text":"represents the price of the option at time ","element":"span"},{"style":{"fontStyle":"italic"},"text":"t","element":"span"},{"text":", and evolves according to a geometric Brownian motion.","element":"span"}],[{"text":"There are two actions in the environment. Action ","element":"span"},{"text":"0 ","element":"span"},{"text":"“holds” the option, while action ","element":"span"},{"text":"1 ","element":"span"},{"text":"represents “execute”. Upon taking action ","element":"span"},{"text":"1 ","element":"span"},{"text":"(or equivalently, once the time reaches ","element":"span"},{"style":{"fontStyle":"italic"},"text":"T","element":"span"},{"text":"), the option is ","element":"span"},{"style":{"fontStyle":"italic"},"text":"executed","element":"span"},{"text":", and the agent receives a reward ","element":"span"},{"style":{"height":16},"width":365.56,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/96191/images/26-0.png","element":"img","alt":" f(x) = max(0, 1 − x)","inline":true,"padRight":true},{"text":"and the episode terminates; here ","element":"span"},{"style":{"fontStyle":"italic"},"text":"x ","element":"span"},{"text":"represents the price at the time of execution. No rewards are incurred otherwise.","element":"span"}],[{"text":"Following the setup of [","element":"span"},{"href":"#id-33","referenceIndex":22,"text":"22","element":"a"},{"text":"], the dynamics of the prices are simulated based on data collected between years 2016 and 2019 from 10 commodities on the DOW market. Lim and Malik proposed a method for estimating the most likely parameters of geometric Brownian motion to fit the data for each commodity, which is then used to simulate many environment rollouts for training and evaluation. This is particularly convenient for our setup, where we additionally scale the decision frequency, corresponding to finer time discretizations of the Euler-Maruyama scheme for the estimated geometric Brownian motion. Like [","element":"span"},{"href":"#id-33","referenceIndex":22,"text":"22","element":"a"},{"text":"], separate dynamics parameters are estimated (for each commodity) between training and evaluation: the dynamics used for training are estimated on prefixes of the data, and those for testing (post-training) are estimated on suffixes of the data. Results are reported on the testing dynamics, averaged over the 10 commodities.","element":"span"}],[{"text":"As is standard [","element":"span"},{"href":"#id-33","referenceIndex":22,"text":"22","element":"a"},{"text":", ","element":"span"},{"href":"#id-35","referenceIndex":17,"text":"17","element":"a"},{"text":"], we simulate the environment with ","element":"span"},{"style":{"fontStyle":"italic"},"text":"T ","element":"span"},{"text":"= 100 ","element":"span"},{"text":"and ","element":"span"},{"style":{"height":13.18},"width":125.68,"height":32.96,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/96191/images/26-1.png","element":"img","alt":" X0 = 1","inline":true},{"text":". The simulations from [","element":"span"},{"href":"#id-33","referenceIndex":22,"text":"22","element":"a"},{"text":"] correspond to ","element":"span"},{"style":{"fontStyle":"italic"},"text":"h ","element":"span"},{"text":"= 1 ","element":"span"},{"text":"in this setting. In our high-frequency simulations, we discretize the dynamics with timestep ","element":"span"},{"style":{"fontStyle":"italic"},"text":"h < ","element":"span"},{"text":"1","element":"span"},{"text":".","element":"span"}],[{"style":{"fontWeight":"bold"},"text":"E.2 ","element":"span"},{"style":{"fontWeight":"bold"},"text":"Hyperparameters","element":"span"}],[{"text":"In Table ","element":"span"},{"href":"#id-93","text":"1","element":"a"},{"text":", we list the hyperparameters used in the simulations for the tested algorithms. We note that although the original DAU implementation scaled the learning rate with ","element":"span"},{"style":{"fontStyle":"italic"},"text":"h","element":"span"},{"text":", we take an alternative approach by updating every ","element":"span"},{"style":{"height":13.39},"width":63.88,"height":33.48,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/96191/images/26-2.png","element":"img","alt":" h−1","inline":true,"padRight":true},{"text":"environment steps akin to [","element":"span"},{"href":"#id-90","referenceIndex":3,"text":"3","element":"a"},{"text":"]. This is discussed in more detail in Appendix ","element":"span"},{"text":"C","element":"span"},{"text":".","element":"span"}],[{"text":"Table 1: Hyperparameters","element":"figcaption","subtype":"caption"}],[{"id":"id-93","style":{"width":"97%"},"width":1542,"height":836,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/96191/images/26-3.png","element":"img"}],[{"style":{"fontWeight":"bold"},"text":"E.3 ","element":"span"},{"style":{"fontWeight":"bold"},"text":"Compute Resources","element":"span"}],[{"text":"Our implementations are written in Jax [","element":"span"},{"href":"#id-94","referenceIndex":6,"text":"6","element":"a"},{"text":"] and executed with a single NVidia V100 GPU. At highest decision frequencies, experiments took longer to execute, averaging out at a maximum of roughly four hours.","element":"span"}]]},{"heading":"NeurIPS Paper Checklist","paragraphs":[[{"text":"1. ","element":"span"},{"style":{"fontWeight":"bold"},"text":"Claims","element":"span"}],[{"text":"Question: Do the main claims made in the abstract and introduction accurately reflect the paper’s contributions and scope?","element":"span"}],[{"text":"Answer: ","element":"span"},{"text":"[Yes]","element":"span"}],[{"text":"Justification: We believe our abstract and introduction outline and faithfully summarize the content of our work.","element":"span"}],[{"text":"Guidelines:","element":"span"}],[{"text":"• The answer NA means that the abstract and introduction do not include the claims made in the paper.","element":"span"}],[{"text":"• The abstract and/or introduction should clearly state the claims made, including the contributions made in the paper and important assumptions and limitations. A No or NA answer to this question will not be perceived well by the reviewers.","element":"span"}],[{"text":"• The claims made should match theoretical and experimental results, and reflect how much the results can be expected to generalize to other settings.","element":"span"}],[{"text":"• It is fine to include aspirational goals as motivation as long as it is clear that these goals are not attained by the paper.","element":"span"}],[{"text":"2. ","element":"span"},{"style":{"fontWeight":"bold"},"text":"Limitations ","element":"span"},{"text":"Question: Does the paper discuss the limitations of the work performed by the authors? Answer: ","element":"span"},{"text":"[Yes]","element":"span"}],[{"text":"Justification: We have included discussions of the limitations and scope of our results throughout our work, rather than within a specific “Limitations” section. See, for example, Section ","element":"span"},{"text":"5","element":"span"},{"text":".","element":"span"}],[{"text":"Guidelines:","element":"span"}],[{"text":"• The answer NA means that the paper has no limitation while the answer No means that the paper has limitations, but those are not discussed in the paper.","element":"span"}],[{"text":"• The authors are encouraged to create a separate \"Limitations\" section in their paper.","element":"span"}],[{"text":"• The paper should point out any strong assumptions and how robust the results are to violations of these assumptions (e.g., independence assumptions, noiseless settings, model well-specification, asymptotic approximations only holding locally). The authors should reflect on how these assumptions might be violated in practice and what the implications would be.","element":"span"}],[{"text":"• The authors should reflect on the scope of the claims made, e.g., if the approach was only tested on a few datasets or with a few runs. In general, empirical results often depend on implicit assumptions, which should be articulated.","element":"span"}],[{"text":"• The authors should reflect on the factors that influence the performance of the approach. For example, a facial recognition algorithm may perform poorly when image resolution is low or images are taken in low lighting. Or a speech-to-text system might not be used reliably to provide closed captions for online lectures because it fails to handle technical jargon.","element":"span"}],[{"text":"• The authors should discuss the computational efficiency of the proposed algorithms and how they scale with dataset size.","element":"span"}],[{"text":"• If applicable, the authors should discuss possible limitations of their approach to address problems of privacy and fairness.","element":"span"}],[{"text":"• While the authors might fear that complete honesty about limitations might be used by reviewers as grounds for rejection, a worse outcome might be that reviewers discover limitations that aren’t acknowledged in the paper. The authors should use their best","element":"span"}],[{"text":"judgment and recognize that individual actions in favor of transparency play an important role in developing norms that preserve the integrity of the community. Reviewers will be specifically instructed to not penalize honesty concerning limitations.","element":"span"}],[{"text":"3. ","element":"span"},{"style":{"fontWeight":"bold"},"text":"Theory Assumptions and Proofs","element":"span"}],[{"text":"Question: For each theoretical result, does the paper provide the full set of assumptions and a complete (and correct) proof?","element":"span"}],[{"text":"Answer: ","element":"span"},{"text":"[Yes]","element":"span"}],[{"text":"Justification: We have provided complete statements, proofs, and references to used results in our proofs of our theoretical results. Please see the Appendix for restatements and proofs. Statements can be found in the main body of our work.","element":"span"}],[{"text":"Guidelines:","element":"span"}],[{"text":"• The answer NA means that the paper does not include theoretical results.","element":"span"}],[{"text":"• All the theorems, formulas, and proofs in the paper should be numbered and crossreferenced.","element":"span"}],[{"text":"• All assumptions should be clearly stated or referenced in the statement of any theorems.","element":"span"}],[{"text":"• The proofs can either appear in the main paper or the supplemental material, but if they appear in the supplemental material, the authors are encouraged to provide a short proof sketch to provide intuition.","element":"span"}],[{"text":"• Inversely, any informal proof provided in the core of the paper should be complemented by formal proofs provided in appendix or supplemental material.","element":"span"}],[{"text":"• Theorems and Lemmas that the proof relies upon should be properly referenced.","element":"span"}],[{"text":"4. ","element":"span"},{"style":{"fontWeight":"bold"},"text":"Experimental Result Reproducibility","element":"span"}],[{"text":"Question: Does the paper fully disclose all the information needed to reproduce the main experimental results of the paper to the extent that it affects the main claims and/or conclusions of the paper (regardless of whether the code and data are provided or not)?","element":"span"}],[{"text":"Answer: ","element":"span"},{"text":"[Yes]","element":"span"}],[{"text":"Justification: We believe we have included enough information that others can faithfully reproduce our main experimental results. Please see Sections ","element":"span"},{"text":"4 ","element":"span"},{"text":"and ","element":"span"},{"text":"5 ","element":"span"},{"text":"and the Appendix.","element":"span"}],[{"text":"Guidelines:","element":"span"}],[{"text":"• The answer NA means that the paper does not include experiments.","element":"span"}],[{"text":"• If the paper includes experiments, a No answer to this question will not be perceived well by the reviewers: Making the paper reproducible is important, regardless of whether the code and data are provided or not.","element":"span"}],[{"text":"• If the contribution is a dataset and/or model, the authors should describe the steps taken to make their results reproducible or verifiable.","element":"span"}],[{"text":"• Depending on the contribution, reproducibility can be accomplished in various ways. For example, if the contribution is a novel architecture, describing the architecture fully might suffice, or if the contribution is a specific model and empirical evaluation, it may be necessary to either make it possible for others to replicate the model with the same dataset, or provide access to the model. In general. releasing code and data is often one good way to accomplish this, but reproducibility can also be provided via detailed instructions for how to replicate the results, access to a hosted model (e.g., in the case of a large language model), releasing of a model checkpoint, or other means that are appropriate to the research performed.","element":"span"}],[{"text":"• While NeurIPS does not require releasing code, the conference does require all submissions to provide some reasonable avenue for reproducibility, which may depend on the nature of the contribution. For example","element":"span"}],[{"text":"(a) If the contribution is primarily a new algorithm, the paper should make it clear how to reproduce that algorithm.","element":"span"}],[{"text":"(b) If the contribution is primarily a new model architecture, the paper should describe the architecture clearly and fully.","element":"span"}],[{"text":"(c) If the contribution is a new model (e.g., a large language model), then there should either be a way to access this model for reproducing the results or a way to reproduce the model (e.g., with an open-source dataset or instructions for how to construct the dataset).","element":"span"}],[{"text":"(d) We recognize that reproducibility may be tricky in some cases, in which case authors are welcome to describe the particular way they provide for reproducibility. In the case of closed-source models, it may be that access to the model is limited in some way (e.g., to registered users), but it should be possible for other researchers to have some path to reproducing or verifying the results.","element":"span"}],[{"text":"5. ","element":"span"},{"style":{"fontWeight":"bold"},"text":"Open access to data and code","element":"span"}],[{"text":"Question: Does the paper provide open access to the data and code, with sufficient instructions to faithfully reproduce the main experimental results, as described in supplemental material?","element":"span"}],[{"text":"Answer: ","element":"span"},{"text":"[Yes]","element":"span"}],[{"text":"Justification: We have provided data, code, and experimental details. We believe we have included enough information that others can faithfully reproduce our main experimental results. Please see Section ","element":"span"},{"text":"5 ","element":"span"},{"text":"and the Appendix.","element":"span"}],[{"text":"Guidelines:","element":"span"}],[{"text":"• The answer NA means that paper does not include experiments requiring code.","element":"span"}],[{"text":"• Please see the NeurIPS code and data submission guidelines (","element":"span"},{"href":"https://nips.cc/public/guides/CodeSubmissionPolicy","style":{"fontFamily":"monospace"},"text":"https://nips.cc/ ","element":"a"},{"href":"https://nips.cc/public/guides/CodeSubmissionPolicy","style":{"fontFamily":"monospace"},"text":"public/guides/CodeSubmissionPolicy","element":"a"},{"text":") for more details.","element":"span"}],[{"text":"• While we encourage the release of code and data, we understand that this might not be possible, so “No” is an acceptable answer. Papers cannot be rejected simply for not including code, unless this is central to the contribution (e.g., for a new open-source benchmark).","element":"span"}],[{"text":"• The instructions should contain the exact command and environment needed to run to reproduce the results. See the NeurIPS code and data submission guidelines (","element":"span"},{"href":"https://nips.cc/public/guides/CodeSubmissionPolicy","style":{"fontFamily":"monospace"},"text":"https: ","element":"a"},{"href":"https://nips.cc/public/guides/CodeSubmissionPolicy","style":{"fontFamily":"monospace"},"text":"//nips.cc/public/guides/CodeSubmissionPolicy","element":"a"},{"text":") for more details.","element":"span"}],[{"text":"• The authors should provide instructions on data access and preparation, including how to access the raw data, preprocessed data, intermediate data, and generated data, etc.","element":"span"}],[{"text":"• The authors should provide scripts to reproduce all experimental results for the new proposed method and baselines. If only a subset of experiments are reproducible, they should state which ones are omitted from the script and why.","element":"span"}],[{"text":"• At submission time, to preserve anonymity, the authors should release anonymized versions (if applicable).","element":"span"}],[{"text":"• Providing as much information as possible in supplemental material (appended to the paper) is recommended, but including URLs to data and code is permitted.","element":"span"}],[{"text":"6. ","element":"span"},{"style":{"fontWeight":"bold"},"text":"Experimental Setting/Details","element":"span"}],[{"text":"Question: Does the paper specify all the training and test details (e.g., data splits, hyperparameters, how they were chosen, type of optimizer, etc.) necessary to understand the results?","element":"span"}],[{"text":"Answer: ","element":"span"},{"text":"[Yes] ","element":"span"},{"text":"Justification: We provided these details in Section ","element":"span"},{"text":"5 ","element":"span"},{"text":"and the Appendix. Guidelines:","element":"span"}],[{"text":"• The answer NA means that the paper does not include experiments.","element":"span"}],[{"text":"• The experimental setting should be presented in the core of the paper to a level of detail that is necessary to appreciate the results and make sense of them.","element":"span"}],[{"text":"• The full details can be provided either with the code, in appendix, or as supplemental material.","element":"span"}],[{"text":"7. ","element":"span"},{"style":{"fontWeight":"bold"},"text":"Experiment Statistical Significance","element":"span"}],[{"text":"Question: Does the paper report error bars suitably and correctly defined or other appropriate information about the statistical significance of the experiments?","element":"span"}],[{"text":"Answer: ","element":"span"},{"text":"[Yes]","element":"span"}],[{"text":"Justification: When appropriate, we have included error bars and information regarding the statistical significance of our experiments. Please see Section ","element":"span"},{"text":"5 ","element":"span"},{"text":"and the Appendix.","element":"span"}],[{"text":"Guidelines:","element":"span"}],[{"text":"• The answer NA means that the paper does not include experiments.","element":"span"}],[{"text":"• The authors should answer \"Yes\" if the results are accompanied by error bars, confidence intervals, or statistical significance tests, at least for the experiments that support the main claims of the paper.","element":"span"}],[{"text":"• The factors of variability that the error bars are capturing should be clearly stated (for example, train/test split, initialization, random drawing of some parameter, or overall run with given experimental conditions).","element":"span"}],[{"text":"• The method for calculating the error bars should be explained (closed form formula, call to a library function, bootstrap, etc.)","element":"span"}],[{"text":"• The assumptions made should be given (e.g., Normally distributed errors).","element":"span"}],[{"text":"• It should be clear whether the error bar is the standard deviation or the standard error of the mean.","element":"span"}],[{"text":"• It is OK to report 1-sigma error bars, but one should state it. The authors should preferably report a 2-sigma error bar than state that they have a 96% CI, if the hypothesis of Normality of errors is not verified.","element":"span"}],[{"text":"• For asymmetric distributions, the authors should be careful not to show in tables or figures symmetric error bars that would yield results that are out of range (e.g. negative error rates).","element":"span"}],[{"text":"• If error bars are reported in tables or plots, The authors should explain in the text how they were calculated and reference the corresponding figures or tables in the text.","element":"span"}],[{"text":"8. ","element":"span"},{"style":{"fontWeight":"bold"},"text":"Experiments Compute Resources","element":"span"}],[{"text":"Question: For each experiment, does the paper provide sufficient information on the computer resources (type of compute workers, memory, time of execution) needed to reproduce the experiments?","element":"span"}],[{"text":"Answer: ","element":"span"},{"text":"[Yes] ","element":"span"},{"text":"Justification: We have included these details in Section ","element":"span"},{"text":"5 ","element":"span"},{"text":"and the Appendix. Guidelines:","element":"span"}],[{"text":"• The answer NA means that the paper does not include experiments.","element":"span"}],[{"text":"• The paper should indicate the type of compute workers CPU or GPU, internal cluster, or cloud provider, including relevant memory and storage.","element":"span"}],[{"text":"• The paper should provide the amount of compute required for each of the individual experimental runs as well as estimate the total compute.","element":"span"}],[{"text":"• The paper should disclose whether the full research project required more compute than the experiments reported in the paper (e.g., preliminary or failed experiments that didn’t make it into the paper).","element":"span"}],[{"text":"9. ","element":"span"},{"style":{"fontWeight":"bold"},"text":"Code Of Ethics","element":"span"}],[{"text":"Question: Does the research conducted in the paper conform, in every respect, with the NeurIPS Code of Ethics ","element":"span"},{"href":"https://neurips.cc/public/EthicsGuidelines","style":{"fontFamily":"monospace"},"text":"https://neurips.cc/public/EthicsGuidelines","element":"a"},{"text":"?","element":"span"}],[{"text":"Answer: ","element":"span"},{"text":"[Yes]","element":"span"}],[{"text":"Justification: We have reviewed the NeurIPS Code of Ethics and believe our work conforms to it in every respect.","element":"span"}],[{"text":"Guidelines:","element":"span"}],[{"text":"• The answer NA means that the authors have not reviewed the NeurIPS Code of Ethics.","element":"span"}],[{"text":"• If the authors answer No, they should explain the special circumstances that require a deviation from the Code of Ethics.","element":"span"}],[{"text":"• The authors should make sure to preserve anonymity (e.g., if there is a special consideration due to laws or regulations in their jurisdiction).","element":"span"}],[{"text":"10. ","element":"span"},{"style":{"fontWeight":"bold"},"text":"Broader Impacts","element":"span"}],[{"text":"Question: Does the paper discuss both potential positive societal impacts and negative societal impacts of the work performed?","element":"span"}],[{"text":"Answer: ","element":"span"},{"text":"[NA]","element":"span"}],[{"text":"Justification: We believe our work is foundational research without a direct path to negative societal impact.","element":"span"}],[{"text":"Guidelines:","element":"span"}],[{"text":"• The answer NA means that there is no societal impact of the work performed.","element":"span"}],[{"text":"• If the authors answer NA or No, they should explain why their work has no societal impact or why the paper does not address societal impact.","element":"span"}],[{"text":"• Examples of negative societal impacts include potential malicious or unintended uses (e.g., disinformation, generating fake profiles, surveillance), fairness considerations (e.g., deployment of technologies that could make decisions that unfairly impact specific groups), privacy considerations, and security considerations.","element":"span"}],[{"text":"• The conference expects that many papers will be foundational research and not tied to particular applications, let alone deployments. However, if there is a direct path to any negative applications, the authors should point it out. For example, it is legitimate to point out that an improvement in the quality of generative models could be used to generate deepfakes for disinformation. On the other hand, it is not needed to point out that a generic algorithm for optimizing neural networks could enable people to train models that generate Deepfakes faster.","element":"span"}],[{"text":"• The authors should consider possible harms that could arise when the technology is being used as intended and functioning correctly, harms that could arise when the technology is being used as intended but gives incorrect results, and harms following from (intentional or unintentional) misuse of the technology.","element":"span"}],[{"text":"• If there are negative societal impacts, the authors could also discuss possible mitigation strategies (e.g., gated release of models, providing defenses in addition to attacks, mechanisms for monitoring misuse, mechanisms to monitor how a system learns from feedback over time, improving the efficiency and accessibility of ML).","element":"span"}],[{"text":"11. ","element":"span"},{"style":{"fontWeight":"bold"},"text":"Safeguards","element":"span"}],[{"text":"Question: Does the paper describe safeguards that have been put in place for responsible release of data or models that have a high risk for misuse (e.g., pretrained language models, image generators, or scraped datasets)?","element":"span"}],[{"text":"Answer: ","element":"span"},{"text":"[NA] ","element":"span"},{"text":"Justification: We believe our work poses no risk of misuse. Guidelines:","element":"span"}],[{"text":"• The answer NA means that the paper poses no such risks.","element":"span"}],[{"text":"• Released models that have a high risk for misuse or dual-use should be released with necessary safeguards to allow for controlled use of the model, for example by requiring that users adhere to usage guidelines or restrictions to access the model or implementing safety filters.","element":"span"}],[{"text":"• Datasets that have been scraped from the Internet could pose safety risks. The authors should describe how they avoided releasing unsafe images.","element":"span"}],[{"text":"• We recognize that providing effective safeguards is challenging, and many papers do not require this, but we encourage authors to take this into account and make a best faith effort.","element":"span"}],[{"text":"12. ","element":"span"},{"style":{"fontWeight":"bold"},"text":"Licenses for existing assets","element":"span"}],[{"text":"Question: Are the creators or original owners of assets (e.g., code, data, models), used in the paper, properly credited and are the license and terms of use explicitly mentioned and properly respected?","element":"span"}],[{"text":"Answer: ","element":"span"},{"text":"[Yes]","element":"span"}],[{"text":"Justification: Data for the option-trading environment was downloaded from an open source Github repository, and we cited the work that introduced the dataset [","element":"span"},{"href":"#id-33","referenceIndex":22,"text":"22","element":"a"},{"text":"].","element":"span"}],[{"text":"Guidelines:","element":"span"}],[{"text":"• The answer NA means that the paper does not use existing assets.","element":"span"}],[{"text":"• The authors should cite the original paper that produced the code package or dataset.","element":"span"}],[{"text":"• The authors should state which version of the asset is used and, if possible, include a URL.","element":"span"}],[{"text":"• The name of the license (e.g., CC-BY 4.0) should be included for each asset.","element":"span"}],[{"text":"• For scraped data from a particular source (e.g., website), the copyright and terms of service of that source should be provided.","element":"span"}],[{"text":"• If assets are released, the license, copyright information, and terms of use in the package should be provided. For popular datasets, ","element":"span"},{"style":{"fontFamily":"monospace"},"text":"paperswithcode.com/datasets ","element":"span"},{"text":"has curated licenses for some datasets. Their licensing guide can help determine the license of a dataset.","element":"span"}],[{"text":"• For existing datasets that are re-packaged, both the original license and the license of the derived asset (if it has changed) should be provided.","element":"span"}],[{"text":"• If this information is not available online, the authors are encouraged to reach out to the asset’s creators.","element":"span"}],[{"text":"13. ","element":"span"},{"style":{"fontWeight":"bold"},"text":"New Assets","element":"span"}],[{"text":"Question: Are new assets introduced in the paper well documented and is the documentation provided alongside the assets?","element":"span"}],[{"text":"Answer: ","element":"span"},{"text":"[NA] ","element":"span"},{"text":"Justification: Our work does not release new assets. Guidelines:","element":"span"}],[{"text":"• The answer NA means that the paper does not release new assets.","element":"span"}],[{"text":"• Researchers should communicate the details of the dataset/code/model as part of their submissions via structured templates. This includes details about training, license, limitations, etc.","element":"span"}],[{"text":"• The paper should discuss whether and how consent was obtained from people whose asset is used.","element":"span"}],[{"text":"• At submission time, remember to anonymize your assets (if applicable). You can either create an anonymized URL or include an anonymized zip file.","element":"span"}],[{"text":"14. ","element":"span"},{"style":{"fontWeight":"bold"},"text":"Crowdsourcing and Research with Human Subjects","element":"span"}],[{"text":"Question: For crowdsourcing experiments and research with human subjects, does the paper include the full text of instructions given to participants and screenshots, if applicable, as well as details about compensation (if any)?","element":"span"}],[{"text":"Answer: ","element":"span"},{"text":"[NA] ","element":"span"},{"text":"Justification: Our work neither involves crowdsourcing nor research with human subjects. Guidelines:","element":"span"}],[{"text":"• The answer NA means that the paper does not involve crowdsourcing nor research with human subjects.","element":"span"}],[{"text":"• Including this information in the supplemental material is fine, but if the main contribution of the paper involves human subjects, then as much detail as possible should be included in the main paper.","element":"span"}],[{"text":"• According to the NeurIPS Code of Ethics, workers involved in data collection, curation, or other labor should be paid at least the minimum wage in the country of the data collector.","element":"span"}],[{"text":"15. ","element":"span"},{"style":{"fontWeight":"bold"},"text":"Institutional Review Board (IRB) Approvals or Equivalent for Research with Human Subjects","element":"span"}],[{"text":"Question: Does the paper describe potential risks incurred by study participants, whether such risks were disclosed to the subjects, and whether Institutional Review Board (IRB) approvals (or an equivalent approval/review based on the requirements of your country or institution) were obtained?","element":"span"}],[{"text":"Answer: ","element":"span"},{"text":"[NA] ","element":"span"},{"text":"Justification: Our work neither involves crowdsourcing nor research with human subjects. Guidelines:","element":"span"}],[{"text":"• The answer NA means that the paper does not involve crowdsourcing nor research with human subjects.","element":"span"}],[{"text":"• Depending on the country in which research is conducted, IRB approval (or equivalent) may be required for any human subjects research. If you obtained IRB approval, you should clearly state this in the paper.","element":"span"}],[{"text":"• We recognize that the procedures for this may vary significantly between institutions and locations, and we expect authors to adhere to the NeurIPS Code of Ethics and the guidelines for their institution.","element":"span"}],[{"text":"• For initial submissions, do not include any information that would break anonymity (if applicable), such as the institution conducting the review.","element":"span"}]]}],"_version":"3.3.4"},"paperNode":"$28:props:children:props:children:0:props:product"}]]