36:[["$","audio",null,{"id":"tts"}],["$","$L3b",null,{"paperID":"1812.01552","publisher":"arxiv","paperJSON":{"title":"Exploration versus exploitation in reinforcement learning: a stochastic control approach","paperID":"1812.01552","avgLineHeight":12,"imgScale":4,"sections":[{"heading":"Abstract","paragraphs":[[{"text":"We consider reinforcement learning (RL) in continuous time and study the problem of achieving the best trade-off between exploration of a black box environment and exploitation of current knowledge. We propose an entropy-regularized reward function involving the differential entropy of the distributions of actions, and motivate and devise an exploratory formulation for the feature dynamics that captures repetitive learning under exploration. The resulting optimization problem is a revitalization of the classical relaxed stochastic control. We carry out a complete analysis of the problem in the linear–quadratic (LQ) setting and deduce that the optimal feedback control distribution for balancing exploitation and exploration is Gaussian. ","element":"span"},{"text":"This in turn interprets and justifies the widely adopted Gaussian exploration in RL, beyond its simplicity for sampling. Moreover, the exploitation and exploration are captured, respectively and mutual-exclusively, by the mean and variance of the Gaussian distribution. We also find that a more random environment contains more learning opportunities in the sense that less exploration is needed. We characterize the cost of exploration, which, for the LQ case, is shown to be proportional to the entropy regularization weight and inversely proportional to the discount rate. Finally, as the weight of exploration decays to zero, we prove the convergence of the solution of the entropy-regularized LQ problem to the one of the classical LQ problem.","element":"span"}],[{"text":"Key words. ","element":"span"},{"text":"Reinforcement learning, exploration, exploitation, entropy regularization, stochastic control, relaxed control, linear–quadratic, Gaussian distribution.","element":"span"}]]},{"heading":"1 Introduction","paragraphs":[[{"text":"Reinforcement learning (RL) is currently one of the most active and fast developing subareas in machine learning. In recent years, it has been successfully applied to solve large scale real world, complex decision making problems, including playing perfect-information board games such as Go (AlphaGo/AlphaGo Zero, ","element":"span"},{"href":"#id-0","referenceIndex":30,"text":"Silver et al. ","element":"a"},{"href":"#id-0","referenceIndex":30,"text":"(2016)","element":"a"},{"text":", ","element":"span"},{"href":"#id-1","referenceIndex":31,"text":"Silver et al. ","element":"a"},{"href":"#id-1","referenceIndex":31,"text":"(2017)","element":"a"},{"text":"), achieving human-level performance in video games ","element":"span"},{"href":"#id-2","referenceIndex":22,"text":"(Mnih et al. ","element":"a"},{"href":"#id-2","referenceIndex":22,"text":"(2015)","element":"a"},{"text":"), and driving autonomously ","element":"span"},{"href":"#id-3","referenceIndex":17,"text":"(Levine et al. ","element":"a"},{"href":"#id-3","referenceIndex":17,"text":"(2016)","element":"a"},{"text":", ","element":"span"},{"href":"#id-4","referenceIndex":21,"text":"Mirowski et al. ","element":"a"},{"href":"#id-4","referenceIndex":21,"text":"(2016)","element":"a"},{"text":"). An RL agent does not pre-specify a structural model or a family of models but, instead, gradually learns the best (or nearbest) strategies based on trial and error, through interactions with the random (black box) environment and incorporation of the responses of these interactions, in order to improve the overall performance. This is a case of “kill two birds with one stone”: ","element":"span"},{"style":{"height":16},"width":963.24,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1812.01552/images/1-0.png","element":"img","alt":" the agent’s actions (controls) serve both as a means to","inline":true,"padRight":true},{"text":"explore (learn) and a way to exploit (optimize)","element":"span"},{"text":".","element":"span"}],[{"text":"Since exploration is inherently costly in terms of resource, time and opportunity, a natural and crucial question in RL is to address the dichotomy between exploration of uncharted territory and exploitation of existing knowledge. Such question exists in both the stateless RL settings (e.g. the multi-armed bandit problem) and the more general multi-state RL settings (e.g. ","element":"span"},{"href":"#id-5","referenceIndex":35,"text":"Sutton and Barto ","element":"a"},{"href":"#id-5","referenceIndex":35,"text":"(2018)","element":"a"},{"text":", ","element":"span"},{"href":"#id-6","referenceIndex":12,"text":"Kaelbling et al. ","element":"a"},{"href":"#id-6","referenceIndex":12,"text":"(1996)","element":"a"},{"text":"). Specifically, the agent must balance between greedily exploiting what has been learned so far to choose actions that yield near-term higher rewards, and continuously exploring the environment to acquire more information to potentially achieve long-term benefits. ","element":"span"},{"text":"Extensive studies have been conducted to find strategies for the best trade-off betweeen exploitation and exploration.","element":"span"},{"style":{"height":7.6},"width":16,"height":19,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1812.01552/images/1-1.png","element":"img","alt":"1","inline":true}],[{"text":"However, most of the contributions to balancing exploitation and exploration do not include exploration explicitly as a part of the optimization objective; the attention has mainly focused on solving the classical optimization problem maximizing the accumulated rewards, while exploration is typically treated separately as an ","element":"span"},{"text":"ad-hoc ","element":"span"},{"text":"chosen exogenous component, rather than being endogenously ","element":"span"},{"text":"derived ","element":"span"},{"text":"as a part of the solution to the overall RL problem. The recently proposed discrete time entropy-regularized (also termed as “entropyaugmented” or “softmax”) RL formulation, on the other hand, explicitly incorporates exploration into the optimization objective as a regularization term, with a trade-off weight imposed on the entropy of the exploration strategy ","element":"span"},{"href":"#id-7","referenceIndex":39,"text":"(Ziebart et al. ","element":"a"},{"href":"#id-7","referenceIndex":39,"text":"(2008)","element":"a"},{"text":", ","element":"span"},{"href":"#id-8","referenceIndex":23,"text":"Nachum et al. ","element":"a"},{"href":"#id-8","referenceIndex":23,"text":"(2017)","element":"a"},{"text":", ","element":"span"},{"href":"#id-9","referenceIndex":7,"text":"Fox et al. ","element":"a"},{"href":"#id-9","referenceIndex":7,"text":"(2016)","element":"a"},{"text":"). An exploratory distribution with a greater entropy signifies a higher level of exploration, re-flecting a bigger weight on the exploration front. On the other hand, having the minimal entropy, the extreme case of Dirac measure implies no exploration, reducing to the case of classical optimization with a complete knowledge about the underlying model. Recent works have been devoted to the designing of various algorithms to solve the entropy regularized RL problem, where numerical experiments have demonstrated remarkable robustness and multi-modal policy learning ","element":"span"},{"href":"#id-10","referenceIndex":9,"text":"(Haarnoja et al. ","element":"a"},{"href":"#id-10","referenceIndex":9,"text":"(2017)","element":"a"},{"text":", ","element":"span"},{"href":"#id-11","referenceIndex":10,"text":"Haarnoja et al. ","element":"a"},{"href":"#id-11","referenceIndex":10,"text":"(2018)","element":"a"},{"text":").","element":"span"}],[{"text":"In this paper, we study the trade-off between exploration and exploitation for RL in a continuous-time setting with both continuous control (action) and state (feature) spaces.","element":"span"},{"style":{"height":7.6},"width":16,"height":19,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1812.01552/images/2-0.png","element":"img","alt":"2","inline":true,"padRight":true},{"text":"Such a continuous-time formulation is especially appealing if the agent can interact with the environment at ultra-high frequency, examples including high frequency stock trading, autonomous driving and snowboard riding. More importantly, once cast in continuous time, it is possible, thanks in no small measure to the tools of stochastic calculus and differential equations, to derive elegant and insightful results which, in turn, lead to theoretical understanding of some of the fundamental issues in RL, give guidance to algorithm design and provide ","element":"span"},{"text":"interpretability ","element":"span"},{"text":"to the underlying learning technologies.","element":"span"}],[{"text":"Our first main contribution is to propose an ","element":"span"},{"text":"entropy-regularized reward function ","element":"span"},{"text":"involving the differential entropy for exploratory probability distributions over the continuous action space, and motivate and devise an “exploratory formulation” for the state dynamics that captures repetitive learning under exploration in the continuous time limit. Existing theoretical works on exploration mainly concentrate on the analysis at the algorithmic level, including proving convergence of the proposed exploration algorithms to the solutions of the classical optimization problems (see, for example, ","element":"span"},{"href":"#id-12","referenceIndex":32,"text":"Singh et al. ","element":"a"},{"href":"#id-12","referenceIndex":32,"text":"(2000)","element":"a"},{"text":", ","element":"span"},{"href":"#id-13","referenceIndex":11,"text":"Jaakkola et al. ","element":"a"},{"href":"#id-13","referenceIndex":11,"text":"(1994)","element":"a"},{"text":"). However, they rarely look into the impact of the exploration on changing significantly the underlying dynamics (e.g. the transition probabilities in the discrete time context). Indeed, exploration not only substantially enriches the space of control strategies (from that of Dirac measures to that of all possible probability distributions) but also, as a result, enormously expands the reachable space of states. This, in turn, sets out to change both the underlying state transitions and the system dynamics.","element":"span"}],[{"text":"We show that our exploratory formulation can account for the effects of learning in both the rewards received and the state transitions observed from the interactions with the environment. It, thus, unearths the important characteristics of learning at a more refined and in-depth level, beyond merely devising and analyzing learning algorithms. Intriguingly, the proposed formulation of the state dynamics coincides with that in the ","element":"span"},{"text":"relaxed control ","element":"span"},{"text":"framework in classical control theory (see, for example, ","element":"span"},{"href":"#id-14","referenceIndex":6,"text":"Fleming and Nisio ","element":"a"},{"href":"#id-14","referenceIndex":6,"text":"(1984)","element":"a"},{"text":"; ","element":"span"},{"href":"#id-15","referenceIndex":5,"text":"El Karoui et al. ","element":"a"},{"href":"#id-15","referenceIndex":5,"text":"(1987)","element":"a"},{"text":"; ","element":"span"},{"href":"#id-16","referenceIndex":38,"text":"Zhou ","element":"a"},{"href":"#id-16","referenceIndex":38,"text":"(1992)","element":"a"},{"text":"; ","element":"span"},{"href":"#id-17","referenceIndex":15,"text":"Kurtz and Stockbridge ","element":"a"},{"href":"#id-17","referenceIndex":15,"text":"(1998, ","element":"a"},{"href":"#id-18","referenceIndex":16,"text":"2001)","element":"a"},{"text":"), which was motivated by entirely different reasons. Specifically, relaxed controls were introduced to mainly deal with the ","element":"span"},{"text":"theoretical ","element":"span"},{"text":"question of whether an optimal control exists. The approach essentially entails randomization to convexify the universe of control strategies. To the best of our knowledge, the present paper is the first to bring back the formulation of relaxed control, guided by a practical motivation: exploration and learning.","element":"span"}],[{"text":"We then carry out a complete analysis on the continuous-time entropy-regularized RL problem, assuming that the original system dynamics is linear in both the control and the state, and that the original reward function is quadratic in the two. This type of linear–quadratic (LQ) problems has occupied the center stage for research in classical control theory for its elegant solutions and its ability to approximate more general nonlinear problems. One of the most important, conceptual contributions of this paper is to show that the optimal feedback control distribution for balancing exploitation and exploration is ","element":"span"},{"text":"Gaussian","element":"span"},{"text":". Precisely speaking, if, at any given state, the agent sets out to engage in exploration then she needs look no further than Gaussian distributions. As is well known, a pure exploitation optimal distribution is Dirac, and a pure exploration optimal distribution is uniform. Our results reveal that Gaussian is the right choice if one seeks a balance between those two extremes. Moreover, we find that the mean of this optimal exploratory distribution is a function of the current state ","element":"span"},{"text":"independent ","element":"span"},{"text":"of the intended exploration level, whereas the variance is a linear function of the entropy regularizing weight (also called the “temperature parameter” or “exploration weight”) ","element":"span"},{"text":"irrespective ","element":"span"},{"text":"of the current state. This result highlights a ","element":"span"},{"text":"separation ","element":"span"},{"text":"between exploitation and exploration: the former is reflected in the mean and the latter in the variance of the optimal Gaussian distribution.","element":"span"}],[{"text":"There is yet another intriguing result. The higher impact actions have on the volatility of the underlying dynamic system, the smaller the variance of the optimal Gaussian distribution needs to be. Conceptually, this implies that a more random environment in fact contains more learning opportunities and, hence, is less costly for learning. This theoretical finding provides an interpretation of the recent RL heuristics where injecting noises leads to better effect of exploration; see, for example, ","element":"span"},{"href":"#id-19","referenceIndex":19,"text":"Lillicrap et al. ","element":"a"},{"href":"#id-19","referenceIndex":19,"text":"(2016)","element":"a"},{"text":"; ","element":"span"},{"href":"#id-20","referenceIndex":25,"text":"Plappert et al. ","element":"a"},{"href":"#id-20","referenceIndex":25,"text":"(2018)","element":"a"},{"text":".","element":"span"}],[{"text":"Another contribution of the paper is that we establish a direct connection between the solvability of the exploratory LQ problem and that of the classical LQ problem. We prove that as the exploration weight in the former decays to zero, the optimal Gaussian control distribution and its value function converge respectively to the optimal Dirac measure and the value function of the classical LQ problem, a desirable result for practical learning purposes.","element":"span"}],[{"text":"Finally, we observe that, beyond the LQ problems and under proper conditions, the Gaussian distribution remains optimal for a much larger class of control problems, namely, problems with drift and volatility linear in control and reward functions linear or quadratic in control even if the dependence on state is nonlinear. Such a family of problems can be seen as the local-linear-quadratic approximation to more general stochastic control problems whose state dynamics are linearized in the control variables and the reward functions are locally approximated by quadratic control functions ","element":"span"},{"href":"#id-21","referenceIndex":37,"text":"(Todorov and Li ","element":"a"},{"href":"#id-21","referenceIndex":37,"text":"(2005)","element":"a"},{"text":", ","element":"span"},{"href":"#id-22","referenceIndex":18,"text":"Li and Todorov ","element":"a"},{"href":"#id-22","referenceIndex":18,"text":"(2007)","element":"a"},{"text":"). Note also that although such iterative LQ approximation generally has different parameters at different local state-action pairs, our result on the optimality of Gaussian distribution under the exploratory LQ framework still holds at any local point, and therefore justifies, from a stochastic control perspective, why Gaussian distribution is commonly used in the RL practice for exploration (see, among others, ","element":"span"},{"href":"#id-10","referenceIndex":9,"text":"Haarnoja et al. ","element":"a"},{"href":"#id-10","referenceIndex":9,"text":"(2017)","element":"a"},{"text":", ","element":"span"},{"href":"#id-11","referenceIndex":10,"text":"Haarnoja et al. ","element":"a"},{"href":"#id-11","referenceIndex":10,"text":"(2018)","element":"a"},{"text":", ","element":"span"},{"href":"#id-23","referenceIndex":24,"text":"Nachum et al. ","element":"a"},{"href":"#id-23","referenceIndex":24,"text":"(2018)","element":"a"},{"text":"), beyond its simplicity for sampling.","element":"span"}],[{"text":"The rest of the paper is organized as follows. In section 2, we motivate and propose the relaxed stochastic control formulation involving an exploratory state dynamics and an entropy-regularized reward function for our RL problem. We then present the associated Hamilton-Jacobi-Bellman (HJB) equation and the optimal control distribution for general entropy-regularized stochastic control problems in section 3. In section 4, we study the special LQ problem in both the state-independent and state-dependent reward cases, corresponding respectively to the multi-armed bandit problem and the general RL problem in discrete time, and derive the optimality of Gaussian exploration. We discuss the connections between the exploratory LQ problem and the classical LQ problem in section 5, establish the solvability equivalence of the two and the convergence result for vanishing exploration, and finally characterize the cost of exploration. We conclude in section 6. ","element":"span"},{"text":"Some technical contents and proofs are relegated to Appendices.","element":"span"}]]},{"heading":"2 An Entropy-Regularized Relaxed Stochastic Control Problem","paragraphs":[[{"text":"We introduce an entropy-regularized relaxed stochastic control problem and provide its motivation in the context of RL.","element":"span"}],[{"text":"Consider a filtered probability space (Ω","element":"span"},{"style":{"height":16.7},"width":245.92,"height":41.76,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1812.01552/images/4-0.png","element":"img","alt":", F, P; {Ft}t≥0","inline":true},{"text":") in which we define an ","element":"span"},{"style":{"height":16.7},"width":136,"height":41.76,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1812.01552/images/4-1.png","element":"img","alt":" {Ft}t≥0","inline":true},{"text":"-adapted Brownian motion ","element":"span"},{"style":{"height":16},"width":342.2,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1812.01552/images/4-2.png","element":"img","alt":" W = {Wt, t ≥ 0}.","inline":true,"padRight":true},{"text":"An “action space” ","element":"span"},{"text":"U ","element":"span"},{"text":"is given, representing the constraint on an agent’s decisions (“controls” or “actions”). An admissible (","element":"span"},{"text":"open-loop","element":"span"},{"text":") control ","element":"span"},{"style":{"height":16},"width":284.48,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1812.01552/images/4-3.png","element":"img","alt":" u = {ut, t ≥ 0}","inline":true,"padRight":true},{"text":"is an ","element":"span"},{"style":{"height":16.7},"width":136,"height":41.76,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1812.01552/images/4-4.png","element":"img","alt":" {Ft}t≥0","inline":true},{"text":"-adapted measurable process taking values in ","element":"span"},{"text":"U","element":"span"},{"text":".","element":"span"}],[{"text":"The classical stochastic control problem is to control the state (or “feature”) dynamics","element":"span"},{"style":{"height":7.6},"width":16,"height":19,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1812.01552/images/4-5.png","element":"img","alt":"3","inline":true}],[{"id":"id-24","style":{"width":"84%"},"width":1161,"height":42,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1812.01552/images/4-6.png","element":"img"}],[{"text":"where (and throughout this paper) ","element":"span"},{"text":"x ","element":"span"},{"text":"is a generic variable representing a current state of the system dynamics. The aim of the control is to achieve the maximum expected total discounted reward represented by the value function","element":"span"}],[{"id":"id-25","style":{"width":"82%"},"width":1137,"height":103,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1812.01552/images/4-7.png","element":"img"}],[{"text":"where ","element":"span"},{"text":"r ","element":"span"},{"text":"is the reward function, ","element":"span"},{"style":{"height":12},"width":64.6,"height":30,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1812.01552/images/4-8.png","element":"img","alt":" ρ >","inline":true,"padRight":true},{"text":"0 is the discount rate, and ","element":"span"},{"style":{"height":17.55},"width":95,"height":43.88,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1812.01552/images/4-9.png","element":"img","alt":" Acl(x","inline":true},{"text":") denotes the set of all admissible controls which in general may depend on ","element":"span"},{"text":"x","element":"span"},{"text":".","element":"span"}],[{"text":"In the classical setting, where the model is fully known (namely, when the functions ","element":"span"},{"style":{"height":14},"width":58.04,"height":35,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1812.01552/images/5-0.png","element":"img","alt":" b, σ","inline":true,"padRight":true},{"text":"and ","element":"span"},{"text":"r ","element":"span"},{"text":"are fully specified) and the dynamic programming is applicable, the optimal control can be derived and represented as a ","element":"span"},{"text":"deterministic ","element":"span"},{"text":"mapping from the current state to the action space ","element":"span"},{"style":{"height":16},"width":249.76,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1812.01552/images/5-1.png","element":"img","alt":" U, u∗t = u∗(x∗t","inline":true,"padRight":true},{"text":"). The map- ","element":"span"},{"text":"ping ","element":"span"},{"style":{"height":11.15},"width":43.36,"height":27.88,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1812.01552/images/5-2.png","element":"img","alt":" u∗","inline":true,"padRight":true},{"text":"is called an optimal ","element":"span"},{"text":"feedback ","element":"span"},{"text":"control (or “policy” or “law”); this feedback control is derived at ","element":"span"},{"text":"t ","element":"span"},{"text":"= 0 and ","element":"span"},{"text":"will ","element":"span"},{"text":"be carried out through [0","element":"span"},{"style":{"height":17.36},"width":100,"height":43.4,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1812.01552/images/5-3.png","element":"img","alt":", ∞).4","inline":true}],[{"text":"In contrast, in the RL setting, where the underlying model is not known and therefore dynamic learning is needed, the agent employs exploration to interact with and learn the unknown environment through trial and error. The key idea is to model exploration by a ","element":"span"},{"text":"distribution ","element":"span"},{"text":"of controls ","element":"span"},{"style":{"height":16},"width":313.28,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1812.01552/images/5-4.png","element":"img","alt":" π = {πt(u), t ≥ 0}","inline":true,"padRight":true},{"text":"over the control space ","element":"span"},{"text":"U ","element":"span"},{"text":"from which each “trial” is sampled.","element":"span"},{"style":{"height":8},"width":16,"height":20,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1812.01552/images/5-5.png","element":"img","alt":"5","inline":true,"padRight":true},{"text":"We can therefore extend the notion of controls to distributions.","element":"span"},{"style":{"height":7.6},"width":16,"height":19,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1812.01552/images/5-6.png","element":"img","alt":"6","inline":true,"padRight":true},{"text":"The agent executes a control for ","element":"span"},{"text":"N ","element":"span"},{"text":"rounds over the same time horizon, while at each round, a classical control is sampled from the distribution ","element":"span"},{"style":{"height":6.8},"width":23,"height":17,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1812.01552/images/5-7.png","element":"img","alt":" π","inline":true},{"text":". The reward of such a policy becomes accurate enough when ","element":"span"},{"text":"N ","element":"span"},{"text":"is large. ","element":"span"},{"text":"This procedure, known as ","element":"span"},{"text":"policy evaluation","element":"span"},{"text":", is considered as a fundamental element of most RL algorithms in practice ","element":"span"},{"href":"#id-5","referenceIndex":35,"text":"(Sutton and Barto ","element":"a"},{"href":"#id-5","referenceIndex":35,"text":"(2018)","element":"a"},{"text":"). Hence, for evaluating such a policy distribution in our continuous time setting, it is necessary to consider the limiting situation as ","element":"span"},{"style":{"height":11.2},"width":138.4,"height":28,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1812.01552/images/5-8.png","element":"img","alt":"N → ∞","inline":true},{"text":".","element":"span"}],[{"text":"In order to capture the essential idea for doing this, let us first examine the special case when the reward depends only on the control, namely, ","element":"span"},{"style":{"height":16},"width":191.32,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1812.01552/images/5-9.png","element":"img","alt":" r(xut , ut) =","inline":true},{"style":{"height":16},"width":97.88,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1812.01552/images/5-10.png","element":"img","alt":"r(ut).","inline":true,"padRight":true},{"text":"One then considers ","element":"span"},{"text":"N ","element":"span"},{"text":"identical independent copies of the control problem in the following way: at round ","element":"span"},{"text":"i","element":"span"},{"text":", ","element":"span"},{"text":"i ","element":"span"},{"text":"= 1","element":"span"},{"text":", ","element":"span"},{"text":"2","element":"span"},{"text":", . . ., N, ","element":"span"},{"text":"a control ","element":"span"},{"style":{"height":12.96},"width":34.04,"height":32.4,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1812.01552/images/5-11.png","element":"img","alt":" ui","inline":true,"padRight":true},{"text":"is sampled under the (possibly random) control distribution ","element":"span"},{"style":{"height":6.8},"width":23,"height":17,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1812.01552/images/5-12.png","element":"img","alt":" π","inline":true},{"text":", and executed for its corresponding copy of the control problem ","element":"span"},{"href":"#id-24","text":"(1)","element":"a"},{"text":"–","element":"span"},{"href":"#id-25","text":"(2)","element":"a"},{"text":". Then, at each fixed time ","element":"span"},{"text":"t","element":"span"},{"text":", it follows, from the law of large numbers (and under certain mild technical conditions), that the average reward over [","element":"span"},{"text":"t, t ","element":"span"},{"text":"+ ∆","element":"span"},{"text":"t","element":"span"},{"text":"], with ∆","element":"span"},{"text":"t ","element":"span"},{"text":"small enough, should satisfy that as ","element":"span"},{"style":{"height":11.2},"width":138.4,"height":28,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1812.01552/images/5-13.png","element":"img","alt":"N → ∞","inline":true},{"text":",","element":"span"}],[{"style":{"width":"67%"},"width":922,"height":117,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1812.01552/images/5-14.png","element":"img"}],[{"text":"For a general reward ","element":"span"},{"style":{"height":16.03},"width":131.04,"height":40.08,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1812.01552/images/5-15.png","element":"img","alt":" r(xut , ut","inline":true},{"text":") which depends on the state, we first need to ","element":"span"},{"text":"describe how exploration might alter the state dynamics ","element":"span"},{"href":"#id-24","text":"(1) ","element":"a"},{"text":"by defining appropriately its “exploratory” version. For this, we look at the effect of repetitive learning under a given control distribution, say ","element":"span"},{"style":{"height":6.8},"width":23,"height":17,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1812.01552/images/5-16.png","element":"img","alt":" π","inline":true},{"text":", for ","element":"span"},{"text":"N ","element":"span"},{"text":"rounds. ","element":"span"},{"text":"Let ","element":"span"},{"style":{"height":16.8},"width":54.2,"height":42,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1812.01552/images/5-17.png","element":"img","alt":" W it","inline":true,"padRight":true},{"text":", ","element":"span"},{"text":"i ","element":"span"},{"text":"= 1","element":"span"},{"text":", ","element":"span"},{"text":"2","element":"span"},{"text":", . . . , N","element":"span"},{"text":", be ","element":"span"},{"text":"N ","element":"span"},{"text":"independent sample paths of the Brownian motion ","element":"span"},{"style":{"height":13.1},"width":49.45,"height":32.76,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1812.01552/images/5-18.png","element":"img","alt":" Wt","inline":true},{"text":", and ","element":"span"},{"style":{"height":16.99},"width":309.1,"height":42.48,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1812.01552/images/6-0.png","element":"img","alt":" xit, i = 1, 2, . . . , N","inline":true},{"text":", be the copies of the state process respectively under the ","element":"span"},{"text":"controls ","element":"span"},{"style":{"height":16.16},"width":307.66,"height":40.4,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1812.01552/images/6-1.png","element":"img","alt":" ui, i = 1, 2, . . . , N","inline":true},{"text":", each sampled from ","element":"span"},{"style":{"height":6.8},"width":23,"height":17,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1812.01552/images/6-2.png","element":"img","alt":" π","inline":true},{"text":". Then, the increments of these state process copies are, for ","element":"span"},{"text":"i ","element":"span"},{"text":"= 1","element":"span"},{"text":", ","element":"span"},{"text":"2","element":"span"},{"text":", . . ., N","element":"span"},{"text":",","element":"span"}],[{"id":"id-26","style":{"width":"94%"},"width":1306,"height":49,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1812.01552/images/6-3.png","element":"img"}],[{"text":"Each such process ","element":"span"},{"style":{"height":16.35},"width":311.03,"height":40.88,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1812.01552/images/6-4.png","element":"img","alt":" xi, i = 1, 2, . . . , N","inline":true},{"text":", can be viewed as an independent sample from the exploratory state dynamics ","element":"span"},{"style":{"height":10.8},"width":55.48,"height":27,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1812.01552/images/6-5.png","element":"img","alt":" Xπ","inline":true},{"text":". The superscript ","element":"span"},{"style":{"height":6.8},"width":23,"height":17,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1812.01552/images/6-6.png","element":"img","alt":" π","inline":true,"padRight":true},{"text":"of ","element":"span"},{"style":{"height":10.8},"width":55.48,"height":27,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1812.01552/images/6-7.png","element":"img","alt":" Xπ","inline":true,"padRight":true},{"text":"indicates that each ","element":"span"},{"style":{"height":12.96},"width":33.56,"height":32.4,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1812.01552/images/6-8.png","element":"img","alt":" xi","inline":true,"padRight":true},{"text":"is generated according to the classical dynamics ","element":"span"},{"href":"#id-26","text":"(3)","element":"a"},{"text":", with the corresponding ","element":"span"},{"style":{"height":12.96},"width":34.04,"height":32.4,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1812.01552/images/6-9.png","element":"img","alt":" ui","inline":true,"padRight":true},{"text":"sampled independently under this policy ","element":"span"},{"style":{"height":6.8},"width":35,"height":17,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1812.01552/images/6-10.png","element":"img","alt":" π.","inline":true}],[{"text":"It then follows from ","element":"span"},{"href":"#id-26","text":"(3) ","element":"a"},{"text":"and the law of large numbers that, as ","element":"span"},{"style":{"height":11.2},"width":138.4,"height":28,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1812.01552/images/6-11.png","element":"img","alt":" N → ∞","inline":true},{"text":",","element":"span"}],[{"id":"id-27","style":{"width":"98%"},"width":1357,"height":295,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1812.01552/images/6-12.png","element":"img"}],[{"text":"In the above, we have implicitly applied the (reasonable) assumption that both ","element":"span"},{"style":{"height":9.1},"width":34.56,"height":22.76,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1812.01552/images/6-13.png","element":"img","alt":"πt","inline":true,"padRight":true},{"text":"and ","element":"span"},{"style":{"height":14.64},"width":55.48,"height":36.6,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1812.01552/images/6-14.png","element":"img","alt":" Xπt","inline":true,"padRight":true},{"text":"are independent of the increments of the Brownian motion sample ","element":"span"},{"text":"paths, which are identically distributed over [","element":"span"},{"text":"t, t ","element":"span"},{"text":"+ ∆","element":"span"},{"text":"t","element":"span"},{"text":"].","element":"span"}],[{"text":"Similarly, as ","element":"span"},{"style":{"height":11.2},"width":138.4,"height":28,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1812.01552/images/6-15.png","element":"img","alt":" N → ∞","inline":true},{"text":",","element":"span"}],[{"id":"id-28","style":{"width":"97%"},"width":1341,"height":118,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1812.01552/images/6-16.png","element":"img"}],[{"text":"As we see, not only ∆","element":"span"},{"style":{"height":16.99},"width":34.56,"height":42.48,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1812.01552/images/6-17.png","element":"img","alt":"xit","inline":true,"padRight":true},{"text":"but also (∆","element":"span"},{"style":{"height":17.39},"width":68.32,"height":43.48,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1812.01552/images/6-18.png","element":"img","alt":"xit)2","inline":true,"padRight":true},{"text":"are affected by repetitive learning ","element":"span"},{"text":"under the given policy ","element":"span"},{"style":{"height":6.8},"width":23,"height":17,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1812.01552/images/6-19.png","element":"img","alt":" π","inline":true},{"text":".","element":"span"}],[{"text":"Finally, as the individual state ","element":"span"},{"style":{"height":16.99},"width":34.56,"height":42.48,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1812.01552/images/6-20.png","element":"img","alt":" xit","inline":true,"padRight":true},{"text":"is an independent sample from ","element":"span"},{"style":{"height":14.64},"width":55.48,"height":36.6,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1812.01552/images/6-21.png","element":"img","alt":" Xπt","inline":true,"padRight":true},{"text":", we ","element":"span"},{"text":"have that ∆","element":"span"},{"style":{"height":16.99},"width":34.56,"height":42.48,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1812.01552/images/6-22.png","element":"img","alt":"xit","inline":true,"padRight":true},{"text":"and (∆","element":"span"},{"style":{"height":17.39},"width":353.27,"height":43.48,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1812.01552/images/6-23.png","element":"img","alt":"xit)2, i = 1, 2, . . . , N","inline":true},{"text":", are the independent samples from ","element":"span"},{"text":"∆","element":"span"},{"style":{"height":14.64},"width":55.48,"height":36.6,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1812.01552/images/6-24.png","element":"img","alt":"Xπt","inline":true,"padRight":true},{"text":"and (∆","element":"span"},{"style":{"height":14.64},"width":55.48,"height":36.6,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1812.01552/images/6-25.png","element":"img","alt":"Xπt","inline":true,"padRight":true},{"text":")","element":"span"},{"style":{"height":7.6},"width":16,"height":19,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1812.01552/images/6-26.png","element":"img","alt":"2","inline":true},{"text":", respectively. As a result, the law of large numbers gives that ","element":"span"},{"text":"as ","element":"span"},{"style":{"height":11.2},"width":138.4,"height":28,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1812.01552/images/6-27.png","element":"img","alt":" N → ∞","inline":true},{"text":",","element":"span"}],[{"style":{"width":"89%"},"width":1224,"height":117,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1812.01552/images/6-28.png","element":"img"}],[{"text":"This interpretation, together with ","element":"span"},{"href":"#id-27","text":"(4) ","element":"a"},{"text":"and ","element":"span"},{"href":"#id-28","text":"(5)","element":"a"},{"text":", ","element":"span"},{"text":"motivates ","element":"span"},{"text":"us to propose the ","element":"span"},{"text":"exploratory version ","element":"span"},{"text":"of the state dynamics, namely,","element":"span"}],[{"id":"id-29","style":{"width":"86%"},"width":1188,"height":49,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1812.01552/images/6-29.png","element":"img"}],[{"text":"where the coefficients ","element":"span"},{"text":"˜","element":"span"},{"style":{"height":16},"width":72.44,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1812.01552/images/6-30.png","element":"img","alt":"b(·, ·","inline":true},{"text":") and ˜","element":"span"},{"style":{"height":16},"width":79.64,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1812.01552/images/6-31.png","element":"img","alt":"σ(·, ·","inline":true},{"text":") are defined as","element":"span"}],[{"id":"id-30","style":{"width":"80%"},"width":1100,"height":92,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1812.01552/images/6-32.png","element":"img"}],[{"text":"and","element":"span"}],[{"id":"id-31","style":{"width":"82%"},"width":1135,"height":108,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1812.01552/images/6-33.png","element":"img"}],[{"text":"with ","element":"span"},{"text":"P ","element":"span"},{"text":"(","element":"span"},{"text":"U","element":"span"},{"text":") being the set of density functions of probability measures on ","element":"span"},{"text":"U ","element":"span"},{"text":"that are absolutely continuous with respect to the Lebesgue measure.","element":"span"}],[{"text":"We will call ","element":"span"},{"href":"#id-29","text":"(6) ","element":"a"},{"text":"the ","element":"span"},{"text":"exploratory formulation ","element":"span"},{"text":"of the controlled state dynamics, and ","element":"span"},{"text":"˜","element":"span"},{"style":{"height":16},"width":71.96,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1812.01552/images/7-0.png","element":"img","alt":"b(·, ·","inline":true},{"text":") and ˜","element":"span"},{"style":{"height":16},"width":79.16,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1812.01552/images/7-1.png","element":"img","alt":"σ(·, ·","inline":true},{"text":") ","element":"span"},{"text":"in ","element":"span"},{"href":"#id-30","text":"(7) ","element":"a"},{"text":"and ","element":"span"},{"href":"#id-31","text":"(8)","element":"a"},{"text":", respectively, the ","element":"span"},{"text":"exploratory drift ","element":"span"},{"text":"and the","element":"span"}],[{"style":{"width":"27%"},"width":372,"height":27,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1812.01552/images/7-2.png","element":"img"}],[{"text":"In a similar fashion, as ","element":"span"},{"style":{"height":11.2},"width":138.4,"height":28,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1812.01552/images/7-3.png","element":"img","alt":" N → ∞","inline":true},{"text":",","element":"span"}],[{"style":{"width":"88%"},"width":1210,"height":117,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1812.01552/images/7-4.png","element":"img"}],[{"text":"Hence, the reward function ","element":"span"},{"text":"r ","element":"span"},{"text":"in ","element":"span"},{"href":"#id-25","text":"(2) ","element":"a"},{"text":"needs to be modified to the ","element":"span"},{"text":"exploratory","element":"span"}],[{"style":{"width":"99%"},"width":1368,"height":124,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1812.01552/images/7-5.png","element":"img"}],[{"text":"If, on the other hand, the model is fully known, exploration would not be needed at all and the control distributions would all degenerate to the Dirac measures, and we would then be in the realm of the classical stochastic control. Thus, in the RL context, we need to add a “regularization term” to account for model uncertainty and to encourage exploration. We use Shanon’s ","element":"span"},{"text":"differential entropy ","element":"span"},{"text":"to measure the level of exploration:","element":"span"}],[{"style":{"width":"52%"},"width":720,"height":93,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1812.01552/images/7-6.png","element":"img"}],[{"text":"We therefore introduce the following entropy-regularized relaxed stochastic control problem","element":"span"}],[{"id":"id-32","style":{"width":"102%"},"width":1414,"height":134,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1812.01552/images/7-7.png","element":"img"}],[{"text":"where ","element":"span"},{"style":{"height":11.6},"width":65.56,"height":29,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1812.01552/images/7-8.png","element":"img","alt":" λ >","inline":true,"padRight":true},{"text":"0 is an exogenous exploration weight parameter capturing the trade-off between exploitation (the original reward function) and exploration (the entropy), ","element":"span"},{"text":"A","element":"span"},{"text":"(","element":"span"},{"text":"x","element":"span"},{"text":") is the set of the admissible control distributions (which may in general depend on ","element":"span"},{"text":"x","element":"span"},{"text":"), and ","element":"span"},{"text":"V ","element":"span"},{"text":"is the value function.","element":"span"},{"style":{"height":7.6},"width":16,"height":19,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1812.01552/images/8-0.png","element":"img","alt":"8","inline":true}],[{"text":"The precise definition of ","element":"span"},{"text":"A","element":"span"},{"text":"(","element":"span"},{"text":"x","element":"span"},{"text":") depends on the specific dynamic model under consideration and the specific problems one wants to solve, which may vary from case to case. Here, we first provide some of the “minimal” requirements for ","element":"span"},{"text":"A","element":"span"},{"text":"(","element":"span"},{"text":"x","element":"span"},{"text":"). ","element":"span"},{"text":"Denote by ","element":"span"},{"text":"B","element":"span"},{"text":"(","element":"span"},{"text":"U","element":"span"},{"text":") the Borel algebra on ","element":"span"},{"text":"U","element":"span"},{"text":". ","element":"span"},{"text":"An admissible control distribution is a measure-valued (or precisely a density-function-valued) process ","element":"span"},{"style":{"height":16},"width":265.76,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1812.01552/images/8-1.png","element":"img","alt":"π = {πt, t ≥ 0}","inline":true,"padRight":true},{"text":"satisfying at least the following properties:","element":"span"}],[{"text":"(i) for each ","element":"span"},{"style":{"height":12.8},"width":56.44,"height":32,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1812.01552/images/8-2.png","element":"img","alt":" t ≥","inline":true,"padRight":true},{"text":"0, ","element":"span"},{"style":{"height":16},"width":163,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1812.01552/images/8-3.png","element":"img","alt":" πt ∈ P(U","inline":true},{"text":") a.s.;","element":"span"}],[{"style":{"width":"93%"},"width":1287,"height":23,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1812.01552/images/8-4.png","element":"img"}],[{"text":"(iii) the stochastic differential equation (SDE) ","element":"span"},{"href":"#id-29","text":"(6) ","element":"a"},{"text":"has a unique strong solution ","element":"span"},{"style":{"height":16},"width":314.24,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1812.01552/images/8-5.png","element":"img","alt":" Xπ = {Xπt , t ≥ 0}","inline":true,"padRight":true},{"text":"if ","element":"span"},{"style":{"height":6.8},"width":23,"height":17,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1812.01552/images/8-6.png","element":"img","alt":" π","inline":true,"padRight":true},{"text":"is applied;","element":"span"}],[{"text":"(iv) the expectation on the right hand side of ","element":"span"},{"href":"#id-32","text":"(12) ","element":"a"},{"text":"is finite.","element":"span"}],[{"text":"Naturally, there could be additional requirements depending on specific problems. For the linear–quadratic control case, which will be the main focus of the paper, we define ","element":"span"},{"text":"A","element":"span"},{"text":"(","element":"span"},{"text":"x","element":"span"},{"text":") precisely in section 4.","element":"span"}],[{"text":"Finally, analogous to the classical control formulation, ","element":"span"},{"text":"A","element":"span"},{"text":"(","element":"span"},{"text":"x","element":"span"},{"text":") contains ","element":"span"},{"text":"open-loop ","element":"span"},{"text":"control distributions that are measure-valued ","element":"span"},{"text":"stochastic processes","element":"span"},{"text":". We will also consider ","element":"span"},{"text":"feedback ","element":"span"},{"text":"control distributions. Specifically, a ","element":"span"},{"text":"deterministic ","element":"span"},{"text":"mapping ","element":"span"},{"style":{"height":16},"width":83.96,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1812.01552/images/8-7.png","element":"img","alt":" π(·; ·","inline":true},{"text":") is called a feedback control (distribution) if i) ","element":"span"},{"style":{"height":16},"width":95.96,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1812.01552/images/8-8.png","element":"img","alt":" π(·; x","inline":true},{"text":") is a density function for each ","element":"span"},{"style":{"height":11.6},"width":106.76,"height":29,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1812.01552/images/8-9.png","element":"img","alt":" x ∈ R","inline":true},{"text":"; ii) the following SDE (which is the system dynamics after the feedback law ","element":"span"},{"style":{"height":16},"width":83.96,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1812.01552/images/8-10.png","element":"img","alt":" π(·; ·","inline":true},{"text":") is applied)","element":"span"}],[{"style":{"width":"95%"},"width":1309,"height":48,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1812.01552/images/8-11.png","element":"img"}],[{"text":"has a unique strong solution ","element":"span"},{"style":{"height":16},"width":192.32,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1812.01552/images/8-12.png","element":"img","alt":" {Xt; t ≥ 0}","inline":true},{"text":"; and iii) the open-loop control ","element":"span"},{"style":{"height":16},"width":144.92,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1812.01552/images/8-13.png","element":"img","alt":" π = {πt,","inline":true},{"style":{"height":16},"width":226.04,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1812.01552/images/8-14.png","element":"img","alt":"t ≥ 0} ∈ A(x","inline":true},{"text":") where ","element":"span"},{"style":{"height":16},"width":218.88,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1812.01552/images/8-15.png","element":"img","alt":" πt := π(·; Xt","inline":true},{"text":"). In this case, the open-loop control ","element":"span"},{"style":{"height":6.8},"width":23,"height":17,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1812.01552/images/8-16.png","element":"img","alt":" π","inline":true,"padRight":true},{"text":"is said to be ","element":"span"},{"text":"generated ","element":"span"},{"text":"from the feedback control law ","element":"span"},{"style":{"height":16},"width":83.48,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1812.01552/images/8-17.png","element":"img","alt":" π(·; ·","inline":true},{"text":") with respect to ","element":"span"},{"text":"x","element":"span"},{"text":".","element":"span"}]]},{"heading":"3 HJB Equation and Optimal Control Distributions","paragraphs":[[{"text":"We present the general procedure for solving the optimization problem ","element":"span"},{"href":"#id-32","text":"(12)","element":"a"},{"text":". The arguments are informal and a rigorous analysis will be carried out in the next section.","element":"span"}],[{"text":"To this end, applying the classical Bellman’s principle of optimality, we have","element":"span"}],[{"style":{"width":"72%"},"width":1001,"height":21,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1812.01552/images/8-18.png","element":"img"}],[{"text":"V ","element":"span"},{"text":"(","element":"span"},{"text":"x","element":"span"},{"text":") = ","element":"span"},{"text":"sup","element":"span"}],[{"style":{"width":"84%"},"width":1159,"height":25,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1812.01552/images/8-19.png","element":"img"}],[{"text":"Proceeding with standard arguments, we deduce that ","element":"span"},{"text":"V ","element":"span"},{"text":"satisfies the Hamilton-Jacobi-Bellmam (HJB) equation","element":"span"}],[{"style":{"width":"89%"},"width":1234,"height":115,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1812.01552/images/8-20.png","element":"img"}],[{"id":"id-40","style":{"width":"64%"},"width":881,"height":73,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1812.01552/images/9-0.png","element":"img"}],[{"text":"or","element":"span"}],[{"id":"id-33","style":{"width":"102%"},"width":1406,"height":141,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1812.01552/images/9-1.png","element":"img"}],[{"text":"where ","element":"span"},{"text":"v ","element":"span"},{"text":"denotes the generic unknown function of the equation. Recalling that ","element":"span"},{"style":{"height":16},"width":156.76,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1812.01552/images/9-2.png","element":"img","alt":"π ∈ P (U","inline":true},{"text":") if and only if","element":"span"}],[{"style":{"width":"76%"},"width":1055,"height":92,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1812.01552/images/9-3.png","element":"img"}],[{"text":"we can solve the (constrained) maximization problem on the right hand side of ","element":"span"},{"href":"#id-33","text":"(15) ","element":"a"},{"text":"to get a feedback control:","element":"span"}],[{"id":"id-34","style":{"width":"95%"},"width":1312,"height":110,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1812.01552/images/9-4.png","element":"img"}],[{"text":"For each given initial state ","element":"span"},{"style":{"height":11.6},"width":110.12,"height":29,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1812.01552/images/9-5.png","element":"img","alt":" x ∈ R","inline":true},{"text":", this feedback control in turn generates an optimal open-loop control","element":"span"}],[{"style":{"width":"105%"},"width":1451,"height":143,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1812.01552/images/9-6.png","element":"img"}],[{"text":"where ","element":"span"},{"style":{"height":16},"width":72.64,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1812.01552/images/9-7.png","element":"img","alt":" {X∗t","inline":true,"padRight":true},{"text":", ","element":"span"},{"style":{"height":16},"width":112.64,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1812.01552/images/9-8.png","element":"img","alt":" t ≥ 0}","inline":true,"padRight":true},{"text":"solves ","element":"span"},{"href":"#id-29","text":"(6) ","element":"a"},{"text":"when the feedback control law ","element":"span"},{"style":{"height":16},"width":102.2,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1812.01552/images/9-9.png","element":"img","alt":" π∗(·; ·","inline":true},{"text":") is applied ","element":"span"},{"text":"and assuming that ","element":"span"},{"style":{"height":17.55},"width":348.16,"height":43.88,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1812.01552/images/9-10.png","element":"img","alt":" {π∗t , t ≥ 0} ∈ A(x).9","inline":true}],[{"text":"Formula ","element":"span"},{"href":"#id-34","text":"(17) ","element":"a"},{"text":"above elicits qualitative understanding about optimal explorations. We further investigate this in the next section.","element":"span"}]]},{"heading":"4 The Linear–Quadratic Case","paragraphs":[[{"text":"We now focus on the family of entropy-regularized (relaxed) stochastic control problems with linear state dynamics and quadratic rewards, in which","element":"span"}],[{"id":"id-36","style":{"width":"99%"},"width":1372,"height":401,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1812.01552/images/9-11.png","element":"img"}],[{"text":"In the classical control literature, this type of linear–quadratic (LQ) control problems is one of the most important, not only because it admits elegant and simple solutions but also because more complex, nonlinear problems can be approximated by LQ problems. As is standard with LQ control, we assume that the control set is unconstrained, namely, ","element":"span"},{"text":"U ","element":"span"},{"text":"= ","element":"span"},{"text":"R","element":"span"},{"text":".","element":"span"}],[{"text":"Fix an initial state ","element":"span"},{"style":{"height":11.6},"width":102.44,"height":29,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1812.01552/images/10-0.png","element":"img","alt":" x ∈ R","inline":true},{"text":". For each open-loop control ","element":"span"},{"style":{"height":16},"width":170.84,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1812.01552/images/10-1.png","element":"img","alt":" π ∈ A(x),","inline":true,"padRight":true},{"text":"denote its mean and variance processes ","element":"span"},{"style":{"height":17.2},"width":220.76,"height":43,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1812.01552/images/10-2.png","element":"img","alt":" µt, σ2t , t ≥ 0,","inline":true,"padRight":true},{"text":"by","element":"span"}],[{"style":{"width":"84%"},"width":1160,"height":90,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1812.01552/images/10-3.png","element":"img"}],[{"text":"Then, the state SDE ","element":"span"},{"href":"#id-29","text":"(6) ","element":"a"},{"text":"becomes","element":"span"}],[{"id":"id-35","style":{"width":"98%"},"width":1349,"height":216,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1812.01552/images/10-4.png","element":"img"}],[{"text":"Further, denote","element":"span"}],[{"style":{"width":"97%"},"width":1347,"height":472,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1812.01552/images/10-5.png","element":"img"}],[{"text":"In the above, condition (iii) is to ensure that for any ","element":"span"},{"style":{"height":16},"width":153.08,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1812.01552/images/10-6.png","element":"img","alt":" π ∈ A(x","inline":true},{"text":"), both the drift and volatility terms of ","element":"span"},{"href":"#id-35","text":"(22) ","element":"a"},{"text":"satisfy a global Lipschitz condition and a type of linear growth condition in the state variable and, hence, the SDE ","element":"span"},{"href":"#id-35","text":"(22) ","element":"a"},{"text":"admits a unique strong solution ","element":"span"},{"style":{"height":10.8},"width":55.48,"height":27,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1812.01552/images/10-7.png","element":"img","alt":" Xπ","inline":true},{"text":". Condition (iv) renders dynamic programming and verification technique applicable for the model, as will be evident in the sequel. Finally, the reward is finite under condition (v).","element":"span"}],[{"text":"We are now ready to introduce the entropy-regularized relaxed stochastic LQ problem","element":"span"}],[{"style":{"width":"84%"},"width":1165,"height":21,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1812.01552/images/10-8.png","element":"img"}],[{"id":"id-47","text":"V ","element":"span"},{"text":"(","element":"span"},{"text":"x","element":"span"},{"text":") = ","element":"span"},{"text":"sup","element":"span"}],[{"style":{"width":"96%"},"width":1323,"height":59,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1812.01552/images/10-9.png","element":"img"}],[{"text":"with ","element":"span"},{"text":"r ","element":"span"},{"text":"as in ","element":"span"},{"href":"#id-36","text":"(20) ","element":"a"},{"text":"and ","element":"span"},{"style":{"height":10.8},"width":55.48,"height":27,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1812.01552/images/10-10.png","element":"img","alt":" Xπ","inline":true,"padRight":true},{"text":"as in ","element":"span"},{"href":"#id-35","text":"(22)","element":"a"},{"text":".","element":"span"}],[{"text":"In the following two subsections, we derive explicit solutions for both cases of state-independent and state-dependent rewards.","element":"span"}],[{"text":"4.1 ","element":"span"},{"text":"The case of state-independent reward","element":"span"}],[{"text":"We start with the technically less challenging case ","element":"span"},{"style":{"height":19.31},"width":438.48,"height":48.28,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1812.01552/images/11-0.png","element":"img","alt":" r(x, u) = −� N2 u2 + Qu�","inline":true},{"text":", namely, the reward is state (feature) independent. ","element":"span"},{"text":"In this case, the system dynamics becomes irrelevant. However, the problem is still interesting in its own right as it corresponds to the state-independent RL problem, which is known as the continuous-armed bandit problem in the continuous time setting ","element":"span"},{"href":"#id-37","referenceIndex":20,"text":"(Mandelbaum ","element":"a"},{"href":"#id-37","referenceIndex":20,"text":"(1987)","element":"a"},{"text":"; ","element":"span"},{"href":"#id-38","referenceIndex":14,"text":"Kaspi and Mandelbaum ","element":"a"},{"href":"#id-38","referenceIndex":14,"text":"(1998)","element":"a"},{"text":").","element":"span"}],[{"text":"Following the derivation in the previous section, the optimal feedback control in ","element":"span"},{"href":"#id-34","text":"(17) ","element":"a"},{"text":"reduces to","element":"span"}],[{"id":"id-39","style":{"width":"103%"},"width":1428,"height":343,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1812.01552/images/11-1.png","element":"img"}],[{"text":"Therefore, the optimal feedback control distribution appears to be ","element":"span"},{"text":"Gaussian","element":"span"},{"text":". More specifically, at any present state ","element":"span"},{"text":"x","element":"span"},{"text":", the agent should embark on exploration according to the Gaussian distribution with mean and variance given, respectively, by ","element":"span"},{"style":{"height":24.45},"width":319.72,"height":61.12,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1812.01552/images/11-2.png","element":"img","alt":"CDxv′′(x)+Bv′(x)−QN−D2v′′(x)","inline":true,"padRight":true},{"text":"and ","element":"span"},{"style":{"height":22.11},"width":175.2,"height":55.28,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1812.01552/images/11-3.png","element":"img","alt":"λN−D2v′′(x)","inline":true},{"text":". Note that in deriving the above,","element":"span"}],[{"text":"we have used that ","element":"span"},{"style":{"height":17.36},"width":278.68,"height":43.4,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1812.01552/images/11-4.png","element":"img","alt":" N − D2v′′(x) >","inline":true,"padRight":true},{"text":"0, ","element":"span"},{"style":{"height":11.6},"width":105.8,"height":29,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1812.01552/images/11-5.png","element":"img","alt":" x ∈ R","inline":true},{"text":", a condition that will be justified ","element":"span"},{"id":"id-54","text":"and discussed later on.","element":"span"}],[{"text":"Remark 1 ","element":"span"},{"text":"If we examine the derivation of ","element":"span"},{"href":"#id-39","text":"(24) ","element":"a"},{"text":"more closely, we easily see that the optimality of the Gaussian distribution still holds as long as the state dynamics is linear in control and the reward is quadratic in control, whereas the dependence of both on the state can be generally nonlinear.","element":"span"}],[{"text":"Substituting ","element":"span"},{"href":"#id-39","text":"(24) ","element":"a"},{"text":"back to ","element":"span"},{"href":"#id-40","text":"(14)","element":"a"},{"text":", the HJB equation becomes, after straightforward calculations,","element":"span"}],[{"id":"id-42","style":{"width":"86%"},"width":1195,"height":177,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1812.01552/images/11-6.png","element":"img"}],[{"text":"In general, this nonlinear equation has ","element":"span"},{"text":"multiple ","element":"span"},{"text":"smooth solutions, even among quadratic polynomials that satisfy ","element":"span"},{"style":{"height":17.36},"width":283,"height":43.4,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1812.01552/images/11-7.png","element":"img","alt":" N − D2v′′(x) >","inline":true,"padRight":true},{"text":"0. One such solution is a constant, given by","element":"span"}],[{"id":"id-43","style":{"width":"74%"},"width":1028,"height":99,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1812.01552/images/11-8.png","element":"img"}],[{"text":"with the corresponding optimal feedback control distribution ","element":"span"},{"href":"#id-39","text":"(24) ","element":"a"},{"text":"being","element":"span"}],[{"id":"id-41","style":{"width":"69%"},"width":951,"height":132,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1812.01552/images/11-9.png","element":"img"}],[{"text":"It turns out the right hand side of the above is independent of the current state ","element":"span"},{"text":"x","element":"span"},{"text":". So the optimal feedback control distribution is the same across different states. Note that the classical LQ problem with the state-independent reward function ","element":"span"},{"style":{"height":19.5},"width":437.52,"height":48.76,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1812.01552/images/12-0.png","element":"img","alt":" r(x, u) = −� N2 u2 + Qu�","inline":true},{"text":"clearly has the optimal control ","element":"span"},{"style":{"height":20.27},"width":171.96,"height":50.68,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1812.01552/images/12-1.png","element":"img","alt":" u∗ = − QN","inline":true,"padRight":true},{"text":", ","element":"span"},{"text":"which is also state-independent and is nothing else than the mean of the optimal Gaussian feedback control ","element":"span"},{"style":{"height":10.96},"width":44.32,"height":27.4,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1812.01552/images/12-2.png","element":"img","alt":" π∗","inline":true},{"text":".","element":"span"}],[{"text":"The following result establishes that the constant ","element":"span"},{"text":"v ","element":"span"},{"text":"is indeed the value function ","element":"span"},{"text":"V ","element":"span"},{"text":"and that the feedback control ","element":"span"},{"style":{"height":10.96},"width":44.32,"height":27.4,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1812.01552/images/12-3.png","element":"img","alt":" π∗","inline":true,"padRight":true},{"text":"defined by ","element":"span"},{"href":"#id-41","text":"(27) ","element":"a"},{"text":"is optimal. Henceforth, we denote, for notational convenience, by ","element":"span"},{"style":{"height":17.36},"width":157.6,"height":43.4,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1812.01552/images/12-4.png","element":"img","alt":" N(·|µ, σ2","inline":true},{"text":") the density function of a Gaussian random variable with mean ","element":"span"},{"style":{"height":10},"width":24,"height":25,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1812.01552/images/12-5.png","element":"img","alt":" µ","inline":true,"padRight":true},{"text":"and variance ","element":"span"},{"style":{"height":13.36},"width":40,"height":33.4,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1812.01552/images/12-6.png","element":"img","alt":" σ2","inline":true},{"text":".","element":"span"}],[{"id":"id-50","style":{"width":"100%"},"width":1375,"height":395,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1812.01552/images/12-7.png","element":"img"}],[{"text":"Moreover, the associated optimal state process, ","element":"span"},{"style":{"height":16.03},"width":200,"height":40.08,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1812.01552/images/12-8.png","element":"img","alt":" {X∗t , t ≥ 0}","inline":true},{"text":", under ","element":"span"},{"style":{"height":16},"width":102.2,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1812.01552/images/12-9.png","element":"img","alt":" π∗(·; ·","inline":true},{"text":") ","element":"span"},{"text":"is the ","element":"span"},{"text":"unique solution of the SDE","element":"span"}],[{"id":"id-44","style":{"width":"95%"},"width":1319,"height":120,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1812.01552/images/12-10.png","element":"img"}],[{"text":"Proof. ","element":"span"},{"text":"Let ","element":"span"},{"style":{"height":16},"width":146.68,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1812.01552/images/12-11.png","element":"img","alt":" v(x) ≡ v","inline":true,"padRight":true},{"text":"be the constant solution to the HJB equation ","element":"span"},{"href":"#id-42","text":"(25) ","element":"a"},{"text":"defined by ","element":"span"},{"href":"#id-43","text":"(26)","element":"a"},{"text":". Then the corresponding feedback optimizer ","element":"span"},{"style":{"height":28.8},"width":471.84,"height":72,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1812.01552/images/12-12.png","element":"img","alt":" π∗(u; x) = N�u�� − QN , λN�","inline":true,"padRight":true},{"text":"follows immediately from ","element":"span"},{"href":"#id-39","text":"(24)","element":"a"},{"text":". Let ","element":"span"},{"style":{"height":16.03},"width":294.08,"height":40.08,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1812.01552/images/12-13.png","element":"img","alt":" π∗ = {π∗t , t ≥ 0}","inline":true,"padRight":true},{"text":"be the open-loop control ","element":"span"},{"text":"generated from ","element":"span"},{"style":{"height":16},"width":102.2,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1812.01552/images/12-14.png","element":"img","alt":" π∗(·; ·","inline":true},{"text":"). It is straightforward to verify that ","element":"span"},{"style":{"height":17.36},"width":219.04,"height":43.4,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1812.01552/images/12-15.png","element":"img","alt":" π∗ ∈ A(x).10","inline":true}],[{"text":"Now, for any ","element":"span"},{"style":{"height":16},"width":150.68,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1812.01552/images/12-16.png","element":"img","alt":" π ∈ A(x","inline":true},{"text":") and ","element":"span"},{"style":{"height":13.2},"width":74.68,"height":33,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1812.01552/images/12-17.png","element":"img","alt":" T ≥","inline":true,"padRight":true},{"text":"0, it follows from the HJB equation ","element":"span"},{"href":"#id-40","text":"(14) ","element":"a"},{"text":"that","element":"span"}],[{"style":{"width":"94%"},"width":1297,"height":234,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1812.01552/images/12-18.png","element":"img"}],[{"text":"Since ","element":"span"},{"style":{"height":16},"width":142.52,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1812.01552/images/12-19.png","element":"img","alt":" π ∈ A(x","inline":true},{"text":"), the dominated convergence theorem yields that, as ","element":"span"},{"style":{"height":11.2},"width":131.2,"height":28,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1812.01552/images/12-20.png","element":"img","alt":" T → ∞","inline":true},{"text":",","element":"span"}],[{"style":{"width":"96%"},"width":1322,"height":206,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1812.01552/images/12-21.png","element":"img"}],[{"text":"and, thus, ","element":"span"},{"style":{"height":16},"width":150.68,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1812.01552/images/13-0.png","element":"img","alt":" v ≥ V (x","inline":true},{"text":"), for ","element":"span"},{"style":{"height":11.6},"width":128.36,"height":29,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1812.01552/images/13-1.png","element":"img","alt":" ∀x ∈ R","inline":true},{"text":". On the other hand, ","element":"span"},{"style":{"height":10.96},"width":44.8,"height":27.4,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1812.01552/images/13-2.png","element":"img","alt":" π∗","inline":true,"padRight":true},{"text":"has been derived as the maximizer for the right hand side of ","element":"span"},{"href":"#id-40","text":"(14)","element":"a"},{"text":"; hence","element":"span"}],[{"style":{"width":"72%"},"width":997,"height":97,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1812.01552/images/13-3.png","element":"img"}],[{"text":"Replacing the inequalities by equalities in the above argument and sending ","element":"span"},{"text":"T ","element":"span"},{"text":"to infinity, we conclude that","element":"span"}],[{"style":{"width":"99%"},"width":1373,"height":278,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1812.01552/images/13-4.png","element":"img"}],[{"text":"It is possible to obtain ","element":"span"},{"text":"explicit solutions ","element":"span"},{"text":"to ","element":"span"},{"href":"#id-44","text":"(28) ","element":"a"},{"text":"for most cases, which may be useful in designing exploration algorithms based on the theoretical results derived in this paper. We relegate this discussion about solving ","element":"span"},{"href":"#id-44","text":"(28) ","element":"a"},{"text":"explicitly to Appendix A.","element":"span"}],[{"text":"The above solution suggests that when the reward is independent of the state, so is the optimal feedback control distribution with density ","element":"span"},{"style":{"height":20.27},"width":209.88,"height":50.68,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1812.01552/images/13-5.png","element":"img","alt":" N(· |− QN , λN","inline":true,"padRight":true},{"text":"). ","element":"span"},{"text":"This is intuitive since objective ","element":"span"},{"href":"#id-32","text":"(12) ","element":"a"},{"text":"in this case does not explicitly distinguish between states.","element":"span"},{"style":{"height":7.6},"width":31.84,"height":19,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1812.01552/images/13-6.png","element":"img","alt":"11","inline":true}],[{"text":"A remarkable feature of the derived optimal distribution ","element":"span"},{"style":{"height":20.27},"width":223.32,"height":50.68,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1812.01552/images/13-7.png","element":"img","alt":" N(· | − QN , λN","inline":true,"padRight":true},{"text":") is ","element":"span"},{"text":"that its mean coincides with the optimal control of the original, non-exploratory LQ problem, whereas the variance is determined by the temperature parameter ","element":"span"},{"style":{"height":10.8},"width":23,"height":27,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1812.01552/images/13-8.png","element":"img","alt":"λ","inline":true},{"text":". In the context of continuous-armed bandit problem, this result stipulates that the mean is concentrated on the current incumbent of the best arm and the variance is determined by the temperature parameter. The more weight put on the level of exploration, the more spread out the exploration becomes around the current best arm. This type of exploration/exploitation strategies is clearly intuitive and, in turn, gives a guidance on how to actually choose the temperature parameter in practice: it is nothing else than the variance of","element":"span"}],[{"style":{"width":"23%"},"width":318,"height":15,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1812.01552/images/13-9.png","element":"img"}],[{"text":"the exploration the agent wishes to engage in (up to a scaling factor being the quadratic coefficient of the control in the reward function).","element":"span"}],[{"text":"However, we shall see in the next section that when the reward depends on the local state, the optimal feedback control distribution genuinely depends on the state.","element":"span"}],[{"text":"4.2 ","element":"span"},{"text":"The case of state-dependent reward","element":"span"}],[{"text":"We now consider the general case with the reward depending on both the control and the state, namely,","element":"span"}],[{"id":"id-45","style":{"width":"75%"},"width":1035,"height":97,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1812.01552/images/14-0.png","element":"img"}],[{"text":"We will be working with the following assumption.","element":"span"}],[{"style":{"width":"107%"},"width":1482,"height":73,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1812.01552/images/14-1.png","element":"img"}],[{"text":"This assumption requires a sufficiently large discount rate, or (implicitly) a sufficiently short planning horizon. Such an assumption is standard in infinite horizon problems with running rewards.","element":"span"}],[{"text":"Following an analogous argument as for ","element":"span"},{"href":"#id-39","text":"(24)","element":"a"},{"text":", we deduce that a candidate optimal feedback control is given by","element":"span"}],[{"style":{"width":"96%"},"width":1325,"height":97,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1812.01552/images/14-2.png","element":"img"}],[{"text":"In turn, denoting by ","element":"span"},{"style":{"height":16},"width":80.6,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1812.01552/images/14-3.png","element":"img","alt":" µ∗(x","inline":true},{"text":") and (","element":"span"},{"style":{"height":17.36},"width":127.36,"height":43.4,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1812.01552/images/14-4.png","element":"img","alt":"σ∗(x))2","inline":true,"padRight":true},{"text":"the mean and variance of ","element":"span"},{"style":{"height":16},"width":114.2,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1812.01552/images/14-5.png","element":"img","alt":" π∗(·; x","inline":true},{"text":") given above, the HJB equation ","element":"span"},{"href":"#id-40","text":"(14) ","element":"a"},{"text":"becomes","element":"span"}],[{"style":{"width":"97%"},"width":1337,"height":636,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1812.01552/images/14-6.png","element":"img"}],[{"text":"Reorganizing, thus, the above reduces to","element":"span"}],[{"id":"id-46","style":{"width":"92%"},"width":1265,"height":177,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1812.01552/images/14-7.png","element":"img"}],[{"text":"Under Assumption ","element":"span"},{"href":"#id-45","text":"3 ","element":"a"},{"text":"and the additional condition ","element":"span"},{"style":{"height":14.16},"width":201.6,"height":35.4,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1812.01552/images/15-0.png","element":"img","alt":" R2 < MN","inline":true,"padRight":true},{"text":"(which holds automatically if ","element":"span"},{"text":"R ","element":"span"},{"text":"= 0, ","element":"span"},{"text":"M > ","element":"span"},{"text":"0 and ","element":"span"},{"text":"N > ","element":"span"},{"text":"0, a standard case in the classical LQ problems), one smooth solution to the HJB equation ","element":"span"},{"href":"#id-46","text":"(32) ","element":"a"},{"text":"is given by","element":"span"}],[{"style":{"width":"32%"},"width":442,"height":81,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1812.01552/images/15-1.png","element":"img"}],[{"text":"where","element":"span"},{"style":{"height":7.6},"width":31.84,"height":19,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1812.01552/images/15-2.png","element":"img","alt":"12","inline":true}],[{"id":"id-48","style":{"width":"96%"},"width":1326,"height":333,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1812.01552/images/15-3.png","element":"img"}],[{"text":"and","element":"span"}],[{"id":"id-49","style":{"width":"82%"},"width":1129,"height":84,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1812.01552/images/15-4.png","element":"img"}],[{"text":"For this particular solution, given by ","element":"span"},{"text":"v","element":"span"},{"text":"(","element":"span"},{"text":"x","element":"span"},{"text":") above, we can verify that ","element":"span"},{"style":{"height":13.11},"width":80.92,"height":32.76,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1812.01552/images/15-5.png","element":"img","alt":" k2 <","inline":true,"padRight":true},{"text":"0, due to Assumption ","element":"span"},{"href":"#id-45","text":"3 ","element":"a"},{"text":"and ","element":"span"},{"style":{"height":14.16},"width":187.68,"height":35.4,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1812.01552/images/15-6.png","element":"img","alt":" R2 < MN","inline":true},{"text":". Hence, ","element":"span"},{"text":"v ","element":"span"},{"text":"is concave, a property that is essential in proving that it is actually the value function.","element":"span"},{"style":{"height":7.6},"width":31.84,"height":19,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1812.01552/images/15-7.png","element":"img","alt":"13","inline":true,"padRight":true},{"text":"On the other hand, ","element":"span"},{"style":{"height":17.36},"width":502.84,"height":43.4,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1812.01552/images/15-8.png","element":"img","alt":"N − D2v′′(x) = N − k2D2 >","inline":true,"padRight":true},{"text":"0, ensuring that ","element":"span"},{"style":{"height":13.11},"width":36.64,"height":32.76,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1812.01552/images/15-9.png","element":"img","alt":" k0","inline":true,"padRight":true},{"text":"is well defined.","element":"span"}],[{"text":"Next, we state one of the main results of this paper.","element":"span"}],[{"style":{"width":"80%"},"width":1104,"height":162,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1812.01552/images/15-10.png","element":"img"}],[{"text":"with ","element":"span"},{"style":{"height":13.2},"width":88.12,"height":33,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1812.01552/images/15-11.png","element":"img","alt":" M ≥","inline":true,"padRight":true},{"text":"0","element":"span"},{"text":", ","element":"span"},{"text":"N > ","element":"span"},{"text":"0","element":"span"},{"text":", ","element":"span"},{"style":{"height":14},"width":212.84,"height":35,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1812.01552/images/15-12.png","element":"img","alt":" R, Q, P ∈ R","inline":true,"padRight":true},{"text":"and ","element":"span"},{"style":{"height":14.35},"width":187.2,"height":35.88,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1812.01552/images/15-13.png","element":"img","alt":" R2 < MN","inline":true},{"text":". Furthermore, suppose that Assumption ","element":"span"},{"href":"#id-45","text":"3 ","element":"a"},{"text":"holds. Then, the value function in ","element":"span"},{"href":"#id-47","text":"(23) ","element":"a"},{"text":"is given by","element":"span"}],[{"style":{"width":"99%"},"width":1372,"height":102,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1812.01552/images/15-14.png","element":"img"}],[{"style":{"width":"44%"},"width":616,"height":17,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1812.01552/images/15-15.png","element":"img"}],[{"text":"where ","element":"span"},{"style":{"height":14},"width":103.36,"height":35,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1812.01552/images/16-0.png","element":"img","alt":" k2, k1","inline":true,"padRight":true},{"text":"and ","element":"span"},{"style":{"height":13.1},"width":36.64,"height":32.76,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1812.01552/images/16-1.png","element":"img","alt":" k0","inline":true,"padRight":true},{"text":"are as in ","element":"span"},{"href":"#id-48","text":"(36)","element":"a"},{"text":", ","element":"span"},{"href":"#id-48","text":"(37) ","element":"a"},{"text":"and ","element":"span"},{"href":"#id-49","text":"(38)","element":"a"},{"text":", respectively. Moreover, the optimal feedback control is Gaussian, with its density function given by","element":"span"}],[{"id":"id-52","style":{"width":"94%"},"width":1302,"height":98,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1812.01552/images/16-2.png","element":"img"}],[{"text":"Finally, the associated optimal state process ","element":"span"},{"style":{"height":16.03},"width":211.52,"height":40.08,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1812.01552/images/16-3.png","element":"img","alt":" {X∗t , t ≥ 0}","inline":true,"padRight":true},{"text":"under ","element":"span"},{"style":{"height":16},"width":102.2,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1812.01552/images/16-4.png","element":"img","alt":" π∗(·; ·","inline":true},{"text":") ","element":"span"},{"text":"is the ","element":"span"},{"text":"unique solution of the SDE","element":"span"}],[{"style":{"width":"99%"},"width":1366,"height":268,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1812.01552/images/16-5.png","element":"img"}],[{"id":"id-51","text":"A proof of this theorem follows essentially the same idea as that of Theorem ","element":"span"},{"href":"#id-50","text":"2, ","element":"a"},{"text":"but it is more technically involved, mainly for verifying the admissibility of the candidate optimal control. To ease the presentation, we defer it to Appendix B.","element":"span"}],[{"text":"Remark 5 ","element":"span"},{"text":"As in the state-independent case (see Appendix A), the solution to the SDE ","element":"span"},{"href":"#id-51","text":"(41) ","element":"a"},{"text":"can be expressed through the Doss-Saussman transformation if ","element":"span"},{"style":{"height":15.2},"width":106.88,"height":38,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1812.01552/images/16-6.png","element":"img","alt":"D ̸= 0","inline":true},{"text":".","element":"span"}],[{"style":{"width":"98%"},"width":1352,"height":801,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1812.01552/images/16-7.png","element":"img"}],[{"text":"If ","element":"span"},{"text":"C","element":"span"},{"text":"+ ","element":"span"},{"style":{"height":23.18},"width":337.76,"height":57.96,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1812.01552/images/16-8.png","element":"img","alt":"D(k2(B+CD)−R)N−k2D2 = 0","inline":true,"padRight":true},{"text":"and ","element":"span"},{"text":"˜","element":"span"},{"style":{"height":15.2},"width":103.04,"height":38,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1812.01552/images/16-9.png","element":"img","alt":"A ̸= 0","inline":true},{"text":", then it follows from direct computation that","element":"span"}],[{"style":{"width":"88%"},"width":1222,"height":167,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1812.01552/images/16-10.png","element":"img"}],[{"text":"The above results demonstrate that, for the general state and control dependent reward case, the optimal actions over ","element":"span"},{"text":"R ","element":"span"},{"text":"also depend on the current state ","element":"span"},{"text":"x","element":"span"},{"text":", which are selected according to a state-dependent Gaussian distribution ","element":"span"},{"href":"#id-52","text":"(40) ","element":"a"},{"text":"with a state-independent variance ","element":"span"},{"style":{"height":21.04},"width":126.8,"height":52.6,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1812.01552/images/17-0.png","element":"img","alt":"λN−k2D2","inline":true,"padRight":true},{"text":". Note that if ","element":"span"},{"style":{"height":15.2},"width":211.6,"height":38,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1812.01552/images/17-1.png","element":"img","alt":" D ̸= 0, then","inline":true},{"style":{"height":20.85},"width":217.56,"height":52.12,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1812.01552/images/17-2.png","element":"img","alt":"λN−k2D2 < λN","inline":true,"padRight":true},{"text":"(since ","element":"span"},{"style":{"height":13.1},"width":80.44,"height":32.76,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1812.01552/images/17-3.png","element":"img","alt":" k2 <","inline":true,"padRight":true},{"text":"0). Therefore, the exploration variance in the general ","element":"span"},{"text":"state-dependent case is ","element":"span"},{"text":"strictly smaller ","element":"span"},{"text":"than ","element":"span"},{"style":{"height":19.31},"width":27,"height":48.28,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1812.01552/images/17-4.png","element":"img","alt":"λN","inline":true,"padRight":true},{"text":", the one in the state-independent ","element":"span"},{"text":"case. Recall that ","element":"span"},{"text":"D ","element":"span"},{"text":"is the coefficient of the control in the diffusion term of the state dynamics, generally representing the level of randomness of the environment.","element":"span"},{"style":{"height":7.6},"width":31.84,"height":19,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1812.01552/images/17-5.png","element":"img","alt":"14","inline":true,"padRight":true},{"text":"Therefore, volatility impacting actions reduce the need for exploration. Moreover, the greater ","element":"span"},{"text":"D ","element":"span"},{"text":"is, the smaller the exploration variance becomes, indicating that even less exploration is required. As a result, the need for exploration is further reduced if an action has a greater impact on the volatility of the system dynamics. This hints that a more volatile environment renders more learning opportunities.","element":"span"}],[{"text":"On the other hand, the mean of the Gaussian distribution does not explicitly depend on ","element":"span"},{"style":{"height":10.8},"width":23,"height":27,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1812.01552/images/17-6.png","element":"img","alt":" λ","inline":true},{"text":". The implication is that the agent should concentrate on the most promising region in the action space while randomly selecting actions to interact with the unknown environment. It is intriguing that the entropy-regularized RL formulation separates the exploitation from exploration, respectively through the mean and variance of the resulting optimal Gaussian distribution.","element":"span"}],[{"text":"Remark 6 ","element":"span"},{"text":"It should be noted that it is the optimal ","element":"span"},{"text":"feedback ","element":"span"},{"text":"control distribution, not the open-loop control generated from the feedback control, that has the Gaussian distribution. More precisely, ","element":"span"},{"style":{"height":16},"width":114.2,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1812.01552/images/17-7.png","element":"img","alt":" π∗(·; x","inline":true},{"text":") ","element":"span"},{"text":"defined by ","element":"span"},{"href":"#id-52","text":"(40) ","element":"a"},{"text":"is Gaussian for each and every ","element":"span"},{"text":"x","element":"span"},{"text":", but the measure-valued process with the density function","element":"span"}],[{"id":"id-53","style":{"width":"96%"},"width":1329,"height":136,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1812.01552/images/17-8.png","element":"img"}],[{"text":"where ","element":"span"},{"style":{"height":16},"width":200,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1812.01552/images/17-9.png","element":"img","alt":" {X∗t , t ≥ 0}","inline":true,"padRight":true},{"text":"is the solution of the exploratory dynamics under the feedback ","element":"span"},{"text":"control ","element":"span"},{"style":{"height":16},"width":102.2,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1812.01552/images/17-10.png","element":"img","alt":" π∗(·; ·","inline":true},{"text":") ","element":"span"},{"text":"with any fixed initial state, say, ","element":"span"},{"style":{"height":14.99},"width":165.76,"height":37.48,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1812.01552/images/17-11.png","element":"img","alt":" X∗0 = x0","inline":true},{"text":", is in general not ","element":"span"},{"text":"Gaussian for any ","element":"span"},{"text":"t > ","element":"span"},{"text":"0","element":"span"},{"text":". The reason is that for each ","element":"span"},{"text":"t > ","element":"span"},{"text":"0","element":"span"},{"text":", the right hand side of ","element":"span"},{"href":"#id-53","text":"(42) ","element":"a"},{"text":"is a composition of the Gaussian density function and a random variable ","element":"span"},{"style":{"height":14.99},"width":52.48,"height":37.48,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1812.01552/images/17-12.png","element":"img","alt":"X∗t","inline":true,"padRight":true},{"text":"whose distribution is unknown. We stress that the Gaussian property for ","element":"span"},{"text":"feedback control is more important and relevant in the RL context, as it stipulates that at any given state, if one undertakes exploration then she should follow Gaussian. ","element":"span"},{"text":"The open-loop control ","element":"span"},{"style":{"height":16.03},"width":209.12,"height":40.08,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1812.01552/images/17-13.png","element":"img","alt":" {π∗t , t ≥ 0}","inline":true},{"text":", generated from the Gaussian ","element":"span"},{"text":"feedback control, is just what the agent would end up if she follows Gaussian exploration at every state.","element":"span"}],[{"text":"Finally, as noted earlier (see Remark ","element":"span"},{"href":"#id-54","text":"1)","element":"a"},{"text":", the optimality of the Gaussian distribution is still valid for problems with dynamics","element":"span"}],[{"style":{"width":"83%"},"width":1153,"height":67,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1812.01552/images/17-14.png","element":"img"}],[{"text":"and reward function in the form ","element":"span"},{"style":{"height":17.36},"width":296.8,"height":43.4,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1812.01552/images/18-0.png","element":"img","alt":" r(x, u) = r2(x)u2","inline":true,"padRight":true},{"text":"+ ","element":"span"},{"style":{"height":16},"width":236.12,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1812.01552/images/18-1.png","element":"img","alt":" r1(x)u + r0(x","inline":true},{"text":"), where the functions ","element":"span"},{"style":{"height":14.8},"width":283.36,"height":37,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1812.01552/images/18-2.png","element":"img","alt":" A, B, C, D, r2, r1","inline":true,"padRight":true},{"text":"and ","element":"span"},{"style":{"height":9.1},"width":33.76,"height":22.76,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1812.01552/images/18-3.png","element":"img","alt":" r0","inline":true,"padRight":true},{"text":"are possibly nonlinear (pending some additional assumptions for the verification arguments to hold).","element":"span"}]]},{"heading":"5 The Cost and Eﬀect of Exploration","paragraphs":[[{"text":"Motivated by the necessity of exploration facing the typically unknown environment in an RL setting, we have formulated and analyzed a new class of stochastic control problems that combine entropy-regularized criteria and relaxed controls. We have also derived closed-form solutions and presented verification results for the important class of LQ problems. A natural question arises, namely, how to quantify the cost and effect of the exploration. This can be done by comparing our results to the ones for the classical stochastic LQ problems, which have neither entropy regularization nor control relaxation.","element":"span"}],[{"text":"We carry out this comparison analysis next.","element":"span"}],[{"text":"5.1 ","element":"span"},{"text":"The classical LQ problem","element":"span"}],[{"text":"We first briefly recall the classical stochastic LQ control problem in an infinite horizon with discounted reward. ","element":"span"},{"text":"Let ","element":"span"},{"style":{"height":16},"width":229.28,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1812.01552/images/18-4.png","element":"img","alt":" {Wt, t ≥ 0}","inline":true,"padRight":true},{"text":"be a standard Brownian motion defined on the filtered probability space (Ω","element":"span"},{"style":{"height":16.7},"width":247.2,"height":41.76,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1812.01552/images/18-5.png","element":"img","alt":", F, {Ft}t≥0, P","inline":true},{"text":") that satisfies the usual conditions. The controlled state process ","element":"span"},{"style":{"height":16.03},"width":188.96,"height":40.08,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1812.01552/images/18-6.png","element":"img","alt":" {xut , t ≥ 0}","inline":true,"padRight":true},{"text":"solves","element":"span"}],[{"style":{"width":"88%"},"width":1214,"height":42,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1812.01552/images/18-7.png","element":"img"}],[{"text":"with given constants ","element":"span"},{"text":"A, B, C ","element":"span"},{"text":"and ","element":"span"},{"text":"D, ","element":"span"},{"text":"and the process ","element":"span"},{"style":{"height":16},"width":189.92,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1812.01552/images/18-8.png","element":"img","alt":" {ut, t ≥ 0}","inline":true,"padRight":true},{"text":"being a (classical, non-relaxed) control. The value function is defined as in ","element":"span"},{"href":"#id-25","text":"(2)","element":"a"},{"text":",","element":"span"}],[{"id":"id-55","style":{"width":"82%"},"width":1139,"height":103,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1812.01552/images/18-9.png","element":"img"}],[{"text":"for ","element":"span"},{"style":{"height":11.6},"width":100.52,"height":29,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1812.01552/images/18-10.png","element":"img","alt":" x ∈ R","inline":true},{"text":", where the reward function ","element":"span"},{"style":{"height":16},"width":73.88,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1812.01552/images/18-11.png","element":"img","alt":" r(·, ·","inline":true},{"text":") is given by ","element":"span"},{"href":"#id-36","text":"(20)","element":"a"},{"text":". Here, the admissible set ","element":"span"},{"style":{"height":17.55},"width":95.48,"height":43.88,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1812.01552/images/18-12.png","element":"img","alt":" Acl(x","inline":true},{"text":") is defined as follows: ","element":"span"},{"style":{"height":17.55},"width":167,"height":43.88,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1812.01552/images/18-13.png","element":"img","alt":" u ∈ Acl(x","inline":true},{"text":") if","element":"span"}],[{"style":{"width":"81%"},"width":1125,"height":221,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1812.01552/images/18-14.png","element":"img"}],[{"text":"The associated HJB equation is","element":"span"}],[{"style":{"width":"89%"},"width":1227,"height":235,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1812.01552/images/18-15.png","element":"img"}],[{"style":{"width":"98%"},"width":1359,"height":242,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1812.01552/images/19-0.png","element":"img"}],[{"id":"id-69","text":"with the maximizer being, provided that ","element":"span"},{"style":{"height":17.36},"width":283,"height":43.4,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1812.01552/images/19-1.png","element":"img","alt":" N − D2w′′(x) >","inline":true,"padRight":true},{"text":"0,","element":"span"}],[{"style":{"width":"81%"},"width":1115,"height":95,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1812.01552/images/19-2.png","element":"img"}],[{"text":"The standard verification argument then deduces that ","element":"span"},{"text":"u ","element":"span"},{"text":"is the optimal feedback control.","element":"span"}],[{"text":"In the next section, we will establish a solvability equivalence between the entropy-regularized relaxed LQ problem and the classical one.","element":"span"}],[{"text":"5.2 ","element":"span"},{"text":"Solvability equivalence of classical and exploratory problems","element":"span"}],[{"text":"Given a reward function ","element":"span"},{"style":{"height":16},"width":73.88,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1812.01552/images/19-3.png","element":"img","alt":" r(·, ·","inline":true},{"text":") and a classical controlled process ","element":"span"},{"href":"#id-24","text":"(1)","element":"a"},{"text":", the relaxed formulation ","element":"span"},{"href":"#id-29","text":"(6) ","element":"a"},{"text":"under the entropy-regularized objective is, naturally, a technically more challenging problem, compared to its classical counterpart.","element":"span"}],[{"text":"In this section, we show that there is actually a solvability equivalence between the exploratory and the classical stochastic LQ problems, in the sense that the value function and optimal control of one problem lead directly to those of the other. Such equivalence enables us to readily establish the convergence result as the exploration weight ","element":"span"},{"style":{"height":10.8},"width":23,"height":27,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1812.01552/images/19-4.png","element":"img","alt":" λ","inline":true,"padRight":true},{"text":"decays to zero. Furthermore, it makes it possible to quantify the exploration cost, which we introduce in the sequel.","element":"span"}],[{"id":"id-56","style":{"width":"88%"},"width":1218,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1812.01552/images/19-5.png","element":"img"}],[{"text":"(a) ","element":"span"},{"text":"The function ","element":"span"},{"style":{"height":28.8},"width":1028.84,"height":72,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1812.01552/images/19-6.png","element":"img","alt":" v(x) = 12α2x2 + α1x + α0 + λ2ρ�ln� 2πeλN−α2D2�− 1�, x ∈ R","inline":true},{"text":", with ","element":"span"},{"style":{"height":14},"width":202.76,"height":35,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1812.01552/images/19-7.png","element":"img","alt":" α0, α1 ∈ R","inline":true,"padRight":true},{"text":"and ","element":"span"},{"style":{"height":11.1},"width":95.32,"height":27.76,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1812.01552/images/19-8.png","element":"img","alt":" α2 <","inline":true,"padRight":true},{"text":"0","element":"span"},{"text":", is the value function of the exploratory problem ","element":"span"},{"href":"#id-47","text":"(23) ","element":"a"},{"text":"and the corresponding optimal feedback control is","element":"span"}],[{"style":{"width":"85%"},"width":1179,"height":98,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1812.01552/images/19-9.png","element":"img"}],[{"text":"(b) ","element":"span"},{"text":"The function ","element":"span"},{"style":{"height":19.31},"width":548.84,"height":48.28,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1812.01552/images/19-10.png","element":"img","alt":" w(x) = 12α2x2+α1x+α0, x ∈ R","inline":true},{"text":", with ","element":"span"},{"style":{"height":14},"width":182.12,"height":35,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1812.01552/images/19-11.png","element":"img","alt":" α0, α1 ∈ R","inline":true,"padRight":true},{"text":"and ","element":"span"},{"style":{"height":11.1},"width":85.24,"height":27.76,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1812.01552/images/19-12.png","element":"img","alt":" α2 <","inline":true,"padRight":true},{"text":"0","element":"span"},{"text":", ","element":"span"},{"text":"is the value function of the classical problem ","element":"span"},{"href":"#id-55","text":"(44) ","element":"a"},{"text":"and the corresponding optimal feedback control is","element":"span"}],[{"style":{"width":"52%"},"width":728,"height":90,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1812.01552/images/19-13.png","element":"img"}],[{"text":"Proof. ","element":"span"},{"text":"See Appendix C.","element":"span"}],[{"text":"The above equivalence between statements (a) and (b) yields that if one problem is solvable, so is the other; and conversely, if one is not solvable, neither is the other.","element":"span"}],[{"text":"5.3 ","element":"span"},{"text":"Cost of exploration","element":"span"}],[{"text":"We define the exploration cost for a general RL problem to be the difference between the discounted accumulated rewards following the corresponding optimal ","element":"span"},{"text":"open-loop ","element":"span"},{"text":"controls under the classical objective ","element":"span"},{"href":"#id-25","text":"(2) ","element":"a"},{"text":"and the exploratory objective ","element":"span"},{"href":"#id-32","text":"(12)","element":"a"},{"text":", net of the value of the entropy. Note that the solvability equivalence established in the previous subsection is important for this definition, not least because the cost is well defined only if both the classical and the exploratory problems are solvable.","element":"span"}],[{"text":"Specifically, let the classical maximization problem ","element":"span"},{"href":"#id-25","text":"(2) ","element":"a"},{"text":"with the state dynamics ","element":"span"},{"href":"#id-24","text":"(1) ","element":"a"},{"text":"have the value function ","element":"span"},{"style":{"height":17.36},"width":83.48,"height":43.4,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1812.01552/images/20-0.png","element":"img","alt":" V cl(·","inline":true},{"text":") and optimal strategy ","element":"span"},{"style":{"height":16.03},"width":186.56,"height":40.08,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1812.01552/images/20-1.png","element":"img","alt":" {u∗t, t ≥ 0}","inline":true},{"text":", and ","element":"span"},{"text":"the corresponding exploratory problem have the value function ","element":"span"},{"style":{"height":16},"width":58.52,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1812.01552/images/20-2.png","element":"img","alt":" V (·","inline":true},{"text":") and optimal control distribution ","element":"span"},{"style":{"height":16},"width":187.52,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1812.01552/images/20-3.png","element":"img","alt":" {π∗t , t ≥ 0}","inline":true},{"text":". Then, we define the ","element":"span"},{"text":"exploration cost ","element":"span"},{"text":"as","element":"span"}],[{"id":"id-57","style":{"height":38.4},"width":466.04,"height":96,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1812.01552/images/20-4.png","element":"img","alt":"Cu∗,π∗(x) := V cl(x)−�V (x","inline":true},{"text":") + ","element":"span"},{"style":{"height":38.8},"width":153.92,"height":97,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1812.01552/images/20-5.png","element":"img","alt":" λE�� ∞","inline":true}],[{"style":{"width":"75%"},"width":1036,"height":52,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1812.01552/images/20-6.png","element":"img"}],[{"text":"for ","element":"span"},{"style":{"height":11.6},"width":100.52,"height":29,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1812.01552/images/20-7.png","element":"img","alt":" x ∈ R","inline":true},{"text":".","element":"span"}],[{"text":"The term in the parenthesis represents the total discounted rewards incurred by ","element":"span"},{"style":{"height":10.96},"width":40,"height":27.4,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1812.01552/images/20-8.png","element":"img","alt":" π∗","inline":true,"padRight":true},{"text":"after taking out the contribution of the entropy term to the value function ","element":"span"},{"style":{"height":16},"width":58.52,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1812.01552/images/20-9.png","element":"img","alt":"V (·","inline":true},{"text":") of the exploratory problem. The exploration cost hence measures the best outcome due to the explicit inclusion of exploratory strategies in the entropy-regularized objective, relative to the benchmark ","element":"span"},{"style":{"height":17.36},"width":83.48,"height":43.4,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1812.01552/images/20-10.png","element":"img","alt":" V cl(·","inline":true},{"text":") which is the best possible objective value should the model be ","element":"span"},{"text":"a priori ","element":"span"},{"text":"fully known.","element":"span"}],[{"text":"We next compute the exploration cost for the LQ case. As we show, this cost is surprisingly simple: it depends only on two “agent-specific” parameters: the temperature parameter ","element":"span"},{"style":{"height":10.8},"width":23,"height":27,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1812.01552/images/20-11.png","element":"img","alt":" λ","inline":true,"padRight":true},{"text":"and the discounting parameter ","element":"span"},{"style":{"height":10},"width":21,"height":25,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1812.01552/images/20-12.png","element":"img","alt":" ρ","inline":true},{"text":".","element":"span"}],[{"id":"id-67","text":"Theorem 8 ","element":"span"},{"text":"Assume that statement (a) (or equivalently, (b)) of Theorem ","element":"span"},{"href":"#id-56","text":"7 ","element":"a"},{"text":"holds. Then, the exploration cost for the stochastic LQ problem is","element":"span"}],[{"id":"id-58","style":{"width":"67%"},"width":923,"height":92,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1812.01552/images/20-13.png","element":"img"}],[{"text":"Proof. ","element":"span"},{"text":"Let ","element":"span"},{"style":{"height":16},"width":201.92,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1812.01552/images/20-14.png","element":"img","alt":" {π∗t , t ≥ 0}","inline":true,"padRight":true},{"text":"be the open-loop control generated by the feedback ","element":"span"},{"text":"control ","element":"span"},{"style":{"height":11.15},"width":44.8,"height":27.88,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1812.01552/images/20-15.png","element":"img","alt":" π∗","inline":true,"padRight":true},{"text":"given in statement (a) with respect to the initial state ","element":"span"},{"text":"x","element":"span"},{"text":", namely,","element":"span"}],[{"style":{"width":"82%"},"width":1141,"height":96,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1812.01552/images/20-16.png","element":"img"}],[{"text":"where ","element":"span"},{"style":{"height":16},"width":207.68,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1812.01552/images/20-17.png","element":"img","alt":" {X∗t , t ≥ 0}","inline":true,"padRight":true},{"text":"is the associated state process of the exploratory problem, ","element":"span"},{"text":"starting from the state ","element":"span"},{"text":"x","element":"span"},{"text":", when ","element":"span"},{"style":{"height":10.96},"width":44.8,"height":27.4,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1812.01552/images/20-18.png","element":"img","alt":" π∗","inline":true,"padRight":true},{"text":"is applied. Then, it is straightforward to calculate","element":"span"}],[{"style":{"width":"54%"},"width":755,"height":82,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1812.01552/images/20-19.png","element":"img"}],[{"text":"The desired result now follows immediately from the general definition in ","element":"span"},{"href":"#id-57","text":"(47) ","element":"a"},{"text":"and the expressions of ","element":"span"},{"style":{"height":16},"width":58.04,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1812.01552/images/20-20.png","element":"img","alt":" V (·","inline":true},{"text":") in (a) and ","element":"span"},{"style":{"height":17.36},"width":83.48,"height":43.4,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1812.01552/images/20-21.png","element":"img","alt":" V cl(·","inline":true},{"text":") in (b).","element":"span"}],[{"text":"In other words, the exploration cost for stochastic LQ problems can be completely pre-determined by the learning agent through choosing her individual parameters ","element":"span"},{"style":{"height":10.8},"width":23,"height":27,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1812.01552/images/21-0.png","element":"img","alt":" λ","inline":true,"padRight":true},{"text":"and ","element":"span"},{"style":{"height":10},"width":21,"height":25,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1812.01552/images/21-1.png","element":"img","alt":" ρ","inline":true},{"text":", since the cost relies neither on the specific (unknown) linear state dynamics, nor on the quadratic reward structure.","element":"span"}],[{"text":"Moreover, the exploration cost ","element":"span"},{"href":"#id-58","text":"(48) ","element":"a"},{"text":"depends on ","element":"span"},{"style":{"height":10.8},"width":23,"height":27,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1812.01552/images/21-2.png","element":"img","alt":" λ","inline":true,"padRight":true},{"text":"and ","element":"span"},{"style":{"height":10},"width":21,"height":25,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1812.01552/images/21-3.png","element":"img","alt":" ρ","inline":true,"padRight":true},{"text":"in a rather intuitive way: it increases as ","element":"span"},{"style":{"height":10.8},"width":23,"height":27,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1812.01552/images/21-4.png","element":"img","alt":" λ","inline":true,"padRight":true},{"text":"increases, due to more emphasis placed on exploration, or as ","element":"span"},{"style":{"height":10},"width":21,"height":25,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1812.01552/images/21-5.png","element":"img","alt":" ρ","inline":true,"padRight":true},{"text":"decreases, indicating an effectively longer horizon for exploration.","element":"span"},{"style":{"height":8},"width":31.84,"height":20,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1812.01552/images/21-6.png","element":"img","alt":"15","inline":true}],[{"text":"5.4 ","element":"span"},{"text":"Vanishing exploration","element":"span"}],[{"text":"Herein, the exploration weight ","element":"span"},{"style":{"height":10.8},"width":23,"height":27,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1812.01552/images/21-7.png","element":"img","alt":" λ","inline":true,"padRight":true},{"text":"has been taken as an exogenous parameter reflecting the level of exploration desired by the learning agent. The smaller this parameter is, the more emphasis is placed on exploitation. ","element":"span"},{"text":"When this parameter is sufficiently close to zero, the exploratory formulation is sufficiently close to the problem without exploration. Naturally, a desirable result is that if the exploration weight ","element":"span"},{"style":{"height":10.8},"width":23,"height":27,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1812.01552/images/21-8.png","element":"img","alt":" λ","inline":true,"padRight":true},{"text":"goes to zero, then the entropy-regularized LQ problem would converge to its classical counterpart. ","element":"span"},{"text":"The following result makes this precise.","element":"span"}],[{"text":"Theorem 9 ","element":"span"},{"text":"Assume that statement (a) (or equivalently, (b)) of Theorem ","element":"span"},{"href":"#id-56","text":"7 ","element":"a"},{"text":"holds. Then, for each ","element":"span"},{"style":{"height":11.6},"width":100.52,"height":29,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1812.01552/images/21-9.png","element":"img","alt":" x ∈ R","inline":true},{"text":",","element":"span"}],[{"style":{"width":"69%"},"width":961,"height":246,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1812.01552/images/21-10.png","element":"img"}],[{"text":"Proof. ","element":"span"},{"text":"The weak convergence of the feedback controls is due to the explicit forms of ","element":"span"},{"style":{"height":10.96},"width":44.8,"height":27.4,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1812.01552/images/21-11.png","element":"img","alt":" π∗","inline":true,"padRight":true},{"text":"and ","element":"span"},{"style":{"height":10.96},"width":42.88,"height":27.4,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1812.01552/images/21-12.png","element":"img","alt":" u∗","inline":true,"padRight":true},{"text":"in statements (a) and (b), and the fact that ","element":"span"},{"style":{"height":10},"width":113.92,"height":25,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1812.01552/images/21-13.png","element":"img","alt":" α1, α2","inline":true,"padRight":true},{"text":"are independent of ","element":"span"},{"style":{"height":10.8},"width":23,"height":27,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1812.01552/images/21-14.png","element":"img","alt":" λ","inline":true},{"text":". ","element":"span"},{"text":"The pointwise convergence of the value functions follows easily from the forms of ","element":"span"},{"style":{"height":16},"width":58.52,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1812.01552/images/21-15.png","element":"img","alt":" V (·","inline":true},{"text":") and ","element":"span"},{"style":{"height":17.36},"width":83.48,"height":43.4,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1812.01552/images/21-16.png","element":"img","alt":" V cl(·","inline":true},{"text":"), together with the fact that","element":"span"}],[{"style":{"width":"72%"},"width":1000,"height":284,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1812.01552/images/21-17.png","element":"img"}]]},{"heading":"6 Conclusions","paragraphs":[[{"text":"This paper approaches RL from a stochastic control perspective. Indeed, control and RL both deal with the problem of managing dynamic and stochastic systems by making the best use of available information. However, as a recent survey paper ","element":"span"},{"href":"#id-59","referenceIndex":26,"text":"Recht ","element":"a"},{"href":"#id-59","referenceIndex":26,"text":"(2018) ","element":"a"},{"text":"points out, “...","element":"span"},{"text":"That the RL and control communities remain practically disjoint has led to the co-development of vastly different approaches to the same problems","element":"span"},{"text":"....” It is our view that communication and exchange of ideas between the two fields are of paramount importance to the progress of both fields, for an old idea from one field may well be a fresh one to the other. The continuous-time relaxed stochastic control formulation employed in this paper exemplifies such a vision.","element":"span"}],[{"text":"The main contributions of this paper are ","element":"span"},{"text":"conceptual ","element":"span"},{"text":"rather than ","element":"span"},{"text":"algorithmic","element":"span"},{"text":": casting the RL problem in a continuous-time setting and with the aid of stochastic control and stochastic calculus, we interpret and explain why the Gaussian distribution is best for exploration in RL. This finding is independent of the specific parameters of the underlying dynamics and reward function structure, as long as the dependence on actions is linear in the former and quadratic in the latter. The same can be said about other main results of the paper, such as the separation between exploration and exploitation in the mean and variance of the resulting Gaussian distribution, and the cost of exploration. The explicit forms of the derived optimal Gaussian distributions do indeed depend on the model specifications which are unknown in the RL context. With regards to implementing RL algorithms based on our results for LQ problems, we can either do it in continuous time and space directly following, for example, ","element":"span"},{"href":"#id-60","referenceIndex":4,"text":"Doya ","element":"a"},{"href":"#id-60","referenceIndex":4,"text":"(2000)","element":"a"},{"text":", or modify the problem into an MDP one by discretizing the time, and then learn the parameters of the optimal Gaussian distribution following standard RL procedures (e.g. the so-called ","element":"span"},{"text":"Q","element":"span"},{"text":"-learning). For that, our results may again be useful: they suggest that we only need to learn among the class of simpler Gaussian policies, i.e., ","element":"span"},{"style":{"height":16.4},"width":355.68,"height":41,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1812.01552/images/22-0.png","element":"img","alt":" π = N(· |θ1x + θ2, φ","inline":true},{"text":") (cf. ","element":"span"},{"href":"#id-52","text":"(40)","element":"a"},{"text":"), rather than generic (nonlinear) parametrized Gaussian policy ","element":"span"},{"style":{"height":17.1},"width":366.68,"height":42.76,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1812.01552/images/22-1.png","element":"img","alt":" πθ,φ = N(· |θ(x), φ(x","inline":true},{"text":")). We expect that this simpler functional form can considerably increase the learning speed.","element":"span"}]]},{"heading":"Appendix A: Explicit Solutions to (28)","paragraphs":[[{"text":"For a range of parameters, we derive explicit solutions to SDE ","element":"span"},{"href":"#id-44","text":"(28) ","element":"a"},{"text":"satisfied by the optimal state process ","element":"span"},{"style":{"height":16},"width":200,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1812.01552/images/22-2.png","element":"img","alt":" {X∗t , t ≥ 0}","inline":true},{"text":". ","element":"span"},{"text":"If ","element":"span"},{"text":"D ","element":"span"},{"text":"= 0, the SDE ","element":"span"},{"href":"#id-44","text":"(28) ","element":"a"},{"text":"reduces to","element":"span"}],[{"style":{"width":"65%"},"width":905,"height":96,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1812.01552/images/22-3.png","element":"img"}],[{"text":"If ","element":"span"},{"style":{"height":12.8},"width":64.6,"height":32,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1812.01552/images/22-4.png","element":"img","alt":" x ≥","inline":true,"padRight":true},{"text":"0 and ","element":"span"},{"style":{"height":14},"width":105.4,"height":35,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1812.01552/images/22-5.png","element":"img","alt":" BQ ≤","inline":true,"padRight":true},{"text":"0, the above equation has a nonnegative solution given by","element":"span"}],[{"style":{"width":"78%"},"width":1083,"height":99,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1812.01552/images/22-6.png","element":"img"}],[{"text":"If ","element":"span"},{"style":{"height":12.8},"width":64.6,"height":32,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1812.01552/images/23-0.png","element":"img","alt":" x ≤","inline":true,"padRight":true},{"text":"0 and ","element":"span"},{"style":{"height":14},"width":105.4,"height":35,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1812.01552/images/23-1.png","element":"img","alt":" BQ ≥","inline":true,"padRight":true},{"text":"0, it has a nonpositive solution","element":"span"}],[{"style":{"width":"78%"},"width":1083,"height":100,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1812.01552/images/23-2.png","element":"img"}],[{"text":"These two cases cover the special case when ","element":"span"},{"text":"Q ","element":"span"},{"text":"= 0 which is standard in the LQ control formulation. We are unsure if there is an explicit solution when neither of these assumptions is satisfied (e.g. when ","element":"span"},{"style":{"height":12.8},"width":64.6,"height":32,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1812.01552/images/23-3.png","element":"img","alt":" x ≥","inline":true,"padRight":true},{"text":"0 and ","element":"span"},{"text":"BQ > ","element":"span"},{"text":"0).","element":"span"}],[{"text":"If ","element":"span"},{"text":"C ","element":"span"},{"text":"= 0, the SDE ","element":"span"},{"href":"#id-44","text":"(28) ","element":"a"},{"text":"becomes","element":"span"}],[{"style":{"width":"61%"},"width":840,"height":97,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1812.01552/images/23-4.png","element":"img"}],[{"text":"and its unique solution is given by","element":"span"}],[{"style":{"width":"87%"},"width":1197,"height":100,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1812.01552/images/23-5.png","element":"img"}],[{"text":"if ","element":"span"},{"style":{"height":15.2},"width":191.44,"height":38,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1812.01552/images/23-6.png","element":"img","alt":" A ̸= 0, and","inline":true}],[{"style":{"width":"57%"},"width":784,"height":85,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1812.01552/images/23-7.png","element":"img"}],[{"text":"if ","element":"span"},{"text":"A ","element":"span"},{"text":"= 0.","element":"span"}],[{"text":"If ","element":"span"},{"href":"#id-44","style":{"height":17.36},"width":1156,"height":43.4,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1812.01552/images/23-8.png","element":"img","alt":" C ̸= 0 and D ̸= 0, then the diffusion coefficient of SDE (28) is C2","inline":true,"padRight":true},{"text":"in the unknown, with the first and second order derivatives being bounded. Hence, ","element":"span"},{"href":"#id-44","text":"(28) ","element":"a"},{"text":"can be solved explicitly using the Doss-Saussman transformation (see, for example, ","element":"span"},{"href":"#id-61","referenceIndex":13,"text":"Karatzas and Shreve ","element":"a"},{"href":"#id-61","referenceIndex":13,"text":"(1991)","element":"a"},{"text":", pp 295-297). This transformation uses the ansatz","element":"span"}],[{"id":"id-62","style":{"width":"76%"},"width":1046,"height":41,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1812.01552/images/23-9.png","element":"img"}],[{"text":"for some deterministic function ","element":"span"},{"text":"F ","element":"span"},{"text":"and an adapted process ","element":"span"},{"style":{"height":14},"width":122.2,"height":35,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1812.01552/images/23-10.png","element":"img","alt":" Yt, t ≥","inline":true,"padRight":true},{"text":"0, solving a random ODE. Applying Itˆo’s formula to ","element":"span"},{"href":"#id-62","text":"(49) ","element":"a"},{"text":"and using the dynamics in ","element":"span"},{"href":"#id-44","text":"(28)","element":"a"},{"text":", we deduce that ","element":"span"},{"text":"F ","element":"span"},{"text":"solves, for each fixed ","element":"span"},{"text":"y","element":"span"},{"text":", the ODE","element":"span"}],[{"id":"id-63","style":{"width":"82%"},"width":1133,"height":120,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1812.01552/images/23-11.png","element":"img"}],[{"text":"Moreover, ","element":"span"},{"style":{"height":14},"width":117.88,"height":35,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1812.01552/images/23-12.png","element":"img","alt":" Yt, t ≥","inline":true,"padRight":true},{"text":"0, is the unique pathwise solution to the random ODE","element":"span"}],[{"id":"id-64","style":{"width":"75%"},"width":1039,"height":84,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1812.01552/images/23-13.png","element":"img"}],[{"text":"where","element":"span"}],[{"style":{"width":"61%"},"width":850,"height":124,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1812.01552/images/23-14.png","element":"img"}],[{"text":"It is easy to verify that both equations ","element":"span"},{"href":"#id-63","text":"(50) ","element":"a"},{"text":"and ","element":"span"},{"href":"#id-64","text":"(51) ","element":"a"},{"text":"have a unique solution. Solving ","element":"span"},{"href":"#id-63","text":"(50)","element":"a"},{"text":", we obtain","element":"span"}],[{"style":{"width":"96%"},"width":1323,"height":120,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1812.01552/images/23-15.png","element":"img"}],[{"text":"This, in turn, leads to the explicit expression of the function ","element":"span"},{"text":"G","element":"span"},{"text":"(","element":"span"},{"text":"z, y","element":"span"},{"text":").","element":"span"}]]},{"heading":"Appendix B: Proof of Theorem 4","paragraphs":[[{"text":"Recall that the function ","element":"span"},{"text":"v","element":"span"},{"text":", where ","element":"span"},{"style":{"height":19.51},"width":625.16,"height":48.76,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1812.01552/images/24-0.png","element":"img","alt":" v(x) = 12k2x2 + k1x + k0, x ∈ R","inline":true},{"text":", where ","element":"span"},{"style":{"height":14},"width":99.52,"height":35,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1812.01552/images/24-1.png","element":"img","alt":"k2, k1","inline":true,"padRight":true},{"text":"and ","element":"span"},{"style":{"height":13.1},"width":36.64,"height":32.76,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1812.01552/images/24-2.png","element":"img","alt":" k0","inline":true,"padRight":true},{"text":"are defined by ","element":"span"},{"href":"#id-48","text":"(36)","element":"a"},{"text":", ","element":"span"},{"href":"#id-48","text":"(37) ","element":"a"},{"text":"and ","element":"span"},{"href":"#id-49","text":"(38)","element":"a"},{"text":", respectively, satisfies the HJB equation ","element":"span"},{"href":"#id-40","text":"(14)","element":"a"},{"text":".","element":"span"}],[{"text":"Throughout this proof we fix the initial state ","element":"span"},{"style":{"height":11.6},"width":100.52,"height":29,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1812.01552/images/24-3.png","element":"img","alt":" x ∈ R","inline":true},{"text":". Let ","element":"span"},{"style":{"height":16},"width":143,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1812.01552/images/24-4.png","element":"img","alt":" π ∈ A(x","inline":true},{"text":") and ","element":"span"},{"style":{"height":10.8},"width":55.48,"height":27,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1812.01552/images/24-5.png","element":"img","alt":" Xπ","inline":true,"padRight":true},{"text":"be the associated state process solving ","element":"span"},{"href":"#id-35","text":"(22) ","element":"a"},{"text":"with ","element":"span"},{"style":{"height":6.8},"width":23,"height":17,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1812.01552/images/24-6.png","element":"img","alt":" π","inline":true,"padRight":true},{"text":"being used. Let ","element":"span"},{"text":"T > ","element":"span"},{"text":"0 be arbitrary. Define the stopping times ","element":"span"},{"style":{"height":16.03},"width":187,"height":40.08,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1812.01552/images/24-7.png","element":"img","alt":" τ πn := {t ≥","inline":true,"padRight":true},{"text":"0 : ","element":"span"},{"style":{"height":21.79},"width":38.4,"height":54.48,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1812.01552/images/24-8.png","element":"img","alt":" � t0","inline":true,"padRight":true},{"text":"(","element":"span"},{"style":{"height":16.99},"width":176.92,"height":42.48,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1812.01552/images/24-9.png","element":"img","alt":"e−ρtv′(Xπt","inline":true,"padRight":true},{"text":")˜","element":"span"},{"style":{"height":20.27},"width":285.4,"height":50.68,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1812.01552/images/24-10.png","element":"img","alt":"σ(Xπt , πt))2 dt ≥","inline":true,"padRight":true},{"text":"n","element":"span"},{"text":"}","element":"span"},{"text":", for ","element":"span"},{"style":{"height":12.8},"width":66.04,"height":32,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1812.01552/images/24-11.png","element":"img","alt":" n ≥","inline":true,"padRight":true},{"text":"1. Then, Itˆo’s formula yields","element":"span"}],[{"style":{"width":"95%"},"width":1309,"height":243,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1812.01552/images/24-12.png","element":"img"}],[{"text":"Taking expectations, using that ","element":"span"},{"text":"v ","element":"span"},{"text":"solves the HJB equation ","element":"span"},{"href":"#id-40","text":"(14) ","element":"a"},{"text":"and that ","element":"span"},{"style":{"height":6.8},"width":23,"height":17,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1812.01552/images/24-13.png","element":"img","alt":" π","inline":true,"padRight":true},{"text":"is in general suboptimal yield","element":"span"}],[{"style":{"width":"89%"},"width":1229,"height":143,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1812.01552/images/24-14.png","element":"img"}],[{"text":"= ","element":"span"},{"text":"v","element":"span"},{"text":"(","element":"span"},{"text":"x","element":"span"},{"text":")+","element":"span"},{"text":"E","element":"span"}],[{"style":{"width":"94%"},"width":1297,"height":177,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1812.01552/images/24-15.png","element":"img"}],[{"text":"Classical results yield ","element":"span"},{"style":{"height":19.2},"width":50.12,"height":48,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1812.01552/images/24-16.png","element":"img","alt":" E�","inline":true},{"text":"sup","element":"span"},{"style":{"height":20.1},"width":312.96,"height":50.24,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1812.01552/images/24-17.png","element":"img","alt":"0≤t≤T |Xπt |2�≤ K","inline":true},{"text":"(1+","element":"span"},{"style":{"height":17.36},"width":125.72,"height":43.4,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1812.01552/images/24-18.png","element":"img","alt":"x2)eKT","inline":true,"padRight":true},{"text":", for some constant","element":"span"},{"text":"K > ","element":"span"},{"text":"0 independent of ","element":"span"},{"text":"n ","element":"span"},{"text":"(but dependent on ","element":"span"},{"text":"T ","element":"span"},{"text":"and the model coefficients). Sending ","element":"span"},{"style":{"height":8.8},"width":125.92,"height":22,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1812.01552/images/24-19.png","element":"img","alt":" n → ∞","inline":true},{"text":", we deduce that","element":"span"}],[{"style":{"width":"98%"},"width":1349,"height":120,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1812.01552/images/24-20.png","element":"img"}],[{"text":"where we have used the dominated convergence theorem and that ","element":"span"},{"style":{"height":16},"width":143,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1812.01552/images/24-21.png","element":"img","alt":" π ∈ A(x","inline":true},{"text":").","element":"span"}],[{"text":"Next, we recall the admissible condition lim inf","element":"span"},{"style":{"height":19.2},"width":301.4,"height":48,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1812.01552/images/24-22.png","element":"img","alt":"T →∞ e−ρT E�(XπT","inline":true,"padRight":true},{"text":")","element":"span"},{"style":{"height":19.2},"width":138.2,"height":48,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1812.01552/images/24-23.png","element":"img","alt":"2� = 0.","inline":true,"padRight":true},{"text":"This, together with the fact that ","element":"span"},{"style":{"height":13.1},"width":80.92,"height":32.76,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1812.01552/images/24-24.png","element":"img","alt":" k2 <","inline":true,"padRight":true},{"text":"0, lead to lim sup","element":"span"},{"style":{"height":19.2},"width":322.52,"height":48,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1812.01552/images/24-25.png","element":"img","alt":"T →∞ E�e−ρT v(XπT","inline":true,"padRight":true},{"text":")","element":"span"},{"style":{"height":19.2},"width":58.84,"height":48,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1812.01552/images/24-26.png","element":"img","alt":"�=","inline":true,"padRight":true},{"text":"0. Applying the dominated convergence theorem once more yields","element":"span"}],[{"style":{"width":"75%"},"width":1042,"height":96,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1812.01552/images/24-27.png","element":"img"}],[{"text":"for each ","element":"span"},{"style":{"height":11.6},"width":100.52,"height":29,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1812.01552/images/24-28.png","element":"img","alt":" x ∈ R","inline":true,"padRight":true},{"text":"and ","element":"span"},{"style":{"height":16},"width":142.52,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1812.01552/images/24-29.png","element":"img","alt":" π ∈ A(x","inline":true},{"text":"). Hence, ","element":"span"},{"style":{"height":16},"width":198.2,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1812.01552/images/24-30.png","element":"img","alt":" v(x) ≥ V (x","inline":true},{"text":"), for all ","element":"span"},{"style":{"height":11.6},"width":100.52,"height":29,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1812.01552/images/24-31.png","element":"img","alt":" x ∈ R","inline":true},{"text":". On the other hand, we deduce that the right hand side of ","element":"span"},{"href":"#id-40","text":"(14) ","element":"a"},{"text":"is maximized at","element":"span"}],[{"style":{"width":"87%"},"width":1209,"height":96,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1812.01552/images/24-32.png","element":"img"}],[{"text":"Let ","element":"span"},{"style":{"height":16.03},"width":290.72,"height":40.08,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1812.01552/images/25-0.png","element":"img","alt":" π∗ = {π∗t , t ≥ 0}","inline":true,"padRight":true},{"text":"be the open-loop control distribution generated from the ","element":"span"},{"text":"above feedback law along with the corresponding state process ","element":"span"},{"style":{"height":16},"width":199.52,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1812.01552/images/25-1.png","element":"img","alt":" {X∗t , t ≥ 0}","inline":true,"padRight":true},{"text":"with ","element":"span"},{"style":{"height":14.99},"width":130.52,"height":37.48,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1812.01552/images/25-2.png","element":"img","alt":"X∗0 = x","inline":true},{"text":", and assume for now that ","element":"span"},{"style":{"height":16},"width":161.24,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1812.01552/images/25-3.png","element":"img","alt":" π∗ ∈ A(x","inline":true},{"text":"). Then","element":"span"}],[{"style":{"width":"99%"},"width":1362,"height":120,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1812.01552/images/25-4.png","element":"img"}],[{"text":"Noting that lim inf","element":"span"},{"style":{"height":19.2},"width":322.04,"height":48,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1812.01552/images/25-5.png","element":"img","alt":"T →∞ E�e−ρT v(X∗T","inline":true,"padRight":true},{"text":")","element":"span"},{"style":{"height":19.2},"width":58.84,"height":48,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1812.01552/images/25-6.png","element":"img","alt":"�≤","inline":true,"padRight":true},{"text":"lim sup","element":"span"},{"style":{"height":19.2},"width":322.52,"height":48,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1812.01552/images/25-7.png","element":"img","alt":"T →∞ E�e−ρT v(X∗T","inline":true,"padRight":true},{"text":")","element":"span"},{"style":{"height":19.2},"width":177.04,"height":48,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1812.01552/images/25-8.png","element":"img","alt":"�= 0, and","inline":true,"padRight":true},{"text":"applying the dominated convergence theorem yield","element":"span"}],[{"style":{"width":"76%"},"width":1056,"height":96,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1812.01552/images/25-9.png","element":"img"}],[{"text":"for any ","element":"span"},{"style":{"height":11.6},"width":100.52,"height":29,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1812.01552/images/25-10.png","element":"img","alt":" x ∈ R","inline":true},{"text":". This proves that ","element":"span"},{"text":"v ","element":"span"},{"text":"is indeed the value function, namely ","element":"span"},{"style":{"height":10.8},"width":104.92,"height":27,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1812.01552/images/25-11.png","element":"img","alt":" v ≡ V","inline":true,"padRight":true},{"text":". It remains to show that ","element":"span"},{"style":{"height":16},"width":161.24,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1812.01552/images/25-12.png","element":"img","alt":" π∗ ∈ A(x","inline":true},{"text":"). First, we verify that","element":"span"}],[{"id":"id-66","style":{"width":"66%"},"width":909,"height":62,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1812.01552/images/25-13.png","element":"img"}],[{"text":"where ","element":"span"},{"style":{"height":16.03},"width":207.68,"height":40.08,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1812.01552/images/25-14.png","element":"img","alt":" {X∗t , t ≥ 0}","inline":true,"padRight":true},{"text":"solves the SDE ","element":"span"},{"href":"#id-51","text":"(41)","element":"a"},{"text":". To this end, Itˆo’s formula yields, for ","element":"span"},{"text":"any ","element":"span"},{"style":{"height":14},"width":112.76,"height":35,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1812.01552/images/25-15.png","element":"img","alt":" T ≥ 0,","inline":true}],[{"style":{"width":"89%"},"width":1237,"height":249,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1812.01552/images/25-16.png","element":"img"}],[{"text":"Following similar arguments as in the proof of Lemma ","element":"span"},{"href":"#id-65","text":"10 ","element":"a"},{"text":"in Appendix C, we can show that ","element":"span"},{"style":{"height":16.22},"width":108.92,"height":40.56,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1812.01552/images/25-17.png","element":"img","alt":" E[(X∗T","inline":true,"padRight":true},{"text":")","element":"span"},{"style":{"height":7.6},"width":16,"height":19,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1812.01552/images/25-18.png","element":"img","alt":"2","inline":true},{"text":"] contains the terms ","element":"span"},{"style":{"height":17.89},"width":184.76,"height":44.72,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1812.01552/images/25-19.png","element":"img","alt":" e(2 ˜A+ ˜C12)T","inline":true,"padRight":true},{"text":"and ","element":"span"},{"style":{"height":16.24},"width":65.24,"height":40.6,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1812.01552/images/25-20.png","element":"img","alt":" e ˜AT","inline":true,"padRight":true},{"text":".","element":"span"}],[{"text":"If 2 ","element":"span"},{"text":"˜","element":"span"},{"text":"A ","element":"span"},{"text":"+ ","element":"span"},{"text":"˜","element":"span"},{"style":{"height":20.75},"width":151.52,"height":51.88,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1812.01552/images/25-21.png","element":"img","alt":"C12 ≤ ˜A","inline":true},{"text":", then ˜","element":"span"},{"style":{"height":14},"width":73.72,"height":35,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1812.01552/images/25-22.png","element":"img","alt":"A ≤","inline":true,"padRight":true},{"text":"0, in which case ","element":"span"},{"href":"#id-66","text":"(52) ","element":"a"},{"text":"easily follows. Therefore, to show ","element":"span"},{"href":"#id-66","text":"(52)","element":"a"},{"text":", it remains to consider the case in which the term ","element":"span"},{"style":{"height":17.7},"width":184.76,"height":44.24,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1812.01552/images/25-23.png","element":"img","alt":" e(2 ˜A+ ˜C12)T","inline":true,"padRight":true},{"text":"dominates ","element":"span"},{"style":{"height":16.24},"width":65.24,"height":40.6,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1812.01552/images/25-24.png","element":"img","alt":" e ˜AT","inline":true,"padRight":true},{"text":", as ","element":"span"},{"style":{"height":11.2},"width":131.68,"height":28,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1812.01552/images/25-25.png","element":"img","alt":" T → ∞","inline":true},{"text":". In turn, using that ","element":"span"},{"style":{"height":13.1},"width":36.64,"height":32.76,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1812.01552/images/25-26.png","element":"img","alt":" k2","inline":true,"padRight":true},{"text":"solves the equation ","element":"span"},{"text":"(33)","element":"span"},{"text":", we obtain","element":"span"}],[{"id":"id-68","style":{"width":"99%"},"width":1373,"height":365,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1812.01552/images/25-27.png","element":"img"}],[{"text":"Notice that the first fraction is nonpositive due to ","element":"span"},{"style":{"height":13.1},"width":87.64,"height":32.76,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1812.01552/images/25-28.png","element":"img","alt":" k2 <","inline":true,"padRight":true},{"text":"0, while the second fraction is bounded for any ","element":"span"},{"style":{"height":13.1},"width":81.88,"height":32.76,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1812.01552/images/25-29.png","element":"img","alt":" k2 <","inline":true,"padRight":true},{"text":"0. Using Assumption ","element":"span"},{"href":"#id-45","text":"3 ","element":"a"},{"text":"on the range of ","element":"span"},{"style":{"height":10},"width":21,"height":25,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1812.01552/images/25-30.png","element":"img","alt":" ρ","inline":true},{"text":", we then easily deduce ","element":"span"},{"href":"#id-66","text":"(52)","element":"a"},{"text":".","element":"span"}],[{"text":"Next, we establish the admissibility constraint","element":"span"}],[{"style":{"width":"41%"},"width":574,"height":97,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1812.01552/images/26-0.png","element":"img"}],[{"text":"The definition of ","element":"span"},{"text":"L ","element":"span"},{"text":"and the form of ","element":"span"},{"text":"r","element":"span"},{"text":"(","element":"span"},{"text":"x, u","element":"span"},{"text":") yield","element":"span"}],[{"style":{"width":"91%"},"width":1252,"height":469,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1812.01552/images/26-1.png","element":"img"}],[{"text":"where we have applied similar computations as in the proof of Theorem ","element":"span"},{"href":"#id-67","text":"8. ","element":"a"},{"text":"Recall that","element":"span"}],[{"style":{"width":"92%"},"width":1275,"height":98,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1812.01552/images/26-2.png","element":"img"}],[{"text":"It is then clear that it suffices to prove ","element":"span"},{"style":{"height":19.6},"width":258.88,"height":49,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1812.01552/images/26-3.png","element":"img","alt":" E�� ∞0 e−ρt(X∗t","inline":true,"padRight":true},{"text":")","element":"span"},{"style":{"height":19.2},"width":173.72,"height":48,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1812.01552/images/26-4.png","element":"img","alt":"2dt�< ∞,","inline":true,"padRight":true},{"text":"which follows easily since, as shown in ","element":"span"},{"href":"#id-68","text":"(54)","element":"a"},{"text":", ","element":"span"},{"style":{"height":12},"width":62.68,"height":30,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1812.01552/images/26-5.png","element":"img","alt":" ρ >","inline":true,"padRight":true},{"text":"2 ","element":"span"},{"text":"˜","element":"span"},{"text":"A","element":"span"},{"text":"+ ","element":"span"},{"text":"˜","element":"span"},{"style":{"height":20.66},"width":62.08,"height":51.64,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1812.01552/images/26-6.png","element":"img","alt":"C12","inline":true,"padRight":true},{"text":"under Assumption ","element":"span"},{"href":"#id-45","text":"3. ","element":"a"},{"text":"The remaining admissibility conditions for ","element":"span"},{"style":{"height":10.96},"width":40,"height":27.4,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1812.01552/images/26-7.png","element":"img","alt":" π∗","inline":true,"padRight":true},{"text":"can be easily verified.","element":"span"}]]},{"heading":"Appendix C: Proof of Theorem 7","paragraphs":[[{"text":"We first note that when (a) holds, the function ","element":"span"},{"text":"v ","element":"span"},{"text":"solves the HJB equation ","element":"span"},{"href":"#id-46","text":"(32) ","element":"a"},{"text":"of the exploratory LQ problem. Similarly for the classical LQ problem when (b) holds.","element":"span"}],[{"text":"Next, we prove the equivalence between (a) and (b). First, a comparison between the two HJB equations ","element":"span"},{"href":"#id-46","text":"(32) ","element":"a"},{"text":"and ","element":"span"},{"href":"#id-69","text":"(45) ","element":"a"},{"text":"yields that if ","element":"span"},{"text":"v ","element":"span"},{"text":"in (a) solves the former, then ","element":"span"},{"text":"w ","element":"span"},{"text":"in (b) solves the latter, and vice versa.","element":"span"}],[{"text":"Throughout this proof, we let ","element":"span"},{"text":"x ","element":"span"},{"text":"be fixed, being the initial state of both the exploratory problem in statement (a) and the classical problem in statement (b). Let ","element":"span"},{"style":{"height":16},"width":295.52,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1812.01552/images/26-8.png","element":"img","alt":" π∗ = {π∗t , t ≥ 0}","inline":true,"padRight":true},{"text":"and ","element":"span"},{"style":{"height":16},"width":293.12,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1812.01552/images/26-9.png","element":"img","alt":" u∗ = {u∗t , t ≥ 0}","inline":true,"padRight":true},{"text":"be respectively the open-loop ","element":"span"},{"text":"controls generated by the feedback controls ","element":"span"},{"style":{"height":10.96},"width":44.8,"height":27.4,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1812.01552/images/26-10.png","element":"img","alt":" π∗","inline":true,"padRight":true},{"text":"and ","element":"span"},{"style":{"height":10.96},"width":43.36,"height":27.4,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1812.01552/images/26-11.png","element":"img","alt":" u∗","inline":true,"padRight":true},{"text":"of the two problems, and ","element":"span"},{"style":{"height":16.03},"width":308,"height":40.08,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1812.01552/images/26-12.png","element":"img","alt":"X∗ = {X∗t , t ≥ 0}","inline":true,"padRight":true},{"text":"and ","element":"span"},{"style":{"height":16.03},"width":281.6,"height":40.08,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1812.01552/images/26-13.png","element":"img","alt":" x∗ = {x∗t , t ≥ 0}","inline":true,"padRight":true},{"text":"be respectively the corresponding state ","element":"span"},{"text":"processes, both starting from ","element":"span"},{"text":"x","element":"span"},{"text":". It remains to show the equivalence between the admissibility of ","element":"span"},{"style":{"height":10.96},"width":40,"height":27.4,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1812.01552/images/26-14.png","element":"img","alt":" π∗","inline":true,"padRight":true},{"text":"for the exploratory problem and that of ","element":"span"},{"style":{"height":10.96},"width":39.04,"height":27.4,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1812.01552/images/26-15.png","element":"img","alt":" u∗","inline":true,"padRight":true},{"text":"for the classical problem. To this end, we first compute ","element":"span"},{"style":{"height":16.42},"width":108.92,"height":41.04,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1812.01552/images/26-16.png","element":"img","alt":" E[(X∗T","inline":true,"padRight":true},{"text":")","element":"span"},{"style":{"height":7.6},"width":16,"height":19,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1812.01552/images/26-17.png","element":"img","alt":"2","inline":true},{"text":"] and ","element":"span"},{"style":{"height":16.42},"width":98.36,"height":41.04,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1812.01552/images/26-18.png","element":"img","alt":" E[(x∗T","inline":true,"padRight":true},{"text":")","element":"span"},{"style":{"height":7.6},"width":16,"height":19,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1812.01552/images/26-19.png","element":"img","alt":"2","inline":true},{"text":"].","element":"span"}],[{"style":{"width":"99%"},"width":1373,"height":167,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1812.01552/images/26-20.png","element":"img"}],[{"style":{"width":"79%"},"width":1093,"height":41,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1812.01552/images/27-0.png","element":"img"}],[{"text":"+","element":"span"}],[{"style":{"width":"99%"},"width":1373,"height":290,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1812.01552/images/27-1.png","element":"img"}],[{"text":"Similarly, the classical dynamics of ","element":"span"},{"style":{"height":10.96},"width":38.56,"height":27.4,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1812.01552/images/27-2.png","element":"img","alt":" x∗","inline":true,"padRight":true},{"text":"under ","element":"span"},{"style":{"height":10.96},"width":39.04,"height":27.4,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1812.01552/images/27-3.png","element":"img","alt":" u∗","inline":true,"padRight":true},{"text":"solves","element":"span"}],[{"style":{"width":"53%"},"width":738,"height":41,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1812.01552/images/27-4.png","element":"img"}],[{"text":"The desired equivalence of the admissibility then follows from the following ","element":"span"},{"id":"id-65","text":"lemma.","element":"span"}],[{"text":"Lemma 10 ","element":"span"},{"text":"We have that (i) ","element":"span"},{"text":"lim inf","element":"span"},{"style":{"height":22.55},"width":445.76,"height":56.36,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1812.01552/images/27-5.png","element":"img","alt":"T →∞ e−ρT E��X∗T�2� = 0","inline":true,"padRight":true},{"text":"if and only if ","element":"span"},{"text":"lim inf","element":"span"},{"style":{"height":22.73},"width":431.36,"height":56.84,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1812.01552/images/27-6.png","element":"img","alt":"T →∞ e−ρT E��x∗T�2�= 0","inline":true},{"text":"; (ii) ","element":"span"},{"style":{"height":28.8},"width":464.8,"height":72,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1812.01552/images/27-7.png","element":"img","alt":" E�� ∞0 e−ρt�X∗t�2dt�< ∞","inline":true,"padRight":true},{"text":"if and only if ","element":"span"},{"style":{"height":28.8},"width":435.52,"height":72,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1812.01552/images/27-8.png","element":"img","alt":"E�� ∞0 e−ρt�x∗t�2dt�< ∞","inline":true},{"text":".","element":"span"}],[{"text":"Proof. ","element":"span"},{"text":"Denote ","element":"span"},{"style":{"height":16},"width":230.56,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1812.01552/images/27-9.png","element":"img","alt":" n(t) := E [X∗t","inline":true,"padRight":true},{"text":"], for ","element":"span"},{"style":{"height":12.8},"width":56.44,"height":32,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1812.01552/images/27-10.png","element":"img","alt":" t ≥","inline":true,"padRight":true},{"text":"0. Then, a standard argument involving ","element":"span"},{"text":"a series of stopping times and the dominated convergence theorem yields the","element":"span"}],[{"style":{"width":"70%"},"width":976,"height":118,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1812.01552/images/27-11.png","element":"img"}],[{"text":"whose solution is ","element":"span"},{"style":{"height":28.8},"width":170.84,"height":72,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1812.01552/images/27-12.png","element":"img","alt":" n(t) =�x","inline":true,"padRight":true},{"text":"+ ","element":"span"},{"style":{"height":21.44},"width":38,"height":53.6,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1812.01552/images/27-13.png","element":"img","alt":"A2A1","inline":true}],[{"style":{"height":28.8},"width":194.48,"height":72,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1812.01552/images/27-14.png","element":"img","alt":"�eA1t − A2A1","inline":true,"padRight":true},{"text":", if ","element":"span"},{"style":{"height":16},"width":483.92,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1812.01552/images/27-15.png","element":"img","alt":" A1 ̸= 0, and n(t) = x + A2t","inline":true},{"text":", if ","element":"span"},{"style":{"height":19.2},"width":807.04,"height":48,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1812.01552/images/27-16.png","element":"img","alt":"A1 = 0. Similarly, the function m(t) := E�(X∗t","inline":true,"padRight":true},{"text":")","element":"span"},{"style":{"height":19.2},"width":115.48,"height":48,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1812.01552/images/27-17.png","element":"img","alt":"2�, t ≥","inline":true,"padRight":true},{"text":"0, solves the ODE","element":"span"}],[{"style":{"width":"90%"},"width":1238,"height":85,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1812.01552/images/27-18.png","element":"img"}],[{"text":"We can also show that ","element":"span"},{"style":{"height":19.2},"width":235.88,"height":48,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1812.01552/images/27-19.png","element":"img","alt":" n(t) = E�x∗t�","inline":true},{"text":", and deduce that ˆ","element":"span"},{"style":{"height":19.2},"width":254.08,"height":48,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1812.01552/images/27-20.png","element":"img","alt":"m(t) := E�(x∗t","inline":true,"padRight":true},{"text":")","element":"span"},{"style":{"height":19.2},"width":34.76,"height":48,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1812.01552/images/27-21.png","element":"img","alt":"2�","inline":true},{"text":", ","element":"span"},{"style":{"height":12.8},"width":56.44,"height":32,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1812.01552/images/27-22.png","element":"img","alt":"t ≥","inline":true,"padRight":true},{"text":"0, satisfies","element":"span"}],[{"style":{"width":"83%"},"width":1144,"height":85,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1812.01552/images/27-23.png","element":"img"}],[{"text":"Next, we find explicit solutions to the above ODEs corresponding to various conditions on the parameters. (a) If ","element":"span"},{"style":{"height":17.39},"width":1037.84,"height":43.48,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1812.01552/images/27-24.png","element":"img","alt":" A1 = B21 = 0, then direct computation gives n(t) = x + A2t","inline":true},{"text":", and","element":"span"}],[{"style":{"width":"83%"},"width":1150,"height":348,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1812.01552/images/27-25.png","element":"img"}],[{"style":{"width":"69%"},"width":958,"height":376,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1812.01552/images/28-0.png","element":"img"}],[{"text":"(c) If ","element":"span"},{"style":{"height":28.8},"width":743.48,"height":72,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1812.01552/images/28-1.png","element":"img","alt":" A1 ̸= 0 and A1 + B21 = 0, then n(t) =�x","inline":true,"padRight":true},{"text":"+ ","element":"span"},{"style":{"height":21.44},"width":38,"height":53.6,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1812.01552/images/28-2.png","element":"img","alt":"A2A1","inline":true}],[{"style":{"height":28.8},"width":196.4,"height":72,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1812.01552/images/28-3.png","element":"img","alt":"�eA1t − A2A1","inline":true,"padRight":true},{"text":". Further ","element":"span"},{"text":"calculations yield","element":"span"}],[{"style":{"width":"87%"},"width":1198,"height":483,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1812.01552/images/28-4.png","element":"img"}],[{"text":"(d) If ","element":"span"},{"style":{"height":28.8},"width":795.32,"height":72,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1812.01552/images/28-5.png","element":"img","alt":" A1 ̸= 0 and 2A1 + B21 = 0, we have n(t) =�x","inline":true,"padRight":true},{"text":"+ ","element":"span"},{"style":{"height":21.63},"width":38,"height":54.08,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1812.01552/images/28-6.png","element":"img","alt":"A2A1","inline":true}],[{"style":{"width":"102%"},"width":1410,"height":1037,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1812.01552/images/28-7.png","element":"img"}],[{"style":{"width":"76%"},"width":1057,"height":99,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1812.01552/images/29-0.png","element":"img"}],[{"text":"It is easy to see that for all cases (a)–(e), the assertions in the Lemma follow and we conclude.","element":"span"}]]},{"heading":"References","paragraphs":[[{"text":"Peter Auer, Nicolo Cesa-Bianchi, and Paul Fischer. Finite-time analysis of the ","element":"span"},{"text":"multi-armed bandit problem. ","element":"span"},{"text":"Machine Learning","element":"span"},{"text":", 47(2-3):235–256, 2002.","element":"span"}],[{"text":"Ronen I Brafman and Moshe Tennenholtz. R-max—a general polynomial time ","element":"span"},{"text":"algorithm for near-optimal reinforcement learning. ","element":"span"},{"text":"Journal of Machine Learning Research","element":"span"},{"text":", 3(Oct):213–231, 2002.","element":"span"}],[{"text":"Cyrus Derman. ","element":"span"},{"text":"Finite state Markovian decision processes","element":"span"},{"text":". ","element":"span"},{"text":"Acedemic Press, New York, 1970.","element":"span"}],[{"id":"id-60","text":"Kenji Doya. ","element":"span"},{"text":"Reinforcement learning in continuous time and space. ","element":"span"},{"text":"Neural Computation","element":"span"},{"text":", 12(1):219–245, 2000.","element":"span"}],[{"id":"id-15","text":"Nicole El Karoui, Nguyen Du Huu, and Monique Jeanblanc-Picqu´e. Compactifi- ","element":"span"},{"text":"cation methods in the control of degenerate diffusions: existence of an optimal control. ","element":"span"},{"text":"Stochastics","element":"span"},{"text":", 20(3):169–219, 1987.","element":"span"}],[{"id":"id-14","text":"Wendell H Fleming and Makiko Nisio. On stochastic relaxed control for partially ","element":"span"},{"text":"observed diffusions. ","element":"span"},{"text":"Nagoya Mathematical Journal","element":"span"},{"text":", 93:71–108, 1984.","element":"span"}],[{"id":"id-9","text":"Roy Fox, Ari Pakman, and Naftali Tishby. Taming the noise in reinforcement ","element":"span"},{"text":"learning via soft updates. In ","element":"span"},{"text":"Proceedings of the 32nd Conference on Uncertainty in Artificial Intelligence","element":"span"},{"text":", pages 202–211, 2016.","element":"span"}],[{"text":"John Gittins. A dynamic allocation index for the sequential design of experi- ","element":"span"},{"text":"ments. ","element":"span"},{"text":"Progress in statistics","element":"span"},{"text":", pages 241–266, 1974.","element":"span"}],[{"id":"id-10","text":"Tuomas Haarnoja, Haoran Tang, Pieter Abbeel, and Sergey Levine. Reinforce- ","element":"span"},{"text":"ment learning with deep energy-based policies. In ","element":"span"},{"text":"Proceedings of the 34th International Conference on Machine Learning","element":"span"},{"text":", pages 1352–1361, 2017.","element":"span"}],[{"id":"id-11","text":"Tuomas Haarnoja, Aurick Zhou, Pieter Abbeel, and Sergey Levine. ","element":"span"},{"text":"Soft actor-critic: Off-policy maximum entropy deep reinforcement learning with a stochastic actor. ","element":"span"},{"text":"arXiv preprint arXiv:1801.01290","element":"span"},{"text":", 2018.","element":"span"}],[{"id":"id-13","text":"Tommi Jaakkola, Michael I Jordan, and Satinder P Singh. ","element":"span"},{"text":"Convergence of stochastic iterative dynamic programming algorithms. In ","element":"span"},{"text":"Advances in Neural Information Processing Systems","element":"span"},{"text":", pages 703–710, 1994.","element":"span"}],[{"id":"id-6","text":"Leslie Pack Kaelbling, Michael L Littman, and Andrew W Moore. Reinforce- ","element":"span"},{"text":"ment learning: A survey. ","element":"span"},{"text":"Journal of Artificial Intelligence Research","element":"span"},{"text":", 4:237– 285, 1996.","element":"span"}],[{"id":"id-61","text":"Ioannis Karatzas and Steven E Shreve. ","element":"span"},{"text":"Brownian motion and stochastic calculus","element":"span"},{"text":". Springer-Verlag, 2nd edition, 1991.","element":"span"}],[{"id":"id-38","text":"Haya Kaspi and Avishai Mandelbaum. Multi-armed bandits in discrete and ","element":"span"},{"text":"continuous time. ","element":"span"},{"text":"Annals of Applied Probability","element":"span"},{"text":", pages 1270–1290, 1998.","element":"span"}],[{"id":"id-17","text":"Thomas Kurtz and Richard Stockbridge. ","element":"span"},{"text":"Existence of Markov controls and characterization of optimal Markov controls. ","element":"span"},{"text":"SIAM Journal on Control and Optimization","element":"span"},{"text":", 36(2):609–653, 1998.","element":"span"}],[{"id":"id-18","text":"Thomas Kurtz and Richard Stockbridge. Stationary solutions and forward equa- ","element":"span"},{"text":"tions for controlled and singular martingale problems. ","element":"span"},{"text":"Electronic Journal of Probability","element":"span"},{"text":", 6, 2001.","element":"span"}],[{"id":"id-3","text":"Sergey Levine, Chelsea Finn, Trevor Darrell, and Pieter Abbeel. End-to-end ","element":"span"},{"text":"training of deep visuomotor policies. ","element":"span"},{"text":"The Journal of Machine Learning Research","element":"span"},{"text":", 17(1):1334–1373, 2016.","element":"span"}],[{"id":"id-22","text":"Weiwei Li and Emanuel Todorov. Iterative linearization methods for approxi- ","element":"span"},{"text":"mately optimal control and estimation of non-linear stochastic system. ","element":"span"},{"text":"International Journal of Control","element":"span"},{"text":", 80(9):1439–1453, 2007.","element":"span"}],[{"id":"id-19","text":"Timothy Lillicrap, Jonathan Hunt, Alexander Pritzel, Nicolas Heess, Tom Erez, ","element":"span"},{"text":"Yuval Tassa, David Silver, and Daan Wierstra. Continuous control with deep reinforcement learning. In ","element":"span"},{"text":"International Conference on Learning Representations","element":"span"},{"text":", 2016.","element":"span"}],[{"id":"id-37","text":"Avi Mandelbaum. Continuous multi-armed bandits and multi-parameter pro- ","element":"span"},{"text":"cesses. ","element":"span"},{"text":"The Annals of Probability","element":"span"},{"text":", pages 1527–1556, 1987.","element":"span"}],[{"id":"id-4","text":"Piotr Mirowski, Razvan Pascanu, Fabio Viola, Hubert Soyer, Andrew J Bal- ","element":"span"},{"text":"lard, Andrea Banino, Misha Denil, Ross Goroshin, Laurent Sifre, and Koray Kavukcuoglu. Learning to navigate in complex environments. ","element":"span"},{"text":"arXiv preprint arXiv:1611.03673","element":"span"},{"text":", 2016.","element":"span"}],[{"id":"id-2","text":"Volodymyr Mnih, Koray Kavukcuoglu, David Silver, Andrei A Rusu, Joel Ve- ","element":"span"},{"text":"ness, Marc G Bellemare, Alex Graves, Martin Riedmiller, Andreas K Fidjeland, and Georg Ostrovski. Human-level control through deep reinforcement learning. ","element":"span"},{"text":"Nature","element":"span"},{"text":", 518(7540):529, 2015.","element":"span"}],[{"id":"id-8","text":"Ofir Nachum, Mohammad Norouzi, Kelvin Xu, and Dale Schuurmans. Bridging ","element":"span"},{"text":"the gap between value and policy based reinforcement learning. In ","element":"span"},{"text":"Advances in Neural Information Processing Systems","element":"span"},{"text":", pages 2775–2785, 2017.","element":"span"}],[{"id":"id-23","text":"Ofir Nachum, Mohammad Norouzi, Kelvin Xu, and Dale Schuurmans. Trust- ","element":"span"},{"text":"pcl: An off-policy trust region method for continuous control. In ","element":"span"},{"text":"International Conference on Learning Representations","element":"span"},{"text":", 2018.","element":"span"}],[{"id":"id-20","text":"Matthias Plappert, Rein Houthooft, Prafulla Dhariwal, Szymon Sidor, Richard ","element":"span"},{"text":"Chen, Xi Chen, Tamim Asfour, Pieter Abbeel, and Marcin Andrychowicz. Parameter space noise for exploration. ","element":"span"},{"text":"arXiv preprint arXiv:1706.01905","element":"span"},{"text":", 2018.","element":"span"}],[{"id":"id-59","text":"Benjamin Recht. A tour of reinforcement learning: The view from continuous ","element":"span"},{"text":"control. ","element":"span"},{"text":"arXiv preprint arXiv:1806.09460v2","element":"span"},{"text":", 2018.","element":"span"}],[{"text":"Daniel Russo and Benjamin Van Roy. Eluder dimension and the sample com- ","element":"span"},{"text":"plexity of optimistic exploration. In ","element":"span"},{"text":"Advances in Neural Information Processing Systems","element":"span"},{"text":", pages 2256–2264, 2013.","element":"span"}],[{"text":"Daniel Russo and Benjamin Van Roy. Learning to optimize via posterior sam- ","element":"span"},{"text":"pling. ","element":"span"},{"text":"Mathematics of Operations Research","element":"span"},{"text":", 39(4):1221–1243, 2014.","element":"span"}],[{"text":"Claude Elwood Shannon. A mathematical theory of communication. ","element":"span"},{"text":"ACM SIGMOBILE Mobile Computing and Communications Review","element":"span"},{"text":", 5(1):3–55, 2001.","element":"span"}],[{"id":"id-0","text":"David Silver, Aja Huang, Chris J Maddison, Arthur Guez, Laurent Sifre, George ","element":"span"},{"text":"Van Den Driessche, Julian Schrittwieser, Ioannis Antonoglou, Veda Panneershelvam, and Marc Lanctot. ","element":"span"},{"text":"Mastering the game of go with deep neural networks and tree search. ","element":"span"},{"text":"Nature","element":"span"},{"text":", 529(7587):484, 2016.","element":"span"}],[{"id":"id-1","text":"David Silver, Julian Schrittwieser, Karen Simonyan, Ioannis Antonoglou, Aja ","element":"span"},{"text":"Huang, Arthur Guez, Thomas Hubert, Lucas Baker, Matthew Lai, and Adrian Bolton. Mastering the game of go without human knowledge. ","element":"span"},{"text":"Nature","element":"span"},{"text":", 550 (7676):354, 2017.","element":"span"}],[{"id":"id-12","text":"Satinder Singh, Tommi Jaakkola, Michael L Littman, and Csaba Szepesv´ari. ","element":"span"},{"text":"Convergence results for single-step on-policy reinforcement-learning algorithms. ","element":"span"},{"text":"Machine Learning","element":"span"},{"text":", 38(3):287–308, 2000.","element":"span"}],[{"text":"Alexander L Strehl and Michael L Littman. An analysis of model-based interval ","element":"span"},{"text":"estimation for Markov decision processes. ","element":"span"},{"text":"Journal of Computer and System Sciences","element":"span"},{"text":", 74(8):1309–1331, 2008.","element":"span"}],[{"text":"Alexander L Strehl, Lihong Li, and Michael L Littman. Reinforcement learning ","element":"span"},{"text":"in finite mdps: Pac analysis. ","element":"span"},{"text":"Journal of Machine Learning Research","element":"span"},{"text":", 10(Nov): 2413–2444, 2009.","element":"span"}],[{"id":"id-5","text":"Richard S Sutton and Andrew G Barto. ","element":"span"},{"text":"Reinforcement learning: An introduction","element":"span"},{"text":". MIT press, 2018.","element":"span"}],[{"text":"William R Thompson. On the likelihood that one unknown probability exceeds ","element":"span"},{"text":"another in view of the evidence of two samples. ","element":"span"},{"text":"Biometrika","element":"span"},{"text":", 25(3/4):285–294, 1933.","element":"span"}],[{"id":"id-21","text":"Emanuel Todorov and Weiwei Li. A generalized iterative lqg method for locally- ","element":"span"},{"text":"optimal feedback control of constrained nonlinear stochastic systems. ","element":"span"},{"text":"In ","element":"span"},{"text":"American Control Conference, 2005. Proceedings of the 2005","element":"span"},{"text":", pages 300–306. IEEE, 2005.","element":"span"}],[{"id":"id-16","text":"Xun Yu Zhou. On the existence of optimal relaxed controls of stochastic partial ","element":"span"},{"text":"differential equations. ","element":"span"},{"text":"SIAM Journal on Control and Optimization","element":"span"},{"text":", 30(2): 247–261, 1992.","element":"span"}],[{"id":"id-7","text":"Brian D Ziebart, Andrew L Maas, J Andrew Bagnell, and Anind K Dey. Max- ","element":"span"},{"text":"imum entropy inverse reinforcement learning. ","element":"span"},{"text":"In ","element":"span"},{"text":"AAAI","element":"span"},{"text":", volume 8, pages 1433–1438. Chicago, IL, USA, 2008.","element":"span"}]]}],"_version":"3.3.4"},"paperNode":"$28:props:children:props:children:0:props:product"}]]