36:[["$","audio",null,{"id":"tts"}],["$","$L3b",null,{"paperID":"1403.6530","publisher":"arxiv","paperJSON":{"title":"Variance-Constrained Actor-Critic Algorithms for Discounted and Average Reward MDPs","paperID":"1403.6530","avgLineHeight":11.96,"imgScale":4,"sections":[{"heading":"Abstract","paragraphs":[[{"text":"$3c","element":"span"}],[{"style":{"fontWeight":"bold"},"text":"Keywords: ","element":"span"},{"text":"Markov decision process (MDP), reinforcement learning (RL), risk sensitive RL, actor-critic algorithms, multi-time-scale stochastic approximation, simultaneous perturbation stochastic approximation (SPSA), smoothed functional (SF).","element":"span"}]]},{"heading":"1 Introduction","paragraphs":[[{"text":"The usual optimization criteria for an infinite horizon Markov decision process (MDP) are the ","element":"span"},{"style":{"fontStyle":"italic"},"text":"expected sum of discounted rewards ","element":"span"},{"text":"and the ","element":"span"},{"style":{"fontStyle":"italic"},"text":"average reward ","element":"span"},{"href":"#id-0","referenceIndex":47,"text":"[47, ","element":"a"},{"href":"#id-1","referenceIndex":5,"text":"5]","element":"a"},{"text":". Many algorithms have been developed to maximize these criteria both when the model of the system is known (planning) and unknown (learning) ","element":"span"},{"href":"#id-2","referenceIndex":7,"text":"[7, ","element":"a"},{"href":"#id-3","referenceIndex":58,"text":"58]","element":"a"},{"text":". These algorithms can be categorized to ","element":"span"},{"style":{"fontWeight":"bold"},"text":"value function-based ","element":"span"},{"text":"methods that are mainly based on the two celebrated dynamic programming algorithms ","element":"span"},{"style":{"fontStyle":"italic"},"text":"value iteration ","element":"span"},{"text":"and ","element":"span"},{"style":{"fontStyle":"italic"},"text":"policy iteration","element":"span"},{"text":"; and ","element":"span"},{"style":{"fontWeight":"bold"},"text":"policy gradient ","element":"span"},{"text":"methods that are based on updating the policy parameters in the direction of the gradient of a performance measure, i.e., the value function of the initial state or the average reward. Policy gradient methods estimate the gradient of the performance measure either without using an explicit representation of the value function (e.g., ","element":"span"},{"href":"#id-4","referenceIndex":67,"text":"[67, ","element":"a"},{"href":"#id-5","referenceIndex":38,"text":"38, ","element":"a"},{"href":"#id-6","referenceIndex":4,"text":"4]","element":"a"},{"text":") or using such a representation in which case they are referred to as ","element":"span"},{"style":{"fontStyle":"italic"},"text":"actor-critic ","element":"span"},{"text":"algorithms (e.g., ","element":"span"},{"href":"#id-7","referenceIndex":59,"text":"[59, ","element":"a"},{"href":"#id-8","referenceIndex":33,"text":"33, ","element":"a"},{"href":"#id-9","referenceIndex":43,"text":"43, ","element":"a"},{"href":"#id-10","referenceIndex":13,"text":"13, ","element":"a"},{"href":"#id-11","referenceIndex":14,"text":"14]","element":"a"},{"text":"). Using an explicit representation for value function (e.g., linear function approximation) by actor-critic algorithms reduces the variance of the gradient estimate with the cost of adding it a bias.","element":"span"}],[{"text":"Actor-critic methods were among the earliest to be investigated in RL ","element":"span"},{"href":"#id-12","referenceIndex":2,"text":"[2, ","element":"a"},{"href":"#id-13","referenceIndex":56,"text":"56]","element":"a"},{"text":". They comprise a family of reinforcement learning (RL) methods that maintain two distinct algorithmic components: An ","element":"span"},{"style":{"fontStyle":"italic"},"text":"Actor","element":"span"},{"text":", whose role is to maintain and update an action-selection policy; and a ","element":"span"},{"style":{"fontStyle":"italic"},"text":"Critic","element":"span"},{"text":", whose role is to estimate the value function associated with the actor’s policy. Thus, the critic addresses a problem of ","element":"span"},{"style":{"fontStyle":"italic"},"text":"prediction","element":"span"},{"text":", whereas the actor is concerned with ","element":"span"},{"style":{"fontStyle":"italic"},"text":"control","element":"span"},{"text":". A common practice is to update the policy parameters using stochastic gradient ascent, and to estimate the value-function using some form of temporal difference (TD) learning ","element":"span"},{"href":"#id-14","referenceIndex":57,"text":"[57]","element":"a"},{"text":".","element":"span"}],[{"text":"However in many applications, we may prefer to minimize some measure of ","element":"span"},{"style":{"fontStyle":"italic"},"text":"risk ","element":"span"},{"text":"as well as maximizing a usual optimization criterion. In such cases, we would like to use a criterion that incorporates a penalty for the ","element":"span"},{"style":{"fontStyle":"italic"},"text":"variability ","element":"span"},{"text":"induced by a given policy. This variability can be due to two types of uncertainties: ","element":"span"},{"style":{"fontWeight":"bold"},"text":"1) ","element":"span"},{"text":"uncertainties in the model parameters, which is the topic of ","element":"span"},{"style":{"fontStyle":"italic"},"text":"robust ","element":"span"},{"text":"MDPs (e.g., ","element":"span"},{"href":"#id-15","referenceIndex":42,"text":"[42, ","element":"a"},{"href":"#id-16","referenceIndex":24,"text":"24, ","element":"a"},{"href":"#id-17","referenceIndex":68,"text":"68]","element":"a"},{"text":"), and ","element":"span"},{"style":{"fontWeight":"bold"},"text":"2) ","element":"span"},{"text":"the inherent uncertainty related to the stochastic nature of the system, which is the topic of ","element":"span"},{"style":{"fontStyle":"italic"},"text":"risk-sensitive ","element":"span"},{"text":"MDPs (e.g., ","element":"span"},{"href":"#id-18","referenceIndex":31,"text":"[31, ","element":"a"},{"href":"#id-19","referenceIndex":51,"text":"51, ","element":"a"},{"href":"#id-20","referenceIndex":27,"text":"27]","element":"a"},{"text":").","element":"span"}],[{"text":"In risk-sensitive sequential decision-making, the objective is to maximize a risk-sensitive criterion such as the expected exponential utility ","element":"span"},{"href":"#id-18","referenceIndex":31,"text":"[31]","element":"a"},{"text":", a variance related measure ","element":"span"},{"href":"#id-19","referenceIndex":51,"text":"[51, ","element":"a"},{"href":"#id-20","referenceIndex":27,"text":"27]","element":"a"},{"text":", the percentile performance ","element":"span"},{"href":"#id-21","referenceIndex":28,"text":"[28]","element":"a"},{"text":", or conditional value-at-risk (CVaR) ","element":"span"},{"href":"#id-22","referenceIndex":48,"text":"[48, ","element":"a"},{"href":"#id-23","referenceIndex":50,"text":"50]","element":"a"},{"text":". Unfortunately, when we include a measure of risk in our optimality criteria, the corresponding optimal policy is usually no longer Markovian stationary (e.g., ","element":"span"},{"href":"#id-20","referenceIndex":27,"text":"[27]","element":"a"},{"text":") and/or computing it is not tractable (e.g., ","element":"span"},{"href":"#id-20","referenceIndex":27,"text":"[27, ","element":"a"},{"href":"#id-24","referenceIndex":37,"text":"37]","element":"a"},{"text":"). Although risk-sensitive sequential decision-making has a long history in operations research and finance, it has only recently grabbed attention in the machine learning community. Most of the work on this topic (including those mentioned above) has been in the context of MDPs (when the model of the system is known) and much less work has been done within the reinforcement learning (RL) framework (when the model is unknown and all the information about the system is obtained from the samples resulted from the agent’s interaction with the environment). In risk-sensitive RL, we can mention the work by Borkar ","element":"span"},{"href":"#id-25","referenceIndex":17,"text":"[17, ","element":"a"},{"href":"#id-26","referenceIndex":18,"text":"18, ","element":"a"},{"href":"#id-27","referenceIndex":21,"text":"21] ","element":"a"},{"text":"and Basu et al. ","element":"span"},{"href":"#id-28","referenceIndex":3,"text":"[3] ","element":"a"},{"text":"who considered the expected exponential utility, the one by Mihatsch and Neuneier ","element":"span"},{"href":"#id-29","referenceIndex":40,"text":"[40] ","element":"a"},{"text":"that formulated a new risk-sensitive control framework based on transforming the temporal difference errors that occur during learning, and the one by Tamar et al. ","element":"span"},{"href":"#id-30","referenceIndex":62,"text":"[62] ","element":"a"},{"text":"on several variance related measures. Tamar et al. ","element":"span"},{"href":"#id-30","referenceIndex":62,"text":"[62] ","element":"a"},{"text":"study stochastic shortest path problems, and in this context, propose a policy gradient algorithm (and in a more recent work ","element":"span"},{"href":"#id-31","referenceIndex":61,"text":"[61] ","element":"a"},{"text":"an actor-critic algorithm) for maximizing several risk-sensitive criteria that involve both the expectation and variance of the ","element":"span"},{"style":{"fontStyle":"italic"},"text":"return ","element":"span"},{"text":"random variable (defined as the sum of the rewards that the agent obtains in an episode).","element":"span"}],[{"text":"In this paper,","element":"span"},{"text":"1 ","element":"span"},{"text":"we develop actor-critic algorithms for optimizing variance-related risk measures in both discounted and average reward MDPs. In the following, we first summarize our contributions in the discounted reward setting and follow it with those in average reward setting.","element":"span"}],[{"style":{"fontWeight":"bold"},"text":"Discounted reward setting. ","element":"span"},{"text":"Here we define the measure of variability as the ","element":"span"},{"style":{"fontStyle":"italic"},"text":"variance of the return ","element":"span"},{"text":"(similar to ","element":"span"},{"href":"#id-30","referenceIndex":62,"text":"[62]","element":"a"},{"text":"). We formulate the following constrained optimization problem with the aim of maximizing the mean of the return subject to its variance being bounded from above: For a given ","element":"span"},{"style":{"height":11.6},"width":98.78,"height":29,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1403.6530/images/1-0.png","element":"img","alt":" α > 0","inline":true},{"text":",","element":"span"}],[{"style":{"width":"39%"},"width":719,"height":64,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1403.6530/images/1-1.png","element":"img"}],[{"text":"In the above, ","element":"span"},{"style":{"height":17.38},"width":122.28,"height":43.46,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1403.6530/images/1-2.png","element":"img","alt":" V θ(x0)","inline":true,"padRight":true},{"text":"is the mean of the return, starting in state ","element":"span"},{"style":{"height":13.38},"width":38.78,"height":33.46,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1403.6530/images/1-3.png","element":"img","alt":" x0","inline":true,"padRight":true},{"text":"for a policy identified by its parameter ","element":"span"},{"style":{"height":10.8},"width":19,"height":27,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1403.6530/images/1-4.png","element":"img","alt":" θ","inline":true},{"text":", while ","element":"span"},{"style":{"height":17.39},"width":117.85,"height":43.46,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1403.6530/images/1-5.png","element":"img","alt":"Λθ(x0)","inline":true,"padRight":true},{"text":"is the variance of the return (see Section ","element":"span"},{"text":"3 ","element":"span"},{"text":"for precise definitions). A standard approach to solve the above problem is to employ the Lagrangian relaxation procedure ","element":"span"},{"href":"#id-32","referenceIndex":6,"text":"[6] ","element":"a"},{"text":"and solve the following unconstrained problem:","element":"span"}],[{"style":{"width":"46%"},"width":841,"height":73,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1403.6530/images/1-6.png","element":"img"}],[{"text":"where ","element":"span"},{"style":{"height":10.8},"width":23,"height":27,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1403.6530/images/1-7.png","element":"img","alt":" λ","inline":true,"padRight":true},{"text":"is the Lagrange multiplier. For solving the above problem, it is required to derive a formula for the gradient of the Lagrangian ","element":"span"},{"style":{"height":16},"width":119.39,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1403.6530/images/1-8.png","element":"img","alt":" L(θ, λ)","inline":true},{"text":", both w.r.t. ","element":"span"},{"style":{"height":10.8},"width":19,"height":27,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1403.6530/images/1-9.png","element":"img","alt":" θ","inline":true,"padRight":true},{"text":"and ","element":"span"},{"style":{"height":10.8},"width":23,"height":27,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1403.6530/images/1-10.png","element":"img","alt":" λ","inline":true},{"text":". While the gradient w.r.t. ","element":"span"},{"style":{"height":10.8},"width":23,"height":27,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1403.6530/images/1-11.png","element":"img","alt":" λ","inline":true,"padRight":true},{"text":"is particularly simple since it is the constraint value, the other gradient, i.e., w.r.t. ","element":"span"},{"style":{"height":10.8},"width":19,"height":27,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1403.6530/images/1-12.png","element":"img","alt":" θ","inline":true,"padRight":true},{"text":"is complicated. We derive this formula in Lemma ","element":"span"},{"href":"#id-33","text":"1 ","element":"a"},{"text":"and show that ","element":"span"},{"style":{"height":16},"width":170.63,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1403.6530/images/1-13.png","element":"img","alt":" ∇θL(θ, λ)","inline":true,"padRight":true},{"text":"requires the gradient of the value function at every state of the MDP (see the discussion in Sections ","element":"span"},{"text":"3 ","element":"span"},{"text":"and ","element":"span"},{"text":"4)","element":"span"},{"text":".","element":"span"}],[{"text":"Note that we operate in a ","element":"span"},{"style":{"fontStyle":"italic"},"text":"simulation optimization ","element":"span"},{"text":"setting, i.e., we have access to reward samples from the underlying MDP. Thus, it is required to estimate the mean and varaince of the return (we use a TD-critic for this purpose) and then use these estimates to compute gradient of the Lagrangian. The latter is used then used to descend in the policy parameter. We estimate the gradient of the Lagrangian using two simultaneous perturbation methods: ","element":"span"},{"style":{"fontStyle":"italic"},"text":"simultaneous perturbation stochastic approximation ","element":"span"},{"text":"(SPSA) ","element":"span"},{"href":"#id-34","referenceIndex":52,"text":"[52] ","element":"a"},{"text":"and ","element":"span"},{"style":{"fontStyle":"italic"},"text":"smoothed functional ","element":"span"},{"text":"(SF) ","element":"span"},{"href":"#id-35","referenceIndex":32,"text":"[32]","element":"a"},{"text":", resulting in two separate discounted reward actor-critic algorithms. In addition, we also propose second-order algorithms with a Newton step, using both SPSA and SF.","element":"span"}],[{"text":"Simultaneous perturbation methods have been popular in the field of stochastic optimization and the reader is referred to ","element":"span"},{"href":"#id-36","referenceIndex":16,"text":"[16] ","element":"a"},{"text":"for a textbook introduction. First introduced in ","element":"span"},{"href":"#id-34","referenceIndex":52,"text":"[52]","element":"a"},{"text":", the idea of SPSA is to perturb each coordinate of a parameter vector uniformly using Rademacher random variable, in the quest for finding the minimum of a function that is only observable via simulation. Traditional gradient schemes require ","element":"span"},{"style":{"height":13.19},"width":58.88,"height":32.98,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1403.6530/images/2-0.png","element":"img","alt":" 2κ1","inline":true,"padRight":true},{"text":"evaluations of the function, where ","element":"span"},{"style":{"height":9.59},"width":38.96,"height":23.98,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1403.6530/images/2-1.png","element":"img","alt":" κ1","inline":true,"padRight":true},{"text":"is the parameter dimension. On the other hand, SPSA requires only two evaluations irrespective of the parameter dimension and hence is an efficient scheme, especially useful in high-dimensional settings. While a one-simulation variant of SPSA was proposed in ","element":"span"},{"href":"#id-37","referenceIndex":53,"text":"[53]","element":"a"},{"text":", the original two-simulation SPSA algorithm is preferred as it is more efficient and also seen to work better than its one-simulation variant. Later enhancements to the original SPSA scheme include using deterministic perturbation using certain Hadamard matrices ","element":"span"},{"href":"#id-38","referenceIndex":12,"text":"[12] ","element":"a"},{"text":"and second-order methods that estimate Hessian using SPSA ","element":"span"},{"href":"#id-39","referenceIndex":54,"text":"[54, ","element":"a"},{"href":"#id-40","referenceIndex":8,"text":"8]","element":"a"},{"text":". The SF schemes are another class of simultaneous perturbation methods, which again perturb each coordinate of the parameter vector uniformly. However, unlike SPSA, Gaussian random variables are used here for the perturbation. Originally proposed in ","element":"span"},{"href":"#id-35","referenceIndex":32,"text":"[32]","element":"a"},{"text":", the SF schemes have been studied and enhanced in later works such as ","element":"span"},{"href":"#id-41","referenceIndex":55,"text":"[55, ","element":"a"},{"href":"#id-42","referenceIndex":9,"text":"9]","element":"a"},{"text":". Further, ","element":"span"},{"href":"#id-43","referenceIndex":15,"text":"[15] ","element":"a"},{"text":"proposes both SPSA and SF like schemes for constrained optimization.","element":"span"}],[{"style":{"fontWeight":"bold"},"text":"Average reward setting. ","element":"span"},{"text":"Here we first define the measure of variability as the ","element":"span"},{"style":{"fontStyle":"italic"},"text":"long-run variance ","element":"span"},{"text":"of a policy as follows:","element":"span"}],[{"style":{"width":"39%"},"width":712,"height":120,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1403.6530/images/2-2.png","element":"img"}],[{"text":"where ","element":"span"},{"style":{"height":16},"width":71.92,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1403.6530/images/2-3.png","element":"img","alt":" ρ(θ)","inline":true,"padRight":true},{"text":"is the average reward under policy identified by its parameter ","element":"span"},{"style":{"height":10.8},"width":19,"height":27,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1403.6530/images/2-4.png","element":"img","alt":" θ","inline":true,"padRight":true},{"text":"(see Section ","element":"span"},{"text":"5 ","element":"span"},{"text":"for precise definitions). The aim here is to solve the following constrained optimization problem:","element":"span"}],[{"style":{"width":"34%"},"width":631,"height":57,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1403.6530/images/2-5.png","element":"img"}],[{"text":"As in the discounted setting. we derive an expression for the gradient of the Lagrangian (see Lemma ","element":"span"},{"href":"#id-44","text":"3)","element":"a"},{"text":". Unlike the discounted setting, we do not require sophisticated simulation optimizations schemes, as the gradient expressions in Lemma ","element":"span"},{"href":"#id-44","text":"3 ","element":"a"},{"text":"suggest a simpler alternative that employs ","element":"span"},{"style":{"fontStyle":"italic"},"text":"compatible features ","element":"span"},{"href":"#id-7","referenceIndex":59,"text":"[59, ","element":"a"},{"href":"#id-9","referenceIndex":43,"text":"43]","element":"a"},{"text":". Compatible features for linearly approximating the action-value function of policy ","element":"span"},{"style":{"height":10.8},"width":19,"height":27,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1403.6530/images/2-6.png","element":"img","alt":" θ","inline":true,"padRight":true},{"text":"are of the form ","element":"span"},{"style":{"height":16},"width":226.41,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1403.6530/images/2-7.png","element":"img","alt":" ∇θ log µ(a|x)","inline":true},{"text":". These features are well-defined if the policy is differentiable w.r.t. its parameters ","element":"span"},{"style":{"height":10.8},"width":19,"height":27,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1403.6530/images/2-8.png","element":"img","alt":" θ","inline":true},{"text":". Sutton et al. ","element":"span"},{"href":"#id-7","referenceIndex":59,"text":"[59] ","element":"a"},{"text":"showed the advantages of using these features in approximating the action-value function in actor-critic algorithms. In ","element":"span"},{"href":"#id-11","referenceIndex":14,"text":"[14]","element":"a"},{"text":", the authors use compatible features to develop actor-critic algorithms for a risk-neutral setting. We extend this to variance-constrained setting and establish that square value function itself serves as a good baseline level when calculating the gradient of the average square reward (see the discussion surrounding Lemma ","element":"span"},{"href":"#id-45","text":"4)","element":"a"},{"text":". This facilitates the usage of ","element":"span"},{"style":{"fontStyle":"italic"},"text":"compatible features ","element":"span"},{"text":"for obtaining unbiased estimates of both average reward as well as square reward. We then develop an actor-critic algorithm that employ these ","element":"span"},{"style":{"fontStyle":"italic"},"text":"compatible features ","element":"span"},{"text":"in order to descend in the policy parameter ","element":"span"},{"style":{"height":10.8},"width":19,"height":27,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1403.6530/images/2-9.png","element":"img","alt":" θ","inline":true,"padRight":true},{"text":"and also identify the bias that arises due to function approximation (see Lemma ","element":"span"},{"href":"#id-46","text":"5)","element":"a"},{"text":".","element":"span"}],[{"style":{"fontWeight":"bold"},"text":"Proof of convergence. ","element":"span"},{"text":"Using the ordinary differential equations (ODE) approach, we establish the asymptotic convergence of our algorithms to locally risk-sensitive optimal policies. Our algorithms employ multi-timescale stochastic approximation, in both settings. The convergence proof proceeds by analysing each timescale separately. In essence, the iterates on a faster timescale view those on a slower timescale as quasi-static, while the slower timescale iterate views that on a faster timescale as equilibrated. Using this principle, we show that TD critic (on the fastest timescale in all the algorithms) converge to fixed points of the Bellman operator, for any fixed policy ","element":"span"},{"style":{"height":10.8},"width":19,"height":27,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1403.6530/images/2-10.png","element":"img","alt":" θ","inline":true,"padRight":true},{"text":"and Lagrange multiplier ","element":"span"},{"style":{"height":10.8},"width":23,"height":27,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1403.6530/images/2-11.png","element":"img","alt":" λ","inline":true},{"text":". Next, for any given ","element":"span"},{"style":{"height":10.8},"width":23,"height":27,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1403.6530/images/2-12.png","element":"img","alt":" λ","inline":true},{"text":", the policy update tracks in the asymptotic limit and converges to the equilibria of the corresponding ODE. Finally, ","element":"span"},{"style":{"height":10.8},"width":23,"height":27,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1403.6530/images/2-13.png","element":"img","alt":" λ","inline":true,"padRight":true},{"text":"updates on slowest timescale converge and the overall convergence is to a local saddle point of the Lagrangian. Moreover, the limiting point is feasible for the constrained optimization problem mentioned above, i.e., the policy obtained upon convergence satisfies the constraint that the variance is upper-bounded by ","element":"span"},{"style":{"height":6.8},"width":26,"height":17,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1403.6530/images/2-14.png","element":"img","alt":" α","inline":true},{"text":".","element":"span"}],[{"style":{"fontWeight":"bold"},"text":"Simulation experiments. ","element":"span"},{"text":"We demonstrate the usefulness of our discounted and average reward risk-sensitive actor-critic algorithms in a traffic signal control application. The objective in our formulation is to minimize the total number of vehicles in the system, which indirectly minimizes the delay experienced by the system. The motivation behind using a risk-sensitive control strategy is to reduce the variations in the delay experienced by road users. From the results, we observe that the risk-sensitive algorithms proposed in this paper result in a long-term (discounted or average) cost that is higher than their risk-neutral variants. However, from the empirical variance of the cost (both discounted as well as average) perspective, the risk-sensitive algorithms outperform their risk-neutral variants.","element":"span"}],[{"style":{"fontWeight":"bold"},"text":"Remark 1. ","element":"span"},{"style":{"fontStyle":"italic"},"text":"It is important to note that our both discounted and average reward algorithms can be easily extended to other variance related risk criteria such as the Sharpe ratio, which is popular in financial decision-making ","element":"span"},{"href":"#id-47","referenceIndex":49,"style":{"fontStyle":"italic"},"text":"[49] ","element":"a"},{"style":{"fontStyle":"italic"},"text":"(see Remarks ","element":"span"},{"href":"#id-48","style":{"fontStyle":"italic"},"text":"5 ","element":"a"},{"style":{"fontStyle":"italic"},"text":"and ","element":"span"},{"href":"#id-49","style":{"fontStyle":"italic"},"text":"9 ","element":"a"},{"style":{"fontStyle":"italic"},"text":"for more details).","element":"span"}],[{"style":{"fontWeight":"bold"},"text":"Remark 2. ","element":"span"},{"style":{"fontStyle":"italic"},"text":"Another important point is that the ","element":"span"},{"text":"expected exponential utility ","element":"span"},{"style":{"fontStyle":"italic"},"text":"risk measure can be also considered as an approximation of the mean-variance tradeoff due to the following Taylor expansion (see e.g., Eq. 11 in ","element":"span"},{"href":"#id-29","referenceIndex":40,"style":{"fontStyle":"italic"},"text":"[40]","element":"a"},{"style":{"fontStyle":"italic"},"text":")","element":"span"}],[{"style":{"width":"42%"},"width":766,"height":89,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1403.6530/images/3-0.png","element":"img"}],[{"style":{"fontStyle":"italic"},"text":"and we know that it is much easier to design actor-critic or other reinforcement learning algorithms ","element":"span"},{"href":"#id-25","referenceIndex":17,"style":{"fontStyle":"italic"},"text":"[17, ","element":"a"},{"href":"#id-26","referenceIndex":18,"style":{"fontStyle":"italic"},"text":"18, ","element":"a"},{"href":"#id-28","referenceIndex":3,"style":{"fontStyle":"italic"},"text":"3, ","element":"a"},{"href":"#id-27","referenceIndex":21,"style":{"fontStyle":"italic"},"text":"21] ","element":"a"},{"style":{"fontStyle":"italic"},"text":"for this risk measure than those that will be presented in this paper. However, this formulation is limited in the sense that it requires knowing the ideal tradeoff between the mean and variance, since it takes ","element":"span"},{"style":{"height":14.4},"width":23,"height":36,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1403.6530/images/3-1.png","element":"img","alt":" β","inline":true,"padRight":true},{"style":{"fontStyle":"italic"},"text":"as an input. On the other hand, the mean-variance formulations considered in this paper are more general because","element":"span"}],[{"style":{"fontStyle":"italic","fontWeight":"bold"},"text":"(1) ","element":"span"},{"style":{"fontStyle":"italic"},"text":"we optimize for the Lagrange multiplier ","element":"span"},{"style":{"height":10.8},"width":23,"height":27,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1403.6530/images/3-2.png","element":"img","alt":" λ","inline":true},{"style":{"fontStyle":"italic"},"text":", which plays a similar role to ","element":"span"},{"style":{"height":14.4},"width":23,"height":36,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1403.6530/images/3-3.png","element":"img","alt":" β","inline":true},{"style":{"fontStyle":"italic"},"text":", as a tradeoff between the mean and variance, and","element":"span"}],[{"style":{"fontStyle":"italic","fontWeight":"bold"},"text":"(2) ","element":"span"},{"style":{"fontStyle":"italic"},"text":"it is usually more natural to know an upper-bound on the variance (as in the mean-variance formulations considered in this paper) than knowing the ideal tradeoff between the mean and variance (as considered in the expected exponential utility formulation).","element":"span"}],[{"style":{"fontStyle":"italic"},"text":"Despite all these, we should not consider these formulations as replacement for each other or try to find a formulation that is the best for all problems, but instead should consider them as different formulations that each might be the right fit for a specific problem.","element":"span"}],[{"style":{"fontWeight":"bold"},"text":"Closely related works. ","element":"span"},{"text":"In comparison to ","element":"span"},{"href":"#id-30","referenceIndex":62,"text":"[62] ","element":"a"},{"text":"and ","element":"span"},{"href":"#id-31","referenceIndex":61,"text":"[61]","element":"a"},{"text":", which are the most closely related contributions, we would like to point out the following:","element":"span"}],[{"style":{"fontWeight":"bold"},"text":"(1) ","element":"span"},{"text":"The authors develop policy gradient and actor-critic methods for stochastic shortest path problems in ","element":"span"},{"href":"#id-30","referenceIndex":62,"text":"[62] ","element":"a"},{"text":"and ","element":"span"},{"href":"#id-31","referenceIndex":61,"text":"[61]","element":"a"},{"text":", respectively. On the other hand, we devise actor-critic algorithms for both discounted and average reward MDP settings.; and","element":"span"}],[{"style":{"fontWeight":"bold"},"text":"(2) ","element":"span"},{"text":"More importantly, we note the difficulty in the discounted formulation that requires to estimate the gradient of the value function at every state of the MDP and also sample from two different distributions. This precludes us from using ","element":"span"},{"style":{"fontStyle":"italic"},"text":"compatible features ","element":"span"},{"text":"- a method that has been employed successfully in actor-critic algorithms in a risk-neutral setting (cf. ","element":"span"},{"href":"#id-11","referenceIndex":14,"text":"[14]","element":"a"},{"text":") as well as more recently in ","element":"span"},{"href":"#id-31","referenceIndex":61,"text":"[61] ","element":"a"},{"text":"for a risk-sensitive stochastic shortest path setting. We alleviate the above mentioned problems for the discounted setting by employing simultaneous perturbation based schemes for estimating the gradient in the first order methods and Hessian in the second order methods, that we propose.","element":"span"}],[{"style":{"fontWeight":"bold"},"text":"(3) ","element":"span"},{"text":"Unlike ","element":"span"},{"href":"#id-30","referenceIndex":62,"text":"[62, ","element":"a"},{"href":"#id-31","referenceIndex":61,"text":"61] ","element":"a"},{"text":"who consider a fixed ","element":"span"},{"style":{"height":10.8},"width":23,"height":27,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1403.6530/images/3-4.png","element":"img","alt":" λ","inline":true,"padRight":true},{"text":"in their constrained formulations, we perform dual ascent using sample variance constrants and optimize the Lagrange multiplier ","element":"span"},{"style":{"height":10.8},"width":23,"height":27,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1403.6530/images/3-5.png","element":"img","alt":" λ","inline":true},{"text":". In rigorous terms, ","element":"span"},{"style":{"height":13.19},"width":43.25,"height":32.98,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1403.6530/images/3-6.png","element":"img","alt":" λn","inline":true,"padRight":true},{"text":"in our algorithms is shown to converge to a local maxima of ","element":"span"},{"style":{"height":17.38},"width":194.39,"height":43.46,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1403.6530/images/3-7.png","element":"img","alt":" ∇λL(θλ, λ)","inline":true,"padRight":true},{"text":"(here ","element":"span"},{"style":{"height":13.38},"width":38.81,"height":33.46,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1403.6530/images/3-8.png","element":"img","alt":" θλ","inline":true,"padRight":true},{"text":"is the limit of the ","element":"span"},{"style":{"height":10.8},"width":19,"height":27,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1403.6530/images/3-9.png","element":"img","alt":" θ","inline":true,"padRight":true},{"text":"recursion for a given value of ","element":"span"},{"style":{"height":10.8},"width":23,"height":27,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1403.6530/images/3-10.png","element":"img","alt":" λ","inline":true},{"text":") and the limit ","element":"span"},{"style":{"height":10.99},"width":39.24,"height":27.46,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1403.6530/images/3-11.png","element":"img","alt":" λ∗","inline":true,"padRight":true},{"text":"is such that the variance constraint is satisfied for the corresponding policy ","element":"span"},{"style":{"height":14.6},"width":53.72,"height":36.49,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1403.6530/images/3-12.png","element":"img","alt":" θλ∗","inline":true},{"text":".","element":"span"}],[{"style":{"fontWeight":"bold"},"text":"Organization of the paper. ","element":"span"},{"text":"The rest of the paper is organized as follows: In Section ","element":"span"},{"text":"2, ","element":"span"},{"text":"we describe the RL setting. In Section ","element":"span"},{"text":"3, ","element":"span"},{"text":"we describe the risk-sensitive MDP in the discounted setting and propose actor-critic algorithms for this setting in Section ","element":"span"},{"text":"4. ","element":"span"},{"text":"In Section ","element":"span"},{"text":"5, ","element":"span"},{"text":"we present the risk measure for the average setting and propose an actor-critic algorithm that optimizes this risk measure in Section ","element":"span"},{"text":"6. ","element":"span"},{"text":"In Sections ","element":"span"},{"text":"7–","element":"span"},{"text":"8, ","element":"span"},{"text":"we present the convergence proofs for the algorithms in discounted and average reward settings, respectively. In Section ","element":"span"},{"text":"9, ","element":"span"},{"text":"we describe the experimental setup and present the results in both average and discounted cost settings. Finally, in Section ","element":"span"},{"text":"10, ","element":"span"},{"text":"we provide the concluding remarks and outline a few future research directions.","element":"span"}]]},{"heading":"2 Preliminaries","paragraphs":[[{"text":"We consider sequential decision-making tasks that can be formulated as a reinforcement learning (RL) problem. In RL, an agent interacts with a dynamic, stochastic, and incompletely known environment, with the goal of optimizing some measure of its ","element":"span"},{"style":{"fontStyle":"italic"},"text":"long-term ","element":"span"},{"text":"performance. This interaction is often modeled as a Markov decision process (MDP). A MDP is a tuple ","element":"span"},{"style":{"height":17.39},"width":266.33,"height":43.46,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1403.6530/images/4-0.png","element":"img","alt":" (X, A, R, P, x0)","inline":true,"padRight":true},{"text":"where ","element":"span"},{"style":{"fontStyle":"italic"},"text":"X ","element":"span"},{"text":"and ","element":"span"},{"style":{"fontStyle":"italic"},"text":"A ","element":"span"},{"text":"are the state and action spaces; ","element":"span"},{"style":{"height":16},"width":203.84,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1403.6530/images/4-1.png","element":"img","alt":" R(x, a), x ∈","inline":true},{"style":{"height":14.8},"width":158.71,"height":37,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1403.6530/images/4-2.png","element":"img","alt":"X, a ∈ A","inline":true,"padRight":true},{"text":"is the reward random variable whose expectation is denoted by ","element":"span"},{"style":{"height":19.2},"width":521.94,"height":48,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1403.6530/images/4-3.png","element":"img","alt":" r(x, a) = E�R(x, a)�; P(·|x, a)","inline":true,"padRight":true},{"text":"is the transition probability distribution; and ","element":"span"},{"style":{"height":14.18},"width":122.38,"height":35.46,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1403.6530/images/4-4.png","element":"img","alt":" x0 ∈ X","inline":true,"padRight":true},{"text":"is the initial state","element":"span"},{"href":"#id-50","text":"2","element":"a"},{"text":". We assume that both state and action spaces are finite.","element":"span"}],[{"text":"The rule according to which the agent acts in its environment (selects action at each state) is called a ","element":"span"},{"style":{"fontStyle":"italic"},"text":"policy","element":"span"},{"text":". A Markovian stationary policy ","element":"span"},{"style":{"height":16},"width":100.42,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1403.6530/images/4-5.png","element":"img","alt":" µ(·|x)","inline":true,"padRight":true},{"text":"is a probability distribution over actions, conditioned on the current state ","element":"span"},{"style":{"fontStyle":"italic"},"text":"x","element":"span"},{"text":". The goal in a RL problem is to find a policy that optimizes the long-term performance measure of interest, e.g., maximizes the ","element":"span"},{"style":{"fontStyle":"italic"},"text":"expected discounted sum of rewards ","element":"span"},{"text":"or the ","element":"span"},{"style":{"fontStyle":"italic"},"text":"average reward","element":"span"},{"text":".","element":"span"}],[{"text":"In policy gradient and actor-critic methods, we define a class of parameterized stochastic policies","element":"span"},{"style":{"height":19.2},"width":239.25,"height":48,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1403.6530/images/4-6.png","element":"img","alt":"�µ(·|x; θ), x ∈","inline":true},{"style":{"height":19.2},"width":303.37,"height":48,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1403.6530/images/4-7.png","element":"img","alt":"X, θ ∈ Θ ⊆ Rκ1�","inline":true},{"text":", estimate the gradient of the performance measure w.r.t. the policy parameters ","element":"span"},{"style":{"height":10.8},"width":19,"height":27,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1403.6530/images/4-8.png","element":"img","alt":" θ","inline":true,"padRight":true},{"text":"from the observed system trajectories, and then improve the policy by adjusting its parameters in the direction of the gradient. Since in this setting a policy ","element":"span"},{"style":{"height":10},"width":24,"height":25,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1403.6530/images/4-9.png","element":"img","alt":" µ","inline":true,"padRight":true},{"text":"is represented by its ","element":"span"},{"style":{"height":9.59},"width":38.96,"height":23.98,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1403.6530/images/4-10.png","element":"img","alt":" κ1","inline":true},{"text":"-dimensional parameter vector ","element":"span"},{"style":{"height":10.8},"width":19,"height":27,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1403.6530/images/4-11.png","element":"img","alt":" θ","inline":true},{"text":", policy dependent functions can be written as a function of ","element":"span"},{"style":{"height":10.8},"width":19,"height":27,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1403.6530/images/4-12.png","element":"img","alt":" θ","inline":true,"padRight":true},{"text":"in place of ","element":"span"},{"style":{"height":10},"width":24,"height":25,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1403.6530/images/4-13.png","element":"img","alt":" µ","inline":true},{"text":". So, we use ","element":"span"},{"style":{"height":10},"width":24,"height":25,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1403.6530/images/4-14.png","element":"img","alt":" µ","inline":true,"padRight":true},{"text":"and ","element":"span"},{"style":{"height":10.8},"width":19,"height":27,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1403.6530/images/4-15.png","element":"img","alt":" θ","inline":true,"padRight":true},{"text":"interchangeably in the paper.","element":"span"}],[{"text":"We make the following assumptions on the policy, parameterized by ","element":"span"},{"style":{"height":10.8},"width":19,"height":27,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1403.6530/images/4-16.png","element":"img","alt":" θ","inline":true},{"text":":","element":"span"}],[{"style":{"fontWeight":"bold"},"text":"(A1) ","element":"span"},{"style":{"fontStyle":"italic"},"text":"For any state-action pair ","element":"span"},{"style":{"height":16},"width":250.39,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1403.6530/images/4-17.png","element":"img","alt":" (x, a) ∈ X ×A","inline":true},{"style":{"fontStyle":"italic"},"text":", the policy ","element":"span"},{"style":{"height":16},"width":147.94,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1403.6530/images/4-18.png","element":"img","alt":" µ(a|x; θ)","inline":true,"padRight":true},{"style":{"fontStyle":"italic"},"text":"is continuously differentiable in the parameter ","element":"span"},{"style":{"height":10.8},"width":19,"height":27,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1403.6530/images/4-19.png","element":"img","alt":"θ","inline":true},{"style":{"fontStyle":"italic"},"text":". ","element":"span"},{"style":{"fontWeight":"bold"},"text":"(A2) ","element":"span"},{"style":{"fontStyle":"italic"},"text":"The Markov chain induced by any policy ","element":"span"},{"style":{"height":10.8},"width":19,"height":27,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1403.6530/images/4-20.png","element":"img","alt":" θ","inline":true,"padRight":true},{"style":{"fontStyle":"italic"},"text":"is irreducible.","element":"span"}],[{"text":"The above assumptions are standard requirements in policy gradient and actor-critic methods.","element":"span"}],[{"text":"Finally, we denote by ","element":"span"},{"style":{"height":16},"width":96.38,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1403.6530/images/4-21.png","element":"img","alt":" dµ(x)","inline":true,"padRight":true},{"text":"and ","element":"span"},{"style":{"height":16},"width":397.5,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1403.6530/images/4-22.png","element":"img","alt":" πµ(x, a) = dµ(x)µ(a|x)","inline":true},{"text":", the stationary distribution of state ","element":"span"},{"style":{"fontStyle":"italic"},"text":"x ","element":"span"},{"text":"and state-action pair ","element":"span"},{"text":"(","element":"span"},{"style":{"fontStyle":"italic"},"text":"x, a","element":"span"},{"text":") ","element":"span"},{"text":"under policy ","element":"span"},{"style":{"height":10},"width":24,"height":25,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1403.6530/images/4-23.png","element":"img","alt":" µ","inline":true},{"text":", respectively. The stationary distributions can be seen to exist because we consider a finite state-action space setting and irreducibility here implies positive recurrence. Similarly in the discounted formulation, we define the ","element":"span"},{"style":{"height":10.4},"width":22,"height":26,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1403.6530/images/4-24.png","element":"img","alt":" γ","inline":true},{"text":"-discounted visiting distribution of state ","element":"span"},{"style":{"fontStyle":"italic"},"text":"x ","element":"span"},{"text":"and state-action pair ","element":"span"},{"text":"(","element":"span"},{"style":{"fontStyle":"italic"},"text":"x, a","element":"span"},{"text":") ","element":"span"},{"text":"under policy ","element":"span"},{"style":{"height":10},"width":24,"height":25,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1403.6530/images/4-25.png","element":"img","alt":" µ","inline":true,"padRight":true},{"text":"as ","element":"span"},{"style":{"height":19.72},"width":873.62,"height":49.31,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1403.6530/images/4-26.png","element":"img","alt":" dµγ(x|x0) = (1 − γ) �∞n=0 γn Pr(xn = x|x0 = x0; µ)","inline":true,"padRight":true},{"text":"and ","element":"span"},{"style":{"height":19.72},"width":500.94,"height":49.31,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1403.6530/images/4-27.png","element":"img","alt":" πµγ (x, a|x0) = dµγ(x|x0)µ(a|x)","inline":true},{"text":".","element":"span"}]]},{"heading":"3 Discounted Reward Setting","paragraphs":[[{"text":"For a given policy ","element":"span"},{"style":{"height":10},"width":24,"height":25,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1403.6530/images/4-28.png","element":"img","alt":" µ","inline":true},{"text":", we define the return of a state ","element":"span"},{"style":{"fontStyle":"italic"},"text":"x ","element":"span"},{"text":"(state-action pair ","element":"span"},{"text":"(","element":"span"},{"style":{"fontStyle":"italic"},"text":"x, a","element":"span"},{"text":")","element":"span"},{"text":") as the sum of discounted rewards encountered by the agent when it starts at state ","element":"span"},{"style":{"fontStyle":"italic"},"text":"x ","element":"span"},{"text":"(state-action pair ","element":"span"},{"text":"(","element":"span"},{"style":{"fontStyle":"italic"},"text":"x, a","element":"span"},{"text":")","element":"span"},{"text":") and then follows policy ","element":"span"},{"style":{"height":10},"width":24,"height":25,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1403.6530/images/4-29.png","element":"img","alt":" µ","inline":true},{"text":", i.e.,","element":"span"}],[{"style":{"width":"45%"},"width":830,"height":243,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1403.6530/images/4-30.png","element":"img"}],[{"text":"The expected value of these two random variables are the value and action-value functions of policy ","element":"span"},{"style":{"height":10},"width":24,"height":25,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1403.6530/images/4-31.png","element":"img","alt":" µ","inline":true},{"text":", i.e.,","element":"span"}],[{"id":"id-50","style":{"width":"78%"},"width":1426,"height":74,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1403.6530/images/4-32.png","element":"img"}],[{"text":"The goal in the standard (risk-neutral) discounted reward formulation is to find an optimal policy ","element":"span"},{"style":{"height":19.68},"width":385.92,"height":49.2,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1403.6530/images/5-0.png","element":"img","alt":" µ∗ = arg maxµ V µ(x0)","inline":true},{"text":", where ","element":"span"},{"style":{"height":13.39},"width":38.78,"height":33.46,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1403.6530/images/5-1.png","element":"img","alt":" x0","inline":true,"padRight":true},{"text":"is the initial state of the system. The most common measure of the ","element":"span"},{"style":{"fontStyle":"italic"},"text":"variability ","element":"span"},{"text":"in the stream of rewards is the ","element":"span"},{"style":{"fontStyle":"italic"},"text":"variance of the return","element":"span"},{"text":", defined by","element":"span"}],[{"style":{"width":"73%"},"width":1335,"height":59,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1403.6530/images/5-2.png","element":"img"}],[{"text":"The above measure was first introduced by Sobel ","element":"span"},{"href":"#id-19","referenceIndex":51,"text":"[51]","element":"a"},{"text":". Note that","element":"span"}],[{"id":"id-51","style":{"width":"18%"},"width":344,"height":59,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1403.6530/images/5-3.png","element":"img"}],[{"text":"is the ","element":"span"},{"style":{"fontStyle":"italic"},"text":"square reward value function ","element":"span"},{"text":"of state ","element":"span"},{"style":{"fontStyle":"italic"},"text":"x ","element":"span"},{"text":"under policy ","element":"span"},{"style":{"height":10},"width":24,"height":25,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1403.6530/images/5-4.png","element":"img","alt":" µ","inline":true},{"text":". On similar lines, we define the ","element":"span"},{"style":{"fontStyle":"italic"},"text":"square reward action-value function ","element":"span"},{"text":"of state-action pair ","element":"span"},{"text":"(","element":"span"},{"style":{"fontStyle":"italic"},"text":"x, a","element":"span"},{"text":") ","element":"span"},{"text":"under policy ","element":"span"},{"style":{"height":10},"width":24,"height":25,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1403.6530/images/5-5.png","element":"img","alt":" µ","inline":true,"padRight":true},{"text":"as","element":"span"}],[{"id":"id-54","style":{"width":"25%"},"width":456,"height":59,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1403.6530/images/5-6.png","element":"img"}],[{"text":"From the Bellman equation of ","element":"span"},{"style":{"height":16},"width":103.32,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1403.6530/images/5-7.png","element":"img","alt":" Λµ(x)","inline":true},{"text":", proposed by Sobel ","element":"span"},{"href":"#id-19","referenceIndex":51,"text":"[51]","element":"a"},{"text":", it is straightforward to derive the following Bellman equations for ","element":"span"},{"style":{"height":16},"width":107.2,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1403.6530/images/5-8.png","element":"img","alt":" U µ(x)","inline":true,"padRight":true},{"text":"and ","element":"span"},{"style":{"height":16},"width":157.59,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1403.6530/images/5-9.png","element":"img","alt":" W µ(x, a)","inline":true},{"text":":","element":"span"}],[{"style":{"width":"98%"},"width":1778,"height":201,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1403.6530/images/5-10.png","element":"img"}],[{"text":"Although ","element":"span"},{"style":{"height":11.6},"width":46.67,"height":29,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1403.6530/images/5-11.png","element":"img","alt":" Λµ","inline":true,"padRight":true},{"text":"of ","element":"span"},{"href":"#id-51","text":"(1) ","element":"a"},{"text":"satisfies a Bellman equation, unfortunately, it lacks the monotonicity property of dynamic programming (DP), and thus, it is not clear how the related risk measures can be optimized by standard DP algorithms ","element":"span"},{"href":"#id-19","referenceIndex":51,"text":"[51]","element":"a"},{"text":". Policy gradient and actor-critic algorithms are good candidates to deal with this risk measure.","element":"span"}],[{"text":"We consider the following risk-sensitive measure for discounted MDPs: For a given ","element":"span"},{"style":{"height":11.6},"width":98.77,"height":29,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1403.6530/images/5-12.png","element":"img","alt":" α > 0","inline":true},{"text":",","element":"span"}],[{"id":"id-53","style":{"width":"69%"},"width":1266,"height":63,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1403.6530/images/5-13.png","element":"img"}],[{"text":"Assuming that there is at least one policy (in the class of parameterized policies that we consider) that satisfies the variance constraint above, it can be inferred from Theorem 3.8 of ","element":"span"},{"href":"#id-52","referenceIndex":1,"text":"[1] ","element":"a"},{"text":"that there exists an optimal policy that uses at most one randomization.","element":"span"}],[{"text":"It is important to note that the algorithms proposed in this paper can be used for any risk-sensitive measure that is based on the variance of the return such as","element":"span"}],[{"text":"1. ","element":"span"},{"style":{"height":17.38},"width":208.94,"height":43.46,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1403.6530/images/5-14.png","element":"img","alt":" minθ Λθ(x0)","inline":true,"padRight":true},{"text":"subject to ","element":"span"},{"style":{"height":17.38},"width":200.91,"height":43.46,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1403.6530/images/5-15.png","element":"img","alt":"V θ(x0) ≥ α","inline":true},{"text":",","element":"span"}],[{"text":"2. ","element":"span"},{"text":"max","element":"span"},{"style":{"height":19.2},"width":360.63,"height":48,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1403.6530/images/5-16.png","element":"img","alt":"θ V θ(x0) − α�Λθ(x0","inline":true},{"text":")","element":"span"},{"text":",","element":"span"}],[{"text":"3. Maximizing the Sharpe Ratio, i.e., ","element":"span"},{"style":{"height":19.2},"width":398.24,"height":48,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1403.6530/images/5-17.png","element":"img","alt":" maxθ V θ(x0)/�Λθ(x0)","inline":true},{"text":". Sharpe Ratio (SR) is a popular risk measure in financial decision-making ","element":"span"},{"href":"#id-47","referenceIndex":49,"text":"[49]","element":"a"},{"text":". Section ","element":"span"},{"href":"#id-48","text":"5 ","element":"a"},{"text":"presents extensions of our proposed discounted reward algorithms to optimize the Sharpe ration.","element":"span"}],[{"text":"To solve ","element":"span"},{"href":"#id-53","text":"(3)","element":"a"},{"text":", we employ the Lagrangian relaxation procedure ","element":"span"},{"href":"#id-32","referenceIndex":6,"text":"[6] ","element":"a"},{"text":"to convert it to the following unconstrained problem:","element":"span"}],[{"style":{"width":"73%"},"width":1327,"height":71,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1403.6530/images/5-18.png","element":"img"}],[{"text":"where ","element":"span"},{"style":{"height":10.8},"width":23,"height":27,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1403.6530/images/5-19.png","element":"img","alt":" λ","inline":true,"padRight":true},{"text":"is the Lagrange multiplier. The goal here is to find the saddle point of ","element":"span"},{"style":{"height":16},"width":119.39,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1403.6530/images/5-20.png","element":"img","alt":" L(θ, λ)","inline":true},{"text":", i.e., a point ","element":"span"},{"style":{"height":16},"width":128.91,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1403.6530/images/5-21.png","element":"img","alt":" (θ∗, λ∗)","inline":true,"padRight":true},{"text":"that satisfies","element":"span"}],[{"style":{"width":"45%"},"width":819,"height":41,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1403.6530/images/5-22.png","element":"img"}],[{"text":"For a standard convex optimization problem with mild regularity conditions, one can ensure the existence of a unique saddle point. Further, convergence to this point can be achieved by descending in ","element":"span"},{"style":{"height":10.8},"width":19,"height":27,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1403.6530/images/5-23.png","element":"img","alt":" θ","inline":true,"padRight":true},{"text":"and ascending in ","element":"span"},{"style":{"height":10.8},"width":23,"height":27,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1403.6530/images/5-24.png","element":"img","alt":" λ","inline":true,"padRight":true},{"text":"using ","element":"span"},{"style":{"height":16},"width":170.63,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1403.6530/images/5-25.png","element":"img","alt":" ∇θL(θ, λ)","inline":true,"padRight":true},{"text":"and ","element":"span"},{"style":{"height":16},"width":173.5,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1403.6530/images/5-26.png","element":"img","alt":" ∇λL(θ, λ)","inline":true},{"text":", respectively.","element":"span"}],[{"text":"However, we operate in a ","element":"span"},{"style":{"fontStyle":"italic"},"text":"simulation optimization ","element":"span"},{"text":"setting, where ","element":"span"},{"style":{"fontWeight":"bold"},"text":"(i) ","element":"span"},{"text":"only sample estimates of the Lagrangian are observed; and ","element":"span"},{"style":{"fontWeight":"bold"},"text":"(ii) ","element":"span"},{"text":"the objective (Lagrangian) is not necessarily convex in ","element":"span"},{"style":{"height":10.8},"width":19,"height":27,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1403.6530/images/6-0.png","element":"img","alt":" θ","inline":true,"padRight":true},{"text":"(or there is no unique saddle point). Hence, performing primal descent and dual ascent, one can only get to a local saddle point, i.e., a tuple ","element":"span"},{"style":{"height":16},"width":128.92,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1403.6530/images/6-1.png","element":"img","alt":" (θ∗, λ∗)","inline":true,"padRight":true},{"text":"which is a local minima w.r.t. ","element":"span"},{"style":{"height":10.8},"width":19,"height":27,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1403.6530/images/6-2.png","element":"img","alt":" θ","inline":true,"padRight":true},{"text":"and local maxima w.r.t ","element":"span"},{"style":{"height":10.8},"width":23,"height":27,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1403.6530/images/6-3.png","element":"img","alt":" λ","inline":true,"padRight":true},{"text":"of the Lagrangian.","element":"span"}],[{"text":"In our setting, the necessary gradients of the Lagrangian are as follows:","element":"span"}],[{"style":{"width":"72%"},"width":1310,"height":47,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1403.6530/images/6-4.png","element":"img"}],[{"text":"Since ","element":"span"},{"style":{"height":17.39},"width":785.96,"height":43.46,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1403.6530/images/6-5.png","element":"img","alt":" ∇θΛθ(x0) = ∇θU θ(x0) − 2V θ(x0)∇θV θ(x0)","inline":true},{"text":", in order to compute ","element":"span"},{"style":{"height":17.39},"width":169.1,"height":43.46,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1403.6530/images/6-6.png","element":"img","alt":" ∇θΛθ(x0)","inline":true,"padRight":true},{"text":"it would be enough to calculate ","element":"span"},{"style":{"height":17.39},"width":173.52,"height":43.46,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1403.6530/images/6-7.png","element":"img","alt":" ∇θV θ(x0)","inline":true,"padRight":true},{"text":"and ","element":"span"},{"style":{"height":17.39},"width":172.97,"height":43.46,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1403.6530/images/6-8.png","element":"img","alt":" ∇θU θ(x0)","inline":true},{"text":". Using the above definitions, we are now ready to derive the expressions for the gradient of ","element":"span"},{"style":{"height":17.38},"width":122.28,"height":43.46,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1403.6530/images/6-9.png","element":"img","alt":" V θ(x0)","inline":true,"padRight":true},{"text":"and ","element":"span"},{"style":{"height":17.38},"width":121.74,"height":43.46,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1403.6530/images/6-10.png","element":"img","alt":" U θ(x0)","inline":true},{"text":", which in turn constitute the main ingredients in calculating ","element":"span"},{"style":{"height":16},"width":170.63,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1403.6530/images/6-11.png","element":"img","alt":" ∇θL(θ, λ)","inline":true},{"text":".","element":"span"}],[{"id":"id-33","style":{"fontWeight":"bold"},"text":"Lemma 1. ","element":"span"},{"style":{"fontStyle":"italic"},"text":"Under (A1) and (A2), we have","element":"span"}],[{"style":{"width":"101%"},"width":1844,"height":205,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1403.6530/images/6-12.png","element":"img"}],[{"style":{"fontStyle":"italic"},"text":"where ","element":"span"},{"style":{"height":19.73},"width":147.18,"height":49.32,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1403.6530/images/6-13.png","element":"img","alt":"�dθγ(x|x0)","inline":true,"padRight":true},{"style":{"fontStyle":"italic"},"text":"and ","element":"span"},{"style":{"height":19.73},"width":187.93,"height":49.32,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1403.6530/images/6-14.png","element":"img","alt":" �πθγ(x, a|x0)","inline":true,"padRight":true},{"style":{"fontStyle":"italic"},"text":"are the ","element":"span"},{"style":{"height":16.99},"width":38.85,"height":42.46,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1403.6530/images/6-15.png","element":"img","alt":" γ2","inline":true},{"style":{"fontStyle":"italic"},"text":"-discounted visiting distributions of state ","element":"span"},{"style":{"fontStyle":"italic"},"text":"x ","element":"span"},{"style":{"fontStyle":"italic"},"text":"and state-action pair ","element":"span"},{"text":"(","element":"span"},{"style":{"fontStyle":"italic"},"text":"x, a","element":"span"},{"text":") ","element":"span"},{"style":{"fontStyle":"italic"},"text":"under policy ","element":"span"},{"style":{"height":10},"width":24,"height":25,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1403.6530/images/6-16.png","element":"img","alt":" µ","inline":true},{"style":{"fontStyle":"italic"},"text":", respectively, and are defined as","element":"span"}],[{"style":{"width":"49%"},"width":906,"height":184,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1403.6530/images/6-17.png","element":"img"}],[{"style":{"fontStyle":"italic"},"text":"Proof. ","element":"span"},{"text":"The proof of ","element":"span"},{"style":{"height":17.38},"width":155.49,"height":43.46,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1403.6530/images/6-18.png","element":"img","alt":" ∇V θ(x0)","inline":true,"padRight":true},{"text":"is standard and can be found, for instance, in ","element":"span"},{"href":"#id-9","referenceIndex":43,"text":"[43]","element":"a"},{"text":". To prove ","element":"span"},{"style":{"height":17.38},"width":154.94,"height":43.46,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1403.6530/images/6-19.png","element":"img","alt":" ∇U θ(x0)","inline":true},{"text":", we start by the fact that from ","element":"span"},{"href":"#id-54","text":"(2) ","element":"a"},{"text":"we have ","element":"span"},{"style":{"height":16.78},"width":452.58,"height":41.96,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1403.6530/images/6-20.png","element":"img","alt":" U(x) = �a µ(x|a)W(x, a)","inline":true},{"text":". If we take the derivative w.r.t. ","element":"span"},{"style":{"height":10.8},"width":19,"height":27,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1403.6530/images/6-21.png","element":"img","alt":" θ","inline":true,"padRight":true},{"text":"from both sides of this ","element":"span"},{"text":"equation and obtain","element":"span"}],[{"id":"id-55","style":{"width":"93%"},"width":1693,"height":841,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1403.6530/images/6-22.png","element":"img"}],[{"text":"By unrolling the last equation using the definition of ","element":"span"},{"style":{"height":16},"width":119.04,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1403.6530/images/7-0.png","element":"img","alt":" ∇U(x)","inline":true,"padRight":true},{"text":"from ","element":"span"},{"href":"#id-55","text":"(5)","element":"a"},{"text":", we obtain","element":"span"}],[{"style":{"width":"84%"},"width":1539,"height":663,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1403.6530/images/7-1.png","element":"img"}],[{"text":"In ","element":"span"},{"href":"#id-56","referenceIndex":60,"text":"[60]","element":"a"},{"text":", a policy gradient result analogous to Lemma ","element":"span"},{"href":"#id-33","text":"1 ","element":"a"},{"text":"is provided for the value function in the case of fullstate representations. In the average reward setting, a similar result helps in extension to incorporate function approximation - see the actor-critic algorithms in ","element":"span"},{"href":"#id-11","referenceIndex":14,"text":"[14]","element":"a"},{"text":"3","element":"span"},{"text":". However, a similar approach is not viable for discounted setting and this motivates the use of stochastic optimization techniques like SPSA/SF (cf. ","element":"span"},{"href":"#id-57","referenceIndex":10,"text":"[10]","element":"a"},{"text":"). The problem is further complicated in the variance-constrained setting that we consider because:","element":"span"}],[{"text":"1. two different sampling distributions, ","element":"span"},{"style":{"height":19.72},"width":40.72,"height":49.31,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1403.6530/images/7-2.png","element":"img","alt":" πθγ","inline":true,"padRight":true},{"text":"and ","element":"span"},{"style":{"height":19.72},"width":40.72,"height":49.31,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1403.6530/images/7-3.png","element":"img","alt":" �πθγ","inline":true},{"text":", are used for ","element":"span"},{"style":{"height":17.38},"width":155.49,"height":43.46,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1403.6530/images/7-4.png","element":"img","alt":" ∇V θ(x0)","inline":true,"padRight":true},{"text":"and ","element":"span"},{"style":{"height":17.38},"width":154.94,"height":43.46,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1403.6530/images/7-5.png","element":"img","alt":" ∇U θ(x0)","inline":true},{"text":", and","element":"span"}],[{"text":"2. ","element":"span"},{"style":{"height":17.38},"width":148.79,"height":43.46,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1403.6530/images/7-6.png","element":"img","alt":" ∇V θ(x′)","inline":true,"padRight":true},{"text":"appears in the second sum of ","element":"span"},{"style":{"height":17.38},"width":154.94,"height":43.46,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1403.6530/images/7-7.png","element":"img","alt":" ∇U θ(x0)","inline":true,"padRight":true},{"text":"equation, which implies that we need to estimate thegradient of the value function ","element":"span"},{"style":{"height":13.39},"width":47.1,"height":33.46,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1403.6530/images/7-8.png","element":"img","alt":" V θ","inline":true,"padRight":true},{"text":"at every state of the MDP, and not just at the initial state ","element":"span"},{"style":{"height":13.39},"width":38.78,"height":33.46,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1403.6530/images/7-9.png","element":"img","alt":" x0","inline":true},{"text":".","element":"span"}],[{"text":"To alleviate the above mentioned problems, we borrow the principle of simultaneous perturbation for estimating the gradient ","element":"span"},{"style":{"height":16},"width":170.63,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1403.6530/images/7-10.png","element":"img","alt":" ∇θL(θ, λ)","inline":true,"padRight":true},{"text":"and develop novel risk-sensitive actor-critic algorithms in the following section.","element":"span"}]]},{"heading":"4 Discounted Reward Risk-Sensitive Actor-Critic Algorithms","paragraphs":[[{"text":"In this section, we present actor-critic algorithms for optimizing the risk-sensitive measure ","element":"span"},{"href":"#id-53","text":"(3)","element":"a"},{"text":". These algorithms are based on two simultaneous perturbation methods: ","element":"span"},{"style":{"fontStyle":"italic"},"text":"simultaneous perturbation stochastic approximation ","element":"span"},{"text":"(SPSA) and ","element":"span"},{"style":{"fontStyle":"italic"},"text":"smoothed functional ","element":"span"},{"text":"(SF).","element":"span"}],[{"id":"id-73","style":{"fontWeight":"bold"},"text":"4.1 ","element":"span"},{"style":{"fontWeight":"bold"},"text":"Algorithm Structure","element":"span"}],[{"text":"For the purpose of finding an optimal risk-sensitive policy, a standard procedure would update the policy parameter ","element":"span"},{"style":{"height":10.8},"width":19,"height":27,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1403.6530/images/7-11.png","element":"img","alt":"θ","inline":true,"padRight":true},{"text":"and Lagrange multiplier ","element":"span"},{"style":{"height":10.8},"width":23,"height":27,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1403.6530/images/7-12.png","element":"img","alt":" λ","inline":true,"padRight":true},{"text":"in two nested loops as follows:","element":"span"}],[{"style":{"fontStyle":"italic"},"text":"• ","element":"span"},{"text":"An inner loop that descends in ","element":"span"},{"style":{"height":10.8},"width":19,"height":27,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1403.6530/images/7-13.png","element":"img","alt":" θ","inline":true,"padRight":true},{"text":"using the gradient of the Lagrangian ","element":"span"},{"style":{"height":16},"width":119.39,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1403.6530/images/7-14.png","element":"img","alt":" L(θ, λ)","inline":true,"padRight":true},{"text":"w.r.t. ","element":"span"},{"style":{"height":10.8},"width":19,"height":27,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1403.6530/images/7-15.png","element":"img","alt":" θ","inline":true},{"text":", and","element":"span"}],[{"style":{"fontStyle":"italic"},"text":"• ","element":"span"},{"text":"An outer loop that ascends in ","element":"span"},{"style":{"height":10.8},"width":23,"height":27,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1403.6530/images/7-16.png","element":"img","alt":" λ","inline":true,"padRight":true},{"text":"using the gradient of the Lagrangian ","element":"span"},{"style":{"height":16},"width":119.39,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1403.6530/images/7-17.png","element":"img","alt":" L(θ, λ)","inline":true,"padRight":true},{"text":"w.r.t. ","element":"span"},{"style":{"height":10.8},"width":23,"height":27,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1403.6530/images/7-18.png","element":"img","alt":" λ","inline":true},{"text":".","element":"span"}],[{"text":"Using two-timescale stochastic approximation ","element":"span"},{"href":"#id-58","referenceIndex":20,"text":"[20, ","element":"a"},{"text":"Chapter 6], the two loops above can run in parallel, as follows:","element":"span"}],[{"id":"id-60","style":{"width":"68%"},"width":1240,"height":114,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1403.6530/images/7-19.png","element":"img"}],[{"text":"In the above,","element":"span"}],[{"style":{"fontStyle":"italic"},"text":"• ","element":"span"},{"style":{"height":13.99},"width":49.89,"height":34.98,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1403.6530/images/8-0.png","element":"img","alt":" An","inline":true,"padRight":true},{"text":"is a positive definite matrix that fixes the order of the algorithm. For the first order methods, ","element":"span"},{"style":{"height":14.8},"width":170.46,"height":37,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1403.6530/images/8-1.png","element":"img","alt":" An = I (I","inline":true,"padRight":true},{"text":"is the identity matrix), while for the second order methods ","element":"span"},{"style":{"height":17.9},"width":326.48,"height":44.75,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1403.6530/images/8-2.png","element":"img","alt":" An → ∇2θL(θn, λn)","inline":true,"padRight":true},{"text":"as ","element":"span"},{"style":{"height":8.8},"width":125.92,"height":22,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1403.6530/images/8-3.png","element":"img","alt":" n → ∞","inline":true},{"text":".","element":"span"}],[{"style":{"fontStyle":"italic"},"text":"• ","element":"span"},{"style":{"height":10.8},"width":25,"height":27,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1403.6530/images/8-4.png","element":"img","alt":" Γ","inline":true,"padRight":true},{"text":"is a projection operator that keeps the iterate ","element":"span"},{"style":{"height":13.19},"width":38.71,"height":32.98,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1403.6530/images/8-5.png","element":"img","alt":" θn","inline":true,"padRight":true},{"text":"stable by projecting onto a compact and convex set ","element":"span"},{"style":{"height":21.46},"width":399.31,"height":53.66,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1403.6530/images/8-6.png","element":"img","alt":"Θ := �κ1i=1[θ(i)min, θ(i)max]","inline":true},{"text":". In particular, for any ","element":"span"},{"style":{"height":18.18},"width":815.4,"height":45.46,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1403.6530/images/8-7.png","element":"img","alt":" θ ∈ Rκ1, Γ(θ) = (Γ(1)(θ(1)), . . . , Γ(κ1)(θ(κ1)))T","inline":true,"padRight":true},{"text":", with ","element":"span"},{"style":{"height":21.24},"width":665.76,"height":53.11,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1403.6530/images/8-8.png","element":"img","alt":"Γ(i)(θ(i)) := min(max(θ(i)min, θ(i)), θ(i)max)","inline":true},{"text":".","element":"span"}],[{"style":{"fontStyle":"italic"},"text":"• ","element":"span"},{"style":{"height":13.19},"width":43.9,"height":32.98,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1403.6530/images/8-9.png","element":"img","alt":" Γλ","inline":true,"padRight":true},{"text":"is a projection operator that keeps the Lagrange multiplier ","element":"span"},{"style":{"height":13.19},"width":43.25,"height":32.98,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1403.6530/images/8-10.png","element":"img","alt":" λn","inline":true,"padRight":true},{"text":"within the interval ","element":"span"},{"style":{"height":16},"width":143.76,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1403.6530/images/8-11.png","element":"img","alt":" [0, λmax]","inline":true},{"text":", for some large positive constant ","element":"span"},{"style":{"height":13.19},"width":177.18,"height":32.98,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1403.6530/images/8-12.png","element":"img","alt":" λmax < ∞","inline":true,"padRight":true},{"text":"and can be defined in an analogous fashion as ","element":"span"},{"style":{"height":10.8},"width":25,"height":27,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1403.6530/images/8-13.png","element":"img","alt":" Γ","inline":true},{"text":".","element":"span"}],[{"style":{"fontStyle":"italic"},"text":"• ","element":"span"},{"style":{"height":16},"width":198.67,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1403.6530/images/8-14.png","element":"img","alt":" ζ1(n), ζ2(n)","inline":true,"padRight":true},{"text":"are step-sizes selected such that ","element":"span"},{"style":{"height":10.8},"width":19,"height":27,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1403.6530/images/8-15.png","element":"img","alt":" θ","inline":true,"padRight":true},{"text":"update is on the faster and ","element":"span"},{"style":{"height":10.8},"width":23,"height":27,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1403.6530/images/8-16.png","element":"img","alt":" λ","inline":true,"padRight":true},{"text":"update is on the slower timescale. Note that another timescale ","element":"span"},{"style":{"height":16},"width":90.72,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1403.6530/images/8-17.png","element":"img","alt":" ζ3(n)","inline":true,"padRight":true},{"text":"that is the fastest is used for the TD-critic, which provides the estimate of the Lagrangian for a given ","element":"span"},{"style":{"height":16},"width":92.26,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1403.6530/images/8-18.png","element":"img","alt":" (θ, λ)","inline":true},{"text":".","element":"span"}],[{"text":"We make the following assumptions on the step-size schedules:","element":"span"}],[{"style":{"fontWeight":"bold"},"text":"(A3) ","element":"span"},{"text":"The step size schedules ","element":"span"},{"style":{"height":16},"width":280.16,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1403.6530/images/8-19.png","element":"img","alt":" {ζ3(n)}, {ζ2(n)}","inline":true},{"text":", and ","element":"span"},{"style":{"height":16},"width":130.15,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1403.6530/images/8-20.png","element":"img","alt":" {ζ1(n)}","inline":true,"padRight":true},{"text":"satisfy","element":"span"}],[{"id":"id-59","style":{"width":"69%"},"width":1262,"height":260,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1403.6530/images/8-21.png","element":"img"}],[{"text":"Equations ","element":"span"},{"href":"#id-59","text":"8 ","element":"a"},{"text":"and ","element":"span"},{"href":"#id-59","text":"9 ","element":"a"},{"text":"are standard step-size conditions in stochastic approximation algorithms, and Equation ","element":"span"},{"href":"#id-59","text":"10 ","element":"a"},{"text":"ensures that the policy parameter update is on the faster time-scale ","element":"span"},{"style":{"height":16},"width":130.15,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1403.6530/images/8-22.png","element":"img","alt":" {ζ2(n)}","inline":true},{"text":", and the Lagrange multiplier update is on the slower time-scale ","element":"span"},{"style":{"height":16},"width":130.15,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1403.6530/images/8-23.png","element":"img","alt":" {ζ1(n)}","inline":true},{"text":".","element":"span"}],[{"style":{"fontWeight":"bold"},"text":"Simulation optimization. ","element":"span"},{"text":"We operate in a setting where we only observe simulated rewards of the underlying MDP. Thus, it is required to estimate the mean and varaince of the return (we use a TD-critic for this purpose) and then use these estimates to compute gradient of the Lagrangian. The gradient ","element":"span"},{"style":{"height":16},"width":173.49,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1403.6530/images/8-24.png","element":"img","alt":" ∇λL(θ, λ)","inline":true,"padRight":true},{"text":"has a particularly simple form of ","element":"span"},{"style":{"height":17.38},"width":215.1,"height":43.46,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1403.6530/images/8-25.png","element":"img","alt":" (Λθ(x0)−α)","inline":true},{"text":", suggesting the usage of sample variance constraints to perform the dual ascent for Lagrange multiplier ","element":"span"},{"style":{"height":10.8},"width":23,"height":27,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1403.6530/images/8-26.png","element":"img","alt":" λ","inline":true},{"text":". On the other hand, the expression for ","element":"span"},{"style":{"height":16},"width":170.63,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1403.6530/images/8-27.png","element":"img","alt":" ∇θL(θ, λ)","inline":true,"padRight":true},{"text":"is complicated (see Lemma ","element":"span"},{"href":"#id-33","text":"1) ","element":"a"},{"text":"and warrants the usage of a simulation optimization that can provide gradient estimates from sample observation. We employ simultaneous perturbation schemes for estimating the gradient (and in the case of second order methods, the Hessian) of the Lagrangian ","element":"span"},{"style":{"height":16},"width":119.39,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1403.6530/images/8-28.png","element":"img","alt":" L(θ, λ)","inline":true},{"text":". The idea in these methods is to estimate the gradients ","element":"span"},{"style":{"height":17.39},"width":173.52,"height":43.46,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1403.6530/images/8-29.png","element":"img","alt":" ∇θV θ(x0)","inline":true,"padRight":true},{"text":"and ","element":"span"},{"style":{"height":17.39},"width":172.97,"height":43.46,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1403.6530/images/8-30.png","element":"img","alt":" ∇θU θ(x0)","inline":true,"padRight":true},{"text":"(needed for estimating the gradient ","element":"span"},{"style":{"height":16},"width":170.63,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1403.6530/images/8-31.png","element":"img","alt":" ∇θL(θ, λ)","inline":true},{"text":") using two simulated trajectories of the system corresponding to policies with parameters ","element":"span"},{"style":{"height":13.19},"width":38.71,"height":32.98,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1403.6530/images/8-32.png","element":"img","alt":" θn","inline":true,"padRight":true},{"text":"and ","element":"span"},{"style":{"height":16.92},"width":228.56,"height":42.31,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1403.6530/images/8-33.png","element":"img","alt":" θ+n = θn + pn","inline":true},{"text":". Here ","element":"span"},{"style":{"height":10},"width":40.05,"height":25,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1403.6530/images/8-34.png","element":"img","alt":" pn","inline":true,"padRight":true},{"text":"is a perturbation vector that is specific to the algorithm.","element":"span"}],[{"text":"Based on the order, our algorithms can be classified as:","element":"span"}],[{"text":"1. ","element":"span"},{"style":{"fontWeight":"bold"},"text":"First order","element":"span"},{"text":": This corresponds to ","element":"span"},{"style":{"height":13.99},"width":134.14,"height":34.98,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1403.6530/images/8-35.png","element":"img","alt":" An = I","inline":true,"padRight":true},{"text":"in ","element":"span"},{"href":"#id-60","text":"(6)","element":"a"},{"text":". The proposed algorithms here include RS-SPSA-G and RS-SF-G, where the former estimates the gradient using SPSA, while the latter uses SF. These algorithms use the following choice for the perturbation vector: ","element":"span"},{"style":{"height":14.8},"width":174.96,"height":37,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1403.6530/images/8-36.png","element":"img","alt":" pn = β∆n","inline":true},{"text":". Here ","element":"span"},{"style":{"height":14.4},"width":100.01,"height":36,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1403.6530/images/8-37.png","element":"img","alt":" β > 0","inline":true,"padRight":true},{"text":"is a positive constant and ","element":"span"},{"style":{"height":13.99},"width":53.21,"height":34.98,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1403.6530/images/8-38.png","element":"img","alt":" ∆n","inline":true,"padRight":true},{"text":"is a perturbation random variable, i.e., a ","element":"span"},{"style":{"height":9.59},"width":38.96,"height":23.98,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1403.6530/images/8-39.png","element":"img","alt":" κ1","inline":true},{"text":"-vector of independent Rademacher (for SPSA) and Gaussian ","element":"span"},{"style":{"fontStyle":"italic"},"text":"N","element":"span"},{"text":"(0","element":"span"},{"style":{"fontStyle":"italic"},"text":", ","element":"span"},{"text":"1) ","element":"span"},{"text":"(for SF) random variables.","element":"span"}],[{"text":"2. ","element":"span"},{"style":{"fontWeight":"bold"},"text":"Second order","element":"span"},{"text":": This corresponds to ","element":"span"},{"style":{"height":13.99},"width":49.89,"height":34.98,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1403.6530/images/8-40.png","element":"img","alt":" An","inline":true,"padRight":true},{"text":"which converges to ","element":"span"},{"style":{"height":17.38},"width":212.76,"height":43.46,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1403.6530/images/8-41.png","element":"img","alt":" ∇2L(θn, λn)","inline":true,"padRight":true},{"text":"as ","element":"span"},{"style":{"height":8.8},"width":132.55,"height":22,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1403.6530/images/8-42.png","element":"img","alt":" n → ∞","inline":true},{"text":". The proposed algorithms here include RS-SPSA-N and RS-SF-N, where the former uses SPSA for gradient/Hessian estimates and the latter employs SF for the same. These algorithms use the following choice for perturbation vector: For RS-SPSA-N, ","element":"span"},{"style":{"height":14.8},"width":450.6,"height":37,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1403.6530/images/8-43.png","element":"img","alt":" pn = β∆n + β �∆n, β > 0","inline":true,"padRight":true},{"text":"is a positive constant and ","element":"span"},{"style":{"height":13.99},"width":53.21,"height":34.98,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1403.6530/images/8-44.png","element":"img","alt":" ∆n","inline":true,"padRight":true},{"text":"and ","element":"span"},{"style":{"height":13.99},"width":53.21,"height":34.98,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1403.6530/images/8-45.png","element":"img","alt":"�∆n","inline":true,"padRight":true},{"text":"are perturbation parameters that are ","element":"span"},{"style":{"height":9.59},"width":38.96,"height":23.98,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1403.6530/images/8-46.png","element":"img","alt":" κ1","inline":true},{"text":"-vectors of independent Rademacher random variables, respectively. For RS-SF-N, ","element":"span"},{"style":{"height":14.8},"width":172.73,"height":37,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1403.6530/images/8-47.png","element":"img","alt":"pn = β∆n","inline":true},{"text":", where ","element":"span"},{"style":{"height":13.99},"width":53.21,"height":34.98,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1403.6530/images/8-48.png","element":"img","alt":" ∆n","inline":true,"padRight":true},{"text":"is a ","element":"span"},{"style":{"height":9.59},"width":38.96,"height":23.98,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1403.6530/images/8-49.png","element":"img","alt":" κ1","inline":true,"padRight":true},{"text":"vector of Gaussian ","element":"span"},{"style":{"fontStyle":"italic"},"text":"N","element":"span"},{"text":"(0","element":"span"},{"style":{"fontStyle":"italic"},"text":", ","element":"span"},{"text":"1) ","element":"span"},{"text":"random variables.","element":"span"}],[{"id":"id-61","style":{"width":"73%"},"width":1325,"height":299,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1403.6530/images/9-0.png","element":"img"}],[{"text":"Figure 1: The overall flow of our simultaneous perturbation based actor-critic algorithms.","element":"figcaption","subtype":"caption"}],[{"id":"id-62","style":{"width":"100%"},"width":1814,"height":695,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1403.6530/images/9-1.png","element":"img"}],[{"text":"The overall flow of our proposed actor-critic algorithms is illustrated in Figure ","element":"span"},{"href":"#id-61","text":"1 ","element":"a"},{"text":"and Algorithm ","element":"span"},{"href":"#id-62","text":"1. ","element":"a"},{"text":"The overall operation involves the following two loops: At each time instant ","element":"span"},{"style":{"fontStyle":"italic"},"text":"n","element":"span"},{"text":",","element":"span"}],[{"style":{"fontWeight":"bold"},"text":"Inner Loop (Critic Update): ","element":"span"},{"text":"For a fixed policy (given as ","element":"span"},{"style":{"height":13.19},"width":38.71,"height":32.98,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1403.6530/images/9-2.png","element":"img","alt":" θn","inline":true},{"text":"), simulate two system trajectories, each of length","element":"span"}],[{"style":{"width":"94%"},"width":1714,"height":270,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1403.6530/images/9-3.png","element":"img"}],[{"style":{"height":17.38},"width":140.96,"height":43.46,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1403.6530/images/9-4.png","element":"img","alt":"�V θn(x0)","inline":true,"padRight":true},{"text":"and ","element":"span"},{"style":{"height":20.2},"width":144.75,"height":50.49,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1403.6530/images/9-5.png","element":"img","alt":" �V θ+n (x0)","inline":true},{"text":", and square value functions ","element":"span"},{"style":{"height":17.38},"width":140.42,"height":43.46,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1403.6530/images/9-6.png","element":"img","alt":" �U θn(x0)","inline":true,"padRight":true},{"text":"and ","element":"span"},{"style":{"height":20.2},"width":144.2,"height":50.49,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1403.6530/images/9-7.png","element":"img","alt":" �U θ+n (x0)","inline":true},{"text":", corresponding to the policy parameter ","element":"span"},{"style":{"height":13.19},"width":38.71,"height":32.98,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1403.6530/images/9-8.png","element":"img","alt":" θn","inline":true,"padRight":true},{"text":"and ","element":"span"},{"style":{"height":16.92},"width":44.81,"height":42.31,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1403.6530/images/9-9.png","element":"img","alt":" θ+n","inline":true,"padRight":true},{"text":".","element":"span"}],[{"style":{"fontWeight":"bold"},"text":"Outer Loop (Actor Update): ","element":"span"},{"text":"Estimate the gradient/Hessian of ","element":"span"},{"style":{"height":17.38},"width":122.28,"height":43.46,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1403.6530/images/9-10.png","element":"img","alt":"�V θ(x0)","inline":true,"padRight":true},{"text":"and ","element":"span"},{"style":{"height":17.38},"width":121.73,"height":43.46,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1403.6530/images/9-11.png","element":"img","alt":"�U θ(x0)","inline":true},{"text":", and hence the gradient/Hessianof Lagrangian ","element":"span"},{"style":{"height":16},"width":119.39,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1403.6530/images/9-12.png","element":"img","alt":" L(θ, λ)","inline":true},{"text":", using either SPSA ","element":"span"},{"href":"#id-63","text":"(21) ","element":"a"},{"text":"or SF ","element":"span"},{"href":"#id-64","text":"(22) ","element":"a"},{"text":"methods. Using these estimates, update the policy parameter ","element":"span"},{"style":{"height":10.8},"width":19,"height":27,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1403.6530/images/9-13.png","element":"img","alt":" θ","inline":true,"padRight":true},{"text":"in the descent direction using either a gradient or a Newton decrement, and the Lagrange multiplier ","element":"span"},{"style":{"height":10.8},"width":23,"height":27,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1403.6530/images/9-14.png","element":"img","alt":" λ","inline":true,"padRight":true},{"text":"in the ascent direction.","element":"span"}],[{"style":{"fontWeight":"bold"},"text":"Remark 3. ","element":"span"},{"style":{"fontStyle":"italic"},"text":"(","element":"span"},{"style":{"fontStyle":"italic","fontWeight":"bold"},"text":"Trajectory length ","element":"span"},{"style":{"height":9.19},"width":54.99,"height":22.98,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1403.6530/images/9-15.png","element":"img","alt":" mn","inline":true},{"style":{"fontStyle":"italic"},"text":") A simple setting is to have ","element":"span"},{"style":{"height":13.19},"width":188.32,"height":32.98,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1403.6530/images/9-16.png","element":"img","alt":" mn = Cnς","inline":true},{"style":{"fontStyle":"italic"},"text":", where ","element":"span"},{"style":{"fontStyle":"italic"},"text":"C ","element":"span"},{"style":{"fontStyle":"italic"},"text":"is a constant and ","element":"span"},{"style":{"height":12.4},"width":101.04,"height":31,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1403.6530/images/9-17.png","element":"img","alt":" ς > 0","inline":true},{"style":{"fontStyle":"italic"},"text":", i.e., have trajectories that increase in length as a function of outer loop index ","element":"span"},{"style":{"fontStyle":"italic"},"text":"n","element":"span"},{"style":{"fontStyle":"italic"},"text":". A constant trajectory length","element":"span"}],[{"style":{"width":"99%"},"width":1811,"height":77,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1403.6530/images/9-18.png","element":"img"}],[{"style":{"fontStyle":"italic"},"text":"value estimate is close enough to the true value.","element":"span"}],[{"text":"In the next section, we describe the TD-critic and subsequently, in Sections ","element":"span"},{"href":"#id-65","text":"4.3–","element":"a"},{"href":"#id-66","text":"4.4, ","element":"a"},{"text":"present the first and second order actor critic algorithms, respectively.","element":"span"}],[{"id":"id-77","style":{"fontWeight":"bold"},"text":"4.2 ","element":"span"},{"style":{"fontWeight":"bold"},"text":"TD-Critic","element":"span"}],[{"text":"In our actor-critic algorithms, the critic uses linear approximation for the value and square value functions, i.e., ","element":"span"},{"style":{"height":16},"width":275.07,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1403.6530/images/10-0.png","element":"img","alt":"�V (x) ≈ vTφv(x)","inline":true,"padRight":true},{"text":"and ","element":"span"},{"style":{"height":16},"width":278.49,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1403.6530/images/10-1.png","element":"img","alt":"�U(x) ≈ uTφu(x)","inline":true},{"text":", where the features ","element":"span"},{"style":{"height":16},"width":85.24,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1403.6530/images/10-2.png","element":"img","alt":" φv(·)","inline":true,"padRight":true},{"text":"and ","element":"span"},{"style":{"height":16},"width":87.15,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1403.6530/images/10-3.png","element":"img","alt":" φu(·)","inline":true,"padRight":true},{"text":"are from low-dimensional spaces ","element":"span"},{"style":{"height":10.98},"width":61.44,"height":27.46,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1403.6530/images/10-4.png","element":"img","alt":"Rκ2","inline":true,"padRight":true},{"text":"and ","element":"span"},{"style":{"height":10.98},"width":61.44,"height":27.46,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1403.6530/images/10-5.png","element":"img","alt":" Rκ3","inline":true},{"text":", respectively. Let ","element":"span"},{"style":{"height":13.19},"width":44.78,"height":32.98,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1403.6530/images/10-6.png","element":"img","alt":" Φv","inline":true,"padRight":true},{"text":"and ","element":"span"},{"style":{"height":13.19},"width":47.78,"height":32.98,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1403.6530/images/10-7.png","element":"img","alt":" Φu","inline":true,"padRight":true},{"text":"denote ","element":"span"},{"style":{"height":16},"width":152.43,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1403.6530/images/10-8.png","element":"img","alt":" |X| × κ2","inline":true,"padRight":true},{"text":"and ","element":"span"},{"style":{"height":16},"width":152.43,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1403.6530/images/10-9.png","element":"img","alt":" |X| × κ3","inline":true,"padRight":true},{"text":"dimensional matrices, whose ","element":"span"},{"style":{"fontStyle":"italic"},"text":"i","element":"span"},{"text":"th columns are ","element":"span"},{"style":{"height":22.17},"width":714,"height":55.42,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1403.6530/images/10-10.png","element":"img","alt":" φ(i)v =�φ(i)v (x), x ∈ X�T, i = 1, . . . , κ2","inline":true,"padRight":true},{"text":"and ","element":"span"},{"style":{"height":22.17},"width":714.01,"height":55.42,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1403.6530/images/10-11.png","element":"img","alt":" φ(i)u =�φ(i)u (x), x ∈ X�T, i = 1, . . . , κ3","inline":true},{"text":". Let ","element":"span"},{"style":{"height":16},"width":368.05,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1403.6530/images/10-12.png","element":"img","alt":"Sv := {Φvv | v ∈ Rκ2}","inline":true,"padRight":true},{"text":"and ","element":"span"},{"style":{"height":16},"width":375.98,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1403.6530/images/10-13.png","element":"img","alt":" Su := {Φuu | u ∈ Rκ3}","inline":true},{"text":", denote the subspaces within which we approximate the value ","element":"span"},{"text":"and square value functions. We make the following standard assumption as in ","element":"span"},{"href":"#id-11","referenceIndex":14,"text":"[14]","element":"a"},{"text":":","element":"span"}],[{"style":{"fontWeight":"bold"},"text":"(A4) ","element":"span"},{"style":{"fontStyle":"italic"},"text":"The basis functions ","element":"span"},{"style":{"height":21.12},"width":153.51,"height":52.81,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1403.6530/images/10-14.png","element":"img","alt":" {φ(i)v }κ2i=1","inline":true,"padRight":true},{"style":{"fontStyle":"italic"},"text":"and ","element":"span"},{"style":{"height":21.12},"width":153.51,"height":52.81,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1403.6530/images/10-15.png","element":"img","alt":" {φ(i)u }κ3i=1","inline":true,"padRight":true},{"style":{"fontStyle":"italic"},"text":"are linearly independent. In particular, ","element":"span"},{"style":{"height":12.4},"width":185.38,"height":31,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1403.6530/images/10-16.png","element":"img","alt":" κ2, κ3 ≪ n","inline":true,"padRight":true},{"style":{"fontStyle":"italic"},"text":"and ","element":"span"},{"style":{"height":13.19},"width":44.78,"height":32.98,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1403.6530/images/10-17.png","element":"img","alt":" Φv","inline":true,"padRight":true},{"style":{"fontStyle":"italic"},"text":"and ","element":"span"},{"style":{"height":13.19},"width":47.78,"height":32.98,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1403.6530/images/10-18.png","element":"img","alt":"Φu","inline":true,"padRight":true},{"style":{"fontStyle":"italic"},"text":"are full rank. Moreover, for every ","element":"span"},{"style":{"height":11.78},"width":130.89,"height":29.46,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1403.6530/images/10-19.png","element":"img","alt":" v ∈ Rκ2","inline":true,"padRight":true},{"style":{"fontStyle":"italic"},"text":"and ","element":"span"},{"style":{"height":15.2},"width":296.78,"height":38,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1403.6530/images/10-20.png","element":"img","alt":" u ∈ Rκ3, Φvv ̸= e","inline":true,"padRight":true},{"style":{"fontStyle":"italic"},"text":"and ","element":"span"},{"style":{"height":15.2},"width":144.56,"height":38,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1403.6530/images/10-21.png","element":"img","alt":" Φuu ̸= e","inline":true},{"style":{"fontStyle":"italic"},"text":", where ","element":"span"},{"style":{"fontStyle":"italic"},"text":"e ","element":"span"},{"style":{"fontStyle":"italic"},"text":"is the ","element":"span"},{"style":{"fontStyle":"italic"},"text":"n","element":"span"},{"style":{"fontStyle":"italic"},"text":"-dimensional vector with all entries equal to one.","element":"span"}],[{"text":"Let ","element":"span"},{"style":{"height":13.19},"width":48.89,"height":32.98,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1403.6530/images/10-22.png","element":"img","alt":" Πu","inline":true,"padRight":true},{"text":"and ","element":"span"},{"style":{"height":13.19},"width":45.89,"height":32.98,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1403.6530/images/10-23.png","element":"img","alt":" Πv","inline":true,"padRight":true},{"text":"be operators that project onto ","element":"span"},{"style":{"height":13.19},"width":40.44,"height":32.98,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1403.6530/images/10-24.png","element":"img","alt":" Sv","inline":true,"padRight":true},{"text":"and ","element":"span"},{"style":{"height":13.19},"width":43.44,"height":32.98,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1403.6530/images/10-25.png","element":"img","alt":" Su","inline":true},{"text":", respectively and as a consequence of the above assumption, can be defined as follows:","element":"span"}],[{"style":{"width":"78%"},"width":1422,"height":46,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1403.6530/images/10-26.png","element":"img"}],[{"text":"where ","element":"span"},{"style":{"height":13.39},"width":49.1,"height":33.46,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1403.6530/images/10-27.png","element":"img","alt":" Dθ","inline":true,"padRight":true},{"text":"is a diagonal ","element":"span"},{"style":{"height":16},"width":161.43,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1403.6530/images/10-28.png","element":"img","alt":" |X| × |X|","inline":true,"padRight":true},{"text":"matrix with entries ","element":"span"},{"style":{"height":17.39},"width":103.54,"height":43.46,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1403.6530/images/10-29.png","element":"img","alt":" dθ(x),","inline":true,"padRight":true},{"text":"for each ","element":"span"},{"style":{"height":11.6},"width":104.5,"height":29,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1403.6530/images/10-30.png","element":"img","alt":" x ∈ X","inline":true},{"text":".","element":"span"}],[{"text":"Let ","element":"span"},{"style":{"height":17.38},"width":233.47,"height":43.46,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1403.6530/images/10-31.png","element":"img","alt":" T θ = [T θv ; T θu]","inline":true},{"text":", where ","element":"span"},{"style":{"height":17.32},"width":43.82,"height":43.31,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1403.6530/images/10-32.png","element":"img","alt":" T θv","inline":true,"padRight":true},{"text":"and ","element":"span"},{"style":{"height":17.32},"width":43.82,"height":43.31,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1403.6530/images/10-33.png","element":"img","alt":" T θu","inline":true,"padRight":true},{"text":"denote the Bellman operators for value and square value functions of the ","element":"span"},{"text":"policy governed by parameter ","element":"span"},{"style":{"height":10.8},"width":19,"height":27,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1403.6530/images/10-34.png","element":"img","alt":" θ","inline":true},{"text":", respectively. These operators are defined as: For any ","element":"span"},{"style":{"height":17.38},"width":167.65,"height":43.46,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1403.6530/images/10-35.png","element":"img","alt":" y ∈ R2|X|","inline":true},{"text":", let ","element":"span"},{"style":{"height":10},"width":35.54,"height":25,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1403.6530/images/10-36.png","element":"img","alt":" yv","inline":true,"padRight":true},{"text":"and ","element":"span"},{"style":{"height":10},"width":38.54,"height":25,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1403.6530/images/10-37.png","element":"img","alt":" yu","inline":true,"padRight":true},{"text":"denote the first and last ","element":"span"},{"style":{"fontStyle":"italic"},"text":"|X| ","element":"span"},{"text":"entries, respectively. Then","element":"span"}],[{"style":{"width":"99%"},"width":1812,"height":428,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1403.6530/images/10-38.png","element":"img"}],[{"text":"We now claim that the projected Bellman operator ","element":"span"},{"style":{"height":10.8},"width":58.89,"height":27,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1403.6530/images/10-39.png","element":"img","alt":" ΠT","inline":true,"padRight":true},{"text":"is a contraction mapping w.r.t ","element":"span"},{"style":{"height":6.8},"width":21,"height":17,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1403.6530/images/10-40.png","element":"img","alt":" ν","inline":true},{"text":"-weighted norm, for any policy ","element":"span"},{"style":{"height":10.8},"width":19,"height":27,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1403.6530/images/10-41.png","element":"img","alt":" θ","inline":true},{"text":".","element":"span"}],[{"style":{"fontWeight":"bold"},"text":"Lemma 2. ","element":"span"},{"style":{"fontStyle":"italic"},"text":"Under (A2) and (A4), there exists a ","element":"span"},{"style":{"height":16},"width":159.98,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1403.6530/images/10-42.png","element":"img","alt":" ν ∈ (0, 1)","inline":true,"padRight":true},{"style":{"fontStyle":"italic"},"text":"and ","element":"span"},{"style":{"height":14.4},"width":95.98,"height":36,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1403.6530/images/10-43.png","element":"img","alt":" ¯γ < 1","inline":true,"padRight":true},{"style":{"fontStyle":"italic"},"text":"such that","element":"span"}],[{"style":{"width":"41%"},"width":751,"height":50,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1403.6530/images/10-44.png","element":"img"}],[{"style":{"fontStyle":"italic"},"text":"Proof. ","element":"span"},{"text":"First, it is well-known that ","element":"span"},{"style":{"height":17.32},"width":92.64,"height":43.31,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1403.6530/images/10-45.png","element":"img","alt":" ΠvT θv","inline":true,"padRight":true},{"text":"is a contraction mapping (cf. Lemma 6 in ","element":"span"},{"href":"#id-67","referenceIndex":65,"text":"[65]","element":"a"},{"text":"). This can be inferred as ","element":"span"},{"text":"follows: For any ","element":"span"},{"style":{"height":17.38},"width":198.4,"height":43.46,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1403.6530/images/10-46.png","element":"img","alt":" y, ¯y ∈ R2|X|","inline":true},{"text":",","element":"span"}],[{"style":{"width":"30%"},"width":559,"height":29,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1403.6530/images/10-47.png","element":"img"}],[{"text":"We have used the fact that ","element":"span"},{"style":{"height":17.39},"width":315.4,"height":43.46,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1403.6530/images/10-48.png","element":"img","alt":" ∥P θv∥Dθ ≤ ∥v∥Dθ","inline":true,"padRight":true},{"text":"for any ","element":"span"},{"style":{"height":14.99},"width":149.54,"height":37.46,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1403.6530/images/10-49.png","element":"img","alt":" v ∈ R|X|","inline":true,"padRight":true},{"text":"(For a proof, see Lemma 1 in ","element":"span"},{"href":"#id-67","referenceIndex":65,"text":"[65]","element":"a"},{"text":"). The claim that ","element":"span"},{"style":{"height":17.32},"width":92.64,"height":43.31,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1403.6530/images/10-50.png","element":"img","alt":" ΠvT θv","inline":true,"padRight":true},{"text":"now follows from the fact that the projection operator ","element":"span"},{"style":{"height":13.19},"width":45.89,"height":32.98,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1403.6530/images/10-51.png","element":"img","alt":" Πv","inline":true,"padRight":true},{"text":"is non-expansive.","element":"span"}],[{"style":{"width":"96%"},"width":1752,"height":300,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1403.6530/images/10-52.png","element":"img"}],[{"text":"The first inequality above follows from the aforementioned facts that ","element":"span"},{"style":{"height":13.38},"width":46.11,"height":33.46,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1403.6530/images/11-0.png","element":"img","alt":" P θ","inline":true,"padRight":true},{"text":"and ","element":"span"},{"style":{"height":13.19},"width":48.89,"height":32.98,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1403.6530/images/11-1.png","element":"img","alt":" Πu","inline":true,"padRight":true},{"text":"are non-expansive. The second inequality follows by using equivalence of norms (cf. the justification for Eq. (7) in the proof of Lemma 7 in ","element":"span"},{"href":"#id-68","referenceIndex":63,"text":"[63]","element":"a"},{"text":").","element":"span"}],[{"style":{"width":"78%"},"width":1421,"height":418,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1403.6530/images/11-2.png","element":"img"}],[{"text":"The claim follows by setting ","element":"span"},{"style":{"height":13.2},"width":736.19,"height":33,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1403.6530/images/11-3.png","element":"img","alt":" ¯γ = γ + ϵ. ■","inline":true}],[{"text":"Let ","element":"span"},{"style":{"height":16},"width":180.67,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1403.6530/images/11-4.png","element":"img","alt":" [Φv¯v; Φu¯u]","inline":true,"padRight":true},{"text":"denote the unique fixed-point of the projected Bellman operator ","element":"span"},{"style":{"height":10.8},"width":58.89,"height":27,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1403.6530/images/11-5.png","element":"img","alt":" ΠT","inline":true},{"text":", i.e.,","element":"span"}],[{"id":"id-71","style":{"width":"72%"},"width":1311,"height":49,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1403.6530/images/11-6.png","element":"img"}],[{"text":"where ","element":"span"},{"style":{"height":13.19},"width":45.89,"height":32.98,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1403.6530/images/11-7.png","element":"img","alt":" Πv","inline":true,"padRight":true},{"text":"and ","element":"span"},{"style":{"height":13.19},"width":48.89,"height":32.98,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1403.6530/images/11-8.png","element":"img","alt":" Πu","inline":true,"padRight":true},{"text":"project into the linear spaces spanned by the columns of ","element":"span"},{"style":{"height":13.19},"width":44.78,"height":32.98,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1403.6530/images/11-9.png","element":"img","alt":" Φv","inline":true,"padRight":true},{"text":"and ","element":"span"},{"style":{"height":13.19},"width":47.78,"height":32.98,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1403.6530/images/11-10.png","element":"img","alt":" Φu","inline":true},{"text":", respectively.","element":"span"}],[{"text":"We now describe the TD algorithm that updates the critic parameters corresponding to the value and square value functions (Note that we require critic estimates for both the unperturbed as well as the perturbed policy parameters). This algorithm is an extension of the algorithm proposed by ","element":"span"},{"href":"#id-68","referenceIndex":63,"text":"[63] ","element":"a"},{"text":"to the discounted setting. Recall from Algorithm ","element":"span"},{"href":"#id-62","text":"1 ","element":"a"},{"text":"that, at any instant ","element":"span"},{"style":{"fontStyle":"italic"},"text":"n","element":"span"},{"text":", the TD-critic runs two ","element":"span"},{"style":{"height":9.19},"width":54.99,"height":22.98,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1403.6530/images/11-11.png","element":"img","alt":" mn","inline":true,"padRight":true},{"text":"length trajectories corresponding to policy parameters ","element":"span"},{"style":{"height":13.19},"width":38.71,"height":32.98,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1403.6530/images/11-12.png","element":"img","alt":" θn","inline":true,"padRight":true},{"text":"and ","element":"span"},{"style":{"height":13.99},"width":161.54,"height":34.98,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1403.6530/images/11-13.png","element":"img","alt":" θn + δ∆n","inline":true},{"text":". ","element":"span"},{"style":{"fontWeight":"bold"},"text":"Critic Update: ","element":"span"},{"text":"Calculate the temporal difference (TD)-errors ","element":"span"},{"style":{"height":16.93},"width":111.41,"height":42.32,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1403.6530/images/11-14.png","element":"img","alt":" δm, δ+m","inline":true,"padRight":true},{"text":"for the value and ","element":"span"},{"style":{"height":16.93},"width":108.33,"height":42.32,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1403.6530/images/11-15.png","element":"img","alt":" ϵm, ϵ+m","inline":true,"padRight":true},{"text":"for the square ","element":"span"},{"text":"value functions using ","element":"span"},{"href":"#id-69","text":"(19)","element":"a"},{"text":", and update the critic parameters ","element":"span"},{"style":{"height":16.93},"width":114.61,"height":42.32,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1403.6530/images/11-16.png","element":"img","alt":" vm, v+m","inline":true,"padRight":true},{"text":"for the value and ","element":"span"},{"style":{"height":16.93},"width":121.61,"height":42.32,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1403.6530/images/11-17.png","element":"img","alt":" um, u+m","inline":true,"padRight":true},{"text":"for the square value ","element":"span"},{"text":"functions as follows:","element":"span"}],[{"id":"id-69","style":{"width":"99%"},"width":1812,"height":381,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1403.6530/images/11-18.png","element":"img"}],[{"style":{"fontStyle":"italic"},"text":"δ","element":"span"},{"style":{"height":16},"width":180.24,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1403.6530/images/11-19.png","element":"img","alt":"m = R(xm","inline":true},{"style":{"fontStyle":"italic"},"text":", a","element":"span"},{"style":{"height":16},"width":85.62,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1403.6530/images/11-20.png","element":"img","alt":"m) +","inline":true,"padRight":true},{"style":{"fontStyle":"italic"},"text":"γv","element":"span"},{"style":{"height":16.18},"width":422.85,"height":40.46,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1403.6530/images/11-21.png","element":"img","alt":"Tmφv(xm+1) − vTmφv(xm)","inline":true},{"style":{"fontStyle":"italic"},"text":", ","element":"span"},{"text":"(19) ","element":"span"},{"style":{"fontStyle":"italic"},"text":"ϵ","element":"span"},{"style":{"height":16},"width":180.24,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1403.6530/images/11-22.png","element":"img","alt":"m = R(xm","inline":true},{"style":{"fontStyle":"italic"},"text":", a","element":"span"},{"style":{"height":18.18},"width":251.96,"height":45.45,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1403.6530/images/11-23.png","element":"img","alt":"m)2 + 2γR(xm","inline":true},{"style":{"fontStyle":"italic"},"text":", a","element":"span"},{"style":{"height":18.18},"width":804.82,"height":45.45,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1403.6530/images/11-24.png","element":"img","alt":"m)vTmφv(xm+1) + γ2uTmφu(xm+1) − uTmφu(xm)","inline":true},{"style":{"fontStyle":"italic"},"text":",","element":"span"}],[{"style":{"width":"10%"},"width":184,"height":29,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1403.6530/images/11-25.png","element":"img"}],[{"style":{"fontStyle":"italic"},"text":"δ","element":"span"},{"style":{"height":17.78},"width":180.24,"height":44.45,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1403.6530/images/11-26.png","element":"img","alt":"+m = R(x+m","inline":true},{"style":{"fontStyle":"italic"},"text":", a","element":"span"},{"style":{"height":17.78},"width":85.62,"height":44.45,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1403.6530/images/11-27.png","element":"img","alt":"+m) +","inline":true,"padRight":true},{"style":{"fontStyle":"italic"},"text":"γv","element":"span"},{"style":{"height":17.72},"width":39.89,"height":44.3,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1403.6530/images/11-28.png","element":"img","alt":"+⊤m","inline":true,"padRight":true},{"style":{"fontStyle":"italic"},"text":"φ","element":"span"},{"style":{"height":19.87},"width":251.24,"height":49.68,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1403.6530/images/11-29.png","element":"img","alt":"v(x+m+1) − v+⊤m","inline":true,"padRight":true},{"style":{"fontStyle":"italic"},"text":"φ","element":"span"},{"style":{"height":17.78},"width":103.48,"height":44.45,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1403.6530/images/11-30.png","element":"img","alt":"v(x+m)","inline":true},{"style":{"fontStyle":"italic"},"text":", ","element":"span"},{"text":"(20) ","element":"span"},{"style":{"fontStyle":"italic"},"text":"ϵ","element":"span"},{"style":{"height":17.78},"width":180.24,"height":44.45,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1403.6530/images/11-31.png","element":"img","alt":"+m = R(x+m","inline":true},{"style":{"fontStyle":"italic"},"text":", a","element":"span"},{"style":{"height":18.18},"width":251.96,"height":45.45,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1403.6530/images/11-32.png","element":"img","alt":"+m)2 + 2γR(x+m","inline":true},{"style":{"fontStyle":"italic"},"text":", a","element":"span"},{"style":{"height":17.78},"width":104.98,"height":44.45,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1403.6530/images/11-33.png","element":"img","alt":"+m)v+⊤m","inline":true,"padRight":true},{"style":{"fontStyle":"italic"},"text":"φ","element":"span"},{"style":{"height":20.03},"width":294.03,"height":50.08,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1403.6530/images/11-34.png","element":"img","alt":"v(x+m+1) + γ2u+⊤m","inline":true,"padRight":true},{"style":{"fontStyle":"italic"},"text":"φ","element":"span"},{"style":{"height":19.87},"width":255.21,"height":49.67,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1403.6530/images/11-35.png","element":"img","alt":"u(x+m+1) − u+⊤m","inline":true,"padRight":true},{"style":{"fontStyle":"italic"},"text":"φ","element":"span"},{"style":{"height":17.78},"width":105.38,"height":44.45,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1403.6530/images/11-36.png","element":"img","alt":"u(x+m)","inline":true},{"style":{"fontStyle":"italic"},"text":".","element":"span"}],[{"text":"Note that the TD-error ","element":"span"},{"style":{"height":0},"width":20,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1403.6530/images/11-37.png","element":"img","alt":" ϵ","inline":true,"padRight":true},{"text":"for the square value function ","element":"span"},{"style":{"fontStyle":"italic"},"text":"U ","element":"span"},{"text":"comes directly from its Bellman equation ","element":"span"},{"href":"#id-54","text":"(2)","element":"a"},{"text":". Theorem ","element":"span"},{"href":"#id-70","text":"6 ","element":"a"},{"text":"in Section ","element":"span"},{"text":"7 ","element":"span"},{"text":"establishes that the critic parameters ","element":"span"},{"style":{"height":16},"width":134.72,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1403.6530/images/11-38.png","element":"img","alt":" (vn, un)","inline":true,"padRight":true},{"text":"governed by ","element":"span"},{"href":"#id-69","text":"(17) ","element":"a"},{"text":"converge to the solutions ","element":"span"},{"style":{"height":16},"width":92.77,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1403.6530/images/11-39.png","element":"img","alt":" (¯v, ¯u)","inline":true,"padRight":true},{"text":"of the fixed point equation ","element":"span"},{"href":"#id-71","text":"(16)","element":"a"},{"text":".","element":"span"}],[{"id":"id-65","style":{"fontWeight":"bold"},"text":"4.3 ","element":"span"},{"style":{"fontWeight":"bold"},"text":"First-Order Algorithms: RS-SPSA-G and RS-SF-G","element":"span"}],[{"style":{"fontWeight":"bold"},"text":"SPSA","element":"span"},{"text":"-based estimate for ","element":"span"},{"style":{"height":17.38},"width":155.49,"height":43.46,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1403.6530/images/11-40.png","element":"img","alt":" ∇V θ(x0)","inline":true},{"text":", and similarly for ","element":"span"},{"style":{"height":17.38},"width":154.94,"height":43.46,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1403.6530/images/11-41.png","element":"img","alt":" ∇U θ(x0)","inline":true},{"text":", is given by","element":"span"}],[{"id":"id-63","style":{"width":"78%"},"width":1423,"height":104,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1403.6530/images/11-42.png","element":"img"}],[{"text":"where ","element":"span"},{"style":{"height":11.6},"width":33,"height":29,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1403.6530/images/12-0.png","element":"img","alt":" ∆","inline":true,"padRight":true},{"text":"is a vector of independent Rademacher random variables. The advantage of this estimator is that it perturbs all directions at the same time (the numerator is identical in all ","element":"span"},{"style":{"height":9.59},"width":38.96,"height":23.98,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1403.6530/images/12-1.png","element":"img","alt":" κ1","inline":true,"padRight":true},{"text":"components). So, the number of function measurements needed for this estimator is always two, independent of the dimension ","element":"span"},{"style":{"height":9.59},"width":38.96,"height":23.98,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1403.6530/images/12-2.png","element":"img","alt":" κ1","inline":true},{"text":". However, unlike the SPSA estimates in ","element":"span"},{"href":"#id-34","referenceIndex":52,"text":"[52] ","element":"a"},{"text":"that use two-sided balanced estimates (simulations with parameters ","element":"span"},{"style":{"height":14.8},"width":110.82,"height":37,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1403.6530/images/12-3.png","element":"img","alt":" θ−β∆","inline":true,"padRight":true},{"text":"and ","element":"span"},{"style":{"height":14.8},"width":110.82,"height":37,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1403.6530/images/12-4.png","element":"img","alt":" θ+β∆","inline":true},{"text":"), our gradient estimates are one-sided (simulations with parameters ","element":"span"},{"style":{"height":10.8},"width":19,"height":27,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1403.6530/images/12-5.png","element":"img","alt":" θ","inline":true,"padRight":true},{"text":"and ","element":"span"},{"style":{"height":14.8},"width":126.26,"height":37,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1403.6530/images/12-6.png","element":"img","alt":" θ + β∆","inline":true},{"text":") and resemble those in ","element":"span"},{"href":"#id-72","referenceIndex":23,"text":"[23]","element":"a"},{"text":". The use of one-sided estimates is primarily because the updates of the Lagrangian parameter ","element":"span"},{"style":{"height":10.8},"width":23,"height":27,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1403.6530/images/12-7.png","element":"img","alt":" λ","inline":true,"padRight":true},{"text":"require a simulation with the running parameter ","element":"span"},{"style":{"height":10.8},"width":19,"height":27,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1403.6530/images/12-8.png","element":"img","alt":" θ","inline":true},{"text":". Using a balanced gradient estimate would therefore come at the cost of an additional simulation (the resulting procedure would then require three simulations), which we avoid by using one-sided gradient estimates.","element":"span"}],[{"style":{"fontWeight":"bold"},"text":"SF","element":"span"},{"text":"-based method estimates not the gradient of a function ","element":"span"},{"style":{"height":16},"width":87.68,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1403.6530/images/12-9.png","element":"img","alt":" H(θ)","inline":true,"padRight":true},{"text":"itself, but rather the convolution of ","element":"span"},{"style":{"height":16},"width":120.88,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1403.6530/images/12-10.png","element":"img","alt":" ∇H(θ)","inline":true,"padRight":true},{"text":"with the Gaussian density function ","element":"span"},{"style":{"height":17.38},"width":176.68,"height":43.46,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1403.6530/images/12-11.png","element":"img","alt":" N(0, β2I)","inline":true},{"text":", i.e.,","element":"span"}],[{"style":{"width":"56%"},"width":1024,"height":195,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1403.6530/images/12-12.png","element":"img"}],[{"text":"where ","element":"span"},{"style":{"height":15.59},"width":41.7,"height":38.98,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1403.6530/images/12-13.png","element":"img","alt":" Gβ","inline":true,"padRight":true},{"text":"is a ","element":"span"},{"style":{"height":9.59},"width":38.96,"height":23.98,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1403.6530/images/12-14.png","element":"img","alt":" κ1","inline":true},{"text":"-dimensional p.d.f. The first equality above follows by using integration by parts and the second one by using the fact that ","element":"span"},{"style":{"height":19.9},"width":360.24,"height":49.74,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1403.6530/images/12-15.png","element":"img","alt":" ∇zGβ(z) = −zβ2 Gβ(z)","inline":true,"padRight":true},{"text":"and by substituting ","element":"span"},{"style":{"height":16},"width":159.04,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1403.6530/images/12-16.png","element":"img","alt":" z′ = z/β","inline":true},{"text":". As ","element":"span"},{"style":{"height":14.4},"width":117.86,"height":36,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1403.6530/images/12-17.png","element":"img","alt":" β → 0","inline":true},{"text":", it can be seen that ","element":"span"},{"style":{"height":16.79},"width":137.65,"height":41.98,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1403.6530/images/12-18.png","element":"img","alt":"CβH(θ)","inline":true,"padRight":true},{"text":"converges to ","element":"span"},{"style":{"height":16},"width":138.91,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1403.6530/images/12-19.png","element":"img","alt":" ∇θH(θ)","inline":true,"padRight":true},{"text":"(see Chapter 6 of ","element":"span"},{"href":"#id-36","referenceIndex":16,"text":"[16]","element":"a"},{"text":"). Thus, a one-sided SF estimate of ","element":"span"},{"style":{"height":17.38},"width":155.49,"height":43.46,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1403.6530/images/12-20.png","element":"img","alt":" ∇V θ(x0)","inline":true,"padRight":true},{"text":"is given by","element":"span"}],[{"id":"id-64","style":{"width":"82%"},"width":1489,"height":98,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1403.6530/images/12-21.png","element":"img"}],[{"text":"where ","element":"span"},{"style":{"height":11.6},"width":33,"height":29,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1403.6530/images/12-22.png","element":"img","alt":" ∆","inline":true,"padRight":true},{"text":"is a vector of independent Gaussian ","element":"span"},{"style":{"fontStyle":"italic"},"text":"N","element":"span"},{"text":"(0","element":"span"},{"style":{"fontStyle":"italic"},"text":", ","element":"span"},{"text":"1) ","element":"span"},{"text":"random variables.","element":"span"}],[{"style":{"fontWeight":"bold"},"text":"Actor Update: ","element":"span"},{"text":"Estimate the gradients ","element":"span"},{"style":{"height":17.39},"width":155.49,"height":43.46,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1403.6530/images/12-23.png","element":"img","alt":" ∇V θ(x0)","inline":true,"padRight":true},{"text":"and ","element":"span"},{"style":{"height":17.39},"width":154.94,"height":43.46,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1403.6530/images/12-24.png","element":"img","alt":" ∇U θ(x0)","inline":true,"padRight":true},{"text":"using SPSA ","element":"span"},{"href":"#id-63","text":"(21) ","element":"a"},{"text":"or SF ","element":"span"},{"href":"#id-64","text":"(22) ","element":"a"},{"text":"and update the policy parameter ","element":"span"},{"style":{"height":10.8},"width":19,"height":27,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1403.6530/images/12-25.png","element":"img","alt":" θ","inline":true,"padRight":true},{"text":"as follows","element":"span"},{"text":"4","element":"span"},{"text":": For ","element":"span"},{"style":{"height":14},"width":214.3,"height":35,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1403.6530/images/12-26.png","element":"img","alt":" i = 1, . . . , κ1","inline":true},{"text":",","element":"span"}],[{"id":"id-92","style":{"width":"97%"},"width":1762,"height":392,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1403.6530/images/12-27.png","element":"img"}],[{"text":"For both SPSA and SF variants, the Lagrange multiplier ","element":"span"},{"style":{"height":10.8},"width":23,"height":27,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1403.6530/images/12-28.png","element":"img","alt":" λ","inline":true,"padRight":true},{"text":"is updated as follows:","element":"span"}],[{"style":{"width":"78%"},"width":1427,"height":96,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1403.6530/images/12-29.png","element":"img"}],[{"text":"In the above, note the following:","element":"span"}],[{"style":{"fontWeight":"bold"},"text":"1) ","element":"span"},{"style":{"height":14.4},"width":97.78,"height":36,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1403.6530/images/12-30.png","element":"img","alt":" β > 0","inline":true,"padRight":true},{"text":"is a small fixed constant and ","element":"span"},{"style":{"height":18.54},"width":68.94,"height":46.36,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1403.6530/images/12-31.png","element":"img","alt":" ∆(i)n","inline":true,"padRight":true},{"text":"’s are independent Rademacher and Gaussian ","element":"span"},{"style":{"fontStyle":"italic"},"text":"N","element":"span"},{"text":"(0","element":"span"},{"style":{"fontStyle":"italic"},"text":", ","element":"span"},{"text":"1) ","element":"span"},{"text":"random variables ","element":"span"},{"text":"in SPSA and SF updates, respectively,","element":"span"}],[{"style":{"fontWeight":"bold"},"text":"2) ","element":"span"},{"style":{"height":10.8},"width":25,"height":27,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1403.6530/images/12-32.png","element":"img","alt":" Γ","inline":true,"padRight":true},{"text":"and ","element":"span"},{"style":{"height":13.19},"width":43.9,"height":32.98,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1403.6530/images/12-33.png","element":"img","alt":" Γλ","inline":true,"padRight":true},{"text":"are projection operators that keep the iterates ","element":"span"},{"style":{"height":16},"width":134.55,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1403.6530/images/12-34.png","element":"img","alt":" (θn, λn)","inline":true,"padRight":true},{"text":"stable and were defined in Section ","element":"span"},{"href":"#id-73","text":"4.1. ","element":"a"},{"text":"These projection operators are necessary to keep the iterates stable and hence, ensure convergence of the algorithms.","element":"span"}],[{"text":"We provide a proof of convergence of the first-order SPSA and SF algorithms to a tuple ","element":"span"},{"style":{"height":18.6},"width":148.14,"height":46.49,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1403.6530/images/13-0.png","element":"img","alt":" (θλ∗, λ∗)","inline":true},{"text":", which is a (local) saddle point of the risk-sensitive objective function ","element":"span"},{"style":{"height":17.38},"width":637.39,"height":43.46,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1403.6530/images/13-1.png","element":"img","alt":"�L(θ, λ)△= −�V θ(x0) + λ(�Λθ(x0) − α)","inline":true},{"text":". Further, the limit ","element":"span"},{"style":{"height":14.6},"width":53.72,"height":36.49,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1403.6530/images/13-2.png","element":"img","alt":" θλ∗","inline":true,"padRight":true},{"text":"satisfies the variance constraint, i.e., ","element":"span"},{"style":{"height":21.81},"width":237.58,"height":54.53,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1403.6530/images/13-3.png","element":"img","alt":"�Λθλ∗(x0) ≤ α","inline":true},{"text":". See Theorems ","element":"span"},{"href":"#id-74","text":"7–","element":"a"},{"href":"#id-75","text":"9 ","element":"a"},{"text":"and Proposition ","element":"span"},{"href":"#id-76","text":"1 ","element":"a"},{"text":"in Section ","element":"span"},{"text":"7 ","element":"span"},{"text":"for details.","element":"span"}],[{"style":{"width":"0%"},"width":4,"height":2,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1403.6530/images/13-4.png","element":"img"}],[{"style":{"fontWeight":"bold"},"text":"Remark 4. ","element":"span"},{"style":{"fontStyle":"italic","fontWeight":"bold"},"text":"(On the bias in gradient estimates) ","element":"span"},{"style":{"fontStyle":"italic"},"text":"Recall that ","element":"span"},{"style":{"height":16},"width":83.41,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1403.6530/images/13-5.png","element":"img","alt":"�V (θ)","inline":true,"padRight":true},{"style":{"fontStyle":"italic"},"text":"is the approximate value function for policy ","element":"span"},{"style":{"height":10.8},"width":19,"height":27,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1403.6530/images/13-6.png","element":"img","alt":" θ","inline":true},{"style":{"fontStyle":"italic"},"text":". Using a Taylor’s expansion of ","element":"span"},{"style":{"height":16},"width":74.67,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1403.6530/images/13-7.png","element":"img","alt":"�V (·)","inline":true,"padRight":true},{"style":{"fontStyle":"italic"},"text":"around ","element":"span"},{"style":{"height":10.8},"width":19,"height":27,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1403.6530/images/13-8.png","element":"img","alt":" θ","inline":true},{"style":{"fontStyle":"italic"},"text":", we obtain:","element":"span"}],[{"style":{"width":"58%"},"width":1062,"height":88,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1403.6530/images/13-9.png","element":"img"}],[{"style":{"fontStyle":"italic"},"text":"Assuming an uniform upper bound ","element":"span"},{"style":{"height":13.19},"width":44.48,"height":32.98,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1403.6530/images/13-10.png","element":"img","alt":" C2","inline":true,"padRight":true},{"style":{"fontStyle":"italic"},"text":"on ","element":"span"},{"style":{"height":17.38},"width":125.76,"height":43.46,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1403.6530/images/13-11.png","element":"img","alt":" ∇2 �V (·)","inline":true,"padRight":true},{"style":{"fontStyle":"italic"},"text":"and noting that ","element":"span"},{"style":{"height":11.6},"width":33,"height":29,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1403.6530/images/13-12.png","element":"img","alt":" ∆","inline":true,"padRight":true},{"style":{"fontStyle":"italic"},"text":"are Rademacher, we obtain","element":"span"}],[{"style":{"width":"75%"},"width":1374,"height":345,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1403.6530/images/13-13.png","element":"img"}],[{"style":{"fontStyle":"italic"},"text":"Using similar arguments as above, one can conclude that","element":"span"}],[{"style":{"width":"50%"},"width":919,"height":123,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1403.6530/images/13-14.png","element":"img"}],[{"style":{"fontStyle":"italic"},"text":"where ","element":"span"},{"style":{"height":13.19},"width":44.48,"height":32.98,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1403.6530/images/13-15.png","element":"img","alt":" C3","inline":true,"padRight":true},{"style":{"fontStyle":"italic"},"text":"upper bounds ","element":"span"},{"style":{"height":17.38},"width":125.21,"height":43.46,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1403.6530/images/13-16.png","element":"img","alt":" ∇2 �U(·)","inline":true},{"style":{"fontStyle":"italic"},"text":". From the foregoing along with gradient expression for the Lagrangian and the fact that the value function is upper-bounded since we operate in a finite state-action space, it is easy to infer that the bias of one-sided SPSA estimates of the gradient of the Lagrangian is ","element":"span"},{"style":{"height":16},"width":87.65,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1403.6530/images/13-17.png","element":"img","alt":" O(β)","inline":true},{"style":{"fontStyle":"italic"},"text":". Later (in Theorem ","element":"span"},{"href":"#id-74","style":{"fontStyle":"italic"},"text":"7) ","element":"a"},{"style":{"fontStyle":"italic"},"text":"we establish that the ","element":"span"},{"style":{"height":10.8},"width":19,"height":27,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1403.6530/images/13-18.png","element":"img","alt":" θ","inline":true},{"style":{"fontStyle":"italic"},"text":"-recursion converges to an ","element":"span"},{"style":{"height":0},"width":20,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1403.6530/images/13-19.png","element":"img","alt":" ϵ","inline":true},{"style":{"fontStyle":"italic"},"text":"-neighborhood of the set of local minima of the Lagrangian, provided ","element":"span"},{"style":{"height":14.4},"width":23,"height":36,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1403.6530/images/13-20.png","element":"img","alt":" β","inline":true,"padRight":true},{"style":{"fontStyle":"italic"},"text":"is small enough.","element":"span"}],[{"id":"id-48","style":{"width":"75%"},"width":1360,"height":236,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1403.6530/images/13-21.png","element":"img"}],[{"style":{"fontStyle":"italic"},"text":"The actor recursions for the variants of the RS-SPSA-G and RS-SF-G algorithms that optimize the SR objective are as follows:","element":"span"}],[{"style":{"width":"96%"},"width":1751,"height":365,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1403.6530/images/13-22.png","element":"img"}],[{"style":{"width":"96%"},"width":1751,"height":362,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1403.6530/images/14-0.png","element":"img"}],[{"style":{"fontStyle":"italic"},"text":"Note that only the actor recursion changes for SR optimization, while the rest of the updates that include the critic recursions for nominal and perturbed parameters remain the same as before in the SPSA and SF based algorithms. Further, SR optimization does not involve the Lagrange parameter ","element":"span"},{"style":{"height":10.8},"width":23,"height":27,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1403.6530/images/14-1.png","element":"img","alt":" λ","inline":true},{"style":{"fontStyle":"italic"},"text":", and thus, the proposed actor-critic algorithms are two time-scale (instead of three time-scale as in the described algorithms) stochastic approximation algorithms in this case.","element":"span"}],[{"style":{"fontWeight":"bold"},"text":"Remark 6. ","element":"span"},{"style":{"fontStyle":"italic","fontWeight":"bold"},"text":"(One-simulation SR variant.) ","element":"span"},{"style":{"fontStyle":"italic"},"text":"For the SR objective, the proposed algorithms can be modified to work with only one simulated trajectory of the system. This is because in the SR case, we do not require the Lagrange multiplier ","element":"span"},{"style":{"height":10.8},"width":23,"height":27,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1403.6530/images/14-2.png","element":"img","alt":" λ","inline":true},{"style":{"fontStyle":"italic"},"text":", and thus, the simulated trajectory corresponding to the nominal policy parameter ","element":"span"},{"style":{"height":10.8},"width":19,"height":27,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1403.6530/images/14-3.png","element":"img","alt":" θ","inline":true,"padRight":true},{"style":{"fontStyle":"italic"},"text":"is not necessary. In this implementation, the gradient is estimated as ","element":"span"},{"style":{"height":18.18},"width":490.58,"height":45.46,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1403.6530/images/14-4.png","element":"img","alt":" ∇iS(θ) ≈ S(θ + β∆)/β∆(i)","inline":true,"padRight":true},{"style":{"fontStyle":"italic"},"text":"for SPSA and as ","element":"span"},{"style":{"height":18.18},"width":508.68,"height":45.46,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1403.6530/images/14-5.png","element":"img","alt":"∇iS(θ) ≈ (∆(i)/β)S(θ + β∆)","inline":true,"padRight":true},{"style":{"fontStyle":"italic"},"text":"for SF.","element":"span"}],[{"style":{"fontWeight":"bold"},"text":"Remark 7. ","element":"span"},{"style":{"fontStyle":"italic","fontWeight":"bold"},"text":"(Monte-Carlo Critic) ","element":"span"},{"style":{"fontStyle":"italic"},"text":"In the above algorithms, the critic uses a TD method to evaluate the policies. These algorithms can be implemented with a Monte-Carlo critic that at each time instant ","element":"span"},{"style":{"fontStyle":"italic"},"text":"n ","element":"span"},{"style":{"fontStyle":"italic"},"text":"computes a sample average of the total discounted rewards corresponding to the nominal ","element":"span"},{"style":{"height":13.19},"width":38.71,"height":32.98,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1403.6530/images/14-6.png","element":"img","alt":" θn","inline":true,"padRight":true},{"style":{"fontStyle":"italic"},"text":"and perturbed ","element":"span"},{"style":{"height":14.8},"width":149.97,"height":37,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1403.6530/images/14-7.png","element":"img","alt":" θn+β∆n","inline":true,"padRight":true},{"style":{"fontStyle":"italic"},"text":"policy parameter. This implementation would be similar to that in ","element":"span"},{"href":"#id-30","referenceIndex":62,"style":{"fontStyle":"italic"},"text":"[62]","element":"a"},{"style":{"fontStyle":"italic"},"text":", except here we use simultaneous perturbation methods to estimate the gradient.","element":"span"}],[{"id":"id-66","style":{"fontWeight":"bold"},"text":"4.4 ","element":"span"},{"style":{"fontWeight":"bold"},"text":"Second-Order Algorithms: RS-SPSA-N and RS-SF-N","element":"span"}],[{"text":"Recall from Section ","element":"span"},{"href":"#id-73","text":"4.1 ","element":"a"},{"text":"that a second-order scheme updates the policy parameter in the following manner:","element":"span"}],[{"style":{"width":"70%"},"width":1287,"height":50,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1403.6530/images/14-8.png","element":"img"}],[{"text":"From the above, it is evident that for any second-order method, an estimate of the Hessian ","element":"span"},{"style":{"height":17.9},"width":170.63,"height":44.75,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1403.6530/images/14-9.png","element":"img","alt":" ∇2θL(θ, λ)","inline":true,"padRight":true},{"text":"of the ","element":"span"},{"text":"Lagrangian is necessary, in addition to an estimate of the gradient ","element":"span"},{"style":{"height":16},"width":170.63,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1403.6530/images/14-10.png","element":"img","alt":" ∇θL(θ, λ)","inline":true},{"text":". As in the case of the gradient based schemes outlined earlier, we employ the simultaneous perturbation technique to develop these estimates. The first algorithm, henceforth referred to as RS-SPSA-N, uses SPSA for the gradient/Hessian estimates. On the other hand, the second algorithm, henceforth referred to as RS-SF-N, uses a smoothed functional (SF) approach for the gradient/Hessian estimates. As confirmed by our numerical experiments, second order methods are in general more accurate, though at the cost of inverting the Hessian matrix in each step.","element":"span"}],[{"style":{"fontWeight":"bold"},"text":"4.4.1 ","element":"span"},{"style":{"fontWeight":"bold"},"text":"RS-SPSA-N Algorithm","element":"span"}],[{"text":"The Hessian w.r.t. ","element":"span"},{"style":{"height":10.8},"width":19,"height":27,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1403.6530/images/14-11.png","element":"img","alt":" θ","inline":true,"padRight":true},{"text":"of ","element":"span"},{"style":{"height":16},"width":119.39,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1403.6530/images/14-12.png","element":"img","alt":" L(θ, λ)","inline":true,"padRight":true},{"text":"can be written as follows:","element":"span"}],[{"style":{"width":"90%"},"width":1649,"height":118,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1403.6530/images/14-13.png","element":"img"}],[{"style":{"fontWeight":"bold"},"text":"Critic Update: ","element":"span"},{"text":"As in the case of the gradient based schemes, we run two simulations. However, perturbed simulation here corresponds to the policy parameter ","element":"span"},{"style":{"height":16},"width":242.17,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1403.6530/images/14-14.png","element":"img","alt":" θ + β(∆ + �∆)","inline":true},{"text":", where ","element":"span"},{"style":{"height":11.6},"width":33,"height":29,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1403.6530/images/14-15.png","element":"img","alt":" ∆","inline":true,"padRight":true},{"text":"and ","element":"span"},{"style":{"height":11.6},"width":33,"height":29,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1403.6530/images/14-16.png","element":"img","alt":"�∆","inline":true,"padRight":true},{"text":"represent vectors of independent ","element":"span"},{"style":{"height":9.59},"width":38.96,"height":23.98,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1403.6530/images/14-17.png","element":"img","alt":"κ1","inline":true},{"text":"-dimensional Rademacher random variables. The critic parameters ","element":"span"},{"style":{"height":10},"width":101.53,"height":25,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1403.6530/images/14-18.png","element":"img","alt":" vn, un","inline":true,"padRight":true},{"text":"from unperturbed simulation and ","element":"span"},{"style":{"height":16.93},"width":112.72,"height":42.32,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1403.6530/images/14-19.png","element":"img","alt":"v+n , u+n","inline":true,"padRight":true},{"text":"from perturbed simulation are updated as described earlier in Section ","element":"span"},{"href":"#id-77","text":"4.2.","element":"a"}],[{"style":{"fontWeight":"bold"},"text":"Gradient and Hessian Estimates: ","element":"span"},{"text":"Using an SPSA-based estimation technique (see Chapter 7 of ","element":"span"},{"href":"#id-36","referenceIndex":16,"text":"[16]","element":"a"},{"text":"), the gradient and Hessian of the value function ","element":"span"},{"style":{"fontStyle":"italic"},"text":"V ","element":"span"},{"text":", and similarly of the square value function ","element":"span"},{"style":{"fontStyle":"italic"},"text":"U","element":"span"},{"text":", are estimated as follows: For ","element":"span"},{"style":{"height":14},"width":227.18,"height":35,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1403.6530/images/15-0.png","element":"img","alt":" i = 1, . . . , κ1,","inline":true}],[{"style":{"width":"62%"},"width":1142,"height":233,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1403.6530/images/15-1.png","element":"img"}],[{"text":"The correctness of the above estimates in the limit as ","element":"span"},{"style":{"height":14.4},"width":116.22,"height":36,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1403.6530/images/15-2.png","element":"img","alt":" β → 0","inline":true,"padRight":true},{"text":"can be inferred from Lemma ","element":"span"},{"href":"#id-78","text":"11 ","element":"a"},{"text":"in the Appendix. The main idea is to expand using suitable Taylor expansions and observe that the bias terms vanish as ","element":"span"},{"style":{"height":11.6},"width":33,"height":29,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1403.6530/images/15-3.png","element":"img","alt":" ∆","inline":true},{"text":", being Rademacher, are zero-mean. As in the case of RS-SPSA, this is an one-sided estimate with the unperturbed simulation required for updating the Lagrange multiplier.","element":"span"}],[{"style":{"fontWeight":"bold"},"text":"Hessian Update: ","element":"span"},{"text":"Using the critic values from the two simulations, we estimate the Hessian ","element":"span"},{"style":{"height":17.9},"width":170.62,"height":44.75,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1403.6530/images/15-4.png","element":"img","alt":" ∇2θL(θ, λ)","inline":true,"padRight":true},{"text":"as follows: ","element":"span"},{"text":"Let ","element":"span"},{"style":{"height":18.54},"width":96.36,"height":46.36,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1403.6530/images/15-5.png","element":"img","alt":" H(i,j)n","inline":true,"padRight":true},{"text":"denote the ","element":"span"},{"style":{"fontStyle":"italic"},"text":"n","element":"span"},{"text":"th estimate of the ","element":"span"},{"text":"(","element":"span"},{"style":{"fontStyle":"italic"},"text":"i, j","element":"span"},{"text":")","element":"span"},{"text":"th element of the Hessian. Then, for ","element":"span"},{"style":{"height":14},"width":256.4,"height":35,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1403.6530/images/15-6.png","element":"img","alt":" i, j = 1, . . . , κ1","inline":true},{"text":", with ","element":"span"},{"style":{"height":13.6},"width":89.56,"height":34,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1403.6530/images/15-7.png","element":"img","alt":" i ≤ j","inline":true},{"text":", the update is","element":"span"}],[{"id":"id-108","style":{"width":"99%"},"width":1811,"height":395,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1403.6530/images/15-8.png","element":"img"}],[{"text":"The last condition above ensures that the Hessian update proceeds on a faster timescale in comparison to the ","element":"span"},{"style":{"height":10.8},"width":19,"height":27,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1403.6530/images/15-9.png","element":"img","alt":" θ","inline":true},{"text":"-recursion (see ","element":"span"},{"href":"#id-79","text":"(31) ","element":"a"},{"text":"below). Finally, we set ","element":"span"},{"style":{"height":23.52},"width":414.38,"height":58.81,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1403.6530/images/15-10.png","element":"img","alt":" Hn+1 = Υ�[H(i,j)n+1]|κ1|i,j=1�","inline":true},{"text":", where ","element":"span"},{"style":{"height":16},"width":73.56,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1403.6530/images/15-11.png","element":"img","alt":" Υ(·)","inline":true,"padRight":true},{"text":"denotes an operator that projects a square matrix onto the set of symmetric and positive definite matrices. This projection is a standard requirement to ensure convergence of ","element":"span"},{"style":{"height":13.19},"width":53.12,"height":32.98,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1403.6530/images/15-12.png","element":"img","alt":" Hn","inline":true,"padRight":true},{"text":"to the Hessian ","element":"span"},{"style":{"height":17.9},"width":170.63,"height":44.75,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1403.6530/images/15-13.png","element":"img","alt":" ∇2θL(θ, λ)","inline":true,"padRight":true},{"text":"and we state the following standard assumption (cf. ","element":"span"},{"href":"#id-36","referenceIndex":16,"text":"[16, ","element":"a"},{"text":"Chapter 7]) on this operator:","element":"span"}],[{"style":{"fontWeight":"bold"},"text":"(A5) ","element":"span"},{"style":{"fontStyle":"italic"},"text":"For any sequence of matrices ","element":"span"},{"style":{"height":16},"width":91.5,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1403.6530/images/15-14.png","element":"img","alt":" {An}","inline":true,"padRight":true},{"style":{"fontStyle":"italic"},"text":"and ","element":"span"},{"style":{"height":16},"width":91.85,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1403.6530/images/15-15.png","element":"img","alt":" {Bn}","inline":true,"padRight":true},{"style":{"fontStyle":"italic"},"text":"in ","element":"span"},{"style":{"height":11.78},"width":125.55,"height":29.46,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1403.6530/images/15-16.png","element":"img","alt":" Rκ1×κ1","inline":true,"padRight":true},{"style":{"fontStyle":"italic"},"text":"such that ","element":"span"},{"style":{"height":21.96},"width":381.23,"height":54.91,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1403.6530/images/15-17.png","element":"img","alt":" limn→∞ ∥ An − Bn ∥ = 0","inline":true},{"style":{"fontStyle":"italic"},"text":", the ","element":"span"},{"style":{"height":10.8},"width":31,"height":27,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1403.6530/images/15-18.png","element":"img","alt":" Υ","inline":true,"padRight":true},{"style":{"fontStyle":"italic"},"text":"operator ","element":"span"},{"style":{"fontStyle":"italic"},"text":"satisfies ","element":"span"},{"style":{"height":21.96},"width":504.52,"height":54.9,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1403.6530/images/15-19.png","element":"img","alt":" limn→∞ ∥ Υ(An) − Υ(Bn) ∥ = 0","inline":true},{"style":{"fontStyle":"italic"},"text":". Further, for any sequence of matrices ","element":"span"},{"style":{"height":16},"width":90.1,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1403.6530/images/15-20.png","element":"img","alt":" {Cn}","inline":true,"padRight":true},{"style":{"fontStyle":"italic"},"text":"in ","element":"span"},{"style":{"height":11.79},"width":125.54,"height":29.46,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1403.6530/images/15-21.png","element":"img","alt":" Rκ1×κ1","inline":true},{"style":{"fontStyle":"italic"},"text":", we have","element":"span"}],[{"style":{"width":"69%"},"width":1259,"height":67,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1403.6530/images/15-22.png","element":"img"}],[{"text":"As suggested in ","element":"span"},{"href":"#id-80","referenceIndex":29,"text":"[29]","element":"a"},{"text":", a possible definition of ","element":"span"},{"style":{"height":10.8},"width":31,"height":27,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1403.6530/images/15-23.png","element":"img","alt":" Υ","inline":true,"padRight":true},{"text":"is to perform an eigen-decomposition of ","element":"span"},{"style":{"height":13.19},"width":53.13,"height":32.98,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1403.6530/images/15-24.png","element":"img","alt":" Hn","inline":true,"padRight":true},{"text":"and then make all eigenvalues positive. This avoids singularity of ","element":"span"},{"style":{"height":13.19},"width":53.13,"height":32.97,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1403.6530/images/15-25.png","element":"img","alt":" Hn","inline":true,"padRight":true},{"text":"and also satisfies the above assumption. In our experiments, we use this scheme for projecting ","element":"span"},{"style":{"height":13.19},"width":53.12,"height":32.97,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1403.6530/images/15-26.png","element":"img","alt":" Hn","inline":true},{"text":".","element":"span"}],[{"style":{"fontWeight":"bold"},"text":"Actor Update: ","element":"span"},{"text":"Let ","element":"span"},{"style":{"height":17.32},"width":202.18,"height":43.31,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1403.6530/images/15-27.png","element":"img","alt":" Mn △= H−1n","inline":true,"padRight":true},{"text":"denote the inverse of the the Hessian estimate ","element":"span"},{"style":{"height":13.19},"width":53.12,"height":32.98,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1403.6530/images/15-28.png","element":"img","alt":" Hn","inline":true},{"text":". We incorporate a Newton decrement to update the policy parameter ","element":"span"},{"style":{"height":10.8},"width":19,"height":27,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1403.6530/images/15-29.png","element":"img","alt":" θ","inline":true,"padRight":true},{"text":"as follows:","element":"span"}],[{"id":"id-79","style":{"width":"98%"},"width":1782,"height":118,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1403.6530/images/15-30.png","element":"img"}],[{"text":"In the long run, ","element":"span"},{"style":{"height":13.19},"width":58.66,"height":32.97,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1403.6530/images/15-31.png","element":"img","alt":" Mn","inline":true,"padRight":true},{"text":"converges to ","element":"span"},{"style":{"height":17.9},"width":211.03,"height":44.74,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1403.6530/images/15-32.png","element":"img","alt":" ∇2θL(θ, λ)−1","inline":true},{"text":", while the last term in the brackets in ","element":"span"},{"href":"#id-79","text":"(31) ","element":"a"},{"text":"converges to ","element":"span"},{"style":{"height":16},"width":170.63,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1403.6530/images/15-33.png","element":"img","alt":" ∇θL(θ, λ)","inline":true,"padRight":true},{"text":"and hence, the update ","element":"span"},{"href":"#id-79","text":"(31) ","element":"a"},{"text":"can be seen to descend in ","element":"span"},{"style":{"height":10.8},"width":19,"height":27,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1403.6530/images/15-34.png","element":"img","alt":" θ","inline":true,"padRight":true},{"text":"using a Newton decrement. Note that the Lagrange multiplier update here is the same as that in RS-SPSA-G.","element":"span"}],[{"style":{"fontWeight":"bold"},"text":"4.4.2 ","element":"span"},{"style":{"fontWeight":"bold"},"text":"RS-SF-N Algorithm","element":"span"}],[{"style":{"fontWeight":"bold"},"text":"Gradient and Hessian Estimates: ","element":"span"},{"text":"While the gradient estimate here is the same as that in the RS-SF-G algorithm, the Hessian is estimated as follows: Recall that ","element":"span"},{"style":{"height":20.7},"width":412.9,"height":51.74,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1403.6530/images/16-0.png","element":"img","alt":" ∆ =�∆(1), . . . , ∆(κ1)�T","inline":true,"padRight":true},{"text":"is a vector of mutually independent ","element":"span"},{"style":{"fontStyle":"italic"},"text":"N","element":"span"},{"text":"(0","element":"span"},{"style":{"fontStyle":"italic"},"text":", ","element":"span"},{"text":"1) ","element":"span"},{"text":"random variables. Let ","element":"span"},{"style":{"height":17.63},"width":101.06,"height":44.08,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1403.6530/images/16-1.png","element":"img","alt":"¯H(∆)","inline":true,"padRight":true},{"text":"be a ","element":"span"},{"style":{"height":10.39},"width":128.5,"height":25.98,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1403.6530/images/16-2.png","element":"img","alt":" κ1 × κ1","inline":true,"padRight":true},{"text":"matrix defined as","element":"span"}],[{"id":"id-82","style":{"width":"99%"},"width":1812,"height":397,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1403.6530/images/16-3.png","element":"img"}],[{"text":"The correctness of the above estimate in the limit as ","element":"span"},{"style":{"height":14.4},"width":114.6,"height":36,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1403.6530/images/16-4.png","element":"img","alt":" β → 0","inline":true,"padRight":true},{"text":"can be seen from Lemma ","element":"span"},{"href":"#id-81","text":"12 ","element":"a"},{"text":"in the Appendix. The main idea involves convolving the Hessian with a Gaussian density function (similar to RS-SF) and then performing integration by parts twice.","element":"span"}],[{"style":{"fontWeight":"bold"},"text":"Critic Update: ","element":"span"},{"text":"As in the case of the RS-SF-G algorithm, we run two simulations with unperturbed and perturbed policy parameters, respectively. Recall that the perturbed simulation corresponds to the policy parameter ","element":"span"},{"style":{"height":14.8},"width":126.24,"height":37,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1403.6530/images/16-5.png","element":"img","alt":" θ + β∆","inline":true},{"text":", where ","element":"span"},{"style":{"height":11.6},"width":33,"height":29,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1403.6530/images/16-6.png","element":"img","alt":" ∆","inline":true,"padRight":true},{"text":"represent a vector of independent ","element":"span"},{"style":{"height":9.59},"width":38.96,"height":23.98,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1403.6530/images/16-7.png","element":"img","alt":" κ1","inline":true},{"text":"-dimensional Gaussian ","element":"span"},{"style":{"fontStyle":"italic"},"text":"N","element":"span"},{"text":"(0","element":"span"},{"style":{"fontStyle":"italic"},"text":", ","element":"span"},{"text":"1) ","element":"span"},{"text":"random variables. The critic parameters for both these simulations are updated as described earlier in Section ","element":"span"},{"href":"#id-77","text":"4.2.","element":"a"}],[{"style":{"fontWeight":"bold"},"text":"Hessian Update: ","element":"span"},{"text":"As in RS-SPSA-N, let ","element":"span"},{"style":{"height":18.54},"width":96.36,"height":46.36,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1403.6530/images/16-8.png","element":"img","alt":" H(i,j)n","inline":true,"padRight":true},{"text":"denote the ","element":"span"},{"text":"(","element":"span"},{"style":{"fontStyle":"italic"},"text":"i, j","element":"span"},{"text":")","element":"span"},{"text":"th element of the Hessian estimate ","element":"span"},{"style":{"height":13.19},"width":53.12,"height":32.98,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1403.6530/images/16-9.png","element":"img","alt":" Hn","inline":true,"padRight":true},{"text":"at time step ","element":"span"},{"style":{"fontStyle":"italic"},"text":"t","element":"span"},{"text":". Using ","element":"span"},{"href":"#id-82","text":"(33)","element":"a"},{"text":", we devise the following update rule for the Hessian estimate ","element":"span"},{"style":{"height":13.19},"width":53.13,"height":32.98,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1403.6530/images/16-10.png","element":"img","alt":" Hn","inline":true},{"text":": For ","element":"span"},{"style":{"height":14},"width":402.74,"height":35,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1403.6530/images/16-11.png","element":"img","alt":" i, j, k = 1, . . . , κ1, j < k","inline":true},{"text":", the update is","element":"span"}],[{"style":{"width":"99%"},"width":1812,"height":612,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1403.6530/images/16-12.png","element":"img"}],[{"style":{"fontWeight":"bold"},"text":"Actor Update: ","element":"span"},{"text":"Using the gradient and Hessian estimates from the above, we update the policy parameter ","element":"span"},{"style":{"height":10.8},"width":19,"height":27,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1403.6530/images/16-13.png","element":"img","alt":" θ","inline":true,"padRight":true},{"text":"as follows:","element":"span"}],[{"id":"id-83","style":{"width":"98%"},"width":1781,"height":170,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1403.6530/images/16-14.png","element":"img"}],[{"text":"As in the case of RS-SPSA-N, it can be seen that the above update rule is equivalent to descent with a Newton decrement, since ","element":"span"},{"style":{"height":13.19},"width":58.66,"height":32.98,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1403.6530/images/16-15.png","element":"img","alt":" Mn","inline":true,"padRight":true},{"text":"converges to ","element":"span"},{"style":{"height":17.9},"width":211.03,"height":44.75,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1403.6530/images/16-16.png","element":"img","alt":" ∇2θL(θ, λ)−1","inline":true},{"text":", and the last term in the brackets in ","element":"span"},{"href":"#id-83","text":"(36) ","element":"a"},{"text":"converges to ","element":"span"},{"style":{"height":16},"width":170.63,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1403.6530/images/16-17.png","element":"img","alt":" ∇θL(θ, λ)","inline":true},{"text":". ","element":"span"},{"text":"The Lagrange multiplier ","element":"span"},{"style":{"height":10.8},"width":23,"height":27,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1403.6530/images/16-18.png","element":"img","alt":" λ","inline":true,"padRight":true},{"text":"update here is the same as that in RS-SF-G.","element":"span"}],[{"style":{"fontWeight":"bold"},"text":"Remark 8. ","element":"span"},{"style":{"fontStyle":"italic"},"text":"The second-order variants of the algorithms for SR optimization can be worked out along similar lines as outlined in Section ","element":"span"},{"href":"#id-66","style":{"fontStyle":"italic"},"text":"4.4 ","element":"a"},{"style":{"fontStyle":"italic"},"text":"and the details are omitted here.","element":"span"}]]},{"heading":"5 Average Reward Setting","paragraphs":[[{"text":"The average reward under policy ","element":"span"},{"style":{"height":10},"width":24,"height":25,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1403.6530/images/17-0.png","element":"img","alt":" µ","inline":true,"padRight":true},{"text":"is defined as","element":"span"}],[{"style":{"width":"76%"},"width":1396,"height":122,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1403.6530/images/17-1.png","element":"img"}],[{"text":"where ","element":"span"},{"style":{"height":10.8},"width":39.74,"height":27,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1403.6530/images/17-2.png","element":"img","alt":" dµ","inline":true,"padRight":true},{"text":"and ","element":"span"},{"style":{"height":10.58},"width":43.15,"height":26.46,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1403.6530/images/17-3.png","element":"img","alt":" πµ","inline":true,"padRight":true},{"text":"are the stationary distributions of policy ","element":"span"},{"style":{"height":10},"width":24,"height":25,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1403.6530/images/17-4.png","element":"img","alt":" µ","inline":true,"padRight":true},{"text":"over states and state-action pairs, respectively (see Section ","element":"span"},{"text":"2)","element":"span"},{"text":". The goal in the standard (risk-neutral) average reward formulation is to find an ","element":"span"},{"style":{"fontStyle":"italic"},"text":"average optimal ","element":"span"},{"text":"policy, i.e., ","element":"span"},{"style":{"height":18.3},"width":336.41,"height":45.74,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1403.6530/images/17-5.png","element":"img","alt":" µ∗ = arg maxµ ρ(µ)","inline":true},{"text":". For all states ","element":"span"},{"style":{"height":11.6},"width":104.5,"height":29,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1403.6530/images/17-6.png","element":"img","alt":" x ∈ X","inline":true,"padRight":true},{"text":"and actions ","element":"span"},{"style":{"height":12.4},"width":101.79,"height":31,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1403.6530/images/17-7.png","element":"img","alt":" a ∈ A","inline":true},{"text":", the ","element":"span"},{"style":{"fontStyle":"italic"},"text":"differential ","element":"span"},{"text":"action-value and value functions of policy ","element":"span"},{"style":{"height":10},"width":24,"height":25,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1403.6530/images/17-8.png","element":"img","alt":" µ","inline":true,"padRight":true},{"text":"are defined respectively as","element":"span"}],[{"style":{"width":"45%"},"width":831,"height":217,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1403.6530/images/17-9.png","element":"img"}],[{"text":"These functions satisfy the following Poisson equations ","element":"span"},{"href":"#id-0","referenceIndex":47,"text":"[47]","element":"a"}],[{"style":{"width":"78%"},"width":1431,"height":195,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1403.6530/images/17-10.png","element":"img"}],[{"text":"In the context of risk-sensitive MDPs, different criteria have been proposed to define a measure of ","element":"span"},{"style":{"fontStyle":"italic"},"text":"variability ","element":"span"},{"text":"in the average reward setting, among which we consider the ","element":"span"},{"style":{"fontStyle":"italic"},"text":"long-run variance ","element":"span"},{"text":"of ","element":"span"},{"style":{"height":10},"width":24,"height":25,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1403.6530/images/17-11.png","element":"img","alt":" µ","inline":true,"padRight":true},{"href":"#id-20","referenceIndex":27,"text":"[27] ","element":"a"},{"text":"defined as","element":"span"}],[{"style":{"width":"86%"},"width":1560,"height":122,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1403.6530/images/17-12.png","element":"img"}],[{"text":"This notion of variability is based on the observation that it is the frequency of occurrence of state-action pairs that determine the variability in the average reward. It is easy to show that","element":"span"}],[{"style":{"width":"60%"},"width":1097,"height":90,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1403.6530/images/17-13.png","element":"img"}],[{"text":"We consider the following risk-sensitive measure for average reward MDPs in this paper:","element":"span"}],[{"id":"id-84","style":{"width":"67%"},"width":1222,"height":57,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1403.6530/images/17-14.png","element":"img"}],[{"text":"for a given ","element":"span"},{"style":{"height":11.6},"width":102.59,"height":29,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1403.6530/images/17-15.png","element":"img","alt":" α > 0","inline":true},{"text":".","element":"span"},{"text":"5 ","element":"span"},{"text":"As in the discounted setting, we employ the Lagrangian relaxation procedure to convert ","element":"span"},{"href":"#id-84","text":"(40) ","element":"a"},{"text":"to the unconstrained problem","element":"span"}],[{"style":{"width":"41%"},"width":751,"height":73,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1403.6530/images/17-16.png","element":"img"}],[{"text":"As in the discounted setting, we descend in ","element":"span"},{"style":{"height":10.8},"width":19,"height":27,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1403.6530/images/17-17.png","element":"img","alt":" θ","inline":true,"padRight":true},{"text":"using ","element":"span"},{"style":{"height":16},"width":592.71,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1403.6530/images/17-18.png","element":"img","alt":" ∇θL(θ, λ) = −∇θρ(θ) + λ∇θΛ(θ)","inline":true,"padRight":true},{"text":"and ascend in ","element":"span"},{"style":{"height":10.8},"width":23,"height":27,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1403.6530/images/17-19.png","element":"img","alt":" λ","inline":true,"padRight":true},{"text":"using ","element":"span"},{"style":{"height":16},"width":410.24,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1403.6530/images/17-20.png","element":"img","alt":"∇λL(θ, λ) = Λ(θ) − α","inline":true},{"text":", to find the saddle point of ","element":"span"},{"style":{"height":16},"width":119.39,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1403.6530/images/17-21.png","element":"img","alt":" L(θ, λ)","inline":true},{"text":". Since ","element":"span"},{"style":{"height":16},"width":600.25,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1403.6530/images/17-22.png","element":"img","alt":" ∇θΛ(θ) = ∇θη(θ) − 2ρ(θ)∇θρ(θ)","inline":true},{"text":", in order to compute ","element":"span"},{"style":{"height":16},"width":130.22,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1403.6530/images/17-23.png","element":"img","alt":" ∇θΛ(θ)","inline":true,"padRight":true},{"text":"it would be enough to calculate ","element":"span"},{"style":{"height":16},"width":123.16,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1403.6530/images/17-24.png","element":"img","alt":" ∇θρ(θ)","inline":true,"padRight":true},{"text":"and ","element":"span"},{"style":{"height":16},"width":123.77,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1403.6530/images/17-25.png","element":"img","alt":" ∇θη(θ)","inline":true},{"text":". Let ","element":"span"},{"style":{"height":10.8},"width":50.55,"height":27,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1403.6530/images/17-26.png","element":"img","alt":" U µ","inline":true,"padRight":true},{"text":"and ","element":"span"},{"style":{"height":10.8},"width":62.17,"height":27,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1403.6530/images/17-27.png","element":"img","alt":" W µ","inline":true,"padRight":true},{"text":"denote the ","element":"span"},{"text":"differential value and action-value functions associated with the square reward under policy ","element":"span"},{"style":{"height":10},"width":24,"height":25,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1403.6530/images/18-0.png","element":"img","alt":" µ","inline":true},{"text":", respectively. These two quantities satisfy the following Poisson equations:","element":"span"}],[{"id":"id-85","style":{"width":"79%"},"width":1446,"height":196,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1403.6530/images/18-1.png","element":"img"}],[{"text":"The gradients of ","element":"span"},{"style":{"height":16},"width":71.92,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1403.6530/images/18-2.png","element":"img","alt":" ρ(θ)","inline":true,"padRight":true},{"text":"and ","element":"span"},{"style":{"height":16},"width":72.53,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1403.6530/images/18-3.png","element":"img","alt":" η(θ)","inline":true,"padRight":true},{"text":"are given by the following lemma:","element":"span"}],[{"id":"id-44","style":{"fontWeight":"bold"},"text":"Lemma 3. ","element":"span"},{"style":{"fontStyle":"italic"},"text":"Under (A1) and (A2), we have","element":"span"}],[{"id":"id-87","style":{"width":"72%"},"width":1316,"height":202,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1403.6530/images/18-4.png","element":"img"}],[{"style":{"fontStyle":"italic"},"text":"Proof. ","element":"span"},{"text":"The proof of ","element":"span"},{"style":{"height":16},"width":123.16,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1403.6530/images/18-5.png","element":"img","alt":" ∇θρ(θ)","inline":true,"padRight":true},{"text":"can be found in the literature (e.g., ","element":"span"},{"href":"#id-7","referenceIndex":59,"text":"[59, ","element":"a"},{"href":"#id-8","referenceIndex":33,"text":"33]","element":"a"},{"text":"). To prove ","element":"span"},{"style":{"height":16},"width":123.77,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1403.6530/images/18-6.png","element":"img","alt":" ∇θη(θ)","inline":true},{"text":", we start by the fact that from ","element":"span"},{"href":"#id-85","text":"(41)","element":"a"},{"text":", we have ","element":"span"},{"style":{"height":16.78},"width":466.32,"height":41.96,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1403.6530/images/18-7.png","element":"img","alt":" U(x) = �a µ(x|a)W(x, a)","inline":true},{"text":". If we take the derivative w.r.t. ","element":"span"},{"style":{"height":10.8},"width":19,"height":27,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1403.6530/images/18-8.png","element":"img","alt":" θ","inline":true,"padRight":true},{"text":"from both sides of this ","element":"span"},{"text":"equation, we obtain","element":"span"}],[{"id":"id-86","style":{"width":"88%"},"width":1597,"height":308,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1403.6530/images/18-9.png","element":"img"}],[{"text":"The second equality is by replacing ","element":"span"},{"style":{"fontStyle":"italic"},"text":"W","element":"span"},{"text":"(","element":"span"},{"style":{"fontStyle":"italic"},"text":"x, a","element":"span"},{"text":") ","element":"span"},{"text":"from ","element":"span"},{"href":"#id-85","text":"(41)","element":"a"},{"text":". Now if we take the weighted sum, weighted by ","element":"span"},{"style":{"height":16},"width":139.44,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1403.6530/images/18-10.png","element":"img","alt":" dµ(x) =","inline":true},{"style":{"height":17.39},"width":93.04,"height":43.46,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1403.6530/images/18-11.png","element":"img","alt":"dθ(x)","inline":true},{"text":", from both sides of ","element":"span"},{"href":"#id-86","text":"(44)","element":"a"},{"text":", we have","element":"span"}],[{"style":{"width":"99%"},"width":1812,"height":335,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1403.6530/images/18-12.png","element":"img"}],[{"text":"Note that ","element":"span"},{"href":"#id-87","text":"(43) ","element":"a"},{"text":"for calculating ","element":"span"},{"style":{"height":16},"width":105.74,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1403.6530/images/18-13.png","element":"img","alt":" ∇η(θ)","inline":true,"padRight":true},{"text":"has close resemblance to ","element":"span"},{"href":"#id-87","text":"(42) ","element":"a"},{"text":"for ","element":"span"},{"style":{"height":16},"width":105.12,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1403.6530/images/18-14.png","element":"img","alt":" ∇ρ(θ)","inline":true},{"text":", and thus, similar to what we have for ","element":"span"},{"href":"#id-87","text":"(42)","element":"a"},{"text":", any function ","element":"span"},{"style":{"height":11.2},"width":203.09,"height":28,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1403.6530/images/18-15.png","element":"img","alt":" b : X → R","inline":true,"padRight":true},{"text":"can be added or subtracted to ","element":"span"},{"style":{"height":16},"width":173.75,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1403.6530/images/18-16.png","element":"img","alt":" W(x, a; θ)","inline":true,"padRight":true},{"text":"on the RHS of ","element":"span"},{"href":"#id-87","text":"(43) ","element":"a"},{"text":"without changing the result of the integral (see e.g., ","element":"span"},{"href":"#id-11","referenceIndex":14,"text":"[14]","element":"a"},{"text":"). So, we can replace ","element":"span"},{"style":{"height":16},"width":173.74,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1403.6530/images/18-17.png","element":"img","alt":" W(x, a; θ)","inline":true,"padRight":true},{"text":"with the square reward advantage function ","element":"span"},{"style":{"height":16},"width":563.91,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1403.6530/images/18-18.png","element":"img","alt":" B(x, a; θ) = W(x, a; θ) − U(x; θ)","inline":true,"padRight":true},{"text":"on the RHS of ","element":"span"},{"href":"#id-87","text":"(43) ","element":"a"},{"text":"in the same manner as we can replace ","element":"span"},{"style":{"height":16},"width":162.08,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1403.6530/images/18-19.png","element":"img","alt":" Q(x, a; θ)","inline":true,"padRight":true},{"text":"with the advantage function ","element":"span"},{"style":{"height":16},"width":550,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1403.6530/images/18-20.png","element":"img","alt":" A(x, a; θ) = Q(x, a; θ) − V (x; θ)","inline":true,"padRight":true},{"text":"on the RHS of ","element":"span"},{"href":"#id-87","text":"(42) ","element":"a"},{"text":"without changing the result of the integral. We define the temporal difference (TD) errors ","element":"span"},{"style":{"height":13.99},"width":37.71,"height":34.98,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1403.6530/images/18-21.png","element":"img","alt":" δn","inline":true,"padRight":true},{"text":"and ","element":"span"},{"style":{"height":4.8},"width":36.18,"height":12,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1403.6530/images/18-22.png","element":"img","alt":" ϵn","inline":true,"padRight":true},{"text":"for the differential value and square value functions as","element":"span"}],[{"style":{"width":"42%"},"width":766,"height":120,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1403.6530/images/18-23.png","element":"img"}],[{"text":"If ","element":"span"},{"style":{"height":10.8},"width":31,"height":27,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1403.6530/images/18-24.png","element":"img","alt":"�V","inline":true,"padRight":true},{"text":", ","element":"span"},{"style":{"height":14},"width":74.17,"height":35,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1403.6530/images/18-25.png","element":"img","alt":"�U, �ρ","inline":true},{"text":", and ","element":"span"},{"style":{"height":10.4},"width":21.75,"height":26,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1403.6530/images/18-26.png","element":"img","alt":" �η","inline":true,"padRight":true},{"text":"are unbiased estimators of ","element":"span"},{"style":{"height":16},"width":222.63,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1403.6530/images/18-27.png","element":"img","alt":" V µ, U µ, ρ(µ)","inline":true},{"text":", and ","element":"span"},{"style":{"height":16},"width":76.73,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1403.6530/images/18-28.png","element":"img","alt":" η(µ)","inline":true},{"text":", respectively, then we show in Lemma ","element":"span"},{"href":"#id-45","text":"4 ","element":"a"},{"text":"that ","element":"span"},{"style":{"height":13.99},"width":37.71,"height":34.98,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1403.6530/images/18-29.png","element":"img","alt":"δn","inline":true,"padRight":true},{"text":"and ","element":"span"},{"style":{"height":4.8},"width":36.18,"height":12,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1403.6530/images/18-30.png","element":"img","alt":" ϵn","inline":true,"padRight":true},{"text":"are unbiased estimates of the advantage functions ","element":"span"},{"style":{"height":11.6},"width":48.89,"height":29,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1403.6530/images/18-31.png","element":"img","alt":" Aµ","inline":true,"padRight":true},{"text":"and ","element":"span"},{"style":{"height":10.8},"width":51.22,"height":27,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1403.6530/images/18-32.png","element":"img","alt":" Bµ","inline":true},{"text":", i.e., ","element":"span"},{"style":{"height":16},"width":500.7,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1403.6530/images/18-33.png","element":"img","alt":" E[δn| xn, an, µ] = Aµ(xn, an)","inline":true,"padRight":true},{"text":"and ","element":"span"},{"id":"id-45","style":{"height":16},"width":498.89,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1403.6530/images/18-34.png","element":"img","alt":"E[ϵn| xn, an, µ] = Bµ(xn, an)","inline":true},{"text":".","element":"span"}],[{"style":{"fontWeight":"bold"},"text":"Lemma 4. ","element":"span"},{"style":{"fontStyle":"italic"},"text":"For any given policy ","element":"span"},{"style":{"height":10},"width":24,"height":25,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1403.6530/images/19-0.png","element":"img","alt":" µ","inline":true},{"style":{"fontStyle":"italic"},"text":", we have","element":"span"}],[{"style":{"width":"67%"},"width":1221,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1403.6530/images/19-1.png","element":"img"}],[{"style":{"fontStyle":"italic"},"text":"Proof. ","element":"span"},{"text":"The first statement ","element":"span"},{"style":{"height":16},"width":503.79,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1403.6530/images/19-2.png","element":"img","alt":" E[δn| xn, an, µ] = Aµ(xn, an)","inline":true,"padRight":true},{"text":"has been proved in Lemma 3 of ","element":"span"},{"href":"#id-11","referenceIndex":14,"text":"[14]","element":"a"},{"text":", so here we only prove the second statement ","element":"span"},{"style":{"height":16},"width":498.89,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1403.6530/images/19-3.png","element":"img","alt":" E[ϵn| xn, an, µ] = Bµ(xn, an)","inline":true},{"text":". we may write","element":"span"}],[{"style":{"width":"88%"},"width":1606,"height":602,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1403.6530/images/19-4.png","element":"img"}],[{"text":"From Lemma ","element":"span"},{"href":"#id-45","text":"4, ","element":"a"},{"text":"we notice that ","element":"span"},{"style":{"height":14.8},"width":85.36,"height":37,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1403.6530/images/19-5.png","element":"img","alt":" δnψn","inline":true,"padRight":true},{"text":"and ","element":"span"},{"style":{"height":14},"width":83.82,"height":35,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1403.6530/images/19-6.png","element":"img","alt":" ϵnψn","inline":true,"padRight":true},{"text":"are unbiased estimates of ","element":"span"},{"style":{"height":16},"width":109.32,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1403.6530/images/19-7.png","element":"img","alt":" ∇ρ(µ)","inline":true,"padRight":true},{"text":"and ","element":"span"},{"style":{"height":16},"width":109.93,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1403.6530/images/19-8.png","element":"img","alt":" ∇η(µ)","inline":true},{"text":", respectively, where ","element":"span"},{"style":{"height":16},"width":569,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1403.6530/images/19-9.png","element":"img","alt":"ψn = ψ(xn, an) = ∇ log µ(an|xn)","inline":true,"padRight":true},{"text":"is the ","element":"span"},{"style":{"fontStyle":"italic"},"text":"compatible ","element":"span"},{"text":"feature (see e.g., ","element":"span"},{"href":"#id-7","referenceIndex":59,"text":"[59, ","element":"a"},{"href":"#id-9","referenceIndex":43,"text":"43]","element":"a"},{"text":").","element":"span"}]]},{"heading":"6 Average Reward Risk-Sensitive Actor-Critic Algorithm","paragraphs":[[{"text":"We now present our risk-sensitive actor-critic algorithm for average reward MDPs. Algorithm ","element":"span"},{"href":"#id-88","text":"2 ","element":"a"},{"text":"presents the complete structure of the algorithm along with the update rules for the average rewards ","element":"span"},{"style":{"height":10.4},"width":99.8,"height":26,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1403.6530/images/19-10.png","element":"img","alt":" �ρn, �ηn","inline":true},{"text":"; TD errors ","element":"span"},{"style":{"height":14.8},"width":93.29,"height":37,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1403.6530/images/19-11.png","element":"img","alt":" δn, ϵn","inline":true},{"text":"; critic ","element":"span"},{"style":{"height":10},"width":101.53,"height":25,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1403.6530/images/19-12.png","element":"img","alt":" vn, un","inline":true},{"text":"; and actor ","element":"span"},{"style":{"height":14},"width":101.36,"height":35,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1403.6530/images/19-13.png","element":"img","alt":" θn, λn","inline":true,"padRight":true},{"text":"parameters. The projection operators ","element":"span"},{"style":{"height":10.8},"width":25,"height":27,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1403.6530/images/19-14.png","element":"img","alt":" Γ","inline":true,"padRight":true},{"text":"and ","element":"span"},{"style":{"height":13.19},"width":43.9,"height":32.98,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1403.6530/images/19-15.png","element":"img","alt":" Γλ","inline":true,"padRight":true},{"text":"are as defined in Section ","element":"span"},{"text":"4, ","element":"span"},{"text":"and similar to the discounted setting, are necessary for the convergence proof of the algorithm. The step-size schedules satisfy (A3) defined in Section ","element":"span"},{"text":"4, ","element":"span"},{"text":"plus the step size schedule ","element":"span"},{"style":{"height":16},"width":130.15,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1403.6530/images/19-16.png","element":"img","alt":" {ζ4(n)}","inline":true,"padRight":true},{"text":"satisfies ","element":"span"},{"style":{"height":16},"width":259.57,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1403.6530/images/19-17.png","element":"img","alt":" ζ4(n) = kζ3(n)","inline":true},{"text":", for some positive constant ","element":"span"},{"style":{"fontStyle":"italic"},"text":"k","element":"span"},{"text":". This is to ensure that the average and critic updates are on the (same) fastest time-scale ","element":"span"},{"style":{"height":16},"width":130.15,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1403.6530/images/19-18.png","element":"img","alt":" {ζ4(n)}","inline":true,"padRight":true},{"text":"and ","element":"span"},{"style":{"height":16},"width":130.15,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1403.6530/images/19-19.png","element":"img","alt":"{ζ3(n)}","inline":true},{"text":", the policy parameter update is on the intermediate time-scale ","element":"span"},{"style":{"height":16},"width":130.15,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1403.6530/images/19-20.png","element":"img","alt":" {ζ2(n)}","inline":true},{"text":", and the Lagrange multiplier update is on the slowest time-scale ","element":"span"},{"style":{"height":16},"width":130.15,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1403.6530/images/19-21.png","element":"img","alt":" {ζ1(n)}","inline":true},{"text":". This results in a three time-scale stochastic approximation algorithm.","element":"span"}],[{"text":"As in the discounted setting, the critic uses linear approximation for the differential value and square value functions, i.e., ","element":"span"},{"style":{"height":16},"width":279.14,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1403.6530/images/19-22.png","element":"img","alt":"�V (x) = vTφv(x)","inline":true,"padRight":true},{"text":"and ","element":"span"},{"style":{"height":16},"width":282.55,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1403.6530/images/19-23.png","element":"img","alt":"�U(x) = uTφu(x)","inline":true},{"text":", where ","element":"span"},{"style":{"height":16},"width":85.25,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1403.6530/images/19-24.png","element":"img","alt":" φv(·)","inline":true,"padRight":true},{"text":"and ","element":"span"},{"style":{"height":16},"width":87.14,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1403.6530/images/19-25.png","element":"img","alt":" φu(·)","inline":true,"padRight":true},{"text":"are feature vectors of size ","element":"span"},{"style":{"height":9.59},"width":38.96,"height":23.98,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1403.6530/images/19-26.png","element":"img","alt":" κ2","inline":true,"padRight":true},{"text":"and ","element":"span"},{"style":{"height":9.59},"width":38.96,"height":23.98,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1403.6530/images/19-27.png","element":"img","alt":"κ3","inline":true},{"text":", respectively. Although our estimates of ","element":"span"},{"style":{"height":16},"width":71.92,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1403.6530/images/19-28.png","element":"img","alt":" ρ(θ)","inline":true,"padRight":true},{"text":"and ","element":"span"},{"style":{"height":16},"width":72.53,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1403.6530/images/19-29.png","element":"img","alt":" η(θ)","inline":true,"padRight":true},{"text":"are unbiased, since we use biased estimates for ","element":"span"},{"style":{"height":13.38},"width":47.1,"height":33.46,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1403.6530/images/19-30.png","element":"img","alt":" V θ","inline":true,"padRight":true},{"text":"and ","element":"span"},{"style":{"height":13.38},"width":46.55,"height":33.46,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1403.6530/images/19-31.png","element":"img","alt":"U θ","inline":true,"padRight":true},{"text":"(linear approximations in the critic), our gradient estimates ","element":"span"},{"style":{"height":16},"width":123.15,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1403.6530/images/19-32.png","element":"img","alt":" ∇θρ(θ)","inline":true,"padRight":true},{"text":"and ","element":"span"},{"style":{"height":16},"width":123.77,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1403.6530/images/19-33.png","element":"img","alt":" ∇θη(θ)","inline":true},{"text":", and as a result ","element":"span"},{"style":{"height":16},"width":170.62,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1403.6530/images/19-34.png","element":"img","alt":" ∇θL(θ, λ)","inline":true},{"text":", are biased. The following lemma shows the bias in our estimate of ","element":"span"},{"style":{"height":16},"width":170.63,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1403.6530/images/19-35.png","element":"img","alt":" ∇θL(θ, λ)","inline":true},{"text":".","element":"span"}],[{"id":"id-46","style":{"fontWeight":"bold"},"text":"Lemma 5. ","element":"span"},{"style":{"fontStyle":"italic"},"text":"The bias of our actor-critic algorithm in estimating ","element":"span"},{"style":{"height":16},"width":170.62,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1403.6530/images/19-36.png","element":"img","alt":" ∇θL(θ, λ)","inline":true,"padRight":true},{"style":{"fontStyle":"italic"},"text":"for fixed ","element":"span"},{"style":{"height":10.8},"width":19,"height":27,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1403.6530/images/19-37.png","element":"img","alt":" θ","inline":true,"padRight":true},{"style":{"fontStyle":"italic"},"text":"and ","element":"span"},{"style":{"height":10.8},"width":23,"height":27,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1403.6530/images/19-38.png","element":"img","alt":" λ","inline":true,"padRight":true},{"style":{"fontStyle":"italic"},"text":"is","element":"span"}],[{"style":{"width":"85%"},"width":1553,"height":93,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1403.6530/images/19-39.png","element":"img"}],[{"style":{"fontStyle":"italic"},"text":"where ","element":"span"},{"style":{"height":17.38},"width":148.93,"height":43.46,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1403.6530/images/19-40.png","element":"img","alt":" vθ⊤φv(·)","inline":true,"padRight":true},{"style":{"fontStyle":"italic"},"text":"and ","element":"span"},{"style":{"height":17.38},"width":152.89,"height":43.46,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1403.6530/images/19-41.png","element":"img","alt":" uθ⊤φu(·)","inline":true,"padRight":true},{"style":{"fontStyle":"italic"},"text":"are estimates of ","element":"span"},{"style":{"height":17.38},"width":92.7,"height":43.46,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1403.6530/images/19-42.png","element":"img","alt":" V θ(·)","inline":true,"padRight":true},{"style":{"fontStyle":"italic"},"text":"and ","element":"span"},{"style":{"height":17.38},"width":92.15,"height":43.46,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1403.6530/images/19-43.png","element":"img","alt":" U θ(·)","inline":true,"padRight":true},{"style":{"fontStyle":"italic"},"text":"upon convergence of the TD recursion, and","element":"span"}],[{"style":{"width":"59%"},"width":1078,"height":196,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1403.6530/images/19-44.png","element":"img"}],[{"id":"id-88","style":{"width":"100%"},"width":1814,"height":959,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1403.6530/images/20-0.png","element":"img"}],[{"style":{"fontStyle":"italic"},"text":"Proof. ","element":"span"},{"text":"The bias in estimating ","element":"span"},{"style":{"height":16},"width":152.6,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1403.6530/images/20-1.png","element":"img","alt":" ∇L(θ, λ)","inline":true,"padRight":true},{"text":"consists of the bias in estimating ","element":"span"},{"style":{"height":16},"width":105.13,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1403.6530/images/20-2.png","element":"img","alt":" ∇ρ(θ)","inline":true,"padRight":true},{"text":"and ","element":"span"},{"style":{"height":16},"width":105.74,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1403.6530/images/20-3.png","element":"img","alt":" ∇η(θ)","inline":true},{"text":". Lemma 4 in Bhatnagar et al. ","element":"span"},{"href":"#id-11","referenceIndex":14,"text":"[14] ","element":"a"},{"text":"shows the bias in estimating ","element":"span"},{"style":{"height":16},"width":105.12,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1403.6530/images/20-4.png","element":"img","alt":" ∇ρ(θ)","inline":true,"padRight":true},{"text":"as","element":"span"}],[{"style":{"width":"52%"},"width":959,"height":88,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1403.6530/images/20-5.png","element":"img"}],[{"text":"where ","element":"span"},{"style":{"height":17.39},"width":893.37,"height":43.46,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1403.6530/images/20-6.png","element":"img","alt":" δθn = R(xn, an) − �ρn+1 + vθ⊤φv(xn+1) − vθ⊤φv(xn)","inline":true},{"text":". Similarly we can prove that the bias in estimating ","element":"span"},{"style":{"height":16},"width":105.74,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1403.6530/images/20-7.png","element":"img","alt":"∇η(θ)","inline":true,"padRight":true},{"text":"is","element":"span"}],[{"style":{"width":"53%"},"width":962,"height":89,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1403.6530/images/20-8.png","element":"img"}],[{"text":"where ","element":"span"},{"style":{"height":17.38},"width":918.85,"height":43.46,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1403.6530/images/20-9.png","element":"img","alt":" ϵθn = R(xn, an) − �ηn+1 + uθ⊤φu(xn+1) − uθ⊤φu(xn)","inline":true},{"text":". The claim follows by putting these two results ","element":"span"},{"text":"together and given the fact that ","element":"span"},{"style":{"height":16},"width":520.87,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1403.6530/images/20-10.png","element":"img","alt":" ∇Λ(θ) = ∇η(θ) − 2ρ(θ)∇ρ(θ)","inline":true,"padRight":true},{"text":"and ","element":"span"},{"style":{"height":16},"width":530.64,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1403.6530/images/20-11.png","element":"img","alt":" ∇L(θ, λ) = −∇ρ(θ) + λ∇Λ(θ)","inline":true},{"text":". Note that the following fact holds for the bias in estimating ","element":"span"},{"style":{"height":16},"width":105.12,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1403.6530/images/20-12.png","element":"img","alt":" ∇ρ(θ)","inline":true,"padRight":true},{"text":"and ","element":"span"},{"style":{"height":16},"width":105.74,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1403.6530/images/20-13.png","element":"img","alt":" ∇η(θ)","inline":true},{"text":":","element":"span"}],[{"id":"id-49","style":{"width":"99%"},"width":1811,"height":430,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1403.6530/images/20-14.png","element":"img"}],[{"style":{"fontStyle":"italic"},"text":"and thus, the actor recursion for the SR-variant of our average reward risk-sensitive actor-critic algorithm is as follows:","element":"span"}],[{"style":{"width":"82%"},"width":1498,"height":127,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1403.6530/images/20-15.png","element":"img"}],[{"style":{"fontStyle":"italic"},"text":"Note that the rest of the updates, including the average reward, TD errors, and critic recursions are as in the risk-sensitive actor-critic algorithm presented in Algorithm ","element":"span"},{"href":"#id-88","style":{"fontStyle":"italic"},"text":"2. ","element":"a"},{"style":{"fontStyle":"italic"},"text":"Similar to the discounted setting, since there is no Lagrange multiplier in the SR optimization, the resulting actor-critic algorithm is a two time-scale stochastic approximation algorithm.","element":"span"}],[{"style":{"fontWeight":"bold"},"text":"Remark 10. ","element":"span"},{"id":"id-89","style":{"fontStyle":"italic"},"text":"In the discounted setting, another popular variability measure is the ","element":"span"},{"text":"discounted normalized variance ","element":"span"},{"href":"#id-20","referenceIndex":27,"style":{"fontStyle":"italic"},"text":"[27]","element":"a"}],[{"style":{"width":"66%"},"width":1202,"height":113,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1403.6530/images/21-0.png","element":"img"}],[{"style":{"fontStyle":"italic"},"text":"where ","element":"span"},{"style":{"height":20.57},"width":632.19,"height":51.42,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1403.6530/images/21-1.png","element":"img","alt":" ργ(µ) = �x,a dµγ(x|x0)µ(a|x)r(x, a)","inline":true,"padRight":true},{"style":{"fontStyle":"italic"},"text":"and ","element":"span"},{"style":{"height":19.72},"width":148.11,"height":49.31,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1403.6530/images/21-2.png","element":"img","alt":" dµγ(x|x0)","inline":true,"padRight":true},{"style":{"fontStyle":"italic"},"text":"is the ","element":"span"},{"style":{"height":10.4},"width":22,"height":26,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1403.6530/images/21-3.png","element":"img","alt":" γ","inline":true},{"style":{"fontStyle":"italic"},"text":"-discounted visiting distribution of state ","element":"span"},{"style":{"fontStyle":"italic"},"text":"x ","element":"span"},{"style":{"fontStyle":"italic"},"text":"under policy ","element":"span"},{"style":{"height":10},"width":24,"height":25,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1403.6530/images/21-4.png","element":"img","alt":" µ","inline":true},{"style":{"fontStyle":"italic"},"text":", defined in Section ","element":"span"},{"style":{"fontStyle":"italic"},"text":"2. ","element":"span"},{"style":{"fontStyle":"italic"},"text":"The variability measure ","element":"span"},{"href":"#id-89","text":"(50) ","element":"a"},{"style":{"fontStyle":"italic"},"text":"has close resemblance to the average reward variability measure ","element":"span"},{"href":"#id-84","text":"(39)","element":"a"},{"style":{"fontStyle":"italic"},"text":", and thus, any (discounted) risk measure based on ","element":"span"},{"href":"#id-89","text":"(50) ","element":"a"},{"style":{"fontStyle":"italic"},"text":"can be optimized similar to the corresponding average reward risk measure ","element":"span"},{"href":"#id-84","text":"(39)","element":"a"},{"style":{"fontStyle":"italic"},"text":".","element":"span"}],[{"style":{"fontWeight":"bold"},"text":"Remark 11. ","element":"span"},{"style":{"fontStyle":"italic","fontWeight":"bold"},"text":"(Simultaneous perturbation analogues) ","element":"span"},{"style":{"fontStyle":"italic"},"text":"In the average reward setting, a simultaneous perturbation algorithm would estimate the average reward ","element":"span"},{"style":{"height":10},"width":21,"height":25,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1403.6530/images/21-5.png","element":"img","alt":" ρ","inline":true,"padRight":true},{"style":{"fontStyle":"italic"},"text":"and the square reward ","element":"span"},{"style":{"height":10.4},"width":20,"height":26,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1403.6530/images/21-6.png","element":"img","alt":" η","inline":true,"padRight":true},{"style":{"fontStyle":"italic"},"text":"on the faster timescale and use these to estimate the gradient of the performance objective. However, a drawback with this approach, compared to the algorithm proposed above is the necessity for having two simulated trajectories (instead of one) for each policy update.","element":"span"}],[{"text":"In the following section, we establish the convergence of our average reward actor-critic algorithm to a (local) saddle point of the risk-sensitive objective function ","element":"span"},{"style":{"height":16},"width":119.39,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1403.6530/images/21-7.png","element":"img","alt":" L(θ, λ)","inline":true},{"text":".","element":"span"}]]},{"heading":"7 Convergence Analysis of the Discounted Reward Risk-Sensitive Actor-","paragraphs":[[{"style":{"width":"23%"},"width":432,"height":52,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1403.6530/images/21-8.png","element":"img"}],[{"text":"Our proposed actor-critic algorithms use multi-timescale stochastic approximation and we use the ordinary differential equation (ODE) approach (see Chapter 6 of ","element":"span"},{"href":"#id-58","referenceIndex":20,"text":"[20]","element":"a"},{"text":") to analyze their convergence. We first provide the analysis for the SPSA based first-order algorithm RS-SPSA-G in Section ","element":"span"},{"href":"#id-90","text":"7.1 ","element":"a"},{"text":"and later provide the necessary modifications to the proof of SF based first-order algorithm and SPSA/SF based second-order algorithms.","element":"span"}],[{"id":"id-90","style":{"fontWeight":"bold"},"text":"7.1 ","element":"span"},{"style":{"fontWeight":"bold"},"text":"Convergence of the First-Order Algorithm: RS-SPSA-G","element":"span"}],[{"text":"Recall that RS-SPSA-G is a two-loop scheme where the inner loop is a TD critic that evaluates the value/square value functions for both unperturbed as well as perturbed policy parameter. On the other hand, the outer loop is a two-timescale stochastic approximation algorithm, where the faster timescale updates policy parameter ","element":"span"},{"style":{"height":10.8},"width":19,"height":27,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1403.6530/images/21-9.png","element":"img","alt":" θ","inline":true,"padRight":true},{"text":"in the descent direction using SPSA estimates of the gradient of the Lagrangian and the slower timescale performs dual ascent for the Lagrange multiplier ","element":"span"},{"style":{"height":10.8},"width":23,"height":27,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1403.6530/images/21-10.png","element":"img","alt":" λ","inline":true,"padRight":true},{"text":"using sample constraint values. The faster timescale ","element":"span"},{"style":{"height":10.8},"width":19,"height":27,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1403.6530/images/21-11.png","element":"img","alt":" θ","inline":true},{"text":"-recursion sees the ","element":"span"},{"style":{"height":10.8},"width":23,"height":27,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1403.6530/images/21-12.png","element":"img","alt":"λ","inline":true},{"text":"-updates on the slower timescales as quasi-static, while the slower timescale ","element":"span"},{"style":{"height":10.8},"width":23,"height":27,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1403.6530/images/21-13.png","element":"img","alt":" λ","inline":true},{"text":"-recursion sees the ","element":"span"},{"style":{"height":10.8},"width":19,"height":27,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1403.6530/images/21-14.png","element":"img","alt":" θ","inline":true},{"text":"-updates as equilibrated.","element":"span"}],[{"text":"The proof of convergence of the RS-SPSA-G algorithm to a (local) saddle point of the risk-sensitive objective function ","element":"span"},{"style":{"height":19.2},"width":1305.45,"height":48,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1403.6530/images/21-15.png","element":"img","alt":"�L(θ, λ)△= −�V θ(x0) + λ(�Λθ(x0) − α)= − �V θ(x0) + λ��U θ(x0) − �V θ(x0)2 − α�","inline":true},{"text":"contains the following three main steps:","element":"span"}],[{"style":{"fontWeight":"bold"},"text":"Step 1: Critic’s Convergence. ","element":"span"},{"text":"We establish that, for any given values of ","element":"span"},{"style":{"height":10.8},"width":19,"height":27,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1403.6530/images/21-16.png","element":"img","alt":" θ","inline":true,"padRight":true},{"text":"and ","element":"span"},{"style":{"height":10.8},"width":23,"height":27,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1403.6530/images/21-17.png","element":"img","alt":" λ","inline":true,"padRight":true},{"text":"that are updated on slower timescales, the TD critic converges to a fixed point of the projected Bellman operator for value and square value functions.","element":"span"}],[{"style":{"fontWeight":"bold"},"text":"Step 2: Convergence of ","element":"span"},{"style":{"height":10.8},"width":19,"height":27,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1403.6530/images/21-18.png","element":"img","alt":" θ","inline":true},{"style":{"fontWeight":"bold"},"text":"-recursion. ","element":"span"},{"text":"We utilize the fact that owing to projection, the ","element":"span"},{"style":{"height":10.8},"width":19,"height":27,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1403.6530/images/21-19.png","element":"img","alt":" θ","inline":true,"padRight":true},{"text":"parameter is stable. Using a Lyapunov argument, we show that the ","element":"span"},{"style":{"height":10.8},"width":19,"height":27,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1403.6530/images/21-20.png","element":"img","alt":" θ","inline":true},{"text":"-recursion tracks the ODE ","element":"span"},{"href":"#id-91","text":"(54) ","element":"a"},{"text":"in the asymptotic limit, for any given value of ","element":"span"},{"style":{"height":10.8},"width":23,"height":27,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1403.6530/images/21-21.png","element":"img","alt":" λ","inline":true,"padRight":true},{"text":"on the slowest timescale.","element":"span"}],[{"style":{"fontWeight":"bold"},"text":"Step 3: Convergence of ","element":"span"},{"style":{"height":10.8},"width":23,"height":27,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1403.6530/images/22-0.png","element":"img","alt":" λ","inline":true},{"style":{"fontWeight":"bold"},"text":"-recursion. ","element":"span"},{"text":"This step is similar to earlier analysis for constrained MDPs . In particular, we show that ","element":"span"},{"style":{"height":10.8},"width":23,"height":27,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1403.6530/images/22-1.png","element":"img","alt":" λ","inline":true},{"text":"-recursion in ","element":"span"},{"href":"#id-92","text":"(23) ","element":"a"},{"text":"converges and the overall convergence of ","element":"span"},{"style":{"height":16},"width":134.54,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1403.6530/images/22-2.png","element":"img","alt":" (θn, λn)","inline":true,"padRight":true},{"text":"is to a local saddle point ","element":"span"},{"style":{"height":18.6},"width":148.14,"height":46.49,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1403.6530/images/22-3.png","element":"img","alt":"(θλ∗, λ∗)","inline":true,"padRight":true},{"text":"of ","element":"span"},{"style":{"height":16},"width":119.39,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1403.6530/images/22-4.png","element":"img","alt":"�L(θ, λ)","inline":true},{"text":", with ","element":"span"},{"style":{"height":14.6},"width":53.72,"height":36.49,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1403.6530/images/22-5.png","element":"img","alt":" θλ∗","inline":true,"padRight":true},{"text":"satisfying the variance constraint in ","element":"span"},{"href":"#id-53","text":"(3)","element":"a"},{"text":".","element":"span"}],[{"style":{"fontWeight":"bold"},"text":"Step 1: (Critic’s Convergence) ","element":"span"},{"text":"Since the critic’s update is in the inner loop, we can assume in this analysis that ","element":"span"},{"style":{"height":10.8},"width":19,"height":27,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1403.6530/images/22-6.png","element":"img","alt":"θ","inline":true,"padRight":true},{"text":"and ","element":"span"},{"style":{"height":10.8},"width":23,"height":27,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1403.6530/images/22-7.png","element":"img","alt":" λ","inline":true,"padRight":true},{"text":"are time-invariant quantities. The following theorem shows that the TD critic estimates for the value and square value function converge to the fixed point given by ","element":"span"},{"href":"#id-71","text":"(16)","element":"a"},{"text":", for any given policy ","element":"span"},{"style":{"height":10.8},"width":19,"height":27,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1403.6530/images/22-8.png","element":"img","alt":" θ","inline":true},{"text":".","element":"span"}],[{"id":"id-70","style":{"fontWeight":"bold"},"text":"Theorem 6. ","element":"span"},{"style":{"fontStyle":"italic"},"text":"Under (A1)-(A4), for any given policy parameter ","element":"span"},{"style":{"height":10.8},"width":19,"height":27,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1403.6530/images/22-9.png","element":"img","alt":" θ","inline":true,"padRight":true},{"style":{"fontStyle":"italic"},"text":"and Lagrange multiplier ","element":"span"},{"style":{"height":10.8},"width":23,"height":27,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1403.6530/images/22-10.png","element":"img","alt":" λ","inline":true},{"style":{"fontStyle":"italic"},"text":", the critic parameters ","element":"span"},{"style":{"height":16},"width":89.51,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1403.6530/images/22-11.png","element":"img","alt":"{vm}","inline":true,"padRight":true},{"style":{"fontStyle":"italic"},"text":"and ","element":"span"},{"style":{"height":16},"width":93.01,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1403.6530/images/22-12.png","element":"img","alt":" {um}","inline":true,"padRight":true},{"style":{"fontStyle":"italic"},"text":"governed by the recursions of ","element":"span"},{"href":"#id-69","text":"(17) ","element":"a"},{"style":{"fontStyle":"italic"},"text":"converge almost surely, i.e.,","element":"span"}],[{"style":{"width":"34%"},"width":618,"height":36,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1403.6530/images/22-13.png","element":"img"}],[{"style":{"fontStyle":"italic"},"text":"In the above ","element":"span"},{"style":{"height":9.6},"width":21.51,"height":24,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1403.6530/images/22-14.png","element":"img","alt":" ¯v","inline":true,"padRight":true},{"style":{"fontStyle":"italic"},"text":"and ","element":"span"},{"style":{"height":9.6},"width":23,"height":24,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1403.6530/images/22-15.png","element":"img","alt":" ¯u","inline":true,"padRight":true},{"style":{"fontStyle":"italic"},"text":"are the solutions to the TD fixed point equations for policy ","element":"span"},{"style":{"height":10.8},"width":19,"height":27,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1403.6530/images/22-16.png","element":"img","alt":" θ","inline":true,"padRight":true},{"style":{"fontStyle":"italic"},"text":"(see ","element":"span"},{"href":"#id-71","text":"(16) ","element":"a"},{"style":{"fontStyle":"italic"},"text":"in Section ","element":"span"},{"href":"#id-77","style":{"fontStyle":"italic"},"text":"4.2.","element":"a"}],[{"style":{"fontWeight":"bold"},"text":"Remark 12. ","element":"span"},{"style":{"fontStyle":"italic"},"text":"It is easy to conclude from the above theorem that the TD critic parameters for the perturbed policy parameter also converge almost surely, i.e., ","element":"span"},{"style":{"height":16.92},"width":172.98,"height":42.31,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1403.6530/images/22-17.png","element":"img","alt":" v+m → ¯v+","inline":true,"padRight":true},{"style":{"fontStyle":"italic"},"text":"and ","element":"span"},{"style":{"height":16.92},"width":178.54,"height":42.31,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1403.6530/images/22-18.png","element":"img","alt":" u+m → ¯u+","inline":true,"padRight":true},{"style":{"fontStyle":"italic"},"text":"a.s., where ","element":"span"},{"style":{"height":12.98},"width":45.74,"height":32.46,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1403.6530/images/22-19.png","element":"img","alt":" ¯v+","inline":true,"padRight":true},{"style":{"fontStyle":"italic"},"text":"and ","element":"span"},{"style":{"height":12.98},"width":47.81,"height":32.46,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1403.6530/images/22-20.png","element":"img","alt":" ¯u+","inline":true,"padRight":true},{"style":{"fontStyle":"italic"},"text":"are the unique ","element":"span"},{"style":{"fontStyle":"italic"},"text":"solutions to TD fixed point relations for perturbed policy ","element":"span"},{"style":{"height":12.8},"width":117.95,"height":32,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1403.6530/images/22-21.png","element":"img","alt":" θ + δ∆","inline":true},{"style":{"fontStyle":"italic"},"text":". Here ","element":"span"},{"style":{"height":11.6},"width":33,"height":29,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1403.6530/images/22-22.png","element":"img","alt":" ∆","inline":true,"padRight":true},{"style":{"fontStyle":"italic"},"text":"is a fixed realization of the perturbation random variable that is updated on the outer loop.","element":"span"}],[{"style":{"fontStyle":"italic"},"text":"Proof. ","element":"span"},{"text":"The ","element":"span"},{"style":{"fontStyle":"italic"},"text":"v","element":"span"},{"text":"-recursion in ","element":"span"},{"href":"#id-69","text":"(17) ","element":"a"},{"text":"is performing TD) with function approximation for the value function, while the ","element":"span"},{"style":{"fontStyle":"italic"},"text":"u","element":"span"},{"text":"-recursion is doing the same for the square value function. The convergence of ","element":"span"},{"style":{"fontStyle":"italic"},"text":"v","element":"span"},{"text":"-recursion to the fixed point in ","element":"span"},{"href":"#id-71","text":"(16) ","element":"a"},{"text":"can be inferred from ","element":"span"},{"href":"#id-67","referenceIndex":65,"text":"[65]","element":"a"},{"text":".","element":"span"}],[{"text":"Using an approach similar to ","element":"span"},{"href":"#id-93","referenceIndex":64,"text":"[64]","element":"a"},{"text":", we club both ","element":"span"},{"style":{"fontStyle":"italic"},"text":"v ","element":"span"},{"text":"and ","element":"span"},{"style":{"fontStyle":"italic"},"text":"u ","element":"span"},{"text":"recursions and establish convergence using a stability argument in the following: Let ","element":"span"},{"style":{"height":16},"width":278.31,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1403.6530/images/22-23.png","element":"img","alt":" wm = (vm, um)T","inline":true},{"text":". Then, ","element":"span"},{"href":"#id-69","text":"(17) ","element":"a"},{"text":"can be seen to be equivalent to","element":"span"}],[{"style":{"width":"99%"},"width":1813,"height":529,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1403.6530/images/22-24.png","element":"img"}],[{"text":"The above ODE has a unique globally asymptotically stable equilibrium, since ","element":"span"},{"style":{"fontStyle":"italic"},"text":"M ","element":"span"},{"text":"is a negative definite. To see the latter fact, observe that ","element":"span"},{"style":{"fontStyle":"italic"},"text":"M ","element":"span"},{"text":"is block triangular and hence its eigenvalues are that of ","element":"span"},{"style":{"height":17.38},"width":321.32,"height":43.46,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1403.6530/images/22-25.png","element":"img","alt":" ΦTvDθ(γP θ − I)Φv","inline":true,"padRight":true},{"text":"and ","element":"span"},{"style":{"height":17.39},"width":343.81,"height":43.46,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1403.6530/images/22-26.png","element":"img","alt":"ΦTuDθ(γ2P θ − I)Φu","inline":true},{"text":". It can be inferred from Theorem 2 of ","element":"span"},{"href":"#id-67","referenceIndex":65,"text":"[65] ","element":"a"},{"text":"that the aforementioned matrices are negative ","element":"span"},{"text":"definite. For the sake of completeness, we provide a brief sketch in the following: For any ","element":"span"},{"style":{"height":14.99},"width":163.02,"height":37.46,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1403.6530/images/22-27.png","element":"img","alt":" V ∈ R|X|","inline":true},{"text":", it can be shown that","element":"span"},{"style":{"height":20.72},"width":336.61,"height":51.81,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1403.6530/images/22-28.png","element":"img","alt":"��P θV��Dθ ≤ ∥V ∥Dθ","inline":true,"padRight":true},{"text":"(see Lemma 1 in ","element":"span"},{"href":"#id-67","referenceIndex":65,"text":"[65] ","element":"a"},{"text":"for a proof). Now,","element":"span"}],[{"style":{"width":"39%"},"width":717,"height":203,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1403.6530/images/22-29.png","element":"img"}],[{"text":"Hence, ","element":"span"},{"style":{"height":20.4},"width":689.93,"height":50.99,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1403.6530/images/22-30.png","element":"img","alt":" V TDθ(γP θ − I)V ≤ (γ − 1) ∥V ∥2Dθ < 0","inline":true},{"text":". By (A4), we know that ","element":"span"},{"style":{"height":13.19},"width":44.78,"height":32.98,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1403.6530/images/22-31.png","element":"img","alt":" Φv","inline":true,"padRight":true},{"text":"is full rank implying the negative ","element":"span"},{"text":"definiteness of ","element":"span"},{"style":{"height":17.38},"width":317.8,"height":43.46,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1403.6530/images/22-32.png","element":"img","alt":" ΦTvDθ(γP θ − I)Φv","inline":true},{"text":". Using the same argument as above and replacing ","element":"span"},{"style":{"height":13.19},"width":44.78,"height":32.98,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1403.6530/images/22-33.png","element":"img","alt":" Φv","inline":true,"padRight":true},{"text":"with ","element":"span"},{"style":{"height":13.19},"width":47.78,"height":32.98,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1403.6530/images/22-34.png","element":"img","alt":" Φu","inline":true,"padRight":true},{"text":"and ","element":"span"},{"style":{"height":10.4},"width":22,"height":26,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1403.6530/images/22-35.png","element":"img","alt":" γ","inline":true,"padRight":true},{"text":"with ","element":"span"},{"style":{"height":16.98},"width":38.84,"height":42.46,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1403.6530/images/22-36.png","element":"img","alt":" γ2","inline":true},{"text":", ","element":"span"},{"text":"one can conclude that ","element":"span"},{"style":{"height":17.38},"width":339.74,"height":43.46,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1403.6530/images/22-37.png","element":"img","alt":" ΦTuDθ(γ2P θ − I)Φu","inline":true},{"text":".","element":"span"}],[{"text":"The final claim now follows by applying Theorems 2.1-2.2(i) of ","element":"span"},{"href":"#id-94","referenceIndex":22,"text":"[22]","element":"a"},{"text":", provided we verify assumptions (A1)-(A2) there. The latter assumptions are given as follows:","element":"span"}],[{"style":{"fontWeight":"bold"},"text":"(A1) ","element":"span"},{"text":"The function ","element":"span"},{"style":{"fontStyle":"italic"},"text":"h ","element":"span"},{"text":"is Lipschitz. For any ","element":"span"},{"style":{"fontStyle":"italic"},"text":"c","element":"span"},{"text":", define ","element":"span"},{"style":{"height":16},"width":291.65,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1403.6530/images/23-0.png","element":"img","alt":" hc(w) = h(cw)/c","inline":true},{"text":". Then, there exists a continuous function ","element":"span"},{"style":{"height":13.19},"width":54.96,"height":32.98,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1403.6530/images/23-1.png","element":"img","alt":" h∞","inline":true,"padRight":true},{"text":"such that ","element":"span"},{"style":{"height":13.19},"width":156.14,"height":32.98,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1403.6530/images/23-2.png","element":"img","alt":" hc → h∞","inline":true,"padRight":true},{"text":"as ","element":"span"},{"style":{"height":8.8},"width":119.25,"height":22,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1403.6530/images/23-3.png","element":"img","alt":" c → ∞","inline":true,"padRight":true},{"text":"uniformly on compacts. Furthermore, origin is an asymptotically stable equilibrium for the ODE","element":"span"}],[{"id":"id-95","style":{"width":"55%"},"width":1006,"height":41,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1403.6530/images/23-4.png","element":"img"}],[{"style":{"fontWeight":"bold"},"text":"(A2) ","element":"span"},{"text":"The martingale difference ","element":"span"},{"style":{"height":16},"width":267.82,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1403.6530/images/23-5.png","element":"img","alt":" {∆Mm, m ≥ 1}","inline":true,"padRight":true},{"text":"is square-integrable with","element":"span"}],[{"style":{"width":"43%"},"width":791,"height":49,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1403.6530/images/23-6.png","element":"img"}],[{"text":"where ","element":"span"},{"style":{"height":13.19},"width":139.5,"height":32.98,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1403.6530/images/23-7.png","element":"img","alt":" C0 < ∞","inline":true},{"text":".","element":"span"}],[{"style":{"width":"96%"},"width":1754,"height":28,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1403.6530/images/23-8.png","element":"img"}],[{"style":{"fontStyle":"italic"},"text":"M ","element":"span"},{"text":"is negative definite, it is easy to see that origin is a asymptotically stable equilibrium for the ODE ","element":"span"},{"href":"#id-95","text":"(53)","element":"a"},{"text":". (A2) can also be verified by using the same arguments that were used to show that the martingale difference associated with the regular TD algorithm with function approximation satisfies a bound on the second moment (cf. ","element":"span"},{"href":"#id-67","referenceIndex":65,"text":"[65]","element":"a"},{"text":").","element":"span"}],[{"style":{"width":"99%"},"width":1810,"height":28,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1403.6530/images/23-9.png","element":"img"}],[{"style":{"fontWeight":"bold"},"text":"Step 2: (Analysis of ","element":"span"},{"style":{"height":10.8},"width":19,"height":27,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1403.6530/images/23-10.png","element":"img","alt":" θ","inline":true},{"style":{"fontWeight":"bold"},"text":"-recursion) ","element":"span"},{"text":"Since ","element":"span"},{"style":{"height":10.79},"width":164.36,"height":26.98,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1403.6530/images/23-11.png","element":"img","alt":" mn → ∞","inline":true,"padRight":true},{"text":"as ","element":"span"},{"style":{"height":8.8},"width":131.59,"height":22,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1403.6530/images/23-12.png","element":"img","alt":" n → ∞","inline":true},{"text":", we can assume that the inner TD critic loop has converged for the purpose of analysing the ","element":"span"},{"style":{"height":10.8},"width":19,"height":27,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1403.6530/images/23-13.png","element":"img","alt":" θ","inline":true},{"text":"-recursion in ","element":"span"},{"href":"#id-92","text":"(23)","element":"a"},{"text":". Due to timescale separation, the value of ","element":"span"},{"style":{"height":10.8},"width":23,"height":27,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1403.6530/images/23-14.png","element":"img","alt":" λ","inline":true,"padRight":true},{"text":"(updated on a slower timescale) is assumed to be constant for the analysis of the ","element":"span"},{"style":{"height":10.8},"width":19,"height":27,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1403.6530/images/23-15.png","element":"img","alt":" θ","inline":true},{"text":"-update. To see this in rigorous terms, first rewrite the ","element":"span"},{"style":{"height":10.8},"width":23,"height":27,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1403.6530/images/23-16.png","element":"img","alt":" λ","inline":true},{"text":"-recursion as","element":"span"}],[{"style":{"width":"28%"},"width":508,"height":82,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1403.6530/images/23-17.png","element":"img"}],[{"text":"where ","element":"span"},{"style":{"height":28.8},"width":775.36,"height":72,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1403.6530/images/23-18.png","element":"img","alt":"ˆH(n) = ζ1(n)ζ2(n)�uTnφu(x0) −�vTnφv(x0)�2 − α�","inline":true},{"text":". Since the critic recursions converge, it is easy to see that","element":"span"}],[{"style":{"height":18.83},"width":180.11,"height":47.07,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1403.6530/images/23-19.png","element":"img","alt":"supn ˆH(n)","inline":true,"padRight":true},{"text":"is finite. Combining with the observation that ","element":"span"},{"style":{"height":24.43},"width":205.69,"height":61.08,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1403.6530/images/23-20.png","element":"img","alt":"ζ1(n)ζ2(n) = o(1)","inline":true,"padRight":true},{"text":"due to the assumption (A3) on step-sizes, ","element":"span"},{"text":"we see that the ","element":"span"},{"style":{"height":10.8},"width":23,"height":27,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1403.6530/images/23-21.png","element":"img","alt":" λ","inline":true},{"text":"-recursion above tracks the ODE ","element":"span"},{"style":{"height":15.01},"width":96.4,"height":37.52,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1403.6530/images/23-22.png","element":"img","alt":"˙λ = 0","inline":true},{"text":".","element":"span"}],[{"text":"In the following, we show that the update of ","element":"span"},{"style":{"height":10.8},"width":19,"height":27,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1403.6530/images/23-23.png","element":"img","alt":" θ","inline":true,"padRight":true},{"text":"is equivalent to gradient descent for the function ","element":"span"},{"style":{"height":16},"width":119.39,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1403.6530/images/23-24.png","element":"img","alt":"�L(θ, λ)","inline":true,"padRight":true},{"text":"andconverges to a limiting set that depends on ","element":"span"},{"style":{"height":10.8},"width":23,"height":27,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1403.6530/images/23-25.png","element":"img","alt":" λ","inline":true},{"text":". Consider the following ODE","element":"span"}],[{"id":"id-91","style":{"width":"59%"},"width":1087,"height":73,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1403.6530/images/23-26.png","element":"img"}],[{"text":"with the limiting set ","element":"span"},{"style":{"height":19.92},"width":590.6,"height":49.8,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1403.6530/images/23-27.png","element":"img","alt":" Zλ =�θ ∈ C : ˇΓ�∇�L(θt, λ)�= 0�","inline":true},{"text":". In the above, ","element":"span"},{"style":{"height":18.43},"width":67.47,"height":46.08,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1403.6530/images/23-28.png","element":"img","alt":"ˇΓ(·)","inline":true,"padRight":true},{"text":"is a projection operator that ensures the evolution of ","element":"span"},{"style":{"height":10.8},"width":19,"height":27,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1403.6530/images/23-29.png","element":"img","alt":" θ","inline":true,"padRight":true},{"text":"via the ODE ","element":"span"},{"href":"#id-91","text":"(54) ","element":"a"},{"text":"stays within the set ","element":"span"},{"style":{"height":21.46},"width":379.24,"height":53.66,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1403.6530/images/23-30.png","element":"img","alt":" Θ := �κ1i=1[θ(i)min, θ(i)max]","inline":true,"padRight":true},{"text":"and is defined as follows: For any ","element":"span"},{"text":"bounded continuous function ","element":"span"},{"style":{"height":16},"width":66.37,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1403.6530/images/23-31.png","element":"img","alt":" f(·)","inline":true},{"text":",","element":"span"}],[{"id":"id-100","style":{"width":"65%"},"width":1193,"height":92,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1403.6530/images/23-32.png","element":"img"}],[{"text":"Notice that the limit above may not exist and in that case, as pointed out on pp. 191 of ","element":"span"},{"href":"#id-96","referenceIndex":36,"text":"[36]","element":"a"},{"text":", one can define ","element":"span"},{"style":{"height":18.43},"width":131.02,"height":46.07,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1403.6530/images/23-33.png","element":"img","alt":"ˇΓ(f(θ))","inline":true,"padRight":true},{"text":"to be the set of all possible limit points. From the definition above, it can be inferred that for ","element":"span"},{"style":{"height":10.8},"width":19,"height":27,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1403.6530/images/23-34.png","element":"img","alt":" θ","inline":true,"padRight":true},{"text":"in the interior of ","element":"span"},{"style":{"height":10.8},"width":31,"height":27,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1403.6530/images/23-35.png","element":"img","alt":" Θ","inline":true},{"text":", ","element":"span"},{"style":{"height":18.43},"width":258.76,"height":46.07,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1403.6530/images/23-36.png","element":"img","alt":"ˇΓ(f(θ)) = f(θ)","inline":true},{"text":", while for ","element":"span"},{"style":{"height":10.8},"width":19,"height":27,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1403.6530/images/23-37.png","element":"img","alt":" θ","inline":true,"padRight":true},{"text":"on the boundary of ","element":"span"},{"style":{"height":18.43},"width":181.74,"height":46.07,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1403.6530/images/23-38.png","element":"img","alt":" Θ, ˇΓ(f(θ))","inline":true,"padRight":true},{"text":"is the projection of ","element":"span"},{"style":{"height":16},"width":75.11,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1403.6530/images/23-39.png","element":"img","alt":" f(θ)","inline":true,"padRight":true},{"text":"onto the tangent space of the boundary of ","element":"span"},{"style":{"height":10.8},"width":31,"height":27,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1403.6530/images/23-40.png","element":"img","alt":" Θ","inline":true,"padRight":true},{"text":"at ","element":"span"},{"style":{"height":10.8},"width":19,"height":27,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1403.6530/images/23-41.png","element":"img","alt":" θ","inline":true},{"text":".","element":"span"}],[{"text":"The main result regarding the convergence of the policy parameter ","element":"span"},{"style":{"height":10.8},"width":19,"height":27,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1403.6530/images/23-42.png","element":"img","alt":" θ","inline":true,"padRight":true},{"text":"for both the RS-SPSA-G and RS-SF-G algorithms is as follows:","element":"span"}],[{"id":"id-74","style":{"fontWeight":"bold"},"text":"Theorem 7. ","element":"span"},{"style":{"fontStyle":"italic"},"text":"Under (A1)-(A4), for any given Lagrange multiplier ","element":"span"},{"style":{"height":10.8},"width":23,"height":27,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1403.6530/images/23-43.png","element":"img","alt":" λ","inline":true,"padRight":true},{"style":{"fontStyle":"italic"},"text":"and ","element":"span"},{"style":{"height":11.6},"width":96.94,"height":29,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1403.6530/images/23-44.png","element":"img","alt":" ε > 0","inline":true},{"style":{"fontStyle":"italic"},"text":", there exists ","element":"span"},{"style":{"height":14.4},"width":118.78,"height":36,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1403.6530/images/23-45.png","element":"img","alt":" β0 > 0","inline":true,"padRight":true},{"style":{"fontStyle":"italic"},"text":"such that for all ","element":"span"},{"style":{"height":16.52},"width":452.98,"height":41.29,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1403.6530/images/23-46.png","element":"img","alt":" β ∈ (0, β0), θn → θ∗ ∈ Zελ","inline":true,"padRight":true},{"style":{"fontStyle":"italic"},"text":"almost surely. Here ","element":"span"},{"style":{"height":19.2},"width":678.02,"height":48,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1403.6530/images/23-47.png","element":"img","alt":" Zελ =�θ ∈ C : ||θ − θ0|| < ε, θ0 ∈ Zλ�","inline":true},{"style":{"fontStyle":"italic"},"text":"denotes the set of points in the ","element":"span"},{"style":{"height":7.2},"width":19,"height":18,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1403.6530/images/23-48.png","element":"img","alt":" ε","inline":true},{"style":{"fontStyle":"italic"},"text":"-neighborhood of ","element":"span"},{"style":{"height":13.19},"width":47.88,"height":32.98,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1403.6530/images/23-49.png","element":"img","alt":" Zλ","inline":true},{"style":{"fontStyle":"italic"},"text":".","element":"span"}],[{"text":"In order to the prove the above claim, we require the well-known Hirsch lemma (see ","element":"span"},{"href":"#id-97","referenceIndex":30,"text":"[30, ","element":"a"},{"text":"pp. 339]). For the sake of completeness, we recall this result below. Consider the ODE:","element":"span"}],[{"id":"id-98","style":{"width":"54%"},"width":996,"height":48,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1403.6530/images/24-0.png","element":"img"}],[{"text":"Let ","element":"span"},{"style":{"fontStyle":"italic"},"text":"K ","element":"span"},{"text":"be an asymptotically stable attractor for the above ODE and let ","element":"span"},{"style":{"height":10.8},"width":50.7,"height":27,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1403.6530/images/24-1.png","element":"img","alt":" Kϵ","inline":true,"padRight":true},{"text":"denote its ","element":"span"},{"style":{"height":0},"width":20,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1403.6530/images/24-2.png","element":"img","alt":" ϵ","inline":true},{"text":"-neighbourhood. Given ","element":"span"},{"style":{"height":14.4},"width":159.84,"height":36,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1403.6530/images/24-3.png","element":"img","alt":"T, η > 0","inline":true},{"text":", we call a bounded, measurable ","element":"span"},{"style":{"height":17.38},"width":403.35,"height":43.46,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1403.6530/images/24-4.png","element":"img","alt":" y(·) : R+ ∪ {0} → RN","inline":true},{"text":", a ","element":"span"},{"style":{"height":16},"width":97.04,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1403.6530/images/24-5.png","element":"img","alt":" (T, η)","inline":true},{"text":"-perturbation of ","element":"span"},{"href":"#id-98","text":"(56) ","element":"a"},{"text":"if there exist ","element":"span"},{"style":{"height":14},"width":578.58,"height":35,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1403.6530/images/24-6.png","element":"img","alt":"0 = T0 < T1 < T2 < · · · < Tr ↑ ∞","inline":true,"padRight":true},{"text":"with ","element":"span"},{"style":{"height":14.79},"width":299.56,"height":36.98,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1403.6530/images/24-7.png","element":"img","alt":" Tr+1 − Tr ≥ T ∀r","inline":true,"padRight":true},{"text":"and solutions ","element":"span"},{"style":{"height":16},"width":327.54,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1403.6530/images/24-8.png","element":"img","alt":" θr(t), t ∈ [Tr, Tr+1]","inline":true,"padRight":true},{"text":"of ","element":"span"},{"href":"#id-98","text":"(56) ","element":"a"},{"text":"for ","element":"span"},{"style":{"height":13.2},"width":92.22,"height":33,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1403.6530/images/24-9.png","element":"img","alt":" r ≥ 0","inline":true},{"text":", such that","element":"span"}],[{"style":{"width":"27%"},"width":507,"height":73,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1403.6530/images/24-10.png","element":"img"}],[{"style":{"fontWeight":"bold"},"text":"Lemma 8 ","element":"span"},{"text":"(Hirsch Lemma)","element":"span"},{"style":{"fontWeight":"bold"},"text":". ","element":"span"},{"style":{"fontStyle":"italic"},"text":"Given ","element":"span"},{"style":{"height":14.4},"width":295.5,"height":36,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1403.6530/images/24-11.png","element":"img","alt":" ϵ, T > 0, ∃¯η > 0","inline":true,"padRight":true},{"style":{"fontStyle":"italic"},"text":"such that for all ","element":"span"},{"style":{"height":16},"width":180.15,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1403.6530/images/24-12.png","element":"img","alt":" ∆ ∈ (0, ¯η)","inline":true},{"style":{"fontStyle":"italic"},"text":", every ","element":"span"},{"style":{"height":16},"width":97.04,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1403.6530/images/24-13.png","element":"img","alt":" (T, η)","inline":true},{"style":{"fontStyle":"italic"},"text":"-perturbation of ","element":"span"},{"href":"#id-98","style":{"fontStyle":"italic"},"text":"(56) ","element":"a"},{"style":{"fontStyle":"italic"},"text":"converges to ","element":"span"},{"style":{"height":10.8},"width":50.7,"height":27,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1403.6530/images/24-14.png","element":"img","alt":" Kϵ","inline":true},{"style":{"fontStyle":"italic"},"text":".","element":"span"}],[{"style":{"fontStyle":"italic"},"text":"Proof. ","element":"span"},{"text":"(","element":"span"},{"style":{"fontWeight":"bold"},"text":"Theorem ","element":"span"},{"href":"#id-74","style":{"fontWeight":"bold"},"text":"7","element":"a"},{"text":") The ","element":"span"},{"style":{"height":10.8},"width":19,"height":27,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1403.6530/images/24-15.png","element":"img","alt":" θ","inline":true},{"text":"-update in ","element":"span"},{"href":"#id-92","text":"(23) ","element":"a"},{"text":"can be rewritten using the converged TD-parameters ","element":"span"},{"style":{"height":16},"width":92.77,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1403.6530/images/24-16.png","element":"img","alt":" (¯v, ¯u)","inline":true,"padRight":true},{"text":"and ","element":"span"},{"style":{"height":16.98},"width":145.68,"height":42.46,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1403.6530/images/24-17.png","element":"img","alt":" (¯v+, ¯u+)","inline":true,"padRight":true},{"text":"as","element":"span"}],[{"id":"id-99","style":{"width":"92%"},"width":1675,"height":106,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1403.6530/images/24-18.png","element":"img"}],[{"text":"where","element":"span"}],[{"style":{"width":"65%"},"width":1187,"height":226,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1403.6530/images/24-19.png","element":"img"}],[{"text":"Since the trajectory length ","element":"span"},{"style":{"height":10.79},"width":164.67,"height":26.98,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1403.6530/images/24-20.png","element":"img","alt":" mn → ∞","inline":true,"padRight":true},{"text":"as ","element":"span"},{"style":{"height":8.8},"width":131.91,"height":22,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1403.6530/images/24-21.png","element":"img","alt":" n → ∞","inline":true},{"text":", the TD-critic converges in the inner loop (see Theorem ","element":"span"},{"href":"#id-70","text":"6) ","element":"a"},{"text":"and hence, ","element":"span"},{"style":{"height":16.79},"width":188.35,"height":41.97,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1403.6530/images/24-22.png","element":"img","alt":" ξ1,n = o(1)","inline":true},{"text":". Thus, ","element":"span"},{"style":{"height":15.59},"width":62.78,"height":38.97,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1403.6530/images/24-23.png","element":"img","alt":" ξ1,n","inline":true,"padRight":true},{"text":"term can be ignored in the asymptotic analysis of ","element":"span"},{"style":{"height":10.8},"width":19,"height":27,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1403.6530/images/24-24.png","element":"img","alt":" θ","inline":true},{"text":"-recursion.","element":"span"}],[{"text":"Recall that ","element":"span"},{"style":{"height":12.98},"width":45.75,"height":32.46,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1403.6530/images/24-25.png","element":"img","alt":" ¯v+","inline":true,"padRight":true},{"text":"and ","element":"span"},{"style":{"height":9.6},"width":21.52,"height":24,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1403.6530/images/24-26.png","element":"img","alt":" ¯v","inline":true,"padRight":true},{"text":"are converged critic parameters corresponding to policies ","element":"span"},{"style":{"height":10.8},"width":19,"height":27,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1403.6530/images/24-27.png","element":"img","alt":" θ","inline":true,"padRight":true},{"text":"and ","element":"span"},{"style":{"height":14.8},"width":130.27,"height":37,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1403.6530/images/24-28.png","element":"img","alt":" θ + β∆","inline":true},{"text":". Letting ","element":"span"},{"style":{"height":16},"width":130.14,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1403.6530/images/24-29.png","element":"img","alt":"�V (θ) =","inline":true},{"style":{"height":17.38},"width":152.03,"height":43.46,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1403.6530/images/24-30.png","element":"img","alt":"¯vTφv(x0)","inline":true},{"text":", we obtain","element":"span"},{"text":"6","element":"span"}],[{"style":{"width":"69%"},"width":1252,"height":432,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1403.6530/images/24-31.png","element":"img"}],[{"text":"The second equality above follows by expanding using Taylor’s expansion of ","element":"span"},{"style":{"height":18.83},"width":74.67,"height":47.07,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1403.6530/images/24-32.png","element":"img","alt":"ˆV (·)","inline":true,"padRight":true},{"text":"around ","element":"span"},{"style":{"height":10.8},"width":19,"height":27,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1403.6530/images/24-33.png","element":"img","alt":" θ","inline":true},{"text":", whereas the third equality follows by using the fact that ","element":"span"},{"style":{"height":18.54},"width":68.94,"height":46.36,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1403.6530/images/24-34.png","element":"img","alt":" ∆(i)n","inline":true,"padRight":true},{"text":"’s are independent Rademacher random variables. Note that ","element":"span"},{"style":{"height":16},"width":75.41,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1403.6530/images/24-35.png","element":"img","alt":" ξ(β)","inline":true,"padRight":true},{"text":"in the ","element":"span"},{"text":"second equality above can be seen to converge to zero as ","element":"span"},{"style":{"height":14.4},"width":106.64,"height":36,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1403.6530/images/24-36.png","element":"img","alt":" β → 0","inline":true},{"text":".","element":"span"}],[{"text":"On similar lines, letting ","element":"span"},{"style":{"height":17.38},"width":291.49,"height":43.46,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1403.6530/images/24-37.png","element":"img","alt":"�U(θ) = ¯uTφu(x0)","inline":true},{"text":", it can be seen that","element":"span"}],[{"style":{"width":"68%"},"width":1248,"height":123,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1403.6530/images/24-38.png","element":"img"}],[{"text":"Plugging the above in ","element":"span"},{"href":"#id-99","text":"(57)","element":"a"},{"text":", we obtain","element":"span"}],[{"style":{"width":"60%"},"width":1094,"height":208,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1403.6530/images/25-0.png","element":"img"}],[{"text":"as ","element":"span"},{"style":{"height":14.4},"width":106.64,"height":36,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1403.6530/images/25-1.png","element":"img","alt":" β → 0","inline":true},{"text":". Thus, ","element":"span"},{"href":"#id-92","text":"(23) ","element":"a"},{"text":"can be seen to be a discretization of the ODE ","element":"span"},{"href":"#id-91","text":"(54)","element":"a"},{"text":". Further, ","element":"span"},{"style":{"height":13.19},"width":47.88,"height":32.98,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1403.6530/images/25-2.png","element":"img","alt":" Zλ","inline":true,"padRight":true},{"text":"is an asymptotically stable attractor for the ODE ","element":"span"},{"href":"#id-91","text":"(54)","element":"a"},{"text":", with ","element":"span"},{"style":{"height":16},"width":119.39,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1403.6530/images/25-3.png","element":"img","alt":"�L(θ, λ)","inline":true,"padRight":true},{"text":"itself serving as a strict Lyapunov function. This can be inferred as follows:","element":"span"}],[{"style":{"width":"53%"},"width":973,"height":95,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1403.6530/images/25-4.png","element":"img"}],[{"text":"Define a linear interpolated trajectory for the ","element":"span"},{"style":{"height":10.8},"width":19,"height":27,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1403.6530/images/25-5.png","element":"img","alt":" θ","inline":true},{"text":"-recursion in ","element":"span"},{"href":"#id-92","text":"(23) ","element":"a"},{"text":"as follows: Let ","element":"span"},{"style":{"height":20.4},"width":409.32,"height":50.99,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1403.6530/images/25-6.png","element":"img","alt":" s(n) = �n−1i=0 ζ2(i). ¯θt","inline":true,"padRight":true},{"text":"is a piecewise linear interpolation defined according to ","element":"span"},{"style":{"height":19.48},"width":185.83,"height":48.71,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1403.6530/images/25-7.png","element":"img","alt":"¯θt(n) = θn","inline":true,"padRight":true},{"text":"with linear interpolation on ","element":"span"},{"text":"[","element":"span"},{"style":{"fontStyle":"italic"},"text":"s","element":"span"},{"text":"(","element":"span"},{"style":{"fontStyle":"italic"},"text":"n","element":"span"},{"text":")","element":"span"},{"style":{"fontStyle":"italic"},"text":", s","element":"span"},{"text":"(","element":"span"},{"style":{"fontStyle":"italic"},"text":"n ","element":"span"},{"text":"+ 1)]","element":"span"},{"style":{"fontStyle":"italic"},"text":". ","element":"span"},{"text":"Now, using standard stochastic approximation arguments (cf. ","element":"span"},{"href":"#id-36","referenceIndex":16,"text":"[16, ","element":"a"},{"text":"Theorem 5.12]), ","element":"span"},{"style":{"height":16.2},"width":30.71,"height":40.5,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1403.6530/images/25-8.png","element":"img","alt":"¯θt","inline":true,"padRight":true},{"text":"can be seen to be a ","element":"span"},{"style":{"height":16},"width":97.03,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1403.6530/images/25-9.png","element":"img","alt":" (T, η)","inline":true},{"text":"-perturbation of the ODE ","element":"span"},{"href":"#id-91","text":"(54)","element":"a"},{"text":". The claim now follows from Hirsch lemma. ","element":"span"},{"style":{"height":0},"width":20,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1403.6530/images/25-10.png","element":"img","alt":"■","inline":true}],[{"style":{"fontWeight":"bold"},"text":"Step 3: (Analysis of ","element":"span"},{"style":{"height":10.8},"width":23,"height":27,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1403.6530/images/25-11.png","element":"img","alt":" λ","inline":true},{"style":{"fontWeight":"bold"},"text":"-recursion and Convergence to a Local Saddle Point) ","element":"span"},{"text":"We first show that the ","element":"span"},{"style":{"height":10.8},"width":23,"height":27,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1403.6530/images/25-12.png","element":"img","alt":" λ","inline":true},{"text":"-recursion converges and then prove that the whole algorithm converges to a local saddle point of ","element":"span"},{"style":{"height":16},"width":119.39,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1403.6530/images/25-13.png","element":"img","alt":"�L(θ, λ)","inline":true},{"text":".We define the following ODE governing the evolution of ","element":"span"},{"style":{"height":10.8},"width":23,"height":27,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1403.6530/images/25-14.png","element":"img","alt":" λ","inline":true},{"text":":","element":"span"}],[{"style":{"width":"79%"},"width":1440,"height":57,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1403.6530/images/25-15.png","element":"img"}],[{"text":"where ","element":"span"},{"style":{"height":13.38},"width":49.72,"height":33.46,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1403.6530/images/25-16.png","element":"img","alt":" θλt","inline":true,"padRight":true},{"text":"is the limiting point of the ","element":"span"},{"style":{"height":10.8},"width":19,"height":27,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1403.6530/images/25-17.png","element":"img","alt":" θ","inline":true},{"text":"-recursion corresponding to ","element":"span"},{"style":{"height":13.19},"width":35.25,"height":32.98,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1403.6530/images/25-18.png","element":"img","alt":" λt","inline":true},{"text":". Further, ","element":"span"},{"style":{"height":16.82},"width":43.91,"height":42.05,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1403.6530/images/25-19.png","element":"img","alt":"ˇΓλ","inline":true,"padRight":true},{"text":"is an operator similar to the operator ","element":"span"},{"style":{"height":14.43},"width":25,"height":36.07,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1403.6530/images/25-20.png","element":"img","alt":"ˇΓ","inline":true,"padRight":true},{"text":"defined in ","element":"span"},{"href":"#id-100","text":"(55) ","element":"a"},{"text":"and is defined as follows: For any bounded continuous function ","element":"span"},{"style":{"height":16},"width":66.37,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1403.6530/images/25-21.png","element":"img","alt":" f(·)","inline":true},{"text":",","element":"span"}],[{"id":"id-103","style":{"width":"67%"},"width":1221,"height":92,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1403.6530/images/25-22.png","element":"img"}],[{"id":"id-75","style":{"fontWeight":"bold"},"text":"Theorem 9. ","element":"span"},{"style":{"height":13.59},"width":146.03,"height":33.98,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1403.6530/images/25-23.png","element":"img","alt":" λn → F","inline":true,"padRight":true},{"style":{"fontStyle":"italic"},"text":"almost surely as ","element":"span"},{"style":{"height":8.8},"width":131.01,"height":22,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1403.6530/images/25-24.png","element":"img","alt":" n → ∞","inline":true},{"style":{"fontStyle":"italic"},"text":", where ","element":"span"},{"style":{"height":21.68},"width":910.04,"height":54.21,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1403.6530/images/25-25.png","element":"img","alt":" F△=�λ | λ ∈ [0, λmax], ˇΓλ��Λθλ(x0) − α�= 0, θλ ∈","inline":true},{"style":{"height":19.2},"width":72.78,"height":48,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1403.6530/images/25-26.png","element":"img","alt":"Zλ�","inline":true},{"style":{"fontStyle":"italic"},"text":".","element":"span"}],[{"style":{"fontStyle":"italic"},"text":"Proof. ","element":"span"},{"text":"The proof follows using standard stochastic approximation arguments. The first step is to rewrite the ","element":"span"},{"style":{"height":10.8},"width":23,"height":27,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1403.6530/images/25-27.png","element":"img","alt":" λ","inline":true},{"text":"-recursion as follows:","element":"span"}],[{"style":{"width":"58%"},"width":1063,"height":97,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1403.6530/images/25-28.png","element":"img"}],[{"text":"where ","element":"span"},{"style":{"height":28.8},"width":1092.84,"height":72,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1403.6530/images/25-29.png","element":"img","alt":" ξ2,n :=�uTnφu(x0) −�vTnφv(x0)�2�−�¯uTφu(x0) −�¯vTφv(x0)�2�","inline":true},{"text":". Note that the converged critic param-","element":"span"}],[{"text":"eters ","element":"span"},{"style":{"height":9.6},"width":21.52,"height":24,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1403.6530/images/25-30.png","element":"img","alt":" ¯v","inline":true,"padRight":true},{"text":"and ","element":"span"},{"style":{"height":9.6},"width":23,"height":24,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1403.6530/images/25-31.png","element":"img","alt":" ¯u","inline":true,"padRight":true},{"text":"are for the policy ","element":"span"},{"style":{"height":13.38},"width":56.72,"height":33.46,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1403.6530/images/25-32.png","element":"img","alt":" θλn","inline":true},{"text":". The latter is a limiting point of the ","element":"span"},{"style":{"height":10.8},"width":19,"height":27,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1403.6530/images/25-33.png","element":"img","alt":" θ","inline":true},{"text":"-recursion, with the Lagrange multiplier ","element":"span"},{"style":{"height":13.19},"width":43.25,"height":32.98,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1403.6530/images/25-34.png","element":"img","alt":"λn","inline":true},{"text":". Owing to convergence of ","element":"span"},{"style":{"height":10.8},"width":19,"height":27,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1403.6530/images/25-35.png","element":"img","alt":" θ","inline":true},{"text":"-recursion and also TD-critic in the inner loop, we can conclude that ","element":"span"},{"style":{"height":16.79},"width":191.32,"height":41.98,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1403.6530/images/25-36.png","element":"img","alt":" ξ2,n = o(1)","inline":true},{"text":". Thus, ","element":"span"},{"style":{"height":15.59},"width":62.78,"height":38.98,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1403.6530/images/25-37.png","element":"img","alt":" ξ2,n","inline":true,"padRight":true},{"text":"adds an asymptotically vanishing bias term to the ","element":"span"},{"style":{"height":10.8},"width":23,"height":27,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1403.6530/images/25-38.png","element":"img","alt":" λ","inline":true},{"text":"-recursion above. The claim follows by applying the standard result in Theorem 2 of ","element":"span"},{"href":"#id-58","referenceIndex":20,"text":"[20] ","element":"a"},{"text":"for convergence of stochastic approximation schemes. ","element":"span"},{"style":{"height":0},"width":20,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1403.6530/images/25-39.png","element":"img","alt":"■","inline":true}],[{"text":"Recall that ","element":"span"},{"style":{"height":17.38},"width":619.95,"height":43.46,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1403.6530/images/25-40.png","element":"img","alt":"�L(θ, λ)△= −�V θ(x0) + λ(�Λθ(x0) − α)","inline":true,"padRight":true},{"text":"and hence ","element":"span"},{"style":{"height":17.39},"width":418.18,"height":43.46,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1403.6530/images/25-41.png","element":"img","alt":" ∇λ�L(θ, λ) = �Λθ(x0) − α","inline":true},{"text":". Thus,","element":"span"}],[{"style":{"width":"60%"},"width":1091,"height":189,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1403.6530/images/25-42.png","element":"img"}],[{"text":"As in ","element":"span"},{"href":"#id-101","referenceIndex":19,"text":"[19]","element":"a"},{"text":", we invoke the envelope theorem of mathematical economics ","element":"span"},{"href":"#id-102","referenceIndex":39,"text":"[39] ","element":"a"},{"text":"to conclude that the ODE ","element":"span"},{"href":"#id-103","text":"(58) ","element":"a"},{"text":"is equivalent to the following","element":"span"}],[{"id":"id-105","style":{"width":"60%"},"width":1104,"height":55,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1403.6530/images/26-0.png","element":"img"}],[{"text":"Note that the above has to interpreted in the ","element":"span"},{"style":{"fontStyle":"italic"},"text":"Cartheodory ","element":"span"},{"text":"sense, i.e., as the following integral equation","element":"span"}],[{"style":{"width":"32%"},"width":592,"height":100,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1403.6530/images/26-1.png","element":"img"}],[{"text":"As noted in Lemma 4.3 of ","element":"span"},{"href":"#id-101","referenceIndex":19,"text":"[19]","element":"a"},{"text":", using the generalized envelope theorem from ","element":"span"},{"href":"#id-104","referenceIndex":41,"text":"[41] ","element":"a"},{"text":"it can be shown that the RHS of ","element":"span"},{"href":"#id-105","text":"(60) ","element":"a"},{"text":"coincides with that of ","element":"span"},{"href":"#id-103","text":"(58) ","element":"a"},{"text":"at differentiable points, while the ODE spends zero time at non-differentiable points (except at the points of maxima).","element":"span"}],[{"text":"We next claim that the limit ","element":"span"},{"style":{"height":14.6},"width":53.72,"height":36.49,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1403.6530/images/26-2.png","element":"img","alt":" θλ∗","inline":true,"padRight":true},{"text":"corresponding to ","element":"span"},{"style":{"height":10.98},"width":39.25,"height":27.46,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1403.6530/images/26-3.png","element":"img","alt":" λ∗","inline":true,"padRight":true},{"text":"satisfies the variance constraint in ","element":"span"},{"href":"#id-53","text":"(3)","element":"a"},{"text":", i.e.,","element":"span"}],[{"id":"id-76","style":{"fontWeight":"bold"},"text":"Proposition 1. ","element":"span"},{"style":{"fontStyle":"italic"},"text":"For any ","element":"span"},{"style":{"height":10.99},"width":39.25,"height":27.46,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1403.6530/images/26-4.png","element":"img","alt":" λ∗","inline":true,"padRight":true},{"style":{"fontStyle":"italic"},"text":"in ","element":"span"},{"style":{"height":21.68},"width":1008.25,"height":54.21,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1403.6530/images/26-5.png","element":"img","alt":"ˆF△=�λ | λ ∈ [0, λmax), ˇΓλ��Λθλ(x0) − α�= 0, θλ ∈ Zλ�","inline":true},{"style":{"fontStyle":"italic"},"text":", the corresponding limiting point ","element":"span"},{"style":{"height":14.6},"width":53.72,"height":36.49,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1403.6530/images/26-6.png","element":"img","alt":" θλ∗","inline":true,"padRight":true},{"style":{"fontStyle":"italic"},"text":"satisfies the variance constraint ","element":"span"},{"style":{"height":21.81},"width":231.88,"height":54.53,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1403.6530/images/26-7.png","element":"img","alt":"�Λθλ∗(x0) ≤ α","inline":true},{"style":{"fontStyle":"italic"},"text":".","element":"span"}],[{"style":{"fontStyle":"italic"},"text":"Proof. ","element":"span"},{"text":"Follows in a similar manner as Proposition 10.6 in ","element":"span"},{"href":"#id-36","referenceIndex":16,"text":"[16]","element":"a"},{"text":".","element":"span"}],[{"text":"From Theorems ","element":"span"},{"href":"#id-74","text":"7–","element":"a"},{"href":"#id-75","text":"9 ","element":"a"},{"text":"and Proposition ","element":"span"},{"href":"#id-76","text":"1, ","element":"a"},{"text":"it is evident that the actor recursion ","element":"span"},{"href":"#id-92","text":"(23) ","element":"a"},{"text":"converges to a tuple ","element":"span"},{"style":{"height":18.6},"width":148.14,"height":46.49,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1403.6530/images/26-8.png","element":"img","alt":" (θλ∗, λ∗)","inline":true,"padRight":true},{"text":"that is a local minimum w.r.t. ","element":"span"},{"style":{"height":10.8},"width":19,"height":27,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1403.6530/images/26-9.png","element":"img","alt":" θ","inline":true,"padRight":true},{"text":"and a local maximum w.r.t. ","element":"span"},{"style":{"height":10.8},"width":23,"height":27,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1403.6530/images/26-10.png","element":"img","alt":" λ","inline":true,"padRight":true},{"text":"of ","element":"span"},{"style":{"height":16},"width":119.39,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1403.6530/images/26-11.png","element":"img","alt":"�L(θ, λ)","inline":true},{"text":". In other words, overall convergence is to a (local) saddle point of ","element":"span"},{"style":{"height":16},"width":119.39,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1403.6530/images/26-12.png","element":"img","alt":"�L(θ, λ)","inline":true},{"text":". Further, the limit is also feasible for the constrained problem in ","element":"span"},{"href":"#id-53","text":"(3) ","element":"a"},{"text":"as ","element":"span"},{"style":{"height":14.6},"width":53.72,"height":36.49,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1403.6530/images/26-13.png","element":"img","alt":" θλ∗","inline":true,"padRight":true},{"text":"satisfies the variance constraint there.","element":"span"}],[{"style":{"fontWeight":"bold"},"text":"7.2 ","element":"span"},{"style":{"fontWeight":"bold"},"text":"Convergence of the First-Order Algorithm: RS-SF-G","element":"span"}],[{"text":"Note that since RS-SPSA-G and RS-SF-G use different methods to estimate the gradient, their proofs only differ in the second step, i.e., the convergence of the policy parameter ","element":"span"},{"style":{"height":10.8},"width":19,"height":27,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1403.6530/images/26-14.png","element":"img","alt":" θ","inline":true},{"text":".","element":"span"}],[{"style":{"fontWeight":"bold"},"text":"Proof of Theorem ","element":"span"},{"href":"#id-74","style":{"fontWeight":"bold"},"text":"7 ","element":"a"},{"style":{"fontWeight":"bold"},"text":"for SF","element":"span"}],[{"style":{"fontStyle":"italic"},"text":"Proof. ","element":"span"},{"text":"As in the case of the SPSA algorithm, we rewrite the ","element":"span"},{"style":{"height":10.8},"width":19,"height":27,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1403.6530/images/26-15.png","element":"img","alt":" θ","inline":true},{"text":"-update in ","element":"span"},{"href":"#id-92","text":"(24) ","element":"a"},{"text":"using the converged TD-parameters and constant ","element":"span"},{"style":{"height":10.8},"width":23,"height":27,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1403.6530/images/26-16.png","element":"img","alt":" λ","inline":true,"padRight":true},{"text":"as","element":"span"}],[{"style":{"width":"95%"},"width":1724,"height":109,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1403.6530/images/26-17.png","element":"img"}],[{"text":"where ","element":"span"},{"style":{"height":15.59},"width":173.09,"height":38.98,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1403.6530/images/26-18.png","element":"img","alt":" ξ1,n → 0","inline":true,"padRight":true},{"text":"(convergence of TD in the critic and as a result convergence of the critic’s parameters to ","element":"span"},{"style":{"height":16.18},"width":191.71,"height":40.46,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1403.6530/images/26-19.png","element":"img","alt":"¯v, ¯u, ¯v+, ¯u+","inline":true},{"text":") in lieu of Theorem ","element":"span"},{"href":"#id-70","text":"6. ","element":"a"},{"text":"Next, we establish that","element":"span"}],[{"style":{"width":"97%"},"width":1767,"height":238,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1403.6530/images/26-20.png","element":"img"}],[{"text":"The above follows in a similar manner as Proposition ","element":"span"},{"text":"10","element":"span"},{"style":{"fontStyle":"italic"},"text":".","element":"span"},{"text":"2 ","element":"span"},{"text":"of Bhatnagar et al. ","element":"span"},{"href":"#id-36","referenceIndex":16,"text":"[16]","element":"a"},{"text":". On similar lines, one can see that","element":"span"}],[{"style":{"width":"45%"},"width":830,"height":100,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1403.6530/images/26-21.png","element":"img"}],[{"text":"Thus, ","element":"span"},{"href":"#id-92","text":"(24) ","element":"a"},{"text":"can be seen to be a discretization of the ODE ","element":"span"},{"href":"#id-91","text":"(54) ","element":"a"},{"text":"and the rest of the analysis follows in a similar manner as in the SPSA proof. ","element":"span"},{"style":{"height":0},"width":20,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1403.6530/images/26-22.png","element":"img","alt":"■","inline":true}],[{"style":{"fontWeight":"bold"},"text":"7.2.1 ","element":"span"},{"style":{"fontWeight":"bold"},"text":"Convergence of the Second-Order Algorithms: RS-SPSA-N and RS-SF-N","element":"span"}],[{"text":"Convergence analysis of the second-order algorithms involves the same steps as that of the first-order algorithms. In particular, the first step involving the TD-critic and the third step involving the analysis of ","element":"span"},{"style":{"height":10.8},"width":23,"height":27,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1403.6530/images/27-0.png","element":"img","alt":" λ","inline":true},{"text":"-recursion follow along similar lines as earlier, whereas ","element":"span"},{"style":{"height":10.8},"width":19,"height":27,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1403.6530/images/27-1.png","element":"img","alt":" θ","inline":true},{"text":"-recursion analysis in the second step differs significantly.","element":"span"}],[{"style":{"fontWeight":"bold"},"text":"Step 2: (Analysis of ","element":"span"},{"style":{"height":10.8},"width":19,"height":27,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1403.6530/images/27-2.png","element":"img","alt":" θ","inline":true},{"style":{"fontWeight":"bold"},"text":"-recursion for RS-SPSA-N and RS-SF-N) ","element":"span"},{"text":"Since the policy parameter is updated in the descent direction with a Newton decrement, the limiting ODE of the ","element":"span"},{"style":{"height":10.8},"width":19,"height":27,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1403.6530/images/27-3.png","element":"img","alt":" θ","inline":true},{"text":"-recursion for the second order algorithms is given by","element":"span"}],[{"id":"id-106","style":{"width":"68%"},"width":1234,"height":73,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1403.6530/images/27-4.png","element":"img"}],[{"text":"where ","element":"span"},{"style":{"height":14.43},"width":25,"height":36.07,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1403.6530/images/27-5.png","element":"img","alt":"ˇΓ","inline":true,"padRight":true},{"text":"is as before (see ","element":"span"},{"href":"#id-100","text":"(55)","element":"a"},{"text":"). Let","element":"span"}],[{"style":{"width":"60%"},"width":1091,"height":73,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1403.6530/images/27-6.png","element":"img"}],[{"text":"denote the set of asymptotically stable equilibrium points of the ODE ","element":"span"},{"href":"#id-106","text":"(61) ","element":"a"},{"text":"and ","element":"span"},{"style":{"height":15.5},"width":47.88,"height":38.74,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1403.6530/images/27-7.png","element":"img","alt":" Zελ","inline":true,"padRight":true},{"text":"its ","element":"span"},{"style":{"height":7.2},"width":19,"height":18,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1403.6530/images/27-8.png","element":"img","alt":" ε","inline":true},{"text":"-neighborhood. Then, we ","element":"span"},{"text":"have the following analogue of Theorem ","element":"span"},{"href":"#id-74","text":"7 ","element":"a"},{"text":"for the RS-SPSA-N and RS-SF-N algorithms:","element":"span"}],[{"id":"id-107","style":{"fontWeight":"bold"},"text":"Theorem 10. ","element":"span"},{"style":{"fontStyle":"italic"},"text":"Under (A1)-(A5), for any given Lagrange multiplier ","element":"span"},{"style":{"height":10.8},"width":23,"height":27,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1403.6530/images/27-9.png","element":"img","alt":" λ","inline":true,"padRight":true},{"style":{"fontStyle":"italic"},"text":"and ","element":"span"},{"style":{"height":11.6},"width":93.94,"height":29,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1403.6530/images/27-10.png","element":"img","alt":" ε > 0","inline":true},{"style":{"fontStyle":"italic"},"text":", there exists ","element":"span"},{"style":{"height":14.4},"width":115.78,"height":36,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1403.6530/images/27-11.png","element":"img","alt":" β0 > 0","inline":true,"padRight":true},{"style":{"fontStyle":"italic"},"text":"such that for all ","element":"span"},{"style":{"height":16.52},"width":439.43,"height":41.29,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1403.6530/images/27-12.png","element":"img","alt":" β ∈ (0, β0), θn → θ∗ ∈ Zελ","inline":true,"padRight":true},{"style":{"fontStyle":"italic"},"text":"almost surely.","element":"span"}],[{"style":{"fontWeight":"bold"},"text":"Proof of Theorem ","element":"span"},{"href":"#id-107","style":{"fontWeight":"bold"},"text":"10 ","element":"a"},{"style":{"fontWeight":"bold"},"text":"for RS-SPSA-N","element":"span"}],[{"text":"Before we prove Theorem ","element":"span"},{"href":"#id-107","text":"10, ","element":"a"},{"text":"we establish that the Hessian estimate ","element":"span"},{"style":{"height":13.19},"width":53.13,"height":32.98,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1403.6530/images/27-13.png","element":"img","alt":" Hn","inline":true,"padRight":true},{"text":"in ","element":"span"},{"href":"#id-108","text":"(30) ","element":"a"},{"text":"converges almost surely to the true Hessian ","element":"span"},{"style":{"height":17.9},"width":191.21,"height":44.75,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1403.6530/images/27-14.png","element":"img","alt":" ∇2θL(θn, λ)","inline":true,"padRight":true},{"text":"in the following lemma.","element":"span"}],[{"id":"id-78","style":{"fontWeight":"bold"},"text":"Lemma 11. ","element":"span"},{"style":{"fontStyle":"italic"},"text":"With ","element":"span"},{"style":{"height":14.4},"width":106.64,"height":36,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1403.6530/images/27-15.png","element":"img","alt":" β → 0","inline":true},{"style":{"fontStyle":"italic"},"text":", for all ","element":"span"},{"style":{"height":16},"width":288.08,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1403.6530/images/27-16.png","element":"img","alt":" i, j ∈ {1, . . . , κ1}","inline":true},{"style":{"fontStyle":"italic"},"text":", we have the following claims with probability one:","element":"span"}],[{"style":{"width":"60%"},"width":1093,"height":361,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1403.6530/images/27-17.png","element":"img"}],[{"style":{"fontStyle":"italic","fontWeight":"bold"},"text":"(iii)","element":"span"}],[{"style":{"fontStyle":"italic","fontWeight":"bold"},"text":"(iv)","element":"span"}],[{"style":{"fontStyle":"italic"},"text":"Proof. ","element":"span"},{"text":"The proofs of the above claims follow from Propositions 10.10, 10.11 and Lemmas 7.10 and 7.11 of ","element":"span"},{"href":"#id-36","referenceIndex":16,"text":"[16]","element":"a"},{"text":", respectively. ","element":"span"},{"style":{"height":0},"width":20,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1403.6530/images/27-18.png","element":"img","alt":"■","inline":true}],[{"style":{"fontStyle":"italic"},"text":"Proof. ","element":"span"},{"style":{"fontWeight":"bold"},"text":"(Theorem ","element":"span"},{"href":"#id-107","style":{"fontWeight":"bold"},"text":"10 ","element":"a"},{"style":{"fontWeight":"bold"},"text":"for RS-SPSA-N) ","element":"span"},{"text":"As in the case of the first order methods, due to timescale separation, we can treat ","element":"span"},{"style":{"height":13.19},"width":131.87,"height":32.98,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1403.6530/images/27-19.png","element":"img","alt":" λn ≡ λ","inline":true},{"text":", a constant and use the converged TD-parameters to arrive at the following equivalent update rules for the Hessian recursion ","element":"span"},{"href":"#id-108","text":"(30) ","element":"a"},{"text":"and ","element":"span"},{"style":{"height":10.8},"width":19,"height":27,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1403.6530/images/27-20.png","element":"img","alt":" θ","inline":true},{"text":"-recursion ","element":"span"},{"href":"#id-79","text":"(31)","element":"a"},{"text":":","element":"span"}],[{"style":{"width":"94%"},"width":1720,"height":244,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1403.6530/images/27-21.png","element":"img"}],[{"text":"In lieu of Lemma ","element":"span"},{"href":"#id-78","text":"11, ","element":"a"},{"text":"the ","element":"span"},{"style":{"height":10.8},"width":19,"height":27,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1403.6530/images/28-0.png","element":"img","alt":" θ","inline":true},{"text":"-recursion above is equivalent to the following:","element":"span"}],[{"style":{"width":"74%"},"width":1354,"height":97,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1403.6530/images/28-1.png","element":"img"}],[{"text":"The above can be seen as a discretization of the ODE ","element":"span"},{"href":"#id-106","text":"(61)","element":"a"},{"text":", with ","element":"span"},{"style":{"height":13.19},"width":47.88,"height":32.98,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1403.6530/images/28-2.png","element":"img","alt":" Zλ","inline":true,"padRight":true},{"text":"serving as its asymptotically stable attractor. The rest of the claim follows in a similar manner as Theorem ","element":"span"},{"href":"#id-74","text":"7. ","element":"a"},{"style":{"height":0},"width":20,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1403.6530/images/28-3.png","element":"img","alt":"■","inline":true}],[{"style":{"fontWeight":"bold"},"text":"Proof of Theorem ","element":"span"},{"href":"#id-107","style":{"fontWeight":"bold"},"text":"10 ","element":"a"},{"style":{"fontWeight":"bold"},"text":"for RS-SF-N","element":"span"}],[{"style":{"fontStyle":"italic"},"text":"Proof. ","element":"span"},{"text":"We first establish the following result for the gradient and Hessian estimators employed in RS-SF-N:","element":"span"}],[{"id":"id-81","style":{"fontWeight":"bold"},"text":"Lemma 12. ","element":"span"},{"style":{"fontStyle":"italic"},"text":"With ","element":"span"},{"style":{"height":14.4},"width":106.63,"height":36,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1403.6530/images/28-4.png","element":"img","alt":" β → 0","inline":true},{"style":{"fontStyle":"italic"},"text":", we have the following claims with probability one:","element":"span"}],[{"style":{"width":"70%"},"width":1287,"height":273,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1403.6530/images/28-5.png","element":"img"}],[{"style":{"fontStyle":"italic"},"text":"Proof. ","element":"span"},{"text":"The proofs of the above claims follow from Propositions 10.1 and 10.2 of ","element":"span"},{"href":"#id-36","referenceIndex":16,"text":"[16]","element":"a"},{"text":", respectively. ","element":"span"},{"style":{"height":0},"width":20,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1403.6530/images/28-6.png","element":"img","alt":"■","inline":true}],[{"text":"The rest of the analysis is identical to that of RS-SPSA-N. ","element":"span"},{"style":{"height":0},"width":20,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1403.6530/images/28-7.png","element":"img","alt":"■","inline":true}],[{"style":{"fontWeight":"bold"},"text":"Remark 13. ","element":"span"},{"style":{"fontStyle":"italic"},"text":"(","element":"span"},{"style":{"fontStyle":"italic","fontWeight":"bold"},"text":"On Convergence Rate.","element":"span"},{"style":{"fontStyle":"italic"},"text":") In the above, we established asymptotic limits for all our algorithms using the ODE approach. To the best of our knowledge, there are no convergence rate results available for multi-timescale stochastic approximation schemes, and hence, for actor-critic algorithms. This is true even for the actor-critic algorithms that do not incorporate any risk criterion. In ","element":"span"},{"href":"#id-109","referenceIndex":34,"style":{"fontStyle":"italic"},"text":"[34]","element":"a"},{"style":{"fontStyle":"italic"},"text":", the authors provide asymptotic convergence rate results for ","element":"span"},{"text":"linear ","element":"span"},{"style":{"fontStyle":"italic"},"text":"two-timescale recursions. It would be an interesting direction for future research to obtain concentration bounds for general (non-linear) two-timescale schemes.","element":"span"}],[{"style":{"fontStyle":"italic"},"text":"While a rigorous analysis on convergence rate of our proposed schemes is difficult, one could make a few concessions and use the following argument to see that the SPSA-based algorithms converge quickly: In order to analyse the rate of convergence of ","element":"span"},{"style":{"height":10.8},"width":19,"height":27,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1403.6530/images/28-8.png","element":"img","alt":" θ","inline":true},{"style":{"fontStyle":"italic"},"text":"-recursion, assume (for sufficiently large ","element":"span"},{"style":{"fontStyle":"italic"},"text":"n","element":"span"},{"style":{"fontStyle":"italic"},"text":") that the TD-critic has converged in the inner-loop. This is because, the trajectory lengths ","element":"span"},{"style":{"height":10.79},"width":166.52,"height":26.98,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1403.6530/images/28-9.png","element":"img","alt":" mn → ∞","inline":true,"padRight":true},{"style":{"fontStyle":"italic"},"text":"as ","element":"span"},{"style":{"height":8.8},"width":133.8,"height":22,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1403.6530/images/28-10.png","element":"img","alt":" n → ∞","inline":true,"padRight":true},{"style":{"fontStyle":"italic"},"text":"and under appropriate step-size settings (or with iterate averaging) one can obtain convergence rate of the order ","element":"span"},{"style":{"height":16.28},"width":166.62,"height":40.71,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1403.6530/images/28-11.png","element":"img","alt":" O (1/√n)","inline":true,"padRight":true},{"style":{"fontStyle":"italic"},"text":"on the root mean square error of TD (see ","element":"span"},{"href":"#id-110","referenceIndex":35,"style":{"fontStyle":"italic"},"text":"[35]","element":"a"},{"style":{"fontStyle":"italic"},"text":"). Now, if one holds ","element":"span"},{"style":{"height":10.8},"width":23,"height":27,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1403.6530/images/28-12.png","element":"img","alt":" λ","inline":true,"padRight":true},{"style":{"fontStyle":"italic"},"text":"fixed, then invoking asymptotic normality results for SPSA (see Proposition 2 in ","element":"span"},{"href":"#id-34","referenceIndex":52,"style":{"fontStyle":"italic"},"text":"[52]","element":"a"},{"style":{"fontStyle":"italic"},"text":") it can be shown that ","element":"span"},{"style":{"height":18.19},"width":221.06,"height":45.47,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1403.6530/images/28-13.png","element":"img","alt":"n1/3(θn−θλ)","inline":true,"padRight":true},{"style":{"fontStyle":"italic"},"text":"is asymptotically normal, where ","element":"span"},{"style":{"height":13.39},"width":38.82,"height":33.47,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1403.6530/images/28-14.png","element":"img","alt":" θλ","inline":true,"padRight":true},{"style":{"fontStyle":"italic"},"text":"is a limit point in the set ","element":"span"},{"style":{"height":13.19},"width":47.88,"height":32.98,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1403.6530/images/28-15.png","element":"img","alt":" Zλ","inline":true},{"style":{"fontStyle":"italic"},"text":". Similar results also hold for second-order SPSA variants (cf. Theorem 3a in ","element":"span"},{"href":"#id-39","referenceIndex":54,"style":{"fontStyle":"italic"},"text":"[54]","element":"a"},{"style":{"fontStyle":"italic"},"text":"). Both the aforementioned claims are proved using a well-known result on asymptotic normality of stochastic approximation schemes due to Fabian ","element":"span"},{"href":"#id-111","referenceIndex":26,"style":{"fontStyle":"italic"},"text":"[26]","element":"a"},{"style":{"fontStyle":"italic"},"text":".","element":"span"}],[{"style":{"fontStyle":"italic"},"text":"The second-order schemes such as RS-SPSA-N score over their first order counterpart RS-SPSA-G from a asymptotic normality results perspective. This is because obtaining the optimal convergence rate for RS-SPSA-G requires that the step-size ","element":"span"},{"style":{"height":16},"width":90.73,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1403.6530/images/28-16.png","element":"img","alt":" ζ2(n)","inline":true,"padRight":true},{"style":{"fontStyle":"italic"},"text":"is set to ","element":"span"},{"style":{"height":16},"width":130.16,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1403.6530/images/28-17.png","element":"img","alt":" ζ2(0)/n","inline":true,"padRight":true},{"style":{"fontStyle":"italic"},"text":"where ","element":"span"},{"style":{"height":17.9},"width":493.78,"height":44.75,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1403.6530/images/28-18.png","element":"img","alt":" ζ2(0) > 1/λmin(∇2θL(θλ, λ))","inline":true},{"style":{"fontStyle":"italic"},"text":", whereas there is no ","element":"span"},{"style":{"fontStyle":"italic"},"text":"such constraint for the second-order algorithm RS-SPSA-N. Here ","element":"span"},{"style":{"height":16},"width":139.43,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1403.6530/images/28-19.png","element":"img","alt":" λmin(A)","inline":true,"padRight":true},{"style":{"fontStyle":"italic"},"text":"denotes the minimum eigenvalue of the matrix ","element":"span"},{"style":{"fontStyle":"italic"},"text":"A","element":"span"},{"style":{"fontStyle":"italic"},"text":". The reader is referred to ","element":"span"},{"href":"#id-112","referenceIndex":25,"style":{"fontStyle":"italic"},"text":"[25] ","element":"a"},{"style":{"fontStyle":"italic"},"text":"for a detailed discussion on convergence rate of (one timescale) SPSA-based schemes using asymptotic mean-square error.","element":"span"}],[{"style":{"fontWeight":"bold"},"text":"Remark 14. ","element":"span"},{"style":{"fontStyle":"italic"},"text":"(","element":"span"},{"style":{"fontStyle":"italic","fontWeight":"bold"},"text":"Unstable Equilibria.","element":"span"},{"style":{"fontStyle":"italic"},"text":") The limit set ","element":"span"},{"style":{"height":13.19},"width":47.88,"height":32.98,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1403.6530/images/28-20.png","element":"img","alt":" Zλ","inline":true,"padRight":true},{"style":{"fontStyle":"italic"},"text":"contains both stable and unstable equilibria and the ","element":"span"},{"style":{"height":10.8},"width":19,"height":27,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1403.6530/images/28-21.png","element":"img","alt":" θ","inline":true},{"style":{"fontStyle":"italic"},"text":"-recursion can possibly end up in a unstable equilibrium point. One may avoid this situation by including additional noise in the randomized policy that drives the ","element":"span"},{"style":{"height":10.8},"width":19,"height":27,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1403.6530/images/28-22.png","element":"img","alt":" θ","inline":true},{"style":{"fontStyle":"italic"},"text":"-recursion. For instance, define a ","element":"span"},{"style":{"height":10.4},"width":20,"height":26,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1403.6530/images/28-23.png","element":"img","alt":" η","inline":true},{"style":{"fontStyle":"italic"},"text":"-offset policy as","element":"span"}],[{"style":{"width":"31%"},"width":571,"height":129,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1403.6530/images/28-24.png","element":"img"}],[{"style":{"fontStyle":"italic"},"text":"The above policy can be used in place of the regular ","element":"span"},{"style":{"height":16},"width":122.56,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1403.6530/images/29-0.png","element":"img","alt":" µ(· | x)","inline":true},{"style":{"fontStyle":"italic"},"text":", so that the algorithm is pulled away from an unstable equilibria. Providing theoretical guarantees for such a scheme is non-trivial and we have left it for future work.","element":"span"}]]},{"heading":"8 Convergence Analysis of the Average Reward Risk-Sensitive Actor-Critic","paragraphs":[[{"style":{"width":"14%"},"width":254,"height":52,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1403.6530/images/29-1.png","element":"img"}],[{"text":"As in the discounted setting, we use the ODE approach ","element":"span"},{"href":"#id-58","referenceIndex":20,"text":"[20] ","element":"a"},{"text":"to analyze the convergence of our average reward risk-sensitive actor-critic algorithm. The proof involves three main steps:","element":"span"}],[{"text":"1. The first step is the convergence of ","element":"span"},{"style":{"height":14.4},"width":116.14,"height":36,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1403.6530/images/29-2.png","element":"img","alt":" ρ, η, V","inline":true,"padRight":true},{"text":", and ","element":"span"},{"style":{"fontStyle":"italic"},"text":"U","element":"span"},{"text":", for any fixed policy ","element":"span"},{"style":{"height":10.8},"width":19,"height":27,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1403.6530/images/29-3.png","element":"img","alt":" θ","inline":true,"padRight":true},{"text":"and Lagrange multiplier ","element":"span"},{"style":{"height":10.8},"width":23,"height":27,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1403.6530/images/29-4.png","element":"img","alt":" λ","inline":true},{"text":". This corresponds to a TD(0) (with extension to ","element":"span"},{"style":{"height":10.4},"width":20,"height":26,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1403.6530/images/29-5.png","element":"img","alt":" η","inline":true,"padRight":true},{"text":"and ","element":"span"},{"style":{"fontStyle":"italic"},"text":"U","element":"span"},{"text":") proof. Using arguments similar to that in Step 2 of the proof of RS-SPSA-G, one can show that the ","element":"span"},{"style":{"height":10.8},"width":19,"height":27,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1403.6530/images/29-6.png","element":"img","alt":" θ","inline":true,"padRight":true},{"text":"and ","element":"span"},{"style":{"height":10.8},"width":23,"height":27,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1403.6530/images/29-7.png","element":"img","alt":" λ","inline":true,"padRight":true},{"text":"recursions track ","element":"span"},{"style":{"height":17.4},"width":105.89,"height":43.5,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1403.6530/images/29-8.png","element":"img","alt":"˙θt = 0","inline":true,"padRight":true},{"text":"and ","element":"span"},{"style":{"height":17.4},"width":110.43,"height":43.5,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1403.6530/images/29-9.png","element":"img","alt":"˙λt = 0","inline":true},{"text":", when viewed from the TD critic timescale ","element":"span"},{"style":{"height":16},"width":120.62,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1403.6530/images/29-10.png","element":"img","alt":" {ζ3(t)}","inline":true},{"text":". Thus, the policy ","element":"span"},{"style":{"height":10.8},"width":19,"height":27,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1403.6530/images/29-11.png","element":"img","alt":" θ","inline":true,"padRight":true},{"text":"and Lagrange multiplier ","element":"span"},{"style":{"height":10.8},"width":23,"height":27,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1403.6530/images/29-12.png","element":"img","alt":" λ","inline":true,"padRight":true},{"text":"are assumed to be constant in the analysis of the critic recursion.","element":"span"}],[{"id":"id-113","style":{"width":"97%"},"width":1762,"height":139,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1403.6530/images/29-13.png","element":"img"}],[{"text":"where the projection operator ","element":"span"},{"style":{"height":14.43},"width":25,"height":36.07,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1403.6530/images/29-14.png","element":"img","alt":"ˇΓ","inline":true,"padRight":true},{"text":"ensures that the evolution of ","element":"span"},{"style":{"height":10.8},"width":19,"height":27,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1403.6530/images/29-15.png","element":"img","alt":" θ","inline":true,"padRight":true},{"text":"via the ODE ","element":"span"},{"href":"#id-113","text":"(63) ","element":"a"},{"text":"stays within the compact and convex set ","element":"span"},{"style":{"height":11.79},"width":145.85,"height":29.46,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1403.6530/images/29-16.png","element":"img","alt":" Θ ⊂ Rκ1","inline":true,"padRight":true},{"text":"and is defined in ","element":"span"},{"href":"#id-100","text":"(55)","element":"a"},{"text":". Again here it is assumed that ","element":"span"},{"style":{"height":10.8},"width":23,"height":27,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1403.6530/images/29-17.png","element":"img","alt":" λ","inline":true,"padRight":true},{"text":"is fixed because ","element":"span"},{"style":{"height":10.8},"width":19,"height":27,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1403.6530/images/29-18.png","element":"img","alt":" θ","inline":true},{"text":"-recursion is on a faster time-scale than ","element":"span"},{"style":{"height":10.8},"width":23,"height":27,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1403.6530/images/29-19.png","element":"img","alt":" λ","inline":true},{"text":"’s.","element":"span"}],[{"text":"3. The final step is the convergence of ","element":"span"},{"style":{"height":10.8},"width":23,"height":27,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1403.6530/images/29-20.png","element":"img","alt":" λ","inline":true,"padRight":true},{"text":"and showing that the whole algorithm converges to a local saddle point of ","element":"span"},{"style":{"height":16},"width":119.39,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1403.6530/images/29-21.png","element":"img","alt":" L(θ, λ)","inline":true},{"text":". where the limit is shown to satisfy the variance constraint in ","element":"span"},{"href":"#id-84","text":"(40)","element":"a"},{"text":".","element":"span"}],[{"id":"id-115","style":{"fontWeight":"bold"},"text":"Step 1: Critic’s Convergence","element":"span"}],[{"style":{"fontWeight":"bold"},"text":"Lemma 13. ","element":"span"},{"style":{"fontStyle":"italic"},"text":"For any given policy ","element":"span"},{"style":{"height":16},"width":350.14,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1403.6530/images/29-22.png","element":"img","alt":" µ, {�ρn}, {�ηn}, {vn}","inline":true},{"style":{"fontStyle":"italic"},"text":", and ","element":"span"},{"style":{"height":16},"width":84.43,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1403.6530/images/29-23.png","element":"img","alt":" {un}","inline":true},{"style":{"fontStyle":"italic"},"text":", defined in Algorithm ","element":"span"},{"href":"#id-88","style":{"fontStyle":"italic"},"text":"2 ","element":"a"},{"style":{"fontStyle":"italic"},"text":"and by the critic recursion ","element":"span"},{"href":"#id-88","text":"(46) ","element":"a"},{"style":{"fontStyle":"italic"},"text":"converge to ","element":"span"},{"style":{"height":16},"width":231.43,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1403.6530/images/29-24.png","element":"img","alt":" ρ(µ), η(µ), vµ","inline":true},{"style":{"fontStyle":"italic"},"text":", and ","element":"span"},{"style":{"height":10.58},"width":41.81,"height":26.46,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1403.6530/images/29-25.png","element":"img","alt":" uµ","inline":true,"padRight":true},{"style":{"fontStyle":"italic"},"text":"almost surely, where ","element":"span"},{"style":{"height":10.58},"width":39.74,"height":26.46,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1403.6530/images/29-26.png","element":"img","alt":" vµ","inline":true,"padRight":true},{"style":{"fontStyle":"italic"},"text":"and ","element":"span"},{"style":{"height":10.58},"width":41.81,"height":26.46,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1403.6530/images/29-27.png","element":"img","alt":" uµ","inline":true,"padRight":true},{"style":{"fontStyle":"italic"},"text":"are the unique solutions to","element":"span"}],[{"style":{"width":"83%"},"width":1513,"height":41,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1403.6530/images/29-28.png","element":"img"}],[{"style":{"fontStyle":"italic"},"text":"respectively. In ","element":"span"},{"href":"#id-114","text":"(64)","element":"a"},{"style":{"fontStyle":"italic"},"text":", ","element":"span"},{"style":{"height":11.8},"width":57.66,"height":29.49,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1403.6530/images/29-29.png","element":"img","alt":" Dµ","inline":true,"padRight":true},{"style":{"fontStyle":"italic"},"text":"denotes the diagonal matrix with entries ","element":"span"},{"style":{"height":16},"width":96.38,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1403.6530/images/29-30.png","element":"img","alt":" dµ(x)","inline":true,"padRight":true},{"style":{"fontStyle":"italic"},"text":"for all ","element":"span"},{"style":{"height":11.6},"width":110.65,"height":29,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1403.6530/images/29-31.png","element":"img","alt":" x ∈ X","inline":true},{"style":{"fontStyle":"italic"},"text":", and ","element":"span"},{"style":{"height":14.74},"width":47.82,"height":36.85,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1403.6530/images/29-32.png","element":"img","alt":" T µv","inline":true,"padRight":true},{"style":{"fontStyle":"italic"},"text":"and ","element":"span"},{"style":{"height":14.74},"width":47.82,"height":36.85,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1403.6530/images/29-33.png","element":"img","alt":" T µu","inline":true,"padRight":true},{"style":{"fontStyle":"italic"},"text":"are the ","element":"span"},{"style":{"fontStyle":"italic"},"text":"Bellman operators for the differential value and square value functions of policy ","element":"span"},{"style":{"height":10},"width":24,"height":25,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1403.6530/images/29-34.png","element":"img","alt":" µ","inline":true},{"style":{"fontStyle":"italic"},"text":", defined as","element":"span"}],[{"id":"id-114","style":{"width":"81%"},"width":1473,"height":41,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1403.6530/images/29-35.png","element":"img"}],[{"style":{"fontStyle":"italic"},"text":"where ","element":"span"},{"style":{"height":10.59},"width":41.34,"height":26.46,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1403.6530/images/29-36.png","element":"img","alt":" rµ","inline":true,"padRight":true},{"style":{"fontStyle":"italic"},"text":"and ","element":"span"},{"style":{"height":11.8},"width":54.18,"height":29.49,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1403.6530/images/29-37.png","element":"img","alt":" P µ","inline":true,"padRight":true},{"style":{"fontStyle":"italic"},"text":"are the reward vector and transition probability matrix of policy ","element":"span"},{"style":{"height":16},"width":305,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1403.6530/images/29-38.png","element":"img","alt":" µ, Rµ = diag(rµ)","inline":true},{"style":{"fontStyle":"italic"},"text":", and ","element":"span"},{"style":{"fontStyle":"italic","fontWeight":"bold"},"text":"e ","element":"span"},{"style":{"fontStyle":"italic"},"text":"is a vector of size ","element":"span"},{"style":{"fontStyle":"italic"},"text":"n ","element":"span"},{"style":{"fontStyle":"italic"},"text":"(the size of the state space ","element":"span"},{"style":{"fontStyle":"italic"},"text":"X","element":"span"},{"style":{"fontStyle":"italic"},"text":") with elements all equal to one.","element":"span"}],[{"style":{"fontStyle":"italic"},"text":"Proof. ","element":"span"},{"text":"The proof follows in a similar manner as Lemma 5 in ","element":"span"},{"href":"#id-11","referenceIndex":14,"text":"[14]","element":"a"},{"text":". ","element":"span"},{"style":{"height":0},"width":20,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1403.6530/images/29-39.png","element":"img","alt":"■","inline":true}],[{"style":{"fontWeight":"bold"},"text":"Step 2: Actor’s Convergence","element":"span"}],[{"text":"Let ","element":"span"},{"style":{"height":19.92},"width":645.92,"height":49.8,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1403.6530/images/29-40.png","element":"img","alt":" Zλ =�θ ∈ C : ˇΓ�− ∇L(θ, λ)�= 0�","inline":true},{"text":"denote the set of asymptotically stable equilibrium points of the ODE ","element":"span"},{"href":"#id-113","text":"(63) ","element":"a"},{"text":"and ","element":"span"},{"style":{"height":19.2},"width":653.05,"height":48,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1403.6530/images/29-41.png","element":"img","alt":" Zελ =�θ ∈ C : ||θ − θ0|| < ε, θ0 ∈ Zλ�","inline":true},{"text":"denote the set of points in the ","element":"span"},{"style":{"height":7.2},"width":19,"height":18,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1403.6530/images/29-42.png","element":"img","alt":" ε","inline":true},{"text":"-neighborhood of ","element":"span"},{"style":{"height":13.19},"width":47.88,"height":32.98,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1403.6530/images/29-43.png","element":"img","alt":" Zλ","inline":true},{"text":". The main result regarding the convergence of the policy parameter in ","element":"span"},{"href":"#id-88","text":"(47) ","element":"a"},{"text":"is as follows:","element":"span"}],[{"style":{"fontWeight":"bold"},"text":"Theorem 14. ","element":"span"},{"style":{"fontStyle":"italic"},"text":"Assume (A1)-(A4). Then, for a given ","element":"span"},{"style":{"height":14.4},"width":269.27,"height":36,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1403.6530/images/29-44.png","element":"img","alt":" ε > 0, ∃β > 0","inline":true,"padRight":true},{"style":{"fontStyle":"italic"},"text":"such that if ","element":"span"},{"style":{"height":16},"width":331.39,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1403.6530/images/29-45.png","element":"img","alt":" supθ ∥B(θ, λ)∥ < β","inline":true},{"style":{"fontStyle":"italic"},"text":", then ","element":"span"},{"style":{"height":13.19},"width":38.71,"height":32.98,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1403.6530/images/29-46.png","element":"img","alt":" θn","inline":true,"padRight":true},{"style":{"fontStyle":"italic"},"text":"governed by ","element":"span"},{"href":"#id-88","text":"(47) ","element":"a"},{"style":{"fontStyle":"italic"},"text":"converges almost surely to ","element":"span"},{"style":{"height":15.5},"width":47.88,"height":38.74,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1403.6530/images/29-47.png","element":"img","alt":" Zελ","inline":true,"padRight":true},{"style":{"fontStyle":"italic"},"text":"as ","element":"span"},{"style":{"height":8.8},"width":125.92,"height":22,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1403.6530/images/29-48.png","element":"img","alt":" n → ∞","inline":true},{"style":{"fontStyle":"italic"},"text":".","element":"span"}],[{"style":{"fontStyle":"italic"},"text":"Proof. ","element":"span"},{"text":"Let ","element":"span"},{"style":{"height":16},"width":375.08,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1403.6530/images/30-0.png","element":"img","alt":" F(n) = σ(θm, m ≤ n)","inline":true,"padRight":true},{"text":"denote a sequence of ","element":"span"},{"style":{"height":6.8},"width":23,"height":17,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1403.6530/images/30-1.png","element":"img","alt":" σ","inline":true},{"text":"-fields. We have","element":"span"}],[{"style":{"width":"76%"},"width":1382,"height":623,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1403.6530/images/30-2.png","element":"img"}],[{"text":"By setting ","element":"span"},{"style":{"height":16},"width":316.11,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1403.6530/images/30-3.png","element":"img","alt":" ξn = �ρn+1 − ρ(θn)","inline":true},{"text":", we may write the above equation as","element":"span"}],[{"id":"id-116","style":{"width":"84%"},"width":1524,"height":655,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1403.6530/images/30-4.png","element":"img"}],[{"text":"Since Algorithm ","element":"span"},{"href":"#id-88","text":"2 ","element":"a"},{"text":"uses an unbiased estimator for ","element":"span"},{"style":{"height":10},"width":21,"height":25,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1403.6530/images/30-5.png","element":"img","alt":" ρ","inline":true},{"text":", we have ","element":"span"},{"style":{"height":16},"width":243.86,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1403.6530/images/30-6.png","element":"img","alt":" �ρn+1 → ρ(θn)","inline":true},{"text":", and thus, ","element":"span"},{"style":{"height":14},"width":127.84,"height":35,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1403.6530/images/30-7.png","element":"img","alt":" ξn → 0","inline":true},{"text":". The terms ","element":"span"},{"text":"(+) ","element":"span"},{"text":"asymptotically vanish in lieu of Lemma ","element":"span"},{"href":"#id-115","text":"13 ","element":"a"},{"text":"(Critic convergence). Finally the terms ","element":"span"},{"style":{"height":16},"width":51.42,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1403.6530/images/30-8.png","element":"img","alt":" (∗)","inline":true,"padRight":true},{"text":"can be seen to vanish using standard martingale arguments (cf. Theorem 2 in ","element":"span"},{"href":"#id-11","referenceIndex":14,"text":"[14]","element":"a"},{"text":"). Thus, ","element":"span"},{"href":"#id-116","text":"(66) ","element":"a"},{"text":"can be seen to be equivalent in an asymptotic sense to","element":"span"}],[{"id":"id-117","style":{"width":"100%"},"width":1816,"height":234,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1403.6530/images/30-9.png","element":"img"}],[{"text":"Note that the bias of Algorithm ","element":"span"},{"href":"#id-88","text":"2 ","element":"a"},{"text":"in estimating ","element":"span"},{"style":{"height":16},"width":152.6,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1403.6530/images/30-10.png","element":"img","alt":" ∇L(θ, λ)","inline":true,"padRight":true},{"text":"is (see Lemma ","element":"span"},{"href":"#id-46","text":"5)","element":"a"}],[{"style":{"width":"85%"},"width":1553,"height":93,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1403.6530/images/30-11.png","element":"img"}],[{"text":"So, if the bias ","element":"span"},{"style":{"height":16},"width":329.87,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1403.6530/images/30-12.png","element":"img","alt":" supθ ∥B(θ, λ)∥ → 0","inline":true},{"text":", the trajectories ","element":"span"},{"href":"#id-117","text":"(69) ","element":"a"},{"text":"converge to those of ","element":"span"},{"href":"#id-91","text":"(54) ","element":"a"},{"text":"uniformly on compacts forthe same initial condition and the claim follows. ","element":"span"},{"style":{"height":0},"width":20,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1403.6530/images/30-13.png","element":"img","alt":"■","inline":true}],[{"style":{"fontWeight":"bold"},"text":"Step 3: ","element":"span"},{"style":{"height":10.8},"width":23,"height":27,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1403.6530/images/30-14.png","element":"img","alt":" λ","inline":true,"padRight":true},{"style":{"fontWeight":"bold"},"text":"Convergence and Overall Convergence of the Algorithm","element":"span"}],[{"style":{"width":"47%"},"width":864,"height":576,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1403.6530/images/31-0.png","element":"img"}],[{"text":"Figure 2: The 2x2-grid network used in our traffic signal control experiments.","element":"figcaption","subtype":"caption"}],[{"text":"As in the discounted setting, we first show that the ","element":"span"},{"style":{"height":10.8},"width":23,"height":27,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1403.6530/images/31-1.png","element":"img","alt":" λ","inline":true},{"text":"-recursion converges and then prove convergence to a local saddle point of ","element":"span"},{"style":{"height":16},"width":119.39,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1403.6530/images/31-2.png","element":"img","alt":" L(θ, λ)","inline":true},{"text":". Consider the ODE","element":"span"}],[{"id":"id-119","style":{"width":"60%"},"width":1089,"height":52,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1403.6530/images/31-3.png","element":"img"}],[{"text":"where ","element":"span"},{"style":{"height":16.82},"width":43.91,"height":42.05,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1403.6530/images/31-4.png","element":"img","alt":"ˇΓλ","inline":true,"padRight":true},{"text":"is a projection operator that forces the evolution of ","element":"span"},{"style":{"height":10.8},"width":23,"height":27,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1403.6530/images/31-5.png","element":"img","alt":" λ","inline":true,"padRight":true},{"text":"via ","element":"span"},{"href":"#id-103","text":"(58) ","element":"a"},{"text":"is within ","element":"span"},{"style":{"height":16},"width":143.76,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1403.6530/images/31-6.png","element":"img","alt":" [0, λmax]","inline":true,"padRight":true},{"text":"and is defined in ","element":"span"},{"href":"#id-103","text":"(59)","element":"a"},{"text":".","element":"span"}],[{"style":{"fontWeight":"bold"},"text":"Theorem 15. ","element":"span"},{"style":{"height":13.59},"width":140.93,"height":33.98,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1403.6530/images/31-7.png","element":"img","alt":" λn → F","inline":true,"padRight":true},{"style":{"fontStyle":"italic"},"text":"almost surely as ","element":"span"},{"style":{"height":10.4},"width":116.39,"height":26,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1403.6530/images/31-8.png","element":"img","alt":" t → ∞","inline":true},{"style":{"fontStyle":"italic"},"text":", where ","element":"span"},{"style":{"height":19.92},"width":914.82,"height":49.8,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1403.6530/images/31-9.png","element":"img","alt":" F△=�λ | λ ∈ [0, λmax], ˇΓλ�Λ(θλ) − α�= 0, θλ ∈ Zλ�","inline":true},{"style":{"fontStyle":"italic"},"text":".","element":"span"}],[{"style":{"fontStyle":"italic"},"text":"Proof. ","element":"span"},{"text":"The proof follows in a similar manner as that of Theorem 3 in ","element":"span"},{"href":"#id-118","referenceIndex":11,"text":"[11]","element":"a"},{"text":". ","element":"span"},{"style":{"height":0},"width":20,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1403.6530/images/31-10.png","element":"img","alt":"■","inline":true}],[{"text":"As in the discounted setting, the following proposition claims that the limit ","element":"span"},{"style":{"height":14.6},"width":53.72,"height":36.49,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1403.6530/images/31-11.png","element":"img","alt":" θλ∗","inline":true,"padRight":true},{"text":"corresponding to ","element":"span"},{"style":{"height":10.99},"width":39.25,"height":27.46,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1403.6530/images/31-12.png","element":"img","alt":" λ∗","inline":true,"padRight":true},{"text":"satisfies the variance constraint in ","element":"span"},{"href":"#id-84","text":"(40)","element":"a"},{"text":", i.e.,","element":"span"}],[{"style":{"fontWeight":"bold"},"text":"Proposition 2. ","element":"span"},{"style":{"fontStyle":"italic"},"text":"For any ","element":"span"},{"style":{"height":10.98},"width":39.25,"height":27.46,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1403.6530/images/31-13.png","element":"img","alt":" λ∗","inline":true,"padRight":true},{"style":{"fontStyle":"italic"},"text":"in ","element":"span"},{"style":{"height":21.68},"width":1008.25,"height":54.21,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1403.6530/images/31-14.png","element":"img","alt":"ˆF△=�λ | λ ∈ [0, λmax), ˇΓλ�Λθλ(x0) − α�= 0, θλ ∈ Zλ�","inline":true},{"style":{"fontStyle":"italic"},"text":", the corresponding limiting point ","element":"span"},{"style":{"height":14.6},"width":53.72,"height":36.49,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1403.6530/images/31-15.png","element":"img","alt":" θλ∗","inline":true,"padRight":true},{"style":{"fontStyle":"italic"},"text":"satisfies the variance constraint ","element":"span"},{"style":{"height":21.81},"width":231.88,"height":54.53,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1403.6530/images/31-16.png","element":"img","alt":" Λθλ∗(x0) ≤ α","inline":true},{"style":{"fontStyle":"italic"},"text":".","element":"span"}],[{"text":"Using arguments similar to that used to prove convergence of RS-SPSA-G, it can be shown that that the ODE ","element":"span"},{"href":"#id-119","text":"(70) ","element":"a"},{"text":"is equivalent to ","element":"span"},{"style":{"height":20.49},"width":399.89,"height":51.24,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1403.6530/images/31-17.png","element":"img","alt":"˙λt = ˇΓλ�∇λL(θλt, λt)�","inline":true},{"text":"and thus, the actor parameters ","element":"span"},{"style":{"height":16},"width":134.55,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1403.6530/images/31-18.png","element":"img","alt":" (θn, λn)","inline":true,"padRight":true},{"text":"updated according to ","element":"span"},{"href":"#id-88","text":"(47) ","element":"a"},{"text":"converge to a (local) saddle point ","element":"span"},{"style":{"height":18.6},"width":148.14,"height":46.49,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1403.6530/images/31-19.png","element":"img","alt":" (θλ∗, λ∗)","inline":true,"padRight":true},{"text":"of ","element":"span"},{"style":{"height":16},"width":119.39,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1403.6530/images/31-20.png","element":"img","alt":" L(θ, λ)","inline":true},{"text":". Morever, the limiting point ","element":"span"},{"style":{"height":14.6},"width":53.72,"height":36.49,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1403.6530/images/31-21.png","element":"img","alt":" θλ∗","inline":true,"padRight":true},{"text":"satisfies the variance constraint in ","element":"span"},{"href":"#id-84","text":"(40)","element":"a"},{"text":".","element":"span"}]]},{"heading":"9 Experimental Results","paragraphs":[[{"text":"We evaluate our algorithms in the context of a traffic signal control application. The objective in our formulation is to minimize the total number of vehicles in the system, which indirectly minimizes the delay experienced by the system. The motivation behind using a risk-sensitive control strategy is to reduce the variations in the delay experienced by road users.","element":"span"}],[{"style":{"fontWeight":"bold"},"text":"9.1 ","element":"span"},{"style":{"fontWeight":"bold"},"text":"Implementation","element":"span"}],[{"text":"We consider both infinite horizon discounted and average settings for the traffic signal control MDP, formulated as in ","element":"span"},{"href":"#id-120","referenceIndex":44,"text":"[44]","element":"a"},{"text":". We briefly recall their formulation here: The state at each time ","element":"span"},{"style":{"height":12.4},"width":79.36,"height":31,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1403.6530/images/31-22.png","element":"img","alt":" t, xn","inline":true},{"text":", is the vector of queue lengths and elapsed times and is given by ","element":"span"},{"style":{"height":16},"width":714.08,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1403.6530/images/31-23.png","element":"img","alt":" xn = (q1(n), . . . , qN(n), t1(n), . . . , tN(n))","inline":true},{"text":", where ","element":"span"},{"style":{"fontStyle":"italic"},"text":"N ","element":"span"},{"text":"is the number of signalled lanes in the road network considered. Here ","element":"span"},{"style":{"height":10},"width":28.79,"height":25,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1403.6530/images/31-24.png","element":"img","alt":" qi","inline":true,"padRight":true},{"text":"and ","element":"span"},{"style":{"height":12.39},"width":25.38,"height":30.98,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1403.6530/images/31-25.png","element":"img","alt":" ti","inline":true,"padRight":true},{"text":"denote the queue length and elapsed time since the signal turned to red on lane ","element":"span"},{"style":{"fontStyle":"italic"},"text":"i","element":"span"},{"text":". The actions ","element":"span"},{"style":{"height":9.19},"width":41.06,"height":22.98,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1403.6530/images/31-26.png","element":"img","alt":" an","inline":true,"padRight":true},{"text":"belong to the set of feasible sign configurations. The single-stage cost function ","element":"span"},{"style":{"height":16},"width":98.93,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1403.6530/images/31-27.png","element":"img","alt":" h(xn)","inline":true,"padRight":true},{"text":"is defined as follows:","element":"span"}],[{"style":{"width":"89%"},"width":1623,"height":96,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1403.6530/images/32-0.png","element":"img"}],[{"text":"where ","element":"span"},{"style":{"height":14},"width":154.04,"height":35,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1403.6530/images/32-1.png","element":"img","alt":" ri, si ≥ 0","inline":true,"padRight":true},{"text":"such that ","element":"span"},{"style":{"height":13.19},"width":180.94,"height":32.98,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1403.6530/images/32-2.png","element":"img","alt":" ri + si = 1","inline":true,"padRight":true},{"text":"for ","element":"span"},{"style":{"fontStyle":"italic"},"text":"i ","element":"span"},{"text":"= 1","element":"span"},{"style":{"fontStyle":"italic"},"text":", ","element":"span"},{"text":"2 ","element":"span"},{"text":"and ","element":"span"},{"style":{"height":11.19},"width":123.67,"height":27.98,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1403.6530/images/32-3.png","element":"img","alt":" r2 > s2","inline":true},{"text":". The set ","element":"span"},{"style":{"height":15.59},"width":34.52,"height":38.98,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1403.6530/images/32-4.png","element":"img","alt":" Ip","inline":true,"padRight":true},{"text":"is the set of prioritized lanes in the road network considered. While the weights ","element":"span"},{"style":{"height":10},"width":88.25,"height":25,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1403.6530/images/32-5.png","element":"img","alt":" r1, s1","inline":true,"padRight":true},{"text":"are used to differentiate between the queue length and elapsed time factors, the weights ","element":"span"},{"style":{"height":10},"width":88.24,"height":25,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1403.6530/images/32-6.png","element":"img","alt":" r2, s2","inline":true,"padRight":true},{"text":"help in prioritization of traffic.","element":"span"}],[{"text":"Given the above traffic control setting, we aim to minimize both the long run discounted and average sum of the cost function ","element":"span"},{"style":{"height":16},"width":98.93,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1403.6530/images/32-7.png","element":"img","alt":" h(xn)","inline":true},{"text":". We implement the following algorithms using the Green Light District (GLD) simulator ","element":"span"},{"href":"#id-121","referenceIndex":66,"text":"[66]","element":"a"},{"text":"7","element":"span"},{"text":":","element":"span"}],[{"style":{"fontWeight":"bold"},"text":"Discounted Setting","element":"span"}],[{"text":"1. ","element":"span"},{"style":{"fontStyle":"italic","fontWeight":"bold"},"text":"SPSA-G","element":"span"},{"text":": This is a first-order risk-neutral algorithm with SPSA-based gradient estimates that updates the parameter ","element":"span"},{"style":{"height":10.8},"width":19,"height":27,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1403.6530/images/32-8.png","element":"img","alt":" θ","inline":true,"padRight":true},{"text":"as follows:","element":"span"}],[{"id":"id-123","style":{"width":"42%"},"width":767,"height":120,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1403.6530/images/32-9.png","element":"img"}],[{"text":"where the critic parameters ","element":"span"},{"style":{"height":16.92},"width":104.47,"height":42.31,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1403.6530/images/32-10.png","element":"img","alt":" vn, v+n","inline":true,"padRight":true},{"text":"are updated according to ","element":"span"},{"href":"#id-69","text":"(17)","element":"a"},{"text":". Note that this is a two-timescale algorithm ","element":"span"},{"text":"with a TD critic on the faster timescale and the actor on the slower timescale. Unlike RS-SPSA-G, this algorithm, being risk-neutral, does not involve the Lagrange multiplier recursion.","element":"span"}],[{"text":"2. ","element":"span"},{"style":{"fontStyle":"italic","fontWeight":"bold"},"text":"SF-G","element":"span"},{"text":": This is a first-order risk-neutral algorithm that is similar to SPSA-G, except that the gradient estimation scheme used here is based on the smoothed functional (SF) technique. The update of the policy parameter in this algorithm is given by","element":"span"}],[{"style":{"width":"48%"},"width":879,"height":120,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1403.6530/images/32-11.png","element":"img"}],[{"text":"3. ","element":"span"},{"style":{"fontStyle":"italic","fontWeight":"bold"},"text":"SPSA-N","element":"span"},{"text":": This is a risk-neutral algorithm and is the second-order counterpart of SPSA-G. The Hessian update in this algorithm is as follows: For ","element":"span"},{"style":{"height":14},"width":356.42,"height":35,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1403.6530/images/32-12.png","element":"img","alt":" i, j = 1, . . . , κ1, i < j","inline":true},{"text":", the update is","element":"span"}],[{"style":{"width":"71%"},"width":1304,"height":106,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1403.6530/images/32-13.png","element":"img"}],[{"text":"and for ","element":"span"},{"style":{"fontStyle":"italic"},"text":"i > j","element":"span"},{"text":", we set ","element":"span"},{"style":{"height":22.53},"width":250.84,"height":56.32,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1403.6530/images/32-14.png","element":"img","alt":" H(i,j)n+1 = H(j,i)n+1","inline":true},{"text":". As in RS-SPSA-N, let ","element":"span"},{"style":{"height":17.32},"width":195.03,"height":43.31,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1403.6530/images/32-15.png","element":"img","alt":" Mn△= H−1n","inline":true,"padRight":true},{"text":", where ","element":"span"},{"style":{"height":23.52},"width":378.3,"height":58.8,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1403.6530/images/32-16.png","element":"img","alt":" Hn = Υ�[H(i,j)n ]|κ1|i,j=1�","inline":true},{"text":". The actor updates the parameter ","element":"span"},{"style":{"height":10.8},"width":19,"height":27,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1403.6530/images/32-17.png","element":"img","alt":" θ","inline":true,"padRight":true},{"text":"as follows:","element":"span"}],[{"style":{"width":"94%"},"width":1712,"height":192,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1403.6530/images/32-18.png","element":"img"}],[{"text":"4. ","element":"span"},{"style":{"fontStyle":"italic","fontWeight":"bold"},"text":"SF-N","element":"span"},{"text":": This is a risk-neutral algorithm and is the second-order counterpart of SF-G. It updates the Hessian and the actor as follows: For ","element":"span"},{"style":{"height":14},"width":402.86,"height":35,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1403.6530/images/32-19.png","element":"img","alt":" i, j, k = 1, . . . , κ1, j < k","inline":true},{"text":", the Hessian update is","element":"span"}],[{"style":{"width":"88%"},"width":1607,"height":259,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1403.6530/images/32-20.png","element":"img"}],[{"text":"and for ","element":"span"},{"style":{"fontStyle":"italic"},"text":"j > k","element":"span"},{"text":", we set ","element":"span"},{"style":{"height":22.53},"width":264.1,"height":56.33,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1403.6530/images/33-0.png","element":"img","alt":" H(j,k)n+1 = H(k,j)n+1","inline":true,"padRight":true},{"text":". As before, let ","element":"span"},{"style":{"height":17.32},"width":195.6,"height":43.31,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1403.6530/images/33-1.png","element":"img","alt":" Mn△= H−1n","inline":true,"padRight":true},{"text":", with ","element":"span"},{"style":{"height":13.19},"width":53.13,"height":32.98,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1403.6530/images/33-2.png","element":"img","alt":" Hn","inline":true,"padRight":true},{"text":"formed as in SPSA-N. Then, ","element":"span"},{"text":"the actor update for the parameter ","element":"span"},{"style":{"height":10.8},"width":19,"height":27,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1403.6530/images/33-3.png","element":"img","alt":" θ","inline":true,"padRight":true},{"text":"is as follows:","element":"span"}],[{"style":{"width":"79%"},"width":1440,"height":203,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1403.6530/images/33-4.png","element":"img"}],[{"text":"5. ","element":"span"},{"style":{"fontStyle":"italic","fontWeight":"bold"},"text":"RS-SPSA-G","element":"span"},{"text":": This is the first-order risk-sensitive actor-critic algorithm that attempts to solve ","element":"span"},{"href":"#id-84","text":"(40) ","element":"a"},{"text":"and updates according to ","element":"span"},{"href":"#id-92","text":"(23)","element":"a"},{"text":".","element":"span"}],[{"text":"6. ","element":"span"},{"style":{"fontStyle":"italic","fontWeight":"bold"},"text":"RS-SF-G","element":"span"},{"text":": This is a first-order algorithm and the risk-sensitive variant of SF-G that updates the actor according to ","element":"span"},{"href":"#id-92","text":"(24)","element":"a"},{"text":".","element":"span"}],[{"text":"7. ","element":"span"},{"style":{"fontStyle":"italic","fontWeight":"bold"},"text":"RS-SPSA-N","element":"span"},{"text":": This is a second-order risk-sensitive algorithm that estimates gradient and Hessian using SPSA and updates them according to ","element":"span"},{"href":"#id-79","text":"(31)","element":"a"},{"text":".","element":"span"}],[{"text":"8. ","element":"span"},{"style":{"fontStyle":"italic","fontWeight":"bold"},"text":"RS-SF-N","element":"span"},{"text":": This second-order risk-sensitive algorithm is the SF counterpart of RS-SPSA-N, and updates according to ","element":"span"},{"href":"#id-83","text":"(36)","element":"a"},{"text":".","element":"span"}],[{"style":{"fontWeight":"bold"},"text":"Average Setting","element":"span"}],[{"text":"1. ","element":"span"},{"style":{"fontStyle":"italic","fontWeight":"bold"},"text":"AC","element":"span"},{"text":": This is an actor-critic algorithm that minimizes the long-run average sum of the single-stage cost function ","element":"span"},{"style":{"height":16},"width":98.92,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1403.6530/images/33-5.png","element":"img","alt":" h(xn)","inline":true},{"text":", without considering any risk criteria. This is similar to Algorithm 1 in Bhatnagar et al. ","element":"span"},{"href":"#id-11","referenceIndex":14,"text":"[14]","element":"a"},{"text":".","element":"span"}],[{"text":"2. ","element":"span"},{"style":{"fontStyle":"italic","fontWeight":"bold"},"text":"RS-AC","element":"span"},{"text":": This is the risk-sensitive actor-critic algorithm that attempts to solve ","element":"span"},{"href":"#id-84","text":"(40) ","element":"a"},{"text":"and is described in Section ","element":"span"},{"text":"6.","element":"span"}],[{"text":"The underlying policy that guides the selection of the sign configuration in each of the algorithms above is a parameterized Boltzmann family and has the form","element":"span"}],[{"style":{"width":"72%"},"width":1320,"height":120,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1403.6530/images/33-6.png","element":"img"}],[{"text":"All our algorithms incorporate function approximation owing to the curse of dimensionality associated with larger road networks. For instance, assuming only ","element":"span"},{"text":"20 ","element":"span"},{"text":"vehicles per lane of a 2x2-grid network, the cardinality of the state space is approximately of the order ","element":"span"},{"style":{"height":13.38},"width":71.75,"height":33.46,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1403.6530/images/33-7.png","element":"img","alt":" 1032","inline":true,"padRight":true},{"text":"and the situation is aggravated as the size of the road network increases. The choice of features used in each of our algorithms is as described in Section V-B of ","element":"span"},{"href":"#id-122","referenceIndex":45,"text":"[45]","element":"a"},{"text":".","element":"span"}],[{"text":"The experiments for each algorithm comprised of the following two phases:","element":"span"}],[{"style":{"fontWeight":"bold"},"text":"Policy Search Phase: ","element":"span"},{"text":"Here each iteration involved the simulation run with the nominal policy parameter ","element":"span"},{"style":{"height":10.8},"width":19,"height":27,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1403.6530/images/33-8.png","element":"img","alt":" θ","inline":true,"padRight":true},{"text":"as well as the perturbed policy parameter ","element":"span"},{"style":{"height":12.99},"width":44.82,"height":32.46,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1403.6530/images/33-9.png","element":"img","alt":" θ+","inline":true,"padRight":true},{"text":"(algorithm-specific). We run each algorithm for ","element":"span"},{"text":"500 ","element":"span"},{"text":"iterations, where the run length for a particular policy parameter is ","element":"span"},{"text":"150 ","element":"span"},{"text":"steps.","element":"span"}],[{"style":{"fontWeight":"bold"},"text":"Policy Test Phase: ","element":"span"},{"text":"After the completion of the policy search phase, we freeze the policy parameter and run ","element":"span"},{"text":"50 ","element":"span"},{"text":"independent simulations with this (converged) choice of the parameter. The results presented subsequently are averages over these ","element":"span"},{"text":"50 ","element":"span"},{"text":"runs.","element":"span"}],[{"text":"Figure ","element":"span"},{"href":"#id-119","text":"2 ","element":"a"},{"text":"shows a snapshot of the road network used for conducting the experiments from GLD simulator. Traffic is added to the network at each time step from the edge nodes. The spawn frequencies specify the rate at which traffic is generated at each edge node and follow a Poisson distribution. The spawn frequencies are set such that the proportion of the number of vehicles on the main roads (the horizontal ones in Fig. ","element":"span"},{"href":"#id-119","text":"2) ","element":"a"},{"text":"to those on the side roads is in the ratio of ","element":"span"},{"text":"100 : 5","element":"span"},{"text":". This setting is close to what is observed in practice and has also been used for instance in ","element":"span"},{"href":"#id-120","referenceIndex":44,"text":"[44, ","element":"a"},{"href":"#id-122","referenceIndex":45,"text":"45]","element":"a"},{"text":". In all our experiments, we set the weights in the single stage cost function ","element":"span"},{"href":"#id-123","text":"(71) ","element":"a"},{"text":"as follows: ","element":"span"},{"style":{"height":13.59},"width":239.06,"height":33.98,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1403.6530/images/33-10.png","element":"img","alt":"r1 = r2 = 0.5","inline":true,"padRight":true},{"text":"and ","element":"span"},{"style":{"height":14},"width":308.4,"height":35,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1403.6530/images/33-11.png","element":"img","alt":" r2 = 0.6, s2 = 0.4","inline":true},{"text":". For the SPSA and SF-based algorithms in the discounted setting, we set","element":"span"}],[{"id":"id-124","style":{"width":"79%"},"width":1450,"height":1290,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1403.6530/images/34-0.png","element":"img"}],[{"text":"Figure 3: Performance comparison in the discounted setting using the distribution of ","element":"figcaption","subtype":"caption"},{"style":{"height":17.38},"width":124.28,"height":43.46,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1403.6530/images/34-1.png","element":"img","alt":" Dθ(x0)","inline":true},{"text":".","element":"figcaption","subtype":"caption"}],[{"text":"the parameter ","element":"span"},{"style":{"height":11.6},"width":126.45,"height":29,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1403.6530/images/34-2.png","element":"img","alt":" δ = 0.2","inline":true,"padRight":true},{"text":"and the discount factor ","element":"span"},{"style":{"height":14.4},"width":130.08,"height":36,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1403.6530/images/34-3.png","element":"img","alt":" γ = 0.9","inline":true},{"text":". The parameter ","element":"span"},{"style":{"height":6.8},"width":26,"height":17,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1403.6530/images/34-4.png","element":"img","alt":" α","inline":true,"padRight":true},{"text":"in the formulations ","element":"span"},{"href":"#id-84","text":"(40) ","element":"a"},{"text":"and ","element":"span"},{"href":"#id-53","text":"(3) ","element":"a"},{"text":"was set to ","element":"span"},{"text":"20","element":"span"},{"text":". The step-size sequences are chosen as follows:","element":"span"}],[{"style":{"width":"84%"},"width":1528,"height":82,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1403.6530/images/34-5.png","element":"img"}],[{"text":"Further, the constant ","element":"span"},{"style":{"fontStyle":"italic"},"text":"k ","element":"span"},{"text":"related to ","element":"span"},{"style":{"height":16},"width":90.73,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1403.6530/images/34-6.png","element":"img","alt":" ζ4(n)","inline":true,"padRight":true},{"text":"in the risk-sensitive average reward algorithm is set to ","element":"span"},{"text":"1","element":"span"},{"text":". It is easy to see that the choice of step-sizes above satisfies (A4). The projection operator ","element":"span"},{"style":{"height":13.19},"width":35.91,"height":32.97,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1403.6530/images/34-7.png","element":"img","alt":" Γi","inline":true,"padRight":true},{"text":"was set to project the iterate ","element":"span"},{"style":{"height":14.19},"width":55.54,"height":35.46,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1403.6530/images/34-8.png","element":"img","alt":" θ(i)","inline":true,"padRight":true},{"text":"onto the set ","element":"span"},{"text":"[0","element":"span"},{"style":{"fontStyle":"italic"},"text":", ","element":"span"},{"text":"10]","element":"span"},{"text":", for all ","element":"span"},{"style":{"height":14},"width":214.3,"height":35,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1403.6530/images/34-9.png","element":"img","alt":" i = 1, . . . , κ1","inline":true},{"text":", while the projection operator for the Lagrange multiplier used the set ","element":"span"},{"text":"[0","element":"span"},{"style":{"fontStyle":"italic"},"text":", ","element":"span"},{"text":"1000]","element":"span"},{"text":". All the experiments were performed on a 2.53GHz Intel quad core machine with 3.8GB RAM.","element":"span"}],[{"style":{"fontWeight":"bold"},"text":"9.2 ","element":"span"},{"style":{"fontWeight":"bold"},"text":"Results","element":"span"}],[{"text":"Figure ","element":"span"},{"href":"#id-124","text":"3 ","element":"a"},{"text":"shows the distribution of the discounted cumulative reward ","element":"span"},{"style":{"height":17.38},"width":124.28,"height":43.46,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1403.6530/images/34-10.png","element":"img","alt":" Dθ(x0)","inline":true,"padRight":true},{"text":"for the algorithms in the discounted setting. Figure ","element":"span"},{"href":"#id-125","text":"4 ","element":"a"},{"text":"shows the total arrived road users (TAR) obtained for all the algorithms in the discounted setting, whereas Figure ","element":"span"},{"href":"#id-125","text":"5 ","element":"a"},{"text":"presents the average junction waiting time (AJWT) for the first-order SF-based algorithm RS-SF-G.","element":"span"},{"text":"8 ","element":"span"},{"text":"TAR is a throughput metric that measures the number of road users who have reached their destination, whereas AJWT is a delay metric that quantifies the average delay experienced by the road users.","element":"span"}],[{"text":"The performance of the algorithms in the average setting is presented in Figure ","element":"span"},{"href":"#id-126","text":"6. ","element":"a"},{"text":"In particular, Figure ","element":"span"},{"href":"#id-126","text":"6(a) ","element":"a"},{"text":"shows the distribution of the average reward ","element":"span"},{"style":{"height":10},"width":21,"height":25,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1403.6530/images/34-11.png","element":"img","alt":" ρ","inline":true},{"text":", while Figure ","element":"span"},{"href":"#id-126","text":"6(b) ","element":"a"},{"text":"presents the average junction waiting time","element":"span"}],[{"style":{"width":"80%"},"width":1462,"height":1324,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1403.6530/images/35-0.png","element":"img"}],[{"text":"Figure 4: Performance comparison of the algorithms in the discounted setting using the total arrived road users (TAR).","element":"figcaption","subtype":"caption"}],[{"id":"id-125","style":{"width":"38%"},"width":704,"height":562,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1403.6530/images/35-1.png","element":"img"}],[{"text":"Figure 5: Performance comparison of the first-order SF-based algorithms, SF-G and RS-SF-G, using the average junction waiting time (AJWT).","element":"figcaption","subtype":"caption"}],[{"id":"id-126","style":{"width":"82%"},"width":1489,"height":633,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1403.6530/images/36-0.png","element":"img"}],[{"text":"Figure 6: Performance comparison of the risk-neutral (AC) and risk-sensitive (RS-AC) average reward actor-critic algorithms using two different metrics.","element":"figcaption","subtype":"caption"}],[{"id":"id-127","style":{"width":"80%"},"width":1460,"height":633,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1403.6530/images/36-1.png","element":"img"}],[{"text":"Figure 7: Convergence of SPSA based algorithms in the discounted setting – illustration using two (arbitrarily chosen) coordinates of the parameter ","element":"figcaption","subtype":"caption"},{"style":{"height":10.8},"width":19,"height":27,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1403.6530/images/36-2.png","element":"img","alt":" θ","inline":true},{"text":".","element":"figcaption","subtype":"caption"}],[{"text":"(AJWT) for the average cost algorithms.","element":"span"}],[{"text":"From Figures ","element":"span"},{"href":"#id-124","text":"3 ","element":"a"},{"text":"and ","element":"span"},{"href":"#id-126","text":"6(a), ","element":"a"},{"text":"we notice that the risk-sensitive algorithms proposed in this paper result in a long-term (discounted or average) cost that is higher than their risk-neutral variants. However, from the empirical variance of the cost (both discounted as well as average) perspective, the risk-sensitive algorithms outperform their risk-neutral variants. Amongst our algorithms in the discounted setting, we observe that the second-order schemes (RS-SPSA-N and RS-SF-N) exhibit better results, though they involve an additional computational cost of inverting the Hessian at each time step. Further, from a traffic signal control application standpoint, we notice from the throughput (TAR) and delay (AJWT) plots (see Figures ","element":"span"},{"href":"#id-125","text":"4, ","element":"a"},{"href":"#id-125","text":"5 ","element":"a"},{"text":"and ","element":"span"},{"href":"#id-126","text":"6(b))","element":"a"},{"text":", that the performance of the risk-sensitive algorithm variants is close to that of the corresponding risk-neutral algorithms in both the considered settings.","element":"span"}],[{"text":"We observe that the policy parameter ","element":"span"},{"style":{"height":10.8},"width":19,"height":27,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1403.6530/images/37-0.png","element":"img","alt":" θ","inline":true,"padRight":true},{"text":"converges for the SPSA based algorithms in the discounted setting. This is illustrated in Figures ","element":"span"},{"href":"#id-127","text":"7(a) ","element":"a"},{"text":"and ","element":"span"},{"href":"#id-127","text":"7(b). ","element":"a"},{"text":"Note that we established theoretical convergence of our algorithms earlier (see Sections ","element":"span"},{"text":"7 ","element":"span"},{"text":"and ","element":"span"},{"text":"8) ","element":"span"},{"text":"and these plots confirm the same. Further, these plots also show that the transient period, i.e., the initial phase when ","element":"span"},{"style":{"height":10.8},"width":19,"height":27,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1403.6530/images/37-1.png","element":"img","alt":" θ","inline":true,"padRight":true},{"text":"has not converged, is short. Similar observations hold for the other algorithms as well. The results of this section indicate the rapid empirical convergence of our proposed algorithms. This observation coupled with the fact that they guarantee low variance of return, make them attractive for implementation in riskconstrained systems.","element":"span"}]]},{"heading":"10 Conclusions and Future Work","paragraphs":[[{"text":"We proposed novel actor-critic algorithms for control in risk-sensitive discounted and average reward MDPs. All our algorithms involve a TD critic on the fast timescale, a policy gradient (actor) on the intermediate timescale, and a dual ascent for Lagrange multipliers on the slowest timescale. In the discounted setting, we pointed out the difficulty in estimating the gradient of the variance of the return and incorporated simultaneous perturbation based SPSA and SF approaches for gradient estimation in our algorithms. The average setting, on the other hand, allowed for an actor to employ compatible features to estimate the gradient of the variance. We provided proofs of convergence to locally (risk-sensitive) optimal policies for all the proposed algorithms. Further, using a traffic signal control application, we observed that our algorithms resulted in lower variance empirically as compared to their risk-neutral counterparts.","element":"span"}],[{"text":"As future work, it would be interesting to develop a risk-sensitive algorithm that uses a single trajectory in the discounted setting. Further, it would also be interesting to consider conditional value at risk (CVaR) as a measure of risk and develop a control algorithm that optimizes the return of a MDP with bounds on CVaR. The resulting algorithm could be applied for portfolio optimization in a financial application. An orthogonal direction of future research is to obtain finite-time bounds on the quality of the solution obtained by our algorithms. As mentioned earlier, this is challenging as, to the best of our knowledge, there are no convergence rate results available for multi-timescale stochastic approximation schemes, and hence, for actor-critic algorithms.","element":"span"}]]},{"heading":"References","paragraphs":[[{"id":"id-52","text":"[1] Eitan Altman. ","element":"span"},{"style":{"fontStyle":"italic"},"text":"Constrained Markov decision processes","element":"span"},{"text":", volume 7. CRC Press, 1999.","element":"span"}],[{"id":"id-12","text":"[2] A. Barto, R. Sutton, and C. Anderson. Neuron-like elements that can solve difficult learning control problems. ","element":"span"},{"style":{"fontStyle":"italic"},"text":"IEEE Transaction on Systems, Man and Cybernetics","element":"span"},{"text":", 13:835–846, 1983.","element":"span"}],[{"id":"id-28","text":"[3] A. Basu, T. Bhattacharyya, and V. Borkar. A learning algorithm for risk-sensitive cost. ","element":"span"},{"style":{"fontStyle":"italic"},"text":"Mathematics of Operations Research","element":"span"},{"text":", 33(4):880–898, 2008.","element":"span"}],[{"id":"id-6","text":"[4] J. Baxter and P. Bartlett. Infinite-horizon policy-gradient estimation. ","element":"span"},{"style":{"fontStyle":"italic"},"text":"Journal of Artificial Intelligence Research","element":"span"},{"text":", 15:319–350, 2001.","element":"span"}],[{"id":"id-1","text":"[5] D. Bertsekas. ","element":"span"},{"style":{"fontStyle":"italic"},"text":"Dynamic Programming and Optimal Control","element":"span"},{"text":". Athena Scientific, 1995.","element":"span"}],[{"id":"id-32","text":"[6] D. Bertsekas. ","element":"span"},{"style":{"fontStyle":"italic"},"text":"Nonlinear programming","element":"span"},{"text":". Athena Scientific, 1999.","element":"span"}],[{"id":"id-2","text":"[7] D. Bertsekas and J. Tsitsiklis. ","element":"span"},{"style":{"fontStyle":"italic"},"text":"Neuro-Dynamic Programming","element":"span"},{"text":". Athena Scientific, 1996.","element":"span"}],[{"id":"id-40","text":"[8] S. Bhatnagar. Adaptive multivariate three-timescale stochastic approximation algorithms for simulation based ","element":"span"},{"text":"optimization. ","element":"span"},{"style":{"fontStyle":"italic"},"text":"ACM Transactions on Modeling and Computer Simulation","element":"span"},{"text":", 15(1):74–107, 2005.","element":"span"}],[{"id":"id-42","text":"[9] S. Bhatnagar. Adaptive Newton-based multivariate smoothed functional algorithms for simulation optimiza- ","element":"span"},{"text":"tion. ","element":"span"},{"style":{"fontStyle":"italic"},"text":"ACM Transactions on Modeling and Computer Simulation","element":"span"},{"text":", 18(1):1–35, 2007.","element":"span"}],[{"id":"id-57","text":"[10] S. Bhatnagar. An actor–critic algorithm with function approximation for discounted cost constrained Markov ","element":"span"},{"text":"decision processes. ","element":"span"},{"style":{"fontStyle":"italic"},"text":"Systems & Control Letters","element":"span"},{"text":", 59(12):760–766, 2010.","element":"span"}],[{"id":"id-118","text":"[11] S. Bhatnagar and K. Lakshmanan. An online actor-critic algorithm with function approximation for con- ","element":"span"},{"text":"strained Markov decision processes. ","element":"span"},{"style":{"fontStyle":"italic"},"text":"Journal of Optimization Theory and Applications","element":"span"},{"text":", pages 1–21, 2012.","element":"span"}],[{"id":"id-38","text":"[12] S. Bhatnagar, M.C. Fu, S.I. Marcus, and I. Wang. Two-timescale simultaneous perturbation stochastic ap- ","element":"span"},{"text":"proximation using deterministic perturbation sequences. ","element":"span"},{"style":{"fontStyle":"italic"},"text":"ACM Transactions on Modeling and Computer Simulation","element":"span"},{"text":", 13(2):180–209, 2003. ISSN 1049-3301.","element":"span"}],[{"id":"id-10","text":"[13] S. Bhatnagar, R. Sutton, M. Ghavamzadeh, and M. Lee. Incremental natural actor-Critic algorithms. In ","element":"span"},{"style":{"fontStyle":"italic"},"text":"Proceedings of Advances in Neural Information Processing Systems 20","element":"span"},{"text":", pages 105–112, 2007.","element":"span"}],[{"id":"id-11","text":"[14] S. Bhatnagar, R. Sutton, M. Ghavamzadeh, and M. Lee. Natural actor-critic algorithms. ","element":"span"},{"style":{"fontStyle":"italic"},"text":"Automatica","element":"span"},{"text":", 45(11): 2471–2482, 2009.","element":"span"}],[{"id":"id-43","text":"[15] S. Bhatnagar, N. Hemachandra, and V. Mishra. Stochastic approximation algorithms for constrained opti- ","element":"span"},{"text":"mization via simulation. ","element":"span"},{"style":{"fontStyle":"italic"},"text":"ACM Transactions on Modeling and Computer Simulation","element":"span"},{"text":", 21(3):15, 2011.","element":"span"}],[{"id":"id-36","text":"[16] S. Bhatnagar, H. Prasad, and L.A. Prashanth. ","element":"span"},{"style":{"fontStyle":"italic"},"text":"Stochastic Recursive Algorithms for Optimization","element":"span"},{"text":", volume 434. Springer, 2013.","element":"span"}],[{"id":"id-25","text":"[17] V. Borkar. A sensitivity formula for the risk-sensitive cost and the actor-critic algorithm. ","element":"span"},{"style":{"fontStyle":"italic"},"text":"Systems & Control Letters","element":"span"},{"text":", 44:339–346, 2001.","element":"span"}],[{"id":"id-26","text":"[18] V. Borkar. Q-learning for risk-sensitive control. ","element":"span"},{"style":{"fontStyle":"italic"},"text":"Mathematics of Operations Research","element":"span"},{"text":", 27:294–311, 2002.","element":"span"}],[{"id":"id-101","text":"[19] V. Borkar. An actor-critic algorithm for constrained Markov decision processes. ","element":"span"},{"style":{"fontStyle":"italic"},"text":"Systems & Control Letters","element":"span"},{"text":", 54(3):207–213, 2005.","element":"span"}],[{"id":"id-58","text":"[20] V. Borkar. ","element":"span"},{"style":{"fontStyle":"italic"},"text":"Stochastic approximation: a dynamical systems viewpoint","element":"span"},{"text":". Cambridge University Press, 2008.","element":"span"}],[{"id":"id-27","text":"[21] V. Borkar. Learning algorithms for risk-sensitive control. In ","element":"span"},{"style":{"fontStyle":"italic"},"text":"Proceedings of the Nineteenth International Symposium on Mathematical Theory of Networks and Systems","element":"span"},{"text":", pages 1327–1332, 2010.","element":"span"}],[{"id":"id-94","text":"[22] Vivek S Borkar and Sean P Meyn. The ode method for convergence of stochastic approximation and rein- ","element":"span"},{"text":"forcement learning. ","element":"span"},{"style":{"fontStyle":"italic"},"text":"SIAM Journal on Control and Optimization","element":"span"},{"text":", 38(2):447–469, 2000.","element":"span"}],[{"id":"id-72","text":"[23] H. Chen, T. Duncan, and B. Pasik-Duncan. A Kiefer-Wolfowitz algorithm with randomized differences. ","element":"span"},{"style":{"fontStyle":"italic"},"text":"IEEE Transactions on Automatic Control","element":"span"},{"text":", 44(3):442–453, 1999.","element":"span"}],[{"id":"id-16","text":"[24] E. Delage and S. Mannor. Percentile optimization for Markov decision processes with parameter uncertainty. ","element":"span"},{"style":{"fontStyle":"italic"},"text":"Operations Research","element":"span"},{"text":", 58(1):203–213, 2010.","element":"span"}],[{"id":"id-112","text":"[25] J. Dippon and J. Renz. Weighted means in stochastic approximation of minima. ","element":"span"},{"style":{"fontStyle":"italic"},"text":"SIAM Journal on Control and Optimization","element":"span"},{"text":", 35(5):1811–1827, 1997.","element":"span"}],[{"id":"id-111","text":"[26] V. Fabian. On asymptotic normality in stochastic approximation. ","element":"span"},{"style":{"fontStyle":"italic"},"text":"The Annals of Mathematical Statistics","element":"span"},{"text":", pages 1327–1332, 1968.","element":"span"}],[{"id":"id-20","text":"[27] J. Filar, L. Kallenberg, and H. Lee. Variance-penalized Markov decision processes. ","element":"span"},{"style":{"fontStyle":"italic"},"text":"Mathematics of Operations Research","element":"span"},{"text":", 14(1):147–161, 1989.","element":"span"}],[{"id":"id-21","text":"[28] J. Filar, D. Krass, and K. Ross. Percentile performance criteria for limiting average Markov decision pro- ","element":"span"},{"text":"cesses. ","element":"span"},{"style":{"fontStyle":"italic"},"text":"IEEE Transaction of Automatic Control","element":"span"},{"text":", 40(1):2–10, 1995.","element":"span"}],[{"id":"id-80","text":"[29] P. Gill, W. Murray, and M. Wright. ","element":"span"},{"style":{"fontStyle":"italic"},"text":"Practical optimization","element":"span"},{"text":". Academic press, 1981.","element":"span"}],[{"id":"id-97","text":"[30] M. W. Hirsch. Convergent activation dynamics in continuous time networks. ","element":"span"},{"style":{"fontStyle":"italic"},"text":"Neural Networks","element":"span"},{"text":", 2:331–349, 1989.","element":"span"}],[{"id":"id-18","text":"[31] R. Howard and J. Matheson. Risk sensitive Markov decision processes. ","element":"span"},{"style":{"fontStyle":"italic"},"text":"Management Science","element":"span"},{"text":", 18(7):356– 369, 1972.","element":"span"}],[{"id":"id-35","text":"[32] V. Katkovnik and Y. Kulchitsky. Convergence of a class of random search algorithms. ","element":"span"},{"style":{"fontStyle":"italic"},"text":"Automatic Remote Control","element":"span"},{"text":", 8:81–87, 1972.","element":"span"}],[{"id":"id-8","text":"[33] V. Konda and J. Tsitsiklis. Actor-Critic algorithms. In ","element":"span"},{"style":{"fontStyle":"italic"},"text":"Proceedings of Advances in Neural Information Processing Systems 12","element":"span"},{"text":", pages 1008–1014, 2000.","element":"span"}],[{"id":"id-109","text":"[34] V. R. Konda and J. N. Tsitsiklis. Convergence rate of linear two-time-scale stochastic approximation. ","element":"span"},{"style":{"fontStyle":"italic"},"text":"Annals of Applied Probability","element":"span"},{"text":", pages 796–819, 2004.","element":"span"}],[{"id":"id-110","text":"[35] Nathaniel Korda and L.A. Prashanth. On TD (0) with function approximation: Concentration bounds and a ","element":"span"},{"text":"centered variant with exponential convergence. ","element":"span"},{"style":{"fontStyle":"italic"},"text":"arXiv preprint arXiv:1411.3224","element":"span"},{"text":", 2014.","element":"span"}],[{"id":"id-96","text":"[36] H. Kushner and D. Clark. ","element":"span"},{"style":{"fontStyle":"italic"},"text":"Stochastic approximation methods for constrained and unconstrained systems","element":"span"},{"text":". Springer-Verlag, 1978.","element":"span"}],[{"id":"id-24","text":"[37] S. Mannor and J. Tsitsiklis. Mean-variance optimization in Markov decision processes. In ","element":"span"},{"style":{"fontStyle":"italic"},"text":"Proceedings of the Twenty-Eighth International Conference on Machine Learning","element":"span"},{"text":", pages 177–184, 2011.","element":"span"}],[{"id":"id-5","text":"[38] P. Marbach. ","element":"span"},{"style":{"fontStyle":"italic"},"text":"Simulated-Based Methods for Markov Decision Processes","element":"span"},{"text":". PhD thesis, Massachusetts Institute of Technology, 1998.","element":"span"}],[{"id":"id-102","text":"[39] A. Mas-Colell, M. Whinston, and J. Green. ","element":"span"},{"style":{"fontStyle":"italic"},"text":"Microeconomic theory","element":"span"},{"text":". Oxford University Press, 1995.","element":"span"}],[{"id":"id-29","text":"[40] O. Mihatsch and R. Neuneier. Risk-sensitive reinforcement learning. ","element":"span"},{"style":{"fontStyle":"italic"},"text":"Machine Learning","element":"span"},{"text":", 49(2):267–290, 2002.","element":"span"}],[{"id":"id-104","text":"[41] Paul Milgrom and Ilya Segal. Envelope theorems for arbitrary choice sets. ","element":"span"},{"style":{"fontStyle":"italic"},"text":"Econometrica","element":"span"},{"text":", 70(2):583–601, 2002.","element":"span"}],[{"id":"id-15","text":"[42] A. Nilim and L. El Ghaoui. Robust control of Markov decision processes with uncertain transition matrices. ","element":"span"},{"style":{"fontStyle":"italic"},"text":"Operations Research","element":"span"},{"text":", 53(5):780–798, 2005.","element":"span"}],[{"id":"id-9","text":"[43] J. Peters, S. Vijayakumar, and S. Schaal. Natural actor-critic. In ","element":"span"},{"style":{"fontStyle":"italic"},"text":"Proceedings of the Sixteenth European Conference on Machine Learning","element":"span"},{"text":", pages 280–291, 2005.","element":"span"}],[{"id":"id-120","text":"[44] L.A. Prashanth and S. Bhatnagar. Reinforcement Learning With Function Approximation for Traffic Signal ","element":"span"},{"text":"Control. ","element":"span"},{"style":{"fontStyle":"italic"},"text":"IEEE Transactions on Intelligent Transportation Systems","element":"span"},{"text":", 12(2):412 –421, june 2011.","element":"span"}],[{"id":"id-122","text":"[45] L.A. Prashanth and S. Bhatnagar. Threshold Tuning Using Stochastic Optimization for Graded Signal Con- ","element":"span"},{"text":"trol. ","element":"span"},{"style":{"fontStyle":"italic"},"text":"IEEE Transactions on Vehicular Technology","element":"span"},{"text":", 61(9):3865 –3880, nov. 2012.","element":"span"}],[{"text":"[46] L.A. Prashanth and M. Ghavamzadeh. Actor-critic algorithms for risk-sensitive MDPs. In ","element":"span"},{"style":{"fontStyle":"italic"},"text":"Proceedings of Advances in Neural Information Processing Systems 26","element":"span"},{"text":", pages 252–260, 2013.","element":"span"}],[{"id":"id-0","text":"[47] M. Puterman. ","element":"span"},{"style":{"fontStyle":"italic"},"text":"Markov decision processes: Discrete stochastic dynamic programming","element":"span"},{"text":". John Wiley & Sons, 1994.","element":"span"}],[{"id":"id-22","text":"[48] A. Ruszczy´nski. Risk-averse dynamic programming for Markov decision processes. ","element":"span"},{"style":{"fontStyle":"italic"},"text":"Mathematical Programming","element":"span"},{"text":", 125:235–261, 2010.","element":"span"}],[{"id":"id-47","text":"[49] W. Sharpe. Mutual fund performance. ","element":"span"},{"style":{"fontStyle":"italic"},"text":"Journal of Business","element":"span"},{"text":", 39(1):119–138, 1966.","element":"span"}],[{"id":"id-23","text":"[50] Y. Shen, W. Stannat, and K. Obermayer. Risk-sensitive Markov control processes. ","element":"span"},{"style":{"fontStyle":"italic"},"text":"SIAM Journal on Control and Optimization","element":"span"},{"text":", 51(5):3652–3672, 2013.","element":"span"}],[{"id":"id-19","text":"[51] M. Sobel. The variance of discounted Markov decision processes. ","element":"span"},{"style":{"fontStyle":"italic"},"text":"Applied Probability","element":"span"},{"text":", pages 794–802, 1982.","element":"span"}],[{"id":"id-34","text":"[52] J. Spall. Multivariate stochastic approximation using a simultaneous perturbation gradient approximation. ","element":"span"},{"style":{"fontStyle":"italic"},"text":"IEEE Transactions on Automatic Control","element":"span"},{"text":", 37(3):332–341, 1992.","element":"span"}],[{"id":"id-37","text":"[53] J. Spall. A one-measurement form of simultaneous perturbation stochastic approximation. ","element":"span"},{"style":{"fontStyle":"italic"},"text":"Automatica","element":"span"},{"text":", 33 (1):109–112, 1997. ISSN 0005-1098.","element":"span"}],[{"id":"id-39","text":"[54] J. Spall. Adaptive stochastic approximation by the simultaneous perturbation method. ","element":"span"},{"style":{"fontStyle":"italic"},"text":"IEEE Transactions on Automatic Control","element":"span"},{"text":", 45(10):1839–1853, 2000.","element":"span"}],[{"id":"id-41","text":"[55] M. A. Styblinski and L. J. Opalski. Algorithms and software tools for IC yield optimization based on funda- ","element":"span"},{"text":"mental fabrication parameters. ","element":"span"},{"style":{"fontStyle":"italic"},"text":"IEEE Transactions on Computer Aided Design CAD","element":"span"},{"text":", 1(5):79–89, 1986.","element":"span"}],[{"id":"id-13","text":"[56] R. Sutton. ","element":"span"},{"style":{"fontStyle":"italic"},"text":"Temporal credit assignment in reinforcement learning","element":"span"},{"text":". PhD thesis, University of Massachusetts Amherst, 1984.","element":"span"}],[{"id":"id-14","text":"[57] R. Sutton. Learning to predict by the methods of temporal differences. ","element":"span"},{"style":{"fontStyle":"italic"},"text":"Machine Learning","element":"span"},{"text":", 3:9–44, 1988.","element":"span"}],[{"id":"id-3","text":"[58] R. Sutton and A. Barto. ","element":"span"},{"style":{"fontStyle":"italic"},"text":"Reinforcement learning: An introduction","element":"span"},{"text":". MIT Press, 1998.","element":"span"}],[{"id":"id-7","text":"[59] R. Sutton, D. McAllester, S. Singh, and Y. Mansour. Policy gradient methods for reinforcement learning with ","element":"span"},{"text":"function approximation. In ","element":"span"},{"style":{"fontStyle":"italic"},"text":"Proceedings of Advances in Neural Information Processing Systems 12","element":"span"},{"text":", pages 1057–1063, 2000.","element":"span"}],[{"id":"id-56","text":"[60] Richard S Sutton, David A McAllester, Satinder P Singh, Yishay Mansour, et al. Policy gradient methods for ","element":"span"},{"text":"reinforcement learning with function approximation. In ","element":"span"},{"style":{"fontStyle":"italic"},"text":"NIPS","element":"span"},{"text":", volume 99, pages 1057–1063. Citeseer, 1999.","element":"span"}],[{"id":"id-31","text":"[61] A. Tamar and S. Mannor. Variance adjusted actor-critic algorithms. ","element":"span"},{"style":{"fontStyle":"italic"},"text":"arXiv preprint arXiv:1310.3697","element":"span"},{"text":", 2013.","element":"span"}],[{"id":"id-30","text":"[62] A. Tamar, D. Di Castro, and S. Mannor. Policy gradients with variance related risk criteria. In ","element":"span"},{"style":{"fontStyle":"italic"},"text":"Proceedings of the Twenty-Ninth International Conference on Machine Learning","element":"span"},{"text":", pages 387–396, 2012.","element":"span"}],[{"id":"id-68","text":"[63] A. Tamar, D. Di Castro, and S. Mannor. Temporal difference methods for the variance of the reward to go. ","element":"span"},{"text":"In ","element":"span"},{"style":{"fontStyle":"italic"},"text":"Proceedings of the Thirtieth International Conference on Machine Learning","element":"span"},{"text":", pages 495–503, 2013.","element":"span"}],[{"id":"id-93","text":"[64] A. Tamar, D. Di Castro, and S. Mannor. Policy evaluation with variance related risk criteria in markov ","element":"span"},{"text":"decision processes. ","element":"span"},{"style":{"fontStyle":"italic"},"text":"arXiv preprint arXiv:1301.0104","element":"span"},{"text":", 2013.","element":"span"}],[{"id":"id-67","text":"[65] John N Tsitsiklis and Benjamin Van Roy. An analysis of temporal-difference learning with function approx- ","element":"span"},{"text":"imation. ","element":"span"},{"style":{"fontStyle":"italic"},"text":"IEEE Transactions on Automatic Control","element":"span"},{"text":", 42(5):674–690, 1997.","element":"span"}],[{"id":"id-121","text":"[66] M. Wiering, J. Vreeken, J. van Veenen, and A. Koopman. Simulation and optimization of traffic in a city. In ","element":"span"},{"style":{"fontStyle":"italic"},"text":"IEEE Intelligent Vehicles Symposium","element":"span"},{"text":", pages 453–458, June 2004.","element":"span"}],[{"id":"id-4","text":"[67] R. Williams. Simple statistical gradient-following algorithms for connectionist reinforcement learning. ","element":"span"},{"style":{"fontStyle":"italic"},"text":"Machine Learning","element":"span"},{"text":", 8:229–256, 1992.","element":"span"}],[{"id":"id-17","text":"[68] H. Xu and S. Mannor. Distributionally robust Markov decision processes. ","element":"span"},{"style":{"fontStyle":"italic"},"text":"Mathematics of Operations Research","element":"span"},{"text":", 37(2):288–300, 2012.","element":"span"}]]}],"_version":"3.3.2"},"paperNode":"$28:props:children:props:children:0:props:product"}]]