Being Optimistic to Be Conservative: Quickly Learning a CVaR Policy

While maximizing expected return is the goal in most reinforcement learning approaches, risk-sensitive objectives such as conditional value at risk (CVaR) are more suitable for many high-stakes applications. However, relatively little is known about how to explore to quickly learn policies with good CVaR. In this paper, we present the first algorithm for sample-efficient learning of CVaR-optimal policies in Markov decision processes based on the optimism in the face of uncertainty principle. This method relies on a novel optimistic version of the distributional Bellman operator that moves probability mass from the lower to the upper tail of the return distribution. We prove asymptotic convergence and optimism of this operator for the tabular policy evaluation case. We further demonstrate that our algorithm finds CVaRoptimal policies substantially faster than existing baselines in several simulated environments with discrete and continuous state spaces.

Introduction A key goal in reinforcement learning (RL) is to quickly learn to make good decisions by interacting with an environment. In most cases the quality of the decision policy is evaluated with respect to its expected (discounted) sum of rewards. However, in many interesting cases, it is important to consider the full distributions over the potential sum of rewards, and the desired objective may be a risk-sensitive measure of this distribution. For example, a patient undergoing a surgery for a knee replacement will (hopefully) only experience that procedure once or twice, and may will be interested in the distribution of potential results for a single procedure, rather than what may happen on average if he or she were to undertake that procedure hundreds of time. Finance and (machine) control are other cases where interest in risk-sensitive outcomes are common. A popular risk-sensitive measure of a distribution of outcomes is the Conditional Value at Risk (CVaR) (Artzner et al. 1999). Intuitively, CVaR is the expected reward in the worst  α-fraction of outcomes, and has seen extensive use in financial portfolio optimization (Zhu and Fukushima

2009), often under the name “expected shortfall”. While there has been recent interest in the RL community in learning to converge or identify good CVaR decision policies in Markov decision processes (Chow and Ghavamzadeh 2014; Chow et al. 2015; Tamar, Glassner, and Mannor 2015; Dabney et al. 2018a), interestingly we are unaware of prior work focused on how to quickly learn such CVaR MDP policies, even though sample efficient RL for maximizing expected outcomes is a deep and well studied theoretical (Jaksch, Ortner, and Auer 2010; Dann et al. 2018) and empirical (Bellemare et al. 2016) topic. Sample efficient exploration seems of equal or even more importance in the case when the goal is risk-averse outcomes.

In this paper we work towards sample efficient reinforcement learning algorithms that can quickly identify a policy with an optimal CVaR. Our focus is in minimizing the amount of experience needed to find such a policy, similar in spirit to probably approximately correct RL methods for expected reward. Note that this is different than another important topic in risk-sensitive RL, which focuses on safe exploration: algorithms that focus on avoiding any potentially very poor outcomes during learning. These typically rely on local smoothness assumptions and do not typically focus on sample efficiency (Berkenkamp et al. 2017; Koller et al. 2018); an interesting question for future work is whether one can do both safe and efficient learning of a CVaR policy. Our work is suitable for the many settings where some outcomes are undesirable but not catastrophic.

Our approach is inspired by the popular and effective principle of optimism in the face of uncertainty (OFU) in sample efficient RL for maximizing expected outcomes (Strehl and Littman 2008; Brafman and Tennenholtz 2002). Such work typically works by considering uncertainty over the MDP model parameters or state-action value function, and constructing an optimistic value function given that uncertainty that is then used to guide decision making. To take a similar idea for rapidly learning the optimal CVaR policy, we seek to consider the uncertainty in the distribution of outcomes possible and the resulting CVaR value. To do so, we use the Dvoretzky-Kiefer-Wolfowitz (DKW) inequality—while to our knowledge this has not been previously used in reinforcement learning settings, it is a very useful concentration inequality for our purposes as it provides bounds on the true cumulative distribution function (CDF) given a set of sampled outcomes. We leverage these bounds in order to compute optimistic estimates of the optimal CVaR.

Our interest is in creating empirically efficient and scalable algorithms that have a theoretically sound grounding. To that end, we introduce a new algorithm for quickly learning a CVaR policy in MDPs and show that at least in the evaluation case in tabular MDPs, this algorithm indeed produces optimistic estimates of the CVaR. We also show that it does converge eventually. We accompany the theoretical evidence with an empirical evaluation. We provide encouraging empirical results on a machine replacement task (Delage and Mannor 2010), a classic MDP where risk sensitive policies are critical, as well as a well validated simulator for type 1 diabetes (Man et al. 2014) and a simulated treatment optimization task for HIV (Ernst et al. 2006). In all cases we find a substantial benefit over simpler exploration strategies. To our knowledge this is the first algorithm that performs strategic exploration to learn good CVaR MDP policies.


Let X be a bounded random variable with cumulative distribution function  F(x) = P[X ≤ x]. The conditional value at risk (CVaR) at level  α ∈ (0, 1)of a random variable X is then defined as (Rockafellar, Uryasev, and others 2000):


We define the inverse CDF as  F −1(u) = inf{x : F(x) ≥u}. It is well known that when X has a continuous distribution,  CVaRα(X) = EX∼F [X|X ≤ F −1(α)](Acerbi and Tasche 2002). For ease of notation we sometimes write CVaR as a function of the CDF  F, CVaRα(F).

We are interested in the CVaR of the discounted cumulative reward in a Markov Decision Process (MDP). An MDP is defined by a tuple  (S, A, R, P, γ), where S and A are finite state and action space,  r ∼ R(s, a)is the reward distribution,  s′ ∼ P(s, a)is the transition kernel and γ ∈ [0, 1)is the discount factor. A stationary policy  πmaps each state  s ∈ Sto a probability distribution over action space A.

Let Z denote the space of distributions over returns (discounted cumulative rewards) from such an MDP, and assume that these returns are in  [Vmin, Vmax]almost surely, where  Vmin ≥ 0. We define  Zπ(s, a) ∈ Zto be the distribution of the return of policy  πwith CDF  FZπ(s,a)and initial state action pair  (s, a) ∈ S × Aas  Zπ(s, a) :=Lawπ (�∞t=0 γtRt|S0 = s, A0 = a). RL algorithms most commonly optimize policies for expected return and explicitly learn Q-values,  Qπ(s, a) = E[Zπ(s, a)]by applying approximate versions of Bellman backups. Instead, we are interested in other properties of the return distribution and we will build on several recently proposed algorithms that aim to learn a parametric model of the entire return distribution instead of only its expectation. Such approaches are known as distributional RL methods.

Distributional Reinforcement Learning Distributional RL methods apply a sample-based approximation to distributional versions of the usual Bellman operators. For example, one can define a distributional Bellman operator (Bellemare, Dabney, and Munos 2017) as  T π : Z → Zas


whereD=denotes equality in distribution, and the transition

operator is defined as  P πZ(s, a) D:= Z(s′, a′)with  s′ ∼P(·|s, a), a′ ∼ π(s). The optimality version T is similarly any  T Z = T πZwhere  πis an optimal policy w.r.t. expected return. Note that this is not necessarily unique when there are multiple optimal policies. (Rowland et al. 2018) showed that T πis a √γ-contraction in the Cram´er-metric, ¯ℓ2


One of the canonical algorithms in distributional RL is CDRL or C51 (Bellemare, Dabney, and Munos 2017) which represent the return distribution  Zπas a discrete distribution with fixed support on N atoms  {zi = Vmin + i∆z : 0 ≤ i <N}, ∆z := Vmax−VminN−1the discrete distribution is parameterized as  θ : S × A → RN:


Essentially, C51 uses a sample transition  (s, a, r, s′)to perform an approximate Bellman backup  Z ← ΠC ˆT Z, where ˆTis a sample-based Bellman operator and  ΠCis a projection back onto the support of discrete distribution {z0, . . . , zN−1}.


In contrast to the typical RL setup where an agent tries to maximize its expected return, we seek to learn a stationary policy that maximizes the  CVaRαof the return at risk level α.1 To find such policies quickly, we follow the optimism-in-the-face-of-uncertainty (OFU) principle and introduce optimism in our CVaR estimates to guide exploration. While adding a bonus to rewards is a popular approach for optimism in the standard expected return case (Ostrovski et al. 2017), we here follow a different approach and introduce optimism into our return estimates by shifting the empirical CDFs. Formally, consider a return distribution  Z(s, a) ∈ Zwith CDF  FZ(s,a)(x). We define the optimism operator  Oc :


Figure 1: Top-left: Empirical CDF Top-right: The lower DKW confidence band (a shifted-down version of the empirical CDF). Bottom-left: Empirical PDF. Bottom-right: Optimistic PDF.

Z → Z as


where c is a constant and  (·)+is short for  max{·, 0}. In the definition above, n(s, a) is the number of times the pair (s, a) has been observed so far or an approximation such as pseudo-counts (Bellemare et al. 2016). By shifting the cumulative distribution function down, this operator essentially puts probability mass from the lower tail to the highest possible value  Vmax. An illustration is provided in Figure 1. This approach to optimism is motivated by an application of the DKW-inequality to the empirical CDF. As shown recently by (Thomas and Learned-Miller 2019), this can yield tighter upper confidence bounds on the CVaR.

Theoretical Analysis The optimistic operator introduced above operates on the entire return distribution and our algorithm introduced in the next section combines this optimistic operator to estimated return-to-go distributions. As such, it belongs to the family of distributional RL methods (Dabney et al. 2018b). These methods are a recent development and come with strong asymptotic convergence guarantees when used for policy evaluation in tabular MDPs (Rowland et al. 2018). Yet, fi-nite sample guarantees such as regret or PAC bounds still remain elusive for distributional RL policy optimization algorithms. A key technical challenge in proving performing bounds for distributionally robust policy optimization during RL is that convergence of the distributional Bellman optimality operator can generally not be guaranteed. Prior results have only showed that if the optimization process itself is to compute a policy which maximizes expected returns, such as Qlearning, then convergence of the distirbutional Bellman optimality operator is guaranteed to converge. (Rowland et al.

2018, Theorem 2). Note however that if the goal is to leverage distributional information to compute a policy to maximize something other than expected outcomes, such as a risk sensitive policy like we consider here, no prior theoretical results are known in the reinforcement learning setting to our knowledge. However, it is promising that there is some empirical evidence that one can compute risk-sensitive policies using distributional Bellman operators (Dabney et al. 2018a) which suggests that more theoretical results may be possible.

Here we take a first step towards this goal. Our primary aim in this work is to provide tools to introduce optimism into distributional return-to-go estimates to guide sample-efficient exploration for CVaR. Therefore, our theoretical analysis focuses on showing that this form of optimism does not harm convergence and is indeed a principled way to obtain optimistic CVaR estimates.

First, we prove that the optimism operator is a non-expansion in the Cram´er distance. This results shows that this operator can be used with other contraction operators without negatively impacting the convergence behaviour. Specifically we can guarantee convergence with distributional Bellman backup. Proposition 1. For any c, the  Ocoperator is a non-expansion in the Cram´er distance ¯ℓ2. This implies that optimistic distributional Bellman backups  OcT πand the projected version  ΠCOcT πare √γ-contractions in ¯ℓ2and iterates of these operators converge in ¯ℓ2to a unique fixed-point.

Next, we provide theoretical evidence that this operator indeed produces optimistic CVaR estimates. Consider here batch policy evaluation in MDPs M with finite state- and action-spaces. Assume that we have collected a fixed number of samples n(s, a) (which can vary across states and actions) and build an empirical model ˆMof the MDP. For any policy  π, let ˆT πdenote the distributional Bellman operator in this empirical MDP. Then we indeed achieve optimistic estimates by the following result: Theorem 2. Let the shift parameter in the optimistic operator be sufficiently large which is  c = O (ln(|S||A|/δ)). Then with probability at least 1 − δ, the iterates CVaRα((Oc ˆT π)mZ0)converges for any risk level  αand initial  Z0 ∈ Zto an optimistic estimate of the policy’s conditional value at risk. That is, with probability at least  1−δ,


This theorem uses the DKW inequality which to the best of our knowledge has not been used for MDPs. Note, that the statement guarantees optimism for all risk levels  α ∈ [0, 1]without paying a penalty for it. Since we estimate the transitions and rewards for each state and action separately, one generally does not expect to be able to use a shift parameter smaller than  Ω(ln(|S||A|/δ)). Thus, Theorem 2 is unimprovable in that sense. Specifically, we avoid a polynomial dependency on the number of states |S| in the shift parameter c by combining two techniques: (1) concentration inequalities w.r.t. the optimal CVaR of the next state for a certain finite set of alphas and (2) a covering argument to get optimism for all infinitely many  α ∈ [0, 1]. This is substantially more involved than the expected reward case.

These results are a key step towards finite-sample analyses. In future work it would be very interesting to obtain a convergence analysis for distributional Bellman optimality operators in general, though this is outside the scope of this current paper. Such a result could lead to sample-complexity guarantees when combined with our existing analysis.

Algorithm In the policy evaluation case where we would like to compute optimistic estimates of the CVaR of a given observed policy  π, our algorithm essentially performs an approximate version of the optimistic Bellman update  OcT πwhere  T πis the distributional Bellman operator. For the control case where we would like to learn a policy that maximizes CVaR, we instead define a distributional Bellman optimality operator  Tα. Analogous to prior work (Bellemare, Dabney, and Munos 2017),  Tαis any operator that satisfies  TαZ = T πZfor some policy  πthat is greedy w.r.t. CVaR at level  α. Our algorithm then performs an approximate version of the optimistic Bellman backup  OcTα, shown in Algorithm 1. The main structure of our algorithm resembles categorical distributional reinforcement learning (C51) (Bellemare, Dabney, and Munos 2017). In a similar vein, our algorithm also maintains a return distribution estimate for each state-action pair, represented as a set of N weights  pi(s, a)for i ∈ [N]. These weights represent a discrete distribution with outcomes at N equally spaced locations  z0 < z1 < · · · <zN−1, each  ∆z = Vmax−VminN−1apart. The current probability assigned to outcome  ziin (s, a) is denoted by  pi(s, a), where the atom probabilities  p1:N(s, a)are given by a differentiable model such as a neural network, similar to C51. Note that other parameterized representations of the weights (Bellemare, Dabney, and Munos 2017) are straightforward to incorporate. The main differences between Algorithm 1 and existing distributional RL algorithms (e.g. C51) are highlighted in red. We first apply an optimism operator to our successor distribution  FZ(st+1,a)(Lines 4–6) to form an optimistic CDF ˜FZ(st+1,a)for all actions  a ∈ A. This operator should encourage exploring actions that might lead to higher CVaR policies for our input  α. These optimistic CDFs are also used to decide on the successor action in the control setting (Line 7). Then, similar to C51 we apply the Bellman operator ˆT zifor  i ∈ [N]and distribute the probability of ˜pito the immediate neighbours of ˆT zi, where we calculate the probability mass  ˜piwith the optimistic CDF ˜FZ(st+1,a⋆)(Line 10). Following (Bellemare, Dabney, and Munos 2017), we train this model using the cross-entropy loss, which for a particular state transition at time t is


where  m0:N−1are the weights of the target distribution computed in Lines 8–15 in Algorithm 1. In the tabular set-


ting we can directly update the probability mass  pjby


where  βis the learning rate.

In tabular settings, the counts n(s,a) can be directly stored and used; however, this is not the case in continuous settings. For this reason, we adopt the pseudo-count estimation method proposed by (Ostrovski et al. 2017) and replace n(s, a) by a pseudo-count ˆNt(s, a)in the optimistic distributional operator (Equation 4). Let  ρbe a density model and ρt(s, a)the probability assigned to the state action pair (s, a) by the model after t training steps. The prediction gain PG of  ρis defined


Where  ρ′t(s, a)is the probability assigned to (s, a) if it were trained on that same (s, a) one more time. Now we define

the pseudo count of (s, a) as


where  κis a constant hyper-parameter, and  (PG(s, a))+thresholds the value of the prediction gain at 0.

Our setting differs from (Ostrovski et al. 2017) in the sense that we have to compute the count before taking the action a. A naive way would be to try all actions and train the model to compute the counts but this method is slow and requires the environment to support an undo action. Instead, we can estimate PG for all actions as follows. Consider the density model parametrized by  θ, ρ(s, a; θ). After observing (s, a), the training step to maximize the log likelihood will update the parameters by  θ′ = θ+α∇θ log ρ(s, a; θ), where αis the learning rate. So we can approximate the new log probability using a first-order Taylor expansion


This calculation suggests that the prediction gain can be estimated just by computing the gradient of the log likelihood given a state-action pair, i.e.,  PG(s, a) ≈α(∇θ log ρ(s, a; θ))2. As discussed in (Graves et al. 2017) this estimate of prediction gain is biased, but empirically we have found this method to perform well.


We validate our algorithm empirically in three simulated environments against baseline approaches. Finance, health and operations are common areas where risk-sensitive strategies are important, and we focus on two health domains and one operations domain. Details, where omitted, are provided in the supplemental material.

Machine Replacement Machine repair and replacement is a classic example in the risk sensitive literature, though to our knowledge no prior work has considered how to quickly learn a good risk-sensitive policy for such domains. Here we consider a minor variant of a prior setting (Delage and Mannor 2010). Specifically, as shown in Figure 2, the environment consists of a chain of n (25 in our experiments) states. There are two actions: replace and don’t replace the machine. Choosing replace at any state terminates the episode, while choosing don’t replace moves the agent to the next state in the chain. At the end of the chain, choosing don’t replace terminates the episode with a high variance cost, and choosing replace terminates the episode with a higher cost but lower variance. This environment is especially a challenging exploration task due to the chain structure of the MDP, as well as the high variance of the reward distributions when taking actions in the last state. Additionally in this MDP it is feasible to exactly compute the  CVaR0.25-optimal policy, which allows us to compare the learned policy to the true optimal CVaR policy. Note here that the optimal policy for maximizing  CVaR0.25is to replace on the final state in the chain to avoid the high variance alternative;


Figure 2: Machine Replacement: This environment consists of a chain of n states, each affording two actions: replace and don’t replace.


Figure 3: Machine Replacement: The thick grey dashed line is the CVaR0.25-optimal policy. The thin dashed lines labeled as the suboptimal policy is the optimal expectation-maximizing policy. The shaded area shows the 95% confi-dence intervals.

in contrast, the optimal policy for expected return always chooses don’t replace.

HIV Treatment In order to test our algorithm on a larger continuous state space, we leverage an HIV Treatment simulator. The environment is based on the implementation by (Geramifard et al. 2015) of the physical model described in (Ernst et al. 2006). The patient state is represented as a 6-dimensional continuous vector and the reward is a function of number of free HIV viruses, immune response of the body to HIV, and side effects. There are four actions, each determining which drugs are administered for the next 20 day period: Reverse Transcriptase Inhibitors (RTI), Protease Inhibitors (PI), neither, or both. There are 50 time steps in total per episode, for a total of 1000 days. We chose here a larger number of days per time step compared to the typical setup (200 steps of 5 days each) to facilitate faster experimentation. This design choice also makes the exploration task harder, since taking one wrong action can drastically destabilize a patient’s trajectory. The original proposed model was deterministic, which makes the CVaR policy identical to the policy optimizing the expected value. Such simulators are rarely a perfect proxy for real systems, and in our setting we add Gaussian noise  ∼ N(0, 0.01)to the efficacy of each drug (RTI:  ϵ1and PI:  ϵ2in (Ernst et al. 2006)). This change necessitates risk-sensitive policies in this environment.

Diabetes 1 Treatment Patients with type 1 diabetes regulate their blood glucose level with insulin in order to avoid


Figure 4: Comparison of our approach against an  ϵ-greedy and IQN baseline. All models were trained to optimize the CVaR0.25of the return on a stochastic version of the HIV simulator (Ernst et al. 2006). Top: Objective CVaR0.25; Bottom: Discounted expected return of the same policies as in top plot.

hypoglycemia or hyperglycemia (very low or very high blood glucose level, respectively). A simulator has been created (Man et al. 2014) that is an open source version of a simulator that was approved by the FDA as a substitute for certain pre-clinical trials. The state is continuous-valued vector of the current blood glucose level and the amount of carbohydrate intake (through food). The action space is discretized into 6 levels of a bolus insulin injection. The reward function is defined similar to the prior work (Bastani 2014) as following:


Where  bg′ = bg/18.018018which is the estimate of bg (blood glucose) in mmol/L.

Additionally we inject two source of stochasticity into the taken action: First, we add Gaussian noise N(0, 1) to the action. Second, we delay the time of the injection by at most 5 steps, where the probability of injection at time t is higher than time  t + i, i ≥ 1following the power law. Each simulation lasts for 200 steps, during which a patient eats five meals. The agent chooses an action after each meal, and after the 200 steps each patient resets to its initial state.

This domain also readily offers a suite of related tasks, since the environment simulates 30 patients with slightly different dynamics. Tuning hyper-parameters on the same task can be misleading (Henderson et al. 2018), as is the case in our two previous benchmarks. In this setting we tune baselines and our method on one patient, and test the performance on different patients.

Baselines and Experimental Setup

The majority of prior risk-sensitive RL work has not focused on efficient exploration, and there has been very little deep distributional RL work focused on risk sensitivity. Our key contribution is to evaluate the impact of more strategic exploration on the efficiency with which a risk-sensitive policy can be learned. We compare to following approaches:

1.  ϵ-greedy CVaR: In this benchmark we use the same algorithm, except we do not introduce an optimism operator, instead using an  ϵ-greedy approach for exploration. This benchmark can be viewed as analogous to the distributional RL methods of C51 (Bellemare, Dabney, and Munos 2017) if the computed policy had optimized for CVaR instead of expected reward.

2. IQN-ϵ-greedy CVaR: In this benchmark we use implicit quantile network (IQN) that also uses  ϵ-greedy method for exploration (Dabney et al. 2018a). We adopted the dopamine implementation of IQN (Castro et al. 2018).

3. CVaR-AC: An actor-critic method proposed by (Chow and Ghavamzadeh 2014) that maximizes the expected return while satisfying an inequality constraint on the CVaR. This method relies on the stochasticity of the policy for exploration. Note that a comparison to an expectation maximizing al-

gorithm is uninformative since such approaches are maxi-

mizing different (non-risk-sensitive) objectives. All of these algorithms use hyperparameters, and it is well

recognized that  ϵ-greedy algorithms can often perform quite

well if their hyperparameters are well-tuned. To provide a fair comparison, we evaluated across a number of schedules for reducing the  ϵparameter for both  ϵ-greedy and IQN, and a small set of parameters (4-7) for the optimism value c for our method. We used the specification described in Appendix C of (Chow and Ghavamzadeh 2014) for CVaR-AC. The system architectures used in continuous settings are

identical for Baseline 1 (ϵ-greedy) and our method. This

consists of 2 hidden layers of size 32 with ReLU activation for Diabetes 1 Treatment, and 4 hidden layers of size 128 with ReLU activation for HIV Treatment, both followed by a softmax layer for each action. Similarly for IQN we used the same architecture, followed by a cosine embedding function and a fully connected layer of size 128 for HIV Treatment (32 for Diabetes 1 Treatment) with ReLU activation, followed by a softmax layer. The density model is a realNVP (Dinh, Sohl-Dickstein, and Bengio 2016) with 3 hidden layers each of size 64. All results are averaged over 10 runs and we report 95%

confidence intervals. We report the performance of  ϵ-greedy

at evaluation time (setting  ϵ= 0), which is the best performance of  ϵ-greedy. For the Diabetes Treatment domain, hyperparameters are

optimized only on adult#001. We then report results of

the methods using those hyperparameters on adult#003 ,


Results and Discussion Results on machine replacement environment (Figure 3), HIV Treatment (Figure 4) and Diabetes 1 Treatment (Figure


Figure 5: Type 1 diabetes simulator: CVaR0.25for three different adults. Plots are averaged over 10 runs with 95% CI.


Figure 6: Type 1 Diabetes simulator, percent of episodes where patients experienced a severe medical condition (hypoglycemia or hyperglycemia), averaged across 10 runs

5) all show our optimistic algorithm achieves better performance much faster than the baselines.

In Machine Replacement (Figure 3) we see that our method quickly converges to the optimal CVaR performance. Unfortunately despite our best efforts, our implementation of CVaR-AC did not perform well even on the simplest environment, so we did not show the performance of this method on other environments. One challenge here is that CVaR-AC has a significant number of hyper-parameters, including 3 different learning rates schedule for the optimization process, initial Lagrange multipliers and the kernel functions.

In the HIV Treatment we also see a clear and substantial benefit to our optimistic approach over the baseline  ϵ-greedy approach and IQN(Figure 4).

Figure 5 is particularly encouraging, as it shows the results for the diabetes simulator across 3 patients, where the hyperparameters were fixed after optimizing for a separate patient. Since in real settings it would be commonly necessary to fix the hyperparameters in advance, this result provides a nice demonstration that the optimistic approach can consistently equal or significantly improve over an  ϵ-greedy policy in related settings, similar to the well known results in Atari in which hyperparameters are optimized for one game and then used for multiple others.

”Safer” Exploration. Our primary contribution is a new algorithm to learn risk-sensitive policies quickly, with less data. However, an interesting side benefit of such a method might be that the number of extremely poor outcomes experienced over time may also be reduced, not due to explicitly prioritizing a form of safe exploration, but because our algorithm may enable a faster convergence to a safe policy. To evaluate this, we consider a risk measure proposed by (Clarke and Kovatchev 2009), which quantifies the risk of a severe medical condition based on how close their glucose level is to hypoglycemia (blood glucose,  ≤3.9 mmol/l) and

hyperglycemia (blood glucose,  ≥10 mmol/l).

Table 6 shows the fraction of episodes in which each patient experienced a severely poor outcome for each algorithm while learning. Optimism-based exploration approximately halves the number of episodes with severely poor outcomes, highlighting a side benefit of our optimistic approach of more quickly learning a good safe policy.


Optimizing policies for risk sensitivity in MDPs has been long studied, with policy gradient (Tamar, Glassner, and Mannor 2015; Tamar et al. 2015), actor critic (Tamar and Mannor 2013) and TD methods (Tamar and Mannor 2013; Sato, Kimura, and Kobayashi 2001). While most of this work considers mean-variance trade objectives, (Chow et al. 2015) establish a connection between a optimizing CVaR and robustness to modeling errors, presenting a value iteration algorithm. In contrast, we do not assume access to transition and rewards models. (Chow and Ghavamzadeh 2014) present a policy gradient and actor-critic algorithm for an expectation-maximizing objective with a CVaR constraint. None of these works considers systematic exploration but rely on heuristics such as  ϵ-greedy or on the stochasticity of the policy for exploration. Instead, we focus on how to explore systematically to find a good CVaR-policy.

Our work builds upon recent advances on distributional RL (Bellemare, Dabney, and Munos 2017; Rowland et al. 2018; Dabney et al. 2018b) which are still concerned with optimizing expected return. Notably, (Dabney et al. 2018a) aims to train risk-averse and risk-seeking agents, but does not address the exploration problem or attempts to find optimal policies quickly.

(Dilokthanakul and Shanahan 2018) uses risk-averse objectives to guide exploration for good performance w.r.t. expected return. (Moerland, Broekens, and Jonker 2018) leverages the return distribution learned in distributional RL as a means for optimism in deterministic environments. (Mavrin et al. 2019) follow a similar pattern but can handle stochastic environments by disentangling intrinsic and parametric uncertainty. While they also evaluate the policy that picks the VaR-greedy action in one experiment, their algorithm still optimizes expected return during learning. In general, these approaches are fundamentally different from ours which learns CVaR policies in stochastic environments efficiently by introducing optimism into the learned return distribution.


We present a new algorithm for quickly learning CVaRoptimal policies in Markov decision processes. This algorithm is the first to leverage optimism in combination with distributional reinforcement learning to learn risk-averse policies in a sample-efficient manner. Unlike existing work on expected return criteria which rely on reward bonuses for optimism, We introduce optimism by directly modifying the target return distribution and provide a theoretical justifica-tion that in the evaluation case for finite MDPs, this indeed yields optimistic estimates. We further empirically observe significantly faster learning of CVaR-optimal policies by our algorithm compared to existing baselines on several benchmark tasks. This includes simulated healthcare tasks where risk-averse policies are of particular interest: HIV medication treatment and insulin pump control for diabetes type 1 patients.

The research reported here was supported by NSF CAREER award, ONR Young Investigator award and Microsoft faculty fellowship.


Acerbi, C., and Tasche, D. 2002. On the coherence of expected shortfall. Journal of Banking & Finance 26(7):1487– 1503.

Artzner, P.; Delbaen, F.; Eber, J.-M.; and Heath, D. 1999. Coherent measures of risk. Mathematical finance 9(3):203– 228.

Bastani, M. 2014. Model-free intelligent diabetes management using machine learning.

Bellemare, M.; Srinivasan, S.; Ostrovski, G.; Schaul, T.; Saxton, D.; and Munos, R. 2016. Unifying count-based exploration and intrinsic motivation. In Advances in Neural Information Processing Systems, 1471–1479.

Bellemare, M. G.; Dabney, W.; and Munos, R. 2017. A distributional perspective on reinforcement learning. In Proceedings of the 34th International Conference on Machine Learning-Volume 70, 449–458. JMLR. org.

Berkenkamp, F.; Turchetta, M.; Schoellig, A.; and Krause, A. 2017. Safe model-based reinforcement learning with stability guarantees. In Advances in neural information processing systems, 908–918.

Brafman, R. I., and Tennenholtz, M. 2002. R-max-a general polynomial time algorithm for near-optimal reinforcement learning. Journal of Machine Learning Research 3(Oct):213–231.

Castro, P. S.; Moitra, S.; Gelada, C.; Kumar, S.; and Bellemare, M. G. 2018. Dopamine: A Research Framework for Deep Reinforcement Learning.

Chow, Y., and Ghavamzadeh, M. 2014. Algorithms for cvar optimization in mdps. In Advances in neural information processing systems, 3509–3517.

Chow, Y.; Tamar, A.; Mannor, S.; and Pavone, M. 2015. Risk-sensitive and robust decision-making: a cvar optimization approach. In Advances in Neural Information Processing Systems, 1522–1530.

Clarke, W., and Kovatchev, B. 2009. Statistical tools to analyze continuous glucose monitor data. Diabetes technology & therapeutics 11(S1):S–45.

Dabney, W.; Ostrovski, G.; Silver, D.; and Munos, R. 2018a. Implicit quantile networks for distributional reinforcement learning. arXiv preprint arXiv:1806.06923.

Dabney, W.; Rowland, M.; Bellemare, M. G.; and Munos, R. 2018b. Distributional reinforcement learning with quantile regression. In Thirty-Second AAAI Conference on Artificial Intelligence.

Dann, C.; Li, L.; Wei, W.; and Brunskill, E. 2018. Policy cer-tificates: Towards accountable reinforcement learning. arXiv preprint arXiv:1811.03056.

Delage, E., and Mannor, S. 2010. Percentile optimization for markov decision processes with parameter uncertainty. Operations research 58(1):203–213.

Dilokthanakul, N., and Shanahan, M. 2018. Deep reinforcement learning with risk-seeking exploration. In International Conference on Simulation of Adaptive Behavior, 201–211. Springer.

Dinh, L.; Sohl-Dickstein, J.; and Bengio, S. 2016. Density estimation using real nvp. arXiv preprint arXiv:1605.08803. Ernst, D.; Stan, G.-B.; Goncalves, J.; and Wehenkel, L. 2006. Clinical data based optimal sti strategies for hiv: a reinforcement learning approach. In Proceedings of the 45th IEEE Conference on Decision and Control, 667–672. IEEE. Geramifard, A.; Dann, C.; Klein, R. H.; Dabney, W.; and How, J. P. 2015. Rlpy: a value-function-based reinforcement learning framework for education and research. Journal of Machine Learning Research 16(46):1573–1578.

Graves, A.; Bellemare, M. G.; Menick, J.; Munos, R.; and Kavukcuoglu, K. 2017. Automated curriculum learning for neural networks. In Proceedings of the 34th International Conference on Machine Learning-Volume 70, 1311–1320. JMLR. org.

Henderson, P.; Islam, R.; Bachman, P.; Pineau, J.; Precup, D.; and Meger, D. 2018. Deep reinforcement learning that matters. In Thirty-Second AAAI Conference on Artificial Intelligence.

Jaksch, T.; Ortner, R.; and Auer, P. 2010. Near-optimal regret bounds for reinforcement learning.

Koller, T.; Berkenkamp, F.; Turchetta, M.; and Krause, A. 2018. Learning-based model predictive control for safe exploration. In 2018 IEEE Conference on Decision and Control (CDC), 6059–6066. IEEE.

Man, C. D.; Micheletto, F.; Lv, D.; Breton, M.; Kovatchev, B.; and Cobelli, C. 2014. The uva/padova type 1 diabetes simulator: new features. Journal of diabetes science and technology 8(1):26–34.

Mavrin, B.; Yao, H.; Kong, L.; Wu, K.; and Yu, Y. 2019. Distributional reinforcement learning for efficient exploration.

In International Conference on Machine Learning, 4424– 4434.

Moerland, T. M.; Broekens, J.; and Jonker, C. M. 2018. The potential of the return distribution for exploration in rl. arXiv preprint arXiv:1806.04242.

Ostrovski, G.; Bellemare, M. G.; van den Oord, A.; and Munos, R. 2017. Count-based exploration with neural density models. In Proceedings of the 34th International Conference on Machine Learning-Volume 70, 2721–2730. JMLR. org.

Rockafellar, R. T.; Uryasev, S.; et al. 2000. Optimization of conditional value-at-risk. Journal of risk 2:21–42.

Rowland, M.; Bellemare, M. G.; Dabney, W.; Munos, R.; and Teh, Y. W. 2018. An analysis of categorical distributional reinforcement learning. arXiv preprint arXiv:1802.08163.

Sato, M.; Kimura, H.; and Kobayashi, S. 2001. Td algorithm for the variance of return and mean-variance reinforcement learning. Transactions of the Japanese Society for Artificial Intelligence 16(3):353–362.

Shapiro, A.; Dentcheva, D.; and Ruszczy´nski, A. 2009. Lectures on stochastic programming: modeling and theory. SIAM.

Strehl, A. L., and Littman, M. L. 2008. An analysis of model-based interval estimation for markov decision processes. Journal of Computer and System Sciences 74(8):1309–1331.

Tamar, A., and Mannor, S. 2013. Variance adjusted actor critic algorithms. arXiv preprint arXiv:1310.3697.

Tamar, A.; Chow, Y.; Ghavamzadeh, M.; and Mannor, S. 2015. Policy gradient for coherent risk measures. In Advances in Neural Information Processing Systems, 1468– 1476.

Tamar, A.; Glassner, Y.; and Mannor, S. 2015. Optimizing the cvar via sampling. In Twenty-Ninth AAAI Conference on Artificial Intelligence.

Thomas, P., and Learned-Miller, E. 2019. Concentration inequalities for conditional value at risk. In International Conference on Machine Learning, 6225–6233.

Weissman, T.; Ordentlich, E.; Seroussi, G.; Verdu, S.; and Weinberger, M. J. 2003. Inequalities for the l1 deviation of the empirical distribution. Hewlett-Packard Labs, Tech. Rep. Xie, J. 2018. A type-1 diabetes simulator, simglucose v0.2.1.

Zhu, S., and Fukushima, M. 2009. Worst-case conditional value-at-risk with application to robust portfolio management. Operations research 57(5):1155–1168.


Experimental Details

Machine Replacement Machine replacement environment consist of n states (we use n = 25 in the experiment), where action replace transitions to the terminal state with cost  N(µr,t, σr,t)at state t, where  µr,t = rmax − tn(rmax − rmin)and σr,t = 0.1 + 0.01t. Action don’t replace has cost  N(0, 1e − 2)and transitions to state t + 1. In our experiment we used rmax = 23, rmin = 10. However, for the last state n action don’t replace has cost  N(µr, 10), where we used  µr = 8, and transitions to the terminal state. For C51 algorithm we use  Vmin = −50, Vmax = 50, γ = 0.99, learning rate 0.01 and 51 atoms.

Tuning: We use  ϵ-greedy with schedule (start, end, n) = (0.9, 0.1, 5000) that starts with  ϵ =start and decays linearly to ϵ =end in n time steps, staying constant afterwards. This schedule achieved the best performance in our experiments when compared to other linear schedules {(0.9, 0.3, 5000), (0.9, 0.1, 10000), (0.9, 0.1, 15000), (0.9, 0.05, 5000)}, and exponential decays with schedule in the form of (ϵ0, d, step): {(0.9, 0.99, 5), (0.9, 0.99, 20), (0.9, 0.99, 2), (0.9, 0.99, 30), (0.5, 0.99, 5)} where  ϵ = ϵ0 × depisode/step. We have also tried our algorithm with optimism values of c = [0.25, 0.5, 1, 2].

For the actor critic method we use the CVaR limit as -10, radial basis function as kernel and other set of hyper-parameters are as described in the appendix of (Chow and Ghavamzadeh 2014).

Additional Experiments: Additional to the risk level  α = 0.25, we observe the same gain in the performance for other risk levels. As shown in figure 7, optimism based exploration shows a significant gain over  ϵ−greedy exploration for risk levels α = 0.1and  α = 0.5.

Type 1 Diabetes Simulator An open source implementation of type 1 diabetes simulator (Xie 2018) simulates 30 different virtual patients, 10 child, 10 adolescent and 10 adult. For our experiments in this paper we have used adult#003, adult#004 and adult#005. Additionally we have used "Dexcom" sensor for CGM (to measure blood glucose level) and "Insulet" as a choice of insulin pump. All simulations are 10 hours for each patient and after 10 hour, patient resets to the initial state. Each step of simulation is 3 minutes.

State space is a continous vector of size 2 (glucose level, meal size) where glucose level is the amount of glucose measured by "Dexcom" sensor and meal size is the amount of Carbohydrate in each meal.

Action space is defined as (bolus, basal=0) where amount of bolus injection discretized by 6 bins between 30 (max bolus, a property of the "Insulet" insulin pump) and 0 (no injection). Additionally we inject two source of stochasticity to the taken action, assume action  a = (ab, 0)at time t is the agent’s decision, then we take the action  a = (a′b, 0)at time  t′where:


Where  x ∼ P(x; 1) = 2x−1is drawn from the power law distribution and c = 5. Note that this means delay the action at most 5 step where the probability of taking the action at time t is higher than time  t + i, i ≥ 1following the power law. Since each step of simulation is 3 minutes, patient might take the insulin up to 15 minutes after the prescribed time by the agent.

Reward structure is defined similar to the prior work (Bastani 2014) as following:


Where  bg′ = bg/18.018018which is the estimate of bg (blood glucose) in mmol/L rather than mg/dL. Additionally if the amount of glucose is less than 39 mg/dL agent incurs a penalty of  −10.


0 2000 4000 6000 8000 10000 12000 Episodes



0 2000 4000 6000 8000 10000 12000 Episodes

Figure 7: Machine Replacement with different risk levels. Left: risk level  α = 0.1, Right: risk level  α = 0.5

We generated a meal plan scenario for all the patients that is meal of size 60, 20, 60, 20 CHO with the schedule 1 hour, 3 hours, 5 hours and 7 hours after starting the simulation. Notice that this will make the simulation horizon 200 steps and 5 actionable steps (initial state, and after each meal).

Categorical Distributional RL: The C51 model consist of 2 hidden layers each of size 32 and ReLU activation function, followed by of |A| each with 51 neurons followed by a softmax activation, for representing the distribution of each action.

We used Adam optimizer with learning rate  1e − 3, β1 = 0.9,β2 = 0.999and  ϵ = 1e − 8. We set  Vmax = 15, Vmin = −40, 51 probability atoms, and used batch size of 32. For computing the CVaR we use 50 samples of the return.

Density Model: For log likelihood density model we used realNVP (Dinh, Sohl-Dickstein, and Bengio 2016) with 3 layers each of size 64. The input of the model is a concatenated vector of (s, a). We used same hyper parameters for optimizer as in C51 model. We have used constant  κ = 1e − 5for computing the pseudo-count.

Tuning: We have tuned our method and  ϵ-greedy on patient adult#001 and used the same parameters for the other patients. We tried 5 different linear schedule of  ϵ-greedy, {(0.9, 0.1, 2), (0.9, 0.05, 4), (0.9, 0.05, 6), (0.9, 0.3, 4), (0.9, 0.3, 4), (0.9, 0.05, 10)} where first element is the initial  ϵ, second element is the final  ϵand the third element is episode ratio (i.e. epsilon starts from initial and reaches to the final value in episode ratio fraction of total number of episodes, linearly). Additionally we have tried 5 different exponential decay schedule for  ϵ-greedy in the form of (ϵ0, d, step): {(0.9, 0.99, 5), (0.9, 0.99, 20), (0.9, 0.99, 2), (0.9, 0.99, 30), (0.5, 0.99, 5)} where  ϵ = ϵ0 × depisode/step. The first of the exponential decay set preformed the best. We have also tested our algorithm with constant optimism values of [0.2, 0.4, 0.5, 0.8, 1, 2, 5] where we picked the best value 0.5.

HIV Simulator The environment is an implementation of the physical model described in (Ernst et al. 2006). The state space is of dimension 6 with and action space is of size 4, indicating the efficacy of being on the treatment.  ϵ1, ϵ2described in (Ernst et al. 2006) takes values as  ϵ1 ∈ {0, 0.7}and  ϵ2 ∈ {0, 0.3}. We have also added the stochasticity to the action by a random gaussian noise. So the efficacy of a drug is computed as  ϵi + N(0, 0.01).

The reward structure is defined similar to the prior work (Ernst et al. 2006). And we simulate for 1000 time steps, where agent can take action in 50 steps (each 20 simulation step) and actions remains constant in each interval. While trianing we normalize the reward by dividing them by 1e6.

Categorical Distributional RL: The C51 model consist of 4 hidden layers each of size 128 and ReLU activation function, followed by of |A| each with 151 neurons followed by a softmax activation, for representing the distribution of each action.

We used Adam optimizer with learning rate decay schedule from  1e − 3to  1e − 4in half a number of episodes,  β1 = 0.9, β2 = 0.999and  ϵ = 1e − 8. We set  Vmax = 40, Vmin = −10, 151 probability atoms, and used batch size of 32. For computing the CVaR we use 50 samples of the return.

Implicit Quantile Network: IQN model consists of 4 hidden layers with size 128 and ReLU activation. Then an embedding of size 64 computed by ReLU(�n−1i=0 cos(πiτ)wij + bj). Then we take the element wise multiplication of the embedding and the output of 4 hidden layers, followed by a fully connected layer with size 128 and ReLU activation, and a softmax layer. We used 8 samples for N and  N ′and 32 quantiles.

Density Model: For log likelihood density model we used realNVP (Dinh, Sohl-Dickstein, and Bengio 2016) with 3 layers each of size 64. The input of the model is a concatenated vector of (s, a). We used same hyper parameters for optimizer as in C51 model. We have used constant  κ = 1e − 5for computing the pseudo-count.

Tuning: We have tuned our method,  ϵ-greedy and IQN. For  ϵ-greedy we tried 5 different linear schedule of  ϵ-greedy, {(0.9, 0.05, 10), (0.9, 0.05, 8), (0.9, 0.05, 5), (0.9, 0.05, 4), (0.9, 0.05, 2)} where first element is the initial  ϵ, second element is the final  ϵand the third element is episode ratio (i.e. epsilon starts from initial and reaches to the final value in episode ratio fraction of total number of episodes, linearly). Additionally we have tried 5 different exponential decay schedule for  ϵ-greedy and IQN in the form of (ϵ0, d, step): {(1.0, 0.9, 10), (1.0, 0.9, 100), (1.0, 0.9, 500), (1.0, 0.99, 10), (1, 0.99, 100)} where ϵ = ϵ0 × depisode/step. The first of the linear decay set preformed the best. We have also tested our algorithm with constant optimism values of (0.2, 0.4, 0.5, 0.8, 1, 2, 5) where we picked the best value 0.8.

Theoretical Analysis

Proposition 3 (Restatement of Proposition 1). For any c, the  Ocoperator is a non-expansion in the Cram´er distance ¯ℓ2. This implies that optimistic distributional Bellman backups  OcT πand the projected version  ΠCOcT πare √γ-contractions in ¯ℓ2and iterates of these operators converge in ¯ℓ2to a unique fixed-point.

Proof. Consider  Z, Z′ ∈ Z, any state  s ∈ Sand action  a ∈ Awith CDFs  FZ(s,a)and  FZ′(s,a)and consider the application of

the optimism operator  Oc:


Generally, for any  a ≥ bwe have


and applying this case-by-case bound to the quantity in the integral above, we get


By taking the square root on both sides as well as a max over states and actions, we get that  Ocis a non-expansion in ¯ℓ2. The rest of the statement follows from the fact that  T πis a √γ-contraction and  ΠCa non-expansion (Rowland et al. 2018) and the Banach fixed-point theorem.

Theorem 4 (Restatement of Theorem 2). Let the shift parameter in the optimistic operator be sufficiently large which is c = O (ln(|S||A|/δ)). Then with probability at least  1 − δ, the iterates  CVaRα((Oc ˆT π)mZ0)converges for any risk level  αand initial  Z0 ∈ Zto an optimistic estimate of the policy’s conditional value at risk. That is, with probability at least  1 − δ,


Proof. By Lemma 8 and Lemma 3 by (Bellemare, Dabney, and Munos 2017), we know that  Zi+1 ← Oc ˆT πZiconverges to a unique fixed-point  Z∞, independent of the initial  Z0. Hence, without loss of generality, we can choose  Z0 = Zπ.

We proceed by first showing how our result will follow under a particular definition of c, and then show what that definition is. Assume that we have obtained a value for c that satisfies the assumption of Lemma 5 (see other parts of this appendix), and let ˜Z = Ziand  Z = Zπ. Then Lemma 5 implies that if  FZi(s,a) ≤ FZπ(s,a)for all (s, a), then also  FZi+1(s,a) ≤ FZπ(s,a)for all (s, a). Thus,  FZ∞(s,a) ≤ FZπ(s,a)for all (s, a). Finally, we can use Lemma 8 to obtain the desired result of our proof statement,  CVaRα(FZ∞(s,a)) ≥ CVaRα(FZπ(s,a))for all (s, a).

Going back, we use concentration inequalities to determine the value of c that ensures the required condition in Lemma 5 (expressed in Eq. (19)). The DKW-inequality which give us that for any (s, a) with probability at least  1 − δ


Further, the inequality by (Weissman et al. 2003) gives that

Combining both with a union bound over all  |S × A|state-action pairs, we get that it is sufficient to choose c = �(1 + 4|S|) ln(4|S||A|/δ) ≥�2|S| ln(4|S||A|/δ) +�ln(4|S||A|/δ)/2to ensure that  ˜c(s, a) ≥ ∥F ˆR(s,a) − FR(s,a)∥∞ +∥ ˆP(·|s, a) − P(·|s, a)∥1allowing us to apply Lemma 5.

However, we can improve this result by removing the polynomial dependency of c on the number of states |S| as follows. Consider a fixed (s, a) and denote  v(s′, x) := �a′∈A π(a′|s′)FγZ(s′,a′)+R(s,a)(x)where we set  Z = Zπ. Our goal is to derive

a concentration bound on �s′∈S�ˆP(s′|s, a) − P(s′|s, a)�v(s′, x)that is tighter than the bound derived from  ∥ ˆP(·|s, a) −

P(·|s, a)∥1. Note that v is not a random quantity and hence �s′∈S�ˆP(s′|s, a) − P(s′|s, a)�v(s′, x)is a normalized sum of independent random variables for any x. To deal with the continuous variable x which prevents us from applying a union bound over x directly, we use a covering argument. Let  K ∈ Nbe arbitrary and consider the discretization set


Define  ¯v(s′, x) = v(s′, max{x′ ∈ ¯X : x′ ≤ x})as the discretization of v at the discretization points in ¯X. This construction ensures that the discretization error is uniformly bounded by 1/K, that is,  |¯v(s, x) − v(s, x)| ≤ 1/Kholds for all  s ∈ Sand x ∈ [Vmin, Vmax]. Hence, we can bound for all  x ∈ [Vmin, Vmax]


where in 1 , we applied Hoeffding’s inequality to the first term in combination with a union bound over ¯Xas  ¯v(·, x)can only take  | ¯X|values in  R|S|. The second term was bounded with H¨older’s inequality and in 2 the concentration inequality from Eq. (12) was used. Combining this bound with Eq. (11) by applying a union bound over all states and actions, we get that picking


is sufficient to apply Lemma 5. Since  v(s′, ·)is non-decreasing, the size of the discretization set is at most  | ¯X| ≤ |S|Kand by picking  K =�|S|, we see that  c = O (ln(|S||A|/δ))is sufficient.


Lemma 5. Let  Z, ˜Z ∈ Zbe such that  F ˜Z(s,a) ≤ FZ(s,a)for all (s, a). Assume finitely many states  |S| < ∞and actions |A| < ∞. Let ˆR(s, a)and ˆP(s, a)be the empirical reward distributions and transition probabilities. Assume further that


where  ˜c(s, a) = c√n(s,a)is the shift in the optimism operator  Oc. Then  FOc ˆT πZ˜(s,a) ≤ FT πZ(s,a)for all (s, a). Note that it is sufficient to pick  ˜c(s, a) ≥ ∥F ˆR(s,a) − FR(s,a)∥∞ + ∥ ˆP(·|s, a) − P(·|s, a)∥1to ensure the Assumption in Eq. (19).

Proof. We start with some basic identities which follow directly from the definition of CDFs


where the integrals are Lebesgue-Stieltjes integrals and understood to be taken over  [Vmin, Vmax]. We omit the limits in the following to unclutter notation. These identities allow us to derive expressions for  FT πZ(s,a)and  FOc ˆT πZ′(s,a):


Here we exchanged the finite sum with the integral by linearity of integrals. Using these identities, we will show that FOc ˆT π ˜Z(s,a)(x) − FT πZ(s,a)(x) ≤ 0for all x. Consider any fixed x and first the case where the max in  FOc ˆT π ˜Z(s,a)(x)is attained by 0. In this case  FOc ˆT π ˜Z(s,a)(x) − FT πZ(s,a)(x) = −FT πZ(s,a)(x) ≤ 0because CDFs take values in [0, 1]. For the second case, we combine (26) and (29) to write



In the following, we consider each of the terms (30)–(32) separately. Let us start with (30) and bound Fγ ˜Z(s′,a′)+R(s,a)(x) − FγZ(s′,a′)+R(s,a)(x)


where the identity follows from the basic identities in Eq. (20) and the inequality from the assumption that  F ˜Z(s,a) ≤ FZ(s,a)for all (s, a). Hence, the term in Eq. (30) is always non-positive. Moving on to the term in Eq. (31) which we bound with similar tools as


This yields that Eq. (31) is bounded by this as well


Finally, the last term from Eq. (32) can be bounded as follows

which is non-positive as long as  ˜c(s, a) ≥ ∥F ˆR(s,a) − FR(s,a)∥∞ + ∥ ˆP(·|s, a) − P(·|s, a)∥1which completes the proof.

Technical lemmas Lemma 6. Let F and G be the CDFs of two non-negative random variables and let  νF , νGbe a maximizing value of  νin the definition of  CVaRα(F)and  CVaRα(G)respectively. Then:


Proof. Assume w.l.o.g. that  CVaRα(F) − CVaRα(G) ≥ 0. Denote by  νFany maximizing value of  νin the definition of CVaRα(F). By Lemma 4.2 and Equation (4.9) in (Acerbi and Tasche 2002), a possible value of  νFis  F −1(α). Then we can write the differences in CVaR as


Using Lemma 7 in Equation (48) gives

We can in full analogy upper-bound  CVaRα(G) − CVaRα(F)and arrive at the desired statement.

Lemma 7. Let F be a CDF of a bounded non-negative random variable and  ν ∈ Rbe arbitrary. Then  EF [(ν − X)+] =� ν0 F(y)dy. Hence, one can write the conditional value at risk of a variable  X ∼ Ffor any CDF F with F(0) = 0 as


Proof. We rewrite  EF [(ν − X)+]as follows


where 1 follows from  a =� a0 dx =� ∞0 1{a > x}dxwhich holds for any  a ≥ 0and 2 uses Tonelli’s theorem to exchange the two integrals. Plugging this identity into


and taking the sup over  νgives the desired result.

Lemma 8. Let G and F be CDFs of non-negative random variables so that  ∀x ≥ 0 : F(x) ≥ G(x). Then for any  α ∈ [0, 1], we have  CVaRα(F) ≤ CVaRα(G).

Proof. Consider now the following difference


By Lemma 7, we have that

Let  νFdenote a value of  νthat achieves the supremum in 1α� ν0 (α − F(y))dy(which exists by Lemma 4.2 and Equation (4.9) in (Acerbi and Tasche 2002)). Then


Lemma 9. Consider a sequence of CDFs  {Fn}on  x ∈ [Vmin, Vmax]with  Vmin ≥ 0that converges in  ℓ2distance to  FOas n → ∞. Then  CVaRα(Fn) → CVaRα(FO)as  n → ∞.

Proof. Consider the sequence of Wasserstein  d1distance between the CDFs:


Where the second line follows by H¨older’s inequality. The right hand side goes to 0 as  n → ∞, which implies convergence in Wasserstein  d1distance (p=1). Finally, using Lemma 6, convergence of CVaR follows.

Designed for Accessibility and to further Open Science