b

DiscoverSearch
About
My stuff
State-Augmentation Transformations for Risk-Sensitive Reinforcement Learning
2018·arXiv
Abstract
Abstract

In the framework of MDP, although the general reward function takes three arguments—current state, action, and successor state; it is often simplified to a function of two arguments—current state and action. The former is called a transition-based reward function, whereas the latter is called a state-based reward function. When the objective involves the expected total reward only, this simplification works perfectly. However, when the objective is risk-sensitive, this sim-plification leads to an incorrect value. We propose three successively more general state-augmentation transformations (SATs), which preserve the reward sequences as well as the reward distributions and the optimal policy in risk-sensitive reinforcement learning. In risk-sensitive scenarios, firstly we prove that, for every MDP with a stochastic transition-based reward function, there exists an MDP with a deterministic state-based reward function, such that for any given (randomized) policy for the first MDP, there exists a corresponding policy for the second MDP, such that both Markov reward processes share the same reward sequence. Secondly we illustrate that two situations require the proposed SATs in an inventory control problem. One could be using Q-learning (or other learning methods) on MDPs with transition-based reward functions, and the other could be using methods, which are for the Markov processes with a deterministic state-based reward functions, on the Markov processes with general reward functions. We show the advantage of the SATs by considering Value-at-Risk as an example, which is a risk measure on the reward distribution instead of the measures (such as mean and variance) of the distribution. We illustrate the error in the reward distribution estimation from the reward simpli-fication, and show how the SATs enable a variance formula to work on Markov processes with general reward functions.

The most widely used optimization criterion in reinforcement learning (RL) is represented by the expected total reward. This risk-neutral criterion renders two functions— the value function and the Q-function—important in many applications. The two functions can be considered as substitutes for the reward functions, even in risk-sensitive reinforcement learning (Borkar 2002; Shen et al. 2014; Garc´ıa and Fern´andez 2015; Junges et al. 2016; Gilbert and Weng 2016; Huang and Haskell 2017; Chow et al. 2017; Berkenkamp et al. 2017). However, the reward function is usually in a more complicated form. In a Markov decision process (MDP), we say the reward function is transition-based if it involves current state, action, and successor state, and state-based if it involves current state and action only. Furthermore, the reward could be stochastic. For a (stochastic) transition-based reward function, the value function, or Q-function, implies a reward simplification, which changes the reward sequence  {Rt}, and only preserves the mean of the reward distribution. This may not be sufficient in many risk-sensitive applications, especially when the small probability events have serious consequences, such as self-driving and medical diagnosis. That is where a risk-sensitive criterion should be considered. A risk-sensitive criterion refers to a risk measure, or a risk function, which maps a random variable to a scalar. Since the risk measure depends on the distribution of the random variable, the reward simplifica-tion will lead to an incorrect result.

The motivation for our proposed methods arises from a contradiction between theory and practice with respect to the reward function. On the one hand, some methods require MDPs, or Markov reward processes, to be with deterministic (and state-based) reward functions. On the other hand, for many practical problems, the underlying MDPs or Markov processes have stochastic (and transition-based) reward functions, and the reward simplification changes the reward sequences, as well as the reward distributions, which are crucial in risk-sensitive scenarios. This paper proposes three successively more general state-augmentation transformations (SATs) for different settings (Cases 1, 2, and 3) to solve this problem. With the aid of the SATs, we can apply methods, which requires a deterministic state-based reward function, to MDPs (or Markov processes) with stochastic transition-based reward function, and at the same time preserve the reward sequence (and the reward distribution). We study the return—the discounted total reward �∞t=0 γt−1Rt—in an infinite-horizon MDP with finite state and action spaces, and consider the Value-at-Risk (VaR) objective as a risk-sensitive example. We generalize the transformation in (Ma and Yu 2017) to three successively more general SATs (Cases 1, 2, and 3), give a proof for the most general one, and illustrate the error from the reward simpli-fication on the return distribution.

In Section 2, we define the notations for the MDPs with four types of reward functions and two policy spaces, and pin down the reward simplification as the main problem in a risk-sensitive setting. We consider two VaR objectives to show the effect of the reward simplification, since VaRs are functions of the reward distribution instead of measures (moments, such as mean and variance) of the distribution. An infinite-horizon MDP is defined for an inventory control problem, which is used as an example for the proposed transformations. The return variance formula requires the Markov process to be with a deterministic reward function. In Section 3, we propose SATs for different cases, and show the error from the reward simplification. In Section 4, we give a literature review on risk study in reinforcement learning and risk-aware Q-learning. In Section 5, we have a discussion on the proposed transformations.

Briefly, in a policy evaluation setting, when the objective is risk-sensitive and the Markov reward process is with a (stochastic) transition-based reward function, the return distribution should be preserved instead of the expectation only, and an appropriate SAT should be applied. In a control setting, when a randomized policy is considered, we can apply the SAT in Case 3 to preserve the reward sequence within a specific policy space. For related studies which concerned risk-sensitive problems in RL, when the reward function is not deterministic and state-based, we believe that they should be revisited with proposed transformations.

In this section, firstly we present the notations for MDPs with four types of reward functions and two policy space. Secondly, we define two VaR objectives and the VaR function. Thirdly, we consider an inventory control problem, which is a straightforward example of MDP with a transition-based reward function.

Markov Decision Processes In this paper we focus on infinite-horizon discrete-time MDPs, which can be represented by

image

in which S is a finite state space, and  Xt ∈ Srepresents the state at (decision) epoch  t ∈ N; Axis the allowable action set for  x ∈ S, A = �x∈S Axis a finite action space, and Kt ∈ Arepresents the action at epoch t; r is a bounded reward function, and  Rtdenotes the immediate reward at epoch  t; p(y | x, a) = P(Xt+1 = y | Xt = x, Kt =a) denotes the homogeneous transition probability;  µis the initial state distribution;  γ ∈ (0, 1)is the discount factor.

In this paper we study the distribution of the return �∞t=1 γt−1Rtin infinite-horizon MDPs. For  x, y ∈ S, a ∈Ax, here we consider four types of reward functions: the deterministic state-based reward

image

the deterministic transition-based reward

image

the stochastic state-based reward

image

and the stochastic transition-based reward

image

With a slight abuse of notation, we also represent, for example,  rST (j | x, y) = P(Rt = j | Xt = x, Xt+1 = y) ∈[0, 1] for a Markov reward process.

When the reward function is not  rDStype, it is often sim-plified in the expectation way. For example, given a  rDT, the reward function can be simplified to a  rDSby

image

where  x, y ∈ S, a ∈ Ax. In practical problems, stochastic reward functions are often naively simplified to  rDSfunctions in a similar way. In RL, when the expected return is considered, and the Q-function or the value function is accessed, which implies such a reward simplification. The effect of the reward simplification on return distribution in a finite-horizon Markov reward process has been studied in (Ma and Yu 2017). Here we estimate the distribution with assuming it is approximately normal, illustrate the similar effect on return distribution, and generalize the transformation for more practical cases.

A policy  πdescribes how to choose actions sequentially. For infinite-horizon MDPs, we focus on two stationary Markovian policy spaces: the deterministic policy space  ΠD, and the randomized policy space  ΠR. A (time-homogeneous) Markov reward process is tantamount to an MDP with a (randomized) policy. Randomized policy is often considered in constrained MDPs (Altman 1999). Given an MDP with a randomized policy, the reward function is often naively simplified as well. Since most risk measures are law invariant (Kusuoka 2001), we generalize the transformation for settings mentioned above, in order to preserve the return distribution.

Value-at-Risk

Value-at-Risk originates from finance. For a given portfolio (which can be considered as an MDP with a policy), a loss threshold (target level), and a time-horizon, VaR concerns the probability that the loss on the portfolio exceeds the threshold over the time horizon. VaR is hard to deal with since it is not a coherent risk measure (Riedel 2004). Two VaR problems described in (Filar et al. 1995) are considered as optional objectives. Given an initial distribution  µand a policy  πin a specified policy space  Π, denote the return by Φπµ, and here we simplify it to  Φ. Denote the return distribu- tion with the policy  πby  F πΦ. VaR addresses the following problems.

Definition 1. Given a quantile  α ∈ [0, 1], find the optimal threshold  ρα = sup{τ ∈ R | P(Φ > τ) ≥ α, π ∈ Π} =sup{τ ∈ R | F πΦ(τ) ≤ 1 − α, π ∈ Π}.

This problem refers to the quantile function, i.e.,  F π−1Φ.

Definition 2. Given a threshold  τ ∈ R, find the optimal quantile  ητ = sup{α ∈ [0, 1] | F πΦ(τ) ≤ 1 − α, π ∈ Π}.

This problem concerns  F πΦ. When the estimated return distribution is strictly increasing, any point along the function  inf{F πΦ | π ∈ Π}is (estimated)  (ρα, 1 − ητ)with τ = ραor  α = 1−ητ. Therefore, both VaR objectives refers to the infimum function, and here we call it the VaR function. Since most risk measures are law invariant, we consider VaR as an example to show the effect on the distribution from the reward simplification.

image

For  x ∈ S, denote the cost to order x units by c(x), a fixed cost  W ≥ 0for placing orders, then we have the order cost  o(x) = (W + c(x))1[x>0]. Denote the revenue when x units of demand is fulfilled by f(x), the maintenance fee by m(x). The real reward function is  r(Xt, Kt, Xt+1) =f(Xt + Kt − Xt+1) − o(Kt) − m(Xt).We set the parameters as follows. The fixed order cost W = 4, the variable order cost c(x) = 2x, the maintenance fee m(x) = x, the warehouse capacity M = 2, and the price f(x) = 8x. The probabilities of demands are  P(Dt = 0) =0.25, P(Dt = 1) = 0.5, P(Dt = 2) = 0.25respectively. The initial distribution  µ(0) = 1. In this infinite-horizon MDP, the reward function is deterministic and transition-based. The simplified reward function  r′can be calculated by Equation (2), which is state-based. As illustrated in Figure 1, now we have two MDPs with different reward functions:  ⟨S, A, r, p, µ, γ⟩and  ⟨S, A, r′, p, µ, γ⟩.

Reward Distribution Estimation

Since we consider risk from the distributional perspective instead of the expected return, we need to estimate the reward distribution. The functional distribution estimations in Makrov reward processes have been studied for decades (Woodroofe 1992; Meyn and Tweedie 2009). However, there is no related central limit theorem for the discounted sum of rewards (return). To simplify the estimation, we estimate the return distribution by considering mean and variance only, which implies the assumption that the return is approximately normal. For an infinite-horizon Markov reward process with a deterministic state-based reward function, Sobel (1982) presented the return variance formula for Markov reward processes.

Theorem 3. (Sobel 1982) Given an infinite-horizon Markov reward process  ⟨S, r′π, pπ, γ⟩with the finite state space S = {1, · · · , |S|}, the reward function  r′πdeterministic state- based and bounded, and the discount factor  γ ∈ (0, 1). Denote the transition matrix by P, in which P(x, y) =

image

Figure 1: An MDP with a transition-based reward function and its counterpart with a state-based reward function following reward simplication. Labels along transitions denote a(r(x, a, y), p(y|x, a)), and labels next to states denote  a(r′(x, a)), the state-based reward function simpli-fied with Equation (2). For example, the labels in bold are interpreted as follows: the label 2(0, 0.5) below the transition from 0 to 1 means that the reward r(0, 2, 1) = 0 and the transition probability is 0.5; the label 2(0) near state 0 means when  Xt = 0and  Kt = 2, the simplified reward r′(0, 2) = 0.

pπ(y | x), x, y ∈ S. Denote the conditional return expectation by  vx = E(Φ | X0 = x)for any deterministic initial state  x ∈ S, and the conditional expectation vector by v. Similarly, denote the conditional return variance by ψx = V(Φ | X0 = x), and the conditional variance vector by  ψ. Let  θdenote the vector whose xth component is θx = �y∈S pπ(x, y)(r′π(x) + γvy)2 − v2x. Then

image

Now with the aid of Theorem 3, we can estimate the return distribution for the ergodic Markov reward process. Notice that the variance formula is for Markov reward process with a deterministic reward function only. In next section, we generalize the transformation for the Markov reward processes and MDPs in different cases, estimate the return distribution in the inventory control problem with the aid of a transformation, and compare it with the one from the reward simplification.

In this section, we propose the state-augmentation transformations (SATs) for three cases. Case 1: a Markov reward process with a stochastic, transition-based reward function;

Case 2: an MDP with a stochastic transition-based reward function, and a randomized policy; and

Case 3: an MDP with a randomized policy space.

Case 1 can be considered as an MDP with a  rST(or rSS) and a deterministic policy. Case 2 refers to the constrained MDPs. Case 3 describes a direct policy search (gradient descent method, for example) scenario from a risk-sensitive perspective. In all the three cases, the reward functions are often simplified in a similar way as in Equation (2), which will lose all moment information except for the first one (mean). Noticing that the state-transition transformation (Ma and Yu 2017) is for a Markov reward process with a deterministic, transition-based reward function, which we define as Case 0. Since Case 0 is a special case of Case 1, we denote this relationship by Case 0  ≺Case 1. Similarly, the four cases have the relationship

image

Concisely, we review the original state-transition transformation in Algorithm 1, give a theorem (Theorem 4) and its proof for the most general Case 3, and two corollaries (Corollary 5, Corollary 6) for Case 2 and 1. Considering the relationship between the cases, all needed algorithms and theorems for different cases can be derived from the constructive proof for Theorem 4 with some slight changes.

SAT for Case 3

We give the transformation theorem for MDPs with a  rST(Equation (1)) in a control setting and prove it as follows.

Theorem 4 (Transformation for MDPs). Given an MDP ⟨S, A, r, p, µ⟩with r stochastic and transition-based, there exists an MDP  ⟨S†, A, r†, p†, µ†⟩with  r†deterministic and state-based, such that for any given policy (possibly randomized) for  ⟨S, A, r, p, µ⟩, there exists a corresponding policy for  ⟨S†, A, r†, p†, µ†⟩, such that both Markov reward processes share the same return distribution.

Proof. The proof has two steps. Step 1 constructs a second MDP and shows that, for every possible sample path in the first MDP, there exists a corresponding sample path in the second MDP. Step 2 proves that, the probability of any possible sample path in first MDP equals to the probability of its counterpart in the second MDP.

Step 1: Define  S = {1, · · · , |S|}. For  x, y ∈ S, a ∈ Ax, define  J = supp(r(· | x, a, y)). Define  S‡ = S2 ×A × J. In order to remove the dependency of the initial state distribution on policy, define a null state space W = {w1, · · · , w|S|}, and  W ∩ S‡ = ∅. Define the state space S† = S‡ ∪ W.

For all  x† = (x, a, y, j), y† = (y, a′, y′, j′) ∈S†,where  y′ ∈ S, a′ ∈ Ay, j, j′ ∈ J, define the state-based reward function  r†(j | x†, ·) = 1, and  r†(0 | w·, ·) = 1; de-fine the transition kernel  p†(y† | x†, a′) = p(y′ | y, a′)r(j′ |y, a′, y′), and  p†(x† | wx, a) = p(y | x, a)r(j | x, a, y); the initial state distribution  µ†(wx) = µ(x).

Now we have two MDPs. Let MDP1 = ⟨S, A, r, p, µ⟩, and MDP2 = ⟨S†, A, r†, p†, µ†⟩. For any sample path (x1, a1, j1, x2, a2, j2, x3, a3, j3, x4 · · · )in MDP1, there

exists a sample path

(wx1, a1, 0, (x1, a1, j1, x2), a2, j1, (x2, a2, j2, x3), a3, j2, (x3, a3, j3, x4), · · · )

in  MDP2. Therefore, we proved that for every possible sample path for the first MDP, there exists a corresponding sample path for the second MDP.

Step 2: Next we prove the probabilities for the two sample paths are equal. Here we prove it by mathematical induction. Set time epoch to n after the first  xn+1in  MDP1, and after the first  (xn, an, jn, xn+1)in  MDP2.

Denote the partial sample path till epoch i by  ιiin  MDP1. Given any  π ∈ ΠRfor  MDP1, the probability for the sample path before epoch 1 in  MDP1is

image

There exists a policy  π†for  MDP2, with  π†(a | wx) =π(a | x), and  π†(a | x†) = π(a | y). The probability for the sample path before epoch 1 in  MDP2is

image

= µ(wx1)π†(a1 | wx1)p((x1, a1, j1, x2) | wx1, a1)= P(ι1 = (x1, a1, j1, x2)).

Therefore, at epoch 1, the two partial sample paths share the same probability.

Assuming that the two partial sample paths share the same probability at epoch n, then the probability for the sample path before epoch n + 1 in  MDP1is

image

The probability for the sample path before epoch n + 1 in MDP2is

image

By induction we proved that, the probability of any possible sample path in  ⟨S, A, r, p, µ⟩equals to the probability of its counterpart in  ⟨S†, A, r†, p†, µ†⟩. As a subsequence of a sample path, the reward sequence  {Rt}is preserved with a null reward at t = 1. Given the discount factor  γ, we can compensate this time drift effect simply by setting r†(x†) = r†(x†)/γto preserve the return distribution. Theorem 4 is proved.

image

It is worth noting that the SATs also works for some risk measures that are not law invariant, such as dynamic risk

image

Figure 2: (a) The Markov reward process for the MDP in Figure 1 with the policy  π. This Markov process has a transition-based reward function. The labels along transitions denote  rπ(x, y)(pπ(x, y)), and the labels  r′π(x)near states denote the state-based reward function simplified with Equation (2); (b) The transformed Markov reward process with a state-based reward function. For a Markov reward process with a deterministic transition-based reward function, the transformation takes transitions as states and attach each possible reward to a state, in order to preserve the reward sequence. The labels along transitions denote p†π(x†, y†), and the labels  r†(x†)near states denote the state-based reward function from the transformation.

measures (Ruszczy´nski 2010), since  {Rt}can be preserved as well. It should also be pointed out that, since the state space is augmented, the SATs has an effect on the computational complexity. Denote the complexity for an algorithm by T(|S|, |A|), it becomes  T(|S2 × A × J| + |S|, |A|)when the SAT for Case 3 is implemented.

SAT for Case 2 and 1

Here, we present SATs for MDPs in policy evaluation settings. Given an MDP with a randomized policy  π ∈ ΠR, the reward function is often simplified as well. Taking a deterministic state-based reward function  rDSfor example, the reward function is simplified to �y∈S π(a|x)rDS(x, a), x, y ∈ S, a ∈ Ax. In order to pre- serve the reward sequence, one way is to consider action in a “situation” in Remark 1. Remark 1 (State augmentation in the transformations). In order to transform a Markov reward process with a  rST(or rSS, rDT) to the one with a  rDS, and preserve the return sequence at the same time, a bijective mapping between a new “augmented” state space and the possible “situation” space is needed. For a Markov reward process with  rST, a possible situation can be defined by a tuple  ⟨x, y, j⟩, in which x, y ∈ S, j ∈ supp(rST (· | x, y)).

Next, we present the following corollaries for Case 2 and Case 1.

Corollary 5 (Transformation for MDP with a randomized policy). For a Markov decision process  ⟨S, A, r, p, µ⟩with r stochastic and transition-based, and a randomized policy  π ∈ ΠR, there exists a Markov reward process ⟨S†, r†π, p†π, µ†π⟩with  r†πdeterministic and state-based, such that both processes share the same return sequence.

Similarly, we present a corollary for Case 1, which is a special case of Corollary 5. Corollary 6 (Transformation for Markov process with a stochastic transition-based reward function). For a Markov reward process  ⟨S, rπ, pπ, µπ⟩with  rπstochastic and transition-based, there exists a Markov reward process ⟨S†, r†π, p†π, µ†π⟩with  r†πdeterministic and state-based, such that both processes share the same return sequence.

SAT for Case 0 When we use Q-learning, or other Q-function (value function) based learning methods in an MDP with a reward function which is not deterministic and state-based, it implies a reward simplification similar to Equation (2). This sim-plification changes the reward distribution as well as a risk measure. In this subsection, we show the effect on the distribution from the reward simplification. We apply the state-transition transformation (Ma and Yu 2017)—which is the SAT for Case 0—in an infinite-horizon with a discount factor setting, estimate the reward distribution and the VaR result, and compare them with the ones from reward simplifi-cation.

Figure 1 describes an MDP with transition-based reward function and its counterpart with state-based reward function following reward simplication. In this MDP, consider a deterministic policy  π = [2, 1, 0]—to order 2, 1, 0 item(s) when the inventory level is 0, 1, 2, respectively—then we

image

have a Markov reward process illustrated in Figure 2(a). In order to attach each possible reward value to a state to construct a  rDS, we consider each transition as a state, and apply the State-Transition Transformation (Algorithm 1 (Ma and Yu 2017)). The transformed Markov reward process is presented in Figure 2(b), where only some of the transitions are shown. Given the same Markov reward process without the transformation, the reward function is often simplified by Equation (2), which only preserves the expectation.

Now we set  γ = 0.95and compare the two return distributions—one from the transformation and the other from the reward simplification—with the averaged empirical return distribution. Since there is no central limit theorem for discounted Markov processes, to simplify the estimation, we estimate the return distribution by involving mean and variance only, which implies the assumption that the return is approximately normal. With the aids of Theorem 3, the two distributions are shown in Figure 3. The averaged empirical return distribution is from a simulation repeated

image

Figure 3: Comparison among the averaged empirical return distribution with error region, the estimated return distribution from the transformation, and the estimated return distribution from the reward simplification.

50 times with a time horizon 1000, with the error region representing the standard deviations of the means along return axis. The Kolmogorov–Smirnov statistic  DKS(Durbin 1973) is used to quantify the distribution difference (error). Denote the averaged empirical return distribution by ˆF πΦ, the estimated distribution for the transformed process by  QπΦ, and the estimated distribution for the process with the reward simplification by  Q′πΦ. For the case in Figure 3, DKS(Q′πΦ , ˆF πΦ) = supφ∈R |Q′πΦ (φ) − ˆF πΦ(φ)| ≈ 0.145, and  DKS(QπΦ, ˆF πΦ) ≈ 0.012. The results show that, the reward simplification leads to a nontrivial estimation error (0.145 > 0.012). Notice that the direct use of Q-learning will result in a risk measure on a learned distribution, which is supposed to converge to the estimated distribution with the reward simplification.

Now consider the VaR objective (VaR Definition 2 for example). The VaR function is obtained by enumerating the deterministic policies. Figure 4 shows the two estimated VaR functions. Since the VaR function can also be regarded as a return distribution, we can still use  DKSto measure the error from the reward simplification, and in this case DKS ≈ 0.150. Denote the optimal quantile for the MDP with the reward simplification by  ˆητ, then the error bound for the optimal quantile  sup{|ˆητ − ητ| : τ ∈ R} ≈ 0.150, which is nontrivial in this risk-sensitive problem. Remark 2 (Smaller variance from reward simplification). In Figure 3 we can tell that the distribution for the process with the reward simplification has a smaller variance. The reason can be intuitively explained by the analysis of variance (Scheff´e 1999). Taking the deterministic transition-based reward function for example. Considering the possible rewards for the same current state as a group, the variance for  Rtincludes the variances between groups and the vari-

image

Figure 4: Comparison among the averaged empirical VaR function, the estimated VaR function from the transformation, and the estimated VaR function from the reward sim-plification.

ances within groups. When the reward function is simplified with Equation (2), the variances within groups are removed, so the variance is smaller.

From the inventory example we can see that, some methods require Markov processes to be with deterministic and state-based reward functions only, and the reward simplifi-cation changes the reward sequences (distributions). If we want to use Q-learning, or other methods for Markov processes with  rDSin a risk-sensitive scenario, we should implement an appropriate SAT first.

The risk concerns arise in RL in two aspects. One refers to the “external” uncertainty in the model parameters, and this problem is known as robust MDPs. In the robust MDPs people optimize the expected return with the worst-case parameters, which belongs to a set of plausible MDP parameters. For example, an MDP with uncertain transition matrices (Nilim and Ghaoui 2005). The other one refers to the “internal” risk, which studies the stochastic property of the process. In general, there are three internal risk-sensitive objective classes in RL area, which have been studied for decades. One is the mean-variance risk measure (White 1988; Sobel 1994; Mannor and Tsitsiklis 2011), also known as the modern portfolio theory, in which the expected return is maximized with a given risk level (variance). The other is target-percentile risk measure (Filar et al. 1995; Wu and Lin 1999). This risk measure formulate the objective in terms of the probabilities of certain targets (or quantiles), and optimize, for example, the expected return. Another is utility risk measure, in which the exponential utility function is concerned. The internal risk concerns arise not only mathematically but also psychologically. A classic example in psychology is the “St. Petersburg Paradox,” which refers to a lottery with an infinite expected reward, but people only prefer to pay a small amount to play. This problem is thoroughly studied in utility theory, and a recent study brought this idea to RL (Prashanth et al. 2016). Some risk measures are coherent (Artzner et al. 1998), which share some intuitively reasonable properties (convexity, for example). Ruszczy´nski and Shapiro (2006) presented a thorough study on coherent risk optimization.

Q-learning has been studied in risk-sensitive RL for decades.Many risk-sensitive Q-learning studies are for MDPs with an  rDS. Borkar (2002) proved the convergence of Q-learning for an exponential utility cost objective with an ordinary differential equation method. A trajectory-based algorithm which combines policy gradient and actor-critic methods was presented to solve a CVaR-constrained problem (Chow et al. 2017). For robust MDP problems, with considering a set of general uncertainties (random action, unknown cost and safety hazards), an approach was provided to compute safe and optimal strategies iteratively (Junges et al. 2016). Q-learning has also been used to provide risk-sensitive analysis on the fMRI signals, which provides a better interpretation of the human behavior in a sequential decision task (Shen et al. 2014). The expectation-based worst case risk measure might not need the proposed SATs. For example, the minimax risk measures studied in (Huang and Haskell 2017). For a comprehensive survey on safe RL, see (Garc´ıa and Fern´andez 2015).

The proposed SATs transform MDPs and Markov reward processes with stochastic transition-based reward functions into ones with deterministic state-based reward functions, and preserve the reward sequences (and distributions) for risk-sensitive objectives. In an infinite-horizon time-homogeneous MDP for an inventory control problem, we illustrate the error on the distribution from the reward sim-plification. Taking the advantage of the variance formula presented by Sobel (1982), we estimate the return distribution for a Markov reward process. Since many RL methods require the reward function to be deterministic and state-based, the transformation is needed for the MDPs with other types of reward functions in the risk-sensitive problems. We generalize the transformation (Ma and Yu 2017) in different settings, and consider VaR as an example to show the effect of reward simplification on distribution.

In many practical problems, the rewards are simplified to deterministic and state-based. When the MDP is with a reward function which is not an  rDStype, the direct use of Qlearning also implies such a reward simplification. In this paper, the error from the reward simplification on distribution is illustrated, which is crucial in risk-sensitive cases. By implementing the transformation instead of the reward simpli-fication, the MDPs with different types of reward functions and (or) randomized policies are transformed to the ones with deterministic and state-based reward functions with an intact reward distribution.

The essence of the SATs is to attach each possible reward value to an augmented state to preserve the reward sequence. This attachment is crucial in risk-sensitive reinforcement learning, considering the wide application of value function and Q-function. Without the proposed SATs, the learned Qfunction (value function) only estimates the expected value of a given state-action pair (state) when the reward function is (stochastic) transition-based. Now with the SATs, the Qfunction (value function) can be considered as an approximation of the “real” value of a given augmented state-action pair (state). In other words, the proposed transformations “transform” the uncertainties from the transition, action, and the stochasticity of the reward function to the augmented state space. The proposed SATs present a platform for Qlearning in risk-sensitive RL, and we believe that many related studies should be revisited with the proposed SATs instead of applying the common-used reward simplification directly.

This research was supported in part by the scholarship from China Scholarship Council (CSC), and the Natural Sciences and Engineering Research Council of Canada (NSERC) under Grants 509935 and 512046.

[Altman 1999] Altman, E. 1999. Constrained Markov Decision Processes. CRC Press.

[Artzner et al. 1998] Artzner, P.; Delbaen, F.; Eber, J.; and Heath, D. 1998. Coherent measures of risk. Mathematical Finance 9(3):1–24.

[Berkenkamp et al. 2017] Berkenkamp, F.; Turchetta, M.; Schoellig, A.; and Krause, A. 2017. Safe model-based reinforcement learning with stability guarantees. In Proceedings of the 31st Advances in Neural Information Processing Systems (NIPS), 908–918.

[Borkar 2002] Borkar, V. S. 2002. Q-learning for risk-sensitive control. Mathematics of Operations Research 27(2):294–311.

[Chow et al. 2017] Chow, Y.; Ghavamzadeh, M.; Janson, L.; and Pavone, M. 2017. Risk-constrained reinforcement learning with percentile risk criteria. The Journal of Machine Learning Research 18(1):6070–6120.

[Durbin 1973] Durbin, J. 1973. Distribution Theory for Tests based on the Sample Distribution Function. SIAM.

[Filar et al. 1995] Filar, J. A.; Krass, D.; Ross, K. W.; and Member, S. 1995. Percentile performance criteria for limiting average Markov decision processes. IEEE Transactions on Automatic Control 40(I):2–10.

[Garc´ıa and Fern´andez 2015] Garc´ıa, J., and Fern´andez, F. 2015. A comprehensive survey on safe reinforcement learning. Journal of Machine Learning Research 16(1):1437– 1480.

[Gilbert and Weng 2016] Gilbert, H., and Weng, P. 2016. Quantile reinforcement learning. arXiv:1611.00862.

[Huang and Haskell 2017] Huang, W., and Haskell, W. B. 2017. Risk-aware Q-learning for Markov decision processes. In Proceedings of the 56th IEEE Conference on Decision and Control (CDC), 4928–4933.

[Junges et al. 2016] Junges, S.; Jansen, N.; Dehnert, C.; Topcu, U.; and Katoen, J.-P. 2016. Safety-constrained reinforcement learning for MDPs. In Proceedings of the 22nd International Conference on Tools and Algorithms for the Construction and Analysis of Systems (TACAS), 130–146. Springer.

[Kusuoka 2001] Kusuoka, S. 2001. On law invariant coher- ent risk measures. In Advances in Mathematical Economics. Springer. 83–95.

[Ma and Yu 2017] Ma, S., and Yu, J. Y. 2017. Transition- based versus state-based reward functions for MDPs with Value-at-Risk. In Proceedings of the 55th Annual Allerton Conference on Communication, Control, and Computing (Allerton), 974–981.

[Mannor and Tsitsiklis 2011] Mannor, S., and Tsitsiklis, J. 2011. Mean-variance optimization in Markov decision processes. In Proceedings of the 28th International Conference on Machine Learning (ICML), 1–22.

[Meyn and Tweedie 2009] Meyn, S. P., and Tweedie, R. L. 2009. Markov Chains and Stochastic Stability. Springer Science & Business Media.

[Nilim and Ghaoui 2005] Nilim, A., and Ghaoui, L. E. 2005. Robust control of Markov decision processes with uncertain transition matrices. Operations Research 53(5):780–798.

[Prashanth et al. 2016] Prashanth, L. A.; Jie, C.; Fu, M.; Marcus, S.; and Szepesv´ari, C. 2016. Cumulative prospect theory meets reinforcement learning: Prediction and control. In Proceedings of the 33rd International Conference on Machine Learning (ICML), 1406–1415.

[Puterman 1994] Puterman, M. L. 1994. Markov Decision Processes: Discrete Stochastic Dynamic Programming. Wiley.

[Riedel 2004] Riedel, F. 2004. Dynamic coherent risk measures. Stochastic Processes and their Applications 112(2):185–200.

[Ruszczy´nski and Shapiro 2006] Ruszczy´nski, A., and Shapiro, A. 2006. Optimization of convex risk functions. Mathematics of Operations Research 31(3):433–452.

[Ruszczy´nski 2010] Ruszczy´nski, A. 2010. Risk-averse dy- namic programming for Markov decision processes. Mathematical Programming 125(2):235–261.

[Scheff´e 1999] Scheff´e, H. 1999. The Analysis of Variance. John Wiley & Sons.

[Shen et al. 2014] Shen, Y.; Tobia, M. J.; Sommer, T.; and Obermayer, K. 2014. Risk-sensitive reinforcement learning. Neural Computation 26(7):1298–1328.

[Sobel 1982] Sobel, M. J. 1982. The variance of discounted Markov decision processes. Journal of Applied Probability 19(4):794–802.

[Sobel 1994] Sobel, M. J. 1994. Mean-variance tradeoffs in an undiscounted MDP. Operations Research 42(1):175– 183.

[White 1988] White, D. J. 1988. Mean , variance , and probabilistic criteria in finite Markov decision processes : A review. Journal of Optimization Theory and Applications 56(1):1–29.

[Woodroofe 1992] Woodroofe, M. 1992. A central limit the- orem for functions of a Markov chain with applications to shifts. Stochastic processes and their Applications 41(1):33– 44.

[Wu and Lin 1999] Wu, C., and Lin, Y. 1999. Minimizing risk models in Markov decision process with policies depending on target values. Journal of Mathematical Analysis and Applications 23(1):47–67.


Designed for Accessibility and to further Open Science