In reinforcement learning (RL), the standard criterion is the expectation of (discounted) total reward. However, stronger reliability guarantees are required in many problems, especially where the small probability events have serious consequences, such as self-driving cars and medical diagnosis. In these cases, a risk-sensitive criterion should be considered. In sequential decision making problems, a risk-sensitive criterion refers to a risk measure, or a risk function, which assigns a scalar to a reward sequence . In this paper, we focus on law-invariant risk measures [1], which are functions mapping a set of random variables to the real numbers. In an infinite time-horizon Markov decision process (MDP), a law-invariant risk measure is usually on the return
, where
is the discount factor. Two risk measures are considered as examples: mean-variance risk and exponential utility risk.
We show that, with the return variance formula and the state-augmentation transformation (SAT), the two risks can be estimated in an MDP with a stochastic transition-based reward and a randomized policy. The variance can be calculated in a Markov process, but with a deterministic reward function only. Besides, a number of RL methods have similar requirements [2], [3], [4], [5], [6] even in a risk-sensitive scenario. However, the reward functions are usually stochastic in many practical problems. It has been shown that, when the objective is risk-sensitive, and the reward needs to be converted to a simple form, the SAT should be implemented instead of a reward simplification [7]. In this paper, we present the SAT in a homomorphism version, and thoroughly discuss its pros and cons. A numerical experiment
Shuai Ma and Jia Yuan Yu are with the Concordia Institute of Information System Engineering, Concordia University, Montreal, Canada. Email: m shua@encs.concordia.ca, jiayuan.yu@concordia.ca.
shows how an SAT preserved the variance in an MDP with a randomized policy, comparing with the reward simplification.
The contribution of this paper is twofold. Firstly, we extend the SAT with a state lumping theorem to relieve the enlarged state space. Secondly, we estimate the two law-invariant risk measures with the return variance. In Section II, we review different types of internal risk measures in MDPs, and provide two variance-based estimation formulas. In Section III, we give the MDP notations, restate the SAT as a homomorphism of MDP. In Section IV, we propose a novel definition of isotopic states for state lumping in transformed Markov process, illustrate the SAT in a toy MDP example, and show the error on the two risks from naive reward simplification. In Section V, we have the conclusions and discuss the future research.
In RL, uncertainty is studied from two perspectives. One is the external uncertainty, which refers to the parameter uncertainty or disturbance. When the model is unknown, its parameters are usually estimated first, and then a model-based approach can be implemented to achieve the optimal policy. However, the parameter estimation depends on noisy data in practice, and the modeling errors may result in negative consequences. In control theory, this problem is known as robust control. Robust control methods consider the uncertain parameters within some compact sets, and optimize the expected return with the worst-case parameters, in order to achieve good robust performance and stability [8].
The other risk concerns the inherent (or internal) uncertainty, which results from the stochastic nature of the process. The inherent risk can be quantified by a dynamic measure [9] or a law-invariant measure [1]. The immediate reward at epoch t is defined by , a dynamic measure can be denoted in general as
which is sensitive to the order of the immediate rewards. Dynamic measures are usually assumed to have a set of properties, such as Markov, monotonicity and coherence, which yields a time-consistent risk measure with a nested structure. In this paper, we focus on law-invariant risk measures. Given a discount factor , a law-invariant measure in an infinite horizon is a functional
on the return
, where
. Three types of law-invariant risk have been widely studied in RL area.
Mean-variance risk: The mean-variance risk is also known in finance as the modern portfolio theory [10], [11], [12]. The mean-variance analysis aims at optimal return at a given level of risk, or the optimal risk at a given level of return. In RL, several mean-variance models have been studied. The variance and the standard deviation of the return are denoted by and
, respectively. One model could be
where k is a risk parameter, and when k > 0, it is a risk-averse objective. This is the first mean-variance model for exploring inventory management related problems [13]. The other model can be maximizing the expected return with a variance constraint, or minimizing the variance with an expected return constraint [14]. For a review on mean-variance risk, see [15].
Utility risk: Another type of risk measure can be utility risks. The original goal of a utility function is to represent the subjective preference [16]. One classic example can be the “St. Petersburg Paradox,” which refers to a lottery with an infinite expected reward, but no one would put up an arbitrary high stake to play it, since the probability of obtaining a high enough reward is too small. Mathematically, a utility function is a mapping from objective value space for all possible outcomes to subjective value space. A utility objective is usually in the form
, where U is a strictly increasing function. The most common used utility risk in RL is the exponential utility [17]
where models a constant risk sensitivity that risk-averse when
. This can be seen more clearly with the Taylor expansion of the utility
Quantile-based risk: The last type of risk measure used in practice refers to quantiles, which requires us to pay attention to discontinuities and intervals of quantile numbers. A commonly used quantile-based risk measure is value at risk (VaR). For VaR estimation with the SAT, see [7].
In short, we claim that most, if not all, inherent risk measures depend on the reward sequence , which can be preserved by the SAT in a risk-sensitive scenario. For law-invariant risk estimations, since VaR estimation in a Markov process with a stochastic reward has been throughly studied in [7], in this paper, we focus on the first two risk estimations with Eq. ????, respectively. In a MDP with a randomized policy, the return variance
is required by both of the two risk estimations. With the aid of the SAT,
can be calculated in a Markov process with a stochastic reward, which allows risk evaluation in practical RL problems.
In this section, firstly, we present notations of MDPs with a deterministic and stochastic reward. Secondly, we restate the state-augmentation transformation (SAT) as an MDP homomorphism. Thirdly, we have a discussion on the pros and cons of the SAT in different cases.
A. Markov Decision Processes
In this paper, we focus on infinite-horizon discrete-time MDPs with finite states and actions. An MDP with a deterministic reward can be represented by
in which S is a finite state space, and represents the state at (decision) epoch
is the allowable action set for
is a finite action space, and
A represents the action at epoch t; r is a bounded reward function, and
denotes the immediate reward at epoch t;
denotes the homogeneous transition probability;
is the initial state distribution;
is the discount factor. Since a number of risk-sensitive RL studies require the reward function to be deterministic and state-based [7], we only consider
R in a transformed MDP (or Markov process). Similarly, an MDP with a stochastic reward can be represented by
in which J is a finite subset of R, and is the set of possible values of the immediate rewards; d is the reward distribution, with y) the probability that the immediate reward at time t is j, given current state x, action a, and next state y.
A policy refers to a sequence of decision rules (
) for
, which describes how to choose actions sequentially. A randomized decision rule is
, i.e., for a given state, it outputs a distribution on the action space. Randomized policy is often considered in constrained MDPs [18] or MDPs with a risk objective. For an MDP with an infinite time horizon, we consider a stationary policy, which can be considered as one decision rule, i.e.,
. A Markov reward process is tantamount to an MDP with a randomized policy. A Markov process with a deterministic reward function can be represented by
In this study, we consider an MDP with a randomized policy. The function of SAT is to transform to M and preserve
, and this problem is concerned in four cases [7]. To illustrate how the SAT with state lumping works, we consider the Case 3 as an example, which refers to an MDP with a randomized policy.
B. Homomorphism Version of the Transformation
Here, we restate the SAT as an MDP homomorphism, which is more general from a mathematical perspective. In many practical problems, the rewards of the Markov processes are stochastic, but many methods may require the reward in a simple form. To enable the method in a risk-sensitive case, we may use the SAT to transform the Markov process and preserve the reward sequence . In a sense, the SAT generalizes a number of theoretical studies which are originally designed to work for MDPs or Markov processes with deterministic reward only. In this paper, the example is that the variance formula is for Markov processes with deterministic reward functions only. We restate the SAT as an MDP homomorphism. Comparing with the original SAT theorem [7], the homomorphism version of SAT works on a more abstract level. An MDP homomorphism is a formalism that captures an intuitive notion of specific equivalence between MDPs [19]. In order to convert an MDP M with d to
with r and preserve
, we consider each “situation”, which determines immediate reward, as an augmented state. We can then attach each possible reward value to an augmented state in
. Formally, we define the SAT homomorphism as follows.
Definition 2 (Policy lift): Let be a homomorphic image of M under h. Let
be a stochastic policy in M. Then
lifted to
is the policy
such that
for
and
for
.
Given an MDP with a policy, the randomness of the induced Markov reward process can be studied in its underlying probability space.
Definition 3 (Underlying probability space): Let be a probability space, and (E, B) a measurable space with
. An induced Markov reward process can be represented by an (E, B)-valued stochastic process on
with a family
of random variables
for
is called the underlying probability space of the process
. For all
, the mapping
is called the trajectory of the process with respect to
. The process
is progressively measurable with respect to the filtration
.
A homomorphism version of the SAT theorem is as follows, which claims that the probability measure on trajectories is preserved under h. Therefore, as a subsequence of sample path, the probability measure on is preserved as well.
Theorem 1 (Probability measure preservation): Let be an image of M under homomorphism h. Let
be the stochastic policy lifted from
. For the two processes M with
and
with
, there exists a bijection
, such that for the underlying probability space
for the first process, we have a sample path probability space
Fig. 1: A toy example with two states and two actions. The labels a(j, q) along transitions represent the action a, the immediate reward j, and the probability q = p(y | x, a)d(j | x, a, y).
for the second process, such that for any
.
The SAT theorem claims that, for an M with a stochastic transition-based reward, there exists an with a deterministic state-based reward, such that for any given
for M, there exists a corresponding
for
, such that both Markov reward processes share the same
.
Pros and cons of the SAT: There are mainly two benefits brought by the SAT. Firstly, it converts an M (or ), which has a discrete stochastic reward distributed on a finite support, into an
(or
) with a deterministic state- based reward function. The essence of the SAT is a surjective mapping from the situation space [7] to the reward space J. This mapping not only extends a number of methods provided for
to work for M, but also allows us to analyze the immediate rewards as states with methods such as z-transform [?]. Secondly, action is “removed” from the reward function and becomes a component of the augmented state. This renders action to impact on transitions only. There are two disadvantages as well. One is the size of the augmented state space is now (
), which is much larger than |S|. The other is that the SAT removes the recurrence property from a recurrent MDP, which may result in a prohibition of some methods. In the next section, we apply the SAT to an MDP with a randomized policy, estimate the two risks, and compare the results with the estimations from reward simplification.
In this section, we use a toy example to illustrate the SAT with a state lumping aiming at shrinking the augmented state space. Furthermore, we show the risk estimations and errors from reward simplification. Firstly, we describe the MDP with two states and a randomized policy, and propose a novel definition of isotopic class, in which all contained states can be regarded as one state in return study. Secondly, we estimate the two risks with the return variance calculation method, and show how the reward simplification change the risk estimations.
A. Isotopic State lumping in a Toy Example
Consider the process illustrated in Fig. 1 with two states and two actions, in which state 1 has two actions and state 1 has one. The process starts from state 1 at time t = 1. Let’s consider the randomized policy — uniformly choose an action for state 1—then we have a Markov reward process illustrated in Fig. 2(a). In order to calculate the return variance, the Markov process needs to have a deterministic reward function. One way to achieve this is to naively simplify the reward function by calculating the expectation conditioned on state. In the MDP with
, the reward simplification refers to
for . The induced Markov process is shown in Fig. 2(b). The other way to acquire a deterministic reward function is through SAT. A suitable SAT renders a deterministic reward function and preserve
. However, SAT also enlarges the state space from |S| to (
) in general. Considering the special structure of
, we propose a definition of isotopic states as an alleviation of the enlarged state space.
Definition 4 (Isotopic states): In a Markov process , if there exist two states
, such that,
• Condition 1: ; and
• Condition 2: for
, then we say
and
are isotopic. With the definition of isotopic states, we propose a theorem on reward preservation in state lumping as follows.
If the two states are isotopic in a Markov process
, then there exists a Markov process
, in which
, and for
and
,
(a) A Markov process with a stochastic transition-based reward function. The labels (j, q) along transitions represent the immediate reward j and the probability x, y represent the current and the next states, respectively. The two underlined situations are isotopic states, which can be lumped into one augmented state in the transformed Markov process.
(b) The Markov process from reward simplification. The labels q along transitions represent the probability labels r(x) in the text boxes represent the deterministic state-based reward values from reward simplification.
Fig. 2: The reward simplification on the Markov process with a stochastic transition-based reward function.
The results of this section are based in part on the results of [20], [21], in which the lumpability in Markov processes was thoroughly studied. By comparison, the Definition 4 and the Theorem 2 are based on single states instead of partitions of state space, i.e., in Condition 2 in Definition 4, and y are single states instead of partitions. Besides the fact that it is hard to partition a general state space
Fig. 3: The transformed Markov process with a deterministic state-based reward function. Some transitions are hidden. The underlined state comes from lumping two situations from the original Markov process in Fig. 2(a).
Fig. 4: The comparison among the empirical mean-variance risk (with a narrow error region zoomed in), the estimated mean-variance risk from reward simplification and the estimated mean-variance risk from SAT along the risk parameter .
with equivalent partitioned transition probabilities (Definition 8.4 in [22]), the main reason refers to a characteristic of the transformed transition probability from the SAT. Let’s consider the transformed Markov process in Fig. 3 with a state space S as an example. Since the transition distribution for
depends on
only, a sufficient condition for two states being isotopic could be that, the two states share the last two components. This sufficient condition can be stated as follows.
In a transformed Markov process with a state space S, if there exist two states , such that
and i = j, then x and y are isotopic.
Now, we can clearly see in Fig. 3 that, the two states and
are isotopic as well, which can be lumped to further simplify the analysis. In summary, for states which satisfied the two conditions in Definition 4, we may regard them as one state and keep
intact. Considering that the transition probability in the transformed Markov process is only sensitive to the start and end original states, the augmented state space can be shrunk under a fair condition. The transformed Markov process is illustrated in Fig. 3 with some transitions hidden. Notice that the underlined state is generated by lumping the two situations shown in Fig. 2(a). Though the isotopic class lumping is not specially designated for SAT, the transformed Markov process has a fair chance to relieve from the enlarged state space to some degree. In the next section, we estimate the two risks on the return with the aid of the SAT.
B. Risk Estimations and Errors from Reward Simplification
Here, we estimate the mean-variance risk and the exponential utility risk by Eq. ????, respectively. As shown in Section II, the two risks can be estimated with the return variance. For an infinite-horizon Markov reward process with a deterministic reward function, Sobel presented the formula
Fig. 5: The comparison among the empirical exponential utility risk, the estimated exponential utility risk from reward simplification, and the estimated exponential utility risk from SAT along the risk parameter .
for the return variance. See [23] for further information. Notice that the variance formula is for Markov processes with deterministic reward functions only. Though we have the SAT for Markov process with a discrete stochastic reward, how to apply the method to the ones with a continuous stochastic reward is a concern.
For each risk, we compare three estimations: empirical estimation, estimation from reward simplification, and estimation from SAT. The empirical estimation is calculated as follows. We run L = 50 groups of simulations to calculate the variance of an estimation, in each group we run M = 1000 simulations of the Markov process, and in each simulation we set the time horizon N = 1000.
Mean-variance risk estimation: The empirical estimation of mean-variance risk with a risk parameter k is
in which
where is an outcome of a simulation in group i, and
. The comparison among the empirical mean-variance risk, the estimated mean-variance risk from reward simplification, and the estimated mean-variance risk from SAT along the risk parameter
is shown in Fig. 4. For different k, the empirical mean-variance risk
is illustrated with an error region representing the standard deviations of the means. Since the error region is so narrow that it is hardly seen, we zoom in a piece to make it clear. Based on the observation, we can see that, the estimated mean-variance risk from SAT is close to the empirical mean-variance risk, but its counterpart from reward simplification is not. That is because the return variance is preserved by the SAT.
Exponential utility risk estimation: The empirical estimation of exponential utility risk with a risk parameter is calculated by
in which
where is an outcome of a simulation in group i, and
. The comparison among the empirical exponential utility risk, the estimated exponential utility risk from reward simplification, and the estimated exponential utility risk from SAT along the risk parameter
is shown in Fig. 5. It is worth noting that, the utility value at
is set by the average of the two adjacent values, since as the denominator in the Eq. ??,
cannot be zero. Based on the observation, we can see that when
, the estimated risk from SAT is close to the empirical one, but its counterpart from reward simplification is not. That is again because the return variance is preserved by the SAT. It is also found that, for the case
in Fig. 6, the estimated risk from SAT goes far from the empirical one. That is because there exists a term
in Eq. ??, and when
deviates too much from zero, this term brings a large error. Therefore, the estimation from SAT works only when
is close to zero, and its counterpart from reward simplification has a large error in both cases.
Thus far, we have shown that, the reward simplification changes the two risk estimations. There are a number of methods for MDPs and Markov processes with rewards which are not stochastic and transition-based, such as [24] on mean-variance risk and [2], [25] on exponential utility risk. We believe that, when people apply these methods for practical problems with stochastic and transition-based rewards, they should revisit the methods with the SAT instead of the reward simplification. This concern refers to the case when randomized policy is involved as well.
In this paper, we estimate the mean-variance risk and exponential risk with the return variance in a toy example. With the aid of the SAT, the variance formula is extended for MDPs with a stochastic transition-based reward and a randomize policy. Moreover, a definition of isotopic states is proposed in order to lump states and shrink the enlarged state space. The numerical experiment illustrates the SAT with a state lumping, and validates the risk estimation methods. One future work is to deal with the augmented state space. Though the isotopic class can be lumped into one state under conditions, the space is still very large in most cases. Another is to evaluate or estimate a generic risk measure iteratively. Currently, only a few risks can be evaluated in iterations. The third is to extend the methods for Markov processes with continuous time horizons and spaces in a risk-sensitive case.
Fig. 6: The comparison among the empirical exponential utility risk, the estimated exponential utility risk from reward simplification, and the estimated exponential utility risk from SAT along the risk parameter .
[1] S. Kusuoka, “On law invariant coherent risk measures,” in Advances in Mathematical Economics, pp. 83–95, Springer, 2001.
[2] V. S. Borkar, “Q-learning for risk-sensitive control,” Mathematics of Operations Research, vol. 27, no. 2, pp. 294–311, 2002.
[3] J. Garc´ıa and F. Fern´andez, “A comprehensive survey on safe reinforcement learning,” Journal of Machine Learning Research, vol. 16, no. 1, pp. 1437–1480, 2015.
[4] W. Huang and W. B. Haskell, “Risk-aware Q-learning for Markov decision processes,” in Proceedings of the 56th IEEE Conference on Decision and Control (CDC), pp. 4928–4933, 2017.
[5] Y. Chow, M. Ghavamzadeh, L. Janson, and M. Pavone, “Riskconstrained reinforcement learning with percentile risk criteria,” The Journal of Machine Learning Research, vol. 18, no. 1, pp. 6070–6120, 2017.
[6] F. Berkenkamp, M. Turchetta, A. Schoellig, and A. Krause, “Safe model-based reinforcement learning with stability guarantees,” in Proceedings of the 31st Advances in Neural Information Processing Systems (NIPS), pp. 908–918, 2017.
[7] S. Ma and J. Y. Yu, “State-augmentation transformations for risk-sensitive reinforcement learning,” arXiv:1804.05950v2:, 2018.
[8] A. Nilim and L. E. Ghaoui, “Robust control of Markov decision processes with uncertain transition matrices,” Operations Research, vol. 53, no. 5, pp. 780–798, 2005.
[9] A. Ruszczy´nski, “Risk-averse dynamic programming for Markov decision processes,” Mathematical Programming, vol. 125, no. 2, pp. 235– 261, 2010.
[10] D. J. White, “Mean , variance , and probabilistic criteria in finite Markov decision processes : A review,” Journal of Optimization Theory and Applications, vol. 56, no. 1, pp. 1–29, 1988.
[11] M. J. Sobel, “Mean-variance tradeoffs in an undiscounted MDP,” Operations Research, vol. 42, no. 1, pp. 175–183, 1994.
[12] S. Mannor and J. Tsitsiklis, “Mean-variance optimization in Markov decision processes,” in Proceedings of the 28th International Conference on Machine Learning (ICML), pp. 1–22, 2011.
[13] H.-S. Lau, “The newsboy problem under alternative optimization objectives,” Journal of the Operational Research Society, vol. 31, no. 6, pp. 525–535, 1980.
[14] T.-M. Choi, D. Li, and H. Yan, “Mean-variance analysis for the newsvendor problem,” IEEE Transactions on Systems, Man, and Cybernetics-Part A: Systems and Humans, vol. 38, no. 5, pp. 1169– 1180, 2008.
[15] C.-H. Chiu and T.-M. Choi, “Supply chain risk analysis with mean-variance models: A technical review,” Annals of Operations Research, vol. 240, no. 2, pp. 489–507, 2016.
[16] R. A. Howard and J. E. Matheson, “Risk-sensitive Markov decision processes,” Management science, vol. 18, no. 7, pp. 356–369, 1972.
[17] K.-J. Chung and M. J. Sobel, “Discounted MDPs: Distribution functions and exponential utility maximization,” SIAM journal on control and optimization, vol. 25, no. 1, pp. 49–62, 1987.
[18] E. Altman, Constrained Markov Decision Processes. CRC Press, 1999.
[19] B. Ravindran and A. G. Barto, “Model minimization in hierarchical reinforcement learning,” in International Symposium on Abstraction, Reformulation, and Approximation, pp. 196–211, Springer, 2002.
[20] J. G. Kemeny and J. L. Snell, Finite Markov Chains. Springer-Verlag, New York, 1976.
[21] C. Burke and M. Rosenblatt, “A markovian function of a markov chain,” The Annals of Mathematical Statistics, vol. 29, no. 4, pp. 1112– 1122, 1958.
[22] P. G. Harrison and N. M. Patel, Performance modelling of communication networks and computer architectures (International Computer S. Addison-Wesley Longman Publishing Co., Inc., 1992.
[23] M. J. Sobel, “The variance of discounted Markov decision processes,” Journal of Applied Probability, vol. 19, no. 4, pp. 794–802, 1982.
[24] L. Xia, “Mean-variance optimization of discrete time discounted markov decision processes,” Automatica, vol. 88, pp. 76–82, 2018.
[25] Y. Shen, M. J. Tobia, T. Sommer, and K. Obermayer, “Risk-sensitive reinforcement learning,” Neural Computation, vol. 26, no. 7, pp. 1298– 1328, 2014.