b

DiscoverSearch
About
My stuff
A Scheme for Dynamic Risk-Sensitive Sequential Decision Making
2019·arXiv
Abstract
Abstract

We present a scheme for sequential decision making with a risk-sensitive objective and constraints in a dynamic environment. A neural network is trained as an approximator of the mapping from parameter space to space of risk and policy with risk-sensitive constraints. For a given risk-sensitive problem, in which the objective and constraints are, or can be estimated by, functions of the mean and variance of return, we generate a synthetic dataset as training data. Parameters defining a targeted process might be dynamic, i.e., they might vary over time, so we sample them within specified intervals to deal with these dynamics. We show that: i). Most risk measures can be estimated using return variance; ii). By virtue of the state-augmentation transformation, practical problems modeled by Markov decision processes with stochastic rewards can be solved in a risk-sensitive scenario; and iii). The proposed scheme is validated by a numerical experiment.

Cities are constantly involved in complex changing processes. City logistics is characterized by multiple stakeholders with different objectives and constraints. Sustainability plays an increasingly significant role in planning and management within organizations and across supply chains [1]. The sustainable city logistics studies express reflections and developments on the economic and the environmental problems on supply chain management, but only few of them take risk into consideration. Since many sustainable city logistics problems, such as planning, routing and scheduling, can be modeled as sequential decision making (SDM) problems [2], in this paper, we propose a scheme composed of reinforcement learning (RL) methods for risk-sensitive SDMs with constraints in dynamic environments.

image

Risk-sensitive objectives have been drawing more attention in many practical problems, especially in which the small probability events have serious consequences. Since various goals are targeted in practice, one solution is to optimize one of them as objective and take others as constraints. Furthermore, the environment might vary over time, so we sample the variables involved in the decision-making within specified intervals to deal with these dynamics. In this paper, we consider the SDM problems with three concerns: risk, constraint, and dynamic environment. In order to solve this problem, we propose a scheme for generating a synthetic dataset and training an approximator (here we use neural network as an example). The motivation for the proposed scheme arises from practical problems in which exact analytic risk analysis may be excessively complicated or prohibitively expensive, especially in a dynamic scenario. Though historical data is usually used for neural network training, in many practical situations, the recorded decisions are not optimal. Furthermore, in risk-sensitive cases, the criteria based on which the decisions were made are not clarified. Therefore, we generate a synthetic dataset for problems with specified risk-sensitive objective and constraints. For a given parameter set, we construct a Markov decision process (MDP), calculate the return variance for any deterministic policy, and then evaluate (or estimate) the specified risk measures and record the optimal policy.

A practical inventory control problem is considered as an example, in which the risk objective and constraints can be evaluated or estimated with return variance. An inventory control problem is dynamic in the long run. For example, the wholesale prices are changing all the time, as well as the uncertainties of the market demands and supplier reliabilities. How to adapt to these dynamics efficiently from a risk perspective is not only an academic concern, but also a business one.

image

In this section, firstly, we introduce the notations of constrained MDPs. Secondly, we present three types of law-invariant risk measures commonly studied in RL. We show that, the three risk measures can be evaluated or estimated with the return variance, and any of the three risks can be the objective or a constraint in our study, as well as other functions of the mean and variance of return. Thirdly, we review the return variance calculation method for a Markov process with a deterministic state-based reward. Finally, we restate the SAT as an MDP homomorphism, which enables the variance calculation method for a Markov process with a stochastic transition-based reward.

2.1 Constrained Markov Decision Processes

MDPs provide a framework for modeling sequential decision making process. In this paper, we focus on infinite-horizon discrete-time MDPs. An MDP with a

stochastic reward can be represented by

image

in which S is a finite state space, and  Xt ∈ Srepresents the state at epoch t ∈ N; Axis the allowable action set for  x ∈ S, A= �x∈S Axis a finite action space, and  Kt ∈ Arepresents the action at epoch t; J is a bounded finite subset of R, and is the set of possible values of the immediate rewards, and denote  Rtthe immediate reward at epoch t; p is the transition probability, with p(y | x, a) =  P(Xt+1 = y | Xt = x, Kt = a) denotes the homogeneous transition probability; d is the reward distribution, with d(j | x, a, y) =  P(Rt= j | Xt = x, Kt = a, Xt+1 = y) the probability that the immediate reward at time t is j, given current state x, action a, and next state  y; µis the initial state distribution; and  γ ∈(0, 1) is the discount factor. Given such an MDP, people concern the discounted total reward, or the return

image

In most cases, the expected return is considered as the optimality criterion (objective).

A policy  πdescribes how to choose actions sequentially. An MDP with a (randomized) policy induces a Markov reward process. In an infinite horizon case, a stationary policy space is considered. Randomized policy is often considered in constrained MDPs [3]. In this study, we focus on the deterministic policy space Π, which is easy-to-use in practice. An optimal policy with respect to a criterion refers to an induced Markov process which optimize the criterion.

In many practical SDM problems, a single objective might not suffice to describe the real cases. To deal with various goals, a natural way is to optimize one objective with constraints on other goals [3]. In particular, since city logistics is characterized by multiple stakeholders with different objectives and constraints, we model such problems as constrained MDPs. A constrained MDP is an MDP with a criterion composed of an objective and constraint(s). In this study, we consider risk-sensitive objective and constraints in the function space {f ∈ R | f = f(E(Φ), V(Φ))}, where the return Φ refers to Equation (1). Next, we review the risk measures widely studied in RL.

2.2 Risk Measures

In RL, uncertainty is studied from two perspectives. One is the external uncertainty, which refers to the parameter uncertainty or disturbance. When the model is unknown, its parameters are usually estimated first, and then the optimal solution is calculated with a model-based approach. However, the parameter estimation depends on noisy data in practice, and the modeling errors may result in negative consequences. In control theory, this problem is known as robust control. Robust control methods consider the uncertain parameters within some compact sets, and optimize the expected return with the worst-case parameters, in order to achieve good robust performance and stability [4].

The other refers to the inherent (or internal) uncertainty, which results from the stochastic nature of the process. We claim that most, if not all, inherent risk measures depend on the reward sequence (Rt : t ∈ {1, · · · , N}), and a inherent risk measure can be denoted by  ρ : RN → R ∪ {+∞}, where  N ∈N+ ∪ {+∞}. The inherent risk can be quantified by a dynamic measure or a law-invariant measure. Given a Markov process with a deterministic reward and a deterministic initial state, a sequence of risk measures (ρi : i ∈ {1, · · · , N}), and denote the immediate reward at epoch t by  Rt, a dynamic measure can be denoted in general as

image

which is sensitive to the order of the immediate rewards. Dynamic measures are usually assumed to have a set of properties, such as Markov, monotonicity and coherence, which yields a time-consistent risk measure with a nested structure. For further information on the dynamic risk measure, see [5].

Given a discount factor  γ, a law-invariant [6] measure in an infinite horizon is a functional Ψ on the return. Three types of law-invariant risk have been widely studied in RL area.

2.2.1 Utility risk

The original goal of a utility function is to represent the subjective preference [7]. One classic example can be the “St. Petersburg Paradox,” which refers to a lottery with an infinite expected reward, but no one would put up an arbitrary high stake to play it, since the probability of obtaining an high enough reward is too small. Mathematically, a utility function Ψ :  R → Ris a mapping from objective value space for all possible outcomes to subjective value space. A utility objective is usually in the form Ψ−1{E[U(Φ)]}, where Ψ is a strictly increasing function. Denoting the return variance by V(Φ), the most common used utility risk in RL is the exponential utility [8]

image

where  βmodels a constant risk sensitivity that risk-averse when  β <0. This can be seen more clearly with the Taylor expansion of the utility

image

2.2.2 Mean-variance risk

Another type of risk measure can be mean-variance risks. The mean-variance risk measure is also known in finance as the modern portfolio theory [9, 10, 11]. The mean-variance analysis aims at optimal return at a given level of risk, or the optimal risk at a given level of return. In RL, several mean-variance models

image

where k is a risk parameter, and when k > 0, it is a risk-averse objective. This is the first mean-variance model for exploring inventory management related problems [12]. The other model can be maximizing the expected return with a variance constraint, or minimizing the variance with an expected return constraint [13]. For a review on mean-variance risk, see [14].

2.2.3 Quantile-based risk

The last type of risk measure used in practice refers to quantiles, which requires us to pay attention to discontinuities and intervals of quantile numbers. A commonly used quantile-based risk measure is value at risk (VaR). VaR originates from finance. For a given portfolio, a loss threshold (target level) and a horizon, VaR concerns the probability that the loss on the portfolio exceeds the threshold over the time horizon. Mathematically, VaR can be defined as to find a policy which maximizes the smallest possible outcome with respect to a specified probability level. In RL, the VaR objective can also be defined as to find a policy which maximizes the probability that the return is larger than or equal to a specified target (threshold) [15, 16]. Denote  F πΦas the return CDF from a policy  π. In this paper, we consider two VaR problems [15] in an infinite-horizon MDP.

Problem 1 Given a quantile  α ∈[0, 1], find the optimal threshold  ρα= sup{τ ∈R | P(Φ > τ) ≥ α, π ∈ Π}= sup{τ ∈ R | F πΦ(τ) ≤ 1 − α, π ∈ Π}.

Problem 2 Given a threshold  τ ∈ R, find the optimal quantile  ητ= sup{α ∈[0, 1]  | F πΦ(τ) ≤ 1 − α, π ∈ Π}.

Both VaR problems relate to  PΦ= inf{F πΦ | π ∈ Π}, here we name itVaR function. In a long or infinite horizon case, the return distribution can be estimated with a strictly increasing function [17]. In this case, VaR can be considered as a study of the return distribution, since any point along  PΦis (estimated) (ρα, 1 − ητ) with  τ = ραor  α= 1  − ητ. Therefore, both VaR problems refer to  PΦ. VaR is straightforward but hard to deal with since it is not a coherent risk measure [18]. In many cases, conditional VaR (also known as expected shortfall) is preferred over VaR since it is coherent [19], i.e., it has some intuitively reasonable properties (convexity, for example). However, when the return can be assumed to be approximately normally distributed, VaR can be simply estimated with E(Φ) and V(Φ) [20].

In this paper, we focus on law-invariant risk measures on Φ. The three law-invariant risk measures can be either calculated (mean-variance risk), estimated (utility risk), or estimated with assumption (quantile-based risk) with E(Φ) and V(Φ). By virtue of the state-augmentation transformation (SAT), V(Φ) can be calculated in a Markov process with a stochastic reward, which allows risk evaluation in practical SDM problems. Next, we show how to calculate the return variance in a Markov process with a deterministic reward, and how the SAT enables the method to a Markov process with a stochastic reward.

2.3 Variance Formula for Markov Processes

As shown above, all law-invariant risk measures can be evaluated with the return variance. To calculate the return variance, Sobel [21] presented the formula for the Markov process with a deterministic reward.

Theorem 1 Given an infinite-horizon Markov process  ⟨S, rπ, pπ, γ⟩with a finite state space  S = {1, · · · , |S|}, a reward function  rπdeterministic state-based and bounded, and a discount factor  γ ∈(0, 1). Denote the transition matrix by P, in which P(x, y) =  pπ(y | x), x, y ∈ S. Denote the conditional return expectation by vx = E(Φ | X0 = x) for any deterministic initial state  x ∈ S, and the conditional expectation vector by v. Similarly, denote the conditional return variance by ψx = V(Φ | X0 = x), and the conditional variance vector by  ψ. Let  θdenote the vector whose xth component is  θx= �y∈S pπ(x, y)(rπ(x) +  γvy)2 − v2x. Then

image

Notice that the variance formula is for Markov processes with deterministic reward functions only. How to apply the method to practical problems with stochastic rewards is the problem. In next section, we review the SAT [22] as an MDP homomorphism, which enables the variance formula in a Markov process with a stochastic reward.

2.4 SAT Homomorphism

In many practical risk-sensitive problems, the rewards of the Markov processes are stochastic, but many methods may require the reward in a determined form [23, 24, 25, 26, 27, 28, 29]. To enable the method in those cases, we may use the SAT to transform the Markov process and preserve the reward sequence (Rt : t ∈ {1, · · · , N}). In this paper, the example is that the variance formula is for Markov processes with deterministic reward functions only. We restate the SAT as an MDP homomorphism. Comparing with the original SAT theorem [22], the homomorphism version of SAT is on a more abstract level. An MDP homomorphism is a formalism that captures an intuitive notion of specific equivalence between MDPs [30]. In order to convert an MDP M with d to  M†with r and preserve (Rt), we consider each “situation”, which determines immediate reward, as an augmented state. We can then attach each possible reward value to an augmented state in  M†. Formally, we define the SAT homomorphism as follows.

image

homomorphism h from an MDP M = ⟨S, A, J, p, d, µ⟩ to an MDP M†= ⟨S†, A, r, p†, µ†⟩. The state space S†= S‡ ∪ Sn, where S‡= S2 × A × J,Sn = {snull,x}x∈S, and Sn ∩ S‡=  ∅. For x†= (x, a, y, i), y†= (y, ay, z, j)  ∈ S‡,we have Ax†= Ay,r(x†) = i, and for ay ∈ Ay, p†(y† |(x†, ay) = p†(y† |(snull,y, ay) = p(z | y, ay)d(j | y, ay, z); for x†= snull,x ∈ Sn, we haveAx†= Ax,r(x†) = 0, and µ†(snull,x) = µ(x).

We call  M†the homomorphic image of M under h. For any policy  πin M, there exists a policy  π†in  M†, such that the two processes share the same (Rt). We define the mapping between the two policy spaces as a policy lift.

Definition 2 (Policy lift) Let  M†be a homomorphic image of M under h. Let  πbe a stochastic policy in M. Then  πlifted to  M†is the policy  π†such that  π†(a | x†) =  π(a | y) for  x†= (x, a, y, i) ∈ S‡and  π†(a | x†) =  π(a | x) for x†=  snull,x ∈ Sn.

Given an MDP with a policy, the randomness of the induced Markov reward process can be studied in its underlying probability space.

Definition 3 (Underlying probability space) Let (Ω, F, P) be a probability space, and (E, B) a measurable space with  E = S × J. An induced Markov reward process can be represented by an (E, B)-valued stochastic process on (Ω, F, P) with a family (Yt)t∈Nof random variables  Yt: (Ω, F) → (E, B) for t ∈ N. (Ω, F, P) is called the underlying probability space of the process (Yt)t∈N. For all  ω ∈, the mapping  Y (·, ω) :  t ∈ N → Yt(ω) ∈ Eis called the trajectory of the process with respect to  ω. The process (Yt)t∈Nis progressively measurable with respect to the filtration (Ft)t∈N.

A homomorphism version of the SAT theorem is as follows, which claims that the probability measure on trajectories is preserved under h. Therefore, as a subsequence of sample path, the probability measure on (Rt)t∈Nis preserved as well.

Theorem 2 (Probability measure preservation) Let  M†be an image of M under homomorphism h. Let  π†be the stochastic policy lifted from  π. For the two processes M with  πand  M†with  π†, there exists a bijection  fΩ: Ω  → Ω†, such that for the underlying probability space (Ω, F, P) for the first process, we have a sample path probability space (Ω†, {fΩ(b) :  b ∈ F}, P †) for the second process, such that for any  t ∈ N, P†({fΩ(b) :  b ∈ Ft}) =  P({Ft}).

image

In this section, we propose a scheme for solving risk-sensitive SDM problems with constraints in a dynamic environment. And we consider neural network (NN) as an example of the approximator for the scheme.

image

Figure 1: A dynamic risk evaluation scheme with NN and RL methods for optimal risk and policy in a sequential decision making problem.

3.1 Sequential Decision Making Scheme

A constrained, dynamic, risk-sensitive SDM problem can be considered as a process (environment) with a criterion (objective with constraints). For scuh an SDM problem, we assume that its process is defined by a vector of k environmental features  qe ∈ Qe = F1 × · · · × Fk, and denote it by  qe,tat epoch t. We allow the components of  qeto vary within some prespecified intervals. Similarly, a targeted risk measure and risk constraints might be defined by a vector of parameters  qp ∈ Qp, and and denote it by  qp,tat epoch t. To allow the decision-making for dynamic risk concern, the components of  qpare within some prespecified intervals as well. For any given  qe,tand  qp,t, denote the optimal risk by  qo,t ∈ Qo, and the optimal policy by  π∗t ∈Π. In this case, an imagined decision maker can be mathematically denoted by  ν : Qe × Qp → Qo ×Π, i.e., a theoretically perfect decision maker consider all the environmental parameters, the objective, and possible constraints, and choose the optimal policy with respect to some targeted risk measure. The goal is to train a NN to approximate ν.

Usually, historical data is used in training an approximator. However, at least two problems should be considered in practice. The first problem refers to information incompleteness. Most historical data contains no information on qpand (or)  π∗, and in many cases, even the information on  qeis incomplete. The second problem is optimality. In practice, the decision makers might prefer an easy-to-use policy than an optimal one, which is hard to determine since the practical problems could be different from the theoretical model diversely and subtly. Therefore, we propose to evaluate or estimate risk measures with RL methods and train NN with a synthetic dataset.

We present a scheme in Figure 1 to solve a constrained risk-sensitive SDM problem in a dynamic environment. In the feature representation module, we define the SDM process with  qe. In the SDM modeling module, a mathematic model is chosen to represent the process and offer a platform for risk evaluation, and here we use MDP as the model. In the risk evaluation module, we calculate the optimal policy for a specified risk measure with risk constraints. Here, for a given MDP with criterion (an objective with constraint(s), which can be risk-sensitive), firstly, we enumerate all deterministic policies to generate a Markov process. Secondly, we implement SAT to transform the Markov process to one with a determined reward (See Section 2.4), in order to calculate the return variance. Thirdly, we estimate the specified risk-measure(s) by the return variance (See Section 2.3), and here we consider VaR as an example (See Section 2.2). In the label generation module, the optimal VaR and policy is recorded and attached to the feature vector as a label. Finally, a functional approximator, such as an NN or a linear combination of polynomial basis or Fourier basis, is trained with the labeled feature dataset. In this paper, we consider an inventory control problem as an example. The feature set  qeis fully described in Section 4.

Briefly, the upper level of the scheme is for synthetic data generation (modeling and risk evaluation), and the lower level is for approximator training for dynamic SDM problems. After each decision-making epoch, some parameters will be updated with the outcome (such as market demand distribution) and the involved external variables (such as wholesale price). Suppose the updated parameters are still within the predefined intervals, the trained NN is able to output the estimated optimal policy and risk at the next decision-making epoch. We present the algorithm for the synthetic dataset generation (upper level of the scheme) as follows.

image

3.2 Approximator

A functional approximator needs training with a dataset to approximate the mapping from parameter space to risk and policy space. In this study, We use neural network (NN), which is a universal function approximator [31]. Other approximators, such as a linear combination of polynomial basis or Fourier basis, are not considered here. The definition of NN is as follows.

Definition 4 (Feed forward neural network) Let  Q ∈ Nbe the number of layers, and  N0, N1, · · · , NQ ∈ Nbe the numbers of nodes of different layers. Denote the activation function by  g : R → R. For any  q ∈ {1, · · · , Q}, x ∈ R, denote the affine function by  Wq(x) =  Aqx + bqfor some  Aq ∈ RNq×Nq−1and bq ∈ RNq. For any  i ∈ {1, · · · , Nq}, j ∈ {1, · · · , Nq−1}, and  q ∈ {1, · · · , Q}, the entry  Aq,i,jof  Aqdenotes the weight of the edge from the i-th node in the (q−1)-th layer to the j-th node in the q-th layer. For any layer  q ∈ {1, · · · , Q}, let Wq : RNq−1 → RNqbe an affine function. With  Hq = g◦Wqfor q = 1, · · · , Q−1, a function  H : RN0 → RNQdefined as

image

is called a feed forward neural network. Since we use NN as an estimator of the decision maker  ν, we denote the NN by  {ˆνg∞,N0,NQ}the set of neural networks mapping from  RN0to  RNQwith g.

In this study, for any given  qe, we construct an MDP model, and by virtue of the SAT [22], we can preserve the variance for a Markov process with a stochastic reward. As claimed in Section 2.2, most inherent risk measures can be evaluated or estimated by return variance with or without some assumption. We calculate  qoand  π∗and attached them to  qeand  qpas labels. Finally, we train an NN with the synthetic dataset. In the next section, we present an inventory management problem as an example to show the implementation details.

Figure 1 illustrates the scheme with NN and RL methods for a dynamic risk-sensitive SDM problem. The scheme includes synthetic dataset generation (upper layer), approximator training and predicting (lower layer). To generate the dataset for training and testing, we define and randomly generate a domain set. For each domain instance, we estimate the optimal VaR and the corresponding policy as follows. Firstly, construct an inventory MDP under some predefined assumptions. Secondly, enumerate all deterministic policies to achieve a set of Markov reward processes  {⟨S, Jπ, pπ, dπ, µ⟩}π∈Π, and calculate the VaR function for the MDP with a stochastic reward. Thirdly, acquire the set of transformed Markov process  {⟨S†, r†π†, p†π†, µ†π†⟩}π†∈Π†by Theorem 2, with  r†π†being deter- ministic and state-based. Fourthly, estimate the mean and variance vectors  v, ψby Theorem 1. Assuming the return is approximately normally distributed, we have the estimated return distributions for all  ⟨S†, r†π†, p†π†, µ†π†⟩to calculate the VaR function for the inventory MDP. Finally, for the risk sensitivity parameters in the domain instance, calculate risk-sensitive objective and constraints, and record  π∗. With the training dataset, we tune, train and validate an NN properly, and use it to predict  qoand  π∗for any  qeand  qp.

4.1 An Inventory Management Example

In the management of sustainable city logistics, both costs and risks are required to be simultaneously evaluated, and these are often conflicting [32]. Moreover, considering the changing environment, the solution should be sensitive to the dynamics of the model parameters  Qe × Qp. As far as we know, most decision-making methods lack a holistic integrated vision, and do not attempt to safeguard sustainable city logistics security at different levels, thus causing unnecessarily high costs and disturbances. To shift from the risk transfer and toleration towards a risk-sensitive control, the proposed scheme in Figure 1 can be regarded as a solution in this aspect. As a proof of principle for the applicability of the proposed scheme, we apply it with NN to a practical inventory control process as an example, in which three factors are modeled: multiple sourcing, supplier reliability, and backlog. It is worth noting that MDP is a discrete-time process, so the output optimal policy is a periodic-review strategy, which has its limitations comparing with its aperiodic-review counterparts. Furthermore, some assumptions are made to keep the model small. Hence, this inventory control SDM model is theoretical, and in many cases a more accurate estimation can be achieved through simulation with respect to complicated cases.

4.1.1 Multiple sourcing

Multiple sourcing (multisourcing) is a strategy that blends services from the optimal set of internal and external suppliers in the pursuit of business goals [33]. We consider two suppliers  U1and  U2, and a retailer R. The scenario involves a multinational corporation (R) which functions in the supply chain as a retailer. For a targeted product, this corporation has two suppliers  U1and  U2(in US and in India, for example). The supplier  U1in US is close and reliable, the other supplier  U2is far and unreliable but with a lower wholesale price.

4.1.2 Supplier reliability

Supplier (sourcing) reliability can be expressed in terms of quantity, quality or timing of orders due to multiple factors, such as equipment breakdowns, material shortages, warehouse capacity constraints, price inflations, strikes, embargoes, and political crises [34]. In the framework of MDP, one way to model these uncertainties is to take the status of a supplier as a Boolean variable, which represents whether the supplier is “available” or “unavailable”, and it can be modeled as a two-state Markov process [35]. In order to simplify the state representation, we assume that the unreliable supplier  U2is available with a probability  β1, and the retailer will be compensated with an amount  psif the

image

Figure 2: The product (solid) and order (dashed) flows between the retailer R and the two suppliers  U1and  U2. The letter T denotes transportation. The transportation lead time of  U1is negligible, hence ignored.

supplier with an order is unreliable at the current epoch. The probability  β1can be estimated using historical data.

Transportation can be another source of unreliability. It emphasizes the long delivery time which mainly results from a long physical distance. This uncertainty can be also a critical factor at times. In our model, we assume  U1is reliable and its transportation lead time can be ignored (for example, a day when one epoch is a month). Assume  U2offers a lower price but unreliable. Since it is far from R, its lead time is one period. To simplify the model, we consider the transportation factor within the supplier reliability.

4.1.3 Backlog and lost sale

For each unsatisfied unit of demand, a backlog or a lost sale occurs. Suppose the probabilities of backlog and lost sale are  β2and 1  − β2, respectively. The probability  β2can be estimated using historical data. We assume that when a backlog happens, the product will be sold at a lower price  pr −cb(which can be due to from fast delivery or other reasons), where  pris the retail price. A lost sale induces a cost  cl. Rewards for both cases will be calculated at the current epoch, in order to keep the state representation small. The backlog and lost sale instances are associated with market demand. To keep the stochastic reward function simple, finite possible demands are considered (i.e., a finite support for the reward function). For example, when the warehouse capacity is four units and the maximum demand is six units, then the possible maximum amount of backlogs or lost sales is two units. In the MDP model, the reward and cost from these two units will be received at a discount at the current epoch.

To simplify the model, other factors, such as fast delivery, order replacement and foreign currency exchange, have not been considered. The inventory control process is illustrated in Figure 2, in which the solid arrows represent the product flow, and the dashed arrows represent the order flow. Next, we construct an MDP for this inventory control problem.

4.2 Inventory MDP

Here we use an infinite-horizon MDP with finite state and action space to model the inventory control process. The related parameters are given in Table 1. For a set of specified parameters, we construct the MDP as follows. For time  t ∈ N, denote the inventory level by  It, and the number of items in transit  Jt. Given the warehouse capacity M, the state is  St= (It, Jt), where  It, Jt ∈ {0, · · · , M}. The action is  Kt= (kt,1, kt,2), where  kt,1, kt,2 ∈ {0, · · · , M}are the order quantities with the two suppliers. Assume that the maximum demand at retailer could be double of the warehouse capacity, i.e., supp(fD) =  {0, · · · , 2M}, and the demand is  dt ∈supp(fD) with probability  fD(dt). Given  β1and  β2to represnet supplier availability and backlog probabilities, respectively, the reward and transition probability can be formulated. For example, at time t, given the state St= (it, jt), the action  Kt= (kt,1, kt,2), the demand  dt ≤ it +jt +kt,1, U2being available currently, and the next state  St+1= (it+1, kt,2), the reward is

Rt= pr · dt − cf · (1[kt,1>0] + 1[kt,2>0]) −(c1 · kt,1+ c2 · kt,2) − ch · it,

where  dt = it + jt + kt,1 − it+1. Then, the transition probability is given by

P(St+1, ‘U2is available’  | St, Kt) =  fD(dt) · β1,

Consider another example, where  U2is unavailable. For the backlogging of one unit and the lost sale of one unit, the reward is

image

and the transition probability is

P(St+1, ‘U2is unavailable’, ‘one backlog’, ‘one lost sale’  | St, Kt) = fD(dt) ·(1  − β1) · β2 ·(1  − β2).

Next, we consider a VaR objective with a risk-sensitive constraint, and generate the synthetic training data for NN.

4.3 Dataset Generation

In this paper, we consider the VaR Problem 1 as the objective with a constraint E(Φ)/V(Φ) > q as an example, in which q is a prespecified constant. The intuitive meaning of this constraint is that the earning per unit of risk (variance) should be larger than a threshold q. We artificially generate a dataset for the inventory control process with an objective of minimizing VaR with a confidence level  αand a constraint E(Φ)/V(Φ) > q, in which q is a constant.

There are multiple ways to define a mapping from parameter space to risk and policy space. Here we define it as follows. Define the domain set K, with each instance  k ∈ Krefering to a vector of inventory control parameters (retail price, fixed ordering cost, etc.). Define the label set L, with each instance  l ∈ Lrefering to a vector of labels (optimal VaR, optimal policy, etc.). A dataset

Table 1: Inventory control parameter list

image

G = {(k1, l1), · · · , (km, lm)}is a finite set of pairs in  K × L, i.e., a set of labeled domain instances.

It worth noting that, to construct an NN one needs to predefine certain parameters, including but not limit to  Q, {Ni}and g (See Section 3.2 for details). The predefined parameters, which determine the network structure and how the NN is trained, are called hyperparameters. A fine-tuned NN with a large enough dataset should be able to make a satisfactory prediction. For our inventory control problem, all related parameters are set and sampled as follows.

Hyperparameters: A domain instance includes all variable values except for warehouse capacity M, discount factor  γ, and initial inventory distribution  µ, which are considered as hyperparameters. We set M = 3, and the inventory level space  S = {0, · · · , M}, and  γ= 0.95. We set  µ((0, 0)) = 1, i.e., at the beginning the inventory level and the item amount in transit are both zero. The latter two parameters are fixed only to simplify the problem.

Revenue related parameters: Set the retail price  pr ∈ Pr= [6, 10], and for generating the dataset, we take it as a random variable, whose distribution is U(6, 10) 1. Similarly, set the penalty on supplier’s unavailability  ps ∈ Ps= [0, 2] with  ps ∼ U(0,2), the wholesale prices from  U1and  U2are  c1 ∈ C1= [4, 6] with  c1 ∼ U(4,6) and  c2 ∈ C2= [1, 4] with  c2 ∼ U(1,4), the backlog cost cb ∈ Cb= [0, 2] with  cb ∼ U(0,2), the fixed ordering cost (for both supplier) cf ∈ Cf= [0, 2] with  cf ∼ U(0,2), and the holding cost  ch ∈ Ch= [0, 2] with ch ∼ U(0,2). For brevity, we set the lost sale cost  cl = pr − c2.

For probabilistic parameters, we set the supplier availability probability  β1 ∈[0, 1] with  β1 ∼ U(0.8,1), the backlog probability  β2 ∈[0, 1] with  β2 ∼ U(0,1), and the risk sensitivity parameter  α ∈[0, 1] with  α ∼ U(0,1). For the demand

image

Figure 3: The loss for training/validating a 3-layer network in 50 epochs.

distribution vector  vD ∈[0, 1]2M+1, we sample it uniformly from a 2M-simplex. After defining the domain features, labels are calculated, which include the optimal risk and policy.

We denote a policy list (vector) by  Lp, in which each item represents an allowable policy. An NN will be trained as an approximator of the function, whose inputs represent the inventory parameters and  α, and outputs represent ραand the corresponding policy (denoted by a Boolean vector or an index), i.e., the domain set  K = Pr × Ps × C1 × C2 × Cb × Cf × Ch ×[0.8, 1]  ×[0, 1]M+3, and the label set  L = R × {0, 1}|Lp|in general. Next, we train an NN with the synthetic dataset and show the validity of the proposed scheme.

4.4 Numerical Result

Since VaR is numeric and (deterministic) inventory policy is categorical, we consider both of them to be numeric. Considering the setting given in Section 4.3, we set a network  NN gw,17,7:  K → R7. Now we use Keras to construct, train and validate an NN for this small inventory problem. Firstly, we try a 3-layer NN. We set Q = 3, the numbers of nodes  N0= 17, N1= 12, N2= 8, N3= 7. We set the relu function and the linear function as the activation functions for the hidden and output layers, respectively. Set the mean squared error as the loss function, and adam as the optimizer. Given that all hyperparameters are determined, the network is trained with a training dataset  Gt ⊂ G. We train the NN 50 times for the means and variances of the loss values at different epochs. We set the batch size to be 50 for the batch gradient descent.

Since it is a regression problem, we measure the loss by the regression metrics, such as mean absolute error or mean squared error (MSE). In Figure 3, it shows that the result with the training data size 105converges within 10 epochs, with a hit rate around 95% in both training and validation phases. The error regions represent the standard deviations of means along the epoch axis.

The SDM problems considering risk, dynamic environment, and constraint are usually studied separately. Besides the works reviewed in Section 2.2, Shen [36] generalized risk measures to the valuation functions. The author applied a set of valuation functions, derived some model-free risk-sensitive reinforcement learning algorithms, and presented a risk control example in simulated algorithmic trading of stocks. For SDMs in dynamic environments, Hadoux [37] proposed a new model named Hidden Semi-Markov-Mode Markov Decision Process (HS3MDP), which represented non-stationary problems whose dynamics evolved among a finite set of contexts. The author adapted the Partially Observable Monte-Carlo Planning (POMCP) algorithm to HS3MDPs in order to efficiently solve those problems. The POMCP algorithm used a black-box environment simulator and a particle filter to approximate a belief state. The simulator relaxed the model-based requirement, and each filter particle represented a state of the POMDP being solved. For different types of dynamic environment, the author compared a regret-based method with its Markov counterpart [38]. In the regret-based method, the agent was involved in a two-players repeated game, where two agents (the player and the opponent, which can be the environment) chose an action to play, got a feedback, and repeated the game.

For SDMs with constraint(s), Chow [28] investigated the variability of stochastic rewards and the robustness to modeling errors. The author analyzed a unifying planning framework with coherent risk measures (such as CVaR), which was robust to inherent uncertainties and modeling errors, and output a time-consistent policy. A scalable approximate value-iteration algorithm on an augmented state space was developed for large scale problems with a data-driven setup. The author proposed novel policy gradient and actor-critic algorithms for CVaR-constrained and chance-constrained optimization in MDPs. Furthermore, the author proposed a framework for risk-averse model predictive control, where the risk was time-consistent and Markovian. In the framework of CMDP [3], Chow et al. [39] derived two LP-based algorithms—safe policy iteration and safe value iteration—for problems with constraints on expected cumulative costs. The algorithms hinged on a novel Lyapunov method, which constructed Lyapunov functions to provide an effective way to guarantee the global safety of a policy during training via a set of local, linear constraints. For unknown environment models and large state/action spaces, the authors proposed two scalable safe RL algorithms: safe DQN, an off-policy fitted Qiteration method, and safe DPI, an approximate policy iteration method. Another way to consider risk in an SDM problem is to avoid some dangerous system states. Geibel

and Wysotzki [40] defined the risk with respect to a policy as the probability of entering such states, and set the constraint as the probability being smaller than a predefined threshold. The authors presented a model-free RL algorithm with a deterministic policy space. The algorithm was based on weighting the original value function and the risk. The weight parameter was adapted to find a feasible solution for the constrained problem, and the probability defining the risk was expressed as the expectation of a cumulative return.

We propose a scheme using functional approximator and RL methods to solve SDM problems with risk-sensitive constraints in a dynamic scenario. We consider risk measures as functions of mean and variance of the return in an induced Markov process. As shown in Section 2.2, most, if not all, law-invariant risk measures can be evaluated or estimated with return variance. Considering that reward functions in practical problems are often stochastic, we implement SAT to enable the variance formula and preserve the reward sequence. In an inventory control problem with practical factors, an NN is trained and validated with a synthetic training dataset in the numerical experiment.

The current work can be extended in two ways. The first one has to do with the “no free lunch” theorem. In our inventory control model, the more factors considered, the larger size the policy space has. When the warehouse capacity is 3, there are 8748 deterministic policies, when it is 4 and 5, the size goes to around 2.4 ×107and 1.4 ×1012, respectively. How to deal with a policy space of such an astronomical magnitude in a risk-sensitive case is still an unsettled problem. The second one is how to deal with two different output data types in an NN efficiently. In our setting, the outputs include optimal risk value (numeric) and policy (categorical). A crucial task is how to set activation functions, normalization, and loss function for network training.

[1] K. Grzybowska and G. Kovcs, “Sustainable supply chain - supporting tools,” in 2014 Federated Conference on Computer Science and Information Systems, pp. 1321–1329, 2014.

[2] A. Franceschetti, Sustainable city logistics : fleet planning, routing and scheduling problems. PhD thesis, Technische Universiteit Eindhoven, 2015.

[3] E. Altman, Constrained Markov Decision Processes. CRC Press, 1999.

[4] A. Nilim and L. E. Ghaoui, “Robust control of Markov decision processes with uncertain transition matrices,” Operations Research, vol. 53, no. 5, pp. 780–798, 2005.

[5] A. Ruszczy´nski, “Risk-averse dynamic programming for Markov decision processes,” Mathematical Programming, vol. 125, no. 2, pp. 235–261, 2010.

[6] S. Kusuoka, “On law invariant coherent risk measures,” in Advances in Mathematical Economics, pp. 83–95, Springer, 2001.

[7] R. A. Howard and J. E. Matheson, “Risk-sensitive Markov decision pro- cesses,” Management science, vol. 18, no. 7, pp. 356–369, 1972.

[8] K.-J. Chung and M. J. Sobel, “Discounted MDPs: Distribution functions and exponential utility maximization,” SIAM journal on control and optimization, vol. 25, no. 1, pp. 49–62, 1987.

[9] D. J. White, “Mean , variance , and probabilistic criteria in finite Markov decision processes : A review,” Journal of Optimization Theory and Applications, vol. 56, no. 1, pp. 1–29, 1988.

[10] M. J. Sobel, “Mean-variance tradeoffs in an undiscounted MDP,” Operations Research, vol. 42, no. 1, pp. 175–183, 1994.

[11] S. Mannor and J. Tsitsiklis, “Mean-variance optimization in Markov de- cision processes,” in Proceedings of the 28th International Conference on Machine Learning (ICML), pp. 1–22, 2011.

[12] H.-S. Lau, “The newsboy problem under alternative optimization objec- tives,” Journal of the Operational Research Society, vol. 31, no. 6, pp. 525– 535, 1980.

[13] T.-M. Choi, D. Li, and H. Yan, “Mean–variance analysis for the newsvendor problem,” IEEE Transactions on Systems, Man, and Cybernetics-Part A: Systems and Humans, vol. 38, no. 5, pp. 1169–1180, 2008.

[14] C.-H. Chiu and T.-M. Choi, “Supply chain risk analysis with mean-variance models: A technical review,” Annals of Operations Research, vol. 240, no. 2, pp. 489–507, 2016.

[15] J. A. Filar, D. Krass, K. W. Ross, and S. Member, “Percentile performance criteria for limiting average Markov decision processes,” IEEE Transactions on Automatic Control, vol. 40, no. I, pp. 2–10, 1995.

[16] C. Wu and Y. Lin, “Minimizing risk models in Markov decision process with policies depending on target values,” Journal of Mathematical Analysis and Applications, vol. 23, no. 1, pp. 47–67, 1999.

[17] S. P. Meyn and R. L. Tweedie, Markov Chains and Stochastic Stability. Springer Science & Business Media, 2009.

[18] F. Riedel, “Dynamic coherent risk measures,” Stochastic Processes and their Applications, vol. 112, no. 2, pp. 185–200, 2004.

[19] P. Artzner, F. Delbaen, J. Eber, and D. Heath, “Coherent measures of risk,” Mathematical Finance, vol. 9, no. 3, pp. 1–24, 1998.

[20] S. Ma and J. Y. Yu, “Transition-based versus state-based reward functions for MDPs with Value-at-Risk,” in Proceedings of the 55th Annual Allerton Conference on Communication, Control, and Computing (Allerton), pp. 974–981, 2017.

[21] M. J. Sobel, “The variance of discounted Markov decision processes,” Journal of Applied Probability, vol. 19, no. 4, pp. 794–802, 1982.

[22] S. Ma and J. Y. Yu, “State-augmentation transformations for risk-sensitive reinforcement learning,” arXiv:1804.05950v2:, 2018.

[23] V. S. Borkar, “Q-learning for risk-sensitive control,” Mathematics of Operations Research, vol. 27, no. 2, pp. 294–311, 2002.

[24] Y. Shen, M. J. Tobia, T. Sommer, and K. Obermayer, “Risk-sensitive re- inforcement learning,” Neural Computation, vol. 26, no. 7, pp. 1298–1328, 2014.

[25] J. Garc´ıa and F. Fern´andez, “A comprehensive survey on safe reinforcement learning,” Journal of Machine Learning Research, vol. 16, no. 1, pp. 1437– 1480, 2015.

[26] H. Gilbert and P. Weng, “Quantile reinforcement learning,” arXiv:1611.00862, 2016.

[27] W. Huang and W. B. Haskell, “Risk-aware Q-learning for Markov decision processes,” in Proceedings of the 56th IEEE Conference on Decision and Control (CDC), pp. 4928–4933, 2017.

[28] Y. Chow, M. Ghavamzadeh, L. Janson, and M. Pavone, “Risk-constrained reinforcement learning with percentile risk criteria,” The Journal of Machine Learning Research, vol. 18, no. 1, pp. 6070–6120, 2017.

[29] F. Berkenkamp, M. Turchetta, A. Schoellig, and A. Krause, “Safe model- based reinforcement learning with stability guarantees,” in Proceedings of the 31st Advances in Neural Information Processing Systems (NIPS), pp. 908–918, 2017.

[30] B. Ravindran and A. G. Barto, “Model minimization in hierarchical rein- forcement learning,” in International Symposium on Abstraction, Reformulation, and Approximation, pp. 196–211, Springer, 2002.

[31] K. Hornik, “Approximation capabilities of multilayer feedforward net- works,” Neural networks, vol. 4, no. 2, pp. 251–257, 1991.

[32] E. Taniguchi, R. G. Thompson, and T. Yamada, “Emerging techniques for enhancing the practical application of city logistics models,” ProcediaSocial and Behavioral Sciences, vol. 39, pp. 3–18, 2012.

[33] L. Cohen and A. Young, Multisourcing: Moving beyond outsourcing to achieve growth and agility. Harvard Business Press, 2006.

[34] E. Mohebbi, “A replenishment model for the supply-uncertainty problem,” International Journal of Production Economics, vol. 87, pp. 25–37, 2004.

[35] S. S. Ahiska, S. R. Appaji, R. E. King, and D. P. Warsing Jr, “A Markov decision process-based policy characterization approach for a stochastic inventory control problem with unreliable sourcing,” International Journal of Production Economics, vol. 144, no. 2, pp. 485–496, 2013.

[36] Y. Shen, Risk sensitive Markov decision processes. PhD thesis, 01 2015.

[37] E. Hadoux, Markovian sequential decision-making in non-stationary environments: application to argumentative debates. PhD thesis, UPMC, Sorbonne Universites CNRS, 2015.

[38] S. P.-M. Choi, N. L. Zhang, and D.-Y. Yeung, “Solving hidden-mode markov decision problems.,” in AISTATS, Citeseer, 2001.

[39] Y. Chow, O. Nachum, E. Duenez-Guzman, and M. Ghavamzadeh, “A lyapunov-based approach to safe reinforcement learning,” in Advances in Neural Information Processing Systems, pp. 8092–8101, 2018.

[40] P. Geibel and F. Wysotzki, “Risk-sensitive reinforcement learning applied to control under constraints,” Journal of Artificial Intelligence Research, vol. 24, pp. 81–108, 2005.

[41] W. Dabney, G. Ostrovski, D. Silver, and R. Munos, “Implicit quantile networks for distributional reinforcement learning,” arXiv preprint arXiv:1806.06923, 2018.

[42] S. Zhang, B. Mavrin, H. Yao, L. Kong, and B. Liu, “Quota: The quantile option architecture for reinforcement learning,” arXiv preprint arXiv:1811.02073, 2018.


Designed for Accessibility and to further Open Science