b

DiscoverSearch
About
My stuff
Learning in Markov Decision Processes under Constraints
2020·arXiv
Abstract
Abstract

We consider reinforcement learning (RL) in Markov Decision Processes in which an agent repeatedly interacts with an environment that is modeled by a controlled Markov process. At each time step t, it earns a reward, and also incurs a cost-vector consisting of M costs. We design model-based RL algorithms that maximize the cumulative reward earned over a time horizon of T time-steps, while simultaneously ensuring that the average values of the M cost expenditures are bounded by agent-specified thresholds  cubi , i = 1, 2, . . . , M.

In order to measure the performance of a reinforcement learning algorithm that satisfies the average cost constraints, we define an M + 1 dimensional regret vector that is composed of its reward regret, and M cost regrets. The reward regret measures the sub-optimality in the cumulative reward, while the i-th component of the cost regret vector is the difference between its i-th cumulative cost expense and the expected cost expenditures  T cubi .

We prove that the expected value of the regret vector of UCRL-CMDP, is upper-bounded as ˜O�T 2/3�, where T is thetime horizon. We further show how to reduce the regret of a desired subset of the M costs, at the expense of increasing the regrets of rewards and the remaining costs. To the best of our knowledge, ours is the only work that considers non-episodic RL under average cost constraints, and derive algorithms that can tune the regret vector according to the agent’s requirements on its cost regrets.

Reinforcement Learning (RL) [1] involves an agent repeat- edly interacting with an environment modelled by a Markov Decision Process (MDP) [2]. More specifically, consider a controlled Markov process [2]  st, t = 1, 2, . . ., T. At each discrete time t, an agent applies control  at. State-space, and action space are denoted by S and A respectively, and are assumed to be finite. The controlled transition probabilities are denoted by  p := {p(s, a, s′) : s, s′ ∈ S, a ∈ A}. Thus, p(s, a, s′)is the probability that the system state transitions to state  s′upon applying action a in state s. The probabilities p(s, a, s′)are not known to the agent. At each discrete time t = 1, 2, . . ., T , the agent observes the current state of the environment  st, applies control action  at, and earns a reward rtthat is a known function of  (st, at). When the agent applies an action a in the state s, then it earns a reward equal to r(s, a) units. The agent does not know the controlled transition probabilities  p(s, a, s′)that describe the system dynamics of the environment. The performance of an agent or a RL

Rahul Singh is with the Department of ECE, Indian Institute of Science, Bengaluru, Karnataka, India. email: rahulsingh@iisc.ac.in.

Abhishek Gupta and Ness Shroff are with the Department of ECE, Ohio State University, Columbus, OH, USA. email: gupta.706@osu.edu,shroff@ece.osu.edu

algorithm is measured by the cumulative rewards that it earns over the time horizon.

However in many applications, in addition to earning rewards, the agent also incurs costs at each time. The underlying physical constraints impose constraints on its cumulative cost expenditures, so that the agent needs to balance its reward earnings with the cost accretion while also simultaneously learning the choice of optimal decisions, all in an online manner.

As a motivating example, consider a single-hop wireless network that consists of a wireless node that transmits data packets to a receiver over an unreliable wireless channel. The channel reliability, i.e., the probability that a transmission at time-step t is successful, depends upon the instantaneous channel state  cstand the transmission power  at. Thus, for example, this probability is higher when the channel is in a good state, or if transmission is carried out at higher power levels. The transmitter stores packets in a buffer, and its queue length at time t is denoted by  Qt. The wireless node is batteryoperated, and packet transmission consumes power. Hence, it is desired that the average power consumption is minimal. An appropriate performance metric for networks is the average queue length�E �Tt=1 Qt�/T[3], and hence it is required that the average queue length stays below a certain threshold. The AP has to choose  atadaptively so as to minimize the power consumption�E �Tt=1 at�/T, or equivalently maximize�E �Tt=1 −at�/T, while simultaneoulsy ensure that the average queue length is below a user-specified threshold, i.e. �E �Tt=1 Qt�/T ≤ cub. In this example, the state of the “environment” at time t is given by the queue length and the channel state  (Qt, cst). Thus, it might be “optimal” to utilize high transmission power levels only when the instantaneous queue length  Qtis large or the wireless channel’s state  cstis good. Such an adaptive strategy saves energy by transmitting at lower energy levels at other times. Since channel reliabilities are typically not known to the transmitter node, it does not know the transition probabilities  p(s, a, s′)that describe the controlled Markov process  (Qt, cst). Hence, it cannot compute the expectations of the average queue lengths and average power consumption for a fixed control policy, and needs to devise appropriate learning policies to optimize its performance under average-cost constraints. RL algorithms that we propose in this work solve exactly these classes of problems. Infact many important network control problems can be solved within the framework of constrained Markov decision processes (CMDPs). For example, [4], [5] utilize CMDPs in order to maximize the throughput in a stochastic network, where the network operator wants to satisfy constraints on delays, while [6] design policies that make dynamic decisions regarding network access in networks shared by different types of traffic. Similarly, it has been used in [7], [8] in order to maximize the timely throughput1 in stochastic networks. If the network/ system parameters are unknown, then the CMDP can be posed as a linear program (LP), and solved efficiently. However, in practice, network parameters are seldom known to the network operator, and it needs to design algorithms which “learn” the optimal policies in an “optimal” manner. Our work addresses precisely this issue.

RL Algorithms for unconstrained MDPs: RL problems without constraints are well-understood by now. Works such as [9][12] develop algorithms using the principle of “optimism under uncertainty.” UCRL2 of [12] is a popular RL algorithm that has a regret bound of ˜O(D(p)S√AT), where D(p) is the diameter [12] of the MDP p; the algorithms proposed in our work are based on UCRL2.

RL Algorithms for Constrained MDPs: [13] is an early work on optimally controlling unknown MDPs under average cost constraints. It utilizes the certainty equivalence (CE) principle, i.e., it applies controls that are optimal under the assumption that the true (but unknown) MDP parameters are equal to the empirical estimates, and also occasionally resorts to “forced explorations.” This algorithm yields asymptotically (as  T → ∞) the same reward rate as the case when the MDP parameters are known. However, analysis is performed under the assumption that the CMDP is strictly feasible. Moreover the algorithm lacks finite-time performance guarantees (bounds on regret). Unlike [13], we do not assume strict feasibility; infact we show that the use of confidence bounds allows us to get rid of the strict feasibility assumption. [14] derives a learning scheme based on multi time-scale stochastic approximation [15], in which the task of learning an optimal policy for the CMDP is decomposed into that of learning the optimal value of the dual variables, which correspond to the price of violating the average cost constraints, and that of learning the optimal policy for an unconstrained MDP parameterized by the dual variables. However, the proposed scheme lacks finite-time regret analysis, and might suffer from a large regret. Prima facie, this layered decomposition might not be optimal with respect to the sample-complexity of the online RL problem. The works [16][19] design policy-search algorithms for constrained RL problems. However unlike our work, they do not utilize the concept of regret vector, and their theoretical guarantees need further research. After the first draft of our work was published online, there appeared a few manuscripts/works that address various facets of learning in CMDPs, and these have some similarity with our work. For example [20] considers episodic RL problems with constraints in which the reward function is time-varying. Similarly, [21] also considers episodic RL in which the state is reset at the beginning of each episode. In contrast, we deal exclusively with non-episodic infinite horizon RL problems. In fact, as we show in our work, the primary difficulty in non-episodic constrained RL arises due to the fact that it is not possible to simultaneously “control/upper-bound” the reward and M costs during long runs of the controlled Markov process. Consequently, in order to control the regret vector, we make the assumption that the underlying MDP is unichain. However, this problem does not occur in the episodic RL case [20], [21] since the state is reset. Secondly, unlike the algorithms provided in our work, [20], [21] do not allow the agent to tune the regret vector.

Our contributions are summarized as follows.

1) We initiate the problem of designing RL algorithms that maximize the cumulative rewards while simultaneously satisfying average cost constraints. We propose an algorithm which we call UCRL for CMDPs, henceforth abbreviated as UCRL-CMDP. UCRL-CMDP is a mod-ification of the popular RL algorithm UCRL2 of [12] that utilizes the principle of optimism in the face of uncertainty (OFU) while making decisions. Since an algorithm that utilizes OFU does not need to satisfy cost constraints (this is briefly discussed in Section II-A), we modify OFU appropriately and derive the principle of balanced optimism in the face of uncertainty (BOFU). Under the BOFU principle, at the beginning of each RL episode, the agent has to solve for (i) an MDP, and (ii) a controller, such that the average costs of a system in which the dynamics are described by (i), and which is controlled using (ii), are less than or equal to the cost constraints. This is summarized in Algorithm 1.

2) In order to quantify the finite-time performance of an RL algorithm that has to perform under average cost constraints, we define its M + 1 dimensional “regret vector” that is composed of its reward regret (8) and M cost regrets (9). More precisely, considering solely the reward regret (as is done in the RL literature) overlooks the cost expenditures. Indeed, we show in Theorem 2 that the reward regret can be made arbitrary small (with a high probability) at the expense of an increase in the cumulative cost expenditure. Thus, while comparing the performance of two different learning algorithms, we also need to compare their cost expenditures. The reward regret of a learning algorithm is the difference between its reward and the reward of an optimal policy that knows the MDP parameters, while the i-th cost regret is the difference between the total cost incurred until T time-steps, and  cubi T.

3) We ask the following question in the constrained setup: What is the set of “achievable” M + 1 dimensional regret vectors? In Theorem 1 we show that the components of the regret vector of UCRL-CMDP, can be bounded as ˜O(T 2/3).

4) We show that the use of BOFU allows us to overcome the shortcomings of the CE approach that were encountered in [13], i.e., there are arbitrarily long time-

durations during which the CMDP in which the system dynamics are described by the current empirical estimates of transition probabilities is infeasible, and hence the agent is unable to utilize these estimates in order to make control decisions. As a by-product, BOFU also allows us to get rid of “forced explorations,” that were utilized in [13], i.e., employing randomized controls occasionally.

5) Analogous to the unconstrained RL setup, in which one is interested in quantifying a lower bound on the regret of any learning algorithm, we provide a partial characterization of the set of those M + 1-dimensional regret vectors, which cannot be achieved under any learning algorithm. More specifically, in Theorem 3 we show that a weighted sum of the M + 1 regrets is necessarily greater than  O�D(p)S�AT log(T )�, where D(p) is the diameter of the underlying MDP, and S, A is the number of states and control actions respectively.

6) In many applications, an agent is more sensitive to the cost expenditures of some specific resources as compared to the rest, and a procedure to “tune” the M + 1 dimensional regret vector is essential. In Section VI, we consider the scenario in which the agent can prespecify the desired bounds on each component of the cost regret vector, and introduce a modification to the UCRL-CMDP that allows the agent to keep the cost regrets below these bounds.

A. Failure of OFU in constrained RL problems

Consider a two-state S = {1, 2}, two-action A = {0, 1} MDP in which the controlled transition probabilities  p(1, 1, 1) = 1 − θand  p(1, 1, 2) = θare unknown, while remaining probabilitites are equal to .5. Assume that r(1, a), c(1, a) ≡ 0and  r(2, a), c(2, a) ≡ 1, i.e., reward and cost depend only upon the current state. Assume that  θ > .5, and the average cost threshold satisfies  cub < 2θ/(1 + 2θ). Since state 2 yields reward at the maximum rate, and  θ > .5this means that the optimal action in state 1 is 1. Let ˆθtand  ǫtdenote the empirical estimate of  θ, and the radius of confidence interval respectively at time t. Then UCRL2 sets the optimistic estimate of  θequal to ˆθt + ǫtand then implements the control that is optimal when true parameter value is equal to this estimate. Thus, if ˆθt + ǫt ≥ .5, then it chooses action 1 in state 1. Since with a high probability we have ˆθt + ǫt ≥ θ, and ˆθt +ǫt → θas  T → ∞[12], we have that when the index of the RL episode is sufficiently large, the agent implements action 1 in state 1. Since the average cost of this policy is 2θ/(1+2θ), this means that UCRL2 violates the average cost constraint.

In our setup, at each time t the agent earns a reward and also incurs M costs. Reward and cost functions are denoted by  r, {ci}Mi=1, S ×A �→ R. Thus, the instantaneous reward ob- tained upon taking an action a in the state s is equal to r(s, a), while the i-th cost is equal to  ci(s, a). A controlled Markov process in which the agent earns reward and incurs M costs is defined by the tuple  CMP = (S, A, p, r, c1, c2, . . . , cM). The probabilities  p(s, a, s′)are not known to the agent, while the reward and cost functions  r, {ci}Mi=1, S × A �→ Rare known to the agent. We will now briefly discuss some notions and results on MDPs.

Let  P (t)π,p,xdenote the t-step probability distribution when the policy  πis applied to the MDP p and the initial state is x, while  Pπ,pbe the corresponding stationary measure 2. For two measures  µ1, µ2, we let  ∥µ1 − µ2∥Vdenote the total variation distance [22] between  µ1and  µ2.

Definition 1: (Unichain MDP) The MDP p is unichain if under any stationary policy there is a single recurrent class. If an MDP is unichain [2], then for the Markov chain induced by any stationary policy  π, we have

image

where  C > 0, α < 1are constants. The mixing time of an MDP p is defined as  TM(p) := maxπ Eπ,pTs,s′, where  Ts,s′denotes the time taken by the Markov chain induced by policy πto hit state  s′, when it starts in state s.

Definition 2: (Control Policy) Let

image

be the |A|-simplex and  Ftdenote the sigma-algebra [23] generated by the random variables  {(sℓ, aℓ)}t−1ℓ=1∪st. A control policy  π[2], [24] is a collection of maps  Ft �→ ∆(A), t =1, 2, . . . that chooses action  aton the basis of past operational history of the system. Thus, under policy  π, we have that  atis chosen according to the probability distribution  π(Ft). A general control policy is allowed to be history-dependent and randomized.

Definition 3: (Stationary Policy) A stationary policy  π :S �→ ∆(A), is a mapping from state S to a probability distribution on the action space A, and prescribes randomized controls on the basis of the current state  st. Thus, under policy π, we have that  atis chosen according to the probability distribution  π(·|st).

A. Notation

Throughout, we use bold font for denoting vectors; for example the vector  (x1, x2, . . . , xN)is denoted by x. We use N to denote the set of natural numbers,  RMto denote the M dimensional Euclidean space, and  RM+to denote non-negative orthant of  RM. Inequalities between two vectors are to be understood component-wise. If E is an event [23], then1(E)denotes its indicator function. For a control policy  π,3

image

denotes its average reward, and

image

denotes its average i-th cost. For  x ∈ RN, we let  ∥x∥1denote its 1-norm.  0Mdenotes the M-dimensional zero vector consisting of all zeros. For  x, y ∈ R, we let  x ∨ y :=max{x, y}. Throughout, we abbreviate [M] := {1, 2, . . ., M}, S := |S|, A := |A|.

B. Constrained MDPs

We now present some definitions and standard results pertaining to constrained MDPs. These can be found in [25].

Definition 4 (Occupation Measure): Consider the controlled Markov process  stevolving under the application of a stationary policy  π. Its occupation measure  µπ ={µπ(s, a) : (s, a) ∈ S × A}is defined as

image

and describes the average amount of time that the process (st, at)spends on each possible state-action pair.

Definition 5 (SR(µ)): Consider a vector µ ={µ(s, a) : (s, a) ∈ S × A}that satisfies the constraints (6) and (7) below. Define  SR(µ)to be the following stationary randomized policy. When the state  stof the environment is equal to s, the policy chooses the action a with a probability equal to µ(s,a)�a′∈A µ(s,a′)if  �a′∈A µ(s, a′) > 0. However, if �a′∈A µ(s, a′) = 0, then the policy takes an action according to some pre-specified rule (e.g. implement  at = 0).

Consider the controlled Markov process CMP = (S, A, p, r, c1, c2, . . . , cM). The following dynamic optimization problem is a constrained Markov Decision Process (CMDP) [25],

image

where the maximization above is over the class of all history-dependent policies, and  cubidenotes the desired upper-bound on the average value of i-th cost expense. The optimal average reward rate of the CMDP is equal to the optimal value of the above LP, and is denoted by  r⋆.

Linear Programming (LP) approach for solving CMDPs: When the controlled transition probabilities  p(s, a, s′)are known, and p is unichain, an optimal policy for the CMDP (2)- (3) can be obtained by solving the following linear program (LP) [25],

image

Let  µ⋆be a solution of the above LP. Then, the stationary randomized policy  SR(µ⋆)solves (2)-(3). Moreover, it can be shown that the average reward and M costs of  SR(µ⋆)are independent of the initial starting state  s0if the MDP is unichain [25].

C. Learning Algorithms and Regret Vector

We will develop reinforcement learning algorithms to solve the finite-time horizon version of the CMDP (2)-(3) when the probabilities  p(s, a, s′)are not known to the agent. Let Ftdenote the sigma-algebra [23] generated by the random variables  {(sℓ, aℓ)}t−1ℓ=1∪st. A learning policy  πis a collection of maps  Ft �→ ∆(A), t = 1, 2, . . .that chooses action  aton the basis of past operational history of the system. In order to measure the performance of a learning algorithm, we define its reward and cost regrets. The “cumulative reward regret” until time T , denoted by  ∆(R)(T ), is defined as,

image

where  r⋆is the optimal average reward of the CMDP (2)-(3) when controlled transition probabilities  p(s, a, s′)are known. Note that  r⋆is the optimal value of the LP (4)-(7). The “cumulative cost regret” for the i-th cost until time T is denoted by  ∆(i)(T ), and is defined as,

image

Remark 1: In the conventional regret analysis of RL algorithms, the objective is to bound the reward regret  ∆(R)(T ). However, in our setup, due to considerations on the cost expenditures, we also need to bound the cost regrets  ∆(i)(T ). Indeed, as shown in the Section VI, we can force  ∆(R)(T )to be arbitrarily small at the expense of increased cost regrets, and also vice versa. The consideration of the regret vector, and the possibility of tuning its various components, is a key novelty of our work. The problem of tuning this vector is challenging because its various components are correlated.

We propose UCRL-CMDP to adaptively control an unknown CMDP. It is depicted in Algorithm 1. UCRL-CMDP maintains empirical estimates of the unknown transition probabilities as follows,

image

where  Nt(s, a)and  Nt(s, a, s′)denote the number of visits to (s, a) and  (s, a, s′)until t respectively.

Confidence Intervals: Additionally, it also maintains confi-dence interval  Ctassociated with the estimate  ˆptas follows,

image

image

where

image

b > 1 is a constant. Episode: UCRL-CMDP proceeds in episodes, and utilizes a single stationary control policy within an episode. Each episode is of duration  ⌈T α⌉steps. Let  τkdenote the start time of episode k. k-th episode is denoted by  Ek :={τk, τk + 1, . . . , τk+1 − 1}, and comprises of  τk+1 − τkconsecutive time-steps. Denote by  ktthe index of the ongoing episode at time t. At the beginning of  Ek, the agent solves the following constrained optimization problem in which the decision variables are (i) Occupation measure  µ = {µ(s, a) :(s, a) ∈ S×A}of the controlled process, and (ii) “Candidate”

MDP  p′,

image

The maximization w.r.t.  p′denotes that the agent is optimistic regarding the belief of the “true” (but unknown) MDP p, while that w.r.t.  µensures that the agent optimizes its control strategy for this optimistic MDP. The constraints (14) ensure that the cost expenditures do not exceed the thresholds  {cubi }Mi=1, and hence ensure that the agent also balances the cost expenses while being optimistic with respect to the rewards about the choice of the MDP thereby taking a balanced approach to optimism when the underlying MDP parameters are unknown. If the constraints (14) were absent, we would recover the UCRL2 algorithm of [12] that is based on the OFU principle [26], [27]. However, as is shown in Section II-A, the OFU principle might fail when it is applied for learning the optimal controls for CMDPs. Indeed, as is shown in the example in Section II-A, the limiting average cost is greater than the threshold value of cost. The BOFU principle proposed in this work is a natural extension of the OFU principle to the case when the agent has to satisfy certain constraints on costs, in addition to maximizing the rewards. In case the problem (13)-(17) is feasible, let  (˜µk, ˜pk)denote a solution. The agent then chooses ataccording to  SR(˜µk)within  Ek. However, in the event the LP (13)-(17) is infeasible, the agent implements an arbitrary stationary control policy that has been chosen at time t = 0. In summary, it implements a stationary controller within  Ek, which is denoted by  ˜πk. We make the following assumptions on the MDP p while analyzing UCRL-CMDP.

Assumption 1:

1) The MDP p = {p(s, a, s′) : s, s′ ∈ S, a ∈ A}is unichain. Thus, under a stationary policy  πwe have

image

where  C > 0, 0 ≤ ρ < 1. 2) The CMDP (2)-(3) is feasible, i.e., there exists a policy under which the average cost constraints (3) are satisfied. 3) Without loss of generality, we assume that the magnitude of rewards and costs are upper-bounded by 1, i.e.,

image

Hence, if  r⋆denotes optimal reward rate of (2)-(3), then r⋆ < 1. Moreover, the cost bounds  {cubi }Mi=1can be taken to be less than 1.

We establish the following bound on the regrets of UCRL- CMDP.

Theorem 1: Consider the UCRL-CMDP (Algorithm 1) applied with  δ = 1/Tto an MDP p that satisfies Assumption 1. The reward and cost regrets can be bounded as follows,

image

where  β > 1/2satisfies  2β − α = 1. Upon letting  α = 1/3in the above, the above reduces to

image

We begin by introducing few notation. If B denotes a subset of S, then we let  ΠBbe the set of those policy  πfor which the occupation measure  µπsatisfies  µπ(s) > 0for all  s ∈ B. Let  Bπdenote the set of states for which  µπ(s) > 0. We now derive few preliminary results that are used while proving the main result.

The following result can be shown by an application of Azuma-Hoeffding inequality [28].

image

where the confidence ball  Cτkis as in (11). Define the set G1 := {p ∈ Cτk, ∀k = 1, 2, . . ., K}. Then,

image

Proof: If the number of visits to (s, a) were determinisic, denoted by N(s, a), and  N(s, a, s′)the corresponding number of visits to  (s, a, s′), then it follows from Azuma-Hoeffding’s inequality [28] that

image

Upon letting  ǫequal to �2N(s, a) log (T b|S||A|)in the

above, we get that  |p(s, a, s′) − p(s, a, s′| ≤�

holds with a probability greater than 1T b|S||A|. The proof is then completed by considering all possibilities for N(s, a), state-action pairs and using union bound.

Lemma 2: Define

image

where K is the total number of episodes and  β > 1/2satisfies 2β − α = 1. We have

image

is a martingale difference sequence. Furthermore, since the duration of each episode is bounded by  T α, and �Nk(s, a) ≥ 1, we have nk(s,a)−E(nk(s,a)|Fτk)√Nk(s,a) ≤ T α. By

applying Azuma-Hoeffding’s inequality to this martingale difference sequence, we get that the probability of the event

image

upper-bounded by

image

Since  2β−(1+α) = 0, the above bound reduces to δSAT. The proof then follows by using union bound for all state-action pairs and k.

Lemma 3: If  s ∈ Bπk, then

image

Proof: Since we have  EπTs′,s ≤ TM, ∀s′ ∈ S, it follows from Markov’s inequality that the probability with which  stdoes not hit the state s in  2TMsteps, is less than 1/2, or equivalently the state s is visited atleast once with a probability greater than 1/2, which yields us the following,

image

The proof is then completed by dividing the total time of ⌈T α⌉steps in an episode into “mini-episodes” of  ⌈2TM⌉steps each, and noting that  nk(s, a)is the sum of number of visits to (s, a) during each such mini-episode.

We begin by giving an equivalent characterization of the UCRL-CMDP rule. At each  τk, it assigns an index  Ik(π)to each stationary policy  πas follows,

image

Proof: Note that  P (1)π,p,s, s ∈ Sdenotes the transition probabilities of the Markov chain that results when the policy πis applied to the MDP p. Consider an MDP  θ ∈ Cτk. Since on G, we have  p ∈ Cτk, we get the following,

image

so that from triangle inequality we have that

image

(23) then follows from Theorem 4.

For a policy  π, we denote the quantity  r⋆ − ¯r(π, p)by its instantaneous reward regret, and  ¯ci(π, p) − cubiits instantaneous cost regret for the i-th cost. We now show that if the instantaneous reward regret, or an instantaneous cost regret of a policy is greater than a certain threshold (that depends upon the radius of the confidence ball at time  τk), then it is not played during  Ek. For a stationary policy  π, define

image

Consider the following two possibilities.

Case A)  ¯ci(π, p) > cubi + δk(π)for some i: From (23) we have that  |¯ci(π, p) − ¯ci(π, θ)| ≤ δk(π)which implies ¯ci(π, θ) > cubifor all  θ ∈ Cτk. Thus  Ik(π) = −∞.

Case B) From (23) we have that  |¯r(π, p)− ¯r(π, θ)| ≤ δk(π)for all  θ ∈ Cτk, so that the index  Ik(π)is bounded by  ¯r(π, p)+δk(π).

The following result summarizes this discussion.

Lemma 5: Let  πbe a stationary randomized policy. On the set G we have that  Ik(π) = −∞if  ¯ci(π, p) > cubi +δk(π),for some  i ∈ [M]. Also,  Ik(π) ≤ ¯r(π, p) + δk(π).

We now show that if a stationary policy is feasible for the MDP p, i.e.  ¯ci(π, p) ≤ cubi , ∀i, then its index  Ik(π)is lower- bounded by  r⋆.

image

Upon combining Lemma 5 and Lemma 6, we obtain the following result.

image

by UCRL-CMDP. This means that  ¯ci(π, p) ≤ cubi + δk(πk), which shows that the instantaneous cost regret is bounded by δk(πk).

To bound the reward regret, note that it was shown in Lemma 6 that the index of an optimal policy is greater than  r⋆, and hence the index  Ik(πk)is greater than  r⋆. From Lemma 5 we have  Ik(πk) ≤ ¯r(π, p) + δk(πk). Hence we have that ¯r(π, p)+δk(πk) ≥ r⋆, or  ¯r(π, p) ≥ r⋆ −δk(πk), which shows that the instantaneous reward regret is bounded by  δk(πk).

We now use the result on instantaneous regrets in order to bound the cumulative regrets of UCRL-CMDP.

Proof of Theorem 1: We will only discuss bound on reward regret, since the bound on cost regrets can be derived using similar steps. Note that the expected regret during  Ekcan be written as follows,

image

where the last inequality follows from (18). Denote  ∆(R)k :=E��t∈Ek r⋆ − ¯r(πk, p)���Fτk�as the regret incurred during the k-th episode. It follows from the above discussion that the cumulative expected regret can be bounded as follows,

image

where K is the total number of episodes. Henceforth we will only focus on bounding the first term in the r.h.s. above. This is bounded separately on the sets  G, Gc1, Gc2.

We begin by bounding �Kk=1 ∆(R)k on G. We have,

image

where the first inequality follows from Lemma 7, and the last inequality follows from Lemma 3. We will now bound the term �Kk=1�(s,a):s∈BπkE(nk(s)|Fk)πk(a|s)√Nk(s,a). We have

image

As is shown in [12, p. 1578], the term �Kk=1�(s,a) nk(s,a)√Nk(s,a)can be bounded by  (√2+1)√SAT, while from (21) we have that on  G2, the term �Kk=1E(nk(s,a)|Fk)−nk(s,a)√Nk(s,a)is bounded

image

by  T β

image

above that on G we have

image

Upon summing (26) over episodes, and using (28), we obtain that the regret on G can be bounded as follows,

image

This completes the analysis on G.

We now analyze the regret on  Gc2. From Lemma 2, the probability of  Gc2is bounded by  δ. On  Gc2, the sample path regret �Kk=1 ∆(R)kcan be trivially bounded by T , so that its contribution to the expected regret is bounded by  δT.

To analyze the regret on  Gc1we note that if the confidence ball  Cτkat time  τkfails, then the regret during  Ekcan be bounded by the duration of  Ek. Since  τk+1 − τk = ⌈T α⌉, the regret during  Ekis bounded by  ⌈T α⌉. From Lemma 1 we have that the probability with which confidence ball fails at time t is upper-bounded by 2T 2b−1|S||A|. Hence, the expected regret from the failure of ball (in case an episode starts at t) at time t is bounded by 2⌈T α⌉T 2b−1|S||A|, so that the cumulative expected regret is bounded by 2T 2b−2|S||A|.

The upper-bounds for the regrets of UCRL-CMDP in Theorem 1 are the same for reward and M costs regrets. However, in many practical applications, an agent is more sensitive to over-utilizing certain specific costs, as compared to the other costs. Thus, in this section, we derive algorithms which enable the agent to tune the upper-bounds on the regrets of different costs. We also quantify the reward regret of these algorithms.

A. Modified UCRL-CMDP

Throughout this section we assume that p satisfies the following.

Assumption 2: For the MDP p, there exists a stationary policy under which the average costs are strictly below the thresholds  {cubi : i = 1, 2, . . ., M}. More precisely, there exists an  ǫ > 0and a stationary policy  πfeas.such that we have  ¯ci(πfeas.) < cubi − ǫ, ∀i ∈ [M]. Define

image

The modified algorithm maintains empirical estimates  ˆptand confidence intervals  Ct(11) in exactly the same manner as UCRL-CMDP (Algorithm 1) does. It also proceeds in episodes, and uses a single stationary control policy within an episode. However, at the beginning of each episode k, it solves the following optimization problem, which is a modification of the problem (13)-(17) that is solved by UCRL-CMDP. More concretely, the cost constraints (14) are replaced by the constraints (32) on the costs:

image

where,

image

image

and the parameters  bi ∈ (0, 1), i ∈ [M]are chosen by the agent. If the LP (31)-(35) is feasible, let  ˜µkbe an optimal occupation measure obtained by solving it. In this case, the agent implements  SR(˜µk)within  Ek. However, if the LP is infeasible, then it implements a stationary controller that has been chosen at time t = 0. This is summarized in Algorithm 2. We will analyze Algorithm 2 under the following assumption on the underlying MDP p.

We derive upper-bounds on regrets of modified algorithm in the following result.

Theorem 2: Consider the modified UCRL-CMDP with  δ =1/T , α = 1/3applied to an MDP p that satisfies Assumption 1 and Assumption 2. Then, the expected reward and cost regrets can be upper-bounded as follows:

image

and

image

Proof closely follows the proof of Theorem 1, hence we point out only the key differences. The modified UCRL algorithm assigns the following modified index to policy  π,

image

If for some i we have  ¯ci(π, θ) > cubi − di, ∀θ ∈ Cτk, then we set  Ik(π) = −∞.

The proof of next result is omitted since it is similar to that of Lemma 5.

Lemma 8: Let  πbe a stationary randomized policy. On the set G we have that  Ik(π) = −∞if  ¯ci(π, p) > cubi − di +δt(π),for some  i ∈ [M]. Also,  Ik(π) ≤ ¯r(π, p) + δt(π).

The following result allows us to derive bounds on the instantaneous regrets.

image

It follows from Lemma 14 that the optimal value of the CMDP maxπ ¯r(π, p), such that  ¯ci(π, p) ≤ cubi − di, ∀i ∈ [M], is greater than or equal to  r⋆ − z. Hence, it follows from the discussion above that the index of the policy which is optimal for this CMDP is greater than or equal to  r⋆ − z. As earlier, we bound the regret on the sets  G, Gc1and  Gc2separately. On G, the regret is bounded by the time spent playing sub-optimal policies.

image

policy  πwill not be played by UCRL-CMDP. This means that  ¯ci(π, p) ≤ cubi − di + δt(πk), which shows that the instantaneous cost regret is bounded by  δt(πk) − di.

To bound the instantaneous reward regret, note that it was shown in Lemma 9 that there is a policy with index greater than  r⋆ − z, and hence the index of  πkis necessarily greater than  r⋆ − z. Since from Lemma 8 we have that the index of  πkis upper-bounded by  ¯r(π, p) + δt(πk), we must have ¯r(π, p) + δt(πk) ≥ r⋆ − z, or  ¯r(π, p) ≥ r⋆ − δt(πk) − z, which shows that the instantaneous reward regret is bounded by  δt(πk) + z.

Proof of Theorem 2: Since the proof closely follows that of Theorem 1, we only point out the key differences. The decomposition result 25 holds for reward as well cost regrets. Similarly, the regrets on  Gc2and  Gc1can be bounded by  δTand 2T 2b−2|S||A|respectively. The only difference arises during bounding the terms �k ∆(R)kand �k ∆(i)k. It follows from Lemma 10 that the bound on �k ∆(R)kdiffers from (26) by an additional term zT , and similarly that on �k ∆(i)kdiffers by  ǫbiT. The proof is then completed by summing the bounds on regrets over the sets  G, Gc1, Gc2.

Let  λ ≥ 0M. Consider the Lagrangian relaxation of (2)-(3),

image

where c(st, at)is the vector that consists of costs ci(st, at), i ∈ [M]. Consider its associated dual function [29], D(λ) := maxπ L(λ; π), and the dual problem

image

Define the diameter D(p) of MDP p as follows, D(p) := maxs,s′ minπ T πs,s′. D(p)is finite if p is communicating [2].

Theorem 3: Consider a learning algorithm  φ. Then, there is a problem instance such that the regrets  ∆(R)(T ), {∆(i)(T )}Mi=1under  φsatisfy

image

where  λ⋆is an optimal solution of the dual problem (40).

Proof: We begin by considering an auxiliary reward maximization problem that involves the same MDP p, but in which the reward received at time t by the agent is equal to r(st, at)+ λ·�cub − c(st, at)�instead of  r(st, at). However, there are no average cost constraints in the auxiliary problem. Let  φ′be a history dependent policy for this auxiliary problem. Denote its optimal reward by  r⋆(λ). Then, the regret for cumulative rewards collected by  φ′in the auxiliary problem is given by

image

It follows from Theorem 5 of [12] that the controlled transition probabilities  p(s, a, s′)of the underlying MDP can be chosen so that this regret is greater than  .015 ·�D(p)SAT, i.e.,

image

We observe that any valid learning algorithm for the constrained problem is also a valid algorithm for the auxiliary problem. Thus, if  φis a learning algorithm for the problem with average cost constraints, then we have

image

We now substitute (46) in the above to obtain

image

Since the expression in the r.h.s. is maximized for values of  λwhich are optimal for the dual problem (40), we set it equal to  λ⋆, and then use Lemma 11 in order to obtain

image

This completes the proof.

We compare the performance of the proposed UCRL-CMDP algorithm with the Actor-Critic algorithm for CMDPs that was proposed in [14]. Actor-Critic algorithms are a popular class of online learning algorithms [30][32] that are based on multi-time-scale stochastic approximation [33], [34]. We compare algorithms on the example presented in Section I in which the goal is to learn an efficient network controller. We begin by explaining the experiment setup.

Experiment Setup: Consider the single-hop wireless network that was discussed in Section I, and consists of a single wireless node that transmits data packets to a receiver. The access point has to dynamically choose the transmission power atat each time t. For simplicity, we let the action set A be binary, and take the channel state to be static, i.e. it does not evolve or equivalently it assumes only a single value. Thus at = 0would mean that no packet was attempted transmission at time t, while  at = 1would mean that a single packet would be delivered, with a probability equal to the channel reliability. The number of packets that arrive at time t are denoted by  At. We let  At ∈ {0, 1, 2, 3}and assume that  Atare i.i.d. across times. The probability vector associated with  Atthat describes its probability mass function is taken equal to (.65, .2, .1, .05) for the experiments shown in Fig. 1, Fig. 2. The packet buffer is of a finite capacity, and can hold a maximum of B packets. Thus, the dynamics of the queue length can be described as follows,

image

where for  x ∈ Rwe let  (x)+ := max{x, 0}, and  x ∧ B :=min{x, B}, while  Dtis the number that depart (are delivered to destination) at time t. In our experiments we use B = 6, and take the channel reliability as .9. Hence, if  at = 1then Dtassumes the value 1 with a probability .9. The associated CMDP can be stated as follows:

image

We now discuss the Actor-Critic algorithm for CMDPs. We begin with some notation that are required in order to discuss Actor-Critic algorithm. Let {a(n)}, {b(n)}, {c(n)} be positive stepsize sequences satisfying �∞n=1 a(n) =∞, �∞n=1 b(n) = ∞, �∞n=1 c(n) = ∞, �∞n=1 a2(n) +�∞n=1 b2(n) + �∞n=1 c2(n) < ∞, and  b(n)a(n) → 0, c(n)b(n) → 0.In our experiments we use a(n) = 1/n, b(n) = 1/(n log n) and  c(n) = 1/(n log2 n). Let  Q :=�x ∈ R|A|−1 : xi ≥0 ∀i, �|A|−1j=1 xj ≤ 1�denote the simplex of subprobability vectors. Let  Γ(·)denote the map that projects a vector onto Q. Thus, if  x ∈ Qthen  Γ(x) = x, otherwise  Γ(x)is the point from Q that is closest to x.

Actor-Critic Algorithm for CMDPs: The algorithm carries out iterations for three quantities that evolve at different timescales and are coupled. To begin with, it replaces the original constrained MDP by an unconstrained one by imposing a penalty upon constraint violation. is held fixed, (or equivalently  r(st, at) − ˜λtc(st, at), since the term ˜λtcubdoes not depend upon the controls) The instantaneous reward for this modified MDP is equal to  r(st, at) − ˜λt�c(st, at) − cub�where ˜λt ≥ 0is the price associated with the constraint violation. ˜λt ≥ 0is itself being tuned in an online way, though at a slower time-scale. ˜λtserves as an estimate of the optimal value of the dual variable for the original CMDP. In order to solve this unconstrained MDP, the algorithm keeps an estimate of the value function  Vt : S �→ R, which is updated as follows,

image

where  s⋆is a designated state. Let  πt(a|s)denote the probability with which action a is implemented in state s at time t. Let  a⋆be a designated action. These probabilities are generated as follows. The algorithm maintains vectors ˆπt(s) = {ˆπt(a|s) : a ∈ A}for each state  s ∈ S, and updates it as follows,

image

where,

image

where  eais the unit vector with a 1 in the place corresponding to action  a4. The probability for action  a⋆is computed as follows,

image

The action probabilities  πtare then generated from  ˆπtas follows,

image

where  ǫt → 0. Finally, the price ˜λtis updated as follows,

image

where  cubis the threshold on average queue length as in (44). In our experiments we use  s⋆ = B, a⋆ = 0and  ǫt = 1/t. Results: Fig. 1 compares the cumulative regrets incurred by

these algorithms. We observe that the reward regret as well

as cost regret of UCRL-CMDP are low. We observe a serious drawback of the Actor-Critic algorithm’s performance, that the cost regret is prohibitively high. We then vary the budget  cubon the average queue length. These results are shown in Fig. 2. Once again, we make a similar observation, that UCRL-CMDP is effective in balancing both, the reward regret  ∆(R)(t)and the cost regret  ∆(1)(t), while the Actor-Critic algorithm yields a high cost regret. In both of these experiments the probability vector of arrivals was held fixed at (.65, .2, .1, .05). We vary this probability vector, and plot the regrets in Fig. 3b. Once again, UCRL-CMDP outperforms the Actor-Critic algorithm. Though the reward regret of Actor-Critic algorithm is lower than that of the UCRL-CMDP algorithms, this occurs at the expense of an undesireable much larger cost regret. In contrast, the reward regret as well as cost regret of UCRL-CMDP is low. Plots are obtained after averaging over 100 runs.

image

Fig. 1: Plot of the reward regret (a) and cost regret (b), for the network in which the probability vector associated with arrivals is (.65, .2, .1, .05), channel reliability is .9, and desired delay is cub = 4.5. Plots are obtained after averaging over 100 runs.

image

Fig. 2: Plot of the normalized reward regret (a) and cost regret (b), as the desired delay  cub is varied.

image

Fig. 3: Plot of the reward regret (a) and cost regret (b), as the probability distribution of the arrivals is varied. The probability vector of  Atis equal to  (.65 − .02i, .2, .1 + .01i, .05 + .01i), wherethe parameter i is varied from 0 to 9. The desired delay  cub is heldfixed at 4.5, and channel reliability at .9.

In this work, we initiate a study to develop learning algorithms that simultaneously control all the components of the regret vector while controlling unknown MDPs. We devised algorithms that are able to tune different components of the cost regret vector, and also obtained a non-achievability result that characterizes those regret vectors that cannot be achieved under any learning rule. In our work, we assume that the underlying MDP is unichain. An interesting research problem is to characterize the set of achievable regret vectors under the weaker assumption that the underlying MDP is communicating.

[1] R. S. Sutton and A. G. Barto, Reinforcement learning - an introduction, ser. Adaptive computation and machine learning. MIT Press, 1998. [Online]. Available: http://www.worldcat.org/oclc/37293240

[2] M. L. Puterman, Markov Decision Processes.: Discrete Stochastic Dynamic Programming. John Wiley & Sons, 2014.

[3] L. I. Sennott, Stochastic dynamic programming and the control of queueing systems. John Wiley & Sons, 2009, vol. 504.

[4] A. Lazar, “Optimal flow control of a class of queueing networks in equilibrium,” IEEE transactions on Automatic Control, vol. 28, no. 11, pp. 1001–1007, 1983.

[5] M.-T. T. Hsiao and A. A. Lazar, “Optimal decentralized flow control of markovian queueing networks with multiple controllers,” Performance evaluation, vol. 13, no. 3, pp. 181–204, 1991.

[6] P. Nain and K. Ross, “Optimal priority assignment with hard constraint,” IEEE transactions on Automatic Control, vol. 31, no. 10, pp. 883–888, 1986.

[7] R. Singh and P. Kumar, “Throughput optimal decentralized scheduling of multihop networks with end-to-end deadline constraints: Unreliable links,” IEEE Transactions on Automatic Control, vol. 64, no. 1, pp. 127–142, 2018.

[8] ——, “Adaptive csma for decentralized scheduling of multi-hop net- works with end-to-end deadline constraints,” IEEE/ACM Transactions on Networking, 2021.

[9] R. I. Brafman and M. Tennenholtz, “R-max-a general polynomial time algorithm for near-optimal reinforcement learning,” Journal of Machine Learning Research, vol. 3, no. Oct, pp. 213–231, 2002.

[10] P. Auer and R. Ortner, “Logarithmic online regret bounds for undiscounted reinforcement learning,” in Advances in Neural Information Processing Systems, 2007, pp. 49–56.

[11] P. L. Bartlett and A. Tewari, “REGAL: A regularization based algorithm for reinforcement learning in weakly communicating mdps,” in Proceedings of the Twenty-Fifth Conference on Uncertainty in Artificial Intelligence, Montreal, QC, Canada, June 18-21, 2009. AUAI Press, 2009, pp. 35–42.

[12] T. Jaksch, R. Ortner, and P. Auer, “Near-optimal regret bounds for reinforcement learning,” Journal of Machine Learning Research, vol. 11, no. Apr, pp. 1563–1600, 2010.

[13] E. Altman and A. Schwartz, “Adaptive control of constrained Markov chains,” IEEE Transactions on Automatic Control, vol. 36, no. 4, pp. 454–462, 1991.

[14] V. S. Borkar, “An actor-critic algorithm for constrained Markov decision processes,” Systems & control letters, vol. 54, no. 3, pp. 207–213, 2005.

[15] ——, “Stochastic approximation with two time scales,” Systems & Control Letters, vol. 29, no. 5, pp. 291–294, 1997.

[16] J. Achiam, D. Held, A. Tamar, and P. Abbeel, “Constrained policy optimization,” arXiv preprint arXiv:1705.10528, 2017.

[17] Y. Liu, J. Ding, and X. Liu, “Ipo: Interior-point policy optimization under constraints,” arXiv preprint arXiv:1910.09615, 2019.

[18] C. Tessler, D. J. Mankowitz, and S. Mannor, “Reward constrained policy optimization,” arXiv preprint arXiv:1805.11074, 2018.

[19] E. Uchibe and K. Doya, “Constrained reinforcement learning from intrinsic and extrinsic rewards,” in 2007 IEEE 6th International Conference on Development and Learning. IEEE, 2007, pp. 163–168.

[20] S. Qiu, X. Wei, Z. Yang, J. Ye, and Z. Wang, “Upper confidence primal-dual optimization: Stochastically constrained Markov decision processes with adversarial losses and unknown transitions,” arXiv preprint arXiv:2003.00660, 2020.

[21] Y. Efroni, S. Mannor, and M. Pirotta, “Exploration-Exploitation in Constrained MDPs,” arXiv preprint arXiv:2003.02189, 2020.

[22] C. Villani, Optimal transport: old and new. Springer Science & Business Media, 2008, vol. 338.

[23] S. Resnick, A probability path. Springer, 2019.

[24] P. R. Kumar and P. Varaiya, Stochastic systems: Estimation, identifica-tion, and adaptive control. SIAM, 2015.

[25] E. Altman, Constrained Markov Decision Processes. Chapman and Hall/CRC, March 1999.

[26] T. L. Lai and H. Robbins, “Asymptotically efficient adaptive allocation rules,” Advances in applied mathematics, vol. 6, no. 1, pp. 4–22, 1985.

[27] R. Agrawal, “Sample mean based index policies by o (log n) regret for the multi-armed bandit problem,” Advances in Applied Probability, vol. 27, no. 4, pp. 1054–1078, 1995.

[28] K. Azuma, “Weighted sums of certain dependent random variables,” Tohoku Mathematical Journal, Second Series, vol. 19, no. 3, pp. 357– 367, 1967.

[29] D. P. Bertsekas, Nonlinear programming. Taylor & Francis, 1997, vol. 48, no. 3.

[30] V. R. Konda and J. N. Tsitsiklis, “Actor-critic algorithms,” in Advances in neural information processing systems, 2000, pp. 1008–1014.

[31] J. Peters and S. Schaal, “Natural actor-critic,” Neurocomputing, vol. 71, no. 7-9, pp. 1180–1190, 2008.

[32] V. R. Konda and V. S. Borkar, “Actor-critic–type learning algorithms for markov decision processes,” SIAM Journal on control and Optimization, vol. 38, no. 1, pp. 94–123, 1999.

[33] H. Kushner and G. G. Yin, Stochastic approximation and recursive algorithms and applications. Springer Science & Business Media, 2003, vol. 35.

[34] V. S. Borkar, Stochastic approximation: a dynamical systems viewpoint. Springer, 2009, vol. 48.

[35] S. Boyd and L. Vandenberghe, Convex optimization. Cambridge university press, 2004.

[36] A. Y. Mitrophanov, “Sensitivity and convergence of uniformly ergodic markov chains,” Journal of Applied Probability, vol. 42, no. 4, pp. 1003– 1014, 2005.

image

We derive some preliminary results that will be utilized in the proof of Theorem 3.

Lemma 11: Consider the dual problem (40) associated with the CMDP (2), (3), and let  λ⋆be a solution of the dual problem. If Assumption 2 holds true, then we have that

image

where  r⋆is the optimal reward of CMDP (2), (3). Proof: Under Assumption 2, the CMDP (2)-(3) is strictly feasible, so that Slater’s constraint [35] is satisfied, and consequently strong duality holds true. Thus, if  λ⋆solves the dual problem (40), we then have that  D(λ⋆) = r⋆.

Lemma 12: Let  λ ≥ 0Mand  φbe a learning algorithm for the problem of maximizing cumulative rewards under average cost constraints. We then have the following,

image

image

We derive some results on the variations in the value of optimal reward of the CMDP (2)-(3) as a function of the cost budgets  cub. Consider a vector  ˆcubof cost budgets that satisfies

image

where  ǫ > 0. Now consider the following CMDP in which the upper-bounds on the average costs are equal to  {ˆcubi }Mi=1.

image

image

where the second inequality follows since a policy that is optimal for the problem (48)-(49) maximizes the Lagrangian ¯r(π) + �Mi=1 λi�ˆcubi − ¯ci(π)�when the Lagrange multiplier λis set equal to  λ⋆[29]. Rearranging the above inequality yields the desired result.

Lemma 14: Let the MDP p satisfy Assumption 1 and Assumption 2. If  r⋆(ˆcub)denotes optimal reward value of (48)- (49), and  r⋆is optimal reward of problem (2)-(3), then we have that

image

where  ˆηis as in Theorem ??,  ηis as in (30), and  ˆcsatis-fies (47).

Proof: As discussed in Section III-B, a CMDP can be posed as a linear program. Since under Assumption 2, both the CMDPs (2)-(3) and (48)-(49) are strictly feasible, we can use the strong duality property of linear programs [29] in order to conclude that the optimal value of the primal and the dual problems for both the CMDPs are equal. Thus,

image

Let  π(1), π(2)and  λ(1), λ(2)denote optimal policies and vector consisting of optimal dual variables for the two CMDPs. It then follows from (50) and (51) that,

image

Subtracting the second inequality from the first yields

image

where the last inequality follows from Lemma 13. This completes the proof.

B. Sensitivity of Markov Chains

The following result is essentially Corollary 3.1 of [36]. Consider a finite-state Markov chain with transition probabilities  {˜p(s, s′) : s, s′ ∈ S}. Let  P (t)sbe the probability distribution at time t when it starts in state s at time 0.

Theorem 4: Assume  ∥ ˜P (t)s − ˜P (∞)s ∥ ≤ Cρt, t ∈ N. Consider a Markov chain with transition probabilities  ˜q(s, s′). We have

image


Designed for Accessibility and to further Open Science