Learning in Markov Decision Processes under Constraints

2020·Arxiv

Abstract

Abstract

We consider reinforcement learning (RL) in Markov Decision Processes in which an agent repeatedly interacts with an environment that is modeled by a controlled Markov process. At each time step t, it earns a reward, and also incurs a cost-vector consisting of M costs. We design model-based RL algorithms that maximize the cumulative reward earned over a time horizon of T time-steps, while simultaneously ensuring that the average values of the M cost expenditures are bounded by agent-specified thresholds

In order to measure the performance of a reinforcement learning algorithm that satisfies the average cost constraints, we define an M + 1 dimensional regret vector that is composed of its reward regret, and M cost regrets. The reward regret measures the sub-optimality in the cumulative reward, while the i-th component of the cost regret vector is the difference between its i-th cumulative cost expense and the expected cost expenditures

We prove that the expected value of the regret vector of UCRL-CMDP, is upper-bounded as time horizon. We further show how to reduce the regret of a desired subset of the M costs, at the expense of increasing the regrets of rewards and the remaining costs. To the best of our knowledge, ours is the only work that considers non-episodic RL under average cost constraints, and derive algorithms that can tune the regret vector according to the agent’s requirements on its cost regrets.

I. INTRODUCTION

Reinforcement Learning (RL) [1] involves an agent repeat- edly interacting with an environment modelled by a Markov Decision Process (MDP) [2]. More specifically, consider a controlled Markov process [2] . At each discrete time t, an agent applies control . State-space, and action space are denoted by S and A respectively, and are assumed to be finite. The controlled transition probabilities are denoted by . Thus, is the probability that the system state transitions to state upon applying action a in state s. The probabilities are not known to the agent. At each discrete time t = 1, 2, . . ., T , the agent observes the current state of the environment , applies control action , and earns a reward that is a known function of . When the agent applies an action a in the state s, then it earns a reward equal to r(s, a) units. The agent does not know the controlled transition probabilities that describe the system dynamics of the environment. The performance of an agent or a RL

Rahul Singh is with the Department of ECE, Indian Institute of Science, Bengaluru, Karnataka, India. email: rahulsingh@iisc.ac.in.

Abhishek Gupta and Ness Shroff are with the Department of ECE, Ohio State University, Columbus, OH, USA. email: gupta.706@osu.edu,shroff@ece.osu.edu

algorithm is measured by the cumulative rewards that it earns over the time horizon.

However in many applications, in addition to earning rewards, the agent also incurs costs at each time. The underlying physical constraints impose constraints on its cumulative cost expenditures, so that the agent needs to balance its reward earnings with the cost accretion while also simultaneously learning the choice of optimal decisions, all in an online manner.

As a motivating example, consider a single-hop wireless network that consists of a wireless node that transmits data packets to a receiver over an unreliable wireless channel. The channel reliability, i.e., the probability that a transmission at time-step t is successful, depends upon the instantaneous channel state and the transmission power . Thus, for example, this probability is higher when the channel is in a good state, or if transmission is carried out at higher power levels. The transmitter stores packets in a buffer, and its queue length at time t is denoted by . The wireless node is batteryoperated, and packet transmission consumes power. Hence, it is desired that the average power consumption is minimal. An appropriate performance metric for networks is the average queue length[3], and hence it is required that the average queue length stays below a certain threshold. The AP has to choose adaptively so as to minimize the power consumption, or equivalently maximize, while simultaneoulsy ensure that the average queue length is below a user-specified threshold, i.e. . In this example, the state of the “environment” at time t is given by the queue length and the channel state . Thus, it might be “optimal” to utilize high transmission power levels only when the instantaneous queue length is large or the wireless channel’s state is good. Such an adaptive strategy saves energy by transmitting at lower energy levels at other times. Since channel reliabilities are typically not known to the transmitter node, it does not know the transition probabilities that describe the controlled Markov process . Hence, it cannot compute the expectations of the average queue lengths and average power consumption for a fixed control policy, and needs to devise appropriate learning policies to optimize its performance under average-cost constraints. RL algorithms that we propose in this work solve exactly these classes of problems. Infact many important network control problems can be solved within the framework of constrained Markov decision processes (CMDPs). For example, [4], [5] utilize CMDPs in order to maximize the throughput in a stochastic network, where the network operator wants to satisfy constraints on delays, while [6] design policies that make dynamic decisions regarding network access in networks shared by different types of traffic. Similarly, it has been used in [7], [8] in order to maximize the timely throughput1 in stochastic networks. If the network/ system parameters are unknown, then the CMDP can be posed as a linear program (LP), and solved efficiently. However, in practice, network parameters are seldom known to the network operator, and it needs to design algorithms which “learn” the optimal policies in an “optimal” manner. Our work addresses precisely this issue.

II. PREVIOUS WORKS AND OUR CONTRIBUTIONS

RL Algorithms for unconstrained MDPs: RL problems without constraints are well-understood by now. Works such as [9]–[12] develop algorithms using the principle of “optimism under uncertainty.” UCRL2 of [12] is a popular RL algorithm that has a regret bound of , where D(p) is the diameter [12] of the MDP p; the algorithms proposed in our work are based on UCRL2.

RL Algorithms for Constrained MDPs: [13] is an early work on optimally controlling unknown MDPs under average cost constraints. It utilizes the certainty equivalence (CE) principle, i.e., it applies controls that are optimal under the assumption that the true (but unknown) MDP parameters are equal to the empirical estimates, and also occasionally resorts to “forced explorations.” This algorithm yields asymptotically (as ) the same reward rate as the case when the MDP parameters are known. However, analysis is performed under the assumption that the CMDP is strictly feasible. Moreover the algorithm lacks finite-time performance guarantees (bounds on regret). Unlike [13], we do not assume strict feasibility; infact we show that the use of confidence bounds allows us to get rid of the strict feasibility assumption. [14] derives a learning scheme based on multi time-scale stochastic approximation [15], in which the task of learning an optimal policy for the CMDP is decomposed into that of learning the optimal value of the dual variables, which correspond to the price of violating the average cost constraints, and that of learning the optimal policy for an unconstrained MDP parameterized by the dual variables. However, the proposed scheme lacks finite-time regret analysis, and might suffer from a large regret. Prima facie, this layered decomposition might not be optimal with respect to the sample-complexity of the online RL problem. The works [16]–[19] design policy-search algorithms for constrained RL problems. However unlike our work, they do not utilize the concept of regret vector, and their theoretical guarantees need further research. After the first draft of our work was published online, there appeared a few manuscripts/works that address various facets of learning in CMDPs, and these have some similarity with our work. For example [20] considers episodic RL problems with constraints in which the reward function is time-varying. Similarly, [21] also considers episodic RL in which the state is reset at the beginning of each episode. In contrast, we deal exclusively with non-episodic infinite horizon RL problems. In fact, as we show in our work, the primary difficulty in non-episodic constrained RL arises due to the fact that it is not possible to simultaneously “control/upper-bound” the reward and M costs during long runs of the controlled Markov process. Consequently, in order to control the regret vector, we make the assumption that the underlying MDP is unichain. However, this problem does not occur in the episodic RL case [20], [21] since the state is reset. Secondly, unlike the algorithms provided in our work, [20], [21] do not allow the agent to tune the regret vector.

Our contributions are summarized as follows.

1) We initiate the problem of designing RL algorithms that maximize the cumulative rewards while simultaneously satisfying average cost constraints. We propose an algorithm which we call UCRL for CMDPs, henceforth abbreviated as UCRL-CMDP. UCRL-CMDP is a mod-ification of the popular RL algorithm UCRL2 of [12] that utilizes the principle of optimism in the face of uncertainty (OFU) while making decisions. Since an algorithm that utilizes OFU does not need to satisfy cost constraints (this is briefly discussed in Section II-A), we modify OFU appropriately and derive the principle of balanced optimism in the face of uncertainty (BOFU). Under the BOFU principle, at the beginning of each RL episode, the agent has to solve for (i) an MDP, and (ii) a controller, such that the average costs of a system in which the dynamics are described by (i), and which is controlled using (ii), are less than or equal to the cost constraints. This is summarized in Algorithm 1.

2) In order to quantify the finite-time performance of an RL algorithm that has to perform under average cost constraints, we define its M + 1 dimensional “regret vector” that is composed of its reward regret (8) and M cost regrets (9). More precisely, considering solely the reward regret (as is done in the RL literature) overlooks the cost expenditures. Indeed, we show in Theorem 2 that the reward regret can be made arbitrary small (with a high probability) at the expense of an increase in the cumulative cost expenditure. Thus, while comparing the performance of two different learning algorithms, we also need to compare their cost expenditures. The reward regret of a learning algorithm is the difference between its reward and the reward of an optimal policy that knows the MDP parameters, while the i-th cost regret is the difference between the total cost incurred until T time-steps, and .

3) We ask the following question in the constrained setup: What is the set of “achievable” M + 1 dimensional regret vectors? In Theorem 1 we show that the components of the regret vector of UCRL-CMDP, can be bounded as .

4) We show that the use of BOFU allows us to overcome the shortcomings of the CE approach that were encountered in [13], i.e., there are arbitrarily long time-

durations during which the CMDP in which the system dynamics are described by the current empirical estimates of transition probabilities is infeasible, and hence the agent is unable to utilize these estimates in order to make control decisions. As a by-product, BOFU also allows us to get rid of “forced explorations,” that were utilized in [13], i.e., employing randomized controls occasionally.

5) Analogous to the unconstrained RL setup, in which one is interested in quantifying a lower bound on the regret of any learning algorithm, we provide a partial characterization of the set of those M + 1-dimensional regret vectors, which cannot be achieved under any learning algorithm. More specifically, in Theorem 3 we show that a weighted sum of the M + 1 regrets is necessarily greater than , where D(p) is the diameter of the underlying MDP, and S, A is the number of states and control actions respectively.

6) In many applications, an agent is more sensitive to the cost expenditures of some specific resources as compared to the rest, and a procedure to “tune” the M + 1 dimensional regret vector is essential. In Section VI, we consider the scenario in which the agent can prespecify the desired bounds on each component of the cost regret vector, and introduce a modification to the UCRL-CMDP that allows the agent to keep the cost regrets below these bounds.

A. Failure of OFU in constrained RL problems

Consider a two-state S = {1, 2}, two-action A = {0, 1} MDP in which the controlled transition probabilities and are unknown, while remaining probabilitites are equal to .5. Assume that and , i.e., reward and cost depend only upon the current state. Assume that , and the average cost threshold satisfies . Since state 2 yields reward at the maximum rate, and this means that the optimal action in state 1 is 1. Let and denote the empirical estimate of , and the radius of confidence interval respectively at time t. Then UCRL2 sets the optimistic estimate of equal to and then implements the control that is optimal when true parameter value is equal to this estimate. Thus, if , then it chooses action 1 in state 1. Since with a high probability we have , and as [12], we have that when the index of the RL episode is sufficiently large, the agent implements action 1 in state 1. Since the average cost of this policy is , this means that UCRL2 violates the average cost constraint.

III. PRELIMINARIES

In our setup, at each time t the agent earns a reward and also incurs M costs. Reward and cost functions are denoted by . Thus, the instantaneous reward ob- tained upon taking an action a in the state s is equal to r(s, a), while the i-th cost is equal to . A controlled Markov process in which the agent earns reward and incurs M costs is defined by the tuple . The probabilities are not known to the agent, while the reward and cost functions are known to the agent. We will now briefly discuss some notions and results on MDPs.

Let denote the t-step probability distribution when the policy is applied to the MDP p and the initial state is x, while be the corresponding stationary measure 2. For two measures , we let denote the total variation distance [22] between and .

Definition 1: (Unichain MDP) The MDP p is unichain if under any stationary policy there is a single recurrent class. If an MDP is unichain [2], then for the Markov chain induced by any stationary policy , we have

where are constants. The mixing time of an MDP p is defined as , where denotes the time taken by the Markov chain induced by policy to hit state , when it starts in state s.

Definition 2: (Control Policy) Let

be the |A|-simplex and denote the sigma-algebra [23] generated by the random variables . A control policy [2], [24] is a collection of maps 1, 2, . . . that chooses action on the basis of past operational history of the system. Thus, under policy , we have that is chosen according to the probability distribution . A general control policy is allowed to be history-dependent and randomized.

Definition 3: (Stationary Policy) A stationary policy , is a mapping from state S to a probability distribution on the action space A, and prescribes randomized controls on the basis of the current state . Thus, under policy , we have that is chosen according to the probability distribution .

A. Notation

Throughout, we use bold font for denoting vectors; for example the vector is denoted by x. We use N to denote the set of natural numbers, to denote the M dimensional Euclidean space, and to denote non-negative orthant of . Inequalities between two vectors are to be understood component-wise. If E is an event [23], thendenotes its indicator function. For a control policy ,3

denotes its average reward, and

denotes its average i-th cost. For , we let denote its 1-norm. denotes the M-dimensional zero vector consisting of all zeros. For , we let max{x, y}. Throughout, we abbreviate [M] := {1, 2, . . ., M}, S := |S|, A := |A|.

B. Constrained MDPs

We now present some definitions and standard results pertaining to constrained MDPs. These can be found in [25].

Definition 4 (Occupation Measure): Consider the controlled Markov process evolving under the application of a stationary policy . Its occupation measure is defined as

and describes the average amount of time that the process spends on each possible state-action pair.

Definition 5 (): Consider a vector that satisfies the constraints (6) and (7) below. Define to be the following stationary randomized policy. When the state of the environment is equal to s, the policy chooses the action a with a probability equal to if . However, if , then the policy takes an action according to some pre-specified rule (e.g. implement ).

Consider the controlled Markov process CMP = . The following dynamic optimization problem is a constrained Markov Decision Process (CMDP) [25],

where the maximization above is over the class of all history-dependent policies, and denotes the desired upper-bound on the average value of i-th cost expense. The optimal average reward rate of the CMDP is equal to the optimal value of the above LP, and is denoted by .

Linear Programming (LP) approach for solving CMDPs: When the controlled transition probabilities are known, and p is unichain, an optimal policy for the CMDP (2)- (3) can be obtained by solving the following linear program (LP) [25],

Let be a solution of the above LP. Then, the stationary randomized policy solves (2)-(3). Moreover, it can be shown that the average reward and M costs of are independent of the initial starting state if the MDP is unichain [25].

C. Learning Algorithms and Regret Vector

We will develop reinforcement learning algorithms to solve the finite-time horizon version of the CMDP (2)-(3) when the probabilities are not known to the agent. Let denote the sigma-algebra [23] generated by the random variables . A learning policy is a collection of maps that chooses action on the basis of past operational history of the system. In order to measure the performance of a learning algorithm, we define its reward and cost regrets. The “cumulative reward regret” until time T , denoted by , is defined as,

where is the optimal average reward of the CMDP (2)-(3) when controlled transition probabilities are known. Note that is the optimal value of the LP (4)-(7). The “cumulative cost regret” for the i-th cost until time T is denoted by , and is defined as,

Remark 1: In the conventional regret analysis of RL algorithms, the objective is to bound the reward regret . However, in our setup, due to considerations on the cost expenditures, we also need to bound the cost regrets . Indeed, as shown in the Section VI, we can force to be arbitrarily small at the expense of increased cost regrets, and also vice versa. The consideration of the regret vector, and the possibility of tuning its various components, is a key novelty of our work. The problem of tuning this vector is challenging because its various components are correlated.

IV. UCRL-CMDP: A LEARNING ALGORITHM FOR CMDPS

We propose UCRL-CMDP to adaptively control an unknown CMDP. It is depicted in Algorithm 1. UCRL-CMDP maintains empirical estimates of the unknown transition probabilities as follows,

where and denote the number of visits to (s, a) and until t respectively.

Confidence Intervals: Additionally, it also maintains confi-dence interval associated with the estimate as follows,

where

b > 1 is a constant. Episode: UCRL-CMDP proceeds in episodes, and utilizes a single stationary control policy within an episode. Each episode is of duration steps. Let denote the start time of episode k. k-th episode is denoted by , and comprises of consecutive time-steps. Denote by the index of the ongoing episode at time t. At the beginning of , the agent solves the following constrained optimization problem in which the decision variables are (i) Occupation measure of the controlled process, and (ii) “Candidate”

MDP ,

The maximization w.r.t. denotes that the agent is optimistic regarding the belief of the “true” (but unknown) MDP p, while that w.r.t. ensures that the agent optimizes its control strategy for this optimistic MDP. The constraints (14) ensure that the cost expenditures do not exceed the thresholds , and hence ensure that the agent also balances the cost expenses while being optimistic with respect to the rewards about the choice of the MDP thereby taking a balanced approach to optimism when the underlying MDP parameters are unknown. If the constraints (14) were absent, we would recover the UCRL2 algorithm of [12] that is based on the OFU principle [26], [27]. However, as is shown in Section II-A, the OFU principle might fail when it is applied for learning the optimal controls for CMDPs. Indeed, as is shown in the example in Section II-A, the limiting average cost is greater than the threshold value of cost. The BOFU principle proposed in this work is a natural extension of the OFU principle to the case when the agent has to satisfy certain constraints on costs, in addition to maximizing the rewards. In case the problem (13)-(17) is feasible, let denote a solution. The agent then chooses according to within . However, in the event the LP (13)-(17) is infeasible, the agent implements an arbitrary stationary control policy that has been chosen at time t = 0. In summary, it implements a stationary controller within , which is denoted by . We make the following assumptions on the MDP p while analyzing UCRL-CMDP.

Assumption 1:

1) The MDP is unichain. Thus, under a stationary policy we have

where . 2) The CMDP (2)-(3) is feasible, i.e., there exists a policy under which the average cost constraints (3) are satisfied. 3) Without loss of generality, we assume that the magnitude of rewards and costs are upper-bounded by 1, i.e.,

Hence, if denotes optimal reward rate of (2)-(3), then . Moreover, the cost bounds can be taken to be less than 1.

We establish the following bound on the regrets of UCRL- CMDP.

Theorem 1: Consider the UCRL-CMDP (Algorithm 1) applied with to an MDP p that satisfies Assumption 1. The reward and cost regrets can be bounded as follows,

where satisfies . Upon letting in the above, the above reduces to

V. PROOF OF THEOREM 1

We begin by introducing few notation. If B denotes a subset of S, then we let be the set of those policy for which the occupation measure satisfies for all . Let denote the set of states for which . We now derive few preliminary results that are used while proving the main result.

The following result can be shown by an application of Azuma-Hoeffding inequality [28].

where the confidence ball is as in (11). Define the set . Then,

Proof: If the number of visits to (s, a) were determinisic, denoted by N(s, a), and the corresponding number of visits to , then it follows from Azuma-Hoeffding’s inequality [28] that

Upon letting equal to in the

above, we get that

holds with a probability greater than . The proof is then completed by considering all possibilities for N(s, a), state-action pairs and using union bound.

Lemma 2: Define

where K is the total number of episodes and satisfies . We have

is a martingale difference sequence. Furthermore, since the duration of each episode is bounded by , and , we have . By

applying Azuma-Hoeffding’s inequality to this martingale difference sequence, we get that the probability of the event

upper-bounded by

Since , the above bound reduces to . The proof then follows by using union bound for all state-action pairs and k.

Lemma 3: If , then

Proof: Since we have , it follows from Markov’s inequality that the probability with which does not hit the state s in steps, is less than 1/2, or equivalently the state s is visited atleast once with a probability greater than 1/2, which yields us the following,

The proof is then completed by dividing the total time of steps in an episode into “mini-episodes” of steps each, and noting that is the sum of number of visits to (s, a) during each such mini-episode.

We begin by giving an equivalent characterization of the UCRL-CMDP rule. At each , it assigns an index to each stationary policy as follows,

Proof: Note that denotes the transition probabilities of the Markov chain that results when the policy is applied to the MDP p. Consider an MDP . Since on G, we have , we get the following,

so that from triangle inequality we have that

(23) then follows from Theorem 4.

For a policy , we denote the quantity by its instantaneous reward regret, and its instantaneous cost regret for the i-th cost. We now show that if the instantaneous reward regret, or an instantaneous cost regret of a policy is greater than a certain threshold (that depends upon the radius of the confidence ball at time ), then it is not played during . For a stationary policy , define

Consider the following two possibilities.

Case A) for some i: From (23) we have that which implies for all . Thus .

Case B) From (23) we have that for all , so that the index is bounded by .

The following result summarizes this discussion.

Lemma 5: Let be a stationary randomized policy. On the set G we have that if for some . Also, .

We now show that if a stationary policy is feasible for the MDP p, i.e. , then its index is lower- bounded by .

Upon combining Lemma 5 and Lemma 6, we obtain the following result.

by UCRL-CMDP. This means that , which shows that the instantaneous cost regret is bounded by .

To bound the reward regret, note that it was shown in Lemma 6 that the index of an optimal policy is greater than , and hence the index is greater than . From Lemma 5 we have . Hence we have that , or , which shows that the instantaneous reward regret is bounded by .

We now use the result on instantaneous regrets in order to bound the cumulative regrets of UCRL-CMDP.

Proof of Theorem 1: We will only discuss bound on reward regret, since the bound on cost regrets can be derived using similar steps. Note that the expected regret during can be written as follows,

where the last inequality follows from (18). Denote as the regret incurred during the k-th episode. It follows from the above discussion that the cumulative expected regret can be bounded as follows,

where K is the total number of episodes. Henceforth we will only focus on bounding the first term in the r.h.s. above. This is bounded separately on the sets .

We begin by bounding . We have,

where the first inequality follows from Lemma 7, and the last inequality follows from Lemma 3. We will now bound the term . We have

As is shown in [12, p. 1578], the term can be bounded by , while from (21) we have that on , the term is bounded

by

above that on G we have

Upon summing (26) over episodes, and using (28), we obtain that the regret on G can be bounded as follows,

This completes the analysis on G.

We now analyze the regret on . From Lemma 2, the probability of is bounded by . On , the sample path regret can be trivially bounded by T , so that its contribution to the expected regret is bounded by .

To analyze the regret on we note that if the confidence ball at time fails, then the regret during can be bounded by the duration of . Since , the regret during is bounded by . From Lemma 1 we have that the probability with which confidence ball fails at time t is upper-bounded by . Hence, the expected regret from the failure of ball (in case an episode starts at t) at time t is bounded by , so that the cumulative expected regret is bounded by .

VI. LEARNING UNDER BOUNDS ON COST REGRET

The upper-bounds for the regrets of UCRL-CMDP in Theorem 1 are the same for reward and M costs regrets. However, in many practical applications, an agent is more sensitive to over-utilizing certain specific costs, as compared to the other costs. Thus, in this section, we derive algorithms which enable the agent to tune the upper-bounds on the regrets of different costs. We also quantify the reward regret of these algorithms.

A. Modified UCRL-CMDP

Throughout this section we assume that p satisfies the following.

Assumption 2: For the MDP p, there exists a stationary policy under which the average costs are strictly below the thresholds . More precisely, there exists an and a stationary policy such that we have . Define

The modified algorithm maintains empirical estimates and confidence intervals (11) in exactly the same manner as UCRL-CMDP (Algorithm 1) does. It also proceeds in episodes, and uses a single stationary control policy within an episode. However, at the beginning of each episode k, it solves the following optimization problem, which is a modification of the problem (13)-(17) that is solved by UCRL-CMDP. More concretely, the cost constraints (14) are replaced by the constraints (32) on the costs:

where,

and the parameters are chosen by the agent. If the LP (31)-(35) is feasible, let be an optimal occupation measure obtained by solving it. In this case, the agent implements within . However, if the LP is infeasible, then it implements a stationary controller that has been chosen at time t = 0. This is summarized in Algorithm 2. We will analyze Algorithm 2 under the following assumption on the underlying MDP p.

We derive upper-bounds on regrets of modified algorithm in the following result.

Theorem 2: Consider the modified UCRL-CMDP with applied to an MDP p that satisfies Assumption 1 and Assumption 2. Then, the expected reward and cost regrets can be upper-bounded as follows:

and

VII. PROOF OF THEOREM 2

Proof closely follows the proof of Theorem 1, hence we point out only the key differences. The modified UCRL algorithm assigns the following modified index to policy ,

If for some i we have , then we set .

The proof of next result is omitted since it is similar to that of Lemma 5.

Lemma 8: Let be a stationary randomized policy. On the set G we have that if for some . Also, .

The following result allows us to derive bounds on the instantaneous regrets.

It follows from Lemma 14 that the optimal value of the CMDP , such that , is greater than or equal to . Hence, it follows from the discussion above that the index of the policy which is optimal for this CMDP is greater than or equal to . As earlier, we bound the regret on the sets and separately. On G, the regret is bounded by the time spent playing sub-optimal policies.

policy will not be played by UCRL-CMDP. This means that , which shows that the instantaneous cost regret is bounded by .

To bound the instantaneous reward regret, note that it was shown in Lemma 9 that there is a policy with index greater than , and hence the index of is necessarily greater than . Since from Lemma 8 we have that the index of is upper-bounded by , we must have , or , which shows that the instantaneous reward regret is bounded by .

Proof of Theorem 2: Since the proof closely follows that of Theorem 1, we only point out the key differences. The decomposition result 25 holds for reward as well cost regrets. Similarly, the regrets on and can be bounded by and respectively. The only difference arises during bounding the terms and . It follows from Lemma 10 that the bound on differs from (26) by an additional term zT , and similarly that on differs by . The proof is then completed by summing the bounds on regrets over the sets .

VIII. ACHIEVABLE REGRET VECTORS

Let . Consider the Lagrangian relaxation of (2)-(3),

where is the vector that consists of costs . Consider its associated dual function [29], , and the dual problem

Define the diameter D(p) of MDP p as follows, D(p) := is finite if p is communicating [2].

Theorem 3: Consider a learning algorithm . Then, there is a problem instance such that the regrets under satisfy

where is an optimal solution of the dual problem (40).

Proof: We begin by considering an auxiliary reward maximization problem that involves the same MDP p, but in which the reward received at time t by the agent is equal to instead of . However, there are no average cost constraints in the auxiliary problem. Let be a history dependent policy for this auxiliary problem. Denote its optimal reward by . Then, the regret for cumulative rewards collected by in the auxiliary problem is given by

It follows from Theorem 5 of [12] that the controlled transition probabilities of the underlying MDP can be chosen so that this regret is greater than , i.e.,

We observe that any valid learning algorithm for the constrained problem is also a valid algorithm for the auxiliary problem. Thus, if is a learning algorithm for the problem with average cost constraints, then we have

We now substitute (46) in the above to obtain

Since the expression in the r.h.s. is maximized for values of which are optimal for the dual problem (40), we set it equal to , and then use Lemma 11 in order to obtain

This completes the proof.

IX. SIMULATION RESULTS

We compare the performance of the proposed UCRL-CMDP algorithm with the Actor-Critic algorithm for CMDPs that was proposed in [14]. Actor-Critic algorithms are a popular class of online learning algorithms [30]–[32] that are based on multi-time-scale stochastic approximation [33], [34]. We compare algorithms on the example presented in Section I in which the goal is to learn an efficient network controller. We begin by explaining the experiment setup.

Experiment Setup: Consider the single-hop wireless network that was discussed in Section I, and consists of a single wireless node that transmits data packets to a receiver. The access point has to dynamically choose the transmission power at each time t. For simplicity, we let the action set A be binary, and take the channel state to be static, i.e. it does not evolve or equivalently it assumes only a single value. Thus would mean that no packet was attempted transmission at time t, while would mean that a single packet would be delivered, with a probability equal to the channel reliability. The number of packets that arrive at time t are denoted by . We let and assume that are i.i.d. across times. The probability vector associated with that describes its probability mass function is taken equal to (.65, .2, .1, .05) for the experiments shown in Fig. 1, Fig. 2. The packet buffer is of a finite capacity, and can hold a maximum of B packets. Thus, the dynamics of the queue length can be described as follows,

where for we let , and min{x, B}, while is the number that depart (are delivered to destination) at time t. In our experiments we use B = 6, and take the channel reliability as .9. Hence, if then assumes the value 1 with a probability .9. The associated CMDP can be stated as follows:

We now discuss the Actor-Critic algorithm for CMDPs. We begin with some notation that are required in order to discuss Actor-Critic algorithm. Let {a(n)}, {b(n)}, {c(n)} be positive stepsize sequences satisfying , , and .In our experiments we use a(n) = 1/n, b(n) = 1/(n log n) and . Let denote the simplex of subprobability vectors. Let denote the map that projects a vector onto Q. Thus, if then , otherwise is the point from Q that is closest to x.

Actor-Critic Algorithm for CMDPs: The algorithm carries out iterations for three quantities that evolve at different timescales and are coupled. To begin with, it replaces the original constrained MDP by an unconstrained one by imposing a penalty upon constraint violation. is held fixed, (or equivalently , since the term does not depend upon the controls) The instantaneous reward for this modified MDP is equal to where is the price associated with the constraint violation. is itself being tuned in an online way, though at a slower time-scale. serves as an estimate of the optimal value of the dual variable for the original CMDP. In order to solve this unconstrained MDP, the algorithm keeps an estimate of the value function , which is updated as follows,

where is a designated state. Let denote the probability with which action a is implemented in state s at time t. Let be a designated action. These probabilities are generated as follows. The algorithm maintains vectors for each state , and updates it as follows,

where,

where is the unit vector with a 1 in the place corresponding to action . The probability for action is computed as follows,

The action probabilities are then generated from as follows,

where . Finally, the price is updated as follows,

where is the threshold on average queue length as in (44). In our experiments we use and . Results: Fig. 1 compares the cumulative regrets incurred by

these algorithms. We observe that the reward regret as well

as cost regret of UCRL-CMDP are low. We observe a serious drawback of the Actor-Critic algorithm’s performance, that the cost regret is prohibitively high. We then vary the budget on the average queue length. These results are shown in Fig. 2. Once again, we make a similar observation, that UCRL-CMDP is effective in balancing both, the reward regret and the cost regret , while the Actor-Critic algorithm yields a high cost regret. In both of these experiments the probability vector of arrivals was held fixed at (.65, .2, .1, .05). We vary this probability vector, and plot the regrets in Fig. 3b. Once again, UCRL-CMDP outperforms the Actor-Critic algorithm. Though the reward regret of Actor-Critic algorithm is lower than that of the UCRL-CMDP algorithms, this occurs at the expense of an undesireable much larger cost regret. In contrast, the reward regret as well as cost regret of UCRL-CMDP is low. Plots are obtained after averaging over 100 runs.

Fig. 1: Plot of the reward regret (a) and cost regret (b), for the network in which the probability vector associated with arrivals is (.65, .2, .1, .05), channel reliability is .9, and desired delay is . Plots are obtained after averaging over 100 runs.

Fig. 2: Plot of the normalized reward regret (a) and cost regret (b), as the desired delay

Fig. 3: Plot of the reward regret (a) and cost regret (b), as the probability distribution of the arrivals is varied. The probability vector of is equal to the parameter i is varied from 0 to 9. The desired delay fixed at 4.5, and channel reliability at .9.

X. CONCLUSIONS AND FUTURE WORK

In this work, we initiate a study to develop learning algorithms that simultaneously control all the components of the regret vector while controlling unknown MDPs. We devised algorithms that are able to tune different components of the cost regret vector, and also obtained a non-achievability result that characterizes those regret vectors that cannot be achieved under any learning rule. In our work, we assume that the underlying MDP is unichain. An interesting research problem is to characterize the set of achievable regret vectors under the weaker assumption that the underlying MDP is communicating.

REFERENCES

[1] R. S. Sutton and A. G. Barto, Reinforcement learning - an introduction, ser. Adaptive computation and machine learning. MIT Press, 1998. [Online]. Available: http://www.worldcat.org/oclc/37293240

[2] M. L. Puterman, Markov Decision Processes.: Discrete Stochastic Dynamic Programming. John Wiley & Sons, 2014.

[3] L. I. Sennott, Stochastic dynamic programming and the control of queueing systems. John Wiley & Sons, 2009, vol. 504.

[4] A. Lazar, “Optimal flow control of a class of queueing networks in equilibrium,” IEEE transactions on Automatic Control, vol. 28, no. 11, pp. 1001–1007, 1983.

[5] M.-T. T. Hsiao and A. A. Lazar, “Optimal decentralized flow control of markovian queueing networks with multiple controllers,” Performance evaluation, vol. 13, no. 3, pp. 181–204, 1991.

[6] P. Nain and K. Ross, “Optimal priority assignment with hard constraint,” IEEE transactions on Automatic Control, vol. 31, no. 10, pp. 883–888, 1986.

[7] R. Singh and P. Kumar, “Throughput optimal decentralized scheduling of multihop networks with end-to-end deadline constraints: Unreliable links,” IEEE Transactions on Automatic Control, vol. 64, no. 1, pp. 127–142, 2018.

[8] ——, “Adaptive csma for decentralized scheduling of multi-hop net- works with end-to-end deadline constraints,” IEEE/ACM Transactions on Networking, 2021.

[9] R. I. Brafman and M. Tennenholtz, “R-max-a general polynomial time algorithm for near-optimal reinforcement learning,” Journal of Machine Learning Research, vol. 3, no. Oct, pp. 213–231, 2002.

[10] P. Auer and R. Ortner, “Logarithmic online regret bounds for undiscounted reinforcement learning,” in Advances in Neural Information Processing Systems, 2007, pp. 49–56.

[11] P. L. Bartlett and A. Tewari, “REGAL: A regularization based algorithm for reinforcement learning in weakly communicating mdps,” in Proceedings of the Twenty-Fifth Conference on Uncertainty in Artificial Intelligence, Montreal, QC, Canada, June 18-21, 2009. AUAI Press, 2009, pp. 35–42.

[12] T. Jaksch, R. Ortner, and P. Auer, “Near-optimal regret bounds for reinforcement learning,” Journal of Machine Learning Research, vol. 11, no. Apr, pp. 1563–1600, 2010.

[13] E. Altman and A. Schwartz, “Adaptive control of constrained Markov chains,” IEEE Transactions on Automatic Control, vol. 36, no. 4, pp. 454–462, 1991.

[14] V. S. Borkar, “An actor-critic algorithm for constrained Markov decision processes,” Systems & control letters, vol. 54, no. 3, pp. 207–213, 2005.

[15] ——, “Stochastic approximation with two time scales,” Systems & Control Letters, vol. 29, no. 5, pp. 291–294, 1997.

[16] J. Achiam, D. Held, A. Tamar, and P. Abbeel, “Constrained policy optimization,” arXiv preprint arXiv:1705.10528, 2017.

[17] Y. Liu, J. Ding, and X. Liu, “Ipo: Interior-point policy optimization under constraints,” arXiv preprint arXiv:1910.09615, 2019.

[18] C. Tessler, D. J. Mankowitz, and S. Mannor, “Reward constrained policy optimization,” arXiv preprint arXiv:1805.11074, 2018.

[19] E. Uchibe and K. Doya, “Constrained reinforcement learning from intrinsic and extrinsic rewards,” in 2007 IEEE 6th International Conference on Development and Learning. IEEE, 2007, pp. 163–168.

[20] S. Qiu, X. Wei, Z. Yang, J. Ye, and Z. Wang, “Upper confidence primal-dual optimization: Stochastically constrained Markov decision processes with adversarial losses and unknown transitions,” arXiv preprint arXiv:2003.00660, 2020.

[21] Y. Efroni, S. Mannor, and M. Pirotta, “Exploration-Exploitation in Constrained MDPs,” arXiv preprint arXiv:2003.02189, 2020.

[22] C. Villani, Optimal transport: old and new. Springer Science & Business Media, 2008, vol. 338.

[23] S. Resnick, A probability path. Springer, 2019.

[24] P. R. Kumar and P. Varaiya, Stochastic systems: Estimation, identifica-tion, and adaptive control. SIAM, 2015.

[25] E. Altman, Constrained Markov Decision Processes. Chapman and Hall/CRC, March 1999.

[26] T. L. Lai and H. Robbins, “Asymptotically efficient adaptive allocation rules,” Advances in applied mathematics, vol. 6, no. 1, pp. 4–22, 1985.

[27] R. Agrawal, “Sample mean based index policies by o (log n) regret for the multi-armed bandit problem,” Advances in Applied Probability, vol. 27, no. 4, pp. 1054–1078, 1995.

[28] K. Azuma, “Weighted sums of certain dependent random variables,” Tohoku Mathematical Journal, Second Series, vol. 19, no. 3, pp. 357– 367, 1967.

[29] D. P. Bertsekas, Nonlinear programming. Taylor & Francis, 1997, vol. 48, no. 3.

[30] V. R. Konda and J. N. Tsitsiklis, “Actor-critic algorithms,” in Advances in neural information processing systems, 2000, pp. 1008–1014.

[31] J. Peters and S. Schaal, “Natural actor-critic,” Neurocomputing, vol. 71, no. 7-9, pp. 1180–1190, 2008.

[32] V. R. Konda and V. S. Borkar, “Actor-critic–type learning algorithms for markov decision processes,” SIAM Journal on control and Optimization, vol. 38, no. 1, pp. 94–123, 1999.

[33] H. Kushner and G. G. Yin, Stochastic approximation and recursive algorithms and applications. Springer Science & Business Media, 2003, vol. 35.

[34] V. S. Borkar, Stochastic approximation: a dynamical systems viewpoint. Springer, 2009, vol. 48.

[35] S. Boyd and L. Vandenberghe, Convex optimization. Cambridge university press, 2004.

[36] A. Y. Mitrophanov, “Sensitivity and convergence of uniformly ergodic markov chains,” Journal of Applied Probability, vol. 42, no. 4, pp. 1003– 1014, 2005.

APPENDIX A

We derive some preliminary results that will be utilized in the proof of Theorem 3.

Lemma 11: Consider the dual problem (40) associated with the CMDP (2), (3), and let be a solution of the dual problem. If Assumption 2 holds true, then we have that

where is the optimal reward of CMDP (2), (3). Proof: Under Assumption 2, the CMDP (2)-(3) is strictly feasible, so that Slater’s constraint [35] is satisfied, and consequently strong duality holds true. Thus, if solves the dual problem (40), we then have that .

Lemma 12: Let and be a learning algorithm for the problem of maximizing cumulative rewards under average cost constraints. We then have the following,

APPENDIX B

We derive some results on the variations in the value of optimal reward of the CMDP (2)-(3) as a function of the cost budgets . Consider a vector of cost budgets that satisfies

where . Now consider the following CMDP in which the upper-bounds on the average costs are equal to .

where the second inequality follows since a policy that is optimal for the problem (48)-(49) maximizes the Lagrangian when the Lagrange multiplier is set equal to [29]. Rearranging the above inequality yields the desired result.

Lemma 14: Let the MDP p satisfy Assumption 1 and Assumption 2. If denotes optimal reward value of (48)- (49), and is optimal reward of problem (2)-(3), then we have that

where is as in Theorem ??, is as in (30), and satis-fies (47).

Proof: As discussed in Section III-B, a CMDP can be posed as a linear program. Since under Assumption 2, both the CMDPs (2)-(3) and (48)-(49) are strictly feasible, we can use the strong duality property of linear programs [29] in order to conclude that the optimal value of the primal and the dual problems for both the CMDPs are equal. Thus,

Let and denote optimal policies and vector consisting of optimal dual variables for the two CMDPs. It then follows from (50) and (51) that,

Subtracting the second inequality from the first yields

where the last inequality follows from Lemma 13. This completes the proof.

B. Sensitivity of Markov Chains

The following result is essentially Corollary 3.1 of [36]. Consider a finite-state Markov chain with transition probabilities . Let be the probability distribution at time t when it starts in state s at time 0.

Theorem 4: Assume . Consider a Markov chain with transition probabilities . We have