My stuff
Constrained Upper Confidence Reinforcement Learning

Constrained Markov Decision Processes are a class of stochastic decision problems in which the decision maker must select a policy that satisfies auxiliary cost constraints. This paper extends upper confidence reinforcement learning for settings in which the reward function and the constraints, described by cost functions, are unknown a priori but the transition kernel is known. Such a setting is well-motivated by a number of applications including exploration of unknown, potentially unsafe, environments. We present an algorithm C-UCRL and show that it achieves sub-linear regret (O(T34 �log(T/δ))) with respect to the reward while satisfying the constraints even while learning with probability  1 − δ. Illustrative examples are provided.

Markov Decision Processes (MDPs) have been successfully utilized to model sequential decision-making problems in stochastic environments. In the typical approach to learning a policy, the decision-maker trades off between exploration and exploitation, gradually improving their performance at the task as learning progresses. Reinforcement learning, a standard paradigm of learning in MDPs, has shown exceptional success in a variety of domains such as video games [22], robotics [20, 19], recommender systems [26], autonomous vehicles [25], among many others.

However, in many of these real-world applications, there is often additional constraints, or specifica-tions that lead to constraints, on the learning problem. For instance, a recommender system should avoid presenting offending items to users and autonomous vehicles must avoid crashing into others while navigating [14]. Building algorithms that respect safety constraints not only during normal operation, but also during the initial learning period, is a question of particular interest [18]. This problem is known as the safe exploration problem [23, 3]. In the standard MDP framework, an approach for baseline performance is risk-sensitive reinforcement learning [11, 14], where the optimization criterion is transformed in order to reflect a subjective measure balancing the return and the risk.

On the other hand, in a safety-critical environment, it is more reasonable to separate the return and the risk criterion, and enforce constraint satisfaction in the learning procedure. A standard formulation for an environment with safety constraints is the constrained MDPs (CMDPs) [2]. A decision-maker facing a CMDP aims to maximize the total reward while satisfying the constraints on costs in expectation over the whole trajectory.

In recent literature, policy gradient-based reinforcement learning algorithms have been proposed as a means to learn a policy for a CMDP. The following are two constrained policy search algorithms with state-of-the-art performance guarantees: Lagrangian-based actor-critic algorithm [5, 8] and Constrained Policy Optimization (CPO) [1]. However, for these policy gradient-based methods, safety is only approximately guaranteed after a sufficient learning period. The fundamental issue is that without a model, safety must be learned via trial and error, which means it may be violated during initial learning interactions.

Model-based approaches have utilized Gaussian processes to model the state safety values or the dynamic uncertainties [4, 17, 27, 7] or utilized Lyapunov-based methods [9] to guarantee safety during learning. Although these methods guarantee constraint satisfaction during learning, an arguably valuable analysis of the regret is lacking.

In unconstrained settings when the reward and transition kernel are unknown, upper confidence based reinforcement learning algorithms have been proposed—namely, UCRL2 [15]—with sub-linear regret. The key idea is to build confidence intervals on the reward and transition kernel and iteratively solve for policies using value iteration based methods.

In this work, we are not only interested in learning the optimal policy that satisfies the constraints via interacting with the stochastic environment, but also in ensuring performance guarantees on the learning algorithm during learning. With some practical scenarios in mind, we make the assumption that the rewards and constraint costs are unknown. For instance, consider a rover exploring the Mars landscape; here one can model the dynamics of the rover as known with some uncertainty and the reward and constraints which model the value of exploring the environment as unknown—e.g., constraints can be abstracted as costs which seek to limit the frequency of visiting a potentially hazardous states [13].

Motivated by upper confidence reinforcement learning [15], we introduce the constrained upper confi-dence reinforcement learning (C-UCRL) algorithm which combines elements of the classical UCRL2 algorithm with robust linear programming1. We define our goals as follows: (1) maintain constraint satisfaction throughout the learning process with high probability, and (2) achieve sub-linear regret comparing the rewards collected by the algorithm during learning with the reward of an optimal stochastic policy.

Contributions. The contributions can be summarized as follows. Building on UCRL2, we introduce the C-UCRL algorithm (Algorithm 1). We show that C-UCRL is guaranteed to satisfy constraints during learning with probability at least  1 − δ (Theorem 3) and achieves  O(T34 �log(T/δ))reward regret (Theorem 7). Of independent interest, we note that when the state space is trivial, the setting we consider subsumes stochastic multi-armed bandits with per-round budget constraints, where the optimal policy is a randomized policy across arms.

Organization. The rest of the paper is organized as follows. An overview of related work is provided in Section 2. Mathematical preliminaries and our algorithm are introduced in Section 3. Analysis of both constraints satisfaction and reward regret is provided in Section 4. Several illustrative examples are provided in Section 5. In those experiments we compare our proposed method to Risk-Sensitive UCRL2 algorithm and show that UCRL2 algorithm fails to converge to the optimal policy while our algorithm does. The paper is concluded in Section 6 with a brief summary and discussion of future directions.

Recently, several policy gradient-based reinforcement learning algorithms have been proposed for learning policies for CMDPs. In particular, there are two noteable constrained policy search algorithms which enjoy state-of-the-art performance: a Lagrangian-based algorithm [5, 8] and Constrained Policy Optimization (CPO) [1]. The Lagrangian-based algorithm formulates the CMDP problem as a minimax problem and uses primal-dual gradient optimization to find the saddle point solution. While this procedure will asymptotically converge to the saddle point solution, in general there is no guarantee on policies being safe during the learning procedure. On the other hand, CPO—a method that derives from an extension of trust-region policy optimization (TRPO)—guarantees monotonic performance improvements on the expected reward and a guarantee on constraint satisfaction throughout training. While this algorithm is safe during learning, analyzing its convergence is challenging and the regret analysis with respect to reward is lacking.

As an alternative to policy gradient reinforcement learning algorithms, linear programming based algorithms have been proposed. In [13], CMDPs with known reward, constraints, transition kernel but uncertain initial state distribution are considered. Linear programming based algorithms are proposed to solve for safe policies in this setting. In our setting, however, the reward and constraints are stochastic and considered unknown a priori, which the stochastic transition kernel is known.

Most similar to our approach is UCRL2; in particular, our approach can be viewed as an extension of UCRL2 [15], in some sense, by incorporating constraints; the one difference is that we assume the transition kernel is known while the classical UCRL2 algorithm does not. We leave extending our setting to unknown transition kernels to future work. As alluded to in the introduction, in UCRL2, the reward and transition kernel are approximated and the policy is obtained by value iteration based methods in a optimism in the face of uncertainty fashion. Further, the performance of UCRL2 is analyzed by bounding the regret with respect to the optimal deterministic policy. CMDPs, however, in general do not admit deterministic policies. In C-UCRL, the reward and constraints are approximated and the policy is obtained by solving a robust linear program. Performance is assessed by computing the reward regret with respect to the optimal randomized policy.

Finally, our work is related to the multi-armed bandit problem with constraints. Previous works, e.g., have considered the multi-armed bandit problem with an auxiliary cost in addition to the traditonal reward [12, 28]. The ‘game’ (between the player and the environment) ends when the sum of current costs associated with the played arms exceeds the remaining budget, which is fixed and known to the player. The typical approach is to construct upper confidence bounds for the reward-to-cost ratio and then utilize them in upper confidence bound-based algorithms. On the other hand, in our approach, we use upper confidence bounds for both reward and cost, and solve a linear program to obtain the policy policy. In related work, fairness constraints are incorporated into a multi-armed bandit setting; in particular, arms that are perceived to have less value/reward should never favored over better performing alternatives, despite a learning algorithm’s uncertainty over the true payoffs [16]. In such settings, the algorithm is forced to pick arms uniformly until the player has enough confidence of the performance of arms. Connecting to this body of work, our problem reduces to a constrained multi-armed bandit problem when there is a single state. The main difference between our setting and that of the majority existing multi-armed bandit literature with constraints is that the optimal policy and policies obtainable by our algorithm can be a randomized or stochastic policy as opposed to a deterministic ‘best arm’ policy.

An MDP is a tuple (S, A, P, r), where S is the set of states, A is the set of actions,  P : S ×A×A → [0, 1] isthe transition kernel such that  P(s′|s, a)is the probability of transitioning to state  s′ given that the previous state was s and the agent took action  a in s, and r : S×A → [0, 1]is the reward function. A stationary policy π : S × A → [0, 1]is a map from states to a probability distribution over actions, with  π(a|s)denoting the probability of selecting action a in state s. We consider the setting in which the transition kernel  P(s′|s, a)is known to the agent, but the reward and costs are stochastic and unknown. In the example of a rover exploring the surface of Mars, the agent (rover) is aware of the transition probability of next state based on its action, but the safety quality of each state is unknown. Let  S = |S| and A = |A| where | · | is thecardinality of its argument. We use the notation  [·] = {1, . . . , ·}for index sets.

3.1. Constrained Markov Decision Processes

A CMDP is an MDP augmented with ‘cost’ constraints that restrict the set of allowable policies for that MDP. For a given CMDP, we consider the performance measure to be the infinite horizon average reward

which is given by


where  τdenotes a trajectory  τ = (s0, a0, s1, . . . ), and τ ∼ πis shorthand for indicating that the distribution over trajectories depends on  π: s0 ∼ p(s0), at ∼ π(·|st), st+1 ∼ P(·|st, at). Similarly, define the average constraint costs by


where  {c1, . . . , cm} with ci : S × A → [0, 1]are the cost constraints. The CMDP is then defined by


where  {d1, . . . , dm}are upper bounds on the average constraint costs. Note that without loss of generality both the reward and costs are random variables with a distribution supported on [0, 1].

Denote the mean of reward and cost constraint functions as  ¯r(s, a) = E[r(s, a)], ¯ci(s, a) = E[ci(s, a)]where the expectation is taken with respect to the distribution of the reward and cost function of that state-action pair (s, a). If the transition kernel  P(s′|s, a), the mean of the reward function  ¯r(s, a), and mean cost functions  ¯ci(s, a)are all given, them we can solve the CMDP by solving the following linear program [2]:


To simplify notation, we write the above linear program in matrix form as follows:


where  ¯r ∈ RSA, y ∈ RSA, ¯c ∈ RSA×m, d ∈ Rm, P ∈ RS×SA, and Io ∈ RS×SA is a sparse matrix built by placing S row blocks of length A in a block diagonal fashion, where each row block consists of all ones. Here,  y ∈ RS×A represents the steady-state occupation measure defined by


With  ¯ythe solution of this linear program, the optimal stationary policy is


Remark. It is worth noting that unlike in tabular MDPs without constraints, where the optimal policy is always deterministic, the optimal policy in CMDPs could be stochastic [24]. It is, in fact, trivial to solve the CMDP if the optimal policy in CMDPs is deterministic because that means the constraints are not active.

3.2. Constrained Upper Confidence Reinforcement Learning Algorithm

Since the reward and constraint cost functions are unknown, motivated by UCRL2, we introduce C-UCRL (Algorithm 1). In general, the C-UCRL algorithm follows a principle of “optimism in the face of reward uncertainty; pessimism in the face of cost uncertainty.” That is, it defines confidence intervals for the reward and cost of each state-action pair given the observations so far, and solves for the optimistic policy that satisfies the constraints. More specifically, in C-UCRL, given the current confidence interval estimates, we


use a robust linear program [21] formulation to find a policy using the confidence intervals as determined at the current iteration.

In particular, in episode k, we start by executing the baseline policy  π0for a constant h number of iterations2. It is common to assume a initial safe baseline policy [1] and without loss of generality, we assume under such policy, the Markov chain resulting from the CMDP is irreducible and aperiodic [6]. This baseline policy could, e.g., be obtained by some prior information about which states are safe to start the conservative exploration3. After executing  π0, we define estimates of the reward and costs by


respectively, where  Nk(s, a), Rk(s, a), and Ci,k(s, a)are the state-action count, and cumulative reward and costs, respectively, as defined in Algorithm 1. The visitation frequency random variable  Nk(s, a) is definedto be the sum of indicators of whether or not the state-action pair (s, a) was visited in each iteration over all episodes. The corresponding reward  Rk(s, a)and constraint costs  Ci,k(s, a)are defined similarly.

Using these estimates, we define




defines the confidence interval as we show in Section 4. We then use (7) and (8) to define the following robust linear program:


A few comments here on guaranteeing that the feasible set is non-trivial are warranted. Our analysis results are predicated on  π0 and hbeing chosen such that in each episode the robust linear program we solve has at least one feasible solution. The duration h is chosen based on the mixing time of the induced Markov chain under the baseline policy with the goal of ensuring with high probability that the feasible set is not empty; for instance, ‘sufficient’ exploration will guarantee that  ˜c⊤1 y ≤ d for some y ∈ {Ioy = Py, 1⊤y = 1, y ≥ 0}.It is possible that in the first episode, even after h iterations of executing the baseline policy, that there is no y such that ˜c⊤1 y ≤ d4. A heuristic we use in practice is to run the baseline policy  π0for as many iterations as it takes for  y0 ∈ {Ioy = Py, 1⊤y = 1, y ≥ 0, ˜c⊤1 y ≤ d}. Then, we are guaranteed that in all future episodes,  y0is always in the feasible set of (RLP). We leave further exploration of theoretically guaranteeing that the (RLP) has a non-trivial feasible set in the first episode to future work.

Returning to the description of the algorithm, in episode k, the solution  ˜ykto the robust linear program is then used to construct the policy  ˜πk via (6). This policy is executed for a linearly increasing number of iterations  (k − 1)h where kis the episode index and h is the fixed duration used for executing the baseline policy. To summarize, for each episode of C-UCRL, we execute the baseline policy for h steps, estimate the reward and costs, and then execute  ˜πkfor a linearly increasing (in the number of epochs) number of steps (k − 1)h, making khthe total duration of episode k.

In this section, we summarize our analysis results. We first show that C-UCRL has guarantees on constraint satisfaction during learning. Then, we provide regret analysis with respect to the reward, showing that the regret is sub-linear.

4.1. Constraint/Safety Guarantees

To capture constraint satisfaction, we leverage the notion of  δ-safety.

Definition 1 (δ-safe)An algorithm is  δ-safe if, with probability at least  1 − δ, for all time steps t, the policy executed by the algorithm satisfies  Ci(πt) ≤ di, ∀i ∈ [m].

Following [15], we define the set of plausible CMDPs by the confidence intervals for the reward and each of the constraint costs. In particular, at episode  k, let Mkbe the set of plausible CMDPs with states and actions as in the underlying true CMDP M, define by all such CMDPs satisfying the following:


for all state-action pairs  (s, a) ∈ S × A. Let Mbe the set of plausible for all episodes k.

Lemma 2 For any fixed  k ≥ 1, the probability that the true CMDP M is not contained in the set of plausible CMDPs  Mkat episode k is at most  6δ/(π2t2k). Furthermore, with probability at least  1 − δ, forevery state-action pair  (s, a), cost ciand episode k, C-UCRL satisfies the following:


Hence, the probability that the true CMDP M is not in the set of all plausible CMDPs for any episode k is at most  δ—that is, Pr{M /∈ M} ≤ δ.

Proof Consider any fixed state-action pair (s, a) and its visitation frequency  Nk(s, a)up to episode k. If the state-action pair (s, a) has not been visited, then (11) and (12) trivially hold since  Nk(s, a) = 0by definition and the right-hand sides of (11) and (12) are greater than one when  Nk(s, a) = 0.

On the other hand, if  Nk(s, a)is not zero, meaning the state-action pair has been visited, then since for each (s, a) pair, the reward and constraint costs are all supported on [0, 1] and independent identically distributed (iid) real-valued random variables, we can apply Hoeffding’s inequality to get a bound on the deviation between the true mean  ¯r(s, a)(respectively,  ¯ci(s, a)) and the empirical mean  ˆrk(s, a)(respectively, ˆci,k(s, a)) given niid samples of the state-action pair (s, a):






Similarly, for each state-action pair (s, a) and constraint cost indexed by i,


Noting that from the above argument, the confidence intervals hold with probability one when (s, a) has not be visited, taking a union bound over all possible values of  n ∈ {1, . . . , tk} gives




where we have now written  Nk(s, a)for the number of visits in (s, a) up to episode k. This proves (11) and (12). Now, further union bounding over all state-action pairs (s, a) gives


for the reward. Analogously, taking a further union bound over all state-action pairs (s, a) and all constraint costs  i ∈ [m], gives


for the constraint costs. Summing (15) and (16), we get the first claim of the lemma—i.e.,


Now, since �∞ℓ=1 1ℓ2 = π26 , if in (15) and (16), we additionally union bounded over all episodes  k ∈{1, . . . , ∞}, we get that




which proves the final statement in the lemma.

Given that, for each episode, we can bound the gaps between the estimated reward (respectively, costs) and the mean reward (respectively, mean costs), with probability  1 − δ, we can provide an assurance on C-UCRL being δ-safe.

Theorem 3  C-UCRL is δ-safe.

Proof According to Lemma 2, with probability at least  1 − δ, ¯ci(s, a) ≤ ˜ci,k(s, a). The occupation measure  ˜ykobtained at each episode via (RLP) satisfies �s,a ˜ci,k(s, a)˜yk(s, a) ≤ di. Hence, Ci(˜πk) =�s,a ¯ci(s, a)˜yk(s, a) ≤ diwith probability  1 − δ.

4.2. Regret Analysis of C-UCRL

Given that we have shown that  C-UCRL is δ-safe, we now analyze the reward regret. In episode k of C-UCRL, we execute a baseline policy  π0 for htimes and policy  ˜πk for (k − 1)htimes. The pseudo-regret of episode k is given by


We first upper bound the per-step pseudo-regret of executing policy  ˜πk, ¯r⊤(¯y − ˜yk), where the first term is the expected average reward under the optimal policy  ¯πand the second term is the sub-optimal expected average reward under policy  ˜πk.

Using the confidence bounds in Lemma 2, define


and  ϵc(s, a) = ϵr(s, a)for each state-action pair and let  ϵr and ϵcdenote the vectors containing the values across all state-action pairs5. Define the following two linear programs:


where  0 ≤ r ≤ 1, 0 ≤ c ≤ 1, ϵr ≥ 0, and ϵc ≥ 0hold element wise.

Lemma 4 Assuming the domains of (18) and (19) are not empty, let  y1 and y2be solutions for each of the problems, respectively. If, for some constant  α > 0 and β > 0, there exist  y0 ∈ {y|Ay = 0, 1⊤y = 1, y ≥


We first find the upper bound of  r⊤(y1 − y3)where we note that  y3 and y1are the solutions of same linear program over different domains. Since the domain of  y3is smaller than  y1, we know that  r⊤(y1 − y3) ≥ 0.First, consider the trivial case that  y1 satisfies (c + ϵc)⊤y1 ≤ d. In this case,  y1 = y3 and r⊤(y1 − y3) = 0.Now we only consider the case such that  (c + ϵc)⊤y1 > d. Note that  (c + ϵc)⊤y0 ≤ d. Hence, there exists a


y0)). Further, we have


so that  c⊤(y1 − y4) = (1 − γ)β > 0 and


Combining this bound with


Since the domain for each of these problems is convex, we know that


Due to optimality,  r⊤y3 ≥ r⊤y4 so that


solutions of two linear programs with different objectives but the same domain. According to optimality of the solutions, we know that  r⊤y3 ≥ r⊤y2 and (r + ϵr)⊤y2 ≥ (r + ϵr)⊤y3. Combining these facts, we have that


Now, combining the bounds on  r⊤(y3 − y2) and r⊤(y1 − y3), we have that


We can use the preceding lemma to get a bound on the pseudo-regret.

Proposition 5 Denote  Y = {y|(Io − P)y = 0, 1⊤y = 1, y ≥ 0}. If there exists  y0 ∈ Y such that¯r⊤(¯y − y0) = α > 0, ¯c⊤(¯y − y0) = β > 0, then with probability at least  1 − δ,


Proof By definition


Define a sequence of subproblems by adding the confidence value to one additional constraint at a time as follows:


y(m) = arg maxy {¯r⊤y|(Io − P)y = 0, 1⊤y = 1, y ≥ 0, ˜c⊤i,ky ≤ di, i ∈ {1, . . . , m}}

Using the same proof technique as for that of Lemma 4, we obtain the bounds for each of the subproblems


Combining each of the bounds and the fact that




we have that


which completes the proof.

Note that according to Proposition 5, with probability at least  1 − δ, the per-step pseudo-regret of executing policy  ˜πkdepends on the confidence intervals of reward and costs of all state-action pairs. This is intuitive since in order for the policy  ˜πkto be close to the optimal policy  ¯π, we need to have good approximations of the reward and costs for all state-action pairs. To ensure this, we need to constantly explore the CMDP so that  Nk(s, a)is not ‘too small’ for any state-action pair. Since the Markov chain resulting from the baseline policy is irreducible and aperiodic, the steady state occupation measure  y0(s, a)corresponding to the baseline policy  π0(a|s)has the property that  y0(s, a) > 0, ∀s, a. Due to this universal exploration demand, we execute the baseline policy  π0for a constant number of times in each linear increasing episode in the C-UCRL algorithm.

To have a upper bound on the regret derived in Proposition 5, we need to have a lower bounds on Nk(s, a).Given our assumptions on the baseline policy as discussed above, define  ρ > 0 such thaty0(s, a) ≥ ρ > 0for all state-action pairs  (s, a) ∈ S × A. The following lemma gives a lower bound on the number of times each state-action pair is visited in episode k.

Lemma 6 Given a fixed total number of episodes K, with probability at least  1 − δ, for every state-action pair (s, a) and episode  k ∈ [K],


where  ξthe mixing time of the Markov chain induced by policy  π0, ρ > 0is such that  y0(s, a) ≥ ρ > 0 forall state-action pairs  (s, a) ∈ S × A, and ϕ = �s,ay′(s,a)2y0(s,a) , where y′is the initial state action distribution and  yois the steady state action distribution under the baseline policy.

Proof Consider the exploration phase (when the baseline policy  π0is executed) of the k-th episode in Algorithm 1. For a given episode  ℓand for a fixed state-action pair  (s, a), let Xℓ,1, . . . , Xℓ,hbe the indicator variables of whether state-action pair (s, a) has be selected at each step within the episode  ℓ. Let Yℓ =�hi=1 Xℓ,i and thus E[Yℓ] = y0(s, a)h. Applying the Chernoff-Hoeffding bound in [10, Theorem 3], gives




the above bound becomes


Using the assumption that  y0(s, a) ≥ ρ > 0, ∀(s, a), the union bound over all state-action pairs (s, a) and episodes  k ∈ [K]is given by


and  Nk(s, a) ≥ �k−1ℓ=1 Yℓsince in each episode  ˜πℓis executed  hℓtimes after the baseline policy so that Nℓ(s, a)may be larger. Hence,


holds with probability at least  1 − δ.


Combining Proposition 5 and Lemma 6 and summing over K episodes, we obtain the total regret bound for C-UCRL.

Theorem 7 Suppose that  δ ≤ ϕSAK exp(− ρh288ξ). Under the assumptions of Proposition 5, with probabil- ity at least  1 − δ, C-UCRLhas total pseudo-regret  ∆(T) = O(T34 �log(T/δ)).




so that  ζ ≤ 12ρh.

Combining this with Lemma 6, we have that


where the second to last inequality follows from Jensen’s inequality and the final step follows from T = �Kk=1 kh = K(K−1)2 h so that K < (2T/h)1/2.


Remark. Adding constants related to the dimension of the CMDP, we have the regret bound


4.3. Specializing to the Constrained Multi-Armed Bandit Setting

Constrained Multi-Armed Bandits (CMABs) can be viewed as a special case of CMDPs, where there is only one state, S = 1 and the transition kernel is trivially staying in that state with all actions. The policy in a CMAB is a probabilistic distribution over actions/arms y(a) and the goal is to solve the following linear program:


Similarly, the per-step pseudo-regret is defined as  ¯r⊤(¯y − ˜yk) where ¯yis the optimal randomized policy and ˜ykis the policy execute in episode k of C-UCRL. Running C-UCRL with S = 1, the following corollaries hold.

Corollary 8 In CMABs,  C-UCRL is δ-safe.

Corollary 9 In CMABs, Under the assumptions of Proposition 5, with probability at least  1 − δ, C-UCRLhas total pseudo-regret  ∆(T) = O(T34 �log(T/δ)).

The proofs of the above two corollaries follow directly from the corresponding results in the preceding section.

The goal of this section is to explore a few illustrative examples which highlight different features of our approach.


Figure 1: Two armed bandit with per-round budget constraint: (a) mean reward and cost of each arm as well as the per-round constraint; (b) average number of times arm one is pulled; (c) the cumulative regret of C-UCRL.

5.1. Two Armed Bandit with per Round Budget Constraints

We first consider a simple two arms bandit example. As stated before, the CMDP reduces to a constrained multi-armed bandit problem when |S| = 1. The reward and cost of each arm are unknown and stochastic. In our simulation, the reward and cost is draw from a binomial distribution, with the mean shown in Figure 1(a). Even though arm one has a better reward, we cannot pull arm one all the time since the constraint is set to be less than the mean cost of arm one. The optimal policy is to pull arm one with probability 0.75 and arm two with probability 0.25. The baseline policy we use to start exploration is pulling the two arms uniformly at random. Figure 1(b) and 1(c) show the average number of times arm one is pulled and the cumulative regret of C-UCRL, respectively. The average pull count of arm one never exceeds 0.75.

5.2. Three State CMDP

To demonstrate the performance of C-UCRL, we consider a simple three state CMDP. As show in Figure 2(a), the CMDP we consider has three states and two actions. An agent can take either a risky exploratory action in which the navigate to another state or they can take the safe action and remain in the current state. There is no reward or cost for staying in the current state but there will be a stochastic reward and cost if the agent navigates. In the simulation, the reward and cost of each state-action pair are each draw from a binomial distribution, with the means defined in the labels on edges in Figure 2(a). Obviously, without this constraint, the optimal policy is to navigate in each of the states. In this problem, we consider the constraint that in expectation, the average cost should be less than 0.2. This constraint prevents the agents from continuously navigating between the three states. In particular, as shown in Figure 2(b), the constrained optimal policy is a randomized policy that has positive probability on the safe action in each state. The relatively conservative baseline policy we use in C-UCRL for exploration is staying in the current state with probability 0.8 and navigate to the next state with probability 0.2.

We compare our approach with the UCRL2 algorithm. However, UCRL2 does not allow for constraints or multiple reward/cost criteria. Hence, we leverage the idea of risk sensitive reinforcement learning [18], where we treat a linear combination of reward and cost—i.e.,  r−λc—as the reward for the UCRL2 algorithm (Algorithm 2). The hyperparameter  λrepresents the trade off between the reward and cost, the combination of which represents the reward in the classical implementation of UCRL2; we refer to risk-sensitive UCRL2 by RS-UCRL2. Figure 2(c) shows the constraint violation probability in 30 training episodes by RS-UCRL2 algorithm with different  λ. Figure 3(a)shows the cumulative regret and average cost of the


Figure 2: Simple CMDP. (a) CMDP structure; (b) optimal policy computed with the true mean reward and mean cost, with and without the constraint on cost, d = 0.2; (c) probability of constraint violation in 30 training episodes by risk-sensitive UCRL2 (RS-UCRL2).

C-UCRL and RS-UCRL2 algorithms. As we can see, when the cost value is underestimated (λ = 1.9),applying RS-UCRL2 directly leads to a ‘good’ reward (i.e., the regret is negative as it gets more reward than the optimal randomized policy), yet the constraints are violated. On the other hand, when the costs are overestimated (λ = 2.1), RS-UCRL2is too conservative about the cost and, thus, receives high regret. We can observe that C-UCRL does not violate the constraint during learning though in this experiment,  δ is setto be 0.1, meaning that with probability at least 0.9, the constraint will not be violated in all episodes.


Figure 3: C-UCRL vs. RS-UCRL2: (a) Cumulative regret and average cost for C-UCRL and risk sensitive UCRL2; (b) Policy learned by C-UCRL and RS-UCRL2.

The fundamental problem with RS-UCRL2 is that with only one criterion, the policy it learns will always be a deterministic policy, while in this CMDP, the optimal policy is randomized. Figure 3(b) shows the policy learned by  C-UCRL and RS-UCRL2. When λ = 1.9, RS-UCRL2learn the optimal policy as there is no constraint, which leads to constraint violation. When  λ = 2.1, the policy learned by RS-UCRL2 is to stay in one state forever. On the contrary, the policy learned by C-UCRL algorithm converges to the optimal randomized policy.

5.3. Grid World with Safety Constraints

Motivated by the goal of ensuring safety in reinforcement learning safety, we validate our algorithms using a 2D grid-world exploration problem [18, 2.24]. This example also represents a crude abstraction of rovers exploring the surface of Mars as described in [27].

Figure 4(a) shows the CMDP structure. The green color in each state represents the mean cost of that state, and the darker the color, the higher the cost is. In the Mars exploration problem, those darker states are the states with large slope that the agents want to avoid. The constraint we enforce is the upper bound of the per-step probability of step into those state with large slope—i.e., the more risky or potentially unsafe states to explore. The agent starts from the origin state ‘O’ and receives reward 1 if it reaches the destination state ‘D’ after which it returns to the origin. In the simulation, the cost of each state is draw from a binomial distribution, with the mean shown in the figure. At each time step, the agent can take action to move into any of its four neighboring states. Due to the stochastic environment, transitions are stochastic (i.e., even if the agents action is to go “North, the environment can send the vehicle with a small probability to “East).

Without safety constraints, the optimal policy is obviously to always choose the orange route in Figure 4(a). However, with constraints, as we can see in Figure 4(b), the optimal policy is a randomized policy that use both blue and orange routes with some probabilities. The relatively conservative baseline policy we


Figure 4: Grid World with Safety Constraints. (a) Grid world structure: the states with darker green color have larger mean cost, and ‘O’ and ‘D’ are the origin and destination states, respectively; (b) Policy learned by different algorithms: the blue column represents the probability of going ‘West’ (choose blue route) and orange column represents the probability of going ‘North’ (choose orange route).


Figure 5: Cumulative regret and average cost of C-UCRL and RS-UCRL2.

use in C-UCRL for exploration is choose both routes uniformly at random. Figure 5 show the cumulative regret and average cost of the C-UCRL and RS-UCRL2 algorithm and Figure 4(b) shows the policy learned by them. As we can see, RS-UCRL2 either learns to only choose orange or blue route respectively, causing either constraint violation or large reward regret, while C-UCRL converges to the optimal policy.

Figure 6(a) shows the structure of another larger scale safety grid world example. The green states in the figure have mean cost 1 and the others have zero cost. The blue state is the origin state and the red state is the destination state, which has reward 1. Figure 6(b) shows the cumulative regret and average cost of the C-UCRL algorithm and RS-UCRL2 algorithm. The RS-UCRL2 algorithm is able to learn a policy that does not violate the constraint if we choose a conservative  λ, however, with much larger reward regret as compared to C-UCRL.

We formulate the problem of safe reinforcement learning when the transition kernel is known but the reward and constraint costs are unknown a priori as a CMDP and propose a C-UCRL algorithm to learn the optimal policy. Theoretically, we show that C-UCRL algorithm is guaranteed to satisfy the constraints during learning with probability at least  1 − δand achieves  O(T34 �log(T/δ))reward regret. Empirically, we provide examples which demonstrate two key properties relative to comparable algorithms: 1) C-UCRL is able to


Figure 6: Grid World with Safety Constraints. (a) CMDP grid world structure: the states with green color


learn the optimal policy which in general is a randomized policy as opposed to a deterministic policy, and 2) C-UCRL has high-probability guarantees on remaining safe while learning.

Let us comment briefly on some of the limitations of our approach and avenues for future research. First, we remark that artful selection of the baseline policy  π0and the duration h for executing it in each episode is required. We choose h based on the mixing time of the Markov chain induced by  π0. The choice of these two facets is really central to the algorithm as it defines the exploration phase and hence, the robust linear program that we solve for finding  ˜πk. The baseline and duration need to be chosen such that in each episode the linear program has a non-trivial feasible set. Our results are predicated on this being case; as noted in Section 3, in practice, however, it may not be. To handle this, we suggest the heuristic of executing the baseline policy in episode  k = 1 until ˜c⊤1 y0 ≤ d. A better understanding of how to ensure that in each episode the feasible set remains non-trivial is an avenue of future work.

We note also that it is likely that C-UCRL has a much worse sample complexity as compared to approaches which do not impose any criteria on safe learning during the exploration period. Better understanding of this trafeoff is an avenue for future work. Furthermore, our approach requires knowledge of the transition kernel. It is not immediately obvious how to extend classical approaches such as UCRL2, without further exacerbating sample complexity issues, due to the fact that central proof technique we employ is the robust linear programming formulation in each episode which is used to obtain a policy based on the confidence bounds. Alternative approaches may be better suited if the transition kernel is unknown. Another interesting direction that arose in our study of CMDPs is that there is potential to extend the theoretical results of UCRL2 to RS-UCRL2 through a primal-dual lens for capturing the hyper-parameter  λ; investigating this direction may lead to an alternative for addressing the unknown transition kernel setting, however, the issue of the optimal policy being non-deterministic for the true underlying CMDP and the fact that UCRL2 seeks out deterministic policies remains.

[1] Joshua Achiam, David Held, Aviv Tamar, and Pieter Abbeel. Constrained policy optimization. arXiv preprint arXiv:1705.10528, 2017.

[2] Eitan Altman. Constrained Markov decision processes, volume 7. CRC Press, 1999.

[3] Dario Amodei, Chris Olah, Jacob Steinhardt, Paul Christiano, John Schulman, and Dan Man´e. Con- crete problems in ai safety. arXiv preprint arXiv:1606.06565, 2016.

[4] Felix Berkenkamp, Matteo Turchetta, Angela Schoellig, and Andreas Krause. Safe model-based rein- forcement learning with stability guarantees. In Advances in Neural Information Processing Systems, pages 908–918, 2017.

[5] Shalabh Bhatnagar and K Lakshmanan. An online actor–critic algorithm with function approximation for constrained markov decision processes. Journal of Optimization Theory and Applications, 153(3): 688–708, 2012.

[6] Shalabh Bhatnagar, Richard S Sutton, Mohammad Ghavamzadeh, and Mark Lee. Natural actor–critic algorithms. Automatica, 45(11):2471–2482, 2009.

[7] Richard Cheng, G´abor Orosz, Richard M Murray, and Joel W Burdick. End-to-end safe reinforce- ment learning through barrier functions for safety-critical continuous control tasks. arXiv preprint arXiv:1903.08792, 2019.

[8] Yinlam Chow, Mohammad Ghavamzadeh, Lucas Janson, and Marco Pavone. Risk-constrained rein- forcement learning with percentile risk criteria. Journal of Machine Learning Research, 18(167):1–51, 2018.

[9] Yinlam Chow, Ofir Nachum, Edgar Duenez-Guzman, and Mohammad Ghavamzadeh. A lyapunov- based approach to safe reinforcement learning. arXiv preprint arXiv:1805.07708, 2018.

[10] Kai-Min Chung, Henry Lam, Zhenming Liu, and Michael Mitzenmacher. Chernoff-hoeffding bounds for markov chains: Generalized and simplified. arXiv preprint arXiv:1201.0559, 2012.

[11] Stefano P Coraluppi and Steven I Marcus. Risk-sensitive and minimax control of discrete-time, finite- state markov decision processes. Automatica, 35(2):301–309, 1999.

[12] Wenkui Ding, Tao Qin, Xu-Dong Zhang, and Tie-Yan Liu. Multi-armed bandit with budget constraint and variable costs. In Twenty-Seventh AAAI Conference on Artificial Intelligence, 2013.

[13] Mahmoud El Chamie, Yue Yu, Behc¸et Ac¸ıkmes¸e, and Masahiro Ono. Controlled markov processes with safety state constraints. IEEE Transactions on Automatic Control, 64(3):1003–1018, 2019.

[14] Javier Garcıa and Fernando Fern´andez. A comprehensive survey on safe reinforcement learning. Journal of Machine Learning Research, 16(1):1437–1480, 2015.

[15] Thomas Jaksch, Ronald Ortner, and Peter Auer. Near-optimal regret bounds for reinforcement learning. Journal of Machine Learning Research, 11(Apr):1563–1600, 2010.

[16] Matthew Joseph, Michael Kearns, Jamie H Morgenstern, and Aaron Roth. Fairness in learning: Classic and contextual bandits. In Advances in Neural Information Processing Systems, pages 325–333, 2016.

[17] Torsten Koller, Felix Berkenkamp, Matteo Turchetta, and Andreas Krause. Learning-based model predictive control for safe exploration. In 2018 IEEE Conference on Decision and Control (CDC), pages 6059–6066. IEEE, 2018.

[18] Jan Leike, Miljan Martic, Victoria Krakovna, Pedro A Ortega, Tom Everitt, Andrew Lefrancq, Laurent Orseau, and Shane Legg. Ai safety gridworlds. arXiv preprint arXiv:1711.09883, 2017.

[19] Sergey Levine, Chelsea Finn, Trevor Darrell, and Pieter Abbeel. End-to-end training of deep visuo- motor policies. The Journal of Machine Learning Research, 17(1):1334–1373, 2016.

[20] Timothy P Lillicrap, Jonathan J Hunt, Alexander Pritzel, Nicolas Heess, Tom Erez, Yuval Tassa, David Silver, and Daan Wierstra. Continuous control with deep reinforcement learning. arXiv preprint arXiv:1509.02971, 2015.

[21] David G Luenberger, Yinyu Ye, et al. Linear and nonlinear programming, volume 2. Springer, 1984.

[22] Volodymyr Mnih, Koray Kavukcuoglu, David Silver, Andrei A Rusu, Joel Veness, Marc G Bellemare, Alex Graves, Martin Riedmiller, Andreas K Fidjeland, Georg Ostrovski, et al. Human-level control through deep reinforcement learning. Nature, 518(7540):529, 2015.

[23] Teodor Mihai Moldovan and Pieter Abbeel. Safe exploration in markov decision processes. arXiv preprint arXiv:1205.4810, 2012.

[24] Martin L Puterman. Markov decision processes: discrete stochastic dynamic programming. John Wiley & Sons, 2014.

[25] Ahmad EL Sallab, Mohammed Abdou, Etienne Perot, and Senthil Yogamani. Deep reinforcement learning framework for autonomous driving. Electronic Imaging, 2017(19):70–76, 2017.

[26] Guy Shani, David Heckerman, and Ronen I Brafman. An mdp-based recommender system. Journal of Machine Learning Research, 6(Sep):1265–1295, 2005.

[27] Akifumi Wachi, Yanan Sui, Yisong Yue, and Masahiro Ono. Safe exploration and optimization of con- strained mdps using gaussian processes. In Thirty-Second AAAI Conference on Artificial Intelligence, 2018.

[28] Datong P Zhou and Claire J Tomlin. Budget-constrained multi-armed bandits with multiple plays. In Thirty-Second AAAI Conference on Artificial Intelligence, 2018.

Designed for Accessibility and to further Open Science