Exploration-Exploitation in Constrained MDPs

2020·Arxiv

Abstract

Abstract

In many sequential decision-making problems, the goal is to optimize a utility function while satisfying a set of constraints on different utilities. This learning problem is formalized through Constrained Markov Decision Processes (CMDPs). In this paper, we investigate the exploration-exploitation dilemma in CMDPs. While learning in an unknown CMDP, an agent should trade-off exploration to discover new information about the MDP, and exploitation of the current knowledge to maximize the reward while satisfying the constraints. While the agent will eventually learn a good or optimal policy, we do not want the agent to violate the constraints too often during the learning process. In this work, we analyze two approaches for learning in CMDPs. The first approach leverages the linear formulation of CMDP to perform optimistic planning at each episode. The second approach leverages the dual formulation (or saddle-point formulation) of CMDP to perform incremental, optimistic updates of the primal and dual variables. We show that both achieves sublinear regret w.r.t. the main utility while having a sublinear regret on the constraint violations. That being said, we highlight a crucial difference between the two approaches; the linear programming approach results in stronger guarantees than in the dual formulation based approach.

1 Introduction

Markov Decision Processes (MDPs) have been successfully used to model several applications, including video games, robotics, recommender systems and many more. However, MDPs do not take into account additional constrains that can affect the optimal policy and the learning process. For example, while driving, we want to reach our destination but we want to avoid to go off-road, overcome the speed limits, collide with other cars [Garcıa and Fern´andez, 2015]. Constrained MDPs [Altman, 1999] extend MDPs to handle constraints on the long term performance of the policy. A learning agent in a CMDP has to maximize the cumulative reward while satisfying all the constraints. Clearly, the optimal solution of a CMDP is different than the one of an MDP when at least one constraint is active. Then, the optimal policy, among the set of policies which satisfies the constraint, is stochastic.

In this paper, we focus on the online learning problem of CMDPs. While interacting with an unknown MDP, the agent has to trade-off exploration to gather information about the system and exploration to maximize the cumulative reward. Performing such exploration in a CMDP may be unsafe since may lead to numerous violations of the constraints. Since the constraints depend on the long term performance of the agent and the CMDP is unknown, the agent cannot exactly evaluate the constraints. It can only exploit the current information to build an estimate of the constraints. The objective is thus to design an algorithm with a small number of violations of the constraints.

Objective and Contributions. The objective of this technical report is to provide an extensive analysis of exploration strategies for tabular constrained MDPs with finite-horizon cost. Similar to [Agrawal and Devanur, 2019], we allow the agent to violate the constraints over the learning process but we require the cumulative cost of constraint violations to be small (i.e., sublinear). Opposite to [Zheng and Ratliff, 2020], we consider the CMDP to be unknown, i.e., the agent does not know the transition kernel, the reward function and the constraints.

The performance of the learning agent is measured through the regret, that accounts for the difference in executing the optimal policy and the learning agent. We define two regrets: i) the regret w.r.t. to the main objective (as in standard MDP), ii) the regret w.r.t. the constraint violations. These terms account for both convergence to the optimal policy and cumulative cost for violations of the constraints. We introduce and analyze the following exploration strategies:

OptCMDP leverages the ideas of UCRL2 [Jaksch et al., 2010]. At each episodes, it builds a set of plausible CMDPs compatible with the observed samples, and plays the optimal policy of the CMDP with the lowest cost (i.e., optimistic CMDP). To solve this planning problem, we introduce an extended linear programming (LP) problem in the space of occupancy measures. The important property is that there always exists a feasible solution of this extended LP.

OptCMDP-bonus merges the uncertainties about costs and transitions used by OptCMDP into an exploration bonus. As a consequence, OptCMDP-bonus solves a single (optimistic) CMDP rather than planning in the space of plausible CMDPs. This leads to a more computationally efficient algorithm. In fact, this planning problem can be solved through an LP with O(SAH) constraints and decision variables, a factor O(S) smaller than the LP solved by OptCMDP.

OptDual-CMDP leverages the saddle-point formulation of constrained MDP [e.g., Altman, 1999]. It solves this problem using an optimistic version of the dual projected sub-gradient algorithm (e.g., Beck 2017). At each episode, OptDual-CMDP solves an optimistic MDP defined using the estimated Lagrangian multiplier. Then, it uses the computed solution to update the Lagrange multipliers via

Table 1: Summary of the regret bounds obtained in this work. Algorithms OptCMDP, OptCMDP-bonus, OptDual-CMDP, OptPrimalDual-CMDP are formulated and analyzed in sections 3, 4, 5.1, 5.2, respectively. The constant term, which is omitted from the table, of OptCMDP-bonus is significantly worse than the one of OptCMDP. Notice that different types of regrets are bounded (see Section 2 for definitions).

projected sub-gradient. The main advantage of this algorithm needs to solve a simple optimistic planning problem for MDPs (rather than for CMDPs).

OptPrimalDual-CMDP exploits a primal-dual algorithm to solve the saddle-point problem associated to a CMDP. It performs incremental updates both on the primal and dual variables. It uses mirror descent to update the Q-function (thus the policy) and projected subgradient descent to update the Lagrange multipliers. Similarly to OptCMDP-bonus, this algorithm exploits an exploration bonus for both cost and constraint costs. This allows to use a simple dynamic programming approach to compute the Q-functions (no need to solve a constrained optimization problem).

For all the proposed algorithms, we provide an upper-bound to the regret and the cumulative constraint violations (see Tab. 1). While the incremental algorithms (OptDual-CMDP and OptPrimalDual-CMDP) may be more amenable for practical applications, they present limitations from a theoretical perspective. In fact, we were able to prove weaker guarantees for the Lagrangian approaches compared to UCRL-like algorithms (i.e., OptCMDP and OptCMDP-bonus). While for UCRL-like algorithms we can bound the sum of positive errors, for Lagrangian algorithms we were able to bound only the cumulative (signed) error. This weaker term allows for “cancellation of errors” (see discussion in Sec. 2.2). Whether it is possible to provide stronger guarantees is left as an open question. Despite this, we think that the analysis of Lagrangian approaches is important since it is at the core of many practical algorithms. For example, the Lagrangian formulation of CMDPs has been used in [Tessler et al., 2019, Paternain et al., 2019], but never analyzed from a regret perspective.

1.1 Related Work

The problem of online learning under constraints (with guarantees) have been analyzed both in bandits and in RL. Conservative exploration focuses on the problem of learning an optimal policy while satisfying a constrained w.r.t. to a predefined baseline policy. This problem can be seen as a specific instance of CMDPs where the constraint is that the policy should perform (in the long run) better than a predefined baseline policy. Conservative exploration has been analyzed both in bandits [Wu et al., 2016, Kazerouni et al., 2017, Garcelon et al., 2020a] and in RL [Garcelon et al., 2020b]. All these algorithms are able to guarantee that the performance of the learning agent is at least as good as the one of the baseline policy with high probability at any time.While they enjoy strong theoretical guarantees, they performs poorly in practice since are too conservative. In fact, the idea of these algorithms is to build budget (e.g., by playing the baseline policy) in order to be able to take standard exploratory actions. Concurrently to this paper, [Zheng and Ratliff, 2020] has extended conservative exploration to CMDP with average reward objective. They assume that the transition functions are known, but the rewards and costs (i.e., the constraints) are unknown. The goal is thus to guarantee that, at any time, the policy executed by the agent satisfies the constraints with high probability. These requirement poses several limitations. Similarly to [Garcelon et al., 2020b], they need to assume that the MDP is ergodic and that the initial policy is safe (i.e., satisfies the constraints). Furthermore, despite the theoretical guarantees, this approach is not practical due to these strong requirements/assumptions. Agrawal and Devanur [2019] studied the exploration problem for bandits under constraints as well as bandits with knapsack constraints [Badanidiyuru et al., 2013]. Algorithms OptCMDP and OptCMDP-bonus can be understood as generalizing their bandit setting to an CMDP setting. That being said, in the following we derive regret bounds on a stronger type of regret relatively to Agrawal and Devanur [2019] (see Remark 1).

There are several approaches in the literature that have focused on (approximately) solving CMDPs. These methods are mainly based on Lagrangian-formulation [Bhatnagar and Lakshmanan, 2012, Chow et al., 2017, Tessler et al., 2019, Paternain et al., 2019] or constrained optimization [Achiam et al., 2017]. Lagrangianbased methods formulate the CMDP optimization problem as a saddle-point problem and optimize it using primal-dual algorithms. While these algorithms may eventually converge to the true policy, they have no guarantees on the policies recovered during the learning process. Constrained Policy Optimization (CPO) [Achiam et al., 2017] leverages the intuition behind conservative approaches [e.g., Kakade and Langford, 2002] to force the policy to improve overtime. This is a practical implementation of conservative exploration where the baseline policy is updated at each iteration.

Another way to solve CMDPs and guarantee safety during learning is through Lyapunov functions [Chow et al., 2018, 2019]. Despite the fact that some of these algorithms are approximately safe over the learning process, analysing the convergence is challenging and the regret analysis is lacking. Other approaches use Gaussian processes to model the dynamics and/or the value function [Berkenkamp et al., 2017, Wachi et al., 2018, Koller et al., 2018, Cheng et al., 2019] in order to be able to estimate the constraints and (approximately) guarantee safety over learning.

A related approach is the literature about budget learning in bandits [e.g., Ding et al., 2013, Combes et al., 2015]. In this setting, the agent is provided with a budget (known and fix in advance) and the learning process is stopped as soon as the budget is consumed. The goal is to learn how to efficiently handle the budget in order to maximize the cumulative reward. A widely studied case of budget bandit is bandit with knapsack [e.g., Agrawal and Devanur, 2014, Badanidiyuru et al., 2018]. In our setting, we do not have a “real” concept of budget and the length of the learning process does not depend on the total cost of constraint violations. This paper is also related to learning with fairness constraints [e.g., Joseph et al., 2016]. Similarly to conservative exploration, fairness constraints can be sometimes formulated as a specific instance of CMDPs.

2 Preliminaries

We start introducing finite-horizon Markov Decision Processes (MDPs) and their constrained version. We define [N] := {1, . . . , N}, for all .

2.1 Finite-Horizon Constrained MDPs

Finite Horizon MDPs. We consider finite-horizon MDPs with time-dependent dynamics [Puterman, 1994]. A finite-horizon constraint MDP is defined by the tuple ), where S and A are the state and action spaces with cardinalities S and A, respectively. The non-stationary immediate cost for taking an action a at state s is a random variable [0, 1] with expectation ). The transition probability is ), the probability of transitioning to state upon taking action a at state s at time-step h. The initial state in each episode is chosen to be the same state and is the horizon. Furthermore, : is the maximum number of non-zero transition probabilities across the entire state-action pairs.

A Markov non-stationary randomized policy where maps states to probabilities ∆on the action set A. We denote by ), the action taken at time h at state according to a policy . For any ] and (, the state-action value function of a non-stationary policy ) is defined as

where the expectation is over the environment and policy randomness. The value function is (s) = ). Since the horizon is finite, under some regularity conditions, [Shreve and Bertsekas, 1978], there always exists an optimal Markov non-stationary deterministic policy whose value and action-value functions are defined as (((s) and (s, a) = sup). The Bellman principle of optimality (or Bellman optimality equation) allows to efficiently compute the optimal solution of an MDP using backward induction:

where and (), for all . The optimal policy is thus greedy w.r.t. [e.g., Puterman, 1994]. Notice that by boundedness of the cost, for any h and (s, a), all functions , are bounded in [0+ 1].

We can reformulate the optimization problem by using the occupancy measure [e.g., Puterman, 1994, Altman, 1999]. The occupancy measure of a policy is defined as the set of distributions generated by executing the policy in the finite-horizon MDP M [e.g., Zimin and Neu, 2013]:

For ease of notation, we define the matrix notation where its (s, a, h) element is given by ). This implies the following relation between the occupancy measure and the value of a policy:

where such that element (s, a, h) element is given by ).

where the first relation holds by linearity of expectation.

Finite Horizon Constraint MDPs. A constraint MDP [Altman, 1999] is an MDP supplied with a set of I constraints , where and [0, H]. The immediate constraint when taking an action a from state s at time-step h is random variable [0, 1] with expectation ). The expected cost of the constraint violation from state s at time-step h is defined as

Similarly to (2), we can rewrite the constraint in terms of occupancy measure: (). Notice that by boundedness of the constraint cost, for any h, i and (s, a), all functions ), (() are bounded in [0+ 1]. The objective of a CMDP is to find a policy minimizing the cost while satisfying all the constraints. Formally,

where and such that

The optimal value is the value of from the initial state, i.e., (().

Assumption 1 (Feasibility). The unknown CMDP is feasible, i.e., there exists an unknown policy Πwhich satisfies the constraints. Thus, an optimal policy exists as well.

It is important to stress that the optimal policy of a CMDP may be stochastic [e.g., Altman, 1999], i.e., may not exist an optimal deterministic policy. In fact, due to the constraints, the Bellman optimality principle, see Eq. 1, may not hold anymore. This means that we cannot leverage backward induction and the greedy operator. Altman [1999] showed that it is possible to compute the optimal policy of a constrained problem by using linear programming. We will review this approach in Sec. 2.3.

2.2 The Learning Problem.

We consider an agent which repeatedly interacts with a CMDP in a sequence of K episodes of fixed length H by playing a non-stationary policy ) where . Each episode k starts from the fixed initial state . The learning agent does not know the transition or reward functions, and it relies on the samples (i.e., trajectories) observed over episodes to improve its performance over time.

The performance of the agent is measured using multiple objectives: i) the regret relatively to the value of the best policy, and ii) the amount of constraint violations. In sections 3 and 4 we analyze algorithms with guarantees on the following type of regrets

where [. The term Reg) represents the maximum cumulative cost for violations of the constraints.

We later continue and analyze algorithms with reduced computational complexity in sections 5.1 and 5.2. For these algorithms, we supply regret guarantees for all ] with respect to a weaker measure of regrets defined as follows.

Remark 1. Note that in our setting, the immediate regret (() might be negative since policy might violate the constraints. For this reason, bounding the regret as Reg) is stronger than bounding Reg) in the sense that the a bound on the first implies a bound on the latter; but not vice-versa.

Similar relation holds between the two definitions of the constraint violations types of regret; a bound on Reg) implies a bound on Reg(K; d), but the opposite does not holds. In words, a bound on the first implies a bound on the absolute sum of constraint violations where the latter bounds the cumulative constraint violations, and, thus, allows for “error cancellations”.

2.3 Linear Programming for CMDPs

In Sec. 2, we have seen that the cost criteria can be expressed as the expectation of the immediate cost w.r.t. to the occupancy measure. The convexity and compactness of this space is essential for the analysis of constrained MDPs. We refer the reader to [Altman, 1999, Chap. 3 and 4] for an analysis in infinite horizon problems.

We start stating two basic properties of an occupancy measure q. In this section, we remove the dependence on the model p to ease the notation. It is easy to see that the occupancy measure of any policy satisfies [e.g., Zimin and Neu, 2013, Bhattacharya and Kharoufeh, 2017]:

for all . For , we have that

Notice that (s, a) = 1. As a consequence, by summing the first constraint in (8) over s we have that ]. Thus the satisfying the constraints are probability measures. We denote by ∆) the space of occupancy measures.

Since the set ∆) can be described by a set of affine constraints, we can state the following property. Please refer to [e.g., Puterman, 1994, Altman, 1999, Mannor and Tsitsiklis, 2005] for more details.

An important consequence of the linearity of the cost criteria and of the structure of ∆(M) is that the original control problem can be reduced to a Linear Program (LP) where the optimization variables are measures. Furthermore, optimal solutions of the LP define the optimal Markov policy through the occupancy measure. In fact, a policy generates an occupancy measure ∆(M) if

The constrained problem (3) is equivalent to the LP:

2.4 Notations and Definitions.

Throughout the paper, we use ] and ] to denote time-step inside an episode and the index of an episode, respectively. The filtration includes all events (states, actions, and costs) until the end of the k-th episode, including the initial state of the k + 1 episode. We denote by ), the number of times that the agent has visited state-action pair (s, a) at the h-th step, and by, the empirical average of a random variable X. Both quantities are based on experience gathered until the end of the episode and are measurable. Since is measurable, so is (s, a; p). Furthermore, from this definition we have that for any X which is measureable

We use ) to refer to a quantity that depends on X up to a poly-log expression of a quantity at most polynomial in S, A, K, H and . Similarly, represents up to numerical constans or poly-log factors. We define max{X, Y }.

3 Upper Conﬁdence Bounds for CMDPs

We start by considering a natural adaptation of UCRL2 [Jaksch et al., 2010] to the setting of CMDPs which we call OptCMDP (see Algorithm 1).

Let denote the number of times a pair (s, a) was observed before episode k. At each episode, OptCMDP estimates the transition model, cost function and constraint cost function by their empirical average:

Following the approach of optimism-in-the-face-of-uncertainty we would like to act with an optimistic policy. To this end, we generalize the notion of optimism from the bandit setup presented in [Agrawal and Devanur, 2019] to the RL setting. Specifically, we would like for our algorithm to satisfy the following demands:

(a) The optimal policy should be contained in the feasible set in every episode.

(b) Value optimism. The value of every policy should be optimistic relatively to its true value, (() where are the optimistic cost and model by which the algorithm calculates the value of a policy.

Indeed, optimizing over a set which satisfy (a) while satisfying (b) results in an optimistic estimate of ().

Similar to UCRL2, at the beginning of each episode k, OptCMDP constructs confidence intervals for the costs and the dynamics of the CMDP. Formally, for any (we define

where the size of the confidence intervals is built using empirical Bernstein inequality [e.g., Audibert et al., 2007, Maurer and Pontil, 2009] for the transitions and Hoeffding inequality for the costs:

where Var

(1 )) [e.g., Dann and Brunskill, 2015]. The set of plausible CMDPs associated with the confidence intervals is then ) : . Once been computed, OptCMDP finds a solution to the optimization problem

While this problem is well-defined and feasible, we can simplify it and avoid to optimize over the sets and . We define

to be the lower confidence bounds on the costs. Then, we can solve the following optimization problem

Proposition 2. The optimization problem (14) is feasible. Denote by the policy recovered solving (14) and by ) the associated CMDP. Then, policy is optimismtic, i.e.,

Proof. The proof of optimism is reported in Lem. 9 and the feasibility is proven in Lem. 10.

The extended LP problem. Problem (14) is similar to (3), the crucial difference is that the true costs and dynamics are unknown. Since we cannot directly optimize this problem, we propose to rewrite (14) as an extended LP problem by considering the state-action-state occupancy measure ) defined as ). We leverage the Bernstein structure of (see Eq. 10) to formulate the extended LP over variable z:

This LP has ) constraints and ) decision variables. Such an approach was also used in Jin et al. [2019] in a different context. Notice that can be chosen by using different concentration inequalities, e.g., concentration inequality for probability distributions. Rosenberg and Mansour [2019] showed that even in that case we can formulate an extended LP.

Once we have computed z, we can recover the policy and the transitions as

sProposition 2 shows that (a) and (b) are satisfied and the solution is optimistic. This allows us to provide the following guarantees.

Theorem 3 (Regret Bounds for OptCMDP). Fix (0, 1). With probability at least 1 for any ] the following regret bounds hold

4 Exploration Bonus for CMDPs

OptCMDP is an efficient algorithm for exploration in constrained MDPs. An obvious shortcoming of OptCMDP is its high computational complexity due to the solution of the extended LP with ) constraints and decision variables. In this section, we present a bonus-based algorithm for exploration in CMDPs that we call OptCMDP-bonus. This algorithm can be seen as a generalization of UCBVI [Azar et al., 2017] to constrained MDPs. The main advantage of OptCMDP-bonus is that it requires to solve a single CMDP. To this extent, it has to solve an LP problem with O(SAH) constraints and decision variables.

At each episode k, OptCMDP-bonus builds an optimistic CMDP ) where

while

andare the empirical estimates defined in (9). The term integrates the uncertainties about costs and transitions into a single exploration bonus. Formally,

where and are defined as in (11). Then, OptCMDP-bonus solves the following optimization problem

This problem can be solved using the LP described in Sec. 2.3. In App. B.2, we show that is an optimistic policy, i.e., (().

Theorem 4 (Regret Bounds for OptCMDP-bonus). Fix (0, 1). With probability at least 1 for any ] the following regret bounds hold

The regret bounds of OptCMDP-bonus include the same term as of OptCMDP. However, the constant term in the regret bounds of OptCMDP-bonus has worst dependence w.r.t. S, H, N. This suggests that in the limit of large state space the bonus-based approach for CMDPs have worse performance relatively to the optimistic model approach.

Remark 2. The origin of the worst regret bound comes from the larger bonus term (16) we need to add to compensate on the lack of knowledge of the transition model. This bonus term, allows us to replace the optimistic planning w.r.t. a set of transition models (as in OptCMDP) by using the empirical transition model. However, it leads to a value function which is not bounded within [0, H] but within []. To circumvent this problem, a truncated Bellman operator has been used [e.g., Azar et al.,

However, plugging this idea into the CMDP problem (Sec. 2.3) is not simple. In particular, it is not clear how to enforce truncation in the space of occupancy measures. Thus, reduction to LP seems problematic to obtain. At the same time, using dynamic programming to solve CMDP is problematic due to the presence of constraints (and the lack of Bellman optimality principle). We leave it for future work to devise a polynomial algorithm to solve this problem, or establishing it is a “hard-problem” to solve. If solved, it would result in an algorithm with similar performance to that of OptCMDP (up to polylog and constant factors).

5 Optimistic Dual and Primal-Dual Approaches for CMDPs

In previous sections, we analyzed algorithms which require access to a solver of an LP with at least Ω(SHA) decision variables and constraints. In the limit of large state space, solving such linear program is expected to be prohibitively expensive in terms of computational cost. Furthermore, most of the practically used RL algorithms [e.g., Achiam et al., 2017, Tessler et al., 2019] are motivated by the Lagrangian formulation of CMDPs.

Motivated by the need to reduce the computational cost, we follow the Lagrangian approach to CMDPs in which the dual problem to CMDP (3) is being solved. Introducing Lagrange multipliers , the dual problem to (3) is given by

With this in mind, a natural way to solve a CMDP is to use a dual sub-gradient algorithm [see e.g., Beck, 2017] or a primal-dual gradient algorithm. Viewing the problem in this manner, a CMDP can be solved by playing a game between two-player; the agent and the Lagrange multiplier . This process is expected to converge to the Nash equilibrium with value . Furthermore, strong duality is known to hold for CMDP [e.g., Altman, 1999] and thus the expected value of this game is expected to converge to (). This general approach is also followed in the line of works on online learning with long-term constraints [e.g., Mahdavi et al., 2012, Yu et al., 2017]. There, the problem does not have a decision horizon H nor state space as in our case.

As the environment is unknown, and the agents gathers its experience based on samples, the algorithm should use an exploration mechanism with care. To handle the exploration, we use the optimism approach. In the following sections, we formulate and establish regret bounds for optimistic dual and primal-dual approaches to solve a CMDP. These algorithms are computationally easier than the algorithms of previous sections. Unfortunately, the regret bounds obtained in this section are weaker. We establish bounds on Reg(K; c) (resp. Reg(K; d)) instead of Reg) (resp. R