Exploration-Exploitation in Constrained MDPs

2020·arXiv

Abstract

1 Introduction

Markov Decision Processes (MDPs) have been successfully used to model several applications, including video games, robotics, recommender systems and many more. However, MDPs do not take into account additional constrains that can affect the optimal policy and the learning process. For example, while driving, we want to reach our destination but we want to avoid to go off-road, overcome the speed limits, collide with other cars [Garcıa and Fern´andez, 2015]. Constrained MDPs [Altman, 1999] extend MDPs to handle constraints on the long term performance of the policy. A learning agent in a CMDP has to maximize the cumulative reward while satisfying all the constraints. Clearly, the optimal solution of a CMDP is different than the one of an MDP when at least one constraint is active. Then, the optimal policy, among the set of policies which satisfies the constraint, is stochastic.

In this paper, we focus on the online learning problem of CMDPs. While interacting with an unknown MDP, the agent has to trade-off exploration to gather information about the system and exploration to maximize the cumulative reward. Performing such exploration in a CMDP may be unsafe since may lead to numerous violations of the constraints. Since the constraints depend on the long term performance of the agent and the CMDP is unknown, the agent cannot exactly evaluate the constraints. It can only exploit the current information to build an estimate of the constraints. The objective is thus to design an algorithm with a small number of violations of the constraints.

Objective and Contributions. The objective of this technical report is to provide an extensive analysis of exploration strategies for tabular constrained MDPs with finite-horizon cost. Similar to [Agrawal and Devanur, 2019], we allow the agent to violate the constraints over the learning process but we require the cumulative cost of constraint violations to be small (i.e., sublinear). Opposite to [Zheng and Ratliff, 2020], we consider the CMDP to be unknown, i.e., the agent does not know the transition kernel, the reward function and the constraints.

The performance of the learning agent is measured through the regret, that accounts for the difference in executing the optimal policy and the learning agent. We define two regrets: i) the regret w.r.t. to the main objective (as in standard MDP), ii) the regret w.r.t. the constraint violations. These terms account for both convergence to the optimal policy and cumulative cost for violations of the constraints. We introduce and analyze the following exploration strategies:

OptCMDP leverages the ideas of UCRL2 [Jaksch et al., 2010]. At each episodes, it builds a set of plausible CMDPs compatible with the observed samples, and plays the optimal policy of the CMDP with the lowest cost (i.e., optimistic CMDP). To solve this planning problem, we introduce an extended linear programming (LP) problem in the space of occupancy measures. The important property is that there always exists a feasible solution of this extended LP.

OptCMDP-bonus merges the uncertainties about costs and transitions used by OptCMDP into an exploration bonus. As a consequence, OptCMDP-bonus solves a single (optimistic) CMDP rather than planning in the space of plausible CMDPs. This leads to a more computationally efficient algorithm. In fact, this planning problem can be solved through an LP with O(SAH) constraints and decision variables, a factor O(S) smaller than the LP solved by OptCMDP.

OptDual-CMDP leverages the saddle-point formulation of constrained MDP [e.g., Altman, 1999]. It solves this problem using an optimistic version of the dual projected sub-gradient algorithm (e.g., Beck 2017). At each episode, OptDual-CMDP solves an optimistic MDP defined using the estimated Lagrangian multiplier. Then, it uses the computed solution to update the Lagrange multipliers via

Table 1: Summary of the regret bounds obtained in this work. Algorithms OptCMDP, OptCMDP-bonus, OptDual-CMDP, OptPrimalDual-CMDP are formulated and analyzed in sections 3, 4, 5.1, 5.2, respectively. The constant term, which is omitted from the table, of OptCMDP-bonus is significantly worse than the one of OptCMDP. Notice that different types of regrets are bounded (see Section 2 for definitions).

projected sub-gradient. The main advantage of this algorithm needs to solve a simple optimistic planning problem for MDPs (rather than for CMDPs).

OptPrimalDual-CMDP exploits a primal-dual algorithm to solve the saddle-point problem associated to a CMDP. It performs incremental updates both on the primal and dual variables. It uses mirror descent to update the Q-function (thus the policy) and projected subgradient descent to update the Lagrange multipliers. Similarly to OptCMDP-bonus, this algorithm exploits an exploration bonus for both cost and constraint costs. This allows to use a simple dynamic programming approach to compute the Q-functions (no need to solve a constrained optimization problem).

For all the proposed algorithms, we provide an upper-bound to the regret and the cumulative constraint violations (see Tab. 1). While the incremental algorithms (OptDual-CMDP and OptPrimalDual-CMDP) may be more amenable for practical applications, they present limitations from a theoretical perspective. In fact, we were able to prove weaker guarantees for the Lagrangian approaches compared to UCRL-like algorithms (i.e., OptCMDP and OptCMDP-bonus). While for UCRL-like algorithms we can bound the sum of positive errors, for Lagrangian algorithms we were able to bound only the cumulative (signed) error. This weaker term allows for “cancellation of errors” (see discussion in Sec. 2.2). Whether it is possible to provide stronger guarantees is left as an open question. Despite this, we think that the analysis of Lagrangian approaches is important since it is at the core of many practical algorithms. For example, the Lagrangian formulation of CMDPs has been used in [Tessler et al., 2019, Paternain et al., 2019], but never analyzed from a regret perspective.

1.1 Related Work

The problem of online learning under constraints (with guarantees) have been analyzed both in bandits and in RL. Conservative exploration focuses on the problem of learning an optimal policy while satisfying a constrained w.r.t. to a predefined baseline policy. This problem can be seen as a specific instance of CMDPs where the constraint is that the policy should perform (in the long run) better than a predefined baseline policy. Conservative exploration has been analyzed both in bandits [Wu et al., 2016, Kazerouni et al., 2017, Garcelon et al., 2020a] and in RL [Garcelon et al., 2020b]. All these algorithms are able to guarantee that the performance of the learning agent is at least as good as the one of the baseline policy with high probability at any time.While they enjoy strong theoretical guarantees, they performs poorly in practice since are too conservative. In fact, the idea of these algorithms is to build budget (e.g., by playing the baseline policy) in order to be able to take standard exploratory actions. Concurrently to this paper, [Zheng and Ratliff, 2020] has extended conservative exploration to CMDP with average reward objective. They assume that the transition functions are known, but the rewards and costs (i.e., the constraints) are unknown. The goal is thus to guarantee that, at any time, the policy executed by the agent satisfies the constraints with high probability. These requirement poses several limitations. Similarly to [Garcelon et al., 2020b], they need to assume that the MDP is ergodic and that the initial policy is safe (i.e., satisfies the constraints). Furthermore, despite the theoretical guarantees, this approach is not practical due to these strong requirements/assumptions. Agrawal and Devanur [2019] studied the exploration problem for bandits under constraints as well as bandits with knapsack constraints [Badanidiyuru et al., 2013]. Algorithms OptCMDP and OptCMDP-bonus can be understood as generalizing their bandit setting to an CMDP setting. That being said, in the following we derive regret bounds on a stronger type of regret relatively to Agrawal and Devanur [2019] (see Remark 1).

There are several approaches in the literature that have focused on (approximately) solving CMDPs. These methods are mainly based on Lagrangian-formulation [Bhatnagar and Lakshmanan, 2012, Chow et al., 2017, Tessler et al., 2019, Paternain et al., 2019] or constrained optimization [Achiam et al., 2017]. Lagrangianbased methods formulate the CMDP optimization problem as a saddle-point problem and optimize it using primal-dual algorithms. While these algorithms may eventually converge to the true policy, they have no guarantees on the policies recovered during the learning process. Constrained Policy Optimization (CPO) [Achiam et al., 2017] leverages the intuition behind conservative approaches [e.g., Kakade and Langford, 2002] to force the policy to improve overtime. This is a practical implementation of conservative exploration where the baseline policy is updated at each iteration.

Another way to solve CMDPs and guarantee safety during learning is through Lyapunov functions [Chow et al., 2018, 2019]. Despite the fact that some of these algorithms are approximately safe over the learning process, analysing the convergence is challenging and the regret analysis is lacking. Other approaches use Gaussian processes to model the dynamics and/or the value function [Berkenkamp et al., 2017, Wachi et al., 2018, Koller et al., 2018, Cheng et al., 2019] in order to be able to estimate the constraints and (approximately) guarantee safety over learning.

A related approach is the literature about budget learning in bandits [e.g., Ding et al., 2013, Combes et al., 2015]. In this setting, the agent is provided with a budget (known and fix in advance) and the learning process is stopped as soon as the budget is consumed. The goal is to learn how to efficiently handle the budget in order to maximize the cumulative reward. A widely studied case of budget bandit is bandit with knapsack [e.g., Agrawal and Devanur, 2014, Badanidiyuru et al., 2018]. In our setting, we do not have a “real” concept of budget and the length of the learning process does not depend on the total cost of constraint violations. This paper is also related to learning with fairness constraints [e.g., Joseph et al., 2016]. Similarly to conservative exploration, fairness constraints can be sometimes formulated as a specific instance of CMDPs.

2 Preliminaries

We start introducing finite-horizon Markov Decision Processes (MDPs) and their constrained version. We define [N] := {1, . . . , N}, for all .

2.1 Finite-Horizon Constrained MDPs

Finite Horizon MDPs. We consider finite-horizon MDPs with time-dependent dynamics [Puterman, 1994]. A finite-horizon constraint MDP is defined by the tuple ), where S and A are the state and action spaces with cardinalities S and A, respectively. The non-stationary immediate cost for taking an action a at state s is a random variable [0, 1] with expectation ). The transition probability is ), the probability of transitioning to state upon taking action a at state s at time-step h. The initial state in each episode is chosen to be the same state and is the horizon. Furthermore, : is the maximum number of non-zero transition probabilities across the entire state-action pairs.

A Markov non-stationary randomized policy where maps states to probabilities ∆on the action set A. We denote by ), the action taken at time h at state according to a policy . For any ] and (, the state-action value function of a non-stationary policy ) is defined as

where the expectation is over the environment and policy randomness. The value function is (s) = ). Since the horizon is finite, under some regularity conditions, [Shreve and Bertsekas, 1978], there always exists an optimal Markov non-stationary deterministic policy whose value and action-value functions are defined as (((s) and (s, a) = sup). The Bellman principle of optimality (or Bellman optimality equation) allows to efficiently compute the optimal solution of an MDP using backward induction:

where and (), for all . The optimal policy is thus greedy w.r.t. [e.g., Puterman, 1994]. Notice that by boundedness of the cost, for any h and (s, a), all functions , are bounded in [0+ 1].

We can reformulate the optimization problem by using the occupancy measure [e.g., Puterman, 1994, Altman, 1999]. The occupancy measure of a policy is defined as the set of distributions generated by executing the policy in the finite-horizon MDP M [e.g., Zimin and Neu, 2013]:

For ease of notation, we define the matrix notation where its (s, a, h) element is given by ). This implies the following relation between the occupancy measure and the value of a policy:

where such that element (s, a, h) element is given by ).

where the first relation holds by linearity of expectation.

Finite Horizon Constraint MDPs. A constraint MDP [Altman, 1999] is an MDP supplied with a set of I constraints , where and [0, H]. The immediate constraint when taking an action a from state s at time-step h is random variable [0, 1] with expectation ). The expected cost of the constraint violation from state s at time-step h is defined as

Similarly to (2), we can rewrite the constraint in terms of occupancy measure: (). Notice that by boundedness of the constraint cost, for any h, i and (s, a), all functions ), (() are bounded in [0+ 1]. The objective of a CMDP is to find a policy minimizing the cost while satisfying all the constraints. Formally,

where and such that

The optimal value is the value of from the initial state, i.e., (().

Assumption 1 (Feasibility). The unknown CMDP is feasible, i.e., there exists an unknown policy Πwhich satisfies the constraints. Thus, an optimal policy exists as well.

It is important to stress that the optimal policy of a CMDP may be stochastic [e.g., Altman, 1999], i.e., may not exist an optimal deterministic policy. In fact, due to the constraints, the Bellman optimality principle, see Eq. 1, may not hold anymore. This means that we cannot leverage backward induction and the greedy operator. Altman [1999] showed that it is possible to compute the optimal policy of a constrained problem by using linear programming. We will review this approach in Sec. 2.3.

2.2 The Learning Problem.

We consider an agent which repeatedly interacts with a CMDP in a sequence of K episodes of fixed length H by playing a non-stationary policy ) where . Each episode k starts from the fixed initial state . The learning agent does not know the transition or reward functions, and it relies on the samples (i.e., trajectories) observed over episodes to improve its performance over time.

The performance of the agent is measured using multiple objectives: i) the regret relatively to the value of the best policy, and ii) the amount of constraint violations. In sections 3 and 4 we analyze algorithms with guarantees on the following type of regrets

where [. The term Reg) represents the maximum cumulative cost for violations of the constraints.

We later continue and analyze algorithms with reduced computational complexity in sections 5.1 and 5.2. For these algorithms, we supply regret guarantees for all ] with respect to a weaker measure of regrets defined as follows.

Remark 1. Note that in our setting, the immediate regret (() might be negative since policy might violate the constraints. For this reason, bounding the regret as Reg) is stronger than bounding Reg) in the sense that the a bound on the first implies a bound on the latter; but not vice-versa.

Similar relation holds between the two definitions of the constraint violations types of regret; a bound on Reg) implies a bound on Reg(K; d), but the opposite does not holds. In words, a bound on the first implies a bound on the absolute sum of constraint violations where the latter bounds the cumulative constraint violations, and, thus, allows for “error cancellations”.

2.3 Linear Programming for CMDPs

In Sec. 2, we have seen that the cost criteria can be expressed as the expectation of the immediate cost w.r.t. to the occupancy measure. The convexity and compactness of this space is essential for the analysis of constrained MDPs. We refer the reader to [Altman, 1999, Chap. 3 and 4] for an analysis in infinite horizon problems.

We start stating two basic properties of an occupancy measure q. In this section, we remove the dependence on the model p to ease the notation. It is easy to see that the occupancy measure of any policy satisfies [e.g., Zimin and Neu, 2013, Bhattacharya and Kharoufeh, 2017]:

for all . For , we have that

Notice that (s, a) = 1. As a consequence, by summing the first constraint in (8) over s we have that ]. Thus the satisfying the constraints are probability measures. We denote by ∆) the space of occupancy measures.

Since the set ∆) can be described by a set of affine constraints, we can state the following property. Please refer to [e.g., Puterman, 1994, Altman, 1999, Mannor and Tsitsiklis, 2005] for more details.

An important consequence of the linearity of the cost criteria and of the structure of ∆(M) is that the original control problem can be reduced to a Linear Program (LP) where the optimization variables are measures. Furthermore, optimal solutions of the LP define the optimal Markov policy through the occupancy measure. In fact, a policy generates an occupancy measure ∆(M) if

The constrained problem (3) is equivalent to the LP:

2.4 Notations and Definitions.

Throughout the paper, we use ] and ] to denote time-step inside an episode and the index of an episode, respectively. The filtration includes all events (states, actions, and costs) until the end of the k-th episode, including the initial state of the k + 1 episode. We denote by ), the number of times that the agent has visited state-action pair (s, a) at the h-th step, and by, the empirical average of a random variable X. Both quantities are based on experience gathered until the end of the episode and are measurable. Since is measurable, so is (s, a; p). Furthermore, from this definition we have that for any X which is measureable

We use ) to refer to a quantity that depends on X up to a poly-log expression of a quantity at most polynomial in S, A, K, H and . Similarly, represents up to numerical constans or poly-log factors. We define max{X, Y }.

3 Upper Conﬁdence Bounds for CMDPs

We start by considering a natural adaptation of UCRL2 [Jaksch et al., 2010] to the setting of CMDPs which we call OptCMDP (see Algorithm 1).

Let denote the number of times a pair (s, a) was observed before episode k. At each episode, OptCMDP estimates the transition model, cost function and constraint cost function by their empirical average:

Following the approach of optimism-in-the-face-of-uncertainty we would like to act with an optimistic policy. To this end, we generalize the notion of optimism from the bandit setup presented in [Agrawal and Devanur, 2019] to the RL setting. Specifically, we would like for our algorithm to satisfy the following demands:

(a) The optimal policy should be contained in the feasible set in every episode.

(b) Value optimism. The value of every policy should be optimistic relatively to its true value, ((