Markov Decision Processes (MDPs) have been successfully used to model several applications, including video games, robotics, recommender systems and many more. However, MDPs do not take into account additional constrains that can affect the optimal policy and the learning process. For example, while driving, we want to reach our destination but we want to avoid to go off-road, overcome the speed limits, collide with other cars [Garcıa and Fern´andez, 2015]. Constrained MDPs [Altman, 1999] extend MDPs to handle constraints on the long term performance of the policy. A learning agent in a CMDP has to maximize the cumulative reward while satisfying all the constraints. Clearly, the optimal solution of a CMDP is different than the one of an MDP when at least one constraint is active. Then, the optimal policy, among the set of policies which satisfies the constraint, is stochastic.
In this paper, we focus on the online learning problem of CMDPs. While interacting with an unknown MDP, the agent has to trade-off exploration to gather information about the system and exploration to maximize the cumulative reward. Performing such exploration in a CMDP may be unsafe since may lead to numerous violations of the constraints. Since the constraints depend on the long term performance of the agent and the CMDP is unknown, the agent cannot exactly evaluate the constraints. It can only exploit the current information to build an estimate of the constraints. The objective is thus to design an algorithm with a small number of violations of the constraints.
Objective and Contributions. The objective of this technical report is to provide an extensive analysis of exploration strategies for tabular constrained MDPs with finite-horizon cost. Similar to [Agrawal and Devanur, 2019], we allow the agent to violate the constraints over the learning process but we require the cumulative cost of constraint violations to be small (i.e., sublinear). Opposite to [Zheng and Ratliff, 2020], we consider the CMDP to be unknown, i.e., the agent does not know the transition kernel, the reward function and the constraints.
The performance of the learning agent is measured through the regret, that accounts for the difference in executing the optimal policy and the learning agent. We define two regrets: i) the regret w.r.t. to the main objective (as in standard MDP), ii) the regret w.r.t. the constraint violations. These terms account for both convergence to the optimal policy and cumulative cost for violations of the constraints. We introduce and analyze the following exploration strategies:
OptCMDP leverages the ideas of UCRL2 [Jaksch et al., 2010]. At each episodes, it builds a set of plausible CMDPs compatible with the observed samples, and plays the optimal policy of the CMDP with the lowest cost (i.e., optimistic CMDP). To solve this planning problem, we introduce an extended linear programming (LP) problem in the space of occupancy measures. The important property is that there always exists a feasible solution of this extended LP.
OptCMDP-bonus merges the uncertainties about costs and transitions used by OptCMDP into an exploration bonus. As a consequence, OptCMDP-bonus solves a single (optimistic) CMDP rather than planning in the space of plausible CMDPs. This leads to a more computationally efficient algorithm. In fact, this planning problem can be solved through an LP with O(SAH) constraints and decision variables, a factor O(S) smaller than the LP solved by OptCMDP.
OptDual-CMDP leverages the saddle-point formulation of constrained MDP [e.g., Altman, 1999]. It solves this problem using an optimistic version of the dual projected sub-gradient algorithm (e.g., Beck 2017). At each episode, OptDual-CMDP solves an optimistic MDP defined using the estimated Lagrangian multiplier. Then, it uses the computed solution to update the Lagrange multipliers via
Table 1: Summary of the regret bounds obtained in this work. Algorithms OptCMDP, OptCMDP-bonus, OptDual-CMDP, OptPrimalDual-CMDP are formulated and analyzed in sections 3, 4, 5.1, 5.2, respectively. The constant term, which is omitted from the table, of OptCMDP-bonus is significantly worse than the one of OptCMDP. Notice that different types of regrets are bounded (see Section 2 for definitions).
projected sub-gradient. The main advantage of this algorithm needs to solve a simple optimistic planning problem for MDPs (rather than for CMDPs).
OptPrimalDual-CMDP exploits a primal-dual algorithm to solve the saddle-point problem associated to a CMDP. It performs incremental updates both on the primal and dual variables. It uses mirror descent to update the Q-function (thus the policy) and projected subgradient descent to update the Lagrange multipliers. Similarly to OptCMDP-bonus, this algorithm exploits an exploration bonus for both cost and constraint costs. This allows to use a simple dynamic programming approach to compute the Q-functions (no need to solve a constrained optimization problem).
For all the proposed algorithms, we provide an upper-bound to the regret and the cumulative constraint violations (see Tab. 1). While the incremental algorithms (OptDual-CMDP and OptPrimalDual-CMDP) may be more amenable for practical applications, they present limitations from a theoretical perspective. In fact, we were able to prove weaker guarantees for the Lagrangian approaches compared to UCRL-like algorithms (i.e., OptCMDP and OptCMDP-bonus). While for UCRL-like algorithms we can bound the sum of positive errors, for Lagrangian algorithms we were able to bound only the cumulative (signed) error. This weaker term allows for “cancellation of errors” (see discussion in Sec. 2.2). Whether it is possible to provide stronger guarantees is left as an open question. Despite this, we think that the analysis of Lagrangian approaches is important since it is at the core of many practical algorithms. For example, the Lagrangian formulation of CMDPs has been used in [Tessler et al., 2019, Paternain et al., 2019], but never analyzed from a regret perspective.
1.1 Related Work
The problem of online learning under constraints (with guarantees) have been analyzed both in bandits and in RL. Conservative exploration focuses on the problem of learning an optimal policy while satisfying a constrained w.r.t. to a predefined baseline policy. This problem can be seen as a specific instance of CMDPs where the constraint is that the policy should perform (in the long run) better than a predefined baseline policy. Conservative exploration has been analyzed both in bandits [Wu et al., 2016, Kazerouni et al., 2017, Garcelon et al., 2020a] and in RL [Garcelon et al., 2020b]. All these algorithms are able to guarantee that the performance of the learning agent is at least as good as the one of the baseline policy with high probability at any time.While they enjoy strong theoretical guarantees, they performs poorly in practice since are too conservative. In fact, the idea of these algorithms is to build budget (e.g., by playing the baseline policy) in order to be able to take standard exploratory actions. Concurrently to this paper, [Zheng and Ratliff, 2020] has extended conservative exploration to CMDP with average reward objective. They assume that the transition functions are known, but the rewards and costs (i.e., the constraints) are unknown. The goal is thus to guarantee that, at any time, the policy executed by the agent satisfies the constraints with high probability. These requirement poses several limitations. Similarly to [Garcelon et al., 2020b], they need to assume that the MDP is ergodic and that the initial policy is safe (i.e., satisfies the constraints). Furthermore, despite the theoretical guarantees, this approach is not practical due to these strong requirements/assumptions. Agrawal and Devanur [2019] studied the exploration problem for bandits under constraints as well as bandits with knapsack constraints [Badanidiyuru et al., 2013]. Algorithms OptCMDP and OptCMDP-bonus can be understood as generalizing their bandit setting to an CMDP setting. That being said, in the following we derive regret bounds on a stronger type of regret relatively to Agrawal and Devanur [2019] (see Remark 1).
There are several approaches in the literature that have focused on (approximately) solving CMDPs. These methods are mainly based on Lagrangian-formulation [Bhatnagar and Lakshmanan, 2012, Chow et al., 2017, Tessler et al., 2019, Paternain et al., 2019] or constrained optimization [Achiam et al., 2017]. Lagrangianbased methods formulate the CMDP optimization problem as a saddle-point problem and optimize it using primal-dual algorithms. While these algorithms may eventually converge to the true policy, they have no guarantees on the policies recovered during the learning process. Constrained Policy Optimization (CPO) [Achiam et al., 2017] leverages the intuition behind conservative approaches [e.g., Kakade and Langford, 2002] to force the policy to improve overtime. This is a practical implementation of conservative exploration where the baseline policy is updated at each iteration.
Another way to solve CMDPs and guarantee safety during learning is through Lyapunov functions [Chow et al., 2018, 2019]. Despite the fact that some of these algorithms are approximately safe over the learning process, analysing the convergence is challenging and the regret analysis is lacking. Other approaches use Gaussian processes to model the dynamics and/or the value function [Berkenkamp et al., 2017, Wachi et al., 2018, Koller et al., 2018, Cheng et al., 2019] in order to be able to estimate the constraints and (approximately) guarantee safety over learning.
A related approach is the literature about budget learning in bandits [e.g., Ding et al., 2013, Combes et al., 2015]. In this setting, the agent is provided with a budget (known and fix in advance) and the learning process is stopped as soon as the budget is consumed. The goal is to learn how to efficiently handle the budget in order to maximize the cumulative reward. A widely studied case of budget bandit is bandit with knapsack [e.g., Agrawal and Devanur, 2014, Badanidiyuru et al., 2018]. In our setting, we do not have a “real” concept of budget and the length of the learning process does not depend on the total cost of constraint violations. This paper is also related to learning with fairness constraints [e.g., Joseph et al., 2016]. Similarly to conservative exploration, fairness constraints can be sometimes formulated as a specific instance of CMDPs.
We start introducing finite-horizon Markov Decision Processes (MDPs) and their constrained version. We define [N] := {1, . . . , N}, for all .
2.1 Finite-Horizon Constrained MDPs
Finite Horizon MDPs. We consider finite-horizon MDPs with time-dependent dynamics [Puterman, 1994]. A finite-horizon constraint MDP is defined by the tuple ), where S and A are the state and action spaces with cardinalities S and A, respectively. The non-stationary immediate cost for taking an action a at state s is a random variable
[0, 1] with expectation
). The transition probability is
), the probability of transitioning to state
upon taking action a at state s at time-step h. The initial state in each episode is chosen to be the same state
and
is the horizon. Furthermore,
:
is the maximum number of non-zero transition probabilities across the entire state-action pairs.
A Markov non-stationary randomized policy where
maps states to probabilities ∆
on the action set A. We denote by
), the action taken at time h at state
according to a policy
. For any
] and (
, the state-action value function of a non-stationary policy
) is defined as
where the expectation is over the environment and policy randomness. The value function is (s) =
). Since the horizon is finite, under some regularity conditions, [Shreve and Bertsekas, 1978], there always exists an optimal Markov non-stationary deterministic policy
whose value and action-value functions are defined as
(
(
(s) and
(s, a) = sup
). The Bellman principle of optimality (or Bellman optimality equation) allows to efficiently compute the optimal solution of an MDP using backward induction:
where and
(
), for all
. The optimal policy
is thus greedy w.r.t.
[e.g., Puterman, 1994]. Notice that by boundedness of the cost, for any h and (s, a), all functions
,
are bounded in [0
+ 1].
We can reformulate the optimization problem by using the occupancy measure [e.g., Puterman, 1994, Altman, 1999]. The occupancy measure of a policy
is defined as the set of distributions generated by executing the policy
in the finite-horizon MDP M [e.g., Zimin and Neu, 2013]:
For ease of notation, we define the matrix notation where its (s, a, h) element is given by
). This implies the following relation between the occupancy measure and the value of a policy:
where such that element (s, a, h) element is given by
).
where the first relation holds by linearity of expectation.
Finite Horizon Constraint MDPs. A constraint MDP [Altman, 1999] is an MDP supplied with a set of I constraints , where
and
[0, H]. The immediate
constraint when taking an action a from state s at time-step h is random variable
[0, 1] with expectation
). The expected cost of the
constraint violation from state s at time-step h is defined as
Similarly to (2), we can rewrite the constraint in terms of occupancy measure: (
). Notice that by boundedness of the constraint cost, for any h, i and (s, a), all functions
),
(
(
) are bounded in [0
+ 1]. The objective of a CMDP is to find a policy minimizing the cost while satisfying all the constraints. Formally,
where and
such that
The optimal value is the value of from the initial state, i.e.,
(
(
).
Assumption 1 (Feasibility). The unknown CMDP is feasible, i.e., there exists an unknown policy Π
which satisfies the constraints. Thus, an optimal policy exists as well.
It is important to stress that the optimal policy of a CMDP may be stochastic [e.g., Altman, 1999], i.e., may not exist an optimal deterministic policy. In fact, due to the constraints, the Bellman optimality principle, see Eq. 1, may not hold anymore. This means that we cannot leverage backward induction and the greedy operator. Altman [1999] showed that it is possible to compute the optimal policy of a constrained problem by using linear programming. We will review this approach in Sec. 2.3.
2.2 The Learning Problem.
We consider an agent which repeatedly interacts with a CMDP in a sequence of K episodes of fixed length H by playing a non-stationary policy ) where
. Each episode k starts from the fixed initial state
. The learning agent does not know the transition or reward functions, and it relies on the samples (i.e., trajectories) observed over episodes to improve its performance over time.
The performance of the agent is measured using multiple objectives: i) the regret relatively to the value of the best policy, and ii) the amount of constraint violations. In sections 3 and 4 we analyze algorithms with guarantees on the following type of regrets
where [. The term Reg
) represents the maximum cumulative cost for violations of the constraints.
We later continue and analyze algorithms with reduced computational complexity in sections 5.1 and 5.2. For these algorithms, we supply regret guarantees for all ] with respect to a weaker measure of regrets defined as follows.
Remark 1. Note that in our setting, the immediate regret (
(
) might be negative since policy
might violate the constraints. For this reason, bounding the regret as Reg
) is stronger than bounding Reg
) in the sense that the a bound on the first implies a bound on the latter; but not vice-versa.
Similar relation holds between the two definitions of the constraint violations types of regret; a bound on Reg) implies a bound on Reg(K; d), but the opposite does not holds. In words, a bound on the first implies a bound on the absolute sum of constraint violations where the latter bounds the cumulative constraint violations, and, thus, allows for “error cancellations”.
2.3 Linear Programming for CMDPs
In Sec. 2, we have seen that the cost criteria can be expressed as the expectation of the immediate cost w.r.t. to the occupancy measure. The convexity and compactness of this space is essential for the analysis of constrained MDPs. We refer the reader to [Altman, 1999, Chap. 3 and 4] for an analysis in infinite horizon problems.
We start stating two basic properties of an occupancy measure q. In this section, we remove the dependence on the model p to ease the notation. It is easy to see that the occupancy measure of any policy satisfies [e.g., Zimin and Neu, 2013, Bhattacharya and Kharoufeh, 2017]:
for all . For
, we have that
Notice that (s, a) = 1. As a consequence, by summing the first constraint in (8) over s we have that
]. Thus the
satisfying the constraints are probability measures. We denote by ∆
) the space of occupancy measures.
Since the set ∆) can be described by a set of affine constraints, we can state the following property. Please refer to [e.g., Puterman, 1994, Altman, 1999, Mannor and Tsitsiklis, 2005] for more details.
An important consequence of the linearity of the cost criteria and of the structure of ∆(M) is that the original control problem can be reduced to a Linear Program (LP) where the optimization variables are measures. Furthermore, optimal solutions of the LP define the optimal Markov policy through the occupancy measure. In fact, a policy generates an occupancy measure
∆(M) if
The constrained problem (3) is equivalent to the LP:
2.4 Notations and Definitions.
Throughout the paper, we use ] and
] to denote time-step inside an episode and the index of an episode, respectively. The filtration
includes all events (states, actions, and costs) until the end of the k-th episode, including the initial state of the k + 1 episode. We denote by
), the number of times that the agent has visited state-action pair (s, a) at the h-th step, and by
, the empirical average of a random variable X. Both quantities are based on experience gathered until the end of the
episode and are
measurable. Since
is
measurable, so is
(s, a; p). Furthermore, from this definition we have that for any X which is
measureable
We use ) to refer to a quantity that depends on X up to a poly-log expression of a quantity at most polynomial in S, A, K, H and
. Similarly,
represents
up to numerical constans or poly-log factors. We define
max{X, Y }.
We start by considering a natural adaptation of UCRL2 [Jaksch et al., 2010] to the setting of CMDPs which we call OptCMDP (see Algorithm 1).
Let denote the number of times a pair (s, a) was observed before episode k. At each episode, OptCMDP estimates the transition model, cost function and constraint cost function by their empirical average:
Following the approach of optimism-in-the-face-of-uncertainty we would like to act with an optimistic policy. To this end, we generalize the notion of optimism from the bandit setup presented in [Agrawal and Devanur, 2019] to the RL setting. Specifically, we would like for our algorithm to satisfy the following demands:
(a) The optimal policy
should be contained in the feasible set in every episode.
(b) Value optimism. The value of every policy should be optimistic relatively to its true value, (
(
) where
are the optimistic cost and model by which the algorithm calculates the value of a policy.
Indeed, optimizing over a set which satisfy (a) while satisfying (b) results in an optimistic estimate of (
).
Similar to UCRL2, at the beginning of each episode k, OptCMDP constructs confidence intervals for the costs and the dynamics of the CMDP. Formally, for any (we define
where the size of the confidence intervals is built using empirical Bernstein inequality [e.g., Audibert et al., 2007, Maurer and Pontil, 2009] for the transitions and Hoeffding inequality for the costs:
where Var
(1
)) [e.g., Dann and Brunskill, 2015]. The set of plausible CMDPs associated with the confidence intervals is then
) :
. Once
been computed, OptCMDP finds a solution to the optimization problem
While this problem is well-defined and feasible, we can simplify it and avoid to optimize over the sets and
. We define
to be the lower confidence bounds on the costs. Then, we can solve the following optimization problem
Proposition 2. The optimization problem (14) is feasible. Denote by the policy recovered solving (14) and by
) the associated CMDP. Then, policy
is optimismtic, i.e.,
Proof. The proof of optimism is reported in Lem. 9 and the feasibility is proven in Lem. 10.
The extended LP problem. Problem (14) is similar to (3), the crucial difference is that the true costs and dynamics are unknown. Since we cannot directly optimize this problem, we propose to rewrite (14) as an extended LP problem by considering the state-action-state occupancy measure ) defined as
). We leverage the Bernstein structure of
(see Eq. 10) to formulate the extended LP over variable z:
This LP has ) constraints and
) decision variables. Such an approach was also used in Jin et al. [2019] in a different context. Notice that
can be chosen by using different concentration inequalities, e.g.,
concentration inequality for probability distributions. Rosenberg and Mansour [2019] showed that even in that case we can formulate an extended LP.
Once we have computed z, we can recover the policy and the transitions as
sProposition 2 shows that (a) and (b) are satisfied and the solution is optimistic. This allows us to provide the following guarantees.
Theorem 3 (Regret Bounds for OptCMDP). Fix (0, 1). With probability at least 1
for any
] the following regret bounds hold
OptCMDP is an efficient algorithm for exploration in constrained MDPs. An obvious shortcoming of OptCMDP is its high computational complexity due to the solution of the extended LP with ) constraints and decision variables. In this section, we present a bonus-based algorithm for exploration in CMDPs that we call OptCMDP-bonus. This algorithm can be seen as a generalization of UCBVI [Azar et al., 2017] to constrained MDPs. The main advantage of OptCMDP-bonus is that it requires to solve a single CMDP. To this extent, it has to solve an LP problem with O(SAH) constraints and decision variables.
At each episode k, OptCMDP-bonus builds an optimistic CMDP ) where
while
and
are the empirical estimates defined in (9). The term
integrates the uncertainties about costs and transitions into a single exploration bonus. Formally,
where and
are defined as in (11). Then, OptCMDP-bonus solves the following optimization problem
This problem can be solved using the LP described in Sec. 2.3. In App. B.2, we show that is an optimistic policy, i.e.,
(
(
).
Theorem 4 (Regret Bounds for OptCMDP-bonus). Fix (0, 1). With probability at least 1
for any
] the following regret bounds hold
The regret bounds of OptCMDP-bonus include the same term as of OptCMDP. However, the constant term in the regret bounds of OptCMDP-bonus has worst dependence w.r.t. S, H, N. This suggests that in the limit of large state space the bonus-based approach for CMDPs have worse performance relatively to the optimistic model approach.
Remark 2. The origin of the worst regret bound comes from the larger bonus term (16) we need to add to compensate on the lack of knowledge of the transition model. This bonus term, allows us to replace the optimistic planning w.r.t. a set of transition models (as in OptCMDP) by using the empirical transition model. However, it leads to a value function which is not bounded within [0, H] but within []. To circumvent this problem, a truncated Bellman operator has been used [e.g., Azar et al.,
However, plugging this idea into the CMDP problem (Sec. 2.3) is not simple. In particular, it is not clear how to enforce truncation in the space of occupancy measures. Thus, reduction to LP seems problematic to obtain. At the same time, using dynamic programming to solve CMDP is problematic due to the presence of constraints (and the lack of Bellman optimality principle). We leave it for future work to devise a polynomial algorithm to solve this problem, or establishing it is a “hard-problem” to solve. If solved, it would result in an algorithm with similar performance to that of OptCMDP (up to polylog and constant factors).
In previous sections, we analyzed algorithms which require access to a solver of an LP with at least Ω(SHA) decision variables and constraints. In the limit of large state space, solving such linear program is expected to be prohibitively expensive in terms of computational cost. Furthermore, most of the practically used RL algorithms [e.g., Achiam et al., 2017, Tessler et al., 2019] are motivated by the Lagrangian formulation of CMDPs.
Motivated by the need to reduce the computational cost, we follow the Lagrangian approach to CMDPs in which the dual problem to CMDP (3) is being solved. Introducing Lagrange multipliers , the dual problem to (3) is given by
With this in mind, a natural way to solve a CMDP is to use a dual sub-gradient algorithm [see e.g., Beck, 2017] or a primal-dual gradient algorithm. Viewing the problem in this manner, a CMDP can be solved by playing a game between two-player; the agent and the Lagrange multiplier
. This process is expected to converge to the Nash equilibrium with value
. Furthermore, strong duality is known to hold for CMDP [e.g., Altman, 1999] and thus the expected value of this game is expected to converge to
(
). This general approach is also followed in the line of works on online learning with long-term constraints [e.g., Mahdavi et al., 2012, Yu et al., 2017]. There, the problem does not have a decision horizon H nor state space as in our case.
As the environment is unknown, and the agents gathers its experience based on samples, the algorithm should use an exploration mechanism with care. To handle the exploration, we use the optimism approach. In the following sections, we formulate and establish regret bounds for optimistic dual and primal-dual approaches to solve a CMDP. These algorithms are computationally easier than the algorithms of previous sections. Unfortunately, the regret bounds obtained in this section are weaker. We establish bounds on Reg(K; c) (resp. Reg(K; d)) instead of Reg) (resp. Reg
)) as in previous section (see Sec. 2.2 for details).
5.1 Optimistic Dual Algorithm for CMDPs
We start by describing the optimistic dual approach for CMDPs. OptDual-CMDP is based upon the dual projected sub-gradient algorithm (e.g., Beck [2017]). It can also be interpreted through the lens of online learning. In this sense, we can interpret OptDual-CMDP as solving a two-player game in a decentralized manner where the first player (the agent, ) applies “be-the-leader” algorithm, and the second player (the Lagrange multiplier,
) uses projected gradient-descent.
Algorithm OptDual-CMDP (see Alg. 3) acts by performing two stages in each iteration. At the first stage it solves the following optimistic problem:
where and
are the same as in Sec. 3 (refer to (10) and (13)). This problem corresponds to finding the optimal policy (denoted
) of the following extended MDP
) :
(
(
. Since this is an extended MDP and not
a CMDP, we can use standard dynamic programming techniques. One possibility is to use the extended LP similar to the one introduced in Sec. 3. Otherwise, we can use backward induction to compute
with . Then,
arg min
). To compute
(s, a) we can use Alg. 3 in [Jin et al., 2019].
At the second stage, OptDual-CMDP updates the Lagrange multipliers proportionally to the violation of the “optimistic” constraints: +
(
.
The following assumption is standard for the analysis of dual projected sub-gradient method which we make as well. This assumption is quite mild and demands a policy which satisfy the constraint with equality exists. For example, a policy with zero constraint-cost (from state ) exists this assumption hold.
for which
for all the constraints
]. Set
The following theorem establishes guarantees for both the performance and the total constraint violation (see App. C for the proof).
Theorem 5 (Regret Bounds for OptDual-CMDP). For any ] the regrets the following bounds hold
See that the regret bounded in Theorem 5 is Reg and not Regas in Sec. 3 and 4. This difference in types of regret, as we believe, is not an artifact of the analysis. It can be directly attributed to bounds from convex analysis [Beck, 2017]. Meaning, establishing a guarantee on Reg
, instead on Reg, for OptDual-CMDP requires to improve convergence guarantees of dual projected gradient-descent.
Finally, we think that it may be possible to use exploration bonus instead of solving the extended problem. However, we leave this point for future work.
5.2 Optimistic Primal Dual approach for CMDPs
In this section, we formulate and analyze OptPrimalDual-CMDP (Algorithm 4). This algorithm performs incremental, optimistic updates of both primal and dual variables. Optimism is achieved by using exploration bonuses (refer to Sec. 4).
Instead of solving an extended MDP as OptDual-CMDP, OptPrimalDual-CMDPevaluates the Q-functions of both the cost and constraint cost w.r.t. the current policy by using the optimistic costs
and the empirical transition model ¯
. Note that the optimistic cost and constraint costs are obtained using
the exploration bonus ) defined in Eq. 15 (see also Eq. 14). Then, it applies a Mirror Descent (MD) [Beck and Teboulle, 2003] update on the weighted Q-function
and updates the dual variables, i.e., the Lagrange multipliers , by a projected gradient step. Since we optimize over the simplex and choose the Bregman distance to be the KL-divergence, the update rule of MD has a close solution (see the policy update step in Alg. 4).
Importantly, in the policy evaluation stage OptPrimalDual-CMDP uses a truncated policy evaluation, which prevents the value function to be negative (see Algorithm 5). This allows us to avoid the problems experienced in OptCMDP-bonus when such truncation is not being performed.
Furthermore, differently then in OptDual-CMDP, in OptPrimalDual-CMDP we project the dual parameter to be within the set Λ: 0
. Such projection can be done efficiently. We remark that such an approach was also applied in [Nedi´c and Ozdaglar, 2009] for convex-concave saddle-points problems. The reason for restricting the set of Lagrange multipliers to Λ
for our needs is to keep
bounded (if a component of
diverges then
might diverge). On the other hand, we wish to keep the set sufficiently big- otherwise, we cannot supply guarantees on the constraint violations. The set Λ
is sufficient to meet both these needs. We remark that projecting on Λ
with
would also lead to convergence guarantees by applying similar proof techniques.
The computational complexity of OptPrimalDual-CMDP amounts to estimate the state-action value functions (
(
) instead of solving an extended MDP as in OptDual-CMDP. However, as the following theorem establishes, the reduced computational cost comes with a worse regrets guarantees. As for OptDual-CMDP we assume a slater point exists (see Assumption 2).
The following theorem establishes guarantees for both the performance and the total constraint violation (see App. D for the proof).
Theorem 6 (Regret Bounds for OptPrimalDual-CMDP). For any ] the regrets the following bounds hold
Observe that Theorem 6 has worst performance relatively to Theorem 5 w.r.t. the terms multiplying theterm. However, its constant term has similar performance to the constant term in Theorem 5.
In this work, we formulated and analyzed different algorithms by which safety constraints can be combined in the framework of RL by combining learning in CMDPs. We investigated both UCRL-like approaches (Sec. 3 and 4) motivated by UCRL2 [Jaksch et al., 2010], as well as, optimistic dual and primal-dual approaches, motivated by practical successes of closely related algorithms [e.g., Achiam et al., 2017, Tessler et al., 2019]. For all these algorithms, we established regret guarantees for both the performance and constraint violations.
Interestingly, although the dual and primal-dual approaches are nowadays more practically acceptable, we uncovered an important deficiency of these methods; these have ‘weaker’ performance guarantees (Reg) relatively to UCRL-like algorithms (Reg). This fact highlights an important practical message if an algorithm designer is interested in good performance w.r.t. Reg
. Furthermore, the primal-dual algorithm (section 5.2), which is computationally easier, has worse performance relatively to the optimistic dual algorithm (section 5.1). In light of these observations, we believe an important future venue is to further study the computational-performance tradeoff in safe RL. This would allow algorithm designers better understanding into the types of guarantees that can be obtained when using different types of safe RL algorithms.
Joshua Achiam, David Held, Aviv Tamar, and Pieter Abbeel. Constrained policy optimization. In ICML, volume 70 of Proceedings of Machine Learning Research, pages 22–31. PMLR, 2017.
Shipra Agrawal and Nikhil R. Devanur. Bandits with concave rewards and convex knapsacks. In EC, pages 989–1006. ACM, 2014.
Shipra Agrawal and Nikhil R Devanur. Bandits with global convex constraints and objective. Operations Research, 67(5):1486–1502, 2019.
Eitan Altman. Constrained Markov decision processes, volume 7. CRC Press, 1999.
Jean-Yves Audibert, R´emi Munos, and Csaba Szepesv´ari. Tuning bandit algorithms in stochastic en- vironments. In ALT, volume 4754 of Lecture Notes in Computer Science, pages 150–165. Springer, 2007.
Mohammad Gheshlaghi Azar, Ian Osband, and R´emi Munos. Minimax regret bounds for reinforcement learning. In ICML, volume 70 of Proceedings of Machine Learning Research, pages 263–272. PMLR, 2017.
Ashwinkumar Badanidiyuru, Robert Kleinberg, and Aleksandrs Slivkins. Bandits with knapsacks. In 2013 IEEE 54th Annual Symposium on Foundations of Computer Science, pages 207–216. IEEE, 2013.
Ashwinkumar Badanidiyuru, Robert Kleinberg, and Aleksandrs Slivkins. Bandits with knapsacks. J. ACM, 65(3):13:1–13:55, 2018.
Amir Beck. First-order methods in optimization, volume 25. SIAM, 2017.
Amir Beck and Marc Teboulle. Mirror descent and nonlinear projected subgradient methods for convex optimization. Operations Research Letters, 31(3):167–175, 2003.
Felix Berkenkamp, Matteo Turchetta, Angela P. Schoellig, and Andreas Krause. Safe model-based rein- forcement learning with stability guarantees. In NIPS, pages 908–918, 2017.
Shalabh Bhatnagar and K Lakshmanan. An online actor–critic algorithm with function approximation for constrained markov decision processes. Journal of Optimization Theory and Applications, 153(3): 688–708, 2012.
Arnab Bhattacharya and Jeffrey P Kharoufeh. Linear programming formulation for non-stationary, finite-horizon markov decision process models. Operations Research Letters, 45(6):570–574, 2017.
Qi Cai, Zhuoran Yang, Chi Jin, and Zhaoran Wang. Provably efficient exploration in policy optimization. arXiv preprint arXiv:1912.05830, 2019.
Richard Cheng, G´abor Orosz, Richard M. Murray, and Joel W. Burdick. End-to-end safe reinforcement learning through barrier functions for safety-critical continuous control tasks. In AAAI, pages 3387– 3395. AAAI Press, 2019.
Yinlam Chow, Mohammad Ghavamzadeh, Lucas Janson, and Marco Pavone. Risk-constrained reinforce- ment learning with percentile risk criteria. J. Mach. Learn. Res., 18:167:1–167:51, 2017.
Yinlam Chow, Ofir Nachum, Edgar A. Du´e˜nez-Guzm´an, and Mohammad Ghavamzadeh. A lyapunov- based approach to safe reinforcement learning. In NeurIPS, pages 8103–8112, 2018.
Yinlam Chow, Ofir Nachum, Aleksandra Faust, Mohammad Ghavamzadeh, and Edgar A. Du´e˜nez- Guzm´an. Lyapunov-based safe policy optimization for continuous control. CoRR, abs/1901.10031, 2019.
Richard Combes, Chong Jiang, and Rayadurgam Srikant. Bandits with budgets: Regret lower bounds and optimal algorithms. In SIGMETRICS, pages 245–257. ACM, 2015.
Christoph Dann and Emma Brunskill. Sample complexity of episodic fixed-horizon reinforcement learning. In NIPS, pages 2818–2826, 2015.
Christoph Dann, Tor Lattimore, and Emma Brunskill. Unifying pac and regret: Uniform pac bounds for episodic reinforcement learning. In Advances in Neural Information Processing Systems, pages 5713–5723, 2017.
Wenkui Ding, Tao Qin, Xu-Dong Zhang, and Tie-Yan Liu. Multi-armed bandit with budget constraint and variable costs. In AAAI. AAAI Press, 2013.
Yonathan Efroni, Nadav Merlis, Mohammad Ghavamzadeh, and Shie Mannor. Tight regret bounds for model-based reinforcement learning with greedy policies. arXiv preprint arXiv:1905.11527, 2019.
Yonathan Efroni, Lior Shani, Aviv Rosenberg, and Shie Mannor. Optimistic policy optimization with bandit feedback. arXiv preprint arXiv:2002.08243, 2020.
Evrard Garcelon, Mohammad Ghavamzadeh, Alessandro Lazaric, and Matteo Pirotta. Improved algo- rithms for conservative exploration in bandits. CoRR, abs/2002.03221, 2020a.
Evrard Garcelon, Mohammad Ghavamzadeh, Alessandro Lazaric, and Matteo Pirotta. Conservative exploration in reinforcement learning. CoRR, abs/2002.03218, 2020b.
Javier Garcıa and Fernando Fern´andez. A comprehensive survey on safe reinforcement learning. Journal of Machine Learning Research, 16(1):1437–1480, 2015.
Thomas Jaksch, Ronald Ortner, and Peter Auer. Near-optimal regret bounds for reinforcement learning. Journal of Machine Learning Research, 11(Apr):1563–1600, 2010.
Chi Jin, Tiancheng Jin, Haipeng Luo, Suvrit Sra, and Tiancheng Yu. Learning adversarial mdps with bandit feedback and unknown transition. arXiv preprint arXiv:1912.01192, 2019.
Matthew Joseph, Michael J. Kearns, Jamie H. Morgenstern, and Aaron Roth. Fairness in learning: Classic and contextual bandits. In NIPS, pages 325–333, 2016.
Sham Kakade and John Langford. Approximately optimal approximate reinforcement learning. In ICML, volume 2, pages 267–274, 2002.
Abbas Kazerouni, Mohammad Ghavamzadeh, Yasin Abbasi, and Benjamin Van Roy. Conservative con- textual linear bandits. In NIPS, pages 3910–3919, 2017.
Torsten Koller, Felix Berkenkamp, Matteo Turchetta, and Andreas Krause. Learning-based model pre- dictive control for safe exploration. In CDC, pages 6059–6066. IEEE, 2018.
Mehrdad Mahdavi, Rong Jin, and Tianbao Yang. Trading regret for efficiency: online convex optimization with long term constraints. Journal of Machine Learning Research, 13(Sep):2503–2528, 2012.
Shie Mannor and John N. Tsitsiklis. On the empirical state-action frequencies in markov decision processes under general policies. Math. Oper. Res., 30(3):545–561, 2005.
Andreas Maurer and Massimiliano Pontil. Empirical bernstein bounds and sample variance penalization. arXiv preprint arXiv:0907.3740, 2009.
Angelia Nedi´c and Asuman Ozdaglar. Subgradient methods for saddle-point problems. Journal of optimization theory and applications, 142(1):205–228, 2009.
Francesco Orabona. A modern introduction to online learning. arXiv preprint arXiv:1912.13213, 2019.
Santiago Paternain, Luiz F. O. Chamon, Miguel Calvo-Fullana, and Alejandro Ribeiro. Constrained reinforcement learning has zero duality gap. In NeurIPS, pages 7553–7563, 2019.
Martin L Puterman. Markov Decision Processes: Discrete Stochastic Dynamic Programming. John Wiley & Sons, Inc., 1994.
Aviv Rosenberg and Yishay Mansour. Online convex optimization in adversarial markov decision pro- cesses. In ICML, volume 97 of Proceedings of Machine Learning Research, pages 5478–5486. PMLR, 2019.
Steven E Shreve and Dimitri P Bertsekas. Alternative theoretical frameworks for finite horizon discrete- time stochastic optimal control. SIAM Journal on control and optimization, 16(6):953–978, 1978.
Chen Tessler, Daniel J. Mankowitz, and Shie Mannor. Reward constrained policy optimization. In ICLR (Poster). OpenReview.net, 2019.
Akifumi Wachi, Yanan Sui, Yisong Yue, and Masahiro Ono. Safe exploration and optimization of con- strained mdps using gaussian processes. In AAAI, pages 6548–6556. AAAI Press, 2018.
Yifan Wu, Roshan Shariff, Tor Lattimore, and Csaba Szepesv´ari. Conservative bandits. In ICML, volume 48 of JMLR Workshop and Conference Proceedings, pages 1254–1262. JMLR.org, 2016.
Hao Yu, Michael Neely, and Xiaohan Wei. Online convex optimization with stochastic constraints. In Advances in Neural Information Processing Systems, pages 1428–1438, 2017.
Andrea Zanette and Emma Brunskill. Tighter problem-dependent regret bounds in reinforcement learning without domain knowledge using value function bounds. In ICML, volume 97 of Proceedings of Machine Learning Research, pages 7304–7312. PMLR, 2019.
Liyuan Zheng and Lillian J. Ratliff. Constrained upper confidence reinforcement learning. CoRR, abs/2001.09377, 2020.
Alexander Zimin and Gergely Neu. Online learning in episodic markovian decision processes by relative entropy policy search. In NIPS, pages 1583–1591, 2013.
In this section, we establish regret guarantees for OptCMDP (Alg. 1). As a first step, we recall the algorithm and we formally states the confidence intervals. The empirical transition model, cost function and constraint cost functions are defined as in (9). We recall that OptCMDP constructs confidence intervals for the costs and the dynamics of the CMDP. Formally, for any (we define
where
A.1 Failure Events
Define the following failure events.
Furthermore, the following relations hold by standard arguments.
• Let Then Pr
, by Hoeffding’s inequality, and using a union bound argument on all s, a, all possible values of
), all
] and
]. Furthermore, for
[0, 1].
• Let . Using Thm. 4 in [Maurer and Pontil, 2009], for every fixed s, a, h, k and valueof
), we have that
See that for any 2, we use Theorem 4 in [Maurer and Pontil, 2009], and for
{0, 1} the bound holds trivially. This also implies that
since . Applying union bound on all s, a, h, and all possible values of
) and
] and set
we get that Pr
. This analysis was also used in [Jin et al., 2019].
• Let Then, Pr
. The proof is given in [Dann et al., 2017, Cor. E.4].
Remark 3. Boundness of of immediate cost and constraints cost. Notice that we assumed that the random variables [0, 1] and
[0, 1] for any s, a, h.
Lemma 7 (Good event of Pr
When the failure events does not hold we say the algorithm is outside the failure event, or inside the good event G which is the complement ofG.
The fact holds conditioning on the good event implies the following result [e.g., Jin et al., 2019, Lem. 8].
Lemma 8. Conditioned on the basic good event, for all there exists constants
0 for which we have that
A.2 Optimism
Recall that and
such that
and
, with
and
defined in (13).
Lemma 9 (Optimism). Conditioning on the good event, for any there exists a transition model
for which (i)
), and , (ii)
).
Proof. Conditioning on the good event, the true model p is contained in . Furthermore, conditioned on the good event
and
component-wise. Thus, setting
we get
where we used the fact that 0 component-wise.
is Feasible Policy.). Conditioning on the good event,
is a feasible policy for any
], i.e.,
be the set of policies which do not violate the constraint w.r.t. all possible models at episode k. Observe that Πis the set of feasible policies at episode k for OptCMDP.
Conditioning on the good event, by Lemma 9 implies that exists
such that
. Thus,
Since it implies that
From the two lemmas we arrive to the following important corollary
Corollary 11. Conditioning on the good event (i) (
(
), and, (ii)
(
(
).
Proof. The following relations hold.
The second relation holds by Lemma 10 and the forth relation holds by Lemma 9.
A.3 Proof of Theorem 3
In this section, we establish the following regret bounds for OptCMDP (see Alg. 1).
Theorem 3 (Regret Bounds for OptCMDP). Fix (0, 1). With probability at least 1
for any
] the following regret bounds hold
Proof. We start by conditioning on the good event. By Lem. 7 it holds with probability at least 1 We now analyze the regret relatively to the cost c. The following relations hold for any
].
The second and third relations hold by optimism, i.e., Cor. 11. The forth relation holds by Lem. 29. See that assumptions 1,2,3 of Lem. 29 are satisfied conditioning on the good event.
We now turn to prove the regret bound on the constraint violation. For any ] and
] the following relations hold.
The first relation holds since (
as the optimization problem solved in every episode is feasible (see Lem. 10). Furthermore, by optimism
(
(
) (see the first relation of Lem. 9). The third relation holds by applying Lem. 29. See that assumptions (a), (b) and (c) of Lem. 29 are satisfied conditioning on the good event (see also Lem. 8).
In this section, we establish regret guarantees for OptCMDP-bonus (see Alg. 2). The main advantage of this algorithm w.r.t. OptCMDP is the computational complexity. While OptCMDP requires to solve an extended CMDP through an LP with ) constraints and decision variable, OptCMDP-bonus requires to find the solution of a single CMDP by solving an LP with O(SAH) constraints and variables.
At each episode k, OptCMDP-bonus builds an optimistic CMDP ) where
while
where and
are defined as in (20). The policy by which we act at episode k is given by solving the following optimization problem
where and
is defined as in (15). Solving this problem can be done by solving an LP, much similar to the LP by which a CMDP is solved (Section 2.3).
Before supplying the proof of Theorem 4 we formally defining the set of good events which we show holds with high probability. Conditioning on the good, we establish the optimism of OptCMDP-bonus and then regret bounds for OptCMDP-bonus.
B.1 Failure Events
We define the same set of good events as for OptCMDP (App. A.1). We restate this set here for convenience.
As in App. A.1 the union of these events hold with probability greater than 1 .
Lemma 12 (Good event of Pr
When the failure events does not hold we say the algorithm is outside the failure event, or inside the good event G which is the complement ofG.
Lemma 13. Conditioned on the basic good event, for all there exists constants
0 for which we have that
B.2 Optimism
Lemma 14 (Per-State Optimism.). Conditioning on the good event, for any ] it holds that
Proof. For any s, a, h, k, conditioning on the good event,
where the forth relation holds conditioning on the good event, and the fifth relation by the choice of the bonus ). Combining (23) and (24) we get that
Repeating this analysis while replacing with
we conclude the proof of the lemma.
Lemma 15 (Optimism). Conditioning on the good event, for any it holds that (i)
(
(s; c, p), and, (ii)
(
(
).
Proof. For any ] we have that
where we used the value difference lemma (see Lem. 35). Applying the first statement of Lem. 14 which hold for any s, a, h, k (conditioning on the good event) we conclude the proof of the first claim.
The second claim follows by the same analysis on the difference (
(
), i.e., using the value difference lemma and the second claim in Lem. 14.
The following lemma shows that the problem solved by OptCMDP-bonus is always feasible. This lemma follows the same idea used to prove the feasibility for OptCMDP (see Lem. 10).
is Feasible Policy.). Conditioning on the good event,
is a feasible policy for any
], i.e.,
be the set of policies which do not violate the constraint w.r.t. all possible models at the episode. Conditioning on the good event, by Lem. 15
implies that
. Thus,
Since it implies that
From the two lemmas we arrive to the following corollary as
Proof. The following relations hold.
The second relation holds by Lem. 16 and the forth relation holds by Lem. 15.
B.3 Proof of Theorem 4
In this section, we establish the following regret bounds for OptCMDP-bonus algorithm.
Theorem 4 (Regret Bounds for OptCMDP-bonus). Fix (0, 1). With probability at least 1
for any
] the following regret bounds hold
Unlike the proof of the OptCMDP-bonus algorithm (Thm. 3), the value function is not constraint to be within [0, H] . However, since the bonus is bounded, the estimated value function is bounded in the range of []. Although this discrepency, in the following we are able to reach similar dependence in
. The fact the estimated value is bounded in OptCMDP-bonus differently then in OptCMDP results in worse constant term as Thm. 4 exhibits (see Remark 2).
Proof. We start by conditioning on the good event. By Lem. 7, it holds with probability at least 1 We now analyze the regret relatively to the cost c. The following relations hold for any
]:
The second and third relations hold by optimism, see Cor. 17. The forth relation holds by Lem. 31. See that assumptions 1,2,3 of Lem. 31 are satisfied conditioning on the good event. Assumption 4 of Lem. 31 holds by the optimism of the value estimate (see Lem. 15). Assumption 5 of Lem. 31 holds by Lem. 14.
We now turn to prove the regret bound on the constraint violation. For any ] and
] the following relations hold.
The first relation holds since (
as the optimization problem solved in every episode is feasible, see Lem. 16. Furthermore, by optimism
(
(
) (see the first relation of Lem. 15). The third relation holds by applying Lem. 31. See that assumptions 1,2,3 of Lem. 31 are satisfied conditioning on the good event (see also Lem. 13).
In this section, we establish regret guarantees for OptDual-CMDP by proving Theorem 5. Unlike both previous sections, OptDual-CMDP does not require an LP solver, but repeatedly solves MDPs with uncertainty in their transition model.
Before supplying the proof of Theorem 5 we formally define the set of good events which we show holds with high probability. Conditioning on the good, we establish the optimism of OptDual-CMDP and then regret bounds for OptDual-CMDP. The regret bound of OptDual-CMDP relies on results from constraint convex optimization with some minor adaptations which we establish in Appendix G.
C.1 Definitions
We introduce a notation that will be used across the proves of this section. Following this notation allows us to apply generic results from convex optimization to the problem.
• The optimistic and true constraints valuation are denoted by
• The optimistic value, true value, and optimal value are denoted by
C.2 Failure Events
We define the same set of good events as for OptDual-CMDP (Appendix A.1). We restate this set here for convenience.
As in Appendix A.1 the union of these events hold with probability greater than 1 .
Lemma 18 (Good event of Pr
When the failure events does not hold we say the algorithm is outside the failure event, or inside the good event G which is the complement ofG.
Lemma 19. Conditioned on the basic good event, for all there exists constants
0 for which we have that
C.3 Proof of Theorem 5
In this section, we establish the following regret bound for OptDual-CMDP.
Theorem 5 (Regret Bounds for OptDual-CMDP). For any ] the regrets the following bounds hold
We start by proving several useful lemmas on which the proof is based upon.
Proof. We have that
The first relation holds since satisfies the constraint (Assumption 1) which implies that (
0, and that
0 by the update rule. The second relation holds since conditioning on the good event the true model is contained in
as well as
.
Lemma 21 (Update Rule Recursion Bound)]
Summing this relation for ] and multiplying both sides by
2 we get
Rearranging we get,
for any .
We are now ready to establish Theorem 5.
Proof. Plugging Lemma 20 into Lemma 21 we get
We now bound each term in (26). Notice that ;
] (where
); it is a value function defined on an MDP with immediate cost in [
] and
[0, H]. Thus, we have that
Applying Lemma 29 (see that assumptions (a), (b) and (c) hold conditioning on the good event), we get that
Plugging these bounds back into (26) and setting
for any .
where , and
is given in Assumption 2. See that
and that, by the definition,
Setting
Since the bound holds for any ] we get that
Now, by the convexity of the state-action frequency (see Proposition 1) function there exists a policy which satisfies
(
) for any
. Since both f and g are linear in
) we have that
Applying Corollary 44 and Theorem 42 we conclude that
for any ].
Remark 4 (Convexity of the RL Objective Function). Although it is common to refer to the objective function in RL as non-convex, in the state action visitation polytope the objective is linear and, hence, convex (however, the problem is constraint to the state action visitation polytope). Thus, we can use Theorem 42 and Cor. 44 which are valid for constraint convex problems.
In this section we establish regret guarantees for OptPrimalDual-CMDP by proving Theorem 6. Unlike for OptDual-CMDP, OptPrimalDual-CMDP requires an access to a (truncated) policy estimation algorithm which returns ), i.e., the Q-function w.r.t. to the empirical transition model and optimistic cost and constraint cost. This reduces the computational complexity of OptPrimalDual-CMDP. However, it results in worse performance guarantees relatively to OptDual-CMDP.
Before supplying the proof of Theorem 6 we formally define the set of good events which we show holds with high probability. Conditioning on the good, we establish the optimism of OptPrimalDual-CMDP and then regret bounds for OptPrimalDual-CMDP. The regret bounds of OptPrimalDual-CMDP relies on results from constraint convex optimization with some minor adaptations which we establish in Appendix G.
D.1 Failure Events
We define the same set of good events as for UCRL-OptCMDP (Appendix A.1). We restate this set here for convenience.
As in Appendix A.1 the union of these events hold with probability greater than 1 .
Lemma 22 (Good event of Pr
When the failure events does not hold we say the algorithm is outside the failure event, or inside the good event G which is the complement ofG.
Lemma 23. Conditioned on the basic good event, for all there exists constants
0 for which we have that
D.2 Optimality and Optimism
Proof. By definition,
Let
Applying the extended value difference lemma 34 we get that
To bound (i), we apply Lemma 26 while setting .
To bound (ii), observe that by Lemma 25 for all s, a, h, k it holds that
This implies that
since (ii) is an expectation over negative terms. Combining (29) and (30) we conclude that
Lemma 25 (Policy Estimation Optimism). Conditioning on the good event, for any s, a, h, k the following bound holds
See that (
(
) are defined in the update rule of OptPrimalDual-CMDP (Algorithm 4).
Proof. For all s, a, h, k the following relations hold.
where (
(
Furthermore, see that
since maxmax{0, a} + max{0, b}. Similarly, for any
],
Plugging (34) and (35) into (33) we get
We now show each of these terms is negative conditioning on the good event.
Furthermore, observe that
The second relation holds since
[0, H] by the update rule (OptPrimalDual-CMDP uses truncated policy evaluation, see Algorithm 5). The third relation holds conditioning on the good event. The forth relation holds by the choice of
. Applying (40) we get that
) +
Similarly, we get that each term in the sums at (38),(39) is non-positive. Since 0 we conclude that both (38)
0 and (39)
0. Thus, we establish that
Proof. This term accounts for the optimization error, bounded by the OMD analysis. By standard analysis of OMD [Orabona, 2019] with the KL divergence used as the Bregman distance (see Lemma 40) we have that for any s, h and for policy any ,
where is a fixed step size.
By the form of (31) we get that
0 since it is a sum of positive terms (policy evaluation is done with truncated policy evaluation, see Algorithm 4). Furthermore, we upper bound
for any s, a, h, k as follows,
The second relation holds by the fact that (
(
by the update rule (both
1, thus, an expectation over an H such terms is smaller than H) and the fact
0 (by the update rule).
Plugging this bound into (41) we get that for any s, a, h
Thus, the following relations hold.
See that the first relation holds as the expectation does not depend on k. Thus, by linearity of expectation, we can switch the order of summation and expectation. The second relation holds since (42) holds for any s.
Finally, by choosing 2 log
(1 +
), we obtain
D.3 Proof of Theorem 6
In this section, we establish the following regret bound for OptPrimalDual-CMDP.
Theorem 6 (Regret Bounds for OptPrimalDual-CMDP). For any ] the regrets the following bounds hold
We start by proving several useful lemmas on which the proof is based upon.
Proof. We have that
The first relation holds since satisfies the constraint (Assumption 1) which implies that (
0, and that
0 by the update rule.
We now state a lemma which corresponds to Lemma 21 from previous section.
Lemma 28 (Update Rule Recursion Bound Primal-Dual): 0
]
Proof. Similar proof to Lemma 21 while using the fact that projection to the set: 0
is non-expansive operator as the operator [
.
We are now ready to establish Theorem 6.
Proof. Applying Lemma 27 into Lemma 28 we get
Applying Lemma 30 (see that assumptions (1),(2),(3) hold conditioning on the good event), we get that
Lastly, by Lemma 24,
Plugging these bounds back into (44) and setting
for any 0 .
where , and
is given in Assumption 2 See that 0
. Furthermore, it holds that
Set
where we applied (46) in the second relation. Since the bound (47) holds for any i we get that
Now, by the convexity of the state-action frequency function (Proposition 1) there exists a policy which satisfies
(
) for any
. Since both f and g are linear in
) we have that
Applying Corollary 44 and Theorem 42 we conclude that
for any ].
Lemma 29 (On Policy Errors for Optimistic Model). Let ) be a a cost function, and its optimistic cost. Let p be the true transition dynamics of the MDP and
be an estimated transition dynamics. Let
(
(
) be the value of a policy
according to the cost and transition model l, p and
, respectively. Assume the following holds for all
]:
Proof. The following relations hold.
where the first relation holds by the value difference Lem. 35. We now bound the terms (i) and (ii).
Bound on (i). To bound (i) we use the assumption (1) and get,
The first relation holds by assumption (a). The second relation holds since is the policy by which the agent acts at episode k in the true MDP. The third relation holds by Lem. 36.
Bound on (ii). To bound (ii) use the fact that
for every s since the immediate cost is bounded in ) +
) component-
wise up to constants, since the second term is bounded by (1). Thus,
The first relation holds by plugging the bound (48) and assumption (b) into (ii). The second relation holds by Jensen’s inequality. The third relation holds since p is a probability distribution. The forth relation holds since is the policy with which the agent interacts with the true CMDP. The fifth relation holds by Lem. 36 (its assumption holds by assumption (c)).
Combining the bounds on (i) and (ii) we conclude the proof.
Lemma 30 (On Policy Errors for Truncated Policy Estimation). Let ) be a a cost function, and its optimistic cost. Let p be the true transition dynamics of the MDP and
be an estimated transition dynamics. Let
(s; l, p) be the value of a policy
according to the cost and transition model l, p. Furthermore, let
(
) be a value function calculated by a truncated value estimation (see Algorithm 5) by the cost and transition model
. Assume the following holds for all
]:
(s, a)
(s, a)
.
(s
s, a)
(s
s, a)
Proof. The following relations hold.
Observe that
where the first relation holds by the extended value difference lemma 34. Plugging back to (50) we get
We now bound the terms (i) and (ii).
Bound on (i). To bound (i) we use the assumption (1) and get,
The first relation holds by assumption (1). The second relation holds since is the policy by which the agent acts at the
episode at the true MDP. The third relation holds by Lemma 36.
Bound on (ii). To bound (ii) use the fact that
for every s since the immediate cost is bounded in 1 +
)
component-wise up to constants, since the second term is bounded by (1). Thus,
The first relation holds by plugging the bound (51) and assumption (2) into (ii). The second relation holds by Jensen’s inequality. The third relation holds since p is a probability distribution. The third relation holds since is the policy with which the agent interacts with the true MDP p. The fifth relation holds by Lemma 36 (its assumption holds by assumption (3)).
Combining the bounds on (i) and (ii) we conclude the proof.
Lemma 31 (On Policy Errors for Bonus Based Optimism). Let ) be a cost function, and its optimistic cost. Let p be the true transition dynamics of the MDP and
be an estimated transition dynamics. Let
(
(
) be the value of a policy
according to the cost and transition model l, p and
, respectively. Assume the following holds for all
]:
(s, a)
q
(s, a; p)
H ln
.
4. V (s;
,
Proof. Denote for any (
(
hold:
where the first relation holds by the value difference lemma (see Lem. 35).
Bound on (i) and (ii). Since 0 (the value of the true MDP is bounded in [0, H]), we can bound both (i) and (ii) by the same analysis as in Lem. 29. Thus,
Bound on (iii). Applying Lem. 32 we obtain the following bound
Plugging the bounds on terms (i), (ii), and (iii) into (52) we get
Applying Lem. 38, by which , we get
Proof. Denote for any (
(
) and
(
(s; l, p). The following relations hold:
since ] by the assumption on its instantaneous cost (assumption (1) of Lem. 31). Note that
[0, H] as usual. Term (ii) is bounded as follows
The first relation holds by Jensen’s inequality while using the fact that ) has at most N non-zero terms. The second relation holds by Cauchy-Schwartz inequality. The third relation follows from properties of the occupancy measure (see Eq. 8). In particular,
). The forth relation holds by applying Lem. 37 and bounding (
)) due to (54) and
0 due to optimism (assumption (4) of Lem. 31). The fifth relation holds by Lemma 33 (see that its assumption holds by assumption (5)). The sixth relation holds by
.
Plugging the bounds on term (i), (55), and term (ii), (56), into (53) we get
Denoting (
, where
Lemma 33. Let ) be a cost function and its optimistic cost. Let p,p be two transition probabilities. Let
(
(s; l, p) and
(
(
) be the value of a policy
according to the cost and transition model l, p and
, respectively. Assume that
Proof. By definition
where the first relation holds by the value difference lemma 35 and the last relation holds due to the assumption 57. Iterating on this relation we get that for any
By summing this relation for and rearranging we get
Thus,
We start stating the value difference lemma (a.k.a. simulation lemma). This lemma has been used in several papers [e.g., Cai et al., 2019, Efroni et al., 2020]. The following lemma is central for the analysis of OptPrimalDual-CMDP.
Lemma 34 (Extended Value Difference). Let be two policies, and
) and
) be two MDPs. Let
) be an approximation of the Q-function of policy
on the MDP M for all h, s, a, and let
(
. Then,
The following lemma is standard [see e.g., Dann et al., 2017, Lem. E.15], and can be seen as a corollary of the extended value difference lemma.
Lemma 35 (Value difference lemma). Consider two MDPs ) and
(
). For any policy
and any s, h the following relation holds.
The following lemmas are standard. There proof can be found in [Dann et al., 2017, Zanette and Brunskill, 2019, Efroni et al., 2019] (e.g., Efroni et al. 2019, Lem. 38).
Lemma 37 (e.g., Zanette and Brunskill [2019], Lem. 13)]
Lemma 38 (Consequences of Self Bounding Property)
Proof. We have that
Since 0 this implies that
where in the second relation we used the relation (+ 2
.
F.1 Online Mirror Descent
In each iteration of Online Mirror Descent (OMD) the following problem is solved:
where is a stepsize, and
) is the bregman distance. When choosing
) as the KL-divergence, and the set C is the unit simplex OMD has the following closed form,
The following lemma [Orabona, 2019, Theorem 10.4] provides a fundamental inequality which will be used in our analysis.
Lemma 39 (Fundamental inequality of Online Mirror Descent). Assume 0 for k = 1, ..., K and i = 1, ..., d. Let
. Using OMD with the KL-divergence, learning rate
, and with uniform initialization,
], the following holds for any
,
In our analysis, we will be solving the OMD problem for each time-step h and state s separately,
Therefore, by adapting the above lemma to our notation, we get the following lemma,
Lemma 40 (Fundamental inequality of Online Mirror Descent for RL). Let 0. Let
) be the uniform distribution for any
] and
. Assume that
[0, M] for all s, a, h, k. Then, by solving (60) separately for any
] and
, the following holds for any stationary policy
,
Proof. First, observe that for any k, h, s, we solve the optimization problem defined in (60) which is the same as (59). By the fact that the estimators used in our analysis are non-negative, we can apply Lemma 39 separately for each h, s with ) and
). Lastly, bounding (
and
concludes the result.
In this section we enumerate several results from constraint convex optimization which are central to establish the bounds for the dual algorithms. To keep the generality of discussion, we follow results from Beck [2017], Chapter 3, and consider a general constraint convex optimization problem
where and
) are convex real valued functions,
. By defining the vector of constraints
Furthermore, we define the dual problem to (61). The dual function is
Where dom(. Furthermore, denote an optimal solution of (62) by
.
We make the following assumption which will be verified to hold. The assumption implies strong duality, i.e., .
x such that g(x) < 0 and exists a point ri(X) satifying
, where ri(X) is the relative interior of X.
The following theorem is proved in Beck 2017.
Theorem 41 (Beck 2017, Theorem 3.59.)(62) iff
Using this result we arrive to the following theorem, which is a variant of Beck 2017, Theorem 3.60.
Theorem 42. Let be an optimal solution of the dual problem (62) and assume that 2
. Let
satisfy
and
Since () is an optimal solution of the dual problem it follows by Theorem 41 that (
). Therefore, for any (
dom(v)
Set . See that
Thus, (63) implies that
We obtain the following relations.
where the last relation holds by (64). Rearranging, we get
by using the assumption 2.
Lastly, we have the following useful result by which we can bound the optimal dual parameter by the properties of a slater point. This result is an adjustment of Beck 2017, Theorem 8.42.
By definition, for any we have that
Therefore,
which implies that for any (
From this theorem we get the following corollary.
be a point satisfying g(x) < 0 and Ax + b = 0, and
be an optimal dual solution. Then,
Proof. Since (be an optimal solution of the dual problem (62).