Reinforcement Learning (RL) studies how an agent learns to maximize its expected total reward by interacting with an unknown environment over time [54]. Safe RL (SRL) involves extra restrictions arising from real-world problems [30, 6, 25]. Examples include collision-avoidance in self-driving cars [29], switching cost limitations in medical applications [10], and legal and business restrictions in financial management [2]. A standard environment model for SRL is the Constrained Markov Decision Process (CMDP) [5] that extends the classical MDP by adding an extra safety-related utility, and translates the safety demand into a constraint on the expected total utility [3]. The presence of safety constraints makes the celebrated exploration-exploitation trade-off more challenging.
Many existing SRL algorithms for CMDPs are policy-based, and primal-dual type constrained optimization methods are commonly used to deal with policy constraints, e.g., Constrained Policy Gradient (CPG) [57], Lagrangian-based Actor-Critic (AC) [14, 13, 19, 55, 39], Primal-Dual Policy Optimization (PDPO) [46, 45, 53], Constrained Policy Optimization (CPO) [3, 65], and Reward Constrained Policy Optimization (RCPO) [55]. These SRL algorithms either do not have a theoretical guarantee or can only be shown to converge asymptotically in the batch offline setting. It is imperative to study theoretical guarantees for SRL algorithms regarding computational/statistical efficiency.
In this work, we look at a more challenging problem of finding a sequence of policies in response to streaming samples of reward functions, constraint functions, and transition. We seek to provide theoretical guarantees on the convergence rate of approaching the best-fixed policy in hindsight as well as feasibility region formed by constraints. Unlike scenarios where the safety constraint is known a priori [56, 12, 22, 60, 20, 21, 59], we have to explore the unknown environment and adapt the policy to the safety constraint with the minimal violation [6, 30]. We refer this task as safe exploration. Recent policy-based SRL algorithms for CMDPs, e.g., CPO [3, 65] and PDPO [46, 45], seek a single safe policy via the constrained policy optimization whose sample efficiency guarantees over streaming or time-varying data are largely unknown. In this paper, we aim to answer a theoretical question:
Can we design a provably sample efficient online policy optimization algorithm for CMDP?
Contribution. We propose a provably efficient SRL algorithm for CMDP with an unknown transition model in the linear episodic setting – the Optimistic Primal-Dual Proximal Policy OPtimization (OPDOP) algorithm – where the value function is estimated by combining the least-squares policy evaluation and an additional bonus term for safe exploration. Theoretically, we prove that the proposed algorithm achieves an ) regret and the same
) constraint violation, where d is the dimension of the feature mapping, H is the horizon of each episode, and T is the total number of steps. We establish these bounds in the setting where the reward/utility functions are fixed but the feedback after each episode is bandit. Our bounds depend on the capacity of the state space only through the dimension of the feature mapping and thus hold even when the number of states goes to infinity. To the best of our knowledge, our result is the first provably efficient online policy optimization for CMDP with safe exploration in the function approximation setting.
Related Work. Our work is related to a line of provably efficient RL algorithms based on linear function approximation where the exploration is achieved by adding an Upper Confidence Bound (UCB) bonus [63, 64, 32]. A closely-related recent work is Proximal Policy Optimization (PPO) [51]. As is shown in [40, 18], PPO converges to the optimal policy sublinearly and an optimistic variant of PPO is sample efficient with UCB exploration in the linear setting. However, such results only hold for unconstrained RL problems. Our work seeks to extend an optimistic variant of PPO for CMDP with UCB exploration and establish theoretical efficiency guarantees. For the large CMDP with unknown transition models, there is a line of works that relates to the policy optimization under constraints, e.g., CPG [57], CPO [3, 65], RCPO [55], and IPPO [41]. However, their theoretical guarantees are still lacking. By contrast, our work is supported by theoretical efficiency guarantees.
Recent results on learning CMDPs with unknown transitions and rewards are closely related. As we know, most of them are model-based and only apply for finite state-action spaces. The works [52, 26] independently leverage upper confidence bound (UCB) on fixed reward, utility, and transition probability to propose sample-efficient algorithms for tabular CMDPs; [52] obtains an ) regret and constraint violation via linear program in the average cost setup in time T; [26] obtains an
) regret and constraint violation in the episodic setting via linear program and primal-dual policy optimization, where S is a state space and A is an action space. In [48], the authors study an adversarial stochastic shortest path problem under constraints and unknown transitions with
) regret and constraint violation. The work [9] extends Q-learning with optimism for finite state-action CMDPs with peak constraints. The work [17] proposes UCB-based convex planning for episodic tabular CMDPs in dealing with convex or hard constraints. In contrast, our proposed algorithm can potentially be applied to infinite state space scenarios with sublinear regret and constraint violation bounds only depending on the implicit dimension as opposed to the true dimension of the state space.
We consider an episodic Markov decision process (MDP) – MDP(S, A, H, P, r) – where S is a state space, A is an action space, H is a fixed length of each episode, is a collection of transition probability measures, and
is a collection of reward functions. We assume that S is a measurable space with possibly infinite number of elements. Moreover, for each step
) is a transition kernel over next state if action a is taken for state
1] is a reward function. The constrained MDP – CMDP(S, A, H, P, r, g) – also contains utility functions
1] is the utility function. We assume that reward/utility are deterministic. Our analysis readily generalizes to the setting where reward and utility are random.
Let the policy space ∆(where ∆(A) denotes a probability simplex over the action space. Let
) be a policy taken by the agent at episode
is the action that the agent takes at state
. For simplicity, we assume the initial state
to be fixed as
in different episodes; it readily generalizes to be random according to some distribution. The agent interacts with the environment in the kth episode as follows. At the beginning, the agent determines a policy
. Then, at each step
], the agent observes the state
, determines an action
following the policy
), and receives a reward
) together with an utility
Meanwhile, the MDP evolves into next state
drawing from the probability measure
). The episode terminates at state
; when this happens, no control action is taken and both reward and utility functions are equal to zero. In this paper, we focus a challenging bandit setting where the agent only observes the values of reward/utility functions,
), at visited state-action pair (
); see examples in [49, 27, 45]. We assume that reward/utility functions are fixed over episodes.
Given a policy ), the value function
associated with the reward function r
at each step h are the expected values of total rewards,
for all ], where the expectation
is taken over the random state-action sequence
; the action
follows the policy
) at the state
and the next state
follows the transition dynamics
). Thus, the action-value function
associated with the reward function r is the expected value of total rewards when the agent starts from state-action pair (x, a) at step h and follows policy
for all (]. Similarly, we define the value function
the action-value function
associated with the utility function g. Denote symbol
. For brevity, we take the shorthand
The Bellman equations associated with a policy
are given by
where , for all (
, the inner product of a function
Learning Performance. The learning agent aims to find a solution of a constrained problem in which the objective function is the expected total rewards and the constraint is on the expected
where we take ] to avoid triality. It is readily generalized to the problem with multiple constraints. Let
) be a solution to problem (2). Since the policy
is computed from knowing all reward and utility functions, it is commonly referred to as an optimal policy in-hindsight.
The associated Lagrangian of problem (2) is given by is the primal variable and
0 is the dual variable. We can cast problem (2) into a saddle-point problem: maximize
), which is convex in Y and is non-concave in
due to the non-concavity of value functions in policy space [4]. To address the non-concavity, we will exploit constrained RL problem structure to propose a new variant of Lagrange multipliers method for constrained RL problems in Section 3, which warrants a new line of primal-dual mirror descent type analysis in sequel. It is different from unconstrained RL analysis, e.g., [4, 18].
Another key feature of constrained RL is the safe exploration under constraints [30]. Without any constraint information a priori, it is infeasible for each policy to satisfy the constraint since utility information in constraints is only revealed after a policy is decided. Instead, we allow each policy to violate the constraint in each episode and minimize regret while minimizing total constraint violations for safe exploration over K episodes; aslo see [26]. We define the regret as the difference between the total reward value of policy in hindsight and that of the agent’s policy
over K episodes, and the constraint violation as a difference between the offset Kb and the total utility value of the agent’s policy
In this paper, we design algorithms, taking bandit feedback of the reward/utility functions, with both regret and constraint violation being sublinear in the total number of steps T := HK. Put differently, the algorithm should ensure that given ), then both Regret(K) =
) and Violation(
) hold with high probability.
To achieve sublinear regret and constraint violation, we assume standard Slater condition for problem (2) and recall strong duality; see the proof in [46, 45], and boundedness of the optimal dual variable. The dual function is D(Y ) := maximize) and the optimal dual variable
Assumption 1 (Slater Condition)
Lemma 1 (Strong Duality, Boundedness of
The Slater condition is mild in practice and commonly adopted in previous works; e.g., [46, 26, 48, 23]. The implied properties for problem (2) by Slater condition will be useful in our algorithm design and analysis.
Linear Function Approximation. We focus on CMDPs with linear transition kernels in feature maps.
Assumption 2. The CMDP(S, A, H, P, r, g) is a linear MDP with a kernal feature map and a value feature map
, if for any
, there exists a vector
such that for any (
where max(. Moreover, we assume that for any function
Assumption 2 adapts the definition of linear kernel MDP [7, 68, 18] for CMDPs. Linear kernel MDP examples include tabular MDPs [68], feature embedded transition models [64], and linear combinations of base models [43]. We can construct related examples of CMDPs with linear structure by adding adding proper constraints. For usefulness of linear structure, see discussions in [24, 58, 36]. For more general transition dynamics, see factored linear model [47].
Although our definition in Assumption 2 and linear MDPs [63, 32] all include tabular MDPs as special cases, they define transition dynamics using different feature maps. They are not comparable since one cannot be implied by the other [68]. The design of provably efficient RL algorithms beyond these linear type CMDPs remains open.
In Algorithm 1, we present a new variant of Proximal Policy Optimization [51] – an Optimistic Primal-Dual Proximal Policy OPtimization (OPDOP) algorithm. We effectuate the optimism through the Upper-Confidence Bounds (UCB) and address the constraints using a new variant of Lagrange multipliers method combined with the performance difference lemma; see Lemma 10 in Appendix F or works [50, 51, 18]. In each episode, our algorithm consists of three main stages. The first stage (lines 4–8) is Policy Improvement: we receive a new policy by improving previous
via a mirror descent type optimization; The second stage (line 9) is Dual Update: we update dual variable
based on the constraint violation induced by previous policy
third stage (line 10) is Policy Evaluation: we optimistically evaluate newly obtained policy via the least-squares policy evaluation with an additional UCB bonus term for exploration.
Policy Improvement. In the k-th episode, a natural attempt of obtaining a policy is to solve
a Lagrangian-based policy optimization problem,
where ) is the Lagrangian and the dual variable
0 is from the last episode. We will show later that
can be updated efficiently. This type update is used by several works, e.g. [38, 46, 45, 55]. The issue with them is that it relies on an oracle solver involving either Q-learning [28], PPO [51], or TRPO [50] to deliver a near-optimal policy. Thus, the overall algorithmic complexity is expensive and it is not suitable for the online use. In contrast, this work utilizes RL problem structure and show that only an easily-computable proximal step is sufficient by establishing efficiency guarantees without using any oracle solvers.
Denote symbol . Via the performance difference lemma, we can expand value functions
) at the previously known policy
where is taken over the random state-action sequence
. Thus, we introduce an approximation of
) for any state-action sequence
where can be estimated from an optimistic policy evaluation that will be discussed later. With this notion, in each episode, instead of solving a Lagrangian-based policy optimization, we perform a simple policy update in online mirror descent fashion,
where ) is a mixed policy of the previous one and the uniform distribution Unif(
1]. The constant
0 is a trade-off parameter,
) is the KL divergence between
is absolutely continuous in
The policy mixing step ensures such absolute continuity and implies uniformly bounded KL divergence (see Lemma 15 in Appendix F). Ignoring other
-irrelevant terms, we update
in terms of previous policy
Since the above update is separable over states , we can update the policy
in Algorithm 1, a closed solution for any step
above update reduces to one step in an optimistic variant of PPO [18]. The idea of KL-divergence regularization in policy optimization has been widely used in many unconstrained scenarios, e.g., NPG [33], PPO [51], TRPO [50], and their neural variants [61, 40]. Our method is distinct from them in that it is based on the performance difference lemma and the optimistically estimated value functions.
Dual Update. To infer the constraint violation for the dual update, we esimate ) via an optimistic policy evaluation that will be discussed later. We update the Lagrange multiplier
0 by moving
to the direction of minimizing the Lagrangian
line 9 in Algorithm 1, where
0 is a stepsize and Proj
is a projection onto [0
upper bound
, we choose
so that projection interval [ 0
includes the optimal dual variable
; also see works [26, 44].
The dual update works as a trade-off between the reward maximization and the constraint violation reduction. If the current policy satisfies the approximated constraint, i.e.,
we put less weight on the action-value function associated with the utility and maximize the reward; otherwise, we sacrifice the reward a bit to satisfy the constraint. The dual update has a similar use in dealing with constraints in CMDP, e.g., Lagrangian-based AC [19, 39], and online constrained optimization problems [66, 62, 67]. In contrast, we handle the dual update via the optimistic policy evaluation that yields an arguably efficient approximation of constraint violation.
Policy Evaluation. The last stage of the kth episode takes the Least-Squares Temporal Difference (LSTD) [16, 15, 37, 35] to evaluate the policy based on previous
1 historical trajectories. For each step
], instead of
in the Bellman equations (1), we estimate
(
is updated by the minimizer of the regularized least-squares problem over
where and
0 is the regularization parameter. Similarly, we estimate
(
. We display the least-squares solution in line 4–6 of Algorithm 2 where symbol
r or g. We also estimate
is updated by the minimizer of another regularized least-squares problem,
where 0 is the regularization parameter. Similarly, we estimate
. The least- squares solutions lead to line 8–9 of Algorithm 2.
value function iteratively in line 11 of Algorithm 2, where
is an estimate of
(
is an estimate of
; we add UCB bonus terms Γ
so that
all become upper confidence bounds for exploration. We take Γand leave the parameter
0 to be tuned later. Moreover, the bounded reward/utility
1] implies
We remark the computational efficiency of Algorithm 1. For the time complexity, since line 6 is a scalar update, they need O(d|A|T) time. A dominating calculation is from lines 5/9 in Algorithm 2. If we use the Sherman–Morrison formula for computing (ΛAnother important calculation is the integration from line 6 in Algorithm 2. We can either compute it analytically if it is tractable or approximate it via the Monte Carlo integration efficiently [68] that assumes O(dT) time. Therefore, the time complexity is
) in total. For the space complexity, we don’t need to store policy since it is recursively calculated via (4). By updating
Λ
) recursively, it takes
We now prove that the regret and the constraint violation for Algorithm 1 are sublinear in T := KH; T is the total number of steps taken by the algorithm; K is the total number of episodes; H is the episode horizon. We recall that |A| is the cardinality of action space A and d is the dimension of the feature map.
Theorem 1 (Linear Kernal MDP: Regret and Constraint Violation). Let Assumptions 1 and 2 hold. Fix and
in Algorithm 1, where
is an absolute constant. Suppose log
The above result establishes that OPDOP enjoys an ) regret and an
constraint violation if we set algorithm parameters
properly. Our results have the optimal dependence on the total number of steps T up to some logarithmic factors. The d dependence occurs due to the uniform concentration for controlling the fluctuations in the least-squares policy evaluation. This matches the existing bounds in the linear MDP setting without any constraints [18, 7, 68]. Our bounds differ from them only by an extra H, which is a price introduced by the uniform bound on the constraint violation. It is noticed that we don’t observe reward/utility functions, our algorithm works for bandit feedback after each episode.
We remark the tabular setting for Algorithm 1; see Appendix C for details. The tabular CMDP is a special case of Assumption 2 by taking canonical bases as feature mappings; see them in Section 5. The feature map has dimension and thus Theorem 1 automatically provides
) regret and constraint violation for the tabular CMDPs. The
dependence relies on the least-squares policy evaluation and it can be improved via other optimistic policy evolution methods if we are limited to the tabular case. We provide such results in Section 5.
4.1 Proof Outline of Theorem 1
We sketch the proof for Theorem 1. We state key lemmas and delay their full versions and complete proofs to Appendix B. In what follows, we fix 1) and use the shorthand w.p. for with probability.
Regret Analysis. We take a regret decomposition,
where (R.I) = is an optimal policy in hindsight, and (R.II) =
) is estimated via our optimistic policy evaluation given by Algorithm 2. As we use
to estimate
, it leads a model prediction error in the Bellman equations,
; similarly define
. In Appendix D.3, the UCB optimism of
, shows that or any (
1
By assumptions of Theorem 1, the policy improvement (4) yields Lemma 1, depicting total differences of estimates ) to the optimal ones.
Lemma 1 (Policy Improvement: Primal-Dual Mirror Descent Step). Let assumptions of Theorem 1
Lemma 1 displays a coupling between the regret and the constraint violation. This coupling also finds use in online convex optimization [42, 67, 62, 34] and CMDP problems [26]. The proof of Lemma 1 takes a primal-dual mirror descent type analysis of line 6 of Algorithm 1 by utilizing the performance difference lemma. Our primal-dual mirror descent analysis enable standard ones in [51, 40, 18] to deal with constraints.
Via the dual update (5), we show the second total differences scales
). Together with a decomposition of (R.II), (R.II) =
is a martingale, we have Lemma 2.
Lemma 2. Let assumptions of Theorem 1 hold. Then, Regret(
Finally, we note that is a martingale that scales as
) via the Azuma-Hoeffding inequality. For the model prediction error, we use the UCB optimism and apply the elliptical potential lemma.
Lemma 3. Let assumptions of Theorem 1 hold. Then,
Applying probability bounds from Lemmas 3 and 4 to Lemma 2 yields our regret bound.
Constraint Violation Analysis. We take a violation decomposition,
where (V.II) = . We still begin with the policy improvement (4) to have Lemma 5, a refined result from Lemma 1.
Lemma 5 (Policy Improvement: Refined Primal-Dual Mirror Descent Step). Let assumptions of
Lemma 5 separates the dual update in the second total differences in Lemma 1. We prove Lemma 5 by combining Lemma 1 with the UCB optimism and a change of variable of
dual update (5).
martingale. By adding (V.II) to the inequality in Lemma 5 with multiplier 0, and also adding (R.II) to it, then taking
0; otherwise
where ) for a policy
bound Γ
as done in Lemmas 3 and 4.
Last, by strong duality, applying the constraint violation bound from Lemma 9 in Appendix E leads to)). This gives our desired violation bound.
The tabular CMDP(S, A, H, P, r, g) is a special case of Assumption 2 with Let
. We take the following feature maps
, and parameter vectors,
where is a canonical basis of
associated with (
) reads that for any (
)th entry of
); similarly we define
and
. We can verify that
, and for any
and any (
. Therefore, we take
in Assumption (2) for the tabular case.
The proof of Theorem 1 is generic, since it is ready to achieve sublinear regret and constraint violation bounds as long as the policy evaluation is sample-efficient, e.g., the UCB design of ‘optimism in the face of uncertainty.’ In what follows, we introduce another efficient policy evaluation for line 10 of Algorithm 1 in the tabular case. Let us first introduce some notation. For any (, and any (
, we define two visitation counters
in episode k,
This allows us to estimate reward function r, utility function g, and transition kernel for episode
for all (0 is the regularization parameter. Moreover, we introduce the bonus term Γ
which adapts the counter-based bonus terms in the literature [8, 31], where
0 is to be determined later.
Using the estimated transition kernels , the estimated reward/utility functions
, and the bonus terms
, we now can estimate the action-value function via line 7 of Algorithm 3 for any (
summarize the above procedure in Algorithm 3. Using already estimated
we execute the policy improvement and the dual update in Algorithm 1.
Similar to Theorem 1, we provide theoretical guarantees in Theorem 2; see Appendix C.2 for the proof. Theorem 2 improves (|S|, |A|) dependence in Theorem 1 for the tabular case and also improves |S| dependence in [26]. It is worthy mentioning our Algorithm 1 is generic in handling an infinite state space.
Theorem 2 (Tabular Case: Regret and Constraint Violation). Let Assumption 1 hold and let Assumption 2 hold with feature maps (8). Fix . In Algorithm 1, we set
is an absolute constant. Then, with probability 1
, the regret and the constraint violation in (3) satisfy
In this paper, we have developed a provably efficient safe reinforcement learning algorithm in the linear MDP setting. The algorithm extends the proximal policy optimization to the constrained MDP and incorporates the UCB exploration. We prove that the proposed algorithm obtains an ) regret and an
) constraint violation under mild regularity conditions where T is the total number of steps taken by the algorithm. Moreover, our algorithm works in settings where reward/utility functions are given by bandit feedback. To the best of our knowledge, our algorithm is the first provably efficient online policy optimization algorithm for CMDP with safe exploration in the function approximation setting. We hope that our work provides a step towards a principled way to design efficient safe reinforcement learning algorithms.
[1] Y. Abbasi-Yadkori, D. P´al, and C. Szepesv´ari. Improved algorithms for linear stochastic bandits. In Advances in Neural Information Processing Systems, pages 2312–2320, 2011.
[2] N. Abe, P. Melville, C. Pendus, C. K. Reddy, D. L. Jensen, V. P. Thomas, J. J. Bennett, G. F. Anderson, B. R. Cooley, M. Kowalczyk, et al. Optimizing debt collections using constrained reinforcement learning. In International Conference on Knowledge Discovery and Data Mining, pages 75–84, 2010.
[3] J. Achiam, D. Held, A. Tamar, and P. Abbeel. Constrained policy optimization. In International Conference on Machine Learning, volume 70, pages 22–31, 2017.
[4] A. Agarwal, S. M. Kakade, J. D. Lee, and G. Mahajan. Optimality and approximation with policy gradient methods in Markov decision processes. arXiv preprint arXiv:1908.00261, 2019.
[5] E. Altman. Constrained Markov Decision Processes, volume 7. CRC Press, 1999.
[6] D. Amodei, C. Olah, J. Steinhardt, P. Christiano, J. Schulman, and D. Man´e. Concrete problems in AI safety. arXiv preprint arXiv:1606.06565, 2016.
[7] A. Ayoub, Z. Jia, C. Szepesvari, M. Wang, and L. F. Yang. Model-based reinforcement learning with value-targeted regression. arXiv preprint arXiv:2006.01107, 2020.
[8] M. G. Azar, I. Osband, and R. Munos. Minimax regret bounds for reinforcement learning. In International Conference on Machine Learning, pages 263–272, 2017.
[9] Q. Bai, A. Gattami, and V. Aggarwal. Model-free algorithm and regret analysis for MDPs with peak constraints. arXiv preprint arXiv:2003.05555, 2020.
[10] Y. Bai, T. Xie, N. Jiang, and Y.-X. Wang. Provably efficient Q-learning with low switching cost. In Advances in Neural Information Processing Systems, pages 8002–8011, 2019.
[11] A. Beck. First-order methods in optimization. SIAM, 2017.
[12] F. Berkenkamp, M. Turchetta, A. Schoellig, and A. Krause. Safe model-based reinforcement learning with stability guarantees. In Advances in Neural Information Processing Systems, pages 908–918, 2017.
[13] S. Bhatnagar and K. Lakshmanan. An online actor–critic algorithm with function approxima- tion for constrained Markov decision processes. Journal of Optimization Theory and Applications, 153(3):688–708, 2012.
[14] V. S. Borkar. An actor-critic algorithm for constrained Markov decision processes. Systems & Control Letters, 54(3):207–213, 2005.
[15] J. A. Boyan. Least-squares temporal difference learning. In International Conference on Machine Learning, pages 49–56, 1999.
[16] S. J. Bradtke and A. G. Barto. Linear least-squares algorithms for temporal difference learning. Machine Learning, 22(1-3):33–57, 1996.
[17] K. Brantley, M. Dudik, T. Lykouris, S. Miryoosefi, M. Simchowitz, A. Slivkins, and W. Sun. Constrained episodic reinforcement learning in concave-convex and knapsack settings. arXiv preprint arXiv:2006.05051, 2020.
[18] Q. Cai, Z. Yang, C. Jin, and Z. Wang. Provably efficient exploration in policy optimization. arXiv preprint arXiv:1912.05830, 2019.
[19] Y. Chow, M. Ghavamzadeh, L. Janson, and M. Pavone. Risk-constrained reinforcement learning with percentile risk criteria. The Journal of Machine Learning Research, 18(1):6070–6120, 2017.
[20] Y. Chow, O. Nachum, E. Duenez-Guzman, and M. Ghavamzadeh. A lyapunov-based approach to safe reinforcement learning. In Advances in Neural Information Processing Systems, pages 8092–8101, 2018.
[21] Y. Chow, O. Nachum, A. Faust, M. Ghavamzadeh, and E. Duenez-Guzman. Lyapunov-based safe policy optimization for continuous control. arXiv preprint arXiv:1901.10031, 2019.
[22] G. Dalal, K. Dvijotham, M. Vecerik, T. Hester, C. Paduraru, and Y. Tassa. Safe exploration in continuous action spaces. arXiv preprint arXiv:1801.08757, 2018.
[23] D. Ding, K. Zhang, T. Ba¸sar, and M. R. Jovanovi´c. Natural policy gradient primal-dual method for constrained Markov decision processes. In Advances in Neural Information Processing Systems, 2020.
[24] S. S. Du, S. M. Kakade, R. Wang, and L. F. Yang. Is a good representation sufficient for sample efficient reinforcement learning? arXiv preprint arXiv:1910.03016, 2019.
[25] G. Dulac-Arnold, D. Mankowitz, and T. Hester. Challenges of real-world reinforcement learn- ing. arXiv preprint arXiv:1904.12901, 2019.
[26] Y. Efroni, S. Mannor, and M. Pirotta. Exploration-exploitation in constrained MDPs. arXiv preprint arXiv:2003.02189, 2020.
[27] M. El Chamie, Y. Yu, B. A¸cıkme¸se, and M. Ono. Controlled Markov processes with safety state constraints. IEEE Transactions on Automatic Control, 64(3):1003–1018, 2018.
[28] D. Ernst, P. Geurts, and L. Wehenkel. Tree-based batch mode reinforcement learning. Journal of Machine Learning Research, 6(Apr):503–556, 2005.
[29] J. F. Fisac, A. K. Akametalu, M. N. Zeilinger, S. Kaynama, J. Gillula, and C. J. Tomlin. A general safety framework for learning-based control in uncertain robotic systems. IEEE Transactions on Automatic Control, 64(7):2737–2752, 2018.
[30] J. Garcıa and F. Fern´andez. A comprehensive survey on safe reinforcement learning. Journal of Machine Learning Research, 16(1):1437–1480, 2015.
[31] C. Jin, Z. Allen-Zhu, S. Bubeck, and M. I. Jordan. Is Q-learning provably efficient? In Advances in Neural Information Processing Systems, pages 4863–4873, 2018.
[32] C. Jin, Z. Yang, Z. Wang, and M. I. Jordan. Provably efficient reinforcement learning with linear function approximation. arXiv preprint arXiv:1907.05388, 2019.
[33] S. M. Kakade. A natural policy gradient. In Advances in Neural Information Processing Systems, pages 1531–1538, 2002.
[34] A. Koppel, K. Zhang, H. Zhu, and T. Ba¸sar. Projected stochastic primal-dual method for constrained online learning with kernels. IEEE Transactions on Signal Processing, 67(10):2528– 2542, 2019.
[35] M. G. Lagoudakis and R. Parr. Least-squares policy iteration. Journal of Machine Learning Research, 4(Dec):1107–1149, 2003.
[36] T. Lattimore and C. Szepesvari. Learning with good feature representations in bandits and in RL with a generative model. arXiv preprint arXiv:1911.07676, 2019.
[37] A. Lazaric, M. Ghavamzadeh, and R. Munos. Finite-sample analysis of LSTD. In International Conference on Machine Learning, pages 615–622, 2010.
[38] H. Le, C. Voloshin, and Y. Yue. Batch policy learning under constraints. In International Conference on Machine Learning, pages 3703–3712, 2019.
[39] Q. Liang, F. Que, and E. Modiano. Accelerated primal-dual policy optimization for safe reinforcement learning. arXiv preprint arXiv:1802.06480, 2018.
[40] B. Liu, Q. Cai, Z. Yang, and Z. Wang. Neural trust region/proximal policy optimization attains globally optimal policy. In Advances in Neural Information Processing Systems, pages 10564–10575, 2019.
[41] Y. Liu, J. Ding, and X. Liu. IPO: Interior-point policy optimization under constraints. arXiv preprint arXiv:1910.09615, 2019.
[42] M. Mahdavi, R. Jin, and T. Yang. Trading regret for efficiency: online convex optimization with long term constraints. The Journal of Machine Learning Research, 13(1):2503–2528, 2012.
[43] A. Modi, N. Jiang, A. Tewari, and S. Singh. Sample complexity of reinforcement learning using linearly combined model ensembles. In International Conference on Artificial Intelligence and Statistics, pages 2010–2020, 2020.
[44] A. Nedi´c and A. Ozdaglar. Subgradient methods for saddle-point problems. Journal of optimization theory and applications, 142(1):205–228, 2009.
[45] S. Paternain, M. Calvo-Fullana, L. F. Chamon, and A. Ribeiro. Safe policies for reinforcement learning via primal-dual methods. arXiv preprint arXiv:1911.09101, 2019.
[46] S. Paternain, L. Chamon, M. Calvo-Fullana, and A. Ribeiro. Constrained reinforcement learn- ing has zero duality gap. In Advances in Neural Information Processing Systems, pages 7553– 7563, 2019.
[47] B. ´A. Pires and C. Szepesv´ari. Policy error bounds for model-based reinforcement learning with factored linear models. In Conference on Learning Theory, pages 121–151, 2016.
[48] S. Qiu, X. Wei, Z. Yang, J. Ye, and Z. Wang. Upper confidence primal-dual optimization: Stochastically constrained Markov decision processes with adversarial losses and unknown transitions. Preprint, 2020.
[49] G. J. Schell, W. J. Marrero, M. S. Lavieri, J. B. Sussman, and R. A. Hayward. Data-driven Markov decision process approximations for personalized hypertension treatment planning. MDM Policy & Practice, 1(1), 2016.
[50] J. Schulman, S. Levine, P. Abbeel, M. Jordan, and P. Moritz. Trust region policy optimization. In International Conference on Machine Learning, pages 1889–1897, 2015.
[51] J. Schulman, F. Wolski, P. Dhariwal, A. Radford, and O. Klimov. Proximal policy optimization algorithms. arXiv preprint arXiv:1707.06347, 2017.
[52] R. Singh, A. Gupta, and N. B. Shroff. Learning in Markov decision processes under constraints. arXiv preprint arXiv:2002.12435, 2020.
[53] A. Stooke, J. Achiam, and P. Abbeel. Responsive safety in reinforcement learning by PID Lagrangian methods. arXiv preprint arXiv:2007.03964, 2020.
[54] R. S. Sutton and A. G. Barto. Reinforcement Learning: An Introduction. MIT press, 2018.
[55] C. Tessler, D. J. Mankowitz, and S. Mannor. Reward constrained policy optimization. In International Conference on Learning Representations, 2019.
[56] M. Turchetta, F. Berkenkamp, and A. Krause. Safe exploration in finite Markov decision processes with Gaussian processes. In Advances in Neural Information Processing Systems, pages 4312–4320, 2016.
[57] E. Uchibe and K. Doya. Constrained reinforcement learning from intrinsic and extrinsic re- wards. In International Conference on Development and Learning, pages 163–168, 2007.
[58] B. Van Roy and S. Dong. Comments on the Du-Kakade-Wang-Yang lower bounds. arXiv preprint arXiv:1911.07910, 2019.
[59] A. Wachi and Y. Sui. Safe reinforcement learning in constrained Markov decision processes. arXiv preprint arXiv:2008.06626, 2020.
[60] A. Wachi, Y. Sui, Y. Yue, and M. Ono. Safe exploration and optimization of constrained MDPs using Gaussian processes. In AAAI Conference on Artificial Intelligence, 2018.
[61] L. Wang, Q. Cai, Z. Yang, and Z. Wang. Neural policy gradient methods: Global optimality and rates of convergence. arXiv preprint arXiv:1909.01150, 2019.
[62] X. Wei, H. Yu, and M. J. Neely. Online primal-dual mirror descent under stochastic constraints. arXiv preprint arXiv:1908.00305, 2019.
[63] L. Yang and M. Wang. Sample-optimal parametric Q-learning using linearly additive features. In International Conference on Machine Learning, pages 6995–7004, 2019.
[64] L. F. Yang and M. Wang. Reinforcement leaning in feature space: Matrix bandit, kernels, and regret bound. arXiv preprint arXiv:1905.10389, 2019.
[65] T.-Y. Yang, J. Rosca, K. Narasimhan, and P. J. Ramadge. Projection-based constrained policy optimization. In International Conference on Learning Representations, 2019.
[66] H. Yu, M. Neely, and X. Wei. Online convex optimization with stochastic constraints. In Advances in Neural Information Processing Systems, pages 1428–1438, 2017.
[67] J. Yuan and A. Lamperski. Online convex optimization for cumulative constraints. In Advances in Neural Information Processing Systems, pages 6137–6146, 2018.
[68] D. Zhou, J. He, and Q. Gu. Provably efficient reinforcement learning for discounted MDPs with feature mapping. arXiv preprint arXiv:2006.13165, 2020.
Our analysis begins with decomposition of the regret given in (3).
where we add and subtract the value ) estimated from an optimistic policy evaluation by Algorithm 2; the policy
in hindsight is the best policy in hindsight for problem (2). To bound the total regret (12), we would like to analyze (R.I) and (R.II) separately.
First, we define the model prediction error for the reward as
for all (], which describes the prediction error in the Bellman equations (1) using
. With this notation, we expand (R.I) into
where the first double sum is linear in terms of the policy difference and the second one describes the total model prediction error. The above expansion is based on the standard performance difference lemma (see Lemma 10) and we provide a proof in Section D.1 for readers’ convenience. Meanwhile, if we define the model prediction error for the utility as
then, similarly, we can expand
To analyze the constraint violation, we also introduce an useful decomposition,
which the inserted value ) is estimated from an optimistic policy evaluation by Algorithm 2.
For notational simplicity, we introduce the underlying probability structure as follows. For any (], we define
-algebra generated by state-action sequences, reward and utility functions,
Here, is a null state for any
A filtration is a sequence of
in terms of time index
which holds that The estimated reward/utility value functions,
, and the associated Q-functions,
-measurable since they are obtained from previous
1 historical trajectories. With these notations, we can expand (R.II) in (12) into
where is a martingale adapted to the filtration
in terms of time index t. Similarly, we have it for (V.II),
where is a martingale adapted to the filtration
in terms of time index t. We prove (19) in Section D.2 for completeness (also see [18, Lemma 4.2]); (20) is similar.
in the action-value function estimation of Algorithm 2. By the UCB argument, if we set and
is an absolute constant, then for any (
(
with probability 12 where the symbol
. We prove (21) in Section D.3 for completeness.
In what follows we delve into the analysis of the regret and the constraint violation.
The goal is to prove that the regret and the constraint violation for Algorithm 1 are sublinear in the total number of steps: T := KH, taken by the algorithm, where K is the total number of episodes and H is the horizon length. We recall that |A| is the size of action space A and d is the feature map’s dimension. We repeat Theorem 1 here for readers’ convenience.
Theorem 1 (Linear Kernal MDP: Regret and Constraint Violation). Let Assumptions 1 and 2 hold. Fix in Algorithm 1 with the full-information setting, where
is an absolute constant. Suppose log
. Then, the regret and the constraint violation in (3) satisfy
We divide the proof into two parts for the regret bound and the constraint violation, respectively, in Section B.1 and Section B.2.
Our analysis begins with a primal-dual mirror descent type analysis for the policy update in line 6 of Algorithm 1. In Lemma 2, we present a key upper bound on the total differences of estimated values ) given by Algorithm 2 to the optimal ones.
Lemma 2 (Policy Improvement: Primal-Dual Mirror Descent Step). Let Assumption 1 and Assumption 2 hold. In Algorithm 1, if we set
Proof. We recall that line 6 of Algorithm 1 follows a solution to the following subproblem,
where we use the shorthandthe shorthand
)) if dependence on the state-action sequence
is clear from context. We note that (23) is in form of a mirror descent subproblem in Lemma 14. We can apply the pushback property with
Equivalently, we write the above inequality as follows,
By taking expectation on both sides of (24) over the random state-action sequence
starting from
, and applying decompositions (14) and (16), we have
The rest is to bound the right-hand side of the above inequality (25). By the H¨older’s inequality and the Pinsker’s inequality, we first have
Then, using the square completion,
where we dropoff the first quadratic term for the inequality, and
where the last inequality is due to , a result from line 12 in Algorithm 2, and 0
. Taking the same expectation
as previously on both sides of (26) and substituting it into the left-hand side of (25) yield,
where in the second inequality we note the fact that Lemma 15.
We note that is initialized to be zero. By taking a telescoping sum of both sides of (27) from k = 1 to k = K + 1 and shifting the index k by one, we have
where we ignore )] and utilize
Finally, we take in the lemma to complete the proof.
By the dual update of Algorithm 1, we can simplify the result in Lemma 2 and return back to the regret (12).
Lemma 3. Let Assumption 1 and Assumption 2 hold. In Algorithm 1, if we set
Regret(
Proof. By the dual update in line 9 in Algorithm 1, we have
where we use the feasibility of in the last inequality. Since
above inequality implies that
By noting the UCB result (21) and 0, the inequality (22) implies that
Then, we add (30) to the above inequality and take
where is an absolute constant. Finally, we combine (19) and (31) to complete the proof.
By Lemma 3, the rest is to bound the last two terms in the right-hand side of (29). We next show two probability bounds for them in Lemma 4 and Lemma 5, separately.
Lemma 4 (Model Prediction Error Bound). Let Assumption 2 hold. Fix . If we set
in Algorithm 1, then with probability 1
it holds that
Proof. By the UCB result (21), with probability 12 for any (
we have
for any (. Therefore, we have
where Γcation of the Cauchy-Schwartz inequality shows that
Since we take 1, we have
The rest is to apply Lemma 13. First, for any
] it holds that
Therefore, we have
Similarly, we can show that
Applying the above inequalities (34) and (35) to (33) leads to
Finally, we set = 1 to obtain (32).
Lemma 5 (Matingale Bound). In Algorithm 1, it holds with probability 1
Proof. In the verification of (19) (see Section D.2), we introduce the following matingale,
where
This shows that ]. Application of the Azuma-Hoeffding inequality yields,
For 1), if we set
), then the inequality (36) holds with probability at least 1
We now are ready to show the desired regret bound. Applying (32) and (36) to the right-hand side of the inequality (29), we have
with probability 1 are absolute constants. Then, with probability 1
that Regret(
where C is an absolute constant.
In Lemma 2, we have provided an useful upper bound on the total differences that are weighted by the dual update . To extract the constraint violation, we first refine Lemma 2 as follows.
Lemma 6 ([Policy Improvement: Refined Primal-Dual Mirror Descent Step). Let Assumptions 1
Proof. By the dual update in line 9 in Algorithm 1, for any
where we apply the non-expansiveness of projection in the first inequality and for the last inequality. By summing the above inequality from k = 1 to k = K, we have
which implies that
By adding the above inequality to (28) in Lemma 2 and noting that and the UCB result (21), we have
By taking in the lemma, we complete the proof.
According to Lemma 6, we can multiply (20) by Y and add it, together with (19), to (37),
We now are ready to show the desired constraint violation bound. We note that there exists a policy ). By the oc- cupancy measure method [5],
) are linear in terms of an occupancy measure induced by policy
and initial state
. Thus, an average of K occupancy measures is still an occupancy measure that produces policy
with values
). Particularly, we take
0; otherwise
. Therefore, we have
where we apply the UCB result (21) for the last inequality.
Finally, we recall two immediate results of Lemma 4 and Lemma 5. Fix 1), the proof of Lemma 4 also shows that with probability 1
and the proof of Lemma 5 shows that with probability 1
If we take log ) implies that with probability 1
where is an absolute constant. Finally, by noting our choice of
, we can apply Lemma 9 to conclude that Violation(
with probability 1 is an absolute constant.
As mentioned, Assumption 2 includes a tabular CMDP(S, A, H, P, r, g) as a special case with |S| < . We take the following feature maps and parameter vectors,
where is a canonical basis of
associated with (
) reads that for any (
)th entry of
); similarly we define
. Thus, we can see that
We can also verify that
and for any ] and any (
Therefore, the tabular CMDP is a special case of Assumption (2) with
We now detail Algorithm 1 for the tabular case as follows. Our policy evaluation works with regression feature
where . Thus, for any (¯
)th entry of
) is given by
which shows that ) is a sparse vector with |S| nonzero elements at
the (
)th entry of
). For instance of
, the regularized least-squares problem (6) becomes
where [)th entry of w, and the solution
serves as an estimator of the transition kernel
). On the other hand, since
, the regularized least- squares problem (7) becomes
where [)th entry of u, the solution
gives an estimate of
By adding similar UCB bonus terms Γ
given in Algorithm 2, we estimate the action-value function as follows,
for any (. Similarly, we estimate
thus
). Using already estimated
execute the policy improvement and the dual update in Algorithm 1.
We restate the result of Theorem 1 for the tabular case as follows.
Corollary 1 (Regret and Constraint Violation). For the tabular CMDP with feature maps (41), let Assumption 1 hold. Fix In Algorithm 1, we set
is an absolute constant. Then, the regret and the constraint violation in (3) satisfy
Proof. It follows the proof of Theorem 1 by noting that the tabular CMDP is a special linear MDP in Assumption 2, with , and we have log
automatically.
As we see in the proof of Theorem 1, our final regret or constraint violation bounds are dominated by the accumulated bonus terms, which come from the design of ‘optimism in the face of uncertainty.’ This framework provides a powerful flexibility for Algorithm 1 to incorporate other optimistic policy evaluation methods. In what follows, we introduce Algorithm 1 with a variant of optimistic policy evaluation.
and any (, we define two visitation counters
in episode
This allows us to estimate transition kernel , reward function r, and utility function g for episode
where 0 is the regularization parameter. Moreover, we introduce the bonus term Γ
which adapts the counter-based bonus terms in the literature [8, 31], where 0 is to be determined later. Using the estimated transition kernels
, the estimated reward/utility functions
, and the bonus terms
, we now can estimate the action-value function via
for any (. We summarize the above procedure in Algorithm 3. Using already estimated
, we can execute the policy improvement and the dual update in Algorithm 1.
Similar to Theorem 1, we prove the following regret and constraint violation bounds.
Theorem 3 (Regret and Constraint Violation). For the tabular CMDP with feature maps (41), let Assumption 1 hold. Fix In Algorithm 1, we set
is an absolute constant. Then, the regret and the constraint violation in (3) satisfy
Proof. The proof is similar to Theorem 1. Since we only change the policy evaluation, all previous policy improvement results still hold. By Lemma 3, we have
where is the model prediction error given by (13) and
is a martingale adapted to the filtration
in terms of time index t defined in (18). By Lemma 5, it holds with probability 1
). The rest is to bound the double sum term. As shown in Section D.4, with probability 1
2 it holds that for any (
Together with the choice of Γ
Define mapping ¯, we can utilize Lemma 13. For any (
where we apply the Cauchy-Schwartz inequality for the second inequality and Lemma 13 for the third inequality. Notice that (
Notice that log ), we conclude the desired regret bound.
For the constraint violation analysis, Lemmas 6 still holds. Similar to (39), we have
where ). Similar to Lemma 5, it holds with probability 1
. As shown in Section D.4, with probability 1
3 it holds that
0 for any (
(
. Therefore, we have
which leads to the desired constraint violation bound due to Lemma 9 and we set previously.
In this section, we collect some verifications for readers’ convenience.
For any (], we recall the definitions of
in the Bellman equations (1) and
from line 12 in Algorithm 2,
We can expand the difference
Recall the equality in the Bellman equations (1) and the model prediction error,
As a result of the above two, it is easy to see that
Substituting the above difference into the right-hand side of (43) yields,
which displays a recursive formula over h. Thus, we expand ) recursively with
V =
With this notation, repeating the above recursion (44) over
Finally, notice that = 0, we use the definitions of
to conclude (14). Similarly, we can also use the above argument to verify (16).
We expand the model prediction error
where we use the Bellman equation ) in the last equality. With the above formula, we expand the difference
Therefore, we have the following recursive formula over h,
Following the definitions of . Thus, for any (
Notice that 1). Clearly,
empty. We define a martingale sequence,
where is the time index. Clearly, this martingale is adapted to the filtration
, and particularly,
Finally, we combine the above martingale with (45) to obtain (19). Similarly, we can show (20).
We recall the definition of the feature map
for any (. By Assumption 2, we have
where the second equality is due to the definition of from line 4 of Algorithm 2 in the fourth equality, and we recursively replace
1] in the last equality.
Therefore,
for any (, where we apply the Cauchy-Schwarz inequality twice in the inequality. By Lemma 12, set
= 1, with probability 1
2 it holds that
Also notice that Λ. Thus, by taking an appropriate absolute constant C, we obtain that
for any (under the event of Lemma 12.
Algorithm 2, with probability 1 2 it holds that
for any (
We note that reward/utility functions are fixed over episodes, the difference
where we apply the Cauchy-Schwartz inequality in the inequality. Notice that Λ
(
We recall the model prediction error and the estimated state-action value function
in line 11 of Algorithm 2,
for any (), we first have
Then, we can show that
for any (
Therefore, (48) reduces to
On the other hand, notice that (
for any (
Therefore, we have proved that with probability 1 2 it holds that
for any (
Similarly, we can show another inequality
Let be a set of bounded function on
, we consider the difference between
) as follows,
for any (, where we apply the triangle inequality for the inequality.
Let ). Conditioning on the filtration
is a zero-mean and H/2-subGaussian random variable. By Lemma 11, we use
and thus with probability at least 1 it holds that
for any (]. Also, since 0
By returning to (49) and setting = 1, with probability at least 1
it holds that
for any
Let be a distance on
with respect to distance
Thus, for any , there exists
such that max
. By the triangle inequality, we have
Furthermore, we choose ) and take an union bound over
), with probability at least 1
2 it holds that
for all (is an absolute constant. We recall our choice of Γ
with probability at least 1
2 it holds that
for any (
where we utilize 1 in the inequalities.
We now are ready to check the model prediction error defined by (13),
for any (. On the other hand, notice that (
for any (. Hence, we complete the proof of (42).
We collect some standard results from the literature for readers’ convenience. We rephrase them for our constrained problem (2),
in which we maximize over all policies and ]. Let the optimal solution be
Let the Lagrangian be 0 is the Lagrange multiplier or dual variable. The associated dual function is defined as
and the optimal dual is
We recall that the problem (2) enjoys strong duality under the standard Slater condition. The proof is a special case of [45, Proposition 1] in finite-horizon.
Lemma 7 (Strong Duality). [45, Proposition 1] If the Slater condition holds, then the strong duality holds,
It is implied by the strong duality that the optimal solution to the dual problem: minimize) is obtained at
. Denote the set of all optimal dual variables as Λ
Under the Slater condition, an useful property of the dual variable is that the sublevel sets are bounded [11, Section 8.5].
Lemma 8 (Boundedness of Sublevel Sets of the Dual Function). Let the Slater condition hold. Fix , it holds that
where we utilize the Slater point ¯in the last inequality. We complete the proof by noting
Corollary 2 (Boundedness of
Another useful theorem from the optimization [11, Section 3.5] is given as follows. It describes that the constraint violation ) can be bounded similarly even if we have some weak bound. We next state and prove it for our problem, which is used in our constraint violation analysis in Section B.
Lemma 9 (Constraint Violation). Let the Slater condition hold and Assume that
By definition, ). It has been shown as a special case of [45, Proposition 1] that
is concave. First, we show that
(0). By the Lagrangian and the strong duality,
For any
If we maximize the right-hand side of above inequality over then
which show that (0). On the other hand, if we take
Combing the above two yields
By our assumption and
First, we state a lemma that is used throughout this paper.
Lemma 10 (Performance Difference Lemma). For any two policies that
Proof. See the proof of Lemma 3.2 in [18].
Next, we state an useful concentration inequality for the standard self-normalized processes.
Lemma 11 (Concentration of Self-normalized Processes)be a filtration and
be a R-valued stochastic process such that
-measurable for any
. Assume that for any
, conditioning on
is a zero-mean and
-subGaussian random variable with the variance proxy
stochastic process such that
-measurable for any
be a deterministic and positive-definite matrix. For any
Proof. See the proof of Theorem 1 in [1].
The above concentration inequality can be customized to our setting in the following form without using covering number arguments as in [32].
in Algorithm 2. Fix
. Then, for any (
for
Proof. See the proof of Lemma D.1 in [18].
Lemma 13 (Elliptical Potential Lemma)be a sequence of functions in
be a positive definite matrix. Let Λ
Then for any
it holds that
Proof. See the proof of Lemma D.2 in [32] or [18].
Lemma 14 (Pushback Property of KL-divergence)be a concave function where ∆ is a probability simplex in
be the interior of ∆. Let
for a fixed
. Then, for any
Proof. See the proof of Lemma 14 in [62].
Lemma 15 (Bounded KL-divergence Difference)be two probability distributions in ∆(
Proof. See the proof of Lemma 31 in [62].