Provably Efficient Safe Exploration via Primal-Dual Policy Optimization

2020·Arxiv

Abstract

Abstract

We study the Safe Reinforcement Learning (SRL) problem using the Constrained Markov Decision Process (CMDP) formulation in which an agent aims to maximize the expected total reward subject to a safety constraint on the expected total value of a utility function. We focus on an episodic setting with the function approximation where the Markov transition kernels have a linear structure but do not impose any additional assumptions on the sampling model. Designing SRL algorithms with provable computational and statistical efficiency is particularly challenging under this setting because of the need to incorporate both the safety constraint and the function approximation into the fundamental exploitation/exploration tradeoff. To this end, we present an Optimistic Primal-Dual Proximal Policy OPtimization (OPDOP) algorithm where the value function is estimated by combining the least-squares policy evaluation and an additional bonus term for safe exploration. We prove that the proposed algorithm achieves an ) regret and an ) constraint violation, where d is the dimension of the feature mapping, H is the horizon of each episode, and T is the total number of steps. These bounds hold when the reward/utility functions are fixed but the feedback after each episode is bandit. Our bounds depend on the capacity of the state-action space only through the dimension of the feature mapping and thus our results hold even when the number of states goes to infinity. To the best of our knowledge, we provide the first provably efficient online policy optimization algorithm for CMDP with safe exploration in the function approximation setting.

1 Introduction

Reinforcement Learning (RL) studies how an agent learns to maximize its expected total reward by interacting with an unknown environment over time [54]. Safe RL (SRL) involves extra restrictions arising from real-world problems [30, 6, 25]. Examples include collision-avoidance in self-driving cars [29], switching cost limitations in medical applications [10], and legal and business restrictions in financial management [2]. A standard environment model for SRL is the Constrained Markov Decision Process (CMDP) [5] that extends the classical MDP by adding an extra safety-related utility, and translates the safety demand into a constraint on the expected total utility [3]. The presence of safety constraints makes the celebrated exploration-exploitation trade-off more challenging.

Many existing SRL algorithms for CMDPs are policy-based, and primal-dual type constrained optimization methods are commonly used to deal with policy constraints, e.g., Constrained Policy Gradient (CPG) [57], Lagrangian-based Actor-Critic (AC) [14, 13, 19, 55, 39], Primal-Dual Policy Optimization (PDPO) [46, 45, 53], Constrained Policy Optimization (CPO) [3, 65], and Reward Constrained Policy Optimization (RCPO) [55]. These SRL algorithms either do not have a theoretical guarantee or can only be shown to converge asymptotically in the batch offline setting. It is imperative to study theoretical guarantees for SRL algorithms regarding computational/statistical efficiency.

In this work, we look at a more challenging problem of finding a sequence of policies in response to streaming samples of reward functions, constraint functions, and transition. We seek to provide theoretical guarantees on the convergence rate of approaching the best-fixed policy in hindsight as well as feasibility region formed by constraints. Unlike scenarios where the safety constraint is known a priori [56, 12, 22, 60, 20, 21, 59], we have to explore the unknown environment and adapt the policy to the safety constraint with the minimal violation [6, 30]. We refer this task as safe exploration. Recent policy-based SRL algorithms for CMDPs, e.g., CPO [3, 65] and PDPO [46, 45], seek a single safe policy via the constrained policy optimization whose sample efficiency guarantees over streaming or time-varying data are largely unknown. In this paper, we aim to answer a theoretical question:

Can we design a provably sample efficient online policy optimization algorithm for CMDP?

Contribution. We propose a provably efficient SRL algorithm for CMDP with an unknown transition model in the linear episodic setting – the Optimistic Primal-Dual Proximal Policy OPtimization (OPDOP) algorithm – where the value function is estimated by combining the least-squares policy evaluation and an additional bonus term for safe exploration. Theoretically, we prove that the proposed algorithm achieves an ) regret and the same ) constraint violation, where d is the dimension of the feature mapping, H is the horizon of each episode, and T is the total number of steps. We establish these bounds in the setting where the reward/utility functions are fixed but the feedback after each episode is bandit. Our bounds depend on the capacity of the state space only through the dimension of the feature mapping and thus hold even when the number of states goes to infinity. To the best of our knowledge, our result is the first provably efficient online policy optimization for CMDP with safe exploration in the function approximation setting.

Related Work. Our work is related to a line of provably efficient RL algorithms based on linear function approximation where the exploration is achieved by adding an Upper Confidence Bound (UCB) bonus [63, 64, 32]. A closely-related recent work is Proximal Policy Optimization (PPO) [51]. As is shown in [40, 18], PPO converges to the optimal policy sublinearly and an optimistic variant of PPO is sample efficient with UCB exploration in the linear setting. However, such results only hold for unconstrained RL problems. Our work seeks to extend an optimistic variant of PPO for CMDP with UCB exploration and establish theoretical efficiency guarantees. For the large CMDP with unknown transition models, there is a line of works that relates to the policy optimization under constraints, e.g., CPG [57], CPO [3, 65], RCPO [55], and IPPO [41]. However, their theoretical guarantees are still lacking. By contrast, our work is supported by theoretical efficiency guarantees.

Recent results on learning CMDPs with unknown transitions and rewards are closely related. As we know, most of them are model-based and only apply for finite state-action spaces. The works [52, 26] independently leverage upper confidence bound (UCB) on fixed reward, utility, and transition probability to propose sample-efficient algorithms for tabular CMDPs; [52] obtains an ) regret and constraint violation via linear program in the average cost setup in time T; [26] obtains an ) regret and constraint violation in the episodic setting via linear program and primal-dual policy optimization, where S is a state space and A is an action space. In [48], the authors study an adversarial stochastic shortest path problem under constraints and unknown transitions with ) regret and constraint violation. The work [9] extends Q-learning with optimism for finite state-action CMDPs with peak constraints. The work [17] proposes UCB-based convex planning for episodic tabular CMDPs in dealing with convex or hard constraints. In contrast, our proposed algorithm can potentially be applied to infinite state space scenarios with sublinear regret and constraint violation bounds only depending on the implicit dimension as opposed to the true dimension of the state space.

2 Problem Setup

We consider an episodic Markov decision process (MDP) – MDP(S, A, H, P, r) – where S is a state space, A is an action space, H is a fixed length of each episode, is a collection of transition probability measures, and is a collection of reward functions. We assume that S is a measurable space with possibly infinite number of elements. Moreover, for each step ) is a transition kernel over next state if action a is taken for state 1] is a reward function. The constrained MDP – CMDP(S, A, H, P, r, g) – also contains utility functions 1] is the utility function. We assume that reward/utility are deterministic. Our analysis readily generalizes to the setting where reward and utility are random.

Let the policy space ∆(where ∆(A) denotes a probability simplex over the action space. Let ) be a policy taken by the agent at episode is the action that the agent takes at state . For simplicity, we assume the initial state to be fixed as in different episodes; it readily generalizes to be random according to some distribution. The agent interacts with the environment in the kth episode as follows. At the beginning, the agent determines a policy . Then, at each step ], the agent observes the state , determines an action following the policy ), and receives a reward ) together with an utility Meanwhile, the MDP evolves into next state drawing from the probability measure ). The episode terminates at state ; when this happens, no control action is taken and both reward and utility functions are equal to zero. In this paper, we focus a challenging bandit setting where the agent only observes the values of reward/utility functions, ), at visited state-action pair (); see examples in [49, 27, 45]. We assume that reward/utility functions are fixed over episodes.

Given a policy ), the value function associated with the reward function r

at each step h are the expected values of total rewards,

for all ], where the expectation is taken over the random state-action sequence ; the action follows the policy ) at the state and the next state follows the transition dynamics ). Thus, the action-value function associated with the reward function r is the expected value of total rewards when the agent starts from state-action pair (x, a) at step h and follows policy

for all (]. Similarly, we define the value function the action-value function associated with the utility function g. Denote symbol . For brevity, we take the shorthand The Bellman equations associated with a policy are given by

where , for all (, the inner product of a function

Learning Performance. The learning agent aims to find a solution of a constrained problem in which the objective function is the expected total rewards and the constraint is on the expected

where we take ] to avoid triality. It is readily generalized to the problem with multiple constraints. Let ) be a solution to problem (2). Since the policy is computed from knowing all reward and utility functions, it is commonly referred to as an optimal policy in-hindsight.

The associated Lagrangian of problem (2) is given by is the primal variable and 0 is the dual variable. We can cast problem (2) into a saddle-point problem: maximize), which is convex in Y and is non-concave in due to the non-concavity of value functions in policy space [4]. To address the non-concavity, we will exploit constrained RL problem structure to propose a new variant of Lagrange multipliers method for constrained RL problems in Section 3, which warrants a new line of primal-dual mirror descent type analysis in sequel. It is different from unconstrained RL analysis, e.g., [4, 18].

Another key feature of constrained RL is the safe exploration under constraints [30]. Without any constraint information a priori, it is infeasible for each policy to satisfy the constraint since utility information in constraints is only revealed after a policy is decided. Instead, we allow each policy to violate the constraint in each episode and minimize regret while minimizing total constraint violations for safe exploration over K episodes; aslo see [26]. We define the regret as the difference between the total reward value of policy in hindsight and that of the agent’s policy over K episodes, and the constraint violation as a difference between the offset Kb and the total utility value of the agent’s policy

In this paper, we design algorithms, taking bandit feedback of the reward/utility functions, with both regret and constraint violation being sublinear in the total number of steps T := HK. Put differently, the algorithm should ensure that given ), then both Regret(K) = ) and Violation() hold with high probability.

To achieve sublinear regret and constraint violation, we assume standard Slater condition for problem (2) and recall strong duality; see the proof in [46, 45], and boundedness of the optimal dual variable. The dual function is D(Y ) := maximize) and the optimal dual variable

Assumption 1 (Slater Condition)

Lemma 1 (Strong Duality, Boundedness of

The Slater condition is mild in practice and commonly adopted in previous works; e.g., [46, 26, 48, 23]. The implied properties for problem (2) by Slater condition will be useful in our algorithm design and analysis.

Linear Function Approximation. We focus on CMDPs with linear transition kernels in feature maps.

Assumption 2. The CMDP(S, A, H, P, r, g) is a linear MDP with a kernal feature map and a value feature map , if for any , there exists a vector such that for any (

where max(. Moreover, we assume that for any function

Assumption 2 adapts the definition of linear kernel MDP [7, 68, 18] for CMDPs. Linear kernel MDP examples include tabular MDPs [68], feature embedded transition models [64], and linear combinations of base models [43]. We can construct related examples of CMDPs with linear structure by adding adding proper constraints. For usefulness of linear structure, see discussions in [24, 58, 36]. For more general transition dynamics, see factored linear model [47].

Although our definition in Assumption 2 and linear MDPs [63, 32] all include tabular MDPs as special cases, they define transition dynamics using different feature maps. They are not comparable since one cannot be implied by the other [68]. The design of provably efficient RL algorithms beyond these linear type CMDPs remains open.

3 Proposed Algorithm

In Algorithm 1, we present a new variant of Proximal Policy Optimization [51] – an Optimistic Primal-Dual Proximal Policy OPtimization (OPDOP) algorithm. We effectuate the optimism through the Upper-Confidence Bounds (UCB) and address the constraints using a new variant of Lagrange multipliers method combined with the performance difference lemma; see Lemma 10 in Appendix F or works [50, 51, 18]. In each episode, our algorithm consists of three main stages. The first stage (lines 4–8) is Policy Improvement: we receive a new policy by improving previous via a mirror descent type optimization; The second stage (line 9) is Dual Update: we update dual variable based on the constraint violation induced by previous policy third stage (line 10) is Policy Evaluation: we optimistically evaluate newly obtained policy via the least-squares policy evaluation with an additional UCB bonus term for exploration.

Policy Improvement. In the k-th episode, a natural attempt of obtaining a policy is to solve

a Lagrangian-based policy optimization problem,

where ) is the Lagrangian and the dual variable 0 is from the last episode. We will show later that can be updated efficiently. This type update is used by several works, e.g. [38, 46, 45, 55]. The issue with them is that it relies on an oracle solver involving either Q-learning [28], PPO [51], or TRPO [50] to deliver a near-optimal policy. Thus, the overall algorithmic complexity is expensive and it is not suitable for the online use. In contrast, this work utilizes RL problem structure and show that only an easily-computable proximal step is sufficient by establishing efficiency guarantees without using any oracle solvers.

Denote symbol . Via the performance difference lemma, we can expand value functions ) at the previously known policy

where is taken over the random state-action sequence . Thus, we introduce an approximation of ) for any state-action sequence

where can be estimated from an optimistic policy evaluation that will be discussed later. With this notion, in each episode, instead of solving a Lagrangian-based policy optimization, we perform a simple policy update in online mirror descent fashion,

where ) is a mixed policy of the previous one and the uniform distribution Unif(1]. The constant 0 is a trade-off parameter, ) is the KL divergence between is absolutely continuous in The policy mixing step ensures such absolute continuity and implies uniformly bounded KL divergence (see Lemma 15 in Appendix F). Ignoring other -irrelevant terms, we update in terms of previous policy

Since the above update is separable over states , we can update the policy in Algorithm 1, a closed solution for any step above update reduces to one step in an optimistic variant of PPO [18]. The idea of KL-divergence regularization in policy optimization has been widely used in many unconstrained scenarios, e.g., NPG [33], PPO [51], TRPO [50], and their neural variants [61, 40]. Our method is distinct from them in that it is based on the performance difference lemma and the optimistically estimated value functions.

Dual Update. To infer the constraint violation for the dual update, we esimate ) via an optimistic policy evaluation that will be discussed later. We update the Lagrange multiplier 0 by moving to the direction of minimizing the Lagrangian line 9 in Algorithm 1, where 0 is a stepsize and Projis a projection onto [0upper bound , we choose so that projection interval [ 0includes the optimal dual variable ; also see works [26, 44].

The dual update works as a trade-off between the reward maximization and the constraint violation reduction. If the current policy satisfies the approximated constraint, i.e., we put less weight on the action-value function associated with the utility and maximize the reward; otherwise, we sacrifice the reward a bit to satisfy the constraint. The dual update has a similar use in dealing with constraints in CMDP, e.g., Lagrangian-based AC [19, 39], and online constrained optimization problems [66, 62, 67]. In contrast, we handle the dual update via the optimistic policy evaluation that yields an arguably efficient approximation of constraint violation.

Policy Evaluation. The last stage of the kth episode takes the Least-Squares Temporal Difference (LSTD) [16, 15, 37, 35] to evaluate the policy based on previous 1 historical trajectories. For each step ], instead of in the Bellman equations (1), we estimate (is updated by the minimizer of the regularized least-squares problem over

where and 0 is the regularization parameter. Similarly, we estimate (. We display the least-squares solution in line 4–6 of Algorithm 2 where symbol r or g. We also estimate is updated by the minimizer of another regularized least-squares problem,

where 0 is the regularization parameter. Similarly, we estimate . The least- squares solutions lead to line 8–9 of Algorithm 2.

value function iteratively in line 11 of Algorithm 2, where is an estimate of (is an estimate of ; we add UCB bonus terms Γso that

all become upper confidence bounds for exploration. We take Γand leave the parameter 0 to be tuned later. Moreover, the bounded reward/utility 1] implies

We remark the computational efficiency of Algorithm 1. For the time complexity, since line 6 is a scalar update, they need O(d|A|T) time. A dominating calculation is from lines 5/9 in Algorithm 2. If we use the Sherman–Morrison formula for computing (ΛAnother important calculation is the integration from line 6 in Algorithm 2. We can either compute it analytically if it is tractable or approximate it via the Monte Carlo integration efficiently [68] that assumes O(dT) time. Therefore, the time complexity is ) in total. For the space complexity, we don’t need to store policy since it is recursively calculated via (4). By updating Λ) recursively, it takes

4 Regret and Constraint Violation Analysis

We now prove that the regret and the constraint violation for Algorithm 1 are sublinear in T := KH; T is the total number of steps taken by the algorithm; K is the total number of episodes; H is the episode horizon. We recall that |A| is the cardinality of action space A and d is the dimension of the feature map.

Theorem 1 (Linear Kernal MDP: Regret and Constraint Violation). Let Assumptions 1 and 2 hold. Fix and in Algorithm 1, where is an absolute constant. Suppose log

The above result establishes that OPDOP enjoys an ) regret and an constraint violation if we set algorithm parameters properly. Our results have the optimal dependence on the total number of steps T up to some logarithmic factors. The d dependence occurs due to the uniform concentration for controlling the fluctuations in the least-squares policy evaluation. This matches the existing bounds in the linear MDP setting without any constraints [18, 7, 68]. Our bounds differ from them only by an extra H, which is a price introduced by the uniform bound on the constraint violation. It is noticed that we don’t observe reward/utility functions, our algorithm works for bandit feedback after each episode.

We remark the tabular setting for Algorithm 1; see Appendix C for details. The tabular CMDP is a special case of Assumption 2 by taking canonical bases as feature mappings; see them in Section 5. The feature map has dimension and thus Theorem 1 automatically provides ) regret and constraint violation for the tabular CMDPs. The dependence relies on the least-squares policy evaluation and it can be improved via other optimistic policy evolution methods if we are limited to the tabular case. We provide such results in Section 5.

4.1 Proof Outline of Theorem 1

We sketch the proof for Theorem 1. We state key lemmas and delay their full versions and complete proofs to Appendix B. In what follows, we fix 1) and use the shorthand w.p. for with probability.

Regret Analysis. We take a regret decomposition,

where (R.I) = is an optimal policy in hindsight, and (R.II) = ) is estimated via our optimistic policy evaluation given by Algorithm 2. As we use to estimate , it leads a model prediction error in the Bellman equations, ; similarly define . In Appendix D.3, the UCB optimism of , shows that or any (1

By assumptions of Theorem 1, the policy improvement (4) yields Lemma 1, depicting total differences of estimates ) to the optimal ones.

Lemma 1 (Policy Improvement: Primal-Dual Mirror Descent Step). Let assumptions of Theorem 1

Lemma 1 displays a coupling between the regret and the constraint violation. This coupling also finds use in online convex optimization [42, 67, 62, 34] and CMDP problems [26]. The proof of Lemma 1 takes a primal-dual mirror descent type analysis of line 6 of Algorithm 1 by utilizing the performance difference lemma. Our primal-dual mirror descent analysis enable standard ones in [51, 40, 18] to deal with constraints.

Via the dual update (5), we show the second total differences scales ). Together with a decomposition of (R.II), (R.II) = is a martingale, we have Lemma 2.

Lemma 2. Let assumptions of Theorem 1 hold. Then, Regret(

Finally, we note that is a martingale that scales as ) via the Azuma-Hoeffding inequality. For the model prediction error, we use the UCB optimism and apply the elliptical potential lemma.

Lemma 3. Let assumptions of Theorem 1 hold. Then,

Applying probability bounds from Lemmas 3 and 4 to Lemma 2 yields our regret bound.

Constraint Violation Analysis. We take a violation decomposition,

where (V.II) = . We still begin with the policy improvement (4) to have Lemma 5, a refined result from Lemma 1.

Lemma 5 (Policy Improvement: Refined Primal-Dual Mirror Descent Step). Let assumptions of

Lemma 5 separates the dual update in the second total differences in Lemma 1. We prove Lemma 5 by combining Lemma 1 with the UCB optimism and a change of variable of dual update (5).

martingale. By adding (V.II) to the inequality in Lemma 5 with multiplier 0, and also adding (R.II) to it, then taking 0; otherwise

where ) for a policy bound Γas done in Lemmas 3 and 4.

Last, by strong duality, applying the constraint violation bound from Lemma 9 in Appendix E leads to)). This gives our desired violation bound.

5 Further Results on Tabular Case

The tabular CMDP(S, A, H, P, r, g) is a special case of Assumption 2 with Let . We take the following feature maps , and parameter vectors,

where is a canonical basis of associated with () reads that for any ()th entry of ); similarly we define and . We can verify that , and for any and any (. Therefore, we take in Assumption (2) for the tabular case.

The proof of Theorem 1 is generic, since it is ready to achieve sublinear regret and constraint violation bounds as long as the policy evaluation is sample-efficient, e.g., the UCB design of ‘optimism in the face of uncertainty.’ In what follows, we introduce another efficient policy evaluation for line 10 of Algorithm 1 in the tabular case. Let us first introduce some notation. For any (, and any (, we define two visitation counters in episode k,

This allows us to estimate reward function r, utility function g, and transition kernel for episode

for all (0 is the regularization parameter. Moreover, we introduce the bonus term Γwhich adapts the counter-based bonus terms in the literature [8, 31], where 0 is to be determined later.

Using the estimated transition kernels , the estimated reward/utility functions , and the bonus terms , we now can estimate the action-value function via line 7 of Algorithm 3 for any (summarize the above procedure in Algorithm 3. Using already estimated we execute the policy improvement and the dual update in Algorithm 1.

Similar to Theorem 1, we provide theoretical guarantees in Theorem 2; see Appendix C.2 for the proof. Theorem 2 improves (|S|, |A|) dependence in Theorem 1 for the tabular case and also improves |S| dependence in [26]. It is worthy mentioning our Algorithm 1 is generic in handling an infinite state space.

Theorem 2 (Tabular Case: Regret and Constraint Violation). Let Assumption 1 hold and let Assumption 2 hold with feature maps (8). Fix . In Algorithm 1, we set is an absolute constant. Then, with probability 1 , the regret and the constraint violation in (3) satisfy

6 Concluding Remarks

In this paper, we have developed a provably efficient safe reinforcement learning algorithm in the linear MDP setting. The algorithm extends the proximal policy optimization to the constrained MDP and incorporates the UCB exploration. We prove that the proposed algorithm obtains an ) regret and an ) constraint violation under mild regularity conditions where T is the total number of steps taken by the algorithm. Moreover, our algorithm works in settings where reward/utility functions are given by bandit feedback. To the best of our knowledge, our algorithm is the first provably efficient online policy optimization algorithm for CMDP with safe exploration in the function approximation setting. We hope that our work provides a step towards a principled way to design efficient safe reinforcement learning algorithms.

References

[1] Y. Abbasi-Yadkori, D. P´al, and C. Szepesv´ari. Improved algorithms for linear stochastic bandits. In Advances in Neural Information Processing Systems, pages 2312–2320, 2011.

[2] N. Abe, P. Melville, C. Pendus, C. K. Reddy, D. L. Jensen, V. P. Thomas, J. J. Bennett, G. F. Anderson, B. R. Cooley, M. Kowalczyk, et al. Optimizing debt collections using constrained reinforcement learning. In International Conference on Knowledge Discovery and Data Mining, pages 75–84, 2010.

[3] J. Achiam, D. Held, A. Tamar, and P. Abbeel. Constrained policy optimization. In International Conference on Machine Learning, volume 70, pages 22–31, 2017.

[4] A. Agarwal, S. M. Kakade, J. D. Lee, and G. Mahajan. Optimality and approximation with policy gradient methods in Markov decision processes. arXiv preprint arXiv:1908.00261, 2019.

[5] E. Altman. Constrained Markov Decision Processes, volume 7. CRC Press, 1999.

[6] D. Amodei, C. Olah, J. Steinhardt, P. Christiano, J. Schulman, and D. Man´e. Concrete problems in AI safety. arXiv preprint arXiv:1606.06565, 2016.

[7] A. Ayoub, Z. Jia, C. Szepesvari, M. Wang, and L. F. Yang. Model-based reinforcement learning with value-targeted regression. arXiv preprint arXiv:2006.01107, 2020.

[8] M. G. Azar, I. Osband, and R. Munos. Minimax regret bounds for reinforcement learning. In International Conference on Machine Learning, pages 263–272, 2017.

[9] Q. Bai, A. Gattami, and V. Aggarwal. Model-free algorithm and regret analysis for MDPs with peak constraints. arXiv preprint arXiv:2003.05555, 2020.

[10] Y. Bai, T. Xie, N. Jiang, and Y.-X. Wang. Provably efficient Q-learning with low switching cost. In Advances in Neural Information Processing Systems, pages 8002–8011, 2019.

[11] A. Beck. First-order methods in optimization. SIAM, 2017.

[12] F. Berkenkamp, M. Turchetta, A. Schoellig, and A. Krause. Safe model-based reinforcement learning with stability guarantees. In Advances in Neural Information Processing Systems, pages 908–918, 2017.

[13] S. Bhatnagar and K. Lakshmanan. An online actor–critic algorithm with function approxima- tion for constrained Markov decision processes. Journal of Optimization Theory and Applications, 153(3):688–708, 2012.

[14] V. S. Borkar. An actor-critic algorithm for constrained Markov decision processes. Systems & Control Letters, 54(3):207–213, 2005.

[15] J. A. Boyan. Least-squares temporal difference learning. In International Conference on Machine Learning, pages 49–56, 1999.

[16] S. J. Bradtke and A. G. Barto. Linear least-squares algorithms for temporal difference learning. Machine Learning, 22(1-3):33–57, 1996.

[17] K. Brantley, M. Dudik, T. Lykouris, S. Miryoosefi, M. Simchowitz, A. Slivkins, and W. Sun. Constrained episodic reinforcement learning in concave-convex and knapsack settings. arXiv preprint arXiv:2006.05051, 2020.

[18] Q. Cai, Z. Yang, C. Jin, and Z. Wang. Provably efficient exploration in policy optimization. arXiv preprint arXiv:1912.05830, 2019.

[19] Y. Chow, M. Ghavamzadeh, L. Janson, and M. Pavone. Risk-constrained reinforcement learning with percentile risk criteria. The Journal of Machine Learning Research, 18(1):6070–6120, 2017.

[20] Y. Chow, O. Nachum, E. Duenez-Guzman, and M. Ghavamzadeh. A lyapunov-based approach to safe reinforcement learning. In Advances in Neural Information Processing Systems, pages 8092–8101, 2018.

[21] Y. Chow, O. Nachum, A. Faust, M. Ghavamzadeh, and E. Duenez-Guzman. Lyapunov-based safe policy optimization for continuous control. arXiv preprint arXiv:1901.10031, 2019.

[22] G. Dalal, K. Dvijotham, M. Vecerik, T. Hester, C. Paduraru, and Y. Tassa. Safe exploration in continuous action spaces. arXiv preprint arXiv:1801.08757, 2018.

[23] D. Ding, K. Zhang, T. Ba¸sar, and M. R. Jovanovi´c. Natural policy gradient primal-dual method for constrained Markov decision processes. In Advances in Neural Information Processing Systems, 2020.

[24] S. S. Du, S. M. Kakade, R. Wang, and L. F. Yang. Is a good representation sufficient for sample efficient reinforcement learning? arXiv preprint arXiv:1910.03016, 2019.

[25] G. Dulac-Arnold, D. Mankowitz, and T. Hester. Challenges of real-world reinforcement learn- ing. arXiv preprint arXiv:1904.12901, 2019.

[26] Y. Efroni, S. Mannor, and M. Pirotta. Exploration-exploitation in constrained MDPs. arXiv preprint arXiv:2003.02189, 2020.

[27] M. El Chamie, Y. Yu, B. A¸cıkme¸se, and M. Ono. Controlled Markov processes with safety state constraints. IEEE Transactions on Automatic Control, 64(3):1003–1018, 2018.

[28] D. Ernst, P. Geurts, and L. Wehenkel. Tree-based batch mode reinforcement learning. Journal of Machine Learning Research, 6(Apr):503–556, 2005.

[29] J. F. Fisac, A. K. Akametalu, M. N. Zeilinger, S. Kaynama, J. Gillula, and C. J. Tomlin. A general safety framework for learning-based control in uncertain robotic systems. IEEE Transactions on Automatic Control, 64(7):2737–2752, 2018.

[30] J. Garcıa and F. Fern´andez. A comprehensive survey on safe reinforcement learning. Journal of Machine Learning Research, 16(1):1437–1480, 2015.

[31] C. Jin, Z. Allen-Zhu, S. Bubeck, and M. I. Jordan. Is Q-learning provably efficient? In Advances in Neural Information Processing Systems, pages 4863–4873, 2018.

[32] C. Jin, Z. Yang, Z. Wang, and M. I. Jordan. Provably efficient reinforcement learning with linear function approximation. arXiv preprint arXiv:1907.05388, 2019.

[33] S. M. Kakade. A natural policy gradient. In Advances in Neural Information Processing Systems, pages 1531–1538, 2002.

[34] A. Koppel, K. Zhang, H. Zhu, and T. Ba¸sar. Projected stochastic primal-dual method for constrained online learning with kernels. IEEE Transactions on Signal Processing, 67(10):2528– 2542, 2019.

[35] M. G. Lagoudakis and R. Parr. Least-squares policy iteration. Journal of Machine Learning Research, 4(Dec):1107–1149, 2003.

[36] T. Lattimore and C. Szepesvari. Learning with good feature representations in bandits and in RL with a generative model. arXiv preprint arXiv:1911.07676, 2019.

[37] A. Lazaric, M. Ghavamzadeh, and R. Munos. Finite-sample analysis of LSTD. In International Conference on Machine Learning, pages 615–622, 2010.

[38] H. Le, C. Voloshin, and Y. Yue. Batch policy learning under constraints. In International Conference on Machine Learning, pages 3703–3712, 2019.

[39] Q. Liang, F. Que, and E. Modiano. Accelerated primal-dual policy optimization for safe reinforcement learning. arXiv preprint arXiv:1802.06480, 2018.

[40] B. Liu, Q. Cai, Z. Yang, and Z. Wang. Neural trust region/proximal policy optimization attains globally optimal policy. In Advances in Neural Information Processing Systems, pages 10564–10575, 2019.

[41] Y. Liu, J. Ding, and X. Liu. IPO: Interior-point policy optimization under constraints. arXiv preprint arXiv:1910.09615, 2019.

[42] M. Mahdavi, R. Jin, and T. Yang. Trading regret for efficiency: online convex optimization with long term constraints. The Journal of Machine Learning Research, 13(1):2503–2528, 2012.

[43] A. Modi, N. Jiang, A. Tewari, and S. Singh. Sample complexity of reinforcement learning using linearly combined model ensembles. In International Conference on Artificial Intelligence and Statistics, pages 2010–2020, 2020.

[44] A. Nedi´c and A. Ozdaglar. Subgradient methods for saddle-point problems. Journal of optimization theory and applications, 142(1):205–228, 2009.

[45] S. Paternain, M. Calvo-Fullana, L. F. Chamon, and A. Ribeiro. Safe policies for reinforcement learning via primal-dual methods. arXiv preprint arXiv:1911.09101, 2019.

[46] S. Paternain, L. Chamon, M. Calvo-Fullana, and A. Ribeiro. Constrained reinforcement learn- ing has zero duality gap. In Advances in Neural Information Processing Systems, pages 7553– 7563, 2019.

[47] B. ´A. Pires and C. Szepesv´ari. Policy error bounds for model-based reinforcement learning with factored linear models. In Conference on Learning Theory, pages 121–151, 2016.

[48] S. Qiu, X. Wei, Z. Yang, J. Ye, and Z. Wang. Upper confidence primal-dual optimization: Stochastically constrained Markov decision processes with adversarial losses and unknown transitions. Preprint, 2020.

[49] G. J. Schell, W. J. Marrero, M. S. Lavieri, J. B. Sussman, and R. A. Hayward. Data-driven Markov decision process approximations for personalized hypertension treatment planning. MDM Policy & Practice, 1(1), 2016.

[50] J. Schulman, S. Levine, P. Abbeel, M. Jordan, and P. Moritz. Trust region policy optimization. In International Conference on Machine Learning, pages 1889–1897, 2015.

[51] J. Schulman, F. Wolski, P. Dhariwal, A. Radford, and O. Klimov. Proximal policy optimization algorithms. arXiv preprint arXiv:1707.06347, 2017.

[52] R. Singh, A. Gupta, and N. B. Shroff. Learning in Markov decision processes under constraints. arXiv preprint arXiv:2002.12435, 2020.

[53] A. Stooke, J. Achiam, and P. Abbeel. Responsive safety in reinforcement learning by PID Lagrangian methods. arXiv preprint arXiv:2007.03964, 2020.

[54] R. S. Sutton and A. G. Barto. Reinforcement Learning: An Introduction. MIT press, 2018.

[55] C. Tessler, D. J. Mankowitz, and S. Mannor. Reward constrained policy optimization. In International Conference on Learning Representations, 2019.

[56] M. Turchetta, F. Berkenkamp, and A. Krause. Safe exploration in finite Markov decision processes with Gaussian processes. In Advances in Neural Information Processing Systems, pages 4312–4320, 2016.

[57] E. Uchibe and K. Doya. Constrained reinforcement learning from intrinsic and extrinsic re- wards. In International Conference on Development and Learning, pages 163–168, 2007.

[58] B. Van Roy and S. Dong. Comments on the Du-Kakade-Wang-Yang lower bounds. arXiv preprint arXiv:1911.07910, 2019.

[59] A. Wachi and Y. Sui. Safe reinforcement learning in constrained Markov decision processes. arXiv preprint arXiv:2008.06626, 2020.

[60] A. Wachi, Y. Sui, Y. Yue, and M. Ono. Safe exploration and optimization of constrained MDPs using Gaussian processes. In AAAI Conference on Artificial Intelligence, 2018.

[61] L. Wang, Q. Cai, Z. Yang, and Z. Wang. Neural policy gradient methods: Global optimality and rates of convergence. arXiv preprint arXiv:1909.01150, 2019.

[62] X. Wei, H. Yu, and M. J. Neely. Online primal-dual mirror descent under stochastic constraints. arXiv preprint arXiv:1908.00305, 2019.

[63] L. Yang and M. Wang. Sample-optimal parametric Q-learning using linearly additive features. In International Conference on Machine Learning, pages 6995–7004, 2019.

[64] L. F. Yang and M. Wang. Reinforcement leaning in feature space: Matrix bandit, kernels, and regret bound. arXiv preprint arXiv:1905.10389, 2019.

[65] T.-Y. Yang, J. Rosca, K. Narasimhan, and P. J. Ramadge. Projection-based constrained policy optimization. In International Conference on Learning Representations, 2019.

[66] H. Yu, M. Neely, and X. Wei. Online convex optimization with stochastic constraints. In Advances in Neural Information Processing Systems, pages 1428–1438, 2017.

[67] J. Yuan and A. Lamperski. Online convex optimization for cumulative constraints. In Advances in Neural Information Processing Systems, pages 6137–6146, 2018.

[68] D. Zhou, J. He, and Q. Gu. Provably efficient reinforcement learning for discounted MDPs with feature mapping. arXiv preprint arXiv:2006.13165, 2020.

A Preliminaries

Our analysis begins with decomposition of the regret given in (3).

where we add and subtract the value ) estimated from an optimistic policy evaluation by Algorithm 2; the policy in hindsight is the best policy in hindsight for problem (2). To bound the total regret (12), we would like to analyze (R.I) and (R.II) separately.

First, we define the model prediction error for the reward as

for all (], which describes the prediction error in the Bellman equations (1) using . With this notation, we expand (R.I) into

where the first double sum is linear in terms of the policy difference and the second one describes the total model prediction error. The above expansion is based on the standard performance difference lemma (see Lemma 10) and we provide a proof in Section D.1 for readers’ convenience. Meanwhile, if we define the model prediction error for the utility as

then, similarly, we can expand

To analyze the constraint violation, we also introduce an useful decomposition,

which the inserted value ) is estimated from an optimistic policy evaluation by Algorithm 2.

For notational simplicity, we introduce the underlying probability structure as follows. For any (], we define -algebra generated by state-action sequences, reward and utility functions,

Here, is a null state for any A filtration is a sequence of in terms of time index

which holds that The estimated reward/utility value functions, , and the associated Q-functions, -measurable since they are obtained from previous 1 historical trajectories. With these notations, we can expand (R.II) in (12) into

where is a martingale adapted to the filtration in terms of time index t. Similarly, we have it for (V.II),

where is a martingale adapted to the filtration in terms of time index t. We prove (19) in Section D.2 for completeness (also see [18, Lemma 4.2]); (20) is similar.

in the action-value function estimation of Algorithm 2. By the UCB argument, if we set and is an absolute constant, then for any ((

with probability 12 where the symbol . We prove (21) in Section D.3 for completeness.

In what follows we delve into the analysis of the regret and the constraint violation.

B Proof of Regret and Constraint Violation

The goal is to prove that the regret and the constraint violation for Algorithm 1 are sublinear in the total number of steps: T := KH, taken by the algorithm, where K is the total number of episodes and H is the horizon length. We recall that |A| is the size of action space A and d is the feature map’s dimension. We repeat Theorem 1 here for readers’ convenience.

Theorem 1 (Linear Kernal MDP: Regret and Constraint Violation). Let Assumptions 1 and 2 hold. Fix in Algorithm 1 with the full-information setting, where is an absolute constant. Suppose log . Then, the regret and the constraint violation in (3) satisfy

We divide the proof into two parts for the regret bound and the constraint violation, respectively, in Section B.1 and Section B.2.

Our analysis begins with a primal-dual mirror descent type analysis for the policy update in line 6 of Algorithm 1. In Lemma 2, we present a key upper bound on the total differences of estimated values ) given by Algorithm 2 to the optimal ones.

Lemma 2 (Policy Improvement: Primal-Dual Mirror Descent Step). Let Assumption 1 and Assumption 2 hold. In Algorithm 1, if we set

Proof. We recall that line 6 of Algorithm 1 follows a solution to the following subproblem,

where we use the shorthandthe shorthand )) if dependence on the state-action sequence is clear from context. We note that (23) is in form of a mirror descent subproblem in Lemma 14. We can apply the pushback property with

Equivalently, we write the above inequality as follows,