b

DiscoverSearch
About
My stuff
Provably Efficient Safe Exploration via Primal-Dual Policy Optimization
2020·arXiv
Abstract
Abstract

We study the Safe Reinforcement Learning (SRL) problem using the Constrained Markov Decision Process (CMDP) formulation in which an agent aims to maximize the expected total reward subject to a safety constraint on the expected total value of a utility function. We focus on an episodic setting with the function approximation where the Markov transition kernels have a linear structure but do not impose any additional assumptions on the sampling model. Designing SRL algorithms with provable computational and statistical efficiency is particularly challenging under this setting because of the need to incorporate both the safety constraint and the function approximation into the fundamental exploitation/exploration tradeoff. To this end, we present an Optimistic Primal-Dual Proximal Policy OPtimization (OPDOP) algorithm where the value function is estimated by combining the least-squares policy evaluation and an additional bonus term for safe exploration. We prove that the proposed algorithm achieves an �O(dH2.5√T) regret and an �O(dH2.5√T) constraint violation, where d is the dimension of the feature mapping, H is the horizon of each episode, and T is the total number of steps. These bounds hold when the reward/utility functions are fixed but the feedback after each episode is bandit. Our bounds depend on the capacity of the state-action space only through the dimension of the feature mapping and thus our results hold even when the number of states goes to infinity. To the best of our knowledge, we provide the first provably efficient online policy optimization algorithm for CMDP with safe exploration in the function approximation setting.

Reinforcement Learning (RL) studies how an agent learns to maximize its expected total reward by interacting with an unknown environment over time [54]. Safe RL (SRL) involves extra restrictions arising from real-world problems [30, 6, 25]. Examples include collision-avoidance in self-driving cars [29], switching cost limitations in medical applications [10], and legal and business restrictions in financial management [2]. A standard environment model for SRL is the Constrained Markov Decision Process (CMDP) [5] that extends the classical MDP by adding an extra safety-related utility, and translates the safety demand into a constraint on the expected total utility [3]. The presence of safety constraints makes the celebrated exploration-exploitation trade-off more challenging.

Many existing SRL algorithms for CMDPs are policy-based, and primal-dual type constrained optimization methods are commonly used to deal with policy constraints, e.g., Constrained Policy Gradient (CPG) [57], Lagrangian-based Actor-Critic (AC) [14, 13, 19, 55, 39], Primal-Dual Policy Optimization (PDPO) [46, 45, 53], Constrained Policy Optimization (CPO) [3, 65], and Reward Constrained Policy Optimization (RCPO) [55]. These SRL algorithms either do not have a theoretical guarantee or can only be shown to converge asymptotically in the batch offline setting. It is imperative to study theoretical guarantees for SRL algorithms regarding computational/statistical efficiency.

In this work, we look at a more challenging problem of finding a sequence of policies in response to streaming samples of reward functions, constraint functions, and transition. We seek to provide theoretical guarantees on the convergence rate of approaching the best-fixed policy in hindsight as well as feasibility region formed by constraints. Unlike scenarios where the safety constraint is known a priori [56, 12, 22, 60, 20, 21, 59], we have to explore the unknown environment and adapt the policy to the safety constraint with the minimal violation [6, 30]. We refer this task as safe exploration. Recent policy-based SRL algorithms for CMDPs, e.g., CPO [3, 65] and PDPO [46, 45], seek a single safe policy via the constrained policy optimization whose sample efficiency guarantees over streaming or time-varying data are largely unknown. In this paper, we aim to answer a theoretical question:

Can we design a provably sample efficient online policy optimization algorithm for CMDP?

Contribution. We propose a provably efficient SRL algorithm for CMDP with an unknown transition model in the linear episodic setting – the Optimistic Primal-Dual Proximal Policy OPtimization (OPDOP) algorithm – where the value function is estimated by combining the least-squares policy evaluation and an additional bonus term for safe exploration. Theoretically, we prove that the proposed algorithm achieves an �O(dH2.5√T) regret and the same �O(dH2.5√T) constraint violation, where d is the dimension of the feature mapping, H is the horizon of each episode, and T is the total number of steps. We establish these bounds in the setting where the reward/utility functions are fixed but the feedback after each episode is bandit. Our bounds depend on the capacity of the state space only through the dimension of the feature mapping and thus hold even when the number of states goes to infinity. To the best of our knowledge, our result is the first provably efficient online policy optimization for CMDP with safe exploration in the function approximation setting.

Related Work. Our work is related to a line of provably efficient RL algorithms based on linear function approximation where the exploration is achieved by adding an Upper Confidence Bound (UCB) bonus [63, 64, 32]. A closely-related recent work is Proximal Policy Optimization (PPO) [51]. As is shown in [40, 18], PPO converges to the optimal policy sublinearly and an optimistic variant of PPO is sample efficient with UCB exploration in the linear setting. However, such results only hold for unconstrained RL problems. Our work seeks to extend an optimistic variant of PPO for CMDP with UCB exploration and establish theoretical efficiency guarantees. For the large CMDP with unknown transition models, there is a line of works that relates to the policy optimization under constraints, e.g., CPG [57], CPO [3, 65], RCPO [55], and IPPO [41]. However, their theoretical guarantees are still lacking. By contrast, our work is supported by theoretical efficiency guarantees.

Recent results on learning CMDPs with unknown transitions and rewards are closely related. As we know, most of them are model-based and only apply for finite state-action spaces. The works [52, 26] independently leverage upper confidence bound (UCB) on fixed reward, utility, and transition probability to propose sample-efficient algorithms for tabular CMDPs; [52] obtains an O(�|A|T 1.5 log T) regret and constraint violation via linear program in the average cost setup in time T; [26] obtains an  O(|S|�|S||A|H4T) regret and constraint violation in the episodic setting via linear program and primal-dual policy optimization, where S is a state space and A is an action space. In [48], the authors study an adversarial stochastic shortest path problem under constraints and unknown transitions with  O(|S|�|A|H2T) regret and constraint violation. The work [9] extends Q-learning with optimism for finite state-action CMDPs with peak constraints. The work [17] proposes UCB-based convex planning for episodic tabular CMDPs in dealing with convex or hard constraints. In contrast, our proposed algorithm can potentially be applied to infinite state space scenarios with sublinear regret and constraint violation bounds only depending on the implicit dimension as opposed to the true dimension of the state space.

We consider an episodic Markov decision process (MDP) – MDP(S, A, H, P, r) – where S is a state space, A is an action space, H is a fixed length of each episode,  P = {Ph}H1is a collection of transition probability measures, and  r = {rh}Hh = 1 is a collection of reward functions. We assume that S is a measurable space with possibly infinite number of elements. Moreover, for each step h ∈ [H], Ph( · | x, a) is a transition kernel over next state if action a is taken for state  x and rh:S ×A → [0,1] is a reward function. The constrained MDP – CMDP(S, A, H, P, r, g) – also contains utility functions  g = {gh}Hh = 1 where gh: S × A → [0,1] is the utility function. We assume that reward/utility are deterministic. Our analysis readily generalizes to the setting where reward and utility are random.

Let the policy space ∆(A | S, H) be {{πh( · | · )}Hh = 1: πh( · | x) ∈ ∆(A), ∀x ∈ S and h ∈ [H]},where ∆(A) denotes a probability simplex over the action space. Let  πk ∈ ∆(A | S, H) be a policy taken by the agent at episode  k, where πkh( · | xkh): S → Ais the action that the agent takes at state xkh. For simplicity, we assume the initial state  xk1 to be fixed as  x1in different episodes; it readily generalizes to be random according to some distribution. The agent interacts with the environment in the kth episode as follows. At the beginning, the agent determines a policy  πk. Then, at each step  h ∈ [H], the agent observes the state  xkh ∈ S, determines an action  akh following the policy πkh( · | xkh), and receives a reward  rh(xkh, akh) together with an utility  gh(xkh, akh).Meanwhile, the MDP evolves into next state  xkh+1 drawing from the probability measure  Ph( · | xkh, akh). The episode terminates at state  xkH+1; when this happens, no control action is taken and both reward and utility functions are equal to zero. In this paper, we focus a challenging bandit setting where the agent only observes the values of reward/utility functions,  rh(xkh, akh), gh(xkh, akh), at visited state-action pair (xkh, akh); see examples in [49, 27, 45]. We assume that reward/utility functions are fixed over episodes.

Given a policy  π ∈ ∆(A | S, H), the value function  V πr,h associated with the reward function r

at each step h are the expected values of total rewards,

image

for all  x ∈ S, h ∈ [H], where the expectation  Eπis taken over the random state-action sequence {(xh, ah)}Hh = i; the action  ahfollows the policy  πh( · |xh) at the state  xhand the next state  xh+1follows the transition dynamics  Ph( · |xh, ah). Thus, the action-value function  Qπr,h(x, a): S×A → Rassociated with the reward function r is the expected value of total rewards when the agent starts from state-action pair (x, a) at step h and follows policy  π,

image

for all (x, a) ∈ S × A and h ∈ [H]. Similarly, we define the value function  V πg,h: S → R andthe action-value function  Qπg,h(x, a): S × A → Rassociated with the utility function g. Denote symbol  ⋄ = r or g. For brevity, we take the shorthand  PhV π⋄,h+1(x, a) := Ex′∼Ph( · | x,a)V π⋄,h+1(x′).The Bellman equations associated with a policy  πare given by

image

where  V π⋄,h(x) =�Qπ⋄,h (x, · ), πh( · | x)�A, for all (x, a) ∈ S × A. Fix x ∈ S, the inner product of a function  f: S × A → R with π( · | x) ∈ ∆(A) is ⟨f(x, · ), π( · | x)⟩A = �a ∈ A⟨f(x, a), π(a | x)⟩.

Learning Performance. The learning agent aims to find a solution of a constrained problem in which the objective function is the expected total rewards and the constraint is on the expected

image

where we take  b ∈ (0, H] to avoid triality. It is readily generalized to the problem with multiple constraints. Let  π⋆ ∈ ∆(A | S, H) be a solution to problem (2). Since the policy  π⋆ is computed from knowing all reward and utility functions, it is commonly referred to as an optimal policy in-hindsight.

The associated Lagrangian of problem (2) is given by  L(π, Y ) := V πr,1(x1)+Y (V πg,1(x1)−b), whereπis the primal variable and  Y ≥0 is the dual variable. We can cast problem (2) into a saddle-point problem: maximizeπ ∈ ∆(A | S,H) minimizeY ≥0 L(π, Y), which is convex in Y and is non-concave in πdue to the non-concavity of value functions in policy space [4]. To address the non-concavity, we will exploit constrained RL problem structure to propose a new variant of Lagrange multipliers method for constrained RL problems in Section 3, which warrants a new line of primal-dual mirror descent type analysis in sequel. It is different from unconstrained RL analysis, e.g., [4, 18].

Another key feature of constrained RL is the safe exploration under constraints [30]. Without any constraint information a priori, it is infeasible for each policy to satisfy the constraint since utility information in constraints is only revealed after a policy is decided. Instead, we allow each policy to violate the constraint in each episode and minimize regret while minimizing total constraint violations for safe exploration over K episodes; aslo see [26]. We define the regret as the difference between the total reward value of policy  π⋆ in hindsight and that of the agent’s policy  πkover K episodes, and the constraint violation as a difference between the offset Kb and the total utility value of the agent’s policy  πk over K episodes,

image

In this paper, we design algorithms, taking bandit feedback of the reward/utility functions, with both regret and constraint violation being sublinear in the total number of steps T := HK. Put differently, the algorithm should ensure that given  ǫ > 0, if T = O(1/ǫ2), then both Regret(K) = O(ǫ) and Violation(K) = O(ǫ) hold with high probability.

To achieve sublinear regret and constraint violation, we assume standard Slater condition for problem (2) and recall strong duality; see the proof in [46, 45], and boundedness of the optimal dual variable. The dual function is D(Y ) := maximizeπ L(π, Y) and the optimal dual variable Y ⋆ := argminY ≥ 0 D(Y ).

Assumption 1 (Slater Condition). There exists γ > 0 and ¯π ∈ ∆(A | S, H) such that V ¯πg,1(x1) ≥b + γ.

Lemma 1 (Strong Duality, Boundedness of  Y ⋆). Let Assumption 1 hold. Then V π⋆r,1 (x1) = D(Y ⋆).Moreover, 0 ≤ Y ⋆ ≤ (V π⋆r,1 (x1) − V ¯πr,1(x1))/γ.

The Slater condition is mild in practice and commonly adopted in previous works; e.g., [46, 26, 48, 23]. The implied properties for problem (2) by Slater condition will be useful in our algorithm design and analysis.

Linear Function Approximation. We focus on CMDPs with linear transition kernels in feature maps.

Assumption 2. The CMDP(S, A, H, P, r, g) is a linear MDP with a kernal feature map  ψ: S ×A × S → Rd1 and a value feature map  ϕ: S × A → Rd2, if for any  h ∈ [H], there exists a vector θh ∈ Rd1 with ∥θh∥2 ≤ √d1such that for any (x, a, x′) ∈ S × A × S,

image

where max(∥θr,h∥2 , ∥θg,h∥2) ≤ √d2. Moreover, we assume that for any function  V : S → [0, H],���S ψ(x, a, x′)V (x′)dx′��2 ≤ √d1 H for all (x, a) ∈ S × A, and max(d1, d2) ≤ d.

Assumption 2 adapts the definition of linear kernel MDP [7, 68, 18] for CMDPs. Linear kernel MDP examples include tabular MDPs [68], feature embedded transition models [64], and linear combinations of base models [43]. We can construct related examples of CMDPs with linear structure by adding adding proper constraints. For usefulness of linear structure, see discussions in [24, 58, 36]. For more general transition dynamics, see factored linear model [47].

Although our definition in Assumption 2 and linear MDPs [63, 32] all include tabular MDPs as special cases, they define transition dynamics using different feature maps. They are not comparable since one cannot be implied by the other [68]. The design of provably efficient RL algorithms beyond these linear type CMDPs remains open.

In Algorithm 1, we present a new variant of Proximal Policy Optimization [51] – an Optimistic Primal-Dual Proximal Policy OPtimization (OPDOP) algorithm. We effectuate the optimism through the Upper-Confidence Bounds (UCB) and address the constraints using a new variant of Lagrange multipliers method combined with the performance difference lemma; see Lemma 10 in Appendix F or works [50, 51, 18]. In each episode, our algorithm consists of three main stages. The first stage (lines 4–8) is Policy Improvement: we receive a new policy  πk by improving previous  πk−1 via a mirror descent type optimization; The second stage (line 9) is Dual Update: we update dual variable  Y k based on the constraint violation induced by previous policy  πk; Thethird stage (line 10) is Policy Evaluation: we optimistically evaluate newly obtained policy via the least-squares policy evaluation with an additional UCB bonus term for exploration.

image

Policy Improvement. In the k-th episode, a natural attempt of obtaining a policy  πk is to solve

a Lagrangian-based policy optimization problem,

image

where  L(π, Y) is the Lagrangian and the dual variable  Y k−1 ≥0 is from the last episode. We will show later that  Y k−1 can be updated efficiently. This type update is used by several works, e.g. [38, 46, 45, 55]. The issue with them is that it relies on an oracle solver involving either Q-learning [28], PPO [51], or TRPO [50] to deliver a near-optimal policy. Thus, the overall algorithmic complexity is expensive and it is not suitable for the online use. In contrast, this work utilizes RL problem structure and show that only an easily-computable proximal step is sufficient by establishing efficiency guarantees without using any oracle solvers.

Denote symbol  ⋄ = r or g. Via the performance difference lemma, we can expand value functions V π⋄,1(x1) at the previously known policy  πk−1,

image

where  Eπk−1is taken over the random state-action sequence  {(xh, ah)}H1 . Thus, we introduce an approximation of  V π⋄,1(x1) for any state-action sequence  {(xh, ah)}H1 induced by π,

image

where  V k−1⋄,h and Qk−1⋄,h can be estimated from an optimistic policy evaluation that will be discussed later. With this notion, in each episode, instead of solving a Lagrangian-based policy optimization, we perform a simple policy update in online mirror descent fashion,

image

where  �πk−1h ( · | xh) = (1 − θ) πk−1h ( · | xh) + θ Unif(A) is a mixed policy of the previous one and the uniform distribution Unif(A) with θ ∈ (0,1]. The constant  α >0 is a trade-off parameter, D(π | �πk−1) is the KL divergence between  π and �πk−1 in which πis absolutely continuous in  �πk−1.The policy mixing step ensures such absolute continuity and implies uniformly bounded KL divergence (see Lemma 15 in Appendix F). Ignoring other  π-irrelevant terms, we update  πk in terms of previous policy  πk−1 by

image

Since the above update is separable over states  {xh}H1 , we can update the policy  πk as line 6in Algorithm 1, a closed solution for any step  h ∈ [H]. If we set Y k−1 = 0 and θ = 0, theabove update reduces to one step in an optimistic variant of PPO [18]. The idea of KL-divergence regularization in policy optimization has been widely used in many unconstrained scenarios, e.g., NPG [33], PPO [51], TRPO [50], and their neural variants [61, 40]. Our method is distinct from them in that it is based on the performance difference lemma and the optimistically estimated value functions.

Dual Update. To infer the constraint violation for the dual update, we esimate  V k−1g,1 (x1) forV πkg,1 (x1) via an optimistic policy evaluation that will be discussed later. We update the Lagrange multiplier  Y ≥0 by moving  Y k to the direction of minimizing the Lagrangian  L(π, Y ) over Y inline 9 in Algorithm 1, where  η >0 is a stepsize and Proj[ 0, χ ]is a projection onto [0, χ] with anupper bound  χ on Y k. By Lemma 1, we choose  χ = 2H/γ ≥ 2Y ⋆ so that projection interval [ 0, χ ]includes the optimal dual variable  Y ⋆; also see works [26, 44].

The dual update works as a trade-off between the reward maximization and the constraint violation reduction. If the current policy  πk satisfies the approximated constraint, i.e.,  b−Lk−1g (πk) ≤ 0,we put less weight on the action-value function associated with the utility and maximize the reward; otherwise, we sacrifice the reward a bit to satisfy the constraint. The dual update has a similar use in dealing with constraints in CMDP, e.g., Lagrangian-based AC [19, 39], and online constrained optimization problems [66, 62, 67]. In contrast, we handle the dual update via the optimistic policy evaluation that yields an arguably efficient approximation of constraint violation.

image

Policy Evaluation. The last stage of the kth episode takes the Least-Squares Temporal Difference (LSTD) [16, 15, 37, 35] to evaluate the policy  πk based on previous  k −1 historical trajectories. For each step  h ∈ [H], instead of  PhV πkr,h+1 in the Bellman equations (1), we estimate  PhV kr,h+1 by(φkr,h)⊤wkr,h where wkr,h is updated by the minimizer of the regularized least-squares problem over

image

where  φτr,h( · , · ) :=�S ψ( · , · , x′ )V τr,h+1(x′)dx′, V τr,h+1(·) = ⟨Qτr,h+1( · , · ), πτh+1( · | · )⟩A for h ∈ [H−1]and  V τH+1 = 0, and λ >0 is the regularization parameter. Similarly, we estimate  PhV kg,h+1 by(φkg,h)⊤wkg,h. We display the least-squares solution in line 4–6 of Algorithm 2 where symbol  ⋄ =r or g. We also estimate  rh(·, ·) by ϕ⊤ukr,h where ukr,h is updated by the minimizer of another regularized least-squares problem,

image

where  λ >0 is the regularization parameter. Similarly, we estimate  gh(·, ·) by ϕ⊤ukg,h. The least- squares solutions lead to line 8–9 of Algorithm 2.

image

value function  {Qk⋄,h}Hh = 1 iteratively in line 11 of Algorithm 2, where  ϕ⊤uk⋄,h is an estimate of  ⋄h and(φk⋄,h)⊤wk⋄,h is an estimate of  PhV k⋄,h+1; we add UCB bonus terms Γkh( · , · ), Γk⋄,h( · , · ): S × A → R+so that

image

all become upper confidence bounds for exploration. We take Γkh = β(ϕ⊤(Λkh)−1ϕ)1/2 and Γk⋄,h =β((φk⋄,h)⊤(Λk⋄,h)−1φk⋄,h)1/2and leave the parameter  β >0 to be tuned later. Moreover, the bounded reward/utility  ⋄h ∈ [0,1] implies  Qk⋄,h ∈ [0, H − h + 1].

We remark the computational efficiency of Algorithm 1. For the time complexity, since line 6 is a scalar update, they need O(d|A|T) time. A dominating calculation is from lines 5/9 in Algorithm 2. If we use the Sherman–Morrison formula for computing (Λkh)−1, it takes O(d2T) time.Another important calculation is the integration from line 6 in Algorithm 2. We can either compute it analytically if it is tractable or approximate it via the Monte Carlo integration efficiently [68] that assumes O(dT) time. Therefore, the time complexity is  O(d2|A|T) in total. For the space complexity, we don’t need to store policy since it is recursively calculated via (4). By updating  Y k,Λkh, Λk⋄,h, wk⋄,h, uk⋄,h, and ⋄h(xkh, akh) recursively, it takes  O((d2 + |A|)H) space.

We now prove that the regret and the constraint violation for Algorithm 1 are sublinear in T := KH; T is the total number of steps taken by the algorithm; K is the total number of episodes; H is the episode horizon. We recall that |A| is the cardinality of action space A and d is the dimension of the feature map.

Theorem 1 (Linear Kernal MDP: Regret and Constraint Violation). Let Assumptions 1 and 2 hold. Fix  p ∈ (0, 1). We set α =�log |A|/(H2K), β = C1�dH2 log (dT/p), η = 1/√K, θ = 1/K,and  λ = 1in Algorithm 1, where  C1is an absolute constant. Suppose log  |A| = O�d2 log2 (dT/p)�.

image

The above result establishes that OPDOP enjoys an �O(dH2.5√T) regret and an �O(dH2.5√T)constraint violation if we set algorithm parameters  {α, β, η, θ, λ}properly. Our results have the optimal dependence on the total number of steps T up to some logarithmic factors. The d dependence occurs due to the uniform concentration for controlling the fluctuations in the least-squares policy evaluation. This matches the existing bounds in the linear MDP setting without any constraints [18, 7, 68]. Our bounds differ from them only by an extra H, which is a price introduced by the uniform bound on the constraint violation. It is noticed that we don’t observe reward/utility functions, our algorithm works for bandit feedback after each episode.

We remark the tabular setting for Algorithm 1; see Appendix C for details. The tabular CMDP is a special case of Assumption 2 by taking canonical bases as feature mappings; see them in Section 5. The feature map has dimension  d = |S|2|A|and thus Theorem 1 automatically provides  O(|S|2|A|H2.5√T) regret and constraint violation for the tabular CMDPs. The  d = |S|2|A|dependence relies on the least-squares policy evaluation and it can be improved via other optimistic policy evolution methods if we are limited to the tabular case. We provide such results in Section 5.

4.1 Proof Outline of Theorem 1

We sketch the proof for Theorem 1. We state key lemmas and delay their full versions and complete proofs to Appendix B. In what follows, we fix  p ∈ (0,1) and use the shorthand w.p. for with probability.

Regret Analysis. We take a regret decomposition,

image

where (R.I) = �Kk = 1�V π⋆r,1 (x1)−V kr,1(x1)�in which π⋆ is an optimal policy in hindsight, and (R.II) = �Kk = 1�V kr,1(x1)−V πkr,1 (x1)�in which V kr,1(x1) is estimated via our optimistic policy evaluation given by Algorithm 2. As we use  V kr,h+1 to estimate  V πkr,h+1, it leads a model prediction error in the Bellman equations,  ιkr,h := rkh + PhV kr,h+1 − Qkr,h; similarly define  ιkg,h. In Appendix D.3, the UCB optimism of  ιk⋄,h with ⋄ = r or g, shows that or any (k, h) ∈ [K] × [H] and (x, a) ∈ S × A, w.p.1  − p/2, we have

image

By assumptions of Theorem 1, the policy improvement (4) yields Lemma 1, depicting total differences of estimates  V kr,1(x1), V kg,1(x1) to the optimal ones.

Lemma 1 (Policy Improvement: Primal-Dual Mirror Descent Step). Let assumptions of Theorem 1

image

Lemma 1 displays a coupling between the regret and the constraint violation. This coupling also finds use in online convex optimization [42, 67, 62, 34] and CMDP problems [26]. The proof of Lemma 1 takes a primal-dual mirror descent type analysis of line 6 of Algorithm 1 by utilizing the performance difference lemma. Our primal-dual mirror descent analysis enable standard ones in [51, 40, 18] to deal with constraints.

Via the dual update (5), we show the second total differences  − �Kk = 1 Y k�V π⋆g,1(x1) − V kg,1(x1)�scales  O(√K). Together with a decomposition of (R.II), (R.II) =  − �Kk = 1�Hh = 1 ιkr,h(xkh, akh) +MKr,H,2, where MKr,H,2 is a martingale, we have Lemma 2.

Lemma 2. Let assumptions of Theorem 1 hold. Then, Regret(K) ≤ O(H2.5�T log |A|) +�Kk = 1�Hh = 1�Eπ⋆[ιkr,h(xh, ah)]−ιkr,h(xkh, akh)�+ MKr,H,2.

Finally, we note that  MKr,H,2 is a martingale that scales as  O(H√T) via the Azuma-Hoeffding inequality. For the model prediction error, we use the UCB optimism and apply the elliptical potential lemma.

Lemma 3. Let assumptions of Theorem 1 hold. Then, �Kk = 1�Hh = 1�Eπ⋆[ιkr,h(xh, ah)] −ιkr,h(xkh, akh)�≤ O(dH1.5�T log (K) log (dT/p)), w.p. 1 − p/2.

image

Applying probability bounds from Lemmas 3 and 4 to Lemma 2 yields our regret bound.

Constraint Violation Analysis. We take a violation decomposition,

image

where (V.II) = �Kk = 1�V kg,1(x1) − V πkg,1 (x1)�. We still begin with the policy improvement (4) to have Lemma 5, a refined result from Lemma 1.

Lemma 5 (Policy Improvement: Refined Primal-Dual Mirror Descent Step). Let assumptions of

image

Lemma 5 separates the dual update  Y k in the second total differences in Lemma 1. We prove Lemma 5 by combining Lemma 1 with the UCB optimism and a change of variable of  Y k from thedual update (5).

image

martingale. By adding (V.II) to the inequality in Lemma 5 with multiplier  Y ≥0, and also adding (R.II) to it, then taking  Y = 0 if �Kk = 1�b − V πkg,1 (x1)�≤0; otherwise  Y = χ, w.p. 1 − p, we have,

image

where  V π′r,1(x1) = 1K�Kk = 1 V πkr,1 (x1) and V π′g,1(x1) = 1K�Kk = 1 V πkg,1 (x1) for a policy  π′. Here, webound Γkh + Γk⋄,h and MK⋄,H,2 as done in Lemmas 3 and 4.

Last, by strong duality, applying the constraint violation bound from Lemma 9 in Appendix E leads to�b − V π′g,1(x1)�+ ≤ O(dH2.5√T log (dT/p) /(χK)). This gives our desired violation bound.

The tabular CMDP(S, A, H, P, r, g) is a special case of Assumption 2 with  |S| < ∞ and |A| < ∞.Let  d1 = |S|2|A| and d2 = |S||A|. We take the following feature maps  ψ(x, a, x′) ∈ Rd1, ϕ(x, a) ∈Rd2, and parameter vectors,

image

where  e(x,a,x′)is a canonical basis of  Rd1 associated with (x, a, x′) and θh = Ph( · , · , ·) reads that for any (x, a, x′) ∈ S ×A×S, the (x, a, x′)th entry of  θh is P(x′ | x, a); similarly we define  e(x,a), θr,h,and  θg,h. We can verify that  ∥θh∥ ≤ √d1, ∥θr,h∥ ≤ √d2, ∥θg,h∥ ≤ √d2, and for any  V : S → [0, H]and any (x, a) ∈ S × A, we have���x′ ∈ S ψ(x, a, x′)V (x′)�� ≤�|S|H ≤ √d1H. Therefore, we take d := max (d1, d2) = |S|2|A|in Assumption (2) for the tabular case.

The proof of Theorem 1 is generic, since it is ready to achieve sublinear regret and constraint violation bounds as long as the policy evaluation is sample-efficient, e.g., the UCB design of ‘optimism in the face of uncertainty.’ In what follows, we introduce another efficient policy evaluation for line 10 of Algorithm 1 in the tabular case. Let us first introduce some notation. For any (h, k) ∈ [H] × [K], any (x, a, x′) ∈ S × A × S, and any (x, a) ∈ S × A, we define two visitation counters  nkh(x, a, x′) and nkh(x, a) at step hin episode k,

image

This allows us to estimate reward function r, utility function g, and transition kernel  Phfor episode

image

for all (x, a, x′) ∈ S × A × S, (x, a) ∈ S × A where λ >0 is the regularization parameter. Moreover, we introduce the bonus term Γkh: S × A → R, Γkh(x, a) = β�nkh(x, a) + λ�−1/2which adapts the counter-based bonus terms in the literature [8, 31], where  β >0 is to be determined later.

Using the estimated transition kernels  {�Pkh}Hh = 1, the estimated reward/utility functions {�rkh, �gkh}Hh = 1, and the bonus terms  {Γkh}Hh = 1, we now can estimate the action-value function via line 7 of Algorithm 3 for any (x, a) ∈ S ×A, where ⋄ = r or g. Thus, V k⋄,h(x) = ⟨Qk⋄,h(x, ·), πkh(· | x)⟩A. Wesummarize the above procedure in Algorithm 3. Using already estimated  {Qkr,h(·, ·), Qkg,h(·, ·)}Hh = 1,we execute the policy improvement and the dual update in Algorithm 1.

Similar to Theorem 1, we provide theoretical guarantees in Theorem 2; see Appendix C.2 for the proof. Theorem 2 improves (|S|, |A|) dependence in Theorem 1 for the tabular case and also improves |S| dependence in [26]. It is worthy mentioning our Algorithm 1 is generic in handling an infinite state space.

Theorem 2 (Tabular Case: Regret and Constraint Violation). Let Assumption 1 hold and let Assumption 2 hold with feature maps (8). Fix  p ∈ (0, 1). In Algorithm 1, we set  α =�log |A|/(H2K),β = C1H�|S| log(|S||A|T/p), η = 1/√K, θ = 1/K, and λ = 1 where C1is an absolute constant. Then, with probability 1  − p, the regret and the constraint violation in (3) satisfy

image

In this paper, we have developed a provably efficient safe reinforcement learning algorithm in the linear MDP setting. The algorithm extends the proximal policy optimization to the constrained MDP and incorporates the UCB exploration. We prove that the proposed algorithm obtains an �O(√T) regret and an �O(√T) constraint violation under mild regularity conditions where T is the total number of steps taken by the algorithm. Moreover, our algorithm works in settings where reward/utility functions are given by bandit feedback. To the best of our knowledge, our algorithm is the first provably efficient online policy optimization algorithm for CMDP with safe exploration in the function approximation setting. We hope that our work provides a step towards a principled way to design efficient safe reinforcement learning algorithms.

[1] Y. Abbasi-Yadkori, D. P´al, and C. Szepesv´ari. Improved algorithms for linear stochastic bandits. In Advances in Neural Information Processing Systems, pages 2312–2320, 2011.

[2] N. Abe, P. Melville, C. Pendus, C. K. Reddy, D. L. Jensen, V. P. Thomas, J. J. Bennett, G. F. Anderson, B. R. Cooley, M. Kowalczyk, et al. Optimizing debt collections using constrained reinforcement learning. In International Conference on Knowledge Discovery and Data Mining, pages 75–84, 2010.

[3] J. Achiam, D. Held, A. Tamar, and P. Abbeel. Constrained policy optimization. In International Conference on Machine Learning, volume 70, pages 22–31, 2017.

[4] A. Agarwal, S. M. Kakade, J. D. Lee, and G. Mahajan. Optimality and approximation with policy gradient methods in Markov decision processes. arXiv preprint arXiv:1908.00261, 2019.

[5] E. Altman. Constrained Markov Decision Processes, volume 7. CRC Press, 1999.

[6] D. Amodei, C. Olah, J. Steinhardt, P. Christiano, J. Schulman, and D. Man´e. Concrete problems in AI safety. arXiv preprint arXiv:1606.06565, 2016.

[7] A. Ayoub, Z. Jia, C. Szepesvari, M. Wang, and L. F. Yang. Model-based reinforcement learning with value-targeted regression. arXiv preprint arXiv:2006.01107, 2020.

[8] M. G. Azar, I. Osband, and R. Munos. Minimax regret bounds for reinforcement learning. In International Conference on Machine Learning, pages 263–272, 2017.

[9] Q. Bai, A. Gattami, and V. Aggarwal. Model-free algorithm and regret analysis for MDPs with peak constraints. arXiv preprint arXiv:2003.05555, 2020.

[10] Y. Bai, T. Xie, N. Jiang, and Y.-X. Wang. Provably efficient Q-learning with low switching cost. In Advances in Neural Information Processing Systems, pages 8002–8011, 2019.

[11] A. Beck. First-order methods in optimization. SIAM, 2017.

[12] F. Berkenkamp, M. Turchetta, A. Schoellig, and A. Krause. Safe model-based reinforcement learning with stability guarantees. In Advances in Neural Information Processing Systems, pages 908–918, 2017.

[13] S. Bhatnagar and K. Lakshmanan. An online actor–critic algorithm with function approxima- tion for constrained Markov decision processes. Journal of Optimization Theory and Applications, 153(3):688–708, 2012.

[14] V. S. Borkar. An actor-critic algorithm for constrained Markov decision processes. Systems & Control Letters, 54(3):207–213, 2005.

[15] J. A. Boyan. Least-squares temporal difference learning. In International Conference on Machine Learning, pages 49–56, 1999.

[16] S. J. Bradtke and A. G. Barto. Linear least-squares algorithms for temporal difference learning. Machine Learning, 22(1-3):33–57, 1996.

[17] K. Brantley, M. Dudik, T. Lykouris, S. Miryoosefi, M. Simchowitz, A. Slivkins, and W. Sun. Constrained episodic reinforcement learning in concave-convex and knapsack settings. arXiv preprint arXiv:2006.05051, 2020.

[18] Q. Cai, Z. Yang, C. Jin, and Z. Wang. Provably efficient exploration in policy optimization. arXiv preprint arXiv:1912.05830, 2019.

[19] Y. Chow, M. Ghavamzadeh, L. Janson, and M. Pavone. Risk-constrained reinforcement learning with percentile risk criteria. The Journal of Machine Learning Research, 18(1):6070–6120, 2017.

[20] Y. Chow, O. Nachum, E. Duenez-Guzman, and M. Ghavamzadeh. A lyapunov-based approach to safe reinforcement learning. In Advances in Neural Information Processing Systems, pages 8092–8101, 2018.

[21] Y. Chow, O. Nachum, A. Faust, M. Ghavamzadeh, and E. Duenez-Guzman. Lyapunov-based safe policy optimization for continuous control. arXiv preprint arXiv:1901.10031, 2019.

[22] G. Dalal, K. Dvijotham, M. Vecerik, T. Hester, C. Paduraru, and Y. Tassa. Safe exploration in continuous action spaces. arXiv preprint arXiv:1801.08757, 2018.

[23] D. Ding, K. Zhang, T. Ba¸sar, and M. R. Jovanovi´c. Natural policy gradient primal-dual method for constrained Markov decision processes. In Advances in Neural Information Processing Systems, 2020.

[24] S. S. Du, S. M. Kakade, R. Wang, and L. F. Yang. Is a good representation sufficient for sample efficient reinforcement learning? arXiv preprint arXiv:1910.03016, 2019.

[25] G. Dulac-Arnold, D. Mankowitz, and T. Hester. Challenges of real-world reinforcement learn- ing. arXiv preprint arXiv:1904.12901, 2019.

[26] Y. Efroni, S. Mannor, and M. Pirotta. Exploration-exploitation in constrained MDPs. arXiv preprint arXiv:2003.02189, 2020.

[27] M. El Chamie, Y. Yu, B. A¸cıkme¸se, and M. Ono. Controlled Markov processes with safety state constraints. IEEE Transactions on Automatic Control, 64(3):1003–1018, 2018.

[28] D. Ernst, P. Geurts, and L. Wehenkel. Tree-based batch mode reinforcement learning. Journal of Machine Learning Research, 6(Apr):503–556, 2005.

[29] J. F. Fisac, A. K. Akametalu, M. N. Zeilinger, S. Kaynama, J. Gillula, and C. J. Tomlin. A general safety framework for learning-based control in uncertain robotic systems. IEEE Transactions on Automatic Control, 64(7):2737–2752, 2018.

[30] J. Garcıa and F. Fern´andez. A comprehensive survey on safe reinforcement learning. Journal of Machine Learning Research, 16(1):1437–1480, 2015.

[31] C. Jin, Z. Allen-Zhu, S. Bubeck, and M. I. Jordan. Is Q-learning provably efficient? In Advances in Neural Information Processing Systems, pages 4863–4873, 2018.

[32] C. Jin, Z. Yang, Z. Wang, and M. I. Jordan. Provably efficient reinforcement learning with linear function approximation. arXiv preprint arXiv:1907.05388, 2019.

[33] S. M. Kakade. A natural policy gradient. In Advances in Neural Information Processing Systems, pages 1531–1538, 2002.

[34] A. Koppel, K. Zhang, H. Zhu, and T. Ba¸sar. Projected stochastic primal-dual method for constrained online learning with kernels. IEEE Transactions on Signal Processing, 67(10):2528– 2542, 2019.

[35] M. G. Lagoudakis and R. Parr. Least-squares policy iteration. Journal of Machine Learning Research, 4(Dec):1107–1149, 2003.

[36] T. Lattimore and C. Szepesvari. Learning with good feature representations in bandits and in RL with a generative model. arXiv preprint arXiv:1911.07676, 2019.

[37] A. Lazaric, M. Ghavamzadeh, and R. Munos. Finite-sample analysis of LSTD. In International Conference on Machine Learning, pages 615–622, 2010.

[38] H. Le, C. Voloshin, and Y. Yue. Batch policy learning under constraints. In International Conference on Machine Learning, pages 3703–3712, 2019.

[39] Q. Liang, F. Que, and E. Modiano. Accelerated primal-dual policy optimization for safe reinforcement learning. arXiv preprint arXiv:1802.06480, 2018.

[40] B. Liu, Q. Cai, Z. Yang, and Z. Wang. Neural trust region/proximal policy optimization attains globally optimal policy. In Advances in Neural Information Processing Systems, pages 10564–10575, 2019.

[41] Y. Liu, J. Ding, and X. Liu. IPO: Interior-point policy optimization under constraints. arXiv preprint arXiv:1910.09615, 2019.

[42] M. Mahdavi, R. Jin, and T. Yang. Trading regret for efficiency: online convex optimization with long term constraints. The Journal of Machine Learning Research, 13(1):2503–2528, 2012.

[43] A. Modi, N. Jiang, A. Tewari, and S. Singh. Sample complexity of reinforcement learning using linearly combined model ensembles. In International Conference on Artificial Intelligence and Statistics, pages 2010–2020, 2020.

[44] A. Nedi´c and A. Ozdaglar. Subgradient methods for saddle-point problems. Journal of optimization theory and applications, 142(1):205–228, 2009.

[45] S. Paternain, M. Calvo-Fullana, L. F. Chamon, and A. Ribeiro. Safe policies for reinforcement learning via primal-dual methods. arXiv preprint arXiv:1911.09101, 2019.

[46] S. Paternain, L. Chamon, M. Calvo-Fullana, and A. Ribeiro. Constrained reinforcement learn- ing has zero duality gap. In Advances in Neural Information Processing Systems, pages 7553– 7563, 2019.

[47] B. ´A. Pires and C. Szepesv´ari. Policy error bounds for model-based reinforcement learning with factored linear models. In Conference on Learning Theory, pages 121–151, 2016.

[48] S. Qiu, X. Wei, Z. Yang, J. Ye, and Z. Wang. Upper confidence primal-dual optimization: Stochastically constrained Markov decision processes with adversarial losses and unknown transitions. Preprint, 2020.

[49] G. J. Schell, W. J. Marrero, M. S. Lavieri, J. B. Sussman, and R. A. Hayward. Data-driven Markov decision process approximations for personalized hypertension treatment planning. MDM Policy & Practice, 1(1), 2016.

[50] J. Schulman, S. Levine, P. Abbeel, M. Jordan, and P. Moritz. Trust region policy optimization. In International Conference on Machine Learning, pages 1889–1897, 2015.

[51] J. Schulman, F. Wolski, P. Dhariwal, A. Radford, and O. Klimov. Proximal policy optimization algorithms. arXiv preprint arXiv:1707.06347, 2017.

[52] R. Singh, A. Gupta, and N. B. Shroff. Learning in Markov decision processes under constraints. arXiv preprint arXiv:2002.12435, 2020.

[53] A. Stooke, J. Achiam, and P. Abbeel. Responsive safety in reinforcement learning by PID Lagrangian methods. arXiv preprint arXiv:2007.03964, 2020.

[54] R. S. Sutton and A. G. Barto. Reinforcement Learning: An Introduction. MIT press, 2018.

[55] C. Tessler, D. J. Mankowitz, and S. Mannor. Reward constrained policy optimization. In International Conference on Learning Representations, 2019.

[56] M. Turchetta, F. Berkenkamp, and A. Krause. Safe exploration in finite Markov decision processes with Gaussian processes. In Advances in Neural Information Processing Systems, pages 4312–4320, 2016.

[57] E. Uchibe and K. Doya. Constrained reinforcement learning from intrinsic and extrinsic re- wards. In International Conference on Development and Learning, pages 163–168, 2007.

[58] B. Van Roy and S. Dong. Comments on the Du-Kakade-Wang-Yang lower bounds. arXiv preprint arXiv:1911.07910, 2019.

[59] A. Wachi and Y. Sui. Safe reinforcement learning in constrained Markov decision processes. arXiv preprint arXiv:2008.06626, 2020.

[60] A. Wachi, Y. Sui, Y. Yue, and M. Ono. Safe exploration and optimization of constrained MDPs using Gaussian processes. In AAAI Conference on Artificial Intelligence, 2018.

[61] L. Wang, Q. Cai, Z. Yang, and Z. Wang. Neural policy gradient methods: Global optimality and rates of convergence. arXiv preprint arXiv:1909.01150, 2019.

[62] X. Wei, H. Yu, and M. J. Neely. Online primal-dual mirror descent under stochastic constraints. arXiv preprint arXiv:1908.00305, 2019.

[63] L. Yang and M. Wang. Sample-optimal parametric Q-learning using linearly additive features. In International Conference on Machine Learning, pages 6995–7004, 2019.

[64] L. F. Yang and M. Wang. Reinforcement leaning in feature space: Matrix bandit, kernels, and regret bound. arXiv preprint arXiv:1905.10389, 2019.

[65] T.-Y. Yang, J. Rosca, K. Narasimhan, and P. J. Ramadge. Projection-based constrained policy optimization. In International Conference on Learning Representations, 2019.

[66] H. Yu, M. Neely, and X. Wei. Online convex optimization with stochastic constraints. In Advances in Neural Information Processing Systems, pages 1428–1438, 2017.

[67] J. Yuan and A. Lamperski. Online convex optimization for cumulative constraints. In Advances in Neural Information Processing Systems, pages 6137–6146, 2018.

[68] D. Zhou, J. He, and Q. Gu. Provably efficient reinforcement learning for discounted MDPs with feature mapping. arXiv preprint arXiv:2006.13165, 2020.

Our analysis begins with decomposition of the regret given in (3).

image

where we add and subtract the value  V kr,1(x1) estimated from an optimistic policy evaluation by Algorithm 2; the policy  π⋆ in hindsight is the best policy in hindsight for problem (2). To bound the total regret (12), we would like to analyze (R.I) and (R.II) separately.

First, we define the model prediction error for the reward as

image

for all (k, h) ∈ [K] × [H], which describes the prediction error in the Bellman equations (1) using V kr,h+1 instead of V πkr,h+1. With this notation, we expand (R.I) into

image

where the first double sum is linear in terms of the policy difference and the second one describes the total model prediction error. The above expansion is based on the standard performance difference lemma (see Lemma 10) and we provide a proof in Section D.1 for readers’ convenience. Meanwhile, if we define the model prediction error for the utility as

image

then, similarly, we can expand �Kk = 1�V π⋆g,1(x1) − V kg,1(x1)�into

image

To analyze the constraint violation, we also introduce an useful decomposition,

image

which the inserted value  V kg,1(x1) is estimated from an optimistic policy evaluation by Algorithm 2.

For notational simplicity, we introduce the underlying probability structure as follows. For any (k, h) ∈ [K] × [H], we define  Fkh,1 as a σ-algebra generated by state-action sequences, reward and utility functions,

image

image

Here,  xkH+1 is a null state for any  k ∈ [K].A filtration is a sequence of  σ-algebras{Fkh,m}(k,h,m) ∈ [K]×[H]×[2] in terms of time index

image

which holds that  Fkh,m ⊂ Fk′h′,m′ for any t ≤ t′.The estimated reward/utility value functions, V kr,h, V kg,h, and the associated Q-functions,  Qkr,h, Qkg,h are Fk1,1-measurable since they are obtained from previous  k −1 historical trajectories. With these notations, we can expand (R.II) in (12) into

image

where  {Mkr,h,m}(k,h,m)∈[K]×[H]×[2] is a martingale adapted to the filtration  {Fkh,m}(k,h,m)∈[K]×[H]×[2]in terms of time index t. Similarly, we have it for (V.II),

image

where  {Mkg,h,m}(k,h,m)∈[K]×[H]×[2] is a martingale adapted to the filtration  {Fkh,m}(k,h,m)∈[K]×[H]×[2]in terms of time index t. We prove (19) in Section D.2 for completeness (also see [18, Lemma 4.2]); (20) is similar.

image

in the action-value function estimation of Algorithm 2. By the UCB argument, if we set  λ = 1and  β = C1�dH2 log(dT/p) where C1is an absolute constant, then for any (k, h) ∈ [K] × [H] and(x, a) ∈ S × A, we have

image

with probability 1−p/2 where the symbol  ⋄ = r or g. We prove (21) in Section D.3 for completeness.

In what follows we delve into the analysis of the regret and the constraint violation.

The goal is to prove that the regret and the constraint violation for Algorithm 1 are sublinear in the total number of steps: T := KH, taken by the algorithm, where K is the total number of episodes and H is the horizon length. We recall that |A| is the size of action space A and d is the feature map’s dimension. We repeat Theorem 1 here for readers’ convenience.

Theorem 1 (Linear Kernal MDP: Regret and Constraint Violation). Let Assumptions 1 and 2 hold. Fix  p ∈ (0, 1). We set α =�log |A|/(H2K), β = C1�dH2 log (dT/p), η = 1/√K, θ = 1/K, andλ = 1in Algorithm 1 with the full-information setting, where  C1is an absolute constant. Suppose log  |A| = O�d2 log2 (dT/p)�. Then, the regret and the constraint violation in (3) satisfy

image

image

We divide the proof into two parts for the regret bound and the constraint violation, respectively, in Section B.1 and Section B.2.

image

Our analysis begins with a primal-dual mirror descent type analysis for the policy update in line 6 of Algorithm 1. In Lemma 2, we present a key upper bound on the total differences of estimated values  V kr,1(x1) and V kg,1(x1) given by Algorithm 2 to the optimal ones.

Lemma 2 (Policy Improvement: Primal-Dual Mirror Descent Step). Let Assumption 1 and Assumption 2 hold. In Algorithm 1, if we set  α =�log |A|/(H2√K) and θ = 1/K, then

image

Proof. We recall that line 6 of Algorithm 1 follows a solution  πk to the following subproblem,

image

where we use the shorthand�Qk−1r,h + Y k−1Qk−1g,h , πh�for�(Qk−1r,h + Y k−1Qk−1g,h )(xh, · ), πh( · | xh)�andthe shorthand  D(πh | �πk−1h ) for D(πh( · | xh) | �πk−1h ( · | xh)) if dependence on the state-action sequence {xh, ah}Hh = 1 is clear from context. We note that (23) is in form of a mirror descent subproblem in Lemma 14. We can apply the pushback property with  x⋆ = πkh, y = �πk−1h and z = π⋆h,

image

Equivalently, we write the above inequality as follows,

image

By taking expectation  Eπ⋆on both sides of (24) over the random state-action sequence  {(xh, ah)}H1starting from  x1, and applying decompositions (14) and (16), we have

image

The rest is to bound the right-hand side of the above inequality (25). By the H¨older’s inequality and the Pinsker’s inequality, we first have

image

Then, using the square completion,

image

where we dropoff the first quadratic term for the inequality, and���πk−1h − πk−1h ��1 ≤ θ, we have

image

where the last inequality is due to ��Qk−1r,h ��∞ ≤ H, a result from line 12 in Algorithm 2, and 0  ≤ Y k−1 ≤ χ. Taking the same expectation  Eπ⋆as previously on both sides of (26) and substituting it into the left-hand side of (25) yield,

image

≤ α(1 + χ)2H32 + θ (1 + χ) H2 + 1α

image

where in the second inequality we note the fact that  D(π⋆h | �πk−1h ) − D(π⋆h | πk−1h ) ≤ θ log |A| fromLemma 15.

We note that  Y 0 is initialized to be zero. By taking a telescoping sum of both sides of (27) from k = 1 to k = K + 1 and shifting the index k by one, we have

image

where we ignore  −α−1 �Hh = 1 Eπ⋆[D(π⋆h | πK+1h)] and utilize

image

Finally, we take  χ := H/γ and α, θin the lemma to complete the proof.

By the dual update of Algorithm 1, we can simplify the result in Lemma 2 and return back to the regret (12).

Lemma 3. Let Assumption 1 and Assumption 2 hold. In Algorithm 1, if we set  α =

image

Regret(K) = C3H2.5�T log |A| +

image

Proof. By the dual update in line 9 in Algorithm 1, we have

image

where we use the feasibility of  π⋆ in the last inequality. Since  Y 0 = 0 and |b − V k−1g,1 (x1)| ≤ H, theabove inequality implies that

image

By noting the UCB result (21) and  Y k ≥0, the inequality (22) implies that

image

Then, we add (30) to the above inequality and take  η = 1/√K,

image

where  C3is an absolute constant. Finally, we combine (19) and (31) to complete the proof.

By Lemma 3, the rest is to bound the last two terms in the right-hand side of (29). We next show two probability bounds for them in Lemma 4 and Lemma 5, separately.

Lemma 4 (Model Prediction Error Bound). Let Assumption 2 hold. Fix  p ∈ (0, 1). If we set β = C1�dH2 log (dT/p)in Algorithm 1, then with probability 1  − p/2it holds that

image

Proof. By the UCB result (21), with probability 1−p/2 for any (k, h) ∈ [K]×[H] and (x, a) ∈ S×A,we have

image

image

for any (k, h) ∈ [K] × [H] and (x, a) ∈ S × A. Therefore, we have

image

where Γkh(·, ·) = β(ϕ(·, ·)⊤(Λkh)−1ϕ(·, ·))1/2 and Γkr,h(·, ·) = β(φkr,h(·, ·)⊤(Λkr,h)−1φkr,h(·, ·))1/2. Appli-cation of the Cauchy-Schwartz inequality shows that

image

Since we take  β = C1�dH2 log (dT/p) with C1 >1, we have  H/β ≤ 1.The rest is to apply Lemma 13. First, for any  h ∈ [H] it holds that

image

Therefore, we have

image

Similarly, we can show that

image

Applying the above inequalities (34) and (35) to (33) leads to

image

Finally, we set  β = C1�dH2 log (dT /p) and λ= 1 to obtain (32).

Lemma 5 (Matingale Bound). Fix p ∈ (0, 1). In Algorithm 1, it holds with probability 1−p/2 that

image

Proof. In the verification of (19) (see Section D.2), we introduce the following matingale,

image

where�Ikhf�(x) :=�f(x, ·), πkh(·|x)�.

image

This shows that  |Dkr,h,1|, |Dkr,h,2| ≤ 2H for all (k, h) ∈ [K]×[H]. Application of the Azuma-Hoeffding inequality yields,

image

For  p ∈ (0,1), if we set  s = 4H�T log (4/p), then the inequality (36) holds with probability at least 1  − p/2.

We now are ready to show the desired regret bound. Applying (32) and (36) to the right-hand side of the inequality (29), we have

image

with probability 1  − p where C1, C3are absolute constants. Then, with probability 1  − p it holdsthat Regret(K) ≤ CdH2.5√T log�dTp

where C is an absolute constant.

image

In Lemma 2, we have provided an useful upper bound on the total differences that are weighted by the dual update  Y k. To extract the constraint violation, we first refine Lemma 2 as follows.

Lemma 6 ([Policy Improvement: Refined Primal-Dual Mirror Descent Step). Let Assumptions 1

image

Proof. By the dual update in line 9 in Algorithm 1, for any  Y ∈ [0, χ] we have

image

where we apply the non-expansiveness of projection in the first inequality and  |b − V kg,1(x1)| ≤ Hfor the last inequality. By summing the above inequality from k = 1 to k = K, we have

image

which implies that

image

By adding the above inequality to (28) in Lemma 2 and noting that  V π⋆,kg,1 (x1) ≥ band the UCB result (21), we have

image

By taking  χ = H/γ, and α, θ, ηin the lemma, we complete the proof.

According to Lemma 6, we can multiply (20) by Y and add it, together with (19), to (37),

image

We now are ready to show the desired constraint violation bound. We note that there exists a policy  π′ such that V π′r,1(x1) = 1K�Kk = 1 V πkr,1 (x1) and V π′g,1(x1) = 1K�Kk = 1 V πkg,1 (x1). By the oc- cupancy measure method [5],  V πkr,1 (x1) and V πkg,1 (x1) are linear in terms of an occupancy measure induced by policy  πk and initial state  x1. Thus, an average of K occupancy measures is still an occupancy measure that produces policy  π′ with values  V π′r,1(x1) and V π′g,1(x1). Particularly, we take Y = 0 when �Kk = 1�b − V πkg,1 (x1)�<0; otherwise  Y = χ. Therefore, we have

image

≤ C4H2.5�T log |A|K + 1K

image

where we apply the UCB result (21) for the last inequality.

Finally, we recall two immediate results of Lemma 4 and Lemma 5. Fix  p ∈ (0,1), the proof of Lemma 4 also shows that with probability 1  − p/2,

image

and the proof of Lemma 5 shows that with probability 1  − p/2,

image

If we take log  |A| = O(d2 log2(dT/p)), (39) implies that with probability 1  − p we have

image

where  C5is an absolute constant. Finally, by noting our choice of  χ ≥ 2Y ⋆, we can apply Lemma 9 to conclude that Violation(K) ≤ C′dH2.5√T log�dTp

with probability 1  − p, where C′is an absolute constant.

As mentioned, Assumption 2 includes a tabular CMDP(S, A, H, P, r, g) as a special case with |S| < ∞ and |A| < ∞. We take the following feature maps and parameter vectors,

image

where  e(x,a,x′)is a canonical basis of  Rd1 associated with (x, a, x′) and θh = Ph( · , · , ·) reads that for any (x, a, x′) ∈ S × A × S, the (x, a, x′)th entry of  θh is P(x′ | x, a); similarly we define  e(x,a),θr,h, and θg,h. Thus, we can see that

image

We can also verify that

image

and for any  V : S → [0, H] and any (x, a) ∈ S × A, we have

image

Therefore, the tabular CMDP is a special case of Assumption (2) with  d := max (d1, d2) = |S|2|A|.

image

We now detail Algorithm 1 for the tabular case as follows. Our policy evaluation works with regression feature  φτ⋄,h: S × A → Rd2,

image

where  ⋄ = r or g. Thus, for any (¯x, ¯a, ¯x′) ∈ S × A × S, the (¯x, ¯a, ¯x′)th entry of  φτ⋄,h(x, a) is given by

image

which shows that  φτ⋄,h(x, a) is a sparse vector with |S| nonzero elements at  {(x, a, x′), x′ ∈ S} andthe (x, a, x′)th entry of  φτ⋄,h(x, a) is V τ⋄,h+1(x′). For instance of  ⋄ = r, the regularized least-squares problem (6) becomes

image

where [w](x,a,x′) is the (x, a, x′)th entry of w, and the solution  wkr,h serves as an estimator of the transition kernel  Ph(· | ·, ·). On the other hand, since  ϕ(xτh, aτh) = e(xτh,aτh), the regularized least- squares problem (7) becomes

image

where [u](x,a) is the (x, a)th entry of u, the solution  ukr,h gives an estimate of  rh(x, a) as ϕ(x, a)⊤ukr,h.By adding similar UCB bonus terms Γkh, Γkr,h: S × A → Rgiven in Algorithm 2, we estimate the action-value function as follows,

image

for any (x, a) ∈ S × A. Thus, V kr,h(x) = ⟨Qkr,h(x, ·), πkh(· | x)⟩A. Similarly, we estimate  gh(x, a) andthus  Qkg,h(x, a) and V kg,h(x). Using already estimated  {Qkr,h(·, ·), Qkg,h(·, ·), V kr,h(·), Qkg,h(·)}Hh = 1, weexecute the policy improvement and the dual update in Algorithm 1.

We restate the result of Theorem 1 for the tabular case as follows.

Corollary 1 (Regret and Constraint Violation). For the tabular CMDP with feature maps (41), let Assumption 1 hold. Fix  p ∈ (0, 1).In Algorithm 1, we set  α = �log |A|/(H2K), β =C1�|S|2|A|H2 log (|S||A|T/p), η = 1/√K, θ = 1/K, and λ = 1 where C1is an absolute constant. Then, the regret and the constraint violation in (3) satisfy

image

Proof. It follows the proof of Theorem 1 by noting that the tabular CMDP is a special linear MDP in Assumption 2, with  d = |S|2|A|, and we have log  |A| ≤ O�d2 log (dT/p)�automatically.

image

As we see in the proof of Theorem 1, our final regret or constraint violation bounds are dominated by the accumulated bonus terms, which come from the design of ‘optimism in the face of uncertainty.’ This framework provides a powerful flexibility for Algorithm 1 to incorporate other optimistic policy evaluation methods. In what follows, we introduce Algorithm 1 with a variant of optimistic policy evaluation.

image

and any (x, a) ∈ S×A, we define two visitation counters  nkh(x, a, x′) and nkh(x, a) at step hin episode

image

This allows us to estimate transition kernel  Ph, reward function r, and utility function g for episode

image

where  λ >0 is the regularization parameter. Moreover, we introduce the bonus term Γkh: S×A → R,

image

which adapts the counter-based bonus terms in the literature [8, 31], where  β >0 is to be determined later. Using the estimated transition kernels  {�Pkh}Hh = 1, the estimated reward/utility functions {�rkh, �gkh}Hh = 1, and the bonus terms  {Γkh}Hh = 1, we now can estimate the action-value function via

image

for any (x, a) ∈ S ×A, where ⋄ = r or g. Thus, V k⋄,h(x) = ⟨Qk⋄,h(x, ·), πkh(· | x)⟩A. We summarize the above procedure in Algorithm 3. Using already estimated  {Qkr,h(·, ·), Qkg,h(·, ·)}Hh = 1, we can execute the policy improvement and the dual update in Algorithm 1.

Similar to Theorem 1, we prove the following regret and constraint violation bounds.

Theorem 3 (Regret and Constraint Violation). For the tabular CMDP with feature maps (41), let Assumption 1 hold. Fix  p ∈ (0, 1).In Algorithm 1, we set  α = �log |A|/(H2K), β =C1H�|S| log(|S||A|T/p), η = 1/√K, θ = 1/K, and λ = 1 where C1is an absolute constant. Then, the regret and the constraint violation in (3) satisfy

image

Proof. The proof is similar to Theorem 1. Since we only change the policy evaluation, all previous policy improvement results still hold. By Lemma 3, we have

image

where  ιkr,h is the model prediction error given by (13) and  {Mkr,h,m}(k,h,m)∈[K]×[H]×[2]is a martingale adapted to the filtration  {Fkh,m}(k,h,m)∈[K]×[H]×[2]in terms of time index t defined in (18). By Lemma 5, it holds with probability 1  − p/3 that |MKr,H,2| ≤ 4�H2T log(4/p). The rest is to bound the double sum term. As shown in Section D.4, with probability 1  − p/2 it holds that for any (k, h) ∈ [K] × [H] and (x, a) ∈ S × A,

image

Together with the choice of Γkh, we have

image

Define mapping ¯φ: S × A → R|S||A| as ¯φ(x, a) = e(x,a), we can utilize Lemma 13. For any (k, h) ∈ [K] × [H], we have

image

where we apply the Cauchy-Schwartz inequality for the second inequality and Lemma 13 for the third inequality. Notice that (K + λ)I ⪰ ¯ΛKh and ¯Λ1h = λI. Hence,

image

Notice that log  |A| ≤ O�|S|2|A| log2(|S||A|T/p)�. By setting λ = 1 and β =C1H�|S| log(|S||A|T/p), we conclude the desired regret bound.

For the constraint violation analysis, Lemmas 6 still holds. Similar to (39), we have

image

where  V π′r,1(x1) = 1K�Kk = 1 V πkr,1 (x1) and V π′g,1(x1) = 1K�Kk = 1 V πkg,1 (x1). Similar to Lemma 5, it holds with probability 1  − p/3 that |MKg,H,2| ≤ 4�H2T log(6/p) for ⋄ = r or g. As shown in Section D.4, with probability 1  − p/3 it holds that  −4Γkh(x, a) ≤ ιk⋄,h(x, a) ≤0 for any (k, h) ∈ [K] × [H] and(x, a) ∈ S × A. Therefore, we have

image

which leads to the desired constraint violation bound due to Lemma 9 and we set  λ and β aspreviously.

In this section, we collect some verifications for readers’ convenience.

image

For any (k, h) ∈ [K] × [H], we recall the definitions of  V π⋆r,h in the Bellman equations (1) and  V kr,hfrom line 12 in Algorithm 2,

image

We can expand the difference  V π⋆r,h(x) − V kr,h(x) into

image

Recall the equality in the Bellman equations (1) and the model prediction error,

image

As a result of the above two, it is easy to see that

image

Substituting the above difference into the right-hand side of (43) yields,

image

which displays a recursive formula over h. Thus, we expand  V π⋆r,1 (x1) − V kr,1(x1) recursively with

image

V π⋆r,1 (x1) − V kr,1(x1) = �P1�V π⋆r,2 − V kr,2�(x1, ·), π⋆1(·|x1)�+�ιkr,1(x1, ·), π⋆1(·|x1)⟩ + ξk1(x1)= �P1�P2�V π⋆r,3 − V kr,3�(x2, ·), π⋆2(·|x2)�(x1, ·) , π⋆1(·|x1)�

image

With this notation, repeating the above recursion (44) over  h ∈ [H] yields

image

Finally, notice that  V π⋆r,H+1 = V kr,H+1 = 0, we use the definitions of  Ph and Ihto conclude (14). Similarly, we can also use the above argument to verify (16).

image

We expand the model prediction error  ιkr,h into,

image

where we use the Bellman equation  Qπkr,h(xkh, akh) = rh(xkh, akh)+(PhV πkr,h+1)(xkh, akh) in the last equality. With the above formula, we expand the difference  V kr,1(x1) − V πkr,1 (x1) into

image

Therefore, we have the following recursive formula over h,

image

Following the definitions of  Fkh,1 and Fkh,2, we know Dkr,h,1 ∈ Fkh,1 and Dkr,h,2 ∈ Fkh,2. Thus, for any (k, h) ∈ [K] × [H],

image

Notice that  t(k, 0, 2) = t(k − 1, H, 2) = 2H(k −1). Clearly,  Fk0,2 = Fk−1H,2 for any k ≥ 2. Let F10,2 beempty. We define a martingale sequence,

image

where  t(k, h, m) := 2(k − 1)H + 2(h − 1) + mis the time index. Clearly, this martingale is adapted to the filtration  {Fkh,m}(k,h,m)∈[K]×[H]×[2], and particularly,

image

Finally, we combine the above martingale with (45) to obtain (19). Similarly, we can show (20).

image

We recall the definition of the feature map  φkr,h,

image

for any (k, h) ∈ [K] × [H] and (x, a) ∈ S × A. By Assumption 2, we have

image

where the second equality is due to the definition of φkr,h, we exploit Λkr,h =�k−1τ = 1 φτr,h(xτh, aτh)φτr,h(xτh, aτh)⊤ + λIfrom line 4 of Algorithm 2 in the fourth equality, and we recursively replace  φτr,h(xτh, aτh)⊤θh by (PhV τr,h+1) (xτh, aτh) for all τ ∈ [k −1] in the last equality.

image

Therefore,

image

for any (k, h) ∈ [K] × [H] and (x, a) ∈ S × A, where we apply the Cauchy-Schwarz inequality twice in the inequality. By Lemma 12, set  λ= 1, with probability 1  − p/2 it holds that

image

Also notice that Λkr,h ⪰ λI and ∥θh∥ ≤√d, thus ∥θh∥(Λkr,h)−1 ≤√λd. Thus, by taking an appropriate absolute constant C, we obtain that

image

for any (k, h) ∈ [K] × [H] and (x, a) ∈ S × Aunder the event of Lemma 12.

image

Algorithm 2, with probability 1  − p/2 it holds that

image

for any (k, h) ∈ [K] × [H] and (x, a) ∈ S × A.

We note that reward/utility functions are fixed over episodes,  rh(xτh, aτh) := ϕ(xτh, aτh)⊤θr,h Forthe difference  ϕ(x, a)⊤ukr,h − rh(x, a), we have

image

where we apply the Cauchy-Schwartz inequality in the inequality. Notice that Λkh ⪰ λI and ∥θr,h∥ ≤

image

(k, h) ∈ [K] × [H] and (x, a) ∈ S × A,

image

We recall the model prediction error  ιkr,h := rh + PhV kr,h+1 − Qkr,h and the estimated state-action value function  Qkr,h in line 11 of Algorithm 2,

image

for any (k, h) ∈ [K] × [H] and (x, a) ∈ S × A. By (46) and (47), we first have

image

Then, we can show that

image

for any (k, h) ∈ [K] × [H] and (x, a) ∈ S × A.

Therefore, (48) reduces to

image

On the other hand, notice that (rkh + PhV kr,h+1)(x, a) ≤ H − h + 1, thus

image

for any (k, h) ∈ [K] × [H] and (x, a) ∈ S × A.

Therefore, we have proved that with probability 1  − p/2 it holds that

image

for any (k, h) ∈ [K] × [H] and (x, a) ∈ S × A.

Similarly, we can show another inequality  −2(Γkh + Γkg,h)(x, a) ≤ ιkg,h(x, a) ≤ 0.

image

Let  V = {V : S → [0, H]}be a set of bounded function on  S. Fo any V ∈ V, we consider the difference between �x′ ∈ S �Pkh(x′ | ·, ·)V (x′) and �x′ ∈ S Ph(x′ | ·, ·)V (x′) as follows,

image

for any (k, h) ∈ [K] × [H] and (x, a) ∈ S × A, where we apply the triangle inequality for the inequality.

Let  ητh := V (xτh+1) − (PhV )(xτh, aτh). Conditioning on the filtration  Fkh,1, ητh is a zero-mean and H/2-subGaussian random variable. By Lemma 11, we use  Y = λI and Xτ = 1{(x, a) = (xτh, aτh)}

and thus with probability at least 1  − δit holds that

image

for any (k, h) ∈ [K] × [H]. Also, since 0  ≤ V ≤ H, we have

image

By returning to (49) and setting  λ= 1, with probability at least 1  − δit holds that

image

for any  k ≥ 1.

Let  d(V, V ′) = maxx ∈ S |V (x) − V ′(x)|be a distance on  V. For any ǫ, an ǫ-covering Vǫ of Vwith respect to distance  d(·, ·) satisfies

image

Thus, for any  V ∈ V, there exists  V ′ ∈ Vǫsuch that maxx ∈ S |V (x) − V ′(x)| ≤ ǫ. By the triangle inequality, we have

image

Furthermore, we choose  δ = (p/3) / (|Vǫ||S||A|) and take an union bound over  V ∈ Vǫ and (x, a) ∈

S × A. By (50), with probability at least 1  − p/2 it holds that

image

for all (k, h) and (x, a), where C1is an absolute constant. We recall our choice of Γkh and β. Hence,with probability at least 1  − p/2 it holds that

image

for any (k, h) ∈ [K] × [H] and (x, a) ∈ |S| × |A|, where β := C1H�|S| log(|S||A|T/p).

image

where we utilize  λ = 1 and β ≥1 in the inequalities.

We now are ready to check the model prediction error  ιkr,h defined by (13),

image

≤ �rkh(x, a) + �x′ ∈ S �Pkh(x′ | x, a)V kr,h+1(x′) + 2Γkh(x, a) − (rh + PhV kr,h+1)(x, a)≤ 4Γkh(x, a)

for any (x, a) ∈ S × A. On the other hand, notice that (rh + PhV kr,h+1)(x, a) ≤ H − h + 1, thus

image

for any (k, h) ∈ [K] × [H] and (x, a) ∈ S × A. Hence, we complete the proof of (42).

We collect some standard results from the literature for readers’ convenience. We rephrase them for our constrained problem (2),

image

in which we maximize over all policies and  b ∈ (0, H]. Let the optimal solution be  π⋆ such that

image

Let the Lagrangian be  L(π, Y ) := V πr,1(x1)+Y (V πg,1(x1)−b), where Y ≥0 is the Lagrange multiplier or dual variable. The associated dual function is defined as

image

and the optimal dual is  Y ⋆ := argminY ≥ 0 D(Y ),

image

We recall that the problem (2) enjoys strong duality under the standard Slater condition. The proof is a special case of [45, Proposition 1] in finite-horizon.

image

Lemma 7 (Strong Duality). [45, Proposition 1] If the Slater condition holds, then the strong duality holds,

image

It is implied by the strong duality that the optimal solution to the dual problem: minimizeY ≥ 0 D(Y) is obtained at  Y ⋆. Denote the set of all optimal dual variables as Λ⋆.

Under the Slater condition, an useful property of the dual variable is that the sublevel sets are bounded [11, Section 8.5].

Lemma 8 (Boundedness of Sublevel Sets of the Dual Function). Let the Slater condition hold. Fix C ∈ R. For any Y ∈ {Y ≥ 0 | D(Y ) ≤ C}, it holds that

image

where we utilize the Slater point ¯πin the last inequality. We complete the proof by noting  γ > 0.

Corollary 2 (Boundedness of  Y ⋆). If we take C = V π⋆r,1 (x1) = D(Y ⋆), then Λ⋆ = {Y ≥ 0 | D(Y ) ≤

image

Another useful theorem from the optimization [11, Section 3.5] is given as follows. It describes that the constraint violation  b−V πg,1(x1) can be bounded similarly even if we have some weak bound. We next state and prove it for our problem, which is used in our constraint violation analysis in Section B.

Lemma 9 (Constraint Violation). Let the Slater condition hold and  Y ⋆ ∈ Λ⋆. Let C⋆ ≥ 2Y ⋆.Assume that  �π ∈ ∆(A | S, H) satisfies

image

where [x]+ = max(x, 0).

image

By definition,  v(0) = V π⋆r,1 (x1). It has been shown as a special case of [45, Proposition 1] that  v(τ)is concave. First, we show that  −Y ⋆ ∈ ∂v(0). By the Lagrangian and the strong duality,

L(π, Y ⋆) ≤ maximizeπ ∈ ∆(A | S,H) L(π, Y ⋆) = D(Y ⋆) = V π⋆r,1 (x1) = v(0), for all π ∈ ∆(A | S, H).

For any  π ∈ {π ∈ ∆(A | S, H) | V πg,1(x1) ≥ b + τ},

image

If we maximize the right-hand side of above inequality over  π ∈ {π ∈ ∆(A | S, H) | V πg,1(x1) ≥ b+τ},then

image

which show that  −Y ⋆ ∈ ∂v(0). On the other hand, if we take  τ = �τ := −(b − V �πg,1(x1))+, then

image

Combing the above two yields

image

By our assumption and  �τ =�b − V �πg (ρ)�+,

image

First, we state a lemma that is used throughout this paper.

Lemma 10 (Performance Difference Lemma). For any two policies  π, π′ ∈ ∆(A | S, H), it holdsthat

image

Proof. See the proof of Lemma 3.2 in [18].

Next, we state an useful concentration inequality for the standard self-normalized processes.

Lemma 11 (Concentration of Self-normalized Processes). Let {Ft}∞t = 0 be a filtration and  {ηt}∞t = 1be a R-valued stochastic process such that  ηt is Ft-measurable for any  t ≥ 0. Assume that for any  t ≥ 0, conditioning on  Ft, ηtis a zero-mean and  σ-subGaussian random variable with the variance proxy  σ2 > 0, i.e., E�eληt | Ft�≤ eλ2σ2/2 for any λ ∈ R. Let {Xt}∞t = 1 be an Rd-valuedstochastic process such that  Xt is Ft-measurable for any  t ≥ 0. Let Y ∈ Rd×d be a deterministic and positive-definite matrix. For any  t ≥ 0, we define

image

image

Proof. See the proof of Theorem 1 in [1].

The above concentration inequality can be customized to our setting in the following form without using covering number arguments as in [32].

Lemma 12. Let λ = 1in Algorithm 2. Fix  δ ∈ (0, 1). Then, for any (k, h) ∈ [K] × [H] it holdsfor  ⋄ = r or g that

image

Proof. See the proof of Lemma D.1 in [18].

Lemma 13 (Elliptical Potential Lemma). Let {φt}∞t=1 be a sequence of functions in  Rd and Λ0 ∈Rd×d be a positive definite matrix. Let Λt = Λ0+�t−1i = 1 φiφ⊤i . Assume ∥φt∥2 ≤ 1 and λmin (Λ0) ≥ 1.Then for any  t ≥ 1it holds that

image

Proof. See the proof of Lemma D.2 in [32] or [18].

Lemma 14 (Pushback Property of KL-divergence). Let f : ∆ → Rbe a concave function where is a probability simplex in  Rd. Let ∆o be the interior of . Let  x⋆ = argmaxx ∈ ∆ f(x) − α−1D(x, y)for a fixed  y ∈ ∆o and α > 0. Then, for any  z ∈ ∆,

image

Proof. See the proof of Lemma 14 in [62].

Lemma 15 (Bounded KL-divergence Difference). Let π1, π2be two probability distributions in ∆(A). Let �π2 = (1 − θ)π2 + 1θ/|A| where θ ∈ (0, 1]. Then,

image

Proof. See the proof of Lemma 31 in [62].


Designed for Accessibility and to further Open Science