Reinforcement Learning (RL) (Sutton & Barto, 2018) has achieved astounding successes in games (Mnih et al., 2015; OpenAI, 2018; Silver et al., 2018; Vinyals et al., 2019), matching or surpassing human performance in several occasions. However, the much-anticipated applications of RL to real-world tasks such as robotics (Kober, Bagnell, & Peters, 2013), autonomous driving (Okuda, Kaji- wara, & Terashima, 2014) and finance (Li & Hoi, 2014) seem still far. This technological delay may be due to the very nature of RL, which relies on the repeated interaction of the learning machine with the surrounding environment, e.g., a manufacturing plant, a trafficked road, a stock market. The trial-and-error process resulting from this interaction is what makes RL so powerful and general. However, it also poses significant challenges in terms of sample efficiency (Recht, 2019) and safety (Amodei et al., 2016).
In reinforcement learning, the term safety can actually refer to a variety of problems (Garc´ıa & Fern´andez, 2015). The general concern is always the same: avoiding or limiting damage. In financial applications, it is typically a loss of money. In robotics and autonomous driving, one should also consider direct damage to people and property. In this work, we do not make assumptions about the nature of the damage, but we assume it is entirely encoded in the scalar reward signal that is presented to the agent in order to evaluate its actions. Other works (e.g., Turchetta, Berkenkamp, & Krause, 2016) employ a distinct safety signal, separate from rewards.
A further distinction is necessary on the scope of safety constraints with respect to the agent’s life. One may simply require the final behavior, the one that is deployed at the end of the learning process, to be safe. This is typically the case when learning is performed in simulation, but the final controller has to be deployed in the real world. The main challenges there are in transferring safety properties from simulation to reality (e.g., Tan et al., 2018). In other cases, learning must be performed, or at least finalized, on the actual system, because no reliable simulator is available (e.g., Peters & Schaal, 2008). In such a scenario, safety must be enforced for the whole duration of the learning process. This poses a further challenge, as the agent must necessarily go through a sequence of sub-optimal behaviors before learning its final policy. The problem of learning while containing the damage is also known as safe exploration (Amodei et al., 2016) and will be the focus of this work.
Garc´ıa and Fern´andez (2015) provide a comprehensive survey on safe RL, where the existing approaches are organized into two main families: methods that modify the exploration process directly in order to explicitly avoid dangerous actions (e.g., Gehring & Precup, 2013), and methods that constrain exploration in a more indirect way by modifying the reward optimization process. The former typically require some sort of external knowledge, such as human demonstrations or advice (e.g., Abbeel, Coates, & Ng, 2010; Clouse & Utgoff, 1992). In this work, we only assume online access to a sufficiently informative reward signal and prior knowledge of some worst-case constants that are easy to obtain. Optimization-based methods (those belonging to the second class) are more suited for this scenario. A particular kind, identified by Garc´ıa and Fern´andez as constrained criteria (Castro, Tamar, & Mannor, 2012; Kadota, Kurano, & Yasuda, 2006; Moldovan & Abbeel, 2012), enforces safety by introducing constraints in the optimization problem, i.e., reward maximization.
A typical constraint is that the agent’s performance, i.e., the sum of rewards, must never be less than a user-specified threshold (Geibel & Wysotzki, 2005; Thomas, Theocharous, & Ghavamzadeh, 2015), which may be the average performance of a trusted baseline policy. Under the assumption that the reward signal also encodes danger, low performances can be matched with dangerous behaviors, so that the performance threshold works as a safety threshold. This falls into the general framework of Seldonian machine learning introduced by Thomas et al. (2019).
If we only cared about the safety of the final controller, the traditional RL objective — maximizing cumulated reward — would be enough. However, most RL algorithms are known to yield oscillating performances during the learning phase. Regardless of the final solution, the intermediate ones may violate the threshold, hence yield unsafe behavior. This problem is known as policy oscillation (Bertsekas, 2011; Wagner, 2011).
A similar constraint, which confronts the policy oscillation problem even more directly, is Monotonic Improvment (MI, S. Kakade & Langford, 2002; Pirotta, Restelli, Pecorino, & Calandriello, 2013), and is the one adopted in this work. The requirement is that each new policy implemented by the agent during the learning process does not perform worse than the previous one. In this way, if the initial policy is safe, so will be all the subsequent ones.
The way safety constraints such as MI can be imposed on the optimization process depends, of course, on what kind of policies are considered as candidates and on how the optimization itself is performed. These two aspects are often tied and will depend on the specific kind of RL algorithm that is employed. Policy Search or Optimization (PO, Deisenroth, Neumann, Peters, et al., 2013) is a family of RL algorithms where the class of candidate policies is fixed in advance and a direct search for the best one within the class is performed. This makes PO algorithms radically different from value-based algorithms such as Deep Q-Networks (Mnih et al., 2015), where the optimal policy is a byproduct of a learned value function. Although value-based methods gained great popularity from their successes in games, PO algorithms are better suited for real-world tasks, especially the ones involving cyberphysical systems. The main reasons are the ability of PO methods to deal with high-dimensional continuous state and action spaces, convergence guarantees (Sutton, McAllester, Singh, & Mansour, 2000), robustness to sensor noise, and the superior control on the set of feasible policies. The latter allows introducing domain knowledge into the optimization process, possibly including some safety constraints.
In this work, we focus on Policy Gradient methods (PG, Peters & Schaal,
2008; Sutton et al., 2000), where the set of candidate policies is a class of parametric distributions and the optimization is performed via stochastic gradient ascent on the performance objective as a function of the policy parameters. In particular, we analyze the prototypical PG algorithm, REINFORCE (Williams, 1992) and see how the MI constraints can be imposed by adaptively selecting its meta-parameters during the learning process. To achieve this, we study in more depth the stochastic gradient-based optimization process that is at the core of all PG methods (Robbins & Monro, 1951). In particular, we identify a general family of parametric policies that makes the optimization objective Lipschitz-smooth (Nesterov, 2013) and allows easy upper-bounding of the related smoothness constant. This family, referred to as smoothing policies, includes commonly used policy classes from the PG literature, namely Gaussian and Softmax policies. Using known properties of Lipschitz-smooth functions, we then provide lower bounds on the performance improvement produced by gradient-based updates, as a function of tunable meta-parameters. This, in turn, allows identifying those meta-parameter schedules that guarantee MI with high probability. In previous work, a similar result was achieved only for Gaussian policies (Papini, Pirotta, & Restelli, 2017; Pirotta, Restelli, & Bascetta, 2013).
The meta-parameters studied here are the step size of the policy updates, or learning rate, and the batch size of gradient estimates, i.e., the number of trials that are performed within a single policy update. These meta-parameters, already present in the original REINFORCE algorithm, are typically selected by hand and fixed for the whole learning process (Duan, Chen, Houthooft, Schulman, & Abbeel, 2016). Besides guaranteeing monotonic improvement, our proposed method removes the burden of selecting these meta-parameters. This safe, automatic selection within the REINFORCE algorithmic framework yields SPG, our Safe Policy Gradient algorithm.
The paper is organized as follows: in Section 2 we introduce the necessary background on Markov decision processes, policy optimization, and smooth functions. In Section 3, we introduce smoothing policies and show the useful properties they induce on the policy optimization problem, most importantly a lower bound on the performance improvement yielded by an arbitrary policy parameter update (Theorem 7). In Section 4, we exploit these properties to select the step size of REINFORCE in a way that guarantees MI with high probability when the batch size is fixed, then we achieve similar results with an adaptive batch size. In Section 5, we design a monotonically improving policy gradient algorithm with adaptive batch size, called Safe Policy Gradient (SPG), and show how the latter can also be adapted to weaker improvement constraints. In Section 6, we offer a detailed comparison of our contributions with the most closely related literature. In Section 7 we empirically evaluate SPG on simulated control tasks. Finally, we discuss the limitations of our approach and propose directions for future work in Section 8.
In this section, we revise continuous Markov Decision Processes (MDPs, Put- erman, 2014), actor-only Policy Gradient algorithms (PG, Deisenroth et al., 2013), and some general properties of smooth functions.
2.1 Markov Decision Processes
A Markov Decision Process (MDP, Puterman, 2014) is a tuple , comprised of a measurable state space S, a measurable action space A, a Markovian transition kernel
, where ∆
denotes the set of probability distributions over S, a reward function
, a discount factor
(0, 1) and an initial-state distribution
. We only consider bounded-reward MDPs, and denote with
sup
(a known upper bound on) the maximum absolute reward. This is the only prior knowledge we have on the task. The MDP is used to model the interaction of a rational agent with the environment. We model the agent’s behavior with a policy
, a stochastic mapping from states to actions. The initial state is drawn as
. For each time step t = 0, 1, . . . , the agent draws an action
), conditional on the current state
. Then, the agent obtains a reward
) and the state of the environment transitions to
). The goal of the agent is to maximize the expected sum of discounted rewards, or performance measure:
We focus on continuous MDPs, where states and actions are real vectors: and
. However, all the results naturally extend to the discrete case by replacing integrals with summations. See (Bertsekas & Shreve, 2004; Puterman, 2014) on matters of measurability and integrability, which just require common technical assumptions. We slightly abuse notation by denoting probability measures (assumed to be absolutely continuous) and density functions with the same symbol.
Given an MDP, the purpose of RL is to find an optimal policy arg max
) without knowing the transition kernel p and the reward function r in advance, but only through interaction with the environment. To better characterize this optimization objective, it is convenient to introduce further quantities. We denote with
the transition kernel of the Markov Process induced by policy
, i.e.,
) :=
) da. The t-step transition kernel under policy
is defined inductively as follows:
for all and
1. The t-step transition kernel allows to define the following conditional state-occupancy measure:
measuring the (discounted) probability of visiting a state starting from s and following policy . The following property of
—a variant of the generalized eigenfunction property by Ciosek and Whiteson (2020, Lemma 20)—will be useful (proof in Appendix A.1):
be any policy and f be any integrable function on S satisfying the following recursive equation:
for all and some integrable function g on S. Then:
for all
The state-value function ) =
is the discounted sum of rewards obtained, in expectation, by following policy
from state s, and satisfies Bellman’s equation (Puterman, 2014):
Similarly, the action-value function:
is the discounted sum of rewards obtained, in expectation, by taking action a in state s and following afterwards. The two value functions are closely related:
For bounded-reward MDPs, the value functions are bounded for every policy :
where = sup
and
= sup
. Using the definition of state-value function we can rewrite the performance measure as follows:
where:
is the state-occupancy probability under the starting-state distribution .
2.2 Parametric policies
In this work, we only consider parametric policies. Given a d-dimensional parameter vector , a parametric policy is a stochastic mapping from states to actions parametrized by
, denoted with
. The search for the optimal policy is thus limited to the policy class Π
. This corresponds to finding an optimal parameter, i.e.,
arg max
). For ease of notation, we often write
in place of
in function arguments and superscripts, e.g.,
) and
) in place of
and
), respectively.
We restrict our attention to policies that are twice differentiable w.r.t.
, for which the gradient
) and the Hessian
) are defined everywhere and finite. For ease of notation, we omit the
subscript in
when clear from the context. Given any twice-differentiable scalar function f : Θ
, we denote with
the i-th gradient component, i.e.,
, and with
the Hessian element of coordinates (i, j), i.e.,
. We also write
) to denote
when this does not introduce any ambiguity.
2000) allows us to characterize the gradient of the performance measure ) as an expectation over states and actions visited under
The gradient of the log-likelihood log
) is called score function, while the Hessian of the log-likelihood
log
) is sometimes called observed information.
2.3 Actor-only policy gradient
In practice, we always consider finite episodes of length T. We call this the effective horizon of the MDP, chosen to be sufficiently large so that the problem does not lose generality.We denote with
:= (
) a trajectory, i.e., a sequence of states and actions of length T such that
,
) for t = 0
1 and some policy
. In this context, the performance measure of a parametric policy
can be defined as:
where ) is the probability density of the trajectory
that can be generated by following policy
, i.e.,
) =
). Let
be a batch
of N trajectories generated with
, i.e.,
i.i.d. for i = 1, . . . , N. Let
) be an estimate of the policy gradient
) based on D. Such an estimate can be used to perform stochastic gradient ascent on the performance objective
):
where 0 is a step size and N = |D| is called batch size. This yields an Actor-only Policy Gradient method, summarized in Algorithm 1.
Under mild conditions, this algorithm is guaranteed to converge to a local optimum (Sutton et al., 2000). This is reasonable since the objective
) is non-convex in general.
As for the gradient estimator, we can use REINFORCE (Glynn, 1986; Williams, 1992):
or its refinement, G(PO)MDP (Baxter & Bartlett, 2001), which typically suffers from less variance (Peters & Schaal, 2008):
where the superscript on states and actions denotes the i-th trajectory of the dataset and b is a (possibly time-dependent and vector-valued) control variate, or baseline. Both estimators are unbiased for any action-independent baseline.Peters and Schaal (2008) prove that Algorithm 1 with the G(PO)MDP estimator is equivalent to Monte-Carlo PGT (Policy Gradient Theorem, Sutton et al., 2000), and provide variance-minimizing baselines for both REINFORCE and G(PO)MDP.
Algorithm 1 is called actor-only to discriminate it from actor-critic policy gradient algorithms (Konda & Tsitsiklis, 1999), where an approximate value function, or critic, is employed in the gradient computation. In this work, we will focus on actor-only algorithms, for which safety guarantees are more easily proven.Generalizations of Algorithm 1 include reducing the variance of gradient estimates through baselines and other stochastic-optimization techniques (e.g., Papini, Binaghi, Canonaco, Pirotta, & Restelli, 2018; Shen, Ribeiro, Hassani, Qian, & Mi, 2019; Xu, Gao, & Gu, 2020) using a vector step size (Papini et al., 2017; Yu, Aberdeen, & Schraudolph, 2006); making the step size adaptive, i.e., iteration and/or data-dependent (Pirotta, Restelli, & Bascetta, 2013); making the batch size N also adaptive (Papini et al., 2017); applying a preconditioning matrix to the gradient, as in Natural Policy Gradient (S. Kakade, 2002) and second-order methods (Furmston & Barber, 2012).
2.4 Smooth functions
In the following we denote with the
-norm of vector x, which is the Euclidean norm for p = 2. . For a matrix
= sup
= 1} denotes the induced norm, which is the spectral norm for p = 2. When the p subscript is omitted, we always mean p = 2.
Let be a (non-convex) vector-valued function. We call g Lipschitz continuous if there exists L > 0 such that, for every
:
Let be a real-valued differentiable function. We call f Lipschitz smooth if its gradient is Lipschitz continuous, i.e., there exists L > 0 such that, for every
:
Whenever we want to specify the Lipschitz constant L of the gradient, we call f L-smooth.We also call L the smoothness constant of f. For a twice-differentiable function, the following holds:
Proposition 2 Let X be a convex subset of be a twice-differentiable function. If the Hessian is uniformly bounded in spectral norm by L > 0, i.e.,
Lipschitz smooth functions admit a quadratic bound on the deviation from linear behavior:
Proposition 3 (Quadratic Bound) Let X be a convex subset of be an L-smooth function. Then, for every
where denotes the dot product.
This bound is often useful for optimization purposes (Nesterov, 2013).
In this section, we provide lower bounds on performance improvement based on general assumptions on the policy class.
3.1 Smoothing policies
We introduce a family of parametric stochastic policies having properties that we deem desirable for policy-gradient learning. We call them smoothing, as they are characterized by the smoothness of the performance measure:
be a class of twice-differentiable parametric stochastic policies, where Θ
is convex. We call it smoothing if there exist non-negative constants
such that, for every state and in expectation over actions, the Euclidean norm of the score function:
Note that the definition requires that the bounding constants be independent of the policy parameters and the state. For this reason, the existence of such constants depends on the policy parameterization.
We call a policy class (
)-smoothing when we want to specify the bounding constants. In Appendix B, we show that some of the most commonly used policies, such as the Gaussian policy for continuous actions and the Softmax policy for discrete actions, are smoothing. The smoothing constants for these classes are reported in Table 1. In the following sections, we will exploit the smoothness of the performance measure induced by smoothing policies to develop a monotonically improving policy gradient algorithm. However, smoothing policies have other interesting properties. For instance, variance upper bounds for REINFORCE/G(PO)MDP with Gaussian policies (Pirotta, Restelli, & Bascetta, 2013; Zhao, Hachiya, Niu, & Sugiyama, 2011) can be generalized to smoothing policies (see Appendix D for details). Other nice properties of smoothing policies, such as Lipschitzness of the performance measure, are discussed in (R. Yuan et al., 2021, Lemma D.1).
Table 1: Smoothing constants and smoothness constant L for Gaussian and Softmax policies, where M is an upper bound on the Euclidean norm of the feature function, R is the maximum absolute-value reward,
is the discount factor,
is the standard deviation of the Gaussian policy and
is the temperature of the Softmax policy. We also report the improved smoothness constant by R. Yuan et al. (2021) as
.
3.2 Policy Hessian
We now show that the Hessian of the performance measure ) for a smoothing policy has bounded spectral norm. We start by writing the policy Hessian for a general parametric policy as follows. The result is well known (S. Kakade, 2001), but we report a proof in Appendix A.4 for completeness. Also, note that our smoothing-policy assumption is weaker than the typical one (uniformly bounded policy derivatives). See Appendix A.3 for details.
be a smoothing policy. The Hessian of the performance measure is:
For smoothing policies, we can bound the policy Hessian in terms of the constants from Definition 1:
-smoothing policy
, the spectral norm of the policy Hessian can be upper-bounded as follows:
Proof By the Policy Gradient Theorem (see the proof of Theorem 1 by Sutton et al., 2000):
where (23) is from the Cauchy-Schwarz inequality and (8), and (24) is from the smoothing-policy assumption. Next, we bound the gradient of the action-value function. From (7):
where the interchange of gradient and expectation in (26) is justified by the smoothing-policy assumption (see Appendix A.3 for details) and (27) is from (24). Finally, from Proposition 4:
where (28) is from Jensen inequality (all norms are convex) and the triangle inequality, (29) is fromfor any two vectors x and y, (30) is from (8) and (27), and the last inequality is from the smoothing-policy assumption.
3.3 Smooth Performance
For a smoothing policy, the performance measure ) is Lipschitz smooth with a smoothness constant that only depends on the smoothing constants, the reward magnitude, and the discount factor. This result is of independent interest as it can be used to establish convergence rates for policy gradient algorithms (R. Yuan et al., 2021).
-smoothing policy class
, the performance measure
-smooth with the following smoothness constant:
Proof From Lemma 5, L is a bound on the spectral norm of the policy Hessian. From Lemma 2, this is a valid Lipschitz constant for the policy gradient, hence the performance measure is
The smoothness of the performance measure, in turn, yields the following property on the guaranteed performance improvement:
-smoothing policy class. For every
where
Proof It suffices to apply Lemma 3 with the Lipschitz constant from Lemma 6.
The smoothness constant L for Gaussian and Softmax policies is reported in Table 1.
In the following, we will exploit this property of smoothing policies to enforce safety guarantees on the policy updates performed by Algorithm 1, i.e., stochastic gradient ascent updates. However, Theorem 7 applies to any policy update ∆as long as
+ ∆
Θ.
Very recently, R. Yuan et al. (2021, Lemma 4.4) provided an improved smoothness constant for smoothing policies:
This is a significant step forward since it improves the dependence on the effective horizon by a (1factor. In Table 1 we report explicit expressions for
in the case of linear Gaussian and Softmax policies. We will use these superior smoothness constant in the numerical simulations of Section 7.
In this section, we provide a step size for Algorithm 1 that maximizes a lower bound on the performance improvement for smoothing policies. This yields safety in the sense of Monotonic Improvement (MI), i.e., non-negative performance improvements at each policy update:
at least with high probability.
In policy optimization, at each learning iteration k, we ideally want to find the policy update ∆that maximizes the new performance
+ ∆
), or equivalently:
since ) is fixed. Unfortunately, the performance of the updated policy cannot be known in advance.
For this reason, we replace the optimization objective in (35) with a lower bound, i.e., a guaranteed improvement. In particular, taking Algorithm 1 as our starting point, we maximize the guaranteed improvement of a policy gradient update (line 5) by selecting optimal meta-parameters. The solution of this meta-optimization problem provides a lower bound on the actual performance improvement. As long as this is always non-negative, MI is guaranteed.
4.1 Adaptive Step Size – Exact Framework
To decouple the pure optimization aspects of this problem from gradient estimation issues, we first consider an exact policy gradient update, i.e., ), where we assume to have a first-order oracle, i.e., to be able to compute the exact policy gradient
). This assumption is clearly not realistic, and will be removed in Section 4.2. In this simplified framework, performance improvement can be guaranteed deterministically. Furthermore, the only relevant meta-parameter is the step size
of the update. We first need a lower bound on the performance improvement
). For a smoothing policy, we can use the following:
-smoothing policy class. Let
, the performance improvement of
can be lower bounded as follows:
This bound is in the typical form of performance improvement bounds (e.g., Cohen, Yu, & Wright, 2018; S. Kakade & Langford, 2002; Pirotta, Restelli, & Bascetta, 2013; Schulman et al., 2015): a positive term accounting for the anticipated advantage of over
, and a penalty term accounting for the mismatch between the two policies, which makes the anticipated advantage less reliable. In our case, the mismatch is measured by the curvature of the performance measure w.r.t. the policy parameters, via the smoothness constant L. This lower bound is quadratic in
, hence we can easily find the optimal step size
.
be the guaranteed performance improvement of an exact policy gradient update, as defined in Theorem 8. Under the same assumptions,
is maximized by the constant step size
, which guarantees the following non-negative performance improvement:
Proof We just maximize ) as a (quadratic) function of
. The global optimum
is attained by
. The improvement guarantee follows from Theorem 8.
4.2 Adaptive Step Size – Approximate Framework
In practice, we cannot compute the exact gradient ), but only an estimate
) obtained from a finite dataset D of trajectories. In this section, N denotes the fixed size of D. To find the optimal step size, we just need to adapt the performance-improvement lower bound of Theorem 8 to stochastic-gradient updates. Since sample trajectories are involved, this new lower bound will only hold with high probability. To establish statistical guarantees, we make the following assumption on how the (unbiased) gradient estimate concentrates around its expected value:
Assumption 1 Fixed a parameter Θ, a batch size
and a failure probability
1), with probability at least 1
where |D| is a dataset of N i.i.d. trajectories collected with a known function.
We will discuss how this assumption is satisfied in cases of interest in Section 5 and Appendix C. Under the above assumption, we can adapt Theorem 8 to the stochastic-gradient case as follows:
-smoothing policy class. Let
. Under Assumption 1, provided
, the performance improvement of
can be lower bounded, with probability at least
, as follows:
Assumption 1, holds with probability at least 1
. For the rest of the proof, we will assume
Let
for short. Under
, by the triangular inequality:
where the latter inequality is from (37). We first consider the case in which
where the last inequality is from (41).
From Theorem 10 we can easily obtain an optimal step size, as done in the exact setting, provided the batch size is sufficiently large:
be the guaranteed performance improvement of a stochastic policy gradient update, as defined in Theorem 10. Under the same assumptions, provided the batch size satisfies:
is maximized by the following adaptive step size:
which guarantees, with probability at least , the following non-negative performance improvement:
, the second argument of the max operator in (41) is selected. In this case, no positive improvement can be guaranteed and the optimal non-negative step size is
= 0. Thus, we focus on the case
. In this case, the first argument of the max operator is selected. Optimizing
) as a function of
alone, which is again quadratic, yields (44) as the optimal step size and (45) as the maximum guaranteed improvement.
In this case, the optimal step size is adaptive, i.e., time-varying and data-dependent. The constant, optimal step size for the exact case (Corollary 9) is recovered in the limit of infinite data, i.e., . In the following we discuss why this adaptive step size should not be used in practice, and propose an alternative solution.
4.3 Adaptive Batch Size
The safe step size from Corollary 11 requires the batch size to be large enough. As soon as the condition (43) fails to hold, the user is left with the decision whether to interrupt the learning process or collect more data — an undesirable property for a fully autonomous system. To avoid this, a large batch size must be selected from the start, which results in a pointless waste of data in the early learning iterations. Even so, Equation (43), used as a stopping condition, would be susceptible to random oscillations of the stochastic gradient magnitude, interrupting the learning process prematurely.
As observed in (Papini et al., 2017), controlling also the batch size N of the gradient estimation can be advantageous. Intuitively, a larger batch size yields a more reliable estimate, which in turn allows a safer policy gradient update. The larger the batch size, the higher the guaranteed improvement, which would lead to selecting the highest possible value of N. However, we must take into account the cost of collecting the trajectories, which is non-negligible in real-world problems (e.g., robotics). For this reason, we would like the meta-parameters to maximize the per-trajectory performance improvement:
where D is a dataset of N i.i.d. trajectories sampled with . We can then use the lower bound from Theorem 10 to find the jointly optimal safe step size and batch size, similarly to what was done in (Papini et al., 2017) for the special case of Gaussian policies:
be the lower bound on the performance improvement of a stochastic policy gradient update, as defined in Theorem 10. Under the same assumptions, the continuous relaxation of
is maximized by the following
Using in the stochastic gradient ascent update guarantees, with probability at least
, the following non-negative performance improvement:
consider the continuous relaxation of Υ(can be any positive real
number. For , the first argument of the max operator in (36) can be selected. Note that the second argument is always a valid choice, since it is a lower bound on the first one for every
1. Thus, we separately solve the following constrained
� � � �
Both problems can be solved in closed form using KKT conditions. The first one (49) yields Υwith the values of
and
). The second one (50) yields a worse optimum Υ
Hence, we keep the first solution. From Theorem 10, using
would guar- antee
). Of course, only integer batch sizes
can be used. However, for , the right-hand side of (36) is monotonically increasing in
, the guarantee (48) is still valid when
are employed in the stochastic gradient ascent update.
In this case, the optimal step size is constant, and is exactly half the one for the exact case (Corollary 9). In turn, the batch size is adaptive: when the norm of the (estimated) gradient is small, a large batch size is selected. Intuitively, this allows to counteract the variance of the estimator, which is large relatively to the gradient magnitude. One may worry about the recursive dependence of on itself through
. We will overcome this issue in the next section.
In this section, we leverage the theoretical results of the previous sections to design a reinforcement learning algorithm with monotonic improvement guarantees. For the reasons discussed above, we adopt the adaptive-batch-size approach from Section 4.3.
Corollary 12 provides a constant step size and a schedule for the batch size (
that jointly maximize per-trajectory performance improvement under a monotonic-improvement constraint. Plugging these meta-parameters into Algorithm 1, we could obtain a safe policy gradient algorithm. Unfortunately, the closed-form expression for
provided in (47) cannot be used directly. We must compute the batch size before collecting the batch of trajectories
, but
depends on
itself. To overcome this issue, we collect trajectories in an incremental fashion until the optimal batch size is achieved. We call this algorithm Safe Policy Gradient (SPG), outlined in Algorithm 2. The user specifies the failure probability
for each iteration k, while the smoothness constant L and the concentration bound
: (0, 1)
can be computed depending on the policy class and the gradient estimator (see Tables 1 and 2).
We can study the data-collecting process of SPG as a stopping problem. Fixed an iteration k, let ) be the sigma-algebra generated by the first i trajectories collected at that iteration. Let
] be short for
In Section 4 and 4.3 we assumed the Euclidean norm of the gradient estimation error to be bounded by
with probability 1
for some function
: (0, 1)
. For Algorithm 2 to be well-behaved, we need gradient estimates to concentrate exponentially, which translates into the following, stronger assumption:
Assumption 2 Fixed a parameter Θ, a batch size
and a failure probability
1), with probability at least 1
where |D| is a dataset of N i.i.d. trajectories collected with ) for some problem-dependent constant C that is independent of
and N.
This is satisfied by REINFORCE/G(PO)MDP with Softmax and Gaussian policies, as shown in Appendix C. In Table 2 we summarize the value of the error bound ) to be used in the different scenarios. Equipped with this exponential tail bound we can prove that, at any given (outer) iteration of SPG, the data-collecting process (inner loop) terminates:
Lemma 13 Fix an iteration k of Algorithm 2 and let the number of trajectories that are collected at that iteration. Under Assumption 2, provided
Proof First, note that is a stopping time w.r.t. the filtration (
the event
. By Assumption 2,
. This allows to upper bound the expectation of
as follows:
Table 2: Gradient estimation error bound ) for Gaussian and Softmax policies using REINFORCE (RE.), GPOMDP (GP.), or the random-horizon estimator discussed in Appendix C.2 (RH.) as gradient estimator, where d is the dimension of the policy parameter, M is an upper bound on the max norm of the feature function, R is the maximum absolute-valued reward,
is the discount factor, T is the task horizon,
is the standard deviation of the Gaussian policy and
is the temperature of the Softmax policy.
where (56) is by Assumption 2 and the last inequality is by Lemma 21 assuming . If the latter is not true, we still get:
We can now prove that the policy updates of SPG are safe.
Theorem 14 Consider Algorithm 2 applied to a smoothing policy, where unbiased policy gradient estimator. Under Assumption 2, for any iteration
provided
, with probability at least
Proof Fix an (outer) iteration k of Algorithm 2 and let short. Using an unbiased policy gradient estimator we ensure
so
) is a martingale difference sequence adapted to (
use an optional stopping argument to show that
is an unbiased policy gradient estimate. Lemma 13 shows that
is a stopping time w.r.t. the filtration (
that is finite in expectation. Furthermore, by Assumption 2, integrating the tail:
Hence, by optional stopping (Lemma 22), ] = 0. Since
). This shows that the policy update of Algorithm 2 is an unbiased policy-gradient update. By the stopping condition:
So holds with probability at least 1
, the performance improvement guarantee is by Corollary 12, Equation (65), and the choice of the step size
We have shown that the policy updates of SPG are safe with probability 1 , where the failure probability
can be specified by the user for each iteration k. Typically, one would like to ensure monotonic improvement for the whole duration of the learning process. This can be achieved by appropriate confidence schedules. If the number of updates K is fixed a priori,
guarantees monotonic improvement with probability 1
. The same can be obtained by using an adaptive confidence schedule
, even when the number of updates is not known in advance. Both results are easily shown by taking a union bound over
1. Notice how having an exponential tail bound like the one from Assumption 2 is fundamental for the batch size to have a logarithmic dependence on the number of policy updates.
5.1 Towards a Practical Algorithm
The version of SPG we have just analyzed is very conservative. The price for guaranteeing monotonic improvement is slow convergence, even in small problems (see Section 7.1 for an example). In this section, we discuss possible variants and generalizations of Algorithm 2 aimed at the development of a more practical method. In doing so, we still stay faithful to the principle of satisfying the safety requirement specified by the user with no compromises. We just list the changes here. See Appendix E for a more rigorous discussion.
Improved smoothness constant.
As mentioned in Section 3.3, we can use the improved smoothness constant by R. Yuan et al. (2021), denoted in the following, which has a better dependence on the effective horizon. This yields a larger step size with the same theoretical guarantees, and allows to tackle problems with longer horizons in practice.
Mini-batches.
In the inner loop of Algorithm 2, instead of just one trajectory at a time, we can collect mini-batches of n independent trajectories. For instance, 2 is required to employ the variance-reducing baselines discussed in Section 2. Moreover, a carefully picked mini-batch size n can make the early gradient estimates more stable, leading to an earlier stopping of the inner loop and a smaller batch size
. We leave the investigation of the optimal value of n to future work.
Largest safe step size.
The meta parameters of Algorithm 2 were selected to maximize a lower bound on the per-trajectory performance improvement. Although we believe this is the most theoretically justified choice, we could gain some convergence speed by using a larger step size. From Theorem 10, it is easy to check that = 1/L is the largest constant step size we can use with our choice of adaptive batch size from Algorithm 2. We leave the investigation of alternative safe combinations of batch size and (possibly adaptive) step size to future work.
Empirical Bernstein bound.
The stopping condition of Algorithm 2 (line 11) is based on a Hoeffding-style bound on the gradient estimation error. In the case of policies with bounded score function, such as Softmax policies (see Appendix B.2), we can use an empirical Bernstein bound instead (Maurer & Pontil, 2009). This requires some modifications to the algorithm, but yields a smaller adaptive batch size with the same safety guarantees. See Appendix E for details. Unfortunately, we cannot use the empirical Bernstein bound with the Gaussian policy because of its unbounded score function (see Appendix B.1).
Weaker safety requirements.
Monotonic improvement is a very strong requirement, so we do expect an algorithm with strict monotonic improvement guarantees like SPG to be very data-hungry and slow to converge. However, with little effort, Algorithm 2 can be modified to handle weaker safety requirements. A common one is the baseline constraint (Garcelon et al., 2020a; Laroche, Trichelair, & Des Combes, 2019, e.g.,), where the performance of the policy is required to never be (significantly) lower than the performance of a baseline policy . In a real safety-critical application, the reward could be designed so that policies with performance greater than
) are always safe. In other applications,
can be an existing, reliable controller that the user wants to replace with an adaptive RL agent. In this case, assuming
, the baseline constraint guarantees that the learning agent never performs worse than the original controller. In our numerical simulations of Section 7, we will consider a stronger version of the baseline constraint that we call milestone constraint. In this case, the agent’s policy must never perform (significantly) worse than the best performance observed so far. Formally, for all
1:
where [0, 1] is a user-defined significance parameter. The idea is as follows: every time the agent reaches a new level of performance (a milestone), it should never do significantly worse than that. When
= 1, this reduces to monotonic improvement. When
1, some amount of performance oscillation is allowed, but this relaxation can significantly improve the learning speed. Of course, the user has full control on this trade-off through the meta-parameter
. In Appendix E we show that variants of Algorithm 2 satisfy the milestone constraint (and other requirements, such as the baseline constraint) with probability 1
for given significance
and failure probability
. We experiment with the milestone constraint in Section 7.2.
In this section, we discuss previous results on MI guarantees for policy gradient algorithms.
The seminal work on monotonic performance improvement is by S. Kakade and Langford (2002). In this work, policy gradient approaches are soon dismissed because of their lack of exploration, although they guarantee MI in the limit of an infinitesimally small step size. The authors hence focus on value-based RL, proposing the Conservative Policy Iteration (CPI) algorithm, where the new policy is a mixture of the old policy
and a greedy one
. The guaranteed improvement of this new policy (S. Kakade & Langford, 2002, Theorem 4.1) depends on the coefficient
of this convex combination, which plays a similar role as the learning rate in our Theorem 8:
where = max
and
) =
) denotes the advantage function of policy
. In fact, both lower bounds have a positive term that accounts for the expected improvement of the new policy w.r.t. the old one, and a penalization term due to the mismatch between the two. The CPI approach is refined by Pirotta, Restelli, Pecorino, and Calan- driello (2013), who propose the Safe Policy Iteration (SPI) algorithm (see also Metelli, Pirotta, Calandriello, & Restelli, 2021).
Specific performance improvement bounds for policy gradient algorithms were first provided by Pirotta, Restelli, and Bascetta (2013) by adapting previous results on policy iteration (Pirotta, Restelli, Pecorino, & Calandriello, 2013) to continuous MDPs. However, the penalty term can only be computed for shallow Gaussian policies (App. B.1) in practice. The bound for the exact framework is:
where |A| denotes the volume of the action space. From Table 1, our bound for the same setting is (Corollary 9):
J(J(
J(
M
Rσ
γ)
which has the same dependence on the step size, the policy standard deviation , the effective horizon (1
, the maximum reward R and the maximum feature norm M. Besides being more general, our penalty term does not depend on the problematic |A| term (the action space is theoretically unbounded for Gaussian policies) and replaces the
norm of (70) with the smaller
norm. Due to the different constants, we cannot say our penalty is always smaller, but the change of norm could make a big difference in practice, especially for large parameter dimension d. Pirotta, Restelli, and Bascetta (2013) also study the approximate framework. However, albeit formulated in terms of the estimated gradient, their lower bound (Theorem 5.2) still pertains exact policy gradient updates, since
is defined as
). This easy-to-overlook observation makes our Theorem 10 the first rigorous monotonic improvement guarantee for stochastic policy gradient updates of the form
+
). Pirotta, Restelli, and Bascetta (2013) use their results to design an adaptive step-size schedule for REINFORCE and G(PO)MDP, similarly to what we propose in this paper, but limited to Gaussian policies. Papini et al. (2017) rely on the same improvement lower bound (70) to design an adaptivebatch size algorithm, the most similar to our SPG. Again, their monotonic improvement guarantees are limited to shallow Gaussian policies.
Another related family of performance improvement lower bounds, inspired once again by S. Kakade and Langford (2002), is that of TRPO. These are very general results that apply to arbitrary pairs of stochastic policies, although they are mostly used to construct policy gradient algorithms in practice. Specializing Theorem 1 by Schulman et al. (2015) to our setting and applying the
KL lower bound suggested by the authors we can get the following:
where is a stochastic policy. Unfortunately, the lower bound for a policy gradient update (exact or stochastic) cannot be computed exactly. Approximations can lead to very good practical algorithms such as TRPO, but not to actually implementable algorithms with rigorous monotonic improvement guarantees like our SPG. Achiam, Held, Tamar, and Abbeel (2017) and Pajari- nen, Thai, Akrour, Peters, and Neumann (2019) are able to remove some approximations, but not all.
If we were to derive a computable worst-case lower bound starting from (71), we would get a result similar to (70). In fact, Pirotta, Restelli, and Bascetta (2013) explicitly upper-bound the KL divergence in their derivations, which is why the final result is limited to Gaussian policies. We overcome this difficulty by directly upper-bounding the curvature of the objective function (Lemma 5). Furthermore, Theorem 7 suggests that our theory is not limited to policy gradient updates. Arbitrary update directions are considered in (Papini, Battistello, & Restelli, 2020).
Pirotta, Restelli, and Bascetta (2015) provide performance improvement lower bounds (Lemma 8) and adaptive-step algorithms for policy gradients under Lipschitz continuity assumptions on the MDP and the policy. Our assumptions on the environment are much weaker since we only require boundedness of the reward. Intuitively, stochastic policies smooth out the irregularities of the environment in computing expected return objectives. In turn, the results of Pirotta et al. (2015) also apply to deterministic policies.
Cohen et al. (2018) provide a general safe policy improvement strategy that can be applied also to policy gradient updates. However, it requires to maintain and evaluate a set of policies per iteration instead of a single one.
As mentioned, R. Yuan et al. (2021, Lemma 4.4) also study policy gradient with smoothing policies, providing an improved smoothness constant and proving Lipschitz continuity of the objective function. However, their main focus is sample complexity of vanilla policy gradient.
In this section, we test our SPG algorithm on simulated control tasks. We first test Algorithm 2 with monotonic improvement guarantees on a small continuous-control problem. We then experiment with the milestone-constraint
Fig. 1: Performance of SPG and AdaBatch (Papini et al., 2017) on the LQR task with Gaussian policy. Results are averaged over 5 independent runs. The shaded areas correspond to 10 standard deviations. A marker corresponds to 100 policy updates.
relaxation proposed in Section 5.1 on a classic RL benchmark — cart-pole balancing.
7.1 Linear-Quadratic Regulator with Gaussian Policy
The first task is a 1-dimensional Linear-Quadratic Regulator (LQR, Dorato, Cerone, and Abdallah (1994)), a typical continuous-control benchmark. See Appendix F.1 for a detailed task specification. We use a Gaussian policy (Appendix B.1) that is linear in the state, ) =
). The task horizon is T = 10 and we use
= 0.9 as a discount factor. The policy mean parameter is initialized to
= 0 and the variance is fixed as
= 1. For this task, the maximum reward (in absolute value) is R = 1 and the only feature is the state itself, giving M = 1. Hence, the smoothness constant
200 is easily computed (see Table 1). Similarly, the error bound can be retrieved from Table 2. We compare the SPG (Algorithm 2) with an existing adaptive-batch-size policy gradient algorithm for Gaussian policies (Papini et al., 2017), discussed in the previous section and labeled AdaBatch in the plots. SPG is run with a mini-batch size of n = 100 (see Section 5.1), and AdaBatch (in the version with Bernstein’s inequality as recommended in the original paper) with an initial batch size of
= 100. Both use the adaptive confidence schedule
+ 1)) discussed in Section 5, with an overall failure probability of
= 0.05. We also consider SPG with a twice-as-large step size
= 1
, as discussed in Section 5.1.
Figure 1 shows the expected performance of the algorithms on the LQR task. For this task, we are able to compute the expected performance in closed form given the policy parameters (Peters, 2002). This allows to filter out the oscillations due to the stochasticity of policy and environment, focusing on
Fig. 2: Batch size of SPG and AdaBatch on the LQR task. Results are averaged over 5 independent runs. The shaded areas correspond to one standard deviation. A marker corresponds to 100 policy updates
actual (expected) performance oscillations. It is also why the variability among different seeds is so small (note that, for this figure, shaded areas correspond to 10 standard deviations. They correspond to a single standard deviation in the other figures). Performance is plotted against the total number of collected trajectories for fair comparison. The distribution of policy updates can be deduced from the markers. We can see that indeed all the safe PG algorithms exhibit monotonic improvement. SPG converges faster than AdaBatch. This is mostly due to the larger step size of SPG (we observed that the step size of SPG was more than 100 times larger than the one of AdaBatch in most of the updates). This allows SPG to converge faster even with fewer policy updates. The variant of SPG with a larger step size (= 1
) converges faster to a good policy, but the original version from Algorithm 2 achieves higher performance on the long run. This indicates that maximizing the lower bound on per-trajectory performance improvement from Theorem 10 is indeed meaningful.
Figure 2 shows the batch size of the different algorithms. The batch size of SPG is mostly larger than that of AdaBatch. From Section 6 we know that the monotonic improvement guarantee of SPG is more rigorous, so a larger batch size is justified. Notice also that the batch size of SPG is smaller than that of AdaBatch in the early iterations, suggesting that the former is more adaptive.
7.2 Cart-Pole with Softmax Policy
The second task is cart-pole (Barto, Sutton, & Anderson, 1983). We use the implementation from openai/gym, which has 4-dimensional continuous states and finite actions, . See Appendix F.2 for further details. The policy is Softmax (Appendix B.2), linear in the state:
exp(
), with
Fig. 3: Performance of GPOMDP and SPG (for different values of the sig-nificance parameter ) on the cart-pole task with Softmax policy. Results are averaged over 5 independent runs. The shaded areas correspond to one standard deviation. A marker corresponds to 1000 policy updates.
a separate parameter for each action (= [
]). We use a fixed temperature
= 1, initial policy parameters set to zero (this corresponds to a uniform policy) and
= 0.9 as a discount factor. For SPG, we employ all the practical variants proposed in Section 5.1. In particular, since the Softmax policy has a bounded score function, we can use the empirical Bernstein bound. Note that we could not have done the same for the LQG task since the score function of the Gaussian policy is unbounded (see Appendix C). Moreover, we consider the relaxed milestone constraint for different values of the significance parameter,
. The overall failure probability is always
= 0.2, the mini-batch size is n = 100, and the step size is
= 1
We compare with GPOMDP with the same step size but a fixed batch size of N = 100, which comes with no safety guarantees, and corresponds to
= 0. In Figure 3 we plot the performance against the total number of collected trajectories. As expected, a more relaxed constraint yields faster convergence. However, no significant performance oscillations are observed, not even in the case of GPOMDP, suggesting that the choice of meta-parameters is still overconservative. In Figure 4 (left) we report the evolution of the batch size of SPG during the learning process. Note how, in this case, the batch size seems to converge to a constant value. In Figure 4 (right) we illustrate the milestone constraint. The solid line is the performance of SPG with
= 0.1, while the dotted line is the performance lower-threshold enforced by the milestone constraint, representing 90% of the highest performance achieved so far. As desired, the actual performance never falls under the threshold.
Fig. 4: Further results for SPG on the cart-pole task. On the left, the batch size is plotted against the total number of trajectories. A marker corresponds to 1000 policy updates. On the right, the performance at each policy update (solid line) is compared with the performance threshold (dashed line) when = 0.1. In both plots, shaded areas correspond to one standard deviation.
We have identified a general class of policies, called smoothing policies, for which the performance measure (expected total reward) is a smooth function of policy parameters. We have exploited this property to select meta-parameters for actor-only policy gradient that guarantee monotonic performance improvement. We have shown that an adaptive batch size can be used in combination with a constant step size for improved efficiency, especially in the early stages of learning. We have designed a monotonically improving policy gradient algorithm, called Safe Policy Gradient (SPG), with adaptive batch size. We have shown how SPG can also be applied to weaker performance-improvement constraints. Finally, we have tested SPG on simulated control tasks.
Albeit the safety motivations are clearly of practical interest, our contribution is mostly theoretical. The meta-parameters proposed in Section 4 and used in SPG are based on worst-case problem-dependent constants that are known and easy to compute, but can be very large. This would lead to overconservative behavior in most problems of interest. However, we believe that this work provides a solid starting point to develop safe and efficient policy gradient algorithms that are rooted in theory.
To conclude, we propose some possible ideas for future work that are aimed to close this gap between theory and practice. While we used the empirical Bernstein bound to characterize the gradient estimation error for Softmax policies, the same cannot be done for Gaussian policies due to their unbounded score function. Tighter concentration inequalities should be studied for this case. The convergence rate of SPG should also be studied. The main challenge here is the growing batch size. The numerical simulations of section 7.1 suggest the the growth is sublinear. Moreover, we have observed convergence to a fixed batch size under the weaker milestone constraint in Section 7.2. It is also worth to investigate whether SPG can be combined with stochastic variancereduction techniques (e.g., Papini et al., 2018; H. Yuan, Lian, Liu, & Zhou, 2020). Convergence to global optima should also be investigated, as is now common in the policy optimization literature (Agarwal et al., 2020; Bhandari & Russo, 2019; Zhang et al., 2020). Actor-critic algorithms (Konda & Tsitsik- lis, 1999) are more used than actor-only algorithms in practice (e.g., Haarnoja, Zhou, Abbeel, & Levine, 2018) due to their reduced variance. Thus, extending our improvement guarantees to this class of algorithms is also important. The main challenge lies in handling the bias due to the critic. A promising first step is to consider compatible critics that yield unbiased gradient estimates (Konda & Tsitsiklis, 1999; Sutton et al., 2000). Although the class of smoothing policies is very broad, we have restricted our attention to Gaussian and Softmax policies with given features. Other policy classes, such as beta policies (Chou, Maturana, & Scherer, 2017) should be considered. Most importantly, deep policies should be considered that also learn the features from data, especially given their success in practice (Duan et al., 2016). See Appendix B.3 for a brief discussion. Other possible extensions include generalizing the monotonic improvement guarantees to other concepts of safety, such as learning under constraints, or risk-averse RL (Bisi, Sabbioni, Vittori, Pap- ini, & Restelli, 2020). Finally, the conservative approach adopted in this work could prevent exploration, making some tasks very hard to learn. We studied the case of Gaussian policies with adaptive standard deviation in (Papini et al., 2020). Future work should consider the trade-off between safety, efficiency and exploration in greater generality.
Acknowledgments. The authors would like to thank Gergely Neu for his suggestions on how to improve the Safe Policy Gradient algorithm and its theoretical analysis.
A.1 Markov Decision Processes
Proof
be any policy and f be any integrable function on S satisfying the following recursive equation:
for all and some integrable function g on S. Then:
for all
A.2 Lipschitz-Smooth Functions
The following results, reported in Section 2, are well known in the literature (Nesterov, 1998), but we also report proofs for the sake of completeness:
Proposition 17 Let X be a convex subset of be a twice-differentiable function. If the Hessian is uniformly bounded in spectral norm by L > 0, i.e.,
). Convexity of X guarantees
1]. Twice-differentiability of
continuous, which in turn implies g is continuous. From the Fundamental Theorem of Calculus:
Proposition 18 (Quadratic Bound) Let X be a convex subset of be an L-smooth function. Then, for every
where denotes the dot product.
). Convexity of X guarantees
1]. Lipschitz smoothness implies continuity of f, which in turn implies g is continuous. From the Fundamental Theorem of Calculus:
where (A6) is from the Cauchy-Schwartz inequality and (A7) is from the Lipschitz smoothness of
A.3 Smoothing Policies and Differentiability
Our proofs of the results of Section 3.2 rely on the interchange of integrals (w.r.t. states and actions) and derivatives (w.r.t. policy parameters). In the policy gradient literature (cf. S. Kakade, 2001; Konda & Tsitsiklis, 1999; Sutton et al., 2000), these are typically justified by assuming the derivatives of the policy are bounded uniformly over states and actions, that is:
for all , and i, j = 1, 2, . . . , d. The policy gradient itself originally relies on this assumption (Konda & Tsitsiklis, 1999), although weaker requirements are possible (see Bhandari & Russo, 2019, Section 5.1, for a recent discussion). The main problem with (A8) is that the uniform bounds may depend on huge quantities such as the diameter of the parameter space. Even worse, for (linear) Gaussian policies, the first derivative is unbounded:
even when ) is bounded, since
. However, these policies are smoothing (see Appendix B.1).
The following application of the Leibniz Integral Rule (cf. Klenke, 2013, Theorem 6.28) shows that our smoothing-policy assumption (Definition 1), can replace the stronger (A8) in differentiating expectations:
, be a class of smoothing policies and
be any function such that
is integrable on S. Then
for all i, j = 1, 2, . . . , d.
where is the element of the canonical basis of
corresponding to the i-th coordinate. By linearity of integration:
By assumption, ) is differentiable, so it is continuous. Fix an
mean value theorem, there exist a
on the segment connecting
that:
By the smoothing-policy assumption, Θ is convex, so Θ, and again by the smoothing-policy assumption:
showing that ) is bounded by a function that is integrable w.r.t. a. By the dominated convergence theorem, we can interchange the limit and the integral in (A14) to obtain:
By (A17) and Holder’s inequality, , which is integrable on S. We can then use the same interchange argument to show that:
For the second derivative, we can just repeat the whole argument from the previous paragraph on ). Continuity of the integrand, which is necessary to apply the mean value theorem, follows from twice differentiability of the policy. To apply the dominated convergence theorem, we use the following:
by the triangular inequality and the smoothing-policy assumption.
With some work, one can use Lemma 19 to justify all the interchanges of differentiation and integrals from Section 3.2 and Appendix A.4, as the original derivations (S. Kakade, 2001; Sutton et al., 2000) were justified by (A8).
A.4 Policy Hessian
In the following, the interchange of differentiation and integrals is justified by our smoothing-policy assumption. See Appendix A.3 for details.
be a smoothing policy. The Hessian of the performance measure is:
(A19) is from the log-derivative trick, (A20) is from another application of the log-derivative trick, (A21) is from (5), and (A22) is from Lemma 1 with recursive term. Computing the Hessian of the performance measure is then trivial:
where the first equality is from (9). Combining (A22), (A23) and (10) we obtain the statement of the lemma.
A.5 Auxiliary Lemmas
Lemma 21 For any a, b > 0 such that ab > 1, a sufficient condition for is
Proof This can be deduced from the properties of the Lambert function. However, it is easier to verify it directly. Letting x = 2a log(ab), the first inequality becomes:
2log(2 log(ab)), (A24) and log(2 log(
1. Finally, notice that
) is increasing for
-dimensional vector-valued martingale difference sequence and
be a stopping time, both with respect to a filtration
and there exists
Proof Consider any martingale . We are going to apply Doob’s optional stopping theorem (See Thm 12.5.9 from Grimmett & Stirzaker, 2020)
to each element
. Sufficient conditions for
where the last inequality is by hypothesis. So, by optional stopping, for all
]. We can repeat the same argument for
In this section, we show that some of the most commonly used parametric policies are smoothing and provide the corresponding Lipschitz constants for the policy gradient.
B.1 Gaussian policy
Consider a scalar-action, fixed-variance, shallow Gaussian policy:
where 0 is the standard deviation, and
is a vector-valued feature function that is bounded in Euclidean norm, i.e., sup
. This common policy turns out to be smoothing.
be the set of Gaussian policies defined in (B26), with parameter set
, standard deviation
and feature function
be a non-negative constant such that
-smoothing with the following constants:
The corresponding Lipschitz constant of the policy gradient is:
. Note that
the following derivatives:
From these constants, the Lipschitz constant of the policy gradient is easily computed (Lemma 6).
B.2 Softmax policy
Consider a fixed-temperature, shallow Softmax policy for a discrete action space:
where 0 is the temperature, and
is a vector-valued feature function that is bounded in Euclidean norm, i.e., sup
. This policy is smoothing.
be the set of Softmax policies defined in (B33), with parameter set
, temperature
and feature function
be a non-negative constant such that
)-smoothing with the following constants:
The corresponding Lipschitz constant of the policy gradient is:
Proof In this case, we can simply bound formly over states and actions. The smoothing conditions follow trivially. We need the following derivatives:
Lipschitz constant of the policy gradient is easily computed (Lemma 6).
Note the similarity with the Gaussian constants from Lemma 23. The temperature parameter plays a similar role to the standard deviation
.
The smoothness constants for Gaussian and Softmax policies are summarized in Table 1.
B.3 Preliminary Results on Deep Policies
The policies we have considered so far rely on given feature maps from state (and action) space to low-dimensional linear space. For many applications, a linear map is not expressive enough to represent good policies. Deep policies (Duan et al., 2016) use Neural Networks (NN) to extract more powerful representations from data. Here we provide a first analysis on how the properties of the NN affect the smoothing properties of the policy. As an example, consider a Gaussian policy with mean parametrized by a NN, that is: (B39)
where is a NN with weights
. The score function is then:
and its second-order counterpart:
For the policy to be smoothing, we need bounds on the gradient and Hessian of the NN w.r.t. its weights (in Euclidean and spectral norm, respectively), both uniformly over the state space. This may suggest the use of activation functions that are smooth and have bounded derivatives for any input, such as tanh or sigmoid activations. We shall study the impact of the network architecture on the smoothing constants in future work.
In this section, we provide exponential tail inequalities for REINFORCE and G(PO)MDP (see Section 2) policy gradient estimators with policy classes of interest. For the G(PO)MDP estimator, it is useful to notice that it can be equivalently written as (Peters & Schaal, 2008; Sutton et al., 2000):
just by reordering. For simplicity, we will consider estimators without variance-reducing baselines. First, let us consider the case of a bounded score function:
. Then, for any
, with probability
where for REINFORCE and
for G(PO)MDP.
and let ) denote the policy gradient estimate obtained from a single trajectory
sampled from
. For both gradient estimators:
where the first inequality uses the fact that, for any max
. By linearity of expectation and unbiasedness of the gradient estimator,
. Hence, by (C50) and Hoeffding’s inequality, with probability 1
where N = |D|. To turn this into a bound on the Euclidean norm, we need a covering argument. For arbitrary 0, consider an
It is a well known result that a finite cover exists such that
with probability 1
where the first inequality is by Cauchy-Schwarz inequality and definition of second one is by union bound over the finite elements of
, and the last inequality uses the covering number of the sphere in
. Finally, by letting
The Softmax policy described in Appendix B.2 satisfies the assumption of Lemma 25 with W = where M is an upper bound on
, as shown in the proof of Lemma 24.
Unfortunately, the Gaussian policy class from Appendix B.1 is not covered by Lemma 25, since its score function is unbounded. Motivated by the broad use of Gaussian policies in applications, we provide an ad-hoc bound for this class:
be the class of shallow Gaussian policies from Lemma 23. Then, for any
, with probability
where for REINFORCE and
for G(PO)MDP.
be defined (differently for the two gradients estimators) as in the proof of Lemma 25. Again, let
be the unit sphere in
vector
) denote the policy gradient estimate obtained from a single trajectory
sampled from
. Consider the filtration (
) is the sigma-algebra representing all the knowledge up to the t-th state included. Conditional on
). Hence, conditionally on
Let -measurable and
is a martingale difference sequence adapted to (
Furthermore, (C59) shows that, for any
where the first inequality is by conditionally (
)-subgaussian and
is conditionally (
)-subgaussian. Using Azuma’s inequality, for any
showing that -subgaussian. From this and
, using Hoeffding’s inequality for averages of i.i.d. subgaussian random variables:
with probability 1 . Finally, using the same covering argument as in the proof of Lemma 25, with probability 1
The values of ) for Gaussian and Softmax policies are summarized in Table 2.
C.1 Empirical Bernstein Bound
For bounded-score policies (such as the Softmax), we can improve Lemma 25 using an empirical Bernstein inequality (Maurer & Pontil, 2009):
. Then, for any
, with probability
Lemma 25.
Proof Recall that is a set of trajectories sampled independently from
) denote the policy gradient estimate obtained from trajectory
, and recall
) denotes the sample mean. Fix a vector
, the unit sphere in
for short. Then, as shown in (C50):
and . Moreover, (
are i.i.d. (conditionally on
fixed in this case). By Theorem 4 from (Maurer & Pontil, 2009), with probability 1
where the inequality is by Cauchy-Schwarz and = 1. Since ˆV does not depend on v, we can use the same covering argument as in the proof of Lemma 25 to obtain the desired result.
To use this concentration inequality in SPG, Algorithm 2 must be modified, as discussed in Appendix E.
C.2 Infinite-Horizon Estimators
To obtain an unbiased estimate of the gradient for the original infinite-horizon performance measure considered in the paper, we can modify our simulation protocol as suggested in (Bedi, Parayil, Zhang, Wang, & Koppel, 2021). Consider a random-horizon G(PO)MDP estimator that, for each episode:
1. Samples a random horizon Geom(1
) from a geometric distribution;
2. Generates a trajectory of length T with the current policy
;
3. Outputs ) =
log
.
The result can be averaged over a batch of independent trajectories, each with its own independently sampled random length. This policy gradient estimator is unbiased (Bedi et al., 2021, Lemma 1). The random horizon should be accounted for in the concentration bounds of Lemma 25, 26, and 27. However, note that the term , for the G(PO)MDP estimator, is bounded as follows:
for any 0. Hence, Lemma 25, 26, and 27 all hold for the random-horizon estimator with
. The corresponding error bounds are reported in Table 2. We leave a more refined analysis of the variance and tail behavior of this random-horizon estimator to future work.
In this section, we provide upper bounds on the variance of the (finite-horizon) REINFORCE and G(PO)MDP estimators, generalizing existing results for Gaussian policies (Pirotta, Restelli, & Bascetta, 2013; Zhao et al., 2011) to smoothing policies. We begin by bounding the variance of the REINFORCE estimator:
-smoothing policy class
and an effective task horizon T, for every
, the variance of the REINFORCE estimator (with zero baseline) is upper-bounded as follows:
This is a generalization of Lemma 5.3 from Pirotta, Restelli, and Bascetta (2013), which in turn is an adaptation of Theorem 2 from Zhao et al. (2011). In the Gaussian case, the original lemma is recovered by plugging the smoothing constant from Lemma 23. Note also that, from the definition of smoothing policy, only the second condition (20) is actually necessary for Lemma 28 to hold.
For the G(PO)MDP estimator, we obtain an upper bound that does not grow linearly with the horizon T:
-smoothing policy class
and an effective task horizon T, for every
, the variance of the G(PO)MDP estimator (with zero baseline) is upper-bounded as follows:
Table D1: Upper bounds on the variance ) for common policy gradient estimators (single trajectory, no baseline), assuming the policy is smoothing (Definition 1). Here R is the maximum absolute-valued reward,
is the discount factor, T is the task horizon, and the smoothing constant
can be retrieved from Table 1 depending on the policy class.
where (D79) is from the fact that the trajectories are i.i.d., (D80) is from the CauchySchwarz inequality, (D81) is from the same argument used for (D74) in the proof of Lemma 28, and (D82) is from the sum of the arithmetico-geometric sequence.
This is a generalization of Lemma 5.5 from Pirotta, Restelli, and Bascetta (2013). Again, in the Gaussian case, the original lemma is recovered by plugging the smoothing constant =
from Lemma 23. Note that this variance upper bound stays finite in the limit
, which is not the case for REINFORCE.
In this section, we will analyze in more detailed the variants of SPG introduced in Section 5.1. In particular, we will consider a very general relaxed improvement guarantee, then we will specialize it to the baseline and milestone constraints discussed in the main paper.
The pseudocode for the relaxed version of SPG is provided in Algorithm 3. The algorithm takes as additional inputs the mini-batch size n and a sequence of degradation thresholds ∆0. Moreover, it assumes access to a generic gradient estimation error function
with the following property:
Assumption 3 Fixed a parameter Θ, a batch size
and a failure probability
1), with probability at least 1
The assumption, as the analysis that will follow, is less precise than Assumption 2, but more general. Indeed, it allows to use the empirical Bernstein bound from Lemma 27 for Softmax and other bounded-score policies. We can prove that the per-iteration performance degradation of Algorithm 3 is bounded by the user-defined threshold ∆with high probability. Of course, when ∆
= 0, this is still a monotonic improvement guarantee.
Theorem 30 Consider Algorithm 3 applied to a smoothing policy, where unbiased policy gradient estimator. Under Assumption 3, for any iteration
provided
, with probability at least
Proof Let the filtration (be defined as in Section 5 and note that
stopping time w.r.t. this filtration. Consider the event
. By Assumption 2,
. Hence, by the same arguments used in the proof of Lemma 13:
This shows that the inner loop of Algorithm 3 always terminates with a finite batch size. By the same optional-stopping argument as in the proof of Theorem 14, ), which means the gradient estimate is unbiased. By the stopping condition, for all k:
with probability at least 1 Theorem 10 and the choice of step size
, with the same probability:
where the last inequality is from (E92).
In the following we discuss some applications of Algorithm 3 to specific safety requirements.
Baseline Constraint.
A common requirement is for the updated policy not to perform (significantly) worse than a known baseline policy (e.g., Garcelon et al., 2020a; Laroche et al., 2019). The safety constraint is thus:
where is the (discounted) performance of the baseline policy and
[0, 1] is a user-defined significance parameter. Equivalently,
), and Algorithm 3 satisfies this safety requirement if we set the degradation threshold as follows:
However, the performance of the current policy must also be estimated from data, and accidentally over-estimating it may result in excessive performance degradation. Hence, we replace it with a lower confidence bound based on the empirical Bernstein inequality (Maurer & Pontil, 2009). See Algorithm 4 for details. Note how the failure probability in line 10 is adjusted w.r.t. Algorithm 3 to account for this additional estimation step. With this small caveat, the analysis of Algorithm 4 can be carried out analogously to the one of Algorithm 3.
Milestone Constraint.
In our numerical simulations we consider the following safety constraint:
which can be enforced by setting the degradation threshold in Algorithm 3 as:
Again, we must replace the unknown performance ) with a lower con-fidence bound. In this case, we also need to overestimate the best historical performance. See Algorithm 5 for details. Note that this safety constraint reduces to monotonic improvement if the significance parameter is set to
= 1, since max
).
In this Appendix, we provide detailed descriptions of the control tasks used in the numerical simulations.
F.1 LQR
The LQR is a classical optimal control problem (Dorato et al., 1994). It models the very general task of controlling a set of variables to zero with the minimum effort. Given a state space and an action space
, the next state is a linear function of current state and action:
where and
. The reward is quadratic in both state and action:
+
(F100) where
and
are positive definite matrices. A linear controller is optimal for this task (Dorato et al., 1994) and can be computed in closed form with dynamic-programming techniques. In our experiments, we always consider shallow Gaussian policies of the form:
where and
0 can be fixed or learned as an additional policy parameter. This version of LQR with Gaussian policies is also called LQG (Linear-Quadratic Gaussian Regulator, Peters & Schaal, 2008). States and actions are clipped in practice when they happen to fall outside S and A, respectively. We have ignored nonlinearities stemming from this fact.
The LQR problem used in Section 7 is 1-dimensional with S = A = [1], A = B = C = D = 1.
F.2 Cart-Pole
This is the CartPole-v1 environment from openai/gym (Brockman et al., 2016). It has 4-dimensional continuous states and finite (two) actions. The goal is to keep a pole balanced by controlling a cart to which the pole is attached. Reward is +1 for every time-step until the pole falls. We set a maximum episode length of 100. See the official documentation for more details (https:// gym.openai.com/envs/CartPole-v1/).
Abbeel, P., Coates, A., Ng, A.Y. (2010). Autonomous helicopter aerobatics through apprenticeship learning. The International Journal of Robotics Research, 29(13), 1608–1639.
Achiam, J., Held, D., Tamar, A., Abbeel, P. (2017). Constrained policy optimization. ICML (Vol. 70, pp. 22–31). PMLR.
Agarwal, A., Kakade, S.M., Lee, J.D., Mahajan, G. (2020). Optimality and approximation with policy gradient methods in markov decision processes. COLT (Vol. 125, pp. 64–66). PMLR.
Amodei, D., Olah, C., Steinhardt, J., Christiano, P., Schulman, J., Man´e, D. (2016). Concrete problems in ai safety. arXiv preprint arXiv:1606.06565.
Barto, A.G., Sutton, R.S., Anderson, C.W. (1983). Neuronlike adaptive ele- ments that can solve difficult learning control problems. IEEE Trans. Syst. Man Cybern., 13(5), 834–846.
Baxter, J., & Bartlett, P.L. (2001). Infinite-horizon policy-gradient estimation. Journal of Artificial Intelligence Research, 15.
Bedi, A.S., Parayil, A., Zhang, J., Wang, M., Koppel, A. (2021). On the sample complexity and metastability of heavy-tailed policy search in continuous control. CoRR, abs/2106.08414.
Berkenkamp, F. (2019). Safe exploration in reinforcement learning: Theory and applications in robotics (Unpublished doctoral dissertation). ETH Zurich.
Berkenkamp, F., Turchetta, M., Schoellig, A.P., Krause, A. (2017). Safe model- based reinforcement learning with stability guarantees. NIPS (pp. 908– 919).
Bertsekas, D.P. (2011). Approximate policy iteration: A survey and some new methods. Journal of Control Theory and Applications, 9(3), 310–335.
Bertsekas, D.P., & Shreve, S. (2004). Stochastic optimal control: the discretetime case.
Bhandari, J., & Russo, D. (2019). Global optimality guarantees for policy gradient methods. CoRR, abs/1906.01786.
Bisi, L., Sabbioni, L., Vittori, E., Papini, M., Restelli, M. (2020). Risk-averse trust region optimization for reward-volatility reduction. IJCAI (pp. 4583–4589). ijcai.org.
Brockman, G., Cheung, V., Pettersson, L., Schneider, J., Schulman, J., Tang, J., Zaremba, W. (2016). Openai gym.
Castro, D.D., Tamar, A., Mannor, S. (2012). Policy gradients with variance related risk criteria. ICML. icml.cc / Omnipress.
Chou, P., Maturana, D., Scherer, S.A. (2017). Improving stochastic policy gradients in continuous control with deep reinforcement learning using the beta distribution. ICML (Vol. 70, pp. 834–843). PMLR.
Chow, Y., Nachum, O., Du´e˜nez-Guzm´an, E.A., Ghavamzadeh, M. (2018). A lyapunov-based approach to safe reinforcement learning. Neurips (pp. 8103–8112).
Ciosek, K., & Whiteson, S. (2020). Expected policy gradients for reinforcement learning. J. Mach. Learn. Res., 21, 52:1–52:51.
Clouse, J.A., & Utgoff, P.E. (1992). A teaching method for reinforcement learning. Machine learning proceedings 1992 (pp. 92–101). Elsevier.
Cohen, A., Yu, L., Wright, R. (2018). Diverse exploration for fast and safe policy improvement. arXiv preprint arXiv:1802.08331.
Dalal, G., Dvijotham, K., Vecer´ık, M., Hester, T., Paduraru, C., Tassa, Y. (2018). Safe exploration in continuous action spaces. CoRR, abs/1801.08757.
Deisenroth, M.P., Neumann, G., Peters, J., et al. (2013). A survey on policy search for robotics. (1–2), 1– 142.
Dorato, P., Cerone, V., Abdallah, C. (1994). Linear-quadratic control: an introduction. Simon & Schuster, Inc.
Duan, Y., Chen, X., Houthooft, R., Schulman, J., Abbeel, P. (2016). Bench- marking deep reinforcement learning for continuous control. ICML (Vol. 48, pp. 1329–1338). JMLR.org.
Fruit, R., Lazaric, A., Pirotta, M. (2019). Regret minimization in infinite-horizon finite markov decision processes. Tutorial at ALT’19. Retrieved from https://rlgammazero.github.io/
Furmston, T., & Barber, D. (2012). A unifying perspective of parametric policy search methods for markov decision processes. Advances in neural information processing systems (pp. 2717–2725).
Garcelon, E., Ghavamzadeh, M., Lazaric, A., Pirotta, M. (2020a). Conser- vative exploration in reinforcement learning. AISTATS (Vol. 108, pp. 1431–1441). PMLR.
Garcelon, E., Ghavamzadeh, M., Lazaric, A., Pirotta, M. (2020b). Improved algorithms for conservative exploration in bandits. AAAI (pp. 3962– 3969). AAAI Press.
Garc´ıa, J., & Fern´andez, F. (2015). A comprehensive survey on safe reinforcement learning. J. Mach. Learn. Res., 16, 1437–1480.
Gehring, C., & Precup, D. (2013). Smart exploration in reinforcement learn- ing using absolute temporal difference errors. Proceedings of the 2013 international conference on autonomous agents and multi-agent systems (pp. 1037–1044).
Geibel, P., & Wysotzki, F. (2005). Risk-sensitive reinforcement learning applied to control under constraints. Journal of Artificial Intelligence Research, 24, 81–108.
Glynn, P.W. (1986). Stochastic approximation for monte carlo optimization. WSC (pp. 356–365). ACM.
Grimmett, G., & Stirzaker, D. (2020). Probability and random processes. Oxford university press.
Haarnoja, T., Zhou, A., Abbeel, P., Levine, S. (2018). Soft actor-critic: Off- policy maximum entropy deep reinforcement learning with a stochastic actor. ICML (Vol. 80, pp. 1856–1865). JMLR.org.
Hans, A., Schneegaß, D., Sch¨afer, A.M., Udluft, S. (2008). Safe exploration for reinforcement learning. Esann (pp. 143–148).
Kadota, Y., Kurano, M., Yasuda, M. (2006). Discounted markov decision processes with utility constraints. Computers & Mathematics with Applications, 51(2), 279–284.
Kakade, S. (2001). Optimizing average reward using discounted rewards. International conference on computational learning theory (pp. 605–615).
Kakade, S. (2002). A natural policy gradient. Advances in neural information processing systems (pp. 1531–1538).
Kakade, S., & Langford, J. (2002). Approximately optimal approximate reinforcement learning..
Kakade, S.M., et al. (2003). On the sample complexity of reinforcement learning (Unpublished doctoral dissertation). University of London London, England.
Kazerouni, A., Ghavamzadeh, M., Abbasi, Y., Roy, B.V. (2017). Conservative contextual linear bandits. NIPS (pp. 3913–3922).
Klenke, A. (2013). Probability theory: a comprehensive course. Springer Science & Business Media.
Kober, J., Bagnell, J.A., Peters, J. (2013). Reinforcement learning in robotics: A survey. The International Journal of Robotics Research, 32(11), 1238– 1274.
Konda, V.R., & Tsitsiklis, J.N. (1999). Actor-critic algorithms. NeurIPS (pp. 1008–1014). The MIT Press.
Laroche, R., Trichelair, P., Des Combes, R.T. (2019). Safe policy improve- ment with baseline bootstrapping. International conference on machine learning (pp. 3652–3661).
Li, B., & Hoi, S.C. (2014). Online portfolio selection: A survey. ACM Computing Surveys (CSUR), 46(3), 35.
Maurer, A., & Pontil, M. (2009). Empirical bernstein bounds and sample- variance penalization. COLT.
Metelli, A.M., Pirotta, M., Calandriello, D., Restelli, M. (2021). Safe pol- icy iteration: A monotonically improving approximate policy iteration approach. Journal of Machine Learning Research, 22(97), 1–83.
Mnih, V., Kavukcuoglu, K., Silver, D., Rusu, A.A., Veness, J., Bellemare, M.G., . . . others (2015). Human-level control through deep reinforcement learning. Nature, 518(7540), 529.
Moldovan, T.M., & Abbeel, P. (2012). Safe exploration in markov decision pro- cesses. Proceedings of the 29th international coference on international conference on machine learning (pp. 1451–1458).
Nesterov, Y. (1998). Introductory lectures on convex programming volume i: Basic course. Lecture notes.
Nesterov, Y. (2013). Introductory lectures on convex optimization: A basic course (Vol. 87). Springer Science & Business Media.
Neu, G., Jonsson, A., G´omez, V. (2017). A unified view of entropy-regularized markov decision processes. CoRR, abs/1705.07798.
Nota, C., & Thomas, P.S. (2020). Is the policy gradient a gradient? AAMAS (pp. 939–947). International Foundation for Autonomous Agents and Multiagent Systems.
Okuda, R., Kajiwara, Y., Terashima, K. (2014). A survey of technical trend of adas and autonomous driving. Technical papers of 2014 international symposium on vlsi design, automation and test (pp. 1–4).
OpenAI (2018). Openai five. https://blog.openai.com/openai-five/.
Pajarinen, J., Thai, H.L., Akrour, R., Peters, J., Neumann, G. (2019). Com- patible natural gradient policy search. arXiv preprint arXiv:1902.02823.
Papini, M., Battistello, A., Restelli, M. (2020). Balancing learning speed and stability in policy gradient via adaptive exploration. AISTATS (Vol. 108, pp. 1188–1199). PMLR.
Papini, M., Binaghi, D., Canonaco, G., Pirotta, M., Restelli, M. (2018). Stochastic variance-reduced policy gradient. ICML (Vol. 80, pp. 4023– 4032). JMLR.org.
Papini, M., Pirotta, M., Restelli, M. (2017). Adaptive batch size for safe policy gradients. Advances in neural information processing systems (pp. 3591–3600).
Paul, S., Kurin, V., Whiteson, S. (2019). Fast efficient hyperparameter tuning for policy gradients. CoRR, abs/1902.06583.
Pecka, M., & Svoboda, T. (2014). Safe exploration techniques for reinforce- ment learning–an overview. International workshop on modelling and simulation for autonomous systems (pp. 357–375).
Peters, J. (2002). Policy gradient methods for control applications (Tech. Rep.). Technical Report TR-CLMC-2007-1, University of Southern California.
Peters, J., & Schaal, S. (2008). Reinforcement learning of motor skills with policy gradients. Neural networks, 21(4), 682–697.
Pirotta, M., Restelli, M., Bascetta, L. (2013). Adaptive step-size for policy gradient methods. Advances in neural information processing systems 26 (pp. 1394–1402).
Pirotta, M., Restelli, M., Bascetta, L. (2015). Policy gradient in lipschitz markov decision processes. Mach. Learn., 100(2-3), 255–283.
Pirotta, M., Restelli, M., Pecorino, A., Calandriello, D. (2013). Safe policy iteration. International conference on machine learning (pp. 307–315).
Puterman, M.L. (2014). Markov decision processes: discrete stochastic dynamic programming. John Wiley & Sons.
Recht, B. (2019). A tour of reinforcement learning: The view from continuous control. Annual Review of Control, Robotics, and Autonomous Systems.
Robbins, H., & Monro, S. (1951). A stochastic approximation method. The annals of mathematical statistics, 400–407.
Schulman, J., Levine, S., Abbeel, P., Jordan, M.I., Moritz, P. (2015). Trust region policy optimization. ICML (Vol. 37, pp. 1889–1897). JMLR.org.
Shamir, O. (2011). A variant of azuma’s inequality for martingales with subgaussian tails. CoRR, abs/1110.2392.
Shani, L., Efroni, Y., Mannor, S. (2020). Adaptive trust region policy optimiza- tion: Global convergence and faster rates for regularized mdps. AAAI (pp. 5668–5675). AAAI Press.
Shen, Z., Ribeiro, A., Hassani, H., Qian, H., Mi, C. (2019). Hessian aided policy gradient. ICML (Vol. 97, pp. 5729–5738). PMLR.
Silver, D., Hubert, T., Schrittwieser, J., Antonoglou, I., Lai, M., Guez, A., . . . others (2018). A general reinforcement learning algorithm that masters chess, shogi, and go through self-play. Science, 362(6419), 1140–1144.
Sutton, R.S., & Barto, A.G. (2018). Reinforcement learning: An introduction. MIT press.
Sutton, R.S., McAllester, D.A., Singh, S.P., Mansour, Y. (2000). Policy gra- dient methods for reinforcement learning with function approximation. Advances in neural information processing systems (pp. 1057–1063).
Tan, J., Zhang, T., Coumans, E., Iscen, A., Bai, Y., Hafner, D., . . . Vanhoucke, V. (2018). Sim-to-real: Learning agile locomotion for quadruped robots. arXiv preprint arXiv:1804.10332.
Thomas, P.S., da Silva, B.C., Barto, A.G., Giguere, S., Brun, Y., Brunskill, E. (2019). Preventing undesirable behavior of intelligent machines. Science, 366(6468), 999–1004.
Thomas, P.S., Theocharous, G., Ghavamzadeh, M. (2015). High confidence policy improvement. ICML (Vol. 37, pp. 2380–2388). JMLR.org.
Tucker, G., Bhupatiraju, S., Gu, S., Turner, R., Ghahramani, Z., Levine, S. (2018). The mirage of action-dependent baselines in reinforcement learning. International conference on machine learning (pp. 5015–5024).
Turchetta, M., Berkenkamp, F., Krause, A. (2016). Safe exploration in finite markov decision processes with gaussian processes. Advances in neural information processing systems (pp. 4312–4320).
Vinyals, O., Babuschkin, I., Chung, J., Mathieu, M., Jaderberg, M., Czar- necki, W.M., . . . Silver, D. (2019). AlphaStar: Mastering the Real-Time Strategy Game StarCraft II. https://deepmind.com/blog/alphastar -mastering-real-time-strategy-game-starcraft-ii/.
Wagner, P. (2011). A reinterpretation of the policy oscillation phenomenon in approximate policy iteration. Advances in neural information processing systems (pp. 2573–2581).
Williams, R.J. (1992). Simple statistical gradient-following algorithms for connectionist reinforcement learning. Machine learning, 8(3-4), 229– 256.
Wu, Y., Shariff, R., Lattimore, T., Szepesv´ari, C. (2016). Conservative bandits. ICML (Vol. 48, pp. 1254–1262). JMLR.org.
Xu, P., Gao, F., Gu, Q. (2020). Sample efficient policy gradient methods with recursive variance reduction. ICLR. OpenReview.net.
Yu, J., Aberdeen, D., Schraudolph, N.N. (2006). Fast online policy gradient learning with smd gain vector adaptation. Advances in neural information processing systems (pp. 1185–1192).
Yuan, H., Lian, X., Liu, J., Zhou, Y. (2020). Stochastic recursive momentum for policy gradient methods. CoRR, abs/2003.04302.
Yuan, R., Gower, R.M., Lazaric, A. (2021). A general sample complexity analysis of vanilla policy gradient. CoRR, abs/2107.11433.
Zhang, J., Kim, J., O’Donoghue, B., Boyd, S.P. (2020). Sample efficient reinforcement learning with REINFORCE. CoRR, abs/2010.11364.
Zhao, T., Hachiya, H., Niu, G., Sugiyama, M. (2011). Analysis and improvement of policy gradient estimation. NIPS (pp. 262–270).