My stuff
Safe reinforcement learning for probabilistic reachability and safety specifications: A Lyapunov-based approach

Emerging applications in robotic and autonomous systems, such as autonomous driving and robotic surgery, often involve critical safety constraints that must be satis-fied even when information about system models is limited. In this regard, we propose a model-free safety specification method that learns the maximal probability of safe operation by carefully combining probabilistic reachability analysis and safe reinforcement learning (RL). Our approach constructs a Lyapunov function with respect to a safe policy to restrain each policy improvement stage. As a result, it yields a sequence of safe policies that determine the range of safe operation, called the safe set, which monotonically expands and gradually converges. We also develop an efficient safe exploration scheme that accelerates the process of identifying the safety of unexamined states. Exploiting the Lyapunov shieding, our method regulates the exploratory policy to avoid dangerous states with high confidence. To handle high-dimensional systems, we further extend our approach to deep RL by introducing a Lagrangian relaxation technique to establish a tractable actor-critic algorithm. The empirical performance of our method is demonstrated through continuous control benchmark problems, such as a reaching task on a planar robot arm.

Reachability and safety specifications for robotic and autonomous systems are one of fundamental problems for the verification of such systems. It is difficult to imagine deploying robots, without (safety) verification, in practical environments due to possible critical issues such as collisions and malfunctions. Several reachability analysis techniques have been developed for the safe operation of various types of systems (e.g., [1]–[3]) and applied to quadrotor control [4], legged locomotion [5], obstacle avoidance [6], among others. However, the practicality of these tools is often limited because they require knowledge of system models. The focus of this work is to develop a model-free reinforcement learning method for specifying reachability and safety in a probabilistic manner. Several learning-based safety specification methods have recently been proposed for deterministic dynamical systems without needing complete information about system models. To learn backward reachable sets, Hamilton–Jacobi reachability-based tools were used in conjunction with Gaussian process regression [7] and reinforcement learning [8]. As another safety certificate, a region of attraction was

This work was supported in part by the Creative-Pioneering Researchers Program through SNU, the Basic Research Lab Program through the National Research Foundation of Korea funded by the MSIT(2018R1A4A1059976), and Samsung Electronics.

S. Huh, and I. Yang are with the Department of Electrical and Computer Engineering, Automation and Systems Research Institute, Seoul National University, Seoul 08826, South Korea, {subinh1994,


estimated using Lyapunov-based reinforcement learning [9] and a neural network Lyapunov function [10]. Forward invariance has also been exploited for safety verification by learning control barrier functions [11], [12].

Departing from these tools for deterministic systems, we propose a model-free safety specification method for stochastic systems by carefully combining probabilistic reachability analysis and reinforcement learning. Specifically, our method aims to learn the maximal probability of avoiding the set of unsafe states. Several methods have been developed for computing the probability of safety in various cases via dynamic programming when the system model is known [1], [13]–[15]. To overcome this limitation, our tool uses model-free reinforcement learning for estimating the probability of safety. We further consider safety guarantees during the learning process so that our scheme runs without frequent intervention of a human supervisor who takes care of safety. To attain this property, we employ the Lyapunov-based RL framework proposed in [16], where the Lyapunov function takes the form of value functions, and thus safety is preserved in a probabilistic manner through the Bellman recursion. We revise this safe RL method to enhance its exploration capability. Note that the purpose of exploration in our method is to enlarge or confirm knowledge about safety, while most safe RL schemes encourage exploration to find rewardmaximizing policies within verified safe regions [17]–[19].

The main contributions of this work can be summarized as follows. First, we propose a safe RL method that specifies the probabilistic safety of a given Markov control system without prior information about the system dynamics. Our approach yields a sequence of safe and improving policies by imposing the Lyapunov constraint in its policy improvement stage and establishing a Lyapunov function in the policy evaluation stage. If there is no approximation error, our RLbased safety specification algorithm is guaranteed to run safely throughout the learning process. In such a case, the safe region determined by our approach also monotonically expands in a stable manner, and eventually converges to the maximal safe set. Second, we develop an efficient safe exploration scheme to learn safe or reachable sets in a sample-efficient manner. Safe policies tend to avoid reaching the borders of safe regions, so the “learned” probability of safety at their borders and outside them is likely to be more inaccurate than others. To mitigate the imbalance of knowledge, we select the least-safe policy to encourage exploration. This exploratory policy visits less-safe states so that the safe set becomes more accurate or grows faster. Third, we implement our approach with deep neural networks to alleviate the scalability issue that arises in high-dimensional systems. Converting the Lyapunov constraints to a regularization term, our approach can be implemented in conventional actor-critic algorithms for deep RL. We further show that our method outperforms other baseline methods through simulation studies.

We consider an MDP, defined as a tuple (S, A, p), where S is the set of states, A is the set of actions, and p : S × A × S → [0, 1]is the transition probability function. We also use the notation  Stermand  S′to represent the set of termination states and non-terminal states, respectively. Moreover, a (stochastic) Markov policy,  π : S×A → [0, 1], is a measurable function, and  π(a|s)represents the probability of executing action a given state s. We also let  Πdenote the set of stochastic Markov policies.

A. Probabilistic Reachability and Safety Specifications

We consider the problem of specifying the probability that the state of an MDP will not visit a pre-specified target set G ⊆ Sbefore arriving at a terminal state given an initial state s and a Markov policy  π. For our purpose of safety specification, the target set represents the set of unsafe states. The probability of safety given a policy  πand an initial state s is denoted by  P safes (π). To compute it, we consider the problem of evaluating  1 − P safes (π), which represents the probability of visiting the target set at least once given an initial state s and a Markov policy  π:


where  T ∗is the first time to arrive at a terminal state. Note that  P reachs (π)represents the probability of unsafety. Our goal is to compute the minimal probability of unsafety and specify the following maximal probabilistic safe set with tolerance  α ∈ (0, 1):


This set can be used for safety verification: If the agent is initialized within  S∗(α), we can guarantee safety with probability  1 − αby carefully steering the agent; otherwise, it is impossible to do so.

We now express the probability of unsafety as an expected sum of stage-wise costs by using the technique proposed in [13]. Let  1C : S �→ {0, 1}denote the indicator function of set  C ⊆ Sso that its value is 1 if  s ∈ C; otherwise, 0. Given a sequence of states  {s0, . . . , st}, we observe that


It is easily seen that the sum of �t−1k=0 1Gc(sk)1G(st)along the trajectory is equal to 0 if the trajectory is away from G and 1 if there exists at least one state  stthat is in G. The probability of unsafety under  πis then given by


We introduce an auxiliary state  xt, which is an indicator of whether a trajectory  {s0, · · · , st−1}is fully safe or not. It is defined as


Since  xt+1 = xt1Gc(st), xt+1depends solely on  (st, xt)and  at, so the Markov property holds with respect to the state pair  (st, xt). The problem of computing the minimal probability of unsafety can be formulated as


(1) which is in the form of the standard optimal control problem. Let  V ∗ : S × {0, 1} → Rdenote the optimal value function of this problem, that is,  V ∗(s, x) :=infπ∈Π Eπ[�T ∗−1t=0 xt1G(st) | (s0, x0) = (s, x)]. After computing the optimal value function, we can obtain the maximal probabilistic safe set by simple thresholding:


Note that this set is a superset of  Sπ(α) := {s ∈ S |P reachs (π) ≤ α} = {s ∈ S | V π(s, 1) ≤ α}for any Markov policy  π, where  V π : S×{0, 1}denotes the value function of πdefined by  V π(s, x) := Eπ[�T ∗−1t=0 xt1G(st) | (s0, x0) =(s, x)]. To distinguish  Sπ(α)from  S∗(α), we refer to the former as the (probabilistic) safe set under  π.

B. Safe Reinforcement Learning

Our goal is to compute the minimal probability of unsafety and the maximal probabilistic safe set without the knowledge of state transition probabilities in a safety-preserving manner. We propose an RL algorithm that guarantees the safety of the agent during the learning process for safety specification. More specifically, the sequence  {πk}k=0,1,...generated by the proposed RL algorithm satisfies


for k = 0, 1, . . .. This constraint ensures that


that is, the probabilistic safe set (given  α) monotonically expands. We also use the constraint (2) to perform safe exploration to collect sample data by preserving safety in a probabilistic manner.

To determine the set of safe policies that satisfy (2), we adopt the Lyapunov function proposed in [16] and enhance the approach to incentivize the agent to explore the state space efficiently.

Throughout the section, we assume that every terminal state lies in  S∗(α)and that, at all events, an agent arrives at a terminal state in a finite period. Thus, there exists an integer m such that  Pπ(sm ∈ Sterm; s0 = s) > 0 ∀s ∈ S, ∀π ∈ Π.In Section III-A and III-B, the state space S and the action space A are assumed to be finite. This assumption will be relaxed when discussing the deep RL version in Section III- C.

Let  T πddenote the stationary Bellman operator for the cost function  d(s, x) := x1G(s)


for all  (s, x) ∈ S′ × {0, 1}, and


for all  (s, x) ∈ Sterm × {0, 1}. Note that  T πdis an m-stage contraction with respect to  ∥·∥∞for all  (s, x) ∈ S′×{0, 1}.

A. Lyapunov Safety Specification

We adopt the following definition of Lyapunov functions, proposed in [16]:

Definition 1. A function  L : S × {0, 1} �→ [0, 1]is said to be a Lyapunov function with respect to a Markov policy  πif it satisfies the following conditions:


where  S0is a given subset of  S∗(α)and  d(s, x) := x1G(s).

Inequalities (3a) and (3b) are called the Lyapunov condition and the safety condition, respectively. We can show that if an arbitrary policy  ˜πsatisfies the Lyapunov condition, then the probability of unsafety at  S0does not exceed the threshold  α. To see this, we recursively apply  T πdon both sides of (3b) and use (3a) and the monotonicity of  T πdto obtain that, for any  s ∈ S0,


has a unique fixed point, which corresponds to the probability of unsafety,

Due to the m-stage contraction property,�T ˜πd�mhas a unique fixed point that corresponds to the probability of unsafety,  P reachs (˜π) = V ˜π(s, 1), under  ˜π. Therefore, by the Banach fixed point theorem, we have


Given a Lyapunov function L, consider the set  {˜π |(T ˜πd L)(s, 1) ≤ α ∀s ∈ S0}. Then, any policy  ˜πin this set satisfies the probabilistic safety condition  P reachs (˜π) ≤ αfor all  s ∈ S0by (5). Thus, when  S0is chosen as  Sπk(α), the safety constraint (2) is satisfied. This set of safe policies is called the L-induced policy set.

We can now introduce the Lyapunov safety specification method. For iteration k, we construct the Lyapunov function Lkby using the current policy  πkand update the policy to πk+1taken from the  Lk-induced policy set. Specifically, we set


where  ϵk : S × {0, 1} �→ R≥0is an auxiliary cost function. Following the cost-shaping method of [16], we define the auxiliary cost as the function


where  T πk(s, x)is the expected time for an agent to reach G or  Stermthe first time under policy  πkand initial state (s, x). We refer to  T πk(s, 1)as the first-hitting time for the rest of this article. It is straightforward to check that the Lyapunov condition (3a) is satisfied with  Lk. Furthermore, the function Lksatisfies the safety condition (3b) because, for all  s ∈ S0,


Therefore,  Lkis a Lyapunov function.

In the policy improvement step, we select  πk+1from the Lk-induced policy set so the updated policy is both safe and has an expanded probabilistic safe set.

Proposition 1. Suppose that  πk+1is chosen in  {π |(T πd L)(s, 1) ≤ α ∀s ∈ Sπk(α)}. Then, we have




Proof. The probabilistic safety of  πk+1follows from (5). This also implies that for an arbitrary  s ∈ Sπk(α), we have s ∈ Sπk(α). Therefore, the result follows.

To achieve the minimal probability of unsafety, we choose πk+1as the “safest” one in the  Lk-induced policy, that is,


Note that the value of Lyapunov function is 0 at x = 0, so we need not compute a policy for x = 0

As the MDP model is unknown, we approximate the value function of a policy using sample trajectories. We also use Q-learning to obtain a reliable estimate of stateaction value functions. Let  QVand  QTdenote the Qfunctions for the probability of unsafety and a first-hitting time, respectively. Given  (st, at, st+1)obtained by executing


πk, the Q-functions are updated as follows:


where  τl(s, a)is the learning rate satisfying �l τl(s, a) =∞and �l τ 2l (s, a) < ∞. We can also rewrite (6) as the following linear program associated with Q-functions:


where  QL,kis the Q-value of Lyapunov function given by QL,k(s, a) = QV,k(s, a) + ϵk(1)QL,k(s, a)and  ϵkis the shortened expression of  ϵk(1). The policy  πk+1(·|s)is then updated as the optimal solution of the linear program (8).

Combining the policy evaluation and the policy improvement steps of Q-functions, we construct the Lyapunov safety specification (LSS) as described in Algorithm 1. The convergence property of Q-learning in finite-state, finite-action space is well studied in [20], so we omit the theoretical details here. Under the standard convergence condition for Q-learning, the algorithm obtains a sequence of policies that satisfy Proposition 1.

B. Efficient Safe Exploration

In this subsection, we develop a novel for safe exploration to efficiently solve a probabilistic safety specification problem. We can utilize the Lyapunov constraint to construct a policy that takes potentially dangerous actions with adequate probability and thus assures safe navigation.

We take our motivation from the discovery that if a state is falsely assumed to have a high probability of unsafety,


Fig. 1. An example of safe exploration on a one-dimensional grid world. The confidence level is set to 0.9. Boxes represent states, and arrows toward the left or right symbolize the policies at each state. Unexamined states are shaded; the gray one is not in the target set, but it is considered unsafe. Choosing the policy at  scallows an agent to explore toward  sd (top). Asthe RL agent successfully returns to the safe set after visiting  sd with highprobability,  sdis added to the safe set (bottom).

it is unlikely to correct the misconception without taking exploratory actions. Consider the table of Q-value estimates used in the LSS algorithm. The Q-learning agent is initiated from the blank slate, so it is a safe choice to assume that all unvisited states evolve into the target set with high probability. As a result, the safe policy derived from the algorithm tends to confine an agent inside the current safe set. With enough time, the Q-value table becomes accurate at all states, but this is unattainable in practice. Therefore, it is crucial to explore the unidentified states, and this process involves visiting the exterior of the safe set.

In this regard, we choose the exploratory policy to be the most aggressive among the set of policies that guarantee safety in the safe set. Conversely, the probabilistic safety of the exploratory policy in the safe set is marginally greater than the tolerance. As there is no element G in  Sπs(α), such a policy is likely to bring an agent outside the safe set. The exploratory policy is efficient if used with an experience replay, the state distribution of which may diverge from the true distribution due to the scarcity of samples obtained in the exterior of the safe set. Our exploratory policy can mitigate the approximation error due to the discrepancy.

To illustrate our idea, we show a one-dimensional (1D) grid world consisting of five states  sa, . . . , seand two actions (left, right) as in Fig. 1. We know from experience that moving to the left at  sa, . . . , scguarantees 100% safety. The states  sdand  seare not visited yet, so the probabilities of unsafety at those states are 1. Suppose the agent is in  scand chooses to move left or right with probability  (1−α, α). The probability of unsafety of  πis then no more than  αbecause an agent never reaches  sdor  sewith probability  1−α. Also, if an agent successfully reaches  sdor  seand returns safely, we obtain an accurate estimate of the probability of unsafety and expand the safe set.

A policy suitable for exploration is not usually the safest


policy; therefore, we separate the exploratory policy  πefrom the policy that constructs the safe set, which is denoted by the safety specification-policy (SS-policy)  πs. Unlike the SS-policy, the exploratory policy drives an agent around the boundary of the safe set. To construct  πein a formal way, we exploit a given  πsand the Lyapunov function L defined as in Section III-A. First, consider the following policy optimization problem:


where  s0is an initial state. Note that this is the auxiliary problem merely to construct the exploratory policy with no connection to the original problem (1). As stated above, the exploratory policy should preserve safety confidence in the safe set under the SS-policy, that is,  V πe(s, 1) ≤ α, ∀s ∈Sπs(α). The solution of (9) satisfies this condition because of the Lyapunov constraint, but it can be suboptimal because the constraint in (9) is stronger than the original. However, by using the Lyapunov constraints, we can enjoy the benefit of using dynamic programming to solve (9).

Proposition 2. Let L be the Lyapunov function stated in (9). An optimal solution of (9) can be obtained by the value iteration using the Bellman operator


Specifically, the value function that satisfies  TexpV = Vis the probability of unsafety under such a policy.

Proof. The operator  Texpis a special form of the safe Bellman operator defined in [16], which is a monotone contraction mapping by Proposition 3 in [16]. Thus, there exists a unique fixed point of  Texp. By the definition of the operator, the fixed point corresponds to the policy and solves problem 9.

As Proposition 2 certifies, we can perform the Bellman operation on  V πsiteratively to obtain  πe, which is the solution of (9). However, in the RL domain, it is difficult to reproduce the whole dynamic programming procedure, since each Bellman operation corresponds to a time-consuming Q-value computation. We thus apply the Bellman operation once to obtain  πe(·|s)at iteration number k as


To sum up, we add an exploratory policy to LSS to obtain the exploratory LSS (ESS), as Algorithm 2.

C. Deep RL Implementation

Each policy improvement stages in Algorithm 1 or 2 solves a linear program. This operation is not straightforward for nontabular implementations. Thus, we provide adaptations of the LSS and ESS for parametrized policies, such as neural networks. To apply our approach to high-dimensional environments in this section, we assume that the state and action spaces are continuous, which is the general setup in policy gradient (PG) algorithms. Suppose a generic policy is parameterized with  θ, and we rewrite the policy improvement step of the LSS as


where  πsis the current SS-policy and  QV , QL, and  ϵare the values defined as the previous section with respect to  πs.

We use Lagrangian relaxation [21] to form an unconstrained problem. Ignoring the constraints, the PG minimizes a single objective  Es,a∼π[QV (s, a)]. The Lyapunov condition is state-wise, so the number of constraints is the same as |S|. We can replace the constraints with a single one  maxs∈S�A QL(s, a) (πθ(a|s) − πs(a|s)) da − ϵ ≤ 0. However, one drawback of this formulation is that the Lagrangian multiplier of the max-constraint places excessive weight on the constraint. In practice, the LHS of this max-constraint is likely greater than 0 due to the parameterization errors, resulting in the monotonic increase of the Lagrangian multiplier throughout learning. Therefore, we adopt statedependent Lagrangian multipliers to have


where  λ(s)is the Lagrangian multiplier at state s, and  ρθis the discounted state-visiting probability of  πθ. We can assume that nearby states have similar  λ(s). Thus, we can parameterize  λ(s)as a critic model, as in [22]. Throughout this section, we represent  ωas the parameter of  λ.

Our goal is to find the saddle point of (12), which is a feasible solution of the original problem (11). We apply the gradient descent (ascent) to optimize  θand  ω. The Qvalues that comprise the Lagrangian are, by definition, the functions of the policy parameter  θ, but since we incorporate the actor-critic framework, the Q-functions are approximated with critics independent of  θ. In this regard, we obtain the update rules for the safety specification-actor (SS-actor) and the Lagrangian multiplier associated with it as follows:


where  at ∼ πθs(st)and  aold,tdenotes the sampled action from the policy parametrized with the old  θs.

We apply the same approach to improve the exploratory actor. The unconstrained problem is similar to (12) except for the opposite sign of the primal objective, so we have


where aexp,t ∼ πθe(st), at ∼ πθs(st).

Besides, critic parameters are optimized to minimize the Bellman residual. The scheme is analogous to the Q-learning version, as in (7), but in this case, we express the discount factor  γ. Recall that the Lyapunov Q-function is a weighted sum of the two Q-functions  QVand  QT, one for a probability of unsafety and the other for a first-hitting time, respectively. Letting  φand  ψrepresent the parameters of  QVand  QT, the targets for the critics  Qφand  Qψare defined as


where  at+1is the action sampled from  πθ′s(st+1). The proposed actor-critic algorithm is summarized in Algorithm 3.

In our experiments, we use the double Q-learning technique in [23] to prevent the target  yVfrom being overly greater than the true probability of unsafety. In this case, two critics have independent weights  φ1, φ2, and two target critics pertained to the respective critics. That is,  Qφ′(st+1, at+1)in  yVis replaced with minj=1,2 Qφ′j(st+1, at+1), where  at+1 ∼ πθ′s(st+1). More- over, we adjust the experience replay to alleviate errors in QV. Catastrophic forgetting is the primary concern, since the target set should be precisely specified to obtain safe policies. We fix the ratio of safe samples (i.e.,  st /∈ G) and unsafe samples (i.e.,  st ∈ G) in a minibatch so that the value of  QVis 1 in the identified states of the target set. We explain the ancilliary techniques in Section IV-B.

In this section, we demonstrate our safe learning and safety specification methods using simulated control tasks. We test the validity of our approach in a simple double integrator and further verify our deep RL algorithms with the high-dimensional dynamic system introduced in [24], both of which have a tuple of positions and velocities as a state. To make the environments congruous with our problem setting, a target set is defined as the exterior of a certain bounded region of the state space, a setup that enables the implementation of tabular Q-learning. The description of


environments, including the definition of the target sets, can be found in Appendix B.

In Section III, we stated the theoretical guarantees as follows. First, Lyapunov-based methods obtain a subset of S∗(α). Second, the improved safe set includes the current safe set. Third, the agent ensures safety while running in the environment if the initial state is safe. However, in practice, these guarantees cannot be strictly satisfied, since we determine a safe set with the approximated probability of unsafety. To distinguish the obtainable safe set from the ideal one derived from the true MDP, we represent the estimate of the safe set under  πas


We introduce two metrics to quantify how close well-trained RL agents are to such guarantees. Regarding the accuracy of safety specification, we inspect if a safe set contains the elements of  S∗(α)and if it does not include the unreliable states in  S∗(α)c. We thus consider the ratio of correct specification


and the ratio of false-positive specification



Fig. 2. Safety specification via tabular Q-learning tested on the double integrator. The solid line denotes the average, and the shaded area indicates the confidence interval of 20 random seeds. The baseline, LSS, and ESS are denoted by teal, orange, and blue, respectively.

We also verify safe exploration throughout learning by tracking the proportion of safely terminated episodes among the 100 latest episodes, which is denoted by the average episode safety (AES). A trajectory is considered safe if an agent reaches terminal states without visiting G or stays in Gcuntil the time limit.

Throughout our experiments, we set  α = 0.2, so AES should be no less than 0.8 to guarantee safe navigation. We also improve learning speed by introducing a discount factor  γ < 1, which is equivalent to  p(st+1 ∈ Sterm|st, at). As the key idea of our approach is the separation of the exploratory policy from the SS-policy, we set an unmodified RL method as baseline; that is, the baseline agents are trained to minimize the expected sum of  xt1G(st).

A. Tabular Q-Learning

First, we evaluate our Lyapunov-based safety specifi-cation methods with tabular implementations. For tabular Q-learning, we discretize a continuous action a = (a1, · · · , adim A)into partitions of  A1, · · · , Adim Aequal intervals for each element. In other words, applying the nth action for  amis interpreted as  a = (am,max −am,min) nAm−1 +am,min. Likewise, state space is represented as a finite-dimensional grid. Based on the MDP quantized as above, the true probability of safety is computed via dynamic programming.

We use a double integrator environment to test the tabular cases. To reduce the size of  S∗(α), we modify the integrator to perturb the input acceleration with a certain probability. We compare LSS, ESS, and a baseline Q-learning with no extra techniques to shield unsafe actions. We initialize the Qfunction tables with random values uniformly sampled from the interval [0.99, 1]; that is, the probability of unsafety is estimated as nearly 1 in all states. Therefore, in the tabular setting we impose the assumption that all unvisited states have the probabilistic safeties lower than the threshold. We then perform the policy improvement  102times, each of which comes after  106environment steps.

Fig. 2 summarizes the specification result averaged across 20 random seeds. Both LSS and ESS show monotonic improvement of the approximated safe set ˆSπ(α). Notably, we find evidence of ESS taking advantage of exploration. The  rcof ESS increase faster than those of LSS or the baseline, while the excess of  rfpof ESS is negligible. The


Fig. 3. Safe sets for the integrator problem with  α = 0.2. Each grid point denotes a state (position, velocity). The ground truth  S∗(α)is denoted by yellow in (a). The other figures show the safe set estimated by (b) the baseline, (c) LSS, and (d) ESS. The shaded region represents ˆSπ(α):correctly specified states are marked yellow, and unsafe states misclassified as safe are marked red.

average value of  rcis 44% for ESS, surpassing the baseline of 34%. The effect of ESS culminates at the beginning of the learning process then dwindles because the boundary of ˆSπ(α)becomes unlikely to reach as the set inflates, so the chance of exploration decreases. Ideally, with the appropriate choice of  γ ≈ 1and the learning rate,  rfpis nearly 0. We skip the AES in Fig. 2, since no agent lacks safety confidence. However, the AES might decline without the limit, since an episode is configured to terminate after 200 steps, which restricts the chance of reaching the target set.

We illustrate the safety analysis results of respective methods and the ground-truth probabilistic safe set in Fig. 3. Each approximated safe set is established from the Q-learning table of an agent with the highest rate of correct specification among the 20 random seeds analyzed in Fig. 2. A grid map represents the whole non-target set except for the grid points on the sides, and the approximated safe set is the set of red and yellow points. The size of ˆSπ(α)for ESS is notably larger than that of the baseline or LSS in the cases of both correctly specified states (yellow) and misclassified states (red). However, the false-positive in the safe set estimated by ESS is hardly due to the ESS method but comes from a universal limitation of tabular Q-learning. This can be explained from the observation that the ratio of misclassified states over the whole ˆSπ(α)of ESS is greater than that of the baseline only by 5%; that is, ESS does not particularly overestimate the probability of safety in unsafe states. The ESS Q-learning is expected to obtain an accurate estimate of S∗if the implementation of Q-learning is improved.

B. Deep RL

We present the experimental results in Algorithm 3 using a realistic robotic simulation. We demonstrate that our approach can be coupled with well-established deep RL methods to perform safety specifications efficiently in the continuous state and action space. Details about our deep RL implementation can be found in Appendix A. We consider a Reacher system for safety analysis. In the Reacher, safety constraints are set on the position of the end effector (See Appendix B for details).

We implement the LSS and ESS actor-critic in DDPG [25], and the baseline. For the sake of fairness, all the algorithms use the same actor network weight and the same replay memory at the start of learning. The critics are initialized


Fig. 4. Safety specification via deep RL tested on the Reacher. (a-b) are the results averaged across 10 random seeds, and (c-d) are the best results for various methods. (e) displays the average episode safety swept across all seeds. Color schemes are equivalent to Fig. 2.

randomly, but the bias value for each layer of  QVis set to 1 so that  QV (s, a) = 1for almost all  (s, a) ∈ S × A. This ensures that the ratio of correct specification is 0 at the very beginning. We also optimize only the critics for the first  105steps to reduce the discrepancies between critics and actors. The techniques mentioned in Section III-C are also applied: we fill 20% of each minibatch with the unsafe samples and use double  QVnetworks for critic update.

The Lyapunov-based RL agents require auxiliary cost  ϵ, as in Section II. For the case of a continuous state space, the safe set is not explicitly defined, so  ϵshould be approximated. We first set the denominator of  ϵto  T π(s) ≈ (1 −γ)−1to prevent it from being larger than the true value. To estimate  mins∈S∗(α){α − V π(s, 1)}, we use supplementary memory that remembers the value of  {α−V π(s, 1)}+for s such that  V π(s, 1) ≤ α.s When an episode is terminated, an agent computes  V πfor all the states in the trajectory and find the maximum among the values that satisfy  V π(s, 1) ≤ α. The memory stores the result for the 100 latest trajectories.

We also exploit the two actors of the ESS actor-critic to ensure safe operation. Since it takes time to construct a stable exploratory actor, the agent makes stochastic choices between the two actors in the early stages. The probability of an SS-actor being chosen is 1 at the first gradient step and declines linearly until it becomes 0 after the first half of the learning process. The SS-actor is also utilized as the backup policy; that is, the agent takes the action using  πsif the AES is less than the threshold  1−α, regardless of the policy choice


Fig. 5. Safe sets in the state space of the Reacher. Each grid point denotes a state of the end effector whose position is determined by the angles of the two joints and whose velocity is 0. Given  α = 0.2, the ground truth S∗(α)is denoted by yellow in (a). The other figures show the estimated safe set obtained by (b) the baseline, (c) LSS, and (d) ESS. Color schemes are equivalent to Fig. 3.

scheme described above. To reduce computation time,  λωsis fixed to 0 for the ESS actor-critic.

Fig. 4 summarizes the experimental result. We perform tests on 10 random seeds to take an average (4a, 4b) and to display the ones that attain the greatest  rcamong various methods (4c, 4d). Comparing the average cases, the ESS actor-critic shows improvement in both specification criteria, and is noticeable for false positives. ESS consistently reduces rfpexcept for the first  3×105steps and then achieves 4.10%, while the baseline and LSS settle at 7.30% and 5.22%, respectively. The learning curves of ESS and the baseline are similar at the very start, since ESS does not regularly use the exploratory policy then. The exploratory policy in ESS supplements novel information about the states, which are normally the elements of the target set, and the safe set thus becomes more accurate. On the other hand, those of the baseline stay stagnant because the agent barely falls into an unusual trajectory with the SS-policy. Regarding LSS, we observe that the regularization term in its update rule degrades the overall performance.

As seen by the large confidence interval of ESS in Fig. 4a, the effect of the exploratory policy varies. ESS performs as the description in Section III-B; considering the best cases, ESS attains 77.7% for the correct specification, which is 13.4% above the baseline. The exploratory policies sometimes converge fast and become indifferent to the SS-policies in terms of exploration, resulting in poor performance. Note that the difference in ESS behavior is determined by the approximation error in the critic  QV. Although it is difficult to organize the parametrized critic, we can exploit the potential of ESS by running on multiple seeds and finding the best among them.

In Fig. 5, we further visualize a relevant part of the state space and the safe sets in it. Each grid map displays ˆSπ(α)of the agent whose  rcis the greatest among the 10 random seeds discussed above. The safe set obtained by ESS clearly resembles the true safe set better than the others.

We have proposed a model-free safety specification tool that incorporates a Lyapunov-based safe RL approach with probabilistic reachability analysis. Our method exploits the Lyapunov constraint to construct an exploratory policy that mitigates the discrepancy between state distributions of the experience replay (or the tabular Q-function) and the environment. Another salient feature of the proposed method is that it can be implemented on generic, model-free deep RL algorithms, particularly in continuous state and action spaces through Lagrangian relaxation. The results of our experiments demonstrate that our method encourages visiting the unspecified states, thereby improving the accuracy of specification. By bridging probabilistic reachability analysis and reinforcement learning, this work can provide an exciting avenue for future research in terms of extensions to partially observable MDPs, and model-based exploration and its regret analysis, among others.


A. Deep RL Implementation

In this section, we provide a specific description of the deep RL agents used in our experiments. Table I displays the basic architecture of neural networks, all of which are fully connected and consist of two hidden layers with ReLU as an activation function unless it is an estimator of  QV. The first and second hidden layers have 400 and 300 nodes, respectively. Adam optimizer [26] is used to apply gradient descent. Aside from the techniques stated in Section IV-B, an action is perturbed with Ornstein-Uhlenbeck noise with parameters  µ = 0, θ = 0.1, and  σ = 0.05.



B. Environments

An environment provides a Boolean done signal that declares the termination of an episode strictly equivalent to 1Sterm. When its value is 1, both  QVand  QTat that state are set to 0. If the length of an episode exceeds the time limit before arriving at a terminal state, the environment resets itself, but done is still 0 at that moment. Refer to Table II for the time limit and the discount factor settings.

Randomized integrator. A vanilla double integrator is a system with a 2D state  (x1, x2)and the scalar control u. x1and  x2represent the position and velocity on a 1D line, respectively. The control is an acceleration.

We add a few features to construct a safety specification problem in this environment. First, we set the terminal states as the points near the origin  (x1, x2) ∈ [−0.2, 0.2] ×[−3.75 × 10−3, 3.75 × 10−3]. Next the target set is defined as all the states  (x1, x2) /∈ [−1, 1] × [−0.5, 0.5]. Finally, we restrict admissible action to the range  [−0.5, 0.5], and adjust the dynamics so that the acceleration is scaled to 0.5u/|u| with probability 1/2. Due to the introduction of stochastic behavior, it becomes more difficult to reach the terminal states safely than in the original environment.


Fig. 6. Description of the Reacher environment.

Reacher. Reacher is a simulative planar 2-DOF robot with two arms attached to joints implemented with a Mujoco engine [27]. The joint of the first arm is fixed on the center of the plane, and the joint of the second is connected to the movable end of the first. The objective of the robot is to touch a motionless goal point with its end effector. An observation is thus defined as a vector that contains the angular positions and the angular velocities of the joints as well as the position of the goal. The action is defined as the torques on the joints, each of which is bounded in the range  [−1, 1].

Let the coordinates be defined as in Fig. 6. Specifically, the goal is deployed randomly in the hued area {(x, y)|�x2 + y2 ≤√2l, | arctan y/x| ≤ π/4}, where l is the length of one arm. The exact position changes for each reset. We define the target set as {(x, y)||y| > l}, where (x, y) is the coordinate of the tip.

We derive the probabilistic safe set in Fig. 5 under the assuming no friction. This is not the case in a Mujoco-based simulation, but the effect of such an assumption is minor. Recall that the states displayed in Fig. 5 stand for an end effector with zero velocity. If appropriate control is applied, the robot can avoid reaching the target set by moving toward an arbitrary position near the goal unless it launched from the target set at the beginning.

In our simulation studies, we only assess the agents with the states where the goal point is given by  (−2l, 0), and the angular velocity is  ( ˙θ1, ˙θ2) = (0, 0). We use the Reacher configuration provided by Gym [28].



[1] A. Abate, M. Prandini, J. Lygeros, and S. Sastry, “Probabilistic reachability and safety for controlled discrete time stochastic hybrid systems,” Automatica, vol. 44, no. 11, pp. 2724–2734, Nov 2008.

[2] A. Majumdar, R. Vasudevan, M. M. Tobenkin, and R. Tedrake, “Convex optimization of nonlinear feedback controllers via occupation measures,” Int. J. Robot. Res., vol. 33, no. 9, pp. 1209–1230, 2014.

[3] M. Chen, S. L. Herbert, M. S. Vashishtha, S. Bansal, and C. J. Tomlin, “A general system decomposition method for computing reachable sets and tubes,” IEEE Trans. Automat. Contr., vol. 63, no. 11, pp. 3675– 3688, 2018.

[4] J. H. Gillula, G. M. Hoffmann, H. Huang, M. P. Vitus, and C. J. Tomlin, “Applications of hybrid reachability analysis to robotic aerial vehicles,” Int. J. Robot. Res., vol. 30, no. 3, pp. 335–354, 2011.

[5] G. Piovan and K. Byl, “Reachability-based control for the active slip model,” Int. J. Robot. Res., vol. 34, no. 3, pp. 270–287, 2015.

[6] N. Malone, H. T. Chiang, K. Lesser, M. Oishi, and L. Tapia, “Hybrid dynamic moving obstacle avoidance using a stochastic reachable setbased potential field,” IEEE Trans. Robot., vol. 33, no. 5, pp. 1124– 1138, 2017.

[7] J. F. Fisac, A. K. Akametalu, M. N. Zeilinger, S. Kaynama, J. Gillula, and C. J. Tomlin, “A general safety framework for learning-based control in uncertain robotic systems,” IEEE Trans. Automat. Contr., vol. 64, no. 7, pp. 2737–2752, 2018.

[8] J. F. Fisac, N. F. Lugovoy, V. Rubies-Royo, S. Ghosh, and C. Tomlin, “Bridging Hamilton-Jacobi safety analysis and reinforcement learning,” in Proc. IEEE Int. Conf. Robot. Autom., 2019, pp. 8550–8556.

[9] F. Berkenkamp, M. Turchetta, A. P. Schoellig, and A. Krause, “Safe model-based reinforcement learning with stability guarantees,” in Adv. Neural Inf. Process. Syst., 2017.

[10] S. M. Richards, F. Berkenkamp, and A. Krause, “The Lyapunov neural network: Adaptive stability certification for safe learning of dynamical systems,” in Proc. 2nd Conf. on Robot Learn., 2018, pp. 466–476.

[11] L. Wang, E. A. Theodorou, and M. Egerstedt, “Safe learning of quadrotor dynamics using barrier certificates,” in Proc. IEEE Int. Conf. Robot. Autom., 2018, pp. 2460–2465.

[12] A. Taylor, A. Singletary, Y. Yue, and A. Ames, “Learning for safety-critical control with control barrier functions,” arXiv preprint arXiv:1912.10099, 2019.

[13] S. Summers and J. Lygeros, “Verification of discrete time stochastic hybrid systems: A stochastic reach-avoid decision problem,” Automatica, vol. 46, no. 12, pp. 1951–1961, 2010.

[14] K. Lesser and M. Oishi, “Approximate safety verification and control of partially observable stochastic hybrid systems,” IEEE Trans. Automat. Contr., vol. 62, no. 1, pp. 81–96, 2017.

[15] I. Yang, “A dynamic game approach to distributionally robust safety specifications for stochastic systems,” Automatica, vol. 94, pp. 94–101, 2018.

[16] Y. Chow, O. Nachum, E. Duenez-Guzman, and M. Ghavamzadeh, “A Lyapunov-based approach to safe reinforcement learning,” in Adv. Neural Inf. Process. Syst., 2018, pp. 8103–8112.

[17] M. Turchetta, F. Berkenkamp, and A. Krause, “Safe exploration in finite markov decision processes with gaussian processes,” in Adv. Neural Inf. Process. Syst., 2016, pp. 4312–4320.

[18] M. Alshiekh, R. Bloem, R. Ehlers, B. K¨onighofer, S. Niekum, and U. Topcu, “Safe reinforcement learning via shielding,” in Proc. AAAI Conf. on Artif. Intell., 2018, pp. 2669–2678.

[19] A. Wachi, Y. Sui, Y. Yue, and M. Ono, “Safe exploration and optimization of constrained MDPs using Gaussian processes,” in Proc. AAAI Conf. on Artif. Intell., 2018, pp. 6548–6556.

[20] J. N. Tsitsiklis, “Asynchronous stochastic approximation and q-learning,” Mach. Learn., vol. 16, no. 3, pp. 185–202, 1994.

[21] D. P. Bertsekas, Nonlinear Programming, 2nd ed. Belmont, MA, USA: Athena Scientific, 1999.

[22] S. Bohez, A. Abdolmaleki, M. Neunert, J. Buchli, N. Heess, and R. Hadsell, “Value constrained model-free continuous control,” arXiv preprint arXiv:1902.04623, 2019.

[23] H. van Hasselt, “Double Q-learning,” in Adv. Neural Inf. Process. Syst., 2010, pp. 2613–2621.

[24] Y. Duan, X. Chen, R. Houthooft, J. Schulman, and P. Abbeel, “Benchmarking deep reinforcement learning for continuous control,” in Int. Conf. on Mach. Learn., vol. 48, 2016, pp. 1329–1338.

[25] T. P. Lillicrap, J. J. Hunt, A. Pritzel, N. Heess, T. Erez, Y. Tassa, D. Silver, and D. Wierstra, “Continuous control with deep reinforcement learning,” in arXiv preprint arXiv:1509.02971, 2015.

[26] D. P. Kingma and J. Ba, “Adam: A method for stochastic optimization,” in arXiv preprint arXiv:1412.6980, 2014.

[27] E. Todorov, T. Erez, and Y. Tassa, “Mujoco: A physics engine for model-based control,” in 2012 IEEE/RSJ International Conference on Intelligent Robots and Systems, 2012, pp. 5026–5033.

[28] G. Brockman, V. Cheung, L. Pettersson, J. Schneider, J. Schulman, J. Tang, and W. Zaremba, “OpenAI Gym,” arXiv preprint arXiv:1606.01540, 2016.

Designed for Accessibility and to further Open Science