b

DiscoverSearch
About
My stuff
Safe Exploration in Finite Markov Decision Processes with Gaussian Processes
2016·arXiv
Abstract
Abstract

In classical reinforcement learning agents accept arbitrary short term loss for long term gain when exploring their environment. This is infeasible for safety critical applications such as robotics, where even a single unsafe action may cause system failure or harm the environment. In this paper, we address the problem of safely exploring finite Markov decision processes (MDP). We define safety in terms of an a priori unknown safety constraint that depends on states and actions and satisfies certain regularity conditions expressed via a Gaussian process prior. We develop a novel algorithm, SAFEMDP, for this task and prove that it completely explores the safely reachable part of the MDP without violating the safety constraint. To achieve this, it cautiously explores safe states and actions in order to gain statistical confidence about the safety of unvisited state-action pairs from noisy observations collected while navigating the environment. Moreover, the algorithm explicitly considers reachability when exploring the MDP, ensuring that it does not get stuck in any state with no safe way out. We demonstrate our method on digital terrain models for the task of exploring an unknown map with a rover.

Today’s robots are required to operate in variable and often unknown environments. The traditional solution is to specify all potential scenarios that a robot may encounter during operation a priori. This is time consuming or even infeasible. As a consequence, robots need to be able to learn and adapt to unknown environments autonomously [10, 2]. While exploration algorithms are known, safety is still an open problem in the development of such systems [18]. In fact, most learning algorithms allow robots to make unsafe decisions during exploration. This can damage the platform or its environment.

In this paper, we provide a solution to this problem and develop an algorithm that enables agents to safely and autonomously explore unknown environments. Specifically, we consider the problem of exploring a Markov decision process (MDP), where it is a priori unknown which state-action pairs are safe. Our algorithm cautiously explores this environment without taking actions that are unsafe or may render the exploring agent stuck.

Related Work. Safe exploration is an open problem in the reinforcement learning community and several definitions of safety have been proposed [16]. In risk-sensitive reinforcement learning, the goal is to maximize the expected return for the worst case scenario [5]. However, these approaches only minimize risk and do not treat safety as a hard constraint. For example, Geibel and Wysotzki [7] define risk as the probability of driving the system to a previously known set of undesirable states. The main difference to our approach is that we do not assume the undesirable states to be known a priori. Garcia and Fernández [6] propose to ensure safety by means of a backup policy; that is, a policy that is known to be safe in advance. Our approach is different, since it does not require a backup policy but only a set of initially safe states from which the agent starts to explore. Another approach that makes use of a backup policy is shown by Hans et al. [9], where safety is defined in terms of a minimum reward, which is learned from data.

Moldovan and Abbeel [14] provide probabilistic safety guarantees at every time step by optimizing over ergodic policies; that is, policies that let the agent recover from any visited state. This approach needs to solve a large linear program at every time step, which is computationally demanding even for small state spaces. Nevertheless, the idea of ergodicity also plays an important role in our method. In the control community, safety is mostly considered in terms of stability or constraint satisfaction of controlled systems. Akametalu et al. [1] use reachability analysis to ensure stability under the assumption of bounded disturbances. The work in [3] uses robust control techniques in order to ensure robust stability for model uncertainties, while the uncertain model is improved.

Another field that has recently considered safety is Bayesian optimization [13]. There, in order to find the global optimum of an a priori unknown function [21], regularity assumptions in form of a Gaussian process (GP) [17] prior are made. The corresponding GP posterior distribution over the unknown function is used to guide evaluations to informative locations. In this setting, safety centered approaches include the work of Sui et al. [22] and Schreiter et al. [20], where the goal is to find the safely reachable optimum without violating an a priori unknown safety constraint at any evaluation. To achieve this, the function is cautiously explored, starting from a set of points that is known to be safe initially. The method in [22] was applied to the field of robotics to safely optimize the controller parameters of a quadrotor vehicle [4]. However, they considered a bandit setting, where at each iteration any arm can be played. In contrast, we consider exploring an MDP, which introduces restrictions in terms of reachability that have not been considered in Bayesian optimization before.

Contribution. We introduce SAFEMDP, a novel algorithm for safe exploration in MDPs. We model safety via an a priori unknown constraint that depends on state-action pairs. Starting from an initial set of states and actions that are known to satisfy the safety constraint, the algorithm exploits the regularity assumptions on the constraint function in order to determine if nearby, unvisited states are safe. This leads to safe exploration, where only state-actions pairs that are known to fulfil the safety constraint are evaluated. The main contribution consists of extending the work on safe Bayesian optimization in [22] from the bandit setting to deterministic, finite MDPs. In order to achieve this, we explicitly consider not only the safety constraint, but also the reachability properties induced by the MDP dynamics. We provide a full theoretical analysis of the algorithm. It provably enjoys similar safety guarantees in terms of ergodicity as discussed in [14], but at a reduced computational cost. The reason for this is that our method separates safety from the reachability properties of the MDP. Beyond this, we prove that SAFEMDP is able to fully explore the safely reachable region of the MDP, without getting stuck or violating the safety constraint with high probability. To the best of our knokwledge, this is the first full exploration result in MDPs subject to a safety constraint. We validate our method on an exploration task, where a rover has to explore an a priori unknown map.

In this section, we define our problem and assumptions. The unknown environment is modeled as a finite, deterministic MDP [23]. Such a MDP is a tuple  ⟨S, A(·), f(s, a), r(s, a)⟩with a finite set of states S, a set of state-dependent actions  A(·), a known, deterministic transition model f(s, a), and reward function r(s, a). In the typical reinforcement learning framework, the goal is to maximize the cumulative reward. In this paper, we consider the problem of safely exploring the MDP. Thus, instead of aiming to maximize the cumulative rewards, we define r(s, a) as an a priori unknown safety feature. Although r(s, a) is unknown, we make regularity assumptions about it to make the problem tractable. When traversing the MDP, at each discrete time step, k, the agent has to decide which action and thereby state to visit next. We assume that the underlying system is safety-critical and that for any visited state-action pair,  (sk, ak), the unknown, associated safety feature,  r(sk, ak),must be above a safety threshold, h. While the assumption of deterministic dynamics does not hold for general MDPs, in our framework, uncertainty about the environment is captured by the safety feature. If requested, the agent can obtain noisy measurements of the safety feature,  r(sk, ak), by taking action  ak in state sk. The index t is used to index measurements, while k denotes movement steps. Typically  k ≫ t.

It is hopeless to achieve the goal of safe exploration unless the agent starts in a safe location. Hence, we assume that the agent stays in an initial set of state action pairs,  S0, that is known to be safe a priori. The goal is to identify the maximum safely reachable region starting from  S0, without visiting any unsafe states. For clarity of exposition, we assume that safety depends on states only; that is, r(s, a) = r(s). We provide an extension to safety features that also depend on actions in Fig. 2b.

image

Figure 1: Illustration of the set operators with  S = {¯s1,¯s2}. The set S = {s}can be reached from  s2in one step and from  s1in two steps, while only the state  s1can be reached from  s. Visiting s1 is safe;that is, it is above the safety threshold, is reachable, and there exists a safe return path through  s2.

Assumptions on the reward function Ensuring that all visited states are safe without any prior knowledge about the safety feature is an impossible task (e.g., if the safety feature is discontinuous). However, many practical safety features exhibit some regularity, where similar states will lead to similar values of r.

In the following, we assume that S is endowed with a positive definite kernel function  k(·, ·) and thatthe function  r(·)has bounded norm in the associated Reproducing Kernel Hilbert Space (RKHS) [19]. The norm induced by the inner product of the RKHS indicates the smoothness of functions with respect to the kernel. This assumption allows us to model  r as a GP [21], r(s) ∼ GP(µ(s), k(s, s′)). AGP is a probability distribution over functions that is fully specified by its mean function  µ(s)and its covariance function  k(s, s′). The randomness expressed by this distribution captures our uncertainty about the environment. We assume  µ(s) = 0for all  s ∈ S, without loss of generality. The posterior distribution over  r(·)can be computed analytically, based on t measurements at states  Dt = {s1, . . . , st} ⊆ Swith measurements,  yt = [r(s1) + ω1 . . . r(st) + ωt]T, that are corrupted by zero-mean Gaussian noise,  ωt ∼ N(0, σ2). The posterior is a GP distribution with mean  µt(s) = kt(s)T(Kt + σ2I)−1yt, variance  σt(s) = kt(s, s), and covariance  kt(s, s′) = k(s, s′) − kt(s)T(Kt + σ2I)−1kt(s′), where kt(s) = [k(s1, s), . . . , k(st, s)]T andKtis the positive definite kernel matrix,  [k(s, s′)]s,s′∈Dt. The identity matrix is denoted by  I ∈ Rt×t.

We also assume L-Lipschitz continuity of the safety function with respect to some metric  d(·, ·) on S.This is guaranteed by many commonly used kernels with high probability [21, 8].

Goal In this section, we define the goal of safe exploration. In particular, we ask what the best that any algorithm may hope to achieve is. Since we only observe noisy measurements, it is impossible to know the underlying safety function  r(·)exactly after a finite number of measurements. Instead, we consider algorithms that only have knowledge of  r(·)up to some statistical confidence  ϵ. Based on this confidence within some safe set S, states with small distance to S can be classified to satisfy the safety constraint using the Lipschitz continuity of  r(·). The resulting set of safe states is

image

which contains states that can be classified as safe given the information about the states in S. While (1) considers the safety constraint, it does not consider any restrictions put in place by the structure of the MDP. In particular, we may not be able to visit every state in  Rsafeϵ (S)without visiting an unsafe state first. As a result, the agent is further restricted to

image

the set of all states that can be reached starting from the safe set in one step. These states are called the one-step safely reachable states. However, even restricted to this set, the agent may still get stuck in a state without any safe actions. We define

image

as the set of states that are able to return to a set S through some other set of states, S, in one step. In particular, we care about the ability to return to a certain set through a set of safe states S. Therefore, these are called the one-step safely returnable states. In general, the return routes may require taking more than one action, see Fig. 1. The n-step returnability operator  Rretn (S, S) = Rret(S, Rretn−1(S, S))with  Rret1 (S, S) = Rret(S, S)considers these longer return routes by repeatedly applying the return operator,  Rret in (3), ntimes. The limit  Rret(S, S) = limn→∞ Rretn (S, S)contains all the states that can reach the set S through an arbitrarily long path in S.

image

For safe exploration of MDPs, all of the above are requirements; that is, any state that we may want to visit needs to be safe (satisfy the safety constraint), reachable, and we must be able to return to safe states from this new state. Thus, any algorithm that aims to safely explore an MDP is only allowed to visit states in

image

which is the intersection of the three safety-relevant sets. Given a safe set S that fulfills the safety requirements,  Rret(Rsafeϵ (S), S)is the set of states from which we can return to S by only visiting states that can be classified as above the safety threshold. By including it in the definition of  Rϵ(S),we avoid the agent getting stuck in a state without an action that leads to another safe state to take.

Given knowledge about the safety feature in S up to  ϵaccuracy thus allows us to expand the set of safe ergodic states to  Rϵ(S). Any algorithm that has the goal of exploring the state space should consequently explore these newly available safe states and gain new knowledge about the safety feature to potentially further enlargen the safe set. The safe set after n such expansions can be found by repeatedly applying the operator in (4):  Rnϵ (S) = Rϵ(Rn−1ϵ (S))with  R1ϵ = Rϵ(S). Ultimately, the size of the safe set is bounded by surrounding unsafe states or the number of states in S. As a result, the biggest set that any algorithm may classify as safe without visiting unsafe states is given by taking the limit,  Rϵ(S) = limn→∞ Rnϵ (S).

Thus, given a tolerance level  ϵand an initial safe seed set  S0, Rϵ(S0)is the set of states that any algorithm may hope to classify as safe. Let  Stdenote the set of states that an algorithm determines to be safe at iteration t. In the following, we will refer to complete, safe exploration whenever an algorithm fulfills  Rϵ(S0) ⊆ limt→∞ St ⊆ R0(S0); that is, the algorithm classifies every safely reachable state up to  ϵaccuracy as safe, without misclassification or visiting unsafe states.

We start by giving a high level overview of the method. The SAFEMDP algorithm relies on a GP model of r to make predictions about the safety feature and uses the predictive uncertainty to guide the safe exploration. In order to guarantee safety, it maintains two sets. The first set,  St, contains all states that can be classified as satisfying the safety constraint using the GP posterior, while the second one, ˆSt, additionally considers the ability to reach points in  Stand the ability to safely return to the previous safe set, ˆSt−1. The algorithm ensures safety and ergodicity by only visiting states in ˆSt. In order to expand the safe region, the algorithm visits states in  Gt ⊆ ˆSt, a set of candidate states that, if visited, could expand the safe set. Specifically, the algorithm selects the most uncertain state in  Gt, which is the safe state that we can gain the most information about. We move to this state via the shortest safe path, which is guaranteed to exist (Lemma 2). The algorithm is summarized in Algorithm 1.

Initialization. The algorithm relies on an initial safe set  S0as a starting point to explore the MDP. These states must be safe; that is,  r(s) ≥ h, for all s ∈ S0. They must also fulfill the reachability and returnability requirements from Sec. 2. Consequently, for any two states,  s, s′ ∈ S0, there must exist a path in  S0that connects them:  s′ ∈ Rret(S0, {s}). While this may seem restrictive, the requirement is, for example, fulfilled by a single state with an action that leads back to the same state.

image

Classification. In order to safely explore the MDP, the algorithm must determine which states are safe without visiting them. The regularity assumptions introduced in Sec. 2 allow us to model the safety feature as a GP, so that we can use the uncertainty estimate of the GP model in order to determine a confidence interval within which the true safety function lies with high probability. For every state s, this confidence interval has the form  Qt(s) =�µt−1(s) ± √βtσt−1(s)�, where βt is apositive scalar that determines the amplitude of the interval. We discuss how to select  βtin Sec. 4.

Rather than defining high probability bounds on the values of r(s) directly in terms of  Qt, we consider the intersection of the sets  Qtup to iteration  t, Ct(s) = Qt(s) ∩ Ct−1(s)with  C0(s) = [h, ∞]for safe states  s ∈ S0 and C0(s) = Rotherwise. This choice ensures that set of states that we classify as safe does not shrink over iterations and is justified by the selection of  βtin Sec. 4. Based on these con-fidence intervals, we define a lower bound,  lt(s) = min Ct(s), and upper bound,  ut(s) = max Ct(s),on the values that the safety features r(s) are likely to take based on the data obtained up to iteration t. Based on these lower bounds, we define

image

as the set of states that fulfill the safety constraint on r with high probability by using the Lipschitz constant to generalize beyond the current safe set. Based on this classification, the set of ergodic safe states is the set of states that achieve the safety threshold and, additionally, fulfill the reachability and returnability properties discussed in Sec. 2:

image

Expanders. With the set of safe states defined, the task of the algorithm is to identify and explore states that might expand the set of states that can be classified as safe. We use the uncertainty estimate in the GP in order to define an optimistic set of expanders,

image

where  gt(s) =��{s′ ∈ S \ St | ut(s) − Ld(s, s′) ≥ h}��.The function  gt(s)is positive whenever an optimistic measurement at s, equal to the upper confidence bound,  ut(s), would allow us to determine that a previously unsafe state indeed has value  r(s′)above the safety threshold. Intuitively, sampling s might lead to the expansion of  Stand thereby ˆSt. The set  Gtexplicitly considers the expansion of the safe set as exploration goal, see Fig. 2a for a graphical illustration of the set.

Sampling and shortest safe path. The remaining part of the algorithm is concerned with selecting safe states to evaluate and finding a safe path in the MDP that leads towards them. The goal is to visit states that allow the safe set to expand as quickly as possible, so that we do not waste resources when exploring the MDP. We use the GP posterior uncertainty about the states in  Gtin order to make this choice. At each iteration t, we select as next target sample the state with the highest variance in  Gt, st = argmaxs∈Gt wt(s),where  wt(s) = ut(s) − lt(s). This choice is justified, because while all points in  Gtare safe and can potentially enlarge the safe set, based on one noisy sample we can gain the most information from the state that we are the most uncertain about. This design choice maximizes the knowledge acquired with every sample but can lead to long paths between measurements within the safe region. Given  st, we use Dijkstra’s algorithm within the set ˆSt in orderto find the shortest safe path to the target from the current state,  st−1. Since we require reachability and returnability for all safe states, such a path is guaranteed to exist. We terminate the algorithm when we reach the desired accuracy; that is,  argmaxs∈Gt wt(s) ≤ ϵ.

Action-dependent safety. So far, we have considered safety features that only depend on the states, r(s). In general, safety can also depend on the actions, r(s, a). In this section, we introduce a modified MDP that captures these dependencies without modifying the algorithm. The modified MDP is equivalent to the original one in terms of dynamics, f(s, a). However, we introduce additional action-states  safor each action in the original MDP. When we start in a state s and take action a, we first transition to the corresponding action-state and from there transition to f(s, a) deterministically. This model is illustrated in Fig. 2b. Safety features that depend on action-states,  sa, are equivalent to action-dependent safety features. The SAFEMDP algorithm can be used on this modified MDP without modification. See the experiments in Sec. 5 for an example.

The safety and exploration aspects of the algorithm that we presented in the previous section rely on the correctness of the confidence intervals  Ct(s). In particular, they require that the true value of the safety feature, r(s), lies within  Ct(s)with high probability for all  s ∈ Sand all iterations t > 0. Furthermore, these confidence intervals have to shrink sufficiently fast over time. The probability of r taking values within the confidence intervals depends on the scaling factor  βt. This scaling factor trades off conservativeness in the exploration for the probability of unsafe states being visited. Appropriate selection of  βthas been studied by Srinivas et al. [21] in the multi-armed bandit setting. Even though our framework is different, their setting can be applied to our case. We choose,

image

where B is the bound on the RKHS norm of the function  r(·), δis the probability of visiting unsafe states, and  γtis the maximum mutual information that can be gained about  r(·)from t noisy observations; that is,  γt = max|A|≤t I(r, yA).The information capacity  γthas a sublinear dependence on t for many commonly used kernels [21]. The choice of  βtin (8) is justified by the following Lemma, which follows from [21, Theorem 6]:

Lemma 1. Assume that  ∥r∥2k ≤ B, and that the noise  ωtis zero-mean conditioned on the history, as well as uniformly bounded by  σfor all t > 0. If  βtis chosen as in (8), then, for all t > 0 and all  s ∈ S, it holds with probability at least  1 − δ that r(s) ∈ Ct(s).

This Lemma states that, for  βtas in (8), the safety function r(s) takes values within the confidence intervals C(s) with high probability. Now we show that the the safe shortest path problem has always a solution:

Lemma 2. Assume that  S0 ̸= ∅and that for all states,  s, s′ ∈ S0, s ∈ Rret(S0, {s′}). Then, when using Algorithm 1 under the assumptions in Theorem 1, for all t > 0 and for all states,  s, s′ ∈ ˆSt, s ∈ Rret(St, {s′}).

This lemma states that, given an initial safe set that fulfills the initialization requirements, we can always find a policy that drives us from any state in ˆStto any other state in ˆStwithout leaving the set of safe states,  St. Lemmas 1 and 2 have a key role in ensuring safety during exploration and, thus, in our main theoretical result:

Theorem 1. Assume that  r(·)is L-Lipschitz continuous and that the assumptions of Lemma 1 hold. Also, assume that  S0 ̸= ∅, r(s) ≥ hfor all  s ∈ S0, and that for any two states,  s, s′ ∈ S0, s′ ∈ Rret(S0, {s}). Choose βt as in (8). Then, with probability at least  1 − δ, we have r(s) ≥ h forany s along any state trajectory induced by Algorithm 1 on an MDP with transition function f(s, a). Moreover, let  t∗be the smallest integer such that t∗βt∗γt∗ ≥ C |R0(S0)|ϵ2 ,with  C = 8/ log(1 + σ−2). Then there exists a  t0 ≤ t∗ such that, with probability at least  1 − δ, Rϵ(S0) ⊆ ˆSt0 ⊆ R0(S0).

Theorem 1 states that Algorithm 1 performs safe and complete exploration of the state space; that is, it explores the maximum reachable safe set without visiting unsafe states. Moreover, for any desired accuracy  ϵand probability of failure  δ, the safely reachable region can be found within a finite number of observations. This bound depends on the information capacity  γt, which in turn depends on the kernel. If the safety feature is allowed to change rapidly across states, the information capacity will be larger than if the safety feature was smooth. Intuitively, the less prior knowledge the kernel encodes, the more careful we have to be when exploring the MDP, which requires more measurements.

In this section, we demonstrate Algorithm 1 on an exploration task. We consider the setting in [14], the exploration of the surface of Mars with a rover. The code for the experiments is available at http://github.com/befelix/SafeMDP.

For space exploration, communication delays between the rover and the operator on Earth can be prohibitive. Thus, it is important that the robot can act autonomously and explore the environment without risking unsafe behavior. For the experiment, we consider the Mars Science Laboratory (MSL) [11], a rover deployed on Mars. Due to communication delays, the MSL can travel 20 meters before it can obtain new instructions from an operator. It can climb a maximum slope of  30◦[15, Sec. 2.1.3]. In our experiments we use digital terrain models of the surface of Mars from the High Resolution Imaging Science Experiment (HiRISE), which have a resolution of one meter [12].

As opposed to the experiments considered in [14], we do not have to subsample or smoothen the data in order to achieve good exploration results. This is due to the flexibility of the GP framework that considers noisy measurements. Therefore, every state in the MDP represents a  d × dsquare area with d = 1 m, as opposed to d = 20 m in [14].

At every state, the agent can take one of four actions: up, down, left, and right. If the rover attempts to climb a slope that is steeper than  30◦, it fails and may be damaged. Otherwise it moves deterministically to the desired neighboring state. In this setting, we define safety over state transitions by using the extension introduced in Fig. 2b. The safety feature over the transition from  s to s′ is definedin terms of height difference between the two states,  H(s) − H(s′). Given the maximum slope of α = 30◦ that the rover can climb, the safety threshold is set at a conservative  h = −d tan(25◦). Thisencodes that it is unsafe for the robot to climb hills that are too steep. In particular, while the MDP dynamics assume that Mars is flat and every state can be reached, the safety constraint depends on the a priori unknown heights. Therefore, under the prior belief, it is unknown which transitions are safe.

We model the height distribution, H(s), as a GP with a Matérn kernel with  ν = 5/2. Due to the limitation on the grid resolution, tuning of the hyperparameters is necessary to achieve both safety and satisfactory exploration results. With a finer resolution, more cautious hyperparameters would also be able to generalize to neighbouring states. The lengthscales are set to 14.5 m and the prior standard deviation of heights is 10 m. We assume a noise standard deviation of 0.075 m. Since the safety feature of each state transition is a linear combination of heights, the GP model of the heights induces a GP model over the differences of heights, which we use to classify whether state transitions fulfill the safety constraint. In particular, the safety depends on the direction of travel, that is, going downhill is possible, while going uphill might be unsafe.

Following the recommendations in [22], in our experiments we use the GP confidence intervals Qt(s)directly to determine the safe set  St. As a result, the Lipschitz constant is only used to determine expanders in G. Guaranteeing safe exploration with high probability over multiple steps leads to conservative behavior, as every step beyond the set that is known to be safe decreases the ‘probability budget’ for failure. In order to demonstrate that safety can be achieved empirically using less conservative parameters than those suggested by Theorem 1, we fix  βtto a constant value,  βt = 2, ∀t ≥ 0. This choice aims to guarantee safety per iteration rather than jointly over all the iterations. The same assumption is used in [14].

We compare our algorithm to several baselines. The first one considers both the safety threshold and the ergodicity requirements but neglects the expanders. In this setting, the agent samples the most uncertain safe state transaction, which corresponds to the safe Bayesian optimization framework in [20]. We expect the exploration to be safe, but less efficient than our approach. The second baseline considers the safety threshold, but does not consider ergodicity requirements. In this setting, we expect the rover’s behavior to fulfill the safety constraint and to never attempt to climb steep slopes, but it may get stuck in states without safe actions. The third method uses the unconstrained Bayesian optimization framework in order to explore new states, without safety requirements. In this setting, the agent tries to obtain measurements from the most uncertain state transition over the entire space, rather than restricting itself to the safe set. In this case, the rover can easily get stuck and may also incur failures by attempting to climb steep slopes. Last, we consider a random exploration strategy, which is similar to the-greedy exploration strategies that are widely used in reinforcement learning.

image

Figure 2: Comparison of different exploration schemes. The background color shows the real altitude of the terrain. All algorithms are run for 525 iterations, or until the first unsafe action is attempted. The saturated color indicates the region that each strategy is able to explore. The baselines get stuck in the crater in the bottom-right corner or fail to explore, while Algorithm 1 manages to safely explore the unknown environment. See the statistics in Fig. 2f.

We compare these baselines over an 120 by 70 meters area at  −30.6◦ latitude and  202.2◦ longitude.We set the accuracy  ϵ = σnβ. The resulting exploration behaviors can be seen in Fig. 2. The rover starts in the center-top part of the plot, a relatively planar area. In the top-right corner there is a hill that the rover cannot climb, while in the bottom-right corner there is a crater that, once entered, the rover cannot leave. The safe behavior that we expect is to explore the planar area, without moving into the crater or attempting to climb the hill. We run all algorithms for 525 iterations or until the first unsafe action is attempted. It can be seen in Fig. 2e that our method explores the safe area that surrounds the crater, without attempting to move inside. While some state-action pairs closer to the crater are also safe, the GP model would require more data to classify them as safe with the necessary confidence. In contrast, the baselines perform significantly worse. The baseline that does not ensure the ability to return to the safe set (non-ergodic) can be seen in Fig. 2a. It does not explore the area, because it quickly reaches a state without a safe path to the next target sample. Our approach avoids these situations explicitly. The unsafe exploration baseline in Fig. 2b considers ergodicity, but concludes that every state is reachable according to the MDP model. Consequently, it follows a path that crosses the boundary of the crater and eventually evaluates an unsafe action. Overall, it is not enough to consider only ergodicity or only safety, in order to solve the safe exploration problem. The random exploration in Fig. 2c attempts an unsafe action after some exploration. In contrast, Algorithm 1 manages to safely explore a large part of the unknown environment. Running the algorithm without considering expanders leads to the behavior in Fig. 2d, which is safe, but only manages to explore a small subset of the safely reachable area within the same number of iterations in which Algorithm 1 explores over 80% of it. The results are summarized in Table 2f.

We presented SAFEMDP, an algorithm to safely explore a priori unknown environments. We used a Gaussian process to model the safety constraints, which allows the algorithm to reason about the safety of state-action pairs before visiting them. An important aspect of the algorithm is that it considers the transition dynamics of the MDP in order to ensure that there is a safe return route before visiting states. We proved that the algorithm is capable of exploring the full safely reachable region with few measurements, and demonstrated its practicality and performance in experiments.

Acknowledgement. This research was partially supported by the Max Planck ETH Center for Learning Systems and SNSF grant 200020_159557.

[1] Anayo K. Akametalu, Shahab Kaynama, Jaime F. Fisac, Melanie N. Zeilinger, Jeremy H. Gillula, and Claire J. Tomlin. Reachability-based safe learning with Gaussian processes. In Proc. of the IEEE Conference on Decision and Control (CDC), pages 1424–1431, 2014.

[2] Brenna D. Argall, Sonia Chernova, Manuela Veloso, and Brett Browning. A survey of robot learning from demonstration. Robotics and Autonomous Systems, 57(5):469–483, 2009.

[3] Felix Berkenkamp and Angela P. Schoellig. Safe and robust learning control with Gaussian processes. In Proc. of the European Control Conference (ECC), pages 2501–2506, 2015.

[4] Felix Berkenkamp, Angela P. Schoellig, and Andreas Krause. Safe controller optimization for quadrotors with Gaussian processes. In Proc. of the IEEE International Conference on Robotics and Automation (ICRA), 2016.

[5] Stefano P. Coraluppi and Steven I. Marcus. Risk-sensitive and minimax control of discrete-time, finite-state Markov decision processes. Automatica, 35(2):301–309, 1999.

[6] Javier Garcia and Fernando Fernández. Safe exploration of state and action spaces in reinforcement learning. Journal of Artificial Intelligence Research, pages 515–564, 2012.

[7] Peter Geibel and Fritz Wysotzki. Risk-sensitive reinforcement learning applied to control under constraints. Journal of Artificial Intelligence Research (JAIR), 24:81–108, 2005.

[8] Subhashis Ghosal and Anindya Roy. Posterior consistency of Gaussian process prior for nonparametric binary regression. The Annals of Statistics, 34(5):2413–2429, 2006.

[9] Alexander Hans, Daniel Schneegaß, Anton Maximilian Schäfer, and Steffen Udluft. Safe exploration for reinforcement learning. In Proc. of the European Symposium on Artificial Neural Networks (ESANN), pages 143–148, 2008.

[10] Jens Kober, J. Andrew Bagnell, and Jan Peters. Reinforcement learning in robotics: a survey. The International Journal of Robotics Research, 32(11):1238–1274, 2013.

[11] Mary Kae Lockwood. Introduction: Mars Science Laboratory: The Next Generation of Mars Landers. Journal of Spacecraft and Rockets, 43(2):257–257, 2006.

[12] Alfred S. McEwen, Eric M. Eliason, James W. Bergstrom, Nathan T. Bridges, Candice J. Hansen, W. Alan Delamere, John A. Grant, Virginia C. Gulick, Kenneth E. Herkenhoff, Laszlo Keszthelyi, Randolph L. Kirk, Michael T. Mellon, Steven W. Squyres, Nicolas Thomas, and Catherine M. Weitz. Mars Reconnaissance Orbiter’s High Resolution Imaging Science Experiment (HiRISE). Journal of Geophysical Research: Planets, 112(E5):E05S02, 2007.

[13] Jonas Mockus. Bayesian Approach to Global Optimization, volume 37 of Mathematics and Its Applications. Springer Netherlands, 1989.

[14] Teodor Mihai Moldovan and Pieter Abbeel. Safe exploration in Markov decision processes. In Proc. of the International Conference on Machine Learning (ICML), pages 1711–1718, 2012.

[15] MSL. MSL Landing Site Selection User’s Guide to Engineering Constraints, 2007. URL http://marsoweb. nas.nasa.gov/landingsites/msl/memoranda/MSL_Eng_User_Guide_v4.5.1.pdf.

[16] Martin Pecka and Tomas Svoboda. Safe exploration techniques for reinforcement learning – an overview. In Modelling and Simulation for Autonomous Systems, pages 357–375. Springer, 2014.

[17] Carl Edward Rasmussen and Christopher K. I. Williams. Gaussian processes for machine learning. Adaptive computation and machine learning. MIT Press, 2006.

[18] Stefan Schaal and Christopher Atkeson. Learning Control in Robotics. IEEE Robotics & Automation Magazine, 17(2):20–29, 2010.

[19] Bernhard Schölkopf and Alexander J. Smola. Learning with Kernels: Support Vector Machines, Regularization, Optimization, and Beyond. MIT Press, 2002.

[20] Jens Schreiter, Duy Nguyen-Tuong, Mona Eberts, Bastian Bischoff, Heiner Markert, and Marc Toussaint. Safe exploration for active learning with Gaussian processes. In Proc. of the European Conference on Machine Learning (ECML), volume 9284, pages 133–149, 2015.

[21] Niranjan Srinivas, Andreas Krause, Sham M. Kakade, and Matthias Seeger. Gaussian process optimization in the bandit setting: no regret and experimental design. In Proc. of the International Conference on Machine Learning (ICML), 2010.

[22] Yanan Sui, Alkis Gotovos, Joel Burdick, and Andreas Krause. Safe exploration for optimization with Gaussian processes. In Proc. of the International Conference on Machine Learning (ICML), pages 997–1005, 2015.

[23] Richard S. Sutton and Andrew G. Barto. Reinforcement learning: an introduction. Adaptive computation and machine learning. MIT Press, 1998.

Lemma 3.  ∀s ∈ S, ut+1(s) ≤ ut(s), lt+1(s) ≥ lt(s), wt+1(s) ≤ wt(s).

Proof. This lemma follows directly from the definitions of  ut(s), lt(s), wt(s) and Ct(s).

Lemma 4. n ≥ 1, s ∈ Rretn (S, S) =⇒ s ∈ S ∪ S.

Proof. Proof by induction. Consider  n = 1, then s ∈ Rret(S, S) =⇒ s ∈ S ∪ Sby definition. For the induction step, assume  s ∈ Rretn−1(S, S) =⇒ s ∈ S ∪ S. Now consider  s ∈ Rretn (S, S). We know that

image

Therefore, since  s ∈ Rretn−1(S, S) =⇒ s ∈ S ∪ S and S ⊆ S ∪ S, it follows that  s ∈ S ∪ S and theinduction step is complete.

Lemma 5.  ∀n ≥ 1, s ∈ Rretn (S, S) ⇐⇒ ∃k, 0 ≤ k ≤ n and (a1, . . . , ak), a sequence of k actions, that induces  (s0, s1, . . . , sk)starting at  s0 = s, such that  si ∈ S, ∀i = 0, . . . , k − 1 and sk ∈ S.

Proof.  ( =⇒ ). s ∈ Rretn (S, S)means that either  s ∈ Rretn−1(S, S)or  ∃a ∈ A(s): f(s, a) ∈Rretn−1(S, S). Therefore, we can reach a state in  Rretn−1(S, S)taking at most one action. Repeating this procedure i times, the system reaches a state in  Rretn−i(S, S)with at most i actions. In particular, if we choose i = n, we prove the agent reaches S with at most n actions. Therefore there is a sequence of actions of length k, with  0 ≤ k ≤ n, inducing a state trajectory such that:  s0 = s, si ∈ Rretn−i(S, S) ⊆ S ∪ S for every i = 0, . . . , k − 1 and sk ∈ S.

( ⇐= ). Consider k = 0. This means that  s ∈ S ⊆ Rretn (S, S). In case k = 1 we have that  s0 ∈ Sand that  f(s0, a1) ∈ S. Therefore  s ∈ Rret(S, S) ⊆ Rretn (S, S). For  k ≥ 2we know  sk−1 ∈ Sand  f(sk−1, ak) ∈ S =⇒ sk−1 ∈ Rret(S, S). Similarly  sk−2 ∈ Sand f(sk−2, ak−1) = sk−1 ∈ Rret(S, S) =⇒ sk−2 ∈ Rret2 (S, S). For any  0 ≤ k ≤ nwe can apply this reasoning k times and prove that  s ∈ Rretk (S, S) ⊆ Rretn (S, S).

Lemma 6.  ∀S, S ⊆ S, ∀N ≥ |S|, RretN (S, S) = RretN+1(S, S) = Rret(S, S)

Proof. This is a direct consequence of Lemma 5. In fact, Lemma 5 states that s belongs to  RretN (S, S)if and only if there is a path of length at most N starting from s contained in S that drives the system to a state in S. Since we are dealing with a finite MDP, there are |S| different states. Therefore, if such a path exists it cannot be longer than |S|.

Lemma 7. Given  S ⊆ R ⊆ S and S ⊆ R ⊆ S, it holds that  Rret(S, S) ⊆ Rret(R, R).

Proof. Let  s ∈ Rret(S, S). It follows from Lemmas 5 and 6 that there exists a sequence of actions, (a1, . . . , ak), with  0 ≤ k ≤ |S|, that induces a state trajectory,  (s0, s1, . . . , sk), starting at  s0 = swith  si ∈ S ⊆ R, ∀i = 1, . . . , k − 1and  sk ∈ S ⊆ R. Using the (  ⇐=) direction of Lemma 5 and Lemma 6, we conclude that  s ∈ Rret(R, R).

Lemma 8. S  ⊆ R =⇒ Rreach(S) ⊆ Rreach(R).

Proof. Consider  s ∈ Rreach(S). Then either  s ∈ S ⊆ R or ∃ˆs ∈ S ⊆ R, ˆa ∈ A(ˆs): s = f(ˆs, ˆa), bydefinition. This implies that  s ∈ Rreach(R).

Lemma 9. For any  t ≥ 1, S0 ⊆ St ⊆ St+1 and ˆS0 ⊆ ˆSt ⊆ ˆSt+1

Proof. Proof by induction. Consider  s ∈ S0, S0 = ˆS0by initialization. We known that

image

where the last inequality follows from Lemma 3. This implies that  s ∈ S1or, equivalently, that S0 ⊆ S1. Furthermore, we know by initialization that  s ∈ Rreach( ˆS0). Moreover, we can say that s ∈ Rret(S1, ˆS0), since S1 ⊇ S0 = ˆS0. We can conclude that  s ∈ ˆS1. For the induction step assume that  St−1 ⊆ St and ˆSt−1 ⊆ ˆSt. Let s ∈ St. Then,

image

Furthermore, it follows from Lemma 3 that  lt+1(s′) − Ld(s, s′) ≥ lt(s′) − Ld(s, s′). This implies that  lt+1(s′) − Ld(s, s′) ≥ h. Thus s ∈ St+1. Now consider  s ∈ ˆSt. We known that

image

We also know that  s ∈ Rret(St, ˆSt−1). Since we just proved that  St ⊆ St+1and we assumed ˆSt−1 ⊆ ˆStfor the induction step, Lemma 7 allows us to say that  s ∈ Rret(St+1, ˆSt). All together this allows us to complete the induction step by saying  s ∈ ˆSt+1.

Lemma 10.  S ⊆ R =⇒ Rsafeϵ (S) ⊆ Rsafeϵ (R).

Proof. Consider  s ∈ Rsafeϵ (S), we can say that:

image

This means that  s ∈ Rsafeϵ (R)

Lemma 11. Given two sets  S, R ⊆ S such that S ⊆ R, it holds that:  Rϵ(S) ⊆ Rϵ(R).

Proof. We have to prove that:

image

Let’s start by checking the reachability condition first:

image

Now let’s focus on the recovery condition. We use Lemmas 7 and 10 to say that  s ∈ Rret(Rsafeϵ (S), S)implies that  s ∈ Rret(Rsafeϵ (R), R)and this completes the proof.

Lemma 12. Given two sets  S, R ⊆ S such that S ⊆ R, the following holds:  Rϵ(S) ⊆ Rϵ(R).

Proof. The result follows by repeatedly applying Lemma 11.

Lemma 13. Assume that  ∥r∥2k ≤ B, and that the noise  ωtis zero-mean conditioned on the history, as well as uniformly bounded by  σfor all t > 0. If  βtis chosen as in (8), then, for all t > 0 and

image

Proof. See Theorem 6 in [21].

Lemma 1. Assume that  ∥r∥2k ≤ B, and that the noise  ωtis zero-mean conditioned on the history, as well as uniformly bounded by  σfor all t > 0. If  βtis chosen as in (8), then, for all t > 0 and all  s ∈ S, it holds with probability at least  1 − δ that r(s) ∈ Ct(s).

Proof. See Corollary 1 in [22].

Lemma 14. For all  t ≥ 1and for all  s ∈ ˆSt, ∃s′ ∈ S0 such that s ∈ Rret(St, {s′}).

Proof. We use a recursive argument to prove this lemma. Since  s ∈ ˆSt, we know that  s ∈Rret(St, ˆSt−1). Because of Lemmas 5 and 6 we know  ∃(a1, . . . , aj), with  j ≤ |S|, inducing s0, s1, . . . , sjsuch that  s0 = s, si ∈ St, ∀i = 1, . . . , j − 1and  sj ∈ ˆSt−1. Similarly, we can build another sequence of actions that drives the system to some state in ˆSt−2passing through  St−1 ⊆ Ststarting from  sj ∈ ˆSt−1. By applying repeatedly this procedure we can build a finite sequence of actions that drives the system to a state  s′ ∈ S0passing through  Ststarting from s. Because of Lemmas 5 and 6 this is equivalent to  s ∈ Rret(St, {s′}).

Lemma 15. For all  t ≥ 1and for all  s ∈ ˆSt, ∃s′ ∈ S0 such that s′ ∈ Rret(St, {s}).

Proof. The proof is analogous to the the one we gave for Lemma 14. The only difference is that here we need to use the reachability property of ˆStinstead of the recovery property of ˆSt.

Lemma 2. Assume that  S0 ̸= ∅and that for all states,  s, s′ ∈ S0, s ∈ Rret(S0, {s′}). Then, when using Algorithm 1 under the assumptions in Theorem 1, for all t > 0 and for all states,  s, s′ ∈ ˆSt, s ∈ Rret(St, {s′}).

Proof. This lemma is a direct consequence of the properties of  S0listed above (that are ensured by the initialization of the algorithm) and of Lemmas 14 and 15

Lemma 16. For any  t ≥ 0, the following holds with probability at least  1 − δ: ∀s ∈ St, r(s) ≥ h.

Proof. Let’s prove this result by induction. By initialization we know that  r(s) ≥ hfor all  s ∈ S0. For the induction step assume that for all  s ∈ St−1holds that  r(s) ≥ h. For any s ∈ St, by definition, there exists  z ∈ ˆSt−1 ⊆ St−1 such that

image

This relation holds with probability at least  1 − δbecause we used Lemma 1 to prove it.

Theorem 2. For any state s along any state trajectory induced by Algorithm 1 on a MDP with transition function f(s, a), we have, with probability at least  1 − δ, that r(s) ≥ h.

Proof. Let’s denote as  (st1, st2, . . . , stk)the state trajectory of the system until the end of iteration t ≥ 0. We know from Lemma 2 and Algorithm 1 that the  sti ∈ St, ∀i = 1, . . . , k. Lemma 16 completes the proof as it allows us to say that  r(sti) ≥ h, ∀i = 1, . . . , kwith probability at least 1 − δ.

Lemma 17. For any  t1 ≥ t0 ≥ 1, if ˆSt1 = ˆSt0, then,  ∀tsuch that  t0 ≤ t ≤ t1, it holds that

image

Proof. Since ˆStis not changing we are always computing the enlargement function over the same points. Therefore we only need to prove that the enlargement function is non increasing. We known from Lemma 3 that  ut(s)is a non increasing function of t for all  s ∈ S. Furthermore we know that (S \ St) ⊇ (S \ St+1)because of Lemma 9. Hence, the enlargement function is non increasing and the proof is complete.

Lemma 18. For any  t1 ≥ t0 ≥ 1, if ˆSt1 = ˆSt0, C1 = 8/log(1 + σ−2)and  st = argmaxs∈Gt wt(s), then,  ∀t such that t0 ≤ t ≤ t1, it holds that  wt(st) ≤�

Proof. See Lemma 5 in [22].

Lemma 19. For any  t ≥ 1, if C1 = 8/log(1 + σ−2) and Ttis the smallest positive integer such that Ttβt+Ttγt+Tt ≥ C1ϵ2 and St+Tt = St, then, for any  s ∈ Gt+Tt it holds that  wt+Tt(s) ≤ ϵ

Proof. The proof is trivial because  Ttwas chosen to be the smallest integer for which the right hand side of the inequality proved in Lemma 18 is smaller or equal to  ϵ.

Lemma 20. For any  t ≥ 1, if Rϵ(S0) \ ˆSt ̸= ∅, then, Rϵ( ˆSt) \ ˆSt ̸= ∅.

Proof. For the sake of contradiction assume that  Rϵ( ˆSt) \ ˆSt = ∅. This implies  Rϵ( ˆSt) ⊆ ˆSt. On the other hand, since ˆStis included in all the sets whose intersection defines  Rϵ( ˆSt), we know that, ˆSt ⊆ Rϵ( ˆSt). This implies that  ˆSt = Rϵ( ˆSt).

If we apply repeatedly the one step reachability operator on both sides of the equality we obtain Rϵ( ˆSt) = ˆSt. By Lemmas 9 and 12 we know that

image

This contradicts the assumption that  Rϵ(S0) \ ˆSt ̸= ∅.

Lemma 21. For any  t ≥ 1, if  Rϵ(S0) \ ˆSt ̸= ∅, then, with probability at least  1 − δit holds that ˆSt ⊂ ˆSt+Tt.

Proof. By Lemma 20 we know that  Rϵ(S0) \ ˆSt ̸= ∅. This implies that  ∃s ∈ Rϵ( ˆSt) \ ˆSt. Therefore there exists a  s′ ∈ ˆStsuch that:

image

For the sake of contradiction assume that ˆSt+Tt = ˆSt. This means that  s ∈ S \ ˆSt+Tt and s′ ∈ ˆSt+Tt.Then we have:

image

Assume, for the sake of contradiction, that  s ∈ S \ St+Tt. This means that  s′ ∈ Gt+Tt. We know that for any  t ≤ ˆt ≤ t + Ttholds that ˆSˆt = ˆSt, because ˆSt = ˆSt+Ttand ˆSt ⊆ ˆSt+1for all  t ≥ 1. Therefore we have  s′ ∈ ˆSt+Tt−1such that:

image

We also know that  s ∈ Rϵ( ˆSt+Tt) \ ˆSt+Tt =⇒ s ∈ Rret(Rsafeϵ ( ˆSt+Tt), ˆSt+Tt). We want to use this fact to prove that  s ∈ Rret(St+Tt, ˆSt+Tt−1). In order to do this, we intend to use the result from Lemma 7. We already know that ˆSt+Tt−1 = ˆSt+Tt. Therefore we only need to prove that  Rsafeϵ ( ˆSt+Tt) ⊆ St+Tt. For the sake of contradiction assume this is not true. This means ∃z ∈ Rsafeϵ ( ˆSt+Tt) \ St+Tt. Therefore there exists a  z′ ∈ ˆSt+Ttsuch that:

image

Consequently:

image

Hence  z′ ∈ Gt+Tt. Since we proved before that ˆSt+Tt = ˆSt+Tt−1, we can say that  z′ ∈ ˆSt+Tt−1and that:

image

Therefore  z ∈ St+Tt. This is a contradiction. Thus we can say that  Rsafeϵ ( ˆSt+Tt) ⊆ St+Tt. Hence:

image

In the end the fact that  s ∈ St+Ttand (13) and (16) allow us to conclude that  s ∈ ˆSt+Tt. This contradiction proves the theorem.

Lemma 22.  ∀t ≥ 0, ˆSt ⊆ R0(S0)with probability at least  1 − δ.

Proof. Proof by induction. We know that ˆS0 = S0 ⊆ R0(S0)by definition. For the induction step assume that for some  t ≥ 1holds that ˆSt−1 ⊆ R0(S0). Our goal is to show that  s ∈ ˆSt =⇒ s ∈R0(S0). In order to this, we will try to show that  s ∈ R0( ˆSt−1). We know that:

image

Furthermore we can say that:

image

For any  z ∈ St, we know that  ∃z′ ∈ ˆSt−1such that:

image

This means that  z ∈ St =⇒ z ∈ Rsafe0 ( ˆSt−1), or, equivalently, that  St ⊆ Rsafe0 ( ˆSt−1). Hence, Lemma 7 and (18) allow us to say that  Rret(St, ˆSt−1) ⊆ Rret(Rsafe0 ( ˆSt−1), ˆSt−1). This result, together with (17), leads us to the conclusion that  s ∈ R0( ˆSt−1). We assumed for the induction step that ˆSt−1 ⊆ R0(S0). Applying on both sides the set operator  R0(·), we conclude that  R0( ˆSt−1) ⊆R0(S0). This proves that  s ∈ ˆSt =⇒ s ∈ R0(S0)and the induction step is complete.

Lemma 23. Let  t∗be the smallest integer such that  t∗ ≥ |R0(S0)|Tt∗, then there exists a  t0 ≤ t∗such that, with probability at least  1 − δholds that ˆSt0+Tt0 = ˆSt0.

Proof. For the sake of contradiction assume that the opposite holds true:  ∀t ≤ t∗, ˆSt ⊂ ˆSt+Tt. Thisimplies that ˆS0 ⊂ ˆST0. Furthermore we know that  Ttis increasing in t. Therefore  0 ≤ t∗ =⇒ T0 ≤Tt∗ =⇒ ˆST0 ⊆ ˆSTt∗ . Now if |R0(S0)| ≥ 1we know that:

image

This justifies the following chain of inclusions:

image

This means that for any  0 ≤ k ≤ |R0(S0)|it holds that  | ˆSkTt∗ | > k. In particular, for  k∗ = |R0(S0)|we have  | ˆSk∗Tt∗ | > |R0(S0)|. This contradicts Lemma 22 (which holds true with probability at least 1 − δ).

Lemma 24. Let  t∗ be the smallest integer such that t∗βt∗γt∗ ≥ C1|R0(S0)|ϵ2, then, there is  t0 ≤ t∗ such

that ˆSt0+Tt0 = ˆSt0with probability at least  1 − δ.

Proof. The proof consists in applying the definition of  Ttto the condition of Lemma 23.

Theorem 3. Let  t∗ be the smallest integer such that t∗βt∗γt∗ ≥ C1|R0(S0)|ϵ2 , with C1 = 8/log(1+σ−2),then, there is  t0 ≤ t∗ such that Rϵ(S0) ⊆ ˆSt0 ⊆ R0(S0)with probability at least  1 − δ.

Proof. Due to Lemma 24, we know that  ∃t0 ≤ t∗such that ˆSt0 = ˆSt0+Tt0with probability at least 1 − δ. This implies that  Rϵ(S0) \ ( ˆSt) = ∅with probability at least  1 − δbecause of Lemma 21. Therefore  Rϵ(S0) ⊆ ˆSt. Furthermore we know that ˆSt ⊆ R0(S0)with probability at least  1 − δbecause of Lemma 22 and this completes the proof.

image

Theorem 1. Assume that  r(·)is L-Lipschitz continuous and that the assumptions of Lemma 1 hold. Also, assume that  S0 ̸= ∅, r(s) ≥ hfor all  s ∈ S0, and that for any two states,  s, s′ ∈ S0, s′ ∈ Rret(S0, {s}). Choose βt as in (8). Then, with probability at least  1 − δ, we have r(s) ≥ h forany s along any state trajectory induced by Algorithm 1 on an MDP with transition function f(s, a). Moreover, let  t∗be the smallest integer such that t∗βt∗γt∗ ≥ C |R0(S0)|ϵ2 ,with  C = 8/ log(1 + σ−2). Then there exists a  t0 ≤ t∗ such that, with probability at least  1 − δ, Rϵ(S0) ⊆ ˆSt0 ⊆ R0(S0).

Proof. This is a direct consequence of Theorem 2 and Theorem 3.


Designed for Accessibility and to further Open Science