b

DiscoverSearch
About
My stuff
Safe Exploration in Markov Decision Processes with Time-Variant Safety using Spatio-Temporal Gaussian Process
2018·arXiv
Abstract
Abstract

In many real-world applications (e.g., planetary exploration, robot navigation), an autonomous agent must be able to explore a space with guaranteed safety. Most safe exploration algorithms in the field of reinforcement learning and robotics have been based on the assumption that the safety features are a priori known and time-invariant. This paper presents a learning algorithm called  ST-SAFEMDPfor exploring Markov decision processes (MDPs) that is based on the assumption that the safety features are a priori unknown and time-variant. In this setting, the agent explores MDPs while constraining the probability of entering unsafe states defined by a safety function being below a threshold. The unknown and time-variant safety values are modeled using a spatio-temporal Gaussian process. However, there remains an issue that an agent may have no viable action in a shrinking true safe space. To address this issue, we formulate a problem maximizing the cumulative number of safe states in the worst case scenario with respect to future observations. The effectiveness of this approach was demonstrated in two simulation settings, including one using real lunar terrain data.

The fundamental challenge of decision making in many real-world applications is that safety is often a priori unknown and time-variant. For example, in the case of Mars rovers, the safe space is a priori unknown and is discovered by the agent or manually defined by human operators; that is, potential hazards must be discovered on-the-fly. Furthermore, especially in the lunar polar regions, factors related to safety such as illumination (associated with power generation and visibility) change drastically over time. For the on-board risk-sensitive and reactive decision making, the agent should not depend on the human operators mainly because of the communication delay. Therefore, the agent, such as a lunar rover, that autonomously explores an uncertain environment must learn where it is safe and when a state is safe. In previous work on safe reinforcement learning (Hans et al. 2008; Garcıa and Fern´andez 2015) and safety-constrained decision-making (Fleming and McEneaney 1995; Schwarm and Nikolaou 1999; Blackmore et al. 2010) safety was assumed to be guaranteed by assigning unsafe states a high cost. In other words, safety was assumed to be a priori known and time-invariant. A useful approach to dealing with a priori unknown functions is to use a Gaussian

process (GP, see Rasmussen and Williams (2006)). A GP is well suited for safety-critical problem settings in which an action known to be safe is taken during exploration. This is because the confidence interval (i.e., variance) as well as the mean value of an a priori unknown safety function can be leveraged. Due to being based on the assumption that the function (i.e., smoothness or continuity) is regular, GP-based safe exploration algorithms have generally provided efficient solutions as well as a rigorous theoretical guarantee of safety. Previous work on safe exploration using GPs has addressed both stateless settings (Sui et al. 2015; Sui, Yue, and Burdick 2017) and stateful settings (Turchetta, Berkenkamp, and Krause 2016; Berkenkamp et al. 2017; Wachi et al. 2018). However, in both settings, safety functions were assumed to be time-invariant. In real applications, an actual safety function often changes over time. To the best of our knowledge, current applications of GP-based safe exploration algorithms are in medical care (e.g., clinical treatment) and robotics (e.g., planetary exploration). For example, in the planetary exploration, since the safety features are associated with the planetary environment, the safety function must be treated as time-variant.

When safety is considered to be time-variant, a simple approach is to use spatio-temporal GPs (Hartikainen, Ri- ihim¨aki, and S¨arkk¨a 2011; S¨arkk¨a and Hartikainen 2012). By incorporating the spatio-temporal information into the kernel function, the safety function values are predicted in both spatial and temporal directions. However, we cannot solve a critical issue if we just use spatio-temporal GPs; that is, in the time-variant environment, the true safe space may shrink, and the agent may not be able to take any viable actions. In the time-invariant case, the true safe space will not change, and the predicted safe space will expand or at least remain unchanged. In our problem setting, states that were previously identified as safe may no longer be safe. A conceptual image is shown in Figure 1. In the rest of the paper, safe space indicates predicted safe space.

Contribution To address the above issue, we define a safe space for a time-variant safe scenario and then formulate a min-max problem in which the cumulative number of safe states in the worst case scenario with respect to future observations is maximized. Under the worst case assumption that the safety function is Lipschitz continuous, safety is guaranteed in subsequent steps while reducing the possibility that an agent gets stuck due to having no viable action it can take.

In this paper, we present a novel safety-constrained exploration algorithm called Spatio-Temporal Safe Markov decision process (ST-SAFEMDP). At a high level, our approach models the safety function using a spatio-temporal GP model. It first predicts future safety function values and then uses the GP model to compute the safe space defined for the time-variant safety scenario. Finally, it chooses the next target sample such that the agent is motivated to explore an uncertain state that is likely to expand the safe space while preventing the current safe space from shrinking.

We theoretically analyze and show that  ST-SAFEMDPguarantees safety with high probability. We also demonstrate empirically that our approach safely and efficiently explores MDPs by using a time-variant safety function. Finally, we describe and present the results of two simulations, one performed using a synthetic environment and one performed using a real environment.

We consider a MDP with time-variant safety characterized by a tuple  M = ⟨S, T , A, f(s, a), g(t, s)⟩, with S the set of states {s}, T the set of times {t}, A the set of actions {a}, f : S × A → Sthe (time-invariant) deterministic transition model, and  g : T × S → Rthe safety function. We assume that the safety function is time-variant and not known a priori. At each time step t, the agent must be in a safe state and observes noise-perturbed value of g; hence, it observes yt = g(t, st) + nt, where  ntis the noise. The safety function value  g(t, st)of state  stmust be above some threshold, h ∈ R; that is,  g(t, st) ≥ h.Note that we consider the safety function to depend on only the time and state. To make this problem tractable, we make an assumption on regularity in the form of a GP to represent the safety function; that is, we assume that similar states possess similar levels of safety at similar times.1 The values of g for unvisited states at a future time are predicted using the GP on the basis of previous observations for visited states. To learn the a priori unknown safety function, the agent should explore the state space. In this paper, we focus on finding a good policy for efficiently expanding the safe space in the environment with time-variant and a priori unknown safety.2 Hence, we formulate the problem of finding the optimal next target state as

image

where  yi−1 = [y1, y2, · · · , yi−1], Nis the terminal time step, and  q(i, si; yi−1)represents the number of safe states that are identified on the basis of the set of observations

image

Figure 1: Conceptual image of the problem. In the time- invariant case, the true safe space (white region) does not change. However, in the time-variant case, the true safe space may shrink. Hence, without considering it, the predicted safe space (blue region) may partially overlap the true unsafe region (black region).

yi−1. In the rest of this paper, we denote the objective function as J such that  J(t, st; yi−1) = �Ni=t q(i, si; yi−1). In- tuitively, in this problem formulation, we seek to find the optimal state to visit such that the number of safe states is maximized under the assumption that the set of observations [yt+1, yt+2, · · · , yN]is most unfavorable.

In our algorithm, we suppose that the agent starts in a safe position. We denote by  S0the set of initial safe states, which are assumed to be a priori known. This assumption is consistent with relevant previous work (e.g., Sui et al. (2015)).

This work is based on several previous studies on safe exploration algorithms using GPs (Sui et al. 2015; Sui, Yue, and Burdick 2017; Turchetta, Berkenkamp, and Krause 2016; Berkenkamp et al. 2017; Wachi et al. 2018). Here we focus on the work by Turchetta, Berkenkamp, and Krause (2016) as it is the primary basis for this paper. For better understandability, we append superscript “α” to the variables used for the explanation. Turchetta, Berkenkamp, and Krause (2016) addressed the problem of safely exploring finite MDPs. They aimed to explore MDPs under the constraint of safety, assuming that the a priori unknown function was time-invariant and satisfied regularity conditions expressed using a GP prior; that is, the safety function was modeled as a GP, and specified using the mean and covariance:  gα(s) = GP(µα(s), kα(s, s′)), which is parameterized using a kernel function  kα(s, s′). Observation noise was modeled as  yαt = gα(st) + nt, nt ∼N(0, ω2t ). The posterior over  gαwas analytically calculated on the basis of T measurements for states,  {s1, · · · , sT }, which is also a Gaussian distribution with mean  µαT (s), vari- ance  (σαT (s))2, and covariance  kαT (s, s′). They assumed that the safety function was Lipschitz continuous with Lipschitz constant  Lswith respect to some metric,  ds(s, s′).

Safe space in time-invariant scenario To guarantee safety, Turchetta, Berkenkamp, and Krause (2016) maintained two sets. The first set,  Sαt, contained states that were highly likely to satisfy the safety constraint using the GP posterior. The second one, ˆSαt, additionally considered the abil- ity to reach points in  Sαt(i.e., reachability constraint) and the ability to safely return to the previous identified safe set, ˆSαt−1(i.e., returnability constraint). The algorithm guaran- tees safety by allowing visits only to states in ˆSαt.

Safety is evaluated on the basis of the confidence interval represented as,  Qαt (s) = [µαt−1(s)±β1/2t σαt−1(s)], where  βtis a positive scalar representing the required level of safety. Consider the intersection of  Qtup to iteration t. It is defined as  Cαt (s) = Qαt (s) ∩ Cαt−1(s), Cα0 (s) = [h, ∞], ∀s ∈ S0. The lower and upper bounds on  Cαt (s)are denoted by

t (s) := min Cαt (s)and  uαt (s) := max Cαt (s), respec- tively. The first set,  Sαtis defined using  Ls.

image

if a state is in  Sαt, it may be surrounded by the unsafe states. Given a set X, the set that is reachable from X in one step is Rαreach(X) = X ∪{s ∈ S | ∃s′ ∈ X, a ∈ A : s = f(s′, a)}.An agent may be unable to move to another state due to the lack of safe actions. The set of states from which the agent can return to ¯Xthrough other set of states X in one step is given by

image

Next, the set containing all the states that can reach ¯Xthrough an arbitrary long path in X is defined as ¯Rαret(X, ¯X). Finally, the set of safe states that satisfy the reachability and returnability constraints, ˆSt, is defined as

image

Sampling criteria Turchetta, Berkenkamp, and Krause (2016) defined expanders,  Ξαt, that might increase the num- ber of states that are identified as safe. Then the state with the highest variance in  Ξαtwas selected as the next target sam- ple; that is, the next sample was  st = arg maxs∈Ξαt wαt (s),where  wαt (s) = uαt (s) − lαt (s).

Characterization of safety is dealt with differently here than by Turchetta, Berkenkamp, and Krause (2016) in two aspects. First, we deal with time-variant safety, meaning that we must capture the similarity between times as well as that between states, which motivated us to model the environment using spatio-temporal GPs. Second, suppose that an agent considers the set of states to which a return action is possible at time t. If the environment is time-invariant, the agent only has to ensure that it can return to the states in ˆSt−1because the states identified as safe at time  t − 1are surely still safe at time t. However, in a time-variant environment, the true safe region may shrink; that, in turn, the predicted safe region may also shrink. This means that the agent must consider the states to which return is possible at t + 1, i.e., the states in ˆSt+1, since it is at time t + 1 that a return action is taken. However, this requirement is problematic: ˆStis defined by ˆSt+1. Hence, we approximate ˆSt+1by using a conservative set.

image

Figure 2: Conceptual image of characterization of safety. Green curve and blue band represent mean and confidence interval, respectively. Magenta line represents safety threshold h. Forefront figure on left is for t = 1, and figure on right is for t = 8. A black rectangle represents the time step with no observation.

Spatio-Temporal Gaussian Processes

As mentioned previously, we make an assumption on regularity in the form of a GP to represent the safety function, g(t, s). The GP-based predictions have uncertainty, which is represented by Gaussian distributions. Roughly speaking, an unvisited state is considered safe if the safety function value is almost certainly above a predefined threshold (i.e., with a greater probability than a predefined level). We assume that S and T are endowed with a positive semidefinite kernel function and that the safety function has bounded norm in the associated Reproducing Kernel Hilbert Space (RKHS). Kernel function k is used to capture the similarity between states and that between times; it formalizes the assumption of safety function regularity. These concepts are shown in Figure 2. Incorporating the spatio-temporal information into the kernel function enables attaching larger weights to more recent observations. The safety function can thus be modeled as

image

We define a new variable,  η = {t, s}, and denote the set of ηas H. A GP is fully specified by its mean,  µ(η), and covariance,  k(η, η′). Without loss of generality, let  µ(η) = 0for all  s ∈ Sand  t ∈ T. We model observation noise as y = g(η) + nt, where  nt ∼ N(0, ω2t ). The posterior over g(η)is computed analytically on the basis of T measurements for times and states  HT = {{1, s1}, · · · , {T, sT }}with measurements,  yT = [g(η1) + n1, · · · , g(ηT ) + nT ]⊤. The posterior mean, variance, and covariance are given as:

image

where  kT (η) = [k(η1, η) · · · , k(ηT , η)]⊤and  KTis the positive definite kernel matrix,  [k(η, η′)]η,η′∈HT. For more details on spatio-temporal GPs, we refer the reader to Soh, Su, and Demiris (2012) or Senanayake, Simon Timothy, and Ramos (2016). We further assume that the safety function g is Lipschitz continuous, with Lipschitz constant  Ls(for states) and  Lt(for time), with respect to some metric ds(·, ·)on S and  dt(·, ·)on T . This assumption is naturally satisfied using common kernel functions (Ghosal and Roy 2006). For simplicity, we define L such that  L(s, s′, t, t′) =Lsds(s, s′) + Ltdt(t, t′). For all t, the time step between t − 1and t is assumed to be consistent; that is,  dt(t′, t) =|t − t′| · dt(t − 1, t).

The kernels used in our model are composite kernels, either product kernels,  (ks ⊗ kt)({s, t}, {s′, t′}) =ks(s, s′)kt(t, t′)or additive combinations, (ks ⊕kt)({s, t}, {s′, t′}) = ks(s, s′) + kt(t, t′), where  ksis the kernel for space, and  ktis the one for time. In our simulation, we used a composite kernel such that

image

where ˆksand ˆktare kernel functions for space and time, respectively. For more details on composite kernels, see Krause and Ong (2011) or Duvenaud et al. (2013).

Safe Space in Time-Variant Scenario We use the extended definition of the safe space as given by Turchetta, Berkenkamp, and Krause (2016). Likewise, to guarantee safety (with high probability), we maintain two sets. The first set is  St, which contains all states that satisfy the safety constraint (2) with high probability. The second one, ˆSt, includes reachability and returnability constraints in addition to the safety constraint (see Figure 3(a)).

First, we formally define  St, the set of states that satisfy the safety constraint. A GP model enables an agent to judge safety by providing it a confidence interval having the form Qt(η) = [µt−1(η) ± β1/2t σt−1(η)], where  βtis a positive scalar specifying the required level of safety. In other words, βinherently specifies the probability of violating the safety constraint. For more details on  β, see Srinivas et al. (2010). We then consider the intersection of  Qtup to iteration t; it is recursively defined as  Ct(η) = Qt(η) ∩ Ct−1(η), C0(η) =[h, ∞], ∀s ∈ S0. We denote the lower and upper bounds on  Ct(η)by  lt(η) := min Ct(η)and  ut(η) := max Ct(η), respectively. The first set,  St, is defined using the Lipschitz constant L (i.e.,  Lsand  Lt):

image

Reachability and returnability We define the set of states that are reachable and those to which return is possible. Even if a state is in  St, it may not be reachable without visiting unsafe states first. For the reachable set, we use the same defi-nition as Turchetta, Berkenkamp, and Krause (2016). Given X, the set of states reachable from X in one step is

image

Next, an agent may be “trapped” in a state due to the lack of safe actions. As for the returnable set, we use different definition from  Rαret(X). Given a set X, the set that is return- able to X with one step is given as

image

Finally, the set of safe states is defined. Safe states must 1) satisfy the safety constraint, 2) be reachable from ˆSt−1,

image

Figure 3: (a) Safe space is characterized by three constraints. (b) Conceptual image of the relations between each set. It is problematic to define ˆStusing ˆSt+1; hence, we use  Gt+1t−1instead to define ˆSt.

and 3) be returnable to ˆSt+1; that is, ˆStis denoted as

image

Observe that the returnable set is written as  Rret( ˆSt+1). In a time-variant environment, an agent must be able to visit ˆSt+1since  ˆSt−1and  ˆStmay no longer be safe. However, defining ˆStusing ˆSt+1is obviously problematic.

Approximation of ˆSt+1We cannot directly solve (6), but nonetheless we have to consider returnability in order to avoid the situation in which an agent is unable to take any action at time t. Suppose we have a set of observations, yt−1. For any pair of time steps,  τand  τ ′(< τ), we define a conservative set of safe states,

Gττ ′(yt−1) = {s∈ S | ∃s′ ∈ ˆSτ ′ :

image

Intuitively, G is an extended definition of S in that G takes into account the satisfaction of the safety constraint several steps into the future. Hence, in (6), we make an approximation and replace ˆSt+1by  Gt+1t−1; that is,

image

Under the assumption of Lipschitz continuity,  Gt+1t−1can be identified as safe on the basis of observations until time  t−1, and taking into account  Rret(Gt+1t−1)enables us to avoid the situation in which an agent cannot move to another state. A conceptual image is shown in Figure 3(b).

We start by giving a high level overview of our ST- SAFEMDPalgorithm (see Algorithm 1). At each iteration, the agent updates the safe space (Line 5) by using the GP model. Given the safe set, ˆSt, the agent calculates expanders that might enhance the number of states identified as safe (Line 6). Then, a state with the maximum weighted summation of the mean and confidence interval of the safety function is chosen as the next target sample (Lines 8-12). Finally, the agent observes the safety function value of the current state and updates the GP model (Lines 13 and 14).

image

Our algorithm maximizes the lower bound of the cumulative number of safe states in the next multiple time steps, as shown by the lemmas and theorem below. Hereinafter, we explicitly denote the underlying observations as independent variables. Recall that our objective here is to find a good policy for efficiently expanding the safe space in the environment with time-variant and a priori unknown safety, as expressed by (1). Given that  q(i, si; yi−1)corresponds to | ˆSi(yi−1)|, the following lemmas hold.

Lemma 1. The following inequality holds:

image

Lemma 2. For all i = [t, N], the following inequality holds:

image

These lemmas come from our desire to express the objective function using only  yt−1and  yt. Finally, we obtain the following theorem on the lower bound of the objective function using Lemmas 1 and 2.

Theorem 1. The following equality holds:

image

where  Miis defined as  Mi := | ˆSt ∩ Gi+1t |.

Note that J has a lower bound characterized by only  yt−1and  yt. Proofs of Lemma 1, 2 and Theorem 1 are given in the supplemental material. Now, the problem to solve is, what is the optimal state to obtain the observation,  yt?.

Let us carefully observe  Mi(yt). Once we transform �Ni=t+1 Miinto  Mt+1 ·�Ni=t+1 Mi/Mt+1, we get a clearer view for optimization. First, of the two components of Mt+1, ˆStis deterministic while  Gt+2tis not unknown at this moment (because an agent has not yet obtained the observation at time  t, yt). As for  Mi, ∀i = [t + 2, N]; since they depend on  Mi−1, it is difficult to directly maximize  Mi. However,  Mi/Mt+1represents the ratio of the number of states within ˆStthat satisfy  lt − L(s, s′, t, i) ≥ hwith respect to one that satisfies  lt − L(s, s′, t, t + 2) ≥ h. Therefore, to maximize J, we choose as the next target sample one that we can expect its observation to 1) expand  Mt+1(yt)and 2) enhance  Mi(yt)/Mt+1(yt)(i.e., choose a state whose safety function value is expected to be high).

As a means to achieve the first objective, we utilize expanders (explained shortly) to enhance  Mt+1(yt)when an agent observes the safety function value. For the second one, we choose as the next target sample the one with the maximum value of the weighted summation of the mean and the width of the confidence interval.

Expander As mentioned above, to enhance  Mt+1(yt), we utilize expanders as in Sui et al. (2015) or Turchetta, Berkenkamp, and Krause (2016). We use the uncertainty estimate in the GP to define an optimistic set of expanders,

image

Sampling criteria We assume that an agent observes the safety function value at every visited state.3 This assumption is particularly significant in dealing with an environment in which the safety function value may change while the agent moves to the next target sample. Hence, we now consider the motion of the agent for one time step. We first define the set of candidates to be the next visited state as  Ψt = {s ∈S | ∃a ∈ A : s = f(st−1, a)}.Our goal is to maximize the number of safe states while guaranteeing safety. We do this by using GP posterior uncertainty regarding the states in  Ξt ∩ Ψt. At each iteration t, we select as the next target sample the state in  Ξt ∩ Ψtwith the maximum weighted summation of the mean and the width of the confidence interval:

image

where  wt(η) = ut(η) − lt(η), and p is a weighted coef-ficient. If  Ξt ∩ Ψtis empty, we sample a new state within ˆSt ∩ Ψt. This sampling criteria is justified because we can gain the most information from the state for which we are the most uncertain, and states in  Ξt∩Ψtcan enlarge the safe set. Also, all the states in  Ξt ∩ Ψtare guaranteed to be safe. In addition, by additionally considering the mean value,  µ, the agent is encouraged to visit a state that is likely to be still safe in the next multiple time steps.

image

Table 1: Simulation results for simple grid world (100 times Monte-Carlo simulation).  ST-SAFEMDPachieved higher accuracy, precision, and recall without any failure.

We use the correctness of the confidence intervals,  Ct(η), as the metric for the safety guarantee. That is, the true safety function value,  g(η), must be within  Ct(η). We adjust the conservativeness by tuning scaling factor  βt. Though this paper addresses the planning problem, like Srinivas et al. (2010), who studied appropriate selection of  βtfor the multiarmed bandit problem (i.e., the sampling problem), we use

image

where B is the bound on the RKHS norm of the safety function,  γtis the maximum mutual information that can be gained about  g(·)from t noisy observations (i.e.,  γt =max|A|≤t I(g; yA)), and  δis the probability of visiting unsafe states. For more details on the bounds on  γt, we refer the reader to Srinivas et al. (2010). As for the spatio-temporal kernel function, the existence of an upper bound on the information gain is ensured by the following lemma.

Lemma 3. Let  ksand  ktbe a finite number of kernel functions on S and T , respectively. We denote an arbitrary composite kernel resulting from  ksand  ktusing product kernel operators (i.e.,  ⊗) or additive combination (i.e.,  ⊕) by  kF. Then  kFhas a bound of information gain.

Since the GP-based prediction is stochastic, being in  Stmeans the constraint on safety is fulfilled with high probability. Given that  βis chosen as in (8), we now present a theorem on the safety guarantee.

Theorem 2. Assume that g is L-Lipschitz continuous and satisfies  ∥g∥2k ≤ Band that noise  ntis  σ-sub-Gaussian. Also assume that  S0 ̸= ∅and that  g(s) ≥ hfor all  s ∈ S0. If  βtis chosen as in (8) and the next state is sampled within Ξt ∩ Ψtor ˆSt ∩ Ψt(i.e., the subsets of  St), the probability of entering a safe state is guaranteed to be at least  1 − δ.

Proofs of Lemma 3 and Theorem 2 are given in the supplemental material.

We evaluated our approach using two problem settings. One was a randomly generated environment, and the other was an environment created from real lunar terrain data. We used a simulated lunar environment because the lunar polar regions have been attracting attention as potential sites for a moon base although the illumination conditions can change drastically over time in those regions. For surface explorers relying on solar power, it is critical to traverse places that are illuminated. That is why we simulated lunar exploration rather than Mars surface exploration as was done in related work (Moldovan and Abbeel 2012; Wachi et al. 2018).

Randomly Generated Environment

For the first problem setting, we defined a 20  ×20 square grid and randomly generated a safety function value for each state. For each state (except the boundary ones), the agent could take one of five actions: stay, go up, go down, go left, and go right. We randomly generated an environment with time-variant safety as follows. First, we defined a radial basis function (RBF) kernel with two input dimensions that had a variance of 1 and length-scales of 2. Then, we generated a multivariate normal function such that the mean was zero and the covariance matrix was one of the above RBF. A two-dimensional safety function was then created and stacked with respect to time such that g(t + 1, s) = g(t, s)+Lt·ϕ·g(1, s), ∀s ∈ S, where  ϕis a sample from the uniform distribution,  [−1, 1]. By stacking the safety function with respect to time, we obtained a randomly generated environment with a time-variant safety function.

The agent predicts the safety function by using the GP and a kernel defined as (5). RBFs were used for the four kernels, ks, kt, ˆks, and ˆktin our model. Their length-scales were 2.0, 1.5, 4.0, and 10, and the prior variances for each were 1.0, 1.0, 0.5, and 0.5, respectively. The simulation settings were βt = 2 ∀t ≥ 0, p = 3, Ls = 0.1, and  Lt = 0.1. Baselines We compare the results for our approach with those of four baselines. The first baseline (Random) uses a random exploration strategy in which an agent randomly chooses the next action. The second baseline (Unsafe) uses pure maximization of information gain in which an agent greedily gathers information without any constraint on safety. In the third baseline (Ignore time-variance), an agent assumes that the environment is time-invariant; that is, the agent updates the GP model using only  ks. Last, we consider a baseline (No cross-covariance) in which the agent predicts the safety function value without using the cross-covariance term; that is, the agent uses only  ksand  kt. We performed Monte-Carlo simulation using 100 samples of randomly generated safety values. Metrics We used five metrics for the comparison: root mean square error (RMSE) between the true and predicted safety function values (normalized w.r.t.  ST-SAFEMDP), number of failures, accuracy, precision, and recall. Accuracy, precision, and recall are defined as TP + TNTP + TN + FP + FN,

image

Table 2: Results for simulated lunar surface exploration.  ST-SAFEMDPoutperformed four baselines in terms of safety, accuracy, precision, and recall.

image

Figure 4: Three out of five images of the same region (1.0 km ×1.0 km) centered on de Gerlache Rim (88.664◦S, 68.398◦W). Spatio-temporal map with 200 time steps was created by linearly interpolating the five images.

TPTP + FP, and TPTP + FN, respectively, in which “positive” means safe. Note that “FP” may result in a catastrophic failure, so keeping precision high is quite important. Precision is a measure of the safety guarantee, and recall is a measure of the efficiency of the expansion.

As shown in Table 1, our approach outperformed the baselines in terms of accuracy, precision, recall, and number of failures. RMSE was better for unsafe exploration than for ST-SAFEMDPbecause the agent visited any state with the highest variance without any consideration of safety. Note that, in order to make the total number of actions identical in all cases, the agent was allowed to explore even after failure.

Simulated Moon Surface Exploration For the second problem setting, we used a lunar region about one kilometer square centered on 88.664◦S and 68.398◦W. We used real lunar data (https://goo.gl/6eXmQP) managed by NASA/GSFC/ASU. The resolution was 10 m, and the square grid used was 100  ×100. The simulated environment was created by first obtaining five images of the target region (Three out of the five images are shown in Figure 4) and then linearly interpolating them so that the duration between images was  ∼1.8 hours. The overall mission period was assumed to be  ∼15 days, so the number of images was 200. We assumed that a rover was powered by solar arrays. Safety function g was defined as the luminance value. Each pixel in each image had luminance value information, and the values were normalized such that the maximum value was 1.5 and the minimum value was  −0.5. We set the safety threshold h to  −0.25in consideration of the required electrical power.

The luminance value of each state was predicted using a GP and we used a spatio-temporal kernel defined as (5). We used RBF for all kernel functions. The length-scales for ks, kt, ˆks, and ˆktwere 20 m, 9 hours, 30 m, and 36 hours, respectively. The prior variances for each kernel were 8 m2,

image

Figure 5: Comparison of different exploration schemes. Blue regions represent the (predicted) safe spaces for (a) ST- SAFEMDP, (b) Ignore time-variance case, and the explored space for (c) Unsafe case.

4.5 hours2, 4 m2, and 2.4 hours2, respectively. We assumed a noise standard deviation of 0.001. The parameter settings were  βt = 2 ∀t ≥ 0, p = 3 Ls = 0.05, and  Lt = 0.01.

The five metrics in the first simulation were again used. As shown in Table 2, our algorithm outperformed the four baselines in terms of accuracy, precision, and recall. Our problem formulation aims to enlarge recall while keeping precision quite high. Higher precision and recall mean that ST- SAFEMDPefficiently expanded the safe region by solving the min-max problem while avoiding FP. As shown in Figure 5, exploration was the safest and most efficient with ST- SAFEMDP.

We introduced the novel problem of safely exploring a time-variant and a priori unknown environment. We formulated it as a min-max problem in which the cumulative number of safe states is maximized and then applied our proposed algorithm,  ST-SAFEMDP, which efficiently expands the safe region. An important aspect of this algorithm is that it explicitly considers the lower bound of the cumulative number of safe states and then chooses the state with the maximum value of the weighted summation of the mean and the width of the confidence interval of the safety function. This algorithm incentivizes exploration while guaranteeing safety by exploiting the GP structure of the safety function. Finally, we demonstrated the effectiveness of our proposed approach by theoretical analysis and numerical simulation.

Moving forward, it would be interesting to extend to the reinforcement learning setting in which the cumulative reward is maximized, as well as apply our proposed approach to other problem domains including medical care and robotic manipulation tasks. We believe that our results contribute to bridging the gap between machine learning algorithms and safety-critical applications.

[Berkenkamp et al. 2017] Berkenkamp, F.; Turchetta, M.; Schoellig, A.; and Krause, A. 2017. Safe model-based reinforcement learning with stability guarantees. In Neural Information Processing Systems (NIPS).

[Blackmore et al. 2010] Blackmore, L.; Ono, M.; Bektassov, A.; and Williams, B. C. 2010. A probabilistic particlecontrol approximation of chance-constrained stochastic predictive control. IEEE transactions on Robotics 26:502–517.

[Duvenaud et al. 2013] Duvenaud, D.; Lloyd, J.; Grosse, R.; Tenenbaum, J.; and Zoubin, G. 2013. Structure discovery in nonparametric regression through compositional kernel search. In International Conference on Machine Learning (ICML).

[Fleming and McEneaney 1995] Fleming, W. H., and McE- neaney, W. M. 1995. Risk-sensitive control on an infinite time horizon. SIAM Journal on Control and Optimization 33(6):1881–1915.

[Garcıa and Fern´andez 2015] Garcıa, J., and Fern´andez, F. 2015. A comprehensive survey on safe reinforcement learning. Journal of Machine Learning Research (JMLR) 16:1437–1480.

[Ghosal and Roy 2006] Ghosal, S., and Roy, A. 2006. Poste- rior consistency of Gaussian process prior for nonparametric binary regression. The Annals of Statistics 34:2413–2429.

[Hans et al. 2008] Hans, A.; Schneegaß, D.; Sch¨afer, A. M.; and Udluft, S. 2008. Safe exploration for reinforcement learning. In European Symposium on Artificial Neural Networks (ESANN).

[Hartikainen, Riihim¨aki, and S¨arkk¨a 2011] Hartikainen, J.; Riihim¨aki, J.; and S¨arkk¨a, S. 2011. Sparse spatio-temporal gaussian processes with general likelihoods. In International Conference on Artificial Neural Networks (ICANN).

[Krause and Ong 2011] Krause, A., and Ong, C. S. 2011. Contextual gaussian process bandit optimization. In Neural Information Processing Systems (NIPS).

[Moldovan and Abbeel 2012] Moldovan, T., and Abbeel, P. 2012. Safe exploration in Markov decision processes. In International Conference on Machine Learning (ICML).

[Rasmussen and Williams 2006] Rasmussen, C. E., and Williams, C. K. 2006. Gaussian processes for machine learning. MIT Press.

[S¨arkk¨a and Hartikainen 2012] S¨arkk¨a, S., and Hartikainen, J. 2012. Infinite-dimensional kalman filtering approach to spatio-temporal gaussian process regression. In International Conference on Artificial Intelligence and Statistics (AISTATS).

[Schwarm and Nikolaou 1999] Schwarm, A. T., and Niko- laou, M. 1999. Chance-constrained model predictive control. AIChE Journal 45(8):1743–1752.

[Senanayake, Simon Timothy, and Ramos 2016] Senanayake, R.; Simon Timothy, O.; and Ramos, F. 2016. Predicting spatio-temporal propagation of seasonal influenza using variational gaussian process regression. In AAAI Conference on Artificial Intelligence (AAAI).

[Soh, Su, and Demiris 2012] Soh, H.; Su, Y.; and Demiris, Y. 2012. Online spatio-temporal gaussian process experts with application to tactile classification. In International Conference on Intelligent Robots and Systems (IROS).

[Srinivas et al. 2010] Srinivas, N.; Krause, A.; Kakade, S. M.; and Seeger, M. 2010. Gaussian process optimization in the bandit setting: No regret and experimental design. In International Conference on Machine Learning (ICML).

[Sui et al. 2015] Sui, Y.; Gotovos, A.; Burdick, J. W.; and Krause, A. 2015. Safe exploration for optimization with gaussian processes. In International Conference on Machine Learning (ICML).

[Sui, Yue, and Burdick 2017] Sui, Y.; Yue, Y.; and Burdick, J. W. 2017. Correlational dueling bandits with application to clinical treatment in large decision spaces. In International Joint Conference on Artificial Intelligence (IJCAI).

[Turchetta, Berkenkamp, and Krause 2016] Turchetta, M.; Berkenkamp, F.; and Krause, A. 2016. Safe exploration in finite Markov decision processes with gaussian processes. In Neural Information Processing Systems (NIPS).

[Wachi et al. 2018] Wachi, A.; Sui, Y.; Yue, Y.; and Ono, M. 2018. Safe exploration and optimization of constrained mdps using gaussian processes. In AAAI Conference on Ar-tificial Intelligence (AAAI).

A. Preliminary Lemmas

Lemma 4. For any  X ⊆ S, the following hold:

Proof. This lemma follows directly from the definitions of  Rreachand  Rret.

Lemma 5. For any  j ≥ i ≥ 1, the following hold.

Proof. This lemma follows directly from the definitions of S, G and ˆS; Si, Gi+1i−1, and  ˆSidepend only on  yi−1.

Lemma 6. For any  i ≥ j ≥ 1, the following hold:

Proof. This lemma follows directly from the definition of S and G.

Lemma 7. For any  t ∈ Tand  η ∈ H, the following hold:

image

Proof. This lemma follows directly from the definitions of l and u.

B. Safe Space

Lemma 8. For any  t, |St+1(yt)| ≥ |St+1(yt−1)|.

Proof. By definition of S,

By Lemma 5 (iii), it holds that ˆSt(yt) = ˆSt(yt−1). In addition, by Lemma 7 (i),  lt+1({t, s′}) ≥ lt({t, s′})is satisfied for the lower bound. Hence, the following holds: St+1(yt) ⊇ St+1(yt−1).

image

Lemma 9. For any  t, |Gt+2t (yt)| ≥ |Gt+2t (yt−1)|.

Proof. By definition of G,

By Lemma 5 (iii), it holds that ˆSt(yt) = ˆSt(yt−1). In addition, by Lemma 7 (i),  lt+1({t, s′}) ≥ lt({t, s′})is satisfied for the lower bound. Hence, the following holds: Gt+2t (yt) ⊇ Gt+2t (yt−1).

image

Proof. [of Lemma 1] Consider i = t + 2, then the following holds:

image

= |St+2(yt+1) ∩ Rreach( ˆSt+1(yt)) ∩ Rret(Gt+3t+1(yt+1))|(By Lemma 5 (iii))

≥ |St+2(yt) ∩ Rreach( ˆSt+1(yt)) ∩ Rret(Gt+3t+1(yt))|(By Lemmas 8 and 9)

image

The above inequality is recursively satisfied for all i = [t + 2, N]; that is, q(i, si; yi−1) ≥ q(i, si; yi−2) ≥ · · · ≥ q(i, si; yt).Therefore, from the above inequality,

image

Hence, Lemma 1 holds.

Lemma 10. For all i = [t, N] and j < i, the following inequality holds:

image

Proof. From definition of ˆSi(yj), ˆSi(yj) = Si(yj) ∩ Rreach( ˆSi−1(yj)) ∩ Rret(Gi+1i−1(yj)).From Lemmas 4 and 6 (i), ˆSi(yj) ⊇ Si(yj) ∩ ˆSi−1(yj) ∩ Gi+1i−1(yj)(By Lemma 4)

= ˆSi−1(yj) ∩ Gi+1i−1(yj).(By Lemma 6 (i)) Therefore, Lemma 10 holds.

Proof. [of Lemma 2] By Lemma 10, ˆSi(yt) ⊇ ˆSi−1(yt) ∩ Gi+1i−1(yt).By using the above inequality recursively, we have the following ˆSi(yt) ⊇ ˆSi−1(yt) ∩ Gi+1i−1(yt)

image

= ˆSi−2(yt) ∩ Gi+1i−2(yt)(By Lemma 6 (ii))

image

⊇ ˆSt(yt) ∩ Gi+1t (yt).Therefore, Lemma 2 holds.

Proof. [of Theorem 1] By Lemma 2,

| ˆSi(yt)| ≥ | ˆSt(yt) ∩ Gi+1t (yt)|.Hence, using Lemma 1, we have the following

image

where  Miis defined as  Mi := | ˆSt ∩ Gi+1t |. Therefore, Theorem 1 holds.

C. Bounding of Information Gain of the Spatio-Temporal Kernel Lemma 11. Let  ktbe a kernel function on T with rank at most d. Then,

image

Proof. This lemma is a direct result from Theorem 2 of Krause and Ong (2011).

Lemma 12. Let  ksand  ktbe kernel functions on S and T , respectively. Then, for the additive combination  k = ks ⊕ktdefined on X, it holds that γ(T; ks ⊕ kt; X) ≤ γ(T; ks; S) + γ(T; kt; T ) + 2 log T.

Proof. This lemma is a direct result from Theorem 3 of Krause and Ong (2011).

Proof. [of Lemma 3] For any pair of two kernels,  k1and  k2, the information gain of the composite kernel,  k12defined on  X12is bounded by γ(T; k12; X12) = max{γ(T; k1 ⊕ k2; X⊕12), γ(T; k1 ⊗ k2; X⊗12)},

where  X⊕12and  X⊗12represent the spaces for which  k1 ⊕ k2and  k1 ⊗ k2are defined, respectively. By the above relationship, kF (k1s, k2s, · · · , k1t , k2t , · · · )is guaranteed to have a bound on the information gain.

For example, a composite kernel is defined as  kF = (ks ⊕ kt) ⊕ (ˆks ⊗ ˆkt), where  ksand ˆksare kernels on S, and  ktand ˆktare kernels on T . Let d be the rank (see Lemma 11). The information gain is:

image

D. Probabilistic Safety Guarantee Lemma 13. If the next state is chosen in  St, the probability of entering a safe state is guaranteed to be at least

image

Proof. The probability of entering safe state is denoted as

Under the assumption of Lipschitz continuity with Lipschitz constants  Lsand  Lt, we transform the above formula:

image

The last equality exploits the conditional independence. Note that this probability can be tuned by adjusting  β.

Lemma 14. Suppose that  ∥g∥2k ≤ Band that noise  ntis  σ-sub-Gaussian. If  βtis chosen as in (4), then, for all  t ≥ 1and all s ∈ S, it holds with probability at least  1 − δthat  g(t, s) ∈ Ct(η).

Proof. This lemma follows from Theorem 6 of Srinivas et al. (2010).

Proof. [of Theorem 2] By the definition of  lt; i.e.,  lt(η) = min Ct(η),

image

Because of Lemma 13 and Lemma 14,

Hence, the probability that the agent takes an unsafe action is guaranteed to be at least  1 − δ.

E. Additional Information on the Simulation using Real Lunar Data Recall that we first prepared the total five images (see Figure 6) to create the spatio-temporal safety function. In this section, we provide the additional information on the data used for the simulated moon surface exploration especially in terms of how to prepare the five images.

We used the lunar images that were actually taken by Lunar Reconnaissance Orbiter (LRO); hence, the available number of images for the same region on the Moon is quite limited. That is why we created the spatio-temporal map by the linearly interpolating the images.

image

Figure 6: Five images of the same region (1.0 km  ×1.0 km) centered on de Gerlache Rim (88.664◦S, 68.398◦W). Spatiotemporal map with 200 time steps was created by linearly interpolating the images.

We first downloaded the six images with the following IDs from the Lunar Reconnaissance Orbiter Camera (LROC) website,4 and then clipped the same region (1.0 km  ×1.0 km) centered on de Gerlache rim (88.664◦S, 68.398◦W) from them.

Image 1: M138264280RE

Image 2: M140625419LE + M140625419RE

Image 3: M143000050LE

Image 4: M180899951LE

Image 5: M1100039834RE

In M140625419LE and M140625419RE, since all the targeted region is not in the angle of view as shown in Figure 7, we integrated with each other. Note that the time stamps of the two images are identical (i.e., M140625419LE and M140625419RE were taken by the left and right cameras of LRO almost at the same time).

image

Figure 7: In the above two images, all the target region is not in the angle of view. Hence, we merged them into one image (i.e., Image 2).

Finally, the simulated environment was created by first obtaining five images of the target region and then linearly interpolating them so that the duration between images was  ∼1.8 hours. The overall mission period was assumed to be  ∼15 days, so the number of images was 200.


Designed for Accessibility and to further Open Science