Safe Exploration in Markov Decision Processes with Time-Variant Safety using Spatio-Temporal Gaussian Process

2018·Arxiv

Abstract

Abstract

In many real-world applications (e.g., planetary exploration, robot navigation), an autonomous agent must be able to explore a space with guaranteed safety. Most safe exploration algorithms in the field of reinforcement learning and robotics have been based on the assumption that the safety features are a priori known and time-invariant. This paper presents a learning algorithm called for exploring Markov decision processes (MDPs) that is based on the assumption that the safety features are a priori unknown and time-variant. In this setting, the agent explores MDPs while constraining the probability of entering unsafe states defined by a safety function being below a threshold. The unknown and time-variant safety values are modeled using a spatio-temporal Gaussian process. However, there remains an issue that an agent may have no viable action in a shrinking true safe space. To address this issue, we formulate a problem maximizing the cumulative number of safe states in the worst case scenario with respect to future observations. The effectiveness of this approach was demonstrated in two simulation settings, including one using real lunar terrain data.

Introduction

The fundamental challenge of decision making in many real-world applications is that safety is often a priori unknown and time-variant. For example, in the case of Mars rovers, the safe space is a priori unknown and is discovered by the agent or manually defined by human operators; that is, potential hazards must be discovered on-the-fly. Furthermore, especially in the lunar polar regions, factors related to safety such as illumination (associated with power generation and visibility) change drastically over time. For the on-board risk-sensitive and reactive decision making, the agent should not depend on the human operators mainly because of the communication delay. Therefore, the agent, such as a lunar rover, that autonomously explores an uncertain environment must learn where it is safe and when a state is safe. In previous work on safe reinforcement learning (Hans et al. 2008; Garcıa and Fern´andez 2015) and safety-constrained decision-making (Fleming and McEneaney 1995; Schwarm and Nikolaou 1999; Blackmore et al. 2010) safety was assumed to be guaranteed by assigning unsafe states a high cost. In other words, safety was assumed to be a priori known and time-invariant. A useful approach to dealing with a priori unknown functions is to use a Gaussian

process (GP, see Rasmussen and Williams (2006)). A GP is well suited for safety-critical problem settings in which an action known to be safe is taken during exploration. This is because the confidence interval (i.e., variance) as well as the mean value of an a priori unknown safety function can be leveraged. Due to being based on the assumption that the function (i.e., smoothness or continuity) is regular, GP-based safe exploration algorithms have generally provided efficient solutions as well as a rigorous theoretical guarantee of safety. Previous work on safe exploration using GPs has addressed both stateless settings (Sui et al. 2015; Sui, Yue, and Burdick 2017) and stateful settings (Turchetta, Berkenkamp, and Krause 2016; Berkenkamp et al. 2017; Wachi et al. 2018). However, in both settings, safety functions were assumed to be time-invariant. In real applications, an actual safety function often changes over time. To the best of our knowledge, current applications of GP-based safe exploration algorithms are in medical care (e.g., clinical treatment) and robotics (e.g., planetary exploration). For example, in the planetary exploration, since the safety features are associated with the planetary environment, the safety function must be treated as time-variant.

When safety is considered to be time-variant, a simple approach is to use spatio-temporal GPs (Hartikainen, Ri- ihim¨aki, and S¨arkk¨a 2011; S¨arkk¨a and Hartikainen 2012). By incorporating the spatio-temporal information into the kernel function, the safety function values are predicted in both spatial and temporal directions. However, we cannot solve a critical issue if we just use spatio-temporal GPs; that is, in the time-variant environment, the true safe space may shrink, and the agent may not be able to take any viable actions. In the time-invariant case, the true safe space will not change, and the predicted safe space will expand or at least remain unchanged. In our problem setting, states that were previously identified as safe may no longer be safe. A conceptual image is shown in Figure 1. In the rest of the paper, safe space indicates predicted safe space.

Contribution To address the above issue, we define a safe space for a time-variant safe scenario and then formulate a min-max problem in which the cumulative number of safe states in the worst case scenario with respect to future observations is maximized. Under the worst case assumption that the safety function is Lipschitz continuous, safety is guaranteed in subsequent steps while reducing the possibility that an agent gets stuck due to having no viable action it can take.

In this paper, we present a novel safety-constrained exploration algorithm called Spatio-Temporal Safe Markov decision process (). At a high level, our approach models the safety function using a spatio-temporal GP model. It first predicts future safety function values and then uses the GP model to compute the safe space defined for the time-variant safety scenario. Finally, it chooses the next target sample such that the agent is motivated to explore an uncertain state that is likely to expand the safe space while preventing the current safe space from shrinking.

We theoretically analyze and show that guarantees safety with high probability. We also demonstrate empirically that our approach safely and efficiently explores MDPs by using a time-variant safety function. Finally, we describe and present the results of two simulations, one performed using a synthetic environment and one performed using a real environment.

Problem Statement

We consider a MDP with time-variant safety characterized by a tuple , with S the set of states {s}, T the set of times {t}, A the set of actions {a}, the (time-invariant) deterministic transition model, and the safety function. We assume that the safety function is time-variant and not known a priori. At each time step t, the agent must be in a safe state and observes noise-perturbed value of g; hence, it observes , where is the noise. The safety function value of state must be above some threshold, ; that is, Note that we consider the safety function to depend on only the time and state. To make this problem tractable, we make an assumption on regularity in the form of a GP to represent the safety function; that is, we assume that similar states possess similar levels of safety at similar times.1 The values of g for unvisited states at a future time are predicted using the GP on the basis of previous observations for visited states. To learn the a priori unknown safety function, the agent should explore the state space. In this paper, we focus on finding a good policy for efficiently expanding the safe space in the environment with time-variant and a priori unknown safety.2 Hence, we formulate the problem of finding the optimal next target state as

where is the terminal time step, and represents the number of safe states that are identified on the basis of the set of observations

Figure 1: Conceptual image of the problem. In the time- invariant case, the true safe space (white region) does not change. However, in the time-variant case, the true safe space may shrink. Hence, without considering it, the predicted safe space (blue region) may partially overlap the true unsafe region (black region).

. In the rest of this paper, we denote the objective function as J such that . In- tuitively, in this problem formulation, we seek to find the optimal state to visit such that the number of safe states is maximized under the assumption that the set of observations is most unfavorable.

In our algorithm, we suppose that the agent starts in a safe position. We denote by the set of initial safe states, which are assumed to be a priori known. This assumption is consistent with relevant previous work (e.g., Sui et al. (2015)).

Previous Work

This work is based on several previous studies on safe exploration algorithms using GPs (Sui et al. 2015; Sui, Yue, and Burdick 2017; Turchetta, Berkenkamp, and Krause 2016; Berkenkamp et al. 2017; Wachi et al. 2018). Here we focus on the work by Turchetta, Berkenkamp, and Krause (2016) as it is the primary basis for this paper. For better understandability, we append superscript “” to the variables used for the explanation. Turchetta, Berkenkamp, and Krause (2016) addressed the problem of safely exploring finite MDPs. They aimed to explore MDPs under the constraint of safety, assuming that the a priori unknown function was time-invariant and satisfied regularity conditions expressed using a GP prior; that is, the safety function was modeled as a GP, and specified using the mean and covariance: , which is parameterized using a kernel function . Observation noise was modeled as . The posterior over was analytically calculated on the basis of T measurements for states, , which is also a Gaussian distribution with mean , vari- ance , and covariance . They assumed that the safety function was Lipschitz continuous with Lipschitz constant with respect to some metric, .

Safe space in time-invariant scenario To guarantee safety, Turchetta, Berkenkamp, and Krause (2016) maintained two sets. The first set, , contained states that were highly likely to satisfy the safety constraint using the GP posterior. The second one, , additionally considered the abil- ity to reach points in (i.e., reachability constraint) and the ability to safely return to the previous identified safe set, (i.e., returnability constraint). The algorithm guaran- tees safety by allowing visits only to states in .

Safety is evaluated on the basis of the confidence interval represented as, , where is a positive scalar representing the required level of safety. Consider the intersection of up to iteration t. It is defined as . The lower and upper bounds on are denoted by

and , respec- tively. The first set, is defined using .

if a state is in , it may be surrounded by the unsafe states. Given a set X, the set that is reachable from X in one step is An agent may be unable to move to another state due to the lack of safe actions. The set of states from which the agent can return to through other set of states X in one step is given by

Next, the set containing all the states that can reach through an arbitrary long path in X is defined as . Finally, the set of safe states that satisfy the reachability and returnability constraints, , is defined as

Sampling criteria Turchetta, Berkenkamp, and Krause (2016) defined expanders, , that might increase the num- ber of states that are identified as safe. Then the state with the highest variance in was selected as the next target sam- ple; that is, the next sample was where .

Characterization of Spatio-Temporal Safety

Characterization of safety is dealt with differently here than by Turchetta, Berkenkamp, and Krause (2016) in two aspects. First, we deal with time-variant safety, meaning that we must capture the similarity between times as well as that between states, which motivated us to model the environment using spatio-temporal GPs. Second, suppose that an agent considers the set of states to which a return action is possible at time t. If the environment is time-invariant, the agent only has to ensure that it can return to the states in because the states identified as safe at time are surely still safe at time t. However, in a time-variant environment, the true safe region may shrink; that, in turn, the predicted safe region may also shrink. This means that the agent must consider the states to which return is possible at t + 1, i.e., the states in , since it is at time t + 1 that a return action is taken. However, this requirement is problematic: is defined by . Hence, we approximate by using a conservative set.

Figure 2: Conceptual image of characterization of safety. Green curve and blue band represent mean and confidence interval, respectively. Magenta line represents safety threshold h. Forefront figure on left is for t = 1, and figure on right is for t = 8. A black rectangle represents the time step with no observation.

Spatio-Temporal Gaussian Processes

As mentioned previously, we make an assumption on regularity in the form of a GP to represent the safety function, g(t, s). The GP-based predictions have uncertainty, which is represented by Gaussian distributions. Roughly speaking, an unvisited state is considered safe if the safety function value is almost certainly above a predefined threshold (i.e., with a greater probability than a predefined level). We assume that S and T are endowed with a positive semidefinite kernel function and that the safety function has bounded norm in the associated Reproducing Kernel Hilbert Space (RKHS). Kernel function k is used to capture the similarity between states and that between times; it formalizes the assumption of safety function regularity. These concepts are shown in Figure 2. Incorporating the spatio-temporal information into the kernel function enables attaching larger weights to more recent observations. The safety function can thus be modeled as

We define a new variable, , and denote the set of as H. A GP is fully specified by its mean, , and covariance, . Without loss of generality, let for all and . We model observation noise as , where . The posterior over is computed analytically on the basis of T measurements for times and states with measurements, . The posterior mean, variance, and covariance are given as:

where and is the positive definite kernel matrix, . For more details on spatio-temporal GPs, we refer the reader to Soh, Su, and Demiris (2012) or Senanayake, Simon Timothy, and Ramos (2016). We further assume that the safety function g is Lipschitz continuous, with Lipschitz constant (for states) and (for time), with respect to some metric on S and on T . This assumption is naturally satisfied using common kernel functions (Ghosal and Roy 2006). For simplicity, we define L such that . For all t, the time step between and t is assumed to be consistent; that is, .

The kernels used in our model are composite kernels, either product kernels, or additive combinations, , where is the kernel for space, and is the one for time. In our simulation, we used a composite kernel such that

where and are kernel functions for space and time, respectively. For more details on composite kernels, see Krause and Ong (2011) or Duvenaud et al. (2013).

Safe Space in Time-Variant Scenario We use the extended definition of the safe space as given by Turchetta, Berkenkamp, and Krause (2016). Likewise, to guarantee safety (with high probability), we maintain two sets. The first set is , which contains all states that satisfy the safety constraint (2) with high probability. The second one, , includes reachability and returnability constraints in addition to the safety constraint (see Figure 3(a)).

First, we formally define , the set of states that satisfy the safety constraint. A GP model enables an agent to judge safety by providing it a confidence interval having the form , where is a positive scalar specifying the required level of safety. In other words, inherently specifies the probability of violating the safety constraint. For more details on , see Srinivas et al. (2010). We then consider the intersection of up to iteration t; it is recursively defined as . We denote the lower and upper bounds on by and , respectively. The first set, , is defined using the Lipschitz constant L (i.e., and ):

Reachability and returnability We define the set of states that are reachable and those to which return is possible. Even if a state is in , it may not be reachable without visiting unsafe states first. For the reachable set, we use the same defi-nition as Turchetta, Berkenkamp, and Krause (2016). Given X, the set of states reachable from X in one step is

Next, an agent may be “trapped” in a state due to the lack of safe actions. As for the returnable set, we use different definition from . Given a set X, the set that is return- able to X with one step is given as

Finally, the set of safe states is defined. Safe states must 1) satisfy the safety constraint, 2) be reachable from ,

Figure 3: (a) Safe space is characterized by three constraints. (b) Conceptual image of the relations between each set. It is problematic to define using ; hence, we use instead to define .

and 3) be returnable to ; that is, is denoted as

Observe that the returnable set is written as . In a time-variant environment, an agent must be able to visit since and may no longer be safe. However, defining using is obviously problematic.

Approximation of We cannot directly solve (6), but nonetheless we have to consider returnability in order to avoid the situation in which an agent is unable to take any action at time t. Suppose we have a set of observations, . For any pair of time steps, and , we define a conservative set of safe states,

∈ S | ∃

Intuitively, G is an extended definition of S in that G takes into account the satisfaction of the safety constraint several steps into the future. Hence, in (6), we make an approximation and replace by ; that is,

Under the assumption of Lipschitz continuity, can be identified as safe on the basis of observations until time , and taking into account enables us to avoid the situation in which an agent cannot move to another state. A conceptual image is shown in Figure 3(b).

Approach

We start by giving a high level overview of our ST- algorithm (see Algorithm 1). At each iteration, the agent updates the safe space (Line 5) by using the GP model. Given the safe set, , the agent calculates expanders that might enhance the number of states identified as safe (Line 6). Then, a state with the maximum weighted summation of the mean and confidence interval of the safety function is chosen as the next target sample (Lines 8-12). Finally, the agent observes the safety function value of the current state and updates the GP model (Lines 13 and 14).

Our algorithm maximizes the lower bound of the cumulative number of safe states in the next multiple time steps, as shown by the lemmas and theorem below. Hereinafter, we explicitly denote the underlying observations as independent variables. Recall that our objective here is to find a good policy for efficiently expanding the safe space in the environment with time-variant and a priori unknown safety, as expressed by (1). Given that corresponds to , the following lemmas hold.

Lemma 1. The following inequality holds:

Lemma 2. For all i = [t, N], the following inequality holds:

These lemmas come from our desire to express the objective function using only and . Finally, we obtain the following theorem on the lower bound of the objective function using Lemmas 1 and 2.

Theorem 1. The following equality holds:

where is defined as .

Note that J has a lower bound characterized by only and . Proofs of Lemma 1, 2 and Theorem 1 are given in the supplemental material. Now, the problem to solve is, what is the optimal state to obtain the observation, ?.

Let us carefully observe . Once we transform into , we get a clearer view for optimization. First, of the two components of , is deterministic while is not unknown at this moment (because an agent has not yet obtained the observation at time ). As for ; since they depend on , it is difficult to directly maximize . However, represents the ratio of the number of states within that satisfy with respect to one that satisfies . Therefore, to maximize J, we choose as the next target sample one that we can expect its observation to 1) expand and 2) enhance (i.e., choose a state whose safety function value is expected to be high).

As a means to achieve the first objective, we utilize expanders (explained shortly) to enhance when an agent observes the safety function value. For the second one, we choose as the next target sample the one with the maximum value of the weighted summation of the mean and the width of the confidence interval.

Expander As mentioned above, to enhance , we utilize expanders as in Sui et al. (2015) or Turchetta, Berkenkamp, and Krause (2016). We use the uncertainty estimate in the GP to define an optimistic set of expanders,

Sampling criteria We assume that an agent observes the safety function value at every visited state.3 This assumption is particularly significant in dealing with an environment in which the safety function value may change while the agent moves to the next target sample. Hence, we now consider the motion of the agent for one time step. We first define the set of candidates to be the next visited state as Our goal is to maximize the number of safe states while guaranteeing safety. We do this by using GP posterior uncertainty regarding the states in . At each iteration t, we select as the next target sample the state in with the maximum weighted summation of the mean and the width of the confidence interval:

where , and p is a weighted coef-ficient. If is empty, we sample a new state within . This sampling criteria is justified because we can gain the most information from the state for which we are the most uncertain, and states in can enlarge the safe set. Also, all the states in are guaranteed to be safe. In addition, by additionally considering the mean value, , the agent is encouraged to visit a state that is likely to be still safe in the next multiple time steps.

Table 1: Simulation results for simple grid world (100 times Monte-Carlo simulation). achieved higher accuracy, precision, and recall without any failure.

Theoretical Results

We use the correctness of the confidence intervals, , as the metric for the safety guarantee. That is, the true safety function value, , must be within . We adjust the conservativeness by tuning scaling factor . Though this paper addresses the planning problem, like Srinivas et al. (2010), who studied appropriate selection of for the multiarmed bandit problem (i.e., the sampling problem), we use

where B is the bound on the RKHS norm of the safety function, is the maximum mutual information that can be gained about from t noisy observations (i.e., ), and is the probability of visiting unsafe states. For more details on the bounds on , we refer the reader to Srinivas et al. (2010). As for the spatio-temporal kernel function, the existence of an upper bound on the information gain is ensured by the following lemma.

Lemma 3. Let and be a finite number of kernel functions on S and T , respectively. We denote an arbitrary composite kernel resulting from and using product kernel operators (i.e., ) or additive combination (i.e., ) by . Then has a bound of information gain.

Since the GP-based prediction is stochastic, being in means the constraint on safety is fulfilled with high probability. Given that is chosen as in (8), we now present a theorem on the safety guarantee.

Theorem 2. Assume that g is L-Lipschitz continuous and satisfies and that noise is -sub-Gaussian. Also assume that and that for all . If is chosen as in (8) and the next state is sampled within or (i.e., the subsets of ), the probability of entering a safe state is guaranteed to be at least .

Proofs of Lemma 3 and Theorem 2 are given in the supplemental material.

Experiments

We evaluated our approach using two problem settings. One was a randomly generated environment, and the other was an environment created from real lunar terrain data. We used a simulated lunar environment because the lunar polar regions have been attracting attention as potential sites for a moon base although the illumination conditions can change drastically over time in those regions. For surface explorers relying on solar power, it is critical to traverse places that are illuminated. That is why we simulated lunar exploration rather than Mars surface exploration as was done in related work (Moldovan and Abbeel 2012; Wachi et al. 2018).

Randomly Generated Environment

For the first problem setting, we defined a 20 20 square grid and randomly generated a safety function value for each state. For each state (except the boundary ones), the agent could take one of five actions: stay, go up, go down, go left, and go right. We randomly generated an environment with time-variant safety as follows. First, we defined a radial basis function (RBF) kernel with two input dimensions that had a variance of 1 and length-scales of 2. Then, we generated a multivariate normal function such that the mean was zero and the covariance matrix was one of the above RBF. A two-dimensional safety function was then created and stacked with respect to time such that g(t + 1, s) = , where is a sample from the uniform distribution, . By stacking the safety function with respect to time, we obtained a randomly generated environment with a time-variant safety function.

The agent predicts the safety function by using the GP and a kernel defined as (5). RBFs were used for the four kernels, , , and in our model. Their length-scales were 2.0, 1.5, 4.0, and 10, and the prior variances for each were 1.0, 1.0, 0.5, and 0.5, respectively. The simulation settings were , and . Baselines We compare the results for our approach with those of four baselines. The first baseline (Random) uses a random exploration strategy in which an agent randomly chooses the next action. The second baseline (Unsafe) uses pure maximization of information gain in which an agent greedily gathers information without any constraint on safety. In the third baseline (Ignore time-variance), an agent assumes that the environment is time-invariant; that is, the agent updates the GP model using only . Last, we consider a baseline (No cross-covariance) in which the agent predicts the safety function value without using the cross-covariance term; that is, the agent uses only and . We performed Monte-Carlo simulation using 100 samples of randomly generated safety values. Metrics We used five metrics for the comparison: root mean square error (RMSE) between the true and predicted safety function values (normalized w.r.t. ), number of failures, accuracy, precision, and recall. Accuracy, precision, and recall are defined as TP + TNTP + TN + FP + FN,

Table 2: Results for simulated lunar surface exploration. outperformed four baselines in terms of safety, accuracy, precision, and recall.

Figure 4: Three out of five images of the same region (1.0 km 1.0 km) centered on de Gerlache Rim (88.664S, 68.398W). Spatio-temporal map with 200 time steps was created by linearly interpolating the five images.

TPTP + FP, and TPTP + FN, respectively, in which “positive” means safe. Note that “FP” may result in a catastrophic failure, so keeping precision high is quite important. Precision is a measure of the safety guarantee, and recall is a measure of the efficiency of the expansion.

As shown in Table 1, our approach outperformed the baselines in terms of accuracy, precision, recall, and number of failures. RMSE was better for unsafe exploration than for because the agent visited any state with the highest variance without any consideration of safety. Note that, in order to make the total number of actions identical in all cases, the agent was allowed to explore even after failure.

Simulated Moon Surface Exploration For the second problem setting, we used a lunar region about one kilometer square centered on 88.664S and 68.398W. We used real lunar data (https://goo.gl/6eXmQP) managed by NASA/GSFC/ASU. The resolution was 10 m, and the square grid used was 100 100. The simulated environment was created by first obtaining five images of the target region (Three out of the five images are shown in Figure 4) and then linearly interpolating them so that the duration between images was 1.8 hours. The overall mission period was assumed to be 15 days, so the number of images was 200. We assumed that a rover was powered by solar arrays. Safety function g was defined as the luminance value. Each pixel in each image had luminance value information, and the values were normalized such that the maximum value was 1.5 and the minimum value was . We set the safety threshold h to in consideration of the required electrical power.

The luminance value of each state was predicted using a GP and we used a spatio-temporal kernel defined as (5). We used RBF for all kernel functions. The length-scales for , , and were 20 m, 9 hours, 30 m, and 36 hours, respectively. The prior variances for each kernel were 8 m,

Figure 5: Comparison of different exploration schemes. Blue regions represent the (predicted) safe spaces for (a) ST- , (b) Ignore time-variance case, and the explored space for (c) Unsafe case.

4.5 hours, 4 m, and 2.4 hours, respectively. We assumed a noise standard deviation of 0.001. The parameter settings were , and .

The five metrics in the first simulation were again used. As shown in Table 2, our algorithm outperformed the four baselines in terms of accuracy, precision, and recall. Our problem formulation aims to enlarge recall while keeping precision quite high. Higher precision and recall mean that ST- efficiently expanded the safe region by solving the min-max problem while avoiding FP. As shown in Figure 5, exploration was the safest and most efficient with ST- .

Conclusion

We introduced the novel problem of safely exploring a time-variant and a priori unknown environment. We formulated it as a min-max problem in which the cumulative number of safe states is maximized and then applied our proposed algorithm, , which efficiently expands the safe region. An important aspect of this algorithm is that it explicitly considers the lower bound of the cumulative number of safe states and then chooses the state with the maximum value of the weighted summation of the mean and the width of the confidence interval of the safety function. This algorithm incentivizes exploration while guaranteeing safety by exploiting the GP structure of the safety function. Finally, we demonstrated the effectiveness of our proposed approach by theoretical analysis and numerical simulation.

Moving forward, it would be interesting to extend to the reinforcement learning setting in which the cumulative reward is maximized, as well as apply our proposed approach to other problem domains including medical care and robotic manipulation tasks. We believe that our results contribute to bridging the gap between machine learning algorithms and safety-critical applications.

References

[Berkenkamp et al. 2017] Berkenkamp, F.; Turchetta, M.; Schoellig, A.; and Krause, A. 2017. Safe model-based reinforcement learning with stability guarantees. In Neural Information Processing Systems (NIPS).

[Blackmore et al. 2010] Blackmore, L.; Ono, M.; Bektassov, A.; and Williams, B. C. 2010. A probabilistic particlecontrol approximation of chance-constrained stochastic predictive control. IEEE transactions on Robotics 26:502–517.

[Duvenaud et al. 2013] Duvenaud, D.; Lloyd, J.; Grosse, R.; Tenenbaum, J.; and Zoubin, G. 2013. Structure discovery in nonparametric regression through compositional kernel search. In International Conference on Machine Learning (ICML).

[Fleming and McEneaney 1995] Fleming, W. H., and McE- neaney, W. M. 1995. Risk-sensitive control on an infinite time horizon. SIAM Journal on Control and Optimization 33(6):1881–1915.

[Garcıa and Fern´andez 2015] Garcıa, J., and Fern´andez, F. 2015. A comprehensive survey on safe reinforcement learning. Journal of Machine Learning Research (JMLR) 16:1437–1480.

[Ghosal and Roy 2006] Ghosal, S., and Roy, A. 2006. Poste- rior consistency of Gaussian process prior for nonparametric binary regression. The Annals of Statistics 34:2413–2429.

[Hans et al. 2008] Hans, A.; Schneegaß, D.; Sch¨afer, A. M.; and Udluft, S. 2008. Safe exploration for reinforcement learning. In European Symposium on Artificial Neural Networks (ESANN).

[Hartikainen, Riihim¨aki, and S¨arkk¨a 2011] Hartikainen, J.; Riihim¨aki, J.; and S¨arkk¨a, S. 2011. Sparse spatio-temporal gaussian processes with general likelihoods. In International Conference on Artificial Neural Networks (ICANN).

[Krause and Ong 2011] Krause, A., and Ong, C. S. 2011. Contextual gaussian process bandit optimization. In Neural Information Processing Systems (NIPS).

[Moldovan and Abbeel 2012] Moldovan, T., and Abbeel, P. 2012. Safe exploration in Markov decision processes. In International Conference on Machine Learning (ICML).

[Rasmussen and Williams 2006] Rasmussen, C. E., and Williams, C. K. 2006. Gaussian processes for machine learning. MIT Press.

[S¨arkk¨a and Hartikainen 2012] S¨arkk¨a, S., and Hartikainen, J. 2012. Infinite-dimensional kalman filtering approach to spatio-temporal gaussian process regression. In International Conference on Artificial Intelligence and Statistics (AISTATS).

[Schwarm and Nikolaou 1999] Schwarm, A. T., and Niko- laou, M. 1999. Chance-constrained model predictive control. AIChE Journal 45(8):1743–1752.

[Senanayake, Simon Timothy, and Ramos 2016] Senanayake, R.; Simon Timothy, O.; and Ramos, F. 2016. Predicting spatio-temporal propagation of seasonal influenza using variational gaussian process regression. In AAAI Conference on Artificial Intelligence (AAAI).

[Soh, Su, and Demiris 2012] Soh, H.; Su, Y.; and Demiris, Y. 2012. Online spatio-temporal gaussian process experts with application to tactile classification. In International Conference on Intelligent Robots and Systems (IROS).

[Srinivas et al. 2010] Srinivas, N.; Krause, A.; Kakade, S. M.; and Seeger, M. 2010. Gaussian process optimization in the bandit setting: No regret and experimental design. In International Conference on Machine Learning (ICML).

[Sui et al. 2015] Sui, Y.; Gotovos, A.; Burdick, J. W.; and Krause, A. 2015. Safe exploration for optimization with gaussian processes. In International Conference on Machine Learning (ICML).

[Sui, Yue, and Burdick 2017] Sui, Y.; Yue, Y.; and Burdick, J. W. 2017. Correlational dueling bandits with application to clinical treatment in large decision spaces. In International Joint Conference on Artificial Intelligence (IJCAI).

[Turchetta, Berkenkamp, and Krause 2016] Turchetta, M.; Berkenkamp, F.; and Krause, A. 2016. Safe exploration in finite Markov decision processes with gaussian processes. In Neural Information Processing Systems (NIPS).

[Wachi et al. 2018] Wachi, A.; Sui, Y.; Yue, Y.; and Ono, M. 2018. Safe exploration and optimization of constrained mdps using gaussian processes. In AAAI Conference on Ar-tificial Intelligence (AAAI).

Supplemental Material

A. Preliminary Lemmas

Lemma 4. For any , the following hold:

Proof. This lemma follows directly from the definitions of and .

Lemma 5. For any , the following hold.

Proof. This lemma follows directly from the definitions of S, G and , and depend only on .

Lemma 6. For any , the following hold:

Proof. This lemma follows directly from the definition of S and G.

Lemma 7. For any and , the following hold:

Proof. This lemma follows directly from the definitions of l and u.

B. Safe Space

Lemma 8. For any .

Proof. By definition of S,

By Lemma 5 (iii), it holds that . In addition, by Lemma 7 (i), is satisfied for the lower bound. Hence, the following holds:

Lemma 9. For any .

Proof. By definition of G,

By Lemma 5 (iii), it holds that . In addition, by Lemma 7 (i), is satisfied for the lower bound. Hence, the following holds:

Proof. [of Lemma 1] Consider i = t + 2, then the following holds:

(By Lemma 5 (iii))

The above inequality is recursively satisfied for all i = [t + 2, N]; that is, Therefore, from the above inequality,

Hence, Lemma 1 holds.

Lemma 10. For all i = [t, N] and j < i, the following inequality holds:

Proof. From definition of , From Lemmas 4 and 6 (i), (By Lemma 4)

(By Lemma 6 (i)) Therefore, Lemma 10 holds.

Proof. [of Lemma 2] By Lemma 10, By using the above inequality recursively, we have the following

(By Lemma 6 (ii))

Therefore, Lemma 2 holds.

Proof. [of Theorem 1] By Lemma 2,

Hence, using Lemma 1, we have the following

where is defined as . Therefore, Theorem 1 holds.

C. Bounding of Information Gain of the Spatio-Temporal Kernel Lemma 11. Let be a kernel function on T with rank at most d. Then,

Proof. This lemma is a direct result from Theorem 2 of Krause and Ong (2011).

Lemma 12. Let and be kernel functions on S and T , respectively. Then, for the additive combination defined on X, it holds that

Proof. This lemma is a direct result from Theorem 3 of Krause and Ong (2011).

Proof. [of Lemma 3] For any pair of two kernels, and , the information gain of the composite kernel, defined on is bounded by

where and represent the spaces for which and are defined, respectively. By the above relationship, is guaranteed to have a bound on the information gain.

For example, a composite kernel is defined as , where and are kernels on S, and and are kernels on T . Let d be the rank (see Lemma 11). The information gain is:

D. Probabilistic Safety Guarantee Lemma 13. If the next state is chosen in , the probability of entering a safe state is guaranteed to be at least

Proof. The probability of entering safe state is denoted as

Under the assumption of Lipschitz continuity with Lipschitz constants and , we transform the above formula:

The last equality exploits the conditional independence. Note that this probability can be tuned by adjusting .

Lemma 14. Suppose that and that noise is -sub-Gaussian. If is chosen as in (4), then, for all and all , it holds with probability at least that .

Proof. This lemma follows from Theorem 6 of Srinivas et al. (2010).

Proof. [of Theorem 2] By the definition of ; i.e., ,

Because of Lemma 13 and Lemma 14,

Hence, the probability that the agent takes an unsafe action is guaranteed to be at least .

E. Additional Information on the Simulation using Real Lunar Data Recall that we first prepared the total five images (see Figure 6) to create the spatio-temporal safety function. In this section, we provide the additional information on the data used for the simulated moon surface exploration especially in terms of how to prepare the five images.

We used the lunar images that were actually taken by Lunar Reconnaissance Orbiter (LRO); hence, the available number of images for the same region on the Moon is quite limited. That is why we created the spatio-temporal map by the linearly interpolating the images.

Figure 6: Five images of the same region (1.0 km 1.0 km) centered on de Gerlache Rim (88.664S, 68.398W). Spatiotemporal map with 200 time steps was created by linearly interpolating the images.

We first downloaded the six images with the following IDs from the Lunar Reconnaissance Orbiter Camera (LROC) website,4 and then clipped the same region (1.0 km 1.0 km) centered on de Gerlache rim (88.664S, 68.398W) from them.

• Image 1: M138264280RE

• Image 2: M140625419LE + M140625419RE

• Image 3: M143000050LE

• Image 4: M180899951LE

• Image 5: M1100039834RE

In M140625419LE and M140625419RE, since all the targeted region is not in the angle of view as shown in Figure 7, we integrated with each other. Note that the time stamps of the two images are identical (i.e., M140625419LE and M140625419RE were taken by the left and right cameras of LRO almost at the same time).

Figure 7: In the above two images, all the target region is not in the angle of view. Hence, we merged them into one image (i.e., Image 2).

Finally, the simulated environment was created by first obtaining five images of the target region and then linearly interpolating them so that the duration between images was 1.8 hours. The overall mission period was assumed to be 15 days, so the number of images was 200.

designed for accessibility and to further open science