b

DiscoverSearch
About
My stuff
Safe Exploration for Identifying Linear Systems via Robust Optimization
2017·arXiv
Abstract
Abstract

Safely exploring an unknown dynamical system is critical to the deployment of reinforcement learning (RL) in physical systems where failures may have catastrophic consequences. In scenarios where one knows little about the dynamics, diverse transition data covering relevant regions of state-action space is needed to apply either model-based or model-free RL. Motivated by the cooling of Google’s data centers, we study how one can safely identify the parameters of a system model with a desired accuracy and confidence level. In particular, we focus on learning an unknown linear system with Gaussian noise assuming only that, initially, a nominal safe action is known. Define safety as satisfying specific linear constraints on the state space (e.g., requirements on process variable) that must hold over the span of an entire trajectory, and given a Probably Approximately Correct (PAC) style bound on the estimation error of model parameters, we show how to compute safe regions of action space by gradually growing a ball around the nominal safe action. One can apply any exploration strategy where actions are chosen from such safe regions. Experiments on a stylized model of data center cooling dynamics show how computing proper safe regions can increase the sample efficiency of safe exploration.

Applying reinforcement learning (RL) to control physical systems in complex, failure-prone environments is non-trivial. The advantages of RL is significant: deploying an agent that is able to learn and optimally control a sophisticated physical system reduces the amount of manual tuning, hard-coding and guess work that goes in to designing good controllers in many applications.

However, a key problem in an unknown system is that the RL agent must explore to learn the relevant aspects of the dynamics for optimal control. Furthermore, in a failure-sensitive environment, the agent must explore without triggering unsafe events. This is known as the safe exploration problem and has recently gained prominence in both the research community and the broader public consciousness [10, 1, 22]. This interest has also been driven by some recent successes of RL over traditional control regimes for physical systems such as those for controlling pumps, chillers and cooling towers for data centers [8].

In this paper, we focus a critical aspect of model-based RL control, namely, system identification subject to safety constraints. Specifically, working with linear Gaussian dynamics models—motivated by a data center cooling system control at Google1—we define safety w.r.t. a domain-specific set of linear constraints over state variables that must be satisfied at all times. This is the common notion of safety in industrial applications where certain process variables (system state variables being monitored or controlled) should meet certain target set points, and must not diverge more than a fixed amount from those points. Once a system dynamics model is learned to a sufficient degree, it can be used in a model-based RL or other control algorithm.

Unlike recent work that takes a Bayesian approach to modelling system uncertainty, we adopt a strict uncertainty perspective, and reason with a set of models within which the true system model must lie with high probability. The set itself is defined using PAC-style bounds derived from an exploration data set over state-action space. With this perspective, we compute safe regions of action space where, starting with an initial state, one can take any sequence of actions within the safe region and guarantee satisfaction of the (state) safety constraints on the states at every step of the trajectory.

Our approach assumes the existence of a nominal safe action, namely, an action that is safe in the sense that it, if continuously applied, will eventually drive the system to a safe state no matter what the initial state, and will maintain safety in steady state. A safe region is then computed by gradually growing a ball around the nominal action, or some other action known to be safe. Our experiments use a modified version of the linear-Gaussian dynamics of a data center’s cooling system. We show how our algorithms can be used to efficiently and safely explore while also computing high probability bounds on the estimated model quality.

Given our definition of safety, perhaps the most relevant work is that of Akametalu et al. [14], who require that, during exploration, the states at every time step must reside within a pre-defined region of safety. They focus on computing a subset of the safe states D with the property that, when starting from any initial state in D, one can execute any exploration policy as long as certain safe actions are taken when the state starts drifting out of D. However, computation of D scales exponentially in the state dimension. The transition dynamics are assumed to be partially known but a deterministic disturbance function is unknown and modeled as a Gaussian process (GP). While their safety definition bears a close resemblance to ours, we determine safe regions of action space as a function of the initial state rather than computing safe initial states from which to explore. Furthermore, we do not assume any prior knowledge of the dynamics model and instead treat model uncertainty with PAC-like bounds on the quality of least-squares parameter estimates. Safe exploration with a finite state space, known deterministic transition and an uncertain safety function modeled with a GP is studied by Turchetta et al. [26], who develop an exploration algorithm with provable safety guarantees. Several other lines of work also use GPs for safe exploration [27, 25, 4]. Our approach is not Bayesian, and does not explicitly encounter computational complexity issues often associated with Bayesian inference.

Our approach, using robust optimization of a worst-case safety violation, is somewhat related to robust MDPs (RMDPs) [21, 13]. In RMDPs one assumes the true model lies in a smaller model set Θ defined by an error upper bound on the transition probabilities, then the goal is to optimize a policy that performs best with respect to an adversarial choice of the true transition model in Θ. We also work with a model set Θ where the true model lies with high probability but instead of optimizing a policy (where one can also encode safety constraints) we are interested in finding safe action regions. Recent work [11] considers the RMDP setting where the safety goal is to find a policy that performs better than some baseline policy, with most of the analysis in the finite-state setting, and methods that do not easily scale to continuous, multi-dimensional state spaces. Similar work [9, 15] also tries to improve on a baseline policy. We do not assume a safe baseline policy exists, but instead assume a known safe action (see above).

System identification is a well-studied topic in control theory [23, 18, 3]. Typically it consists of three phases: (1) data gathering; (2) model class selection; and (3) parameter estimation [18]. In this work, we primarily deal with in safe exploration for data gathering and parameter estimation in the context, quantifying the approximation quality of parameter estimates and using this as a stopping criterion for exploration. With regard to model class selection, various model types are studied in the literature, including ARX (autoregressive with exogenous inputs), ARMA (autoregressive moving average) and ARMAX (autoregressive moving average with exogeneous inputs) [5]. Our paper focuses on learning linear ARX models as opposed to non-linear models, such as neural networks [6]. For exploration, the celebrated result on persistently exciting inputs [2] asserts that control input signals that are rich enough—having a certain number of distinct frequencies—provide sufficiently diverse training data to learn the true model. We leverage this insight in our work.

In the context of robust parameter estimation, there is a line of work on bounded noise models in system identification [29, 24, 28]. This paradigm does not assume any statistical form of the noise inherent in state observations, but instead assumes only that the noise is bounded by an unknown constant. Researchers have studied the existence and characteristics of estimators for various optimality concepts [19]. In our setting, we assume a statistical distribution for the noise model, but we compute robust, safe actions with respect to the uncertainty in the model parameters estimated from trajectory data.

We first describe some basic concepts and notation related to sequential decision making. Since our paper’s main results center on linear transition models with Gaussian noise, we address this model class in some detail and outline some corresponding PAC-style bounds for estimating the parameters of these models. These bounds depend on properties of the training data—generally, a time series of system behaviour—and thus do not require that training examples be independently generated.

3.1 Background and Notation

Let  X ⊆ Rdbe the state space and  U ⊆ Rd′the action space. We generally view each dimension of the state as representing some process variable to be monitored or controlled, and each action dimension as some control variable to be set. For example, the state might be a vector of sensor readings while the action might be a vector of hardware parameter configurations.2Let  xt ∈ Xdenote the system state at time t and  xtℓthe value of its  ℓ-th variable, for t ∈ {0, 1, 2, . . .}. Likewise, let  ut ∈ Ube the action taken at time t in state  xt. There is an underlying, unknown  transition model P ∗(xt+1 | xt, ut), i.e., the distribution from which the state at the next time step  xt+1is generated. The transition model is stationary and Markovian (i.e., depends only on the current state  xtand action  ut).

We consider time-based episodes or trajectories of state-action sequences: (x0, u0), . . . , (xT −1, uT −1), (xT). A length T episodes is typically generated by providing  initial state x0to our agent, which executes an exploration policy that selects/executes action  utin response to observing state  xt(and past history), which induces  xt+1according the true system dynamics, for  T −1 stages. The initial state  x0might be exogeneously realized and may differ across episodes. Our goal is to ensure safety during an episode.

We use the following notation throughout. Let  vi:j = vi, . . . , vjdenote a sequence of  j − i+ 1 vectors starting at index i. Let M be a matrix. We use  M′to denote the matrix transpose. We use  ∥M∥Fto denote the Frobenius norm, and for a vector  v, we use ∥v∥to denote the 2-norm. We denote by  M[ℓ, :] theℓ-th row vector of M. For two row vectors v and w, let�v w�be the row vector obtained by concatenating v and w.

3.2 Linear Gaussian Transition Models and Estimation Bounds

Our goal is to identify the underlying model  P ∗. We restrict this model to belong to a class of transition models  {Pθ | θ ∈ Θ}defined by a parameter space Θ. In particular, we assume there is some  θ∗ ∈Θ such that  P ∗=  Pθ∗. We assume Θ corresponds to parameters of a linear-Gaussian model, though our notion of safety and robust optimization formulations apply more generally. Specifically, the next state is given by

image

where  A ∈ Rd×dand  B ∈ Rd×d′. A constant additive bias vector can be incorporated in  Butby adding an extra dimension to  utwith a constant value of 1. We assume a Gaussian random noise vector  ξt ∼ N(0, σ2Id) where  Idis the  d × didentity matrix. We also assume  ξtis independent of  ξt′for any t′ ̸= tand is independent of the previous state  xt−1. For ease of exposition, we omit  σfrom Θ—it can be easily estimated as we describe below. Thus Θ =  {�A B� | A ∈ Rd×d, B ∈ Rd×d′}. We denote by  θ[ℓ,:] = [A[ℓ,:]  B[ℓ,:]] the  ℓ-th row of the concatenated matrix.

Suppose we have existing time series (xi, ui)n−1i=0 data generated by a (possibly randomized) exploration strategy. Let  Xn be the (n-row) data matrix whose ith row is vector  x′i−1. Suppose, furthermore, that the exploration strategy meets the persistence of excitation condition [18]. That is, the matrix of auto-covariances of the actions in the data is positive definite (e.g., independent, uniform random actions for each  t ≥0 satisfy this property). Then the least-squares estimates ˆθ =� ˆA ˆB�, where the predicted labels are the next states, asymptotically converges to the following multivariate Gaussian distribution (see, e.g., [12]):

image

where Q = limn→∞ 1nX′nXn, which we assume is positive definite. Let ∆θ= θ − θ∗and ∆ℓθ = θ[ℓ,:]  − θ∗[ℓ,:]. We can get an elliptical confidence bound on our estimates from the Gaussian distribution in (1). Specifically, the distribution over trajectories—induced by the randomness in linear Gaussian transitions (with respect to  θ∗) and any potential possibly randomness in the exploration strategy—satisfies:

image

where  F −1 is the inverse CDF of the chi-squared distribution with  d + d′ degreesof freedom. Let  λmin(Q) be the smallest eigenvalue of Q, which must be positive since Q is positive definite. We can derive a looser confidence bound using the Frobenius norm of ∆ℓˆθ:

image

To make this bound practical, we can approximate  Q and σ with:

image

Note that each row of ˆθconverges at a rate of  O(1/√n). If we apply a union bound over the  d rows of ˆθ, we can get an approximation bound that holds over the entire parameter matrix�A B�:

image

We define safety as satisfying relevant constraints on the system’s state. This is common, for example, in industrial applications where certain process variables must be kept in a pre-specified range. For instance, in data center cooling, the cold-air temperature (i.e., prior to passing over computing hardware to be cooled) may not be allowed to exceed some upper bound (safety threshold). This limit could could be applied to all temperature sensors, the average, or the median or some other quantile (where aggregations may be defined over various regions of the DC floor).

image

straints apply. We adopt a straightforward safety model in which a single inequality constraint is imposed on each such process variable, all of which must be satisfied:

image

Let  s ∈ Rdbe the vector where  sℓ = sℓif  ℓ ∈ Sotherwise  sℓ = Ufor some upper bound U on  xℓ. We use this formulation for ease of exposition and note that interval constraints on  xℓcan be accommodated by creating another state variable  −xℓwith the desired bound.

image

M within a specific region of model space where the true model resides with high probability, the exploration actions we take for the purposes of system identifcation result in satisfying safety conditions according to M’s dynamics.

Definition 1. Let P ∗ be the true but unknown transition model. Let  s ∈ Rd bea vector of upper bound constraints on the state variables. Let  x0be the initial state. We say that a sequence of actions  u0:T −1is safe in expectation if

image

where xt+1 ∼ P ∗(· | xt, ut) for all t ≥ 0. It issafe with probability at least 1−γ

image

The high probability guarantee is much stronger than safety in expectation. However, in practice it may be more difficult computationally to verify safe actions with high probability; and safety in expectation might suffice if the true model  P ∗ is close to the mean estimated model (e.g., when transitions are nearly deterministic).

If we can obtain a high probability bound on each state at time  τ, i.e. Pr(xτ ≤ s) ≥ 1 − γ/T, then by applying the union bound we can satisfy inequality Eq. (4). We focus on this simpler safety requirement in the sequel. In practice, the union bound can be quite conservative given state dependencies across time.

For practical exploration, we would like to take different sequences of actions (e.g., persistently exciting) within a region of action space that guarantees safety. We assume there is a  nominal action u∗that is known to be safe with high probability when applied consistently (i.e., at each time step).

Assumption 2. Let s be the safety bounds, T the time horizon, and  γ > 0 thesafety confidence parameter. There exists a set of initial states  X0 ⊆ X, and a nominal safe action  u∗ such that for all initial states  x0 ∈ X0,

image

where  xt+1 ∼ P ∗(· | xt, u∗) for all t ≥ 0.

Such safe actions exist in many applications. For instance, in DC cooling it is generally feasible (though expensive) to set cooling fans and water flow to very high values (subject to their own constraints) to ensure cold air temperatures are maintained at or below safety thresholds. It is also the case that the nominal safe action will eventually drive “routine” unsafe states (i.e., those not associated with some significant system failure) into the safe region.

The nominal safe action is also useful for identifying the standard deviation parameter  σof our Gaussian noise—we can execute the safe action for sufficiently long period (in practice, the action can be safe in steady state). For simplicity of analysis and exposition, we take ˆσ = σto be known. Our results can be extended easily if we remove this assumption by replacing  σwith an estimate using a high-probability upper bound.

Definition 3. We say that  V ⊆ Uis a safe action region in expectation (respectively, with probability at least 1  − γ)if, for all  u0:T −1 ∈ VT , u0:T −1is safe in expectation (respectively, safe with probability at least 1  − γ).

A safe action region V represents a region in which we can explore freely, without having to re-compute whether a chosen sequence of actions from this region is safe. In practice, we can continuously select exploratory actions from the safe region. Our safety definition guarantees with high probability that no state in a trajectory of length T is unsafe. However, if we re-run the trajectory many times—i.e. run the same sequence of actions starting at the same initial state—the probability of an unsafe event compounds. In practice, however, the additive noise term for the next state is typically small and bounded that states are unlikely to be unsafe. This is the consequence of assuming a Gaussian noise model—if we draw N samples from a Gaussian, for a sufficiently large N, we are likely to draw a very rare event even if its probability is extremely small.

While we have defined safety with respect to constraint satisfaction across an entire trajectory, we first analyze a simple case, namely, whether an action is one-step safe. Specifically, we assess, if we start in a safe state  x0, whether a given action u results in a safe state  x1 ∼ Pθ∗(· | x0, u).

Suppose we have training data generated by past exploration, and that ∥ˆθ∥F < εholds with probability 1  − α. This ensures that, with probability 1  − α, should we verify u is safe for all models  θ =�A B�within  εof the estimated model ˆθ, that u is safe w.r.t. the true model  θ∗. This worst-case, robust analysis ensures safety even for adversarially chosen  θwithin this  ε-ball.

We can formulate the an optimization problem to test such one-step safety. The optimization in Eqs. (5-6) finds the model parameters  θ(i.e., system dynamics [A, B]) within the high-probability  ε-ball around ˆθsuch that the safety constraint defined by any  sℓis maximally violated (if violation is feasible). Specifically, for each process variable  ℓ ∈ S, we check whether the optimal objective value is greater than  sℓ. If no such value is greater, then u is one-step safe in expectation.

image

ε([A[ℓ,:]  B[ℓ,:]]′ −[ ˆA[ℓ,:] ˆB[ℓ,:]]′). To check safety in expectation, we can re-formulate Eqs. (5-6) as:

image

This solution to this problem is the largest singular value of�x′0 u′�, which is its vector 2-norm, and the maximizing  c∗ is the corresponding right-singular vector. Thus, the optimal objective of the original optimization is

image

The maximizing  A[ℓ, :], B[ℓ,:] of the original problem can also be easily derived.

Notice that one-step safety does not imply subsequent states along the trajectory will be safe, even if the nominal safe action is taken. In fact, a one-step safe action may lead to a state where there are no safe actions.

Example 4 (One-step safety insufficient). Define a system whose true dynamics is given by:

image

Suppose our controller/agent knows  A∗, B∗and the system is noise-free. Let ℓ = 1, so the first state variable is the process variable with safety threshold  s1 = 1.Since  A∗ is in Jordan normal form,  ρ(A∗) = 1/2 < 1. When action  u0is applied we obtain  x1 =�1/4 1�′, which is safe. However for any  u1 =�u1 u2�′,

image

which is unsafe—this means the first action, while safe, ensures that no next action is safe. Thus, unless our safety goals are fully myopic, one-step safety will not suffice.

Despite having an easily checkable criteria, one-step safety is insufficient to guarantee safety over a longer duration. In this section, we focus on algorithms for verifying trajectory safety for a given sequence of actions, including running a fixed action over an extended duration. In particular, we develop an algorithm for finding a maximum ball of safe actions centered around a known safe action. This allows one to run, for example, a sequence of randomly chosen actions within the safe action ball. First, we start off with some basic facts for linear Gaussian trajectory rollouts.

6.1 Linear Gaussian Rollouts

Assuming linear Gaussian transitions, we can rollout the state at time  τ ≥1 given action sequence  u0:τ−1

image

As  τ → ∞, E[xτ] might diverge. To ensure convergence, we need the following Schur stability [16] assumption.

Assumption 5. Let θ = (A, B)be the linear Gaussian parameters of a transition model. We assume the spectral radius of A is less than 1. That is, all eigenvalues lie in the open unit circle

image

Aτ can grow arbitrarily large if  ρ(A) >1, and for  ρ(A) = 1 it may diverge or converge depending on properties of A. Schur stability is a reasonable assumption in many applications since we may be aware of an action that when applied at each time step eventually converges to a safe steady state. As  τ → ∞ the effectof the initial state  x0diminishes since  Aτx0 → 0.

image

be derived as

image

so we have,

image

where  S(τ) = �τt=0 AtA′t and 1is the all-one vector. With Schur stability, the variance is bounded:

Theorem 6.  If ρ(A) < 1, limτ→∞ Var[xτ] exists.

Proof. To show the limit exists, let  ϵ >0. We apply Gelfand’s formula [17] which states  ρ(A) = limk→∞ ∥Ak∥1/kfor any matrix norm  ∥ · ∥. By Gelfand’s, there exists a small enough  ϵ′ >0 and a large enough K such that for all  k ≥ Kthe following two conditions hold,

image

Then we have,

image

For any other  k′ ≥ K and k′ < k,

image

Then we have 1σ2 (Var[xkℓ] − Var[xk′ℓ]) < ϵ, this shows  Var[xτℓ] is a Cauchy sequence which always has a limit.

This result shows that there exist a bound  β(A) such that for all  τ ≥1, Var[xτℓ] < β(A) so the variance does not grow unbounded independently of the executed action sequence. This is reassuring for high probability safety as  xτconverges to a steady state Gaussian distribution with finite variance. Moreover, we can compute high probability upper bounds for any process variable  xτℓat any future time step  τ. It can be computed by noticing from Theorem 6 that xτℓis a Gaussian distribution with variance at most limt→∞ Var[xtℓ]. Again, we can compute the expectation and add the appropriate standard deviation multiplier to get the desired confidence bound. That is, for the desired 1  − γconfidence, choose c such that

image

6.2 Limiting Case with a Fixed Action

We first study trajectory safety with respect to a fixed action u that is applied at each time step and ask whether safety constraints are satisfied in the steady state distribution. Obtaining an easily computable “safety with high probability” test is complicated by the form of the variance term in Eq. 9. For this reason, we address this problem later in Sec. 6.3. In this section, we begin by deriving a simpler procedure for determining safety in expectation for a fixed action.

Recall our earlier assumption that  ρ(A) <1. As above, suppose we have a data set generated by past exporation such that we can guarantee  ∥ˆθ∥F ≤ εwith probability at least 1  − γ. Let  θ =�A B�be any “potential true model” such that  ∥θ − ˆθ∥ ≤ ε and ρ(A) <1. The expected steady state under  θ is

image

Note that  I−Ais invertible and limτ→∞ Aτ = 0since, by assumption,  ρ(A) < 1.Recall our robust optimization objective for safety with respect to the  ℓ-th process variable,

image

image

trick as in one-step safety and optimize for an approximating upper bound. The objective Eq. 10 can be modified by subtracting a constant term:

image

This new formulation Eq. 12 is difficult to optimize, but we can easily maximize an objective that upper bounds Eq. 12 by finding an upper bound on the Frobenius norm of the terms inside the parentheses. (Assume all norms below are Frobenius.)

image

Since  ∥A − ˆA∥ ≤ ε, let A = ˆA − E for ∥E∥ ≤ ε. We use the following result for any non-singular matrix M and small perturbation matrix E [7]:

image

where  κ(M) = ∥M∥·∥M−1∥is the condition number of M. By letting  M = I− ˆA,we obtain:

image

Substituting Ineq. 14 to Ineq. 13 gives:

image

Thus we obtain an upper bound the original optimization Eqs. (10-11) as follows:

image

Using the same trick as in the one-step case, we can solve for the optimal value by finding the largest singular value. This gives the following upper bound on the optimal objective value of the original maximization problem:

image

To check the safety in expectation of the  ℓ-th process variable, in the limit, when we execute action u at every time step, we simply test whether  V ∗( ˆA, ˆB, ε, u) >sℓ. Note that this upper bound test will generally be overly conservative— as a consequence, we propose heuristics later in the section that should be tighter in practice.

6.3 Safety within δ-Ball

Assume, as discussed above, we have determined that u is a safe action for the next T time steps under any transition model  θsuch that  ∥θ − ˆθ∥F ≤ εwith some probability (again, it’s assumed that ∆∥ˆθ∥F ≤ ε). For purposes of system identification, we need to explore and not just execute a fixed action. In this section, we discuss how to determine whether a given region of action space is safe (either in expectation or with high probability). We do this in a way that exploits the notion of a fixed safe action discussed above.

Here we focus on regions of action space defined by a  δ-ball of actions,  δ > 0centered at a fixed safe action u:

image

To verify that region  Bδ(u) is safe (see Def. 3), we must ensure that any sequence of actions (u + z)0:τ−1is safe, where  ∥zt − u∥2 ≤ ε. As above, we express this problem as an adversarial optimization for state variable  ℓ at time τ:

image

We address safety in expectation, i.e., we try to approximately solve the above optimization problem when c = 0. Since u is safe, we can upper bound

(19) by:

image

For the max inside the summation, we can again try to bound

image

Let  A = ˆA − E for ∥E∥ ≤ ε, we have

image

where Ineq. 26 follows from triangle inequality and sub-multiplicative property of Frobenius norm. Substituting this inequality into Ineq. 25, and then into Ineq. 24, gives an upper bound on the objective (19) as a function of the  zt’s. This can be solved again by computing the maximum singular value of the resulting matrix.

However, this bound for testing safety in expectation will generally be quite loose, yielding a conservative test of the safety of  Bδ(u). To compensate, we instead consider a direct approximation of the above optimization.

6.4 Computing Largest Safe δ-ball

Due to the looseness of the closed-form upper bound above, we can instead computationally optimize adversarial choices of A, B, b and  ztthat attempt to violate one of the process variables. We optimize (19-22) using alternating optimizing over  A, B and zt. We start with a random initialization (e.g., uniform random) of these parameters. We first optimize  A given B and zt. Note that by Eq. (19), the value of each process variable is a simple multi-variate polynomial over the parameters of A, including the additive standard deviation multiple giving us the desired Gaussian confidence bound. We can optimize this using gradient ascent (and the resulting local maxima can be improved using random restarts).

We can then fix ˜A, and ˜z1, . . . , ˜zτ. Let ˜Atij= ˜a(t)ij, then (B˜zt)i= �mj=1 Bij ˜ztj. The variance term remains constant, giving:

image

where vect(·) is the reshaping of B into vector form, and  ⊗is the tensor product. One can again solve this by formulating it as a maximum singular value problem.

Finally, suppose ˜A and ˜Bare fixed. Each  ztcan be optimized independently in the summation ˜At[ℓ, :] ˜Bztsubject to  ∥zt − u∥2 ≤ δ. This again can be posed as a maximum singular value problem. Thus, for a fixed  δwe can approximate a lower bound on the optimization, which tells us whether there exists an adversarial example that makes  δunsafe. See Algorithm 1. Since  δis a scalar, we can do a simple “galloping” search to find the largest safe  δ. See Algorithm 2

The SafeBall algorithm also allows us to determine whether a given sequence of actions,  u0:T −1, is safe. In SafeBall we just need to fix  zt = utand update only A and B in each iteration of optimization.

The use of a single ball may not be ideal to get diversity of training data. Instead we can generate new fixed, safe actions from which to build new safe balls using the above algorithms. We maintain a list of safe balls from which we update its safe radius and from which we can create new safe balls. For exploration, we may randomly select a safe ball from the list and run exploration by, say, randomly choosing actions within the ball. In our experiments, we elaborate on this method in more detail and illustrate how it improves sample efficiency with respect to model errors.

We now describe two sets of experiments designed to test our algorithms for safely identifying the true underlying model of system dynamics by exploring within growing safe  δ-balls. In the first set of experiments, we test how the largest safe  δ-ball Bδ(u∗) increases in size as a function of  ε, the model approximation error. In the second, we compare how quickly the true model can be identified

image

by gathering exploration data using: (a) a single safe ball; or (b) multiple safe balls. In both experimental settings, we test for safety in expectation.

We test out methods on the control on fancoil operation on a data center floor—fancoil control dictates the amount of airflow over section of the DC floor, as well as the flow of water (used to cool returning hot air) through the fancoils themselves. Typically, very large number of fancoils are distributed throughout the DC and can have significant interactions. In our experiments, we use a model of fancoil and air temperature dynamics that is a modified, normalized model of a single fancoil (including its interactions with other nearby fancoils) derived from a DC cooling system at Google. The control problem is discrete-time system with time steps of of 30 seconds. There modified model has five state variables (d = 5) actions comprised of two control inputs (d′= 2) corresponding to fan speed and water flow. The “true” model to be identified in our experiments has a spectral radius  ρ(A∗) = 0.90. A single state variable (the first) is the only process variable under control. The nominal safe action  u∗= (1.833, 1.857) is known, based on expert knowledge of the domain. We set the trajectory length to be T = 20, which is a sufficient horizon for, say, model-predictive control in this model. We set tolerance  ω = 0.05 for SafeBall’s galloping search.

In the first experiment, we vary the model error parameter  εand ran the MaxSafeBall algorithm to find the largest safe  δ-ball. We set the input parameter estimates ˆθ =� ˆA ˆB�=�A∗ B∗�= θ∗. That is, the algorithm will optimize for the worst  θ ∈ Bε(θ∗) that (potentially) violates the safety constraints. We use an initial safe radius of  δ0= 0.1 around the safe action  u∗at which to begin the SafeBall galloping search. We varied  εfrom 0 to 0.0425 with more granularity near  ε= 0. For some context on the model estimation error  ε,

image

we note that  ∥θ∗∥F= 1.974; hence the largest relative error for  ε= 0.0425 is ε/∥θ∗∥F ≈ 2.15%. Any relative error greater than 2.15% results in a zero radius safe ball—containing only the nominal safe action. We vary the relative error from 0% up to 2.15%. At  ε= 0, when the true parameters are known, we get the largest safe  δ-ball. Results are shown in Fig. 1. The plot shows a non-linear relationship where the largest computed  δdrops at a faster rate for smaller  εcompared to larger values of  ε. The largest  δdrops to 0.2375 when we reach ε = 0.0425.

For the second experiment we simulated a typical use case of gradually identifying the model parameters while running safe exploration to generate training data. In particular, we compare the sample efficiency of running exploration on one dynamically growing safe ball versus many, simultaneously, dynamically growing safe balls. In the true model we add Gaussian noise with standard deviation  σ= 0.01. We set a fixed constraint on the process variable for both the single ball and the multiple ball runs. We run 100 episodes each with trajectory length of T = 20 before updating ˆθand the safe balls (i.e., we update every 2000 training examples). Each episode starts with the same initial state. In practice, this can be accomplished by running the nominal safe action in between episodes until steady state is reached (driving a system to a fixed state is a common assumption, see e.g. [20]).

When generating multiple safe balls after every 100 episodes, we randomly choose 2 actions  u1, u2on the surface of a randomly chosen safe ball (from an existing set of safe balls). We use MaxSafeBall to find the largest  δ1, δ2, respectively, around those chosen actions, with  δ0 = 0.1 and ε = εtthe theoretical model error as given by the error approximation bound of Eq. 2 with confidence fixed at 1  − α = 0.95. If the largest  δcomputed exceeds a minimum lower bound (we choose 0.2) we include the new ball in our list of safe balls. We limit the number of safe balls at 14 to simplify our computations and reduce clutter in our presentation. We generate exploration actions by randomly choosing a safe ball and uniformly at random drawing a sequence of exploration actions within that ball.

Fig. 2 contrasts the reduction in theoretical model error  εtversus actual model error  εa = ∥ˆθ − θ∗∥under the two exploration regimes. It shows that exploration that generates actions from multiple safe balls is generally more sample efficient (i.e., it achieves the same model error with fewer examples). For instance, after 1000 episodes (n = 20000 examples) multi-ball exploration achieves  εt = 0.0224and  εa= 0.0045 while exploration with a single ball achieves  εt= 0.0305 and εa= 0.0075, which is considerably worse As n increases, both theoretical and actual model error decrease at a slower rate (as expected). Moreover, the gap in sample efficiency between the two regimes gets smaller at n grows.

We note that theoretical model error is conservative estimate of the actual model error. Furthermore, actual model error is a conservative measure when using the model to optimize for a good control policy, which does not necessarily require a fully accurate model. In terms of exploration time, n = 20000 examples corresponds to 166 hours or about 6 days and 22 hours on a single fancoil. However, “practical” identification will proceed much more quickly for two reasons. First, as discussed above, a data center has many fancoils, this exploration time can be reduced significantly by exploring across multiple fancoils simultaneously. Second, we generally do not require accurate system identification across the full range of dynamics: for example, once model confidence is sufficient for specific regions of state-control space such that these points can be shown to never (or “rarely”) be reached under optimal control, further model refinement in unneccesary (i.e., the value of information for more precise identification is low).

Fig. 3 shows the safe ball(s) after being updated for various values of n (i.e. immediately after n examples have been generated) for the two regimes. One can see that the area of action space covered when exploring with many safe balls grows more quickly than with just one ball. In fact, the original safe ball around  u∗is larger in the many balls regime due to the smaller  εtthat results from better sample efficiency. For values of  n ≥50000 the balls in both regimes do not grow as quickly, this corresponds to the model errors starting to decline at a slower rate.

We studied safe exploration for identifying system parameters of a linear Gaussian model. In our approach we defined safety as satisfying linear inequality constraints on the state at each time step of a trajectory. This is motivated by applications involving control of process variables and where violating those constraints may lead to significant negative impacts. We showed how to compute many regions of safe actions when given a nominal safe action. Confined to taking actions within each safe region, any exploration strategy may be applied and theoretical bounds on the approximation quality of an estimated model can be computed online. By computing and exploring within many safe regions, our experiments show we can increase the sample-efficiency of exploration. For future work, we would like to extend safe exploration to not just identify the model as accurately as possible, but to gather just the right kind of transition

image

Figure 1: The largest safe ball around the nominal safe action  u∗ as approximated by MaxSafeBall for various input model errors  ε.

image

Figure 2: Comparing the theoretical and actual model errors under two safe exploration regimes: (1) random actions chosen from up to 14 safe balls or (2) random actions chosen from a single ball.

image

Figure 3: Left column plots show the gradual addition of safe balls (i.e., circles, up to 14) and growth of each ball. Each plot has x and y axis corresponding to the two control inputs. Each circle within a plot indicates a safe action region where any sequence of T actions is safe. Plots on the right shows gradual growth of a single safe ball.

samples that optimizes the quality of the controller.

We like to thank Nevena Lazic, Eehern Wang and Greg Imwalle for useful discussions.

[1] Dario Amodei, Chris Olah, Jacob Steinhardt, Paul Christiano, John Schulman, and Dan Man´e. Concrete problems in AI safety. arXiv:1606.06565, 2016.

[2] Karl-Johan ˚Astr¨om and Torsten Bohlin. Numerical identification of linear dynamic systems from normal operating records. In Theory of Self-Adaptive Control Systems, pages 96–111, Teddington, UK, 1967.

[3] Karl-Johan ˚Astr¨om and Pieter Eykhoff. System identification: A survey. Automatica, 7(2):123–162, 1971.

[4] Felix Berkenkamp and Angela P. Schoellig. Safe and robust learning control with Gaussian processes. In European Control Conference (ECC), pages 2496–2501, 2015.

[5] George E. P. Box, Gwilym Jenkins, Gregory C. Reinsel, and Greta M. Ljung. Time Series Analysis: Forecasting and Control (5th Edition). Wiley, Hoboken, NJ, 2015.

[6] S. Chen, Billings S. A., and P. M. Grant. Non-linear system identification using neural networks. International Journal of Control, 51:1191–1214, 1990.

[7] James Demmel. The componentwise distance to the nearest singular matrix. SIAM Journal on Matrix Analysis and Applications, 13:10–19, 1992.

[8] Jim Gao. Machine learning applications for data center optimization. Google white paper, 2014.

[9] Javier Garc´ıa and Fernando Fern´andez. Safe exploration of state and action spaces in reinforcement learning. Journal of Artificial Intelligence Research, 45(1):515–564, 2012.

[10] Javier Garc´ıa and Fernando Fern´andez. A comprehensive survey on safe reinforcement learning. Journal of Machine Learning Research, 16(1):1437– 1480, 2015.

[11] Mohammad Ghavamzadeh, Marek Petrik, and Yinlam Chow. Safe policy improvement by minimizing robust baseline regret. In Advances in Neural Information Processing Systems 29 (NIPS-16), pages 2298–2306, Barcelona, Spain, 2016.

[12] William H. Greene, editor. Econometric Analysis (8th Edition). Pearson, New York, NY, 2018.

[13] G. Iyengar. Robust dynamic programming. Mathematics of Operations Research, 30(2):1–21, 2005.

[14] Anayo K. Akametalu, Shahab Kaynama, Jaime Fisac, Melaine N. Zeilinger, Jeremy H. Gillula, and Claire J. Tomlin. Reachability-based safe learning with Gaussian processes. In Proceedings of the IEEE Conference on Decision and Control, pages 1424–1431, Los Angeles, CA, 2014.

[15] Rogier Koppejan and Shimon Whiteson. Neuroevolutionary reinforcement learning for generalized control of simulated helicopters. Evolutionary Intelligence, 4(4):219–241, 2011.

[16] Joseph P. LaSelle, editor. The Stability and Control of Discrete Processes. Springer-Verlag, New York, NY, 1986.

[17] Peter D. Lax, editor. Functional Analysis. Wiley, New York, 2002.

[18] Lennart Ljung, editor. System Identification: Theory for the User (2nd Edition). Prentice Hall, Upper Saddle River, New Jersey, 1999.

[19] Mario Milanese and Antonio Vicino. Optimal estimation theory for dynamic systems with set membership uncertainty: An overview. In Mario Milanese, John Norton, H´el`ene Piet-Lahanier, and ´Eric Walter, editors, Bounding Approaches to System Identification, chapter 3, pages 5–27. Springer, Boston, MA, 1996.

[20] Teodor Mihai Moldovan and Pieter Abbeel. Safe exploration in Markov decision processes. In International Conference on Machine Learning, pages 1451–1458, Edinburgh, Scotland, 2012.

[21] Arnab Nilim and Laurent El Ghaoui. Robust control of Markov decision processes with uncertain transition matrices. Operations Research, 53(1):780– 798, 2005.

[22] Martin Pecka and Tomas Svoboda. Safe exploration techniques for reinforcement learning – an overview. In Modelling and Simulation for Autonomous Systems, pages 357–375, Rome, Italy, 2014.

[23] Rik Pintelon and Johan Schoukens, editors. System Identification: A Frequency Domain Approach (2nd Edition). Wiley, Hoboken, NJ, 2012.

[24] Fred C. Schweppe. Recursive state estimation: Unknown but bounded errors and system inputs. IEEE Transactions on Automatic Control, 13:22–28, 1968.

[25] Yanan Sui, Alkis Gotovos, Joel W. Burdick, and Andreas Krause. Safe exploration for optimization with Gaussian processes. In International Conference on Machine Learning, pages 997–1005, Lille, France, 2015.

[26] Matteo Turchetta, Felix Berkenkamp, and Andreas Krause. Safe exploration in finite Markov decision processes with Gaussian processes. In Advances in Neural Information Processing Systems 29 (NIPS-16), pages 4312–4320, Barcelona, Spain, 2016.

[27] Matteo Turchetta, Felix Berkenkamp, Angela P. Schoellig, and Andreas Krause. Safe model-based reinforcement learning with stability guarantees. arXiv:1705.08551, 2017.

[28] Eric Walter and H´el`ene Piet-Lahanier. Estimation of parameter bounds from bounded-error data: a survey. Mathematics and Computers in Simulation, 32:449–468, 1990.

[29] Hans S. Witsenhausen. Sets of possible states of linear systems given perturbed observations. IEEE Transactions on Automatic Control, 13:556– 558, 1968.


Designed for Accessibility and to further Open Science