Applying reinforcement learning (RL) to control physical systems in complex, failure-prone environments is non-trivial. The advantages of RL is significant: deploying an agent that is able to learn and optimally control a sophisticated physical system reduces the amount of manual tuning, hard-coding and guess work that goes in to designing good controllers in many applications.
However, a key problem in an unknown system is that the RL agent must explore to learn the relevant aspects of the dynamics for optimal control. Furthermore, in a failure-sensitive environment, the agent must explore without triggering unsafe events. This is known as the safe exploration problem and has recently gained prominence in both the research community and the broader public consciousness [10, 1, 22]. This interest has also been driven by some recent successes of RL over traditional control regimes for physical systems such as those for controlling pumps, chillers and cooling towers for data centers [8].
In this paper, we focus a critical aspect of model-based RL control, namely, system identification subject to safety constraints. Specifically, working with linear Gaussian dynamics models—motivated by a data center cooling system control at Google—we define safety w.r.t. a domain-specific set of linear constraints over state variables that must be satisfied at all times. This is the common notion of safety in industrial applications where certain process variables (system state variables being monitored or controlled) should meet certain target set points, and must not diverge more than a fixed amount from those points. Once a system dynamics model is learned to a sufficient degree, it can be used in a model-based RL or other control algorithm.
Unlike recent work that takes a Bayesian approach to modelling system uncertainty, we adopt a strict uncertainty perspective, and reason with a set of models within which the true system model must lie with high probability. The set itself is defined using PAC-style bounds derived from an exploration data set over state-action space. With this perspective, we compute safe regions of action space where, starting with an initial state, one can take any sequence of actions within the safe region and guarantee satisfaction of the (state) safety constraints on the states at every step of the trajectory.
Our approach assumes the existence of a nominal safe action, namely, an action that is safe in the sense that it, if continuously applied, will eventually drive the system to a safe state no matter what the initial state, and will maintain safety in steady state. A safe region is then computed by gradually growing a ball around the nominal action, or some other action known to be safe. Our experiments use a modified version of the linear-Gaussian dynamics of a data center’s cooling system. We show how our algorithms can be used to efficiently and safely explore while also computing high probability bounds on the estimated model quality.
Given our definition of safety, perhaps the most relevant work is that of Akametalu et al. [14], who require that, during exploration, the states at every time step must reside within a pre-defined region of safety. They focus on computing a subset of the safe states D with the property that, when starting from any initial state in D, one can execute any exploration policy as long as certain safe actions are taken when the state starts drifting out of D. However, computation of D scales exponentially in the state dimension. The transition dynamics are assumed to be partially known but a deterministic disturbance function is unknown and modeled as a Gaussian process (GP). While their safety definition bears a close resemblance to ours, we determine safe regions of action space as a function of the initial state rather than computing safe initial states from which to explore. Furthermore, we do not assume any prior knowledge of the dynamics model and instead treat model uncertainty with PAC-like bounds on the quality of least-squares parameter estimates. Safe exploration with a finite state space, known deterministic transition and an uncertain safety function modeled with a GP is studied by Turchetta et al. [26], who develop an exploration algorithm with provable safety guarantees. Several other lines of work also use GPs for safe exploration [27, 25, 4]. Our approach is not Bayesian, and does not explicitly encounter computational complexity issues often associated with Bayesian inference.
Our approach, using robust optimization of a worst-case safety violation, is somewhat related to robust MDPs (RMDPs) [21, 13]. In RMDPs one assumes the true model lies in a smaller model set Θ defined by an error upper bound on the transition probabilities, then the goal is to optimize a policy that performs best with respect to an adversarial choice of the true transition model in Θ. We also work with a model set Θ where the true model lies with high probability but instead of optimizing a policy (where one can also encode safety constraints) we are interested in finding safe action regions. Recent work [11] considers the RMDP setting where the safety goal is to find a policy that performs better than some baseline policy, with most of the analysis in the finite-state setting, and methods that do not easily scale to continuous, multi-dimensional state spaces. Similar work [9, 15] also tries to improve on a baseline policy. We do not assume a safe baseline policy exists, but instead assume a known safe action (see above).
System identification is a well-studied topic in control theory [23, 18, 3]. Typically it consists of three phases: (1) data gathering; (2) model class selection; and (3) parameter estimation [18]. In this work, we primarily deal with in safe exploration for data gathering and parameter estimation in the context, quantifying the approximation quality of parameter estimates and using this as a stopping criterion for exploration. With regard to model class selection, various model types are studied in the literature, including ARX (autoregressive with exogenous inputs), ARMA (autoregressive moving average) and ARMAX (autoregressive moving average with exogeneous inputs) [5]. Our paper focuses on learning linear ARX models as opposed to non-linear models, such as neural networks [6]. For exploration, the celebrated result on persistently exciting inputs [2] asserts that control input signals that are rich enough—having a certain number of distinct frequencies—provide sufficiently diverse training data to learn the true model. We leverage this insight in our work.
In the context of robust parameter estimation, there is a line of work on bounded noise models in system identification [29, 24, 28]. This paradigm does not assume any statistical form of the noise inherent in state observations, but instead assumes only that the noise is bounded by an unknown constant. Researchers have studied the existence and characteristics of estimators for various optimality concepts [19]. In our setting, we assume a statistical distribution for the noise model, but we compute robust, safe actions with respect to the uncertainty in the model parameters estimated from trajectory data.
We first describe some basic concepts and notation related to sequential decision making. Since our paper’s main results center on linear transition models with Gaussian noise, we address this model class in some detail and outline some corresponding PAC-style bounds for estimating the parameters of these models. These bounds depend on properties of the training data—generally, a time series of system behaviour—and thus do not require that training examples be independently generated.
3.1 Background and Notation
Let be the state space and
the action space. We generally view each dimension of the state as representing some process variable to be monitored or controlled, and each action dimension as some control variable to be set. For example, the state might be a vector of sensor readings while the action might be a vector of hardware parameter configurations.
Let
denote the system state at time t and
the value of its
-th variable, for
. Likewise, let
be the action taken at time t in state
. There is an underlying, unknown
), i.e., the distribution from which the state at the next time step
is generated. The transition model is stationary and Markovian (i.e., depends only on the current state
and action
We consider time-based episodes or trajectories of state-action sequences: (). A length T episodes is typically generated by providing
to our agent, which executes an exploration policy that selects/executes action
in response to observing state
(and past history), which induces
according the true system dynamics, for
1 stages. The initial state
might be exogeneously realized and may differ across episodes. Our goal is to ensure safety during an episode.
We use the following notation throughout. Let denote a sequence of
+ 1 vectors starting at index i. Let M be a matrix. We use
to denote the matrix transpose. We use
to denote the Frobenius norm, and for a vector
to denote the 2-norm. We denote by
-th row vector of M. For two row vectors v and w, let
be the row vector obtained by concatenating v and w.
3.2 Linear Gaussian Transition Models and Estimation Bounds
Our goal is to identify the underlying model . We restrict this model to belong to a class of transition models
defined by a parameter space Θ. In particular, we assume there is some
Θ such that
=
. We assume Θ corresponds to parameters of a linear-Gaussian model, though our notion of safety and robust optimization formulations apply more generally. Specifically, the next state is given by
where and
. A constant additive bias vector can be incorporated in
by adding an extra dimension to
with a constant value of 1. We assume a Gaussian random noise vector
) where
is the
identity matrix. We also assume
is independent of
for any
and is independent of the previous state
. For ease of exposition, we omit
from Θ—it can be easily estimated as we describe below. Thus Θ =
. We denote by
:] = [
:]
:]] the
-th row of the concatenated matrix.
Suppose we have existing time series (data generated by a (possibly randomized) exploration strategy. Let
-row) data matrix whose ith row is vector
. Suppose, furthermore, that the exploration strategy meets the persistence of excitation condition [18]. That is, the matrix of auto-covariances of the actions in the data is positive definite (e.g., independent, uniform random actions for each
0 satisfy this property). Then the least-squares estimates ˆ
, where the predicted labels are the next states, asymptotically converges to the following multivariate Gaussian distribution (see, e.g., [12]):
where Q = lim, which we assume is positive definite. Let ∆
=
and ∆
:]
:]. We can get an elliptical confidence bound on our estimates from the Gaussian distribution in (1). Specifically, the distribution over trajectories—induced by the randomness in linear Gaussian transitions (with respect to
) and any potential possibly randomness in the exploration strategy—satisfies:
where is the inverse CDF of the chi-squared distribution with
of freedom. Let
) be the smallest eigenvalue of Q, which must be positive since Q is positive definite. We can derive a looser confidence bound using the Frobenius norm of ∆
To make this bound practical, we can approximate
Note that each row of ˆconverges at a rate of
). If we apply a union bound over the
, we can get an approximation bound that holds over the entire parameter matrix
We define safety as satisfying relevant constraints on the system’s state. This is common, for example, in industrial applications where certain process variables must be kept in a pre-specified range. For instance, in data center cooling, the cold-air temperature (i.e., prior to passing over computing hardware to be cooled) may not be allowed to exceed some upper bound (safety threshold). This limit could could be applied to all temperature sensors, the average, or the median or some other quantile (where aggregations may be defined over various regions of the DC floor).
straints apply. We adopt a straightforward safety model in which a single inequality constraint is imposed on each such process variable, all of which must be satisfied:
Let be the vector where
if
otherwise
for some upper bound U on
. We use this formulation for ease of exposition and note that interval constraints on
can be accommodated by creating another state variable
with the desired bound.
M within a specific region of model space where the true model resides with high probability, the exploration actions we take for the purposes of system identifcation result in satisfying safety conditions according to M’s dynamics.
be the true but unknown transition model. Let
a vector of upper bound constraints on the state variables. Let
be the initial state. We say that a sequence of actions
is safe in expectation if
safe with probability at least 1
The high probability guarantee is much stronger than safety in expectation. However, in practice it may be more difficult computationally to verify safe actions with high probability; and safety in expectation might suffice if the true model is close to the mean estimated model (e.g., when transitions are nearly deterministic).
If we can obtain a high probability bound on each state at time , i.e. Pr(
, then by applying the union bound we can satisfy inequality Eq. (4). We focus on this simpler safety requirement in the sequel. In practice, the union bound can be quite conservative given state dependencies across time.
For practical exploration, we would like to take different sequences of actions (e.g., persistently exciting) within a region of action space that guarantees safety. We assume there is a that is known to be safe with high probability when applied consistently (i.e., at each time step).
Assumption 2. Let s be the safety bounds, T the time horizon, and safety confidence parameter. There exists a set of initial states
, and a nominal safe action
such that for all initial states
where
Such safe actions exist in many applications. For instance, in DC cooling it is generally feasible (though expensive) to set cooling fans and water flow to very high values (subject to their own constraints) to ensure cold air temperatures are maintained at or below safety thresholds. It is also the case that the nominal safe action will eventually drive “routine” unsafe states (i.e., those not associated with some significant system failure) into the safe region.
The nominal safe action is also useful for identifying the standard deviation parameter of our Gaussian noise—we can execute the safe action for sufficiently long period (in practice, the action can be safe in steady state). For simplicity of analysis and exposition, we take ˆ
to be known. Our results can be extended easily if we remove this assumption by replacing
with an estimate using a high-probability upper bound.
Definition 3. We say that is a safe action region in expectation (respectively, with probability at least 1
if, for all
is safe in expectation (respectively, safe with probability at least 1
A safe action region V represents a region in which we can explore freely, without having to re-compute whether a chosen sequence of actions from this region is safe. In practice, we can continuously select exploratory actions from the safe region. Our safety definition guarantees with high probability that no state in a trajectory of length T is unsafe. However, if we re-run the trajectory many times—i.e. run the same sequence of actions starting at the same initial state—the probability of an unsafe event compounds. In practice, however, the additive noise term for the next state is typically small and bounded that states are unlikely to be unsafe. This is the consequence of assuming a Gaussian noise model—if we draw N samples from a Gaussian, for a sufficiently large N, we are likely to draw a very rare event even if its probability is extremely small.
While we have defined safety with respect to constraint satisfaction across an entire trajectory, we first analyze a simple case, namely, whether an action is one-step safe. Specifically, we assess, if we start in a safe state , whether a given action u results in a safe state
Suppose we have training data generated by past exploration, and that ∆ˆ
holds with probability 1
. This ensures that, with probability 1
, should we verify u is safe for all models
within
of the estimated model ˆ
, that u is safe w.r.t. the true model
. This worst-case, robust analysis ensures safety even for adversarially chosen
within this
We can formulate the an optimization problem to test such one-step safety. The optimization in Eqs. (5-6) finds the model parameters (i.e., system dynamics [A, B]) within the high-probability
-ball around ˆ
such that the safety constraint defined by any
is maximally violated (if violation is feasible). Specifically, for each process variable
, we check whether the optimal objective value is greater than
. If no such value is greater, then u is one-step safe in expectation.
:]
:]]
[ ˆ
:] ˆ
:]]
). To check safety in expectation, we can re-formulate Eqs. (5-6) as:
This solution to this problem is the largest singular value of, which is its vector 2-norm, and the maximizing
is the corresponding right-singular vector. Thus, the optimal objective of the original optimization is
The maximizing :] of the original problem can also be easily derived.
Notice that one-step safety does not imply subsequent states along the trajectory will be safe, even if the nominal safe action is taken. In fact, a one-step safe action may lead to a state where there are no safe actions.
Example 4 (One-step safety insufficient). Define a system whose true dynamics is given by:
Suppose our controller/agent knows and the system is noise-free. Let
, so the first state variable is the process variable with safety threshold
Since
is in Jordan normal form,
. When action
is applied we obtain
, which is safe. However for any
which is unsafe—this means the first action, while safe, ensures that no next action is safe. Thus, unless our safety goals are fully myopic, one-step safety will not suffice.
Despite having an easily checkable criteria, one-step safety is insufficient to guarantee safety over a longer duration. In this section, we focus on algorithms for verifying trajectory safety for a given sequence of actions, including running a fixed action over an extended duration. In particular, we develop an algorithm for finding a maximum ball of safe actions centered around a known safe action. This allows one to run, for example, a sequence of randomly chosen actions within the safe action ball. First, we start off with some basic facts for linear Gaussian trajectory rollouts.
6.1 Linear Gaussian Rollouts
Assuming linear Gaussian transitions, we can rollout the state at time 1 given action sequence
As ] might diverge. To ensure convergence, we need the following Schur stability [16] assumption.
be the linear Gaussian parameters of a transition model. We assume the spectral radius of A is less than 1. That is, all eigenvalues lie in the open unit circle
can grow arbitrarily large if
1, and for
) = 1 it may diverge or converge depending on properties of A. Schur stability is a reasonable assumption in many applications since we may be aware of an action that when applied at each time step eventually converges to a safe steady state. As
of the initial state
diminishes since
be derived as
so we have,
where is the all-one vector. With Schur stability, the variance is bounded:
Theorem 6.
Proof. To show the limit exists, let 0. We apply Gelfand’s formula [17] which states
) = lim
for any matrix norm
. By Gelfand’s, there exists a small enough
0 and a large enough K such that for all
the following two conditions hold,
Then we have,
For any other
Then we have , this shows
] is a Cauchy sequence which always has a limit.
This result shows that there exist a bound ) such that for all
1,
) so the variance does not grow unbounded independently of the executed action sequence. This is reassuring for high probability safety as
converges to a steady state Gaussian distribution with finite variance. Moreover, we can compute high probability upper bounds for any process variable
at any future time step
. It can be computed by noticing from Theorem 6 that
is a Gaussian distribution with variance at most lim
]. Again, we can compute the expectation and add the appropriate standard deviation multiplier to get the desired confidence bound. That is, for the desired 1
confidence, choose c such that
6.2 Limiting Case with a Fixed Action
We first study trajectory safety with respect to a fixed action u that is applied at each time step and ask whether safety constraints are satisfied in the steady state distribution. Obtaining an easily computable “safety with high probability” test is complicated by the form of the variance term in Eq. 9. For this reason, we address this problem later in Sec. 6.3. In this section, we begin by deriving a simpler procedure for determining safety in expectation for a fixed action.
Recall our earlier assumption that 1. As above, suppose we have a data set generated by past exporation such that we can guarantee
∆ˆ
with probability at least 1
. Let
be any “potential true model” such that
1. The expected steady state under
Note that is invertible and lim
since, by assumption,
Recall our robust optimization objective for safety with respect to the
-th process variable,
trick as in one-step safety and optimize for an approximating upper bound. The objective Eq. 10 can be modified by subtracting a constant term:
This new formulation Eq. 12 is difficult to optimize, but we can easily maximize an objective that upper bounds Eq. 12 by finding an upper bound on the Frobenius norm of the terms inside the parentheses. (Assume all norms below are Frobenius.)
Since . We use the following result for any non-singular matrix M and small perturbation matrix E [7]:
where is the condition number of M. By letting
we obtain:
Substituting Ineq. 14 to Ineq. 13 gives:
Thus we obtain an upper bound the original optimization Eqs. (10-11) as follows:
Using the same trick as in the one-step case, we can solve for the optimal value by finding the largest singular value. This gives the following upper bound on the optimal objective value of the original maximization problem:
To check the safety in expectation of the -th process variable, in the limit, when we execute action u at every time step, we simply test whether
. Note that this upper bound test will generally be overly conservative— as a consequence, we propose heuristics later in the section that should be tighter in practice.
6.3 Safety within δ-Ball
Assume, as discussed above, we have determined that u is a safe action for the next T time steps under any transition model such that
with some probability (again, it’s assumed that ∆
). For purposes of system identification, we need to explore and not just execute a fixed action. In this section, we discuss how to determine whether a given region of action space is safe (either in expectation or with high probability). We do this in a way that exploits the notion of a fixed safe action discussed above.
Here we focus on regions of action space defined by a -ball of actions,
centered at a fixed safe action u:
To verify that region ) is safe (see Def. 3), we must ensure that any sequence of actions (
is safe, where
. As above, we express this problem as an adversarial optimization for state variable
We address safety in expectation, i.e., we try to approximately solve the above optimization problem when c = 0. Since u is safe, we can upper bound
(19) by:
For the max inside the summation, we can again try to bound
Let
where Ineq. 26 follows from triangle inequality and sub-multiplicative property of Frobenius norm. Substituting this inequality into Ineq. 25, and then into Ineq. 24, gives an upper bound on the objective (19) as a function of the ’s. This can be solved again by computing the maximum singular value of the resulting matrix.
However, this bound for testing safety in expectation will generally be quite loose, yielding a conservative test of the safety of ). To compensate, we instead consider a direct approximation of the above optimization.
6.4 Computing Largest Safe δ-ball
Due to the looseness of the closed-form upper bound above, we can instead computationally optimize adversarial choices of A, B, b and that attempt to violate one of the process variables. We optimize (19-22) using alternating optimizing over
. We start with a random initialization (e.g., uniform random) of these parameters. We first optimize
. Note that by Eq. (19), the value of each process variable is a simple multi-variate polynomial over the parameters of A, including the additive standard deviation multiple giving us the desired Gaussian confidence bound. We can optimize this using gradient ascent (and the resulting local maxima can be improved using random restarts).
We can then fix ˜A, and ˜. Let ˜
= ˜
, then (
=
. The variance term remains constant, giving:
where vect() is the reshaping of B into vector form, and
is the tensor product. One can again solve this by formulating it as a maximum singular value problem.
Finally, suppose ˜are fixed. Each
can be optimized independently in the summation ˜
subject to
. This again can be posed as a maximum singular value problem. Thus, for a fixed
we can approximate a lower bound on the optimization, which tells us whether there exists an adversarial example that makes
unsafe. See Algorithm 1. Since
is a scalar, we can do a simple “galloping” search to find the largest safe
. See Algorithm 2
The SafeBall algorithm also allows us to determine whether a given sequence of actions, , is safe. In SafeBall we just need to fix
and update only A and B in each iteration of optimization.
The use of a single ball may not be ideal to get diversity of training data. Instead we can generate new fixed, safe actions from which to build new safe balls using the above algorithms. We maintain a list of safe balls from which we update its safe radius and from which we can create new safe balls. For exploration, we may randomly select a safe ball from the list and run exploration by, say, randomly choosing actions within the ball. In our experiments, we elaborate on this method in more detail and illustrate how it improves sample efficiency with respect to model errors.
We now describe two sets of experiments designed to test our algorithms for safely identifying the true underlying model of system dynamics by exploring within growing safe -balls. In the first set of experiments, we test how the largest safe
) increases in size as a function of
, the model approximation error. In the second, we compare how quickly the true model can be identified
by gathering exploration data using: (a) a single safe ball; or (b) multiple safe balls. In both experimental settings, we test for safety in expectation.
We test out methods on the control on fancoil operation on a data center floor—fancoil control dictates the amount of airflow over section of the DC floor, as well as the flow of water (used to cool returning hot air) through the fancoils themselves. Typically, very large number of fancoils are distributed throughout the DC and can have significant interactions. In our experiments, we use a model of fancoil and air temperature dynamics that is a modified, normalized model of a single fancoil (including its interactions with other nearby fancoils) derived from a DC cooling system at Google. The control problem is discrete-time system with time steps of of 30 seconds. There modified model has five state variables (d = 5) actions comprised of two control inputs (= 2) corresponding to fan speed and water flow. The “true” model to be identified in our experiments has a spectral radius
) = 0.90. A single state variable (the first) is the only process variable under control. The nominal safe action
= (1.833, 1.857) is known, based on expert knowledge of the domain. We set the trajectory length to be T = 20, which is a sufficient horizon for, say, model-predictive control in this model. We set tolerance
’s galloping search.
In the first experiment, we vary the model error parameter and ran the MaxSafeBall algorithm to find the largest safe
-ball. We set the input parameter estimates ˆ
. That is, the algorithm will optimize for the worst
) that (potentially) violates the safety constraints. We use an initial safe radius of
= 0.1 around the safe action
at which to begin the SafeBall galloping search. We varied
from 0 to 0.0425 with more granularity near
= 0. For some context on the model estimation error
,
we note that = 1.974; hence the largest relative error for
= 0.0425 is
15%. Any relative error greater than 2.15% results in a zero radius safe ball—containing only the nominal safe action. We vary the relative error from 0% up to 2.15%. At
= 0, when the true parameters are known, we get the largest safe
-ball. Results are shown in Fig. 1. The plot shows a non-linear relationship where the largest computed
drops at a faster rate for smaller
compared to larger values of
. The largest
drops to 0.2375 when we reach
For the second experiment we simulated a typical use case of gradually identifying the model parameters while running safe exploration to generate training data. In particular, we compare the sample efficiency of running exploration on one dynamically growing safe ball versus many, simultaneously, dynamically growing safe balls. In the true model we add Gaussian noise with standard deviation = 0.01. We set a fixed constraint on the process variable for both the single ball and the multiple ball runs. We run 100 episodes each with trajectory length of T = 20 before updating ˆ
and the safe balls (i.e., we update every 2000 training examples). Each episode starts with the same initial state. In practice, this can be accomplished by running the nominal safe action in between episodes until steady state is reached (driving a system to a fixed state is a common assumption, see e.g. [20]).
When generating multiple safe balls after every 100 episodes, we randomly choose 2 actions on the surface of a randomly chosen safe ball (from an existing set of safe balls). We use MaxSafeBall to find the largest
, respectively, around those chosen actions, with
the theoretical model error as given by the error approximation bound of Eq. 2 with confidence fixed at 1
95. If the largest
computed exceeds a minimum lower bound (we choose 0.2) we include the new ball in our list of safe balls. We limit the number of safe balls at 14 to simplify our computations and reduce clutter in our presentation. We generate exploration actions by randomly choosing a safe ball and uniformly at random drawing a sequence of exploration actions within that ball.
Fig. 2 contrasts the reduction in theoretical model error versus actual model error
under the two exploration regimes. It shows that exploration that generates actions from multiple safe balls is generally more sample efficient (i.e., it achieves the same model error with fewer examples). For instance, after 1000 episodes (n = 20000 examples) multi-ball exploration achieves
and
= 0.0045 while exploration with a single ball achieves
= 0.0305 and
= 0.0075, which is considerably worse As n increases, both theoretical and actual model error decrease at a slower rate (as expected). Moreover, the gap in sample efficiency between the two regimes gets smaller at n grows.
We note that theoretical model error is conservative estimate of the actual model error. Furthermore, actual model error is a conservative measure when using the model to optimize for a good control policy, which does not necessarily require a fully accurate model. In terms of exploration time, n = 20000 examples corresponds to 166 hours or about 6 days and 22 hours on a single fancoil. However, “practical” identification will proceed much more quickly for two reasons. First, as discussed above, a data center has many fancoils, this exploration time can be reduced significantly by exploring across multiple fancoils simultaneously. Second, we generally do not require accurate system identification across the full range of dynamics: for example, once model confidence is sufficient for specific regions of state-control space such that these points can be shown to never (or “rarely”) be reached under optimal control, further model refinement in unneccesary (i.e., the value of information for more precise identification is low).
Fig. 3 shows the safe ball(s) after being updated for various values of n (i.e. immediately after n examples have been generated) for the two regimes. One can see that the area of action space covered when exploring with many safe balls grows more quickly than with just one ball. In fact, the original safe ball around is larger in the many balls regime due to the smaller
that results from better sample efficiency. For values of
50000 the balls in both regimes do not grow as quickly, this corresponds to the model errors starting to decline at a slower rate.
We studied safe exploration for identifying system parameters of a linear Gaussian model. In our approach we defined safety as satisfying linear inequality constraints on the state at each time step of a trajectory. This is motivated by applications involving control of process variables and where violating those constraints may lead to significant negative impacts. We showed how to compute many regions of safe actions when given a nominal safe action. Confined to taking actions within each safe region, any exploration strategy may be applied and theoretical bounds on the approximation quality of an estimated model can be computed online. By computing and exploring within many safe regions, our experiments show we can increase the sample-efficiency of exploration. For future work, we would like to extend safe exploration to not just identify the model as accurately as possible, but to gather just the right kind of transition
Figure 1: The largest safe ball around the nominal safe action as approximated by MaxSafeBall for various input model errors
Figure 2: Comparing the theoretical and actual model errors under two safe exploration regimes: (1) random actions chosen from up to 14 safe balls or (2) random actions chosen from a single ball.
Figure 3: Left column plots show the gradual addition of safe balls (i.e., circles, up to 14) and growth of each ball. Each plot has x and y axis corresponding to the two control inputs. Each circle within a plot indicates a safe action region where any sequence of T actions is safe. Plots on the right shows gradual growth of a single safe ball.
samples that optimizes the quality of the controller.
We like to thank Nevena Lazic, Eehern Wang and Greg Imwalle for useful discussions.
[1] Dario Amodei, Chris Olah, Jacob Steinhardt, Paul Christiano, John Schulman, and Dan Man´e. Concrete problems in AI safety. arXiv:1606.06565, 2016.
[2] Karl-Johan ˚Astr¨om and Torsten Bohlin. Numerical identification of linear dynamic systems from normal operating records. In Theory of Self-Adaptive Control Systems, pages 96–111, Teddington, UK, 1967.
[3] Karl-Johan ˚Astr¨om and Pieter Eykhoff. System identification: A survey. Automatica, 7(2):123–162, 1971.
[4] Felix Berkenkamp and Angela P. Schoellig. Safe and robust learning control with Gaussian processes. In European Control Conference (ECC), pages 2496–2501, 2015.
[5] George E. P. Box, Gwilym Jenkins, Gregory C. Reinsel, and Greta M. Ljung. Time Series Analysis: Forecasting and Control (5th Edition). Wiley, Hoboken, NJ, 2015.
[6] S. Chen, Billings S. A., and P. M. Grant. Non-linear system identification using neural networks. International Journal of Control, 51:1191–1214, 1990.
[7] James Demmel. The componentwise distance to the nearest singular matrix. SIAM Journal on Matrix Analysis and Applications, 13:10–19, 1992.
[8] Jim Gao. Machine learning applications for data center optimization. Google white paper, 2014.
[9] Javier Garc´ıa and Fernando Fern´andez. Safe exploration of state and action spaces in reinforcement learning. Journal of Artificial Intelligence Research, 45(1):515–564, 2012.
[10] Javier Garc´ıa and Fernando Fern´andez. A comprehensive survey on safe reinforcement learning. Journal of Machine Learning Research, 16(1):1437– 1480, 2015.
[11] Mohammad Ghavamzadeh, Marek Petrik, and Yinlam Chow. Safe policy improvement by minimizing robust baseline regret. In Advances in Neural Information Processing Systems 29 (NIPS-16), pages 2298–2306, Barcelona, Spain, 2016.
[12] William H. Greene, editor. Econometric Analysis (8th Edition). Pearson, New York, NY, 2018.
[13] G. Iyengar. Robust dynamic programming. Mathematics of Operations Research, 30(2):1–21, 2005.
[14] Anayo K. Akametalu, Shahab Kaynama, Jaime Fisac, Melaine N. Zeilinger, Jeremy H. Gillula, and Claire J. Tomlin. Reachability-based safe learning with Gaussian processes. In Proceedings of the IEEE Conference on Decision and Control, pages 1424–1431, Los Angeles, CA, 2014.
[15] Rogier Koppejan and Shimon Whiteson. Neuroevolutionary reinforcement learning for generalized control of simulated helicopters. Evolutionary Intelligence, 4(4):219–241, 2011.
[16] Joseph P. LaSelle, editor. The Stability and Control of Discrete Processes. Springer-Verlag, New York, NY, 1986.
[17] Peter D. Lax, editor. Functional Analysis. Wiley, New York, 2002.
[18] Lennart Ljung, editor. System Identification: Theory for the User (2nd Edition). Prentice Hall, Upper Saddle River, New Jersey, 1999.
[19] Mario Milanese and Antonio Vicino. Optimal estimation theory for dynamic systems with set membership uncertainty: An overview. In Mario Milanese, John Norton, H´el`ene Piet-Lahanier, and ´Eric Walter, editors, Bounding Approaches to System Identification, chapter 3, pages 5–27. Springer, Boston, MA, 1996.
[20] Teodor Mihai Moldovan and Pieter Abbeel. Safe exploration in Markov decision processes. In International Conference on Machine Learning, pages 1451–1458, Edinburgh, Scotland, 2012.
[21] Arnab Nilim and Laurent El Ghaoui. Robust control of Markov decision processes with uncertain transition matrices. Operations Research, 53(1):780– 798, 2005.
[22] Martin Pecka and Tomas Svoboda. Safe exploration techniques for reinforcement learning – an overview. In Modelling and Simulation for Autonomous Systems, pages 357–375, Rome, Italy, 2014.
[23] Rik Pintelon and Johan Schoukens, editors. System Identification: A Frequency Domain Approach (2nd Edition). Wiley, Hoboken, NJ, 2012.
[24] Fred C. Schweppe. Recursive state estimation: Unknown but bounded errors and system inputs. IEEE Transactions on Automatic Control, 13:22–28, 1968.
[25] Yanan Sui, Alkis Gotovos, Joel W. Burdick, and Andreas Krause. Safe exploration for optimization with Gaussian processes. In International Conference on Machine Learning, pages 997–1005, Lille, France, 2015.
[26] Matteo Turchetta, Felix Berkenkamp, and Andreas Krause. Safe exploration in finite Markov decision processes with Gaussian processes. In Advances in Neural Information Processing Systems 29 (NIPS-16), pages 4312–4320, Barcelona, Spain, 2016.
[27] Matteo Turchetta, Felix Berkenkamp, Angela P. Schoellig, and Andreas Krause. Safe model-based reinforcement learning with stability guarantees. arXiv:1705.08551, 2017.
[28] Eric Walter and H´el`ene Piet-Lahanier. Estimation of parameter bounds from bounded-error data: a survey. Mathematics and Computers in Simulation, 32:449–468, 1990.
[29] Hans S. Witsenhausen. Sets of possible states of linear systems given perturbed observations. IEEE Transactions on Automatic Control, 13:556– 558, 1968.