By exploring and interacting with an environment, reinforcement learning can determine the optimal policy with respect to the long-term rewards given to an agent [1], [2]. Whereas the idea of determining the optimal policy in terms of a cost over some time horizon is standard in the controls literature [3], reinforcement learning is aimed at learning the long-term rewards by exploring the states and actions. As such, the agent dynamics is no longer explicitly taken into account, but rather is subsumed by the data.
If no information about the agent dynamics is available, however, an agent might end up in certain regions of the state space that must be avoided while exploring. Avoiding such
This work was sponsored in part by the U.S. National Science Foundation under Grant No. 1531195. The work of M. Ohnishi was supported in part by the Scandinavia-Japan Sasakawa Foundation under Grant GA17-JPN-0002 and the Travel Grant of the School of Electrical Engineering, Royal Institute of Technology.
M. Ohnishi is with the School of Electrical Engineering, Royal Institute of Technology, 11428 Stockholm, Sweden, the Georgia Robotics and Intelligent Systems Laboratory, Georgia Institute of Technology, Atlanta, GA 30332 USA, and also with the RIKEN Center for Advanced Intelligence Project, Tokyo 103-0027, Japan (e-mail: motoya@kth.se).
L. Wang and M. Egerstedt are with the School of Electrical and Computer Engineering, Georgia Institute of Technology, Atlanta, GA 30332 USA (e-mail: liwang@gatech.edu; magnus@gatech.edu).
G. Notomista is with the School of Mechanical Engineering, Georgia Institute of Technology, Atlanta, GA 30313 USA (e-mail: g.notomista@gatech.edu).
regions of the state space is referred to as safety. Safety includes collision avoidance, boundary-transgression avoidance, connectivity maintenance in teams of mobile robots, and other mandatory constraints, and this tension between exploration and safety becomes particularly pronounced in robotics, where safety is crucial.
In this paper, we address this safety issue, by employing model learning in combination with barrier certificates. In particular, we focus on learning for systems with discrete-time nonstationary (or time-varying) agent dynamics. Nonstationarity comes, for example, from failures of actuators, battery degradations, or sudden environmental disturbances. The result is a method that adapts to nonstationary agent dynamics and, under certain conditions, ensures recovery of safety in the sense of Lyapunov stability even after violations of safety due to the nonstationarity occur. We also propose discrete-time barrier certificates that guarantee global optimality of solutions to the barrier-certified policy optimization, and we use the learned model for barrier certificates.
Over the last decade, the safety issue has been addressed under the name of safe learning, and plenty of solutions have been proposed [4]–[13]. To ensure safety while exploring, an initial knowledge of the agent dynamics, initial safe policy or a teacher advising the agent is necessary [4], [14]. To obtain a model of the agent dynamics, human operators may maneuver the agent and record its trajectories [12], [15]. It is also possible that an agent continues exploring without entering the states with low long-term risks (e.g., [11], [16]). Due to the inherent uncertainty, the worst case scenario (e.g., possible lowest rewards) is typically taken into account [13], [17] and the set of safe policies can be expanded by exploring the states [4], [5]. To address the issue of this uncertainty for nonlinear-model estimation tasks, Gaussian process regression [18] is a strong tool, and many safe learning studies have taken advantage of its property (e.g., [4], [6], [7], [10], [13]).
Nevertheless, when the agent dynamics is nonstationary and the long-term rewards vary accordingly, the assumptions often made in the safe learnings literature no longer hold, and violations of safety become inevitable. In such cases, we wish to ensure that the agent is at least successfully brought back to the set of safe states and the negative effect of an unexpected violation of safety is mitigated. Moreover, the long-term rewards must also be learned in an adaptive manner. These are the core motivations of this paper.
To constrain the states within a desired safe region while exploring, we employ control barrier functions (cf. [19]–[24]). When the exact model of the agent dynamics is available, control barrier certificates ensure that an agent remains in the set of safe states for all time by constraining the instantaneous control input at each time. Also, an agent outside of the set of safe states is forced back to safety (Proposition III.1). A useful property of control barrier certificates is that they modify polices only when violations of safety are truly imminent [22].
If no nominal model (or simulation) of the possibly nonstationary agent dynamics is available, on the other hand, violations of safety are inevitable. Therefore, we wish to adaptively learn the agent dynamics, and eventually bring the agent back to safety. To this end, we propose a learning framework for a possible nonstationary agent dynamics, which recovers safety in the sense of Lyapunov stability under some conditions. This learning framework ties adaptive algorithms with control barrier certificates by focusing on set-theoretical aspects and monotonicity (or non-expansivity). By augmenting the state with the estimate of agent dynamics, Lyapunov stability with respect to the set of augmented safe states is guaranteed (Theorem IV.1). Also, to efficiently enforce control barrier certificates, we employ adaptive sparse optimization techniques to extract dynamic structures (e.g., control-affine dynamics) by identifying truly active structural components (see Section III-C and IV-B).
In addition, the long-term rewards need to be adaptively estimated when the agent dynamics is nonstationary. To this end, we reformulate the action-value function approximation problem so that, even if the action-value function varies, it can be adaptively estimated in the same functional space by employing an adaptive supervised learning algorithm in the space. Consequently, resetting the learning whenever the agent dynamics varies becomes unnecessary. Moreover, we present a barrier-certified policy update strategy by employing control barrier functions to effectively constrain policies. Because the global optimality of solutions to the constrained policy optimization is necessary to ensure the greedy improvement of a policy, we propose a discrete-time control barrier certificate that ensures the global optimality under some mild conditions (see Section IV-C and Theorem IV.4 therein). This is an improvement of the previously proposed discrete-time control barrier certificate [24].
To validate and clarify our learning framework, we first conduct experiments of quadrotor simulations. Then, we conduct real-robotics experiments on a brushbot, whose dynamics is unknown, highly complex and nonstationary, to test the efficacy of our framework in the real world (see Section V). This is challenging due to many uncertainties and lack of simulators often used in applications of reinforcement learning in robotics (see [25] for example).
In this section, we present some of the related work and the system model considered in this paper. Throughout, and
are the sets of real numbers, nonnegative integers and positive integers, respectively. Let
be the norm induced by the inner product
in an inner-product space H . In particular, define
y for L-dimensional real vectors x
, and
, where
stands for transposition. We define [x;y] as
, and let xn
and un
, for nx,nu
, denote the state and the control input at time instant n
, respectively.
A. Related Work
The primary focus of this paper is the safety issue while exploring. Typically, some initial knowledges, such as an initial safe policy and a model of the agent dynamics, are required to address the safety issue while exploring; therefore, model learning is often employed together. We introduce some related work on model learning and kernel-based action-value function approximation.
1) Model Learning for Safe Maneuver: The recent work in
[13], [7], and [4] assumes an initial conservative set of safe policies, which is gradually expanded as more data become available. These approaches are designed for stationary agent dynamics, and Gaussian processes (GPs) are employed to obtain the confidence interval of the model. To ensure safety, control barrier functions and control Lyapunov functions are employed in [13] and [4], respectively. On the other hand, the work in [10] uses a trajectory optimization based on the receding horizon control and model learning by GPs, which is computationally expensive when the model is highly nonlinear.
In this paper, we aim at tying adaptive model learning algorithms and control barrier certificates by focusing on set-theoretical aspects and monotonicity (or non-expansivity). Hence, we employ an adaptive filter with monotone approximation property, which shares similar ideas with stable online learning for adaptive control based on Lyapunov stability (c.f. [26]–[29], for example).
2) Learning Dynamic Structures in Reproducing Kernel
Hilbert Spaces: An approach that learns dynamics in reproducing kernel Hilbert spaces (RKHSs) so that the resulting model satisfies the Euler-Lagrange equation was proposed in [30], while our paper proposes a learning framework that adaptively captures control-affine structure in RKHSs to efficiently enforce control barrier certificates.
3) Reinforcement Learning in Reproducing Kernel Hilbert
Spaces: We introduce, briefly, ideas of existing action-value function approximation techniques. Given a policy U , the action-value function Q
associated with the policy
is defined as
where is the discount factor, (xn
is a trajectory of the agent starting from x0 = x, and R
is the immediate reward. It is known that the action-value function follows the Bellman equation (c.f. [2, Equation (66)]):
For robotics applications, where the states and controls are continuous, some form of function approximators is required to approximate the action-value function (and/or policies). Nonparametric learning such as a kernel method is often desirable when a priori knowledge about a suitable set of basis functions for learning is unavailable. Kernel-based reinforcement learning has been studied in the literature, e.g., [31], [31]–[44]. Due to the property of reproducing kernels, the framework of linear learning algorithms is directly applied to nonlinear function estimation tasks in a possibly infinite-dimensional functional space, namely a reproducing kernel Hilbert space.
Definition II.1 ( [45, page 343]). Given a nonempty set Z and H which is a Hilbert space defined in Z , the function of z is called a reproducing kernel of H if
1) for every w as a function of z
belongs to H , and
2) it has the reproducing property, i.e., the following holds for every w and every
:
If H has a reproducing kernel, H is called a Reproducing Kernel Hilbert Space (RKHS).
One of the examples of kernels is the Gaussian kernel
It is well-known that the Gaussian reproducing kernel Hilbert space has universality [46], i.e., any continuous function on every compact subset of can be approximated with an arbitrary accuracy. Another widely used kernel is the polynomial kernel
.
In contrast to these existing approaches, we explicitly define a so-called reproducing kernel Hilbert space (RKHS) so that adaptive supervised learning of action-value functions can be conducted in the same space without having to reset the learning. Consequently, we can also conduct an action-value function approximation in the same RKHS even after the agent dynamics changes or policies are updated (See the remark below Theorem IV.3 and Section V-A.2). The GP SARSA can also be reproduced by employing a GP in the explicitly defined RKHS as is discussed in Appendix I. Specifically, in this paper, a possibly nonstationary agent dynamics is considered as detailed below.
B. System Model
In this paper, we consider the following discrete-time deterministic nonlinear model of the nonstationary agent dynamics,
where p : , f :
, g :
are continuous. Hereafter, we regard
as the same as
under the one-to-one correspondence between z := [x;u
and
if there is no confusion.
We consider an agent with dynamics given in (II.3), and the goal is to find an optimal policy which drives the agent to a desirable state while remaining in the set of safe states (or the safe set) defined as
where B : . An optimal policy is a policy
that attains an optimal value Q
for every state x
. Note that the value associated with a policy varies when the dynamics is nonstationary, and that a quadruple (xn,un,xn
xn,un)) is available at each time instant n.
With these preliminaries in place, we can present our safe learning framework.
Under possibly nonstationary dynamics, our safe learning framework adaptively estimates the long-term rewards to update policies with safety constraints. Also, recovery of safety in the sense of Lyapunov stability during exploration is guaran- teed under certain conditions. Define as
p(x,u)+ f(x)+g(x)u, and suppose that the estimator of
at time instant n, denoted by ˆ
, is approximated by the model parameter hn
in the linear form as
Here, k(znis the output of basis functions at zn. If the model parameter is accurately estimated (or the exact agent dynamics is available), the safe set C becomes forward invariant and asymptotically stable by enforcing control barrier certificates at each time instant n.
A. Discrete-time Control Barrier Functions
The idea of control barrier functions is similar to Lyapunov functions; they require no explicit computations of the forward reachable set while ensuring certain properties by constraining the instantaneous control input. Particularly, control barrier functions guarantee that an agent starting from the safe set remains safe (i.e., forward invariance), and that an agent outside of the safe set is forced back to safety (i.e., Lyapunov stability with respect to the safe set). To make barrier certificates compatible with model learning and reinforcement learning, we employ the discrete-time control barrier certificates.
Definition III.1 ( [24, Definition 4]). A map B : is a discrete-time exponential control barrier function if there exists a control input un
such that
Note that we intentionally removed the condition B(x00 originally presented in [24, Definition 4]. Then, the forward invariance and asymptotic stability with respect to the safe set are ensured by the following proposition.
Proposition III.1. The set C defined in (II.4) for a valid discrete-time exponential control barrier function B : is forward invariant when B(x0
0, and is asymptotically stable when B(x0) < 0.
Proof. See Appendix A.
Proposition III.1 implies that an agent remains in the safe set defined in (II.4) for all time if B(x00 and the inequality (III.1) are satisfied, and the agent outside of the safe set is brought back to safety.
The main motivations of using control barrier functions are given below:
Fig. III.1. An illustration of the monotone approximation property. The estimate hn monotonically approaches to the set of optimal vectors h
by sequentially minimizing the distance between hn and
argminh
is the cost function at time instant n.
a). Little modifications of policies: control barrier functions modify polices only when violations of safety are imminent. Consequently, an inaccurate or rough estimation of the model causes less negative effect on (model-free) reinforcement learning.
b). Asymptotic stability of the safe set: the agent outside of the safe set is brought back to the safe set. In addition to Proposition III.1, this robustness property is analyzed in [19]. This property together with the adaptive model learning algorithm presented in the next subsection is particularly important when the safety is violated due to the nonstationarity of the agent dynamics. Under a possibly nonstationary agent dynamics, we can no longer guarantee that the current estimate of the model parameter is sufficiently accurate to enforce the inequality (III.1) or forward invariance of C . Nevertheless, we are still able to show that safety is recovered in the sense of Lyapunov stability under certain conditions by adaptively learning the model.
B. Adaptive Model Learning Algorithms with Monotone Approximation Property
At each time instant, an input-output pair (zn, where zn := [xn;un] and
xn
xn for model learning is available. Under possibly nonstationary agent dynamics, it is vital for the model parameter estimation to be stable even after the agent dynamics changes. In this paper, we employ an adaptive algorithm with monotone approximation property. Note this approach shares a similar idea with stable online learning based on Lyapunov-like conditions.
Suppose that the estimate of model parameter at time instant n is given by hn . Given a cost function
at time instant n, we update the parameter hn so as to satisfy the strictly monotone approximation property
hn
hn
:= argminh
if hn
/0, where /0 is the empty set. Then, if
is nonempty and if hn
, it follows that
hn
hn
. This is illustrated in Figure III.1. Under mild conditions, we can also design algorithms (e.g., the adaptive projected subgradient method [47]) that satisfy
hn
hn
dist2(hn
, for
Fig. III.2. An illustration of Lyapunov stability of the system for the augmented state with respect to the forward invariant set
all h, and for some
0, where dist(hn
inf
hn
. (See [47] for more detailed argu- ments for example.)
At each time instant, we use the current estimate of the model to constrain control inputs so that they satisfy
for some margin 0, where ˆxn
is the predicted output of the current estimate hn at xn and un. Then, under certain conditions, we can guarantee Lyapunov stability of the system for the augmented state [xn;hn
with respect to the forward invariant set
as illustrated in Figure III.2. In Sections IV-A and V, we will theoretically and experimentally show that the system for the augmented state is stable on the set of augmented safe states.
To efficiently constrain policies by using control barrier functions, the learned model is preferred to be affine in control. (see Section IV-C and Theorem IV.4 therein.) As such, outputs of the learned model should have preferred dynamic structures while capturing the true agent dynamics.
C. Leaning Dynamic Structure via Sparse Optimizations
Control-affine dynamics is given by (II.3) with p = 0, where 0 denotes the null function. Therefore, the simplest way is to learn the agent dynamics with the constraint p = 0. In practice, however, it is unrealistic to assume that p = 0 due to the effects of frictions and other disturbances. Instead, as long as the term p is negligibly small, we can consider p to be a system noise added to a control-affine dynamics. To encourage the term p to be as small as possible while capturing the true input-output relations of the agent dynamics, we use adaptive sparse optimization techniques. In particular, motivated by the monotone approximation property due to convexity of the formulations, we use (sparse) kernel adaptive filters for the systems with nonlinear dynamics. Specifically, we take the following steps to extract the control-affine structure:
1) Assume for simplicity that nx = 1. We suppose that p , f
, and g
, where
and
are RKHSs, and g(x) =
.
2) Let be the RKHS associated with the reproducing kernel
,and
the set of constant functions on U . Estimate the function
in the RKHS
(see Section IV-B and Theorem IV.2 therein).
3) Define the cost so as to promote sparsity of the model parameter. If the underlying true dynamics is affine in control, a control-affine model (i.e., the estimate of p denoted by ˆp becomes null) is expected to be extracted. The resulting control-affine part of the estimated dynamics is used in combination with control barrier certificates in order to efficiently constrain policies while and after learning an optimal policy. (see Theorem IV.1 and Theorem IV.4 for more details.)
D. Barrier-certified Policy Update
Lastly, we present the barrier-certified policy update strategy. To update policies, we use the long-term rewards that needs to be adaptively estimated for systems with possibly nonstationary agent dynamics.
1) Adaptive Action-value Function Approximation in
RKHSs: Again, motivated by the monotone approximation property (see Corollary IV.1) and the flexibility of nonparametric learning that requires no fixed set of basis functions, we employ kernel-based adaptive algorithms to estimate the action-value function. One of the issues arising when applying a kernel-based method to an action-value function approximation is that the output of the action-value function Qxn,un
associated with a policy
, where
is assumed to be an RKHS, is unobservable. Nevertheless, we know that the action-value function follows the Bellman equation (II.2). Hence, by defining a function
:
, where
, as
the Bellman equation in (II.2) is solved via iterative nonlinear function estimation with the input-output pairs {([xn;un;xnxn
xn,un
. In fact, the function
is an element of a properly constructed RKHS
(see Section IV-C and Theorem IV.3 therein). Because the domain of
is defined as
instead of Z , the RKHS
does not depend on the agent dynamics. Therefore, we do not have to reset learning even after the dynamics changes or the policy is updated, and we can analyze convergence and/or monotone approximation property of an action-value function approximation in the same RKHS (see Section V- A.2, for example).
2) Policy Update: For a current policy , assumethat the action-value function Q
with respect to
at time instant n is available. Given a discrete-time exponential control barrier function B and 0
1, the barrier certified safe control space is define as
From Proposition III.1, the set C defined in (II.4) is forward invariant and asymptotically stable if un xn) for all n
. Then, the updated policy
given by
is well-known (e.g., [48], [49]) to satisfy that QQ
, where Q
is the action-value function with respect to
. In practice, we use the estimate of Q
because the exact function Q
is unavailable. For example, the action-value function is estimated over Nf
iterations, and the policy is updated every Nf iterations.
In the previous section, we presented our barrier-certified adaptive reinforcement learning framework. In this section, we present theoretical analysis of our framework to further strengthen the arguments.
A. Safety Recovery: Adaptive Model Learning and Control Barrier Certificates
The monotone approximation property of model parameters is closely related to Lyapunov stability. In fact, by augmenting the state vector with the model parameter, we can construct a Lyapunov function which guarantees stability with respect to the safe set under certain conditions.
We first make following assumptions.
Assumption IV.1. 1) Finite-dimensional model parameter: the dimension of model parameter h remains finite, and is r . 2) Boundedness of the basis functions: all of the basis functions (or kernel functions) are bounded over X .
3) Lipschitz continuity of the control barrier function: the control barrier function B is Lipschitz continuous over X with Lipschitz constant .
4) Validity of barrier certificates: there exists a control input un satisfying for a sufficiently small
0 that
where ˆxnis the predicted output of the current estimate hn at xn and un.
5) Appropriate cost functions: if hn argminh
, where
is the continuous cost function at time instant n, then
xn
ˆxn
.
6) Model learning with monotone approximation property: model parameter hn is updated as hnTn(hn), where Tn :
is continuous and has monotone approximation property: if hn
, then dist2(hn
and
hn
hn
dist2(hn
, for all h
, and for some
0. If hn
, then hn
hn.
7) Data consistency: The set is nonempty.
Remark IV.1 (On Assumption IV.1.1). Assumption IV.1.1 is made so that Lyapunov stability can be analyzed in an Euclidean space and is reasonable if polynomial kernels are employed for learning or if the input space is compact.
Remark IV.2 (On Assumptions IV.1.2 and IV.1.3). Assumptions IV.1.2 and IV.1.3 ensure that the predicted value of the barrier function is close to its true value if the current estimate of model parameter is close to the true parameter.
Remark IV.3 (On Assumption IV.1.4). Assumption IV.1.4 implies that we can enforce barrier certificates for the current estimate of the dynamics with a sufficiently small margin . This assumption is necessary to implicitly bound the growth of B(xn
and to robustly enforce barrier certificates whenever hn
. Although this assumption is somewhat restrictive, it is still reasonable if the initial estimate does not largely deviate from the true dynamics.
Remark IV.4 (On Assumption IV.1.5). Assumption IV.1.5 implies that the set or equivalently the cost
is designed so that the predicted output ˆxn
for hn
is sufficiently close to the true output xn
. Such a cost can be easily designed. This assumption is necessary to render the set
forward invariant.
Remark IV.5 (On Assumptions IV.1.6 and IV.1.7). To apply theories of Lyapunov stability, Assumption IV.1.6 is needed to make sure that the dynamical system for the augmented state is continuous. Moreover, the cost (or the set ) is designed so that hn
or dist2(hn
. See the work in [47] for a class of algorithms that satisfy this property, for example. Unless there exist some adversarial data (or inappropriate costs) that do not reflect the true agent dynamics, Assumption IV.1.7 is valid and ensures that the set of augmented safe states is nonempty.
Let the augmented state be [x;h. Then, the following theorem states that the system for the augmented state is (asymptotically) stable with respect to the set of augmented safe states even after a violation of safety due to the abrupt and unexpected change of the agent dynamics occurs.
Theorem IV.1. Suppose that a triple (xn,un,xnis available at time instant n + 1. Suppose also that a control input un satisfying (IV.1) is employed for all n
. Then, under Assumption IV.1, the system for the augmented state is stable with respect to the set of augmented safe states
. If, in addition, hn
for all n
such that [xn;hn
, then the system is uniformly globally asymptotically stable with respect to
.
Proof. See Appendix B.
Remark IV.6 (On Theorem IV.1). Theorem IV.1 implies that how much the current estimate gets closer to the true dynamics depends on how much the next state of the agent is deviated from the predicted next state. Therefore, both barrier certifi-cates and model learning work together to guarantee stability. If the model learning algorithm satisfies Assumption IV.1, then Theorem IV.1 claims that safety is recovered successfully. When GPs or kernel ridge regressions are employed for model learning, for example, introducing forgetting factors or letting the sample size grow as time advances will make the algorithms adaptive to time-varying systems; in such cases, we need to make sure that the algorithms satisfy Assumption IV.1 to guarantee safety recovery. Numerical simulations about safety recovery is given in Section V-A.
If the agent dynamics keeps changing or if we know that
there are multiple modes for dynamics, then we may have separate model learning processes as proposed in [50], and the augmented state can be regarded as following a hybrid system. Hence, stability should be analyzed under additional assumptions in this case. We leave such an analysis as a future work.
B. Structured Model Learning
We have seen that, by employing a model learning with monotone approximation property under Assumption IV.1, the agent is stabilized on the set of augmented safe states even after an abrupt and unexpected change of the agent dynamics. Here, we show that a control-affine dynamics can be learned via sparse optimizations satisfying monotone approximation property in a properly defined RKHS. We assume that nx = 1 for simplicity (we can employ nx approximators if nx > 1).
First, we show that the space (see Section III-C) is an RKHS.
Lemma IV.1. The space is an RKHS associated with the reproducing kernel
, with the inner product defined as
.
Proof. See Appendix C.
Then, the following lemma implies that can be approximated in the sum space of RKHSs denoted by
.
Lemma IV.2 ( [51, Theorem 13]). Let and
be two RKHSs associated with the reproducing kernels
and
. Then the completion of the tensor product of
and
, denoted by
, is an RKHS associated with the reproducing kernel
.
From Lemmas IV.1 and IV.2, we can now assume that ˆf and ˆ˜g
, where ˆ˜g is an estimate of ˜g(x,u) := g(x)u. As such,
can be approximated in the RKHS
. Therefore, we can employ a kernel adaptive filter working in the sum space
.
Second, the following theorem ensures that can be uniquely decomposed into p, f, and ˜g in the RKHS
.
Theorem IV.2. Assume that X and U have nonempty interiors. Assume also that is a Gaussian RKHS. Then,
is the direct sum of
, and
, i.e., the intersection of any two of the RKHSs
, and
is {0}.
Proof. See Appendix D.
Remark IV.7 (On Theorem IV.2). Because only the control-affine part of the learned model is used in combination with barrier certificates (see Assumption IV.2 and Theorem IV.4) and the term p is assumed to be a system noise added to the control-affine dynamics, the unique decomposition is crucial; if the unique decomposition does not hold, the term p may be able to estimate the overall dynamics, including the control-affine terms.
By using a sparse optimization for the coefficient vector hn , we wish to extract a structure of the model; from Theorem IV.2, the term ˆpn is expected to drop off when the true agent dynamics is affine in control.
In order to use the learned model in combination with control barrier functions, each entry of the vector ˆgn(xn) is required. Assume, without loss of generality, that {ei(this is always possible for
/0 by transforming coordinates of the control inputs and reducing the dimension nu if necessary). Then, the ith entry of the vector ˆgn(xn) is given by ˆgn(xn)ei = ˆ˜gn(xn,ei). As such, we can use the learned model to constrain control inputs efficiently by using control barrier functions for explorations as well as policy updates. We analyze an adaptive action-value function approximation with barrier-certified policy updates in the next subsection.
C. Adaptive Action-value Function Approximation with Barrier-certified Policy Updates
In this subsection, we analyze the proposed adaptive action-value function approximation with barrier-certified policy updates.
We showed in Section III-D.1 that the Bellman equation in (II.2) is solved via iterative nonlinear function estimation with the input-output pairs {([xn;un;xnxn
xn,un
. The following theorem states that the function
defined in (III.2) can be estimated in a properly constructed RKHS.
Theorem IV.3. Suppose that is an RKHS associated with the reproducing kernel
. Define, for
(0,1),
Then, the operator U : defined by U
z;w
, is bijective. Moreover,
is an RKHS with the inner product defined by
(IV.2)
The reproducing kernel of the RKHS is given by
Proof. See Appendix E.
From Theorem IV.3, we can use any kernel-based method by assuming that the action-value function is in . The estimate of Q
denoted by ˆQ
is obtained by U
, where ˆ
is the estimate of
. For instance, suppose that the estimate of
for an input [z;w] at time instant n is given by
where hQn is the model parameter, and k([z;w]) :=
z;w],[˜z1; ˜w1
z;w],[˜z2; ˜w2
z;w],[˜zr; ˜wr
for {˜zj
˜w j
and for
de-fined by (IV.3). Then, the estimate of Q
for an input z at time instant n is given by
Remark IV.8 (On Theorem IV.3). As discussed in Appendix I, the GP SARSA is reproduced by applying a GP in the space , although the GP SARSA or other kernel-based action-value function approximation is ad-hoc and designed for estimating the action-value function associated with a fixed policy under a stationary agent dynamics.
When the parameter hQn for the estimator ˆis monoton-ically approaching to an optimal point hQ
in the Euclidean norm sense, so is the model parameter for the action-value function because the same parameter is used to estimate
and Q
. Suppose we employ a method which monotonically brings ˆ
closer to an optimal function
in the Hilber- tian norm sense. Then, the following corollary implies that an estimator of the action-value function also satisfies the monotonicity.
Corollary IV.1. Let z;w]) := ˆQ
ˆQ
and
z;w
, where ˆQ
. Then, if ˆ
is approaching to
, i.e.,
, it follows that
ˆQ
ˆQ
Proof. See Appendix F.
Note that the use of action-value functions enables us to use random control inputs instead of the target policy for exploration, and we require no models of the agent dynamics for policy updates as discussed below.
To obtain analytical solutions for (III.3), we follow the arguments in [37]. Suppose that ˆQis given by (IV.4). We define the reproducing kernel
of
as the tensor kernel given by
where is, for example, defined by
Then, (III.3) becomes
where the target value being maximized is linear to u at x. Therefore, if the set is convex, an optimal solution to (IV.6) is guaranteed to be globally optimal, ensuring the greedy improvement of the policy.
As pointed out in [24], is not a convex set in general. Instead, we consider a convex subset of S (x) under the following moderate assumptions:
Assumption IV.2. 1) The set U is convex.
2) Existence of Lipschitz continuous gradient of the barrier function: Given
xn +t( ˆfn(xn)+ ˆgn(xn
there exists a constant 0 such that the gradient of the discrete-time exponential control barrier function B, denoted by
, satisfies
Then, the following theorem holds.
Theorem IV.4. Under Assumptions IV.1.3 and IV.2, assume also that xn
ˆfn(xn)+ ˆgn(xn)un +xn
. Then, inequality (III.1) is satisfied at time instant n
if un satisfies the following:
Moreover, (IV.7) defines a convex constraint for un.
Proof. See Appendix G.
Remark IV.9. When ˆgn(xn
0 and U admits suffi-ciently large value of each entry of un, there always exists a un that satisfies (IV.7).
Theorem IV.4 essentially implies that, even when the gradient of B along the shift of xn decreases steeply, inequality (III.1) holds if (IV.7) is satisfied. From Theorem IV.4, the set ˆxn), defined as
is convex under Assumption IV.2.
As witnessed in the literatures (e.g., [22]), an agent might encounter deadlock situations, where the constrained control keeps the agent remain in the same state, when control barrier certificates are employed. It is even possible that there is no safe control driving the agent from those states. However, an elaborative design of control barrier functions remedies this issue, as shown in the following example.
Example IV.1. If the agent is nonholonomic, turning inward safe regions when approaching their boundary might be infeasible. To reduce the risk of such deadlock situations, control barrier functions may be designed as
B(x) = ˜Batan2
˜B
˜B
x
where the state x = [x;y;consists of the X position x, the Y position y, and the orientation
of an agent from the world frame,
˜B
is the original safe region, and
is a strictly increasing function. If this control barrier function exists, then the agent is forced to turn inward the original
Fig. IV.1. An illustration of how a nonholonomic agent avoids deadlocks. When the orientation of the agent is not considered (i.e., ˜B(x) is the barrier function), there might be no safe control driving the agent from those states as the left figure shows. By taking into account the orientation (i.e., B(x) is the barrier function), the agent turns inward the safe region before reaching its boundaries as the right figure shows.
safe region before reaching its boundaries because the control barrier function also depends on and takes larger value when the agent is facing inward the safe region. An illustration of this example is given in Figure IV.1.
Resulting barrier-certified adaptive reinforcement learning framework is summarized in Algorithm 1.
For the sake of reproducibility and for clarifying each contribution, we first validate the proposed learning framework on simulations of vertical movements of a quadrotor, which has been used in the safe learnings literature under stationarity assumption (e.g., [7]). Then, we test the proposed learning framework on a real robot called brushbot, whose dynamics is unknown, highly complex and nonstationary1. The experiments on the brushbot was conducted at the Robotarium, a remotely accessible robot testbed at Georgia institute of technology [52].
A. Validations of the Safe Learning Framework via Simula- tions of a Quadrotor
In this experiment, we empirically validate Theorem IV.1 (i.e., Lyapunov stability of the set of augmented safe states after an unexpected and abrupt change of the agent dynamics) and the motivations of using an online kernel method working in the RKHS (see Section IV-C) for action-value function approximation. We also test the proposed framework for simulated vertical movements of a quadrotor. We use parametric model for the agent dynamics and nonparametric model for the action-value function in this experiment. The discrete-time dynamics of the vertical movement of a quadrotor is given by
where denotes the time interval, xn and ˙xn are the vertical position and the vertical velocity of the quadrotor at time instant n, respectively. When the weight of the quadrotor is 0.027kg, the nominal model is given by h1 = 1, h2 = 9.81, and h3 = 1/0.027. Let the time interval
t be 0.02 seconds for the simulations, and the maximum input 2
027
81.
Control barrier certificates are used to limit the region of exploration to the area: x , and we employ the following two barrier functions:
and we use the barrier-certificate parameter 01 (see (III.1)) in this experiment. Note that the safe set is equivalently expressed by
and the barrier functions satisfy Assumption IV.2.2 with the Lipschitz constant 0. The immediate reward is given by
where the constant is added to prevent the resulting value of explored states from becoming negative, i.e., lower than the value outside of the safe set.
Fig. V.1. Trajectories of the vector of the GP-based learning and the adaptive model learning algorithm with barrier certificates from n = 1000 to n = 10000. The trajectory of the adaptive model learning algorithm converges to the forward invariant set
. GP-based learning seems slowly approaching to the safe set while safety recovery is not theoretically supported in the current settings.
1) Stability of the Safe Set: In terms of safety recovery,
we compare a GP-based approach, which tends to be less adaptive to time-varying systems, and a set-theoretical adaptive model learning algorithm with monotone approximation property. Random explorations by uniformly random control inputs are conducted for the first 20 seconds corresponding to 1000 iterations under the dynamics h1;9.81;1/0.027]. Then, we change the simulated dynamics and observe if the quadrotor is stabilized on the set of augmented safe states. To clearly visualize the difference between the GP-based approach and the adaptive model learning algorithm, we let the new agent dynamics be h
1;9.81;5/0.027], which is an extreme situation where the maximum input generates very large acceleration.
We define the update rule of model learning as
which satisfies the monotone approximation property2, where is the step size. In this experiment, we used
6. For the GP-based learning, on the other hand, we let the noise variance of the output be 0.01, and let the prior covariance of the parameter vector h be 25I.
The trajectories of the vector [x;h2;h3] of the GP-based learning and the adaptive model learning algorithm from n = 1000 to n = 10000 are plotted in Figure V.1. We can observe that the trajectory of the adaptive model learning algorithm converges to the forward invariant set . GPbased learning seems to be slowly approaching to the safe set while safety recovery is not theoretically supported in the current settings.
TABLE V.1 SUMMARY OF THE PARAMETER SETTINGS OF THE SIMULATED VERTICAL MOVEMENTS OF A QUADROTOR (KERNEL ADAPTIVE FILTER)
2) Adaptive Action-value Function Approximation: We also
validate our action-value function approximation framework by employing a GP (i.e., the GP SARSA) and a kernel adaptive filter in the same RKHS. The parameter settings for the kernel adaptive filter are summarized in Table V.1. Please refer to Appendix H for the notations that are not in the main text. Six Gaussian kernels with different scale parameters are employed for the kernel adaptive filter (i.e., M = 6. See also Appendix H for more detail about multikernel adaptive filter). For the GP SARSA, we employ a Gaussian kernel with scale parameter 3, which achieved sufficiently good performance, and let the noise variance of the output be 10
(i.e.,
10
I. See Appendix I.). Other parameters are the same as those of the kernel adaptive filter. In addition, we also test the GP SARSA in another settings, where the kernel function is added in the first 600 iterations (i.e., dimension of the parameter becomes r = 600) and is not newly added after 600 iterations. We call this as the GP SARSA 2 for convenience in this section. We employ an adaptive model learning algorithm for all of the three reinforcement learning approaches, and update policies every 1000 iterations. For the comparison purpose, we do not reset learning even when the policy is updated3. Random explorations by uniformly random control inputs are conducted for the first 200 seconds corresponding to 10000 iterations under the dynamics h
1;9.81;1/0.027], and the dynamics changes to h
1;11.81;0.9/0.027] (i.e., additional downward accelerations and degradations of batteries, for example) at time instant n = 2500. We evaluate the policy obtained at time instant n = 10000 for five times with different initial states, and we also conduct 15 runs for learning. For each policy evaluation, the initial position x follows the uniform distribution while the velocity ˙x = 0.
The learning curves of the normalized mean squared errors (NMSEs) of action-value function approximation, which are averaged over 15 runs and smoothed, are plotted in Figure V.2 for the GP SARSA, the kernel adaptive filter and the GP SARSA 2. From Figure V.2, we can observe that both the GP SARSA and the kernel adaptive filter show no large degradations of the NMSE even after the dynamics changes or the policy is updated, while the GP SARSA 2 stops improving the NMSE after the policy is updated (and the dynamics is changed). Because no kernel function is newly added after
0 1000 2000 3000 4000 5000 6000 7000 8000 9000 10000 Iteration
Fig. V.2. The learning curves of the normalized mean squared errors (NMSEs) of action-value function approximation for the GP SARSA, kernel adaptive filter, and the GP SARSA 2.
TABLE V.2 THE EXPECTED VALUES OF THE GP SARSA AND THE KERNEL ADAPTIVE
the first 600 iterations, the GP SARSA 2 could not adapt to the new policy or new dynamics.
The expected values EV
for the GP SARSA, the kernel adaptive filter and the GP SARSA 2 associated with the policies obtained at time instant n = 10000 are shown in Table V.2. (expectation is taken over the 15
5 runs, i.e., 15 runs for learning, each of which includes five policy evaluations). Recall that V
is defined in (II.1).
Among the 15 runs for the kernel adaptive filter, we extracted the seventh run, which was successful. The left figure of Figure V.3 illustrates the action-value function at time instant n = 10000 of the seventh run for the kernel adaptive filter, and the right figure of Figure V.3 plots the trajectory of the optimal policy obtained at time instant n = 10000 for the seventh run. The simulated quadrotor was relocated at time instant n = 11000,12000,13000, and n = 14000, and both the position and the velocity of the simulated quadrotor went to zeros successfully.
3) Discussion: The control barrier certificates with an adaptive model learning algorithm recovered safety even for an extreme situation where the control inputs start generating very large acceleration. As long as model learning algorithm satisfies Assumption IV.1, safety recovery is guaranteed.
Reinforcement learning with the GP SARSA and kernel adaptive filter in the RKHS worked sufficiently well. If no kernel functions are newly added, GP-based learnings cannot adapt to the new policies or agent dynamics. Therefore, we need to sequentially add new kernel functions or use a sparse adaptive filter to prune redundant kernel functions (see also Appendix H for a sparse adaptive filter). We mention that identifying the RKHS
enabled us to employ GPs for nonstationary agent dynamics without having to reset learnings. Consequently, we can effectively reuse the previous estimation of the target function if the new target function is close to the previous one.
Our safe learning framework validated by these simulations
-3
-2
-1
Fig. V.3. The left figure illustrates the action-value function over the position x and the velocity ˙x at n = 10000 and at the control input which cancels out the acceleration added to the quadrotor. Positive velocities for negative positions and negative velocities for positive positions have higher values. The right figure shows the trajectory of the optimal policy obtained at time instant n = 10000 for the seventh run, which was a successful run among the 15 runs. Dashed lines indicate the time when the quadrotor was relocated. Both the position and the velocity of the simulated quadrotor went to zeros successfully.
Fig. V.4. A picture of the brushbot used in the experiment. Vibrations of the two motors propagate to the two brushes, driving the brushbot. Control inputs are of two dimensions each of which corresponds to the rotational speed of a motor.
is now ready to be applied to a real robot called brushbot as presented below.
B. Real-Robotics Experiments on the Brushbot
Next, we apply our safe learning framework, which was validated by simulations, to the brushbot, which has highly nonlinear, nonholonomic and nonstationary dynamics (see Figure V.4). The objective of this experiment is to find a policy driving the brushbot to the origin, while restricting the region of exploration. The experiment is conducted at the Robotarium, a remotely accessible robot testbed at Georgia institute of technology [52].
1) Experimental Condition: The experimental conditions for model learning, reinforcement learning, control barrier functions and their parameter settings are presented below.
a) Model learning: The state x = [x;y;consists of the X position x, Y position y and the orientation
of the brushbot from the world frame. The exact positions and the orientation are recorded by motion capture systems every 0.3 seconds. A control input u is of two dimensions each of which corresponds to the rotational speed of a motor. To improve the learning efficiency and reduce the total learning time required, we identify the most significant dimension and reduce the dimensions to learn. The sole input variable of p, f and g for the shifts of x and y, is assumed to be
. The shift of
is assumed to be constant over the state, and hence depends on nothing but control inputs (see Section V-B.1.d). The brushbot used in the present study is nonholonomic, i.e., it can only go forward, and positive control inputs basically drive the brushbot in the same way as negative control inputs. As such, we use the rotational speeds of the motors as the control inputs. Moreover, to eliminate the effect of static frictions on the model, we assume that the zero control input given to the algorithm actually generates some minimum control inputs u
to the motors, i.e., the actual maximum control inputs to the motors are given by umax
, where umax is the maximum control input fed to the algorithm.
b) Reinforcement learning: The state for action-value function approximation consists of the distance x;y
from the origin and the orientation
atan2(y,x) which is wrapped to the interval
. The immediate reward is given by
where the constant is added to prevent the resulting value of explored states from becoming negative, namely, lower than the value outside of the region of exploration.
c) Discrete-time control barrier certificates: Control barrier certificates are used to limit the region of exploration to the rectangular area: x xmax,xmax
ymax,ymax], where xmax > 0 and ymax > 0. Because the brushbot can only go
forward, we employ the following four barrier functions:
(see Example IV.1 for the motivations of using the above control barrier functions). Note that those functions satisfy Assumption IV.2.2 and the Lipschitz constant is zero except at around
. (Although we can employ globally Lipschitz functions for more rigorous treatment, we use the above functions for simplicity.) d) Parameter settings: The parameter settings are summarized in Table V.3. Please refer to Appendix H for the notations that are not in the main text. Five Gaussian kernels with different scale parameters
are employed in action-value function approximation (i.e., M = 5. See also Appendix H for more detail about multikernel adaptive filter), and six Gaussian kernels are employed in model learning for x and y (i.e., M = 6). In model learning for
, we define
and
as sets of constant functions. The kernels of
and
are weighed by
1 in model learning (see Lemma H.1 in Appendix H). e) Procedure: The time interval (duration of one iteration) for learning is 0.3 seconds, and random explorations are conducted for the first 300 seconds corresponding to 1000 iterations. While exploring, the model learning algorithm adaptively learns a model whose control-affine terms, i.e., ˆfn(x)+ ˆgn(x)u, is used in combination with barrier certificates. Although barrier functions employed in the experiment reduce deadlock situations, the brushbot is forced to turn inward the region of exploration when a deadlock is detected. Note that the barrier certificates are intentionally violated in such a case. The policy is updated every 50 seconds. After 300 seconds, we stop learning a model and the action-value function, and the policy replaces random explorations. The brushbot is forced to stop when it enters into the circle of radius 0.2 centered at the origin. When the brushbot is driven close to the origin and enters this circle, it is pushed away from the origin to see if it returns to the origin again (see Figure V.10). 2) Results: Figure V.5 plots ˆpn([x;0;0]), ˆfn(x), ˆg
and ˆg
for x and y at n = 1000. Here ˆg
is the estimate of g
at time instant n. Recall that these functions only depend on
in this experiment to improve the learning efficiency. For the shift of
, the estimators are constant over the state, and the result is ˆg
38, ˆg
77 and ˆpn([x;0;0]) = ˆfn(x) = 0 at n = 1000. As can be seen in Figure V.5, ˆpn([x;0;0]) is almost zero and so is ˆfn(x), implying that the proposed algorithm successfully dropped off irrelevant structural components of a model. Figure V.6 plots the trajectory of the brushbot while exploring (i.e., X,Y positions from n = 0 to n = 1000). It is observed that the brushbot remained in the region of exploration (x
and y
) most of the time. Moreover, the values of barrier functions Bi
, for the whole trajectory are plotted in Figure V.7. Even though
Fig. V.5. Estimated output of the model estimator at u = [0;0] and n = 1000 over the orientation . Irrelevant structures such as ˆpn and ˆfn dropped off successfully.
some violations of safety are seen in the figure, the brushbot returned to the safe region before large violations occurred. Despite unknown, highly complex and nonstationary system, the proposed safe learning framework was shown to work efficiently.
Figure V.8 plots the trajectories of the optimal policy learned by the brushbot. Once the optimal policy replaced random explorations, the brushbot returned to the origin until n = 1016 as the first figure shows. The brushbot was pushed by a sweeper at time instant n = 1031,1075,1101,1128,1181 and n = 1230, and the trajectories of the brushbot after being pushed at n = 1031,1075,1101 are also shown in Figure V.8. Dashed lines in the last figure indicate the time when the brushbot is pushed away. Given relatively short learning time and the fact that no simulator was used, the brushbot learned the desirable behavior sufficiently well.
Figure V.9 plots the shape of ˆQx;y
;0],[0;0]) over X,Y positions at n = 1000. It is observed that when the control input is zero (i.e., when the brushbot basically does not move), the vicinity of the origin has the highest value, which is reasonable.
Finally, Figure V.10 shows two trajectories of the brushbot returning to the origin by using the action-value function saved at n = 1000. After being pushed away from the origin, the brushbot successfully returned to the origin again.
3) Discussion: One of the challenges of the experiments is that no initial data or simulators were available. Despite the fact that the brushbot with highly complex system had to learn an optimal policy while dealing with safety by employing an adaptive model learning algorithm, the proposed learning framework worked well in the real world. Brushbot is powered by brushes, and its dynamics highly depends on the conditions of the floor and brushes. The possible changes of the agent dynamics thus lead to some violations of safety. Nevertheless, our learning framework recovered safety quickly. In addition, the agent learned a good policy within a quite short period. One reason of those successes of adaptivity and data-efficiency is the convex-analytic formulations.
On the other hand, because no initial nominal model or policy is available and our framework is fully adaptive, i.e., we do not collect data to conduct batch model learning and/or reinforcement learning, we need to reduce the dimensions of input vectors to speed-up and robustify learning. This can be
Fig. V.6. The left figure shows the trajectory of the brushbot while exploring, and the right figure shows X,Y positions over iterations. The region of exploration is limited to x . The brushbot remained in the region most of the time.
Fig. V.7. The values of four control barrier functions employed in the experiment for the whole trajectory. Even though some violations of safety were seen, the brushbot returned to the safe region before large violations occurred. The nonholonomic brushbot adaptively learned a model to turn inward the region of exploration before reaching the boundaries of the region of exploration.
an inherent limitation of our framework.
The learning framework presented in this paper successfully tied model learning, reinforcement learning, and barrier certificates, enabling barrier-certified reinforcement learning for unknown, highly nonlinear, nonholonomic, and possibly nonstationary agent dynamics. The proposed model learning algorithm captures a structure of the agent dynamics by employing a sparse optimization. The resulting model has preferable structure for preserving efficient computations of barrier certificates. In addition, recovery of safety after an
Fig. V.8. Trajectories of the optimal policy learned by the brushbot. The optimal policy replaced random explorations at n = 1000, and the brushbot returned to the origin until n = 1016 (first figure). The brushbot was pushed by a sweeper at time instant n = 1031,1075,1101,1128,1181, and n = 1230. Dashed lines in the last figure indicate the time when the brushbot was pushed away. The brushbot learned the desirable behavior sufficiently well.
Fig. V.9. The shape of the action-value function over X,Y positions at the control input u = [0;0] and n = 1000. The vicinity of the origin has the highest value when the control input is zero.
unexpected and abrupt change of the agent dynamics was guaranteed by employing barrier certificates and a model learning algorithm with monotone approximation property under certain conditions. For possibly nonstationary agent dynamics, the action-value function approximation problem was appropriately reformulated so that kernel-based methods, including kernel adaptive filter, can be directly applied in an RKHS. Lastly, certain conditions were also presented to render the set of safe policies convex, thereby guaranteeing the global optimality of solutions to the policy update to ensure the greedy improvement of a policy. The experimental result shows the efficacy of the proposed learning framework in the real world.
See [24, Proposition 4] for the proof of forward invariance. The set is asymptotically stable as
where the inequality holds from [24, Proposition 1].
From Assumptions IV.1.1, IV.1.2, IV.1.5, and from the facts that the estimated output is linear to the model parameter at a fixed input and that xn
ˆxn
0, we obtain
for some bounded 0. From Assumptions IV.1.3, we also obtain that
Fig. V.10. Two trajectories of the brushbot returning to the origin by using the action-value function saved at n = 1000. Red arrows show the trajectories. After being pushed away from the origin, the brushbot successfully returned to the origin again.
Therefore, from Assumptions IV.1.6 and IV.1.7, and from 0, we obtain for h
dist(hn
hn
that
xn
ˆxn
xn
ˆxn
dist2(hn
If B(xnˆxn
, then we obtain
This inequality also holds in case when B(xnˆxn
. Because of the continuity of the cost function
and the barrier function B (Assumptions IV.1.3 and IV.1.5), the set
is closed. We show that there exists a Lyapunov function V
with respect to the closed set
for the augmented state [x;h]. A candidate function is given by
0 if [x;h
min
dist2
Since min
dist2
0,
x;h
, where
is the boundary of the set
, from Assumption IV.1.3, the function V
is continuous. It also holds that V
x;h]) > 0 when [x;h
. Under Assumption IV.1.6, we obtain
for all n . To show that the first inequality holds, we first show
(a) For B(xn0: from (IV.1), (B.2), and 0 <
1, we obtain B(ˆxn
and B(xn
dist2(hn
dist2(hn
from which it follows that
(b) For B(xn) < 0 and B(xn0: it is straightforward to see that
(c) For B(xn) < 0 and B(xn0: from (IV.1), (B.2), and 0
1, we obtain B(ˆxn
xn
and B(xn
B(xn
dist2(hn
dist2(hn
from which it
follows that
min(B(xn
min(B(xn),0) = B(xn
xn
If hn , under Assumption IV.1.6, we obtain
dist2(hn
dist2(hn
from which it follows that
dist2(hn
dist2(hn
dist2(hn
dist2(hn
and
and the first inequality of (B.3) holds. The inequality also holds for hn . Moreover, if [xn;hn
, then hn remains in
because of monotonic approximation property. From (B.2), the control barrier certificate (III.1) is thus ensured with a control input satisfying (IV.1) under Assumption IV.1.4, and the set
is forward invariant. Therefore, the system for the augmented state is stable with respect to the set
. If hn
for all n
such that [xn;hn
, it follows that
and [53, Theorem 1] applies, i.e., the system for the augmented state is uniformly globally asymptotically stable with respect to the set .
APPENDIX D PROOF OF THEOREM IV.2 The following lemmas are used to prove the theorem.
Lemma D.1 ( [54, Theorem 2]). Let be any set with nonempty interior. Then, the RKHS associated with the Gaussian kernel for an arbitrary scale parameter
0 does not contain any polynomial on X , including the nonzero constant function.
Lemma D.2. Assume that and
have nonempty interiors. Then, the intersection of the RKHS
associated with the kernel
, and the RKHS
is {0}, i.e.,
Proof. It is obvious that the function , is an element of both of the RKHSs (vector spaces)
and
. Therefore, it is sufficient to show that there exists u
satisfying that
uint), uint
int(U ), where int(U ) denotes the interior of U , for any
. Assume that
0 for some v
. From [51, Theorem 3], the RKHS
is expressed as
span
, which is finite dimension, implying that any function in
is linear. Since there exists u = uint
for some
0, it is proved that
Lemma D.3 ( [55, Proposition 1.3]). If for given vector spaces
and
, then
{0}, i.e.,
Lemma D.4. Given and
, let
,
, and H be associated with the Gaussian kernels
spectively, for an arbitrary 0. Then, by regarding a function in
as a function over the input space
, it holds that
Proof. has the reproducing kernel defined by
This verifies the claim.
We are now ready to prove Theorem IV.2.
Proof of Theorem IV.2. By Lemmas D.2 and D.3, it is derived that . By Lemmas D.1, D.3, and D.4, it holds that
and
{0}.
We show that the operator U : , which maps
to a function
z;w
where
, is bijective. Because the mapping U is surjective by definition, we show it is also injective. For any
,
and
from which the linearity holds. Therefore, it is sufficient to show that ker(U) = 0 [56]. For any ker(U), we obtain
which implies that 0.
Next, we show that is an RKHS. The space
with the inner product defined in (IV.2) is isometric to the RKHS HQ, and hence is a Hilbert space. Because
, it is true that
z;w
. Moreover, it holds that
z;w
˜z; ˜w
z;w],[˜z; ˜w]),
and that
[z;w])
z)
w)
=
(z)
(w) =
([z;w])
Therefore, is the reproducing kernel with which the RKHS
is associated.
is the gradient of the scaler field B [57]. Let xt)xn +txn
xn +t( ˆfn(xn) + ˆgn(xn)un), where t
parameterizes the line path between xn and xn
, then dB
ˆfn(xn)+ ˆgn(xn)un). Therefore, for any path A from xn
to ˆxnxn + ˆfn(xn)+ ˆgn(xn)un, it holds under Assumption IV.2.2 that
( ˆfn(xn)+ ˆgn(xn)un)dt
The inequality implies that B(ˆxnxn) is greater than or equal to that in the case when
decreases along the line path at the maximum rate. Therefore, when (IV.7) is satisfied, it holds from (G.1) that
which is the control barrier certificate defined in (IV.1). Hence, (III.1) is satisfied by the same argument as in the proof of Theorem IV.1 under Assumption IV.1.3. Equation (IV.7) can be rewritten as xn)
x ( ˆfn(xn)+ ˆgn(xn)un)
2
The first term in the left hand side of (G.2) is affine to un, the second term is the combination of a concave function and an affine function of un, which is concave. Therefore, the left hand side of (G.2) is a concave function, and the inequality (G.2) defines a convex constraint under Assumption IV.2.1.
APPENDIX H KERNEL ADAPTIVE FILTER WITH MONOTONE APPROXIMATION PROPERTY Kernel adaptive filter [58] is an adaptive extension of the kernel ridge regression [59], [60] or GPs. Multikernel adaptive filter [61] exploits multiple kernels to conduct learning in the sum space of RKHSs associated with each kernel. Let M be the number of kernels employed. Here, we only discuss the case that the dimension of the model parameter h is fixed, for simplicity. Denote, by
˜zm
{1,2,...,M},rm
, the time-dependent set of functions, referred to as a dictionary, at time instant n for the mth kernel
. The current estimator ˆ
is evaluated at the current input zn, in a linear form, as
where hn := [h1;h2
;hM
h1;h2;
;hr
r :
rm, is the coefficent vector, and k(zn) := [k1(zn);k2(zn
;kM(zn
, km(zn) :=
zn, ˜zm
zn, ˜zm
zn, ˜zm
. To obtain a sparse model parameter, we define the cost at time instant n as
where , and
which is a set of coefficient vector h satisfying instantaneous-error-zero with a precision parameter . Here,
is the output at time instant n, and the
-norm regularization
hi| with a parameter
0 promotes sparsity of h. The update rule of the adaptive proximal forward-backward splitting [62], which is an adaptive filter designed for sparse optimizations, for the cost (H.1) is given by
where is the step size, I is the identity operator, and
where sgnis the sign function. Then, the strictly monotone approximation property [62]:
hn
hn
:= argminh
, holds if hn
/0.
Dictionary Construction: If the dictionary is insufficient, we can employ two novelty conditions when adding the kernel functions zn
to the dictionary: (i) the maximum-dictionary-size condition
and (ii) the large-normalized-error condition
By using sparse optimizations, nonactive structural components represented by some kernel functions can be removed, and the dictionary is refined as time goes by. To effectively achieve a compact representation of the model, it might be required to appropriately weigh the kernel functions to include some preferences on a structure of the model. The following lemma implies that the resulting kernels are still reproducing kernels.
Lemma H.1 ( [63, Theorem 2]). Let be the reproducing kernel of an RKHS
. Then,
for an arbitrary
0 is the reproducing kernel of the RKHS
with the inner product
.
APPENDIX I COMPARISON TO PARAMETRIC APPROACHES AND THE GP SARSA
If the suitable set of basis functions for approximating action-value functions is available, we can adopt a parametric approach for action-value function approximation. Suppose that an estimate of the action-value function at time instant n is given by ˆQ, where
is fixed for all time. In this parametric case, given an input-output
Then, stable tracking is achieved if the step size is properly selected, even after the dynamics or the policy is changed.
On the other hand, when employing a kernel-based learning, it is not trivial how to update the estimate in a theoretically formal manner. Because the output of the action-value function is not directly observable, the expansion zn) (where
is the reproducing kernel of the RKHS containing the action-value function) cannot be validated by the representer theorem [64] any more. By defining the RKHS
as in Theorem IV.3, however, we can view an action-value function approximation as the supervised learning in the RKHS
, and can overcome the aforementioned issue. We mention that when an adaptive filter is employed in the RKHS
, we do not have to reset learning even after policies are updated or the dynamics changes, since the domain of
is
instead of Z . The example below indicates that our approach is general.
As discussed in Section II-A, the least squares temporal difference algorithm has been extended to kernel-based methods including the GP SARSA [37]. Given a set of input data {znzn := [xn;un], Nd
, the posterior mean mQ and variance
of ˆQ
at a point z
are given by
If we employ a GP for learning in
defined in Theorem IV.3, the posterior mean m
and variance
of ˆ
at a point
;w
are given by
where kNd :;w
z0;z1
;w
zNd
;zNd])], and the (i, j) entry of K
is
zi
;zi],[zj
;zj]). Then, the posterior mean mQ and variance
of ˆQ
at a point z
are given by
M. Ohnishi thanks all of those who have given him insightful comments on this work, including the members of the Georgia Robotics and Intelligent Systems Laboratory. The authors thank all of the anonymous reviewers for their constructive suggestions.
[1] R. S. Sutton and A. G. Barto, Reinforcement learning: An introduction. MIT Press, 1998.
[2] F. L. Lewis and D. Vrabie, “Reinforcement learning and adaptive dynamic programming for feedback control,” IEEE Circuits and Systems Magazine, vol. 9, no. 3, 2009.
[3] D. Liberzon, Calculus of variations and optimal control theory: a concise introduction. Princeton University Press, 2011.
[4] F. Berkenkamp, M. Turchetta, A. P. Schoellig, and A. Krause, “Safe model-based reinforcement learning with stability guarantees,” in Proc. NIPS, 2017.
[5] F. Berkenkamp, R. Moriconi, A. P. Schoellig, and A. Krause, “Safe learning of regions of attraction for uncertain, nonlinear systems with Gaussian processes,” in Proc. CDC, 2016, pp. 4661–4666.
[6] J. Schreiter, D. Nguyen-Tuong, M. Eberts, B. Bischoff, H. Markert, and M. Toussaint, “Safe exploration for active learning with Gaussian processes,” in Proc. ECML PKDD, 2015, pp. 133–149.
[7] A. K. Akametalu, J. F. Fisac, J. H. Gillula, S. Kaynama, M. N. Zeilinger, and C. J. Tomlin, “Reachability-based safe learning with Gaussian processes,” in Proc. CDC, 2014, pp. 1424–1431.
[8] S. Shalev-Shwartz, S. Shammah, and A. Shashua, “Safe, multiagent, reinforcement learning for autonomous driving,” arXiv preprint arXiv:1610.03295, 2016.
[9] H. B. Ammar, R. Tutunov, and E. Eaton, “Safe policy search for lifelong reinforcement learning with sublinear regret,” in Proc. ICML, 2015, pp. 2361–2369.
[10] D. A. Niekerk, B. V. and B. Rosman, “Online constrained model-based reinforcement learning,” in Proc. AUAI, 2017.
[11] J. Achiam, D. Held, A. Tamar, and P. Abbeel, “Constrained policy optimization,” in Proc. ICML, 2017.
[12] P. Abbeel and A. Y. Ng, “Exploration and apprenticeship learning in reinforcement learning,” in Proc. ICML, 2005, pp. 1–8.
[13] L. Wang, E. A. Theodorou, and M. Egerstedt, “Safe learning of quadrotor dynamics using barrier certificates,” in IEEE Proc. ICRA, 2018, pp. 2460–2465.
[14] J. Garcıa and F. Fern´andez, “A comprehensive survey on safe reinforcement learning,” J. Mach. Learn. Res., vol. 16, no. 1, pp. 1437–1480, 2015.
[15] B. D. Argall, S. Chernova, M. Veloso, and B. Browning, “A survey of robot learning from demonstration,” Robotics and Autonomous Systems, vol. 57, no. 5, pp. 469–483, 2009.
[16] P. Geibel, “Reinforcement learning for MDPs with constraints,” in Proc. ECML, vol. 4212, 2006, pp. 646–653.
[17] S. P. Coraluppi and S. I. Marcus, “Risk-sensitive and minimax control of discrete-time, finite-state Markov decision processes,” Automatica, vol. 35, no. 2, pp. 301–309, 1999.
[18] C. E. Rasmussen and C. K. Williams, Gaussian processes for machine learning. MIT press Cambridge, 2006, vol. 1.
[19] X. Xu, P. Tabuada, J. W. Grizzle, and A. D. Ames, “Robustness of control barrier functions for safety critical control,” in Proc. IFAC, vol. 48, no. 27, 2015, pp. 54–61.
[20] P. Wieland and F. Allg¨ower, “Constructive safety using control barrier functions,” in Proc. IFAC, vol. 40, no. 12, 2007, pp. 462–467.
[21] P. Glotfelter, J. Cort´es, and M. Egerstedt, “Nonsmooth barrier functions with applications to multi-robot systems,” IEEE Control Systems Letters, vol. 1, no. 2, pp. 310–315, 2017.
[22] L. Wang, A. D. Ames, and M. Egerstedt, “Safety barrier certificates for collisions-free multirobot systems,” IEEE Trans. Robotics, 2017.
[23] A. D. Ames, X. Xu, J. W. Grizzle, and P. Tabuada, “Control barrier function based quadratic programs for safety critical systems,” IEEE Trans. Automatic Control, vol. 62, no. 8, pp. 3861–3876, 2017.
[24] A. Agrawal and K. Sreenath, “Discrete control barrier functions for safety-critical control of discrete systems with application to bipedal robot navigation,” in Proc. RSS, 2017.
[25] J. Kober, J. A. Bagnell, and J. Peters, “Reinforcement learning in robotics: A survey,” The International Journal of Robotics Research, vol. 32, no. 11, pp. 1238–1274, 2013.
[26] V. M. Janakiraman, X. L. Nguyen, and D. Assanis, “A Lyapunov based stable online learning algorithm for nonlinear dynamical systems using extreme learning machines,” in IEEE Proc. IJCNN, 2013, pp. 1–8.
[27] M. French and E. Rogers, “Non-linear iterative learning by an adaptive Lyapunov technique,” International Journal of Control, vol. 73, no. 10, pp. 840–850, 2000.
[28] M. M. Polycarpou, “Stable adaptive neural control scheme for nonlinear systems,” IEEE Trans. Automatic Control, vol. 41, no. 3, pp. 447–451, 1996.
[29] K. J. ˚Astr¨om and B. Wittenmark, Adaptive control. Courier Corporation, 2013.
[30] C. A. Cheng and H. P. Huang, “Learn the Lagrangian: A vector-valued RKHS approach to identifying Lagrangian systems,” IEEE Trans. Cybernetics, vol. 46, no. 12, pp. 3247–3258, 2016.
[31] D. Ormoneit and P. Glynn, “Kernel-based reinforcement learning in average-cost problems,” IEEE Trans. Automatic Control, vol. 47, no. 10, pp. 1624–1636, 2002.
[32] X. Xu, D. Hu, and X. Lu, “Kernel-based least squares policy iteration for reinforcement learning,” IEEE Trans. Neural Networks, vol. 18, no. 4, pp. 973–992, 2007.
[33] G. Taylor and R. Parr, “Kernelized value function approximation for reinforcement learning,” in Proc. ICML, 2009, pp. 1017–1024.
[34] W. Sun and J. A. Bagnell, “Online Bellman residual and temporal difference algorithms with predictive error guarantees,” in Proc. IJCAI, 2016.
[35] Y. Nishiyama, A. Boularias, A. Gretton, and K. Fukumizu, “Hilbert space embeddings of POMDPs,” in Proc. UAI, 2012.
[36] S. Grunewalder, G. Lever, L. Baldassarre, M. Pontil, and A. Gretton, “Modelling transition dynamics in MDPs with RKHS embeddings,” in Proc. ICML, 2012.
[37] Y. Engel, S. Mannor, and R. Meir, “Reinforcement learning with Gaussian processes,” in Proc. ICML, 2005, pp. 201–208.
[38] A. Barreto, D. Precup, and J. Pineau, “Practical kernel-based reinforcement learning,” J. Mach. Learn. Res., vol. 17, no. 1, pp. 2372–2441, 2016.
[39] A. S. Barreto, D. Precup, and J. Pineau, “Reinforcement learning using kernel-based stochastic factorization,” in Proc. NIPS, 2011, pp. 720–728.
[40] B. Kveton and G. Theocharous, “Kernel-based reinforcement learning on representative states.” in Proc. AAAI, 2012.
[41] J. Bae, P. Chhatbar, J. T. Francis, J. C. Sanchez, and J. C. Principe, “Reinforcement learning via kernel temporal difference,” in IEEE Proc. EMBC, 2011.
[42] J. Reisinger, P. Stone, and R. Miikkulainen, “Online kernel selection for Bayesian reinforcement learning,” in Proc. ICML, 2008, pp. 816–823.
[43] Y. Cui, T. Matsubara, and K. Sugimoto, “Kernel dynamic policy programming: Applicable reinforcement learning to robot systems with high dimensional states,” Neural Networks, vol. 94, pp. 13–23, 2017.
[44] H. Van H., J. Peters, and G. Neumann, “Learning of non-parametric control policies with high-dimensional state features,” in Artificial Intelligence and Statistics, 2015, pp. 995–1003.
[45] N. Aronszajn, “Theory of reproducing kernels,” Trans. Amer. Math. Soc., vol. 68, no. 3, pp. 337–404, May 1950.
[46] I. Steinwart, “On the influence of the kernel on the consistency of support vector machines,” J. Mach. Learn. Res., vol. 2, pp. 67–93, 2001.
[47] I. Yamada and N. Ogura, “Adaptive projected subgradient method for asymptotic minimization of sequence of nonnegative convex functions,” Numerical Functional Analysis and Optimization, vol. 25, no. 7&8, pp. 593–617, 2004.
[48] M. L. Puterman and S. L. Brumelle, “On the convergence of policy iteration in stationary dynamic programming,” Mathematics of Operations Research, vol. 4, no. 1, pp. 60–69, 1979.
[49] D. P. Bertsekas, Dynamic programming and optimal control. Athena Scientific Belmont, MA, 2005, vol. 1, no. 3.
[50] C. D. McKinnon and A. P. Schoellig, “Experience-based model selection to enable long-term, safe control for repetitive tasks under changing conditions,” in IEEE Proc. IROS, 2018, pp. 2977–2984.
[51] A. Berlinet and A. C. Thomas, Reproducing kernel Hilbert spaces in probability and statistics. Kluwer, 2004.
[52] D. Pickem, P. Glotfelter, L. Wang, M. Mote, A. Ames, E. Feron, and M. Egerstedt, “The robotarium: A remotely accessible swarm robotics research testbed,” in IEEE Proc. ICRA, 2017, pp. 1699–1706.
[53] Z. P. Jiang and Y. Wang, “A converse Lyapunov theorem for discrete-time systems with disturbances,” Systems & Control Letters, vol. 45, no. 1, pp. 49–58, 2002.
[54] H. Q. Minh, “Some properties of Gaussian reproducing kernel Hilbert spaces and their implications for function approximation and learning theory,” Constructive Approximation, vol. 32, no. 2, pp. 307–338, 2010.
[55] R. A. Ryan, Introduction to tensor products of Banach spaces. Springer Science & Business Media, 2013.
[56] G. Strang, Introduction to linear algebra. Wellesley-Cambridge Press Wellesley, MA, 1993, vol. 3.
[57] L. V. Ahlfors, “Complex analysis: an introduction to the theory of analytic functions of one complex variable,” New York, London, p. 177, 1953.
[58] W. Liu, J. Pr´ıncipe, and S. Haykin, Kernel adaptive filtering. New Jersey: Wiley, 2010.
[59] K. R. M¨uller, S. Mika, G. Ratsch, K. Tsuda, and B. Scholkopf, “An introduction to kernel-based learning algorithms,” IEEE Trans. Neural Networks, vol. 12, no. 2, pp. 181–201, 2001.
[60] B. Sch¨oelkopf and A. Smola, Learning with kernels. MIT Press, Cambridge, 2002.
[61] M. Yukawa, “Multikernel adaptive filtering,” IEEE Trans. Signal Processing, vol. 60, no. 9, pp. 4672–4682, Sept. 2012.
[62] Y. Murakami, M. Yamagishi, M. Yukawa, and I. Yamada, “A sparse adaptive filtering using time-varying soft-thresholding techniques,” in Proc. IEEE ICASSP, 2010, pp. 3734–3737.
[63] M. Yukawa, “Adaptive learning in Cartesian product of reproducing kernel Hilbert spaces,” IEEE Trans. Signal Processing, vol. 63, no. 22, pp. 6037–6048, Nov. 2015.
[64] G. Kimeldorf and G. Wahba, “Some results on Tchebycheffian spline functions,” Journal of Mathematical Analysis and Applications, vol. 33, no. 1, pp. 82–95, 1971.