b

DiscoverSearch
About
My stuff
Uniform Error Bounds for Gaussian Process Regression with Application to Safe Control
2019·arXiv
Abstract
Abstract

Data-driven models are subject to model errors due to limited and noisy training data. Key to the application of such models in safety-critical domains is the quantification of their model error. Gaussian processes provide such a measure and uniform error bounds have been derived, which allow safe control based on these models. However, existing error bounds require restrictive assumptions. In this paper, we employ the Gaussian process distribution and continuity arguments to derive a novel uniform error bound under weaker assumptions. Furthermore, we demonstrate how this distribution can be used to derive probabilistic Lipschitz constants and analyze the asymptotic behavior of our bound. Finally, we derive safety conditions for the control of unknown dynamical systems based on Gaussian process models and evaluate them in simulations of a robotic manipulator.

The application of machine learning techniques in control tasks bears significant promises. The identification of highly nonlinear systems through supervised learning techniques [1] and the automated policy search in reinforcement learning [2] enables the control of complex unknown systems. Nevertheless, the application in safety-critical domains, like autonomous driving, robotics or aviation is rare. Even though the data-efficiency and performance of self-learning controllers is impressive, engineers still hesitate to rely on learning approaches if the physical integrity of systems is at risk, in particular, if humans are involved. Empirical evaluations, e.g. for autonomous driving [3], are available, however, this might not be sufficient to reach the desired level of reliability and autonomy.

Limited and noisy training data lead to imperfections in data-driven models [4]. This makes the quantification of the uncertainty in the model and the knowledge about a model’s ignorance key for the utilization of learning approaches in safety-critical applications. Gaussian process models provide this measure for their own imprecision and therefore gained attention in the control community [5, 6, 7]. These approaches heavily rely on error bounds of Gaussian process regression and are therefore limited by the strict assumptions made in previous works on GP uniform error bounds [8, 9, 10, 11].

The main contribution of this paper is therefore the derivation of a novel GP uniform error bound, which requires less prior knowledge and assumptions than previous approaches and is therefore applicable to a wider range of problems. Furthermore, we derive a Lipschitz constant for the samples of GPs and investigate the asymptotic behavior in order to demonstrate that arbitrarily small error bounds can be guaranteed with sufficient computational resources and data. The proposed GP bounds are employed to derive safety guarantees for unknown dynamical systems which are controlled based on a GP model. By employing Lyapunov theory [12], we prove that the closed-loop system - here we take a robotic manipulator as example - converges to a small fraction of the state space and can therefore be considered as safe.

The remainder of this paper is structured as follows: We briefly introduce Gaussian process regression and discuss related error bounds in Section 2. The novel proposed GP uniform error bound, the probabilistic Lipschitz constant and the asymptotic analysis are presented in Section 3. In Section 4 we show safety of a GP model based controller and evaluate it on a robotic manipulator in Section 5.

2.1 Gaussian Process Regression and Uniform Error Bounds

Gaussian process regression is a Bayesian machine learning method based on the assumption that any finite collection of random variables1 yi ∈ Rfollows a joint Gaussian distribution with prior mean 0 and covariance kernel  k : Rd × Rd → R+ [13]. Therefore, the variables  yiare observations of a sample function  f : X ⊂ Rd → Rof the GP distribution perturbed by zero mean Gaussian noise with variance  σ2n ∈ R+,0. By concatenating N input data points  xiin a matrix  XNthe elements of the GP kernel matrix  K(XN, XN)are defined as  Kij = k(xi, xj), i, j = 1, . . . , Nand  k(XN, x)denotes the kernel vector, which is defined accordingly. The probability distribution of the GP at a point x conditioned on the training data concatenated in  XNand  yNis then given as a normal distribution with mean  νN(x) = k(x, XN)(K(XN, XN) + σ2nIN)−1yN and variance σ2N(x, x′) = k(x, x′) − k(x, XN)(K(XN, XN) + σ2nIN)−1k(XN, x′).

A major reason for the popularity of GPs and related approaches in safety critical applications is the existence of uniform error bounds for the regression error, which is defined as follows.

Definition 2.1. Gaussian process regression exhibits a uniformly bounded error on a compact set  X ⊂ Rd if there exists a function  η(x) such that

image

If this bound holds with probability of at least  1 − δfor some  δ ∈ (0, 1), it is called a probabilistic uniform error bound.

2.2 Related Work

For many methods closely related to Gaussian process regression, uniform error bounds are very common. When dealing with noise-free data, i.e. in interpolation of multivariate functions, results from the field of scattered data approximation with radial basis functions can be applied [14]. In fact, many of the results from interpolation with radial basis functions can be directly applied to noise-free GP regression with stationary kernels. The classical result in [15] employs Fourier transform methods to derive an error bound for functions in the reproducing kernel Hilbert space (RKHS) attached to the interpolation kernel. By additionally exploiting properties of the RKHS a uniform error bound with increased convergence rate is derived in [16]. Typically, this form of bound crucially depends on the so called power function, which corresponds to the posterior standard deviation of Gaussian process regression under certain conditions [17]. In [18], a  Lperror bound for data distributed on a sphere is developed, while the bound in [19] extends existing approaches to functions from Sobolev spaces. Bounds for anisotropic kernels and the derivatives of the interpolant are developed in [20]. A Sobolev type error bound for interpolation with Matérn kernels is derived in [21]. Moreover, it is shown that convergence of the interpolation error implies convergence of the GP posterior variance.

Regularized kernel regression is a method which extends many ideas from scattered data interpolation to noisy observations and it is highly related to Gaussian process regression as pointed out in [17]. In fact, the GP posterior mean function is identical to kernel ridge regression with squared cost function [13]. Many error bounds such as [22] depend on the empirical  L2covering number and the norm of the unknown function in the RKHS attached to the regression kernel. In [23], the effective dimension of the feature space, in which regression is performed, is employed to derive a probabilistic uniform error bound. The effect of approximations of the kernel, e.g. with the Nyström method, on the regression error is analyzed in [24]. Tight error bounds using empirical  L2covering numbers are derived under mild assumptions in [25]. Finally, error bounds for general regularization are developed in [26], which depend on regularization and the RKHS norm of the function.

Using similar RKHS-based methods for Gaussian process regression, probabilistic uniform error bounds depending on the maximal information gain and the RKHS norm have been developed in [8]. These constants pose a high hurdle which has prevented the rigorous application of this work in control and typically heuristic constants without theoretical foundations are applied, see e.g. [27]. While regularized kernel regression allows a wide range of observation noise distributions, the bound in [8] only holds for bounded sub-Gaussian noise. Based on this work an improved bound is derived in [9] in order to analyze the regret of an upper confidence bound algorithm in multi-armed bandit problems. Although these bounds are frequently used in safe reinforcement learning and control, they suffer from several issues. On the one hand, they depend on constants which are very difficult to calculate. While this is no problem for theoretical analysis, it prohibits the integration of these bounds into algorithms and often estimates of the constants must be used. On the other hand, they suffer from the general problem of RKHS approaches: The space of functions, for which the bounds hold, becomes smaller the smoother the kernel is [19]. In fact, the RKHS attached to a covariance kernel is usually small compared to the support of the prior distribution of a Gaussian process [28].

The latter issue has been addressed by considering the support of the prior distribution of the Gaussian process as belief space. Based on bounds for the suprema of GPs [29] and existing error bounds for interpolation with radial basis functions, a probabilistic uniform error bound for Kriging (alternative term for GP regression for noise-free training data) is derived in [30]. However, the uniform error of Gaussian process regression with noisy observations has not been analyzed with the help of the prior GP distribution to the best of our knowledge.

While probabilistic uniform error bounds for the cases of noise-free observations and the restriction to subspaces of a RKHS are widely used, they often rely on constants which are hard to determine and are typically limited to unnecessarily small function spaces. The inherent probability distribution of GPs, which is the largest possible function space for regression with a certain GP, has not been exploited to derive uniform error bounds for Gaussian process regression with noisy observations. Under the weak assumption of Lipschitz continuity of the covariance kernel and the unknown function, a directly computable probabilistic uniform error bound is derived in Section 3.1. We demonstrate how Lipschitz constants for unknown functions directly follow from the assumed distribution over the function space in Section 3.2. Finally, we show that an arbitrarily small error bound can be reached with sufficiently many and well-distributed training data in Section 3.3.

3.1 Exploiting Lipschitz Continuity of the Unknown Function

In contrast to the RKHS based approaches in [8, 9], we make use of the inherent probability distribution over the function space defined by Gaussian processes. We achieve this through the following assumption.

Assumption 3.1. The unknown function  f(·)is a sample from a Gaussian process  GP(0, k(x, x′))and observations  y = f(x) + ϵare perturbed by zero mean i.i.d. Gaussian noise  ϵwith variance  σ2n.

This assumption includes abundant information about the regression problem. The space of sample functions F is limited through the choice of the kernel  k(·, ·)of the Gaussian process. Using Mercer’s decomposition [31]  φi(x), i = 1, . . . , ∞of the kernel  k(·, ·), this space is defined through

image

which contains all functions that can be represented in terms of the kernel  k(·, ·). By choosing a suitable class of covariance functions  k(·, ·), this space can be designed in order to incorporate prior knowledge of the unknown function  f(·). For example, for covariance kernels  k(·, ·)which are universal in the sense of [32], continuous functions can be learned with arbitrary precision. Moreover, for the squared exponential kernel, the space of sample functions corresponds to the space of continuous functions on X, while its RKHS is limited to analytic functions [28]. Furthermore, Assumption 3.1 defines a prior GP distribution over the sample space F which is the basis for the calculation of the posterior probability. The prior distribution is typically shaped by the hyperparameters of the covariance kernel  k(·, ·), e.g. slowly varying functions can be assigned a higher probability than functions with high derivatives. Finally, Assumption 3.1 allows Gaussian observation noise which is in contrast to the bounded noise required e.g. in [8, 9].

In addition to Assumption 3.1, we need Lipschitz continuity of the kernel  k(·, ·)and the unknown function  f(·). We define the Lipschitz constant of a differentiable covariance kernel  k(·, ·) as

image

Since most of the practically used covariance kernels  k(·, ·), such as squared exponential and Matérn kernels, are Lipschitz continuous [13], this is a weak restriction on covariance kernels. However, it allows us to prove continuity of the posterior mean function  νN(·)and the posterior standard deviation  σN(·), which is exploited to derive a probabilistic uniform error bound in the following theorem. The proofs for all following theorems can be found in the supplementary material.

Theorem 3.1. Consider a zero mean Gaussian process defined through the continuous covariance kernel  k(·, ·)with Lipschitz constant  Lkon the compact set X. Furthermore, consider a continuous unknown function  f : X → Rwith Lipschitz constant  Lfand  N ∈ Nobservations  yisatisfying Assumption 3.1. Then, the posterior mean function  νN(·)and standard deviation  σN(·)of a Gaussian process conditioned on the training data  {(xi, yi)}Ni=1are continuous with Lipschitz constant  LνNand modulus of continuity  ωσN (·) on X such that

image

Moreover, pick  δ ∈ (0, 1), τ ∈ R+ and set

image

Then, it holds that

image

The parameter  τis in fact the grid constant of a grid used in the derivation of the theorem. The error on the grid can be bounded by exploiting properties of the Gaussian distribution [8] resulting in a dependency on the number of grid points. Eventually, this leads to the constant  β(τ)defined in (6) since the covering number  M(τ, X)is the minimum number of points in a grid over X with grid constant  τ. By employing the Lipschitz constant  LνNand the modulus of continuity  ωσN (·), whichare trivially obtained due Lipschitz continuity of the covariance kernel  k(·, ·), as well as the Lipschitz constant  Lf, the error bound is extended to the complete set X, which results in (8).

Note, that most of the equations in Theorem 3.1 can be directly evaluated. Although our expression for  β(τ)depends on the covering number of X, which is in general difficult to calculate, upper bounds can be computed trivially. For example, for a hypercubic set  X ⊂ Rdthe covering number can be bounded by

image

where r is the edge length of the hypercube. Furthermore, (4) and (5) depend only on the training data and kernel expressions, which can be calculated analytically in general. Therefore, (8) can be computed for fixed  τand  δif an upper bound for the Lipschitz constant  Lfof the unknown function  f(·)is known. Prior bounds on the Lipschitz constant  Lfare often available for control systems, e.g. based on simplified first order physical models. However, we demonstrate a method to obtain probabilistic Lipschitz constants from Assumption 3.1 in Section 3.2. Therefore, it is trivial to compute all expressions in Theorem 3.1 or upper bounds thereof, which emphasizes the high applicability of Theorem 3.1 in safe control of unknown systems.

Moreover, it should be noted that  τcan be chosen arbitrarily small such that the effect of the constant  γ(τ)can always be reduced to an amount which is negligible compared to�β(τ)σN(x). Even conservative approximations of the Lipschitz constants  LνNand  Lfand a loose modulus of continuity  ωσN (τ)do not affect the error bound (8) much since (6) grows merely logarithmically with diminishing  τ. In fact, even the bounds (4) and (5), which grow in the order of O(N) and  O(N12 ), respectively, as shown in the proof of Theorem 3.3 and thus are unbounded, can be compensated such that a vanishing uniform error bound can be proven under weak assumptions in Section 3.3.

3.2 Probabilistic Lipschitz Constants for Gaussian Processes

If little prior knowledge of the unknown function  f(·)is given, it might not be possible to directly derive a Lipschitz constant  Lfon X. However, we indirectly assume a certain distribution of the derivatives of  f(·)with Assumption 3.1. Therefore, it is possible to derive a probabilistic Lipschitz constant  Lffrom this assumption, which is described in the following theorem.

Theorem 3.2. Consider a zero mean Gaussian process defined through the covariance kernel  k(·, ·)with continuous partial derivatives up to the fourth order and partial derivative kernels

image

Let  L∂ikdenote the Lipschitz constants of the partial derivative kernels  k∂i(·, ·)on the set X with maximal extension  r = maxx,x′∈X ∥x − x′∥. Then, a sample function  f(·)of the Gaussian process is almost surely continuous on X and with probability of at least  1 − δL, it holds that

image

is a Lipschitz constant of  f(·) on X.

Note that a higher differentiability of the covariance kernel  k(·, ·)is required compared to Theorem 3.1. The reason for this is that the proof of Theorem 3.2 exploits the fact that the partial derivative k∂i(·, ·)of a differentiable kernel is again a covariance function, which defines a derivative Gaussian process [33]. In order to obtain continuity of the samples of these derivative processes, the derivative kernels  k∂i(·, ·)must be continuously differentiable [34]. Using the metric entropy criterion [34] and the Borell-TIS inequality [35], we exploit the continuity of sample functions and bound their maximum value, which directly translates into the probabilistic Lipschitz constant (11).

Note that all the values required in (11) can be directly computed. The maximum of the derivative kernels  k∂i(·, ·)as well as their Lipschitz constants  L∂ik can be calculated analytically for many kernels. Therefore, the Lipschitz constant obtained with Theorem 3.2 can be directly used in Theorem 3.1 through application of the union bound. Since the Lipschitz constant  Lfhas only a logarithmic dependence on the probability  δL, small error probabilities for the Lipschitz constant can easily be achieved. Remark 3.1. The work in [36] derives also estimates for the Lipschitz constants. However, they only take the Lipschitz constant of the posterior mean function, which neglects the probabilistic nature of the GP and thereby underestimates the Lipschitz constants of samples of the GP.

3.3 Analysis of Asymptotic Behavior

In safe reinforcement learning and control of unknown systems an important question regards the existence of lower bounds for the learning error because they limit the achievable control performance. It is clear that the available data and constraints on the computational resources pose such lower bounds in practice. However, it is not clear under which conditions, e.g. requirements of computational power, an arbitrarily low uniform error can be guaranteed. The asymptotic analysis of the error bound, i.e. investigation of the bound (8) in the limit  N → ∞can clarify this question. The following theorem is the result of this analysis.

Theorem 3.3. Consider a zero mean Gaussian process defined through the continuous covariance kernel  k(·, ·)with Lipschitz constant  Lkon the set X. Furthermore, consider an infinite data stream of observations  (xi, yi)of an unknown function  f : X → Rwith Lipschitz constant  Lfand maximum absolute value ¯f ∈ R+on X which satisfies Assumption 3.1. Let  νN(·)and  σN(·)denote the mean and standard deviation of the Gaussian process conditioned on the first N observations. If there exists a  ϵ > 0such that the standard deviation satisfies  σN(x) ∈ O�log(N)− 12 −ϵ�, ∀x ∈ X, then it

holds for every  δ ∈ (0, 1) that

image

In addition to the conditions of Theorem 3.1 the absolute value of the unknown function is required to be bounded by a value ¯f. This is necessary to bound the Lipschitz constant  LνNof the posterior mean function  νN(·)in the limit of infinite training data. Even if no such constant is known, it can be derived from properties of the GP under weak conditions similarly to Theorem 3.2. Based on this restriction, it can be shown that the bound of the Lipschitz constant  LνNgrows at most with rate O(N) using the triangle inequality and the fact that the squared norm of the observation noise  ∥ϵ∥2 follows a χ2N distribution with probabilistically bounded maximum value [37]. Therefore, we pick  τ(N) ∈ O(N −2)such that  γ(τ(N)) ∈ O(N −1)and  β(τ(N)) ∈ O(log(N))which implies (12).

The condition on the convergence rate of the posterior standard deviation  σN(·)in Theorem 3.3 can be seen as a condition for the distribution of the training data, which depends on the structure of the covariance kernel. In [38, Corollary 3.2], the condition is formulated as follows: Let  Bρ(x)denote a set of training points around x with radius  ρ > 0, then the posterior variance converges to zero if there exists a function  ρ(N)for which  ρ(N) ≤ k(x, x)/Lk ∀N, limN→∞ ρ(N) = 0and  limN→∞��Bρ(N)(x)�� = ∞holds. This is achieved, e.g. if a constant fraction of all samples lies on the point x. In fact, it is straightforward to derive a similar condition for the uniform error bounds in [8, 9]. However, due to their dependence on the maximal information gain, the required decrease rates depend on the covariance kernel  k(·, ·)and are typically higher. For example, the posterior standard deviation of a Gaussian process with a squared exponential kernel must satisfy  σN(·) ∈ O�log(N)− d2 −2�for [8] and  σN(·) ∈ O�log(N)− d+12 �for [9].

Safety guarantees for dynamical systems, in terms of upper bounds for the tracking error, are becoming more and more relevant as learning controllers are applied in safety-critical applications like autonomous driving or robots working in close proximity to humans [39, 40, 4]. We therefore show how the results in Theorem 3.1 can be applied to control safely unknown dynamical systems. In Section 4.1 we propose a tracking control law for systems which are learned with GPs. The stability of the resulting controller is analyzed in Section 4.2.

4.1 Tracking Control Design

Consider the nonlinear control affine dynamical system

image

with state  x = [x1 x2]⊺ ∈ X ⊂ R2and control input  u ∈ U ⊆ R. While the structure of the dynamics (13) is known, the function  f(·)is not. However, we assume that it is a sample from a GP with kernel  k(·, ·). Systems of the form (13) cover a large range of applications including Lagrangian dynamics and many physical systems.

The task is to define a policy  π : X → Ufor which the output  x1tracks the desired trajectory  xd(t)such that the tracking error  e = [e1 e2]⊺ = x − xdwith  xd = [xd ˙xd]⊺vanishes over time, i.e.  limt→∞∥e∥ = 0. For notational simplicity, we introduce the filtered state  r = λe1 + e2, λ ∈ R+.

A well-known method for tracking of control affine systems is feedback linearization [12], which aims for a model-based compensation of the non-linearity  f(·)using an estimate ˆf(·)and then applies linear control principles for the tracking. The feedback linearizing policy reads as

image

where the linear control law  νis the PD-controller

image

with control gain  kc ∈ R+. This results in the dynamics of the filtered state

image

Assuming training data of the real system  yi = f(xi) + ϵ, i = 1, . . . , N, ϵ ∼ N(0, σ2n)are available, we utilize the posterior mean function  νN(·)for the model estimate ˆf(·). This implies, that observations of  ˙x2are corrupted by noise, while x is measured free of noise. This is of course debatable, but in practice measuring the time derivative is usually realized with finite difference approximations, which injects significantly more noise than a direct measurement.

4.2 Stability Analysis

Due to safety constraints, e.g. for robots interacting with humans, it is usually necessary to verify that the model ˆf(·)is sufficiently precise and the parameters of the controller  kc, λare chosen properly. These safety certificates can be achieved if there exists an upper bound for the tracking error as defined in the following.

Definition 4.1 (Ultimate Boundedness). The trajectory x(t) of a dynamical system  ˙x = f(x, u) isglobally ultimately bounded, if there exist a positive constants  b ∈ R+such that for every  a ∈ R+, there is a  T = T(a, b) ∈ R+ such that

image

Since the solutions x(t) cannot be computed analytically, a stability analysis is necessary, which allows conclusions regarding the closed-loop behavior without running the policy on the real system [12].

Theorem 4.1. Consider a control affine system (13), where  f(·)admits a Lipschitz constant  Lfon  X ⊂ Rd. Assume that  f(·)and the observations  yi, i = 1, . . . , N, satisfy the conditions of Assumption 3.1. Then, the feedback linearizing controller (14) with ˆf(·) = νN(·)guarantees with probability  1 − δthat the tracking error e converges to

image

with  β(τ) and γ(τ)defined in Theorem 3.1.

Based on Lyapunov theory, it can be shown that the tracking error converges if the feedback term  |kcr|dominates the model error  |f(·) − ˆf(·)|. As Theorem 3.1 bounds the model error, the set for which holds  |kcr| >�β(τ)σN(x) + γ(τ)can be computed. It can directly be seen, that the ultimate bound can be made arbitrarily small, by increasing the gains  λ, kcor with more training points to decrease  σN(·). Computing the set B allows to check whether the controller (14) adheres to the safety requirements.

We evaluate our theoretical results in two simulations.2 In Section 5.1, we investigate the effect of applying Theorem 3.2 to determine a probabilistic Lipschitz constant for an unknown synthetic

image

Figure 1: Snapshots of the state trajectory (blue) as it approaches the desired trajectory (green). In low uncertainty areas (yellow background), the set B (red) is significantly smaller then in high uncertainty areas (blue background).

image

Figure 2: When the ultimate bound (red) is large, the tracking error (blue) increases due to the less precise model.

system. Furthermore, we analyze the effect of unevenly distributed training samples on the tracking error bound from Theorem 4.1. In Section 5.2, we apply the feedback linearizing controller (14) to a tracking problem with a robotic manipulator.

5.1 Synthetic System with Unknown Lipschitz Constant Lf

As an example for a system of form (13), we consider  f(x) = 1 − sin(x1) + 11+exp(−x2). Based on a uniform grid over  [0 3]×[−3 3]the training set is formed of 81 points with  σ2n = 0.01. The reference trajectory is a circle  xd(t) = 2 sin(t)and the controller gains are  kc = 2and  λ = 1. We choose a probability of failure  δ = 0.01, δL = 0.01and set  τ = 10−8. The state space is the rectangle X = [−6 4] × [−4 4]. A squared exponential kernel with automatic relevance determination is utilized, for which  Lk and maxx,x′∈X k(x, x′)is derived analytically for the optimized hyperparameters. We make use of Theorem 3.2 to estimate the Lipschitz constant  Lf, and it turns out to be a conservative bound (factor  10 ∼ 100). However, this is not crucial, because  τcan be chosen arbitrarily small and  γ(τ)is dominated by�β(τ)ωσN (τ). As Theorems 3.1 and 3.2 are subsequently utilized in this example, a union bound approximation can be applied to combine  δ and δL.

The results are shown in Figs. 1 and 2. Both plots show, that the safety bound here is rather conservative, which also results from the fact that the violation probability was set to 1%.

5.2 Robotic Manipulator with 2 Degrees of Freedom

We consider a planar robotic manipulator in the  z1-z2-plane with 2 degrees of freedom (DoFs), with unit length and unit masses / inertia for all links. For this example, we consider  Lfto be known and extend Theorem 3.1 to the multidimensional case using the union bound. The state space is here four dimensional  [q1 ˙q1 q2 ˙q2]and we consider  X = [−π π]4. The 81training points are distributed

image

Figure 3: The task space of the robot (left) shows the robot is guaranteed to remain in B (red) after a transient phase. Hence, the remaining state space X \ B (green) can be considered as safe. The joint angles and velocities (right) converge to the desired trajectories (dashed lines) over time.

in  [−1 1]4and the control gain is  kc = 7, while other constants remain the same as in Section 5.1. The desired trajectories for both joints are again sinusoidal as shown in Fig. 3 on the right side. The robot dynamics are derived according to [41, Chapter 4].

Theorem 3.1 allows to derive an error bound in the joint space of the robot according to Theorem 4.1, which can be transformed into the task space as shown in Fig. 3 on the left. Thus, based on the learned (initially unknown) dynamics, it can be guaranteed, that the robot will not leave the depicted area and can thereby be considered as safe.

Previous error bounds for GPs are not applicable to this practical setting, because they i) do not allow the observation noise on the training data to be Gaussian [8], which is a common assumption in robotics, ii) utilize constants which cannot be computed efficiently (e.g. maximal information gain in [42]) or iii) make assumptions difficult to verify in practice (e.g. the RKHS norm of the unknown dynamical system [6]).

This paper presents a novel uniform error bound for Gaussian process regression. By exploiting the inherent probability distribution of Gaussian processes instead of the reproducing kernel Hilbert space attached to the covariance kernel, a wider class of functions can be considered. Furthermore, we demonstrate how probabilistic Lipschitz constants can be estimated from the GP distribution and derive sufficient conditions to reach arbitrarily small uniform error bounds. We employ the derived results to show safety bounds for a tracking control algorithm and evaluate them in simulation for a robotic manipulator.

Armin Lederer gratefully acknowledges financial support from the German Academic Scholarship Foundation.

[1] P. M. Nørgård, O. Ravn, N. K. Poulsen, and L. K. Hansen, Neural Networks for Modelling and Control of Dynamical Systems - A Practicioner’s Handbook. London: Springer, 2000.

[2] M. P. Deisenroth, “A Survey on Policy Search for Robotics,” Foundations and Trends in Robotics, vol. 2, no. 1-2, pp. 1–142, 2013.

[3] B. Huval, T. Wang, S. Tandon, J. Kiske, W. Song, J. Pazhayampallil, M. Andriluka, P. Rajpurkar, T. Migimatsu, R. Cheng-Yue, F. Mujica, A. Coates, and A. Y. Ng, “An Empirical

Evaluation of Deep Learning on Highway Driving,” pp. 1–7, 2015. [Online]. Available: http://arxiv.org/abs/1504.01716

[4] J. Umlauft, Y. Fanger, and S. Hirche, “Bayesian Uncertainty Modeling for Programming by Demonstration,” in Proceedings of the IEEE Conference on Robotics and Automation, 2017, pp. 6428–6434.

[5] T. Beckers, D. Kuli´c, and S. Hirche, “Stable Gaussian Process based Tracking Control of Euler–Lagrange Systems,” Automatica, vol. 103, no. 23, pp. 390–397, 2019.

[6] F. Berkenkamp, R. Moriconi, A. P. Schoellig, and A. Krause, “Safe Learning of Regions of Attraction for Uncertain, Nonlinear Systems with Gaussian Processes,” in Proceedings of the IEEE Conference on Decision and Control, 2016, pp. 4661–4666.

[7] Y. Fanger, J. Umlauft, and S. Hirche, “Gaussian Processes for Dynamic Movement Primitives with Application in Knowledge-based Cooperation,” in Proceedings of the IEEE Conference on Intelligent Robots and Systems, 2016, pp. 3913–3919.

[8] N. Srinivas, A. Krause, S. M. Kakade, and M. W. Seeger, “Information-Theoretic Regret Bounds for Gaussian Process Optimization in the Bandit Setting,” IEEE Transactions on Information Theory, vol. 58, no. 5, pp. 3250–3265, 2012.

[9] S. R. Chowdhury and A. Gopalan, “On Kernelized Multi-armed Bandits,” in Proceedings of the International Conference on Machine Learning, 2017, pp. 844–853.

[10] J. Umlauft, L. Pöhler, and S. Hirche, “An Uncertainty-Based Control Lyapunov Approach for Control-Affine Systems Modeled by Gaussian Process,” IEEE Control Systems Letters, vol. 2, no. 3, pp. 483–488, 2018.

[11] J. Umlauft, T. Beckers, and S. Hirche, “Scenario-based Optimal Control for Gaussian Process State Space Models,” in Proceedings of the European Control Conference, 2018.

[12] H. K. Khalil, Nonlinear Systems; 3rd ed. Upper Saddle River, NJ: Prentice-Hall, 2002.

[13] C. E. Rasmussen and C. K. I. Williams, Gaussian Processes for Machine Learning. Cambridge, MA: The MIT Press, 2006.

[14] H. Wendland, Scattered Data Approximation. Cambridge University Press, 2004.

[15] Z. M. Wu and R. Schaback, “Local Error Estimates for Radial Basis Function Interpolation of Scattered Data,” IMA Journal of Numerical Analysis, vol. 13, no. 1, pp. 13–27, 1993.

[16] R. Schaback, “Improved Error Bounds for Scattered Data Interpolation by Radial Basis Functions,” Mathematics of Computation, vol. 68, no. 225, pp. 201–217, 2002.

[17] M. Kanagawa, P. Hennig, D. Sejdinovic, and B. K. Sriperumbudur, “Gaussian Processes and Kernel Methods: A Review on Connections and Equivalences,” pp. 1–64, 2018. [Online]. Available: http://arxiv.org/abs/1807.02582

[18] S. Hubbert and T. M. Morton, “Lp-Error Estimates for Radial Basis Function Interpolation on the Sphere,” Journal of Approximation Theory, vol. 129, no. 1, pp. 58–77, 2004.

[19] F. J. Narcowich, J. D. Ward, and H. Wendland, “Sobolev Error Estimates and a Bernstein Inequality for Scattered Data Interpolation via Radial Basis Functions,” Constructive Approximation, vol. 24, no. 2, pp. 175–186, 2006.

[20] R. Beatson, O. Davydov, and J. Levesley, “Error Bounds for Anisotropic RBF Interpolation,” Journal of Approximation Theory, vol. 162, no. 3, pp. 512–527, 2010.

[21] A. M. Stuart and A. L. Teckentrup, “Posterior Consistency for Gaussian Process Approximations of Bayesian Posterior Distributions,” Mathematics of Computation, vol. 87, no. 310, pp. 721– 753, 2018.

[22] S. Mendelson, “Improving the Sample Complexity using Global Data,” IEEE Transactions on Information Theory, vol. 48, no. 7, pp. 1977–1991, 2002.

[23] T. Zhang, “Learning Bounds for Kernel Regression using Effective Data Dimensionality,” Neural Computation, vol. 17, no. 9, pp. 2077–2098, 2005.

[24] C. Cortes, M. Mohri, and A. Talwalkar, “On the Impact of Kernel Approximation on Learning Accuracy,” Proceedings of 13th International Conference on Artificial Intelligece and Statistics, vol. 9, pp. 113–120, 2010.

[25] L. Shi, “Learning Theory Estimates for Coefficient-based Regularized Regression,” Applied and Computational Harmonic Analysis, vol. 34, no. 2, pp. 252–265, 2013.

[26] L. H. Dicker, D. P. Foster, and D. Hsu, “Kernel Ridge vs. Principal Component Regression: Minimax Bounds and the Qualification of Regularization Operators,” Electronic Journal of Statistics, vol. 11, no. 1, pp. 1022–1047, 2017.

[27] F. Berkenkamp, A. P. Schoellig, M. Turchetta, and A. Krause, “Safe Model-based Reinforcement Learning with Stability Guarantees,” in Advances in Neural Information Processing Systems, 2017.

[28] A. van der Vaart and H. van Zanten, “Information Rates of Nonparametric Gaussian Process Methods,” Journal of Machine Learning Research, vol. 12, pp. 2095–2119, 2011.

[29] R. Adler and J. Taylor, Random Fields and Geometry. Springer Science & Business Media, 2007.

[30] W. Wang, R. Tuo, and C. F. J. Wu, “On Prediction Properties of Kriging: Uniform Error Bounds and Robustness,” Journal of the American Statistical Society, pp. 1–38, 2019.

[31] J. Mercer, “Functions of Positive and Negative Type, and their Connection with the Theory of Integral Equations,” Philosophical Transactions of the Royal Society A: Mathematical, Physical and Engineering Sciences, vol. 209, no. 441-458, pp. 415–446, 1909.

[32] I. Steinwart, “On the Influence of the Kernel on the Consistency of Support Vector Machines,” Journal of Machine Learning Research, vol. 2, pp. 67–93, 2001.

[33] S. Ghosal and A. Roy, “Posterior Consistency of Gaussian Process Prior for Nonparametric Binary Regression,” The Annals of Statistics, vol. 34, no. 5, pp. 2413–2429, 2006.

[34] R. M. Dudley, “The Sizes of Compact Subsets of Hilbert Space and Continuity of Gaussian Processes,” Journal of Functional Analysis, vol. 1, no. 3, pp. 290–330, 1967.

[35] M. Talagrand, “Sharper Bounds for Gaussian and Empirical Processes,” The Annals of Probability, vol. 22, no. 1, pp. 28–76, 1994.

[36] J. González, Z. Dai, P. Hennig, and N. D. Lawrence, “Batch Bayesian Optimization via Local Penalization,” in Proceedings of the International Conference on Artificial Intelligence and Statistics, 2016, pp. 648–657.

[37] B. Laurent and P. Massart, “Adaptive Estimation of a Quadratic Functional by Model Selection,” The Annals of Statistics, vol. 28, no. 5, pp. 1302–1338, 2000.

[38] A. Lederer, J. Umlauft, and S. Hirche, “Posterior Variance Analysis of Gaussian Processes with Application to Average Learning Curves,” 2019. [Online]. Available: http://arxiv.org/abs/1906.01404

[39] J. Umlauft, T. Beckers, M. Kimmel, and S. Hirche, “Feedback Linearization using Gaussian Processes,” in Proceedings of the IEEE Conference on Decision and Control, 2017, pp. 5249– 5255.

[40] J. Umlauft, A. Lederer, and S. Hirche, “Learning Stable Gaussian Process State Space Models,” in Proceedings of the American Control Conference, 2017, pp. 1499–1504.

[41] R. M. Murray, Z. Li, and S. Shankar Sastry, A Mathematical Introduction to Robotic Manipulation. CRC Press, 1994.

[42] N. Srinivas, A. Krause, S. Kakade, and M. Seeger, “Gaussian Process Optimization in the Bandit Setting: No Regret and Experimental Design,” in Proceedings of the International Conference on Machine Learning, 2010, pp. 1015–1022.

[43] S. Grünewälder, J.-Y. Audibert, M. Opper, and J. Shawe-Taylor, “Regret Bounds for Gaussian Process Bandit Problems,” Journal of Machine Learning Research, vol. 9, pp. 273–280, 2010.

Proof of Theorem 3.1. We first prove the Lipschitz constant of the posterior mean  νN(x)and the modulus of continuity of the standard deviation  σN(x), before we derive the bound of the regression error. The norm of the difference between the posterior mean  νN(x)evaluated at two different points is given by

image

with

image

Due to the Cauchy-Schwarz inequality and the Lipschitz continuity of the kernel we obtain

image

which proves Lipschitz continuity of the mean  νN(x). In order to calculate a modulus of continuity for the posterior standard deviation  σN(x)observe that the difference of the variance at two points x, x′ ∈ Xcan be expressed as

image

Since the standard deviation is positive semidefinite we have

image

and hence, we obtain

image

Therefore, it is sufficient to bound the difference of the variance at two points  x, x′ ∈ Xand take the square root of the resulting expression. Due to the Cauchy-Schwarz inequality and Lipschitz continuity of  k(·, ·)the absolute value of the difference of the variance can be bounded by

image

On the one hand, we have

image

due to Lipschitz continuity of  k(x, x′). On the other hand we have

image

The modulus of continuity  ωσN (τ)follows from substituting (23) and (24) in (22) and taking the square root of the resulting expression. Finally, we prove the probabilistic uniform error bound by exploiting the fact that for every grid  Xτ with |Xτ|grid points and

image

it holds with probability of at least  1 − |Xτ|e−β(τ)/2 that [8]

image

Choose  β(τ) = 2 log�|Xτ |δ �, then

image

holds with probability of at least  1 − δ. Due to continuity of  f(x), νN(x) and σN(x) we obtain

image

Moreover, the minimum number of grid points satisfying (25) is given by the covering number M(τ, X). Hence, we obtain

image

where

image

In order to proof Theorem 3.2, several auxiliary results are necessary, which are derived in the following. The first lemma concerns the expected supremum of a Gaussian process.

Lemma B.1. Consider a Gaussian process with a continuously differentiable covariance function  k(·, ·)and let  Lkdenote its Lipschitz constant on the set X with maximum extension r = maxx,x′∈X ∥x − x′∥. Then, the expected supremum of a sample function f(x) of this Gaussian process satisfies

image

Proof. We prove this lemma by making use of the metric entropy criterion for the sample continuity of some version of a Gaussian process [34]. This criterion allows to bound the expected supremum of a sample function f(x) by

image

where  N(ϱ, X) is the ϱ-packing number of X with respect to the covariance pseudo-metric

image

Instead of bounding the  ϱ-packing number, we bound the  ϱ/2-covering number, which is known to be an upper bound. The covering number can be easily bounded by transforming the problem of covering X with respect to the pseudo-metric  dk(·, ·)into a coverage problem in the original metric of X. For this reason, define

image

which is continuous due to the continuity of the covariance kernel  k(·, ·). Consider the inverse function

image

Continuity of  ψ(·)implies  ϱ = ψ(ψ−1(ϱ)). In particular, this means that we can guarantee dk(x, x′) ≤ ϱ2if  ∥x − x′∥ ≤ ψ−1( ϱ2). Due to this relationship it is sufficient to construct an uniform grid with grid constant  2ψ−1( ϱ2)in order to obtain a  ϱ/2-covering net of X. Furthermore, the cardinality of this grid is an upper bound for the  ϱ/2-covering number, i.e.

image

Therefore, it follows that

image

Due to the Lipschitz continuity of the covariance function, we can bound  ψ(·) by

image

Hence, the inverse function satisfies

image

and consequently

image

holds, where the ceil operator is resolved through the addition of 1. Substituting this expression in the metric entropy bound (35) yields

image

As shown in [43] this integral can be bounded by

image

which proves the lemma.

Based on the expected supremum of Gaussian process it is possible to derive a high probability bound for the supremum of a sample function.

Lemma B.2. Consider a Gaussian process with a continuously differentiable covariance function  k(·, ·)and let  Lkdenote its Lipschitz constant on the set X with maximum extension r = maxx,x′∈X ∥x − x′∥. Then, with probability of at least  1 − δLthe supremum of a sample function f(x) of this Gaussian process is bounded by

image

Proof. We prove this lemma by exploiting the wide theory of concentration inequalities to derive a bound for the supremum of the sample function f(x). We apply the Borell-TIS inequality [35]

image

Due to Lemma B.1 we have

image

Finally, we exploit the fact that the derivative of a sample function is a sample function from another Gaussian process to prove the high probability Lipschitz constant in Theorem 3.2.

Proof of Theorem 3.2. Continuity of the sample function f(x) follows directly from [33, Theorem 5]. Furthermore, this theorem guarantees that the derivative functions ∂∂xi f(x)are samples from derivative Gaussian processes with covariance functions

image

Therefore, we can apply Lemma B.2 to each of the derivative processes and obtain with probability of at least  1 − δLd

image

where

image

and  L∂ik is the Lipschitz constant of derivative kernel  k∂i(x, x′). Applying the union bound over all partial derivative processes i = 1, . . . , d finally yields the result.

Proof of Theorem 3.3. Due to Theorem 3.1 with  βN(τ) = 2 log�M(τ,X)π2N 23δ �and the union bound over all N > 0 it follows that

image

with probability of at least  1 − δ/2. A trivial bound for the covering number can be obtained by considering a uniform grid over the cube containing X. This approach leads to

image

where  r = maxx,x′∈X ∥x − x′∥. Therefore, we have

image

Furthermore, the Lipschitz constant  LνNis bounded by

image

due to Theorem 3.1. Since the Gram matrix  K(XN, XN)is positive semidefinite and  f(·)is bounded by ¯f, we can bound��(K(XN, XN) + σ2nIN)−1yN�� by

image

where  ξNis a vector of N i.i.d. zero mean Gaussian random variables with variance  σ2n. Therefore, it follows that ∥ξN∥2σ2n ∼ χ2N. Due to [37], with probability of at least  1 − exp(−ηN) we have

image

image

with probability of at least  1 − δ/2. Hence, the Lipschitz constant of the posterior mean function νN(·)satisfies with probability of at least  1 − δ/2

image

Since  ηNgrows logarithmically with the number of training samples N, it holds that  LνN ∈ O(N)with probability of at least  1 − δ/2. The modulus of continuity  ωσN (·)of the posterior standard deviation can be bounded by

image

because  ∥(K(XN, XN) + σ2nIN)−1∥ ≤ 1σ2n . Due to the union bound (52) holds with probability of at least  1 − δ with

image

This function must converge to  0 for N → ∞in order to guarantee a vanishing regression error. This is only ensured if  τ(N)decreases faster than  O((N log(N))−1). Therefore, set  τ(N) ∈ O(N −2) inorder to guarantee

image

However, this choice of  τ(N)implies that  βN(τ(N)) ∈ O(log(N))due to (54). Since there exists an  ϵ > 0 such that σN(x) ∈ O�log(N)− 12 −ϵ�, ∀x ∈ Xby assumption, we have

image

which concludes the proof.

Lyapunov theory provides the following statement [12].

Lemma D.1. A dynamical system  ˙x = f(x, u)is globally ultimately bounded to a set  B ⊂ X, containing the origin, if there exists a positive definite (so called Lyapunov) function,  V : X → R+,0,for which ˙V (x) < 0, for all x ∈ X \ B.

This allows to proof Theorem 4.1 as following.

Proof of Theorem 4.1. Consider the Lyapunov function  V (x) = 12r2

˙V (x) = ∂V∂r ˙r = r�f(x) − ˆf(x) − kcr�≤ |r||f(x) − νN(x)| − kc|r|2 ≤ 0 ∀|r| > f(x) − νN(x)kcBased on Theorem 3.1, the model error is bounded with high probability, which allows to conclude

image

The global ultimate boundedness of the closed-loop system, is thereby shown according to Lemma D.1.

image

Simulations are performed in MATLAB 2019a on a i5-6200U CPU with 2.3GHz and 8GB RAM. The simulation in Sec. 5.1 took 77s and used 1 MB of workspace memory. The simulation in Sec. 5.2 took 39s and used 134 MB of workspace memory. The code is available as supplementary material.


Designed for Accessibility and to further Open Science