b

DiscoverSearch
About
My stuff
Online Constrained Model-based Reinforcement Learning
2020·arXiv
Abstract
Abstract

Applying reinforcement learning to robotic systems poses a number of challenging problems. A key requirement is the ability to handle continuous state and action spaces while remaining within a limited time and resource budget. Additionally, for safe operation, the system must make robust decisions under hard constraints. To address these challenges, we propose a model based approach that combines Gaussian Process regression and Receding Horizon Control. Using sparse spectrum Gaussian Processes, we extend previous work by updating the dynamics model incrementally from a stream of sensory data. This results in an agent that can learn and plan in real-time under non-linear constraints. We test our approach on a cart pole swing-up environment and demonstrate the benefits of online learning on an autonomous racing task. The environment’s dynamics are learned from limited training data and can be reused in new task instances without retraining.

Reinforcement learning has become an important approach to the planning and control of autonomous agents in complex environments. However, recent interest in reinforcement learning is yet to be reflected in robotics applications; possibly due to their specific challenges.

In robotic systems, reinforcement learning methods must deal with continuous, and potentially high-dimensional, state and control spaces. For example, the dynamics of autonomous vehicles are most naturally described in terms of continuous variables like position, velocity, orientation and steering angle. Data efficiency is also critical since collecting experience, or evaluating control policies, with a robot in the loop can be costly and time consuming. Poorly modeled dynamics, due to a slow learning rate, may result in unstable trajectories and dangerous collisions. The system also needs to handle physical constraints — from joint and actuator limitations to the boundaries defined by obstacles. Finally, in fastpaced applications rapid decisions, requiring real-time planning, must be made.

In this paper we propose a method which tackles all of the above challenges simultaneously. Our method achieves this through the marriage of two key components:

1. a learned dynamics model based on sparse spectrum Gaussian processes (GPs), and

2. a planner based on receding horizon control (RHC), and the structure exploiting interior point solver FORCES (Domahidi and Jerez, 2014).

Sparse spectrum GPs allow us to update the learned model online and the RHC framework provides the ability plan in real-time while naturally handling constraints. This means that we can learn and plan online in constrained environments where data collection is expensive. We believe that these are important features for real world applications of reinforcement learning, particularly in safety critical systems. Our proposed method will be referred to as Gaussian Process-Receding Horizon Control (GP-RHC hereafter).

The remainder of the paper is structured as follows. In section 2 we provide an overview of related approaches in model-based reinforcement learning. The receding horizon control framework is presented in section 3. This is followed, in section 4, by a discussion on the application of Gaussian process regression to model learning. In section 5 we apply GP-RHC to a cart pole swing-up environment and a challenging autonomous racing task. The results demonstrate that (a) models can be learned quickly from limited data, (b) complex non-linear constraints can be handled in real-time, and (c) online updates improve the rate of learning and result in more consistent performance.

GP-RHC falls into the class of model-based reinforcement learning methods. These are generally the methods of choice in robotics primarily due to their impressive data efficiency (Kober et al., 2013). Model-based approaches can be broadly classified according to (a) how the dynamics model is learned and, (b) the choice of planner.

PILCO (Deisenroth and Rasmussen, 2011), for example, combines policy search for planning with a Gaussian process model of the dynamics. The policy search relies on analytic gradients of closed form solutions to the long term expected cost. This requires the cost function and policy to take specific functional forms, making it diffi-cult or impossible to incorporate general constraints. In a similar vein, Kim et al. (2004) use policy search with locally weighted linear regression to learn a controller for helicopter flight.

As an alternative, trajectory optimization based on differential dynamic programming is often used for planning. Methods such as PDDP (Pan and Theodorou, 2014) and AGP-iLQR (Boedecker et al., 2014) make use of this idea by combining dynamics models learned by locally weighted projection regression with either an iterative linear quadratic regulator or Gaussian as the planner. These planners can take simple box constraints into account but cannot handle general non-linear constraints.

Our work is most closely related to the RL-RCO method (Andersson et al., 2015) which leverages sparse Gaussian processes for learning the dynamics and trajectory optimization based sequential quadratic programming. We improve upon RL-RCO by proposing an approach which allows the dynamics model to be updated online as the agent interacts with the environment. Furthermore, in contrast to the work of Andersson et al. (2015), we present results that highlight the ability to handle non-linear constraints.

We show how these enhancements improve data effi-ciency, learning rate and constraint handling, rendering GP-RHC overall more applicable to realistic scenarios.

In this paper, we consider an agent operating in a continuous environment described by a set of differential equations,  ˙x = f(x, u), where  x ∈ Rnrepresents the agent’s state and  u ∈ Rmthe control signal. The objective of the agent is to select a control u to minimize a cost function (1a), while conforming to the system dynamics, and additional constraints (1b)-(1d).

We address this overall problem using receding horizon control where the idea is to plan over a finite horizon by iteratively solving an optimization problem. Given a measurement of the current state  ˆx0, a trajectory through the state-control space is calculated, and the first step of the control signal is applied to the system. The horizon is then shifted forward and the process repeats. Formally, at each time step the agent must minimize the cost function

image

where  t0is the current time, T denotes the length of the planning horizon, L is an intermediate cost function and h is a terminal cost function. In order to ensure that the trajectory is feasible, the optimization is subject to the constraints:

image

Inequalities (1b) and (1c) specify box constraints on the control and state spaces (described by upper and lower bounds), and can represent operational limits on actuators or joints. On the other hand, (1d) can express non-linear, non-convex constraints through the function g. This can be used, for example, to define road boundaries or obstacles in an autonomous driving task.

The receding horizon formulation differs from other reinforcement learning approaches in a few important ways. First, instead of explicitly maintaining a representation of a policy or value function, the control is recalculated (over the shifting horizon) at each time step. This has the advantage of not requiring a representation for the entire problem a priori, but incurs the additional computational burden of repeatedly replanning. Second, hard constraints can be explicitly specified and handled through (1d). This allows agents to safely learn by avoiding dangerous regions of the state and control spaces.

Without some simplifying assumptions we cannot solve problem (1) directly. However, efficient approximate solutions can be found by following the “first discretize, then optimize” approach described in the sections below.

3.1 DIRECT MULTIPLE SHOOTING

Using direct multiple shooting (Bock and Plitt, 1984), problem (1) can be transformed into a structured non-linear program (NLP). First, the time horizon  [t0, t0 +T] is partitioned into N equal subintervals  [tk, tk+1]for k = 0, . . . , N − 1. Then, taking a piecewise constant approximation of the control signal over each interval, a sequence of initial value problems,

image

can be set up for the state trajectory. Here, the variables xkhave been added as initial values. Each of these problems can then be integrated to obtain a discretized trajectory. However, in order to enforce continuity over the planning horizon, matching constrains,

image

are placed on  xkat the boundary of each subinterval. The functions  Fkrepresent the solutions to the initial value problems (2) at time  tk+1.

Finally, the cost function is discretized over each time interval, resulting in the following NLP:

image

subject to

image

In principle, any non-linear programming method can be used to solve the above problem (Nocedal and Wright, 2006). However, the computational burden of solving an NLP at each time step is a severe limitation in real-time applications. To address this issue, there has been much interest in efficient optimization for receding horizon control (Domahidi et al., 2012; Vukov et al., 2013) . A particularly successful approach has been the combination of sequential quadratic programming (SQP) with structure exploiting stage-wise solvers (Kouzoupis et al., 2015).

3.2 SEQUENTIAL QUADRATIC PROGRAMMING

In the SQP framework, the NLP (3) is linearized about a given nominal trajectory  w = [x0, u0, . . . , xN]. The trajectory can then be improved according to the update (Nocedal and Wright, 2006),

image

where  αis the step size and  ∆w∗denotes the solution to the following quadratic program:

image

subject to the constraints

image

The objective function (5a) is a quadratic approximation of the discretized cost function (3a) and the constraints correspond to linearizations of (3b) to (3f).

Provided that  Qkis positive semidefinite and  Rkis strictly positive definite, the trajectory can be shown to converge to a local optimum by iteratively linearizing and optimizing (Nocedal and Wright, 2006). This is guaranteed for least squares cost functions popular in tracking and stabilization tasks. When a more general cost function is desired a positive definite approximation to the Hessian can be obtained using BFGS updates.

In this paper we use FORCES (Domahidi and Jerez, 2014) as the stage-wise solver for the quadratic program (5). FORCES is an interior-point method tailored for problems arising in RHC. In particular, the block diagonal structure of the Hessian (5a) and the fact that the states are only directly coupled to the previous time step are exploited to give linear computational complexity in the planning horizon.

It is often unnecessary to iterate to convergence before a reasonable improvement is found. In real-time settings this is important because we need to maintain a balance between efficiency and accuracy. An example of this trade-off can be seen in figure 1. Given an initial trajectory, the car must maximize its progress along the race track.

image

Figure 1: An example of planned trajectories in an au- tonomous racing task. The agent’s objective is to maximize its progress along the racing track over the planning horizon. Note how early termination yields a solution trajectory that is still near-optimal over a short horizon.

The complete algorithm is presented in algorithm 1. In the next section we discuss how sparse online Gaussian process can be used to learn the non-linear dynamics model (3c).

Typically, trajectory optimization based on receding horizon control relies on an analytic model of the system’s dynamics. However, in the setting of reinforcement learning the dynamics are unknown and must be learned by interacting with the environment. In order to effectively plan over the horizon, an accurate model of the system’s dynamics needs to be learned quickly.

Gaussian process (GP) regression is a non-parametric, Bayesian approach to model learning that has demonstrated impressive data-efficiency on both simulated and real robotic systems (Deisenroth and Rasmussen, 2011). Unfortunately, GPs do not scale well to large data sets, limiting their applicability in practice. There are, however, a number of approximation schemes that signifi-cantly reduce computational costs (Qui˜nonero-Candela and Rasmussen, 2005). In this paper, we use a sparse spectrum approximation of the kernel function (L´azaro- Gredilla et al., 2010). Sparse spectrum GPs were chosen

Data: number of features D, initial training data foreach episode do

train GPs on accumulated data (minimize (11)); linearise dynamics and constraints about  x0; solve (5) until convergence; while not terminal do

image

because (a) online data can be incorporated through in- cremental updates (Gijsberts and Metta, 2013), and (b) the learned model can be efficiently linearized to form the sequential quadratic programs.

4.1 GAUSSIAN PROCESS REGRESSION

Given an input set of N state-control pairs  ˜x = (x, u) ∈Rn+mand the resulting (possibly noisy) state transitions ∆x ∈ Rn, GP regression can be used to learn a model of the underlying dynamics. Following Deisenroth and Rasmussen (2011), we train conditionally independent GPs on each component of the state transition vector, so for the remainder of this section we will represent target data as  y ∈ R.

Formally, a Gaussian process is a collection of random variables, any finite number of which have a joint Gaussian distribution (Rasmussen and Williams, 2006). To completely specify a GP we need to choose a mean function  m(˜x)and a covariance function  k(˜x, ˜x′), parametrized by a set of hyperparameters. A common, but flexible choice for the covariance function is the squared exponential kernel

image

where the signal variance  σ2and characteristic length scales  liconstitute the kernel’s set of hyperparameters. When modelling noisy targets, an additive noise term with variance  σ2nis included in the covariance function.

Since we model the relative rather than absolute state transitions, we choose a zero mean prior. This means that in the absence of data, the state is expected to re-

main unchanged regardless of the control input.

After defining the input matrix  X = [˜x1, . . . , ˜xN]and the corresponding target vector  y = [y1, . . . , yN]⊺, the posterior predictive distribution for a set of test points, X∗, is a multivariate Gaussian with mean

image

and covariance

image

where  Q = K(X, X) + σ2nI, and K(X, X) denotes the matrix of covariances evaluated at all pairs of input points. Computing the predictive mean and covariance are O(N) and  O(N 2)respectively.

A common approach for learning the hyperparameters and the noise variance  σ2nis to maximize the marginal likelihood (Rasmussen and Williams, 2006). The negative log marginal likelihood, given by

image

can be minimized using gradient based optimizers. Since Q needs to be inverted each time the log marginal likelihood is evaluated, which is an  O(N 3)operation in general, hyperparameter inference represents the main bottleneck for GP regression.

4.2 SPARSE SPECTRUM APPROXIMATION

In this section, we assume that the covariance function is stationary, i.e. that  k(˜x, ˜x′)is a function of  r = ˜x − ˜x′. In this case, Bochner’s theorem (Rudin, 2011) states that k(r) can be represented as the Fourier transform,

image

of a positive finite measure  µ. If  µ(ω)has a density, then it is called the power spectrum  S(ω)of the covariance function and, by the Wiener-Khintchine theorem (Carl- son et al., 2009),  S(ω)is the Fourier dual of k(r). In particular, this means that S is proportional to some probability measure p over  Rn+m, and so equation (9) can be rewritten as an expectation

image

where  αis the constant of proportionality.

To approximate the expectation, we draw D sample frequencies  ω1, . . . , ωDfrom p and take averages. Since the power spectrum is symmetric about zero, we also include  ω−j = −ωjfor each sample frequency, in order to guarantee that k(r) is real valued for all r. This results in the sparse spectrum approximation (L´azaro-Gredilla et al., 2010) of the covariance function:

image

In the particular case of the squared exponential kernel  α2 = σ2, and the frequencies are drawn from the normal distribution  N(0, Λ−1), where  Λ =diag([l21, . . . , l2n+m]).

For convenience, we define the feature mapping  φ :Rn+m → R2Dby,

image

In order to make use of the approximation (10), the matrix inversion lemma is applied to equations (6) and (7), giving

image

where  A = φ(X)⊺φ(X) + σ2nI, and  φ(X)is the matrix obtained by applying  φto each column of X. Instead of inverting the  N × Nmatrix Q, we now only require the inverse of the  2D × 2Dmatrix A, which constitutes a significant saving in computational cost if  D ≪ N. Importantly, the size of A is independent of the number of training points, which makes it amenable to incremental updates (Gijsberts and Metta, 2013).

Applying the same idea to the negative log likelihood (8) gives the expression

image

Again, the smaller size of A results in reduced computational complexity for hyperparameter inference. In particular, each step of the gradient based optimization is O(ND2).

4.3 INCREMENTAL UPDATES

To incrementally handle a stream of data, the matrix A and the vector  b = Φ⊺yneed to be updated in real time. Given a new sample,  (˜x, y), the updates are computed according to the rules:

image

Since A remains positive semidefinite after each update, we do not need to store it explicitly. Instead, we can keep track of its upper triangular Cholesky factor. This allows us to make use of fast, numerically stable rank-1 Cholesky updates (Gijsberts and Metta, 2013).

In the following sections we evaluate the performance of GP-RHC on a cart-pole swing up task and an autonomous racing scenario. In all the experiments GPRHC was able to run in real-time with appropriate choices of the sparsity D and planning horizon N. The choice of D essentially involves a trade-off between model accuracy and the computational costs of both the learning and the prediction routines. Empirically, we found that a value between 20 and 100 allowed us to learn a sufficiently accurate dynamics model while remaining within the real-time constraints. Similarly, the choice of N involves a balance between computational costs and the quality of the controller. Ideally, one would choose a large value of N so that the controller can optimally react to future changes in dynamics or constraints.

5.1 CARTPOLE SWINGUP

Cartpole experiments are a common benchmark in both reinforcement learning and control theory (Kober et al., 2013). The basic set-up consists of a cart with an attached pendulum running along a track. In the swing-up task, the pendulum is initially pointing downwards and the objective is to apply horizontal forces to the cart in order to swing the pendulum up and balance it above the cart in the center of the track. This is a relatively difficult control problem as the dynamics are fairly non-linear. Additionally, a long planning horizon is required because the cart must be pushed back and forth in order to develop enough momentum to swing the pendulum up.

The state of the system,  x = [x, v, θ, ω], is described by the position of the cart, the velocity of the cart, the angle of the pendulum and its angular velocity. A horizontal force u in the range of  −10N to 10N can be applied to the cart at a sampling rate of 0.025s. The cost function is given by a least squares objective penalizing the distance from the set point  [0, 0, π, 0].

In the first experiment we compare GP-RHC against PILCO (Deisenroth and Rasmussen, 2011) and the ground-truth analytical model. To initialize each trial, 80 data points were collected from an episode with random control inputs. For the sparse spectrum approximation 50 sample frequencies were drawn and a planning horizon of 50 time steps was used. After each episode the GP models were retrained on all the preceding data using 10 random restarts to avoid poor local minima. PILCO can potentially have problems with least squares costs (Deisenroth, 2010) so we used (the preferred) saturating cost function and post-processed the results. Figure 2 shows that GP-RHC is competitive with PILCO both in terms of sample efficiency and overall performance.

image

Figure 2: Median cost of the unconstrained cartpole swing-up task (using the saturated cost function). The lower and upper bounds of the confidence envelope represents the first and third quantiles respectively. The results were aggregated over 10 runs.

In the second experiment the length of the track is limited; constraining the cart’s position to between  −2m and 2m. Since PILCO does not handle constraints we just compare GP-RHC to the analytical model. The GP models were initialized with 20 points of training data collected from an episode with random inputs.

As seen in figures 2 and 3, GP-RHC learns to solve the cartpole swing-up task in just a few episodes. In particular, by updating the dynamics model online we can achieve performance comparable to the optimal analytic model at least a full episode earlier. Another noticeable feature is the reduced variability in cost. GP-RHC, with online updates, performs more consistently and is less susceptible to variations in the initial training data. Finally, table 1 shows that fewer constraint violations can be expected when using online updates. In fact, after the first episode only a single constraint violation was encountered, in contrast to the method without updates which incurred significantly more violations. These results indicate that GP-RHC enables fast learning and planning in safety critical conditions.

image

Figure 3: Median cost of the constrained cartpole swing- up task (using the saturated cost function). The lower and upper bounds of the confidence envelope represents the first and third quantiles respectively. The results were aggregated over 22 runs.

Table 1: Percentage of experiments terminated due to constraint violations

image

5.2 AUTONOMOUS RACING

In this section we apply GP-RHC to the autonomous racing of 1:43 scale remote control cars. Liniger et al. (2015) originally investigated this problem using an analytical model of the cars incorporated into a contouring control framework (see section 5.2.2). The objective is to maximize progress along the race course (depicted in figure 6) while remaining within the track boundaries.

5.2.1 BICYCLE MODEL

Following Liniger et al. (2015), the cars are modeled using a bicycle model. The cars are treated as rigid bodies and symmetry is used to approximate the pairs of front and back tyres as single wheels. Pitch and roll dynamics are neglected so only in-plane motion is considered. In our experiments the additional complexity of tyre dynamics is ignored by assuming a no-slip model.

The state space is described by the vector  [x, y, v, φ], where x and y denote the position of the car, v the longitudinal velocity of the car, and  φthe car’s orientation. The control signal consists of the PWM duty cycle of the electric drive train motor and the steering angle of the front wheels. The duty cycle is constrained to the interval [0, 1] and the steering angle cannot exceed 18 degrees.

5.2.2 CONTOURING CONTROL

Contouring control was originally designed for industrial applications like machine tool control and laser profil-ing (Lam et al., 2010). The objective of the controller is to track a given reference path while maximizing some measure of progress. Often these are competing interests and we need to find a balance between speed and tracking accuracy. In contrast to standard tracking approaches, the reference path is described only in terms of spatial coordinates. By specifying velocities and orientations the contouring controller is free to determine how the path is followed.

Here we assume that the reference path is given by an arc length parametrized curve

image

where l is the total length of the path. In our experiments, the center line of the race track is used as the reference path. To find an arc length parametrization, the center line is interpolated by a cubic spline, using the method described by Wang et al. (2002).

Let  pk = [xk, yk]denote the position of the car at time tk. Then, the contouring error,

image

is defined as the normal deviation from the path  γ, where s∗kis the value of the path parameter which minimizes the distance between the point  pkand the path, and  n(s∗k)is the unit normal to  γat  s∗k.

Calculating the contouring error requires us to determine the value of  s∗kat each point along the planned trajec- tory. This is too computationally intensive to use as a cost function in the iterative SQP framework. To address this issue, approximations to  s∗kat each point are intro- duced into the state. The dynamics are then augmented by the equation

image

where  skdenotes the approximation to  s∗kat time  tk, and vkis a virtual control input. Since the path is parameterized by arc length,  vkcan be thought of as the velocity of the car along the center line. For the auxiliary state  skto be a useful approximation we introduce a lag error term εldefined as the distance between the points  γ(s∗k)and γ(sk)along the reference path.

image

Figure 4: Contouring error  εc, lag error  εland their respective approximations  ˜εcand  ˜εl.

Since neither the contouring nor lag error can be used directly in the cost function, approximations defined only in terms of  pkand  skare made. The approximate contouring error  ˜εckand approximate lag error  ˜εlkare defined as the orthogonal and tangential component of the error between the points  pkand  γ(sk),

image

where  t(sk)is the unit tangent to  γat  sk. It is clear from figure 4 that  ˜εckapproaches  εck, and  skapproaches  s∗kas the lag error is reduced. Therefore, in order to get a good approximation of  s∗kthe lag error  ˜εlis heavily penalized in the cost function (13).

Using the approximate contouring and lag errors an intermediate cost function can be defined

image

The term  −α∆tvcan be thought of as a reward for progressing along the track and the weights  qcand  αrepresent the relative importance of fast progress and accurate path tracking.

5.2.3 TRACK CONSTRAINTS

To ensure that the car remains within the track, limits are placed on the x and y components of the car’s state. Each point on the planned trajectory is constrained to lie within two half spaces defined by the left and right track boundaries (see figure 5). The relevant tangent lines are found by projecting the point  γ(sk)(an approximation to the closest point on the center line) onto the track boundaries.

image

Figure 5: Track constraints (pink lines) corresponding to one point (pink dot) along the planned trajectory. The central dot is the closest point lying on the center line. Tangent lines are found by projecting this point onto the track boundaries.

The planned trajectory is often along the limit of these constraints so in order to avoid infeasibility problems in practice, the constraints are softened by adding slack variables. By penalizing the slack variables heavily in the cost function the original solution of the hard constrained problem is recovered where it would admit a solution.

5.2.4 RESULTS

Initially, 70 points of data were collected from a demonstrated trajectory around a simple oval track. 100 sample frequencies were chosen for the sparse spectrum approximation. A planning horizon of 20 time steps was used at a sampling rate of 0.03s. We found that we could plan in real time by limiting the number of SQP iterations to 30. The race track is shown in figure 6 along with the driven trajectory. The car initially starts at rest the point [0, 0] and the race is completed after one full lap. The track is 7.23m long (measured at the center line).

image

Figure 6: Driven trajectory of the racing car with veloc- ity profile given by GP-RHC. The model was initially trained on 70 points of data on a simple oval track, and was updated online during this task.

Table 2: Lap Times

image

Using a learned dynamics model, GP-RHC quickly accelerates from rest and drives trajectories that satisfy the complex track constraints (see figure 6). The sparse spectrum Gaussian processes are very data efficient, learning a sufficiently accurate model of the car’s dynamics from just 70 data points. In particular, the training data was collected by driving around a much simpler oval track yet the learned dynamics were able to generalize well enough to effectively navigate the sharp left turn in the new race course. Table 2 shows the lap times of GPRHC with and without online updates compared to the baseline analytic model. By updating the model online we improve the lap time by more than 0.1s (a relative improvement of about 3.86 percent). This represents a significant saving considering the high speeds and small length scales involved in the problem. In fact, naively scaling up the domain results in a 311m track with a lap time improvement of about 4.56 seconds. This corresponds to a 16.4km/h increase in average speed around the track when using online updates.

In addition to the racing task we also consider an obstacle avoidance problem depicted in figure 7. Static obstacles, represented by the blue cars, are included by adjusting the track constraints. Liniger et al. (2015) determine these adjustments using a high-level planner based on dynamic programming. In our experiments, we manually specify the obstacle constraints but a high level planner could be employed in principle. Since the dynamics are independent of the constraints and cost function, GPRHC can implicitly take advantage of all the information gained in the racing task by simply reusing the learned model with no further learning. This could be useful in safety critical tasks where costly trials can avoided by safely learning a model of the dynamics in a simpler, safer environment.

image

Figure 7: The driven trajectory in an obstacle avoidance task. The car is able to avoid dangerous collisions by planning around the obstacles.

In this paper we introduce GP-RHC for online learning and planning in continuous environments with non-linear constraints. This is achieved by combining receding horizon control for planning with data efficient sparse spectrum Gaussian processes for model learning. We show that incorporating online data results in faster convergence to optimal behaviour while significantly reducing the number of constraint violations during learning — an important feature for safety critical applications. We demonstrate our approach on a complex autonomous racing task, showing that GP-RHC enables learning from only few training points, and the ability to apply the learned model to new tasks with no additional training. This method provides a promising approach to deploying online reinforcement learning algorithms on complex systems such as robots.

In future work plan to apply GP-RHC to real robotic systems such as quadrotors or manipulators. To achieve this goal, a possible improvement to the current method would be the ability to handle input noise. GP regression methods typically assume that the training inputs are noise free. However, in real robotic systems, sensors and filtering algorithms can introduce noise into the state estimation. This issue could be addressed, by incorporating ideas from Mchutchon and Rasmussen (2011) for example, to make GP-RHC more robust.

We thank the reviewer for their helpful insights and feedback.

Olov Andersson, Fredrik Heintz, and Patrick Doherty. Model-based reinforcement learning in continuous environments using real-time constrained optimization. In Twenty-Ninth AAAI Conference on Artificial Intelligence (AAAI15), 2015.

Hans Georg Bock and Karl-Josef Plitt. A multiple shoot- ing algorithm for direct solution of optimal control problems. In Proceedings of the IFAC world congress, 1984.

Joschka Boedecker, Jost Tobias Springenberg, Jan W¨ulfing, and Martin Riedmiller. Approximate real-time optimal control based on sparse gaussian process models. In 2014 IEEE Symposium on Adaptive Dynamic Programming and Reinforcement Learning (ADPRL), pages 1–8. IEEE, 2014.

A.B. Carlson, P. Crilly, and P.B. Crilly. Communication Systems. McGraw-Hill Education, 2009.

Marc Deisenroth and Carl E Rasmussen. Pilco: A model- based and data-efficient approach to policy search. In Proceedings of the 28th International Conference on machine learning (ICML-11), pages 465–472, 2011.

Marc Peter Deisenroth. Efficient reinforcement learning using Gaussian processes. KIT Scientific Publishing, 2010.

Alexander Domahidi and Juan Jerez. FORCES Professional. embotech GmbH (http://embotech. com/FORCES-Pro), July 2014.

Alexander Domahidi, Aldo U Zgraggen, Melanie N Zeilinger, Manfred Morari, and Colin N Jones. Ef-ficient interior point methods for multistage problems arising in receding horizon control. In 2012 IEEE 51st IEEE Conference on Decision and Control (CDC), pages 668–674. IEEE, 2012.

Arjan Gijsberts and Giorgio Metta. Real-time model learning using incremental sparse spectrum gaussian process regression. Neural Networks, 41:59–69, 2013.

H. J. Kim, Michael I. Jordan, Shankar Sastry, and An- drew Y. Ng. Autonomous helicopter flight via reinforcement learning. In Advances in Neural Information Processing Systems 16, pages 799–806. MIT Press, 2004.

Jens Kober, J Andrew Bagnell, and Jan Peters. Rein- forcement learning in robotics: A survey. The International Journal of Robotics Research, 2013.

D Kouzoupis, A Zanelli, Helfried Peyrl, and Hans Joachim Ferreau. Towards proper assessment of qp algorithms for embedded model predictive control. In Control Conference (ECC), 2015 European, pages 2609–2616. IEEE, 2015.

Denise Lam, Chris Manzie, and Malcolm Good. Model predictive contouring control. In Decision and Control (CDC), 2010 49th IEEE Conference on, pages 6137– 6142. IEEE, 2010.

Miguel L´azaro-Gredilla, Joaquin Qui˜nonero-Candela, Carl Edward Rasmussen, and An´ıbal R FigueirasVidal. Sparse spectrum gaussian process regression. The Journal of Machine Learning Research, 11:1865– 1881, 2010.

Alexander Liniger, Alexander Domahidi, and Manfred Morari. Optimization-based autonomous racing of 1:43 scale rc cars. Optimal Control Applications and Methods, 36(5):628–647, 2015.

Andrew Mchutchon and Carl E. Rasmussen. Gaussian process training with input noise. In Advances in Neural Information Processing Systems 24, pages 1341– 1349. Curran Associates, Inc., 2011.

J. Nocedal and S. Wright. Numerical Optimization. Springer Series in Operations Research and Financial Engineering. Springer New York, 2006.

Yunpeng Pan and Evangelos Theodorou. Probabilistic differential dynamic programming. In Advances in Neural Information Processing Systems, pages 1907– 1915, 2014.

Joaquin Qui˜nonero-Candela and Carl Edward Ras- mussen. A unifying view of sparse approximate gaussian process regression. Journal of Machine Learning Research, 6(Dec):1939–1959, 2005.

Carl Edward Rasmussen and Christopher K I Williams. Gaussian processes for machine learning. 2006.

Walter Rudin. Fourier analysis on groups. John Wiley & Sons, 2011.

Milan Vukov, Alexander Domahidi, Hans Joachim Fer- reau, Manfred Morari, and Moritz Diehl. Autogenerated algorithms for nonlinear model predictive control on long and on short horizons. In 52nd IEEE Conference on Decision and Control, pages 5113– 5118. IEEE, 2013.

Hongling Wang, Joseph Kearney, and Kendall Atkinson. Arc-length parameterized spline curves for real-time simulation. In In in Proc. 5th International Conference on Curves and Surfaces, pages 387–396, 2002.


Designed for Accessibility and to further Open Science