Meta-Learning Priors for Efficient Online Bayesian Regression

2018·Arxiv

Abstract

Abstract

Gaussian Process (GP) regression has seen widespread use in robotics due to its generality, simplicity of use, and the utility of Bayesian predictions. The predominant implementation of GP regression is a nonparameteric kernel-based approach, as it enables fitting of arbitrary nonlinear functions. However, this approach suffers from two main drawbacks: (1) it is computationally inefficient, as computation scales poorly with the number of samples; and (2) it can be data inefficient, as encoding prior knowledge that can aid the model through the choice of kernel and associated hyperparameters is often challenging and unintuitive. In this work, we propose ALPaCA, an algorithm for efficient Bayesian regression which addresses these issues. ALPaCA uses a dataset of sample functions to learn a domain-specific, finite-dimensional feature encoding, as well as a prior over the associated weights, such that Bayesian linear regression in this feature space yields accurate online predictions of the posterior predictive density. These features are neural networks, which are trained via a meta-learning (or “learning-to-learn”) approach. ALPaCA extracts all prior information directly from the dataset, rather than restricting prior information to the choice of kernel hyperparameters. Furthermore, by operating in the weight space, it substantially reduces sample complexity. We investigate the performance of ALPaCA on two simple regression problems, two simulated robotic systems, and on a lane-change driving task performed by humans. We find our approach outperforms kernel-based GP regression, as well as state of the art meta-learning approaches, thereby providing a promising plug-in tool for many regression tasks in robotics where scalability and data-efficiency are important.

1 Introduction

Gaussian Process (GP) regression has become a workhorse in robotics due to its generality, simplicity of use, and the utility of Bayesian predictions. As such, it has been used in robotic applications as diverse as model identification for control and reinforcement learning [1,2], contact modeling [3], terrain and environment modeling [4,5], prediction of human behavior [6,7], and trajectory optimization [8]. Moreover, it has been used effectively for essential tools in robotics such as Bayesian filtering [9], SLAM [10], and Bayesian optimization [11].

While GP regression is a useful tool, standard approaches have several well-known limitations. GP regression is predominantly performed via kernel-based methods [12], but incorporating physical priors into these models is difficult, and done implicitly through the choice of kernel function. For example, in modeling a physical system with a GP equipped with the widely-used squared exponential (SE) kernel, the system designer may vary the prior variance, and the length scale of the kernel, which controls smoothness [12]. These choices of priors are neither expressive nor intuitive enough for complex applications in robotics. Moreover, kernel-based GP regression algorithms scale cubically with the amount of data [13]. This high sample complexity prevents application beyond relatively small problems [14].

Several works have attempted to incorporate prior knowledge into GP models and to remove the limitations imposed by kernel methods. Methods mapping input data to alternative feature spaces which are then used as inputs to kernel GP models have been shown to allow incorporation of prior knowledge, and reduce the complexity by projecting to a lower dimensional space [15]. For example, Hinton and Salakhutdinov [16] use a Deep Belief Network to learn a feature representation in an unsupervised fashion, but this separates the process of representation learning and regression, which may be suboptimal. Manifold GPs [17] jointly learn a neural network transformation of the data such that GP methods applied to this feature space achieve good predictions. However, this approach relies on the system designer having access to data generated from the exact system being modeled, and it is unclear how well this approach will perform on a regression problem that differs from that used to generate the training dataset. Good performance on a general class of tasks (as opposed to one specific task within this family) is necessary for effective autonomous operation in unstructured environments. Moreover, these approaches still suffer from the poor sample complexity associated with kernel methods. A more extensive discussion of methods for incorporating prior knowledge is presented in Section 5.

A variety of sparse and approximate methods have been proposed to reduce the sample complexity of kernel methods, but these at best scale quadratically [18,19]. A popular approach to avoiding the complexity of kernel methods is to instead turn to random features for approximating kernel functions, and solving the linear regression problem using these features instead [20]. The approach taken in this work is similar, but we learn features that are well-suited to representing posterior densities, based on available prior information.

Contributions. In this work, we propose ALPaCA (Adaptive Learning for Probabilistic Connectionist Architectures), an algorithm for incorporating prior information in to GP regression that increases computational and sample efficiency, and provides Bayesian posteriors reflecting confi-dence in predictions. We propose to learn a neural network model (i.e. connectionist architecture) offline that is trained to be able to rapidly incorporate online information to predict posterior distributions. Online adaptation is achieved by performing Bayesian linear regression on the last layer of the network, and thus our approach is capable of providing confidence intervals for predictions. Fundamentally, our network learns priors over the last layer weights, as well as network weights, such that it can optimally predict the posterior over the label, conditioned on the online samples observed so far. This is performed by leveraging the “learning-to-learn” approach of meta-learning [21], where model training is done to minimize divergence between the predicted and true posterior, for any amount of observed online context data. By performing regression on meta-learned basis functions, ALPaCA achieves complexity during inference that is linear in the amount of data. As such, ALPaCA offers a general, plug-in alternative to kernel GP regression which can improve scalability and online data-efficiency in a wide variety of robotics applications.

Organization. In Section 2, we provide a formal problem statement. We present background information on Bayesian linear regression and GP regression in Section 3. In Section 4 we introduce the ALPaCA algorithm for generic supervised learning tasks. We contrast our approach to the literature on neural network regression models, current meta-learning approaches, as well as popular GP models in Section 5. Finally in Section 6, we demonstrate ALPaCA on simple problems that are standard benchmarks in meta-learning [21], as well as high-dimensional robotic systems involving contact, and prediction of human decision making in a driving task. ALPaCA is shown to outperform kernel GP regression, as well as meta-learning approaches such as MAML [21].

2 Problem Formulation

In this work, we are focused on the problem of Bayesian function regression. In this setting, we start with a prior over possible functions, observe samples from an unknown test function, and use these samples to compute a posterior predictive density over the unknown function.

Formally, let denote an arbitrary, possibly nonlinear and nonsmooth function with latent parameters . We consider a setting where f is fixed and captures problem specific structure, while is uncertain and explains intra-domain variation. For example, in the context of prediction of human driving behavior, uncertainty in the state evolution stems from uncertainty in the driver’s intent, while the physics governing the car’s dynamics are known. We assume the samples of f observed online may be corrupted with additive Gaussian noise. That is, we observe a sequence of samples (x, y) such that

where ) denotes the multivariate Gaussian distribution, and is the noise covariance, and is unknown.

Suppose we observe a collection of sample points coming from an unknown function with latent parameter . Given a prior over model parameters ), and this context data , the posterior predictive density over y given x can be computed as

Computing this posterior predictive density, however, requires having analytic expressions for ), and even then, computing the posterior ) exactly may remain intractable. Instead, we propose using a surrogate model for which computing the posterior predictive density is analytically tractable, and optimizing this surrogate model offline such that its posterior preditictive density is close to the true posterior predictive density for all likely settings of resulting observations

Let ) represent the posterior predictive density under our surrogate model, param- eterized by represent a sample from this density q. We consider a scenario where data comes in as a stream: that is, at each timestep, the agent is presented with an input , and after estimating the output ˆ, the agent is provided with the true value . An example of this is a Markovian dynamical system, in which an agent wishes to predict the distribution over the next state, and at the next time step, this state is observed.

In this context, the problem we wish to solve can be written as

where ) denotes the Kullback-Leibler (KL) divergence. Note that this objective is minimizing the conditional KL divergence. As such, the objective is to minimize this divergence in expectation over . Directly optimizing this objective requires access to Instead of assuming access to these distributions directly, we only require a dataset of samples drawn from different settings of latent parameter . Specifically, we will write resent a dataset of samples generated via sampling (iid) that is, a set of samples from one function instantiation. Let represent a collection of these datasets. Note that by sampling ), sampling uniformly from D, and the truncating elements, we can approximate the expectations in (3) through a Monte Carlo scheme.

3 Bayesian Regression

In this work, we utilize Bayesian multivariate linear regression as a tool for computing as it allows for analytic computation of the posterior. In this section, we offer a brief review of the technique. For a more in-depth treatment, we refer the reader to [12], [22], or [23]. Let denote the independent and dependent variables respectively. We will consider regression over basis functions is a coefficient matrix and ) denotes the multivariate Gaussian distribution and denotes the noise variance. We will assume is known throughout this work. The case is which this assumption is relaxed is discussed in the appendix. Let where is the number of data points. Stacking samples , we may equivalently write . Given this formulation, the conditional probability of the data Y is

where tr() denotes the trace operation. A natural conjugate prior for where MN denotes the matrix normal distribution, and is a precision matrix (the inverse of the variance). The matrix normal distribution is equivalent to the multivariate normal distribution with mean vec( ¯) and variance , where vec( ¯) represents the vectorization of ¯denotes the Kronecker product. The posterior, conditioned on

Given the above, the posterior predictive distribution is

where

For a derivation of this term, the reader may refer to [22]. Note that several common forms of linear regression, such as ridge regression, are special cases of this general formulation [23].

Bayesian linear regression has strong connections to kernel methods for GP regression. With the choice of Gaussian priors discussed here, Bayesian linear regression implicitly defines a GP with mean and kernel functions that are functions of the prior on the weights, p(K), and the basis functions used.

Note that the terms involving the basis functions are always weighted inner products of and ) (with the same weighting, which we will denote ). Thus, we define the kernel function ). In contrast to Bayesian linear regression, kernel-based GP regression

Fig. 1: Overview of the ALPaCA algorithm. Rather than compute the intractable posterior ) directly, we approximate it through a surrogate model for which computing the posterior ) is analytically tractable via Bayesian linear regression. Offline, we optimize the parameters of this surrogate model on a dataset D of functions drawn from a prior distribution, aligning the surrogate posterior with the true empirical posterior in the dataset through a meta-learning procedure. Blue nodes are observed offline, green are online inputs, and purple the targets. White nodes are computed, and gray nodes are not observed.

employs the well-known “kernel trick” to replace occurrences of inner products with the kernel function [12]. In this way, it can effectively operate in an infinite dimensional feature space such as that defined by the popular squared exponential kernel. This added model capacity comes at the cost of poor computational complexity as the number of samples in the dataset grows.

4 Approach

In this work, we choose to represent the approximate density ) as a neural network model with a Bayesian last layer, which allows employing Bayesian linear regression to compute ) analytically. Section 4.1 provides details of this model and other preliminaries. In Section 4.2, we outline the full ALPaCA algorithm.

4.1 Algorithmic Preliminaries

denotes the random variable distributed according to In this work we will learn a parametric model for ˆ

where is a neural network with output dimensionality and weights is a matrix which may be thought of as the last layer of the neural network. Learning this model based on a dataset for fixed is relatively simple, and is in line with the majority of the literature on neural network regression models. However, we wish to learn a prior distribution over K such that given data sampled from (1) with the test parameter , we can generate accurate Gaussian posteriors. Specifically, we wish to learn the parameters ¯of a matrix normal prior p(K) = ), so that we may apply tools from Bayesian linear regression to analytically compute posterior distributions given online data.

While a fully Bayesian approach would also aim to estimate the posterior over the weights w, we will assume a delta prior over these weights, and thus they are not updated online. However, note that in comparison to previous work that utilizes networks with Bayesian output layers [24,25], we will not train the network and learn the prior over K in separate phases, and instead they are learned jointly. This is an important distinction: as opposed to simply learning a network to produce good predictions in expectation (with no context data), and adding in a Bayesian last layer afterwards, we aim to learn a network that is composed of basis functions capable of representing the posterior distribution given any context data. Given fixed basis functions, learning the prior terms in a data-driven fashion is known as empirical Bayes [26]. In contrast to the empirical Bayes approach, we also learn the basis functions in addition to the matrix normal prior. Thus, in summary, the parameters of the posterior predicitve model we wish to optimize are

Optimization Objective. We will denote the process generating , and label y as ), for online test parameters . Note that this may equivalently be written as p(y | is conditionally independent of . Making explicit the dependence of the posterior on the prior terms, the inner expectation of (3), wherein the expectation over

and t is temporarily ignored, may be written

We can then write the above divergence, dropping the terms that do not affect the optimization over the prior terms, as

Rewriting this negative log likelihood using (7), discarding terms not relevant to the optimization problem over the prior terms, and making the time dependency explicit, we obtain

Thus, (3) may be written as

where are as in (5), (6), and (8). These terms are functions of the prior terms, ¯and w, as well as . Here, we write ). This equivalence between minimizing KL divergence and maximizing log likelihood is well-known [23], but we emphasize it to improve clarity regarding the expectations and role of the prior terms in the likelihood.

4.2 ALPaCA Overview

Offline Training. Given (10), we are now able to write the full optimization problem we wish to solve. We sample samples each uniformly from D. Let j denote the index of

these datasets. For each of these datasets, we sample a horizon ) and clip each dataset to this horizon to obtain contains datasets with ), evaluating the loss on this minibatch approximates the expectations in (10). We use the superscript (j) to denote the terms within each dataset . Note that

Since it does not depend on the optimization variables, log det may be ignored and we may write the Monte Carlo estimate of (10) as

Thus, our problem formulation is

where contain data from dataset j, from timesteps 1. The full algorithm is detailed in Algorithm 1. The semidefinite constraint on is easily enforced by optimizing instead over , the Cholesky decomposition (). By using this Cholesky decomposition and substituting in the equality constraints, (12) is made unconstrained, and may be solved by stochastic gradient descent. Note that in practice, rather than evaluating the loss only on (minimize the average loss evaluated on

Online Phase. After learning the prior parameters ( ¯) and the weights for the basis functions w, ALPaCA computes the online posterior by performing Bayesian linear regression using the data observed online . Since the number of basis functions used may be large for complex functions, we introduce a recursive Bayesian update that improves computational efficiency. This result is well known in linear system identification and online regression [27]. Let Then, the online updates may be written as

where ). The first update comes from the Woodbury identity [27]. Importantly, this recursive update rule decreases the complexity from the original formulation’s ) (resulting from inverting directly) to ). The full recursive online algorithm is presented in Algorithm 2.

Note that is not updated online, in contrast to some other approaches to meta-learning such as MAML [21]. Moving from gradient updates to least squares yields a simpler algorithm online, which is useful for debugging or verifying autonomous systems. Because ALPaCA performs GP regression in the weight space as opposed to computing the kernel functions, it achieves linear complexity in the amount of data. In particular, for n context data points and m test data points, ALPaCA has complexity O(n + m). One interpretation of ALPaCA is as an approximation method for kernel-based GP regression, similar to randomized kernels [20]. However, instead of randomizing basis functions to approximate well-known kernels, we compute basis functions to maximize predictive power with respect to the prior implicit in the training data.

5 Related Work

Thus far we have introduced ALPaCA in the context of GP regression. In this section, we discuss connections to other related works in meta-learning and sample efficient regression.

Meta-Learning. In this work, we present a Bayesian perspective on meta-learning. Recently, model-agnostic meta-learning (MAML) presented a simple approach to meta-learning with impressive performance. MAML was originally introduced from the perspective of learning an initial set of neural network weights that could efficiently reach the desired weights for a task [21], given a handful of gradient steps on samples from that task. Subsequent work then recast this as a form of hierarchical Bayes [28]. Specifically, indexing tasks with j, the marginal likelihood of the data X is written

where denote data points from within task denotes the post meta-update weights for task and loss function j, and w denote the initial weights. They note that MAML estimates the likelihood of the data drawn from task j with a point estimate of , computed via gradient descent. In the case of linear regression, Santos [29] showed that early stopping of gradient descent to compute results in a Maximum a Posteriori (MAP) estimate of , regularized by where Q depends on the gradient step size and the number of iterations. In contrast to these formulations, we assume a delta prior on the network weights, and so the weights are not updated online. Additionally, our formulation of a Bayesian last layer allows the posterior (given the context data) to be computed directly. Because we perform this analytical Bayesian update, we do not suffer the effects of the regularizing term, and instead may exactly compute the Gaussian approximation of the posterior. Generally, in the case of linear filtering, recursive least squares approaches have been shown to outperform gradient-based approaches [30]. These results from the linear case may explain in part why ALPaCA makes more efficient use of the online data provided, and can perform faster updates.

The interpretation of MAML as hierarchical Bayes allowed the authors to incorporate Laplace approximation into the weight update [28], and the authors refer to the resulting algorithm as LLAMA. This approximation results in the point estimate of the updated weights being replaced with a second-order approximation of the posterior. Thus, sampling based methods can be used to approximate the posterior distribution of the data. More recently, PLATIPUS [31] and Bayesian MAML [32] have used variational approaches to move from point estimates to Bayesian posteriors. We wish to emphasize that these approaches are capable of capturing a larger family of distributions (such as multimodal distributions), while we achieve improved sample efficiency and better efficiency in terms of computation. We believe these trade-offs are justified and necessary in many robotics applications.

Efficient Stochastic Process Regression. In this work we have considered meta-learning for GP regression. A relaxation of this problem formulation to estimation of general stochastic processes was recently presented in [19]. This approach is based on amortized variational inference (as in PLATIPUS [31]). In particular, it is a form of conditional variational autoencoder (CVAE) [33,34], where the conditioning variable is a function of the context data. As with other autoencoder-based approaches, this method requires sampling-based estimates of the moments of the distribution. In contrast, in ALPaCA, mean and variance may be computed analytically. Moreover, [19] was only tested on problems with two dimensional output spaces, and it is unclear how well it performs for larger problems, whereas ALPaCA performs well even for high dimensional systems (such as the 12D hopper system). Within the context of meta-learning for GPs, [35] took a similar approach to the one presented here. However, while they are learning hyperparameter priors based on a distribution over tasks, they do not jointly learn a set of basis functions capable of accurately representing posterior distributions.

Fig. 2: Comparison of ALPaCA (left) compared to ALPaCA without meta-training (center) and Gaussian process regression (right) on a simple sinusoid problem from [21]. The plots on each row contain 0, 1, 2, 3, and 5 context samples respectively. Confidence intervals are 95%.

Bayesian Output Layer Models. The approach of affixing a Bayesian last layer to a standard neural network has seen application across machine learning. In [24], the authors use this network structure with an arbitrary prior for Bayesian optimization, and [36] use it to improve exploration efficiency in reinforcement learning. In the context of meta-learning, [25] use a similar approach to the one presented here. However, they first train their network without considering training it with respect to meta-updated last layers and losses, as we have here. They then perform meta-learning by updating the last layer using approximate methods. In contrast to this approach, we learn network weights such that they form useful basis functions for estimating the posterior, for any amount of data accrued online. Our work also shares common features with recent work on meta-learning with closed form solvers [37], in which the authors do meta-learning with a ridge regression last layer. Generally speaking, these approaches all make arbitrary assumptions regarding the output layer prior, whereas we map our prior information (available via the training data) to a prior over weights.

6 Experiments

In this section, we investigate the performance of ALPaCA on a variety of regression problems, ranging from simple toy problems which help in building intuition for the algorithm, to challenging prediction and dynamics modeling problems. We compare ALPaCA against several other algorithms that are capable of rapid online adaptation. First, as an ablation experiment, we compare to ALPaCA without the meta-training. This is to say, the network is trained to predict the prior

2 ALPaCA ALPaCA (no meta) GPR

Fig. 3: Comparison of ALPaCA (left) compared to ALPaCA without meta-training (center) and Gaussian process regression (right) on a function exhibiting discrete switching between 1 for 5]. Three switching points are randomly sampled. The plots on each row contain 0, 5, 10, 15, and 25 context samples.

for varying datasets, but is not explicitly conditioned on random selections of data. This is roughly equivalent to standard methods for training neural networks with Bayesian last layers [24,36]. In these works, a network is typically trained to minimize e.g. MSE loss over all of the data, and then the prior over the last layer is chosen arbitrarily. For the dynamical systems that we investigate, we also compare to the predictive performance of ALPaCA with no online updating. This network is just trained in expectation over the sampled values of the model parameters . For our toy examples, we compare to GP regression. We chose a zero-mean GP with a squared-exponential kernel, which is often the default choice in GP modeling [11]. The three above approaches are compared in terms of negative log likelihood to compare the accuracy of their posterior predictions. In our experiments we use tanh nonlinearities, due to favorable properties in terms of variance smoothness and the behavior of the variance far from the observed data. This is discussed in more detail in the appendix.

In addition to these comparisons, we also compare to MAML [21], which has rapidly become one of the most common approaches for meta-learning. Indeed, MAML was recently used for online system identification [38] on relatively complex dynamical systems. Because MAML generates point predictions rather than full posteriors, we compare in terms of MSE loss (for ALPaCA, we compute MSE loss on the mean of the posterior). We were not able to compare to PLATIPUS [31] or Neural Processes [19] because to our knowledge, at the time of writing, the code is not publicly available. Throughout our experiments, plotted confidence intervals are 95%.

Fig. 4: Negative Log Likelihood (center) and Mean Squared Error (right) for the pendulum system. The image on the left shows the limits of the uncertainty over the pendulum length.

6.1 Toy Experiments

We first investigate two simple regression examples that allow both quantitative and qualitative evaluation of performance. First, we investigate a regression problem over a family of sinusoids, where the amplitude and phase were sampled uniformly from [0] respectively. This sinusoid problem was used as a simple test problem in [21]. We also evaluate performance on a new, simple benchmark that consists of a family of two-valued functions that begin at x < 2.5, end at y = 1 for x > 2.5, and switch three times between these values for These switching points are sampled uniformly from this range. This function was designed to be challenging relative to its apparent simplicity for our approach. Because we are performing basis function regression, the discrete switches between 1 and 1 which can occur anywhere within the range specified above, are hard to capture. This is also challenging for GP approaches, as the structure of the function is hard to capture via a single global smoothness parameter in the kernel. For example, in [39], the authors learn separate GPs for each piecewise continuous segment, as opposed to loosening their smoothness priors.

The outputs of ALPaCA, ALPaCA without meta-training, and Gaussian process regression (GPR) on these two problems are plotted in figures 2 and 3. First, note that ALPaCA is capable of making predictions that incorporate structural knowledge of the problem from a single datapoint, compared to the local predictions that are common for GPs with the SE kernel. Secondly, on the sinusoid problem, the first sample allows ALPaCA to correctly reduce uncertainty at points at a distance of a scalar multiple of a half wavelength from the sampled point. Moreover, one sample is enough to have a good estimate of the sinusoid nearly everywhere, and within five samples the estimated variance has dropped to nearly zero. This reduction in variance is observed for ALPaCA without meta-training as well, but its predictions are incorrect, highlighting the utility of our meta-learning approach. Further experiments with varying noise covariance are presented in the appendix.

On the step function, ALPaCA can capture the discrete steps as well as or better than GP methods. Indeed, ALPaCA does not exhibit the large overcorrection below the function that GPR exhibits. Again, ALPaCA without meta-training reduces the predicted variance incorrectly. In contrast, the predicted confidence intervals generated by ALPaCA appear to be well-calibrated.

Fig. 5: Negative Log Likelihood (center) and Mean Squared Error (right) for the hopper system. The image on the left depicts bounds on the uncertain torso size of the hopper.

ALPaCA with and without meta-training successfully learn that the function always takes a value of 1 and 1 for 5 and above 2.5 respectively. This automatic incorporation of prior information is a powerful tool in the context of robotics, where a wealth of physical information is available but often hard to encode into standard GP regression.

The average negative log likelihood and the mean squared error (based on the mean prediction) of the test set as a function of the number of context samples is plotted for both problems in the appendix. We find that while the prior (no samples observed) for ALPaCA and ALPaCA without meta-training is nearly identical, ALPaCA nearly monotonically improves in performance as more samples are gathered online. We wish to note that GPR performance could likely be improved by better choice of prior, but as we discussed above, this tuning of the prior is often difficult for complex systems. We also compare ALPaCA to MAML in terms of MSE. We find that we outperform MAML for all context data sizes, but this difference in performance is especially acute for a small number of context datapoints. Indeed, because MAML performs gradient steps on context data, it performs poorly with a single sample.

6.2 Dynamical Systems

We investigate the performance of ALPaCA in fitting the transition model for dynamical systems. In these tasks, we wish to fit a model to use the input ) to predict () represents the state, action, and next state of a dynamical system. The process generating these (x, y) pairs is given by the true dynamics of the system, here represents model parameters. We generate a dataset of trajectories under different settings of sampled from a prior, and study whether ALPaCA can capture this structured uncertainty to enable sample efficient Bayesian regression online.

We test ALPaCA on the pendulum and the hopper from OpenAI’s Gym [40]. For these examples, the dynamics parameters were varied. For the pendulum, the mass and length were sampled uniformly from [0.5, 1.5]. The hopper’s foot friction was sampled uniformly from [1.7, 2.0] and the torso size was sampled uniformly from [0.02, 0.08]. Plotted rollouts for these systems are provided in the appendix.

Fig. 6: Negative Log Likelihood (center) and Mean Squared Error (right) for the lane change experiment. The image on the left shows the view for the drivers, and the aim of the task (reproduced with permission from [41]).

The performance of ALPaCA on the pendulum and the hopper are plotted in figures 4 and 5. Again, when the amount of context data is low, MAML provides little performance increase. ALPaCA, in contrast, achieves rapid performance improvement within the first few samples. These results also highlight an issue with approaches such as MAML, where taking more gradient steps can lead to overfitting when the amount of context data is small. The selection of number of gradient steps to take with a MAML-style approach, as well as the batch size to be used for online updates, are hyperparameters that are relatively difficult to tune. By computing the posterior directly via recursive least squares, ALPaCA avoids this ambiguity in the number of gradient steps to take. Kernel-based methods for GP regression are known to be prohibitively slow for high dimensional dynamical systems; the time required to evaluate ALPaCA is compared to GPR are plotted and discussed in the appendix.

6.3 Vehicle Lane Change

Finally, we investigate the performance of ALPaCA on human lane change experimentsexperiments were based on the dataset presented in [41]. This data is based on 19 pairs of participants, driving against each other. The objective of the task was to switch lanes with each other without communicating verbally, and without colliding. Further details on use of this dataset are presented in the appendix. Here, we treat the system as an autonomous dynamical system, and use ALPaCA to predict the transitions.

In this experiment, the underlying parameter is essentially capturing the driving style of the human. ALPaCA aims to identify the underlying style of the drivers from observations of the first few transitions of a particular trial to reduce uncertainty over future predictions. Numerical results are presented in figure 6. Interestingly, the rate of performance improvement on this problem was slower than previous problems. On previous problems, rapid performance improvement occurred early in the episode, which tailed off toward the end of the episode. A possible explanation is that the transition data is often bimodal, with a mode corresponding to driving straight and a mode corresponding to switching lanes. In these experiments, ALPaCA performance improves nearly linearly. Surprisingly, MAML does not experience an initial performance decrease as in previous experiments, and shows rapid improvement that quickly slows.

While building models to identify dynamics parameters as above is reasonably simple, identifying latent variables such as driver intent and style are difficult to do via non-data-driven models. Critically, we did not specify in any way that we wish to identify human driving behavior over the course of the episode. Generally, pre-specification of the latent variables to identify is not necessary. ALPaCA, if trained from real operational data, will learn to identify existing latent parameters.

7 Discussion and Conclusions

In this work we have presented ALPaCA, a sample efficient probabilistic neural network model that enables efficient adaptive online learning. We have demonstrated that ALPaCA outperforms kernel GP regression, which is a tool used throughout robotics and engineering as a whole, as well as MAML [21] (which itself has outperformed many meta-learning approaches). In contrast to kernel GP regression, our approach enables incorporation of an arbitrarily large amount of prior information, and scales to large data regimes. In contrast to MAML, ALPaCA maintains a full analytic posterior, which is useful for applications such as Bayesian optimization and stochastic optimal control. Moreover, it is simple to implement and a plug-and-play tool for any regression problem. We believe its blend of high-capacity regression models, together with sample efficiency, has the potential to improve robotic autonomy in many scenarios.

Limitations. While the ALPaCA model has shown potential for efficient online Bayesian learning, it has several limitations. First and foremost, the use of analytic update formulas come with the standard limitations of Gaussian models – in particular, the assumption of Gaussian noise. Indeed, in this paper we assume is know. We discuss the case where this assumption is relaxed in the appendix. Relative to MAML-style models [21], in which all parameters of the model are being adapted online, ALPaCA has lower model capacity. Our experiments demonstrate that the reduction in capacity results in an improvement in stability (via near-monotonic MSE improvement relative to MAML), but it may be possible to achieve both of these objectives. Finally, fundamental machine learning considerations such as data requirements and meta-overfitting have largely not been addressed in this paper, or in the meta-learning literature broadly.

Future Work. The literature on kernel methods for GP regression is large, and it was impossible to compare to all of the flavors of GP methods that are used in practice. We believe further comparisons of the relative merits of ALPaCA and GP methods are useful future work. Beyond this, several avenues of future work exist. We detail them here briefly, and an extended discussion is presented in the appendix. First, because GPs are closed under addition, our method may be combined with kernel GP methods. Additionally, ALPaCA is simply extended to tracking time-varying using forgetting methods, which we have not discussed in this work. Generally speaking, many of the tools previously developed within basis function system identification may be applied with ALPaCA. Finally, characterization of the data requirements of meta-learning algorithms, as well as meta-overfitting are needed.

Acknowledgments

This work was supported by the Office of Naval Research YIP program (Grant N00014-17-1-2433), by DARPA under the Assured Autonomy program, and by the Toyota Research Institute (TRI). This article solely reflects the opinions and conclusions of its authors and not ONR, DARPA, TRI or any other Toyota entity. James Harrison was supported in part by the Stanford Graduate Fellowship and the National Sciences and Engineering Research Council (NSERC).

References

1. M. Deisenroth and C. E. Rasmussen, “PILCO: A model-based and data-efficient approach to policy search,” International Conference on Machine Learning (ICML), 2011.

2. F. Berkenkamp, M. Turchetta, A. Schoellig, and A. Krause, “Safe model-based reinforcement learning with stability guarantees,” Neural Information Processing Systems (NIPS), 2017.

3. M. Bauza and A. Rodriguez, “A probabilistic data-driven model for planar pushing,” IEEE International Conference on Robotics and Automation (ICRA), 2017.

4. S. Vasudevan, F. Ramos, E. Nettleton, and H. Durrant-Whyte, “Gaussian process modeling of large-scale terrain,” Journal of Field Robotics, 2009.

5. S. T. OCallaghan and F. T. Ramos, “Gaussian process occupancy maps,” International Journal of Robotics Research, 2012.

6. J. M. Wang, D. J. Fleet, and A. Hertzmann, “Gaussian process dynamical models for human motion,” IEEE Transactions on Pattern Analysis & Machine Intelligence, 2008.

7. R. Urtasun, D. J. Fleet, and P. Fua, “3D people tracking with Gaussian process dynamical models,” IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2006.

8. M. Mukadam, X. Yan, and B. Boots, “Gaussian process motion planning,” IEEE International Conference on Robotics and Automation (ICRA), 2016.

9. J. Ko and D. Fox, “GP-BayesFilters: Bayesian filtering using Gaussian process prediction and observation mod- els,” Autonomous Robots, 2009.

10. B. Ferris, D. Fox, and N. Lawrence, “WiFi-SLAM using Gaussian process latent variable models,” International Joint Conference on Artificial Intelligence (IJCAI), 2007.

11. J. Snoek, H. Larochelle, and R. P. Adams, “Practical Bayesian optimization of machine learning algorithms,” Neural Information Processing Systems (NIPS), 2012.

12. C. E. Rasmussen, Gaussian processes in machine learning. Springer, 2004.

13. J. Hensman, N. Fusi, and N. D. Lawrence, “Gaussian processes for big data,” Uncertainty in Artificial Intelligence (UAI), 2013.

14. A. J. Smola and P. L. Bartlett, “Sparse greedy Gaussian process regression,” Neural Information Processing Systems (NIPS), 2001.

15. D. Mackay, “Introduction to Gaussian process,” Neural Networks and Machine Learning, 1998.

16. G. E. Hinton and R. R. Salakhutdinov, “Using deep belief nets to learn covariance kernels for Gaussian processes,” Neural Information Processing Systems (NIPS), 2008.

17. R. Calandra, J. Peters, C. E. Rasmussen, and M. P. Deisenroth, “Manifold gaussian processes for regression,” International Joint Conference on Neural Networks (IJCNN), 2016.

18. E. Snelson and Z. Ghahramani, “Sparse Gaussian processes using pseudo-inputs,” Neural Information Processing Systems (NIPS), 2006.

19. M. Garnelo, J. Schwarz, D. Rosenbaum, F. Viola, D. J. Rezende, S. A. Eslami, and Y. W. Teh, “Neural processes,” International Conference on Machine Learning (ICML), 2018.

20. A. Rahimi and B. Recht, “Random features for large-scale kernel machines,” Neural Information Processing Systems (NIPS), 2008.

21. C. Finn, P. Abbeel, and S. Levine, “Model-agnostic meta-learning for fast adaptation of deep networks,” International Conference on Machine Learning (ICML), 2017.

22. T. Minka, “Bayesian linear regression,” MIT Technical Report, 2000.

23. K. P. Murphy, Machine Learning: A Probabilistic Perspective. MIT Press, 2012.

24. J. Snoek, O. Rippel, K. Swersky, R. Kiros, N. Satish, N. Sundaram, M. Patwary, Prabhat, and R. Adams, “Scalable Bayesian optimization using deep neural networks,” International Conference on Machine Learning (ICML), 2015.

25. M. Bauer, M. Rojas-Carulla, J. B. Swiatkowski, B. Scholkopf, and R. E. Turner, “Discriminative k-shot learning using probabilistic models,” arXiv:1706.00326, 2017.

26. C. N. Morris, “Parametric empirical bayes inference: theory and applications,” Journal of the American Statistical Association, 1983.

27. L. Ljung and T. S¨oderstr¨om, Theory and practice of recursive identification. MIT press, 1983.

28. E. Grant, C. Finn, S. Levine, T. Darrell, and T. Griffiths, “Recasting gradient-based meta-learning as hierarchical bayes,” International Conference on Learning Representations (ICLR), 2018.

29. R. J. Santos, “Equivalence of regularization and truncated iteration for general ill-posed problems,” Linear algebra and its applications, 1996.

30. J. Cioffi and T. Kailath, “Fast, recursive-least-squares transversal filters for adaptive filtering,” IEEE Transactions on Acoustics, Speech, and Signal Processing, 1984.

31. C. Finn, K. Xu, and S. Levine, “Probabilistic model-agnostic meta-learning,” Neural Information Processing Systems (NIPS), 2018.

32. T. Kim, J. Yoon, O. Dia, S. Kim, Y. Bengio, and S. Ahn, “Bayesian model-agnostic meta-learning,” Neural Information Processing Systems (NIPS), 2018.

33. D. P. Kingma and M. Welling, “Auto-encoding variational bayes,” International Conference on Learning Representations (ICLR), 2014.

34. X. Yan, J. Yang, K. Sohn, and H. Lee, “Attribute2image: Conditional image generation from visual attributes,” European Conference on Computer Vision (ECCV), 2016.

35. K. Yu, V. Tresp, and A. Schwaighofer, “Learning Gaussian processes from multiple tasks,” International Conference on Machine Learning (ICML), 2005.

36. K. Azizzadenesheli, E. Brunskill, and A. Anandkumar, “Efficient exploration through Bayesian deep Q-networks,” arXiv:1802.04412, 2018.

37. L. Bertinetto, J. F. Henriques, P. H. Torr, and A. Vedaldi, “Meta-learning with differentiable closed-form solvers,” arXiv:1805.08136, 2018.

38. I. Clavera, A. Nagabandi, R. S. Fearing, P. Abbeel, S. Levine, and C. Finn, “Learning to adapt: Meta-learning for model-based control,” arXiv:1803.11347, 2018.

39. A. Svensson and T. B. Sch¨on, “A flexible state-space model for learning nonlinear dynamical systems,” Automatica, 2017.

40. G. Brockman, V. Cheung, L. Pettersson, J. Schneider, J. Schulman, J. Tang, and W. Zaremba, “OpenAI gym,” arXiv:1606.01540, 2016.

41. E. Schmerling, K. Leung, W. Vollprecht, and M. Pavone, “Multimodal probabilistic model-based planning for human-robot interaction,” IEEE International Conference on Robotics and Automation (ICRA), 2018.

42. Y. Gal and Z. Ghahramani, “A theoretically grounded application of dropout in recurrent neural networks,” Neural Information Processing Systems (NIPS), 2016.

43. T. B. Sch¨on, A. Wills, and B. Ninness, “System identification of nonlinear state-space models,” Automatica, 2011.

44. J. D. Hamilton, Time series analysis. Princeton university press, 1994.

45. S. Hochreiter and J. Schmidhuber, “Long short-term memory,” Neural computation, 1997.

46. A. Santoro, S. Bartunov, M. Botvinick, D. Wierstra, and T. Lillicrap, “Meta-learning with memory-augmented neural networks,” International Conference on Machine Learning (ICML), 2016.

47. R. Kulhav`y, “Restricted exponential forgetting in real-time identification,” Automatica, 1987.

48. M. Al-Shedivat, T. Bansal, Y. Burda, I. Sutskever, I. Mordatch, and P. Abbeel, “Continuous adaptation via meta- learning in nonstationary and competitive environments,” International Conference on Learning Representations (ICLR), 2018.

49. L. Birg´e and P. Massart, “Gaussian model selection,” Journal of the European Mathematical Society, 2001.

50. A. Shah, A. Wilson, and Z. Ghahramani, “Student-t processes as alternatives to Gaussian processes,” Artificial Intelligence and Statistics (AISTATS), 2014.

51. B. Parsa, K. Rajasekaran, F. Meier, and A. G. Banerjee, “A hierarchical Bayesian linear regression model with local features for stochastic dynamics approximation,” arXiv:1807.03931, 2018.

52. X. B. Peng, M. Andrychowicz, W. Zaremba, and P. Abbeel, “Sim-to-real transfer of robotic control with dynamics randomization,” arXiv:1710.06537, 2017.

53. J. Schulman, F. Wolski, P. Dhariwal, A. Radford, and O. Klimov, “Proximal policy optimization algorithms,” arXiv:1707.06347, 2017.

A Further Algorithmic Details and Extensions

A.1 Choice of Nonlinearity

The choice of neural network nonlinearity strongly affects the behavior of the variance far from datapoints. Moreover, choice of smooth vs. non-smooth activation affects the smoothness of the mean, but has an especially prevalent effect on the smoothness of the variance. In this work we have chosen to use the tanh activation function, which is bounded. As a result, and contrary to what is observed with a ReLU activation, the variance far from data reaches some prior value, similar to the behavior of a GP. This same design choice was discussed by [24] and [42]. Whether or not very large variance far from data is desirable is left to the system designer.

A.2 Dependent Samples

We have previously assumed independence of data x when conditioned on . In the case of dynamical systems, this is not true. Generally, the data generating process may be written as a hidden Markov model (HMM), in which the underlying state, x, has Markovian dynamics, and we receive observations y. In this case, we use the term HMM to refer to a process that has a possibly continuous state space, as opposed to more common usage which assumes a finite state space. In this case, the joint distribution of the dataset and the current x, y data and label pair may be written

However, in the online case, we do not observe the state and thus computation of the posterior requires that the state be jointly estimated. This problem of simultaneous system identification and state estimation is common in the system identification literature [43]. In this work, we will assume that the state is fully observed without noise, and thus . The assumption of full state observations is common to make the problem tractable without expensive filtering jointly over the non-Markovian observations and the model parameters [43]. Alternatively, it is possible with observation noise to simply perform prediction in the space of observations.

A common method for time series prediction (in which the underlying model is an HMM, but predictions are made for the observation) is an autoregressive approach, in which the next observation is predicted as a function of some finite number of previous observations [44]. Such an approach, which effectively corresponds to concatenating several previous observations before input to , may easily be applied in the context of ALPaCA. Moreover, a wide variety of approaches within the literature on autoregressive models are likely applicable within the ALPaCA framework. Extension to recurrent neural network models (e.g. [45]) is possible, but this would substantially increase training complexity. Indeed, recurrent models for meta-learning are an existing topic of interest within the meta-learning community [46], and so combination with those works is possible.

A.3 Time-Varying

The method we have presented in this work meta-trains basis functions for online Bayesian regression. However, ALPaCA is agnostic to the choice of online regression algorithm. As an example of the potential utility of this, consider the case in which is time-varying. In this case, off-the-shelf forgetting algorithms such as [47] may be used for the online phase with no other modifications. There is a vast literature on system identification methods based on regression over basis functions, much of which can be applied within the ALPaCA framework. This problem was addressed in the context of meta learning by [48], who used a MAML-based approach that jointly learned an adaptation rate. In contrast, an exponential forgetting method is considerably simpler.

A.4 Combination with Existing GP Regression Methods

One of the primary uses of GP regression in robotics is for algorithms which aim to guarantee safety online [2]. As such, the strong locality of squared exponential kernels is a selling point, and the ability of ALPaCA to induce further structure in the regression problem is not desirable. However, it is possible to incorporate SE kernel behavior into ALPaCA. Because Gaussian processes are closed under addition, one could combine ALPaCA directly with a kernel-based GP model via convex combination of means and variances. This is a slightly unusual model, but provides a spectrum of intermediate models between kernel-based GP regression and ALPaCA models. Moreover, if safety is critical, Gaussian models have a variety of model selection algorithms [49]. These approaches may be used to detect if the predictions from an ALPaCA model are far out of distribution (corresponding to a misspecified prior), which can be used to inform weights on GP models.

A.5 Known Noise Covariance

In this paper we have addressed the problem of Bayesian linear regression with known noise covariance, . This assumption is in line with the majority of the literature on GP regression. Relaxing this assumption changes the posterior distribution over K, which is a Normal-Inverse-Wishart distribution as opposed to a Matrix Normal. As a result, the posterior predictive density for y is a multivariate t distribution. In the case of a scalar output, this regression problem was discussed in [50], however the relative performance improvement of this approach was primarily for noisy systems with a relatively small amount of data. This process has been discussed by numerous others in the statistics and machine learning literature.

While Bayesian regression with the multivariate t distribution is tractable in relatively simple cases, ALPaCA requires backpropagation through the posterior predictive distribution. This distribution involves a gamma function for the univariate case, which may add numerical instability. An approach is that of [51], in which they approximate the gamma with Stirling’s approximation. However, the general multivariate t distribution involves the multivariate gamma distribution, for which approximation methods are more difficult. For example, in [39] the authors turn to MCMC methods for identification of the hyperpriors for the Normal Inverse Wishart (equivalent to multivariate t in this case).

A.6 Data Acquisition.

There are two approaches to gathering the data D to train ALPaCA. First, given a process wherein is being sampled from pairs are observed, this data may simply be used for meta training. A simple example is predicting driver actions, wherein one can gather data from observing driver state transitions for different drivers. Alternatively, with access to f and a belief over may train ALPaCA via sampling from the belief, followed by sequentially sampling x and y. This case is more common when the system is relatively well known. For example, this is often the case for robotic system identification, in which the governing dynamics of the system are known but there is uncertainty with respect to the parameters.

In the online/streaming setting, data gathered online may be used to improve the model in multiple ways. First, it may simply be added to the training set, and the model may be retrained. Second, if the training is done via simulation of f, online data may be used to update the prior belief over via filtering algorithms that may be too expensive to run online.

B Further Experimental Details and Results

B.1 Toy Experiments

For both problems, datapoints are sampled uniformly between 5 and 5. For the sinusoid problem, the network was composed of two hidden layers of 128 units each, and 16 basis functions were used. For the step function, the same network architecture was used, but we used 128 basis functions. We did not notice a significant increase in computation time in the online phase for this number of basis functions, but training was slower than for the sinusoid problem. For both problems, 05. We also investigated both the sinusoid and the step with larger noise values, and these are plotted in figures 9 and 10.

In our comparison to MAML [21], the performance of MAML (in terms of MSE) was slightly worse than the reported values. However, ALPaCA performs better than the values reported in the original paper as well. This mismatch is likely due to differences in network training time and hyperparameter tuning. In this paper we did not perform experiments to optimize hyperparameters.

B.2 Dynamical Systems

The pendulum and hopper systems were based on default implementations in OpenAI’s gym [40]. For the pendulum, we use , while for the hopper, we use

This was chosen somewhat arbitrarily, as these problems have deterministic transitions (based on a physics simulator). Intuitively, is a hyperparameter that can either be identified from data, or chosen arbitrarily. When chosen arbitrarily, it controls the tradeoff between the variance term and normalized MSE term. For the hopper the different state elements varied with different orders of magnitude, and thus was chosen to match.

For the pendulum, the system state consists of angle and angular velocity. The angle was sampled uniformly between 0 and 2at the start of each episode, and the angular velocity was sampled uniformly between 8 and 8 (default initialization). The system was simulated in open-loop (zero action) to generate data. For this problem, two hidden layers of 128 units each were used, and 16 basis functions. Rollouts of the system are plotted in figure 12. In this figure, ALPaCA is compared to ALPaCA with no meta updating during training, and ALPaCA with no online updating. Note that ALPaCA without online adaptation is roughly equivalent to “dynamics randomization” [52], which has become a common tool to increase robustness of learned policies.

For the hopper, a policy was trained via Proximal Policy Optimization [53] with the dynamics randomization described in the body of the paper. This was utilized to generate hopping behavior. Using this policy was necessary, as open loop simulation of the hopper results in the system just collapsing to the ground. Rollouts of ALPaCA and MAML are plotted in figures 13 and 15. In this experiment, two hidden layers of 128 units were used, with 32 basis functions. For both the pendulum and the hopper, performance did not noticeably change with a larger number of basis functions. A comparison of time required to evaluate ALPaCA versus GPR is plotted in figure 11. Note that on the high dimensional hopper system, GPR rapidly becomes unsuitable for real time usage. While this performance could be improved with sparsification techniques, these are likely also unsuitable for use online. These timing curves were generated using a computer with a 3.6GHz octo-core AMD Ryzen 1800X CPU.

B.3 Vehicle Lane Change

For this experiment, the lane change dataset from [41] was used. This dataset consists of 1105 episodes of humans pairs driving against each other (on a driving simulator), wherein the goal is to switch lanes with each other without colliding. The participants were not allowed to communicate. A total of 19 pairs of people are captured in the dataset. The data is captured at a frequency of 10Hz. While episodes are nominally 50 timesteps long, some end earlier than this. As a result, we cut all trajectories to the minimum length observed in the data, which was 33 timesteps. Again, two hidden layers of 128 units each were used, and 32 basis functions. Performance stayed relatively constant for a greater number of basis functions, but degraded for 16. The noise was set as

These values were estimated by training the ALPaCA with , calculating the MSE of each element on trajectories in a validation set when conditioned on the entire trajectory, and setting the diagonal elements of to be approximately these values.

The prediction task was, given the position and velocity of the vehicles, predict the next position and velocity of both vehicles. Predicting the position is relatively easy as the velocity may be used for first order approximations, but the velocity (specifically, the change in velocity) at the next timestep captures the actions taken by the drivers. As such, the task was effectively predicting human decision-making. We found performance was substantially improved by predicting the state difference between subsequent timesteps. The dataset was split into a train set of 1000 examples and a test set of 100. For details on the specific initial conditions of the dataset, as well as the mechanics of the simulator, we refer the reader to [41]. The initial conditions were chosen to make it ambiguous to the drivers whether passing in front of or behind the other vehicle is preferable, and so the dataset captures humans reasoning in real time under uncertainty.

Fig. 7: Negative Log Likelihood (left) and Mean Squared Error (right) for the sinusoid problem.

Fig. 8: Negative Log Likelihood (left) and Mean Squared Error (right) for the step function problem.

0

0

0

0

Fig. 9: Results for ALPaCA for the sinusoid problem with varying . The noise covariance is fixed for each column, the rows show 0 – 10 samples.

Fig. 10: Results for ALPaCA for the step function problem with varying . The noise covariance is fixed for each column, the rows show 0 – 20 samples, in steps of 2 samples per row.

Fig. 11: Computation time for GPR and ALPaCA on the pendulum environment (left) and the hopper environment (right).

Fig. 12: Rollouts for the pendulum problem for ALPaCA, ALPaCA without meta-updates during training, and ALPaCA without meta-updates during execution. Rollouts were generated by sampling K and recursively evaluating the dynamics with the same value of K. The two curves (red and blue) are the two elements of the state, and the dotted black and solid black line denote the true rollout.

Fig. 13: Rollouts for the hopper environment (dotted black) and predicted rollouts from dynamics models sampled from the posterior maintained by ALPaCA (blue). As can be seen, as ALPaCA incorporates online experience into its posterior, the uncertainty in the dynamics is reduced, with predicted trajectories clustering closer to the true trajectory.

0.10

Fig. 14: Rollouts for the hopper environment (dotted black) and predicted rollouts from MAML (blue).

�6

�4

�6

�4

�6

�4

�8

�6

�4

Fig. 15: Rollouts for the lane-change example environment, each row corresponding to a different trial in the test set. Dotted black shows the whole trajectory of both cars. 10 timestep predictions by ALPaCA (left), and MAML (right) at timesteps 0, 10, and 20 are overlaid. These predictions are conditioned on the trajectory of both cars up to that point. The ground truth over the time period of each prediction is highlighted in solid black. ALPaCA consistently reduces uncertainty and improves the quality of its predictions as more data is observed. While this happens in some cases with MAML, in many cases MAML fails to adapt, and furthermore, gives no indication of its uncertainty.

designed for accessibility and to further open science