Gaussian processes (GPs) [Rasmussen and Williams, 2006] are probability distributions over functions with many properties which make them convenient for use in Bayesian models. They can be used as both priors and approximate posteriors in a wide range of models where some kind of function needs to be learned, such as classification, regression, or state-space models. While GPs are usually formulated as having a single output, multiple interacting outputs can be considered as well [Álvarez et al., 2012]. This is crucial for building deep Gaussian process models where we want to be able to learn more complex intermediate mappings than just simple 1D warpings.
The most well-known drawback of GPs is that exact implementations scale with the number of training points N as in time. Multioutput GPs (MOGPs) have the additional problem of also needing to consider the covariances between all P outputs, leading to
scaling. Fortunately, the development of variational inducing variable approximations has reduced the cost of expensive matrix manipulations to
, where
, and has allowed for minibatch training [Titsias, 2009; Hensman et al., 2013]. These techniques have also been extended for use in MOGPs [Álvarez et al., 2010] and provided large improvements. Using interdomain approximations [Lázaro-Gredilla and Figueiras-Vidal, 2009], a wide range of efficient inference schemes can be implemented for MOGPs, which can reduce the computational complexity to
.
In this work, we give a comprehensive introduction to sparse variational Gaussian process models, interdomain approximations, and multioutput Gaussian processes and show how they fit together in a unified description. We then present a unified software framework for specifying inter-domain approximations and multioutput models, which we have implemented in GPflow [Matthews et al., 2017]. The flexible specification of interdomain approximations is particularly important for allowing the implementation of a wide range of efficient inference schemes for MOGPs, but our framework applies equally to single-output GPs. Our immediate practical goal is to flexibly allow convolutional Gaussian process [van der Wilk et al., 2017] structure in deep GPs [Damianou and Lawrence, 2013; Salimbeni and Deisenroth, 2017], which requires image-to-image GPs [Blomqvist et al., 2018; Dutordoir et al., 2020]. More generally, we hope that by removing obstacles to implementing and using multioutput GPs, more researchers will experiment with architectues that contain them, and that more practitioners will apply them to problems that can benefit from them.
Contents
1 Introduction 1
2 Scalar Gaussian processes 2 2.1 Gaussian process notation and manipulation . . . . . . . . . . . . . . . . . . . . . . 2
2.2 Exact inference . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
2.3 Approximate inference . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
2.4 Inducing point posteriors . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
2.5 Variational inference for Gaussian processes . . . . . . . . . . . . . . . . . . . . . . 6
3 Interdomain Gaussian processes 7 3.1 Interdomain observations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
3.2 Interdomain inducing variables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
4 Multioutput Gaussian processes 9 4.1 Multioutput Gaussian process priors . . . . . . . . . . . . . . . . . . . . . . . . . . 10
4.2 Approximate inference . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
4.3 Example: Linear Model of Coregionalization . . . . . . . . . . . . . . . . . . . . . . 12
5 Software framework 14 5.1 Desiderata . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14
5.2 GPflow model setup and examples . . . . . . . . . . . . . . . . . . . . . . . . . . . 14
5.3 Implementation of SVGP . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15
6 Implementation of example methods in our framework 20 6.1 Single-output Gaussian processes . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20
6.2 Multiscale interdomain inducing variables . . . . . . . . . . . . . . . . . . . . . . . 21
6.3 Image convolutional Gaussian processes . . . . . . . . . . . . . . . . . . . . . . . . 21
6.4 Variational Fourier features . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21
6.5 Linear Model of Coregionalization . . . . . . . . . . . . . . . . . . . . . . . . . . . 22
7 Uncertain inputs 23 7.1 Background . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23
7.2 Examples . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24
7.3 Implementation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25
8 Conclusion 26
Gaussian processes [Rasmussen and Williams, 2006] are most commonly discussed and used as distributions over scalar-valued functions on some input space X, i.e. . Gaussian processes are particularly convenient as a distribution over functions for two reasons: 1) we can manipulate them by only considering the distribution of the function’s outputs at input points of interest, and 2) the distribution of function outputs is analytically tractable, since it is Gaussian. These two properties make it easy to use GPs to specify Bayesian models that involve function learning, and to develop (approximate) inference algorithms. The main advantage of specifying a distribution over functions directly, rather than specifying a distribution over parameters, is that we can use non-parametric models, i.e. models with an infinite number of basis functions (and corresponding parameters). This property is particularly important for obtaining uncertainty estimates which remain large in regions unconstrained by any data.
2.1 Gaussian process notation and manipulation
We denote a random function distributed as a GP as . The properties of the GP are determined by the mean function
and covariance function
. We denote an evaluation of the function
at an input point x as
. Several inputs
are grouped as X, and the evaluations at these inputs are denoted as
. If
, then
. A Gaussian process is defined by the property that its function values at arbitrary locations (here denoted X) are jointly Gaussian distributed as
We usually take without loss of generality.
Throughout this document, we will manipulate a whole function as an infinitely long vector. Instead of having a finite number of elements addressed by a discrete index, we have an element for each continuous-valued input. From this view, f(x) picks a single element out of the infinitely long vector
, while f(X) picks a subset of points. We will abuse notation by denoting the Gaussian process distribution over
as a density
. Strictly speaking, it is not possible to write a density over an infinitely long vector. However, finite-dimensional marginals of the GP are well defined by densities (eq. 1), so we consider
to be an object that gives us the correct finite-dimensional Gaussian distribution when we marginalise out all but a finite set of points.
We denote this process as
where denotes the infinitely long vector
excluding f(X). Since we work with finite datasets, we will always evaluate the GP at a finite number of locations, and therefore only need to consider a finite-dimensional marginal.
We will also often factorise into finite and infinite parts, e.g.
This splits the distribution over the infinitely long into the finite-dimensional Gaussian p(f(X)) (eq. 1), and the infinitely long conditional
that includes all function values except f(X). The conditional
is given by
where . This expression is identical to the conditional of the whole process
. In this latter case, the random variable f(X) is included in
on the left hand side of the conditioning and has a deterministic (delta function) relationship to the conditioned value. We explicitly denote
in eq. 4 so we do not double-count random variables on the right hand side of the equality.
2.2 Exact inference
We are interested in Bayesian models that use GPs as prior distributions. Common uses are regression, classification, or unsupervised learning (e.g. density estimation). Initially, our running example will be regression with arbitrary likelihoods. In our setup, we model N observations that are produced by evaluating the GP at input locations
. We denote our model as
where we assume that we can evaluate the likelihood density .
When performing inference, we are firstly interested in the posterior for making predictions, and secondly in the marginal likelihood for learning hyperparameters. One of the early reasons for interest in GPs was that closed-form solutions for these quantities exist when the likelihood is Gaussian, . We can obtain the posterior for any predictive point using Bayes’ rule, and the marginal likelihood through marginalising over the prior:
where . To find the predictive distribution for a specific set of inputs
, we need to marginalise the posterior GP from eq. 6 to remove all points but
:
In GPflow, this model is implemented as GPR (Gaussian Process Regression). Note that you must include the likelihood variance when predicting actual observations, rather than the unobserved function values [Rasmussen and Williams, 2006].
Despite having closed-form solutions to the quantities of interest, actual computation using these equations quickly becomes intractable for even moderately sized datasets. The cause is the computational cost for computing the inverse and determinant of K. The source of this intractability is unusual. Usually, Bayesian models have tractable priors and likelihoods, and any intractability comes from the fact that there is no closed-form solution for the normalising constant of the posterior (the marginal likelihood). The computational expense comes from needing to use methods like MCMC that avoid dealing with normalising constants. In GPs, however, it is the prior p(f(X)) that is intractable, as this density already contains the costly inversion and determinant calculations.
Non-Gaussian likelihoods present an additional challenge, due to the loss of conjugacy, which allowed the closed-form computation of eqs. 6 and 7. To summarise, any approximate method for inference needs to deal with 1) the cost of computing the prior, and therefore also the posterior, 2) likelihoods that are not conjugate to the Gaussian distribution. We will address this in the following section.
2.3 Approximate inference
Many approximations that address the two problems above have been proposed over the years. These can largely be categorised in terms of approximations to the model (see Quiñonero-Candela and Rasmussen [2005] for a review), and approximations to the posterior. When approximating Gaussian process models, we need to be careful that the approximation retains the desirable properties of the original model. Van der Wilk [2019, ch. 2] discussed desiderata that approximations to non-parametric models should satisfy, and argued that it was particularly important to maintain the uncertainty estimates afforded by using a non-parametric model. For this reason, we favour posterior approximations.
Broadly speaking, there are currently two general inference frameworks for posterior approximations to GPs that address the two problems above: expectation propagation (EP) variants [Minka, 2001; Csató and Opper, 2002; Bui et al., 2017], and variational inference (VI) methods [Blei et al., 2017; Titsias, 2009; Hensman et al., 2013; Matthews et al., 2016]. Both methods provide procedures for picking an approximate posterior from a family of tractable ones, and both methods use the same family of tractable approximate posteriors [Bui et al., 2017]. Currently, there is no consensus on which method is strictly better. In terms of pure predictive performance, neither method consistently outperforms the other. Bui et al. [2017] show that for approximations with few inducing points, interpolations between EP and VI often perform best, while Hensman et al. [2015b] provide favourable evidence for VI in classification. Bauer et al. [2016] and Van der Wilk [2019] compare EP (in its form as the FITC model approximation [Snelson and Ghahramani, 2006; Quiñonero-Candela and Rasmussen, 2005]) and VI more closely in terms of the quality of posterior approximation. They show that EP sometimes provides a posterior approximation with significantly different properties to the true posterior. On the other hand, VI was shown to approximate the true posterior in a predictable way. Moreover, VI can easily be extended to importance-weighted variational inference which has been shown to converge to the true marginal likelihood [Domke and Sheldon, 2018]. In this work, we focus on variational inference, mainly because of its practical convenience, and its strong theoretical guarantees including satisfying the desiderata from Van der Wilk [2019].
Variational inference works by minimising the KL divergence from the true posterior
to a distribution in the tractable family
. While the KL divergence cannot directly be computed, we can minimise it by maximising a lower bound L to
the log marginal likelihood, which has the KL as its gap:
The lower bound L depends on the variational approximation , as well as the training data and any hyperparameters. Minimising the KL as a means to choosing the approximate posterior provides some important properties. Firstly, since the algorithm requires only optimisation, it is guaranteed to converge. Secondly, subject to optimisation difficulties, we can only improve the approximation by expanding the family of approximating distributions [Bauer et al., 2016; Matthews, 2017]. Finally, variational bounds are relatively easy to specify for a wide range of models, and very general algorithms exist to optimise them. These algorithms are based on stochastic evaluation of the bound [Hoffman et al., 2013] and its gradients through either score function estimators [Ranganath et al., 2014] or the reparameterisation trick [Kingma and Welling, 2013; Rezende et al., 2014; Titsias and Lázaro-Gredilla, 2014].
2.4 Inducing point posteriors
The variational lower bound L defined in eq. 9 depends on the family of approximate posteriors from which we will choose our approximation . We need a family that contains distributions that are convenient to manipulate mathematically, as well as computationally tractable. Additionally, we would like the family to be large enough to contain a distribution very close to the true posterior. We choose the family of Gaussian process posteriors introduced by Titsias [2009].
To construct a family of posteriors, we have to parameterise a distribution over functions, for which Gaussian processes are a convenient choice. We start by specifying the joint distribution over M function values at the input locations . We collect the function values f(Z) in the variable
, and specify a free mean and variance:
Next, we specify the distribution for all other points using the prior conditioned on u. Together, this gives a Gaussian process on
:
where and
. This family of posteriors is computationally tractable, as only a smaller
matrix
has to be inverted for making predictions.
To gain more intuition into this family of posteriors, we provide an alternative construction, similar to the unifying view from Bui et al. [2017]. We can consider the family of approximate posteriors from eq. 11 to be the collection of all the posteriors that we can get from observing M function values through an arbitrary Gaussian likelihood. To show this, we derive eq. 11 by setting up a regression problem akin to eq. 6 only with the likelihood . We get
where Z is the marginal likelihood for this surrogate regression problem that normalises . Following eq. 6, we obtain the approximate posterior from eq. 11
where and
.
This view provides a different way to intuitively understand the range of posteriors we allow. Following this interpretation, the variational method tries to find a set of M observations which, together with the right likelihood variance
, give a posterior as similar as possible to the true posterior in terms of KL. This is often possible, particularly in the case where the data contain redundant information in a specific region of the input space. A concrete example of this is where we have many noisy observations in a concentrated region. In this situation, a few very precise observations can give the same posterior as many noisy observations.
We can also immediately see that when M = N and a Gaussian likelihood is used in the model, the true posterior is contained within the family of distributions. Burt et al. [2019] show that in this case an arbitrarily good posterior can be found as , with
and the exact rate depending on the expected eigenvalue decay of
.
It is also worth pointing out that the approximate posterior has error bars that behave in the same way as the full model. That is, both the full model, and our approximate posterior predict with an infinite number of basis functions, allowing the posterior to be uncertain in regions unconstrained by the data. This is not the case for many parametric approximations (see Quiñonero-Candela and Rasmussen [2005] or Van der Wilk [2019, ch. 2]).
Finally, it is worth pointing out that the choice of this family of posteriors was not accidental. This family allows the KL divergence between the and
to be finite. Many distributions over functions, including many Gaussian processes, would give an infinite KL. By choosing a family consisting of distributions that are all posteriors with respect to the prior of the model, we can guarantee a finite KL [Matthews, 2017].
2.5 Variational inference for Gaussian processes
Now that we have specified the family of approximate posteriors, we need to find the variational lower bound. Matthews et al. [2016] showed that, despite technically not being able to deal with densities over functions, variational inference in GP models proceeds very similarly to the well-understood parametric case. The key observation is that the KL divergence between the approximate distribution over functions (eq. 11) and the true posterior can be computed from the KL between a finite marginal distribution on function values:
We can informally see that this makes sense, since we can include any other set of function values (denoted ) without changing the total KL divergence:
Substituting this KL divergence into eq. 9 and applying Bayes’ rule to the true posterior gives the bound discussed by Hensman et al. [2013] and Hensman et al. [2015b]:
The original variational GP method by Titsias [2009] is a special case for regression, where q(u) is analytically optimised, and substituted in. The bound takes the same form as familiar bounds for parametric models, with an expected log-likelihood term and a KL to the prior. Conveniently, the final bound is no more complex than that of parametric models, despite the technical difficulties of dealing with distributions over functions.
To evaluate the bound, we need to take expectations over the approximate posterior of a single function value, given by the univariate Gaussian distribution:
To reduce the cost of summing over large numbers of data points, Hensman et al. [2013] suggested using minibatches to obtain an unbiased estimate of the sum. Other likelihoods can be handled by estimating the expected likelihood using Monte Carlo (unbiased), or Gauss-Hermite quadrature (small bias, but zero variance) [Hensman et al., 2015b].
When working with Gaussian processes, we are usually concerned with observations and predictions of the function itself. However, point observations may not be the only properties of a function that we are interested in, or that we can observe. For example, when modelling the trajectory of an object, its velocity can be observed as well as location, which gives additional observations of the function’s derivative [O’Hagan, 1992; Rasmussen and Williams, 2006, ch. 9.4]. Alternatively, in Bayesian quadrature, we are interested in learning about the integral of a function from point observations [Rasmussen and Ghahramani, 2003]. These quantities rely on a transformation of
to a different domain, and we will name them “interdomain” variables, following Lázaro-Gredilla and Figueiras-Vidal [2009]. We start by discussing how observations of these interdomain quantities can be incorporated into an exact posterior. However, we are primarily interested in using interdomain transformations to define inducing variables for specifying GP approximations, which we will discuss afterwards. Here, we give an informal overview of interdomain methods, with a more formal discussion of some difficulties surrounding measurability being left to Matthews et al. [2016] and Matthews [2017].
3.1 Interdomain observations
The earlier examples of derivative or integral observations can be seen as observations of after having been passed through some linear operator
. In other words, L takes the entire function
and returns a real value. We modify the likelihood from
to
, where
is the linear operator for the nth observation.
A common example for the linear operator is an integral operator, as used by Lázaro-Gredilla and Figueiras-Vidal [2009]:
We can see that indeed maps the entire function
to a single real value. A different example of a linear operator which we can express, is the derivative of the dth input dimension as
In this case, the linear operator only depends on a small neighbourhood around . Since an operator is allowed to depend on the entire function
, it is also free to only depend on a part of it while fitting within the operator framework. In fact, even regular point observations fit within this framework, where the linear operator simply depends on a single point:
We will now derive the posterior for these operator observations. To ease notation, we redefine the operator L to return all N quantities needed for the likelihood terms. This makes , where
. To find the posterior, we first need to find the joint distribution between variables on
, and
, which the likelihood depends on. Given that the dependence between
and
is linear, and that the prior on
is Gaussian, the joint will be Gaussian as well. Keeping in mind the usual shorthand of
referring to function values at arbitrary inputs, we obtain
where the covariances and
are defined as
These covariances can be evaluated by using the definition of , taking advantage of the linearity of L, and then taking expectations over
:
where with we denote the linear operator that works on the
parameter of
. A specific example is when all elements in L use the integral operator of eq. 18, which gives
We now have all the required elements for obtaining the posterior for interdomain observations. Using the fact that is jointly Gaussian, everything reduces to Gaussian conditioning as for eq. 6. We obtain the posterior
which differs only from the point observation posterior (eq. 6) through the use of the interdomain covariances and
.
3.2 Interdomain inducing variables
Apart from being quantities of interest for observation or prediction, interdomain variables can also be used to specify approximate posteriors. Álvarez and Lawrence [2009] used interdomain variables tailored to a particular model construction, while Lázaro-Gredilla and Figueiras-Vidal [2009] discussed interdomain approximations for a given squared exponential kernel. Both these early approaches used a FITC style approach, while Álvarez et al. [2010] was the first to use an interdomain posterior within the variational framework.
The construction of approximate posteriors using interdomain variables follows the exact same procedure as using inducing points. In eq. 12 we showed that we specified our family of approximate posteriors by conditioning on M arbitrary point observations. By changing the covariances as we did in eq. 27, we can specify our approximate posterior through conditioning on interdomain variables instead. The final form of the approximate posterior (eq. 11) is changed only through and
being evaluations of the interdomain covariances of eq. 22. The variational lower bound remains unchanged, with the
term now being between the interdomain variables [Matthews et al., 2016; Matthews, 2017].
There are many possible interdomain inducing variables that can be used to construct approximate posteriors. In certain cases, an appropriate choice can make a crucial difference to the effectiveness of an approximation, particularly when interdomain variables can be tailored to a particular kernel of interest. Álvarez et al. [2010] construct a kernel through a deterministic transformation of a white noise process, and want to create an approximation by learning about the white noise process. Normal inducing points are completely ineffective at representing knowledge about white noise, but well-chosen interdomain variables can capture the useful low-frequency characteristics of a realisation. Similarly, computationally tractable convolutional GPs rely on tailored interdomain variables [van der Wilk et al., 2017]. Well-chosen interdomain variables can also be used to induce computationally favourable structure in for general Matérn kernels [Hensman et al., 2017].
In the following section we introduce multioutput GPs and discuss how interdomain approximations allow such models to scale to large numbers of data points as well as large numbers of outputs. In section 5 we then introduce our software framework that allows kernels and interdomain variables to be specified in a flexible manner, which allows all combinations to be specified with minimal code duplication.
So far, we have discussed approximations for scalar GP models. Multioutput Gaussian processes (MOGPs) (see Álvarez et al. [2012] for a review) are powerful models that can represent useful correlations between related outputs. Taking these correlations into account allows the information from one output to be used in the prediction of another. This helps in multi-task learning, when predicting multiple outputs using a single model. Convolutional layers in deep convolutional Gaussian process models [Blomqvist et al., 2018; Dutordoir et al., 2020] also exhibit multioutput structure, where outputs for patches in different image locations are correlated. While the additional correlations are helpful for the model, they present an even larger computational challenge than single-output models. In this section, we will introduce MOGPs and discuss how we can do efficient inference in the variational inference framework, leading up to the discussion of our software framework in section 5.
We are interested in models that learn a multioutput function . We take the input space X to be
, but much carries over to the general case. We denote by
the pth output of
at the input x. A prior over such functions is a multioutput Gaussian process [Álvarez et al., 2012] if the distribution of a vector of values
is Gaussian distributed. Note that for each element we specify the input as well as which output
we want to consider. As with single-output GPs, multioutput GPs are fully specified by their kernel (we assume zero mean functions). Kernels for multioutput GPs can be seen as matrix-valued (or operator-valued in general)
where a covariance matrix for all outputs is returned [Micchelli and Pontil, 2005]. This view has allowed in-depth analysis of kernel methods for learning vector- and even function-valued functions [Kadri et al., 2016]. An alternative view is that the matrix-valued kernel obeys the same positive definiteness properties of a single-output kernel where the index of the output is treated simply as another input. In other words, we can think of multioutput kernels as functions on the input space X extended by the index of the output:
This means that the collection on random variables f from eq. 28 will have the distribution
This insight shows that when we observe an arbitrary collection of different output dimensions of for different inputs, which we will refer to as heterotopic data, we can simply use the same procedure and software as for single-output GPs. While this view is conceptually simple, it obscures structure that can be taken advantage of for convenient and efficient implementation.
The “output as an input” view is particularly ill-suited for dealing with data where we observe (nearly) all output dimensions of for each input (homotopic data), both in terms of mathematical notation and software implementation. In these situations, we will instead use
to denote the stacked outputs for all N inputs in X. We can obtain the covariance of p(f(X)) by taking the
block of K to be
, using the matrix-valued kernel of eq. 29:
We will sometimes index into f(X) and K as multidimensional arrays, in their unstacked form. Specifically, we denote
The true index given n and p can be found to be . This notation provides several benefits. Firstly, when dealing with all outputs, it is inconvenient to explicitly denote the output dimensions as inputs. Secondly, always considering all outputs may introduce structure in K, which can be computationally exploited (we will see examples later). Finally, in software terms, we want to organise a collection of vector-valued outputs and their covariances in high-dimensional arrays, which makes their grouping explicit, which simplifies later use.
We will now discuss several examples of multioutput kernels. In order to take advantage of details that help efficient implementation, we take the view of evaluating a vector-valued function at particular inputs, rather than the “output as an input” view. This will then lead to a software framework which a) allows easy specification of the most efficient implementation, and b) does not compromise on unified software interfaces.
4.1 Multioutput Gaussian process priors
Álvarez et al. [2012] discussed different choices for priors on the correlations between multiple outputs. Here we briefly review these as well as convolutional GPs.
4.1.1 Linear model of coregionalization
A simple way of introducing correlations in the outputs is to construct our multioutput function from a linear transformation
of L independent functions
as:
with and
. This is known as the Linear Model of Coregionalization (LMC) [Journel and Huijbregts, 1978], and implies the covariance
This construction can be viewed as a form of input-dependent PCA or Factor Analysis, where we have a linear relationship between a latent and observed space. We can choose L < P to obtain a low-dimensional explanation.
4.1.2 Convolution processes
The LMC only allows for a limited kind of correlations across outputs. It cannot capture, for example, outputs which are simply delayed versions of each other, i.e. . Such more complicated relationships between outputs can still be expressed linearly by constructing
through a convolution of some latent process
. The resulting convolution process (CP) can exhibit relationships like time-lags, and more general linear dependence on past observations. Such models have been investigated in different forms over the years [Higdon, 2002; Boyle and Frean, 2005; Álvarez and Lawrence, 2009; Álvarez et al., 2009]. We follow the description by Álvarez et al. [2010], which constructs
from a convolution of
as
which, when taking the prior on as before, results in the covariance
is usually parameterised in a way that makes the integral tractable, and adds a number of parameters that balances flexibility against susceptibility to overfitting.
4.1.3 Image convolutional Gaussian processes
Gaussian processes with convolutional structure [van der Wilk et al., 2017] have been proposed for learning functions on images. They work by learning functions on a patch-by-patch basis, after which all patch responses are combined to form a single response for the image. This is similar to convolutional neural networks, with an exact correspondence in a particular infinite limit [van der Wilk, 2019]. The construction starts with a , operating on image patches of size
. An image response is obtained by summing the response to all P patches
where is the pth patch in the image x, and
is the number of patches for a
image. The construction can be phrased similarly to a convolution processes (section 4.1.2) by taking
, and integrating over the space of patches rather than images.
This construction is easily adapted to create an image-to-image MOGP, by simply returning all patch responses, rather than summing them. We obtain the resulting function which has one output per patch. This construction was discussed by Dutordoir et al. [2020] to create a convolutional layer for use in a deep convolutional Gaussian process model. Since we construct all outputs using the same patch response function
, the correlation of all outputs is given by the covariance function of
:
4.2 Approximate inference
Exact inference with MOGPs is complicated by the same issues that were discussed in section 2.2. The computational scaling in particular is worse, since if we observe P outputs for all N inputs, we will obtain a covariance matrix, resulting in
time scaling. In the same way as for single-output GPs, these problems can be addressed in the variational framework (section 2.3). In the following, we will consider general likelihoods for MOGP priors of the form
, where the vector-valued observation
can depend on all outputs of
for the input
. A simple but useful example to keep in mind is where we use a correlated Gaussian as the likelihood, e.g.:
To find the variational bound for MOGP models, we have to take this modified likelihood into account, as well as any changes that come from using multioutput variational posteriors. Luckily, much of the single-output case carries through. As before, we define the approximate posterior through the prior conditioned on the inducing observations u (i.e. ), together with a free-from density q(u):
Denoting the prior conditonal is similar to the single-output case, but requires slightly more sophisticated notation. To obtain a similar expression, we will make the following conventions:
• In multioutput scenarios, we will use to refer to the number of elements in u, i.e.
. This is an important distinction for the interdomain inducing variables.
• We take the kernel to be a matrix-valued kernel, as in eq. 29, i.e.
.
• The covariance matrices and
are obtained by the same stacking as eq. 34. This makes
, and
.
• We index into these matrices in the same way as in eq. 35.
These conventions allow us to write the approximate posterior at x as
We can now write the multioutput variational lower bound with a variational expectation over the P-dimensional q(f(x)) to take the multivariate nature of the new likelihood into account:
This bound is written for homotopic data, as it assumes that the likelihood depends on all P dimensions of f(x). Heterotopic can be dealt with in the same way by simply marginalising out unobserved variables. For example, if we only observe a subset of outputs described by the index set at the input
with the correlated Gaussian likelihood from eq. 42, we can integrate out all elements in
that are not in
. The likelihood will then also only depend on a subset of the outputs in
. This only changes the variational expectation in the ELBO:
Integrating out the unused random variables is done by selecting the relevant rows and columns of and
.
In the discussion above, we have used the vector of inducing outputs u without defining exactly which random variables they correspond to. This general discussion only assumes that we have a covariance of u as given by . Creating efficient multioutput methods depends strongly on the choice of u, as this can induce structure in
which can make it easier to invert. In the next section, we will discuss various choices of u and their computational benefits.
4.3 Example: Linear Model of Coregionalization
In the previous section, we stated the variational bound for multioutput GP models, and that its computational efficiency depended on the structure in . Here, we discuss variants of sparse approximations to the Linear Model of Coregionalization (section 4.1.1). The main design choice in the approximation is the definition of u, which determines the matrices
and
. We will show how different choices lead to approximations with different computational characteristics, both in terms of the ease with which
can be inverted, as well as how many kernel elements need to be computed. This section discusses examples in mathematical terms, while section 6.5 delves into the implementation.
4.3.1 Inducing points
We start with by introducing two formulations of inducing points for MOGPs. In the same way as in single-output GPs, these inducing points are general, and work with every multioutput kernel.
We can choose our inducing variables in an unstructured way, by specifying input-output pairs
using the “output as an input” formulation. This gives the following u and corresponding covariance matrices:
Alternatively, we can define inducing points for MOGPs as specifying M inducing inputs, and using all corresponding outputs as inducing variables:
In this case, we have inducing variables for M inducing inputs. This “vector-valued inducing point” approach defines more inducing variables for fewer inducing inputs, which reduces the number of parameters to be optimised. This may be beneficial even when dealing with heterotopic data, particularly if the input distribution is the same for all outputs.
Both these methods yield dense and
matrices for which no additional simplifying assumptions can be made. However, this method does immediately work for all multioutput kernels. We include the vector-valued inducing point approach in GPflow (section 6.5.1) as the most basic and general method.
4.3.2 Latent inducing points
For the LMC, approximate inference can be made more efficient by a more judicious choice of inducing variables. In section 4.1.1, we constructed the LMC from L independent latent processes . If we choose evaluations of
as our inducing variables, we can take advantage of the prior independence of
over
. We collect all outputs of
for M inputs in Z, to obtain
inducing variables. Thanks to the prior independence,
will have a block diagonal structure, meaning that the largest matrix that has to be inverted only has size
. We specify the approximation as
We impose an additional mean-field restriction on q(u) by choosing , where
. This gains extra computational efficiency at the cost of some accuracy. We can now rewrite eq. 44 to take advantage of this block-diagonal structure. We write
for the inverse of the
block of
.
where we implicitly sum over repeated indices (Einstein notation). Furthermore, by noting that , we can write eqs. 51 and 52 as a linear transformation of the predictive distributions of
. If we denote the predictive density at the input x as q(g(x)) =
(note that
is diagonal), we have
4.3.3 Latent inducing points for the Intrinsic Model of Coregionalisation
We can further gain computational benefits by constraining the kernels of each to be equal, which is known as the Intrinsic Model of Coregionalisation (IMC). In terms of modelling quality, this constraint may be beneficial or a hindrance depending on the characteristics of the dataset. However, it is much faster than the more flexible model of the previous section: instead of L separate inversions, only a single
inversion has to be performed. Our predictive mean and covariance become
We can again rewrite these terms as in eq. 53, only in this case the computation of and
would require only a single inversion.
4.3.4 Computational complexity
The choice of inducing variable has a large impact on the computational complexity of the method. We compare the methods of the previous three sections in table 1. For the methods in sections 4.3.2 and 4.3.3, we have two ways to compute the required quantities: 1) first calculate followed by a summation over
, 2) first sum over
to obtain
and
, and then sum over
. While method 2 is more efficient than method 1, this choice influences the complexity of implementation, which we discuss in section 6.5.
Table 1: Computational complexity of computing for different choices of inducing points (IP) for the LMC model. We explicitly include the dependence on the number of predictions N that need to be made.
In this section, we will present our unified implementation of interdomain and multioutput Gaussian processes, following the mathematical derivation that we described so far. Our framework is implemented as a large extension to GPflow [Matthews et al., 2017]. We start with describing the desiderata which guided our design choices, followed by a description of the code architecture. We will specifically refer to locations in the code where key items are implemented, so the reader can gain an idea for how other kernels and interdomain variables can be implemented.
5.1 Desiderata
Our contribution to GPflow maintains its limited scope of implementing only variational approximations to GP models. While many of the abstractions described here are useful for other methods (e.g. EP), we maintain the narrow focus to ensure high-quality code. Even with this limitation, the previous sections illustrated that a wide variety of models can be approximated with a wide variety of approximate posteriors. The goal of our software framework is to facilitate the implementation of as many combinations of model and approximation structure as possible, in a computationally efficient manner. We summarise this goal using three desiderata:
1) Efficiency: Our framework must allow the most computationally efficient code to be implemented, by exploiting any mathematical structure in the model and approximation.
2) Modularity: Using different combinations of modelling and approximation assumptions should require a minimum of effort and code-duplication.
3) Extensibility: Our framework should allow new model and approximation variations to be implemented without modifying core GPflow source, following the open/closed design principle [Martin, 1996].
The mathematical framework discussed in the previous sections requires us to make three choices that influence the overall method. The kernel and likelihood determine model properties, while the choice of interdomain variables u determines properties of the approximation. We will achieve modularity and extensibility by expressing the design choices as class hierarchies, which specify the most efficient code path through multiple dispatch.
5.2 GPflow model setup and examples
We focus our implementation around the variational bound from eqs. 16 and 45 [Hensman et al., 2013; Hensman et al., 2015b], as it works for general kernels, inducing variables, and likelihoods, in contrast to more specialised methods like Titsias [2009]. Our contribution extends the existing SVGP (Sparse Variational GP) implementation in GPflow [Matthews et al., 2017]. SVGP requires that three main components are specified:
1. the kernel, which determines the prior on ,
2. the inducing variables u, and
3. the likelihood.
Each is passed as an object to the SVGP, presenting a flexible interface for users to combine different kernels, inducing variables, and likelihoods. A model can be trained to the dataset (X, Y) by minimising its objective function with respect to its variational and hyperparameters, using:
We now review several examples that show how our interface can be used, and that it allows for the creation of models that flexibly mix and match different components.
Single-output and inducing points The most straightforward use case is simple single-output regression with inducing points on a dataset where X.shape == (N, D) and Y.shape == (N, 1). This model can be set up with listing 1.
Single-output and Fourier features The properties of the inducing variables greatly influence the computational properties of the method. Variational Fourier Features [Hensman et al., 2017], for example, reduce the cost of inverting , and allow precomputation of many costly quantities. We can use them by simply changing the inducing variable that is passed into SVGP in listing 1:
with everything else remaining the same. We give the complete example in the GPflow documentation [2019c].
Convolutions and inducing patches We can create a GP model with convolutional structure [van der Wilk et al., 2017] by using a Convolutional kernel:
where a set of N images is represented as a matrix X of shape . While using “default” InducingPoints is possible, van der Wilk et al. [2017] argue that tailoring the inducing variables to the convolutional structure of the kernel leads to a much more efficient method. We achieve this by instead using InducingPatches as inducing variables:
A fully-working example can be found in the GPflow documentation [2019a].
Multioutput kernels Multioutput models also fit into this framework, and can be specified using a multioutput kernel. The usual InducingPoints can still be used to specify a posterior, as described in section 4.3.1, resulting in the code:
Alternatively, we can use inducing points defined in to take advantage of the most efficient code path (section 4.3.2):
See the GPflow documentation [2019b] for more examples on multi-output GPs in GPflow.
5.3 Implementation of SVGP
The main role of SVGP is to specify the computation of the training objective, and to allow for future predictions once the model is trained. As stated earlier, we train our model by maximising the ELBO (eqs. 16 and 45), which we reproduce here:
We need to implement SVGP and its components in a way that satisfies our desiderata of efficiency, modularity, and extensibility. The example use-cases in the previous section determine the interface
Figure 1: Flow of elbo().
of SVGP to ensure modularity. To satisfy extensibility, we need to ensure that SVGP does not rely on properties that are specific to a particular kernel or inducing variable, and that new behaviour can be implemented without modifications to GPflow. We will separately discuss each of the four components that form the ELBO: 1) the KL term, 2) the predictive distribution q(f(x)), 3) the expectation of the log-likelihood over the predictive distribution, and 4) the sum of the expectations over the log-likelihood. Although slightly hidden from the implementation of SVGP, we will also discuss how the covariance matrices and
are computed, which the KL term and predictive distribution rely on.
We see a skeleton of the implementation of SVGP in listing 2. The interface of the constructor permits the usage described in the previous section, by accepting a kernel, inducing variable, and likelihood. Optionally, a user can specify a mean function, and a different parameterisation of the variational parameters (using whiten, see section 5.3.3). The core of SVGP computes the objective function in SVGP.elbo(), where we see the separation of the different terms of the ELBO:
• KL divergence The first quantity we compute in elbo() is the term (eq. 56), i.e. the KL divergence between the approximate posterior and prior of the inducing outputs. This can be calculated in closed form as both distributions are Gaussian. The method that computes this quantity, kullback_leiblers.prior_kl() takes the inducing variable and kernel objects, in addition to the specification of the variational posterior through q_mu and q_sqrt. We pass the inducing variable and kernel in as objects so we can specialise the computational path to take advantage of any resulting structure in
.
• Predictive distribution The predictive distribution q(f(x)) is computed through a call to the function conditional() inside predict_f(). The method will return the mean fmean and the variance fvar of the posterior predictive GP evaluated at points Xnew, following the general equation of a conditional GP eq. 13. As with the KL divergence, we want to tailor the code path to the combination of inducing variable and kernel that is being used, so we pass these objects in to the function as parameters.
• Expected log-likelihood The expected log-likelihood is encapsulated in classes inheriting from Likelihood, which only requires the predictive distribution (fmean and fvar), and is therefore independent of any model choices once conditional() has been called.
• Minibatch sum In the last line of elbo() we compute the sum of the expected loglikelihoods of the data points in the batch and rescale its contribution to the overall ELBO to compensate for the minibatching.
We will now delve into the implementation of these terms in more detail.
5.3.1 Multiple dispatch
Both for computing the KL divergence (prior_kl()) and for computing the predictive density (conditional()), the code path depends on the type of inducing variables and kernel we use. One possible approach for customising the code paths inside these functions would be to have branches for each pair that is implemented, e.g.:
While this implementation would allow the most efficient code paths to be specified, it is not extensible. We want to be able to implement specific code for new combinations of kernel and inducing variable without having to edit core GPflow code.
To provide this functionality, we rely on multiple dispatch (we use the Python implementation by Rocklin [2014]). When a multiple dispatch function is called, the runtime types of its arguments are used to determine which specific code path to run. This is a runtime equivalent of function overloading in statically typed programming languages. Multiple dispatch allows new code paths to be registered as separate functions, which can be placed anywhere in code, therefore enabling extensibility. A code path for a new pair of types is registered using the dispatcher.register(*types) decorator. For example, in GPflow we register various implementations of conditional():
When conditional() is called, the appropriately typed implementation from above is run. The inheritance structure of the inducing variable and kernel classes is respected, with the most spe-cific type that has an implementation being preferred, and parent types providing default implementations. For example, if a new inducing variable NewInducingVariable inherits from InducingPoints, conditional() will call the second implementation in listing 3, unless a more specific implementation is registered with @conditional.register(NewInducingVariable, kernels. Kernel).
The key benefits of implementing SVGP in this way, is that the model is completely general to any kernel, inducing variable, and likelihood. Any code that is specific to a particular combination (e.g. for efficiency) can still be specified, but external to the model and even the GPflow package itself (e.g. GPflow documentation [2019c]). This achieves our desiderata of efficiency, and extensibility. In the next sections, we will describe the details of each component of the ELBO: covariances (section 5.3.2), prior_kl() (section 5.3.3), conditional() (section 5.3.4), and variational_expectations() (section 5.3.5).
5.3.2 Covariances Kuu & Kuf
Although not directly visible in SVGP, both prior_kl() and conditional() need to compute the covariances and
. Since their expressions (eq. 22) depend on both the type of the kernel and inducing variables, we also implement their interface with multiple dispatch (listing 4).
Listing 4: Three definitions for interdomain covariances, using multiple dispatch for extensibility.
5.3.3 Kullback–Leibler divergences
To compute the KL divergence between the approximate posterior process and the prior process
, we only need to compute the finite-dimensional KL divergence at the inducing points [Matthews et al., 2016]:
Due to the possible structure in , we use multiple dispatch on the inducing variables and kernel. This gives a general interface for prior_kl() of:
For unstructured , GPflow provides a default gauss_kl() function to compute eq. 58. In the formulation above, it requires the
matrix, which is computed with the multiple dispatch function Kuu, together with the variational parameters m and S. We parameterise S through its lower-triangular Cholesky factorisation, i.e.
, which makes
, which is computationally cheaper.
We can alternatively compute the KL without direct access to by whitening the inducing variables. In SVGP this is indicated by setting whiten=True. Instead of conditioning on u, we use
Table 2: Shapes of the covariance returned by conditionals for different values of full_cov and full_output_cov. The returned mean always has shape .
a transformed version v designed to have an identity covariance in the prior:
where . It is clear that this construction implies the correct distribution p(u) =
. If we represent the variational posterior over v instead of u, then
Since the prior over v is now a centered standard Normal, this simplifies the KL divergence:
Whitening can make optimisation or MCMC sampling more efficient [Hensman et al., 2015a]. In the GPflow source code a whiten flag indicates whether we are passing () or (m, S).
5.3.4 Conditionals
The computation of the approximate posterior distribution q(f(X)) is the most complex component to be specified. It is required for computing the expected log-likelihood (section 5.3.5), as well as for making predictions. In its most complete form, q(f(X)) is specified by covariances between every output for every input, giving elements in total. Computing the expected log-likelihood never requires covariances between outputs for different inputs, and if the likelihood factorises over the output dimensions, only the P marginal variances are needed. Covariances between outputs for different inputs are generally only needed for plotting. This gives four different predictive covariances that could be computed, determined by whether covariances are needed for different inputs (full_cov={True|False}), and outputs (full_output_cov={True|False}).
We bring all this functionality into the single function conditional(), which deals with computing q(f(x)) with all four output covariance shapes. From the point of use in SVGP, as described in section 5.2, we need a fixed signature for conditional(), which we specify as:
The return values of conditional() are the mean and covariance of function values for the N input points in Xnew. The return shape of the covariance depends on the value of full_cov and full_output_cov, summarised in table 2. We choose the layout of the returned covariances based on what is most convenient for follow-on operations. Since these are usually matrix multiplications or Cholesky decompositions, the tailing dimensions always contain the matrix of covariances that are requested.
In principle, conditional() only needs to implement eq. 44. However, different combinations of priors and inducing variables induce different structure in and
, which can be exploited for computational efficiency. As described earlier, we will use multiple dispatch to allow the most efficient code path to be specified, while keeping a consistent interface. We dispatch on the specific inducing_variable and kernel that are passed to conditional(). In the following sections, we describe three different conditionals that are implemented in GPflow, each making a different assumption about
and
. In section 6, we discuss various models (including the LMC from section 4.3) that can be implemented efficiently using these and other custom conditionals.
5.3.5 Likelihood
The expected log-likelihood is computed by Likelihood.variational_expectations(Fmu, Fvar, Y), which implements the integral
This function is separated from all other components as it depends only on the variational predictive distribution . This is computed by conditional(), and represented through its mean
(passed as Fmu) and covariance
(passed as Fvar).
For certain likelihoods this expectation can be calculated in closed form. This is implemented by inheriting from the Likelihood base class, and implementing the closed-form expression in Likelihood.variational_expectations(Fmu, Fvar, Y). If this not possible, GPflow can fall back to either Gauss-Hermite quadrature or Monte Carlo estimation. If is diagonal (i.e. the outputs under the variational posterior are independent), or if the likelihood is independent over dimensions as
, the multidimensional integral becomes equivalent to a sum over single dimensional integrals, which makes Gauss-Hermite or Monte Carlo estimation particularly effective. These can be used by inheriting from the classes Likelihood or MonteCarloLikelihood, respectively.
In this section, we wish to demonstrate the efficiency, modularity, and extensibility of our framework by discussing several example methods. We pay particular attention to demonstrating the use of multiple dispatch, and show how the appropriate Kuu(), Kuf(), and conditional() functions can be implemented.
6.1 Single-output Gaussian processes
We begin with the most simple example of a single-output GP with an approximate posterior that induces no structure in . Listing 1 sets up an example of such a model with an inducing point posterior, with Kuu() and Kuf() being defined in listing 4. To complete the implementation we need to define a version of conditional() and register it using @conditional.register(...). Since there is no structure in
, we need to define a version of conditional that simply implements eq. 17 in accordance with the shapes in table 2. We register the function using the following code:
We register the function as a conditional with the dispatcher for general Kernel and Inducing Variable objects, using @conditional.register(...). This will make this code path the default for all kernels and inducing variables that inherit from the base classes Kernel and Inducing Variable, unless more specialised implementations are provided. The main work of implementing eq. 17 is done in base_conditional(). We pass in and
as matrices, after they have been evaluated using the multiple dispatched functions Kuu() and Kuf(). To keep base_conditional() simple, it does not implement full_output_cov=True. To ensure that the returned shapes variances are consistent with other multioutput conditionals, we use the helper function _expand _independent_outputs().
6.2 Multiscale interdomain inducing variables
In section 3 we discussed the use of integral transformations of a function as inducing variables. Lázaro-Gredilla and Figueiras-Vidal [2009] proposed using
being a Gaussian, or a Gaussianwindowed sinusoid. When combined with the SquaredExponential kernel, the integrals for the required covariances (eqs. 23 and 24) have closed form expressions. We included the Gaussian integral transformation as the Multiscale inducing variable in GPflow. While these inducing variables do not provide special structure in
and
, they do allow the posterior to represent longer-range correlations with fewer inducing variables. In terms of code, this method can be implemented by simply specifying the appropriate covariances:
Using the @method.register(...) line, we ensure that this code is specific to the combination of the Multiscale inducing variable, and the SquaredExponential kernel. Since Multiscale and SquaredExponential inherit from InducingPoints and Kernel respectively, the same version of conditional(...) will be called as in the previous section.
6.3 Image convolutional Gaussian processes
Similarly to Multiscale inducing variables, the inducing patches for the convolutional Gaussian process [van der Wilk et al., 2017] can be implemented using by only specifying Kuu() and Kuf(), while relying on the default conditional. We define InducingPatch for specifying inducing variables in (see section 4.1.3), and define multiple dispatch Kuu and Kuf functions for use with the Convolutional kernel:
Image-to-image convolutional Gaussian processes [Dutordoir et al., 2020] also fit within this framework, but require custom conditionals to implement the covariances between the different outputs.
6.4 Variational Fourier features
Hensman et al. [2017] define inducing variables in the spectral domain of the process which lead to periodic functions for . More importantly, this careful choice leads to a “low-rank plus diagonal” form of
, i.e.
where and
, where
is the mean-squared differentiability of the functions in the prior.
This method naturally fits into GPflow’s interdomain and multiple dispatch framework. We can define a new type of feature, FourierFeatures, which inherits from InducingVariable. It would be mathematically possible and an immediate fit into the existing framework if we implement a new Kuu() method which computes the full (non-structured) , and register it within the multiple-dispatch framework.
This approach, however, is sub-optimal and would not exploit any of the known structure (diagonal + low rank) of . To take advantage of this, we need to route the model through more specialised code for both the KL divergence and the conditional. We let the covariance method for
return an object that captures the structure of
(in this case, a TensorFlow LinearOperator). We then register a specialised conditional method that can deal with subsequent computations in the computationally most efficient way given its access to
and
.
The covariances can be implemented closely following the equations in Hensman et al., 2017, e.g. for the Matérn 1/2 kernel: Note that different orders of Matérn kernels have different structures in their matrix, so we have to separately register one implementation for each combination of FourierFeatures inducing variable and one of the Matérn kernels. Similarly, we need to register separate implementations for
. The conditional, however, only needs to know that the solve operation is efficient. A sample implementation is given in the GPflow documentation [2019c].
6.5 Linear Model of Coregionalization
6.5.1 Multioutput inducing points
When moving to specifying approximate posteriors for multioutput kernels, we have to deal with all the possible covariance structures described in table 2. We start with the inducing point method for multioutput GPs, which was described in section 4.3.1 and applied to the LMC. This method takes all output values from M inputs as inducing variables. The single-output conditional (section 6.1) does not compute the required covariances for full_output_cov=True, so we will need define a new conditional().
To implement the predictive equation (eq. 44), Kuu() needs to compute all covariances, while Kuf() needs to compute the
covariances between u and the predictions. We let these functions simply compute and return the covariances defined in the kernel. Since this implementation of the InducingPoints inducing variable is general to all multioutput kernels, we register the implementation for all kernels that inherit from the multioutput kernel base class MultioutputKernel:
We now require an implementation of conditional() that can use these differently shaped covariances. When full_cov and full_output_cov are equal, i.e. we only care about the marginals, or we want the complete joint over both the inputs and outputs, we can rely on the base_conditional() that we used earlier after a simple reshape. For the other cases, we implement fully_correlated_conditional(). As with the implementation of the covariances, we define this conditional() to be the default for all multioutput kernels when InducingPoints are used. For this example, we can assume that P is equal to L.
In order to use this code with the LMC, all we have to do is define the covariance of the prior in an object that inherits from the multioutput base class MultioutputKernel:
6.5.2 Latent inducing points
In section 4.3.2, we discussed that placing the inducing variables in the process reduced the computational cost of the method.
The previous implementation focused on generality. Here we want to take advantage of additional structure. The IndependentLatent kernel allows inference with inducing variables Fallback {Shared|Separate}InducingVariables. These inducing variables give a block-diagonal structure in , which is represented as a
array (section 4.3.2). To take advantage of this we define a custom conditional for these combinations of inducing variables and kernels. The exact details are abstracted away in independent_interdomain_conditional(*) which takes advantage of the block-diagonal structure.
The previous implementation of LMC using the default MultioutputKernel and InducingPoints leads to a scaling of . In this section, using the specialised codepath gives
(see section 4.3.4).
7.1 Background
So far we have discussed a set of models for which all inputs are known and fixed variables: they form the features of the model. This is the typical setting for most supervised learning algorithms. In certain applications, however, the GP models have to deal with inputs which are uncertain, and typically described by a distribution
, rather than a single point. A well- known example of this type of model is the GP-LVM [Lawrence, 2004], where we are interested in learning the joint posterior of two variables: the distribution of the input locations
of the GP function and the GP mapping
itself. More recent models such as the deep Gaussian process (DGP) [Damianou and Lawrence, 2013] and the conditional density estimation model [Dutordoir et al., 2018] also require dealing with distributional or uncertain inputs.
Uncertain inputs are omnipresent in GP models and need to be accounted for in our software tool. In the following sections we discuss the optimisation bounds of the GP-LVM and DGP models, and show how they are implemented in GPflow. The focus of this section is to show that even when constructing these more complex models, we can still follow our desiderata of efficiency, modularity and extensibility. This leads to a software package that can be used to extend the functionality or modelling assumptions of DGPs by re-implementing only certain bits of the computational graph while staying inside the framework.
7.2 Examples
7.2.1 Latent Variable Model
The GP-LVM [Lawrence, 2004] is an unsupervised algorithm which learns for each of the given points in the dataset an associated input distribution
, together with a GP mapping, so that
noise. This is different from the typical regression or classifica-tion settings, where both the input and outputs are known and the only quantity to infer is the underlying mapping.
Deriving the bound for the GP-LVM is similar as in section 2.5: the approximate GP is defined by a set of inducing variables, which are learned by minimising the KL between the true and the approximate posterior. To learn the approximate posterior of the inputs in the GP-LVM we follow the same procedure, and define an approximate posterior over the uncertain inputs . Following Bayes’ rule, as in section 2.5 we obtain the lower bound
Most of the elements of this bound are already discussed in section 5.3, and stay unchanged in the case of uncertain inputs. However, the additional unknown variables and associated approximate posterior introduce two new components to the bound.
The first component is a KL term (for each datapoint) between the prior and approximate posterior of the uncertain inputs . As with the KL term on the inducing variables, given reasonable choices for the form of distributions this quantity can be computed in closed form.
The second component is an additional expectation of the data-fit term with respect to q(x). In general, solving this expectation analytically is not possible (as is the case for the inner expectation over for most likelihoods). To deal with this we resort to Monte Carlo and sample
to estimate the quantity. Using re-parameterised samples [Kingma and Welling, 2013; Rezende et al., 2014] from the approximate posterior distribution allows us to get unbiased gradient estimates of the ELBO and optimise for the parameters of q(x), simultaneously with the other parameters as before.
For certain kernels and forms of the expectation can be computed analytically by making use of the so-called kernel expectations [Damianou et al., 2016]. While this approach adds less stochasticity to the bound, it increases the computational costs and limits the set of kernels that can be used.
7.2.2 Deep Gaussian processes
A deep Gaussian process consists of multiple GPs which are hierarchically chained to form a compositional function, . At each level the input of the current function is given by the output of the previous one. The uncertain inputs in a DGP arise from the fact that the hidden state that is used as the input to the next GP is given by the output of a GP, which is itself a distribution.
Imposing an independent GP prior over each of the components and hidden states
, the joint density of a DGP can be written as
where we define . The form of
leads to different types of DGPs. The original DGP formulation added noise between the layers, setting
to
[Damianou and Lawrence, 2013]. In Salimbeni and Deisenroth [2017] a noise-less transition is assumed, leading to
.
As was the case for shallow GP models (see section 2.3 and section 4.2), we construct an approximate posterior GP by conditioning the prior GP on a set of inducing variables. The inducing variables , for which we have a set for each layer, are then integrated out with respect to their posterior
, leading to an approximate posterior of the form
where the mean and the variance are given by
The crucial difference between this approximate posterior and eq. 44 is that here the inputs to the mean and variance are the outputs of the previous layer , and thus uncertain variables. As was the case in the GP-LVM model, integrating out the inputs to account for their uncertainty is not possible due to the non-linear mapping of the inputs in the kernel. However, when appropriately propagating the uncertainty through the layers of the DGP, it is possible to compute a sample at each layer:
Using the approximate posterior for every GP in the composition, we can lower-bound the log marginal likelihood to get the ELBO for a DGP:
In this bound, compared to the shallow GP case (eq. 16), we are dealing with a variational expectation over the output of the last layer and a sum over the KL terms of the inducing variables of every layer. Evaluating the KLs in this bound is straightforward given the Gaussianity of both prior and approximate posterior of
. The required computation is identical to the one required in the shallow case. The expectation over the final layer is approximated using Monte Carlo. Following the procedure of eq. 69, we obtain a sample from
which is used to get an unbiased estimate of the variational expectation, as described in Salimbeni and Deisenroth [2017].
7.3 Implementation
Implementing bounds for models with uncertain inputs, e.g. the GP-LVM (eq. 65) and DGP (eq. 70) is not significantly harder, as most of the required components are identical to the building blocks of the SVGP model discussed in section 5.3.
The most notable difference between the standard SVGP model and models with uncertain inputs is that for the SVGP, we compute the mean and variance of at known locations
, whereas for models with uncertain inputs, we evaluate the posterior GP at samples from the source of uncertainty. This can be the distribution of the inputs
in the case of the GP-LVM or the output distribution of the previous layer
for DGPs.
To accommodate this use-case, GPflow provides a method called sample_conditional, whose role is to compute a sample from the approximate posterior q(f(x)). The method has the same signature as conditional and in most cases will use conditional to compute the mean and covariance
of the approximate posterior, so that it can construct a sample
, with
. However, for certain types of kernels and inducing variables it is possible to compute a valid sample from the posterior in a more efficient way, which is why sample_conditional itself makes use of multiple dispatch. As was the case for conditional (see section 5.3.4 and section 5.3.1), thanks to multiple dispatch, specialised code can be executed depending on the types of the arguments to the method in order to compute the sample in the most efficient way.
Modularity and extensibility The sample_conditional method computes a sample from the GP’s approximate posterior, which can then be propagated through the layers of a DGP to obtain a valid sample of the composite function. A user can alter the behaviour of a layer in a DGP by defining a new kernel and inducing variable, together with the corresponding sample_conditional. This makes GPflow easy to extend and modular, as evidenced by e.g. Blomqvist et al. [2018] and Dutordoir et al. [2020] who implemented deep convolutional (multi-output) GPs following this approach, and Salimbeni et al. [2019] who implemented a DGP with latent variable layers.
In this work we presented the current state of Gaussian process models. We reviewed current modelling assumptions such as multi-output GPs and discussed methods for efficient inference: sparse variational inference and inter-domain inducing variables. We presented this in a unified mathematical framework, starting from single-output models up to multi-output inter-domain GPs. Throughout the text it becomes clear that efficient inference in GP models relies strongly on choosing useful inducing variables in a way that can take computational advantage of independence properties of the prior. To this end, we developed a software framework in GPflow that allows for the flexible specification of both multi-output priors and inducing variables.
Many thanks to Felix Leibfried for repeatedly reading drafts and providing excellent feedback. Also thanks to Kaspar Märtens for providing feedback on early versions of the manuscript. Thanks to the TensorFlow Probability team and the GPflow community for their work and support.
M. A. Álvarez and N. D. Lawrence (2009). ‘Sparse convolved Gaussian processes for multi-output regression’. In: Advances in Neural Information Processing Systems 21 (NIPS 2008).
M. A. Álvarez, D. Luengo and N. D. Lawrence (2009). ‘Latent force models’. In: Proceedings of the Twelth International Conference on Artificial Intelligence and Statistics.
M. A. Álvarez, D. Luengo, M. Titsias and N. D. Lawrence (2010). ‘Efficient multioutput Gaussian processes through variational inducing kernels’. In: Proceedings of the Thirteenth International Conference on Artificial Intelligence and Statistics (AISTATS 2010).
M. A. Álvarez, L. Rosasco and N. D. Lawrence (2012). ‘Kernels for vector-valued functions: a review’. In: Foundations and Trends in Machine Learning.
M. Bauer, M. van der Wilk and C. E. Rasmussen (2016). ‘Understanding probabilistic sparse Gaussian process approximations’. In: Advances in Neural Information Processing Systems 29 (NIPS 2016).
D. M. Blei, A. Kucukelbir and J. D. McAuliffe (2017). ‘Variational inference: a review for statisti- cians’. In: Journal of the American Statistical Association.
K. Blomqvist, S. Kaski and M. Heinonen (2018). ‘Deep convolutional Gaussian processes’. In: arXiv:1810.03052.
P. Boyle and M. Frean (2005). ‘Dependent Gaussian processes’. In: Advances in Neural Information Processing Systems 17 (NIPS 2004).
T. D. Bui, J. Yan and R. E. Turner (2017). ‘A unifying framework for Gaussian process pseudo- point approximations using power expectation propagation’. In: Journal of Machine Learning Research.
D. Burt, C. E. Rasmussen and M. Van Der Wilk (2019). ‘Rates of convergence for sparse variational Gaussian process regression’. In: Proceedings of the 36th International Conference on Machine Learning.
L. Csató and M. Opper (2002). ‘Sparse on-line Gaussian processes’. In: Neural Computation.
A. C. Damianou and N. D. Lawrence (2013). ‘Deep Gaussian processes’. In: Proceedings of the Sixteenth International Conference on Artificial Intelligence and Statistics (AISTATS 2013).
A. C. Damianou, M. K. Titsias and N. D. Lawrence (2016). ‘Variational inference for latent vari- ables and uncertain inputs in Gaussian processes’. In: Journal of Machine Learning Research.
J. Domke and D. R. Sheldon (2018). ‘Importance weighting and variational inference’. In: Advances in Neural Information Processing Systems 31.
V. Dutordoir, H. Salimbeni, J. Hensman and M. Deisenroth (2018). ‘Gaussian process conditional density estimation’. In: Advances in Neural Information Processing Systems 31 (NeurIPS 2018).
V. Dutordoir, M. van der Wilk, A. Artemev and J. Hensman (2020). ‘Bayesian Image Classi- fication with Deep Convolutional Gaussian Processes’. In: Proceedings of the Twenty-Third International Conference on Artificial Intelligence and Statistics (AISTATS 2020).
GPflow documentation (2019a). Convolutional Gaussian Processes. https://gpflow.readthedocs. io/en/develop/notebooks/advanced/convolutional.html.
GPflow documentation (2019b). Multi-output Gaussian Processes in GPflow. https://gpflow. readthedocs.io/en/develop/notebooks/advanced/multioutput.html.
— (2019c). Variational Fourier features in GPflow. https://gpflow.readthedocs.io/en/ develop/notebooks/advanced/variational_fourier_features.html.
J. Hensman, A. G. de G. Matthews Maurizio Filippone and Z. Ghahramani (2015a). ‘MCMC for variationally sparse Gaussian processes’. In: Advances in Neural Information Processing Systems 28.
J. Hensman, N. Durrande and A. Solin (2017). ‘Variational Fourier features for Gaussian processes’. In: Journal of Machine Learning Research.
J. Hensman, N. Fusi and N. D. Lawrence (2013). ‘Gaussian processes for big data’. In: Proceedings of the Twenty-Ninth Conference on Uncertainty in Artificial Intelligence (UAI 2013).
J. Hensman, A. G. d. G. Matthews and Z. Ghahramani (2015b). ‘Scalable variational Gaussian process classification’. In: Journal of Machine Learning Research.
D. Higdon (2002). ‘Space and space-time modeling using process convolutions’. In: Quantitative Methods for Current Environmental Issues.
M. D. Hoffman, D. M. Blei, C. Wang and J. Paisley (2013). ‘Stochastic variational inference’. In: Journal of Machine Learning Research.
A. G. Journel and C. J. Huijbregts (1978). Mining geostatistics. Academic press London.
H. Kadri, E. Duflos, P. Preux, S. Canu, A. Rakotomamonjy and J. Audiffren (2016). ‘Operator- valued kernels for learning from functional response data’. In: Journal of Machine Learning Research.
D. P. Kingma and M. Welling (2013). ‘Auto-encoding variational Bayes’. In: arXiv preprint arXiv:1312.6114.
N. D. Lawrence (2004). ‘Gaussian process latent variable models for visualisation of high dimen- sional data’. In: Advances in Neural Information Processing Systems 16 (NIPS 2003).
M. Lázaro-Gredilla and A. Figueiras-Vidal (2009). ‘Inter-domain Gaussian processes for sparse inference using inducing features’. In: Advances in Neural Information Processing Systems 22 (NIPS 2009).
R. C. Martin (1996). ‘The open-closed principle’. In: More C++ gems.
A. G. d. G. Matthews (2017). ‘Scalable Gaussian process inference using variational methods’. PhD thesis. University of Cambridge.
A. G. d. G. Matthews, J. Hensman, R. Turner and Z. Ghahramani (2016). ‘On sparse variational methods and the Kullback-Leibler divergence between stochastic processes’. In: Journal of Machine Learning Research.
A. G. d. G. Matthews, M. van der Wilk, T. Nickson, K. Fujii, A. Boukouvalas, P. León-Villagrá, Z. Ghahramani and J. Hensman (2017). ‘GPflow: a Gaussian process library using TensorFlow’. In: Journal of Machine Learning Research.
C. A. Micchelli and M. Pontil (2005). ‘On learning vector-valued functions’. In: Neural Computation.
T. P. Minka (2001). ‘Expectation propagation for approximate Bayesian inference’. In: Proceedings of the Seventeenth conference on Uncertainty in Artificial Intelligence (UAI 2001). Morgan Kaufmann Publishers Inc.
A. O’Hagan (1992). ‘Some Bayesian numerical analysis’. In: Bayesian Statistics.
J. Quiñonero-Candela and C. E. Rasmussen (2005). ‘A unifying view of sparse approximate Gaus- sian process regression’. In: Journal of Machine Learning Research.
R. Ranganath, S. Gerrish and D. M. Blei (2014). ‘Black box variational inference’. In: Proceedings of the Seventeenth International Conference on Artificial Intelligence and Statistics (AISTATS 2014).
C. E. Rasmussen and Z. Ghahramani (2003). ‘Bayesian Monte Carlo’. In: Advances in Neural Information Processing Systems 15 (NIPS 2002).
C. E. Rasmussen and C. K. I. Williams (2006). Gaussian processes for machine learning. 2006. The MIT Press, Cambridge, MA, USA.
D. J. Rezende, S. Mohamed and D. Wierstra (2014). ‘Stochastic backpropagation and approximate inference in deep generative models’. In: Proceedings of the 31st International Conference on Machine Learning (ICML 2014).
M. Rocklin (2014). Multiple Dispatch. https: / /multiple - dispatch .readthedocs .io /en / latest/.
H. Salimbeni and M. Deisenroth (2017). ‘Doubly stochastic variational inference for deep Gaussian processes’. In: Advances in Neural Information Processing Systems 30 (NIPS 2017).
H. Salimbeni, V. Dutordoir, J. Hensman and M. P. Deisenroth (2019). ‘Deep gaussian processes with importance-weighted variational inference’. In: Proceedings of the 36st International Conference on Machine Learning (ICML 2019).
E. Snelson and Z. Ghahramani (2006). ‘Sparse Gaussian processes using pseudo-inputs’. In: Advances in Neural Information Processing Systems 18 (NIPS 2005).
M. Titsias (2009). ‘Variational learning of inducing variables in sparse Gaussian processes’. In: Proceedings of the Twelth International Conference on Artificial Intelligence and Statistics (AISTATS 2009).
M. Titsias and M. Lázaro-Gredilla (2014). ‘Doubly stochastic variational Bayes for non-conjugate inference’. In: Proceedings of the 31st International Conference on Machine Learning (ICML 2014).
M. van der Wilk (2019). ‘Sparse Gaussian process approximations and applications’. PhD thesis. University of Cambridge.
M. van der Wilk, C. E. Rasmussen and J. Hensman (2017). ‘Convolutional Gaussian processes’. In: Advances in Neural Information Processing Systems 30 (NIPS 2017).