Doubly Sparse Variational Gaussian Processes

2020·Arxiv

Abstract

Abstract

The use of Gaussian process models is typically limited to datasets with a few tens of thousands of observations due to their complexity and memory footprint. The two most commonly used methods to overcome this limitation are 1) the variational sparse approximation which relies on inducing points and 2) the state-space equivalent formulation of Gaussian processes which can be seen as exploiting some sparsity in the precision matrix. We propose to take the best of both worlds: we show that the inducing point framework is still valid for state space models and that it can bring further computational and memory savings. Furthermore, we provide the natural gradient formulation for the proposed variational parameterisation. Finally, this work makes it possible to use the state-space formulation inside deep Gaussian process models as illustrated in one of the experiments.

1 Introduction

Gaussian processes (GPs) provide a very powerful framework for statistical modelling in low data regimes (i.e., when the number of observations N is small). However, they typically scale as in computational complexity and in memory which makes them impractical for datasets containing more than a few thousand observations. This limitation has received a lot of attention, especially in the machine learning community (Rasmussen and Williams, 2006), and two frameworks have distinguished themselves so far.

The first approach focuses on state-space models (SSM) and exploits the underlying Markov property for computational efficiency. The Markov structure results in

Proceedings of the 23International Conference on Artificial Intelligence and Statistics (AISTATS) 2020, Palermo, Italy. PMLR: Volume 108. Copyright 2020 by the author(s).

1900 1905 1910 1915 1920 1925 1930 1935 1940 0.5

Figure 1: Illustration of the proposed method for a regression task with a Matérn kernel. For each inducing location , we introduce three inducing variables, i.e., . Figure also shows samples from the model, where the inducing variables are represented by a quadratic Taylor expansion. For more details on the experiment settings see Section 4.1.

sparse precision matrices (Grigorievskiy et al., 2017; Durrande et al., 2019), or enables filtering algorithms (Kalman, 1960; Särkkä and Solin, 2019; Solin et al., 2018). These approaches lead to inference algorithms with O(N) complexity. Several model types can be tackled with this approach, such as Gaussian Markov random fields or linear stochastic differential equations. The second approach is the sparse variational Gaussian process (SVGP). It relies on the assumption that the dataset contains some redundant in- formation, so that the GP posterior can be approximated using a smaller set of inducing points (i.e. pseudo-inputs): with . Variational Inference (VI), which consists in minimising the Kullback–Leibler divergence between the approximate and the true posterior, is then used to find appropriate values for z, for and for the model parameters (Titsias, 2009; Hensman et al., 2013).

In this paper, we focus on GPs with 1-dimensional input and propose a novel inference method that brings together the merits of both sparse GP approximation and SSM representation: for a state-space model with state dimension d, we introduce d-dimensional inducing variables and associate each element of this vector to a state component. For example, a GP f with a Matérn covariance is Markovian if you consider the state space . Given some inducing locations we approximate the posterior , by (see Figure 1). Using the state space components as inducing features brings several advantages. First, the proposed method scales linearly both with the number of data and with the number of inducing points (whereas SVGP is quadratic in the later). Second, the number of variational parameters that need optimising is , which scales favourably compared to as respectively required by SVGP and VI for SSM (Durrande et al., 2019). Finally, our approach allows for mini-batching as well as the use of SSM layers in deep-GP models.

In developing our approach, we quickly identified that optimisation of the objective function was cumbersome using standard gradient approaches. We developed a natural gradient approach for our method based on Salimbeni et al. (2018). To do this efficiently requires the formulation of the compact exponential family form of a Markov structured Gaussian distribution, with novel mathematical and computational operators.

2 Background

Here we introduce the basics of GP models (Rasmussen and Williams, 2006) as well as the two main techniques for dealing with large datasets: sparse variational inference and inference in the state-space formulation. They both lead to sparse algorithms, albeit in a different way.

2.1 Gaussian processes

In the classic GP regression setting we are given a dataset , where corresponds to the evaluation of a latent function corrupted with observation noise task is to predict for some input location Under the assumptions that f is a Gaussian process also follows a multivariate normal distribution , the prediction problem can be solved analytically and results in a Gaussian distribution:

where the matrix K is defined as is vector such that . The matrix inverse required in the prediction yields a computational complexity of which is prohibitively expensive for large datasets.

2.2 Variational inter-domain approximations

Variational inference in probabilistic models turns inference into an optimisation problem (Jordan et al., 1999). The sparse variational Gaussian process, originally introduced by Titsias (2009) and more rigorously defined by Matthews et al. (2016), approximates the posterior distribution by a distribution q(f) that depends on ‘inducing points’:

The inducing variables u are typically assumed to be normally distributed (i.e., moments corresponding to the variational parameters.

The above approach has been generalised by introducing the idea of inter-domain features (Alvarez and Lawrence, 2009; Lázaro-Gredilla and Figueiras-Vidal, 2009) which replaces the conditioning Eq. (3) by with a linear operator . Choosing allows to recover the classic inducing points but more interesting behaviour can be obtained by using other operators such as the integrals or convolutions (Hensman et al., 2018; van der Wilk et al., 2017). To enhance readability, we denote by the vector of size M with entries and by its distribution.

The distribution of the conditioned GP is

with and . Taking the expectation of Eq. (4) under gives a closed form expression for

The variational lower bound to the log-marginal likelihood is then given by:

The evaluation of L can be shown to scale as . The linear scaling with the size of the training set allows to apply this approximation to large datasets (Hensman et al., 2013). In some cases, it is shown to approximate the posterior process with high accuracy at a low computational cost in the large data regime (Burt et al., 2019). Finally, it provides a well-defined objective, amenable to gradient-based optimisation that works well in practice (Bauer et al., 2016).

2.3 Inference in the state-space formulation

A large class of Gaussian processes defined on can be written as linear stochastic differential equations (SDEs). This class of kernels is described in depth in Chapter 12 of Särkkä and Solin (2019) and includes, for example, all Matérn (k odd), harmonic oscillators, and all sum and products of such kernels. Also, many kernels can be approximated by an element of this class, see for example Särkkä and Piché (2014) for the RBF kernel.

Here we focus on the class of GPs with stationary kernels which can be represented as a linear time-invariant (LTI) SDE of the form:

The state-space vector is given by evaluations of the process and its derivatives s(t) = is a white noise process with spectral density , are the feedback, noise effect and observation matrix of the system.

The marginal distribution of the solution of this LTISDE evaluated at any ordered set follows a discrete-time linear system:

where the state transition matrices , noise covariance matrices , and state stationary covariance matrix can be computed analytically. Denoting the matrix exponential and

By following the standard Kalman recursions (Särkkä, 2013), inference in conjugate models using the state-space formulation of the GP prior scales linearly with the number of data points and cubically with the state-space dimension, i.e., . Equally efficient approximations have been derived in the non-conjugate case (Durrande et al., 2019; Nickisch et al., 2018), by exploiting the block-tridiagonal structure of the precision (Grigorievskiy et al., 2017).

3 Doubly sparse inference

In this section we introduce the idea of combining the inter-domain inducing features with the state-space GP formulations, which results in what we dub a “doubly sparse variational GP” approximation (SVGP). Although the scope of our approach is broader, we formulate our idea under the classic GP regression setting with factorising likelihood:

3.1 State-space inducing features

We restrict our analysis to GPs with an LTI-SDE representation and we choose as our inter-domain features evaluated at ordered inducing inputs This has two immediate desirable consequences:

Property 1: The sequence of inducing states u = is Markovian and its distribution is mul- tivariate normal . This means that has a block-band structure and that the sufficient statistics are the block-tridiagonal elements (with block size d) and returns them as a vector. This property can be summarised as:

where are the natural parameters of the distribution.

Property 2: The posterior of the function evaluation at a point conditioned on the inducing variables, depends only on the closest left and right inducing states:

where the indices and correspond to the indices of the closest lower and upper neighbor of in z, i.e., . This follows directly from and the Markovian property of which ensures that . The graphical model associated with these properties is given in Figure 2. In practice, we get the statistics of in time using the state-space parameters:

where . The matrices depend on the statistics of the prior state transitions between time points triplet () and are given in Appendix A.1.

Finally, we define the marginal posterior on as , and obtain the posterior predictions analytically as

3.2 Optimal variational distribution for inducing state-space features

Figure 2: (a) Graphical model representing the Markovian joint prior of states indexed at (black arrows). (b) Graphical model for marginal posterior highlighting the statistical properties of the variational posterior: is a Markovian sharing the same structure as (red arrows); only depends on the two nearest inducing states, i.e., (blue arrows). The conditional dependencies , are shown in light blue.

We continue our analysis by investigating the form of the approximating distribution on the inducing states . By following the approach of (Op- per and Archambeau, 2009), we show that at the optimum the variational distribution has a precision with the same block-tridiagonal structure as the prior precision . More specifically, we start from the variational loss in Eq. (6) and expand the terms of the KL that depend on the posterior covariance

where contains the terms in KL that do not depend on . At the optimal covariance , the gradient of the loss w.r.t. the variational parameters is zero, i.e. , which leads to:

The term contributes only to the block-tridiagonal band of the precision. The posterior prediction for each data point only depends on the marginal covariance of the two neighboring inducing states from Eq. (14). So, each likelihood term for data contributes to the posterior precision at the location , on which depends, which is also in the band. That is, the optimal has the same sparsity pattern in the precision as . So, we choose to parameterise , by its mean and by a lower triangular matrix with two blockdiagonals which can be interpreted as the Cholesky factor of the block-tridiagonal . This parameterisation results in a storage footprint instead of had we chosen the more general mean, covariance parameterisation.

3.3 Doubly sparse variational GP inference:

We denote by SVGP the algorithm that performs sparse variational inference with inducing state-space features while restricting the variational distribution to be in the class of multivariate normal distributions with block-tridiagonal banded precisions. Graphical models summarising the corresponding prior and approximate posterior assumptions are given in Figure 2.

SVGP has the following computational advantages: (1) both the KL divergence and the pairwise marginal posterior predictions of contiguous inducing states (a pairwise Kalman-like smoothing of ) can be evaluated in ; (2) Given these pairwise marginals on inducing states, the marginal posterior predictions of function evaluations at the data points evaluated in parallel in . Overall, the evaluation of the variational loss (and of its gradient) has complexity , which compares favourably against alternative variational methods as shown in Table 1. For efficient implementation of the variational loss and the gradients based on banded precision parameterised Gaussian distributions we refer to (Durrande et al., 2019).

SVGP further inherits two properties of the SVGP approximation. First, marginal posterior predictions for each data point can be computed independently which allows to perform stochastic optimisation of an estimator of the loss evaluated on random minibatches of size of the data, reducing the complexity of an evaluation of the objective to . Second because the inducing inputs z are decoupled from the data inputs VGP can be used as a GP -layer in a deep (compositional) architecture as in (Sal- imbeni and Deisenroth, 2017). Both properties are not available to alternative state-space approaches to GP models.

3.4 Natural gradient updates

To learn the variational parameters and the parameters of the model we resort to gradient based optimisation. The gradient of the objective w.r.t. a parameter is defined as subject to constraint . Intuitively, it is the direction of steepest descent with respect to the Euclidean norm of , that maximally reduces the loss.

Table 1: Complexity and capabilities of variational inference algorithms for GP regression.

However, the Euclidean norm can be deceiving when optimising over distributions: small changes in parameters can induce large changes in the distributions, and changing the parameterisation of the distribution typically leads to different optimisation performance. Natural gradients solve this problems by substituting the constraint on the Euclidean norm by (Amari, 1998). Such a constraint can be shown to induce a quadratic norm in the parameter space with curvature given by the Fisher information matrix . The direction of steepest descent w.r.t. this norm is given by

Conveniently, for distributions in the exponential family, the Fisher matrix takes a rather simple form for any parameterisation (Malagò and Pistone, 2015):

where are the natural parameters, the expectation parameters and the parameterisation of our choice. For the variational distribution defined in Section 3.2 these parameters are equivalent to:

It is clear that the banded structure of the sufficient statistics is reflected in both natural and expectation parameters. This allows us to derive efficient updates as in (Salimbeni et al., 2018) using the banded operators introduced by Durrande et al. (2019):

There is one caveat: the covariance term is a full matrix, but for an efficient update we need to transfer between using only the elements in the band. This requires a novel operator (the reverse of in (Durrande et al., 2019, Sec. 4.1)), which maps the band of a covariance to the Cholesky factor of its banded precision. Detailed transformations between , along with the algorithm for the operator can be found in the supplementary materials.

In practice, the natural gradient update is recommended when the likelihood is Gaussian since it en- sures convergence in very few steps (typically one step, see (Salimbeni et al., 2018)). We also observed that it performs well in the non-conjugate case, especially in the first iterations of the optimisation where it can very quickly move to areas of interest. These two properties are illustrated on simple examples in Appendix C.1.

4 Experiments

4.1 Solar irradiance

The aim of this first experiment is to visually illustrate the predictive power of the proposed methodology on a simple GP regression example. We consider here the solar irradiance datasetand choose the model is a centered GP with Matérn covariance and is i.i.d. . We compare three (approximate) posteriors for this model: (a) the classic SVGP, (b) the proposed SVGP, and (c) the exact GPR posterior. For (a) and (b), a grid of 60 inducing locations is used, and the kernel parameters, the noise variance and the variational parameters are estimated by maximising the ELBO. For (c) the kernel parameters and the noise variance are estimated by maximising the model log-marginal likelihood.

As shown in Figure 3, the proposed method is much more accurate than SVGP on this example, and its predictions are extremely similar to the exact GPR model. One the one hand, this experimental setting may be seen to the advantage of our method since having the same number of inducing inputs means that there are three times more inducing variables in the SVGP model. On the other, the computational burden is much smaller for SVGP and this added flexibility actually comes with a reduction of the computational cost. This calls for a more thorough investigation that explores how the ELBO of SVGP and SVGP compare with respect to the numbers of inducing variables, variational parameters and the execution time. This is what we do in the next section on a larger dataset.

4.2 Conjugate regression on time series

In this section we illustrate the computational and storage savings our method entails against the classical

Figure 3: Model comparison on the solar dataset. The SVGP (panel a) fails at capturing the high frequency variations, whereas the proposed SVGP (panel b) results in predictions that are extremely similar to the ones obtained with exact inference (panel c).

SVGP algorithm on a conjugate regression problem (see Appendix D for a non-conjugate example). The data consists of an uttered vowel from a female speaker (Hillenbrand et al., 1995, file w01ae.wav) of length N = 4879, sampled at 16kHz. A vowel is a typical quasi-periodic signal where the pitch (or fundamental frequency) is fixed but the repeated pattern varies through time.

We encode our assumptions about this sound by constructing quasi-periodic kernels. Such kernels can be obtained as the product of a periodic kernel and a Matérn kernel whose lengthscale controls the rate of change of the periodic pattern (Solin and Särkkä, 2014). We construct periodic kernels of varying complexity as weighted sums of cosine kernels in harmonic ratio of frequencies where is the fundamental frequency and the magnitude of each harmonic component. Each harmonic increases the state dimension by 2, so the resulting kernel has a state di- mension of d = 2J.

We then perform approximate inference in settings where we both vary the model complexity (using 1 to 4 harmonics) and the flexibility of the variational distribution by increasing M in powers of 2 (from 16 to 512). For each setting, we optimise the variational parameters and use the divergence as a performance metric. We also compare the execution time of the evaluation of the gradient of L with respect to the variational parameters. We use similar implementations of SVGP and SVGP as in the previous section.

Results are displayed in Figure 4. As a function of the number of inducing points M, the KL decreases much faster for SVGP (a). This is because each inducing state contains more information about the process than an inducing evaluation. Both methods reduce the KL following a similar trend, with a small advantage for SVGP when compared against the actual number of inducing variables (b). However, as summarised in Table 1, storage and computational complexities of SVGP grow linearly with M but are respectively quadratic and cubic in M for SVGP. Strikingly, the scaling for the gradient evaluation means the cost of adding extra inducing points in a SVGP model is independent of the size of the dataset. For large N, it is thus possible to increase the number of inducing variable with very little impact on the computational time since the latter is dominated by the term as demonstrated by the vertical lines in (d). Note that this is not the case for SVGP which scales as

A visual illustration of an inference is given in Figure 5 (N = 4541, M = 318), where we have also removed part of the signal to demonstrate the out-of-sample predictive ability of SVGP. Our method correctly interpolates between the periodic patterns at both ends of the missing data region with highest predictive uncertainty in the middle of this region.

4.3 Additive regression

This experiment illustrates three core capabilities of the proposed SVGP algorithm: (i) it can deal with large datasets; (ii) it allows minibatching with rather large batch-size; (iii) it is not restricted to problems with 1-dimensional inputs. We also show that the proposed method is competitive in terms of performance.

The airline delay dataset consists of flight details (route distance, airtime, aircraft age, etc.) for every commercial flight in the USA for the year 2008. We use the same c = 8 covariates x as in (Hensman et al., 2013) to predict the delay y of the aircraft at landing.

We perform regression from under the modelling assumption that the delay is additive, i.e. f(x) = , where . We set GP priors over each function , where are Matérn kernels. We propose a mean-field approximation to the posterior over processes where each process is approximated using our Sparameterisation. Details are given in Appendix C.3.

Figure 4: Comparison of sparse variational GP regression using classic inducing points and state-space features on an audio time series (see Fig. 5). Each line corresponds to a given inference method (line style) and model complexity (color) and shows the KL divergence for a an increasing number of inducing points. The proposed method is more accurate for a given number of inducing points, requires less variational parameters to reach a good accuracy, and is extremely fast (especially for large M).

Figure 5: Vowel waveform with missing data and fit using SVGP. Black dots are the data. Blue line and shaded area correspond to the posterior predictions. Red vertical lines are the locations of the inducing points.

We compare the MSE obtained with two different optimisation schemes: the first one optimises all parameters with Adam (Kingma and Ba, 2015), while the second uses natural gradient for the variational parameters and Adam for the remaining ones. All learning rates are set to a constant: and . Given the size of the dataset, we use minibatches of 10k points for the two optimisers when

As illustrated in Table 2, natural gradient provides the best performances when the number of observations is small, but as expected it suffers from the minibatching on larger datasets. On the other hand, Adam performs similarly, even when it only has access to sub-samples of the data. The proposed approach has similar accuracy to the state of the art (Hensman et al., 2018). A graphical version of the results in this table is given in Appendix (Figure 9).

4.4 Time warping with deep GPs

In this section we demonstrate the ability of SVGP to perform variational inference in a deep-GP model (Damianou and Lawrence, 2013), where inference is performed by following the approach presented in Salimbeni and Deisenroth (2017). We consider the problem of data alignment and focus on reproducing the results from (Kaiser et al. (2018, Sec. 4.1), see Appendix C.4). The dataset consists of two times series generated by a three layer model and further corrupted by additive Gaussian noise so that . The function f is a sine wave shared across the two observed series indexed by k = {1, 2}. The functions are time-warping functions, with being the identity and a quadratic function; while are output distortions applied to f, with the hyperbolic tangent and the identity. Parts of the observed sequences have been removed at different locations to assess the generalisation performance of the model. Compared to the setting of Kaiser et al. (2018), we double both the frequency of the true f and the number of observations.

The goal here is to infer all five functions under the true model structure. We place GP priors on all five functions with Matérn kernels. We fur-

Table 2: Predictive mean squared errors (MSEs) and negative log predictive densities (NLPDs) with one standard deviation on the airline arrival delays experiment.

Figure 6: Data alignment with SVGP layers. The top three panels show samples from the inferred functions; the bottom two show the observed data y (coloured dots) and posterior predictions for the whole sequence for each output. Black circles indicate missing data.

ther introduce linear mean functions on all priors to avoid pathologies while propagating samples through the layers (Salimbeni and Deisenroth, 2017). We use M = [50, 100, 50] inducing points at each layer with the inducing inputs placed on a linear grid and kept fixed (all functions within a layer share the inducing input locations). We use natural gradients to learn the variational parameters of the approximate distributions in each layer, with the learning rate set to . The hyper-parameters of the kernel, the mean functions and the likelihood noise are learnt using Adam (Kingma and Ba, 2015) with exponential decaying learning rate (initialised from

Results of the inference are shown in Figure 6. The inferred functions (first and third row) share the same characteristics as the corresponding ground truth functions. The function f is also recovered with the correct dampening (second row). More interestingly, we see that due to the time warping of the first layer, the model has successfully learnt to reconstruct each observed sequence from different sections of f. Finally, in the lower two panels we report the model predictions highlighting how the learnt model in rightfully more uncertain in the missing data region. Uncertainty decomposition across layers is discussed in Appendix C.4.

5 Discussion and conclusions

The proposed doubly sparse variational Gaussian processes combines the variational sparse approximations for Gaussian processes with the state-space representation of the process. It inherits its appealing tractability from the variational approach, but has the representative power and computational scalability of state-space representations. Unlike other state-space GP methods, it is readily applicable to deep GP settings and supports mini-batch stochastic training. We showed that our framework can be used to approximate functions with more than one input variable while preserving the computational gain of state-space models.

To ease the optimisation of our variational objective, we derived natural gradient updates for the class of multivariate normal distribution with banded precisions. Although the objective is non-convex, this leads to few shot inference in the conjugate setting and empirically improves optimisation in non-conjugate settings. To further improve the applicability of SVGP, differ-ent sub-optimal variational parameterisations could be used for leading to better behaved objectives or additional scalability improvement, albeit at the cost of reduced expressivity. Another route of improvement could consist in making a further steady-state approximation to the posterior to reduce the scaling with the GP state dimension from cubic to quadratic as in (Solin et al., 2018). As in (Nickisch et al., 2018), further computational gains could be achieved by using interpolations when we compute the SSM parameters.

References

Mauricio Alvarez and Neil D. Lawrence. Sparse con- volved Gaussian processes for multi-output regression. In Advances in Neural Information Processing

Systems, pages 57–64, 2009.

Shun-Ichi Amari. Natural gradient works efficiently in learning. Neural computation, 10(2):251–276, 1998.

Matthias Bauer, Mark van der Wilk, and Carl E. Ras- mussen. Understanding probabilistic sparse Gaussian process approximations. In Advances in Neural

Information Processing Systems, pages 1533–1541, 2016.

David R. Burt, Carl E. Rasmussen, and Mark van der Wilk. Rates of convergence for sparse variational Gaussian process regression. In International

Conference on Machine Learning, 2019.

Andreas Damianou and Neil D. Lawrence. Deep Gaussian processes. In Artificial Intelligence and Statistics, pages 207–215, 2013.

Nicolas Durrande, Vincent Adam, Lucas Bordeaux, Ste- fanos Eleftheriadis, and James Hensman. Banded matrix operators for Gaussian Markov models in the automatic differentiation era. In Artificial Intelligence

and Statistics, pages 2780–2789, 2019.

GPflow. VGP model in gpflow. https://github.com/ GPflow/GPflow/blob/develop/gpflow/model.

Alexander Grigorievskiy, Neil D. Lawrence, and Simo Särkkä. Parallelizable sparse inverse formulation Gaussian processes (SpInGP). In International Workshop on Machine Learning for Signal Processing, pages 1–6, 2017.

James Hensman, Nicolò Fusi, and Neil D. Lawrence. Gaussian processes for big data. In Uncertainty in Artificial Intelligence, pages 282–290, 2013.

James Hensman, Nicolas Durrande, and Arno Solin. Variational Fourier features for Gaussian processes. Journal of Machine Learning Research, 18(151):1–

52, 2018.

James Hillenbrand, Laura A. Getty, Michael J. Clark, and Kimberlee Wheeler. American English vowels dataset, 1995.

Michael I. Jordan, Zoubin Ghahramani, Tommi S. Jaakkola, and Lawrence K. Saul. An introduction to variational methods for graphical models. Machine

learning, 37(2):183–233, 1999.

Markus Kaiser, Clemens Otte, Thomas Runkler, and Carl Henrik Ek. Bayesian alignments of warped multi-output Gaussian processes. In Advances in Neural Information Processing Systems, pages 6995–

Rudolph E. Kalman. A new approach to linear fil- tering and prediction problems. Transactions of the ASME–Journal of Basic Engineering, 82:35–45, 1960.

Diederik P. Kingma and Jimmy Ba. Adam: A method for stochastic optimization. In International

Conference on Learning Representations, 2015.

Miguel Lázaro-Gredilla and Aníbal Figueiras-Vidal. Inter-domain Gaussian processes for sparse inference using inducing features. In Neural Information

Processing Systems, 2009.

Luigi Malagò and Giovanni Pistone. Information geom- etry of the Gaussian distribution in view of stochastic optimization. In Foundations of Genetic Algorithms,

pages 150–162, 2015.

Alexander G. de G. Matthews, James Hensman, Richard E. Turner, and Zoubin Ghahramani. On sparse variational methods and the Kullback-Leibler divergence between stochastic processes. Journal of Machine Learning Research, 51:231–239, 2016.

Hannes Nickisch, Arno Solin, and Alexander Grig- orevskiy. State space Gaussian processes with nonGaussian likelihood. In International Conference on Machine Learning, pages 3789–3798, 2018.

Manfred Opper and Cédric Archambeau. The vari- ational Gaussian approximation revisited. Neural

computation, 21(3):786–792, 2009.

Carl E. Rasmussen and Christopher K. I. Williams. Gaussian processes for machine learning. The MIT Press, Cambridge, MA, USA, 2006.

Hugh Salimbeni and Marc Deisenroth. Doubly stochastic variational inference for deep Gaussian processes. In Advances in Neural Information

Processing Systems, pages 4588–4599, 2017.

Hugh Salimbeni, Stefanos Eleftheriadis, and James Hensman. Natural gradients in practice: Nonconjugate variational inference in Gaussian process models. In Artificial Intelligence and Statistics,

pages 689–697, 2018.

Simo Särkkä. Bayesian filtering and smoothing, volume 3. Cambridge University Press, 2013.

Simo Särkkä and Robert Piché. On convergence and accuracy of state-space approximations of squared exponential covariance functions. In 2014 IEEE International Workshop on Machine Learning for Signal Processing (MLSP), pages 1–6. IEEE, 2014.

Simo Särkkä and Arno Solin. Applied stochastic

differential equations. Cambridge University Press, 2019.

Arno Solin and Simo Särkkä. Explicit link between periodic covariance functions and state space models.

In Artificial Intelligence and Statistics, pages 904– 912, 2014.

Arno Solin, James Hensman, and Richard E. Turner. Infinite-horizon Gaussian processes. In Advances in Neural Information Processing Systems, pages 3486–

3495, 2018.

Michalis Titsias. Variational learning of inducing vari- ables in sparse Gaussian processes. In Artificial

Intelligence and Statistics, pages 567–574, 2009.

Mark van der Wilk, Carl E. Rasmussen, and James Hensman. Convolutional Gaussian processes. In Advances in Neural Information Processing Systems,

Appendices

A Details on the doubly sparse variational Gaussian process

A.1 Conditional for Markovian Gaussian processes

We consider a stationary Markovian GP with state dimension d and denote by its evaluation on the triplet . We here detail the derivation of

Derivation from the joint precision

with

and P

Derivation from the joint covariance

Another derivation using the covariance approach. One can write down the joint density

with

and get

Both implementations reveal the overall scaling of the conditional statistics.

A.2 Sampling from the variational posterior process

We here describe a method to jointly sample from the posterior . Such a sample can be obtained by first sampling from the prior process at inputs [x, z]:

Then sampling from the marginal posterior at z:

And finally construct:

which is a sample from the marginal posterior q(s(x)).

This methods allows to generate samples in complexity , which is the time required to jointly sample . It was used to produce the posterior samples in the deep- GP experiment.

B Multivariate Gaussian distributions with banded precision: parameterisations, link functions and natural gradients

Here we present different parameterisations of a multivariate Gaussian distribution with block- tridiagonal precision matrices along with the link functions between them and describe how we use these to compute natural gradient updates of our variational objective.

B.1 Distribution parameterisations and link functions

In Section 3.2 we have defined the variational distribution approximating the posterior on inducing states to be

where denotes the precision matrix with block-tridiagonal structure and the Cholesky factor of the precision. We denote with the above parameterisation. In the following table we present the identities that allow us to transfer back and forth from the default parameterisation to the natural parameters and to the expectation parameters

Table 3: Transformations between the different parameterisations of the Gaussian distribution with block-tridiagonal precision.

Note that the expectation parameter is a full matrix since it involves the inverse of a banded matrix, which is not necessarily banded. However, since the sufficient statistics of the distribution are , we only need to compute the elements in the band.

B.2 Natural gradient update

When maximising our objective with respect to , the parameters of our variational distribution, we perform a sequence of natural gradient updates:

In an exponential family with natural parameters and expectation parameters , the Fisher information is given by

which is a Jacobian-vector product allowing for an efficient implementation using automatic differentiation libraries. The computation of is achieved using the chain rule:

B.3 Inverse of the subset inverse, and its reverse mode derivatives

A banded positive semi-definite (PSD) matrix Q has a Cholesky factor that is lower triangular with the same lower bandwidth. However, its inverse is in most cases dense. If one is only interested in computing the entries of that are located in the band of Q (denoted by band), efficient subset inverse algorithms are available (Durrande et al., 2019). We are now interested in the mathematical inverse of this operation, which we call reverse to avoid confusion with the matrix inverse operation.

We consider symmetric matrices with lower bandwidth r

Such a matrix has banded Cholesky factor

Computing the subset inverse is (Takahashi, 1973). We present below an algorithm that performs the

reverse subset inverse operation with complexity

B.3.1 Derivation of the forward evaluation

The following derivation shows how to compute, for each index i, the column given the sub-block independently.

To simplify notations, we introduce the following intervals denote by C the full covariance

First, we have that the sub-covariance only depends on

Second, we have that the following expression for sub-covariance

Using, the matrix inversion lemma, we get the following expression for

By construction, the first column of is out of the matrix band (it is a null vector). Because is lower triangular, the product also has a null vector as its first column, and so has the last term in Eq. 29.

Therefore, keeping only the first column, we end up with the identity

which is a system of r equations with , that we can solve analytically getting first then

This derivation is summarised in Algorithm 1:

B.3.2 Derivation of the reverse mode differentiation

We manually derive the reverse mode differentiation of the reverse inverse subset algorithm introduced in the previous section.

We refer the reader to Giles (2008) for a brief introduction to reverse mode differentiation and to Durrande et al. (2019) for derivations of reverse mode differentiation of the subset inverse algorithm for banded matrices.

In a nutshell, in a chain with A, B matrices, the reverse mode derivative of operation of is the operation propagating the reverse mode sensitivity into sensitivity . Given the differential identity and the general differential relation at follows that , therefore we identify

Algorithm 1 being parallel, we can compute the contribution of each column of L to C separately. First we relate

We identify

Then we relate

And we identify

Putting everything together, we have

Algorithm 2 summarises the derivations of the reverse mode sensitivity reverse subset inverse operation.

B.4 Fast implementation

In Algorithms 1 and 2, we computed the independently for each i. This can be achieved by first computing the Cholesky factors of each and then solving . The direct Cholesky factorization of all would incur a total complexity of . However one can use a recursive algorithm achieving the same goal in complexity

Given the Cholesky factor , we can compute the Cholesky factor of as follows:

with

The Cholesky factor is then readily obtained by removing the last row and column of the matrix in Eq. 34. The bottom right entry of the matrix in Eq. 34 corresponds to a Cholesky downdate that can be computed with cost (Seeger, 2004), so that, starting from index n, all Cholesky factors can be computed with complexity

C Experimental details

C.1 Illustration of the efficiency of the natural gradient update

We describe in this section a simple experiment to compare the behaviour of various optimisers. Given a regular grid X of points equally spaced on [0, 1] and a centred GP f with a Matérn 3/2 covariance (unit variance and length-scale ), we generate two datasets at random as follow:

Note that the first model has a conjugate likelihood whereas the second one does not. We can then fit SVGP models with 50 inducing points (fixed to a regular grid on [0, 1]). We show in Figure 7 the optimisations traces we obtained when optimising the variational parameters (all other model parameters being fixed to their nominal values), for different samples of the datasets. It can be seen that natural gradients provide a striking advantage in the conjugate case but that it also behaves favourably, especially in the first few iterations, in the non-conjugate case.

Figure 7: Optimisation traces for the ELBO of our SVGP method. Three optimisers (Natural gradients, Adam and LBFGS) are compared on 10 datasets that are generated at random. The left pannel correspond to a dataset and a model with a conjugate likelihood, whereas the right one is for a non conjugate likelihood.

C.2 Details on Section 4.2: conjugate regression on time-series

The full sound waveform used in this experiment is shown in Figure 8(top).

To model this signal, we used the following stationary kernels that have an equivalent SDE representations of state-dimension 2J, where J is the number of harmonic components:

where denotes the fundamental frequency of the pitched sound. The variance of the Matérnkernel was set to one and its only free parameter is its length-scale. We used a Gaussian likelihood with variance . Since

Figure 8: Audio time-series used in Section 4.2

we focused on inference, all parameters were initially fitted to the data by maximising the marginal likelihood available in closed form in this conjugate setting.

We increased the number of inducing points in powers of 2, placing them on an homogenous grid from the start to the end of the time support. Inducing points locations were not learned.

For both SVGP and SVGP, we only learned the variational parameters. We used LBFGS for both SVGP and S

C.3 Details on Section 4.3: additive regression

The generative model is as follow:

We propose a mean-field approximation to the posterior over processes as in (Adam et al., 2016). where each process is approximated using our doubly sparse parameterisation with inducing states evaluated at component specific inputs

Figure 9: Comparison of predictive MSE on the airline delays dataset when training SVGP with various optimisers (the distribution of errors is across the 10 splits).

C.4 Details on Section 4.4: time warping with deep Gaussian process

Setting

The function f is a sine wave of the form and is shared across the two observed series. The functions are time-warping functions, with and . The functions are output distortions with . To generate the two time series we uniformly sample 1000 points in [0, 1] and subsequently pass them to the 2-layer model that we described. The Gaussian additive noise has standard deviation . We removed observations in the intervals [0.55, 0.6] & [0.85, 0.9] for the first time series and in the interval of [0.40, 0.50] for the second time series.

Uncertainty decomposition across layers in the deep GP experiment

As can be seen in Figure 6, there is almost no uncertainty in the first layer. The reason for this is two fold. First, we initialised the kernels on the first layer to have very long lengthscales (initial value of 4), and small variance (initial value of 0.01) to bias the inference towards smooth functions. Second, the low posterior uncertainty in the intermediate layers is a known consequence of variational approach used in Salimbeni and Deisenroth (2017), where the variational distribution factorises across layers (we use the same approximation). This pathology has been recently explored in Ustyuzhaninov et al. (2019) and can be remedied by explicitly imposing a conditional dependency between the layers in the variational distribution. We conducted the same experiment with a higher observation noise level (instead of ) and report the result in Figure 10. Changing the noise level has no effect to the uncertainty in the intermediate layers but has a significant effect in the model’s ability to learn the correct functions, as the model prefers to explain these attributes by measurement noise.

Figure 10: Data alignment with SVGP layers. Left: original experiment from Section 4.4 with same experiment with

D Empirical comparison to alternative SSM based approximate inference methods

We empirically compare our SVGP approach to alternative methods to perform approximate inference in GP models based on their state space representation.

We consider a simple classification task similar from the GPMLv4.2 toolbox demo gpml-matlab-master/doc/demoState.m with N=5000 (see Figure 11) and using Maternkernel with fixed hyperparameters. For the SVGP method, we choose M = 50 inducing points on a homogenous grid and we use gradient based optimisation (L-BFGS). We run inference and report the NLPDs and execution time:

For SVGP, we run gradient based optimisation (using L-BFGS). We are equally accurate as the other methods yet much faster.

Figure 11: Classification data (N=5000) and SVGP fit. The ground truth used to generate the data is shown in blue. Blue dots represent the binary data (with additional noise introduced for visibility). The posterior process is shown in red. Inducing point locations are shown in green.

References