b

DiscoverSearch
About
My stuff
Doubly Sparse Variational Gaussian Processes
2020·arXiv
Abstract
Abstract

The use of Gaussian process models is typically limited to datasets with a few tens of thousands of observations due to their complexity and memory footprint. The two most commonly used methods to overcome this limitation are 1) the variational sparse approximation which relies on inducing points and 2) the state-space equivalent formulation of Gaussian processes which can be seen as exploiting some sparsity in the precision matrix. We propose to take the best of both worlds: we show that the inducing point framework is still valid for state space models and that it can bring further computational and memory savings. Furthermore, we provide the natural gradient formulation for the proposed variational parameterisation. Finally, this work makes it possible to use the state-space formulation inside deep Gaussian process models as illustrated in one of the experiments.

Gaussian processes (GPs) provide a very powerful framework for statistical modelling in low data regimes (i.e., when the number of observations N is small). However, they typically scale as  O(N 3)in computational complexity and  O(N 2)in memory which makes them impractical for datasets containing more than a few thousand observations. This limitation has received a lot of attention, especially in the machine learning community (Rasmussen and Williams, 2006), and two frameworks have distinguished themselves so far.

The first approach focuses on state-space models (SSM) and exploits the underlying Markov property for computational efficiency. The Markov structure results in

Proceedings of the 23rdInternational Conference on Artificial Intelligence and Statistics (AISTATS) 2020, Palermo, Italy. PMLR: Volume 108. Copyright 2020 by the author(s).

image

1900 1905 1910 1915 1920 1925 1930 1935 1940 0.5

image

Figure 1: Illustration of the proposed method for a regression task with a Matérn 5/2kernel. For each inducing location  zi, we introduce three inducing variables, i.e.,  f(zi), f ′(zi) and f ′′(zi). Figure also shows samples from the model, where the inducing variables are represented by a quadratic Taylor expansion. For more details on the experiment settings see Section 4.1.

sparse precision matrices (Grigorievskiy et al., 2017; Durrande et al., 2019), or enables filtering algorithms (Kalman, 1960; Särkkä and Solin, 2019; Solin et al., 2018). These approaches lead to inference algorithms with O(N) complexity. Several model types can be tackled with this approach, such as Gaussian Markov random fields or linear stochastic differential equations. The second approach is the sparse variational Gaussian process (SVGP). It relies on the assumption that the dataset  D = {xi, yi}Ni=1contains some redundant in- formation, so that the GP posterior  f|{f(xi) = yi}Ni=1can be approximated using a smaller set of inducing points (i.e. pseudo-inputs):  f|{f(zi) = ui}Mi=1with M ≪ N. Variational Inference (VI), which consists in minimising the Kullback–Leibler divergence between the approximate and the true posterior, is then used to find appropriate values for z, for and for the model parameters (Titsias, 2009; Hensman et al., 2013).

In this paper, we focus on GPs with 1-dimensional input and propose a novel inference method that brings together the merits of both sparse GP approximation and SSM representation: for a state-space model with state dimension d, we introduce d-dimensional inducing variables and associate each element of this vector to a state component. For example, a GP f with a Matérn 5/2covariance is Markovian if you consider the state space (f(x), f ′(x), f ′′(x)). Given some inducing locations  zi,we approximate the posterior  f|{f(xi) = yi}Ni=1, by f|{f(zj) = uj,1, f ′(zj) = uj,2, f ′′(zj) = uj,2}Mj=1(see Figure 1). Using the state space components as inducing features brings several advantages. First, the proposed method scales linearly both with the number of data and with the number of inducing points (whereas SVGP is quadratic in the later). Second, the number of variational parameters that need optimising is  O(Md2), which scales favourably compared to O(M 2) and O(Nd2)as respectively required by SVGP and VI for SSM (Durrande et al., 2019). Finally, our approach allows for mini-batching as well as the use of SSM layers in deep-GP models.

In developing our approach, we quickly identified that optimisation of the objective function was cumbersome using standard gradient approaches. We developed a natural gradient approach for our method based on Salimbeni et al. (2018). To do this efficiently requires the formulation of the compact exponential family form of a Markov structured Gaussian distribution, with novel mathematical and computational operators.

Here we introduce the basics of GP models (Rasmussen and Williams, 2006) as well as the two main techniques for dealing with large datasets: sparse variational inference and inference in the state-space formulation. They both lead to sparse algorithms, albeit in a different way.

2.1 Gaussian processes

In the classic GP regression setting we are given a dataset  D = {xn, yn}n=1...N ∈ (X, R)N, where  yncorresponds to the evaluation of a latent function corrupted with observation noise  yn = f(xn) + ϵn and thetask is to predict  f(x∗)for some input location  x∗ ∈ X.Under the assumptions that f is a Gaussian process f(·) ∼ GP(0, k(·, ·)) and that ϵalso follows a multivariate normal distribution  ϵ ∼ N(0, σ2I), the prediction problem can be solved analytically and results in a Gaussian distribution:

image

where the matrix K is defined as  Kij = k(xi, xj) andk∗is vector such that  k∗i = k(x∗, xi). The matrix inverse required in the prediction yields a computational complexity of  O(N 3)which is prohibitively expensive for large datasets.

2.2 Variational inter-domain approximations

Variational inference in probabilistic models turns inference into an optimisation problem (Jordan et al., 1999). The sparse variational Gaussian process, originally introduced by Titsias (2009) and more rigorously defined by Matthews et al. (2016), approximates the posterior distribution  p(f(·)|D)by a distribution q(f) that depends on ‘inducing points’:

image

The inducing variables u are typically assumed to be normally distributed (i.e.,  qu = N(µu, Σuu)) with themoments corresponding to the variational parameters.

The above approach has been generalised by introducing the idea of inter-domain features (Alvarez and Lawrence, 2009; Lázaro-Gredilla and Figueiras-Vidal, 2009) which replaces the conditioning  {f(zi) = ui} inEq. (3) by  {Ψi[f(·)] = ui}with  Ψa linear operator RX → R. Choosing  Ψi : f → f(zi)allows to recover the classic inducing points but more interesting behaviour can be obtained by using other operators such as the integrals or convolutions (Hensman et al., 2018; van der Wilk et al., 2017). To enhance readability, we denote by  Ψ[f]the vector of size M with entries  Ψi[f],and by  pΨits distribution.

The distribution of the conditioned GP is

image

with  (kΨ(x))i = cov(f(x), Ψi[f])and  (KΨΨ)i,j =cov(Ψi[f], Ψj[f]). Taking the expectation of Eq. (4) under  qu(u)gives a closed form expression for  q(·):

image

The variational lower bound to the log-marginal likelihood is then given by:

image

The evaluation of L can be shown to scale as  O(NM 2 +M 3). The linear scaling with the size of the training set allows to apply this approximation to large datasets (Hensman et al., 2013). In some cases, it is shown to approximate the posterior process with high accuracy at a low computational cost in the large data regime (Burt et al., 2019). Finally, it provides a well-defined objective, amenable to gradient-based optimisation that works well in practice (Bauer et al., 2016).

2.3 Inference in the state-space formulation

A large class of Gaussian processes defined on  X ⊆ Rcan be written as linear stochastic differential equations (SDEs). This class of kernels is described in depth in Chapter 12 of Särkkä and Solin (2019) and includes, for example, all Matérn k/2(k odd), harmonic oscillators, and all sum and products of such kernels. Also, many kernels can be approximated by an element of this class, see for example Särkkä and Piché (2014) for the RBF kernel.

Here we focus on the class of GPs with stationary kernels which can be represented as a linear time-invariant (LTI) SDE of the form:

image

The state-space vector  s(t) ∈ Rdis given by evaluations of the process and its derivatives s(t) = [f(t), f (1)(t), . . . , f (d−1)(t)]⊤. ε(t) ∈ Rris a white noise process with spectral density  Qc. F ∈ Rd×d, L ∈ Rd×r, H ∈ R1×dare the feedback, noise effect and observation matrix of the system.

The marginal distribution of the solution of this LTISDE evaluated at any ordered set  x = [x1, . . . , xN]⊤ ∈RN follows a discrete-time linear system:

image

where the state transition matrices  An,n+1 ∈ Rd×d, noise covariance matrices  Qn,n+1 ∈ Rd×d, and state stationary covariance matrix  P0can be computed analytically. Denoting  Φthe matrix exponential and

image

By following the standard Kalman recursions (Särkkä, 2013), inference in conjugate models using the state-space formulation of the GP prior scales linearly with the number of data points and cubically with the state-space dimension, i.e.,  O(Nd3). Equally efficient approximations have been derived in the non-conjugate case (Durrande et al., 2019; Nickisch et al., 2018), by exploiting the block-tridiagonal structure of the precision (Grigorievskiy et al., 2017).

In this section we introduce the idea of combining the inter-domain inducing features with the state-space GP formulations, which results in what we dub a “doubly sparse variational GP” approximation (S2VGP). Although the scope of our approach is broader, we formulate our idea under the classic GP regression setting with factorising likelihood: p(y, f(·)|x) =p(f(·)) �n p(yn|f(xn)).

3.1 State-space inducing features

We restrict our analysis to GPs with an LTI-SDE representation and we choose  Ψi : f → s(zi) =[f(zi), . . . , f (d−1)(zi)]⊤as our inter-domain features evaluated at ordered inducing inputs  z = [z1, . . . , zM]⊤.This has two immediate desirable consequences:

Property 1: The sequence of inducing states u = {Ψi[f]}Mi=1is Markovian and its distribution is mul- tivariate normal  pΨ = N(0, Q−1Ψ ). This means that QΨhas a block-band structure and that the sufficient statistics are  t(u) = [u, btd[uu⊤]], where btd extractsthe block-tridiagonal elements (with block size d) and returns them as a vector. This property can be summarised as:

image

where  θ = [QΨµΨ, −1/2 btd[QΨ]]are the natural parameters of the distribution.

Property 2: The posterior of the function evaluation at a point  xnconditioned on the inducing variables, depends only on the closest left and right inducing states:

image

where the indices  n− ∈ �1, M − 1�and  n+ = n− + 1correspond to the indices of the closest lower and upper neighbor of  xnin z, i.e.,  z1 < · · · < zn− <xn < zn+ < · · · < zM. This follows directly from f(xn) = Hs(xn)and the Markovian property of {u1, . . . , un−, s(xn), un+, . . . , uM}which ensures that p(s(xn)|u) = p(s(xn)|un−, un+). The graphical model associated with these properties is given in Figure 2. In practice, we get the statistics of  p(s(xn)|un−, un+)in time  O(d3)using the state-space parameters:

image

where  vn = [un−; un+]. The matrices  Pn, Tndepend on the statistics of the prior state transitions between time points triplet (zn−, xn, zn+) and are given in Appendix A.1.

Finally, we define the marginal posterior on  vnas q(vn) = N(µvn, Σvnvn), and obtain the posterior predictions analytically as

image

3.2 Optimal variational distribution for inducing state-space features

image

Figure 2: (a) Graphical model representing the Markovian joint prior of states indexed at  {xn, zm}(black arrows). (b) Graphical model for marginal posterior q(s2, u) = p(s2|u)q(u)highlighting the statistical properties of the variational posterior:  (1) qu(u)is a Markovian sharing the same structure as  pΨ(u)(red arrows); (2) s2only depends on the two nearest inducing states, i.e.,  p(s2|u) = p(s2|u1, u2)(blue arrows). The conditional dependencies  p(si|u), are shown in light blue.

We continue our analysis by investigating the form of the approximating distribution on the inducing states qu = N(µu, Σuu). By following the approach of (Op- per and Archambeau, 2009), we show that at the optimum the variational distribution  quhas a precision with the same block-tridiagonal structure as the prior precision  QΨ. More specifically, we start from the variational loss in Eq. (6) and expand the terms of the KL that depend on the posterior covariance  Σuu:

image

where  c(µu, pΨ)contains the terms in KL that do not depend on  Σuu. At the optimal covariance  Σ∗uu, the gradient of the loss w.r.t. the variational parameters is zero, i.e.  ∇ΣuuL(q)|Σ∗uu = 0, which leads to:

image

The term  QΨcontributes only to the block-tridiagonal band of the precision. The posterior prediction for each data point  q(f(xn))only depends on the marginal covariance  Σvnvnof the two neighboring inducing states from Eq. (14). So, each likelihood term for data  yncontributes to the posterior precision at the location zn−, zn+, on which  q(f(xn))depends, which is also in the band. That is, the optimal  quhas the same sparsity pattern in the precision as  pΨ. So, we choose to parameterise  qu = N(µu, Q−1u ), by its mean  µuand by a lower triangular matrix  Luwith two blockdiagonals which can be interpreted as the Cholesky factor of the block-tridiagonal  Qu. This parameterisation results in a  O(Md2)storage footprint instead of  O(M 2d2)had we chosen the more general mean, covariance parameterisation.

3.3 Doubly sparse variational GP inference:

image

We denote by S2VGP the algorithm that performs sparse variational inference with inducing state-space features while restricting the variational distribution to be in the class of multivariate normal distributions with block-tridiagonal banded precisions. Graphical models summarising the corresponding prior and approximate posterior assumptions are given in Figure 2.

S2VGP has the following computational advantages: (1) both the KL divergence and the pairwise marginal posterior predictions of contiguous inducing states (a pairwise Kalman-like smoothing of  qu) can be evaluated in  O(Md3); (2) Given these pairwise marginals on inducing states, the marginal posterior predictions of function evaluations at the data points  q(f(xn)) can beevaluated in parallel in  O(Nd3). Overall, the evaluation of the variational loss (and of its gradient) has complexity  O((N +M)d3) 1, which compares favourably against alternative variational methods as shown in Table 1. For efficient implementation of the variational loss and the gradients based on banded precision parameterised Gaussian distributions we refer to (Durrande et al., 2019).

S2VGP further inherits two properties of the SVGP approximation. First, marginal posterior predictions q(f(xn))for each data point can be computed independently which allows to perform stochastic optimisation of an estimator of the loss evaluated on random minibatches of size  Nbof the data, reducing the complexity of an evaluation of the objective to  O((Nb + M)d3). Second because the inducing inputs z are decoupled from the data inputs  x, S2VGP can be used as a GP -layer in a deep (compositional) architecture as in (Sal- imbeni and Deisenroth, 2017). Both properties are not available to alternative state-space approaches to GP models.

3.4 Natural gradient updates

To learn the variational parameters and the parameters of the model we resort to gradient based optimisation. The gradient of the objective w.r.t. a parameter  ξis defined as  ∇ξL = limϵ→0 arg minδ 1ϵ L(ξ +δ)subject to constraint  ||δ|| = ϵ. Intuitively, it is the direction of steepest descent with respect to the Euclidean norm of  δ, that maximally reduces the loss.

Table 1: Complexity and capabilities of variational inference algorithms for GP regression.

image

However, the Euclidean norm can be deceiving when optimising over distributions: small changes in parameters can induce large changes in the distributions, and changing the parameterisation of the distribution typically leads to different optimisation performance. Natural gradients solve this problems by substituting the constraint on the Euclidean norm by KL[qu(ξ) ∥ qu(ξ + δ)] = ϵ(Amari, 1998). Such a constraint can be shown to induce a quadratic norm in the parameter space with curvature given by the Fisher information matrix  Fξ. The direction of steepest descent w.r.t. this norm is given by ˜∇ξL = (∇ξL)F −1ξ .

Conveniently, for distributions in the exponential family, the Fisher matrix takes a rather simple form for any parameterisation  ξ(Malagò and Pistone, 2015):

image

where  θare the natural parameters,  ηthe expectation parameters and  ξthe parameterisation of our choice. For the variational distribution  qudefined in Section 3.2 these parameters are equivalent to:

image

It is clear that the banded structure of the sufficient statistics is reflected in both natural and expectation parameters. This allows us to derive efficient updates as in (Salimbeni et al., 2018) using the banded operators introduced by Durrande et al. (2019):

image

There is one caveat: the covariance term  (L−⊤u L−1u ) inηis a full matrix, but for an efficient update we need to transfer between  ξ[band] ⇄ η[band]using only the elements in the band. This requires a novel operator (the reverse of  I[·]in (Durrande et al., 2019, Sec. 4.1)), which maps the band of a covariance to the Cholesky factor of its banded precision. Detailed transformations between  θ[band] ⇄ ξ[band] ⇄ η[band], along with the algorithm for the operator can be found in the supplementary materials.

In practice, the natural gradient update is recommended when the likelihood is Gaussian since it en- sures convergence in very few steps (typically one step, see (Salimbeni et al., 2018)). We also observed that it performs well in the non-conjugate case, especially in the first iterations of the optimisation where it can very quickly move to areas of interest. These two properties are illustrated on simple examples in Appendix C.1.

4.1 Solar irradiance

The aim of this first experiment is to visually illustrate the predictive power of the proposed methodology on a simple GP regression example. We consider here the solar irradiance dataset2and choose the model y(x) = f(x) + ε where fis a centered GP with Matérn 3/2covariance and  εis i.i.d.  N(0, σ2). We compare three (approximate) posteriors for this model: (a) the classic SVGP, (b) the proposed S2VGP, and (c) the exact GPR posterior. For (a) and (b), a grid of 60 inducing locations is used, and the kernel parameters, the noise variance  σ2and the variational parameters are estimated by maximising the ELBO. For (c) the kernel parameters and the noise variance are estimated by maximising the model log-marginal likelihood.

As shown in Figure 3, the proposed method is much more accurate than SVGP on this example, and its predictions are extremely similar to the exact GPR model. One the one hand, this experimental setting may be seen to the advantage of our method since having the same number of inducing inputs means that there are three times more inducing variables in the S2VGP model. On the other, the computational burden is much smaller for S2VGP and this added flexibility actually comes with a reduction of the computational cost. This calls for a more thorough investigation that explores how the ELBO of SVGP and S2VGP compare with respect to the numbers of inducing variables, variational parameters and the execution time. This is what we do in the next section on a larger dataset.

4.2 Conjugate regression on time series

In this section we illustrate the computational and storage savings our method entails against the classical

image

Figure 3: Model comparison on the solar dataset. The SVGP (panel a) fails at capturing the high frequency variations, whereas the proposed S2VGP (panel b) results in predictions that are extremely similar to the ones obtained with exact inference (panel c).

SVGP algorithm on a conjugate regression problem (see Appendix D for a non-conjugate example). The data consists of an uttered vowel from a female speaker (Hillenbrand et al., 1995, file w01ae.wav) of length N = 4879, sampled at 16kHz. A vowel is a typical quasi-periodic signal where the pitch (or fundamental frequency) is fixed but the repeated pattern varies through time.

We encode our assumptions about this sound by constructing quasi-periodic kernels. Such kernels can be obtained as the product of a periodic kernel  kpand a Matérn 1/2kernel  k1/2whose lengthscale controls the rate of change of the periodic pattern (Solin and Särkkä, 2014). We construct periodic kernels of varying complexity as weighted sums of cosine kernels in harmonic ratio of frequencies  kJp (τ) = �Jj=1 γ2j cos(2πf0jτ)where  f0is the fundamental frequency and  γj controlsthe magnitude of each harmonic component. Each harmonic increases the state dimension by 2, so the resulting kernel  kJ(τ) = kJp (τ)k1/2(τ)has a state di- mension of d = 2J.

We then perform approximate inference in settings where we both vary the model complexity (using 1 to 4 harmonics) and the flexibility of the variational distribution by increasing M in powers of 2 (from 16 to 512). For each setting, we optimise the variational parameters and use the divergence  KL[q(s(·))|p(s(·)|y)]as a performance metric. We also compare the execution time of the evaluation of the gradient of L with respect to the variational parameters. We use similar implementations of SVGP and S2VGP as in the previous section.

Results are displayed in Figure 4. As a function of the number of inducing points M, the KL decreases much faster for S2VGP (a). This is because each inducing state contains more information about the process than an inducing evaluation. Both methods reduce the KL following a similar trend, with a small advantage for SVGP when compared against the actual number of inducing variables (b). However, as summarised in Table 1, storage and computational complexities of S2VGP grow linearly with M but are respectively quadratic and cubic in M for SVGP. Strikingly, the scaling  O((N+M)d3)for the gradient evaluation means the cost of adding extra inducing points in a S2VGP model is independent of the size of the dataset. For large N, it is thus possible to increase the number of inducing variable with very little impact on the computational time since the latter is dominated by the O(Nd3)term as demonstrated by the vertical lines in (d). Note that this is not the case for SVGP which scales as  O(NM 2 + M 3).

A visual illustration of an inference is given in Figure 5 (N = 4541, M = 318), where we have also removed part of the signal to demonstrate the out-of-sample predictive ability of S2VGP. Our method correctly interpolates between the periodic patterns at both ends of the missing data region with highest predictive uncertainty in the middle of this region.

4.3 Additive regression

This experiment illustrates three core capabilities of the proposed S2VGP algorithm: (i) it can deal with large datasets; (ii) it allows minibatching with rather large batch-size; (iii) it is not restricted to problems with 1-dimensional inputs. We also show that the proposed method is competitive in terms of performance.

The airline delay dataset consists of flight details (route distance, airtime, aircraft age, etc.) for every commercial flight in the USA for the year 2008. We use the same c = 8 covariates x as in (Hensman et al., 2013) to predict the delay y of the aircraft at landing.

We perform regression from  Rc → Runder the modelling assumption that the delay is additive, i.e. f(x) = �i fi(x(i)), where  x(i) ∈ R. We set GP priors over each function  fi ∼ GP(0, ki), where  kiare Matérn 3/2kernels. We propose a mean-field approximation to the posterior over processes  q(f1, . . . , fc) = �i q(i)(fi),where each process is approximated using our S2VGPparameterisation. Details are given in Appendix C.3.

image

Figure 4: Comparison of sparse variational GP regression using classic inducing points and state-space features on an audio time series (see Fig. 5). Each line corresponds to a given inference method (line style) and model complexity (color) and shows the KL divergence for a an increasing number of inducing points. The proposed method is more accurate for a given number of inducing points, requires less variational parameters to reach a good accuracy, and is extremely fast (especially for large M).

image

Figure 5: Vowel waveform with missing data and fit using S2VGP. Black dots are the data. Blue line and shaded area correspond to the posterior predictions. Red vertical lines are the locations of the inducing points.

We compare the MSE obtained with two different optimisation schemes: the first one optimises all parameters with Adam (Kingma and Ba, 2015), while the second uses natural gradient for the variational parameters and Adam for the remaining ones. All learning rates are set to a constant:  γnatgrads = 0.01and  γAdam = 0.01. Given the size of the dataset, we use minibatches of 10k points for the two optimisers when  N ≥ 106.

As illustrated in Table 2, natural gradient provides the best performances when the number of observations is small, but as expected it suffers from the minibatching on larger datasets. On the other hand, Adam performs similarly, even when it only has access to sub-samples of the data. The proposed approach has similar accuracy to the state of the art (Hensman et al., 2018). A graphical version of the results in this table is given in Appendix (Figure 9).

4.4 Time warping with deep GPs

In this section we demonstrate the ability of S2VGP to perform variational inference in a deep-GP model (Damianou and Lawrence, 2013), where inference is performed by following the approach presented in Salimbeni and Deisenroth (2017). We consider the problem of data alignment and focus on reproducing the results from (Kaiser et al. (2018, Sec. 4.1), see Appendix C.4). The dataset consists of two times series generated by a three layer model and further corrupted by additive Gaussian noise so that  yk = gk(f(ak(t)))+ε, whereε ∼ N(0, σ2I). The function f is a sine wave shared across the two observed series indexed by k = {1, 2}. The functions  akare time-warping functions, with  a1being the identity and  a2a quadratic function; while gkare output distortions applied to f, with  g1the hyperbolic tangent and  g2the identity. Parts of the observed sequences have been removed at different locations to assess the generalisation performance of the model. Compared to the setting of Kaiser et al. (2018), we double both the frequency of the true f and the number of observations.

The goal here is to infer all five functions  a1, a2, f, g1, g2under the true model structure. We place GP priors on all five functions with Matérn 3/2kernels. We fur-

Table 2: Predictive mean squared errors (MSEs) and negative log predictive densities (NLPDs) with one standard deviation on the airline arrival delays experiment.

image

Figure 6: Data alignment with S2VGP layers. The top three panels show samples from the inferred functions; the bottom two show the observed data y (coloured dots) and posterior predictions for the whole sequence for each output. Black circles indicate missing data.

ther introduce linear mean functions on all priors to avoid pathologies while propagating samples through the layers (Salimbeni and Deisenroth, 2017). We use M = [50, 100, 50] inducing points at each layer with the inducing inputs placed on a linear grid and kept fixed (all functions within a layer share the inducing input locations). We use natural gradients to learn the variational parameters of the approximate distributions in each layer, with the learning rate set to γnatgrads = 0.001. The hyper-parameters of the kernel, the mean functions and the likelihood noise are learnt using Adam (Kingma and Ba, 2015) with exponential decaying learning rate (initialised from  γAdam = 0.001).

image

Results of the inference are shown in Figure 6. The inferred  ak and gkfunctions (first and third row) share the same characteristics as the corresponding ground truth functions. The function f is also recovered with the correct dampening (second row). More interestingly, we see that due to the time warping of the first layer, the model has successfully learnt to reconstruct each observed sequence from different sections of f. Finally, in the lower two panels we report the model predictions highlighting how the learnt model in rightfully more uncertain in the missing data region. Uncertainty decomposition across layers is discussed in Appendix C.4.

The proposed doubly sparse variational Gaussian processes combines the variational sparse approximations for Gaussian processes with the state-space representation of the process. It inherits its appealing tractability from the variational approach, but has the representative power and computational scalability of state-space representations. Unlike other state-space GP methods, it is readily applicable to deep GP settings and supports mini-batch stochastic training. We showed that our framework can be used to approximate functions with more than one input variable while preserving the computational gain of state-space models.

To ease the optimisation of our variational objective, we derived natural gradient updates for the class of multivariate normal distribution with banded precisions. Although the objective is non-convex, this leads to few shot inference in the conjugate setting and empirically improves optimisation in non-conjugate settings. To further improve the applicability of S2VGP, differ-ent sub-optimal variational parameterisations could be used for  quleading to better behaved objectives or additional scalability improvement, albeit at the cost of reduced expressivity. Another route of improvement could consist in making a further steady-state approximation to the posterior to reduce the scaling with the GP state dimension from cubic to quadratic as in (Solin et al., 2018). As in (Nickisch et al., 2018), further computational gains could be achieved by using interpolations when we compute the SSM parameters.

Mauricio Alvarez and Neil D. Lawrence. Sparse con- volved Gaussian processes for multi-output regression. In Advances in Neural Information Processing

Systems, pages 57–64, 2009.

Shun-Ichi Amari. Natural gradient works efficiently in learning. Neural computation, 10(2):251–276, 1998.

Matthias Bauer, Mark van der Wilk, and Carl E. Ras- mussen. Understanding probabilistic sparse Gaussian process approximations. In Advances in Neural

Information Processing Systems, pages 1533–1541, 2016.

David R. Burt, Carl E. Rasmussen, and Mark van der Wilk. Rates of convergence for sparse variational Gaussian process regression. In International

Conference on Machine Learning, 2019.

Andreas Damianou and Neil D. Lawrence. Deep Gaussian processes. In Artificial Intelligence and Statistics, pages 207–215, 2013.

Nicolas Durrande, Vincent Adam, Lucas Bordeaux, Ste- fanos Eleftheriadis, and James Hensman. Banded matrix operators for Gaussian Markov models in the automatic differentiation era. In Artificial Intelligence

and Statistics, pages 2780–2789, 2019.

GPflow. VGP model in gpflow. https://github.com/ GPflow/GPflow/blob/develop/gpflow/model.

Alexander Grigorievskiy, Neil D. Lawrence, and Simo Särkkä. Parallelizable sparse inverse formulation Gaussian processes (SpInGP). In International Workshop on Machine Learning for Signal Processing, pages 1–6, 2017.

James Hensman, Nicolò Fusi, and Neil D. Lawrence. Gaussian processes for big data. In Uncertainty in Artificial Intelligence, pages 282–290, 2013.

James Hensman, Nicolas Durrande, and Arno Solin. Variational Fourier features for Gaussian processes. Journal of Machine Learning Research, 18(151):1–

52, 2018.

James Hillenbrand, Laura A. Getty, Michael J. Clark, and Kimberlee Wheeler. American English vowels dataset, 1995.

Michael I. Jordan, Zoubin Ghahramani, Tommi S. Jaakkola, and Lawrence K. Saul. An introduction to variational methods for graphical models. Machine

learning, 37(2):183–233, 1999.

Markus Kaiser, Clemens Otte, Thomas Runkler, and Carl Henrik Ek. Bayesian alignments of warped multi-output Gaussian processes. In Advances in Neural Information Processing Systems, pages 6995–

image

Rudolph E. Kalman. A new approach to linear fil- tering and prediction problems. Transactions of the ASME–Journal of Basic Engineering, 82:35–45, 1960.

Diederik P. Kingma and Jimmy Ba. Adam: A method for stochastic optimization. In International

Conference on Learning Representations, 2015.

Miguel Lázaro-Gredilla and Aníbal Figueiras-Vidal. Inter-domain Gaussian processes for sparse inference using inducing features. In Neural Information

Processing Systems, 2009.

Luigi Malagò and Giovanni Pistone. Information geom- etry of the Gaussian distribution in view of stochastic optimization. In Foundations of Genetic Algorithms,

pages 150–162, 2015.

Alexander G. de G. Matthews, James Hensman, Richard E. Turner, and Zoubin Ghahramani. On sparse variational methods and the Kullback-Leibler divergence between stochastic processes. Journal of Machine Learning Research, 51:231–239, 2016.

Hannes Nickisch, Arno Solin, and Alexander Grig- orevskiy. State space Gaussian processes with nonGaussian likelihood. In International Conference on Machine Learning, pages 3789–3798, 2018.

Manfred Opper and Cédric Archambeau. The vari- ational Gaussian approximation revisited. Neural

computation, 21(3):786–792, 2009.

Carl E. Rasmussen and Christopher K. I. Williams. Gaussian processes for machine learning. The MIT Press, Cambridge, MA, USA, 2006.

Hugh Salimbeni and Marc Deisenroth. Doubly stochastic variational inference for deep Gaussian processes. In Advances in Neural Information

Processing Systems, pages 4588–4599, 2017.

Hugh Salimbeni, Stefanos Eleftheriadis, and James Hensman. Natural gradients in practice: Nonconjugate variational inference in Gaussian process models. In Artificial Intelligence and Statistics,

pages 689–697, 2018.

Simo Särkkä. Bayesian filtering and smoothing, volume 3. Cambridge University Press, 2013.

Simo Särkkä and Robert Piché. On convergence and accuracy of state-space approximations of squared exponential covariance functions. In 2014 IEEE International Workshop on Machine Learning for Signal Processing (MLSP), pages 1–6. IEEE, 2014.

Simo Särkkä and Arno Solin. Applied stochastic

differential equations. Cambridge University Press, 2019.

Arno Solin and Simo Särkkä. Explicit link between periodic covariance functions and state space models.

In Artificial Intelligence and Statistics, pages 904– 912, 2014.

Arno Solin, James Hensman, and Richard E. Turner. Infinite-horizon Gaussian processes. In Advances in Neural Information Processing Systems, pages 3486–

3495, 2018.

Michalis Titsias. Variational learning of inducing vari- ables in sparse Gaussian processes. In Artificial

Intelligence and Statistics, pages 567–574, 2009.

Mark van der Wilk, Carl E. Rasmussen, and James Hensman. Convolutional Gaussian processes. In Advances in Neural Information Processing Systems,

image

A Details on the doubly sparse variational Gaussian process

A.1 Conditional for Markovian Gaussian processes

We consider a stationary Markovian GP with state dimension d and denote by  (u−, s, u+)its evaluation on the triplet  (zn−, t, zn+). We here detail the derivation of  p(s|v = [u−, u+])

Derivation from the joint precision

image

with

image

and P  = [P1, P2] = T M = [T M1, T M2] given by

image

Derivation from the joint covariance

Another derivation using the covariance approach. One can write down the joint density

image

with

image

and get

image

Both implementations reveal the overall  O(d3)scaling of the conditional statistics.

A.2 Sampling from the variational posterior process

We here describe a method to jointly sample from the posterior  q(s(·)) at inputs x. Such a sample can be obtained by first sampling from the prior process at inputs [x, z]:

image

Then sampling from the marginal posterior at z:

image

And finally construct:

image

which is a sample from the marginal posterior q(s(x)).

This methods allows to generate samples in complexity  O((N + M)d3), which is the time required to jointly sample  sp, up. It was used to produce the posterior samples in the deep- GP experiment.

B Multivariate Gaussian distributions with banded precision: parameterisations, link functions and natural gradients

Here we present different parameterisations of a multivariate Gaussian distribution with block- tridiagonal precision matrices along with the link functions between them and describe how we use these to compute natural gradient updates of our variational objective.

B.1 Distribution parameterisations and link functions

In Section 3.2 we have defined the variational distribution approximating the posterior on inducing states to be

image

where  Qudenotes the precision matrix with block-tridiagonal structure and  Luthe Cholesky factor of the precision. We denote with  ξ : {mu, Lu}the above parameterisation. In the following table we present the identities that allow us to transfer back and forth from the default parameterisation  ξto the natural parameters θ : {θ1, θ2}and to the expectation parameters  η : {η1, η2}.

Table 3: Transformations between the different parameterisations of the Gaussian distribution with block-tridiagonal precision.

image

Note that the expectation parameter  η2is a full matrix since it involves the inverse of a banded matrix, which is not necessarily banded. However, since the sufficient statistics of the distribution are  t(u) = [u, btd[uu⊤]], we only need to compute the elements in the band.

B.2 Natural gradient update

When maximising our objective  L(ξ)with respect to  ξ, the parameters of our variational distribution, we perform a sequence of natural gradient updates:

image

In an exponential family with natural parameters  θand expectation parameters  η, the Fisher information is given by

image

which is a Jacobian-vector product allowing for an efficient implementation using automatic differentiation libraries. The computation of ∂L∂η is achieved using the chain rule: ∂L∂η = ∂L∂ξ∂ξ∂η.

B.3 Inverse of the subset inverse, and its reverse mode derivatives

A banded positive semi-definite (PSD) matrix Q has a Cholesky factor  LQthat is lower triangular with the same lower bandwidth. However, its inverse  Q−1is in most cases dense. If one is only interested in computing the entries of  Q−1that are located in the band of Q (denoted by bandQ[·]), efficient subset inverse algorithms are available (Durrande et al., 2019). We are now interested in the mathematical inverse of this operation, which we call reverse to avoid confusion with the matrix inverse operation.

image

We consider symmetric matrices with lower bandwidth r

image

Such a matrix has banded Cholesky factor

Computing the subset inverse is  O(nr2)(Takahashi, 1973). We present below an algorithm that performs the

reverse subset inverse operation with complexity  O(nr2).

B.3.1 Derivation of the forward evaluation  bandQ[C] → LQ

The following derivation shows how to compute, for each index i, the column  Li:i+r,igiven the sub-block  Ci:i+r,i:i+rindependently.

To simplify notations, we introduce the following intervals  n = [1 : i − 1], o = [i : i + r] and p = [i + r + 1 : n]. Wedenote by C the full covariance  C = Q−1 = L−T L−1.

First, we have that the sub-covariance  Cop,oponly depends on  Lop,op

image

Second, we have that the following expression for sub-covariance  Co,o

image

Using, the matrix inversion lemma, we get the following expression for  C−1o,o

image

By construction, the first column of  Lpois out of the matrix band (it is a null vector). Because  Lo,ois lower triangular, the product  LpoLTo,o also has a null vector as its first column, and so has the last term in Eq. 29.

Therefore, keeping only the first column, we end up with the identity

image

which is a system of r equations with  r unknown Lo,i, that we can solve analytically getting first  Li,i =�[C−1o,o]1,1then  Lo,i = [C−1o,o]:,1/Li,i

This derivation is summarised in Algorithm 1:

B.3.2 Derivation of the reverse mode differentiation

We manually derive the reverse mode differentiation of the reverse inverse subset algorithm introduced in the previous section.

We refer the reader to Giles (2008) for a brief introduction to reverse mode differentiation and to Durrande et al. (2019) for derivations of reverse mode differentiation of the subset inverse algorithm for banded matrices.

In a nutshell, in a chain  A → B → ... → c ∈ Rwith A, B matrices, the reverse mode derivative of operation of  f : A → Bis the operation propagating the reverse mode sensitivity ¯B = dcdBinto sensitivity dcdA. Given the differential identity  dc = �ij ∂c∂Bij dBij = Tr[ ¯BT dB]and the general differential relation at  A: dB = XAdAYA, itfollows that  dc = Tr[Y TA XAdA], therefore we identify  ¯AT = YA ¯BT XA.

Algorithm 1 being parallel, we can compute the contribution of each column of L to C separately. First we relate l(i) to v(i):

image

We identify  ¯v(i):

image

Then we relate  v(i) to c(i),

image

And we identify  ¯c(i):

image

Putting everything together, we have

Algorithm 2 summarises the derivations of the reverse mode sensitivity ¯C of thereverse subset inverse operation.

image

B.4 Fast implementation

In Algorithms 1 and 2, we computed the  (c(i))−1eindependently for each i. This can be achieved by first computing the Cholesky factors  s(i)of each  c(i)and then solving  v(i) = (s(i))−T (s(i))−1e. The direct Cholesky factorization of all  c(i)would incur a total complexity of  O(nr3). However one can use a recursive algorithm achieving the same goal in complexity  O(nr2).

Given the Cholesky factor  s(i) of c(i), we can compute the Cholesky factor of  c(i−1) as follows:

image

with

The Cholesky factor  s(i−1) is then readily obtained by removing the last row and column of the matrix in Eq. 34. The bottom right entry of the matrix in Eq. 34 corresponds to a Cholesky downdate that can be computed with cost  r2(Seeger, 2004), so that, starting from index n, all Cholesky factors  s(1), . . . , s(n)can be computed with complexity  O(nr2).

C Experimental details

C.1 Illustration of the efficiency of the natural gradient update

We describe in this section a simple experiment to compare the behaviour of various optimisers. Given a regular grid X of  103 points equally spaced on [0, 1] and a centred GP f with a Matérn 3/2 covariance (unit variance and length-scale  ℓ = 0.1), we generate two datasets at random as follow:

image

Note that the first model has a conjugate likelihood whereas the second one does not. We can then fit S2VGP models with 50 inducing points (fixed to a regular grid on [0, 1]). We show in Figure 7 the optimisations traces we obtained when optimising the variational parameters (all other model parameters being fixed to their nominal values), for different samples of the datasets. It can be seen that natural gradients provide a striking advantage in the conjugate case but that it also behaves favourably, especially in the first few iterations, in the non-conjugate case.

image

Figure 7: Optimisation traces for the ELBO of our S2VGP method. Three optimisers (Natural gradients, Adam and LBFGS) are compared on 10 datasets that are generated at random. The left pannel correspond to a dataset and a model with a conjugate likelihood, whereas the right one is for a non conjugate likelihood.

C.2 Details on Section 4.2: conjugate regression on time-series

The full sound waveform used in this experiment is shown in Figure 8(top).

To model this signal, we used the following stationary kernels that have an equivalent SDE representations of state-dimension 2J, where J is the number of harmonic components:

image

where  f0denotes the fundamental frequency of the pitched sound. The variance of the Matérn1/2kernel was set to one and its only free parameter is its length-scale. We used a Gaussian likelihood with variance  σ2. Since

image

Figure 8: Audio time-series used in Section 4.2

we focused on inference, all parameters were initially fitted to the data by maximising the marginal likelihood available in closed form in this conjugate setting.

We increased the number of inducing points in powers of 2, placing them on an homogenous grid from the start to the end of the time support. Inducing points locations were not learned.

For both SVGP and S2VGP, we only learned the variational parameters. We used LBFGS for both SVGP and S2VGP.

C.3 Details on Section 4.3: additive regression

The generative model is as follow:

We propose a mean-field approximation to the posterior over processes  g(f1, . . . , fc) = �i q(i)(fi)as in (Adam et al., 2016). where each process is approximated using our doubly sparse parameterisation with inducing states u(i) = fi(z(i))evaluated at component specific inputs  z(i).

image

Figure 9: Comparison of predictive MSE on the airline delays dataset when training S2VGP with various optimisers (the distribution of errors is across the 10 splits).

C.4 Details on Section 4.4: time warping with deep Gaussian process

Setting

The function f is a sine wave of the form  f(x) = (0.75(1 − tanh(10 ∗ 2πx/15)) + 0.25) sin(10 ∗ 2πx)and is shared across the two observed series. The functions  akare time-warping functions, with  a1(x) = xand  a2(x) = x2. The functions  gkare output distortions with  g1(x) = tanh(x) and g2(x) = x. To generate the two time series we uniformly sample 1000 points in [0, 1] and subsequently pass them to the 2-layer model that we described. The Gaussian additive noise has standard deviation  σ = 0.05. We removed observations in the intervals [0.55, 0.6] & [0.85, 0.9] for the first time series and in the interval of [0.40, 0.50] for the second time series.

Uncertainty decomposition across layers in the deep GP experiment

As can be seen in Figure 6, there is almost no uncertainty in the first layer. The reason for this is two fold. First, we initialised the kernels on the first layer to have very long lengthscales (initial value of 4), and small variance (initial value of 0.01) to bias the inference towards smooth functions. Second, the low posterior uncertainty in the intermediate layers is a known consequence of variational approach used in Salimbeni and Deisenroth (2017), where the variational distribution factorises across layers (we use the same approximation). This pathology has been recently explored in Ustyuzhaninov et al. (2019) and can be remedied by explicitly imposing a conditional dependency between the layers in the variational distribution. We conducted the same experiment with a higher observation noise level (σ = 0.1instead of  σ = 0.05) and report the result in Figure 10. Changing the noise level has no effect to the uncertainty in the intermediate layers but has a significant effect in the model’s ability to learn the correct functions, as the model prefers to explain these attributes by measurement noise.

image

Figure 10: Data alignment with S2VGP layers. Left: original experiment from Section 4.4 with  σ = 0.05. Right:same experiment with  σ = 0.1.

D Empirical comparison to alternative SSM based approximate inference methods

We empirically compare our S2VGP approach to alternative methods to perform approximate inference in GP models based on their state space representation.

We consider a simple classification task similar from the GPMLv4.2 toolbox demo gpml-matlab-master/doc/demoState.m with N=5000 (see Figure 11) and using Matern3/2kernel with fixed hyperparameters. For the S2VGP method, we choose M = 50 inducing points on a homogenous grid and we use gradient based optimisation (L-BFGS). We run inference and report the NLPDs and execution time:

image

For S2VGP, we run gradient based optimisation (using L-BFGS). We are equally accurate as the other methods yet much faster.

image

Figure 11: Classification data (N=5000) and S2VGP fit. The ground truth used to generate the data is shown in blue. Blue dots represent the binary data (with additional noise introduced for visibility). The posterior process is shown in red. Inducing point locations are shown in green.

image


Designed for Accessibility and to further Open Science