b

DiscoverSearch
About
My stuff
Randomly Projected Additive Gaussian Processes for Regression
2019·arXiv
Abstract
Abstract

Gaussian processes (GPs) provide flexible distributions over functions, with inductive biases controlled by a kernel. However, in many applications Gaussian processes can struggle with even moderate input dimensionality. Learning a low dimensional projection can help alleviate this curse of dimensionality, but introduces many trainable hyperparameters, which can be cumbersome, especially in the small data regime. We use additive sums of kernels for GP regression, where each kernel operates on a different random projection of its inputs. Surprisingly, we find that as the number of random projections increases, the predictive performance of this approach quickly converges to the performance of a kernel operating on the original full dimensional inputs, over a wide range of data sets, even if we are projecting into a single dimension. As a consequence, many problems can remarkably be reduced to one dimensional input spaces, without learning a transformation. We prove this convergence and its rate, and additionally propose a deterministic approach that converges more quickly than purely random projections. Moreover, we demonstrate our approach can achieve faster inference and improved predictive accuracy for high-dimensional inputs compared to kernels in the original input space.

Gaussian processes (GPs) are flexible Bayesian nonparametric models with well-calibrated predictive uncertainties. Gaussian processes can also naturally encode inductive biases, such as smoothness or periodicity, through a choice of kernel function (Rasmussen and Williams, 2006). Gaussian processes have been especially impactful in the small-data regime, where careful uncertainty representation is particularly crucial, strong priors provide useful biases where learning is difficult, and exact inference is tractable. Additionally, Gaussian processes have been most successfully applied to low-dimensional input (predictor) spaces, such as time series, and spatiotemporal regression problems (e.g., Wilson and Adams, 2013; Duvenaud, 2014; Herlands et al., 2019). In these settings, canonical kernels — such as the RBF or Mat´ern kernels — provide reasonable distance metrics over pairs of data instances; for example, if we are modelling CO2concentrations indexed by time, then CO2levels at times which are close together in  ℓ2 or ℓ1 distancewill be treated as highly correlated under these kernels.

For higher dimensional problems, these standard distance metrics become less compelling. For example, with an RBF kernel, the fraction of data space with high covariance with a given point decreases exponentially with dimension. Additionally, in many online settings where Gaussian processes are used as regression models, such as Bayesian optimization, there is exponential regret with dimensionality (Srini- vas et al., 2010; Bull, 2011). Furthermore, scalable Gaussian processes which have a high degree of accuracy often only apply for up to a few input dimensions (e.g., Wilson and Nickisch, 2015; Gilboa et al., 2013).

To help circumvent such issues, there are two popular approaches. The first approach is to learn a projection into a lower dimensional space, such as through deep kernel learning (Wilson et al., 2016). While such approaches are highly flexible, they introduce many hyperparameters to train, which can be burdensome and impractical in the small data regime. Alternatively, additive Gaussian processes (Duvenaud, 2014; Kandasamy et al., 2015; Hastie and Tib- shirani, 1986) instead consider a sum of kernels, with each kernel operating on subsets of the input dimensions. This structure can both help reduce the effective dimensionality of the problem, and provide a useful inductive bias with compelling sample complexity (Stone et al., 1985). However, while assuming a fully additive decomposition of an untransformed space can provide a useful inductive bias for many real data sets, it is often too restrictive (Li et al., 2016). Moreover, methods for learning additive structure, as with standard projection approaches, are either computationally expensive or require learning a large number of parameters, which may overfit or hurt uncertainty estimation.

In this work, we show how to dramatically reduce the input dimensionality of a given problem, while retaining or even improving predictive accuracy, without having to learn projections. Specifically, our contributions are as follows:

We propose a novel learning-free algorithm for constructing additive GPs based on sequences of multiple random projections (RPA-GP). This results in a Turning Band style (Matheron, 1973) approximation to a high-dimensional kernel.

We prove that RPA-GP converges to a full-degree inverse multiquadratic kernel as the number of projections increase at a rate of  O(J−1/2)where J is the number of projections.

We propose a deterministic algorithm (DPA-GP) to minimize projection redundancy and achieve faster convergence to the limiting kernel.

We demonstrate the surprising result that RPA-GP and DPA-GP converge very quickly to the regression accuracy of a kernel operating on the full dimensional inputs, over a wide range of regression problems, even for projections into a single input dimension.

We show in a large empirical study that fully additive GPs can also perform competitively with GPs using standard kernels, but are outperformed by DPA-GP with automatic relevance determination on the original input space, particularly on large data sets and high dimensional data sets.

We additionally demonstrate that by exploiting the additive structure of RPA-GP, we alleviate the curse of dimensionality for structured kernel interpolation (SKI) (Wilson and Nickisch, 2015), enabling lineartime training and constant-time predictions over a wide range of problems, including problems with over 1000 input dimensions.

We provide GPyTorch code (Gard- ner et al., 2018) for all models at

https://github.com/idelbrid/ Randomly-Projected-Additive-GPs.

The high level idea of random projections to compose additive kernels has been considered in geostatistics under the name the turning band method (TBM) (Matheron, 1973), for 2 and 3-dimensional simulation. However, the execution and details are very different from what we consider here. Our paper analyzes and demonstrates how learning-free additive projections can be promising for regression in high dimensional input spaces.

RPA-GP and DPA-GP are a step towards alleviating the curse of dimensionality for Gaussian processes, while retaining a pleasingly tractable and lightweight representation. We focus our experiments on regression, since regression is the basic foundation for many popular procedures involving Gaussian processes, such as Bayesian optimization (Moˇckus, 1975), and model based reinforcement learning

(Deisenroth and Rasmussen, 2011; Engel et al., 2005), and is in itself a widespread application for Gaussian processes (Williams and Rasmussen, 1996; van Beers and Kleijnen, 2004).

We briefly review Gaussian process regression and structured kernel interpolation (SKI) (Wilson and Nickisch, 2015). For more details on Gaussian processes, we refer the reader to Rasmussen and Williams (2006).

2.1. Gaussian process regression

Formally, a Gaussian process f is a stochastic process over an index set X (typically elements of X are in  Rd) taking on real values. Therefore, it can be interpreted as a prior over functions from X to R. The process evaluated at any finite collection of points is distributed according to a multivariate normal distribution. That is, for any  x1, ..., xn ∈ X,f = [f(x1), ..., f(xn)] ∼ N(mX, KX,X). Accordingly, a Gaussian process is fully determined by its prior mean function  m : X �→ Rand covariance kernel function k : X × X �→ R. The prior mean function is often chosen to be 0 in the case where we have limited knowledge of f. Therefore, a Gaussian process is almost entirely determined by k. Standard identities of the multivariate Gaussian distribution can be applied to find the posterior predictive distribution under a Gaussian observation model given data

image

where

image

The computational bottleneck in computing the posterior distribution is solving the linear system  (KX,X +σ2In)−1y.Standard approaches use the Cholesky decomposition, which requires  O(n3)computations.

The log marginal likelihood

image

is used for model comparison and optimization. Typically, one parameterizes the kernel with some number of hyperparameters which are tuned by maximizing the marginal likelihood. This maximization provides automatic regularization because the determinant  |KX,X + σ2In|penalizes quickly varying functions. The computational bottleneck in computing the marginal likelihood is the determinant, which has the standard computational cost of  O(n3)from the Cholesky decomposition.

2.2. Structured kernel interpolation

Structured kernel interpolation (SKI) (Wilson and Nickisch, 2015) uses an approximation to the kernel  KX,Xthat permits fast matrix-vector multiplications, which are used to compute the log marginal likelihood and predictive distributions. Specifically, let U be a regular grid of inducing points. Wilson and Nickisch (2015) let W be a matrix of local cubic interpolation weights from U to X (Keys, 1981). The SKI kernel approximation of base kernel matrix  KX,Xis

image

The interpolation matrix W is sparse, having only  4dnonzero elements per row. The matrix  KU,Ucan have Toeplitz (if d = 1) or Kronecker (if d > 1) structure, either of which permit fast matrix-vector multiplications with  KU,U(Saatc¸i, 2012) and thus also the approximate KX,X, due to the sparse interpolation in SKI. The linear solve  (KSKIX,X + σ2In)−1ycan then be efficiently computed using linear conjugate gradients, which proceeds by iterative matrix-vector multiplications of  KSKIX,X + σ2In. The log determinant can be computed using stochastic Lanczos quadrature (Dong et al., 2017), which similarly only requires iterative matrix-vector multiplications of  KSKIX,X + σ2In.

However, fixing the number of inducing points in each dimension, the size of the grid grows exponentially with dimension. Therefore, inference using SKI is intractable generally for dimension d > 5.

GPs with kernels that fully decompose additively1, i.e.

image

for some sub-kernels  {ki}di=1, are considered Generalized Additive Models (GAM) (Hastie and Tibshirani, 1986). We refer to the resulting GP as a GAM GP throughout this paper. Here, we denote the ith component of vector  x as xi withoutbold face to indicate that it is a scalar value.

The GAM GP implicitly assumes there are only first-order interactions in the modeled function. This assumption may be reasonable inductive bias in some cases, but it is often too strong (Li et al., 2016; Duvenaud et al., 2011). It is natural, then, to consider additive combinations of sets of features. Unfortunately, the space of subsets of features is a power set and therefore grows exponentially. Therefore, learning additive combinations of kernels on subsets of features is difficult. Extant approaches to learning additive kernel structure can be divided roughly into enumeration methods, where sub-kernels consider every combination of feature interactions up to a degree, search methods, where possible decompositions are traversed by a search algorithm, and projection-pursuit where a projected-additive GP is learned by iteratively optimizing projection directions from regression residuals.

Hierarchical Kernel Learning (Bach, 2009) is an enumeration method in which one constructs the sum of kernels in a hull of possible kernels. Duvenaud et al. (2011) compute the sum of kernels over every possible feature combination in  O(d2)time by using the Newton-Girard formulae. Du- venaud et al. (2013) define a grammar over kernels and use discrete search to optimize kernel structure. Qamar and Tokdar (2014) uses a sampling approach to search through additive decompositions. In Bayesian optimization, it is especially beneficial to learn additive structure where no features are overlapped between sub-kernels. Gardner et al. (2017) perform MCMC sampling over such kernel structures as a search method. Similarly, Wang et al. (2017) perform a Gibbs sampling procedure to search over feature partitions. Enumeration methods inherently incorporate a very large number of sub-kernels, which can be expensive to compute for high dimensions. Search methods, on the other hand, are burdened by searching over a combinatorial space.

Projection pursuit, introduced by Friedman and Stuetzle (1981) and adapted to the Gaussian process setting by Saatc¸i (2012); Gilboa et al. (2013), is different in that one learns projected-additive GPs. That is, the GP is an additive combination of low-dimensional kernels defined on linear projections of data whose directions are sequentially optimized. If a large number of projections are used, the sequential optimization of directions with respect to the marginal likelihood can be computationally expensive, and the large number of parameters learned by optimization may result in overfitting and poor uncertainty estimation (Li et al., 2016).

A GP using a single non-additive random projection has been briefly considered with promising preliminary results (Wang et al., 2016). However, we find that such methods can be dramatically improved through sequences of additive random and deterministic projections, and investigate this surprising and practically significant result. Additionally, Guhaniyogi and Dunson (2016) use a GP over random projections of high-dimensional data having low-dimensional manifold structure. However, their work does not explore additive Gaussian processes and also relies on a model average of many GPs to account for variation in the random projections.

Composing 1-dimensional stochastic processes along ran- dom directions to approximate higher-dimensional stochas- tic processes has been used in the geostatistics community under the name the “turning bands method” and has since been studied in detail for simulation of 2 to 3-dimensional processes (Matheron, 1973; Mantoglou and Wilson, 1982; Mantoglou, 1987; Lantu´ejoul, 2013). Work has been devoted to describing the 1-dimensional covariances associated with common covariances (Christakos, 1987; Gneit- ing, 1998), quantifying the approximation error (Mantoglou, 1987), and even choosing well-spread directions (Freulon and Lantuejoul, 1993; Lantu´ejoul, 2013). Yet, this direction of work has not been explored for higher dimensional GPs, nor for Gaussian process regression.

Rather than directly learning additive structure, we project data onto randomly drawn directions and impose additive structure on a GP defined over the projections. As a result, we bypass the need to search over or enumerate all possible sub-kernels, and the burden of training many hyperparameters in a learned projection.

Formally, let n be the number of data points, d be number of dimensions, and J be the number sub-kernels. Denoting the degree of kernel  j as Dj, we define the randomly projected additive kernel as

image

We refer to a GP with covariance kernel  krpas a randomlyprojected additive GP (RPA-GP). Matrices  {P (j)}Jj=1de- fine the directions of the projections. The parameters {αj}Jj=1 determine the amount of variance each sub-kernel contributes and may be either learned or set as a constant value 1/J.

If the sub-kernel degrees  Djare large enough relative to sample size n, the Johnson-Lindenstrauss Lemma guarantees that the  ℓ2distances between points are approximately preserved with high probability (Sarlos, 2006). Alternatively, we have a similar guarantee if data lie on a low-dimensional manifold (Baraniuk and Wakin, 2009). Therefore, if we use RBF sub-kernels, each sub-kernel is a good approximation of the high-dimensional RBF kernel. Moreover, having multiple random projections increases the like-

0.00 0.15 0.30 0.45 0.60 0.75 0.90 1.05

Figure 1. Contour plots of 2-dimensional kernels. From left to right: RBF, GAM RBF, RPA-GP with 16 projections, and DPA-GP with 16 projections. With enough additive projections, we attain approximately spherical covariance, and choosing well-placed directions facilitates convergence.

lihood of drawing a good random projection but does not increase kernel dimensionality, similar to the method presented in Ahmed (2004). However, if the sub-kernel degrees are small, we have sufficient flexibility given enough projections. RPA-GP forms a distribution over linear combinations of ridge functions, which are defined as functions that are invariant in all but 1 direction. Since linear combinations of ridge functions are dense in the set of continuous functions, we are able to approximate any continuous function arbitrarily well given a rich enough set of directions (Cheney and Light, 2009).

4.1. The expected kernel

We now analyze projected-additive kernels by studying the functional form of the covariance in comparison to the RBF kernel. We limit our analysis to the most challenging case of one-dimensional additive projections, i.e. when each matrix  P (j)is a vector  ηj, though analysis is similar for higher dimensional projections. Further, we assume unit length-scale, which can always be achieved by appropriate scaling of data. For brevity, we defer proofs to the appendix.

Clearly, an additive (GAM) covariance kernel does not decay to zero as the  ℓ2distance between points goes to infinity. For example, a GAM kernel with RBF additive components is lower bounded by d−1dalong each axis. Conversely, in expectation, the covariance of a randomly projected additive kernel with RBF components decays to zero in any given direction; if  ηjare drawn from an isotropic distribution, an additive randomly projected kernel converges to a high-dimensional kernel as  J → ∞. This is made formal in the following proposition.

Proposition 1. Let  φ: R �→ [−1, 1]be a 1-dimensional kernel, and let  (ηj : j ≥ 1)be an i.i.d. sequence of random variables in  Rd drawn from a common isotropic distribution D. Then, for some expected kernel  kexpected : R �→ [0, 1], forany  τ ∈ Rd, almost surely

image

For certain choices of sub-kernel  φand distribution D, we can recover familiar kernels.

Corollary 1. If  φ(x) = e− 12 x2 and η1 ∼ N(0, Id), then

image

Corollary 2. If  φ(x) = cos(x) and η1 ∼ N(0, Id), then

image

The expected kernel in (5) is a rational quadratic kernel with parameter  α = 1/2, also known as the inverse multiquadratic kernel. It is especially relevant because in this work we focus on the case when the base kernel is RBF. Note that the spectral density provides a standard way to derive sub-kernels associated with higher-dimensional kernels (Mantoglou, 1987).

We can also derive an  O(1/√J)convergence rate.

Proposition 2. Let  φ, kexpectedbe as in Proposition 1. Let {ηj}Jj=1be a sequence of random variables drawn i.i.d. from an isotropic distribution. Let  δ > 0. Then, with probability at least  1 − δ, we have simultaneously for all pairs of points  τi,k, i, k ∈ [n],

image

Empirically, we see convergence to the performance of a kernel operating on the original space at a much greater rate. This empirical result is intuitive because even if the resulting kernel after additive random projections is not a multiquadratic kernel, it may still be a good kernel for the data.

4.2. Reducing projection redundancy

As shown in Proposition 2, sampling directions purely randomly converges at the “slow” simple Monte Carlo rate of  O(1/√J). Ideally, we would space directions equally. However, even in only 3-dimensions, this is only possible for certain values of J = 3, 15, ... (Mantoglou, 1987). In higher dimensions, the problem is highly nontrivial. One solution is to numerically maximize a measure of distance between points, such as the antipodal separation distance

image

which directly measures the minimal angle between directions. However, because maximizing  δis difficult, we instead minimize the loss

image

Minimizing  ℓhas the effect of increasing the separation distance  δbetween directions, though an optimizer of  ℓdoes not necessarily coincide with an optimizer of  δ unless d ≥ J.Additionally, given sufficiently large J, a set of directions {ηj}Jj=1 that maximize  ℓis a spherical t-design with t = 4 (Womersley, 2018), thus guaranteeing optimal order rate decay of worst-case error for quadrature of smooth functions. If  J ≤ d, orthogonal directions minimize  δand  ℓ, so we simply use Gram-Schmidt orthogonalization. Otherwise, we minimize  ℓusing gradient descent. We refer to projected-additive GPs with directions chosen by this method as Diverse Projected-Additive GPs (DPA-GP). We visualize the DPA-GP and other kernels in Figure 1.

4.3. Applying length-scales before projection

If each sub-kernel of an additive kernel learns its own length-scale, it is not clear that the additive kernel approximates an expected kernel. Additionally, if the number of additive kernels J is large, learning separate length-scales introduces many hyperparameters which are also only indirectly related to the original inputs.

Alternatively, we propose applying automatic relevance determination scaling directly on the original input space before the data are projected to a low-dimensional space. To learn the length scales, we efficiently propagate gradients through the projections with automatic differentiation. We define

image

where  A = diag(σ−1). When αj = 1/J for all j and eachkjhas unit length-scales, the theory of section 4.1 readily applies, while permitting flexible treatment of length-scales.

In Section 5, we make the empirical discovery that this ARD approach provides significant performance gains. To distinguish this parameterization from others, we refer to such a model with the -ARD suffix.

4.4. Scaling to large data sets with high dimension with

image

In section 2.2, we described how SKI enables scalable GPs, but is constrained to input dimensions of about d < 5, if no kernel structure is exploited. However, if a kernel decomposes additively by groups of dimensions, it is possible to generalize the applicability of SKI to much higher dimensional spaces. Suppose that a kernel decomposes additively as

image

where  x(j) denotes a group of dimensions of point x. Then, the Gram matrix K corresponding to this kernel decomposes similarly, so a matrix-vector multiplication can be performed with each kernel separately as  Kv = �Jj=1 K(j)v.Since we assume that each sub-kernel  kjis low-degree, each matrix-vector multiplication  K(j)vcan be computed efficiently using SKI. In particular, in the case that each sub-kernel is 1-dimensional, inference with such a kernel using SKI has complexity O(Jc(n + m log m)), where m is the number of inducing points for each projection and c is the number of iterations of linear conjugate gradients. Typically, c ≪ nis sufficient to reach convergence within machine precision, so inference is approximately linear in n (Wil- son and Nickisch, 2015). We demonstrate this asymptotic scaling empirically in Section 5.4.

We evaluate RPA-GP and DPA-GP on a wide array of regression tasks. We compare the predictive accuracy of the proposed methods to GPs with RBF and GAM kernels (section 5.1), study the effect of increasing number of projections (section 5.2), compare predictive accuracy under various assumptions via synthetic data sets and on very high-dimensional data (section 5.3), and demonstrate the superior asymptotic scaling of RPA-GP with SKI over traditional inference (section 5.4). We implement all models using GPyTorch (Gardner et al., 2018) and provide code at https://github.com/idelbrid/ Randomly-Projected-Additive-GPs.

5.1. Benchmarks on UCI data sets

To evaluate RPA-GP, we compare each method’s regression performance on a large number of UCI data sets. For each model, and for each data set, we perform 10-fold cross validation twice to accurately measure the performance of stochastic methods. For each fold, we normalize the features and target function by mean and standard deviation as computed on the training folds, so predicting the mean results in  ≈ 1RMSE. For each fold, we fit the kernel hyperparameters by maximizing the log-marginal likelihood. We use Adam (Kingma and Ba, 2014) with learning rate 0.1 for at most 1000 iterations, stopping if log marginal likelihood improves by less than 0.0001 over 20 iterations, smoothing with a moving average. We use a smoothed box prior over the likelihood noise parameter to encourage numerical stability. We compute the root mean squared error on the held-out fold for each model, data set, and fold, and report the RMSE in Figure 2. To reiterate, RBF-ARD is a standard RBF GP with ARD; GAM-GP is an additive GP with RBF subkernels and ARD. RPA-GP-1, DPA-GP, and DPA-GP-ARD are additive across 20 1-dimensional projections. RPA-GP-1 uses Gaussian random projections; DPA-GP minimizes the objective in Equation 7; and DPA-GP-ARD performs the pre-projection scaling method described in Section 4.3.

We report results for additional models tested in the appendix. In particular, using SKI for scalable inference, enabled by additive projections, results in essentially identical performance as exact inference. Additionally, an IMQ kernel achieves error similar to an RBF kernel. Further, we find as expected that including sub-kernels of slightly higher degree can improve performance.

5.2. Convergence of random projected-additive GP accuracy

To study the sensitivity of RPA-GP to the number of projections and the behavior as  J → ∞, we perform an ablation study for several data sets. For each data set, we perform 10-fold cross validation (twice) with 1-degree random projected-additive GPs and vary the number of projections. We show representative plots in Figure 3. In these experiments, the benefit of DPA-GP-ARD also becomes obvious, as we see it is able to converge much more quickly than the other approaches. By the time J = 20, represented in Figure 2, the various projection approaches are more comparable.

5.3. Comparisons to fully-additive kernel

As observed in section 5.1, GAM GP performs surprisingly well in comparison to the standard RBF-ARD kernel, despite its limited model class. This result is noteworthy in its own right, since the GAM GP ignores everything but first order interactions between inputs. To our knowledge it is not known that such a parsimonious representation can achieve comparable results to a kernel acting on the full inputs over this significant range of experiments.

The GAM kernel is also related to DPA-GP. Using J = d 1-dimensional projections, DPA-GP is equivalent to a GAM with a random rotation applied to input data (though the ability to tune model expressiveness by controlling J is a key benefit to a random projection approach). As a result, we expect the methods to perform similarly if the target function does not vary mostly in axis-aligned directions.

To understand the differences between GAM GP and our proposed techniques, we consider additional empirical tests on synthetic data and very high-dimensional data sets.

Synthetic regression tasks: In Figure 6, we test each method as we increase the number of data points on additive

image

Figure 2. The proposed methods and GAM-GP perform surprisingly well compared to RBF-GP. DPA-GP-ARD is able to match the performance of RBF-GP even for large data sets, where the flexibility of GAM-GP begins to be a limiting factor.

image

Figure 3. Representative test RMSE of RPA-GP and DPA-GP as the number of projections vary compared to full-dimensional RBF and inverse multiquadratic (IMQ) kernels. Shaded regions are 2 times the standard deviation over cross-validation, and lines are the average RMSE. For clarity, we only show the variation for DPA-GP-ARD. In general, there is a fast convergence to the performance of RBF and IMQ kernels, and DPA-GP consistently improves upon RPA-GP by a small amount, and applying length-scales before (DPA-GP-ARD) projection dramatically increases performance.

synthetic data. We see that GAM, as expected, performs better for target functions that are truly additive. Conversely, GAM can perform poorly if the function is not additive. For target functions that are rotation-invariant, all of GAM, DPA-GP, and DPA-GP-ARD perform equivalently as expected. However, if irrelevant features are introduced, which we conjecture is a frequent occurrence in real regression problems, DPA-GP-ARD performs better than GAM. Irrelevant features introduce noise into the projections, but ARD prunes irrelevant features and effectually reduces the input dimension, wherein DPA-GP-ARD still uses d additive components, but GAM then effectually uses < d additive components. Hence, DPA-GP-ARD provides a better approximation of a non-additive kernel.

High dimensional regression tasks: We construct regression data sets of three different sizes from the Olivetti faces data set, following Wilson et al. (2016) and Hinton and Salakhutdinov (2008). We uniformly subsample images, uniformly sample a rotation in  [−90, 90], crop the rotated images, and use the rotations as regression targets. DPA-GP-ARD outperforms RBF and GAM GPs with when n is small compared to d. Results with n = 400 images are presented in Figure 4. We additionally test on three genomics data sets, finding that DPA-GP-ARD and GAM GP generally perform comparably and provide figures in the appendix.

image

Figure 4. Comparative performance on the Olivetti face orientation regression task. With high dimensionality and a non-additive target function, DPA-GP-ARD outperforms alternatives, though the number of projections J must be somewhat high.

image

Figure 5. Training run time of RPA-GP with SKI compared to canonical Cholesky decomposition-based inference on synthetic data with 100 input dimensions and varying numbers of points following  x ∼ N(0, Id), y = �100i=1 sin(xi) + ϵ. We note that runtime behaviour is somewhat data dependent, due to changes in conditioning of the kernel matrix, but the scaling of SKI remains approximately linear, compared to the cubic scaling of the Cholesky decomposition.

5.4. Scaling to large data sets

To demonstrate the asymptotic computational complexity of RPA-GP with SKI, we train both RPA-GP with SKI and a GP with RBF kernel using Cholesky-based inference for 120 Adam iterations on data sets constructed by drawing  X ∼

image

d = 100 and a varying number data points. We use RPA-GP with 20 1-dimensional projections and 512 inducing points per projection. We run this experiment on a MacBook Air with a 1.8 GHz Intel i5 processor with 8 GB of RAM. The results are presented in Figure 5.

Note that it is infeasible to run SKI without RPA-GP with a reasonable number of inducing points on data sets with this number of input dimensions; even if d = 6 and we have 100 inducing points in each dimension, the resulting 1 trillion inducing points represented as 32-bit floating point numbers cannot be stored in memory.

We proposed novel learning-free algorithms to construct additive Gaussian processes by using sums of low-dimensional kernels operating over random (RPA-GP) projections. We demonstrated the remarkable result that these approaches achieve the performance of kernels operating over the full-dimensional input space even when projecting into a one-dimensional space and without learning the projections.

Moreover, we proved that by taking enough additive random projections, RPA-GP converges to the inverse multiquadratic kernel and proposed a novel deterministic algorithm (DPA-GP) to reduce projection redundancy that indeed improves regression performance. We demonstrated that a fully-additive GP also achieves remarkable empirical performance and illustrated that, while such a model is superior if its assumption holds, combining DPA-GP with automatic relevance determination directly on input features (DPA-GP-ARD) performs as well as fully additive GPs on very high-dimensional data and better on large data sets. Finally, as an added benefit, we demonstrated that by exploiting the additive structure of RPA-GP, we essentially reduce inference from a d dimensional problem to J 1-dimensional problems, enabling the application of SKI (Wil- son and Nickisch, 2015) and thereby reducing standard GP computational complexity from  O(n3)to O(Jc(n + m)), where J is the number of random projections, c is the number of linear conjugate gradients iterations, and m is the number of inducing points. These results are of great practical significance: we have shown that GP regression, which is the backbone of many procedures, can often be effectively reduced to one-dimension without requiring the training of a projection. These approaches also naturally generalize the applicability of popular scalable inference procedures, such as SKI, which have been conventionally constrained to lower dimensional spaces.

In short, we demonstrate the pleasing result that a range of regression problems can be reduced to a single input dimension, while retaining or even improving accuracy, without having to learn a projection. In a single dimension, methods become much easier to analyze and scale, leading a rich variety of future research directions.

AGW and IAD were supported by Amazon Research Award, Facebook Research, NSF IIS-1563887, and NSF IIS-1910266. We thank Alex Smola for helpful discussions.

Yousuf Shamim Ahmed. Multiple random projection for fast, approximate nearest neighbor search in high dimensions. University of Toronto, 2004.

image

Figure 6. Top left: Average RMSE on an additive function. GAM is best as expected. Top right: Average RMSE on a very non-additive function: a smooth relaxation of the XOR function. GAM performs worse than DPA methods. Bottom left: Average RMSE synthetic data with a rotation-invariant function. All methods perform equally. Bottom right: Average RMSE synthetic data with half irrelevant features. Plain DPA-GP cannot distinguish relevant features from irrelevant ones after projection, but DPA-GP-ARD improves on both DPA-GP and GAM.

Francis Bach. High-dimensional non-linear variable selec- tion through hierarchical kernel learning. arXiv preprint arXiv:0909.0844, 2009.

Richard G Baraniuk and Michael B Wakin. Random projec- tions of smooth manifolds. Foundations of computational mathematics, 9(1):51–77, 2009.

Adam D Bull. Convergence rates of efficient global opti- mization algorithms. Journal of Machine Learning Research, 12(Oct):2879–2904, 2011.

Elliott Ward Cheney and William Allan Light. A course in approximation theory, volume 101. American Mathematical Soc., 2009.

George Christakos. Stochastic simulation of spatially cor- related geo-processes. Mathematical Geology, 19(8): 807–831, 1987.

Marc Deisenroth and Carl E Rasmussen. Pilco: A model- based and data-efficient approach to policy search. In Proceedings of the 28th International Conference on machine learning (ICML-11), pages 465–472, 2011.

Kun Dong, David Eriksson, Hannes Nickisch, David Bindel, and Andrew G Wilson. Scalable log determinants for gaussian process kernel learning. In Advances in Neural Information Processing Systems, pages 6330–6340, 2017.

David Duvenaud. Automatic model construction with Gaussian processes. PhD thesis, University of Cambridge, 2014.

David Duvenaud, James Lloyd, Roger Grosse, Joshua Tenenbaum, and Ghahramani Zoubin. Structure discovery in nonparametric regression through compositional kernel search. In Sanjoy Dasgupta and David McAllester, editors, Proceedings of the 30th International Conference on Machine Learning, volume 28 of Proceedings of Machine Learning Research, pages 1166–1174, Atlanta, Georgia, USA, 17–19 Jun 2013. PMLR. URL http://proceedings.mlr. press/v28/duvenaud13.html.

David K Duvenaud, Hannes Nickisch, and Carl E Ras- mussen. Additive gaussian processes. In Advances in neural information processing systems, pages 226–234, 2011.

Yaakov Engel, Shie Mannor, and Ron Meir. Reinforcement learning with gaussian processes. In Proceedings of the 22Nd International Conference on Machine Learning, ICML ’05, pages 201–208, New York, NY, USA, 2005. ACM. ISBN 1-59593-180-5. doi: 10.1145/1102351. 1102377. URL http://doi.acm.org/10.1145/ 1102351.1102377.

Xavier Freulon and Christian Lantuejoul. Revisiting the turning bands method. Acta Stereologica, 1993.

Jerome H Friedman and Werner Stuetzle. Projection pursuit regression. Journal of the American statistical Association, 76(376):817–823, 1981.

Jacob Gardner, Chuan Guo, Kilian Weinberger, Roman Garnett, and Roger Grosse. Discovering and exploiting additive structure for bayesian optimization. In Artificial Intelligence and Statistics, pages 1311–1319, 2017.

Jacob Gardner, Geoff Pleiss, Kilian Q Weinberger, David Bindel, and Andrew G Wilson. Gpytorch: Blackbox matrix-matrix gaussian process inference with gpu acceleration. In Advances in Neural Information Processing Systems, pages 7576–7586, 2018.

Elad Gilboa, Yunus Saatc¸i, and John Cunningham. Scaling multidimensional gaussian processes using projected additive approximations. In International Conference on Machine Learning, pages 454–461, 2013.

Tilmann Gneiting. Closed form solutions of the twodimensional turning bands equation. Mathematical Geology, 30(4):379–390, 1998.

Rajarshi Guhaniyogi and David B Dunson. Compressed gaussian process for manifold regression. The Journal of Machine Learning Research, 17(1):2472–2497, 2016.

Trevor Hastie and Robert Tibshirani. Generalized additive models. Statist. Sci., 1(3):297–310, 08 1986. doi: 10. 1214/ss/1177013604. URL https://doi.org/10. 1214/ss/1177013604.

William Herlands, Daniel B. Neill, Hannes Nickisch, and Andrew Gordon Wilson. Change surfaces for expressive multidimensional changepoints and counterfactual prediction. Journal of Machine Learning Research, 20(99):1– 51, 2019. URL http://jmlr.org/papers/v20/ 17-352.html.

Geoffrey E Hinton and Ruslan R Salakhutdinov. Using deep belief nets to learn covariance kernels for gaussian processes. In J. C. Platt, D. Koller, Y. Singer, and S. T. Roweis, editors, Advances in Neural Information Processing Systems 20, pages 1249–1256. Curran Associates, Inc., 2008.

Kirthevasan Kandasamy, Jeff Schneider, and Barnab´as P´oczos. High dimensional bayesian optimisation and bandits via additive models. In International Conference on Machine Learning, pages 295–304, 2015.

Robert Keys. Cubic convolution interpolation for digital im- age processing. IEEE transactions on acoustics, speech, and signal processing, 29(6):1153–1160, 1981.

Diederik P Kingma and Jimmy Ba. Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980, 2014.

Christian Lantu´ejoul. Geostatistical simulation: models and algorithms. Springer Science & Business Media, 2013.

Chun-Liang Li, Kirthevasan Kandasamy, Barnabas Poc- zos, and Jeff Schneider. High dimensional bayesian optimization via restricted projection pursuit models. In Arthur Gretton and Christian C. Robert, editors, Proceedings of the 19th International Conference on Ar-tificial Intelligence and Statistics, volume 51 of Proceedings of Machine Learning Research, pages 884–892, Cadiz, Spain, 09–11 May 2016. PMLR. URL http:// proceedings.mlr.press/v51/li16e.html.

Aristotelis Mantoglou. Digital simulation of multivariate two-and three-dimensional stochastic processes with a spectral turning bands method. Mathematical Geology, 19(2):129–149, 1987.

Aristotelis Mantoglou and John L Wilson. The turning bands method for simulation of random fields using line generation by a spectral method. Water Resources Research, 18(5):1379–1394, 1982.

G. Matheron. The intrinsic random functions and their appli- cations. Advances in Applied Probability, 5(3):439–468, 1973. ISSN 00018678. URL http://www.jstor. org/stable/1425829.

Jonas Moˇckus. On bayesian methods for seeking the extremum. In Optimization Techniques IFIP Technical Conference, pages 400–404. Springer, 1975.

Shaan Qamar and Surya T Tokdar. Additive gaussian pro- cess regression. arXiv preprint arXiv:1411.7009, 2014.

Carl Edward Rasmussen and Christopher KI Williams. Gaussian processes for machine learning, volume 2. MIT Press Cambridge, MA, 2006.

Yunus Saatc¸i. Scalable inference for structured Gaussian process models. PhD thesis, Citeseer, 2012.

Tamas Sarlos. Improved approximation algorithms for large matrices via random projections. In 2006 47th Annual IEEE Symposium on Foundations of Computer Science (FOCS’06), pages 143–152. IEEE, 2006.

Niranjan Srinivas, Andreas Krause, Sham Kakade, and Matthias Seeger. Gaussian process optimization in the bandit setting: No regret and experimental design. In Proceedings of the 27th International Conference on International Conference on Machine Learning, ICML’10, pages 1015–1022, USA, 2010. Omnipress. ISBN 978-1-60558-907-7. URL http://dl.acm. org/citation.cfm?id=3104322.3104451.

Charles J Stone et al. Additive regression and other nonpara- metric models. The annals of Statistics, 13(2):689–705, 1985.

W. C. M. van Beers and J. P. C. Kleijnen. Kriging inter- polation in simulation: a survey. In Proceedings of the 2004 Winter Simulation Conference, 2004., volume 1, page 121, Dec 2004. doi: 10.1109/WSC.2004.1371308.

Zi Wang, Clement Gehring, Pushmeet Kohli, and Stefanie Jegelka. Batched large-scale bayesian optimization in high-dimensional spaces. In AISTATS, 2017.

Ziyu Wang, Frank Hutter, Masrour Zoghi, David Matheson, and Nando de Feitas. Bayesian optimization in a billion dimensions via random embeddings. Journal of Artificial Intelligence Research, 55:361–387, 2016.

Christopher KI Williams and Carl Edward Rasmussen. Gaussian processes for regression. In Advances in neural information processing systems, pages 514–520, 1996.

Andrew Wilson and Ryan Adams. Gaussian process ker- nels for pattern discovery and extrapolation. In International Conference on Machine Learning, pages 1067– 1075, 2013.

Andrew Wilson and Hannes Nickisch. Kernel interpolation for scalable structured gaussian processes (kiss-gp). In International Conference on Machine Learning, pages 1775–1784, 2015.

Andrew Gordon Wilson, Zhiting Hu, Ruslan Salakhutdinov, and Eric P Xing. Deep kernel learning. In Artificial Intelligence and Statistics, pages 370–378, 2016.

Robert S Womersley. Efficient spherical designs with good geometric properties. In Contemporary Computational Mathematics-A Celebration of the 80th Birthday of Ian Sloan, pages 1243–1285. Springer, 2018.

A. Proofs

A.1. Proof of Proposition 1

Proof. Because  η1is drawn from an isotropic distribution, we can let  τ = [x, 0, 0, ..., 0]without loss of generality. Then, define the expected kernel value at  τas the expectation

image

Then, by the Law of Large Numbers, the empirical mean 1J�Jj=1 k(η⊤j τ)converges to the expectation kexpected(||τ||2)almost surely as  J → ∞.

A.2. Proof of Corollary 1

Proof. Let  r = ||τ||2for simplicity.

Suppose  η1 ∼ N(0, Id)m which implies  η11 ∼ N(0, 1). Then,

image

A.3. Proof of Corollary 2

Proof. Let  r = ||τ||2.

Suppose  η1 ∼ N(0, Id)which implies  η11 ∼ N(0, 1). Then,

image

A.4. Proof of Proposition 2

Proof. From Proposition 1, we have an empirical mean of 1-dimensional kernels approximating its mean. We can apply concentration inequalities. We choose Bernstein’s inequality to explain the effect of the projection variance on convergence:

image

Letting the right-hand size equal  δ > 0and solving for  ϵ, we have

image

for  b = −4/(3J) log(1/δ), c = −2v(τ)/J log(1/δ). Then,

image

Finally, to derive the uniform convergence bound, applying union bounds to equation 8 and following similar simplifi-cation steps leads to the bound

image

B. Supplementary Results

We show the regression performance of additional models from Section 5.1 in Table 1.

image

Table 1. Average RMSE on UCI regression data sets with 2 times standard deviation. For each data set, we bold the best model and models whose means are not statistically significantly different according to a 1-sided t-test against the best model. Models are described in Table B. RBF-ARD and IMQ-ARD are generally the best models and perform similarly. A single random projection is handily beaten by all additive random projection methods. Among 1-dimensional random projections, there are slight benefits to using a diverse projected additive (DPA) GP. There is a large benefit to applying length-scales before projection (-ARD). There is little to no performance loss using SKI. Finally, from RPA-GP-2 and RPA-GP-3, there is a benefit for adding more random projections and sub-kernels of higher-degrees. The last experiment, DPA-GP-ARD-SKI on gas was not completed in time, but we fully believe it will continue the pattern.

image

Table 2. Summary of each evaluated model in Table 1.

image

Figure 7. Top: RBF-ARD GP, GAM GP, RPA-GP, and DPA-GP-ARD average RMSE on the high dimensional CoEPrA data sets. DPA-GP-ARD is competitive with GAM on these data sets for the optimal number of projections. Middle: training negative marginal log likelihood for the same models. For each case, marginal likelihood greatly favors DPA-GP-ARD with few projections though the optimal DPA-GP-ARD model by marginal likelihood need not be the optimal model by RMSE. Bottom: RBF-ARD GP, GAM GP, RPA-GP, and DPA-GP-ARD average RMSE on Olivetti face orientation data set. RBF-ARD attributes all data to noise for n = 400, 1200, and DPA-GP-ARD achieves low error with less data than either GAM or RBF-ARD.

C. Tests on very high-dimensional data sets

Above in Figure 7 are the full set of plots corresponding to tests on the very high-dimensional data sets.


Designed for Accessibility and to further Open Science