Multi-class Gaussian Process Classification with Noisy Inputs

2020·Arxiv

Abstract

Abstract

It is a common practice in the supervised machine learning community to assume that the observed data are noise-free in the input attributes. Nevertheless, scenarios with input noise are common in real problems, as measurements are never perfectly accurate. If this input noise is not taken into account, a supervised machine learning method is expected to perform sub-optimally. In this paper, we focus on multi-class classification problems and use Gaussian processes (GPs) as the underlying classifier. Motivated by a dataset coming from the astrophysics domain, we hypothesize that the observed data may contain noise in the inputs. Therefore, we devise several multi-class GP classifiers that can account for input noise. Such classifiers can be efficiently trained using variational inference to approximate the posterior distribution of the latent variables of the model. Moreover, in some situations, the amount of noise can be known before-hand. If this is the case, it can be readily introduced in the proposed methods. This prior information is expected to lead to better performance results. We have evaluated the proposed methods by carrying out several experiments, involving synthetic and real data. These data include several datasets from the UCI repository, the MNIST dataset and a dataset coming from astrophysics. The results obtained show that, although the classification error is similar across methods, the predictive distribution of the proposed methods is better, in terms of the test log-likelihood, than the predictive distribution of a classifier based on GPs that ignores input noise.

1. Introduction

Multi-class classification problems involve predicting a class label y that can take values in a discrete set of C labels {1, . . . , C} with C > 2 (Murphy, 2012). For this task, one should use the information contained in the input attributes the dimensionality of the data. That is, we assume input attributes in the real line. In order to infer the relation between y and x it is assumed that one can use some training data in the form of . Multi-class classification problems arise in a huge variety of fields, from industry to science. Most of the times, however, it is common to have datasets whose inputs x are the result of experimental measurements. These measurements are unavoidably contaminated with noise, as a consequence of measurement error. Furthermore, in some situations, the errors in the measurement of the explaining attributes can be well determined. See (Barford, 1985) for further details. Incorporating this inductive bias, or prior knowledge about the particular characteristics of the inputs of the classification problem is expected to lead to better results when training a classifier to predict y given x. Conversely, ignoring errors or noise in the input measurements is expected to lead to sub-optimal results when this is indeed the case. In this research work we have empirically validated this hypothesis on several commonly used datasets extracted from the UCI repository (Bay et al., 2000), as well as on the well-known MNIST data set (LeCun et al., 1998). Even for these commonly used datasets in the machine learning community, we find better prediction results by assuming the presence (and learning the amount) of noise in some of the corresponding input variables, as we will show later.

While inference tasks on data with noisy attributes have been considered since long time in the context of regression —see for example, (Press et al., 2007), or more recently (Mchutchon and Rasmussen, 2011), in the context of Gaussian processes— the specific case of multi-class classification has received much less attention from the literature, with a few exceptions (S´aez et al., 2014). Taking into account the presence of noise in the input is, as we show below, potentially essential to better modeling the conditional distribution p(y|x) giving rise to the observed labels. Considering input noise is also expected to have an important impact on the classification of points that are not far from the decision boundaries, since those are regions of the input space in which the data is more susceptible of being misclassified (at least in the case of additive noise with finite variance). Needless to say, both improving the estimated underlying predictive distribution and the better confidence in the classification of difficult points are two desirable properties for real-world problems in generic science applications, as for example in the case of the medical domain or in astrophysics applications (Gal, 2016).

Gaussian Processes (GPs) are machine learning methods that are inherently specified within a Bayesian framework (Rasmussen and Williams, 2006). Therefore, they can deliver probabilistic outputs that allow to extract uncertainty in the predictions made. This uncertainty is specified in terms of a predictive distribution for the target variable and may arise both from the modeling of intrinsic noise and also due to the lack of observed data. GPs are also non-parametric models. Therefore, their expressiveness grows as the number of data points in the training set, of size N, increases. GPs are, however, expensive to train since their complexity is in ). Specifically, they require the inversion of a covariance matrix of size . These methods also suffer from the difficulty of requiring approximate techniques to compute the posterior distribution of the latent variables of the model in the case of classification tasks. This posterior distribution is required to compute the predictive distribution for the target variable. Nevertheless, in spite of this difficulties, GPs have been successfully used to address multi-class classification problems and have been shown to be competitive with other approaches such as support vector machines or neural networks (H.-C.Kim and Ghahramani, 2006; Hern´andez-Lobato et al., 2011; Villacampa-Calvo and Hern´andez-Lobato, 2017; Hensman et al., 2015c; Henao and Winther, 2012).

To alleviate the problems of scalability of GPs several approximations have been proposed in the literature. Among them, the most popular ones are based on using inducing points representations (Snelson and Ghahramani, 2006; Titsias, 2009b). These techniques consist in introducing a set of inducing points that are carefully chosen to approximate the full GP model. Specifically, the locations of the inducing points are optimized during training alongside with any other GP hyper-parameter, by maximizing a an estimate of the marginal likelihood (Villacampa-Calvo and Hern´andez-Lobato, 2017; Hensman et al., 2015c). The complexity of these approximations is in ), which is significantly better than . Importantly, sparse approximations based on inducing points can be combined with stochastic optimization techniques, which allow to scale GPs to very large datasets (Villacampa-Calvo and Hern´andez-Lobato, 2017; Hensman et al., 2015b). Nevertheless, in spite of this, to our knowledge, all current methods for multi-class GPs classification assume noiseless input attributes. In this paper we extend the framework of multi-class GP classification to account for this type of noise.

Our work is motivated by a concrete multi-class classification problem and from a dataset coming from astrophysics, dealing with measurements, by the Fermi-LAT instrument operated by NASA, of point-like sources of photons in the gamma ray energy range all over the sky (https://fermi.gsfc.nasa.gov/). The experimental measurements obtained from such a system are unavoidably contaminated with noise and, in practice, the level of noise in some of the explaining attributes is known and can be well determined. As it turns out, at present, a significant fraction of those point-like sources are not associated to or labeled as known astrophysical gamma rays sources as for example pulsars, blazars or quasars Ab- dollahi et al. (2019). It is thus of paramount importance for the physics related to those measurements to know whether those point-like sources belong to the standard astrophysical classes, or instead whether they are part of other more exotic kind of sources. As a study-case exercise, though, we train a fully supervised classifier using exclusively the sources which do have labels already. Further details of the dataset are given when presenting our experimental results. These show that a GP classifier which considers input noise can obtain better predictive distributions in terms of the test log-likelihood in this dataset.

To account for input noise in the context of multi-class GP classification, we describe several methods. A first approach is based on considering the actual noiseless input attributes (which we denote by x) as a latent variable, to then perform approximate Bayesian inference and compute their posterior distribution. The required computations are approximated using variational inference (Blei et al., 2017) combined with Monte Carlo methods and stochastic optimization. A second approach considers a first order Taylor expansion of the predictive mean of the GPs contained in the multi-class GP classifier (typically, one different GP per each potential class label), following (Mchutchon and Rasmussen, 2011). Under this linear approximation the input noise is simply translated into output noise, which is incorporated in the modeling process. The variance of the output noise is determined by the slope of the GP predictive mean, which can be obtained using automatic differentiation tools. Variational inference is also used in this second method to approximate the required computations. The two methods described to account for input noise in the context of multi-class GP classification are validated on several experiments, including synthetic data, datasets from the UCI repository, MNIST, and the dataset related to astrophysics that motivated this work described above. These experiments give empirical claims supporting our hypothesis that our methods can effectively deal with noise in the inputs. In particular, we have consistently observed that the predictive distribution of the methods proposed is better, in terms of the test log-likelihood, than the one of a standard multi-class GP clas-sifier that ignores input noise. The prediction error of the proposed method is, however, similar.

The rest of the manuscript is organized as follows: We first introduce the fundamentals about multi-class GP classification and sparse GPs in Section 2. Section 3 describes the proposed models and methods to account for input noise in the observed attributes. Related work about input noise in GPs and other machine learning methods is described in Section 4. Section 5 illustrates the empirical performance of the proposed methods. Finally, Section 6 gives the conclusions of this work.

2. Multi-class Gaussian Process Classiﬁcation

In this section we describe how Gaussian processes (GPs) can be used to address multi-class classification problems. We consider first a noiseless input setting. Next, in the following section, we describe how noisy inputs can be incorporated into the model. Assume a dataset consisting of N instances with the observed explaining attributes and the target class labels, where 2 is the number of classes. The task of interest is to make predictions about the label of a new instance given the observed data X and y.

Following the representation introduced by Kim and Ghahramani (2006) for multi-class classification with Gaussian processes, we assume that each class label has been obtained with the labeling rule:

where , are different latent functions, each one of them corresponding to a different class label. Therefore the class label has been obtained simply by considering the latent function with the largest value at the data point Under this labeling rule the likelihood of the value of each latent function at a training point is given by:

where Θ() is a Heaviside step function. Other likelihood functions such as the soft-max likelihood arise simply by considering and marginalizing Gumbell noise around the latent functions Maddison et al., 2014). Here we instead consider Gaussian noise around each , as described later on. To account for labeling errors, we consider that the actual class label associated to could have been flipped with probability to some other class label (Hern´andez-Lobato et al., 2011; Hensman et al., 2015c). Under this setting, the likelihood becomes:

In order to address multi-class classification with GPs, a GP prior is assumed for each latent function Rasmussen and Williams, 2006). That is, where ) is a covariance function, with hyper-parameters Popular examples of covariance functions include the squared exponential covariance function. Namely,

where ] is an indicator function and are the hyper-parameters. More precisely, is the amplitude parameter, are the length-scales and is the level of additive Gaussian noise around Rasmussen and Williams, 2006) for further details. In practice, the hyper-parameters will be different for each latent function ignored here their dependence on the latent function ) for the sake of readability.

In order to make predictions about the potential class label of a new data point one would like to compute the posterior distribution of This distribution summarizes which values of the latent functions are compatible with the observed data. The posterior distribution can be computed using Bayes’ rule as follows:

where ) is a multi-variate Gaussian distribution with zero mean and covariance matrix ). The denominator in the previous expression, , is just a normalization constant and is known as the marginal likelihood. It can be maximized to obtain the good values for the model hyper-parameters . Note that in this setting, we assume independence among the latent functions ), since the prior factorizes as

A practical problem, however, is that the non-Gaussian likelihood in (3) makes infeasible the exact computation of the marginal likelihood, so one has to make use of approximate inference methods to approximate the posterior in (5). One of the most widely used methods for approximate inference is variational inference (VI), which will be explained in detail in the following sections. See (Titsias, 2009b; Hensman et al., 2015b,c) for further details.

2.1 Sparse Gaussian Processes

A difficulty of the method described so far is that, even if the likelihood was Gaussian and exact inference was tractable, the cost of computing the posterior distribution would be in is the training set size. The reason for this cost is the need of inverting the prior covariance matrices Rasmussen and Williams, 2006) for further details. As a consequence, multi-class GPs classification would only be applicable on a dataset of at most a few thousand data points.

A popular and successful approach to reduce the previous cost, is to consider an approximation based on sparse Gaussian processes (Titsias, 2009a). Under a sparse setting, a set of M pseudo-inputs or inducing points is introduced associated to each latent function ). These points will lie in the same space as the training data. Namely, , and their locations will be specified during training, simply by maximizing an estimate of the marginal likelihood. Associated to these inducing points consider some inducing outputs ). The process values at each , are characterized by and can then be obtained from the predictive distribution of a GP as follows:

where matrix of covariances of ) between the process values at the observed data points X and the inducing points . Similarly, covariance matrix. Each entry in this matrix contains the covariances among the process values at the inducing points . Under the sparse approximation, the prior for each simply the Gaussian process prior. Namely,

Importantly, now one only has to invert the matrices and compute the product . Therefore, the training cost will be in ), which is significantly better if

In practice, the inducing point values will be unknown and they are treated as latent variables of the model. An approximate Gaussian posterior distribution will be specified for them. Namely, . This uncertainty about be readily introduced in (6) simply by marginalizing these latent variables. The result is a Gaussian distribution for with extra variance due to the randomness of

where

In the following sections we describe how to compute the parameters of each approximate distribution , using variational inference.

3. Multi-class GP Classiﬁcation with Input Noise

In this section we describe the proposed approaches for dealing with noisy inputs in the context of multi-class GP classification. Let us consider that ˜X is the matrix of noisy observations in which the data patterns are contaminated with additive Gaussian noise with some mean and some variance. Again, consider that X is the matrix of noiseless inputs. That is:

where ˜is a particular observation and diagonal matrix. That is, we assume independent additive Gaussian noise for the inputs. For the moment, we consider that the variance of the additive noise is known beforehand. Later on, we describe how this parameter can be inferred from the training data. Recall that we assume input attributes in the real line.

3.1 Modeling the Input Noise using Latent Variables

A first approach for taking into account noisy inputs ˜X in the context of GP multi-class classification is based on making approximate Bayesian inference about the actual noiseless inputs Importantly, these variables will be latent. The observed variables will be the ones contaminated with Gaussian noise ˜. With this goal, note that the assumption made in (11) about the generation of ˜provides a likelihood function for the actual observation. That is,

In order to make inference about the noiseless observation , we need to specify a prior distribution for that variable. In practice, however, the actual prior distribution for specific of each classification problem and unknown, in general. Therefore, we set the prior for to be a multi-variate Gaussian with a broad variance. Namely,

where I is the identity matrix and s is chosen to have a large value, i.e., s = 1, 000, in order to make it similar to a non-informative uniform distribution. This prior has shown good results in our experiments.

3.1.1 Joint and Posterior Distribution

The first step towards making inference about is to describe the joint distribution of all the variables of the model (observed and latent). This distribution is given by

where ˜is the matrix of noisy observations, matrix of actual noiseless inputs, y is the vector of observed labels, matrix of process values at the actual noiseless inputs (instead of at each ˜is the matrix of process values at the inducing points.

The posterior distribution of the latent variables, i.e., X, F and U is obtained using Bayes’ rule:

Again, as in the case of standard multi-class classification where there is noise in the inputs, computing this posterior distribution is intractable and approximate inference will be required. To approximate this distribution we employ variational inference, as described in the next section.

3.1.2 Approximate Inference using Variational Inference

We will use variational inference as the approximate inference method Jordan et al. (1999). The posterior approximation that will target the exact posterior (15) is specified to be:

where

with a diagonal matrix. This posterior approximation assumes independence among the different GPs of the model and the actual noiseless inputs X.

To enforce that q looks similar to the target posterior distribution, variational inference minimizes the Kullback-Leibler divergence between q and the exact posterior p, given by the distribution in (15). This is done indirectly by maximizing the evidence lower bound L. See (Jordan et al., 1999) for further details. The evidence lower bound (ELBO) is given by

where KL() is the Kullback-Leibler divergence and where we have used the fact that the factors of the form ) described in (6) and present in both the joint distribution and in q will cancel. One problem that arises when computing the previous expression is that the first expectation in (18), )], does not have a closed form solution. It can, however, be computed using a one dimensional quadrature and Monte Carlo methods combined with the reparametrization trick (Hensman et al., 2015a; Kingma and Welling, 2014). Concerning the other factors, the second expectation, )], is the expecta- tion of the logarithm of a Gaussian distribution, so it can be computed in a closed form. We can also evaluate analytically the Kullback Leibler divergences in the lower bound, )), since they involve Gaussian distributions. Importantly, the ELBO in (18) is expressed as a sum across the observed data points. This means that this objective is suitable for being optimized using mini-batches. The required gradients can be obtained using automatic differentiation techniques such as those implemented in frameworks such as Tensorflow (Abadi et al., 2015). Appendix A describes the details about how to obtain an unbiased noisy estimate of the ELBO, L. One last remark is that it is possible to show that

where KL[q||p] is the Kullback-Leibler (KL) divergence between q and the target posterior distribution p in (15) (Jordan et al., 1999). After maximizing the ELBO, L, it is expected that the KL term is fairly small and hence log . Therefore, L can be maximized to find good values for the model hyper-parameters. This is expected to maximize log ), which will correspond to a type-II maximum likelihood approach (Rasmussen and Williams, 2006). The locations of the inducing points , are found by maximizing L, an estimate of the marginal likelihood, as in (Villacampa-Calvo and Hern´andez-Lobato, 2017; Hensman et al., 2015c). Note, however, that these correspond to parameters of the posterior approximation q, defined in (16), and not of the described probabilistic model (Titsias, 2009a).

3.1.3 Predictions

After the maximization of the lower bound (18), the approximate distribution q is fitted to the actual posterior. The predictive distribution for the class label of a new instance can be approximated by replacing the exact posterior by the posterior approximation in the exact predictive distribution. Namely,

where ) is the posterior distribution of the actual attributes of the new instance given the observed attributes ˜. This posterior is the normalized product of the prior times the likelihood. Therefore, it can be computed in closed form and is a Gaussian. Namely,

where ), and where is a diagonal matrix with the variances of the Gaussian noise around

In general, the integral in (20) is intractable. However, we can generate samples of simply by drawing from ) to then compute a Monte Carlo approximation:

where S is the number of samples, generated

The only remaining thing is how to compute the integral in the right hand side of (22). It turns out that the integral with respect to each can be computed analytically using (8). The integral with respect to can be approximated using a one-dimensional quadrature. In particular, under the likelihood function in (3), the approximation of the predictive distribution becomes:

where ) is the cumulative probability of a standard Gaussian distribution and

for the variance of the matrix of covariances between the values of the covariances of ) among the inducing points . Note that the integral over ) has no closed form solution but it can be computed using one-dimensional quadrature.

3.2 Amortized Approximate Inference

A limitation of the method described is that the approximate posterior distribution over the noiseless inputs, ), demands storing in memory a number of parameters that is in O(N), where N is the number of observed instances. Of course, in the big data regime, i.e., when N is in the order of thousands or even millions the memory resources needed to store those parameters can be too high. To alleviate this problem, we propose to use amortized approximate inference to reduce the number of parameters that need to be store in memory (Kingma and Welling, 2014; Shu et al., 2018).

Amortized variational inference assumes that the parameters of each distribution i.e., the mean and the diagonal covariance matrix, can be obtained simply as a non-linear function that depends on the observed data instance (˜). This non-linear function is set to be a neural network whose parameters are optimized during training. That is,

where both ) are obtained as the output of a neural network with parameters (we use a one-hot encoding for the label Therefore, one only has to store in memory the neural network which has a fixed number of parameters. This number of parameters does not depend on N. The neural network can be adjusted simply by maximizing w.r.t the evidence lower bound described in (18). The computational cost of the method is not changed, since the cost for the feed-forward pass of the network to obtain ) is constant.

Amortized approximate inference introduces the inductive bias that points that are located in similar regions of the input space should have similar parameters in the corresponding posterior approximation ). Of course, this has the benefit property of reducing the number of parameters of the approximate distribution q. A second advantage is, however, that the neural network can provide a beneficial regularization that is eventually translated into better generalization results (Shu et al., 2018). More precisely, our experiments show that amortized variational inference sometimes provides better results than using an approximate distribution q that has separate parameters per each data point ˜

Besides using this neural network to compute the parameters of ), the model is not changed significantly. Prediction is done in the same way, and hyper-parameter optimization is also carried out by maximizing the evidence lower bound.

3.3 First Order Approximation

In this section we describe an alternative method to account for input noise in the context of multi-class classification with GPs. This method is inspired on work already done for regression problems (Mchutchon and Rasmussen, 2011). Consider the relation between the noisy and the noiseless input measurements given by (11). Now, consider a Taylor expansion of a latent function ) around the noiseless measurement

where ˜is the noisy observation. Since we do not have access to the noiseless input vector , we simply approximate it with the noisy one ˜. Note that this last expression involves the derivatives of the GP. Although they can be showed to be again GPs (see (Rasmussen and Williams, 2006) for further details), we here decide to approximate these derivatives with the derivatives of the mean of GP, as in (Mchutchon and Rasmussen, 2011). This approximation corresponds to ignoring the uncertainty about the derivative. Let denote the d-dimensional vector corresponding to the derivative of the mean of the GP with respect to each input dimension at ˜. If we expand the right hand side of (27) up to the first order terms, we get a linear model. Namely,

Therefore, the input noise can be understood as output noise whose variance is proportional to square of the derivative of the mean value of

The model just described can be combined with the framework for sparse GPs described in Section 2.1 to give an alternative posterior predictive distribution for described in (8). That is,

with

where are the parameters of ), the covariance matrices are evaluated at the noisy measurements ˜X and

Therefore, is a diagonal matrix whose entries account for the extra output noise that results from the corresponding input noise. This makes sense, since the input noise is expected to have a small effect in those regions of the input space in which the latent function is expected to be constant. Importantly, in the sparse setting

This partial derivative can be easily obtained automatically in modern frameworks for implementing multi-class GP classifiers such as Tensorflow (Abadi et al., 2015). The expression in (29) can replace (8) in a standard sparse multi-class GP classifier to account for input noise. One only has to provide the corresponding input noise variances Approximate inference in such a model can be carried out using variational inference, as described in (Hensman et al., 2015c).

Figure 1 shows the predictive distribution of the model described, which we refer to as (NIMGP), for each latent function in a three class toy classification problem described in Section 5.1. The figure on the left shows the predictive distribution obtained when the extra term in the predictive variance that depends on the slope of the mean is ignored in (29). The figure on the right shows the resulting predictive distribution when that term is taken into account. We can observe that the produced effect is to increase the variance of the predictive distribution by some amount that is proportional to the squared value of slope of the predictive mean. This is particularly noticeable in the case of the latent function corresponding to class number 2. A bigger variance in the predictive distribution for the latent function will correspond to less confidence in the predictive distribution for the class label. See Section 5.1 for further details.

Figure 1 also shows the learned locations of the inducing points for each latent function (displayed at the bottom of each image). We observe that they tend to be placed uniformly in the case of the latent functions corresponding to class labels 0 and 1. However, in the case of the latent function corresponding to class label 2, they concentrate in specific regions of the input space. Namely, in those regions in which the latent function changes abruptly.

Figure 1: For the NIMGPmodel, the predictive mean GP and variances (right) and original variances (left), for each latent function, for a classification problem with three classes shown in red, green and blue. The learned locations of the inducing points, , are shown in a row at the bottom of each figure. Best seen in color.

3.4 Learning the Level of Noise in the Inputs

The previous sections assumed that the variance of Gaussian noise associated to each input dimension, i.e., the diagonal matrix , is known before-hand. This is the case of many practical problems in which the error associated to the measurement instrument that is used to obtain the observed attributes ˜is well-known. However, in some certain situations it can be the case that the variance of the error is unknown. In this case, it may still be possible to infer this level of error from the observed data.

Typically, in this case, one will assume that the level of noise is the same across all observed data instances. That is, . This has the advantage of reducing the number of parameters that have to be inferred. To estimate V one can simply treat this parameter as a hyper-parameter of the model. Its value can be estimated simply by type-II maximum likelihood, as any other hyper-parameter of the GP (Rasmussen and Williams, 2006). Under this setting, one simply maximizes the marginal likelihood of the model, i.e., the denominator in Bayes’ theorem w.r.t the parameter of interest. This is precisely the approach followed in (Mchutchon and Rasmussen, 2011) for regression problems.

Because evaluating the marginal likelihood is infeasible in the models described so far, one has to use an approximation. This approximation can be the evidence lower bound described in (18), which will be similar to the marginal likelihood if the approximate distribution q is an accurate posterior approximation. The maximization of (18) w.r.t V can be simply done with no extra computational cost using again stochastic optimization techniques.

In general, we will assume that is known for each data instance. If that is not the case, we will infer that level of noise using the method described in this section.

3.5 Summary of the Proposed Methods to Deal with Input Noise

Below we briefly the different methods described to deal with input noise in the context of multi-class GP classification:

• NIMGP: This is the method described in Section 3.1 which uses latent variables to model the original noiseless inputs. It relies on a Gaussian distribution for the actual observation with the mean being a random variable representing the noiseless input. We assume a non-informative Gaussian prior for the noiseless input. The joint posterior distribution that involves non-gaussian likelihood factors for the process values is approximated using variational inference. Quadrature and Monte Carlo methods are combined with the reparametrization trick to obtain a noisy estimate of the lower bound which is optimized using stochastic methods.

• A limitation of the previous method is the need of storing parameters for each data instance. This is a disadvantage in big data applications. To circumvent this problem, amortized approximate inference is employed in this method, as explained in Section 3.2. We use a neural network to compute the parameters of posterior approximation for the noiseless attributes of each data instance. This network reduces the number of parameters of the approximate distribution and also regularizes the model. Approximate inference is carried out as in NIMGP.

• : This is the method described in Section 3.3. In this method we adapt for classification with sparse GPs an already proposed method for regression that accounts for noisy inputs using GPs. This method is based on propagating the noise found in the inputs to the variance of the GPs predictive distribution. For this, a local linear approximation of the GP is used at each input location. This allows the input noise to be recast as output noise proportional to the squared gradient of the GP predictive mean.

4. Related Work

In the literature there are some works dealing with GPs and input points corrupted with noise in the context of regression problems (Mchutchon and Rasmussen, 2011). For this, a local linear approximation of the GP at each input point is performed. When this is done, the input noise is translated into output noise proportional to the squared gradient of the GP posterior mean. Therefore, the mentioned paper simplifies the problem of modeling input noise by assuming that the input measurements are deterministic and by inflating the corresponding output variance to compensate for the extra noise. When this operation is performed, it leads to output noise variance varying across the input space. This property is defined as heteroscedasticity. The model presented in the that paper can hence be described as a heteroscedastic GP model. In our work we have extended the approach of those authors to address multi-class classification problems since they only considered regression problems. Furthermore, we have compared such a method (NIMGP) with another approach that uses latent variables to model the noiseless inputs (NIMGP) and that, in principle, does not rely on a linear approximation of the GP. We hypothesize that NIMGP circumvents the bias of the linear approximation and give some empirical evidence supporting this claim. Note that we also consider sparse GPs instead of standard GPs, which make our approach more scalable to datasets with a larger number of instances. These differences are common between our proposed methods and most of the techniques described in this section.

As described in the previous paragraph, one of the proposed methods can be understood as a GP model in which the level of output noise depends on the input location, i.e., the data can be considered to be heteroscedastic. Several works have tried to address such a problem in the context of GPs and regression tasks. In particular, Goldberg et al. (1998) introduce a second GP to deal with the output noise level as a function of the input location. This approach uses a Markov chain Monte Carlo method to approximate the posterior noise variance, which is time-consuming. An interesting extension to the described work tries to circumvent the limitations found in the method of Goldberg et al. (1998) by replacing the Monte Carlo method with an approximative most likely noise approach (Kersting et al., 2007). This other work learns both the hidden noise variances and the kernel parameters simultaneously. Furthermore, it significantly reduces the computational cost. Again, the domain of application is different from ours since only regression problems are addressed.

Other related work concerning heteroscedasticity in the context of regression problems is the one of L´azaro-Gredilla and Titsias (2011). This approach also relies on variational inference for approximate inference in the context of GPs. These authors explicitly take into account the input noise, and model it with GP priors. By using an exponential transformation of the noise process, they specify the variance of the output noise. Exact inference in the heteroscedastic GP is intractable and the computations need to be approximated using variational inference. Variational inference in such a model has equivalent cost to an analytically tractable homoscedastic GP. Importantly, this work focuses exclusively on regression and ignores classification tasks. Furthermore, no GP sparse approximation is considered by these authors. Therefore, the problems addressed cannot contain more than a few thousand instances.

Copula processes are another alternative to deal with input noise in the context of regression tasks and GPs (Wilson and Ghahramani, 2010). In this case, approximate inference is carried out using the Laplace approximation and Markov chain Monte Carlo methods. There also exists in the literature an approach for online heteroscedastic GP regression that tackles the incorporation of new measurements in constant runtime and makes the computation cheaper by considering online sparse GPs (Bijl et al., 2017). This approach has proven to be effective in a practical applications considering, e.g., system identification.

The work of Mchutchon and Rasmussen (2011) has been employed in several practical applications involving machine learning regression problems. For example, in a problem concerning driving assistant systems (Armand et al., 2013), where the velocity profile that the driver follows is modeled as the vehicle decelerates towards a stop intersection. Another example application can be found in the context of Bayesian optimization (Nogueira et al., 2016; Oliveira et al., 2017), where a GP models an objective function that is being optimized. In this scenario, the input space is contaminated with i.i.d Gaussian noise. Two real applications are considered: safe robot grasping and safe navigation under localization uncertainty (Nogueira et al., 2016; Oliveira et al., 2017).

Another approach that considers input noise in the context of regression problems in arbitrary models is that of B´ocsi and Csat´o (2013). This approach corrects the bias caused by the integration of the noise. The correction is proportional to the Hessian of the learned model and to the variance of the input noise. The advantage of the method is that it works for arbitrary regression models and the disadvantage is that it does not improve prediction for high-dimensional problems, where the data are implicitly scarce, and the estimated Hessian is considerably flattened.

Input dependent noise has also been taken into account in the context of binary classi-fication with GPs by Hern´andez-lobato et al. (2014). In particular, these authors describe the use of an extra GP to model the variance of additive Gaussian noise around a latent function that is used for binary classification. This latent function is modeled again using a GP. Importantly, in this work the level of noise is expected to depend on privileged information. These are extra input attributes that are only available at training time, but not at test time. The goal is to exploit that privileged information to obtain a better classifier during the training phase. Approximate inference is done in this case using expectation propagation instead of variational inference (Minka, 2001). The experiments carried out show that privileged information is indeed useful to obtain better classifiers based on GPs.

To tackle input noise in the context multi-class classification, when one does does not rely on the use of GPs, some decomposition strategies can be used. Specifically, it is possible to decompose the problem into several binary classification subproblems, reducing the complexity and, hence, dividing the effects caused by the noise into each of the subproblems (S´aez et al., 2014). There exist several of these decomposition strategies, being the one-vs-one scheme a method that can be applied for well known binary classification algorithms. The results obtained by these authors show that the one-vs-one decomposition leads to better performances and more robust classifiers than other decompositions. A problem of these decompositions, however, is that they can lead to ambiguous regions in which it is not clear what class label should be predicted (Bishop, 2006). Furthermore, they do not learn an underlying true multi-class classifier and rely on binary classifiers to solve the multi-class problem, which is expected to be sub-optimal.

Random input attributes that follow a Gaussian distribution have been considered in the context of GP regression in (Girard et al., 2003). That work addresses the problem of learning in such a setting by using a Gaussian approximation that matches the mean and the variance of the GP predictive distribution when the input attributes are noisy. The required computations are tractable for some particular covariance functions such as the squared exponential. In that work, however, no inference has been made about the noiseless inputs. The approach described has been applied in the context of solving control problems using GPs (Deisenroth and Rasmussen, 2011) and in the context of deep Gaussian processes in which the input to a hidden layer is the output of the previous layer, which is random (Bui et al., 2016). An alternative approximation based on Kalman filters that is potentially more accurate has also been considered in the literature in the context of sequential state estimation (Ko et al., 2007) and in the context of GPs with arbitrary non-linear likelihood functions (Steinberg and Bonilla, 2014).

Other approaches to deal with input noise involve using robust features in the context of multi-class SVMs (Rabaoui et al., 2008) or enhancements of fuzzy models (Ge and Wang, 2007). The first work can be understood as a pre-processing step in which robust features are generated and used for training with the goal of reducing the effect of background noise in a sound recognition system. These robust features are, however, specific of the application domain conspired. Namely, sounds recognition. They are not expected to be useful in other classification problems. The second work focuses exclusively on linear regression models and hence cannot address multi-class classification problems, as the ones considered in our work.

5. Experiments

In this section we carry out several experiments to evaluate the performance of the proposed method for multi-class GP classification with input noise. More precisely, we compare the performance of a standard multi-class Gaussian process that does not consider noise in the inputs (MGP) and the three proposed methods. Namely, the approach described in Section 3.1 (NIMGP), the variant described in Section 3.2 where the parameters of the Gaussian posterior approximation are computed using a neural network (NIMGPthe method proposed in Section 3.3 (NIMGP), which is based on the work of Mchutchon and Rasmussen (2011), and uses a first order approximation to account for input noise. The experiments considered include both in synthetic and real data. All the experiments carried out (except those related to the MNIST dataset) involve 100 repetitions and we report average results. These are detailed in the following sections.

All the methods described have been implemented in Tensorflow (Abadi et al., 2015). The source code to reproduce all the experiments carried out is available online at https: //github.com/cvillacampa/GPInputNoise. In these experiments, for each GP, we have employed a squared exponential covariance function with automatic relevance determination (Rasmussen and Williams, 2006). All hyper-parameters, including the GP amplitude parameter, the length-scales and the level of additive Gaussian noise have been tuned by maximizing the ELBO. The class noise level has been set equal to 10, and kept fixed during training as in (Hensman et al., 2015c). Unless indicated differently, we have set the number of inducing points for the sparse gaussian process to the minimum of 100 and 5% of the total number of points. For the optimization of the ELBO we have used the ADAM optimizer with learning rate equal to 0.01, the number of epochs has been set to 750 and the mini-batch size to 50 (Kingma and Ba, 2015). This number of epochs seems to guarantee the convergence of the optimization process. All other ADAM parameters have been set equal to their default value. In NIMGPthe neural network has 50 hidden units and one hidden layer. The activation function is set to be ReLu. Finally, the number of Monte Carlo samples used to approximate the predictive distribution in NIMGP and NIMGPis set to 300.

5.1 Illustrative Toy Problem

Before performing a fairly complete study on more realistic examples, we show here the results of the three proposed methods on a one-dimensional synthetic dataset with three classes. This dataset is simple enough so that in can be analyzed in detail and the optimal predictive distribution provided by the Bayes classifier can be computed in closed-form. The dataset is generated by sampling the latent functions from the GP prior using specific hyper-parameter values and then applying the labeling rule in (1). The input locations have been generated randomly by drawing from a uniform distribution in the [3] interval. Then, we add a Gaussian noise to each observation to generate ˜, with standard deviation We consider 1, 000 training instances and the number of inducing points M = 100. The mini-batch size is set to 200 points. Since in this experiment the variance of the input noise is known beforehand we do not infer its value from the observed data and rather specify its actual value in each method. The number of test points is 1, 000. In NIMGPthe neural network is set to have 2 hidden layers with 50 units each.

We have trained each method on this dataset and compared the resulting predictive distribution with that of the optimal Bayes classifier, which we can compute in closed form since we know the generating process of the labels. Figure 2 shows, for each method, as a function of the input value x, the predicted distribution for each of the three classesobserved labels for each data point are shown at the top of each figure as small vertical bars in green, blue and red colors, depending on the class label. We observe that each method produces decision boundaries that agree with the optimal ones (i.e., all methods predict the class label that has the largest probability according to the optimal Bayes classifier). However, the predictive distributions produced differ from one method to another. More precisely, the first method, i.e., MGP (top-left), which ignores the input noise, does produce a predictive distribution that is significantly different from the optimal one, especially in

Figure 2: Comparison of the predictive distribution computed by each method and the optimal Bayes prediction (shown as a dashed line). The observed labels are shown as small vertical bars at the top of the plot in green, blue and red. (Top-left) Model without input noise (MGP), (top-right) NIMGP model, (bottom-left) NIMGPand (bottom-right) NIMGP. While the decision boundaries of each method agree with the optimal ones, NIMGP is the method that more accurately computes the predictive probabilities of each class label. By contrast MGP produces predictions that are too confident.

regions of the input space that are close to the decision boundaries. This method produces a predictive distribution that is too confident. The closest predictive distribution to the optimal one is obtained by NIMGP (top-right), followed by NIMGP(bottom-left). Finally, NIMGP(bottom-right) seems to improve the results of MGP, but is far from NIMGP.

Figure 2 shows very clearly the advantage of including the input noise when estimating the predictive distribution p(c|x) for each class label c = 1, 2, 3, given a new test point x. Indeed, the proposed models reproduce more closely the optimal predictive distribution of the Bayes classifier. By contrast, the model that ignores the input noise, i.e., MGP fails to produce an accurate predictive distribution. Therefore, we expect to obtained better results for the proposed methods in terms of the log-likelihood of the test labels.

Table 1 shows the prediction error of each method and the corresponding negative test log-likelihood (NLL). Standard deviations are estimated using the bootstrap. Importantly, NLL can be understood as a measure of the quality of the predictive distribution. The smaller the better. We observe that while the prediction error of each method is similar (since the decision boundaries produced are similar) the NLL of the proposed methods is significantly better than the one provided by MGP, the method that ignores input noise. These results highlight the importance of accurately modeling the input noise to obtain better predictive distributions, which can play a critical role when one is interested in the confidence on the decisions to be made in terms of the classifier output.

Table 1: Test error and negative test log-likelihood (NLL) for the one-dimensional toy ex- periment.

5.2 Synthetic Experiments

Next, we compare the methods on 100 synthetic two-dimensional classification problems with three classes. As in the case of the one-dimensional dataset described above, these problems are generated by sampling the latent functions from a GP prior with the squared exponential function and then applying the labeling rule in (1). The GP hyper-parameters employed are . The input vectors are chosen uniformly in the box [. Then, we add three different levels of random noise to each observation . The interest of these experiments is to evaluate the proposed methods in a controlled setting in which the expected optimal model to explain the observed data is a multi-class GP classifier and we can control the level of input noise in the data. In the next section we carry out experiments with real datasets. The number of training and test instances, inducing points, mini-batch size and parameters of the ADAM optimizer are the same as in the previous experiment. Again, since the level of injected noise is known in these experiments, we directly codify this information in each method. Figure 3 shows a sample dataset before and after the noise injection.

The average results obtained on the 100 datasets, for each method, are displayed in Tables 2 and 3. We observe that in terms of the negative test log-likelihood, all the proposed methods improve over MGP, i.e., the standard GP multi-class classifier that ignores input noise. Among the proposed methods, the best performing one is NIMGP, closely followed by NIMGP. Therefore, these experiments highlight the benefits of using a neural network to compute the parameters of the posterior approximation ). Specifically, there is no performance degradation. The method that is based on the first order approximation, , also improves over MGP, but the differences are smaller. In terms of the prediction error the differences are much smaller. However, in spite of this, the proposed

Figure 3: Sample synthetic classification problem with three class labels. (left) Input data and associated class label before the noise injection in the observed attributes. (right) Input data and associated class labels after the noise injection. The variance of the Gaussian noise is set equal to 0.1. Best seen in color.

methods improve the results of MGP. Among them, again NIMGP is the best method, except when . In that case, NIMGPperforms best, but the differences are very small. Summing up, the results obtained agree with the ones obtained in the one-dimensional problem. Namely, the proposed approaches significantly improve the quality of the predictive distribution in terms of the neg. test log-likelihood. The prediction error is, however, similar.

Table 2: Average Neg. Test Log Likelihood for each method on the synthetic problems.

Table 3: Average Test Error for each method on the synthetic problems.

5.3 Experiments on Datasets Extracted from the UCI Repository

Another set of experiments evaluates the proposed methods on 8 different multi-class datasets extracted from the UCI repository (Dua and Graff, 2017). Table 4 displays the characteristics of the datasets considered. For each of these datasets, we consider 100 splits into train and test, containing 90% and 10% of the data respectively. Unlike in the previous experiments, in these datasets, a GP multi-class classifier need not be optimal. Furthermore, the input attributes may already be contaminated with additive noise. To assess the benefits of considering such noise during the learning process, we have considered four different setups. In each one, we inject Gaussian noise in the observed attributes with different variances. Namely, 0.0, 0.1, 0.25 and 0.5. This will allow to evaluate the different methods in a setting in which there may be or may be not input noise due to the particular characteristics of the problem addressed (i.e. when the variance of the injected noise is equal to 0.0), and also for increasing levels of input noise (i.e., when the variance of the injected noise is equal to 0.1, 0.25 and 0.5). This noise injection process is done after a standardization step in which the observed attributes are normalized to have zero mean and unit variance in the training set. All methods are trained for 1, 000 epochs using a mini-batch size of 50. In these experiments, in each of the proposed methods, the level of noise is learned during the training process by maximizing the ELBO. The reason for this is that the actual level of noise in the input attributes need not be equal to the injected level of noise.

Table 4: Characteristics of the datasets extracted from the UCI repository.

Table 5 shows the average results obtained for each method in terms of the negative test log-likelihood, for each level of noise considered. We also report the mean rank for each method across datasets and splits. If a method is always the best one, it will receive a mean rank equal to 1. Conversely, if a method is always worst, it will receive a mean rank equal to 4. Therefore, in general lower is better. We can see that on average, the proposed methods improve over MGP and the method that works the best (according to the mean rank) is NIMGP, even for the case where we do not introduce noise in the inputs. This suggests that these datasets have already some noise in the inputs. Also, as we increase the noise level the mean rank for MGP and NIMGP is worsen and NIMGPboth improve. The fact that NIMGPgive better results than NIMGP as we increase the level of noise indicates that NIMGP may be over-fitting the training data and that the use of the neural network and the first order approximation may act as a regularizer that alleviates this (Shu et al., 2018).

Table 6 shows the average results obtained for each method in terms of the prediction error. In this case, we do not observe big differences among the different methods. Moreover, the methods that take into account the noise in the inputs do not improve the prediction error as we increase the noise level. These small differences in terms of the error can be due to each method having similar decision boundaries, even when we obtain better predictive distributions in terms of the test log-likelihood, as illustrated in Section 5.1.

Table 5: Average neg. test log likelihood for the experiments on the UCI datasets.

Table 6: Average Test Error for experiments on UCI datasets

5.4 Experiments on the MNIST Dataset

In this section we consider a dataset in which sparse GPs are needed in order to train a multi-class classifier based on GPs. Namely, the MNIST dataset (LeCun et al., 1998). This dataset has 10 different class labels and 60, 000 training instances lying in a 28 dimensional space. The test set has 10, 000 data instances. We consider a similar setup to the previous experiments and inject noise in the inputs (after a standardization to guarantee zero mean and unit standard deviation on the input attributes) with variances equal to 0.0, 0.1, 0.25 and 0.5. The mini-batch size is set to 200 and the number of training epochs is set to 350. This number of epochs seems to be large enough to guarantee the convergence of the different methods evaluated. In the case of NIMGPthe neural network has 250 units and two hidden layers. We also slightly modified the neural network so that at the beginning of the training process it outputs as the mean the noisy observed attributes ˜fed at the input. The number of Monte Carlo samples used to approximate the predictive distribution in NIMGP and NIMGPis set to 500. We use a bigger number of Monte Carlo samples in these experiments because of the bigger size of the classification problem (10 classes) and the input dimensionality, i.e., 784 dimensions. All the computations are sped-up by using a TESLA P100 GPU for training.

The results obtained in these experiments, for each method, are displayed in Table 7. The results obtained are similar to others reported in the literature (Villacampa-Calvo and Hern´andez-Lobato, 2017; Hensman et al., 2015c). We observe that the proposed methods always outperform the MGP, i.e., the multi-class GP that ignores the input noise. Both in terms of the test error and the negative test log-likelihood. In this case, however, the gains are small. We believe this is because this dataset is particularly challenging for GPs using the squared exponential covariance function (Van der Wilk et al., 2017). Table 8 shows the average training time employed on each epoch for each method. We observe that MGP, NIMGP and NIMGPhave similar training times. However, NIMGPsignificantly larger amount of training time on each epoch. This is an unexpected result probably due to the over-head of having to compute gradients of the GP predictive mean, for each latent function. Recall that the input dimensionality of this dataset is high (784 dimensions) and also the number of class labels (10 class labels).

Table 7: Average test error and neg. test log-likelihood (NLL) of each method on the MNIST dataset.

5.5 Experiments on a Dataset Coming from Astrophysics

In this last experimental section we describe the results obtained by each method on a three class dataset coming from the astrophysics domain. Importantly, in this data the errors in

Table 8: Average time per epoch, in seconds, for each method on the MNIST dataset.

some of the observed inputs are available at training time. Therefore, it is suited to be analyzed using the methods proposed in our paper. As briefly commented in the introduction, the dataset consists of a series of attributes measured for a set of point-like astrophysical sources located all over the sky which have already been identified (distinguished from the diffuse background of photon emission) by the Fermi-LAT collaboration. Such catalogue of sources is fully public and can be downloaded from (Collaboration, 2019), while a detailed description can be found in (Abdollahi et al., 2019).

Figure 4: The astrophysics dataset considered in this work, showing two of the six attributes included in the input with their associated error bars, for the three classes of sources.

Among the available attributes of the sources, there is the position in the sky, the flux of photons in different energy bins, the significance of detection (according to a specified test statistics), the variability of the flux over some period of time, characteristics of the energy spectrum, and many others. Among these, we have only taken into account the following attributes: 1) the photon flux between 1 GeV and 100 GeV, 2) the detection significance (in number of sigmas), 3) the curvature significance , 4) the pivot energy , 5) the index and 6) the index . Of these attributes, the flux, the index and the index with associated estimated error bars, whereas the rest, by definition, do not. The choice of these attributes among all the available ones is motivated from a physics point of view, where it has been shown (See (Abdollahi et al., 2019)) that some combinations of them offer a promising discrimination power among the classes of point-like sources considered. Concerning the latter, the catalogue offers as well several classes of sources. Here, for simplicity, we have chosen three classes: pulsars (in the catalogue, psr), blazars (bll) and quasars (fsrq), which are also the most abundant classes among all the catalogue. The final dataset, after preprocessing and preselectioncontain 454 sources (points), of which 184 are pulsars, 168 are blazars and 102 are quasars. Figure 4 shows two of the six attributes included in the input of the models alongside with the corresponding error bars. From that figure, one can already observe a good separability of the classes, which improves as the other attributes are taken into account. In general, the pulsars class has larger values of with respect to the other two classes, whereas blazars are separable from quasars in the direction, but also in the pivot energy direction.

As in the previous experiments, we have estimated the prediction performance of each method on this dataset. For this, we have generated 100 splits of the data into training and test partitions with 90% and 10% of the instances, respectively. We have evaluated the test error rate and negative test log-likelihood on each partition, for each method. The average results obtained are displayed in Table 9. As expected, we obtain a significant improvement in the test log-likelihood when using the proposed methods, which improve on the baseline MGP, which does not take into account the input noise. On the other hand, and consistently as well with the previous experiments, the test error rate is similar for all methods, which is an indication that the decision boundaries among the classes are well captured already by the MGP model.

Table 9: Average test error and neg. test log-likelihood (NLL) of each method on the astrophysics dataset.

6. Conclusions

Multi-class classification problems involve estimating a predictive distribution for the class label given the observed data attributes. Multi-class Gaussian process classifiers are kernel machines that can be used to address these problems with the benefit that they will take into account uncertainty in the estimation process. Often, the supervised machine learning community assumes that the observed data are noise-free in the explaining inputs. Notwithstanding, in some scenarios the measurement of the explaining variables is contaminated with noise. Therefore, input noise is often common in many problems of interest. If this input noise is not modeled correctly, the quality of the resulting predictive distribution can be sub-optimal.

In this paper we have proposed several multi-class GP classifiers that can account for input noise. All these classifiers can be efficiently trained using variational inference to approximate the posterior distribution of the latent variables of the model. They also allow to specify manually, or to infer from the observed data, the level of input noise. Two approaches are based on introducing extra latent variables in the model to account for noisy inputs. One method is, however, based on a linear approximation of the GPs of the classifier. Under this approximation input noise is directly translated into output noise.

The inductive bias described is expected to lead to better performance results in practical datasets. To show this, we have evaluated the proposed methods on several experiments, involving synthetic and real data. These data include several datasets from the UCI repository, the MNIST data set and also a dataset coming from astrophysics. We have compared the results of the proposed methods with those of a standard multi-class GP classifier that ignores input noise. The experiments show that the predictive distribution of the proposed methods is significantly better in terms of the test log-likelihood. The classification error is, however, similar. This means that the decision boundaries of the classifier are not significantly changed. Only the class prediction probabilities.

We have also evaluated the proposed methods in a real dataset involving astrophysics data with success, both in terms of error rate and negative test log-likelihood. These experiments add empirical evidence to the hypothesis that if we model input noise the results of multi-class classification using GPs can be enhanced. Summing up, our results indicate that if one is interested in obtaining accurate predictive distributions, it is of vital importance to take into account any potential input noise that has contaminated the explaining variables of the multi-class classification problem.

We believe that the reason for the prediction error being similar is due to the fact that it only depends on the decision boundaries. These boundaries are fully determined by the class label with the largest posterior probability, as estimated by the corresponding method. Therefore, an accurate decision boundary estimation does not require an accurate class posterior probability estimation. However, if one is concerned about the quality of the predictive distribution and the uncertainty in the predictions made by the method, an accurate class posterior probability estimation is strictly required. The proposed methods significantly improve the quality of class posterior probability estimation, while providing similar prediction errors.

Acknowledgements

We would like to thank M. A. S´anchez-Conde, J. Coronado and V. Gammaldi for pointing our attention to the dataset that motivated this work, as well as for the discussions concerning the data extraction. We thank as well E. Fern´andez-Mart´ınez, A. Su´arez and C. M. Ala´ız-Gud´ın for useful discussions and feedback about the work. BZ especially acknowledges the hospitality of the Machine Learning group of UAM during the development of this project. BZ is supported by the Programa Atracci´on de Talento de la Comunidad de Madrid under grant n. 2017-T2/TIC-5455, from the Spanish MINECOs Centro de Excelencia Severo Ochoa Programme via grant SEV-2016-0597, and from the Comunidad de Madrid project SI1-PJI-2019-00294, of which BZ is the P.I. The authors gratefully acknowledge the use of the facilities of Centro de Computaci´on Cient´ıfica (CCC) at Universidad Aut´onoma de Madrid. The authors also acknowledge financial support from Spanish Plan Nacional I+D+i, grants TIN2016-76406-P and TEC2016-81900-REDT.

Appendix A. Stochastic Approximation of the Lower Bound

In this section we describe how to compute an stochastic estimate of the ELBO, described in (18). The stochasticity of the approximation arises from (i) using mini-batches of data to approximate the data-dependent term, and from (ii) approximating the corresponding expectations using Monte Carlo sampling. For this, we use the fact that the approximate distribution q is reparametrizable in the sense that it allows to separate, in each random sample, the dependence on the distribution parameters and the randomness.

Consider the first term in the ELBO and a mini-batch of data instances B. Then,

results in an un-biased estimate of the first term in (18).

Consider now the term )]. The expectation with respect to the q distribution can be computed analytically in the case of the random variables U. For this, we only have to use (8). Furthermore, the expectation with respect to the posterior distribution of can be approximated accurately using a one-dimensional quadrature. In particular,

where Φ() is the cumulative probability of a standard Gaussian distribution and

for the variance of the covariance vector between ) evaluated at the covariance matrix among the values of . All these values and matrices can be easily computed given the corresponding covariance functions are the mean and covariance parameters of ), respectively. See (17) for further details.

It remains now to approximate the expectation of )] with respect to This is done by using a Monte Carlo approximation combined with the reparametrization trick (Kingma and Welling, 2014). More precisely, we generate a single sample from by separating the randomness and the dependence on the parameters of

where is a diagonal matrix whose entries contain the square root of the diagonal entries of is the mean of ) for further details. Let us define ˆ(

where the right hand side is given by (35) in which we have replaced . The consequence is that

where the right hand side of is an unbiased estimate of the left hand side. The second term in (18) can be approximated using a mini-batch and the corresponding expectation can be computed analytically. In particular,

The third term in (18) is the Kullback-Leibler divergence between Gaussian distributions. This is given by,

where are the parameters of is the covariance matrix of

The fourth term in (18) can be approximated using a mini-batch and the corresponding Kullback-Leibler divergence can be computed analytically. In particular,

Note that all these estimates are unbiased. In practice, the stochastic estimate of the lower bound can be easily codified in a framework such as Tensorflow (Abadi et al., 2015), and the corresponding unbiased estimate of the gradient can be computed using automatic differentiation.

Bibliography

M. Abadi, A. Agarwal, P. Barham, E. Brevdo, Z. Chen, C. Citro, G. S. Corrado, A. Davis, J. Dean, M. Devin, S. Ghemawat, I. Goodfellow, A. Harp, G. Irving, M. Isard, Y.Jia, R. Jozefowicz, L. Kaiser, M. Kudlur, J. Levenberg, D. Man´e, R. Monga, S. Moore, D. Murray, C. Olah, M. Schuster, J. Shlens, B. Steiner, I. Sutskever, K. Talwar, P. Tucker, V. Vanhoucke, V. Vasudevan, F. Vi´egas, O. Vinyals, P. Warden, M. Wattenberg, M. Wicke, Y. Yu, and X. Zheng. TensorFlow: Large-scale machine learning on heterogeneous systems, 2015. URL https://www.tensorflow.org/. Software available from tensorflow.org.

S. Abdollahi et al. Fermi Large Area Telescope Fourth Source Catalog. 2019.

A. Armand, D. Filliat, and J. Ibanez-Guzman. Modelling stop intersection approaches using Gaussian processes. In IEEE Conference on Intelligent Transportation Systems, pages 1650–1655, 2013.

N. C. Barford. Experimental measurements: precision, error and truth. Wiley, 1985.

S. D. Bay, D. F. Kibler, M. J. Pazzani, and P. Smyth. The UCI KDD archive of large data sets for data mining research and experimentation. SIGKDD explorations, 2:81–85, 2000.

H. Bijl, T. B. Sch¨on, J.-W. van Wingerden, and M. Verhaegen. System identification through online sparse Gaussian process regression with input noise. IFAC Journal of Systems and Control, 2:1–11, 2017.

C. M. Bishop. Pattern Recognition and Machine Learning (Information Science and Statistics). Springer, 2006.

D. M. Blei, A. Kucukelbir, and J. D. McAuliffe. Variational inference: A review for statis- ticians. Journal of the American Statistical Association, 112:859–877, 2017.

B. Attila B´ocsi and L. Csat´o. Hessian corrected input noise models. In International Conference on Artificial Neural Networks, pages 1–8, 2013.

T. D. Bui, D. Hern´andez-Lobato, J. M. Hernandez-Lobato, Y. Li, and R. E. Turner. Deep Gaussian processes for regression using approximate expectation propagation. In International conference on machine learning, pages 1472–1481, 2016.

Fermi-LAT Collaboration. 4th fgl catalogue. Available at https: // heasarc. gsfc. nasa. gov/ W3Browse/ fermi/ fermilpsc. html , 2019.

. Deisenroth and C. E. Rasmussen. Pilco: A model-based and data-efficient approach to policy search. In International Conference on Machine Learning, pages 465–472, 2011.

D. Dua and C. Graff. UCI machine learning repository, 2017. URL http://archive.ics. uci.edu/ml.

Y. Gal. Uncertainty in deep learning. PhD thesis, PhD thesis, University of Cambridge, 2016.

H.-W. Ge and S.-T. Wang. Dependency between degree of fit and input noise in fuzzy linear regression using non-symmetric fuzzy triangular coefficients. Fuzzy Sets and Systems, 158: 2189–2202, 2007.

A. Girard, C. E. Rasmussen, J. Qui˜nonero Candela, and R. Murray-Smith. Gaussian process priors with uncertain inputs application to multiple-step ahead time series forecasting. In Advances in neural information processing systems, pages 545–552, 2003.

P. W. Goldberg, C. K. I. Williams, and C. M. Bishop. Regression with input-dependent noise: A Gaussian process treatment. In Advances in neural information processing systems, pages 493–499, 1998.

H.-C.Kim and Z. Ghahramani. Bayesian Gaussian process classification with the EM-EP algorithm. IEEE Transactions on Pattern Analysis and Machine Intelligence, 28:1948– 1959, 2006.

R. Henao and O. Winther. Predictive active set selection methods for Gaussian processes. Neurocomputing, 80:10–18, 2012.

J. Hensman, A. G. de G. Matthews, M. Filippone, and Z. Ghahramani. MCMC for varia- tionally sparse Gaussian processes. pages 1648–1656, 2015a.

J. Hensman, A. Matthews, and Z. Ghahramani. Scalable variational Gaussian process classification. In Proceedings of the Eighteenth International Conference on Artificial Intelligence and Statistics, 2015b.

J. Hensman, A. G. Matthews, M. Filippone, and Z. Ghahramani. MCMC for variationally sparse Gaussian processes. In Advances in Neural Information Processing Systems, pages 1648–1656, 2015c.

D. Hern´andez-Lobato, J. M. Hern´andez-Lobato, and P. Dupont. Robust multi-class Gaus- sian process classification. In Advances in Neural Information Processing Systems 24, pages 280–288, 2011.

D. Hern´andez-lobato, V. Sharmanska, K. Kersting, C. H. Lampert, and N. Quadrianto. Mind the nuisance: Gaussian process classification using privileged noise. In Advances in Neural Information Processing Systems, pages 837–845. 2014.

M. I. Jordan, Z. Ghahramani, T. Jaakkola, and L. K. Saul. An introduction to variational methods for graphical models. Machine Learning, (2):183–233, 1999.

K. Kersting, C. Plagemann, P. Pfaff, and W. Burgard. Most likely heteroscedastic Gaussian process regression. In International conference on Machine learning, pages 393–400, 2007.

H.-C. Kim and Z. Ghahramani. Bayesian Gaussian process classification with the EMEP algorithm. IEEE Transactions on Pattern Analysis and Machine Intelligence, 28: 1948–1959, 2006.

D. P. Kingma and J. Ba. ADAM: a method for stochastic optimization. In Inrernational Conference on Learning Representations, pages 1–15, 2015.

D. P. Kingma and M. Welling. Auto-encoding variational bayes. In International Conference on Learning Representations, 2014.

J. Ko, D. J. Kleint, D. Fox, and D. Haehnelt. Gp-ukf: Unscented kalman filters with Gaussian process prediction and observation models. In International Conference on Intelligent Robots and Systems, pages 1901–1907, 2007.

M. L´azaro-Gredilla and M. K. Titsias. Variational heteroscedastic Gaussian process regres- sion. pages 841–848, 2011.

Y. LeCun, L. Bottou, Y. Bengio, and P. Haffner. Gradient-based learning applied to docu- ment recognition. Proceedings of the IEEE, 86:2278–2324, 1998.

C. J. Maddison, D. Tarlow, and T. Minka. sampling. In Advances in Neural Information Processing Systems, pages 3086–3094. 2014.

A. Mchutchon and C. E. Rasmussen. Gaussian process training with input noise. In Advances in Neural Information Processing Systems 24, pages 1341–1349, 2011.

T. Minka. Expectation propagation for approximate Bayesian inference. In Uncertainty in Artificial Intelligence, pages 362–36, 2001.

K. Murphy. Machine Learning: a Probabilistic Perspective. The MIT Press, 2012.

J. Nogueira, R. Martinez-Cantin, A. Bernardino, and L. Jamone. Unscented Bayesian optimization for safe robot grasping. In International Conference on Intelligent Robots and Systems, pages 1967–1972, 2016.

R. Oliveira, L. Ott, V. Guizilini, and F. Ramos. Bayesian optimisation for safe navigation under localisation uncertainty. arXiv preprint arXiv:1709.02169, 2017.

W. H. Press, S. A. Teukolsky, W. T. Vetterling, and B. P. Flannery. Numerical recipes: The art of scientific computing (3rd ed.). Cambridge University Press, 2007.

A. Rabaoui, H. Kadri, Z. Lachiri, and N. Ellouze. Using robust features with multi-class SVMs to classify noisy sounds. In International Symposium on Communications, Control and Signal Processing, pages 594–599, 2008.

C. E. Rasmussen and C. K. I. Williams. Gaussian Processes for Machine Learning (Adaptive Computation and Machine Learning). The MIT Press, 2006.

J. A. S´aez, M. Galar, J. Luengo, and F. Herrera. Analyzing the presence of noise in multi- class problems: alleviating its influence with the one-vs-one decomposition. Knowledge and information systems, 38(1):179–206, 2014.

R. Shu, H. H Bui, S. Zhao, M. J. Kochenderfer, and S. Ermon. Amortized inference regularization. In Advances in Neural Information Processing Systems, pages 4393–4402, 2018.

E. Snelson and Z. Ghahramani. Sparse Gaussian processes using pseudo-inputs. In Advances in neural information processing systems, pages 1257–1264, 2006.

D. M. Steinberg and E. V. Bonilla. Extended and unscented Gaussian processes. In Advances in Neural Information Processing Systems, pages 1251–1259, 2014.

M. Titsias. Variational learning of inducing variables in sparse Gaussian processes. In International Conference on Artificial Intelligence and Statistics, pages 567–574, 2009a.

M. Titsias. Variational learning of inducing variables in sparse Gaussian processes. In Artificial Intelligence and Statistics, pages 567–574, 2009b.

M. Van der Wilk, C. E. Rasmussen, and J. Hensman. Convolutional Gaussian processes. In Advances in Neural Information Processing Systems, pages 2849–2858, 2017.

C. Villacampa-Calvo and D. Hern´andez-Lobato. Scalable multi-class Gaussian process classification using expectation propagation. In International Conference on Machine Learning-Volume, pages 3550–3559, 2017.

A. G. Wilson and Z. Ghahramani. Copula processes. In Advances in Neural Information Processing Systems, pages 2460–2468, 2010.