Improving Output Uncertainty Estimation and Generalization in Deep Learning via Neural Network Gaussian Processes

2017·Arxiv

Abstract

Abstract

We propose a simple method that combines neural networks and Gaussian processes. The proposed method can estimate the uncertainty of outputs and flexibly adjust target functions where training data exist, which are advantages of Gaussian processes. The proposed method can also achieve high generalization performance for unseen input configurations, which is an advantage of neural networks. With the proposed method, neural networks are used for the mean functions of Gaussian processes. We present a scalable stochastic inference procedure, where sparse Gaussian processes are inferred by stochastic variational inference, and the parameters of neural networks and kernels are estimated by stochastic gradient descent methods, simultaneously. We use two real-world spatio-temporal data sets to demonstrate experimentally that the proposed method achieves better uncertainty estimation and generalization performance than neural networks and Gaussian processes.

1 Introduction

Neural networks (NNs) have achieved state-of-the-art results in a wide variety of supervised learning tasks, such as image recognition [25, 39, 37], speech recognition [38, 17, 10] and machine translation [3, 9]. However, NNs have a major drawback in that output uncertainty is not well estimated. NNs give point estimates of outputs at test inputs.

Estimating the uncertainty of the output is important in various situations. First, the uncertainty can be used for rejecting the results. In real-world applications such as medical diagnosis, we should avoid automatic decision making with the difficult examples, and ask human experts or conduct other examinations to achieve high reliability. Second, the uncertainty can be used to calculate risk. In some domains, it is important to be able to estimate the probability of critical issues occurring, for example with self-driving cars or nuclear power plant systems. Third, the uncertainty can be used for the inputs of other machine learning tasks. For example, uncertainty of speech recognition results helps in terms of improving machine translation performance in automatic speech translation systems [32]. The uncertainty would also be helpful for active learning [24] and reinforcement learning [7].

We propose a simple method that makes it possible for NNs to estimate output uncertainty. With the proposed method, NNs are used for the mean functions of Gaussian processes (GPs) [36]. GPs are used as prior distributions over smooth nonlinear functions, and the uncertainty of the output can be estimated with Bayesian inference. GPs perform well in various regression and classification tasks [43, 4, 30, 33].

Figure 1: True values (red), mean function values (green) and prediction values (blue) provided by GPs with zero mean functions (a) and GPs with nonzero mean functions (b). The blue area is the 95% confidence interval of the prediction, and the red points indicate the training samples.

Combining NNs and GPs gives us another advantage. GPs exploit local generalization, where generalization is achieved by local interpolation between neighbors [5]. Therefore, GPs can adjust target functions rapidly in the presense of training data, but fail to generalize in regions where there are no training data. On the other hand, NNs have good generalization capability for unseen input configurations by learning multiple levels of distributed representations, but require a huge number of training data. Since GPs and NNs achieve generalization in different ways, the proposed method can improve generalization performance by adopting both of their advantages.

Zero mean functions are usually used since GPs with zero mean functions and some specific kernels can approximate an arbitrary continuous function given enough training data [29]. However, GPs with zero mean functions predict zero outputs far from training samples. Figure 1(a) shows the predictions of GPs with zero mean functions and RBF kernels. When trained with two samples, the prediction values are close to the true values if there are training samples, but far from the true values if there are none. On the other hand, when GPs with appropriate nonzero mean functions are used as in Figure 1(b), the prediction approximates the true values even when there are no training samples. Figure 1 shows that GPs rapidly adjust the prediction when there are training data regardless of the mean function values.

The proposed method gives NNs more flexibility via GPs. In general, the risk of overfitting increases as the model flexibility increases. However, since the proposed method is based on Bayesian inference, where nonlinear functions with GP priors are integrated out, the proposed method can help alleviate overfitting.

To retain the high generalization capability of NNs with the proposed method, large training data are required. The computational complexity of the exact inference of GPs is cubic in the number of training samples, which is prohibitive for large data. We present a scalable stochastic inference procedure for the proposed method, where sparse GPs are inferred by stochastic variational inference [15], and NN parameters and kernel parameters are estimated by stochastic gradient descent methods, simultaneously. By using stochastic optimization, the parameters are updated efficiently without analyzing all the data at each iteration, where a noisy estimate of the gradient of the objective function is used. The inference algorithm also enables us to handle massive data even when they cannot be stored in a memory.

2 Related work

Bayesian NNs are the most common way of introducing uncertainty into NNs, where distributions over the NN parameters are inferred. A number of Bayesian NN methods have been proposed including Laplace approximation [28], Hamiltonian Monte Carlo [31], variational inference [18, 13, 7, 26, 41], expectation propagation [21], stochastic backpropagation [16], and dropout [23, 12] methods. Our proposed method gives the output uncertainty of NNs with a different approach, where we conduct point estimation for the NN parameters, but the NNs are combined with GPs. Therefore, the proposed method incorporates the high generalization performance of NNs and the high flexibility of GPs, and can handle large-scale data by using scalable NN stochastic optimization and GP stochastic variational inference.

Although zero mean functions are often used for GPs, nonzero mean functions, such as polynomial functions [6], have also been used. When the mean functions are linear in the parameters, the parameters can be integrated out, which leads to another GP [34]. However, scalable inference algorithms for GPs with flexible nonlinear mean functions like NNs have not been proposed.

NNs and GPs are closely related. NNs with a hidden layer converge to GPs in the limit of an infinite number of hidden units [31]. A number of methods combining NNs and GPs have been proposed. Deep GPs [11] use GPs for each layer in NNs, where local generalization is exploited since their inputs are kernel values. GP regression networks [44] combine the structural properties of NNs with the nonparametric flexibility of GPs for accommodating input dependent signal and noise correlations. Manifold GPs [8] and deep NN based GPs [20] use NNs for transforming the input features of GPs. Deep kernel learning [45] uses NNs to learn kernels for GPs. The proposed method is different from these methods since it incorporates the outputs of NNs into GPs.

3 Proposed method

Suppose that we have a set of input and output pairs, , where is the nth input, and is its output. Output is assumed to be generated by a nonlinear function with Gaussian noise. Let be the vector of function values on the observed inputs, . Then, the probability of the output is given by

where is the observation precision parameter. For the nonlinear function, we assume a GP model,

where is the mean function with parameters , and is the kernel function with kernel parameters . We use a NN for the mean function, and we call (2) NeuGaP, which is a simple and new model that fills a gap between the GP and NN literatures. By integrating out the nonlinear function f, the likelihood is given by

where covariance matrix defined by the kernel function and is the vector of the output values of the NN on the observed inputs. The parameters in GPs are usually estimated by maximizing the marginal likelihood (3). However, the exact inference is infeasible for large data since the computational complexity is due to the inversion of the covariance matrix.

To reduce the computational complexity while keeping the desirable properties of GPs, we employ sparse GPs [40, 35, 15]. With a sparse GP, inducing inputs , and their outputs , are introduced. The basic idea behind sparse inducing point methods is that when the number of inducing points , computation can be reduced in inducing outputs u are assumed to be generated by the nonlinear function of NeuGaP (2) taking the inducing inputs Z as inputs. By marginalizing out the nonlinear function, the probability of the inducing outputs is given by

where is the vector of the NN output values on the inducing inputs, covariance matrix evaluated between all the inducing inputs, The output values at the observed inputs f are assumed to be conditionally independent of each other given the inducing outputs u, then we have

where

Here, is the M-dimensional column vector of the covariance function evaluated between observed and inducing inputs, . Equation (5) is obtained in the same way as the derivation of the predictive mean and variance of test data points in standard GPs.

The lower bound of the log marginal likelihood of the sparse GP to be maximized is

where q(u) = N(m, S) is the variational distribution of the inducing points, and Jensen’s inequality is applied [42]. The log likelihood of the observed output y given the inducing points u is as follows,

where Jensen’s inequality is applied, and the lower bound of log p(y|u) is decomposed into terms for each training sample. By using (7) and (8), the lower bound of log p(y) is given by

where

and KL(q(u)||p(u)) is the KL divergence between two Gaussians, which is calculated by

The NN parameters and kernel parameters are updated efficiently by maximizing the lower bound (9) using stochastic gradient descent methods. The parameters in the variational distribution, m and S, are updated efficiently by using stochastic variational inference [19]. We altenately iterate the stochastic gradient descent and stochastic variational inference for each minibatch of training data.

With stochastic variational inference, the parameters of variational distributions are updated based on the natural gradients [1], which are computed by multiplying the gradients by the inverse of the Fisher information matrix. The natural gradients provide faster convergence than standard gradients by taking account of the information geometry of the parameters. In the exponential family, the natural gradients with respect to natural parameters correspond to the gradients with respect to expectation parameters [14]. The natural parameters of Gaussian N(m, S) are and . Its expectation parameters are . We take a step in the natural gradient direction by employing is the natural gradient of the objective function with respect to the natural parameter, is the Fisher information, and step length at iteration t. The update rules for the proposed model are given by

We can use minibatches instead of a single training sample to update the natural parameters.

The output distribution given test input is calculated by

where is the covariance function column vector evaluated between test input and inducing inputs, and

4 Experiments

Data We evaluated our proposed method by using two real-world spatio-temporal data sets. The first data set is the Comprehensive Climate (CC) data set 1, which consists of monthly climate reports for North America [2, 27]. We used 19 variables for 1990, such as month, latitude, longitude, carbon dioxide and temperature, which were interpolated on a degree grid with 125 locations. The second data set is the U.S. Historical Climatology Network (USHCN) data set 2, which consists of monthly climate reports at 1218 locations in U.S. for 1990. We used the following seven variables: month, latitude, longitude, elevation, precipitation, minimum temperature, and maximum temperature.

The task was to estimate the distribution of a variable given the values of the other variables as inputs; there were 19 tasks in CC data, and seven tasks in USHCN data. We evaluated the performance in terms of test log likelihoods. We also used mean squared errors to evaluate point estimate performance. We randomly selected some locations as test data. The remaining data points were randomly split into 90% training data and 10% validation data. With CC data, we used 20%, 50% and 80% of locations as test data, and their training data sizes were 1081, 657 and 271, respectively. With USHCN data, we used 50%, 90% and 95% of locations as test data, and their training data sizes were 6597, 1358 and 609, respectively.

Comparing Methods We compared the proposed method with GPs and NNs. The GPs were sparse GPs inferred by stochastic variational inference. The GPs correspond to the proposed method with a zero mean function. With the proposed method and GPs, we used the following RBF kernels for the kernel function, , and 100 inducing points. We set the step size at epoch t as for the stochastic variational inference. With the NNs, we used three-layer feed-forward NNs with five hidden units, and we optimized the NN parameters and precision parameter by maximizing the following likelihood, using Adam [22]. The proposed method used NNs with the same structure for the mean function, where the NN parameters were first optimized by maximizing the likelihood, and then variational, kernel and NN parameters were estimated by maximizing the variational lower bound (9) using the stochastic variational inference and Adam. The locations of the inducing inputs were initialized by k-means results. For all the methods, we set the minibatch size at 64, and used early stoping based on the likelihood on a validation set.

Results Tables 1 and 2 show the test log likelihoods with different missing value rates with CC and USHCN data, respectively. The proposed method achieved the highest average likelihoods with both data sets. The NN performed poorly when many values were missing (Table 1, 80% missing). On the other hand, since a GP is a nonparametric Bayesian method, where the effective model complexity is automatically adjusted depending on the number of training samples, the GPs performed better than the NNs with the many missing value data. When the number of missing values was small (Table 1, 20% missing), the NN performed better than the GP. The proposed method achieved the best performance with different missing value rates by combining the advantages of NNs and GPs. Tables 3 and 4 show the test mean squared errors with different missing value rates. The proposed method achieved the lowest average errors with both data sets. This result indicates that combining NNs and GPs also helps to improve the generalization performance. Table 5 shows the computational time in seconds.

Figure 2 shows the prediction with its confidence interval obtained by the proposed method, GP and NN. The NN gives fixed confidence intervals at all test points, and some true values are located outside the confidence intervals. On the other hand, the proposed method flexibly changes confidence intervals depending on the test points. The confidence intervals with the GP differ across different test points as with the proposed method. However, they are wider than those of the proposed method, since the mean functions are fixed at zero.

Table 1: Test log likelihoods provided by the proposed method, GP, and NN with CC data. The bottom row shows the values averaged over all variables. Values in a bold typeface are statistically better (at the 5% level) than those in normal typeface as indicated by a paired t-test.

Table 2: Test log likelihoods provided by the proposed method, GP, and NN with USHCN data.

5 Conclusion

In this paper, we proposed a simple method for combining neural networks and Gaussian processes. With the proposed method, neural networks are used as the mean function of Gaussian processes. We present a scalable learning procedure based on stochastic gradient descent and stochastic variational inference. With experiments using two real-world spatio-temporal data sets, we demonstrated that the proposed method achieved better uncertainty estimation and generalization performance than neural networks and Gaussian processes. There are several avenues that can be pursed as future work. In our experiments, we used feed-forward neural networks. We would like to use other types of neural networks, such as convolutional and recurrent neural networks. Moreover, we plan to analyze the sensitivity with respect to the structure of the neural networks, the number of inducing points and the choice of kernels. Finally, the mean function of neural networks could be inferred using Bayesian methods.

References

[1] S.-I. Amari. Natural gradient works efficiently in learning. Neural computation, 10(2):251–276, 1998.

Table 3: Test mean squared error provided by the proposed method, GP, and NN with CC data.

Table 4: Test mean squared error provided by the proposed method, GP, and NN with USHCN data.

[2] M. T. Bahadori, Q. R. Yu, and Y. Liu. Fast multivariate spatio-temporal analysis via low rank tensor learning. In Advances in Neural Information Processing Systems, pages 3491–3499, 2014.

[3] D. Bahdanau, K. Cho, and Y. Bengio. Neural machine translation by jointly learning to align and translate. arXiv preprint arXiv:1409.0473, 2014.

[4] D. Barber and C. K. Williams. Gaussian processes for Bayesian classification via hybrid Monte Carlo. In Advances in Neural Information Processing Systems, pages 340–346, 1997.

[5] Y. Bengio, A. Courville, and P. Vincent. Representation learning: A review and new perspectives. IEEE Transactions on Pattern Analysis and Machine Intelligence, 35(8):1798–1828, 2013.

[6] B. Blight and L. Ott. A Bayesian approach to model inadequacy for polynomial regression. Biometrika, 62 (1):79–88, 1975.

[7] C. Blundell, J. Cornebise, K. Kavukcuoglu, and D. Wierstra. Weight uncertainty in neural network. In Proceedings of the 32nd International Conference on Machine Learning, pages 1613–1622, 2015.

[8] R. Calandra, J. Peters, C. E. Rasmussen, and M. P. Deisenroth. Manifold Gaussian processes for regression. In Proceedings of the International Joint Conference on Neural Networks, pages 3338–3345. IEEE, 2016.

[9] K. Cho, B. v. M. C. Gulcehre, D. Bahdanau, F. B. H. Schwenk, and Y. Bengio. Learning phrase representations using RNN encoder–decoder for statistical machine translation. In Proceedings of the Conference on Empirical Methods in Natural Language Processing, pages 1724–1734, 2014.

Table 5: Computational time for inference in seconds.

Figure 2: Prediction of held-out CO2 values using training data with 80% missing values at a test location with CC data. The horizontal axis is month, the vertical axis is CO2, the blue bar is the 95% confidence interval of the prediction, and the red point is the true value.

[10] G. E. Dahl, D. Yu, L. Deng, and A. Acero. Context-dependent pre-trained deep neural networks for large-vocabulary speech recognition. IEEE Transactions on Audio, Speech, and Language Processing, 20 (1):30–42, 2012.

[11] A. Damianou and N. Lawrence. Deep Gaussian processes. In Artificial Intelligence and Statistics, pages 207–215, 2013.

[12] Y. Gal and Z. Ghahramani. Dropout as a Bayesian approximation: Representing model uncertainty in deep learning. In Proceedings of The 33rd International Conference on Machine Learning, pages 1050–1059, 2016.

[13] A. Graves. Practical variational inference for neural networks. In Advances in Neural Information Processing Systems, pages 2348–2356, 2011.

[14] J. Hensman, M. Rattray, and N. D. Lawrence. Fast variational inference in the conjugate exponential family. In Advances in Neural Information Processing Systems, pages 2888–2896, 2012.

[15] J. Hensman, N. Fusi, and N. D. Lawrence. Gaussian processes for big data. In Proceedings of the Twenty-Ninth Conference on Uncertainty in Artificial Intelligence, pages 282–290. AUAI Press, 2013.

[16] J. M. Hernández-Lobato and R. Adams. Probabilistic backpropagation for scalable learning of Bayesian neural networks. In ICML, pages 1861–1869, 2015.

[17] G. Hinton, L. Deng, D. Yu, G. E. Dahl, A.-R. Mohamed, N. Jaitly, A. Senior, V. Vanhoucke, P. Nguyen, T. N. Sainath, et al. Deep neural networks for acoustic modeling in speech recognition: The shared views of four research groups. IEEE Signal Processing Magazine, 29(6):82–97, 2012.

[18] G. E. Hinton and D. Van Camp. Keeping the neural networks simple by minimizing the description length of the weights. In Proceedings of the sixth annual conference on Computational learning theory, pages 5–13. ACM, 1993.

[19] M. D. Hoffman, D. M. Blei, C. Wang, and J. W. Paisley. Stochastic variational inference. Journal of Machine Learning Research, 14(1):1303–1347, 2013.

[20] W. Huang, D. Zhao, F. Sun, H. Liu, and E. Chang. Scalable Gaussian process regression using deep neural networks. In Proceedings of the 24th International Joint Conference on Artificial Intelligence, pages 3576–3582, 2015.

[21] P. Jylänki, A. Nummenmaa, and A. Vehtari. Expectation propagation for neural networks with sparsitypromoting priors. Journal of Machine Learning Research, 15(1):1849–1901, 2014.

[22] D. Kingma and J. Ba. Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980, 2014.

[23] D. P. Kingma, T. Salimans, and M. Welling. Variational dropout and the local reparameterization trick. In Advances in Neural Information Processing Systems, pages 2575–2583, 2015.

[24] A. Krause and C. Guestrin. Nonmyopic active learning of Gaussian processes: an exploration-exploitation approach. In Proceedings of the 24th International Conference on Machine Learning, pages 449–456. ACM, 2007.

[25] A. Krizhevsky, I. Sutskever, and G. E. Hinton. Imagenet classification with deep convolutional neural networks. In Advances in Neural Information Processing Systems, pages 1097–1105, 2012.

[26] C. Louizos and M. Welling. Structured and efficient variational deep learning with matrix gaussian posteriors. In Proceedings of the 33rd International Conference on International Conference on Machine Learning, pages 1708–1716, 2016.

[27] A. C. Lozano, H. Li, A. Niculescu-Mizil, Y. Liu, C. Perlich, J. Hosking, and N. Abe. Spatial-temporal causal modeling for climate change attribution. In Proceedings of the 15th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pages 587–596. ACM, 2009.

[28] D. J. MacKay. A practical Bayesian framework for backpropagation networks. Neural Computation, 4(3): 448–472, 1992.

[29] C. A. Micchelli, Y. Xu, and H. Zhang. Universal kernels. Journal of Machine Learning Research, 7(Dec): 2651–2667, 2006.

[30] A. Naish-Guzman and S. B. Holden. The generalized FITC approximation. In Advances in Neural Information Processing Systems, pages 1057–1064, 2007.

[31] R. M. Neal. Bayesian learning for neural networks. PhD thesis, University of Toronto, 1995.

[32] H. Ney. Speech translation: Coupling of recognition and translation. In Proceedings of the IEEE International Conference on Acoustics, Speech, and Signal Processing, volume 1, pages 517–520. IEEE, 1999.

[33] H. Nickisch and C. E. Rasmussen. Approximations for binary Gaussian process classification. Journal of Machine Learning Research, 9(Oct):2035–2078, 2008.

[34] A. O’Hagan. Curve fitting and optimal design for prediction. Journal of the Royal Statistical Society. Series B, 1:1–42, 1978.

[35] J. Quiñonero-Candela and C. E. Rasmussen. A unifying view of sparse approximate Gaussian process regression. Journal of Machine Learning Research, 6(Dec):1939–1959, 2005.

[36] C. Rasmussen and C. Williams. Gaussian Processes for Machine Learning. MIT Press, 2006.

[37] O. Russakovsky, J. Deng, H. Su, J. Krause, S. Satheesh, S. Ma, Z. Huang, A. Karpathy, A. Khosla, M. Bernstein, et al. Imagenet large scale visual recognition challenge. International Journal of Computer Vision, 115(3):211–252, 2015.

[38] F. Seide, G. Li, and D. Yu. Conversational speech transcription using context-dependent deep neural networks. In Interspeech, pages 437–440, 2011.

[39] K. Simonyan and A. Zisserman. Very deep convolutional networks for large-scale image recognition. CoRR, abs/1409.1556, 2014.

[40] E. Snelson and Z. Ghahramani. Sparse Gaussian processes using pseudo-inputs. In Advances in Neural Information Processing Systems, volume 18, pages 1257–1264, 2006.

[41] S. Sun, C. Chen, and L. Carin. Learning structured weight uncertainty in Bayesian neural networks. In Artificial Intelligence and Statistics, pages 1283–1292, 2017.

[42] M. K. Titsias. Variational learning of inducing variables in sparse gaussian processes. In AISTATS, volume 5, pages 567–574, 2009.

[43] C. K. Williams and C. E. Rasmussen. Gaussian processes for regression. In Advances in Neural Information Processing Systems, pages 514–520, 1996.

[44] A. G. Wilson, Z. Ghahramani, and D. A. Knowles. Gaussian process regression networks. In Proceedings of the 29th International Conference on Machine Learning, pages 599–606, 2012.

[45] A. G. Wilson, Z. Hu, R. Salakhutdinov, and E. P. Xing. Deep kernel learning. In Proceedings of the 19th International Conference on Artificial Intelligence and Statistics, pages 370–378, 2016.

designed for accessibility and to further open science