Quantile Regularization: Towards Implicit Calibration of Regression Models

2020·Arxiv

Abstract

Abstract

Recent works have shown that most deep learning models are often poorly calibrated, i.e., they may produce overconfident predictions that are wrong. It is therefore desirable to have models that produce predictive uncertainty estimates that are reliable. Several approaches have been proposed recently to calibrate classification models. However, there is relatively little work on calibrating regression models. We present a method for calibrating regression models based on a novel quantile regularizer de-fined as the cumulative KL divergence between two CDFs. Unlike most of the existing approaches for calibrating regression models, which are based on post-hoc processing of the model’s output and require an additional dataset, our method is trainable in an end-to-end fashion without requiring an additional dataset. The proposed regularizer can be used with any training objective for regression. We also show that post-hoc calibration methods like Isotonic Calibration sometimes compound miscalibration whereas our method provides consistently better calibrations. We provide empirical results demonstrating that the proposed quantile regularizer significantly improves calibration for regression models trained using approaches, such as Dropout VI and Deep Ensembles.

1 Introduction

Calibration is a measure of evaluating how well a model’s confidence in its prediction matches with the correctness of these predictions. For example, a binary classifier will be considered perfectly calibrated if among all predictions with probability score 0.9, 90% of the predictions should be correct [1]. Likewise, consider a Bayesian regression model that produces credible intervals. In this setting, the model will be considered perfectly calibrated if the 90% credible interval contains 90% of the test points [2]. Unfortunately, modern deep neural networks are known to be poorly calibrated [1].

While there has been a significant amount of recent work on calibrating classification models [1, 3], relatively little work exists on calibrating regression models. Recently, [2] proposed a post-hoc method for calibrating regression models. Their approach is inspired by Platt scaling [4], commonly used for calibrating classification models. However, post-hoc methods like [2] rely on the availability of large quantities of labeled i.i.d. data that is needed to achieve well-calibrated models.

In this work, we introduce quantile regularization, a method that can be trained in an end-to-end manner unlike the post-hoc calibration methods that require large quantities

of labeled data. The regularizer we proposed is defined as the cumulative KL divergence between two CDFs. Moreover, our method has a very general applicability as it can be used in any regression model that produces a predictive mean and predictive variance, by augmenting its training objective with the proposed regularizer. Before describing our approach, we first provide a brief overview of calibration ap-

proaches proposed for classification and regression models.

1.1 Classification Calibration

The notion of calibration was originally first considered in meteorology literature [5, 6, 7] and saw one of its first prominent usage used in the machine learning literature by [4] in context of support vector machines (SVM) in order to obtain probabilistic predictions from SVMs which are non-probabilistic models. There has been renewed interested in calibration, especially for classification models, after [1] showed that modern classification networks are not well-calibrated.

Currently there are three main notions of calibration in case of classification [8, 9, 10] and these are listed below. For the rest of this section. assume X, Y to be random variables on spaces X and Y = {1, 2, ..K}, P to be their true joint distribution, and g to be the model that outputs a probability distribution on Y. Therefore, we can represent the model as . The three notions are as follows:

Most calibration methods [4, 11, 12, 1, 13, 10] for classification models are post-hoc, where they learn calibration mapping using an additional dataset to recalibrate an already trained model. There has been recent work showing some of these popular post-hoc methods are either themselves miscalibrated or sample inefficient [8] and they do not actually help the model output well-calibrated probabilities. An alternative to post-hoc processing is to ensure that model outputs well-calibrated

probabilities during training itself. These are implicit calibration methods. Such an ap-

proach does not require an additional dataset to learn the calibration mapping. While almost all post-hoc calibration mechanisms can be seen as density estimation methods, existing implicit calibration methods are of various types. Several heuristics like Mixup

Figure 1: Calibration plot showing that Quantile Regularization makes calibration plot more close to the identity function (the ideal line)

[14, 15] and Label Smoothing [16, 17] that were part of high performance deep networks for classification were later shown empirically to achieve calibration. [18] show that their optimization method improves calibration. [19] found that penalizing high-confidence predictions acts as a regularizer. A more principled way of achieving calibration is by minimizing a loss function that is tailored for calibration [3]. This is somewhat similar in spirit to our proposed approach that does it for regression models.

1.2 Regression Calibration

There has been relatively less work on regression calibration. Among the early approaches, [20] were the first to address this issue by proposing a framework for calibration. However, they do not provide any procedure to correct a mis-calibrated model. Recently, [2] proposed Quantile Calibration which intuitively says that the p credible interval predicted by model should have target variable with probability p. They also propose a post-hoc method based on isotonic regression [21] for recalibration which is a well-known recalibration technique for classification models. Recently, [22] proposed a much stronger notion of calibration called distributional calibration which guarantees that among all instances whose predicted PDF has mean and standard deviation , the actual distribution of the target variable should have mean and standard deviation . This can be seen as the regression analog of joint calibration for classification (Sec. 1.1) . They too propose post-hoc recalibration method based on Gaussian processes. Among other work, [23], consider a different setting where neural networks for classification are used for regression problems and showed that temperature scaling [24, 1] and their proposed method based on empirical prediction intervals improves calibration. Again, these are post-hoc methods.

1.3 Quantitle Calibration and Isotonic Regression

The notion of calibration that we consider in this work is quantile calibration. Isotonic Regression is currently used for quantile calibration [2]. However, isotonic regression has

the following disadvantages

Considering these shortcomings, we propose an end-to-end trainable loss function for quantile calibration. Our approach leverages a novel regularizer that is defined as a cumulative KL divergence (KL divergence of two CDFs). With our approach, the smoothness of the PDF/CDF is maintained for well-calibrated probabilities. Moreover, our approach eliminates the need for a separate calibration dataset. To the best of our knowledge, this is the first trainable loss function for any notion of calibration in regression setting.

The Rest of the paper is organized as follows: Section (2) sets up the notation and background and presents the problem setting formally. In Section (3), we present our proposed method. Section (4) discusses the experimental analysis. In Section (5), we conclude and briefly discuss avenues for future work.

2 Background and Deﬁnitions

Throughout the paper, X and Y will denote random variables on spaces true distribution will denote i.i.d samples from this distribution.we assume that CDF’s of random variables are invertible.

Any probabilistic regression model can be seen as conditional CDF, which gives a distribution function on Y corresponding to each instance from the input space X. We represent the model as

Assume F is distribution function predicted corresponding to the true distribution function G. Ideally we want to predict true distribution, i.e., F = G. This is equivalent of saying that Based on this, [20] propose the following defini-

tion

Definition 1 (Complete Probabilistic Calibration). Given a model true underlying model , the model F is said to be probabilistically calibrated

completely iff for every sequence

Since G is unknown, [2] proposes the sufficient condition for above definition which is useful in practice.

Definition 2 (Quantile Calibration). Given a model

distributed as P, the function F is said to be Quantile Calibrated iff

The key to understanding above definition is the random variable under consideration [F(X)](Y ). Note that [F(X)](Y ) is cumulative density that the model predicts for random

X, Y whose underlying distribution is P

The importance of such definition is that we get calibrated confidence/credible intervals, which is extremely critical for reliable uncertainty estimates. Its usefulness was demonstrated empirically in [2] who developed a post-hoc calibration method using the above notion of quantile calibration.

Existing calibration approaches can be divided into two types.

2.1 Post-hoc calibration

The objective of post-hoc calibration is to calibrate a miscalibrated model by learning a mapping is calibrated model. One such mapping can be obtained from definition of calibration itself. Setting a quantile calibrated model. Recently, [9] refer to an analogous mapping in context of classification as canonical calibration mapping. We will use same name to refer to it for our regression setting.

Proposition 1. For any Model and given the canonical calibration mapping

The proof of this proposition can be found in the Appendix (A1)

With this insight, and using the fact that mapping is monotonically increasing, [2] use isotonic regression to learn this mapping on the training dataset itself without using any separate dataset claiming that they do not overfit much. Given , and assume that , isotonic regression finds by minimizing the following objective

In isotonic calibration [2], given training data , the recalibration dataset is

the isotonic calibration mapping is fit on this recalibration dataset. However, this approach can be prone to overfitting. One way to see why isotonic calibration can potentially overfit is that nature of recalibration dataset already satisfies the monotonicity constraint because . So, to minimize the loss, the calibration mapping passes through exactly. Also it is non-parametric methods that can overfit given less data. Therefore, [2] used training data itself in order to have plenty of data to learn the calibration mapping. Therefore, to recalibrate a pre-trained model you would need training data with which you would have trained the model. Another Disadvantage is that the isotonic mapping is a piecewise linear monotonic function, with which we have to compose our predicted CDF during test time. This results in non-smooth CDFs, which may not be desirable.

2.2 Implicit Calibration

In contrast to post-hoc calibration, implicit calibration ensures that the model is well-calibrated by having a strong inductive bias towards model parameters that yield well-calibrated predictions. Our approach can seen as regression analog of [3] where they designed a trainable loss function for classification by kernalizing the calibration error and [19] where they minimize the entropy of softmax outputs.

3 Quantile Regularization

Recall that, in quantile calibration, we want . Note that, both the right and the left hand sides can be seen as CDF of some random variables. Let can be seen as the the CDF of [F(X)](Y ) while S can be seen as CDF of Uniform[0,1]. So quantile calibration essentially wants the two CDFs to be equal. This is equivalent to saying that, for perfectly calibrated quantile model, we have that [F[X]](Y ) is the Uniform[0,1] distribution. Our approach is based on this equivalence. Essentially, we penalize model if the r.v. [F[X]](Y ) deviates from Uniform[0,1]. This property can be used to design a calibration metric that can be trained with our loss function, yielding a well-calibrated model while training itself.

One possible divergence metric that one could use is the KL divergence. The KL divergence between a distribution and the uniform distribution is equal to differential entropy. This method will result in very interpretable way of getting calibration that is minimizing differentiable entropy of (F[X])[Y ]. However, in practice, this would require using the Beta kernel [25] for density estimation and computing the entropy. Therefore, we use other divergences that can result in loss functions that are simpler to train.

3.1 Cumulative KL divergence

Cumulative KL divergence (CKL) [26] is based on cumulative residual entropy (CKL) [27]. We derive analytically closed-form expression for CKL between a distribution with sup-

port on [0, 1] and Uniform[0,1], and use this divergence for our calibration method.

Definition 3 (Cumulative Residual Entropy). Let S be non negative r.v with CDF

Definition 4 (Cumulative KL divergence). Let S, T be non-negative r.v with CDF

divergence between S and T is defined as

The cumulative KL divergence has similar properties as the standard KL divergence. In particular, for any CDF’s

Proposition 2. Consider random variable with support Uniform[0,1] with CDF then CKL in terms of residual entropy is as follows

Proof of the above proposition can be found in the Appendix .A1

Proposition 3. Given denote ordered samples, then the

following is a consistent estimator of above expression

Proof of the above proposition can be found in the Appendix .A1

3.2 Calibration loss function

In our case, the random variable is [F(X)](Y ) where F is the model. Given i.i.d. samples in the training data, we need to generate samples to compute the expression given in Eq. 4.

Note that, we want to make this part of the training procedure to achieve implicit calibration. However, we are faced with a challenge here. In particular, we need ordered samples to compute the first summation in Eq. 4 whereas sorting is not a differentiable operation. There are many differentiable approximations to sorting operation.We use NeuralSort [28] for its simplicity in our experiments. The algorithm for computing the loss function is summarized below.

The overall loss function with quantile regularization is as follows: Given training data be parameters of the model, negative log likelihood and be the calibrated loss computed by Algorithm 1.

3.3 Sharpness with Calibrated Predictions

Note that calibration is alone not sufficient for predictions to be accurate; sharpness is needed too. Our method can seen as naturally achieving both desiderata. While the usual negative log-likelihood (NLL) makes sure that the prediction are sharp, the quantile regularizer makes sure that those predictions are calibrated too, with controlling strength of the regularization. As our experiments show, the RMSE and NLL scores do not worse much for even values as large as

4 Experiments

We evaluate our approach on various regression datasets in terms of the calibration error as well as other standard metrics, sich as root-mean-squared-error (RMSE) and negative log-likelihood (NLL). We experiment with two base models - MC Dropout [29] and [30] -by augmenting their objective functions with our proposed quantile regularizer.

4.1 Metrics

Quantile Calibration Error

Given any model , we define the calibration error as follows

Let us choose m equidistant points . Given a test set whose predictions are -bin estimator of above integral will

give us the following metric used in [2]

Table 1: base and QR stands for model trained without Quantile Regularization and with Quantile Regularization respectively. As we can see, the calibration error is reduced, all the while keeping RMSE/NLL close/better to/than the base model.

4.2 Models

4.2.1 Heteroscedastic MC Dropout

We integrate our quantile regularizer with the heteroscedastic MC dropout approach [31] where, for each instance, a neural network with Dropout predicts and is trained with Gaussian likelihood . While testing, we enable dropout and perform T stochastic forward passes and set Normalas our prediction. Dropout rate is set to 0.25 and we perform T = 10 forward passes.

4.2.2 Deep Ensembles :

We also test our quantile regularizer method using deep ensembles [30] as they also provide uncertainty estimates. We fix the ensemble size = 5 where each network has Adversarial Training with is range of input features along that dimension, as suggested in the paper [30].

4.3 Hyperparameters

We use the same hyperparamter settings, for all the models and all the datasets. In particular, we use a two hidden-layer network with 128 units and learning rate = 1e-2 with Adam Optimizer identical to [2] and batch size = 512, amd number of epochs = 100.

Table 2: base and QR stands for model trained without Quantile Regularization and with Quantile Regularization respectively. As we can see, the calibration error is reduced, all the while keeping RMSE/NLL close/better to/than the base model.

4.4 UCI datasets

We experiment with the following datasets (size-of-data,num-input-features): AirFoil (1503,6) , Boston Housing (506,13), Concrete Strength (1030,8), Fish Toxicity (908,7), Kin8nm (8192, 9), Protein Structure (45730, 10), Red Wine (1599, 12), White Wine (4898, 12), Yacht Hydrodynamics (308,6), and Year Prediction MSD (515345,91). The dataset sizes range from 308 to 515345 and input feature dimensions range from 6 to 91. Every dataset, except Year Prediction MSD, is split into 5 splits whereas, for Year Prediction MSD, there is pre-defined single split where we train on 463715 points and test on 51630 points. Each experiment is repeated 5 times and averages are reported.

Table 1 reports Calibration error, RMSE, NLL, and recalibration error when trained with and without quantile regularization. As shown in the table, the calibration error is smaller for the variant of MC Dropout model when model is trained with Quantile Regularization. In 7/10 cases, even the NLL is better. RMSE drops, if any, are almost negligible.

Table 2 reports Calibration error, RMSE, NLL, and recalibration error with Deep Ensembles as underlying model. With Deep Ensembles as the base model, one can see that quantile regularization decreases calibration error in 9/10 cases.

Figure2 shows how calibration error, RMSE, and NLL changes as we vary phasize that there is no significant change in NLL/RMSE. Note that, while reporting results in the tables, we always set

Table 3 and Table 4 compare results when isotonic calibration is done . With Dropout-VI as base model in 5/10 cases, post-processing worsens calibration error. These are instances where over-fitting of isotonic regression can be manifested. Isotonic calibration works

Table 3: Base Model is Dropout-VI model. indicates the datasets for which isotonic recal- ibration increases the calibration error, especially on smaller datasets.

Table 4: Base Model is Dropout-VI model. indicates the datasets for which isotonic recalibration increases the calibration error, especially on smaller datasets.

well on large datsets like Kin8nm Protein Structure, Year Prediction MSD; one possible justification is that there is plenty of data to recalibrate in these cases. Similarly coming to Deep Ensembles we can see same phenomenon, that post-hoc processing can increase calibration error . Here it is even more worse cause 7/10 cases increases calibration error. The amount of miscalibration is much more in case of Deep Ensembles when compared to MC dropout Model. However, note that, very large datasets like Year Prediction MSD isotonic calibration does perform well just like anticipated. Another thing to be noted is that the amount of increase in miscalibration is smaller when the model is trained with Quantile Regularization.

5 Conclusion and Future work

Although there is significant empirical evidence that calibrated models produce more reliable uncertainty estimates and generalize well, there is relatively less theoretical understanding as to why calibrated models are superior. Properties of calibrated classification models were studied in [32]; however, an in-depth analysis of the properties of calibrated regression models is currently lacking. Also, as mentioned in [22], Quantile Calibration is based on marginal probabilities. A more stronger notion is Distributional Calibration. As interesting avenue of future work will be to design trainable loss functions for the notion of Distributional Calibration.

Figure 2: We can see that Calibration error reduces gradually as we increase on Concrete Strength , Protein Structure , Year Prediction MSD dataset

Appendix

A1 : Proofs

Proof of proposition 1

Proof. To show that is quantile calibrated. we have to show that since we are assuming that R(p) is invertible function, which gives us that it is surjective. So, an equivalent way of showing this is that

Proof of proposition 2

Proof. Note that the expectation and survival function of Uniform[0, 1] are E[T] = 0.5 and

Proof of proposition 3

the expectation can be replaced by sample mean

so, overall, we have a consistent estimator.

References

[1] Chuan Guo, Geoff Pleiss, Yu Sun, and Kilian Q Weinberger. On calibration of modern neural networks. In Proceedings of the 34th International Conference on Machine LearningVolume 70, pages 1321–1330. JMLR. org, 2017.

[2] Volodymyr Kuleshov, Nathan Fenner, and Stefano Ermon. Accurate uncertainties for deep learning using calibrated regression. arXiv preprint arXiv:1807.00263, 2018.

[3] Aviral Kumar, Sunita Sarawagi, and Ujjwal Jain. Trainable calibration measures for neural networks from kernel mean embeddings. In International Conference on Machine Learning, pages 2805–2814, 2018.

[4] John Platt et al. Probabilistic outputs for support vector machines and comparisons to regularized likelihood methods. Advances in large margin classifiers, 10(3):61–74, 1999.

[5] Glenn W Brier. Verification of forecasts expressed in terms of probability. Monthly weather review, 78(1):1–3, 1950.

[6] Allan H Murphy. Scalar and vector partitions of the probability score: Part i. two-state situation. Journal of Applied Meteorology, 11(2):273–282, 1972.

[7] Tilmann Gneiting and Adrian E Raftery. Strictly proper scoring rules, prediction, and estimation. Journal of the American statistical Association, 102(477):359–378, 2007.

[8] Ananya Kumar, Percy S Liang, and Tengyu Ma. Verified uncertainty calibration. In Advances in Neural Information Processing Systems, pages 3787–3798, 2019.

[9] Juozas Vaicenavicius, David Widmann, Carl Andersson, Fredrik Lindsten, Jacob Roll, and Thomas B Schön. Evaluating model calibration in classification. arXiv preprint arXiv:1902.06977, 2019.

[10] Meelis Kull, Miquel Perello Nieto, Markus Kängsepp, Telmo Silva Filho, Hao Song, and Peter Flach. Beyond temperature scaling: Obtaining well-calibrated multi-class probabilities with dirichlet calibration. In Advances in Neural Information Processing Systems, pages 12295–12305, 2019.

[11] Bianca Zadrozny and Charles Elkan. Obtaining calibrated probability estimates from decision trees and naive bayesian classifiers. In Icml, volume 1, pages 609–616. Citeseer, 2001.

[12] Bianca Zadrozny and Charles Elkan. Transforming classifier scores into accurate mul- ticlass probability estimates. In Proceedings of the eighth ACM SIGKDD international conference on Knowledge discovery and data mining, pages 694–699, 2002.

[13] Meelis Kull, Telmo Silva Filho, and Peter Flach. Beta calibration: a well-founded and easily implemented improvement on logistic calibration for binary classifiers. In Artificial Intelligence and Statistics, pages 623–631, 2017.

[14] Hongyi Zhang, Moustapha Cisse, Yann N Dauphin, and David Lopez-Paz. mixup: Beyond empirical risk minimization. arXiv preprint arXiv:1710.09412, 2017.

[15] Sunil Thulasidasan, Gopinath Chennupati, Jeff A Bilmes, Tanmoy Bhattacharya, and Sarah Michalak. On mixup training: Improved calibration and predictive uncertainty for deep neural networks. In Advances in Neural Information Processing Systems, pages 13888–13899, 2019.

[16] Christian Szegedy, Vincent Vanhoucke, Sergey Ioffe, Jon Shlens, and Zbigniew Wojna. Rethinking the inception architecture for computer vision. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 2818–2826, 2016.

[17] Rafael Müller, Simon Kornblith, and Geoffrey E Hinton. When does label smoothing help? In Advances in Neural Information Processing Systems, pages 4696–4705, 2019.

[18] Wesley J Maddox, Pavel Izmailov, Timur Garipov, Dmitry P Vetrov, and Andrew Gordon Wilson. A simple baseline for bayesian uncertainty in deep learning. In Advances in Neural Information Processing Systems, pages 13132–13143, 2019.

[19] Gabriel Pereyra, George Tucker, Jan Chorowski, Łukasz Kaiser, and Geoffrey Hinton. Regularizing neural networks by penalizing confident output distributions. arXiv preprint arXiv:1701.06548, 2017.

[20] Tilmann Gneiting, Fadoua Balabdaoui, and Adrian E Raftery. Probabilistic forecasts, calibration and sharpness. Journal of the Royal Statistical Society: Series B (Statistical Methodology), 69(2):243–268, 2007.

[21] Tom Fawcett and Alexandru Niculescu-Mizil. Pav and the roc convex hull. Machine Learning, 68(1):97–106, 2007.

[22] Hao Song, Tom Diethe, Meelis Kull, and Peter Flach. Distribution Calibration for Regression. arXiv e-prints, page arXiv:1905.06023, May 2019.

[23] Gil Keren, Nicholas Cummins, and Björn Schuller. Calibrated prediction intervals for neural network regressors. IEEE Access, 6:54033–54041, 2018.

[24] Geoffrey Hinton, Oriol Vinyals, and Jeff Dean. Distilling the knowledge in a neural network. arXiv preprint arXiv:1503.02531, 2015.

[25] Song Xi Chen. Beta kernel estimators for density functions. Computational Statistics & Data Analysis, 31(2):131–145, 1999.

[26] S Baratpour and A Habibi Rad. Testing goodness-of-fit for exponential distribution based on cumulative residual entropy. Communications in Statistics-Theory and Methods, 41(8):1387–1396, 2012.

[27] Murali Rao, Yunmei Chen, Baba C Vemuri, and Fei Wang. Cumulative residual entropy: a new measure of information. IEEE transactions on Information Theory, 50(6):1220–1228, 2004.

[28] Aditya Grover, Eric Wang, Aaron Zweig, and Stefano Ermon. Stochastic optimization of sorting networks via continuous relaxations. arXiv preprint arXiv:1903.08850, 2019.

[29] Yarin Gal and Zoubin Ghahramani. Dropout as a bayesian approximation: Representing model uncertainty in deep learning. In international conference on machine learning, pages 1050–1059, 2016.

[30] Balaji Lakshminarayanan, Alexander Pritzel, and Charles Blundell. Simple and scal- able predictive uncertainty estimation using deep ensembles. In Advances in neural information processing systems, pages 6402–6413, 2017.

[31] Yarin Gal. Uncertainty in deep learning. University of Cambridge, 1:3, 2016.

[32] Ira Cohen and Moises Goldszmidt. Properties and benefits of calibrated classifiers. In European Conference on Principles of Data Mining and Knowledge Discovery, pages 125– 136. Springer, 2004.