A key challenge facing deep neural networks is that they do not produce reliable confidence estimates, which are important for applications such as safe reinforcement learning (Berkenkamp et al., 2017), guided exploration (Malik et al., 2019), and active learning (Gal et al., 2017).
We consider the setting where the test data follows the same distribution as the training data (i.e., we do not consider adversarial examples designed to fool the network (Szegedy et al., 2014)); even in this setting, confidence estimates produced by deep neural networks are notoriously unreliable (Guo et al., 2017). One intuition for this shortcoming is that unlike traditional supervised learning algorithms, deep learning models typically overfit the training data (Zhang et al., 2017). As a consequence, the confidence estimates of deep neural networks are flawed even for test data from the training distribution since, by construction, they overestimate the likelihood of the training data.
A promising approach to addressing this challenge is temperature scaling (Platt, 1999). This approach takes as input a trained neural network —i.e., whose parameters
have already been fit to a training dataset
—which produces unreliable probabilities
. Then, this approach rescales these confidence estimates based on a validation dataset to improve their “calibration”. More precisely, this approach fits confidence estimates of the form
where is a temperature scaling parameter that is fit based on the validation dataset. The goal is to choose
to minimize calibration error, which roughly speaking measures the degree to which the reported error rate differs from the actual error rate.
The key insight is that in the temperature scaling approach, only a single parameter is fit to the validation data—thus, unlike fitting the original neural network, the temperature scaling algorithm comes with generalization guarantees based on traditional statistical learning theory.
Despite the improved generalization guarantees, these confidence estimates still do not come with theoretical guarantees. We are interested in producing confidence sets that satisfy statistical guarantees while being as small as possible. Given a test input , a confidence set
Table 1: ImageNet images with varying ResNet confidence set sizes. The confidence set sizes are on the top. The true label is on the left-hand side. Incorrectly labeled images are boxed in red.
(parameterized by ) should contain the true label y for at least a
fraction of cases:
Since we are fitting a parameter T to based on , we additionally incur a probability of failure due to the randomness in
. In other words, given
, we aim to obtain probably approximately correct (PAC) confidence sets
satisfying the guarantee
Indeed, techniques from statistical learning theory (Vapnik, 1999) can be used to do so (Vovk, 2013).
There are a number of reasons why confidence sets can be useful. First, they can be used to inform safety critical decision making. For example, consider a doctor who uses prediction tools to help perform diagnosis. Having a confidence set would both help the doctor estimate the confidence of the prediction (i.e., smaller confidence sets imply higher confidence), but also give a sense of the set of possible diagnoses. Second, having a confidence set can be useful for reasoning about safety since they contain the true outcome with high probability. For instance, robots may use a confidence set over predicted trajectories to determine whether it is safe to act with high probability. As a concrete example, consider a self-driving car that uses a deep neural network to predict the path that a pedestrian might take. We require that the self-driving car avoid the pedestrian with high probability, which it can do by avoiding all possible paths in the predicted confidence set.
Contributions. We propose an algorithm combining calibrated prediction and statistical learning theory to construct PAC confidence sets for deep neural networks (Section 3). We propose instantiations of this framework in the settings of classification, regression, and learning models for reinforcement learning (Section 3.6). Finally, we evaluate our approach on three benchmarks: ResNet (He et al., 2016) for ImageNet (Russakovsky et al., 2015), a model (Held et al., 2016) learned for a visual object tracking benchmark (Wu et al., 2013), and a probabilistic dynamics model (Chua et al., 2018) learned for the half-cheetah environment (Brockman et al., 2016) (Section 4). Examples of ImageNet images with different sized ResNet confidence sets are shown in Table 1. As can be seen, our confidence sets become larger and the images become more challenging to classify. In addition, we show predicted confidence sets for ResNet in Table 2, as well as predicted confidence sets for the visual object tracking model in Table 3.
Related work. There has been work on constructing confidence sets with theoretical guarantees. Oftentimes, these guarantees are asymptotic rather than finite sample (Steinberger & Leeb, 2016; 2018). Alternatively, there has been work focused on predicting confidence sets with a given expected size (Denis & Hebiri, 2017).
More relatedly, there has been recent work on obtaining PAC guarantees. For example, there has been some work specific prediction tasks such as binary classification (Lei, 2014; Wang & Qiao,
� � � � � � � � � �
Table 2: Confidence sets of ImageNet images with varying ResNet confidence set sizes. The predicted confidence set is shown to the right of the corresponding input image. The true label is shown in red, and the predicted label is shown with a hat. See Table 5 in Appendix D for more examples.
Table 3: Visualization of confidence sets for the tracking dataset (Wu et al., 2013), including the ground truth bounding box (white), the bounding box predicted by the original neural network (Held et al., 2016) (red), and the bounding box produced using our confidence set predictor (green). We have overapproximated the predicted ellipsoid confidence set with a box. Our bounding box contains the ground truth bounding box with high probability. See Table 9 in Appendix D for more examples.
2018). There has also been work in the setting of regression (Lei et al., 2018; Barber et al., 2019). However, in this case, the confidence sets are fixed in size—i.e., they do not depend on the input x (Barber et al., 2019). Furthermore, they make stability assumptions about the learning algorithm (though they achieved improved rates by doing so) (Lei et al., 2018; Barber et al., 2019).
The most closely related work is on conformal prediction (Papadopoulos, 2008; Vovk, 2013). Like our approach, this line of work provides a way to construct confidence sets from a given confidence predictor, and provided PAC guarantees for the validity of these confidence sets. Indeed, with some work, our generalization bound Theorem 1 can be shown to be equivalent to Theorem 1 in Vovk (2013). In contrast to their approach, we proposed to use calibrated prediction to construct confi-dence predictors that can suitably be used with deep neural networks. Furthermore, our approach makes explicit the connections to temperature scaling and as well as to generalization bounds from statistical learning theory (Vapnik, 1999). In addition, unlike our paper, they do not explicitly provide an efficient algorithm for constructing confidence sets. Finally, we also propose an extension to the case of learning models for reinforcement learning.
Finally, we build on a long line of work on calibrated prediction, which aims to construct “calibrated” probabilities (Murphy, 1972; DeGroot & Fienberg, 1983; Platt, 1999; Zadrozny & Elkan, 2001; 2002; Naeini et al., 2015; Kuleshov & Liang, 2015). Roughly speaking, probabilities are calibrated if events happen at rates equal to the predicted probabilities. This work has recently been applied to obtaining confidence estimates for deep neural networks (Guo et al., 2017; Kuleshov et al., 2018; Pearce et al., 2018), including for learned models for reinforcement learning (Malik et al., 2019). However, these approaches do not come with PAC guarantees.
Our goal is to estimate confidence sets that are as small as possible, while simultaneously ensuring that they are probably approximately correct (PAC) (Valiant, 1984). Essentially, a confidence set is correct if it contains the true label. More precisely, let X be the inputs and Y be the labels, and let D be a distribution over . A confidence set predictor is a function
such that
is a set of labels; we denote the set of all confidence set predictors by C. For a given example
, we say C is correct if
. Then, the error of C is
Finally, consider an algorithm A that takes as input a validation set consisting of n i.i.d. samples
, and outputs a confidence set predictor
. Given
, we say that A is probably approximately correct (PAC) if
Our goal is to design an algorithm A that satisfies (2) while constructing confidence sets C(x) that are as “small in size” as possible on average. The size of C(x) depends on the domain. For classifi-cation, we consider confidence sets that are arbitrary subsets of labels , and we measure the size by
—i.e., the number of labels in C(x). For regression, we consider confidence sets that are intervals
, and we measure size by
—i.e., the length of the predicted interval. Note that there is an intrinsic tradeoff between satisfying (2) and average size of C(x)—larger confidence sets are more likely to satisfy (2).
Our algorithm is formulated in the empirical risk framework. Typically, this framework refers to empirical risk minimization. In our setting, such an algorithm would take as input (i) a parametric family of confidence set predictors , where
is the parameter space, and (ii) a training set
of n i.i.d. samples
, and output the confidence set predictor
, where
minimizes the empirical risk:
Here, is the indicator function, and the empirical risk
in an estimate of the confidence set error (1) based on the validation set
.
However, our algorithm does not minimize the empirical risk. Rather, recall that our goal is to minimize the size of the predicted confidence sets given a PAC constraint on the true risk based on the given PAC parameters
and the number of available validation samples
. Thus, the risk shows up as a constraint in the optimization problem, and the objective is instead to minimize the size of the predicted confidence sets:
At a high level, the value is chosen to enforce the PAC constraint, and is based on generalization bounds from statistical learning theory (Valiant, 1984). Furthermore, following the temperature scaling approach (Platt, 1999), the parameter space
is chosen to be as small as possible (in particular, one dimensional) to enable good generalization. Finally, our choice of size metric S follows straightforwardly based on our choice of parameter space. In the remainder of this section, we describe the choices of (i) parameter space
, (ii) size metric
, and (iii) confidence level
in more detail, as well as how to solve (3) given these choices.
3.1 CHOICE OF PARAMETER SPACE Θ
Probability forecasters. Our construction of the parameteric family of confidence set predictors assumes given a probability forecaster
, where
is a space of probability distributions over Y. Given such an f, we use f(y | x) to denote the probability of label y under distribution f(x). Intuitively, f(y | x) should be the probability (or probability density) that y is the true label for a given input x—i.e.,
. For example, in classification, we can choose
to be the space of categorical distributions over Y, and f may be a neural network whose last layer is a softmax layer with |Y| outputs. Then,
. Alternatively, in regression, we can choose
to be the space of Gaussian distributions, and f may be a neural network whose last layer outputs the values
of a Gaussian distribution. Then,
, where
, and
is the Gaussian density function with mean
and variance
.
Training a probability forecaster. To train a probability forecaster, we use a standard approach to calibrated prediction that combines maximum likelihood estimation with temperature scaling. 2 First, we consider a parametric model family , where
is the parameter space. Note that
can be high-dimensional—e.g., the weights of a neural network model. Given a training set
of m i.i.d. samples
, the maximum likelihood estimate (MLE) of
is
We could now use as the probability forecaster. However, the problem with directly using
is that because
may be high-dimensional, it often overfits the training data
. Thus, the probabilities are typically overconfident compared to what they should be.
To reduce their confidence, we use the temperature scaling approach to calibrate the predicted probabilities (Platt, 1999; Guo et al., 2017). Intuitively, this approach is to train an MLE estimate using exactly the same approach used to train , but using a single new parameter
. The key idea is that this time, the model family is based on the parameters
from (4). In other words, the “shape” of the probabilities forecast by
are preserved, but their exact values are shifted.
Then, we have the following MLE for :
Note that is estimated based on a second training set
. Because we are only fitting a single parameter, this training set can be much smaller than the training set
used to fit
.
Parametric family of confidence set predictors. Finally, given a probability forecaster f, we consider one dimensional parameter space ; in an analogy to the temperature scaling technique for calibrated prediction, we denote this parameter by
. In particular, we assume a confidence probability predictor f is given, and consider
In other words, is the set of y with high probability given x according to f. Considering this scalar parameter space, we denote the minimum of (3) by
.
3.2 CHOICE OF SIZE METRIC S(T)
To choose the size metric S(T), we note that for our chosen parametric family of confidence set predictors, smaller values correspond to uniformly smaller confidence sets—i.e.,
Thus, we can simply choose the size metric to be
This choice minimizes the size of the confidence sets produced by our algorithm.
3.3 CHOICE OF CONFIDENCE LEVEL α(n, ϵ, δ)
Naive approach based on VC generalization bound. A naive approach to choosing is to do so based on the VC dimension generalization bound (Vapnik, 1999). It is not hard to show that the problem of estimating
is equivalent to a binary classification problem, and that the VC dimension of
for this problem is 1. Thus, the VC dimension bound implies that for all
,
The details of this equivalence are given in Appendix B.2. Then, suppose we choose
With this choice, for the solution of (3) with
, the constraint in (3) ensures that
. Together with the VC generalization bound (7), we have
Direct generalization bound. In fact, we can get better choices of by directly bounding generalization error. For instance, in the realizable setting (i.e., we always have
), we can get rates of
instead of
(Kearns & Vazirani, 1994); see Appendix A.2 for details. We can achieve these rates by choosing
, but then, the PAC guarantees we obtain may actually be stronger than desired (i.e., for
). Intuitively, we can directly prove a bound that interpolates between the realizable setting and the VC generalization bound—in particular:
Theorem 1. For any , and
, we have
where is the solution to (3) with
. 3
We give a proof in Appendix B.2. Based on Theorem 1, we can choose
3.4 THEORETICAL GUARANTEES
We have the following guarantee, which follows straightforwardly from Theorem 1: Corollary 1. Let be the solution to (3) for
chosen according to (8). Then, we have
In other words, our algorithm is probably approximately correct.
3.5 PRACTICAL IMPLEMENTATION
Our algorithm for estimating a confidence set predictor is summarized in Algorithm 1. The algorithm solves the optimization problem (3) using the choices of
, and
described in the preceding sections. There are two key implementation details that we describe here.
Computing . To compute
, we need to solve (8). A straightforward approach is to enumerate all possible choices of
. There are two optimizations. First, the objective is monotone increasing in k, so we can enumerate k in ascending order until the constraint no longer holds. Second, rather than re-compute the left-hand side of the constraint
, we can accumulate the sum as we iterate over k. We can also incrementally compute
, and
. For numerical stability, we perform these computations in log space.
Solving (3). To solve (3), note that the constraint in (3) is equivalent to
Also, note that is an integer due to the definition of
in (8). Thus, we can interpret (9) as saying that E(x, y; T) = 1 for at most
of the points
.
In addition, note that E(x, y; T) decreases monotonically as becomes larger. Thus, we can sort the points
in ascending order of
, and require that only the first
points (x, y) in this list satisfy E(x, y; T) = 1. In particular, letting
be the
st point, (9) is equivalent to
In other words, this constraint says that T must satisfy . Finally, the solution
to (3) is the smallest T that satisfies (10), which is the T that makes (10) hold with equality—i.e.,
We have assumed ; if not, we decrement
until this holds.
3.6 PROBABILITY FORECASTERS FOR SPECIFIC TASKS
We briefly discuss the architectures we use for probability forecasters for various tasks. We give details, including how we measure the sizes of predicted confidence sets , in Appendix C. We consider three tasks: classification, regression, and model-based reinforcement learning. For classification, we use the standard approach of using a soft-max layer to predict label probabilities f(y | x). For regression, we also use a standard approach where the neural network predicts both the mean
and covariance
of a Gaussian distribution
; then, f(y | x) =
is the probability density of y according to this Gaussian distribution.
Finally, for model-based reinforcement learning, our goal is to construct confidence sets over trajectories predicted using a learned model of the dynamics. We consider unknown dynamics
Figure 1: Results on ResNet for ImageNet with n = 20000. Default parameters are and
. We plot the median and min/max confidence set sizes. (a) Ablation study; C is “calibrated predictor” (i.e., use
instead of
), and D is “direct bound” (i.e., use Theorem 1 instead of the VC generalization bound). (b) Restricted to correctly vs. incorrectly labeled images. (c) Varying
. (d) Varying
.
mapping a state-action pair (x, u) to a distribution over states
, and consider a known (and fixed) policy
mapping a given state x to a distribution over actions
. Then, we let
denote the (unknown) closed-loop dynamics.
Next, we consider a forecaster of the form
, and our goal is to construct confidence sets for the predictions of f. However, we want to do so for not just for one-step predictions, but for predictions over a time horizon
. In particular, given initial state
, we can sample
by letting
and sequentially sampling
for each
. Then, our goal is to construct a confidence set that contains
with high probability (over both the randomness in an initial state distribution
and the randomness in
).
To do so, we construct and use a forecaster based on f. In principle, this task is a special case of multivariate regression, where the inputs are X (i.e., the initial state
) and the outputs are
(i.e., a predicted trajectory
). However, the variance
predicted by our probability forecaster is only for a single step, and does not take into account the fact that x is itself uncertain. Thus, we use a simple heuristic where we accumulate variances over time. More precisely, we construct (i) the predicted mean
by
and
for
, and (ii) the predicted variances
by
We use a probability forecaster to construct confidence sets.
We describe our experiments on ImageNet (a classification task), a visual object tracking benchmark (a regression task), and the half-cheetah environment (a model-based reinforcement learning task). We give additional results in Appendix D.
Figure 2: Confidence set sizes for an object tracking benchmark (Wu et al., 2013); we use n = , and
. (a) Ablation study similar to Figure 3. In (b) and (c), we show how the confidence set sizes produced using our algorithm vary with respect to
and
, respectively.
ResNet for ImageNet. We use our algorithm to compute confidence sets for ResNet (He et al., 2016) on ImageNet (Russakovsky et al., 2015), for , and n = 20000 validation images. We show the results in Figure 1. In (a), we compare our approach to an ablation. In particular, C refers to performing an initial temperature scaling step to calibrate the neural network predictor (i.e., using
instead of
), and (ii) D refers to using Theorem 1 instead of the VC generalization bound. Thus, C + D refers to our approach. As can be seen, using calibrated predictor produces a noticeable reduction in the maximum confidence set size.
We also compared to the ablation C—i.e., using the VC generalization bound. However, we were unable to obtain valid confidence sets for our choice of and
—i.e., (3) is infeasible. That is, using Theorem 1 outperforms using the VC generalization bound since the VC bound is too loose to satisfy the PAC criterion for our choice of parameters. In addition, in Table 6 in Appendix D, we show results for larger choices of
and
; these results show that our approach substantially outperforms the ablation based on the VC bound even when the VC bound produces valid confidence sets.
In (b), we show the confidence set sizes for images correctly vs. incorrectly labeled by ResNet. As expected, the sizes are substantially larger for incorrectly labeled images. Finally, in (c) and (d), we show how the sizes vary with and
, respectively. As expected, the dependence on
is much more pronounced (note that
is log-scale).
Visual object tracking. We apply our confidence set prediction algorithm to a 2D visual singleobject tracking task, which is a multivariate regression problem. Specifically, the input space X consists of the previous image, the previous bounding box (in ), and the current image. The output space
is a current bounding box. We use the regression-based tracker from Held et al. (2016), and retrain the regressor neural network to predict the mean and variance of a Gaussian distribution. More precisely, our object tracking model predicts the mean and variance of each bounding box parameter—i.e.,
. Given this bounding box forecaster
, we calibrate and estimate a confidence set predictor as described in Section 3.6.
We use the visual object tracking benchmark from Wu et al. (2013) to train and evaluate our con-fidence set predictor. This benchmark consists of 99 video sequences labeled with ground truth bounding boxes. We randomly split these sequences to form the training set for calibration, validation set for confidence set estimation, and test set for evaluation. For each sequence, a pair of two adjacent frames constitute a single example. Our training dataset contains 20,882 labeled examples, each consisting of of a pair of consecutive images and ground truth bounding boxes. The validation set for confidence set estimation and test set contain 22,761 and 22,761 labeled examples, respectively. Figure 2 shows the sizes of the predicted confidence sets; the sizes are measured as described in Section 3.6 for regression tasks. As for ResNet, we omit results for the VC bound ablation since n is too small to get a bound. The trends are similar to the ones for ResNet.
Half-cheetah. We use our algorithm to compute confidence sets for a probabilistic neural network dynamics model (Chua et al., 2018) for the half-cheetah environment (Brockman et al., 2016), for time steps, and n = 5000 validation rollouts. When using temperature scaling to calibrate
to obtain
, we calibrate each dimension of time steps independently (i.e., we fit H parameters, where H is time horizon). We show the results in Figure 3.
Figure 3: Results on the dynamics model for the half-cheetah with n = 5000. Default parameters are and
. (a) Ablation study; A is “accumulated variance” (i.e., for each
{1, ..., 20}, use
instead of
), and C and D are as for ResNet. We plot the median and min/max confidence set sizes (see Section 3.6), averaged across
. (b) Same ablations, but with per time step size. We plot the average size of the confidence set for the predicted state
on step t, as a function of
. (c) Varying
, and (d) varying
.
In (a), we compare to two ablations. The labels C and D are as for ResNet; in addition, A refers to using the accumulated variance instead of the one-step predicted variances
. Thus, A + C + D is our approach. As before, we omit results for the ablation using the VC generalization bound since n is so small that the bound does not hold for any k for the given
and
. In (b), we show the same ablations over the entire trajectory until t = 20. As can be seen, using the calibrated predictor produces a large gain; these gains are most noticeable in the tails. Using the accumulated confidence produces a smaller, but still significant, gain. In (c) and (d), we show how the sizes vary with
and
, respectively. The trends are similar those for ResNet.
We have proposed an algorithm for constructing PAC confidence sets for deep neural networks. Our approach leverages statistical learning theory to obtain theoretical guarantees on the predicted confidence sets. These confidence sets quantify the uncertainty of deep neural networks. For instance, they can be used to inform safety-critical decision-making, and to ensure safety with highprobability in robotics control settings that leverage deep neural networks for perception. Future work includes extending these results to more complex tasks (e.g., structured prediction), and handling covariate shift (e.g., to handle policy updates in reinforcement learning).
This work was support in part by NSF CCF-1910769 and by the Air Force Research Laboratory and the Defense Advanced Research Projects Agency under Contract No. FA8750-18-C-0090. Any opinions, findings and conclusions or recommendations expressed in this material are those of the author(s) and do not necessarily reflect the views of the Air Force Research Laboratory (AFRL), the Defense Advanced Research Projects Agency (DARPA), the Department of Defense, or the United States Government.
Rina Foygel Barber, Emmanuel J Candes, Aaditya Ramdas, and Ryan J Tibshirani. Predictive inference with the jackknife+. arXiv preprint arXiv:1905.02928, 2019.
Felix Berkenkamp, Matteo Turchetta, Angela Schoellig, and Andreas Krause. Safe model-based reinforcement learning with stability guarantees. In Advances in neural information processing systems, pp. 908–918, 2017.
Marko Bohanec and Vladislav Rajkovic. Knowledge acquisition and explanation for multi-attribute decision making. In 8th Intl Workshop on Expert Systems and their Applications, 1988.
Christopher P Bonafide, A Russell Localio, John H Holmes, Vinay M Nadkarni, Shannon Stemler, Matthew MacMurchy, Miriam Zander, Kathryn E Roberts, Richard Lin, and Ron Keren. Video analysis of factors associated with response time to physiologic monitor alarms in a childrens hospital. JAMA pediatrics, 171(6):524–531, 2017.
Greg Brockman, Vicki Cheung, Ludwig Pettersson, Jonas Schneider, John Schulman, Jie Tang, and Wojciech Zaremba. Openai gym. arXiv preprint arXiv:1606.01540, 2016.
Kurtland Chua, Roberto Calandra, Rowan McAllister, and Sergey Levine. Deep reinforcement learn- ing in a handful of trials using probabilistic dynamics models. In Advances in Neural Information Processing Systems, pp. 4754–4765, 2018.
Paulo Cortez and Alice Maria Goncalves Silva. Using data mining to predict secondary school student performance. 2008.
Morris H DeGroot and Stephen E Fienberg. The comparison and evaluation of forecasters. Journal of the Royal Statistical Society: Series D (The Statistician), 32(1-2):12–22, 1983.
Christophe Denis and Mohamed Hebiri. Confidence sets with expected sizes for multiclass classifi- cation. The Journal of Machine Learning Research, 18(1):3571–3598, 2017.
Yarin Gal, Riashat Islam, and Zoubin Ghahramani. Deep bayesian active learning with image data. In Proceedings of the 34th International Conference on Machine Learning-Volume 70, pp. 1183– 1192. JMLR. org, 2017.
Chuan Guo, Geoff Pleiss, Yu Sun, and Kilian Q Weinberger. On calibration of modern neural networks. arXiv preprint arXiv:1706.04599, 2017.
H Altay Guvenir, Burak Acar, Gulsen Demiroz, and Ayhan Cekin. A supervised machine learning algorithm for arrhythmia analysis. In Computers in Cardiology 1997, pp. 433–436. IEEE, 1997.
Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recog- nition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 770–778, 2016.
David Held, Sebastian Thrun, and Silvio Savarese. Learning to track at 100 fps with deep regression networks. In European Conference on Computer Vision, pp. 749–765. Springer, 2016.
Michael J Kearns and Umesh Virkumar Vazirani. An introduction to computational learning theory. MIT press, 1994.
Alex Krizhevsky. One weird trick for parallelizing convolutional neural networks. arXiv preprint arXiv:1404.5997, 2014.
Volodymyr Kuleshov and Percy S Liang. Calibrated structured prediction. In Advances in Neural Information Processing Systems, pp. 3474–3482, 2015.
Volodymyr Kuleshov, Nathan Fenner, and Stefano Ermon. Accurate uncertainties for deep learning using calibrated regression. arXiv preprint arXiv:1807.00263, 2018.
Jing Lei. Classification with confidence. Biometrika, 101(4):755–769, 2014.
Jing Lei, Max GSell, Alessandro Rinaldo, Ryan J Tibshirani, and Larry Wasserman. Distribution- free predictive inference for regression. Journal of the American Statistical Association, 113 (523):1094–1111, 2018.
Ali Malik, Volodymyr Kuleshov, Jiaming Song, Danny Nemer, Harlan Seymour, and Stefano Er- mon. Calibrated model-based deep reinforcement learning. In International Conference on Machine Learning, pp. 4314–4323, 2019.
Allan H Murphy. Scalar and vector partitions of the probability score: Part i. two-state situation. Journal of Applied Meteorology, 11(2):273–282, 1972.
Mahdi Pakdaman Naeini, Gregory Cooper, and Milos Hauskrecht. Obtaining well calibrated prob- abilities using bayesian binning. In Twenty-Ninth AAAI Conference on Artificial Intelligence, 2015.
Harris Papadopoulos. Inductive conformal prediction: Theory and application to neural networks. In Tools in artificial intelligence. IntechOpen, 2008.
T Pearce, M Zaki, A Brintrup, and A Neely. High-quality prediction intervals for deep learning: A distribution-free, ensembled approach. In 35th International Conference on Machine Learning, ICML 2018, volume 9, pp. 6473–6482, 2018.
John Platt. Probabilistic outputs for support vector machines and comparisons to regularized likeli- hood methods. Advances in large margin classifiers, 1999.
J Ross Quinlan. Combining instance-based and model-based learning. In Proceedings of the Tenth International Conference on International Conference on Machine Learning, pp. 236–243. Morgan Kaufmann Publishers Inc., 1993.
Olga Russakovsky, Jia Deng, Hao Su, Jonathan Krause, Sanjeev Satheesh, Sean Ma, Zhiheng Huang, Andrej Karpathy, Aditya Khosla, Michael Bernstein, et al. Imagenet large scale visual recognition challenge. International journal of computer vision, 115(3):211–252, 2015.
Lukas Steinberger and Hannes Leeb. Leave-one-out prediction intervals in linear regression models with many variables. arXiv preprint arXiv:1602.05801, 2016.
Lukas Steinberger and Hannes Leeb. Conditional predictive inference for high-dimensional stable algorithms. arXiv preprint arXiv:1809.01412, 2018.
Christian Szegedy, Wojciech Zaremba, Ilya Sutskever, Joan Bruna, Dumitru Erhan, Ian Goodfel- low, and Rob Fergus. Intriguing properties of neural networks. In International Conference on Learning Representations (ICLR), 2014.
Christian Szegedy, Wei Liu, Yangqing Jia, Pierre Sermanet, Scott Reed, Dragomir Anguelov, Du- mitru Erhan, Vincent Vanhoucke, and Andrew Rabinovich. Going deeper with convolutions. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 1–9, 2015.
Leslie G Valiant. A theory of the learnable. Communications of the ACM, 27(11):1134–1142, 1984.
Vladimir N Vapnik. An overview of statistical learning theory. IEEE transactions on neural networks, 10(5):988–999, 1999.
Vladimir Vovk. Conditional validity of inductive conformal predictors. Machine learning, 92(2-3): 349–376, 2013.
Wenbo Wang and Xingye Qiao. Learning confidence sets using support vector machines. In Advances in Neural Information Processing Systems, pp. 4929–4938, 2018.
Yi Wu, Jongwoo Lim, and Ming-Hsuan Yang. Online object tracking: A benchmark. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2013.
Bianca Zadrozny and Charles Elkan. Obtaining calibrated probability estimates from decision trees and naive bayesian classifiers. In In Proceedings of the Eighteenth International Conference on Machine Learning. Citeseer, 2001.
Bianca Zadrozny and Charles Elkan. Transforming classifier scores into accurate multiclass proba- bility estimates. In Proceedings of the eighth ACM SIGKDD international conference on Knowledge discovery and data mining, pp. 694–699. ACM, 2002.
Chiyuan Zhang, Samy Bengio, Moritz Hardt, Benjamin Recht, and Oriol Vinyals. Understanding deep learning requires rethinking generalization. In ICLR, 2017.
A DISCUSSION OF ALGORITHM DESIGN CHOICES
A.1 USEFULNESS OF TEMPERATURE SCALING
In this section, we discuss why temperature scaling can help improve the predicted confidence sets.
A concern is that temperature scaling does not change the ordering of label probabilities. Thus, we
may expect that temperature scaling does not affect the predicted confidence sets. However, this
fact only holds when considering a single input x—i.e., the ordering of the probabilities p(y | x)
for is not changed by temperature scaling. Indeed, the order of confidences for labels for
different inputs can change. For a concrete example, consider two inputs x and , and the case
Y = {0, 1, 2}. Assume that the label probabilities are
Now, if we take temperature very large, then the labels become roughly
As a consequence, there are confidence sets that are achievable when using that are not achievable
when using f. In particular, the confidence sets
can be achieved using (e.g., with
). However, it is impossible to achieve these
confidence sets using f for any choice of T, since if , then it must be the case that
. Intuitively, we expect calibrated prediction to improve the ordering of probabili-
ties across different inputs. Our experiments support this intuition, since they show that empirically,
using calibrated predictors produces confidence sets of smaller size.
A.2 USEFULNESS OF DIRECT BOUND
One key design choice is to use a specialized generalization bound that directly provides PAC guar-
antees on our confidence sets rather than simply applying the VC dimension bound. The easiest way
to determine which bound is better is to examine which one produces a smaller confidence set. In our
approach, the size of the confidence set decreases monotonically with the choice of in
(3). Thus, the bound that produces larger is better. Recall that the VC dimension bound produces
whereas our direct bound produces (for k = 0)
Directly comparing these two choices of is difficult, but our experiments show empirically that
using the direct bound outperforms using the indirect bound.
A more direct way to compare the two approaches is to instead ask how large n needs to be to
achieve . For
, it is easy to check that we need
Thus, we need n to be at least (and possibly greater, to account for the log(2n)
term). In contrast, for our direct bound, corresponds to the case k = 0. To achieve k = 0, it
suffices to have n satisfying . Using
, it suffices to have n satisfying
Figure 4: Sample complexity of different bounds; we fix . Left: Sample complexity of VC bound and direct bound when k = 0. Right: Sample complexity of direct bound for varying k.
In other words, n only needs to be . For small
(e.g.,
), we need
fewer samples to achieve the same size confidence set (i.e., with choice ). In Figure 4
(right), we compute the exact values of n needed to get as a function of
for each
bound (fixing ). As expected, our bound requires substantially smaller n.
Finally, in Figure 4 (right), we compare the magnitude of n needed to achieve larger values of
using our direct bound; for simplicity, we actually consider larger values of k (where ), but
the qualitative insights are the same. As can be seen, even for large k, (e.g., k = 50), the number of
samples increases, but not substantially.
B THEORETICAL GUARANTEES
B.1 ASSUMPTIONS
We make two additional technical assumptions in Theorem 1, both of which are standard. First, we
assume that f is measurable; this assumption holds for all models used in practice, including neural
networks (e.g., it holds as long as f is continuous).
Second, letting , where
, be defined by
, we
assume that the distribution induced by
on R has continuous cumulative distribution function
(CDF). More precisely, letting be the measure defining D, then
is defined by the measure
where is the inverse of
in the sense that
for all
. Then,
we assume that the CDF corresponding to is continuous. This second assumption is standard in
statistical learning theory (Kearns & Vazirani, 1994). Essentially, it says that for any , the
probability that must equal zero. This assumption should hold unless p(x, y) or
f(y | x) are degenerate in some way. Furthermore, we can detect this case. In particular, the failure
mode corresponds to the case that we see multiple points with the same value . Thus,
choosing would include all these points, so the realized error rate
is larger
than desired for . In this case, we can simply choose a slightly larger
to avoid this problem.
B.2 PROOF OF THEOREM 1
At a high level, our proof proceeds in three steps. First, we show that a confidence set predictor
can be encoded as a binary classifier . Second, we show that a PAC bound for
implies a
PAC bound for (where in both cases, the unknown parameter is
). Third, we prove PAC
bounds on the error of ; by the second step, these bounds complete our proof.
Encoding as a binary classifier
. We begin by showing how the problem of learning a PAC
confidence set predictor reduces to the problem of learning a PAC binary classifier
. First,
we show that for any , the confidence set predictor
can be encoded as a binary classifier
. Consider any parameter
. Recall that we use the model f(y | x) to construct the
confidence set predictor
Now, define the map by
, where
, and define the
binary classifier by
Here, I[s] is the indicator function, which returns one if a statement s is true and zero otherwise. We
claim that
To see this claim, note that
as claimed.
PAC bound for implies PAC bound for
. Next, we show that a PAC bound for
implies
a PAC bound for . More precisely, we design a data distribution
and loss
, and show that (i)
the distribution of (trained to optimize
) is the same as the distribution of
(constructed using
our algorithm), and (ii) a PAC bound for (where
is trained on data from
) implies a PAC
bound for . We show that as a consequence, a PAC bound on
implies a PAC bound on
.
We begin by constructing and
. To this end, recall that D is a given distribution over
.
We define a data distribution over
, where
and
, as follows. The
first component of is the distribution over
induced by
from D, and the second component
is the distribution over that places all probability mass on 1. Formally,
exists assuming
is
measurable, so the induced distribution exists; for all our choices of f (i.e., categorical or Gaussian),
this property is satisfied. Then,
where is the measure encoding
, and
is the measure encoding D. Furthermore, we define
to be the 0-1 loss
. Finally, let
be chosen using our
algorithm—i.e.,
for any , and let
be chosen similarly for
—i.e.,
Now, we show (i) above. In particular, we claim that has the same distribution as
, where
and
are random datasets. To this end, define
by
Note that
from which it follows that
By construction of , the random variables
and
have the same distribution; thus, it follows
that the random variables and
have the same distribution as well. Since
, it follows that
has the same distribution as
, as claimed.
Next, we show (ii) above. In particular, we claim that a PAC bound for —i.e.,
implies a PAC bound for —i.e.,
where the true losses are
Note that it suffices to show that the true loss for equals the true loss for
—i.e.,
since this equation (together with the PAC bound for ) implies
as desired. To see the claim, note that
Now, using the change of variables , we have
Then, using (12), we have
as claimed.
Finally, combining (i) and (ii), we have
where the first equality follows since (i) says that (where
) has the same distribution
as (where
), and the second inequality follows by (ii).
Generalization bound. Finally, we prove the PAC bound
for , where
; for conciseness, we have dropped the dependence
of on
. By the previous step, this bound implies the theorem statement. To this end, we first
simplify the left-hand side of the inequality (13). In particular, let be the smallest T for which
; such a
exists by our assumption that
has continuous density function.
First, we claim that implies
. Assuming
, then
Assuming , we can similarly show that
. It follows that
As a consequence, (13) is equivalent to
Next, recall that must satisfy
, where
Assuming , and using
, it follows that
As a consequence, we have
By our definition of , the event in the final expression says that the sum of n i.i.d. Bernoulli
random variables Bernoulli
is at most k. Thus, this event follows a distribution
Binomial, so
as claimed. The theorem statement follows.
C DETAILS ON PROBABILITY FORECASTERS FOR SPECIFIC TASKS
In this section, we describe architectures for probability forecasters for classification, regression,
and model-based reinforcement learning.
Classification. For the case Y = {1, ..., Y }, we choose the probability forecaster f to be a neural
network with a softmax output. Then, we can compute a given confidence set
by explicitly enumerating . We measure the size of
as
.
Regression. For the case Y = R, we choose the probability forecaster f to be a neural network that
outputs the parameters of a Gaussian distribution. Then, we have
This choice generalizes to by having f output the parameters
(where
is the set of d dimensional symmetric positive definite matrices) of a d dimensional Gaussian
distribution. Note that is an ellipsoid
, where
and
is
the unit sphere in ; in particular,
, where
is the eigendecomposition of
We measure the size of as
, where
is the Frobenius norm.
Model-based reinforcement learning. In model-based reinforcement learning, the goal is to predict
trajectories based on a model of the dynamics. We consider an MDP with states , actions
, an unknown distribution over initial states
, and unknown dynamics
mapping a state-action pair to a distribution over states
. We assume a fixed,
known policy , mapping a state
to a distribution over actions
. The (unknown)
closed-loop dynamics are .
Given initial state and time horizon
, we can sample a trajectory
by setting
and sequentially sampling
for
. Our goal is to predict a confidence set
that contains
with high probability (according to both the randomness in initial states and in f). This
problem is a multivariate regression problem with inputs X and outputs .
We assume given a probability forecaster trained to predict the
distribution over next states—i.e., . Given initial state
and time
horizon , we construct the mean trajectory
by setting
and letting
.
To account for the fact that the variances accumulate over time, we sum them together to obtain the
predicted variances —i.e.,
Then, we use the probability forecast (where we think of
as
a vector in and
as a block diagonal matrix in
) to construct confidence
sets.
Finally, we describe how we measure the size of a predicted confidence set . In
particular, note that has the form
Figure 5: Comparison to baselines that do not have theoretical guarantees. In (a) and (b), we show results for ImageNet, and in (c) and (d), we show results for the half-cheetah. In (a) and (c), we show the empirical error in the confidence set sizes; the dotted line denotes , our target confidence set error. In (b) and (d), we show the sizes of the constructed confidence sets.
i.e., is the confidence set for the state
reached after t time steps. Then, we measure the
size of the confidence set for each component (for
) individually, and take
the average. As in the case of regression, is an ellipsoid
; then,
the size of is
.
An additional detail is that when we calibrate this forecaster, we calibrate each component
individually—i.e., we use H calibration parameters .
D ADDITIONAL RESULTS
D.1 COMPARISON TO ADDITIONAL BASELINES
We compare to two baselines that do not have theoretical guarantees. We assume given a probability
forecaster f(y | x). Then, given an input , we construct the confidence set to satisfy
More precisely, we first rank the labels in decreasing order of f(y | x), to obtain a list
. Then, we choose the smallest k such that (14) holds for
.
Intuitively, if the probabilities f(y | x) are correct (i.e., f(y | x) is the true probability of y given
x), then this confidence set should contain the true label y with high probability.
For regression, we cannot explicitly rank labels , but they are monotonically decreasing
away from the mean. Then, assuming is Gaussian, we take an
ellipsoid of shape around
with minimum radius that captures
of the probability
mass of f(y | x). More precisely, we choose
Figure 6: Confidence set sizes for two neural network architectures trained on ImageNet; for both, we use and
. Left: AlexNet (Krizhevsky, 2014); here, the empirical confidence set error of our approach C + D is 0.0066. Right: GoogLeNet (Szegedy et al., 2015); here, the empirical confidence set error of our approach is 0.0061.
where as before. Note that unlike our algorithm, the threshold
is not a learned parameter, but is computed independently for each new input x. We can solve
for efficiently by changing basis to convert f(y | x) to a standard Gaussian distribution, and
then using the error function to compute the cutoff that includes the desired probability mass.
In Figure 5, we compare the confidence sets constructed using this approach with (i) the forecaster
without any calibration, and (ii) the calibrated forecaster
. We plot both the
confidence set sizes and the empirical error rates. For the latter, recall that a confidence set predictor
C is correct if , where L(C) the true error rate. However, we cannot measure L(C);
instead, we approximate it on a held-out test set —i.e.,
, where
Intuitively, is the fraction of inputs
such that the predicted confidence set
for x does not contain y. We say a confidence set C is empirically valid when .
Recall that our algorithm guarantees correctness with probability at least , where
.
As can be seen, the baseline approaches are not empirically valid in all cases. In one case—namely,
the baseline with the calibrated forecaster on ImageNet—the confidence sets are almost empirically
valid. However, in this case, the confidence sets are much larger than those based on our approach,
despite the fact that the error rate of our confidence sets are empirically valid. Thus, our algorithms
outperform the baselines in all cases.
D.2 RESULTS ON ADDITIONAL IMAGENET NEURAL NETWORK ARCHITECTURES
We apply our approach to two additional neural network architectures for ImageNet: AlexNet
(Krizhevsky, 2014) and GoogLeNet (Szegedy et al., 2015). Our results are shown in Figure 6.
As can be seen, calibration reduces the confidence set sizes for AlexNet, but actually increases the
confidence set sizes for GoogleNet. Thus, both calibrated and uncalibrated models may need to be
considered when constructing confidence set predictors. Also, we find that confidence set sizes are
correlated with classification error—the test errors for AlexNet, GoogleNet, and ResNet are 47.83%,
29.41%, and 21.34%, respectively, and their confidence set sizes decrease in the same order.
D.3 RESULTS ON ADDITIONAL CLASSIFICATION DATASETS
We apply our approach to three small classification datasets: an Arrhythmia detection dataset (Gu-
venir et al., 1997), a car evaluation dataset (Bohanec & Rajkovic, 1988), and a medical alarm dataset
(Bonafide et al., 2017). The confidence set sizes are shown in Figure 7. We choose larger values of
and since we cannot obtain confidence sets that satisfy the PAC criterion with smaller
and
when
the number of validation examples n is too small. For all three datasets, the empirical confidence
set error is smaller than the specified error ; thus, the constructed confidence sets are empirically
valid. For these datasets, the confidence set sizes of our approach C + D and our approach without
calibration D are similar, most likely due to the small number of class labels.
Figure 7: Confidence set sizes for three additional classification benchmarks: (a) the arrhythmia detection dataset (Guvenir et al., 1997); here, , and the empirical confidence set error of our approach C + D is 0.0435, (b) the car evaluation dataset (Bohanec & Rajkovic, 1988); here,
, and the empirical confidence set error of our approach C + D is 0.0172, and (c) the CHOP alarm dataset (Bonafide et al., 2017); here, n = 1000,
, and the empirical confidence set error of our approach C+D is 0.0159. (d) The fractions of actionable and false alarms with a confidence set {0} (i.e., only contains false alarm).
We additionally ran our approach on a medical dataset where classification decisions are safety
critical; thus, correct predicted confidence sets are required. In particular, we use the Children’s
Hospital of Philadelphia (CHOP) alarm dataset (Bonafide et al., 2017). This dataset consists of vital
signs from 100 patients around one year of age. One of the vital signs is the oxygen level of the
blood, and a medical device generates an alarm if the oxygen level is below a specified level. The
labels indicate whether the generated alarm is true (y = 1) or false (y = 0). We use n = 1000,
, and
. The empirical confidence set error of our approach is
.
The key question is how many false alarms can be reliably detected using machine learning to help
reduce alarm fatigue. We consider an approach where we use the predicted confidence sets to detect
false alarms. In particular, we first train a probability forecaster , where Y = {0, 1}, to
predict the probability that an alarm is true, and then construct a calibrated confidence set predictor
based on this forecaster. We consider an alarm to be false if the predicted confidence
set is —i.e., according to our confidence set predictor, the alarm is definitely false. Then,
our PAC guarantee says that the alarm is actually false with probability at least . In summary,
we suppress an alarm if . Using our approach, 176/630 (i.e., 27.94%) of false alarms
are suppressed, while only 13/187 (i.e., 6.95%) true alarms are suppressed (see Figure 7 (d)).
D.4 RESULTS ON ADDITIONAL REGRESSION DATASETS
We ran our algorithm on two small regression baselines—the Auto MPG dataset (Quinlan, 1993)
and the student grade dataset (Cortez & Silva, 2008). We show results in Figure 8. The parameters
we use are and
; as with the smaller classification datasets, we use larger choices
of and
since we cannot construct valid confidence sets for smaller choices. For the Auto MPG
dataset, the empirical confidence set error of our final model C + D is , so
these are empirically valid. For the student grade dataset, the error is , which is
slightly larger than desired; this failure is likely due to the fact that the failure probability
is somewhat large.
Figure 8: Confidence set sizes for two benchmarks focused on regression; for both, we use and
. Left: the Auto MPG dataset (Quinlan, 1993); here, n = 70, and the empirical confidence set error of our approach C + D is 0.1250. Right: The student grade dataset (Cortez & Silva, 2008); here, n = 100, and the empirical confidence set error of our approach is 0.0597.
D.5 ADDITIONAL RESULTS ON IMAGENET, HALF-CHEETAH, AND OBJECT TRACKING
Table 4 & 5 show examples of ResNet confidence set sizes for ImageNet images. Table 6 shows
results for varying on ResNet. Tables 7 & 8 show results for varying
on the Half-Cheetah.
Table 9 shows visualizations of the confidence sets predicted for our object tracking benchmark.
Table 4: ImageNet images with varying ResNet confidence set sizes. The confidence set sizes are on the top. The true label is on the left-hand side. Incorrectly labeled images are boxed in red.
� � � � � � � � � �
Table 5: Confidence sets of ImageNet images with varying ResNet confidence set sizes. The predicted confidence set is shown to the right of the corresponding input image. The true label is shown in red, and the predicted label is shown with a hat.
Table 6: Confidence set sizes for ResNet trained on ImageNet, for varying and for n = 20, 000. The plots are as in Figure 1 (a).
Table 7: Confidence set sizes for a neural network dynamics model trained on the half-cheetah environment, for varying and for n = 5000. The plots are as in Figure 3 (a).
Table 8: Confidence set sizes for a neural network dynamics model trained on the half-cheetah environment, for varying and for n = 5000. The plots are as in Figure 3 (b).
Table 9: Visualization of confidence sets for the tracking dataset (Wu et al., 2013), including the ground truth bounding box (white), the bounding box predicted by the original neural network (Held et al., 2016) (red), and the bounding box produced using our confidence set predictor (green). We have overapproximated the predicted ellipsoid confidence set with a box. Our bounding box contains the ground truth bounding box with high probability.