PAC Confidence Sets for Deep Neural Networks via Calibrated Prediction

2019·Arxiv

ABSTRACT

ABSTRACT

We propose an algorithm combining calibrated prediction and generalization bounds from learning theory to construct confidence sets for deep neural networks with PAC guarantees—i.e., the confidence set for a given input contains the true label with high probability. We demonstrate how our approach can be used to construct PAC confidence sets on ResNet for ImageNet, a visual object tracking model, and a dynamics model for the half-cheetah reinforcement learning problem. 1

1 INTRODUCTION

A key challenge facing deep neural networks is that they do not produce reliable confidence estimates, which are important for applications such as safe reinforcement learning (Berkenkamp et al., 2017), guided exploration (Malik et al., 2019), and active learning (Gal et al., 2017).

We consider the setting where the test data follows the same distribution as the training data (i.e., we do not consider adversarial examples designed to fool the network (Szegedy et al., 2014)); even in this setting, confidence estimates produced by deep neural networks are notoriously unreliable (Guo et al., 2017). One intuition for this shortcoming is that unlike traditional supervised learning algorithms, deep learning models typically overfit the training data (Zhang et al., 2017). As a consequence, the confidence estimates of deep neural networks are flawed even for test data from the training distribution since, by construction, they overestimate the likelihood of the training data.

A promising approach to addressing this challenge is temperature scaling (Platt, 1999). This approach takes as input a trained neural network —i.e., whose parameters have already been fit to a training dataset —which produces unreliable probabilities . Then, this approach rescales these confidence estimates based on a validation dataset to improve their “calibration”. More precisely, this approach fits confidence estimates of the form

where is a temperature scaling parameter that is fit based on the validation dataset. The goal is to choose to minimize calibration error, which roughly speaking measures the degree to which the reported error rate differs from the actual error rate.

The key insight is that in the temperature scaling approach, only a single parameter is fit to the validation data—thus, unlike fitting the original neural network, the temperature scaling algorithm comes with generalization guarantees based on traditional statistical learning theory.

Despite the improved generalization guarantees, these confidence estimates still do not come with theoretical guarantees. We are interested in producing confidence sets that satisfy statistical guarantees while being as small as possible. Given a test input , a confidence set

Table 1: ImageNet images with varying ResNet confidence set sizes. The confidence set sizes are on the top. The true label is on the left-hand side. Incorrectly labeled images are boxed in red.

(parameterized by ) should contain the true label y for at least a fraction of cases:

Since we are fitting a parameter T to based on , we additionally incur a probability of failure due to the randomness in . In other words, given , we aim to obtain probably approximately correct (PAC) confidence sets satisfying the guarantee

Indeed, techniques from statistical learning theory (Vapnik, 1999) can be used to do so (Vovk, 2013).

There are a number of reasons why confidence sets can be useful. First, they can be used to inform safety critical decision making. For example, consider a doctor who uses prediction tools to help perform diagnosis. Having a confidence set would both help the doctor estimate the confidence of the prediction (i.e., smaller confidence sets imply higher confidence), but also give a sense of the set of possible diagnoses. Second, having a confidence set can be useful for reasoning about safety since they contain the true outcome with high probability. For instance, robots may use a confidence set over predicted trajectories to determine whether it is safe to act with high probability. As a concrete example, consider a self-driving car that uses a deep neural network to predict the path that a pedestrian might take. We require that the self-driving car avoid the pedestrian with high probability, which it can do by avoiding all possible paths in the predicted confidence set.

Contributions. We propose an algorithm combining calibrated prediction and statistical learning theory to construct PAC confidence sets for deep neural networks (Section 3). We propose instantiations of this framework in the settings of classification, regression, and learning models for reinforcement learning (Section 3.6). Finally, we evaluate our approach on three benchmarks: ResNet (He et al., 2016) for ImageNet (Russakovsky et al., 2015), a model (Held et al., 2016) learned for a visual object tracking benchmark (Wu et al., 2013), and a probabilistic dynamics model (Chua et al., 2018) learned for the half-cheetah environment (Brockman et al., 2016) (Section 4). Examples of ImageNet images with different sized ResNet confidence sets are shown in Table 1. As can be seen, our confidence sets become larger and the images become more challenging to classify. In addition, we show predicted confidence sets for ResNet in Table 2, as well as predicted confidence sets for the visual object tracking model in Table 3.

Related work. There has been work on constructing confidence sets with theoretical guarantees. Oftentimes, these guarantees are asymptotic rather than finite sample (Steinberger & Leeb, 2016; 2018). Alternatively, there has been work focused on predicting confidence sets with a given expected size (Denis & Hebiri, 2017).

More relatedly, there has been recent work on obtaining PAC guarantees. For example, there has been some work specific prediction tasks such as binary classification (Lei, 2014; Wang & Qiao,

� � � � � � � � � �

Table 2: Confidence sets of ImageNet images with varying ResNet confidence set sizes. The predicted confidence set is shown to the right of the corresponding input image. The true label is shown in red, and the predicted label is shown with a hat. See Table 5 in Appendix D for more examples.

Table 3: Visualization of confidence sets for the tracking dataset (Wu et al., 2013), including the ground truth bounding box (white), the bounding box predicted by the original neural network (Held et al., 2016) (red), and the bounding box produced using our confidence set predictor (green). We have overapproximated the predicted ellipsoid confidence set with a box. Our bounding box contains the ground truth bounding box with high probability. See Table 9 in Appendix D for more examples.

2018). There has also been work in the setting of regression (Lei et al., 2018; Barber et al., 2019). However, in this case, the confidence sets are fixed in size—i.e., they do not depend on the input x (Barber et al., 2019). Furthermore, they make stability assumptions about the learning algorithm (though they achieved improved rates by doing so) (Lei et al., 2018; Barber et al., 2019).

The most closely related work is on conformal prediction (Papadopoulos, 2008; Vovk, 2013). Like our approach, this line of work provides a way to construct confidence sets from a given confidence predictor, and provided PAC guarantees for the validity of these confidence sets. Indeed, with some work, our generalization bound Theorem 1 can be shown to be equivalent to Theorem 1 in Vovk (2013). In contrast to their approach, we proposed to use calibrated prediction to construct confi-dence predictors that can suitably be used with deep neural networks. Furthermore, our approach makes explicit the connections to temperature scaling and as well as to generalization bounds from statistical learning theory (Vapnik, 1999). In addition, unlike our paper, they do not explicitly provide an efficient algorithm for constructing confidence sets. Finally, we also propose an extension to the case of learning models for reinforcement learning.

Finally, we build on a long line of work on calibrated prediction, which aims to construct “calibrated” probabilities (Murphy, 1972; DeGroot & Fienberg, 1983; Platt, 1999; Zadrozny & Elkan, 2001; 2002; Naeini et al., 2015; Kuleshov & Liang, 2015). Roughly speaking, probabilities are calibrated if events happen at rates equal to the predicted probabilities. This work has recently been applied to obtaining confidence estimates for deep neural networks (Guo et al., 2017; Kuleshov et al., 2018; Pearce et al., 2018), including for learned models for reinforcement learning (Malik et al., 2019). However, these approaches do not come with PAC guarantees.

2 PAC CONFIDENCE SETS

Our goal is to estimate confidence sets that are as small as possible, while simultaneously ensuring that they are probably approximately correct (PAC) (Valiant, 1984). Essentially, a confidence set is correct if it contains the true label. More precisely, let X be the inputs and Y be the labels, and let D be a distribution over . A confidence set predictor is a function such that is a set of labels; we denote the set of all confidence set predictors by C. For a given example , we say C is correct if . Then, the error of C is

Finally, consider an algorithm A that takes as input a validation set consisting of n i.i.d. samples , and outputs a confidence set predictor . Given , we say that A is probably approximately correct (PAC) if

Our goal is to design an algorithm A that satisfies (2) while constructing confidence sets C(x) that are as “small in size” as possible on average. The size of C(x) depends on the domain. For classifi-cation, we consider confidence sets that are arbitrary subsets of labels , and we measure the size by —i.e., the number of labels in C(x). For regression, we consider confidence sets that are intervals , and we measure size by —i.e., the length of the predicted interval. Note that there is an intrinsic tradeoff between satisfying (2) and average size of C(x)—larger confidence sets are more likely to satisfy (2).

3 PAC ALGORITHM FOR CONFIDENCE SET CONSTRUCTION

Our algorithm is formulated in the empirical risk framework. Typically, this framework refers to empirical risk minimization. In our setting, such an algorithm would take as input (i) a parametric family of confidence set predictors , where is the parameter space, and (ii) a training set of n i.i.d. samples , and output the confidence set predictor , where minimizes the empirical risk:

Here, is the indicator function, and the empirical risk in an estimate of the confidence set error (1) based on the validation set .

However, our algorithm does not minimize the empirical risk. Rather, recall that our goal is to minimize the size of the predicted confidence sets given a PAC constraint on the true risk based on the given PAC parameters and the number of available validation samples . Thus, the risk shows up as a constraint in the optimization problem, and the objective is instead to minimize the size of the predicted confidence sets:

At a high level, the value is chosen to enforce the PAC constraint, and is based on generalization bounds from statistical learning theory (Valiant, 1984). Furthermore, following the temperature scaling approach (Platt, 1999), the parameter space is chosen to be as small as possible (in particular, one dimensional) to enable good generalization. Finally, our choice of size metric S follows straightforwardly based on our choice of parameter space. In the remainder of this section, we describe the choices of (i) parameter space , (ii) size metric , and (iii) confidence level in more detail, as well as how to solve (3) given these choices.

3.1 CHOICE OF PARAMETER SPACE Θ

Probability forecasters. Our construction of the parameteric family of confidence set predictors assumes given a probability forecaster , where is a space of probability distributions over Y. Given such an f, we use f(y | x) to denote the probability of label y under distribution f(x). Intuitively, f(y | x) should be the probability (or probability density) that y is the true label for a given input x—i.e., . For example, in classification, we can choose to be the space of categorical distributions over Y, and f may be a neural network whose last layer is a softmax layer with |Y| outputs. Then, . Alternatively, in regression, we can choose to be the space of Gaussian distributions, and f may be a neural network whose last layer outputs the values of a Gaussian distribution. Then, , where , and is the Gaussian density function with mean and variance .

Training a probability forecaster. To train a probability forecaster, we use a standard approach to calibrated prediction that combines maximum likelihood estimation with temperature scaling. 2 First, we consider a parametric model family , where is the parameter space. Note that can be high-dimensional—e.g., the weights of a neural network model. Given a training set of m i.i.d. samples , the maximum likelihood estimate (MLE) of is

We could now use as the probability forecaster. However, the problem with directly using is that because may be high-dimensional, it often overfits the training data . Thus, the probabilities are typically overconfident compared to what they should be.

To reduce their confidence, we use the temperature scaling approach to calibrate the predicted probabilities (Platt, 1999; Guo et al., 2017). Intuitively, this approach is to train an MLE estimate using exactly the same approach used to train , but using a single new parameter . The key idea is that this time, the model family is based on the parameters from (4). In other words, the “shape” of the probabilities forecast by are preserved, but their exact values are shifted.

Then, we have the following MLE for :

Note that is estimated based on a second training set . Because we are only fitting a single parameter, this training set can be much smaller than the training set used to fit .

Parametric family of confidence set predictors. Finally, given a probability forecaster f, we consider one dimensional parameter space ; in an analogy to the temperature scaling technique for calibrated prediction, we denote this parameter by . In particular, we assume a confidence probability predictor f is given, and consider

In other words, is the set of y with high probability given x according to f. Considering this scalar parameter space, we denote the minimum of (3) by .

3.2 CHOICE OF SIZE METRIC S(T)

To choose the size metric S(T), we note that for our chosen parametric family of confidence set predictors, smaller values correspond to uniformly smaller confidence sets—i.e.,

Thus, we can simply choose the size metric to be

This choice minimizes the size of the confidence sets produced by our algorithm.

3.3 CHOICE OF CONFIDENCE LEVEL α(n, ϵ, δ)

Naive approach based on VC generalization bound. A naive approach to choosing is to do so based on the VC dimension generalization bound (Vapnik, 1999). It is not hard to show that the problem of estimating is equivalent to a binary classification problem, and that the VC dimension of for this problem is 1. Thus, the VC dimension bound implies that for all ,

The details of this equivalence are given in Appendix B.2. Then, suppose we choose

With this choice, for the solution of (3) with , the constraint in (3) ensures that . Together with the VC generalization bound (7), we have

Direct generalization bound. In fact, we can get better choices of by directly bounding generalization error. For instance, in the realizable setting (i.e., we always have ), we can get rates of instead of (Kearns & Vazirani, 1994); see Appendix A.2 for details. We can achieve these rates by choosing , but then, the PAC guarantees we obtain may actually be stronger than desired (i.e., for ). Intuitively, we can directly prove a bound that interpolates between the realizable setting and the VC generalization bound—in particular:

Theorem 1. For any , and , we have

where is the solution to (3) with . 3

We give a proof in Appendix B.2. Based on Theorem 1, we can choose

3.4 THEORETICAL GUARANTEES

We have the following guarantee, which follows straightforwardly from Theorem 1: Corollary 1. Let be the solution to (3) for chosen according to (8). Then, we have

In other words, our algorithm is probably approximately correct.

3.5 PRACTICAL IMPLEMENTATION

Our algorithm for estimating a confidence set predictor is summarized in Algorithm 1. The algorithm solves the optimization problem (3) using the choices of , and described in the preceding sections. There are two key implementation details that we describe here.

Computing . To compute , we need to solve (8). A straightforward approach is to enumerate all possible choices of . There are two optimizations. First, the objective is monotone increasing in k, so we can enumerate k in ascending order until the constraint no longer holds. Second, rather than re-compute the left-hand side of the constraint , we can accumulate the sum as we iterate over k. We can also incrementally compute, and . For numerical stability, we perform these computations in log space.

Solving (3). To solve (3), note that the constraint in (3) is equivalent to

Also, note that is an integer due to the definition of in (8). Thus, we can interpret (9) as saying that E(x, y; T) = 1 for at most of the points .

In addition, note that E(x, y; T) decreases monotonically as becomes larger. Thus, we can sort the points in ascending order of , and require that only the first points (x, y) in this list satisfy E(x, y; T) = 1. In particular, letting be the st point, (9) is equivalent to

In other words, this constraint says that T must satisfy . Finally, the solution to (3) is the smallest T that satisfies (10), which is the T that makes (10) hold with equality—i.e.,

We have assumed ; if not, we decrement until this holds.

3.6 PROBABILITY FORECASTERS FOR SPECIFIC TASKS

We briefly discuss the architectures we use for probability forecasters for various tasks. We give details, including how we measure the sizes of predicted confidence sets , in Appendix C. We consider three tasks: classification, regression, and model-based reinforcement learning. For classification, we use the standard approach of using a soft-max layer to predict label probabilities f(y | x). For regression, we also use a standard approach where the neural network predicts both the mean and covariance of a Gaussian distribution ; then, f(y | x) = is the probability density of y according to this Gaussian distribution.

Finally, for model-based reinforcement learning, our goal is to construct confidence sets over trajectories predicted using a learned model of the dynamics. We consider unknown dynamics

Figure 1: Results on ResNet for ImageNet with n = 20000. Default parameters are and . We plot the median and min/max confidence set sizes. (a) Ablation study; C is “calibrated predictor” (i.e., use instead of ), and D is “direct bound” (i.e., use Theorem 1 instead of the VC generalization bound). (b) Restricted to correctly vs. incorrectly labeled images. (c) Varying . (d) Varying .

mapping a state-action pair (x, u) to a distribution over states , and consider a known (and fixed) policy mapping a given state x to a distribution over actions . Then, we let denote the (unknown) closed-loop dynamics.

Next, we consider a forecaster of the form , and our goal is to construct confidence sets for the predictions of f. However, we want to do so for not just for one-step predictions, but for predictions over a time horizon . In particular, given initial state , we can sample by letting and sequentially sampling for each . Then, our goal is to construct a confidence set that contains with high probability (over both the randomness in an initial state distribution and the randomness in ).

To do so, we construct and use a forecaster based on f. In principle, this task is a special case of multivariate regression, where the inputs are X (i.e., the initial state ) and the outputs are (i.e., a predicted trajectory ). However, the variance predicted by our probability forecaster is only for a single step, and does not take into account the fact that x is itself uncertain. Thus, we use a simple heuristic where we accumulate variances over time. More precisely, we construct (i) the predicted mean by and for , and (ii) the predicted variances by

We use a probability forecaster to construct confidence sets.

4 EXPERIMENTS

We describe our experiments on ImageNet (a classification task), a visual object tracking benchmark (a regression task), and the half-cheetah environment (a model-based reinforcement learning task). We give additional results in Appendix D.

Figure 2: Confidence set sizes for an object tracking benchmark (Wu et al., 2013); we use n = , and . (a) Ablation study similar to Figure 3. In (b) and (c), we show how the confidence set sizes produced using our algorithm vary with respect to and , respectively.

ResNet for ImageNet. We use our algorithm to compute confidence sets for ResNet (He et al., 2016) on ImageNet (Russakovsky et al., 2015), for , and n = 20000 validation images. We show the results in Figure 1. In (a), we compare our approach to an ablation. In particular, C refers to performing an initial temperature scaling step to calibrate the neural network predictor (i.e., using instead of ), and (ii) D refers to using Theorem 1 instead of the VC generalization bound. Thus, C + D refers to our approach. As can be seen, using calibrated predictor produces a noticeable reduction in the maximum confidence set size.

We also compared to the ablation C—i.e., using the VC generalization bound. However, we were unable to obtain valid confidence sets for our choice of and —i.e., (3) is infeasible. That is, using Theorem 1 outperforms using the VC generalization bound since the VC bound is too loose to satisfy the PAC criterion for our choice of parameters. In addition, in Table 6 in Appendix D, we show results for larger choices of and ; these results show that our approach substantially outperforms the ablation based on the VC bound even when the VC bound produces valid confidence sets.

In (b), we show the confidence set sizes for images correctly vs. incorrectly labeled by ResNet. As expected, the sizes are substantially larger for incorrectly labeled images. Finally, in (c) and (d), we show how the sizes vary with and , respectively. As expected, the dependence on is much more pronounced (note that is log-scale).

Visual object tracking. We apply our confidence set prediction algorithm to a 2D visual singleobject tracking task, which is a multivariate regression problem. Specifically, the input space X consists of the previous image, the previous bounding box (in ), and the current image. The output space is a current bounding box. We use the regression-based tracker from Held et al. (2016), and retrain the regressor neural network to predict the mean and variance of a Gaussian distribution. More precisely, our object tracking model predicts the mean and variance of each bounding box parameter—i.e., . Given this bounding box forecaster , we calibrate and estimate a confidence set predictor as described in Section 3.6.

We use the visual object tracking benchmark from Wu et al. (2013) to train and evaluate our con-fidence set predictor. This benchmark consists of 99 video sequences labeled with ground truth bounding boxes. We randomly split these sequences to form the training set for calibration, validation set for confidence set estimation, and test set for evaluation. For each sequence, a pair of two adjacent frames constitute a single example. Our training dataset contains 20,882 labeled examples, each consisting of of a pair of consecutive images and ground truth bounding boxes. The validation set for confidence set estimation and test set contain 22,761 and 22,761 labeled examples, respectively. Figure 2 shows the sizes of the predicted confidence sets; the sizes are measured as described in Section 3.6 for regression tasks. As for ResNet, we omit results for the VC bound ablation since n is too small to get a bound. The trends are similar to the ones for ResNet.

Half-cheetah. We use our algorithm to compute confidence sets for a probabilistic neural network dynamics model (Chua et al., 2018) for the half-cheetah environment (Brockman et al., 2016), for time steps, and n = 5000 validation rollouts. When using temperature scaling to calibrate to obtain , we calibrate each dimension of time steps independently (i.e., we fit H parameters, where H is time horizon). We show the results in Figure 3.

Figure 3: Results on the dynamics model for the half-cheetah with n = 5000. Default parameters are and . (a) Ablation study; A is “accumulated variance” (i.e., for each {1, ..., 20}, use instead of ), and C and D are as for ResNet. We plot the median and min/max confidence set sizes (see Section 3.6), averaged across . (b) Same ablations, but with per time step size. We plot the average size of the confidence set for the predicted state on step t, as a function of . (c) Varying , and (d) varying .

In (a), we compare to two ablations. The labels C and D are as for ResNet; in addition, A refers to using the accumulated variance instead of the one-step predicted variances . Thus, A + C + D is our approach. As before, we omit results for the ablation using the VC generalization bound since n is so small that the bound does not hold for any k for the given and . In (b), we show the same ablations over the entire trajectory until t = 20. As can be seen, using the calibrated predictor produces a large gain; these gains are most noticeable in the tails. Using the accumulated confidence produces a smaller, but still significant, gain. In (c) and (d), we show how the sizes vary with and , respectively. The trends are similar those for ResNet.

5 CONCLUSION

We have proposed an algorithm for constructing PAC confidence sets for deep neural networks. Our approach leverages statistical learning theory to obtain theoretical guarantees on the predicted confidence sets. These confidence sets quantify the uncertainty of deep neural networks. For instance, they can be used to inform safety-critical decision-making, and to ensure safety with highprobability in robotics control settings that leverage deep neural networks for perception. Future work includes extending these results to more complex tasks (e.g., structured prediction), and handling covariate shift (e.g., to handle policy updates in reinforcement learning).

ACKNOWLEDGMENTS

This work was support in part by NSF CCF-1910769 and by the Air Force Research Laboratory and the Defense Advanced Research Projects Agency under Contract No. FA8750-18-C-0090. Any opinions, findings and conclusions or recommendations expressed in this material are those of the author(s) and do not necessarily reflect the views of the Air Force Research Laboratory (AFRL), the Defense Advanced Research Projects Agency (DARPA), the Department of Defense, or the United States Government.

REFERENCES

Rina Foygel Barber, Emmanuel J Candes, Aaditya Ramdas, and Ryan J Tibshirani. Predictive inference with the jackknife+. arXiv preprint arXiv:1905.02928, 2019.

Felix Berkenkamp, Matteo Turchetta, Angela Schoellig, and Andreas Krause. Safe model-based reinforcement learning with stability guarantees. In Advances in neural information processing systems, pp. 908–918, 2017.

Marko Bohanec and Vladislav Rajkovic. Knowledge acquisition and explanation for multi-attribute decision making. In 8th Intl Workshop on Expert Systems and their Applications, 1988.

Christopher P Bonafide, A Russell Localio, John H Holmes, Vinay M Nadkarni, Shannon Stemler, Matthew MacMurchy, Miriam Zander, Kathryn E Roberts, Richard Lin, and Ron Keren. Video analysis of factors associated with response time to physiologic monitor alarms in a childrens hospital. JAMA pediatrics, 171(6):524–531, 2017.

Greg Brockman, Vicki Cheung, Ludwig Pettersson, Jonas Schneider, John Schulman, Jie Tang, and Wojciech Zaremba. Openai gym. arXiv preprint arXiv:1606.01540, 2016.

Kurtland Chua, Roberto Calandra, Rowan McAllister, and Sergey Levine. Deep reinforcement learn- ing in a handful of trials using probabilistic dynamics models. In Advances in Neural Information Processing Systems, pp. 4754–4765, 2018.

Paulo Cortez and Alice Maria Goncalves Silva. Using data mining to predict secondary school student performance. 2008.

Morris H DeGroot and Stephen E Fienberg. The comparison and evaluation of forecasters. Journal of the Royal Statistical Society: Series D (The Statistician), 32(1-2):12–22, 1983.

Christophe Denis and Mohamed Hebiri. Confidence sets with expected sizes for multiclass classifi- cation. The Journal of Machine Learning Research, 18(1):3571–3598, 2017.

Yarin Gal, Riashat Islam, and Zoubin Ghahramani. Deep bayesian active learning with image data. In Proceedings of the 34th International Conference on Machine Learning-Volume 70, pp. 1183– 1192. JMLR. org, 2017.

Chuan Guo, Geoff Pleiss, Yu Sun, and Kilian Q Weinberger. On calibration of modern neural networks. arXiv preprint arXiv:1706.04599, 2017.

H Altay Guvenir, Burak Acar, Gulsen Demiroz, and Ayhan Cekin. A supervised machine learning algorithm for arrhythmia analysis. In Computers in Cardiology 1997, pp. 433–436. IEEE, 1997.

Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recog- nition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 770–778, 2016.

David Held, Sebastian Thrun, and Silvio Savarese. Learning to track at 100 fps with deep regression networks. In European Conference on Computer Vision, pp. 749–765. Springer, 2016.

Michael J Kearns and Umesh Virkumar Vazirani. An introduction to computational learning theory. MIT press, 1994.

Alex Krizhevsky. One weird trick for parallelizing convolutional neural networks. arXiv preprint arXiv:1404.5997, 2014.

Volodymyr Kuleshov and Percy S Liang. Calibrated structured prediction. In Advances in Neural Information Processing Systems, pp. 3474–3482, 2015.

Volodymyr Kuleshov, Nathan Fenner, and Stefano Ermon. Accurate uncertainties for deep learning using calibrated regression. arXiv preprint arXiv:1807.00263, 2018.

Jing Lei. Classification with confidence. Biometrika, 101(4):755–769, 2014.

Jing Lei, Max GSell, Alessandro Rinaldo, Ryan J Tibshirani, and Larry Wasserman. Distribution- free predictive inference for regression. Journal of the American Statistical Association, 113 (523):1094–1111, 2018.

Ali Malik, Volodymyr Kuleshov, Jiaming Song, Danny Nemer, Harlan Seymour, and Stefano Er- mon. Calibrated model-based deep reinforcement learning. In International Conference on Machine Learning, pp. 4314–4323, 2019.

Allan H Murphy. Scalar and vector partitions of the probability score: Part i. two-state situation. Journal of Applied Meteorology, 11(2):273–282, 1972.

Mahdi Pakdaman Naeini, Gregory Cooper, and Milos Hauskrecht. Obtaining well calibrated prob- abilities using bayesian binning. In Twenty-Ninth AAAI Conference on Artificial Intelligence, 2015.

Harris Papadopoulos. Inductive conformal prediction: Theory and application to neural networks. In Tools in artificial intelligence. IntechOpen, 2008.

T Pearce, M Zaki, A Brintrup, and A Neely. High-quality prediction intervals for deep learning: A distribution-free, ensembled approach. In 35th International Conference on Machine Learning, ICML 2018, volume 9, pp. 6473–6482, 2018.

John Platt. Probabilistic outputs for support vector machines and comparisons to regularized likeli- hood methods. Advances in large margin classifiers, 1999.

J Ross Quinlan. Combining instance-based and model-based learning. In Proceedings of the Tenth International Conference on International Conference on Machine Learning, pp. 236–243. Morgan Kaufmann Publishers Inc., 1993.

Olga Russakovsky, Jia Deng, Hao Su, Jonathan Krause, Sanjeev Satheesh, Sean Ma, Zhiheng Huang, Andrej Karpathy, Aditya Khosla, Michael Bernstein, et al. Imagenet large scale visual recognition challenge. International journal of computer vision, 115(3):211–252, 2015.

Lukas Steinberger and Hannes Leeb. Leave-one-out prediction intervals in linear regression models with many variables. arXiv preprint arXiv:1602.05801, 2016.

Lukas Steinberger and Hannes Leeb. Conditional predictive inference for high-dimensional stable algorithms. arXiv preprint arXiv:1809.01412, 2018.

Christian Szegedy, Wojciech Zaremba, Ilya Sutskever, Joan Bruna, Dumitru Erhan, Ian Goodfel- low, and Rob Fergus. Intriguing properties of neural networks. In International Conference on Learning Representations (ICLR), 2014.

Christian Szegedy, Wei Liu, Yangqing Jia, Pierre Sermanet, Scott Reed, Dragomir Anguelov, Du- mitru Erhan, Vincent Vanhoucke, and Andrew Rabinovich. Going deeper with convolutions. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 1–9, 2015.

Leslie G Valiant. A theory of the learnable. Communications of the ACM, 27(11):1134–1142, 1984.

Vladimir N Vapnik. An overview of statistical learning theory. IEEE transactions on neural networks, 10(5):988–999, 1999.

Vladimir Vovk. Conditional validity of inductive conformal predictors. Machine learning, 92(2-3): 349–376, 2013.

Wenbo Wang and Xingye Qiao. Learning confidence sets using support vector machines. In Advances in Neural Information Processing Systems, pp. 4929–4938, 2018.

Yi Wu, Jongwoo Lim, and Ming-Hsuan Yang. Online object tracking: A benchmark. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2013.

Bianca Zadrozny and Charles Elkan. Obtaining calibrated probability estimates from decision trees and naive bayesian classifiers. In In Proceedings of the Eighteenth International Conference on Machine Learning. Citeseer, 2001.

Bianca Zadrozny and Charles Elkan. Transforming classifier scores into accurate multiclass proba- bility estimates. In Proceedings of the eighth ACM SIGKDD international conference on Knowledge discovery and data mining, pp. 694–699. ACM, 2002.

Chiyuan Zhang, Samy Bengio, Moritz Hardt, Benjamin Recht, and Oriol Vinyals. Understanding deep learning requires rethinking generalization. In ICLR, 2017.

A DISCUSSION OF ALGORITHM DESIGN CHOICES

A.1 USEFULNESS OF TEMPERATURE SCALING

In this section, we discuss why temperature scaling can help improve the predicted confidence sets.

A concern is that temperature scaling does not change the ordering of label probabilities. Thus, we

may expect that temperature scaling does not affect the predicted confidence sets. However, this

fact only holds when considering a single input x—i.e., the ordering of the probabilities p(y | x)

for is not changed by temperature scaling. Indeed, the order of confidences for labels for

different inputs can change. For a concrete example, consider two inputs x and , and the case

Y = {0, 1, 2}. Assume that the label probabilities are

Now, if we take temperature very large, then the labels become roughly

As a consequence, there are confidence sets that are achievable when using that are not achievable

when using f. In particular, the confidence sets

can be achieved using (e.g., with ). However, it is impossible to achieve these

confidence sets using f for any choice of T, since if , then it must be the case that

. Intuitively, we expect calibrated prediction to improve the ordering of probabili-

ties across different inputs. Our experiments support this intuition, since they show that empirically,

using calibrated predictors produces confidence sets of smaller size.

A.2 USEFULNESS OF DIRECT BOUND

One key design choice is to use a specialized generalization bound that directly provides PAC guar-

antees on our confidence sets rather than simply applying the VC dimension bound. The easiest way

to determine which bound is better is to examine which one produces a smaller confidence set. In our

approach, the size of the confidence set decreases monotonically with the choice of in

(3). Thus, the bound that produces larger is better. Recall that the VC dimension bound produces

whereas our direct bound produces (for k = 0)

Directly comparing these two choices of is difficult, but our experiments show empirically that

using the direct bound outperforms using the indirect bound.

A more direct way to compare the two approaches is to instead ask how large n needs to be to

achieve . For , it is easy to check that we need

Thus, we need n to be at least (and possibly greater, to account for the log(2n)

term). In contrast, for our direct bound, corresponds to the case k = 0. To achieve k = 0, it

suffices to have n satisfying . Using , it suffices to have n satisfying

Figure 4: Sample complexity of different bounds; we fix . Left: Sample complexity of VC bound and direct bound when k = 0. Right: Sample complexity of direct bound for varying k.

In other words, n only needs to be . For small (e.g., ), we need

fewer samples to achieve the same size confidence set (i.e., with choice ). In Figure 4

(right), we compute the exact values of n needed to get as a function of for each

bound (fixing ). As expected, our bound requires substantially smaller n.

Finally, in Figure 4 (right), we compare the magnitude of n needed to achieve larger values of

using our direct bound; for simplicity, we actually consider larger values of k (where ), but

the qualitative insights are the same. As can be seen, even for large k, (e.g., k = 50), the number of

samples increases, but not substantially.

B THEORETICAL GUARANTEES

B.1 ASSUMPTIONS

We make two additional technical assumptions in Theorem 1, both of which are standard. First, we

assume that f is measurable; this assumption holds for all models used in practice, including neural

networks (e.g., it holds as long as f is continuous).

Second, letting , where , be defined by , we

assume that the distribution induced by on R has continuous cumulative distribution function

(CDF). More precisely, letting be the measure defining D, then is defined by the measure

where is the inverse of in the sense that for all . Then,

we assume that the CDF corresponding to is continuous. This second assumption is standard in

statistical learning theory (Kearns & Vazirani, 1994). Essentially, it says that for any , the

probability that must equal zero. This assumption should hold unless p(x, y) or

f(y | x) are degenerate in some way. Furthermore, we can detect this case. In particular, the failure

mode corresponds to the case that we see multiple points with the same value . Thus,

choosing would include all these points, so the realized error rate is larger

than desired for . In this case, we can simply choose a slightly larger to avoid this problem.

B.2 PROOF OF THEOREM 1

At a high level, our proof proceeds in three steps. First, we show that a confidence set predictor

can be encoded as a binary classifier . Second, we show that a PAC bound for implies a

PAC bound for (where in both cases, the unknown parameter is ). Third, we prove PAC

bounds on the error of ; by the second step, these bounds complete our proof.

Encoding as a binary classifier . We begin by showing how the problem of learning a PAC

confidence set predictor reduces to the problem of learning a PAC binary classifier . First,

we show that for any , the confidence set predictor can be encoded as a binary classifier

. Consider any parameter . Recall that we use the model f(y | x) to construct the

confidence set predictor

Now, define the map by , where , and define the

binary classifier by

Here, I[s] is the indicator function, which returns one if a statement s is true and zero otherwise. We

claim that

To see this claim, note that

as claimed.

PAC bound for implies PAC bound for . Next, we show that a PAC bound for implies

a PAC bound for . More precisely, we design a data distribution and loss , and show that (i)

the distribution of (trained to optimize ) is the same as the distribution of (constructed using

our algorithm), and (ii) a PAC bound for (where is trained on data from ) implies a PAC

bound for . We show that as a consequence, a PAC bound on implies a PAC bound on .

We begin by constructing and . To this end, recall that D is a given distribution over .

We define a data distribution over , where and , as follows. The

first component of is the distribution over induced by from D, and the second component

is the distribution over that places all probability mass on 1. Formally, exists assuming is

measurable, so the induced distribution exists; for all our choices of f (i.e., categorical or Gaussian),

this property is satisfied. Then,

where is the measure encoding , and is the measure encoding D. Furthermore, we define

to be the 0-1 loss . Finally, let be chosen using our

algorithm—i.e.,

for any , and let be chosen similarly for —i.e.,

Now, we show (i) above. In particular, we claim that has the same distribution as , where

and are random datasets. To this end, define by

Note that

from which it follows that

By construction of , the random variables and have the same distribution; thus, it follows

that the random variables and have the same distribution as well. Since

, it follows that has the same distribution as , as claimed.

Next, we show (ii) above. In particular, we claim that a PAC bound for —i.e.,

implies a PAC bound for —i.e.,

where the true losses are

Note that it suffices to show that the true loss for equals the true loss for —i.e.,

since this equation (together with the PAC bound for ) implies

as desired. To see the claim, note that

Now, using the change of variables , we have

Then, using (12), we have

as claimed.

Finally, combining (i) and (ii), we have

where the first equality follows since (i) says that (where ) has the same distribution

as (where ), and the second inequality follows by (ii).

Generalization bound. Finally, we prove the PAC bound

for , where ; for conciseness, we have dropped the dependence

of on . By the previous step, this bound implies the theorem statement. To this end, we first

simplify the left-hand side of the inequality (13). In particular, let be the smallest T for which

; such a exists by our assumption that has continuous density function.

First, we claim that implies . Assuming , then

Assuming , we can similarly show that . It follows that

As a consequence, (13) is equivalent to

Next, recall that must satisfy , where

Assuming , and using , it follows that

As a consequence, we have

By our definition of , the event in the final expression says that the sum of n i.i.d. Bernoulli

random variables Bernoulliis at most k. Thus, this event follows a distribution

Binomial, so

as claimed. The theorem statement follows.

C DETAILS ON PROBABILITY FORECASTERS FOR SPECIFIC TASKS

In this section, we describe architectures for probability forecasters for classification, regression,

and model-based reinforcement learning.

Classification. For the case Y = {1, ..., Y }, we choose the probability forecaster f to be a neural

network with a softmax output. Then, we can compute a given confidence set

by explicitly enumerating . We measure the size of as .

Regression. For the case Y = R, we choose the probability forecaster f to be a neural network that

outputs the parameters of a Gaussian distribution. Then, we have

This choice generalizes to by having f output the parameters (where

is the set of d dimensional symmetric positive definite matrices) of a d dimensional Gaussian

distribution. Note that is an ellipsoid , where and is

the unit sphere in ; in particular, , where is the eigendecomposition of

We measure the size of as , where is the Frobenius norm.

Model-based reinforcement learning. In model-based reinforcement learning, the goal is to predict

trajectories based on a model of the dynamics. We consider an MDP with states , actions

, an unknown distribution over initial states , and unknown dynamics

mapping a state-action pair to a distribution over states . We assume a fixed,

known policy , mapping a state to a distribution over actions . The (unknown)

closed-loop dynamics are .

Given initial state and time horizon , we can sample a trajectory

by setting and sequentially sampling for

. Our goal is to predict a confidence set that contains

with high probability (according to both the randomness in initial states and in f). This

problem is a multivariate regression problem with inputs X and outputs .

We assume given a probability forecaster trained to predict the

distribution over next states—i.e., . Given initial state and time

horizon , we construct the mean trajectory by setting and letting .

To account for the fact that the variances accumulate over time, we sum them together to obtain the

predicted variances —i.e.,

Then, we use the probability forecast (where we think of as

a vector in and as a block diagonal matrix in ) to construct confidence

sets.

Finally, we describe how we measure the size of a predicted confidence set . In

particular, note that has the form

Figure 5: Comparison to baselines that do not have theoretical guarantees. In (a) and (b), we show results for ImageNet, and in (c) and (d), we show results for the half-cheetah. In (a) and (c), we show the empirical error in the confidence set sizes; the dotted line denotes , our target confidence set error. In (b) and (d), we show the sizes of the constructed confidence sets.

i.e., is the confidence set for the state reached after t time steps. Then, we measure the

size of the confidence set for each component (for ) individually, and take

the average. As in the case of regression, is an ellipsoid ; then,

the size of is .

An additional detail is that when we calibrate this forecaster, we calibrate each component

individually—i.e., we use H calibration parameters .

D ADDITIONAL RESULTS

D.1 COMPARISON TO ADDITIONAL BASELINES

We compare to two baselines that do not have theoretical guarantees. We assume given a probability

forecaster f(y | x). Then, given an input , we construct the confidence set to satisfy

More precisely, we first rank the labels in decreasing order of f(y | x), to obtain a list

. Then, we choose the smallest k such that (14) holds for .

Intuitively, if the probabilities f(y | x) are correct (i.e., f(y | x) is the true probability of y given

x), then this confidence set should contain the true label y with high probability.

For regression, we cannot explicitly rank labels , but they are monotonically decreasing

away from the mean. Then, assuming is Gaussian, we take an

ellipsoid of shape around with minimum radius that captures of the probability

mass of f(y | x). More precisely, we choose

Figure 6: Confidence set sizes for two neural network architectures trained on ImageNet; for both, we use and . Left: AlexNet (Krizhevsky, 2014); here, the empirical confidence set error of our approach C + D is 0.0066. Right: GoogLeNet (Szegedy et al., 2015); here, the empirical confidence set error of our approach is 0.0061.

where as before. Note that unlike our algorithm, the threshold

is not a learned parameter, but is computed independently for each new input x. We can solve

for efficiently by changing basis to convert f(y | x) to a standard Gaussian distribution, and

then using the error function to compute the cutoff that includes the desired probability mass.

In Figure 5, we compare the confidence sets constructed using this approach with (i) the forecaster

without any calibration, and (ii) the calibrated forecaster . We plot both the

confidence set sizes and the empirical error rates. For the latter, recall that a confidence set predictor

C is correct if , where L(C) the true error rate. However, we cannot measure L(C);

instead, we approximate it on a held-out test set —i.e., , where

Intuitively, is the fraction of inputs such that the predicted confidence set

for x does not contain y. We say a confidence set C is empirically valid when .

Recall that our algorithm guarantees correctness with probability at least , where .

As can be seen, the baseline approaches are not empirically valid in all cases. In one case—namely,

the baseline with the calibrated forecaster on ImageNet—the confidence sets are almost empirically

valid. However, in this case, the confidence sets are much larger than those based on our approach,

despite the fact that the error rate of our confidence sets are empirically valid. Thus, our algorithms

outperform the baselines in all cases.

D.2 RESULTS ON ADDITIONAL IMAGENET NEURAL NETWORK ARCHITECTURES

We apply our approach to two additional neural network architectures for ImageNet: AlexNet

(Krizhevsky, 2014) and GoogLeNet (Szegedy et al., 2015). Our results are shown in Figure 6.

As can be seen, calibration reduces the confidence set sizes for AlexNet, but actually increases the

confidence set sizes for GoogleNet. Thus, both calibrated and uncalibrated models may need to be

considered when constructing confidence set predictors. Also, we find that confidence set sizes are

correlated with classification error—the test errors for AlexNet, GoogleNet, and ResNet are 47.83%,

29.41%, and 21.34%, respectively, and their confidence set sizes decrease in the same order.

D.3 RESULTS ON ADDITIONAL CLASSIFICATION DATASETS

We apply our approach to three small classification datasets: an Arrhythmia detection dataset (Gu-

venir et al., 1997), a car evaluation dataset (Bohanec & Rajkovic, 1988), and a medical alarm dataset

(Bonafide et al., 2017). The confidence set sizes are shown in Figure 7. We choose larger values of

and since we cannot obtain confidence sets that satisfy the PAC criterion with smaller and when

the number of validation examples n is too small. For all three datasets, the empirical confidence

set error is smaller than the specified error ; thus, the constructed confidence sets are empirically

valid. For these datasets, the confidence set sizes of our approach C + D and our approach without

calibration D are similar, most likely due to the small number of class labels.

Figure 7: Confidence set sizes for three additional classification benchmarks: (a) the arrhythmia detection dataset (Guvenir et al., 1997); here, , and the empirical confidence set error of our approach C + D is 0.0435, (b) the car evaluation dataset (Bohanec & Rajkovic, 1988); here, , and the empirical confidence set error of our approach C + D is 0.0172, and (c) the CHOP alarm dataset (Bonafide et al., 2017); here, n = 1000, , and the empirical confidence set error of our approach C+D is 0.0159. (d) The fractions of actionable and false alarms with a confidence set {0} (i.e., only contains false alarm).

We additionally ran our approach on a medical dataset where classification decisions are safety

critical; thus, correct predicted confidence sets are required. In particular, we use the Children’s

Hospital of Philadelphia (CHOP) alarm dataset (Bonafide et al., 2017). This dataset consists of vital

signs from 100 patients around one year of age. One of the vital signs is the oxygen level of the

blood, and a medical device generates an alarm if the oxygen level is below a specified level. The

labels indicate whether the generated alarm is true (y = 1) or false (y = 0). We use n = 1000,

, and . The empirical confidence set error of our approach is .

The key question is how many false alarms can be reliably detected using machine learning to help

reduce alarm fatigue. We consider an approach where we use the predicted confidence sets to detect

false alarms. In particular, we first train a probability forecaster , where