Isotropic Maximization Loss and Entropic Score: Fast, Accurate, Scalable, Unexposed, Turnkey, and Native Neural Networks Out-of-Distribution Detection

2019·Arxiv

Abstract

Abstract

Current out-of-distribution (OOD) detection approaches require cumbersome procedures that add undesired side effects to the solution. In this paper, we argue that the low OOD detection performance of neural networks is due to cross-entropy SoftMax loss

anisotropy and extreme propensity to produce low entropy (high confidence) posterior probability distributions in frontal disagreement with the Principle of Maximum Entropy. Consequently, we propose IsoMax, a loss that is isotropic (distance-based) and produces high entropy (low confidence) posterior probability distributions despite still relying on cross-entropy minimization. Additionally, we propose a speedy Entropic Score for OOD detection. IsoMax loss works as a seamless SoftMax loss drop-in replacement that keeps the overall solution

accurate, fast, efficient, scalable, and turnkey. Our experiments indeed confirmed that neural networks OOD detection performance may be extremely improved without relying on techniques such as adversarial training or validation, data augmentation, ensembles methods, generative approaches, model architectural changes, metric learning, or additional classifiers or regressions. The results also showed that our straightforward approach is competitive against state-of-the-art solutions besides avoiding previous methods undesired drawbacks.

Index Terms—Isotropic Maximization Loss, Entropic Score, Accurate, Fast, Efficient, Scalable, Turnkey, Neural Networks, Out-of-Distribution Detection, Principle of Maximum Entropy

1 INTRODUCTION

NEURAL networks have been used as classifiers ina wide range of applications. Their design usually considers that the model receives an instance of one of the classes at inference. If this holds, the neural network tends to present satisfactory performance. However, in real-world applications, this assumption is not usually fulfilled. Additionally, neural networks are known to present overcon-fident predictions even for objects they were not trained to recognize [1].

The ability to detect if an input applied to a neural network can not be reliably classified is essential to critical applications in medicine, finance, agriculture, and engineering. In such situations, it is better to have a system that acknowledges that it is unable to decide. The rapid adoption of neural networks in modern applications makes the development of such capability a primary necessity from a practical point of view.

The mentioned problem has been studied under many similar point of views and nomenclatures such as Open Set Recognition [2], [3] and Open World Recognition [4], [5]. Recently, [6] defined as out-of-distribution (OOD) detection the task of evaluating whether a sample belongs to the in- distribution on which a neural network was trained. [6] also established baseline datasets and metrics for OOD detection. Additionally, they established the baseline performance for this task by proposing an OOD detection approach that uses the maximum predicted probability as the score to detect whether an example belongs to the in-distribution.

Despite being a fundamental task, current OOD detection approaches are based on ad-hoc techniques that produce severe side effects on the solution. ODIN [7] and Mahalanobis [8] require input preprocessing, which makes inferences slow and increases computational cost and energy consumption. Additionally, these solutions also present hyperparameters to tune using unrealistic access to out-of-distribution samples or the cumbersome process of generating adversarial examples. Adversarial training methods such as ACET [9] usually implicate in longer training times and reduced scalability for large-sized images.

Another major drawback present in recent OOD detection approaches is the classification accuracy drop [10], [11], which is indeed a harmful undesired side effect because classification is usually the primary aim of the system. In contrast, OOD detection is an auxiliary task [12].

In some cases, OOD detection proposals require undesired model structural changes [13] or even ensemble methods [14], [15]. Finally, there are solutions based on uncertainty or confidence estimation/calibration [16], [17], [18], [19], [20]. Despite additional complexity, slower inference, and higher energy/computation required, they may additionally present OOD detection performance worse than ODIN [11], [21].

In this paper, we argue that the low OOD performance of neural networks is mainly due to two factors. First, the SoftMax anisotropy, which does not concentrate high-level representations in the feature space, making it difficult for OOD detection [9]. Second, the cross-entropy loss propensity to generate extremely overconfident (low entropy) posterior probability distributions, which is in direct conflict with the Principle of Maximum Probability. Throughout this work, we further develop those claims with both theoretical motivations and experimental results.

Hence, we propose IsoMax, a loss that is isotropic (distance-based) and generates inferences with high mean entropy posterior probability distributions in agreement with the Principle of Maximum Entropy. Our principled approach is accurate, fast, efficient, scalable, and turnkey, besides producing competitive performance. Additionally, our solution works as a seamless SoftMax loss drop-in replacement that facilitates its incorporation into current and future projects. Unlike most contemporary approaches, our proposal is viable from an economical and environmental point of view. Furthermore, the detection is a speedy procedure that can be achieved by a straightforward negative entropy calculation.

2 BACKGROUND

ODIN was proposed in [7] by combining SoftMax input preprocessing and temperature calibration. Despite significantly outperforming the baseline, the input preprocessing introduced in ODIN considerably increases the inference delay by requiring a backpropagation operation and a second inference to perform the final prediction on a single sample. Considering that backpropagation is typically slower than inference, input prepossessing makes the ODIN inferences at least three times slower. Additionally, input preprocessing also makes the inference power-consumption at least three times higher, which is a severe limitation from an economical and environmental perspective [22]. Several subsequent OOD detection proposals incorporated input preprocessing and its associated drawbacks [7], [8], [11], [23]. Temperature calibration consists of changing the scale of the logits of a pretrained model. Both input preprocessing and temperature calibration requires hyperparameter tuning.

ODIN required unrealistic access to out-of-distribution samples to validate the hyperparameters. Even if some supposed OOD samples are indeed available during design, using those examples to validate hyperparameters makes the solution overfits to detect this particular type of out-distribution [21]. In real-world applications, the system will probably operate under a different/novel/unknown out-distribution, and the OOD detection performance could degrade significantly. Therefore, using design time out-of-distribution samples to validate hyperparameters generates unrealistic OOD detection performance expectations.

The seminal work introduced by [8], which we call the Mahalanobis method, overcomes the necessity of access to out-of-distribution samples by validating the required hyperparameters in adversarial examples and producing more realistic performance estimates. Hence, in this work, we only consider validation on adversarial samples for competing methods. As our approach is turnkey, it does not require hyperparameters validation.

However, validation using adversarial examples has the disadvantage of adding a cumbersome procedure to the process. Even worse, the generation of adversarial samples itself requires the definition of hyperparameters as the maximum adversarial perturbations. For research datasets, we may know those, but it may be hard to find them for novel real-world data. The same drawbacks apply to methods based on adversarial training such as ACET [9], which also implicates in slower training. Solutions based on adversarial training may also present scalability problems when used in applications dealing with real-world large size images.

Moreover, the Mahalanobis approach still requires inputpreprocessing, which brings to this solution the previously mentioned drawbacks associated with the mentioned technique. Feature ensembles introduced in Mahalanobis also present limitations. Since it requires training/inference of ad-hoc classification/regression models on features produced in many neural network layers, this approach may not scale to applications using large-sized images, as it should require using those shallow models in spaces of thousands of dimensions.

Contributions. In this paper, we develop an OOD detection approach that avoids all previously mentioned requirements and side effects. For this work, we follow the “SoftMax loss” expression as defined in [24]. The first component of our solution is the Isotropy Maximization (IsoMax) loss, which works as a drop-in replacement for SoftMax one, as the swap of SoftMax loss with IsoMax one neither requires model, data, nor training procedure modifications. IsoMax uses distance-based logits to fix the SoftMax loss anisotropy caused by its affine transformations. Moreover, we introduced what we call the Entropic Scale (), a multiplicative factor applied to the logits throughout training that is, nevertheless, removed during inference to achieve high entropy posterior probability distributions. This clever scheme allows us to build high entropy (low confidence) posterior probability distributions committed to our prior knowledge as stated by the Principle of Maximum Entropy.

The second part of our proposal is the speedy Entropic Score, which is defined as the negative of entropy of the output probabilities, used to OOD detection. Furthermore, we provided theoretical motivation to explain why the solution works based on the Principle of Maximum Entropy. Finally, we present substantial experiments that confirm our theoretical assumptions and show that the overall solution is competitive with approaches that require operating under more favorable and less restrictive conditions.

Indeed, the principled way we construct our approach allowed it to be accurate (no classification accuracy drop), fast, and to present energy/computational efficiency (no input preprocessing, no adversarial training). Additionally, it is also scalable (no feature ensemble, no adversarial training) and turnkey (no post-processing for hyperparameter validation, no access to out-of-distribution or the generation of adversarial examples is required). Modern loss enhancement techniques such as outlier exposure [25], [26] may be readily adapted to also work with IsoMax loss.

Fig. 1. (a) Cross-entropy SoftMax loss simultaneously minimizes both the cross-entropy and the entropy of the posterior probabilities. (b) IsoMax loss produces low entropy posterior probabilities for low Entropic Scale (). (c) IsoMax loss produces medium mean entropy for intermediate Entropic Scale (). (d) In agreement with the Principle of Maximum Entropy, IsoMax loss can minimize the cross-entropy while producing high mean entropies for high Entropic Scale (). Entropic Scale equals to ten is enough to produce extremely high entropy posterior probability distribution. (e) Higher values of the Entropic Scale correlate to higher mean entropy and increased OOD detection performance regardless of the out-distribution under consideration. Isotropy enables IsoMax loss to produce higher OOD performance than SoftMax one even for unitary value of the Entropic Scale. IsoMax loss classification accuracies are similar to SoftMax ones and insensitive to

3 ISOMAX LOSS AND ENTROPIC SCORE

3.1 Isotropy and Distance-based Logits

Let x represent the input applied to a neural network and represent the high-level feature vector produced by it. For this work, the underlying structure of the neural network does not matter. Considering k be the correct class for a particular training example x, we can write the SoftMax loss associated with this specific training sample as:

In the equation (1), and represent, respectively, the weights and biases associated with the class j. From a geometric perspective, the term represents a hyperplane in the high-level feature space. It divides the feature space into two subspaces that we call the positive and negative subspaces. The deeper inside the positive subspace the features is located, the more the example likely belongs to the considered class. Therefore, to train neural networks using SoftMax loss does not incentive the agglomeration of the representations of the examples associated with a particular class into a limited region of the hyperspace. The immediate consequence is the propensity of SoftMax loss trained neural networks to make high confident predictions on examples that stay in regions very far away from the training examples, explaining their low out-of-distribution detection performance [9].

The main characteristic of the Mahalanobis distance used in [8] is to be locally isotropic around the produced prototypes. The fact it achieved high OOD detection performance indicates that deploying locally isotropic spaces around class prototypes improves the OOD detection performance. SoftMax loss trained neural networks are based on affine transformations on the last layer, which are essentially internal products. Consequently, the last layer representations of such networks tend to align in the direction of the weights vector, producing a preferential direction in space and anisotropy. Designing a loss that only depends on the distances of high-level representations to class prototypes is a possible way to avoid the mentioned anisotropy. A distance-based loss forbids the network to learn preferred directions in the feature space and enforces local isotropy during the network training, avoiding metric learning post-processing or hyperparameters validation.

Distance-based losses have been studied in the context

of few-shot learning. [27] used metric and transfer learning on pretrained features, while [28] proposed an offline procedure to calculate prototypes. In both cases, prototypes are not calculated seamlessly during the network backpropagation training. Additionally, while [27] used Mahalanobis distance, [28] proposed squared Euclidean one.

In IsoMax, to build a straightforward procedure to perform OOD detection, distance-based logits are incorporated directly into the loss used to train the neural network. Therefore, the prototypes are treated as usual weights and learned during the regular backpropagation procedure. We experimentally observed that using regular non-squared Euclidean distance performed best. Therefore, IsoMax loss is constructed with the negative of the non-squared Euclidean distance, which is given by the expression , where represents the seamlessly learnable prototype associated with the class j. The class prototypes have the same dimension of the last layer representations. As there is no bias, IsoMax loss has fewer parameters than SoftMax one.

3.2 The Principle of Maximum Entropy

The Principle of Maximum Entropy, formulated by E. T. Jaynes to unify the statistical mechanics and information theory entropy concepts [29], [30], states that when estimating probability distributions, we should choose the one which produces the maximum entropy consistent with the given constraints [31]. Following this principle, we avoid introducing additional assumptions or bias1. In other words, from a set of trial probability distributions that satisfactorily describes the prior knowledge available, the one that presents the maximal information entropy (the least informative option) represents the best choice.

The Principle of Maximum Entropy has been studied as a regularization factor in classification tasks [32], [33]. In some cases, it has also being used in classification tasks as a direct optimization procedure without connection to the mechanism of cross-entropy minimization or backpropagation. For example, in [34], [35], [36], the Maximization of the Entropy subject to a constraint on the expected classification error is showed to be equivalent to solving an unconstrained Lagrangian. Despite being theoretically well-grounded [37], [38], [39], [40], direct entropy maximization presents high computational complexity as it is a NP-complete problem [37], [39].

Alternatively, modern neural networks are trained using the computational efficient cross-entropy minimization. However, the mentioned procedure does not prioritize posterior probability distributions with high entropy. Actually, exactly the opposite is true. Indeed, the minimization of cross-entropy has the undesired side effect of producing overconfident low mean entropy posterior probability distributions. Hence, we use the Principle of Maximum Entropy as motivation to construct high entropy posterior probabilities still relying on computationally efficient cross-entropy minimization. Additionally, we present substantial experimental evidence to show that increased posterior probability distribution entropy correlates to improved OOD detection performance.

Unlike the previously mentioned works, we are neither using the Principle of Maximum Entropy to motivate the construction of regularization mechanisms (such as label smoothing or confidence penalty) nor performing direct Maximum Entropy optimization. Indeed, the entropy is not even calculated during IsoMax loss training. Since our approach is not directly maximizing the entropy, we can not state that the proposed method is producing the highest available mean entropy for the posterior probability distribution. Nevertheless, the experiments show that our approach’s average entropy is high enough to improve the OOD detection performance significantly. Thus, our approach may be seen as a computationally efficient procedure to obtain high entropy posterior distributions avoiding the extremely high computational cost of performing a direct entropy maximization.

The equation (2) explains the behavior of the cross-entropy and entropy for the SoftMax loss. represents the logits associated with the class j, and represents the logits associated with the correct class k. When minimizing the first term of the mentioned equation, extremely high probabilities are generated. Consequently, very low entropy posterior probability distributions are produced. The usual cross-entropy loss minimization tends to generate unrealistic overconfident (low entropy) probabilities distributions. Therefore, we have an opposition between cross-entropy loss minimization and the Principle of Maximum Entropy.

The IsoMax loss straightforwardly conciliates these apparently contradictory objectives by multiplying the logits by what we call the Entropic Scale presented during training but removed for inference. The equation (3) demonstrates how the Entropic Scale allows the production of high entropy posterior distributions relying on cross-entropy minimization. The Entropic Scale present during training allows the argument of the exponential functions become high enough to produce low loss without producing extremely high probability for the correct classes as they are calculated with the Entropic Scale removed. Hence, it is possible to build posterior probability distributions with high mean entropy in agreement with the fundamental Principle of Maximum Entropy despite using cross-entropy minimization. Hence, we can define the IsoMax loss as:

Accurate, fast, efficient, scalable, and turnkey out-of-distribution detection approaches (neither classification accuracy drop, adversarial training, input preprocessing, temperature calibration, feature ensemble, nor ad-hoc post-processing classification/regression). Since there is no hyperparameter to tune, no access to out-of-distribution or adversarial examples is required. SoftMax+MPS means training with SoftMax loss and performing OOD detection using the Maximum Probability Score (MPS) [6]. SoftMax+ES means training with SoftMax loss and performing OOD detection using Entropic Score. IsoMax+ES means training with IsoMax loss and performing OOD detection using Entropic Score. The best results are in bold. To the best of our knowledge, IsoMax+ES presents state-or-the-art under these assumptions.

Unfair comparison of approaches with different requirements and side effects. ODIN uses input preprocessing, temperature calibration, and adversarial validation. Mahalanobis uses input preprocessing, feature ensemble, ad-hoc post-processing classification/regression models, and adversarial validation. Input preprocessing makes ODIN and Mahalanobis inference three times slower and three times less energy/computationally efficient. ACET uses adversarial training, which implicates in slower training and reduced scalability for large-scale images. ODIN, Mahalanobis, and ACET present hyperparameters that need to be validated for each dataset. IsoMax+ES neither uses those techniques nor presents hyperparameters to tune for novel datasets. The best results are in bold (2% tolerance).

Performance metrics of neural networks trained using SoftMax and IsoMax losses for a combination of in-distributions and models. IsoMax loss produces very similar train and test classification accuracy while presenting much higher OOD detection performance (Table 1).

In the previous equation, k is the correct class. Experimentally, we observed that using Xavier [41] or Kaiming [42] initialization for prototypes made OOD detection performance oscillates. Sometimes it improves, sometimes it decreases. Additionally, we experimentally observed classification accuracy drop when using with affine transformations used in SoftMax loss. Hence, we decided always to initialize all prototypes to zero and indeed use non-squared Euclidean distance-based logits.

To calculate the cross-entropy loss, deep learning libraries usually combine the logarithm and probability calculations into a single computation. However, we experimentally observed that sequentially computing these calculations as stand-alone operations significantly improves IsoMax performance. Since prototypes are regular learnable network weights, the weight decay was applied to them. Finally, we can defined the inference probabilities as:

3.3 The Entropic Score

Out-of-distribution detection approaches typically define a score to be used during inference to evaluate whether an example should be considered out-of-distribution. In a seminal work, [43] demonstrated that the entropy presents the optimum measure of the randomness of a source of symbols. More broadly, we currently understand entropy as a measure of the uncertainty we have about a random variable. Therefore, considering the uncertainty in classifying a specific sample should be an optimum metric to evaluate whether a particular example is out-of-distribution, we define our score to perform OOD detection, called Entropic Score, as the negative entropy of the output probabilities:

By using the negative entropy as a score to evaluate if a particular sample is out-of-distribution, we consider the information provided by all available outputs rather than relying on a single network output, for example, the maximum probability (as in baseline, ODIN, and ACET) or distance to the nearest prototype (as in Mahalanobis). Additionally, from a practical perspective, using this a priori score avoids the need to train an ad-hoc additional regression model to detect out-of-distributions samples, which is required, for example, in Mahalanobis. Even more important, since no regression model needs to be trained, there is no need for unrealistic access to out-of-distribution samples or generating adversarial to hyperparameters validation. Since ES is a predefined no-trainable score, it is available as soon as the neural network training finishes.

4 EXPERIMENTAL RESULTS

The source code to reproduce all the results is available as supplementary material. Considering that outlier exposure may be integrated and benefit both SoftMax and IsoMax losses, all experiments were performed without relying on outlier data [25], [26]. Similar arguments hold for background samples based approaches [44]. All datasets, models, training procedures, and metrics followed the baseline established in [6] and subsequently used in major OOD detection papers [7], [8], [9]. In this paper, only approaches that do not present classification accuracy drop were compared.

In our experiments, we trained from scratch 100 layers DenseNets [45] and 34 layers ResNets [46] on CIFAR10 [47], CIFAR100 [47] and SVHN [48] datasets using SoftMax and IsoMax losses using the same protocol presented in [8] (300 epochs, initial learning rate of 0.1 with learning rate decay rate equals ten in the epochs 150, 200, and 250, and a weight decay of 0.0001).

We used resized images from the datasets TinyImageNet [49]2, and the Large-scale Scene UNderstanding dataset (LSUN) [50]2 following the same protocol used in [8] to create out-distribution data.

To evaluate the OOD detection performance, we added these out-of-distribution images to the validation images presented in CIFAR10, CIFAR100, and SVHN datasets to form the composed test set. The performance was evaluated using three detection metrics. First, we calculate the True Negative Rate at 95% True Positive Rate (TNR@TPR95). Besides, we evaluated the Area Under the Receiver Operating Characteristic Curve (AUROC) and the Detection Accuracy (DTACC), which corresponds to the maximum classification probability over all possible thresholds

where o(x) is a given OOD detection score. It is assumed that both positive and negative samples have equal probability of being in the test set, i.e., . All the above metrics follow the calculation detailed in [8].

To define the global hyperparameter , we trained DenseNets on SVHN using equals to 1, 3, and 10. We validated these possible values using the TNR@TPR95 metric and CIFAR100 as out-distribution (Fig. 1). It is important to emphasize that CIFAR100 was never used as an out-distribution in the subsequent experiments.

The Fig. 1(a) shows that the SoftMax loss minimizes both the cross-entropy and the entropy of the posterior distribution. The Fig. 1(b) shows that IsoMax loss is capable of minimizing the cross-entropy keeping high average posterior probability entropy as recommend by the Principle of Maximum Entropy. The Fig. 1(c) shows that OOD detection performance increases for higher Entropic Scales.

It was possible to define the Entropic Scale hyperparameter because the experiments showed that once validated in a simple metric, model, in-distribution, and out-distribution; the global value defined to it generalized well to all other metrics, models, in-distributions, and out-distributions (Fig. 1(c)). Considering that already produces very high entropies probability distributions, we see no reason to increase it even more. Therefore, all experiments in this paper used and no validation was performed for each new dataset, making our proposal turnkey. The value of presented the best OOD detection performance. The mentioned value generalizes well to unseen out-distributions as required for a satisfactory global hyperparameter candidate. Consequently, this same value was used for all other experiments (combinations of models, in-distributions, and out-distributions in Tables 1 and 2). Once confirmed as a adequate global hyperparameter, the experiments showed that IsoMax loss trained networks present classification accuracy performance extremely similar to SoftMax ones to all other datasets and models (Table 3).

4.1 OOD Detection Performance: Fair Scenario.

In Table 1, SoftMax+MPS presents the worst results. The Entropic Score produces a small positive effect when applied to SoftMax loss trained networks. However, the combination of IsoMax loss with the same Entropic Score significantly improves the OOD detection performance across almost all metrics, in-distribution, and out-distribution.

4.2 OOD Detection Performance: Unfair Scenario.

Table 2 shows an unfair comparison of approaches that present different requirements and side effects. Input prepossessing (and consequently slower and higher power consumption inferences) and validation on adversarial samples are used in both ODIN and Mahalanobis, while temperature calibration is required only in ODIN. Feature ensemble and ad-hoc classification/regression models, which may implicate in limited scalability, are mandatory in Mahalanobis. ACET requires adversarial training, which may restrict its use to small scale images. ODIN, Mahalanobis, and ACET have hyperparameters tuned for each in-distribution. The IsoMax+ES presents neither the mentioned drawbacks nor side effects.

Regardless of the previous considerations, the table shows that IsoMax+ES significantly outperforms ODIN in all evaluated scenarios. Additionally, IsoMax+ES usually outperforms ACET (sometimes by a large margin). Moreover, in more than half of the cases, even operating under much more favorable conditions, Mahalanobis surpasses IsoMax+ES by less than 2%. In some scenarios, the latter even overcomes the former despite avoiding hyperparameter validation, being native, scalable, straightforward to implement/use, and presenting at least there times faster and power-efficient inference. IsoMax+ES loss performs particularly well in one of the CIFAR100 cases, which may suggest that the fact ES uses all outputs to decide works even better when many classes are presented. We speculate that recent advances in data augmentation techniques may help to improve IsoMax+ES OOD detection performance even further [51], [52].

5 CONCLUSION

In this paper, we proposed the IsoMax loss and the Entropic Score to show that neural networks OOD detection performance can be significantly improved in an accurate, fast, efficient, scalable, and turnkey way simply by replacing the SoftMax loss and using an predefined, meaningful, and information-theoretic well-founded score without relying on ad-hoc techniques to avoid their associated drawbacks, requirements and side effects. However, if the abovementioned limitations are not a concern for a particular application, those techniques may be combined with IsoMax loss to achieve even higher OOD detection performance. Another promising approach could be the use of recent specialized data augmentation techniques. In future works, we intend to make a learnable parameter.

REFERENCES

[1] C. Guo, G. Pleiss, Y. Sun, and K. Q. Weinberger, “On calibration of modern neural networks,” International Conference on Machine Learning, 2017.

[2] W. J. Scheirer, A. Rocha, A. Sapkota, and T. E. Boult, “Towards open set recognition,” IEEE Transactions on Pattern Analysis and Machine Intelligence, 2013.

[3] W. J. Scheirer, L. P. Jain, and T. E. Boult, “Probability models for open set recognition,” IEEE Transactions on Pattern Analysis and Machine Intelligence, 2014.

[4] A. Bendale and T. Boult, “Towards open world recognition,” Computer Vision and Pattern Recognition, 2015.

[5] E. Rudd, L. P. Jain, W. J. Scheirer, and T. Boult, “The extreme value machine,” IEEE Transactions on Pattern Analysis and Machine Intelligence, 2018.

[6] D. Hendrycks and K. Gimpel, “A baseline for detecting mis-classified and out-of-distribution examples in neural networks,” International Conference on Learning Representations, 2017.

[7] S. Liang, Y. Li, and R. Srikant, “Enhancing the reliability of out-of-distribution image detection in neural networks,” International Conference on Learning Representations, 2018.

[8] K. Lee, K. Lee, H. Lee, and J. Shin, “A simple unified framework for detecting out-of-distribution samples and adversarial attacks,” Neural Information Processing Systems, 2018.

[9] M. Hein, M. Andriushchenko, and J. Bitterwolf, “Why ReLU networks yield high-confidence predictions far away from the training data and how to mitigate the problem,” Computer Vision and Pattern Recognition, 2018.

[10] E. Techapanurak, M. Suganuma, and T. Okatani, “Hyperparameterfree out-of-distribution detection using softmax of scaled cosine similarity,” arXiv preprint arXiv:1905.10628, 2019.

[11] Y.-C. Hsu, Y. Shen, H. Jin, and Z. Kira, “Generalized ODIN: Detecting out-of-distribution image without learning from out-of-distribution data,” arXiv preprint arXiv:2002.11297, 2020.

[12] N. Carlini, A. Athalye, N. Papernot, W. Brendel, J. Rauber, D. Tsipras, I. Goodfellow, A. Madry, and A. Kurakin, “On evaluating adversarial robustness,” arXiv preprint arXiv:1902.06705, 2019.

[13] Q. Yu and K. Aizawa, “Unsupervised out-of-distribution detection by maximum classifier discrepancy,” International Conference on Computer Vision, 2019.

[14] A. Vyas, N. Jammalamadaka, X. Zhu, D. Das, B. Kaul, and T. L. Willke, “Out-of-distribution detection using an ensemble of self supervised leave-out classifiers,” European Conference on Computer Vision, 2018.

[15] B. Lakshminarayanan, A. Pritzel, and C. Blundell, “Simple and scalable predictive uncertainty estimation using deep ensembles,” Neural Information Processing Systems, 2017.

[16] A. Kendall and Y. Gal, “What uncertainties do we need in bayesian deep learning for computer vision?” Neural Information Processing Systems, 2017.

[17] C. Leibig, V. Allken, M. S. Ayhan, P. Berens, and S. Wahl, “Leveraging uncertainty information from deep neural networks for disease detection,” Scientific Reports, 2017.

[18] A. Malinin and M. Gales, “Predictive uncertainty estimation via prior networks,” Neural Information Processing Systems, 2018.

[19] V. Kuleshov, N. Fenner, and S. Ermon, “Accurate uncertainties for deep learning using calibrated regression,” arXiv preprint arXiv:1807.00263, 2018.

[20] A. Subramanya, S. Srinivas, and R. V. Babu, “Confidence estimation in deep neural networks via density modelling,” arXiv preprint arXiv:1707.07013, 2017.

[21] A. Shafaei, M. Schmidt, and J. J. Little, “A less biased evaluation of out-of-distribution sample detectors,” British Machine Vision Conference, 2019.

[22] R. Schwartz, J. Dodge, N. A. Smith, and O. Etzioni, “Green Artificial Intelligence,” arXiv preprint arXiv:1907.10597, 2019.

[23] T. DeVries and G. W. Taylor, “Learning confidence for out-of-distribution detection in neural networks,” arXiv preprint arXiv:1802.04865, 2018.

[24] W. Liu, Y. Wen, Z. Yu, and M. Yang, “Large-margin softmax loss for convolutional neural networks.” International Conference on Machine Learning, 2016.

[25] D. Hendrycks, M. Mazeika, and T. Dietterich, “Deep anomaly detection with outlier exposure,” International Conference on Learning Representations, 2019.

[26] A.-A. Papadopoulos, M. R. Rajati, N. Shaikh, and J. Wang, “Outlier exposure with confidence control for out-of-distribution detection,” arXiv preprint arXiv:1906.03509, 2019.

[27] T. Mensink, J. Verbeek, F. Perronnin, and G. Csurka, “Distancebased image classification: Generalizing to new classes at near-zero cost,” IEEE Transactions on Pattern Analysis and Machine Intelligence, 2013.

[28] J. Snell, K. Swersky, and R. S. Zemel, “Prototypical networks for few-shot learning,” Neural Information Processing Systems, 2017.

[29] E. T. Jaynes, “Information Theory and Statistical Mechanics,” Phys. Rev., 1957.

[30] ——, “Information Theory and Statistical Mechanics. II,” Phys. Rev., 1957.

[31] T. M. Cover and J. A. Thomas, “Elements of Information Theory,” Wiley Series in Telecommunications and Signal Processing, 2006.

[32] A. Dubey, O. Gupta, R. Raskar, and N. Naik, “Maximum-entropy fine grained classification,” Neural Information Processing Systems, 2018.

[33] G. Pereyra, G. Tucker, J. Chorowski, Łukasz Kaiser, and G. Hinton, “Regularizing neural networks by penalizing confident output distributions,” arXiv preprint arXiv:1908.05569, 2017.

[34] D. Miller, A. Rao, K. Rose, and A. Gersho, “A maximum entropy approach for optimal statistical classification,” IEEE Workshop on Neural Networks for Signal Processing, 1995.

[35] A. L. Berger, S. A. Della Pietra, and V. J. Della Pietra, “A maximum entropy approach to natural language processing,” Computational Linguistics, 1996.

[36] J. Shawe-Taylor and D. Hardoon, “Pac-bayes analysis of maximum entropy classification,” International Conference on Artificial Intelligence and Statistics, 2009.

[37] J. Pearl, “Probabilistic reasoning in intelligent systems: Networks of plausible inference,” Morgan Kaufmann Publishers Inc., 1988.

[38] J. Williamson, “Objective bayesian nets,” We Will Show Them!, 2005.

[39] ——, “Philosophies of probability,” Philosophy of Mathematics: Handbook of the Philosophy of Science, 2009.

[40] ——, “In defence of objective bayesianism,” Oxford University Press, 2013.

[41] X. Glorot and Y. Bengio, “Understanding the difficulty of training deep feedforward neural networks,” International Artificial Intelligence and Statistics, 2010.

[42] K. He, X. Zhang, S. Ren, and J. Sun, “Delving deep into rectifiers: Surpassing human-level performance on imagenet classification,” International Conference on Computer Vision, 2016.

[43] C. E. Shannon, “A Mathematical Theory of Communication,” Bell System Technical Journal, 1948.

[44] A. R. Dhamija, M. Günther, and T. Boult, “Reducing network agnostophobia,” Neural Information Processing Systems, 2018.

[45] G. Huang, Z. Liu, L. v. d. Maaten, and K. Q. Weinberger, “Densely connected convolutional networks,” Computer Vision and Pattern Recognition, 2017.

[46] K. He, X. Zhang, S. Ren, and J. Sun, “Identity mappings in deep residual networks,” Lecture Notes in Computer Science, 2016.

[47] A. Krizhevsky, “Learning multiple layers of features from tiny images,” Science Department, University of Toronto, 2009.

[48] Y. Netzer and T. Wang, “Reading digits in natural images with unsupervised feature learning,” Neural Information Processing Systems, 2011.

[49] J. D. J. Deng, W. D. W. Dong, R. Socher, L.-J. L. L.-J. Li, K. L. K. Li, and L. F.-F. L. Fei-Fei, “ImageNet: A large-scale hierarchical image database,” Computer Vision and Pattern Recognition, 2009.

[50] F. Yu, Y. Zhang, S. Song, A. Seff, and J. Xiao, “LSUN: Construction of a large-scale image dataset using deep learning with humans in the loop,” arXiv preprint arXiv:1506.03365, 2015.

[51] S. Thulasidasan, G. Chennupati, J. A. Bilmes, T. Bhattacharya, and S. Michalak, “On mixup training: Improved calibration and predictive uncertainty for deep neural networks,” Neural Information Processing Systems, 2019.

[52] S. Yun, D. Han, S. J. Oh, S. Chun, J. Choe, and Y. Yoo, “Cutmix: Regularization strategy to train strong classifiers with localizable features,” International Conference on Computer Vision, 2019.

designed for accessibility and to further open science