GIM: Gaussian Isolation Machines

2020·Arxiv

Abstract

Abstract

In many cases, neural network classifiers are likely to be exposed to input data that is outside of their training distribution data. Samples from outside the distribution may be classified as an existing class with high probability by softmax-based classifiers; such incorrect classifications affect the performance of the classifiers and the applications/systems that depend on them. Previous research aimed at distinguishing training distribution data from out-of-distribution data (OOD) has proposed detectors that are external to the classification method. We present Gaussian isolation machine (GIM), a novel hybrid (generative-discriminative) classifier aimed at solving the problem arising when OOD data is encountered. The GIM is based on a neural network and utilizes a new loss function that imposes a distribution on each of the trained classes in the neural network’s output space, which can be approximated by a Gaussian. The proposed GIM’s novelty lies in its discriminative performance and generative capabilities, a combination of characteristics not usually seen in a single classifier. The GIM achieves state-of-the-art classification results on image recognition and sentiment analysis benchmarking datasets and can also deal with OOD inputs.

Index Terms—Deep Neural Networks, Confidence metric, Gaussians, Generative modeling, Out Of Distribution Data Detection, Regularization, Representation Learning

I. INTRODUCTION

In recent years, neural networks have successfully been used for classification tasks in various domains for numerous tasks, including computer vision [1]–[3], natural language processing [4]–[6], voice recognition [7], and even in the domain of

The use of softmax classification layers (a dense layer paired with a softmax activation function) [12], [13] is a common practice in neural network classifiers. The softmax classification layer is a linear classifier, with respect to the previous layers of the neural network. Softmax layers are used due to their probabilistic outputs. According to [12], pairing the softmax layer with the cross-entropy loss provides improvements in convergence speed which aren’t seen when the output layer is paired with other types of loss functions. The softmax layer uses straight lines to draw the decision boundary between the classes, whereas all of the other layers extract features. The decision boundary drawn by the softmax layer divides the space into areas, such that each class has its own area. Given a new sample, the network tries to determine which area the sample belongs to, thereby classifying the sample accordingly. This study is aimed at better understanding the extent to which the area assigned to a class is “actually the class itself.”

Figure 1 illustrates the input space of a neural network trained on data sampled from three two-dimensional Gaussians. The colors in the figure indicate the network’s con-fidence in the prediction; the darker the color, the more confident the network is regarding the predicted class. The instances used for training the network appear in each class region. The following three observations can be made based on the figure. First, large regions in the decision space do not contain input data from any of the three classes, however the probability output of these areas is high (dark blue). Second, when the network is presented with a fourth class (the group of points in green, termed the “untrained class”) which are samples from unknown distribution, the network classifies it with high confidence as one of the trained classes, instead of issuing an alert stating that it has encountered data from an unknown distribution. Third, no area between classes represents instances of other classes, even the decision boundary has a probability of 0.3 of belonging to one of these classes. These three observations demonstrate that neural networks do not actually capture the distribution of each class in the training set but rather learn how to differentiate between classes without considering the possibility that the input instance may (mistakenly or even intentionally) not belong to the classes in the training set. The main problem with this type of learning, i.e., softmax-based learning, is that the classifier does not consider the possibility that other types of data might exist. As in Figure 1, the fourth class is recognized as class 3 with high confidence. In real-world scenarios such phenomena frequently occur (e.g., when an unexpected object appears in front of an autonomous vehicle or a facial recognition system tries to recognize a person with his/her head tilted, causing the classification probability to be indecisive, i.e., for different people to have a similar probability). There are some methods that aimed to solve this problem. However, most of them are doing so by adding external components to the classifier. Our goal was to create a classifier that has the capability of detecting OOD data intrinsically and would have performance comparable to those of discriminative models.

In this paper, we propose a hybrid classification method which is based on neural networks and a new loss function that aims to solve the mentioned problem caused by the softmax layer. Our approach utilizes concepts of generative and discriminative modeling [14] to create a hybrid classification method with a built-in confidence metric that enables it to deal with data from other distributions.

We evaluate our method on four datasets (three computer vision benchmarking datasets and one sentiment analysis datsaset) and various neural network architectures, and show that it can achieve accuracy comparable to that of standard neural networks and is capable of dealing with data outside of the trained distribution, without employing additional anomaly detection algorithms or input prepossessing.

The main contributions of this paper is: a neural network-based hybrid classifier with state-of-the-art accuracy. The classifier’s accuracy is similar to that of discriminative models, while being inherently capable of identifying data from other distributions, and like generative models, the proposed classifier calculates a confidence score for its predictions.

II. RELATED WORK

A. Generative Classifiers vs Discriminative Classifiers

Machine learning classifiers are often divided into two families: generative and discriminative. The difference between the two is the information produced when calculating the prediction.

Formally, let x be a sample and y be a label. Generative classifiers learn a model of the joint probability p(x, y) and make a prediction by calculating p(y|x). This enables p(x|y) to be calculated as well. Discriminative classifiers do not calculate p(x|y); instead they predict the posterior probability p(y|x) directly [14]. The term p(x|y) can be interpreted as a confidence rate for the prediction, i.e., the probability for x from the input space to be labeled as a specific y, which is used as a measure of y being the correct prediction for x. Although the confidence rate can be useful in various applications, in practice, generative classifiers are not usually used due to the fact that they are outperformed by discriminative classifiers.

B. Identification of Out-of-Distribution Data with Neural Networks

A classifier that can identify whether a sample is not from the same distribution as the training data is capable of handling unpredictable inputs. The presence of unpredictable inputs can be intentional or accidental. Technically, identifying out-of-distribution (OOD) data means that the model labels the input as OOD instead of classifying it to a specific class. Classifiers based exclusively on softmax do not inherently have this capability, as softmax classifies every input to some class. Some research has been performed on unsupervised means of OOD detection, such as [15]–[17]. However, because the proposed methods use components that are external to the classifier, they require the training of an additional component/model for each class. Hendrycks et al. [18] established a baseline method which is based on softmax. Later, Liang et al. [19] introduced ODIN. ODIN uses a distillation [20] like softmax, combining it with adversarial-like perturbations [21] to the input in order to predict whether the input is in or out of the distribution. This method does not require additional training, but it does require the performance of two feedforward and one backpropagation operations which makes it impractical for real-time use. The most recent research on OOD detection was conducted by Devries et al. [22]. Their method adds an external functionality to the existing neural network design. In addition, their method is based on a neural network with softmax output, with the addition of an output neuron that serves as an OOD detector; the neuron added must be trained using a novel method which is described in the paper. In contrast, we introduce a method inherently capable of detecting OOD inputs. We compare our results with the baseline proposed by Hendrycks et al. [18], which is the most closely related study performed on the subject, as it deduces the affiliation of the input to OOD, relying solely on representations learned by the classifier.

C. Multivariate Gaussians in Neural Networks

Gaussians and neural networks are known for their ability to approximate functions and data distributions. In the literature we find the Gaussian and neural network combination proposed in many domains. There are articles that use Gaussians as part of the neural network itself, such as [23], where the Gaussians are used as activation functions, and [24], where the layers of the neural network follow a Gaussian mixture model (GMM). In general, the notable traits of a multivariate Gausssian is that knowing its parameters (means vector and covariance matrix) allow easy generation of new samples. For example, in variational autoencoders [25] neural networks are used to estimate the parameters of a multivariate Gaussian; then the parameters are used to produce samples from the distribution of the data in the input space. Another issue related to the combination of multivariate Gaussians and neural networks is how to integrate GMMs into neural networks in order to perform classification. In a recent study performed by Tske et al. [26] the authors took an approach similar to ours, proposing the integration of latent variables into the last layer of the neural network. In their work, the last layer represents the parameters of the GMM. In this case, a GMM for the sample is calculated, and the prediction is made accordingly. In our method, we approximate each class representation using a multivariate Gaussian distribution (a special type of GMM) on each class, but in contrast to [26], we do not incorporate the parameters of the Gaussians in the neural network, meaning that the parameters do not need to be learned explicitly, thus making the training easier. One more important difference is that in the work performed by Tske et al. [26] the authors use the same covariance matrix for all of the classes, while we approximate a different covariance matrix for each class.

D. Large Margin Algorithms in Neural Networks

As shown by Boser at el. [27], one of the desired qualities of a classifier is to have large margins between the representations of the classes. Liang at el. [28] explored the effects of large margins between the classes in neural networks. Crossentropy loss is currently the most frequent loss function used by neural network-based classifiers. Sun et al. [29] empirically showed that the cross-entropy loss does not encourage a large margin. Neural networks contain many representations of the data, and hence, a reasonable approach for achieving the large margin effect in neural networks is to form large margins in some of the network’s inner representations. [28], [30], [31]. In [32], Elsayed et al. proposed a new optimization target aimed at replacing the softmax/cross-entropy combination completely. In our method, we also replace the softmax/cross-entropy combination - with a new optimization target - but our method forces a more general margin definition. The algorithms proposed by Elsayed et al. aim to maximize the distance between the closest points of different classes, while our method tries to maximize the distance between the means of each class, which represents a relaxation of the formal definition of a large margin between the classes.

III. METHODOLOGY

In this paper, we propose a means of improving the design of neural network classifiers so that they can cope with input that does not belong to one of the classes in the training set. The proposed classifier takes advantage of the fact that neural networks can model complex probability distributions to transform the input of the neural network into a vector space in which each class has an approximately known probability distribution - a Gaussian. More specifically, we train the neural network to produce an output space where the model’s output has a simple distribution, thus forcing the samples from each class to behave like dense, isolated clusters in the output space. Forcing each class to be dense improves its approximation using a multivariate Gaussian with diagonal covariance matrix, and separating the clusters from one another is a way of creating large distances between the classes, which is similar to creating large margins. In contrast to softmax-based neural

Fig. 2. Gaussian isolation machine trained on three classes predicting the fourth class.

networks that take a discriminative approach and directly model p(y|x), the Gaussian isolation machine models p(y|x) as follows:

where f(x) is the network output, and y is the class label. We use the likelihood probability component as our confidence rate, and this allows us to deal with data from untrained distributions. The modeling technique used is similar to that of a generative model, but we consider it a hybrid model, because it is a discriminative model that is generative towards the output of the neural network and not toward the input space. We formulated two ways of controlling the distribution of each class representation: the first one controls the class representation density (CTV loss), and the second one controls the class representation spread (CH loss).

Consider a classification problem with |C| classes, such that , where is the class, and a function , such that d is not necessarily equal to |C|, which represent a neural network. We define the following metrics:

1) Class Mean Vector:

The mean vector of class in the neural network output space.

2) Class Neighborhood Probability:

where and are classes, and is the mean vector of each class (Equation (2)). This metric corresponds to the unnormalized probability of class being near class , assuming a Gaussian distribution on class with a diagonal covariance matrix whose diagonal elements are all equal to . For optimization purposes, the class neighborhood probability equation (Equation (3))

is insufficient for the purpose of separating the class and achieving the large margin effect, because when the probability is low, there isn’t a need for the network to separate the classes. Therefore, for optimization, we use the following modified version of the equation to ensure class separation and the desired large margin effect:

where is a large constant, compared to the actual class covariance matrix diagonal elements. This is similar to the method used by Hinton, et al [20], but it is used for a different purpose. In our method, during optimization this constant forces the assumed Gaussians to (1) cover more space, and (2) always include the other classes.

3) Center Distance:

where is the class mean vector (Equation((2)). This metric is the distance from the class mean.

4) Center loss [33] as the Class Total Variance:

The class total variance (CTV) is the first moment of the center distance (Equation (5)) over the samples of class :

The class total variance is equal to the sum of the diagonal elements of the covariance matrix of class :

The class homogeneity (CH) is the second moment of the Equation (5) over the samples of class , and it defines the variance of distances from the center of the class.

The minimization of Equation (6) will effectively shrink the diagonal elements of the covariance matrix of the class in the output space, thus making its representation small and dense. This minimization allows us to assume a multivariate Gaussian distribution for each class with a diagonal covariance matrix, those diagonal covariance matrices will be employed when making predictions.

In the first of the two optional loss functions for the GIM, the combination of Equations (6) and (4) results in the CTV loss. This loss function both controls the class representation size in the output space and ensures that the representations of classes will be far apart from one another.

When combining the class neighborhood probability (Equation (4)) and the class homogeneity (Equation (7)), we were able to attain the second loss function for the GIM -the CH loss. This loss function controls the class spread by ensuring that the variance of distances from the class mean vector will be small

We trained individual neural networks to minimize Equations (7) and (9)), using different neural network architectures where the last layer is an arbitrary size (e.g., 24, 32, 64). As we hoped, the distribution of the network’s output is similar to that of a GMM, with a large Euclidean distance between the clusters (classes).

Knowing that the output space approximately distributed as a multivariate Gaussian provides us with access to the likelihood term p(f(x)|c), which can be thought of as a confidence metric for the neural network’s predictions. The likelihood probability is the probability for a sample x to belong to a certain class in the neural network’s output space, and it is defined as follows: (the multivariate Gaussian probability density function):

In practice, we use the log() of the probability. We use the log-likelihood as a confidence metric that allows us to differentiate between in-distribution data and out-of-distribution data, by setting a threshold on its value, so that inputs that result in a value lower then this confidence threshold will be labeled as out-of-distribution.

To make a prediction, the GIM follows an approach similar to that of many generative classifiers. It uses a term which includes both the likelihood term and the prior over the labels. We combine the confidence (log-likelihood) term with a prior over the class labels and utilize Bayes’ rule to approximate p(y|x) as follows: Prior over the labels:

Posterior probability for classification:

In practice, we use the log() of the whole expression, in order to avoid numerical errors. As can be seen in Figure 2, our method has a different decision boundary shape, so that rather than splitting the input space into three areas (as seen in Figure 1), a heat map is created for each class probability distribution.

IV. EVALUATION

A. Experimental Settings

In this section, we evaluate the performance of the GIM on three tasks and compare its performance to standard neural networks. Our evaluation shows that the Gaussian isolation machine achieves similar classification results and has a similar convergence speed to that of standard neural networks, while possessing the inherent ability to detect OOD data with high accuracy. We evaluate the GIM on two classic classifi-cation tasks and three out-of-distribution data detection tasks. For classification, we chose three standard object detection benchmark datasets and one sentiment analysis benchmark dataset. In all of the classification experiments we compare our method to state-of-the-art neural network classifiers with the same architecture, but we removed the last layer (weights and softmax activation) of the GIMs, and as a result, they have fewer parameters. We created two scenarios for the iden-tification of untrained distribution data (OOD) experiments. In these experiments, we trained a GIM on several classes of a dataset and determined whether data from the remaining classes is classified by the GIM as one of the trained classes. In addition, we also performed an experiment similar to the one presented in Linag et al. [19] to compare the GIM’s detection abilities to the baseline detector [18]. To measure the classification accuracy and convergence speed, we used the architectures presented in Table I. For the CIFAR 10 dataset, we trained a ResNet20v1, like the one presented in [1],

TABLE I NEURAL NETWORK ARCHITECTURES

which is a very compact ResNet with under 300K trainable parameters. We used Keras [34] to implement the neural network, but because our implementation of ResNet20V1 uses random data augmentation, we don’t obtain the same results as those presented in [1]. However, after several trials we were able to achieve 91.2% accuracy, matching the results in that paper. In addition to the ResNet20v1, we also trained a VGG16 [35], without its final layer.

B. Classification Accuracy

In this section, we determine the accuracy of the GIM and compare it to that of standard neural networks. All of the neural networks were created using the architectures presented in Table I and were initialized using the same random seed. We test our method on the MNIST character recognition dataset [36], the FASHION-MNIST clothing recognition dataset [37], the CIFAR 10 object detection dataset [38], and the IMDB sentiment analysis dataset [39]. In Table II, we provide a comparison of the results, presenting the average accuracy achieved by each method with each dataset. It is clear from the results presented in Table II that our method does not compromise the classification accuracy and that in some cases it even improves it. When examining other generative and hybrid models, such as Bayesian neural networks [40], KNN, Na¨ıve Bayse, and ClassRBM [41], the accuracy level is usually low when the datasets contain high dimensional data, as is the case with the MNIST and CIFAR10 images. The novelty of our work is that we were able to retain the classification accuracy of the fully discriminative neural network, while creating a hybrid model.

C. Convergence Speed

Figure 3 compares the convergence speed of the GIM to that of a standard neural network, where both methods use the ResNet20v1 architecture. All training was performed using a single NVIDIA-2080 TI GPU. Note that both formulations of the Gaussian isolation machine converge slower than a standard neural network, although they achieved nearly the same final accuracy (see Table II). We hypothesize that the slower convergence speed is due to the fact that the GIM must separate the representations of the classes, as well as isolate them and force them into a Gaussian form.

D. Identifying Out Of Distribution Data

Anomalous data and data from outside the trained distributions can appear in a variety of applications. The proposed

TABLE II GAUSSIAN ISOLATION MACHINE VS STANDARD NEURAL NETWORKS

Fig. 3. The Gaussian isolation machine vs a standard neural network on the CIFAR 10 dataset.

method can detect data from other distributions, i.e., data classes that the model did not train on. In this section, we empirically evaluate the proposed method’s ability to distinguish between data from the trained distribution and data from outside the trained distribution. To accomplish this, we designed two experiments in which we trained a GIM on a portion of the classes in a dataset and evaluated its detection ability on the remainder of the classes in the dataset. In the first experiment, we used the Caltech101 dataset [42], and in the second experiment, we used the Hand-Gestures [43] dataset. In both experiments, we also trained standard neural networks with the same architecture and compared our performance to those of the neural networks.

The difference between the confidence values for the GIM’s predictions for Caltech101 data from the trained distribution (first 10 classes) and Caltech101 data from untrained distribution (classes 10-40) can be seen in Figure 4. There is a big difference between the two graphs presented in the figure: data from classes that our model trained on has much higher confidence than data from classes it didn’t train on. This observation makes it possible to set a threshold for the confidence values and, given an input, determine whether it belongs to the trained distribution or not. In this case, the threshold value was set at , creating an almost perfect separation of 99.8% between out-of-distribution data and the trained distribution. In the second experiment, we loaded the Hand-Gestures dataset and resized it to . The GIM

Fig. 4. Caltech101 dataset confidence metric for the trained and untrained class data.

Fig. 5. Hand-Gesture dataset confidence metric for the trained and untrained class data.

TABLE III BASELINE METHOD VS GAUSSIAN ISOLATION MACHINE DETECTION OF OOD DATA (THE CIFAR 10 TEST SET SERVES AS THE IN-DISTRIBUTION DATA)

trained on the first three classes. The confidence values can be seen in Figure 5. Here we used a confidence threshold of . Again, the GIM achieved an almost perfect detection rate of .

To perform a fair comparison to other OOD detectors, we implemented the baseline method introduced by Hendrycks & Gimpel [18], comparing the baseline method to the GIM when the threshold values for the softmax (baseline) and the log-likelihood (GIM) were set such that the TPR(TP/(TP + FN)) would yield 97%, i.e., 97% of the neural networks’ predictions on the training set will be above the thresholds.

Table III presents an evaluation similar to that presented by Liang et al. [19]. A comparison is made between two VGG13 neural networks trained on the CIFAR 10 dataset, to test accuracy. The CIFAR 10 test set serves as the in-distribution data, and the out-of-distribution data comes from the following datasets: Tiny ImageNet, LSUN, iSUN. The results of this comparison appears in Table III.

1) Evaluation Metrics:

• TPR and FPR: Measures the false positive rate and true positive rate. Let TP, FP, TN, and FN respectfully represent the true positives, false positives, true negatives, and false negatives. The true positive rate is calculated as TPR = TP/(FP + TP), and the false positive rate is calculated as FPR = FP/(FP + TN).

• AUROC: Measures the area under the ROC curve. The receiver operating characteristic (ROC) curve plots the

relationship between the TPR and FPR. The area under the ROC curve can be interpreted as the probability that a positive example (in-distribution) will have a higher detection score than a negative example (out-of-distribution).

• AUPR: Measures the area under the precision-recall (PR) curve. The PR curve is created by plotting precision = TP/(TP + FP) against recall = TP/(TP + FN). In our tests, AUPR In denotes in-distribution data, which is used as the positive class, and AUPR Out denotes out-of-distribution examples, which are used as the positive class.

2) Test Sets for OOD Detection:

• Tiny ImageNet This is a subset of the original ImageNet dataset containing 200 classes. For testing purposes we used two datasets that were created from the Tiny ImageNet test set which contains 10,000 images: ImageNet (resize) and ImageNet (crop).

• LSUN The Large-scale Scene Understanding dataset contains 10,000 test images which were used to create two datasets: LSUN (resize) and LSUN (crop).

• iSun The iSUN dataset is a subset of the SUN dataset, and it contains 8,925 images. All images in this dataset were used, resized to 32 32 pixels.

• Gaussian and Uniform Noise The Gaussian and Uniform Noise datasets are datasets created by sampling 10,000 pixel images from a uniform distribution and 10,000 pixel images sampled i.i.d. from 2D multivariate Gaussian distribution with a mean of 0.5 and STD of one.

V. SUMMARY

In this paper, we presented the Gaussian isolation machine, a new neural network-based classification method. The GIM is based on neural network that was trained to transfer the inputs to a vector space where the data distribution can be approximated using multivariate Gaussian. The approach integrates principles from generative and discriminative models to form a hybrid classification method that can classify data with high accuracy, as well as to identify data from untrained distributions. In the process of creating the Guassian isolation machine, we also experimented with new regularization terms that improves the classification ability of cross-entropy/softmax-based neural networks. The main contribution of this paper is the ability of the GIM to identify whether the input data is from the training set distribution or not, without the need for any preprocessing or external detection measures. In future work, we intend to add a sampling capability to the GIM (i.e., giving it the ability to produce new samples), and make changes to the loss function to enable it to perform multi-label classification. In our experiments we tried using the full covariance matrix of each class, and found that although the classification results were better, the run-time was much longer. We believe that additional research in this area will lead to better classification results.

REFERENCES

[1] K. He, X. Zhang, S. Ren, and J. Sun, “Deep residual learning for image recognition,” in Proceedings of the IEEE conference on computer vision and pattern recognition, 2016, pp. 770–778.

[2] K. Xu, J. Ba, R. Kiros, K. Cho, A. Courville, R. Salakhudinov, R. Zemel, and Y. Bengio, “Show, attend and tell: Neural image caption generation with visual attention,” in International conference on machine learning, 2015, pp. 2048–2057.

[3] S. Lawrence, C. L. Giles, A. C. Tsoi, and A. D. Back, “Face recognition: A convolutional neural-network approach,” IEEE transactions on neural networks, vol. 8, no. 1, pp. 98–113, 1997.

[4] S. Chopra, M. Auli, and A. M. Rush, “Abstractive sentence summarization with attentive recurrent neural networks,” in Proceedings of the 2016 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies. San Diego, California: Association for Computational Linguistics, Jun. 2016, pp. 93–98. [Online]. Available: https://www. aclweb.org/anthology/N16-1012

[5] G. Lample, M. Ballesteros, S. Subramanian, K. Kawakami, and C. Dyer, “Neural architectures for named entity recognition,” arXiv preprint arXiv:1603.01360, 2016.

[6] C. dos Santos and M. Gatti, “Deep convolutional neural networks for sentiment analysis of short texts,” in Proceedings of COLING 2014, the 25th International Conference on Computational Linguistics: Technical Papers. Dublin, Ireland: Dublin City University and Association for Computational Linguistics, Aug. 2014, pp. 69–78. [Online]. Available: https://www.aclweb.org/anthology/C14-1008

[7] G. K. Venayagamoorthy, V. Moonasar, and K. Sandrasegaran, “Voice recognition using neural networks,” in Proceedings of the 1998 South African Symposium on Communications and Signal ProcessingCOMSIG’98 (Cat. No. 98EX214). IEEE, 1998, pp. 29–32.

[8] I. Rosenberg, A. Shabtai, L. Rokach, and Y. Elovici, “Generic black-box end-to-end attack against state of the art api call based malware clas-sifiers,” in International Symposium on Research in Attacks, Intrusions, and Defenses. Springer, 2018, pp. 490–510.

[9] A. Khan, A. Sohail, U. Zahoora, and A. Saeed, “A survey of the recent architectures of deep convolutional neural networks,” 01 2019.

[10] Y. Bengio, P. Simard, P. Frasconi et al., “Learning long-term dependencies with gradient descent is difficult,” IEEE transactions on neural networks, vol. 5, no. 2, pp. 157–166, 1994.

[11] B. A. Pearlmutter, “Learning state space trajectories in recurrent neural networks,” Neural Computation, vol. 1, pp. 263–269, 1989.

[12] R. A. Dunne and N. A. Campbell, “On the pairing of the softmax activation and cross-entropy penalty functions and the derivation of the softmax activation function,” in Proc. 8th Aust. Conf. on the Neural Networks, Melbourne, vol. 181. Citeseer, 1997, p. 185.

[13] C. Nwankpa, W. Ijomah, A. Gachagan, and S. Marshall, “Activation functions: Comparison of trends in practice and research for deep learning,” arXiv preprint arXiv:1811.03378, 2018.

[14] A. Y. Ng and M. I. Jordan, “On discriminative vs. generative classifiers: A comparison of logistic regression and naive bayes,” in Advances in neural information processing systems, 2002, pp. 841–848.

[15] R. Chalapathy, A. K. Menon, and S. Chawla, “Anomaly detection using one-class neural networks,” arXiv preprint arXiv:1802.06360, 2018.

[16] J. An and S. Cho, “Variational autoencoder based anomaly detection using reconstruction probability,” Special Lecture on IE, vol. 2, no. 1, 2015.

[17] R. Chalapathy, A. K. Menon, and S. Chawla, “Robust, deep and inductive anomaly detection,” in Joint European Conference on Machine Learning and Knowledge Discovery in Databases. Springer, 2017, pp. 36–51.

[18] D. Hendrycks and K. Gimpel, “A baseline for detecting misclassified and out-of-distribution examples in neural networks,” arXiv preprint arXiv:1610.02136, 2016.

[19] S. Liang, Y. Li, and R. Srikant, “Enhancing the reliability of out- of-distribution image detection in neural networks,” arXiv preprint arXiv:1706.02690, 2017.

[20] G. Hinton, O. Vinyals, and J. Dean, “Distilling the knowledge in a neural network,” arXiv preprint arXiv:1503.02531, 2015.

[21] I. J. Goodfellow, J. Shlens, and C. Szegedy, “Explaining and harnessing adversarial examples,” arXiv preprint arXiv:1412.6572, 2014.

[22] T. DeVries and G. W. Taylor, “Learning confidence for out-of-distribution detection in neural networks,” arXiv preprint arXiv:1802.04865, 2018.

[23] K. Watanabe, J. Tang, M. Nakamura, S. Koga, and T. Fukuda, “A fuzzy- gaussian neural network and its application to mobile robot control,” IEEE transactions on control systems technology, vol. 4, no. 2, pp. 193– 199, 1996.

[24] C. Viroli and G. J. McLachlan, “Deep gaussian mixture models,” Statistics and Computing, vol. 29, no. 1, pp. 43–51, 2019.

[25] C. Doersch, “Tutorial on variational autoencoders,” arXiv preprint arXiv:1606.05908, 2016.

[26] Z. T¨uske, M. A. Tahir, R. Schl¨uter, and H. Ney, “Integrating gaussian mixtures into deep neural networks: Softmax layer with hidden variables,” in 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 2015, pp. 4285–4289.

[27] B. E. Boser, I. M. Guyon, and V. N. Vapnik, “A training algorithm for optimal margin classifiers,” in Proceedings of the fifth annual workshop on Computational learning theory. ACM, 1992, pp. 144–152.

[28] X. Liang, X. Wang, Z. Lei, S. Liao, and S. Z. Li, “Soft-margin softmax for deep classification,” in International Conference on Neural Information Processing. Springer, 2017, pp. 413–421.

[29] S. Sun, W. Chen, L. Wang, X. Liu, and T.-Y. Liu, “On the depth of deep neural networks: A theoretical view,” in Thirtieth AAAI Conference on Artificial Intelligence, 2016.

[30] W. Liu, Y. Wen, Z. Yu, and M. Yang, “Large-margin softmax loss for convolutional neural networks.” in ICML, vol. 2, no. 3, 2016, p. 7.

[31] J. Sokoli´c, R. Giryes, G. Sapiro, and M. R. Rodrigues, “Robust large margin deep neural networks,” IEEE Transactions on Signal Processing, vol. 65, no. 16, pp. 4265–4280, 2017.

[32] G. Elsayed, D. Krishnan, H. Mobahi, K. Regan, and S. Bengio, “Large margin deep networks for classification,” in Advances in neural information processing systems, 2018, pp. 842–852.

[33] Y. Wen, K. Zhang, Z. Li, and Y. Qiao, “A discriminative feature learning approach for deep face recognition,” in European conference on computer vision. Springer, 2016, pp. 499–515.

[34] F. Chollet et al., “Keras,” https://keras.io, 2015.

[35] K. Simonyan and A. Zisserman, “Very deep convolutional networks for large-scale image recognition,” arXiv preprint arXiv:1409.1556, 2014.

[36] Y. LeCun and C. Cortes, “MNIST handwritten digit database,” 2010. [Online]. Available: http://yann.lecun.com/exdb/mnist/

[37] H. Xiao, K. Rasul, and R. Vollgraf, “Fashion-mnist: a novel image dataset for benchmarking machine learning algorithms,” arXiv preprint arXiv:1708.07747, 2017.

[38] A. Krizhevsky, G. Hinton et al., “Learning multiple layers of features from tiny images,” Citeseer, Tech. Rep., 2009.

[39] A. L. Maas, R. E. Daly, P. T. Pham, D. Huang, A. Y. Ng, and C. Potts, “Learning word vectors for sentiment analysis,” in Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies. Portland, Oregon, USA: Association for Computational Linguistics, June 2011, pp. 142–150. [Online]. Available: http://www.aclweb.org/anthology/P11-1015

[40] Y. Gal and Z. Ghahramani, “Bayesian convolutional neural net- works with bernoulli approximate variational inference,” arXiv preprint arXiv:1506.02158, 2015.

[41] H. Larochelle, M. Mandel, R. Pascanu, and Y. Bengio, “Learning algorithms for the classification restricted boltzmann machine,” Journal of Machine Learning Research, vol. 13, no. Mar, pp. 643–669, 2012.

[42] L. Fei-Fei, R. Fergus, and P. Perona, “Learning generative visual models from few training examples: An incremental bayesian approach tested on 101 object categories,” in 2004 conference on computer vision and pattern recognition workshop. IEEE, 2004, pp. 178–178.

[43] T. Mantec´on, C. R. del Blanco, F. Jaureguizar, and N. Garc´ıa, “Hand gesture recognition using infrared imagery provided by leap motion controller,” in International Conference on Advanced Concepts for Intelligent Vision Systems. Springer, 2016, pp. 47–57.