Discriminative Multi-level Reconstruction under Compact Latent Space for One-Class Novelty Detection

2020·arXiv

Abstract

I. INTRODUCTION

Novelty detection is a task to detect an incoming signal that deviates from the underlying regularity of a known class [1]. One-class novelty detection, in particular, assumes an additional constraint that only the known, in-class samples are available for training [2]. In the inference stage, the trained system needs to detect out-class samples, differentiating them from the in-class data. Due to the absence of out-class knowledge, the one-class novelty detection problem is of unsupervised learning and highly challenging. The range of one-class novelty detection application is diverse from medical data processing [3]–[5] to intruder detection [6], [7], abnormality detection [8], and fraud detection [9]. Moreover, novelty detection has a deep root in neuroscience [10] as it is one of the core neural mechanisms of intelligent beings .

Many successful methods in novelty detection follow one of the following two approaches. In the first strategy, a density function of the in-class data is modeled, and then a query located on the low-density region is classified as out-class [11]–[15]. The second strategy is by reconstruction-based methods [2], [16]–[18], the core principle of which is to design a mapping that is invertible exclusively over the in-class manifold. To differentiate in-class and out-class samples, the models often come together with a score function that measures the novelty of a query, which can be either sample-wise reconstruction loss used in the training of their models [18], a score derived by an independent module [2], [19], or a mixture of them [1].

As to the reconstruction-based approach, most of the models follow the paradigm of compact representation learning [20] to acquire a function that reconstructs the in-class data only. Its latent representations are learned to be compact in the sense that they are so condensed as to represent the in-class data exclusively. For example, principal component analysis (PCA)-based methods [16], [17], [21] select a minimal number of eigen-axes by which to reconstruct the in-class data. The recent advances in deep learning [22], [23] enabled the reconstruction-methods to learn compact representations in more diverse manners. The deep autoencoder (AE) achieves this goal by making its middle layer much lower-dimensional than its input dimension and thereby posing a bottleneck therein. Moreover, progresses in generative adversarial learning [24] enabled AE to learn compact latent representations [2], showing promising results.

In OCGAN [2] particularly, any input is encoded to a bounded region by applying tanh activation, and all points in the region are decoded to in-class samples. Out-class instances are thus reconstructed to an in-class sample, resulting in large reconstruction error. However, the bounding constraint by tanh activation may cause collapse in the latent representations of in-class samples, causing their reconstruction in poor quality (Figure 1(b)). As such an issue is reported in [2], OCGAN seeks to resolve it by a complicated sampling technique, which, however, is insufficient to make OCGAN excel on complex dataset such as CIFAR-10, leaving room for improvement. Moreover, to infer the novelty of a given query in the inference stage, OCGAN resorts to an extra classifier on top of the dual GANs employed therein, making the model heavy.

To improve over the aforementioned issues, we propose Discriminative Compact Autoencoder, abbreviated by DCAE,

Fig. 1. (a) vanilla AE, (b) AE with its encoder output improperly constrained, (c) ours. In this schematic diagram, in the in-class data consists of plane while an horse image is given as an out-class instance. For the vanilla AE in (a), the reconstruction quality is similar for both in-class and out-class samples. In (b), the AE poorly reconstructs both the in-class and out-class instances. In both (a) and (b), effective novelty detection is not allowed as the reconstruction error does not differentiate between in-class and out-class samples. In (c), however, fine reconstruction is performed exclusively for the in-class data, allowing the model to successfully detect out-class instances by the reconstruction error.

that exclusively reconstructs the in-class data by learning their latent representations to be compact and collapse-free. DCAE utilizes its own internal module that captures class semantics of the in-class data for both effective training and inference.

Our contributions are summarized as follows:

1) We propose to learn both compact and collapse-free latent representations of the in-class data so as to reconstruct them both finely and exclusively.

2) For inference, a novel measure of reconstruction error is proposed. The proposed measure evaluates the error between an input and its reconstruction by projecting onto the penultimate layer of the internal adversarial discriminator of DCAE. The projection provides the class semantics of an input query, allowing the measure to effectively differentiate the in-class from out-class. Moreover, we theoretically show that, due to Lipschitz continuity, reconstructing through multiple hidden layers of the discriminator during the training of DCAE improves the effectiveness of using the penultimate layer in inference.

3) Extensive experiments in public image data sets validate effectiveness of DCAE not only over novelty detection problem but also over the task of detecting adversarial examples, delivering the state-of-the-art results.

We highlight that our problem to solve in this work is unsupervised one-class novelty detection. There are other, different settings for novelty detection: for example, semi-supervised novelty detection [25], [26] allows to train with out-class data, and self-supervised novelty detection [27], [28] allows a model to exploit supervisory signals inferred from a simple rule. Both settings require some amount of expert knowledge and/or human prior on a given training data. (For further discussion, see Supplementary V-A. In our unsupervised one-class setting, we only assume that a given training dataset is one-class (i.e., the known class).

II. METHOD

Our method is divided into two parts: one for training the one-class model and the other for inference by measuring reconstruction error.

Fig. 2. Heat maps and histograms of scalar along the dimension index . For (a) and (b), are in-class test samples, and for (c) and (d) are out-class. The figures verify that the encoder outputs of not only constrained into

A. Training

The proposed model delivers to reconstruct the in-class data exclusively by (a) learning collapse-free (i.e. bijective) latent representations of the in-class data within a compact latent space and (b) constraining the latent representation of out-class instance into the same compact latent space. (a) ensures fine reconstruction of the in-class data while (b) forces the latent representation of out-class instances to be decoded to in-class samples, resulting in large reconstruction error thereof (as shown in Figure 1 (c)). For (a), firstly, the in-class data is bidirectionally represented by the compact latent space under dual GANs. Then, for fine reconstruction, they are reconstructed as projected onto multiple layers of the input discriminator. To prevent any collapse in the latent space, we linearly activates the output of the encoder and reconstruct the latent vectors. For (b), we specify our encoder to be deep so that it be vulnerable to open set risk and thus too fool to differentiate between the in-class samples and out-class instances, thereby encoding both to the same compact latent space. Due to the aforementioned bidirectional modeling of the in-class data, the latent representations of out-class instances are decoded to in-class samples, resulting in large reconstruction error of out-class instances.

The following are detailed steps to realize the desired mechanism.

1) Bidirectional Representations of the In-Class Data by a

Compact Latent Space: To allow a given compact latent space M, in our case a hypercube , to bidirectionally represent the in-class data , we employ dual GANs with latent and input discriminators and parametrized by and , respectively. In particular, the following adversarial loss is optimized:

where and are the weights parametrizing the encoder and decoder . In Eq. (1), the latent adversarial loss is defined as

where and are batches sampled from the uniform prior and the in-class dataset , respectively. Under this latent adversarial loss , the latent representations E(x) of in-class samples are mapped to and thus constrained in the compact latent space . On the other hand, optimizing the input adversarial loss

forces every latent vector to represent an in-class sample through the decoder G; that is, for .

To ensure our encoder output to be collapse-free over the in-class data, we apply linear activation rather then bounded activation such as tanh. Overall, the in-class data is bidirectionally represented by the compact latent space M.

2) Exclusive Representations of the In-Class Data: To

perform novelty detection effectively, the given AE must reconstruct out-class instances poorly while maintaining the reconstruction quality of the in-class samples. To this end, we specify our encoder E to be a deep neural net (DNN), and claim that the deep encoder enables our DCAE to fulfill the desired objective.

Due to the vulnerability of the deep encoder E to open set risk [29], the encoder does not distinguish between in-class samples and out-class instance , thereby constraining both and into the same compact latent space M. To see this, note that under the latent adversarial loss in Eq. (1), the encoder learns to minimize the distance between and M

Due to DNN’s vulnerability to adversarial attack and open set risk (as empirically reported in [29], [30]), the loss l(x) is reduced over out-class instances as well. In other words, the open set with small loss

is significantly large and placed near to the in-class data . Therefore, the encoder output of out-class instances tend to be placed inside the compact latent space M (as shown in Figure 2). As every point in the compact latent space M represents an in-class sample, then is decoded to an in-class sample, resulting in poor reconstruction of . Remark. Both OCGAN and our DCAE poorly reconstruct out-class instance by constraining the range of latent representation E(x) into a bounded space. In OCGAN, E(x) is constrained explicitly by applying a bounded activation tanh, possibly causing deterioration of in-class reconstruction (as depicted in [2]). In our case, we constrain it in a learningbased way, preventing such deterioration. The contrast is shcematically depicted in Figure 1.

3) Discriminative Multi-Level Reconstruction of the In-

Class Data: As our model is a reconstruction-based one, the in-class samples must be reconstructed finely. A conventional way to realize this is to minimize either or based distance between x and its reconstruction . However, as x is high-dimensional in our case, the minimization results in blurry reconstruction, which is ineffective for our purpose.

To obtain robust reconstruction, we propose to exploit the multiple hidden layers of the internal discriminator . In particular, we minimize the distance between multi-level projections of an in-class sample and its reconstruction:

Here, , and L > 0 is the number of the hidden layers selected in . As a result, the reconstruction preserves the multi-level semantics [31] of captured by the input discriminator, necessitating fine reconstruction of the in-class data.

Moreover, multi-level reconstruction by during the training improves the effectiveness of using the penultimate layer for novelty detection in the inference stage as will be shown in the later subsection (Proposition 4).

4) Surjective Encoding: To fully prevent the collapse in the latent representations of the in-class data and thus deterioration of their reconstruction, the encoder must be surjective. Otherwise, the encoder output of the in-class data will be constrained in a limited range of M and thus not fully represent the in-class data. Henceforth, we ensure the surjectivity of by minimizing

where is the reconstruction of z. Our proposition below validates that minimizing Eq. (7) forces E to be surjective.

Proposition 1. Assume and . If for every , then is surjective.

On the other hand, minimzing ensures the reconstruction of the inferred in-class samples G(z) (i.e., the in-class samples generated by G) on account of the below proposition:

Proposition 2. For any where is the reconstruction of G(z).

Thus, if the generator G is poor and the generated samples do not properly mimic real in-class samples (i.e., ), then would harm the performance as it forces the AE to learn to reconstruct out-class samples G(z). On contrary, if the generator is good, it allows the AE to learn to reconstruct on unseen in-class samples, improving novelty detection performance in test environment.

Based on these evidences, we regulate the contribution of by hyper-parametrizing .

5) Full Objective: The full objective of DCAE is to adversarially optimize

Here, the coefficient controls the contribution of . Unless specified otherwise, is fixed to be 1.

The overall architecture of DCAE is depicted in Figure 3, and its detailed algorithm is given in Supplementary.

B. Inference

In the inference stage, a given query x is classified out-class if its nolvety score s(x) exceeds a threshold and otherwise in-class. For reconstruction-based method, s(x) is defined based on sample-wise reconstruction error.

As shown in Figure 4, the reconstruction of DCAE preserves class semantics for in-class samples but not for out-class instances. On account of this evidence, a distance metric capturing such a semantic error needs to be built to sharply classify a given query.

We claim that the penultimate layer of can be an effective building block. To see this, note that on the projected space , the in-class data is linearly separated from incorrectly generated samples:

Proposition 3. is linearly separated from the projections of incorrectly generated samples with and

Observed in Figure 5, constitutes a valid part of out-class space. The projection thus captures the class semantics of the in-class data accordingly. Motivated by this, we define a novelty score based on the -based error

between x and under the projection . In our ablation study, we empirically validate the effectiveness of compared to usage of early layers (Figure 7(b)).

1) Multi-level Reconstruction Improves Discriminativeness

of : As in-class samples should have low novelty score , achieving minimal values of on in-class samples x is crucial. Based on the following proposition, reconstructing through a more number of layers during the training of DCAE incites the model to find a smaller local minimum of on the in-class samples x.

Proposition 4. The reconstruction error over the final layer is tightly bounded by that of every previous layer:

for some .

Overall, multi-level reconstruction by during training improves the inference capability of novelty score defined by . The hypothesis is experimentally verified in Figure 7(a).

2) Centered Co-activation Novelty Score: The score sc

based on the -based error might be insufficient to capture the class-relation between x and since it is simply an additive ensemble of element-wise errors. In biometric models [32], angular distance is known to well capture such relation. Motivated by this, we propose centered co-activation novelty score

where is the cosine similarity, and m(x) is the element-wise mean of , i.e., is similarly defined for . The proposed score is increased if the activations in the centered feature co-activate with those in the corresponding parts of . For this score to be low, not only needs to be high but also needs to be small. The latter term is governed by the -based error:

Proposition 5. For any

Thus, if a query x has a small -based reconstruction error , it is reflected in the score . Overall, captures both the cosine and -distance based similarities between the input and its reconstruction.

III. EXPERIMENTS

In this section, we assess the effectivness of the proposed model DCAE. The set of experiments we conduct can be divided into three parts:

Fig. 3. A schematic diagram for the training of DCAE.

Fig. 4. Here, is the truck class in CIFAR-10, and is the rest of the classes. (a) test in-class , (b) test in-class reconstructions out-class , (d) out-class reconstructions

Fig. 5. Here, is the ship class in CIFAR-10. (a) incorrectly generated samples , (b) generated samples G(z), (c) real in-class figures show that exist (i.e., can be sampled), represent out-class, but not too distant from

(1) Novelty detection performance of DCAE is evaluated on well-known benchmark data sets: MNIST [33], FMNIST [34], and CIFAR-10 [35],

(2) DCAE is applied to detect adversarial examples, tested upon GTSRB stop sign dataset [36],

(3) Ablation study is conducted to analyze the contribution of each component in DCAE.

We remark that our problem is one-class unsupervised nov-

elty detection. Thus, we do not compare with novelty detectors trained in other settings, for example, semi-supervised [25], [26] and self-supervised [27], [28] novelty detectors, which generally outperform unsupervised novelty detectors.

A. Novelty Detection

1) Evaluation Protocol: To assess the effectiveness of the proposed method, we test it on three well-known multi-class object recognition datasets. Following [1], [2], we conduct our experiment in a one-class setting by regarding each class at a time as the known class (in-class). The network of the model is trained using only the known class samples. In the inference stage, the other remaining classes are used as out-class samples. Based on previous works tested upon the same one-class setting, we compare our method by assessing its performance using Area Under the Curve (AUC) of Receiver Operating Characteristics curve. To this end, we follow two protocols widely used in the literature [1], [2], [39], [40] of novelty detection:

Protocol A: Given in-class and out-class sets, 80% of the in-class samples are used for training. The remaining 20% is reserved for testing. The out-class samples for testing are randomly collected from the out-class set so that its total number be equal to that of the in-class test samples.

Protocol B: We follow the training-testing splits provided from the dataset. For training, all samples in the known class in the training set are employed. For testing, all samples in the test set are used by regarding any other class as an out-class.

TABLE I COMPARISON OF NOVELTY DETECTION PERFORMANCE ON MNIST USING PROTOCOL B.

TABLE II COMPARISON OF NOVELTY DETECTION PERFORMANCE ON F-MNIST USING PROTOCOL A.

2) Results: Here, we present our results together with a brief description of the hyperparameters we used. The detailed architecture setting is given in Supplementary.

MNIST. For the MNIST dataset, we tested our model upon Protocol B. We have found that the generator generalizes too well to learn samples in the extreme (i.e., the samples near the boundary) of the in-class manifold. For this reason, we reduced the coefficient of in (7) to , which is known to disentangle the latent code [42]. Our result is shown in Table I, showing that the perforamnce is comparable to the state-of-the-art model OCGAN and LSA. Note that OCGAN needs an extra classifier module plus with dual GANs and careful sampling technique to achieve the reported performance while LSA is heavy with autoregressive density estimation and requires sensitive architectural configuration. On the other hand, our DCAE consists of dual GANs only, simply trained with a simple end-to-end loss, and inferencing by its own internal module.

F-MNIST. The model performance on F-MNIST is assessed using Protocol A. Based on the MNIST experiment, we set . The F-MNIST dataset is not fairly easy as there is a fair amount of intra-class variation while between some classes, the inter-class dissimilarity is not so significant (for example, ’T-shirt’ and ’Pullover’ classes). Our result is shown in Table II, showing that it outperforms the state-of-the-art OCGAN.

CIFAR-10 is a difficult dataset for one-class unsupervised novelty detection. Several reasons include that the dataset is fairly sparse (i.e., samples are not continuous), that it has high intra-class variation (i.e., diverse samples), and that the images are of low-resolution while they contain real objects. Table III shows that our model outperforms the state-of-the-art OCGAN by a large margin as tested upon Protocol B.

B. Detection of Adversarial Example

In many practical scenarios such as security systems and autonomous driving, it is vital to detect adversarial attacks [44]. In this experiment, we test our model DCAE on the task of adversarial example detection. Following the protocol proposed by [39], we use the ‘stop sign’ class of German Traffic Sign Recognition Benchmark (GTSRB) dataset [36]. The training set consists of 780 stop sign images of spatial size . The test set is composed of 270 stop sign images and 20 adversarial examples, which are generated by

Fig. 6. Showing reconstructed images of test normal samples and adversarial examples.

Fig. 7. The novelty detection performances of DCAE by (a) varying L in and (b) varying , respectively.

applying Boundary Attack [45] on randomly drawn test stop sign images.

To measure the performance of our method over the task of adversarial example detection, we measured AUC over the test dataset. The model is trained solely using the training set as its in-class set. As shown in Table IV, our model performs effectively over this task, outperforming all baselines.

To qualitatively assess our model, we visualized the reconstructed images of the test samples. Figure 6 shows that our model denoises adversarial examples as it reconstructs, resulting in poor reconstruction of them. As to normal samples, DCAE reconstructs them finely except noisy samples in the train data. It validates the effectivenss of DCAE on detecting adversarial examples.

C. Ablation Study

For all experiments below, we test upon CIFAR-10 based on the same protocol used above in Sec. III-A2.

1) Ablation study on model components: We conduct ab-

lation study to assess the effectiveness of each component in DCAE. Our model can be decomposed into three parts that correspond to the bidirectional modeling by , multi-level reconstruction by , and surjective encoding by . According to this decomposition, we consider three models: (a) DCAE without multi-level reconstruction and surjective encoding, (b) DCAE without surjective encoding, (c) full DCAE. Additionally, we test (d) DCAE without and but with its encoder output bounded by tanh activation. This model is equivalent to OCGAN without extra classifier and sampling technique, thus explicitly bounding the range of the

TABLE III COMPARISION OF NOVELTY DETECTION PERFORMANCE ON CIFAR-10 USING PROTOCOL B.

TABLE IV THE PERFORMANCE COMPARISON OVER THE TASK OF DETECTING ADVERSARIAL EXAMPLES GENERATED BY BOUNDARY ATTACK.

TABLE V ABLATION STUDY OF THE MODEL COMPONENTS OVER CIFAR-10.

latent representations and short of any of our novel approaches to resolve the latent collapse issue.

To measure the novelty detection performance of each model, we employ three different novelty scores: the score by the conventional per-pixel reconstruction error and the proposed novel score functions and .

The results in Table V show that each component of our method contributes to improving the performance of our model. An important aspect to note is that the performance improvement is not captured when the detection is performed by the per-pixel score , showing both the importance and effectiveness of proper novelty score function. Moreover, the result on tanh-DCAE w/o shows that explicitly bounding the latent representation by applying tanh activation on the encoder output indeed degrades the performance significantly as it collapses the latent representations of the in-class data, thereby deteriorating the in-class reconstruction.

2) On multi-Level reconstruction: The multi-level recon-

struction loss in (6) has been analyzed by varying the number L of ensemble components. We note that the final layer is always used for all cases. The result in Figure 7 (a) shows that the performance improves as we use larger L for as suggested by Proposition 4.

3) Choice of hidden layer for novelty score: We experimen-

tally studied how the novelty detection performance changes as we define the novelty score by another hidden layer in . Specifically, we replaced by in both and , defining and with and defined similarly by replacing by in m(x) and . Its comparison is shown in Figure 7 (b), depicting a clear sign of monotonicity between the performance and the layer depth l. The trend validates that deeper layers of capture semantics more effective at differentiating the in-class data from out-class instances.

IV. CONCLUSION

We proposed a recontruction-based novelty detector DCAE that induces both fine and exclusive reconstruction of the in-class data by learning its compact and collapse-free latent representations. DCAE successfully attains the desired mechanism by exploiting multi-level reconstruction based on its internal discriminator and vulnerability of the deep encoder to open set risk. Moreover, utilizing the penultimate discriminative layer and the proposed novelty score functions based on it has been validated, both theoretically and experimentally, to be effective for novelty detection inference. Extensive experiments on public image datasets exhibited strong capability of DCAE on both novelty and adversarial example detection tasks.

This work was supported by the National Research Foundation of Korea (NRF) grant funded by the Korea government (MSIP) (NO. NRF-2019R1A2C1003306), and by NVIDIA GPU grant program.

REFERENCES

[1] D. Abati, A. Porrello, S. Calderara, and R. Cucchiara, “Latent space autoregression for novelty detection,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2019, pp. 481– 490.

[2] P. Perera, R. Nallapati, and B. Xiang, “Ocgan: One-class novelty detec- tion using gans with constrained latent representations,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2019, pp. 2898–2906.

[3] S. J. Roberts, “Novelty detection using extreme value statistics,” IEE Proceedings-Vision, Image and Signal Processing, vol. 146, no. 3, pp. 124–129, 1999.

[4] T. Schlegl, P. Seeb¨ock, S. M. Waldstein, U. Schmidt-Erfurth, and G. Langs, “Unsupervised anomaly detection with generative adversarial networks to guide marker discovery,” in International Conference on Information Processing in Medical Imaging. Springer, 2017, pp. 146– 157.

[5] T. Schlegl, P. Seeb¨ock, S. M. Waldstein, G. Langs, and U. Schmidt- Erfurth, “f-anogan: Fast unsupervised anomaly detection with generative adversarial networks,” Medical image analysis, vol. 54, pp. 30–44, 2019.

[6] P. Perera and V. M. Patel, “Dual-minimax probability machines for one-class mobile active authentication,” in 2018 IEEE 9th International Conference on Biometrics Theory, Applications and Systems (BTAS). IEEE, 2018, pp. 1–8.

[7] P. Oza and V. M. Patel, “Active authentication using an autoencoder regularized cnn-based one-class classifier,” arXiv preprint arXiv:1903.01031, 2019.

[8] B. Saleh, A. Farhadi, and A. Elgammal, “Object-centric anomaly detection by attribute-based reasoning,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2013, pp. 787– 794.

[9] P. Zheng, S. Yuan, X. Wu, J. Li, and A. Lu, “One-class adversarial nets for fraud detection,” in Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, 2019, pp. 1286–1293.

[10] R. T. Knight, “Contribution of human hippocampal region to novelty detection,” Nature, vol. 383, no. 6597, p. 256, 1996.

[11] R. El-Yaniv and M. Nisenson, “Optimal single-class classification strate- gies,” in Advances in Neural Information Processing Systems, 2007, pp. 377–384.

[12] A. Adam, E. Rivlin, I. Shimshoni, and D. Reinitz, “Robust real-time unusual event detection using multiple fixed-location monitors,” IEEE transactions on pattern analysis and machine intelligence, vol. 30, no. 3, pp. 555–560, 2008.

[13] V. Mahadevan, W. Li, V. Bhalodia, and N. Vasconcelos, “Anomaly detection in crowded scenes,” in 2010 IEEE Computer Society Conference on Computer Vision and Pattern Recognition. IEEE, 2010, pp. 1975– 1981.

[14] J. Kim and K. Grauman, “Observe locally, infer globally: a space-time mrf for detecting abnormal activities with incremental updates,” in 2009 IEEE Conference on Computer Vision and Pattern Recognition. IEEE, 2009, pp. 2921–2928.

[15] M. Sabokrou, M. Fayyaz, M. Fathy, Z. Moayed, and R. Klette, “Deep- anomaly: Fully convolutional neural network for fast anomaly detection in crowded scenes,” Computer Vision and Image Understanding, vol. 172, pp. 88–97, 2018.

[16] H. Hoffmann, “Kernel pca for novelty detection,” Pattern recognition, vol. 40, no. 3, pp. 863–874, 2007.

[17] E. J. Cand`es, X. Li, Y. Ma, and J. Wright, “Robust principal component analysis?” Journal of the ACM (JACM), vol. 58, no. 3, p. 11, 2011.

[18] M. Sakurada and T. Yairi, “Anomaly detection using autoencoders with nonlinear dimensionality reduction,” in Proceedings of the MLSDA 2014 2nd Workshop on Machine Learning for Sensory Data Analysis. ACM, 2014, p. 4.

[19] M. Sabokrou, M. Khalooei, M. Fathy, and E. Adeli, “Adversarially learned one-class classifier for novelty detection,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2018, pp. 3379–3388.

[20] A. Gretton, K. M. Borgwardt, M. J. Rasch, B. Sch¨olkopf, and A. Smola, “A kernel two-sample test,” Journal of Machine Learning Research, vol. 13, no. Mar, pp. 723–773, 2012.

[21] C. M. Bishop, Pattern recognition and machine learning. Springer Science+ Business Media, 2006.

[22] A. Krizhevsky, I. Sutskever, and G. E. Hinton, “Imagenet classification with deep convolutional neural networks,” in Advances in neural information processing systems, 2012, pp. 1097–1105.

[23] K. He, X. Zhang, S. Ren, and J. Sun, “Deep residual learning for image recognition,” in Proceedings of the IEEE conference on computer vision and pattern recognition, 2016, pp. 770–778.

[24] I. Goodfellow, J. Pouget-Abadie, M. Mirza, B. Xu, D. Warde-Farley, S. Ozair, A. Courville, and Y. Bengio, “Generative adversarial nets,” in Advances in neural information processing systems, 2014, pp. 2672– 2680.

[25] L. Ruff, R. A. Vandermeulen, N. G¨ornitz, A. Binder, E. M¨uller, K.-R. M¨uller, and M. Kloft, “Deep semi-supervised anomaly detection,” arXiv preprint arXiv:1906.02694, 2019.

[26] D. Hendrycks, M. Mazeika, and T. Dietterich, “Deep anomaly detection with outlier exposure,” arXiv preprint arXiv:1812.04606, 2018.

[27] I. Golan and R. El-Yaniv, “Deep anomaly detection using geometric transformations,” in Advances in Neural Information Processing Systems, 2018, pp. 9758–9769.

[28] D. Hendrycks, M. Mazeika, S. Kadavath, and D. Song, “Using self- supervised learning can improve model robustness and uncertainty,” in Advances in Neural Information Processing Systems, 2019, pp. 15 637– 15 648.

[29] W. J. Scheirer, A. de Rezende Rocha, A. Sapkota, and T. E. Boult, “Toward open set recognition,” IEEE transactions on pattern analysis and machine intelligence, vol. 35, no. 7, pp. 1757–1772, 2012.

[30] S.-M. Moosavi-Dezfooli, A. Fawzi, and P. Frossard, “Deepfool: a simple and accurate method to fool deep neural networks,” in Proceedings of the IEEE conference on computer vision and pattern recognition, 2016, pp. 2574–2582.

[31] M. D. Zeiler and R. Fergus, “Visualizing and understanding convolu- tional networks,” in European conference on computer vision. Springer, 2014, pp. 818–833.

[32] H. V. Nguyen and L. Bai, “Cosine similarity metric learning for face verification,” in Asian conference on computer vision. Springer, 2010, pp. 709–720.

[33] Y. LeCun and C. Cortes, “MNIST handwritten digit database,” 2010. [Online]. Available: http://yann.lecun.com/exdb/mnist/

[34] H. Xiao, K. Rasul, and R. Vollgraf, “Fashion-mnist: a novel image dataset for benchmarking machine learning algorithms,” arXiv preprint arXiv:1708.07747, 2017.

[35] A. Krizhevsky, G. Hinton et al., “Learning multiple layers of features from tiny images,” Citeseer, Tech. Rep., 2009.

[36] J. Stallkamp, M. Schlipsing, J. Salmen, and C. Igel, “The german traffic sign recognition benchmark: A multi-class classification competition.” in IJCNN, vol. 6, 2011, p. 7.

[37] B. Sch¨olkopf, R. C. Williamson, A. J. Smola, J. Shawe-Taylor, and J. C. Platt, “Support vector method for novelty detection,” in Advances in neural information processing systems, 2000, pp. 582–588.

[38] D. P. Kingma and M. Welling, “Auto-encoding variational bayes,” arXiv preprint arXiv:1312.6114, 2013.

[39] L. Ruff, R. Vandermeulen, N. Goernitz, L. Deecke, S. A. Siddiqui, A. Binder, E. M¨uller, and M. Kloft, “Deep one-class classification,” in International Conference on Machine Learning, 2018, pp. 4393–4402.

[40] S. Pidhorskyi, R. Almohsen, and G. Doretto, “Generative probabilistic novelty detection with adversarial autoencoders,” in Advances in Neural Information Processing Systems, 2018, pp. 6822–6833.

[41] R. Hadsell, S. Chopra, and Y. LeCun, “Dimensionality reduction by learning an invariant mapping,” in 2006 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR’06), vol. 2. IEEE, 2006, pp. 1735–1742.

[42] X. Chen, Y. Duan, R. Houthooft, J. Schulman, I. Sutskever, and P. Abbeel, “Infogan: Interpretable representation learning by information maximizing generative adversarial nets,” in Advances in neural information processing systems, 2016, pp. 2172–2180.

[43] F. T. Liu, K. M. Ting, and Z.-H. Zhou, “Isolation forest,” in 2008 Eighth IEEE International Conference on Data Mining. IEEE, 2008, pp. 413– 422.

[44] N. Papernot, P. McDaniel, I. Goodfellow, S. Jha, Z. B. Celik, and A. Swami, “Practical black-box attacks against machine learning,” in Proceedings of the 2017 ACM on Asia conference on computer and communications security. ACM, 2017, pp. 506–519.

[45] W. Brendel, J. Rauber, and M. Bethge, “Decision-based adversarial attacks: Reliable attacks against black-box machine learning models,” arXiv preprint arXiv:1712.04248, 2017.

[46] I. Arganda-Carreras, S. C. Turaga, D. R. Berger, D. Cires¸an, A. Giusti, L. M. Gambardella, J. Schmidhuber, D. Laptev, S. Dwivedi, J. M. Buhmann et al., “Crowdsourcing the creation of image segmentation algorithms for connectomics,” Frontiers in neuroanatomy, vol. 9, p. 142, 2015.

[47] R. K. Cowen, S. Sponaugle, K. Robinson, and J. Luo, “Planktonset 1.0: Plankton imagery data collected from fg walton smith in straits of florida from 2014–06-03 to 2014–06-06 and used in the 2015 national data science bowl (ncei accession 0127422),” NOAA National Centers for Environmental Information, 2015.

[48] V. Ciesielski and V. P. Ha, “Texture detection using neural networks trained on examples of one class,” in Australasian Joint Conference on Artificial Intelligence. Springer, 2009, pp. 140–149.

[49] R. E. S´anchez-Y´a˜nez, E. V. Kurmyshev, and A. Fern´andez, “One-class texture classifier in the ccr feature space,” Pattern Recognition Letters, vol. 24, no. 9-10, pp. 1503–1511, 2003.

[50] T. Miyato, T. Kataoka, M. Koyama, and Y. Yoshida, “Spectral normalization for generative adversarial networks,” arXiv preprint arXiv:1802.05957, 2018.

[51] M. Heusel, H. Ramsauer, T. Unterthiner, B. Nessler, and S. Hochreiter, “Gans trained by a two time-scale update rule converge to a local nash equilibrium,” in Advances in Neural Information Processing Systems, 2017, pp. 6626–6637.

[52] K. He, X. Zhang, S. Ren, and J. Sun, “Delving deep into rectifiers: Surpassing human-level performance on imagenet classification,” in Proceedings of the IEEE international conference on computer vision, 2015, pp. 1026–1034.

[53] T. Miyato and M. Koyama, “cgans with projection discriminator,” arXiv preprint arXiv:1802.05637, 2018.

[54] H. Zhang, I. Goodfellow, D. Metaxas, and A. Odena, “Self-attention gen- erative adversarial networks,” arXiv preprint arXiv:1805.08318, 2018.

V. SUPPLEMENTARY

Here, we provides supplementary materials including proofs for all the propositions in the main paper.

A. Unsupervised vs. Self-Supervised Novelty One-Class De- tectors

Althoguh the gap between unsupervised and self-supervised one-class novelty detectors seems not huge, there still exists a clear distinction between them. Self-supervised novelty detectors differ from unsupervised counterparts in that it requires a prior knowledge of a given dataset. To elaborate this fact, we show that only specific data sets can be solved by RotNet [27], [28] (which achieves the top performance on object recognition image data sets for novelty detection in the self-supervised setting). In particular, we theoretically prove that RotNet completely fails as a novelty detector on in-class data sets which are closed in rotation and translation. We note that other RotNet-based variants that share the same core mechanism as RotNet thus fail on these in-class data sets in the same manner.1

To prove our claim, we denote to be a geometric transformation corresponding to rotation and/or translation with . (In RotNet of [27], is the identity and K = 72.) Then RotNet minimizes the following cross entropy loss:

TABLE VI THE ARCHITECTURE OF THE INPUT DISCRIMINATOR

where is sampled from the in-class set and is the posterior given . In inference, the novelty score is defined as the sum of maximal posteriors on

(There are other ways to define the novelty score but basically those variants are equivalent to this.)

Our claim is that RotNet completely fails if the set is a group that acts on , i.e., is closed in

Proposition. If is a group that acts on , then s(x) is maximal for every .

Proof. Since is closed in , for every , the posteriors p(y = k|x) are minimized for all . This forces p(y|x) to be uniform and thus induces maximal novelty score on every in-class sample x.

Therefore, RotNet fails on geometrically closed data sets [46], [47]. For this reason, RotNet might not be suitable in applications where the data sets contain various geometrical symmetries. On the other hand, unsupervised novelty detectors (such as reconstruction-based detectors and one-class classi-fiers) are not constrained by such priors and thus produce consistent performance [48], [49].

One more aspect to note is that the application of self-supervised RotNet-based models is confined to image data sets where rotation and translation are properly defined. On the other hand, most of the reconstruction-based models including ours are applicable to general data sets.

B. Algorithm of DCAE

The detailed algorithm of DCAE is given in Algorithm 1. Note that in the actual training, we want the contribution of the reconstruction losses and to linearly increase. Thus, we multiply the losses by that linearly increases from 0 to 1. This is to synchronize the reconstruction losses with the adversarial losses; adversarial learning is relatively slower than learning reconstruction.

C. Architectures and Hyperparameters in Detail

We provide the detailed description of the architectures and hyperparameters here. All our networks are residual CNNs except the latent discriminator , which is a multilayer perceptron with two fully connected hidden layers. To define the ensemble loss , we pick from the first convolution layer and and from the residual block outputs.

Fig. 8. The residual blocks with (a) average pooling, (b) atrous convolution, and (c) nearest neighbor upsampling.

TABLE VII THE ARCHITECTURE OF THE INPUT DISCRIMINATOR

All networks are spectral-normalized [50]. To train the network we use Adam optimizer with and . For the learning rates, we follow TTUL [51], thereby setting learning rates for and (G, E) differently: lrlrand lrlr. The input images are scaled to . For all experiments, the total number of training iterations is 500K, which is relatively long but necessary to stabilize adversarial learning.

In the implementation, the hinge version [50] of the adversarial loss is adopted. The architectures for our networks are described in Tables VI, VII, VIII, and IX with their residual

TABLE VIII THE ARCHITECTURE OF THE ENCODER

TABLE IX THE ARCHITECTURE OF THE ENCODER

blocks in Figure 8.

As to the atrous convolution in the residual block, it is a 2-dilated convolution with no padding. Its kernel size is chosen so as to make the spatial size of the output map to be 2 times smaller than the input map.

We adopt atrous convolution for for downsampling in E as it makes encoding less degenerate than downsampling by average pooling. We experimentally observed that downsampling by atrous convolution in E gives slightly better performance than downsampling by average pooling. However, for the discriminator D, average pooling was much better at stabilizing the adversarial optimization.

As to the nearest neighbor upsampling used in the residual block, its upsampling rate is fixed by 2.

To initialize the weights of the networks, we adopt He

initialization [52]. We note that GANs are particularly sensitive to the choice of weight initialization. We use Adam optimizer with and . The choice for these hyperparameters is motivated by recent spectral-normalized GANs [53], [54]. For each iteration, we take a sample batch of size N = 100 for both x and z.

For the penultimate layer , we assumed that it is the final activated feature map in the derivation of our method and theory in the main paper. In the practical implementation, however, we adopted to be its pre-activated part , which satisfies . Since the activation function a is ReLU, contains all information of in a linear format where is the negative parts of . As the pre- activated part contains more information, it gives a slightly better performance (about 1% higher in AUC, for example, over CIFAR-10 experiments).

D. Proofs

Here, we provide proofs of the propositions given in the main text.

Proof of Proposition 1. Let . Then, z = E(G(z)) for any . Since, , we obtain the desired.

Proof of Proposition 2. This is trivial because G is a neural network with ReLU activations and thus Lipschitz continuous.

Proof of Proposition 3. Note that with sigmoid activation function , a weight vector , and a bias . By the given assumption, for all , thus is linearly separable from .

Proof of Proposition 4. For each layer ,

where g is the function such that . Here, the bound is tight. Thus, simply cascading the inequality relation through finishes the proof.

Proof of Proposition 5. Observe

by the triangle inequality of norm, finishing the proof.

Designed for Accessibility and to further Open Science