NFAD: Fixing anomaly detection using normalizing flows

2019·Arxiv

ABSTRACT

ABSTRACT

Anomaly detection is a challenging task that frequently arises in practically all areas of industry and science, from fraud detection and data quality monitoring to finding rare cases of diseases and searching for new physics. Most of the conventional approaches to anomaly detection, such as one-class SVM and Robust Auto-Encoder, are one-class classification methods, i.e., focus on separating normal data from the rest of the space. Such methods are based on the assumption of separability of normal and anomalous classes, and subsequently do not take into account any available samples of anomalies. Nonetheless, in practical settings, some anomalous samples are often available; however, usually in amounts far lower than required for a balanced classification task, and the separability assumption might not always hold. This leads to an important task—incorporating known anomalous samples into training procedures of anomaly detection models. In this work, we propose a novel model-agnostic training procedure to address this task. We reformulate one-class classification as a binary classification problem with normal data being distinguished from pseudo-anomalous samples. The pseudo-anomalous samples are drawn from low-density regions of a normalizing flow model by feeding tails of the latent distribution into the model. Such an approach allows to easily include known anomalies into the training process of an arbitrary classifier. We demonstrate that our approach shows comparable performance on one-class problems, and, most importantly, achieves comparable or superior results on tasks with variable amounts of known anomalies.

Subjects Artificial Intelligence, Computer Vision, Data Mining and Machine Learning, Data Science Keywords Anomaly detection, Deep learning, Semi-supervised learning, Normalizing flows

INTRODUCTION

The anomaly detection (AD) problem is one of the important tasks in the analysis of real-world data. Possible applications range from the data-quality certification (for example, Borisyak et al., 2017) to finding the rare specific cases of the diseases in medicine (Spence, Parra & Sajda, 2001). The technique can be also used in credit card fraud detection (Aleskerov, Freisleben & Rao, 1997), complex systems failure predictions (Xu & Li, 2013), and novelty detection in time series data (Schmidt & Simic, 2019).

Formally, AD is a classification problem with a representative set of normal samples and a small, non-representative or empty set of anomalous examples. Such a setting makes conventional binary classification methods to be overfitted and not to be robust w.r.t. novel anomalies (Görnitz et al., 2012). In contrast, conventional one-class classification (OC-)

How to cite this article Ryzhikov A, Borisyak M, Ustyuzhanin A, Derkach D. 2021. NFAD: fixing anomaly detection using normalizing flows. PeerJ Comput. Sci. 7:e757 http://doi.org/10.7717/peerj-cs.757

methods (Breunig et al., 2000; Liu, Ting & Zhou, 2012) are typically robust against all types of outliers. However, OC-methods do not take into account known anomalies which often result to suboptimal performance in cases when normal and anomalous classes are not perfectly separable (Campos et al., 2016; Pang, Shen & Van den Hengel, 2019). The research in the area addresses several challenges (Pang et al., 2021) that lie in the field of increasing precision, generalizing to unknown anomaly classes, and tackling multi-dimensional data. Several reviews of classical (Zimek, Schubert & Kriegel, 2012; Aggarwal, 2016; Boukerche, Zheng & Alfandi, 2020; Belhadi et al., 2020) and deep-learning methods (Pang et al., 2021) were published that describe the literature in detail. With the advancement of the neural generative modeling, methods based on generative adversarial networks (Schlegl et al., 2017), variational autoencoders (Xu et al., 2018), and normalizing flows (Pathak, 2019) are introduced for the AD task.

We propose1 addressing the class-imbalanced classification task by modifying the learning procedure that effectively makes anomaly detection methods suitable for a two- class classification. Our approach relies on imbalanced dataset augmentation by surrogate anomalies sampled from normalizing flow-based generative models.

PROBLEM STATEMENT

Classical AD methods consider anomalies a priori significantly different from the normal samples (Aggarwal, 2016). In practice, while such samples are, indeed, most likely to be anomalous, often some anomalies might not be distinguishable from normal samples (Hunziker et al., 2017; Pol et al., 2019; Borisyak et al., 2017). This provides a strong motivation to include known anomalous samples into the training procedure to improve the performance of the model on these ambiguous samples. Technically, this leads to a binary classification problem which is typically solved by minimizing cross-entropy loss function LBCE:

where: f is a arbitrary model (e.g., a neural network), Cdenote normal and anomalous classes. In this case, the solution f approaches the optimal Bayesian classifier:

Notice that f implicitly relies on the estimation of the probability densities P(xand P(x). A good estimation of these densities is possible only when a sufficiently large and representative sample is available for each class. In practical settings, this assumption certainly holds for the normal class. However, the anomalous dataset is rarely large or representative, often consisting of only a few samples or covering only a portion of all possible anomaly types.2 With only a small number of examples (or a non-representative sample) to estimate the second term of Eq. (2), LBCE effectively does not depend on f (x)

in x , which leads to solutions with arbitrary predictions in the area, i.e., to classifiers that are not robust to novel anomalies.

One-class classifiers avoid this problem by aiming to explicitly separate the normal class from the rest of the space (Liu, Ting & Zhou, 2008; Scholkopf & Smola, 2018). As discussed above, this approach, however, ignores available anomalous samples, potentially leading to incorrect predictions on ambiguous samples.

Recently, semi-supervised AD algorithms like 1-classification method (Borisyak et al., 2020), Deep Semi-supervised AD method (Ruff et al., 2019), Feature Encoding with AutoEncoders for Weakly-supervised Anomaly Detection (Zhou et al., 2021) and Deep Weakly-supervised Anomaly Detection (Pang et al., 2019) were put forward. They aim to combine the main properties of both unsupervised (one-class) and supervised (binary classification) approaches: proper posterior probability estimations of binary classification and robustness against novel anomalies of one-class classification.

In this work, we propose a method that extends the 1-classification method (Borisyak et al., 2020) scheme by exploiting normalizing flows. The method is based on sampling the surrogate anomalies to augment the existing anomalies dataset using advanced techniques.

NORMALIZING FLOWS

The normalizing flows (Rezende & Mohamed, 2015b) generative model aims to fit the exact probability distribution of data. It represents a set of invertible transformations with parameters , to obtain a bijection between the given distribution of training samples and some domain distribution with known probability density function(PDF). However, in the case of non-trivial bijection z0 , the distribution density at the final point zk (training sample) differs from the density at point z0 (domain). This is due to the fact that each non-trivial transformation fi() changes the infinitesimal volume at some points. Thus, the task is not only to find a flow of invertible transformations to know how the distribution density is changed at each point after each transformation fi(

Consider the multivariate transformation of variable zi ) with parameters for i > 0. Then, Jacobian for a given transformation fi(zi) at given point zifollowing form:

Then, the distribution density at point zi after the transformation fi of point ziwritten in a following common way:

where detJ(fi) is a determinant of the Jacobian matrix J(fiRezende & Mohamed, 2015).

Thus, given a flow of invertible transformations f and domain distribution of z0 with known p.d.f. p(z0), we obtain likelihood p(x) for each object x . This way, the parameters of NF model f can be fitted by explicit maximizing the likelihood p(x) for training objects x . In practice, Monte-Carlo estimate of logp(X) ) is optimized, which is an equivalent optimization procedure. Also, the likelihood p(X) can be used as a metric of how well the NF model f fits given data X.

The main bottleneck of that scheme is located in that detJ() computation, which is O(n3) in a common case (n is the dimension of variable z). In order to deal with that problem, specific normalizing flows with specific families of transformations f are used, for which Jacobian computation is much faster (Rezende & Mohamed, 2015; Papamakarios, Pavlakou & Murray, 2017; Kingma et al., 2016; Chen et al., 2019).

ALGORITHM

The suggested NF-based AD method (NFAD) is a two-step procedure. In the first step, we train normalizing flow on normal samples to sample new surrogate anomalies. Here, we assume that anomalies differ from normal samples, and its likelihood pNF(xless than likelihood of normal samples pNF(x). In the second step, we sample new surrogate anomalies from tails of normal samples distribution using NF and train an arbitrary binary classifier on normal samples and a mixture of real and sampled surrogate anomalies.

Step 1. Training normalizing flow We train normalizing flow on normal samples. It can be trained by a standard for normalizing flows scheme of maximization the log-likelihood (see ‘Normalizing flows’):

After NF for sampling is trained, it can be used to sample new anomalies. To produce new anomalies, we sample z from tails of normal domain distribution, where p-value of tails is a hyperparameter (see Fig. 1).

Here, we assume that test time anomalies are either represented in the given anomalous training set or novelties w.r.t. normal class. In other words, p(x) of novelties x must be relatively small. Nevertheless, p(x) obtained by NF might be drastically different from

Figure 1 NF bijection between tails of standard normal domain distribution (left) and 2D Moon dataset (Pedregosa et al., 2011) samples (right). Rows represent different tail p-values choices. The value of the ROC AUC of the anomaly classifier is shown on the right side. The classifier is trained on the mixture of Csamples from the Moon dataset and surrogate anomalies sampled from the tails.

the corresponding domain point likelihood p(z) because of non-unit Jacobian of NF transformations Eq. (8). Such distribution density distortion is illustrated in Fig. 2 and makes the proposed sampling scheme of anomalies to be incomplete. Because of such distortion, some points in the tails of the domain can correspond to normal samples, and some points in the body of domain distribution can correspond to anomalies. To fix it, we propose Jacobian regularization of normalizing flows (Fig. 2) by introducing extra regularization term. It penalizes the model for non-unit Jacobian:

max

Figure 2 Density distortion of normalizing flows on the Moon dataset (Pedregosa et al., 2011). Without extra regularization distribution density of domain distribution (A) significantly differs from the target distribution (B) because of non-unit Jacobian. To preserve the distribution density after NF transformations, Jacobian regularization Eq. (9) can be used (C and D, respectively).

where denotes the regularization hyperparameter. We estimate the regularization term LJ in Eq. (9) by direct sampling of z from the domain distribution N(0,I) to cover the whole sampling space. The theorem below proofs that any level of expected distortion can be obtained with such a regularization: Theorem 4.1 Let a sample space with probability (domain) distribution D, Ca class of normal samples, f (is a set of invertible transformations parametrized by

Proof. Suppose the opposite. Let

Let(minimum exists since negative log likelihood is lower bounded by 0). Then

c0

But leads to contradiction. In this work, we use Neural Spline Flows (NSF, Durkan et al., 2019) and Inverse (IAF,

Kingma et al., 2016) Autoregressive Flows for tabular anomalies sampling. We also use Residual Flow (ResFlow, Chen et al., 2019) for anomalies sampling on image datasets.

All the flows satisfy the conditions of Theorem 4.1. The proposed algorithms are called ‘nfad-nsf‘, ‘nfad-iaf‘ and ‘nfad-resflow‘ respectively.

Once normalizing flow for anomaly sampling is trained, a classifier can be trained on normal samples and a mixture of real and surrogate anomalies sampled from NF (Fig. 3).

During the research, we used binary cross-entropy objective Eq. (2) to train the classifier. We do not focus on classifier configuration since any classification model can be used at this step.

Final algorithm The final scheme of the algorithm is shown in Fig. 3 accompanied with pseudocode Algorithm 1. All training details are given in Appendix A.

Input : Normal samples C, anomaly samples C(may be empty), p-value of tail ppp,

number of epochs for NF ENF, number of epochs for classifier ECLF Output: Anomalies classifier gfor epoch from 1 to ENF do

RESULTS

We evaluate the proposed method on the following tabular and image datasets: KDD-99 (Stolfo et al., 1999), SUSY (Whiteson, 2014), HIGGS (Baldi, Sadowski & Whiteson, 2014), MNIST (LeCun et al., 1998a), Omniglot (Lake, Salakhutdinov & Tenenbaum, 2015) and CIFAR (Krizhevsky, Hinton et al., 2009). In order to reflect typical AD cases behind the approach, we derive multiple tasks from each dataset by varying sizes of anomalous datasets.

As the proposed method targets problems that are intermediate between one-class and two-class problems, we compare the proposed approach with the following algorithms:

• one-class methods: Robust AutoEncoder (RAE-OC, (Chalapathy, Krishna Menon & Chawla, 2017)) and Deep SVDD (Ruff et al., 2018).

• conventional two-class classification;

• semi-supervised methods: dimensionality reduction by an Deep AutoEncoder followed by two-class classification (DAE), Feature Encoding with AutoEncoders for Weaklysupervised Anomaly Detection (FEAWAD, (Zhou et al., 2021)), DevNet (Pang, Shen &

Figure 3 Normalizing flows for anomaly detection (NFAD). Surrogate anomalies are sampled from the tails of gaussian distribution and transformed by NF to be mixed into real samples. Then, any classifier can be trained on that mixture.

Van den Hengel, 2019), 1Borisyak et al., 2020) (‘*ope’), Deep SAD (Ruff et al., 2019) and Deep Weakly-supervised Anomaly Detection (PRO, (Pang et al., 2019))

We compare the algorithms using the ROC AUC metric to avoid unnecessary optimization for threshold-dependent metrics like accuracy, precision, or F1. Tables 1, 2 and 3 show the experimental results on tabular data. Tables 4, 5 and 6 show the experimental results on image data. Also, some of the aforementioned algorithms like DevNet are applicable only to tabular data and not reported on image data. In these tables, columns represent tasks with a varying number of negative samples presented in the training set: numbers in the header indicate either number of classes that form negative class (in case of KDD, CIFAR, OMNIGLOT and MNIST datasets) or a number of negative samples used (HIGGS and SUSY); ‘one-class’ denotes the absence of known anomalous samples. As one-class algorithms do not take into account negative samples, their results are identical for the tasks with any number of known anomalies. The best score in each column is highlighted in bold font.

DISCUSSION

Our tests suggest that the best results are achieved when the normal class distribution has single mode and convex borders. These requirements are data-specific and can not be

effectively addressed in our algorithm. The effects can be seen in Fig. 2, where two modes result in the ‘‘bridge’’ in the reconstructed standard class shape, and the non-convexity of the borders ends up in the worse separation line description.

Table 1 ROC AUC on the KDD-99 dataset. ‘nfad*’ is our algorithm.

Table 2 ROC AUC on the HIGGS dataset. ‘nfad*’ is our algorithm.

Also, hyperparameters like Jacobian regularization and tail size p must be accurately chosen. This fact is illustrated in Figs. 1 and 2, where we show the different samples quality and the performance of our algorithm for different hyperparameters values. To find suitable values, some heuristics can be used. For instance, optimal tail location p can be estimated based on known anomalies from the training dataset, whereas Jacobian regularization the NF training process can be linearly scheduled like KL factor in (Hasan et al., 2020).

On tabular data (Tables 1, 2 and 3), the proposed NFAD method shows statistically significant improvement over other AD algorithms in many experiments, where the amount of anomalous samples is extremely low.

Table 3 ROC AUC on the SUSY dataset. ‘nfad*’ is our algorithm.

Table 4 ROC AUC on the MNIST dataset. ‘nfad*’ is our algorithm.

On image data (Tables 4, 5 and 6), the proposed method shows competitive quality along with other state-of-the-art AD methods, significantly outperforming the existing algorithms on CIFAR dataset.

Our experiments suggest the main reason for the proposed method to have lower performance with respect to others on image data is a tendency of normalizing flows to estimate the likelihood of images by its local features instead of common semantics, as described by Kirichenko, Izmailov & Wilson (2020). We also find that the overfitting of the classifier must be carefully monitored and addressed, as this might lead to the deterioration of the algorithm.

However, the results obtained on HIGGS, KDD, SUSY and CIFAR-10 datasets demonstrated the big potential of the proposed method over previous AD algorithms.

Table 5 ROC AUC on the CIFAR-10 dataset. ‘nfad*’ is our algorithm.

Table 6 ROC AUC on the Omniglot dataset. Note that for this task only Greek, Futurama and Braille alphabets were considered as normal classes. ‘nfad*’ is our algorithm.

With the advancement of new ways of NF application to images, the results are expected to improve for this class of datasets as well. In particular, we believe our method to be widely applicable in the industrial environment, where the task of AD can take advantage of both tabular and image-like datasets.

It also should be emphasized that unlike state-of-the-art AD algorithms (Pang et al., 2019; Zhou et al., 2021; Ruff et al., 2019), we propose a model-agnostic data augmentation algorithm that does not modify AD model training scheme and architecture. It enriches the input training anomalies set requiring only normal samples in the augmentation process (Fig. 3).

Figure 4. Tabular data classifier architecture.

CONCLUSION

In this work, we present a new model-agnostic anomaly detection training scheme that deals efficiently with hard-to-address problems both by one-class or two-class methods. The solution combines the best features of one-class and two-class approaches. In contrast to one-class approaches, the proposed method makes the classifier effectively utilize any number of known anomalous examples, but, unlike conventional two-class classification, does not require an extensive number of anomalous samples. The proposed algorithm significantly outperforms the existing anomaly detection algorithms in most realistic anomaly detection cases. This approach is especially beneficial for anomaly detection problems, in which anomalous data is non-representative, or might drift over time.

The proposed method is fast, stable and flexible both in terms of training and inference stages; unlike previous methods, any classifier can be used in the scheme with any number of anomalies in the training dataset. Such a universal augmentation scheme opens wide prospects for further anomaly detection study and makes it possible to use any classifier on any kind of data. Also, the results on datasets with images are improvable with new techniques of normalizing flows become available.

APPENDIX A. TRAIN AND IMPLEMENTATION DETAILS

All the code is implemented using the PyTorch (Paszke et al., 2019) framework. For augmentation, Resflow (Chen et al., 2019), NSF (Durkan et al., 2019) and IAF (Kingma et al., 2016) are trained with default parameters. As a classifier, a dense classifier with three layers is used for tabular data (see Fig. 4) and built-in ResFlow classification head is used for images. Tabular data classifier is trained 10 epochs with batch size 100 using AdamW (Loshchilov & Hutter, 2017) optimizer with default PyTorch parameters. For image data, ResFlow classification head is trained 8 epochs with batch size 40 using Adam (Kingma & Ba, 2014) optimizer with default PyTorch parameters.

The research leading to these results has received funding from Russian Science Foundation under grant agreement no. 19-71-30020. The research was also supported through

computational resources of HPC facilities at NRU HSE. The funders had no role in study design, data collection and analysis, decision to publish, or preparation of the manuscript.

Author Contributions

• Artem Ryzhikov conceived and designed the experiments, performed the experiments, analyzed the data, performed the computation work, prepared figures and/or tables, authored or reviewed drafts of the paper, and approved the final draft.

• Maxim Borisyak conceived and designed the experiments, performed the experiments, analyzed the data, performed the computation work, authored or reviewed drafts of the paper, and approved the final draft.

• Andrey Ustyuzhanin and Denis Derkach conceived and designed the experiments, authored or reviewed drafts of the paper, and approved the final draft.

Data Availability

The data is available at: - Moons dataset: https://scikit-learn.org/stable/modules/generated/sklearn.datasets.make_

- SUSY dataset: https://archive.ics.uci.edu/ml/datasets/SUSY - HIGGS dataset: https://archive.ics.uci.edu/ml/datasets/HIGGS - MNIST dataset: http://yann.lecun.com/exdb/mnist/ - CIFAR dataset: https://www.cs.toronto.edu/~kriz/cifar.html - KDD dataset: http://kdd.ics.uci.edu/databases/kddcup99/kddcup99.html - OMNIGLOT dataset: https://github.com/brendenlake/omniglot.

Supplemental information for this article can be found online at http://dx.doi.org/10.7717/ peerj-cs.757#supplemental-information.

REFERENCES

Aggarwal CC. 2016. Outlier analysis. 2nd edition. Luxembourg: Springer Publishing Company, Incorporated.

Aleskerov E, Freisleben B, Rao B. 1997. Cardwatch: a neural network based database mining system for credit card fraud detection. In: Proceedings of the IEEE/IAFE

1997 computational intelligence for financial engineering (CIFEr). Piscataway: IEEE, 220–226.

Baldi P, Sadowski P, Whiteson D. 2014. Searching for exotic particles in high-energy physics with deep learning. Nature Communications 5:4308 DOI 10.1038/ncomms5308.

Belhadi A, Djenouri Y, Lin JC-W, Cano A. 2020. Trajectory outlier detection: algorithms, taxonomies, evaluation, and open challenges. ACM Transactions on Management Information Systems 11(3):1–29 DOI 10.1145/3399631.

Borisyak M, Ratnikov F, Derkach D, Ustyuzhanin A. 2017. Towards automation of data quality system for CERN CMS experiment. Journal of Physics: Conference Series 898(9):092041.

Borisyak M, Ryzhikov A, Ustyuzhanin A, Derkach D, Ratnikov F, Mineeva O. 2020. (1+ epsilon)-class classification: an anomaly detection method for highly imbalanced or incomplete data sets. Journal of Machine Learning Research 21(72):1–22.

Boukerche A, Zheng L, Alfandi O. 2020. Outlier detection: methods, models, and classification. ACM Computing Surveys 53(3):1–37 DOI 10.1145/3381028.

Breunig M, Kriegel H-P, Ng RT, Sander J. 2000. LOF: identifying density-based local outliers. In: Proceedings of the 2000 ACM sigmod international conference on management of data. New York: ACM, 93–104.

Campos G, Zimek A, Sander J, Campello R, Micenkov B, Schubert E, Assent I, Houle M. 2016. On the evaluation of unsupervised outlier detection: measures, datasets, and an empirical study. Data Mining and Knowledge Discovery 30:891–927 DOI 10.1007/s10618-015-0444-8.

Chalapathy R, Krishna Menon A, Chawla S. 2017. Robust, deep and inductive anomaly detection. ArXiv preprint. arXiv:1704.06743.

Chen RTQ, Behrmann J, Duvenaud D, Jacobsen J-H. 2019. Residual flows for invertible generative modeling. ArXiv preprint. arXiv:1906.02735.

Durkan C, Bekasov A, Murray I, Papamakarios G. 2019. Neural spline flows. ArXiv preprint. arXiv:1906.04032.

Görnitz N, Kloft M, Rieck K, Brefeld U. 2012. Toward supervised anomaly detection. Journal of Artificial Intelligence Research (JAIR) 45:235–262 DOI 10.1613/jair.3623.

Hasan A, Pereira JM, Farsiu S, Tarokh V. 2020. Learning latent stochastic differential equations with variational auto-encoders. ArXiv preprint. arXiv:2007.06075.

Hunziker S, Gubler S, Calle J, Moreno I, Andrade M, Velarde F, Ticona Ticona L, Carrasco G, Castelln Y, Oria C, Croci-Maspoli M, Konzelmann T, Rohrer M, Brönnimann S. 2017. Identifying, attributing, and overcoming common data quality issues of manned station observations. International Journal of Climatology 37:4131–4145 DOI 10.1002/joc.5037.

Kingma DP, Ba J. 2014. Adam: a method for stochastic optimization. ArXiv preprint. arXiv:1412.6980.

Kingma DP, Salimans T, Jozefowicz R, Chen X, Sutskever I, Welling M. 2016. Improved variational inference with inverse autoregressive flow. In: Advances in neural information processing systems. New York: ACM, 4743–4751.

Kirichenko P, Izmailov P, Wilson AG. 2020. Why normalizing flows fail to detect out-of-distribution data. ArXiv preprint. arXiv:2006.08545.

Krizhevsky A, Hinton G. 2009. Learning multiple layers of features from tiny images. Available at https://www.cs.toronto.edu/~kriz/learning-features-2009-TR.pdf .

Lake BM, Salakhutdinov R, Tenenbaum JB. 2015. Human-level concept learning through probabilistic program induction. Science 350(6266):1332–1338 DOI 10.1126/science.aab3050.

LeCun Y, Bottou L, Bengio Y, Haffner P. 1998a. Gradient-based learning applied to document recognition. Proceedings of the IEEE 86(11):2278–2324 DOI 10.1109/5.726791.

Liu FT, Ting KM, Zhou Z-H. 2008. Isolation forest. In: 2008 Eighth IEEE international conference on data mining. Piscataway: IEEE, 413–422.

Liu FT, Ting KM, Zhou Z-H. 2012. Isolation-based anomaly detection. ACM Transactions on Knowledge Discovery from Data 6(1):1–39 DOI 10.1145/2133360.2133363.

Loshchilov I, Hutter F. 2017. Decoupled weight decay regularization. ArXiv preprint. arXiv:1711.05101.

Pang G, Shen C, Cao L, Hengel AVD. 2021. Deep learning for anomaly detection. ACM Computing Surveys 54(2):138 DOI 10.1145/3439950.

Pang G, Shen C, Van den Hengel A. 2019. Deep anomaly detection with deviation networks. In: Proceedings of the 25th ACM SIGKDD international conference on knowledge discovery & data mining. New York: ACM, 353–362.

Pang G, Shen C, Jin H, Van den Hengel A. 2019. Deep weakly-supervised anomaly detection. ArXiv preprint. arXiv:1910.13601.

Papamakarios G, Pavlakou T, Murray I. 2017. Masked autoregressive flow for density estimation. In: Advances in Neural Information Processing Systems. 2338–2347.

Paszke A, Gross S, Massa F, Lerer A, Bradbury J, Chanan G, Killeen T, Lin Z, Gimelshein N, Antiga L, Desmaison A, Kopf A, Yang E, DeVito Z, Raison M, Tejani A, Chilamkurthy S, Steiner B, Fang L, Bai J, Chintala S. 2019. PyTorch: an imperative style, high-performance deep learning library. In: Wallach H, Larochelle H, Beygelzimer A, d’Alché-Buc F, Fox E, Garnett R, eds. Advances in neural information processing systems 32. New York: Curran Associates, Inc, 8024–8035.

Pathak C. 2019. Exploring normalizing flow for anomaly detection. dissertation, TU Delft Electrical Engineering.

Pedregosa F, Varoquaux G, Gramfort A, Michel V, Thirion B, Grisel O, Blondel M, Müller A, Nothman J, Louppe G, Prettenhofer P, Weiss R, Dubourg V, Vanderplas J, Passos A, Cournapeau D, Brucher M, Perrot M, Duchesnay É. 2011. Scikit-learn: machine learning in Python. Journal of Machine Learning Research 12(Oct):2825–2830.

Pol A, Azzolini V, Cerminara G, Guio F, Franzoni G, Pierini M, Sirok F, Vlimant J-R. 2019. Anomaly detection using Deep Autoencoders for the assessment of the quality of the data acquired by the CMS experiment. EPJ Web of Conferences 214:06008 DOI 10.1051/epjconf/201921406008.

Rezende DJ, Mohamed S. 2015. Variational inference with normalizing flows. ArXiv preprint. arXiv:1505.05770.

Ruff L, Vandermeulen RA, Görnitz N, Binder A, Müller E, Müller K-R, Kloft M. 2019. Deep semi-supervised anomaly detection. ArXiv preprint. arXiv:1906.02694.

Ruff L, Vandermeulen RA, Görnitz N, Deecke L, Siddiqui SA, Binder A, Müller E, Kloft M. 2018. Deep one-class classification. In: Proceedings of the 35th international conference on machine learning, volume 80. 4393–4402.

Schlegl T, Seeböck P, Waldstein SM, Schmidt-Erfurth U, Langs G. 2017. Unsupervised anomaly detection with generative adversarial networks to guide marker discovery. In: Niethammer M, ed. Information Processing in Medical Imaging. IPMI 2017. Lecture notes in Computer Science, vol. 10265. Cham: Springer, 146–157 DOI 10.1007/978-3-319-59050-9_12.

Schmidt M, Simic M. 2019. Normalizing flows for novelty detection in industrial time series data. ArXiv preprint. arXiv:1906.06904.

Scholkopf B, Smola AJ. 2018. Learning with kernels: support vector machines, regularization, optimization, and beyond. Cambridge: MIT Press.

Spence C, Parra L, Sajda P. 2001. Detection, synthesis and compression in mammographic image analysis with a hierarchical image probability model. In: Proceedings IEEE workshop on mathematical methods in biomedical image analysis (MMBIA 2001). Piscataway: IEEE, 3–10.

Stolfo S, Fan W, Lee W, Prodromidis A, Chan P. 1999. KDD Cup 1999 dataset. Available at https://archive.ics.uci.edu/ml/datasets/kdd+cup+1999+data.

Whiteson D. 2014. SUSY dataset. Available at http://archive.ics.uci.edu/ml/datasets/SUSY .

Xu H, Feng Y, Chen J, Wang Z, Qiao H, Chen W, Zhao N, Li Z, Bu J, Li Z , et al. 2018. Unsupervised anomaly detection via variational auto-encoder for seasonal KPIs in web applications. In: Proceedings of the 2018 world wide web conference on world wide web - WWW 18. DOI 10.1145/3178876.3185996.

Xu J, Li H. 2013. The failure prediction of cluster systems based on system logs. In: Wang M, ed. Knowledge Science, Engineering and Management. KSEM 2013. Lecture Notes in Computer Science, vol. 8041. Berlin, Heidelberg: Springer DOI 10.1007/978-3-642-39787-5_44.

Zhou Y, Song X, Zhang Y, Liu F, Zhu C, Liu L. 2021. Feature encoding with autoencoders for weakly-supervised anomaly detection. IEEE Transactions on Neural Networks and Learning Systems PP:1–12 Available at https://ieeexplore.ieee.org/ document/9465358.

Zimek A, Schubert E, Kriegel H-P. 2012. A survey on unsupervised outlier detection in high-dimensional numerical data. Statistical Analysis and Data Mining: The ASA Data Science Journal 5(5):363–387 DOI 10.1002/sam.11161.