One-Class Classification by Ensembles of Regression models -- a detailed study

2019·arXiv

Abstract

1 INTRODUCTION

The one-class classification (OCC) problem is a special class of classification problems in which only the data points of one class (the target set) are available [1]. The task in one-class classification is to make a model of a target set of data points and to predict if a testing data point is similar to the target set. Point which are not similar to the target set are called outlier. OCC algorithms have applications in various domains [2] including anomaly detection, fraud detection, machine fault detection, spam detection, etc. [2].

The OCC problem is generally considered to be more difficult problem than the two-class classification problem as the training data has only data points belonging to one class [1]–[3], and traditional classifiers need training data from more than one class to learn decision boundaries. Therefore, standard classifiers cannot be applied directly to OCC problems. Various algorithms have been proposed to address OCC problems [1]–[3].

There are two main approaches to handle OCC problems [2], [3]. In the first approach, artificial data points for the non-target class (outlier) are generated and combined with the target data points and then a binary classifier is trained on this new data. In the second approach, target data points are used to create the OCC models [4].

Gaussian models [3], reconstruction-based methods [1]– [3], nearest neighbours [1], [5], support vector machines [6], [7] clustering based methods [1] and convex hull [8] are some examples of the second approach.

Ensembles of accurate and diverse models generally perform better than individual members of ensembles [9]. Ensembles of classification models have been developed to improve the performance of one-class classification models [5], [10]–[12].

Regression analysis is a well-established research area [13]. In a regression problem, the relationship between a dependent variable and other independent variables is investigated. Various regression methods have been developed [13].

Nato et al. [14] propose OCCER that uses ensembles of regression models to address anomaly detection problem. They test OCCER only on UCI datasets [15]. In this paper, we carried out an extensive study of OCCER on various kinds of datasets. We also studied the performance of OCCER on image datasets [16] by creating latent features using autoencoders. The effect of ensemble size on the performance of OCCER was also investigated.

The paper is organised as follows. Section 2 discusses about related work. The OCCER algorithm is presented in Section 3. Section 4 presents experiments and discussion. Section 5 discusses the conclusion and suggests future developments.

2 LITERATURE SURVEY

As previously discussed there are two types of OCC algorithms, and OCCER belongs to the second type, which will be discussed in this section.

Generative methods are useful for OCC as the target class may directly be modelled from the available training target data points. Density-based methods, such as Gaussian, kernel density estimators, Parzen windows and mixture models are widely used for OCC problems [3], [17],. Density-based methods estimate the probability density function of the underlying distribution of the training target data points. Then, these methods determine if a new data point comes from the same distribution. The selection of appropriate models and large-scale training data are the problems of this approach.

Nearest neighbour-based (NN-based) approaches are other widely used methods to address OCC problems [1], [3], [5]. This approach assumes that an outlier point will be far away from neighbour target points as compared to a target point from other neighbour target points [1], [5].

The local outlier factor (LOF) method is a density-based scheme for OCC [18], in which a LOF is computed for each data point by taking the ratios of the local density of the point and the local densities of its neighbours. An outlier point has a large LOF score.

Tax and Duin [6] propose the support vector domain description method for OCC. The method finds a hypersphere with a minimum volume around the target class data such that it encloses almost all the points in the target class dataset. Scholkopf et al. [7] propose the use of support vector machines for one class classification. A hyperplane is constructed such that it separates all the data points from the origin and the hyperplanes distance from the origin is maximised.

In reconstruction-based methods [1]–[3], a model like autoencoder is trained on the given target class data. The reconstruction error which depends on a testing data point and the system output is used to define the outlier score. An outlier point is likely to have more reconstruction error.

Clustering-based approaches use a clustering method, like k-means clustering to create clusters [1]. The distance between a data point and its nearest cluster centre is used as the outlier score. The number of clusters and cluster initialization are the problem of k-means type clustering algorithms.

Arashloo and Kittler [19] present a nonlinear one-class classifier formulated as the Rayleigh quotient criterion optimisation that projects the target class points to a single point in a new feature space, the distance between the projection of a testing point to that point is the outlier score of the testing point. Leng et al. [20] use a similar approach,but uses extreme learning machines for the projection.

Ensembles have also been developed for the OCC problems. There are two approaches for creating ensembles. In the first approach, one OCC algorithm is employed and a randomisation process is used to create diverse OCC models. Lazarevic and Kumar [21] propose the creation of multiple datasets by using feature bagging. The LOF algorithm is then used on these multiple datasets, hence multiple OCC models are created. The outputs of these models are combined to get the final output. Khan and Ahmad [5] use random projection to create multiple datasets. A NN-based OCC algorithm is applied to these multiple datasets. Zimek et al. [22] introduce noise to the dataset to create multiple datasets. Experiments with different OCC algorithms show the effectiveness of the proposed approach. Chen et al. [11] use randomly connected autoencoders to create ensembles of autoencoders. These ensembles outperformed other state-of-the art OCC methods. Khan and Taati [23] train different autoencoders using different features to create ensembles of autoencoders. They show that ensembles perform better than single autoencoders. Isolation forests consist of many decision trees [10]. These trees are created by using random partitioning. The authors argue that anomalies are susceptible to isolation and therefore have short path lengths. The method has produced excellent results on various datasets. Nato et al. [14] propose OCCER that uses an ensemble of regression models to produce outlier score. In the next

section, we will discuss this model in detail.

3 OCCER ALGORITHM

In this section, we discuss OCCER [14], an OCC algorithm, that uses an ensemble of regression models to compute the outlier score of a data point . For an m-dimensional dataset, OCCER converts the OCC problem into m regression problems. Each regression problem generates one regression model. Each regression model produces an outlier score to a data point. m outlier scores are combined to produce an outlier score for the data point. Firstly, we will present the motivation of the OCCER algorithm.

D is a training dataset consisting of n data points. Each data point is defined by m features , and all data points belong to the target class. The rules of the target class are defined by using the following expression

Classifiers like decision trees can learn these rules if the dataset has more than one class, however, with one class dataset, this is not possible. OCCER algorithm uses the information that all data points belong to the target class to covert a OCC problem into m regression problems. The rules to represent the class are defined by m features, and these features are related to each other. OCCER uses regression models to learn these relations. One of the m features is taken as the dependent variable, and the other features are taken as independent variables, and a regression model is trained on it. The regression function is presented as follows:

Since m features exist, m different regression models can be trained. For a given data point, each regression model predicts one of the feature (which is used as the output feature) value. The predicted values and actual values are used to compute the outlier score (O) of the data point. The average of the absolute difference between the actual values and the predicted value for all the m features is taken as the outlier score of the point.

It is expected that data points which do not belong to the target class have larger outlier scores. The relationships between the features for the outlier data points will be different from the relationships between the features for the data points that belong to the target class (for which regression models are trained). Regression models will be less accurate for an outlier data point. It will lead to larger outlier scores for points that do not belong to the target class. Nota et al. [14] propose to use accuracy of the prediction in computing outlier score. However, in this paper, we are using the Eq. 1 as the outlier score.

We explain the mechanism of the OCCER algorithm using the example dataset presented in Table 1. The dataset has points belonging to a target class, and has four features: . Four regression models will be created from the dataset, and each regression model uses one of

TABLE 1 An example dataset.

the features as the dependent variable and all other three features as independent variables. Note that different features may have different ranges, so it is important to bring them on the same scale. Data normalisation is a necessary preprocessing step of OCCER.

These functions compute the outlier score for a given data point. For example, the outlier score of data point 1 can be computed using Eq. 3,

The OCEM algorithm is presented as Algorithm 1.

4 EXPERIMENTS

We conducted experiments by using the scikit-learn python package (https://scikit-learn.org/stable/) and PyOD (a Python toolbox for scalable outlier detection) [24]. As OCCER can work with any regression method, experiments were carried out using the following: Ridge regression, Lasso regression, Elastic net regression [13] and Random Forests regression [25]. In this paper, we named the OCCER with a regression method as OCCER(Regression method). scikit-learn’s regression APIs are used for these regression methods. Different standard OCC algorithms, Isolation Forests (IF), One-class SVM (OCSVM), LOF and autoencoders, were used for the comparative study. PyOD was used for these methods. The default parameters for these methods given in PyOD were used in the experiments. -fold cross-validation was used for the experiments. Strat-ified k-fold was implemented using scikit-learn to ensure that the folds were made by preserving the percentage of the samples for each class. Only the target class points in the training data was used to train the OCC algorithms. Z-normalisation was used to normalise the data. As classi-fication accuracy is not a correct performance matrix due to the highly-imbalanced testing data, we used the average

area under the curve (AUC) for the receiver operating characteristics(ROC) curve as it is generally used to measure the performance of the OCC algorithms [10], [11].

4.1 Standard datasets and domain datasets

Various kinds of datasets were used in the experiments [26]– [31], Some datasets are created as imbalanced datasets [26], [27]. Information on these datasets is presented in Table 2. The domain datasets [28]–[31] belong to two different domains datasets: normal activity-fall activity datasets and software engineering-related datasets. The domain datasets [28]–[31] are naturally imbalanced datasets. Mobilfall data [28] was collected using Samsung Galaxy S3 mobile employing the integrated 3D accelerometer and gyroscope. The data has two classes normal activity and fall activity. We used the data collected from 11 subjects who performed various normal and fall activities. We grouped the German Aerospace Centre (DLR) data [29] into; normal activity and fall activity, and only used the data from the accelerometer and gyroscope. Only data from those subjects who performed both the activities were used. Coventry dataset (COV) [30] also has two classes normal activity and fall activity, and the complete information of these domain datasets is presented in detail in [5]. We also carried out experiments on software engineering realted datasets [31], the information of which is presented in Table 3.

The results (average AUC) for datasets presented in Table 2 are provided in Table 4, which suggest that out of 29 datasets, OCCER algorithms performed best for 18 datasets. and among the OCCER algorithms, the OCCER(RF) algorithm performed best. OCCER(RF) algorithm performed best for 10 datasets. The results (average AUC) for domain datasets (presented in Table 3) are provided in Table 5. The OCCER(RF) algorithm performed best for six datasets, whereas other non-OCC algorithms were best for four datasets, suggesting that OCCER(RF) was the best among the OCCER algorithms.

The Friedman post-hoc test with the Bergmann and Hommels correction [32], [33] was used to compare the OCCER(RF) algorithm with four standard OCC algorithms for all the datasets presented in Tables 2 and 3 (39 datasets). In this test, the p-values for each pair of classifier are computed and then corrected by using the Bergmann and Hommel’s procedure. Two classifiers are significantly different, if the p value is less than 0.05. The results are presented in Table 6, which indicate that OCCER(RF) is statistically better than IF, OCSVM, and autoencoders, whereas no statistically significant difference exists between OCCER(RF) and LOF. However, LOF does not exhibit any statistically significant improvement on OCSVM and IF. This analysis reveals that the OCCER(RF) algorithm is the best among these algorithms.

4.2 MNIST image datasets

In OCCER algorithms, we will have to train m regression models. For some datasets, such as image datasets the number of features can be very large. We conducted experiments to study the performance of OCCER algorithms in a new feature space. We used the MNIST dataset [16] for this study, which contains images of 2828 greyscale handwritten digits. Autoencoders with five hidden layers (512, 128, 64, 128, 512) were used to create latent representation of images, these latent features were used as new features. The outputs from the nodes of the middle layer were used as the new features. Hence, 64 features were created for each image. Datasets were created by selecting data points for a digit which act as target class data points and the data points for other digits were used as outlier data points. For example, mnist-0 data contains 2000 randomly selected images of digit 0 (target class), whereas 25 images of each of the other digits (225 images) are outlier data points. Ten datasets, one for each digit as the target class were created. A setup similar to that discussed in the previous section was used in the experiment. We also used the autoencoders, which were utilised to create latent representations as OCC models for the experiments, and we named them Autoencoder(I). The results (average AUC) are presented in Table 7, which suggest that OCCER algorithms performed best for seven out of ten datasets. Autoencoder(I) performed best for two datasets, whereas LOF performed best for one dataset. This is an interesting result as OCCER algorithms were trained on the latent representation of the images whereas Autoencoder(I) was trained on the image data. This indicates that OCCER algorithms can also be used for the datasets in a new feature space.

4.3 Effect of the ensemble size

In the experiments, the size of the ensembles for OCCER is equal to the number of features. In other words, each feature is used as the output in one of the regression models. It is possible that some of the regression models may not be very accurate. Hence, the question arises whether

TABLE 2 Information on the datasets that were taken from [26], [27]. The datasets presented before the separating line in the Table is taken from [26] whereas the datasets presented after the separating line is taken from [27]

OCCER can perform better with fewer number of accurate regression models. We conducted experiments to investigate this issue and selected the most accurate regression models to compute the outlier scores. The root-mean-square error for the training dataset was used to determine the accuracy of a regression model. We conducted experiments with the top 25%, 50%, and 75% most accurate models of the total regression models.

The results for OCCER(ridge), OCCER(lasso), OCCER(elastic) and OCCER(RF) are presented in Table 7, which show that there is no consistent performance advantage for any ensemble size. For example, for OCCER(RF) using all the regression models generated the best results for six out of ten datasets. For example, for OCCER(RF), using all the regression models generated the best results for 6 out of 10 datasets. For two datasets, the top 75% most accurate regression models gave the best results, whereas for the remaining two datasets, the top 50% most accurate regression models gave the best performance. Similar results were observed with other datasets presented in Table 1. On the basis of our observations, we can suggest that it is better to use all the regression models in computing the outlier scores. However, it will be interesting to investigate the best ensemble size for OCCER.

TABLE 3 Information on the domain datasets.

TABLE 4 Average AUC of various OCC algorithms against the OCCER algorithm with different regression methods on various datasets [26], [27] presented in Table 2. Bold numbers indicate the best performance.

TABLE 5 Average AUC of various OCC algorithms against the OCCER algorithm with different regression techniques on various domain datasets presented in Table 3. Bold numbers indicate the best performance.

TABLE 6 Statistical comparison between different pairs of OCC algorithms. Bold numbers indicate the statistically significant difference (p < 0.05) between these two OCC algorithms.

5 CONCLUSION

OCC is a challenging task due to the absence of the outlier class data points in the training dataset. In this paper, we extensively studied OCCER. OCCER creates m regression models that are used to find outlier scores. As these m regression models are trained only using data belonging to the target class, these models generate more outlier scores for outlier data points. Experiments were conducted to show the effectiveness of the OCCER, and OCCER performed better than or similar to other OCC methods. OCCER can work well with the latent features created by autoencoders, as shown by the MNIST image data.

OCCER suggests that regression methods can be used for OCC, and combines the results of m regression models by averaging them. The other combination techniques will be investigated in the future. The combination of OCCER with other ensemble approaches, such as bagging [34] (to create different training datasets), is another future research direction. In this paper, we investigated the performance of OCCER for image datasets by using the latent features created by autoencoders. Random projections and principal component analysis are some of the approaches to create new features. We will study the performance of OCCER in the feature space created by random projections and principal component analysis.

REFERENCES

[1] D. M. J. Tax, “One-class classification: Concept learning in the absence of counter-examples,” PhD thesis, Technische Universiteit Delft, 2001.

[2] S. S. Khan and M. G. Madden, “One-class classification: taxonomy of study and review of techniques,” Knowledge Eng. Review, vol. 29, no. 3, pp. 345–374, 2014. [Online]. Available: https://doi.org/10.1017/S026988891300043X

[3] M. A. F. Pimentel, D. A. Clifton, L. Clifton, and L. Tarassenko, “Review: A review of novelty detection,” Signal Process., vol. 99, pp. 215–249, Jun. 2014. [Online]. Available: http://dx.doi.org/10.1016/j.sigpro.2013.12.026

[4] C. Desir, S. Bernard, C. Petitjean, and L. Heutte, “One class random forests,” Pattern Recognition, vol. 46, no. 12, pp. 3490 – 3506, 2013. [Online]. Available: http://www.sciencedirect.com/science/article/pii/S003132031300246X

[5] S. S. Khan and A. Ahmad, “Relationship between variants of one-class nearest neighbors and creating their accurate ensembles,” IEEE Trans. Knowl. Data Eng., vol. 30, no. 9, pp. 1796–1809, 2018. [Online]. Available: https://doi.org/10.1109/TKDE.2018.2806975

[6] D. M. J. Tax and R. P. W. Duin, “Support vector domain description,” Pattern Recognition Letters, vol. 20, pp. 1191–1199, 1999.

[8] P. Casale, O. Pujol, and P. Radeva, “Approximate polytope ensemble for one-class classification,” Pattern Recognition, vol. 47, no. 2, pp. 854 – 864, 2014.

[9] T. G.Dietterich, “Ensemble Methods in Machine Learning,” in Proc. of Conf. Multiple Classifier Systems, vol. 1857, 2000, pp. 1–15.

[10] F. T. Liu, k. M. Ting, and Z. Zhou, “Isolation forest,” in Proceedings of the 2008 Eighth IEEE International Conference on Data Mining, ser. ICDM ’08. Washington, DC, USA: IEEE Computer Society, 2008, pp. 413–422. [Online]. Available: http://dx.doi.org/10.1109/ICDM.2008.17

[11] J. Chen, S. Sathe, C. Aggarwal, and D. Turaga, Outlier Detection with Autoencoder Ensembles, 2017, pp. 90–98. [Online]. Available: https://epubs.siam.org/doi/abs/10.1137/1.9781611974973.11

[12] B. Krawczyk, M. Wozniak, and B. Cyganek, “Clustering-based ensembles for one-class classification,” Information Sciences, vol. 264, pp. 182 – 195, 2014, serious Games. [Online]. Available: http://www.sciencedirect.com/science/article/pii/S0020025513008694

[13] C. M. Bishop, Pattern Recognition and Machine Learning (Information Science and Statistics). Berlin, Heidelberg: Springer-Verlag, 2006.

[14] K. Noto, C. Brodley, and D. Slonim, “Anomaly detection using an ensemble of feature models,” in 2010 IEEE International Conference on Data Mining, Dec 2010, pp. 953–958.

[15] D. Dua and C. Graff, “UCI machine learning repository,” 2017. [Online]. Available: http://archive.ics.uci.edu/ml

[16] Y. LeCun and C. Cortes, “MNIST handwritten digit database,” 2010. [Online]. Available: http://yann.lecun.com/exdb/mnist/

[17] V. Chandola, A. Banerjee, and V. Kumar, “Anomaly detection: A survey,” ACM Comput. Surv., vol. 41, no. 3, pp. 15:1–15:58, Jul. 2009. [Online]. Available: http://doi.acm.org/10.1145/1541880.1541882

[18] M. M. Breunig, H. Kriegel, R. T. Ng, and J. Sander, “Lof: Identify- ing density-based local outliers,” SIGMOD Rec., vol. 29, no. 2, pp. 93–104, May 2000.

[19] S. R. Arashloo and J. Kittler, “One-class kernel spectral regression for outlier detection,” CoRR, vol. abs/1807.01085, 2018. [Online]. Available: http://arxiv.org/abs/1807.01085

[20] Q. Leng, H. Qi, J. Miao, W. Zhu, and G. Su, “One-class classifi- cation with extreme learning machine,” Mathematical Problems in Engineering, vol. 2015, no. 412957, 2015.

[21] A. Lazarevic and V. Kumar, “Feature bagging for outlier detection,” in Proceedings of the Eleventh ACM SIGKDD International Conference on Knowledge Discovery in Data Mining, ser. KDD ’05. New York, NY, USA: ACM, 2005, pp. 157–166. [Online]. Available: http://doi.acm.org/10.1145/1081870.1081891

[22] A. Zimek, R. J. G. B. Campello, and J. Sander, “Data perturbation for outlier detection ensembles,” in Proceedings of the 26th International Conference on Scientific and Statistical Database Management, ser. SSDBM ’14. New York, NY, USA: ACM, 2014, pp. 13:1–13:12. [Online]. Available: http://doi.acm.org/10.1145/2618243.2618257

[23] S. S. Khan and B. Taati, “Detecting unseen falls from wearable devices using channel-wise ensemble of autoencoders,” Expert Syst. Appl., vol. 87, no. C, pp. 280–290, Nov. 2017. [Online]. Available: https://doi.org/10.1016/j.eswa.2017.06.011

[24] Y. Zhao, Z. Nasrullah, and L. Zheng, “Pyod: A python toolbox for scalable outlier detection,” Journal of Machine Learning Research, vol. 20, no. 96, pp. 1–7, 2019.

[25] L. Breiman, “Random forests,” Machine Learning, vol. 45, no. 1, pp. 5–32, Oct 2001.

[26] J. Alcal-Fdez, A. Fernndez, J. Luengo, J. Derrac, and S. Garca, “Keel data-mining software tool: Data set repository, integration of algorithms and experimental analysis framework.” MultipleValued Logic and Soft Computing, vol. 17, no. 2-3, pp. 255–287, 2011.

[27] M. Goldstein and S. Uchida, “A comparative evaluation of un- supervised anomaly detection algorithms for multivariate data,” PLOS ONE, vol. 11, no. 4, pp. 1–31, 04 2016.

[28] G. Vavoulas, M. Pediaditis, E. G. Spanakis, and M. Tsiknakis, “The mobifall dataset: An initial evaluation of fall detection algorithms using smartphones,” in 13th IEEE International Conference on BioInformatics and BioEngineering, Nov 2013, pp. 1–4.

[29] M. J. V. Nadales, “Recognition of human motion related activities from sensors,” Master’s thesis, University of Malaga and German Aerospace Center, 2010.

[30] O. Ojetola, E. Gaura, and J. Brusey, “Data set for fall events and daily activities from inertial sensors,” in Proceedings of the

6th ACM Multimedia Systems Conference, ser. MMSys ’15. New York, NY, USA: ACM, 2015, pp. 243–248. [Online]. Available: http://doi.acm.org/10.1145/2713168.2713198

[31] J. Sayyad Shirabad and T. Menzies, “The PROMISE Repository of Software Engineering Databases.” School of Information Technology and Engineering, University of Ottawa, Canada, 2005. [Online]. Available: http://promise.site.uottawa.ca/SERepository

[32] J. Demˇsar, “Statistical comparisons of classifiers over multiple data sets,” J. Mach. Learn. Res., vol. 7, pp. 1–30, Dec. 2006. [Online]. Available: http://dl.acm.org/citation.cfm?id=1248547.1248548

[33] S. Garca and F. Herrera, “An extension on ”statistical comparisons of classifiers over multiple data sets” for all pairwise comparisons,” Journal of Machine Learning Research, vol. 9, pp. 2677–2694, 2009.

[34] L. Breiman, “Bagging predictors,” Machine Learning, vol. 24, no. 2, pp. 123–140, Aug 1996.

Designed for Accessibility and to further Open Science