b

DiscoverSearch
About
My stuff
Outlier Detection Ensemble with Embedded Feature Selection
2020·arXiv
Abstract
Abstract

Feature selection places an important role in improving the performance of outlier detection, especially for noisy data. Existing methods usually perform feature selection and outlier scoring separately, which would select feature subsets that may not optimally serve for outlier detection, leading to unsatisfying performance. In this paper, we propose an outlier detection ensemble framework with embedded feature selection (ODEFS), to address this issue. Specifically, for each random sub-sampling based learning component, ODEFS unifies feature selection and outlier detection into a pairwise ranking formulation to learn feature subsets that are tailored for the outlier detection method. Moreover, we adopt the thresholded self-paced learning to simultaneously optimize feature selection and example selection, which is helpful to improve the reliability of the training set. After that, we design an alternate algorithm with proved convergence to solve the resultant optimization problem. In addition, we analyze the generalization error bound of the proposed framework, which provides theoretical guarantee on the method and insightful practical guidance. Comprehensive experimental results on 12 real-world datasets from diverse domains validate the superiority of the proposed ODEFS.

Outlier detection has been intensively studied and widely used in various applications, such as medical diagnosis (Wang et al. 2019a), fraud detection (Wang et al. 2018; Wang et al. 2013), and information security (Kang et al. 2019a), to name just a few. In such real-world applications, it is not uncommon to see that there are many irrelevant or redundant features among data when performing outlier detection (Kang et al. 2019b; Liu and Motoda 2007). It has been shown that the performance of outlier detection can be significantly improved by only using the informative feature subsets (Pang et al. 2018a; Keller, Muller, and Bohm 2012). Therefore, feature/subspace selection, which can help to remove noisy features, places an important role in improving the performance of outlier detection, especially for noisy data. Feature selection based outlier detection methods select relevant feature subsets for the subsequent outlier detection

method, with the aim to alleviate the negative effect brought by noisy features. Many works on this regard have been developed. Early attempts often separate the feature searching from the subsequent outlier scoring methods (Dang et al. 2014; Keller, Muller, and Bohm 2012; Noto, Brodley, and Slonim 2012; Lazarevic and Kumar 2005). Consequently, they may retain features that do not optimally serve for the outlier scoring method and the performance of the subsequent outlier detection may be sufficiently biased. The recent work in (Pang et al. 2018a) involves the outlier scoring methods when searching the relevant feature subset. It builds the sequential ensembles to refine feature selection and outlier scoring by iterative sparse modeling with outlier scores as the pseudo target feature. Though demonstrating promising performance, we observe that the used outlier detector and feature selection method are still performed in an iterative manner, which may lead to a suboptimal solution. A question naturally raised is that can we take a step forward to integrate outlier detection and feature selection into a joint framework?

To address the aforementioned issue, this paper introduces a novel outlier detection ensemble framework with embedded feature selection, termed ODEFS. Specifically, ODEFS uses a given outlier scoring method to compute initial outlier scores of data objects, and then defines an outlier thresholding function to identify a set of outlier candidates. Considering that diverse outliers may have different discriminative feature subsets (Wang et al. 2019b; Wang and Li 2006), ODEFS builds an ensemble framework to obtain multiple feature subsets by bagging. It randomly chooses examples from both outlier candidates and the unlabeled objects. They are fed into the objective function which embeds feature selection into outlier detection to learn customized feature subsets for such outlier scoring methods. The pairwise ranking loss is adopted in the objective to encourage outliers having higher ranks than the inliers.

Notice that outlier thresholding function may produce unreliable outlier candidates since the initial outlier scores are computed using all the original features. To improve the reliability of the training set, we propose to adopt thresholded self-paced learning to simultaneously implement example selection and feature weighting. It selects ”easy” examples, i.e., the ones with small loss values which are more likely to be outliers, as the training set. After that, we design an alternate optimization algorithm with proved convergence to obtain reliable and informative feature subsets. Finally, ODEFS applies the same given outlier detector to the data with the selected feature subsets in a weighted aggregating manner to produce a reliable outlier scoring.

The pairwise ranking loss function involves a large number of interactive terms between the outlier examples and the unlabeled examples, leading to a high computation complexity. To reduce the number of interactive terms, we theoretically analyze the generalization error bound of the proposed framework, which provides valuable insights into relationships between some important parameters and the detection performance. Those insights lead to some useful practical guidance. For example, we find that the improvement on the error bound is quite limited by increasing the number of unlabeled examples when it is more than the number of outlier examples. Based on this finding, we can significantly reduce the time complexity by including a moderate number of examples without decreasing the detection performance.

The main contributions of this paper are three folds.

• We introduce the ODEFS framework for identifying outliers in noise data. Different from existing methods that separate feature selection from subsequent outlier detectors, ODEFS unifies the two tasks in a joint formulation.

• We derive a thresholded self-paced learning algorithm to eliminate the negative effect of the unreliable outlier candidates. To solve the resultant optimization problem, we design an alternate algorithm and prove its convergence.

• We theoretically analyze the generalization error bound of the proposed framework, which provides valuable insight into the theoretical performance of the method and helps to reduce the computation complexity.

The ODEFS framework is instantiated on one state-of-the-art distance-based method LeSiNN (Pang, Kai, and Al- brecht 2016). It is also worthy of mentioning that the proposed framework can be easily extended to other formulations. Extensive empirical results on 12 real-world data sets show that ODEFS (i) reduces a large proportion of features and improves the performance of the original bare method; (ii) performs substantially better and more stably than the state-of-the-art competitors; (iii) has much better resilience to noisy features than its competitors; (iv) has linear time complexity w.r.t. data size and feature size.

Outlier Detection in Noisy Data

Subspace-based methods (Aggarwal and Philip 2005; M¨uller, Schiffer, and Seidl 2011; Keller, Muller, and Bohm 2012; Dang et al. 2014) are popular solutions for outlier detection in noisy data. They search for a set of feature subspaces and use them in an ensemble framework to avoid the negative effect of noise features, but the subspace searching is often costly as it requires extensive search in identifying the feature subspaces in high-dimensional data. Random subspaces generation is a widely used solution to address this efficiency issue (Lazarevic and Kumar 2005; Nguyen, Ang, and Gopalkrishnan 2010), but it may include many noisy features into subspaces.

Alternatively, feature selection-based methods aim to identify optimal feature subset(s) that reveals the exceptional behaviors of outliers. Although feature selection has been well investigated in clustering and classification (Xu, Wang, and Lai 2016; Li et al. 2018; Nie, Zhu, and Li 2016), there exists limited work on outlier detection because it is challenging to define feature relevance to outlier detection given its unsupervised nature. RegFS in (Paulheim and Meusel 2015; Noto, Brodley, and Slonim 2012) defines the relevance of features by their correlation to the other features. The assumption is that independent features are not useful in capturing the violation in outliers. This assumption may be invalid since some features can be strongly relevant to outlier detection but not correlated to other features. CINFO in (Pang et al. 2018a) firstly generates outlier scores via a given outlier detector, and then feed the scores into sparse learning based supervised feature selection to choose relevant features. These two steps are iteratively performed to build a sequential ensemble outlier detection framework. Such an iterative manner may result in feature subset(s) that are suboptimal to the outlier detectors.

Most of the above methods generate multiple feature subsets and work in an ensemble framework. They combine the results calculated on these feature subsets to obtain a reliable detection result. In recent years, there are also some other outlier ensemble learning works that construct a set of independent base models (Sugiyama and Borgwardt 2013; Zhang et al. 2017; Rayana, Zhong, and Akoglu 2016). Since they work on the full feature space, their performance is still largely biased by noisy features. It is also worth noting that there are some successful works on joint feature selection and outlier detection for categorical data (Pang et al. 2017; Pang et al. 2016). Using popular unsupervised discretization methods like equal-width and equal-frequency to adopt these methods to numeric data perform poorly (Pang et al. 2018a). We therefore focus on comparing ODEFS with numeric data-based methods in our experiments. There are also some works on representation learning for outlier detection (Pang et al. 2018b). Although feature selection can be viewed as an approximate of representation learning, the method is customized for a given outlier detection method only. Thus it is not added to the competitors.

Self-paced Learning

Self-paced learning (Kumar, Packer, and Koller 2010) is motivated by the procedure of human learning: from easy to hard. In machine learning problems, the value of loss function is used to measure ”easiness”. How easy examples should be used for training is controlled by a threshold  λ. Formally, given training examples {(x1, y1), (x2, y2), . . . , (xn, yn)}and learning model f, the self-paced learning problem is:

image

where  v = [v1, v2, . . . , vn]are binary parameters that denote the weights of examples, w is the learning parameters, and  −λviis called self-paced regularization term.

image

Figure 1: The framework of ODEFS. ODEFS builds a par- allel ensemble framework which consists of l feature learning components. It defines an outlier thresholding function to identify a set of outlier candidates  X⋆(size  n⋆) based on the outlier scores computed by s. For each feature learning component, ODEFS randomly chooses m unlabeled examples ( ˆX = {ˆx1, ˆx2, . . . , ˆxm}) from X and  m⋆outlier examples ( ˆX⋆ = {ˆx⋆1, ˆx⋆2, . . . , ˆx⋆m⋆}) from  X⋆. These ex- amples are fed into a pairwise ranking formulation that embeds feature selection into outlier detection. In the training process, thresholded self-paced learning is proposed to simultaneously learn example weights v and feature weights w. With the l groups of scores computed on the obtained feature subsets, ODEFS finally performs a weighted aggregating based on the learning loss to obtain the final outlier scores.

When v is given, the minimization over w is a weighted loss minimization problem. And when w is fixed, the optimal  viis determined by the closed form:

image

where  Liis the loss of  xiand  λincreases at each iteration by step  δ. All examples will be added into the training set at the end of training when  λis large enough.

Kumar et al. (Kumar, Packer, and Koller 2010) demonstrate that self-paced learning algorithm outperforms the state-of-the-art methods for learning a latent structural SVM on several applications. To the best of our knowledge, self-paced learning has not been used in unsupervised outlier detection yet. Maybe this is because it is hard to define the loss and set the hyper-parameters of self-paced learning for unsupervised outlier detection. We fill this gap by introducing a variant version of self-paced learning into outlier detection to select reliable outlier examples.

We consider outlier detection problems defined over a set of n data objects  X = {x1, x2, . . . , xn}, where each data object is described as a d-dimensional real-valued vector xi = {xi1, xi2, . . . , xid}. There is an unknown partition that divides X into a set of outliers  X+ = {x+1 , x+2 , . . . , x+n+}and a set of inliers  X− = {x−1 , x−2 , . . . , x−n−}, so that X = X+ ∪ X−. n+and  n−are the number of outliers and inliers, respectively.  π = n+/nis the outlier percentage of the data set. Outlier detector  s(·) : xi → Rassigns outlier scores to objects in X to yield an overall outlier ranking, with the goal of having the outliers to be higher ranked than the inliers. We will assume that the associated outlier detector has a particular form that the feature weights can be embedded into the scoring function:  s(x) → s(x, w).

The framework of ODEFS is illustrated in Figure 1, ODEFS tries to encourage  s(x⋆, w)to be larger than s(x−, w). Since the feature selection is guided by s, the chosen features only attain the information that is the most important to distinguish outliers from inliers. The feature subsets obtained by ODEFS are therefore tailored for the outlier detector. It is a chicken-and-egg problem because we do not know which part of the outlier candidates are true outliers at the beginning. Fortunately, we can tackle this problem by iteratively selecting reliable examples according to the learning loss value in thresholded self-paced learning. This learning strategy enables ODEFS to get more reliable discriminative feature subsets. We detail the key steps of ODEFS in the following.

Outlier Thresholding with Cantelli’s Inequality

The outlier thresholding function is to identify a set of most likely outliers. We adopt the outlier thresholding function proposed in (Pang et al. 2018a; Pang et al. 2018b) which is based on Cantelli’s inequality to obtain the outlier candidates:

image

where  µand  σ2are the average value and variance of all the initial outlier scores computed by s with all features, and a ≥ 0is user-defined thresholding rate based on a desired false positive bound.

The reason we adopt Eq. (3) as initial outlier threshold is two folds: (i) it provides an upper bound which can be used to study the theoretical performances of the proposed model (see our theoretical foundation); (ii) it is simple but useful as shown in the experiments.

Pairwise Ranking Loss for Outlier Detection with Embedded Feature Selection

Outlier scoring method s yields an overall outlier ranking, with the goal of having the outliers to be higher ranked than the inliers. It tries to maximize:

image

where  φis an indicator function that returns 1 if the condition satisfies and 0 otherwise.

Since ODEFS is an unsupervised framework, it is impossible to directly obtain labels for both outliers and inliers. Inspired by (Ren et al. 2018), we propose a relaxed pairwise

ranking:

image

here the whole unlabeled object set X is used instead of  X−. The above equation indicates that  J′(s)depends on J(s) linearly. That is, maximizing  J′(s)essentially maximizes J(s). We therefore consider  J′(s)in the objective function rather than J(s). To approximate  X+, we use outlier candidates set  X⋆in the objective function. Further, as ODEFS is a random sub-sampling based ensemble framework, each feature learning component feeds the randomly chosen examples (i.e., ˆXand ˆX⋆) into the objective function:

image

To put feature selection in the objective function, we embed feature weights  w = {wi}di=1in which  widenotes the weight of ith feature and add sparsity constraints (i.e., l1−norm) on them. Since the indicator function  φ(·)is not continuous, the common treatment is to use convex and continuous surrogate function h to approximate it. There are several loss functions that can be used here, such as sigmoid loss, hinge loss and logistic loss function. Without loss of generality, here, we focus on the logistic loss, which is de-fined as:  h(x) = 11+exp(−x). Then we get the new objective function:

image

where  Lw(ˆx⋆i ) = 1m�mj=1 11+exp(s(ˆx⋆i ,w)−s(ˆxj,w))is the loss of  ˆx⋆i,  θ = 10−4is a small constant.

Thresholded Self-paced Learning

Since the scores in the outlier thresholding are calculated using all the features, the outlier candidates may be not reliable. Here we use proposed thresholded self-paced learning to select the most confident examples. We combine (1) into (7) to get the final objective:

image

where  v = [v1, v2, . . . , vm⋆]⊤are weights of training examples, and  λis the age parameter which controls the number of selected examples.

In traditional self-paced learning, there are two hyper-parameters: the age parameter  λfor controlling the learning pace and step size  δfor increasing  λ. λincreases by a step  δevery several iterations. All examples should be added to the training set at the end of training when  λis large enough. But in our problem, not all the outlier candidates are reliable. According to the definition of the loss function, the examples with lower losses value are more likely to be true outliers than the ones with higher losses. Therefore we shall prevent the self-paced learning from selecting examples with high loss values, even at the end of the training.

We propose to constrain  λaccording to the statistics of losses during the training:

image

where  Lwt−1denotes losses for all examples in the  (t−1)th iteration,  µ(·)and  σ2(·)are average value and variance of losses.

λnow is thresholded by the changing losses of examples (i.e.,  λ ≤ maxt µ(Lwt) + σ(Lwt)), while it keeps a nondecreasing trend as the traditional version. Examples with high loss are filtered by this setting. Besides, this setting also ensures that at least half example are fed into the training process according to the Cantelli’s inequality.

Final Outlier Scoring Using a single component may produce high detection errors when diverse outliers may have diverse discriminative feature subsets. We therefore further aggregate a set of sub-sampling based detection results to address this issue.

With the l groups of sub-samples involved in the training, we obtain a set of l feature weight vectors {w1, w2, . . . , wl}, and their associated loss {loss1, loss2, . . . , lossl}as defined in Eq. (8). We follow the literature (Nie, Zhu, and Li 2016; Guo and Zhu 2018) to select the features with weights larger than a given threshold:

image

where  ϵ = 0.05is a small constant, and  max(wj)denotes the maximum value in  wj.

After that, we borrow the idea of boosting (Freund and Schapire 1997) to combine the outlier score vectors with associated loss as weights, and define the final outlier score for each data object in the ensemble as follows:

image

where  uiis a normalized weight by  ui = exp(−lossi)�lj=1 exp(−lossi), s(x, F i)denotes outlier scoring with  F i, and  τ(s(x, F i)) =

�nj=1 s(xj,F i)is a vector normalization function that normal- izes the vector into an unit norm to address the heterogeneity of the outlier scores from heterogeneous feature subsets.

Optimization and Convergence Analysis

There are two parameters in Eq. (8), w, v, corresponding to feature learning and reliable examples selection, respectively. Motivated by the literatures (Kumar, Packer, and Koller 2010; Wang et al. 2017; Wang and Ma 2014), we use an alternative search strategy to optimize v and w.

Update v with w fixed With w fixed, we first compute λby Eq. (9). Then the optimal v can be easily obtained in closed form:

image

Update w with v fixed With v fixed, Eq. (8) is simplified to

image

which is consistent with  l1-norm optimization problem. Thus we can resort to SGD.

image

the literature (Kumar, Packer, and Koller 2010), we obtained an estimate of  w0by initially setting  vi = 1for all outlier examples. Then it goes to the normal 1st iteration.

Convergence analysis In  t−th(t ≥ 1) iteration, we first update  λto get  Lt1:

image

thus  Lt1 < Lt−1as  λt ≥ λt−1.

image

we have:

image

According the computation of v in Eq. (12): (i) if  vti = 1, we have  vti − vt−1i ≥ 0and  Lwt−1(ˆx⋆i ) − λt < 0; (ii) if vti = 0, we have  vti − vt−1i ≤ 0and  Lwt−1(ˆx⋆i ) − λt ≥ 0. Thus,  Lt2 − Lt1 ≤ 0.Lastly, when we update the feature weights w, we have a closed form solution. The objective function is guaranteed to decrease, that is  Lt < Lt2. Then we have  Lt < Lt2 ≤ Lt1 ≤Lt−1. And  L = 1m⋆�m⋆i=1(viLw(ˆx⋆i ) − λvi) + θl1(w) >−λ ≥ −(maxt µ(Lwt) + σ(Lwt)) ≥ −2, thus the convergence of the optimization is proved.

Time Complexity Analysis

The whole algorithm is summarized in Algorithm 1. In each learning component, the main computation cost involves the alternative optimization process in which optimization of w is the most complex part. Therefore we focus on calculating the complexity of optimizing w. The pairwise ranking loss function involves a huge number of interactive terms between outlier examples and unlabeled examples. Specifi-cally, the computation complexity of l learning components can be represented as  O(lmm⋆d). Fortunately, from the proposed Theorem 1 in the following section, we can see that when the number of outlier examples (i.e.,  m⋆) is fixed, the marginal gain by including more unlabeled examples (i.e., increasing m) is decreasing. Thus we set  m = 6m⋆according to the theoretical analysis and empirical validation. Then the total time complexity is  O(l(m⋆)2d). m⋆is a given parameter, the ensemble size  l = 2⌈ n⋆m⋆ ⌉tries to involve at least equal number of outlier candidates into training, so the overall time complexity can also be represented as  O(dn⋆). Since  n⋆is linear to n, the proposed ODEFS is linear w.r.t. data size and feature size.

image

In this section, we study the theoretical performances of the proposed ODEFS, which provides practical guidance of parameters setting.

Theorem 1. Assume that all data objects in X are i.i.d. samples, with probability at least  1 − δwe get the upper error bound of ODEFS with:

image

where ˆJ⋆, E( ˆJ⋆)and  E(J′)are respectively the adopted empirical loss, thresholding based expected loss, and ideal expected loss, and  κm′is defined as  κm′ = d′ log(dm′/d′) +log 1δin which  d′is the number of selected features.

Proof. To give a detailed proof, we first give a brief introduction of two lemmas that come from the references. lemma 1. The outlier thresholding function  ϕ(s, x) =s(x) − µ − aσhas a false positive upper bound of 1(1+a2)(Pang et al. 2018b).

The above lemma is a variant of Cantelli’s inequality, which implies that the probability that we could wrongly identify inliers as outliers is up to 11+a2when we define the threshold as  µ + aσ. lemma 2. Assume that all data objects in X are i.i.d. samples, with probability at least  1 − δwe have (Ren et al. 2018):

image

where �BAUCand BAUC are defined as:

image

here  D+and D are respectively the distribution of outliers and the whole dataset. And  κn′is defined as

image

where  d′is the number of selected features. Based on these two lemmas, we give below that ODEFS can obtain an upper error bound for its learning process as follows. We can take the expectation of ˆJ⋆from its definition

image

where ˆD⋆and ˆDare respectively the distribution of outlier examples and unlabeled examples.

Since the objects in X are i.i.d. samples, then the objects in ˆXand ˆX⋆are also i.i.d. samples. Based on lemma 1, we can easily have

image

holds with probability at least  1 − δ.

Then we come to the second inequation in the theorem. As ˆXand ˆX⋆are respectively randomly sampled from X and  X⋆, we have ˆD = Dand ˆD⋆ = D⋆. Thus

image

where  D⋆and D denote the distributions for outlier candidates and the whole dataset, respectively.

Suppose the outlier candidate set is composed of outlier set  X⋆+and inlier set  X⋆−, that is,  X⋆ = X⋆+ ∪ X⋆−. n⋆+and n⋆−are respectively the number of outliers and inliers in  X⋆.

image

where  D+and  D−denote the distributions for the outliers and inliers, respectively.  D⋆+and  D⋆−denote the distribu- tions for the outliers and inliers in outlier candidates, respectively.  p = n⋆+/n⋆is the outlier percentage in  X⋆. The term Ex⋆−∈D⋆−Ex−∈D−φ(s(x⋆−) ≥ s(x−))is a constant, because the probability that a randomly chosen inlier is ranked higher than another randomly chosen inlier should always be 12. So we have

image

the probability that a chosen outlier is ranked higher than a chosen unlabeled object should be larger than 12; (ii)  0 ≤∆ ≤ 12, this is because the probability that a chosen inlier is ranked higher than a chosen outlier should be smaller than

2. Based on lemma 1, we have  a2/(1 + a2) ≤ p ≤ 1. Thus we get:

image

and

image

The proof is completed.

This theorem provides the upper bound of the difference between ˆJ⋆and the ideal expected  E(J′). It contains three parameters: the thresholding rate a, the number of sampled outliers  m⋆and the number of sampled unlabeled objects m. This leads to the following interesting observations:

(i) On one hand,  E( ˆJ⋆)gets close to  E(J′)when a increases; on the other hand, increasing a will decrease  n⋆, thus the number of outlier candidates is limited;

(ii) When  m⋆is fixed, the improvement on this bound by increasing m is quite limited. Thus it is not necessary to set a large value for m.

According to these observations, we therefore set the parameters a = 2 (i.e.,  E( ˆJ⋆) > 0.8E(J′)) and  m = 6m⋆in our experiments. The experimental results in the following section have demonstrated the effectiveness of the parameters settings.

Experiment Setup

Application to Distance-based Outlier Detection There are a number of outlier detectors, where different criterions are used in different algorithms. Here we choose one state-of-the-art distance-based outlier detector LeSiNN (Pang, Kai, and Albrecht 2016) as representative to motivate our model. It is worthy mentioning that we also get similar results on another tree-based outlier detector iForest (Liu, Ting, and Zhou 2012).

Table 1: The details of 12 used datasets, the performance of bare LeSiNN (denoted by AUC, p@k) and ODEFS-enabled LeSiNN (denoted by AUC’, p@k’). d’ is the average number of features retained by ODEFS-enabled LeSiNN.

image

Given a dataset X of vector-valued objects, LeSiNN analyzes X to construct a set of random subsets. The outlier score of an object x is assigned as the average value of its nearest distances to the subsets.

image

where c is the number of subsets and  nn disti(x)returns the nearest neighbor distance of x to the ith subset.

In particular, the weight of each feature in the distance calculation is 1 (constant). It is easy to embed the feature weight into the distance calculation. Take Squared Euclidean Distance as an example, the weighted distance is: dist(xi, xj, w) = �dk=1 wk ∗ (xik − xjk)2. Thus, we can get the weighted version of LeSiNN by substituting this formulation into Eq. (28).

Datasets As shown in Table 1, 12 real-world datasets are used, which cover diverse domains, i.e., medical, social, security and nature1. They are described with four data factors, i.e., n - the number of objects, d - the number of features, on - the number of outliers and or - the outlier percentage. Some datasets like AD, AID362, Probe, and U2R contain semantically real outliers. For the other datasets, we follow the literature (Pang et al. 2018b; Paulheim and Meusel 2015) to treat rare classes as outliers and the largest class as the normal class.

Parameters setting ODEFS and its competitors are implemented in Python 3.4. All the experiments are executed at a PC in a 3.6GHz CPU with 16GB memory. In our experiments, ODEFS uses  m⋆ = 32for small datasets (i.e., n ≤ 104) and  m⋆ = 64for large datasets (i.e.,  n > 104). Other parameters setting, i.e.,  a = 2, m = 6m⋆, l = 2⌈ n⋆m⋆ ⌉, has been explained in the above sections. The parameters of LeSiNN are set as the recommended settings.

Evaluation Methods Following the literature (Campos et al. 2016; Cheng, Wang, and Ma 2019), we evaluate the outlier detection performance by AUC and p@k. Their values range from 0 to 1 and higher value indicates better feature subset. The Wilcoxon signed rank test is used to examine the significance of the performance of ODEFS against its competitors. We repeat each experiment 20 times and average the results to get a convincing evaluation.

Empirical Validation of Theorem 1

Experimental setting This section conducts empirical experiments to study how the number of unlabeled examples affects AUC. We follow the literature (Zimek, Schubert, and Kriegel 2012) to create a 100-dimensional synthetic dataset of size 10000 with 20 relevant features. Inliers are from a Gaussian distribution N(1, 0.2), while outliers are from another Gaussian distribution N(1.2, 0.2) in relevant features, and the other features are from a same Gaussian distribution N(1, 0.2) and used as noisy features. The number of outlier examples is fixed as  m⋆ = 32. We gradually increase the number of unlabeled examples from  m = m⋆to  m = 12m⋆by a step  m⋆.

image

Figure 2: The trends of AUC and runtime regarding different sizes of the unlabeled examples. After  m/m⋆is larger than 6, AUC stays stable.

Results According to the experimental results shown in Figure 2, it is observed that: It indicates that when the number of unlabeled examples is more than 6 times of the number of outlier examples, the improvement on AUC becomes quite minor. In the meanwhile, the runtime grows linearly w.r.t the size of the examples. This observation is consistent

Table 2: The performance of ODEFS and its competitors. The best results are in bold. AVG is the averaged performance of a method over all datasets. p-value of Wilcoxon signed rank test is reported in the bottom.

image

with our analysis in Theorem 1. It essentially suggests that it is not necessary to include all the unlabeled examples in the training process when the unlabeled data points are substantially more than the outlier examples.

Improvement to The Bare Method

Experimental Setting We compare the ODEFS-enabled LeSiNN with its bare version to evaluate whether ODEFS can remove noisy features and improve the performance.

Results Table 1 shows the feature reduction and detection performance of ODEFS-enabled LeSiNN, compared to LeSiNN performing in the original feature space. ODEFSenabled LeSiNN works with only 5% (e.g., on Advertisements) to less than 50% (e.g., on Optdigits) of the original features, while its performance is substantially better than, or roughly the same as, its bare version. ODEFS enables LeSiNN to gain more than 7% and 16% improvement on average in terms of AUC and p@k, respectively. Our significance test shows that ODEFS enables LeSiNN to achieve significantly better performance at the 95% confidence level.

ODEFS embeds feature learning into outlier scoring by a joint framework, which enables ODEFS to safely remove noisy features in these datasets. As a result, ODEFSenabled LeSiNN works on much cleaner datasets and thus can achieve significant performance improvement.

Comparing to State-of-the-art Methods

Experimental Settings Four state-of-the-art methods are used as competitors, they are: RandFS (Lazarevic and Ku- mar 2005), RegFS (Paulheim and Meusel 2015), DisFS (Dang et al. 2014), and CINFO (Pang et al. 2018a) from four different but relevant researches. Besides, one variant of ODEFS (named ODEFS∗) is also evaluated to show the contribution of self-paced learning.

• RandFS: RandFS is a random subspace-based method in which the features are randomly selected;

• RegFS: RegFS is a relevance analysis based method which returns a feature relevance ranking and selects the top-ranked features;

• DisFS: DisFS is chosen as a representative subspace-based algorithm that uncovers outliers in subspaces of reduced dimensionality in which they are well discriminated from regular objects;

• CINFO: CINFO is a representative of feature-selection method that performs lasso-based sparse regression by treating the outlier scores as the targets to obtain feature subsets. It iteratively refines their performance by sequential ensemble;

• ODEFS∗: ODEFS∗is a variant of ODEFS in which self-paced learning is removed. It feeds all the outlier examples into the training process.

Results Table 2 shows the performance of LeSiNN with each method over all datasets. According to the experimental results, ODEFS gets the best performance on ten and eleven of twelve datasets in terms of AUC and p@k respectively, while the performance on the other ones is close to the best. ODEFS averagely performs better than four competitors RandFS, RegFS, DisFS, CINFO and ODEFS∗by 4% − 7%in term of AUC. And in term of p@k, the average improvements are  9% − 15%. The small p-values show that the improvement is significant at a high confidence level (i.e., 95%).

Different from RandFS, RegFS and DisFS that ignore the outlier scoring methods when they perform feature selection, ODEFS couples these two tasks in a joint formulation. This enables ODEFS to substantially reduce its detection errors and obtain significant AUC improvement, especially in noisy datasets like Advertisements, aPascal, and Census, which likely contain a large proportion of noisy features. Since there is no thresholded self-paced learning in ODEFS∗, the quality of the training set may be reduced, leading to worse detection performance.

ODEFS and CINFO are two different feature selection based methods. CINFO iteratively performs lasso-based sparse regression by treating the outlier scores as the target and the original features as the predictors on the outlier candidates to obtain a few feature subsets. The outlier scoring method and lasso-based feature selection are still in a sequential order. And it adopts all the outlier candidates which may contain inliers, leading to a poor feature learning performance. In contrast, ODEFS works in a unified framework, which embeds feature selection into outlier scoring method. Besides, thresholded self-paced learning is proposed to improve the quality of the training set. Thus ODEFS has better performance than CINFO.

Capability of Handling Noisy Features

Experimental settings We create a few 100-dimensional synthetic datasets with different percentages of relevant features (or noisy features). Inliers are from a Gaussian distribution N(1, 0.2), while outliers are from another Gaussian distribution N(1.2, 0.2) in relevant features, and the other features are from a same Gaussian distribution N(1, 0.2) and used as noisy features.

image

Figure 3: Detection performance on datasets with differ- ent levels of relevant features. ODEFS persistently performs better than its competitors. All the methods obtain AUC of nearly one with more than 35% relevant features.

Results The performance on the synthetic datasets is shown in Figure 3. ODEFS-enabled LeSiNN performs consistently better than five other versions in a wide range of noise levels. The better performance of the ODEFS-enabled LeSiNN over the competitors shows its stronger capability of handling noisy features.

Scalability Experimental settings We follow the literature (Pang et al. 2018a) to generate datasets by varying the feature size w.r.t. to a fixed data size (i.e., 1000), as well as varying the data size while fixing the feature size (i.e., 100), respectively.

image

Figure 4: Scalability test w.r.t. data size and feature size on ODEFS and its competitors.

Results The runtime of the six versions of LeSiNN is shown in Figure 4. ODEFS has time complexity linear w.r.t. both data size and feature size, which justifies our complexity analysis. In the left sub-figure, ODEFS is comparably fast to RegFS, CINFO and DisFS. These four methods are slower than RandFS and the bare LeSiNN, since they incorporate more sophisticated components to enhance the performance of LeSiNN. In the right sub-figure, RegFS is the slowest one since it has quadratic complexity while the other methods have linear time complexity. The runtime of subspace-based method DisFS grows very fast because the subspace searching is often costly in high-dimensional data.

In this paper, we propose an outlier detection ensemble framework, called ODEFS, which directly embeds feature selection into outlier detection. We propose thresholded self-paced learning to improve the reliability of the training set and design an alternative algorithm to address the optimization problem. Experimental results on various datasets demonstrate the effectiveness of ODEFS.

This work was supported by the National Key Research and Development Program of China (Grant No.2016YFB1000101), the National Natural Science Foundation of China (Grant No.61379052, No.61872371 and No.61872377), the Science Foundation of Ministry of Education of China (Grant No.2018A02002), the Natural Science Foundation for Distinguished Young Scholars of Hunan Province (Grant No.14JJ1026).

[Aggarwal and Philip 2005] Aggarwal, C. C., and Philip, S. Y. 2005. An effective and efficient algorithm for high-dimensional outlier detection. The VLDB journal 14(2):211–221.

[Campos et al. 2016] Campos, G. O.; Zimek, A.; Sander, J.; Campello, R. J.; Micenkov´a, B.; Schubert, E.; Assent, I.; and Houle, M. E. 2016. On the evaluation of unsupervised outlier detection: measures, datasets, and an empirical study. Data Mining and Knowledge Discovery 30(4):891–927.

[Cheng, Wang, and Ma 2019] Cheng, L.; Wang, Y.; and Ma, X. 2019. A neural probabilistic outlier detection method for categorical data. Neurocomputing.

[Dang et al. 2014] Dang, X. H.; Assent, I.; Ng, R. T.; Zimek, A.; and Schubert, E. 2014. Discriminative features for identifying and interpreting outliers. In ICDE, 88–99. IEEE.

[Freund and Schapire 1997] Freund, Y., and Schapire, R. E. 1997. A decision-theoretic generalization of on-line learning and an application to boosting. Journal of Computer and System Sciences 55(1):119–139.

[Guo and Zhu 2018] Guo, J., and Zhu, W. 2018. Dependence guided unsupervised feature selection. In AAAI.

[Kang et al. 2019a] Kang, Z.; Lu, Y.; Su, Y.; Li, C.; and Xu, Z. 2019a. Similarity learning via kernel preserving embedding. arXiv preprint arXiv:1903.04235.

[Kang et al. 2019b] Kang, Z.; Pan, H.; Hoi, S. C.; and Xu, Z. 2019b. Robust graph learning from noisy data. IEEE Transactions on Cybernetics.

[Keller, Muller, and Bohm 2012] Keller, F.; Muller, E.; and Bohm, K. 2012. Hics: high contrast subspaces for densitybased outlier ranking. In ICDE, 1037–1048. IEEE.

[Kumar, Packer, and Koller 2010] Kumar, M. P.; Packer, B.; and Koller, D. 2010. Self-paced learning for latent variable models. In NIPS, 1189–1197.

[Lazarevic and Kumar 2005] Lazarevic, A., and Kumar, V. 2005. Feature bagging for outlier detection. In SIGKDD, 157–166. ACM.

[Li et al. 2018] Li, J.; Cheng, K.; Wang, S.; Morstatter, F.; Trevino, R. P.; Tang, J.; and Liu, H. 2018. Feature selection: A data perspective. ACM Computing Surveys (CSUR) 50(6):94.

[Liu and Motoda 2007] Liu, H., and Motoda, H. 2007. Computational methods of feature selection. CRC Press.

[Liu, Ting, and Zhou 2012] Liu, F. T.; Ting, K. M.; and Zhou, Z.-H. 2012. Isolation-based anomaly detection. ACM Transactions on Knowledge Discovery from Data 6(1):3.

[M¨uller, Schiffer, and Seidl 2011] M¨uller, E.; Schiffer, M.; and Seidl, T. 2011. Statistical selection of relevant subspace projections for outlier ranking. In ICDE, 434–445. IEEE.

[Nguyen, Ang, and Gopalkrishnan 2010] Nguyen, H. V.; Ang, H. H.; and Gopalkrishnan, V. 2010. Mining outliers with ensemble of heterogeneous detectors on random subspaces. In DASFAA, 368–383. Springer.

[Nie, Zhu, and Li 2016] Nie, F.; Zhu, W.; and Li, X. 2016. Unsupervised feature selection with structured graph optimization. In AAAI.

[Noto, Brodley, and Slonim 2012] Noto, K.; Brodley, C.; and Slonim, D. 2012. Frac: a feature-modeling approach for semi-supervised and unsupervised anomaly detection. Data Mining and Knowledge Discovery 25(1):109–133.

[Pang et al. 2016] Pang, G.; Cao, L.; Chen, L.; and Liu, H. 2016. Unsupervised feature selection for outlier detection by modelling hierarchical value-feature couplings. In ICDM, 410–419. IEEE.

[Pang et al. 2017] Pang, G.; Cao, L.; Chen, L.; and Liu, H. 2017. Learning homophily couplings from non-iid data for joint feature selection and noise-resilient outlier detection. In IJCAI, 2585–2591.

[Pang et al. 2018a] Pang, G.; Cao, L.; Chen, L.; Lian, D.; and Liu, H. 2018a. Sparse modeling-based sequential ensemble learning for effective outlier detection in high-dimensional numeric data. In AAAI.

[Pang et al. 2018b] Pang, G.; Cao, L.; Chen, L.; and Liu, H. 2018b. Learning representations of ultrahigh-dimensional data for random distance-based outlier detection. In SIGKDD, 2041–2050. ACM.

[Pang, Kai, and Albrecht 2016] Pang, G.; Kai, M. T.; and Al- brecht, D. 2016. Lesinn: Detecting anomalies by identifying least similar nearest neighbours. In ICDMW.

[Paulheim and Meusel 2015] Paulheim, H., and Meusel, R. 2015. A decomposition of the outlier detection problem into a set of supervised learning problems. Machine Learning 100(2-3):509–531.

[Rayana, Zhong, and Akoglu 2016] Rayana, S.; Zhong, W.; and Akoglu, L. 2016. Sequential ensemble learning for outlier detection: A bias-variance perspective. In ICDM, 1167– 1172. IEEE.

[Ren et al. 2018] Ren, K.; Yang, H.; Zhao, Y.; Chen, W.; Xue, M.; Miao, H.; Huang, S.; and Liu, J. 2018. A robust auc maximization framework with simultaneous outlier detection and feature selection for positive-unlabeled classi-fication. IEEE Transactions on Neural Networks and Learning Systems.

[Sugiyama and Borgwardt 2013] Sugiyama, M., and Borg- wardt, K. 2013. Rapid distance-based outlier detection via sampling. In NIPS, 467–475.

[Wang and Li 2006] Wang, Y., and Li, S. 2006. Research and performance evaluation of data replication technology in distributed storage systems. Computers & Mathematics with Applications 51(11):1625–1632.

[Wang and Ma 2014] Wang, Y., and Ma, X. 2014. A gen- eral scalable and elastic content-based publish/subscribe service. IEEE Transactions on Parallel and Distributed Systems 26(8):2100–2113.

[Wang et al. 2013] Wang, Y.; Li, X.; Li, X.; and Wang, Y. 2013. A survey of queries over uncertain data. Knowledge and Information Systems 37(3):485–530.

[Wang et al. 2017] Wang, Y.; Pei, X.; Ma, X.; and Xu, F. 2017. Ta-update: An adaptive update scheme with treestructured transmission in erasure-coded storage systems. IEEE Transactions on Parallel and Distributed Systems 29(8):1893–1906.

[Wang et al. 2018] Wang, C.; Deng, Z.; Lai, J.; and Philip, S. Y. 2018. Serendipitous recommendation in e-commerce using innovator-based collaborative filtering. IEEE Transactions on Cybernetics 49(7):2678–2692.

[Wang et al. 2019a] Wang, H.; Yang, Y.; Liu, B.; and Fujita, H. 2019a. A study of graph-based system for multi-view clustering. Knowledge-Based Systems 163:1009–1019.

[Wang et al. 2019b] Wang, H.; Yang, Y.; Zhang, X.; and Peng, B. 2019b. Parallel multi-view concept clustering in distributed computing. Neural Computing and Applications 1–11.

[Xu, Wang, and Lai 2016] Xu, Y.; Wang, C.; and Lai, J. 2016. Weighted multi-view clustering with feature selection. Pattern Recognition 53:25–35.

[Zhang et al. 2017] Zhang, X.; Dou, W.; He, Q.; Zhou, R.; Leckie, C.; Kotagiri, R.; and Salcic, Z. 2017. Lshiforest: A generic framework for fast tree isolation based ensemble anomaly analysis. In ICDE.

[Zimek, Schubert, and Kriegel 2012] Zimek, A.; Schubert, E.; and Kriegel, H.-P. 2012. A survey on unsupervised outlier detection in high-dimensional numerical data. Statistical Analysis and Data Mining: The ASA Data Science Journal 5(5):363–387.


Designed for Accessibility and to further Open Science