Outlier Detection and Data Clustering via Innovation Search

2019·Arxiv

Abstract

Abstract

The idea of Innovation Search was proposed as a data clustering method in which the directions of innovation were utilized to compute the adjacency matrix and it was shown that Innovation Pursuit can notably outperform the self representation based subspace clustering methods. In this paper, we present a new discovery that the directions of innovation can be used to design a provable and strong robust (to outlier) PCA method. The proposed approach, dubbed iSearch, uses the direction search optimization problem to compute an optimal direction corresponding to each data point. An outlier by definition is a data point which does not participate in forming a low dimensional structure with a large number of data points in the data. In other word, an outlier carries innovation with respect to most of the other data points. iSearch utilizes the directions of innovation to measure the innovation of the data points and it identifies the outliers as the most innovative data points. Analytical performance guarantees are derived for the proposed robust PCA method under different models for the distribution of the outliers including randomly distributed outliers, clustered outliers, and linearly dependent outliers. In addition, we study the problem of outlier detection in a union of subspaces and it is shown that iSearch provably recovers the span of the inliers when the inliers lie in a union of subspaces. Moreover, we present theoretical studies which show that the proposed measure of innovation remains stable in the presence of noise and the performance of iSearch is robust to noisy data. In the challenging scenarios in which the outliers are close to each other or they are close to the span of the inliers, iSearch is shown to remarkably outperform most of the existing methods. The presented method shows that the directions of innovation [36], [37] are useful representation of the data which can be used to perform both data clustering and outlier detection.

Index Terms—Robust PCA, Outlier Detection, Convex Optimization, Innovation Search, Data Clustering

I. INTRODUCTION

Principal Component Analysis (PCA) has been extensively used to reduce dimensionality by finding linear projections of high-dimensional data into lower dimensional subspaces. PCA finds an r-dimensional linear subspace by solving

where is the given data, is an orthonormal basis for the r-dimensional subspace, I is the identity matrix, denotes the Frobenius norm, is the number of data points, and is the dimensionality of the ambient space. While PCA is useful when the data has low intrinsic dimension, it is sensitive to outliers in the sense that the solution to (1) can arbitrarily deviate from the true underlying subspace if a small portion of the data is not contained in this low-dimensional subspace.

In addition to the sensitivity of PCA to outliers, detecting the outlying data points is also an important research problem in unsupervised machine learning. Outliers are associated with important rare events such as malignant tissues [17], the failures of a system [11], [13], [38], web attacks [19], and misclassified data points [10], [35]. In this paper, the proposed outlier detection method is introduced as a robust Principal Component Analysis (PCA) algorithm, i.e., the inliers lie in a low dimensional subspace.

In the literature of robust PCA, two main models for the data corruption are considered: the element-wise data corruption model and the column-wise data corruption model. These two models are corresponding to two completely different robust PCA problems. In the element-wise model, it is assumed that a small subset of the elements of the data matrix are corrupted and the support of the corrupted elements is random. Thus, the corrupted elements are not concentrated in any column/row of the data. This problem is known as the low rank plus sparse matrix decomposition problem [1], [3], [29], [30]. In the column-wise model, a subset of the columns of the data are affected by the data corruption [4], [8], [9], [12], [21], [26], [31], [32], [43]–[46]. We solely focus on the column-wise data corruption model. Specifically, it is assumed that the given data matrix D follows Data Model 1.

Data Model 1. The matrix D can be expressed as D = [B A] T , where is an arbitrary permutation matrix, and [B A] represents the concatenation of matrices A and B. The columns of A lie in an r-dimensional subspace U. The columns of B do not lie entirely in U, i.e., the columns of A are the inliers and the columns of B are the outliers. The orthonormal matrix is a basis for U.

The output of a robust PCA method is a basis for U. If U is estimated accurately, the outliers can be located using a simple subspace projection [28].

Summary of Contributions: The main contributions can be summarized as follows.

• iSearch introduces a new idea to the robust PCA problem. iSearch utilizes the directions of innovation to measure the Innovation of the data points. It is shown that the pro-

posed approach mostly outperforms the exiting methods in handling close outliers and noisy data.

• The proposed approach and the CoP method presented in [35], to the best of our knowledge, are the only robust PCA methods which are supported with analytical performance guarantees under different models for the distributions of the outliers including the randomly distributed outliers, the clustered outliers, and the linearly dependent outliers.

• In addition to considering several models for the distribution of the outliers, we provide analytical performance guarantees under different models for the distribution of the inliers too. The presumed models include the union of subspaces and the uniformly at random distribution on where denotes the unit -norm sphere in . Moreover, the stability of iSearch in the presence of additive noisy is studied and it is proved that iSearch is robust to noisy data.

A. Notation

Given a matrix denotes its spectral norm. For a vector denotes its -norm and a(i) its element. Given two matrices and with an equal number of rows, the matrix is the matrix formed by concatenating their columns. For a matrix denotes its column. The subspace is the complement of U. The cardinality of set I is defined as |I|. Also, for any positive integer n, the index set {1, ..., n} is denoted [n]. The coherence between vector a and subspace H with orthonormal basis H is defined as .

II. RELATED WORKS

In this section, we review some of the important related works in robust (to outlier) PCA and the Innovation Pursuit method. We refer the reader to [22], [35] for a more comprehensive review on the topic. Some of the earliest approaches to robust PCA relied on robust estimation of the data covariance matrix, such as the minimum covariance determinant, the minimum volume ellipsoid, and the StahelDonoho estimator [16]. They mostly compute a full SVD or eigenvalue decomposition in each iteration and generally have no explicit performance guarantees. The performance of these approaches greatly degrades when . To enhance the robustness to the presence of the outliers, another approach is to replace the Frobenius Norm in the cost function of PCA with -norm [18], [25] because -norm were shown to be robust to the presence of the outliers [2], [18]. However, -norm does not leverage the column-wise structure of the corruption matrix. The authors of [6] modified the -norm minimization problem used in [18] and replaced it with an -norm minimization problem. In [24] and [46], the optimization problem used in [6] was relaxed to two different convex optimization problems. The authors of [24] / [46] provided sufficient conditions under which the optimal point of the convex optimization problem proposed in [24] / [46] is guaranteed to be equal to a projection matrix whose column-space is equal to . The approach presented in [41] focused on the scenario in which the data is predominantly unstructured outliers and the number of outliers is larger than . In [41], it is essential to assume that the outliers are randomly distributed on and the inliers are distributed randomly on the intersection of and U. In [43], a convex optimization problem was proposed which decomposes the data into a low rank component and a column sparse component. The approach presented in [43] is provable but it requires to be significantly smaller than . The outlier detection method proposed in [39] assumes that the outliers are randomly distributed on and a small number of them are not linearly dependent which means that [39] is not able to detect the linearly dependent outliers and the outliers which are close to each other.

A. Review of Innovation Pursuit for Data Clustering

In contrast to most of the spectral clustering based subspace clustering methods which utilize self-representation to compute the affinity matrix, the authors of [36], [37] proposed Innovation Pursuit which utilized the directions of innovation to compute the affinity matrix. A convex optimization problem was introduced which finds an optimal direction corresponding to each data point. The Direction of Innovation corresponding to data point d is closely aligned with d but it has the minimum projection on the other data points. In [36], [37], it was shown that Innovation Pursuit can significantly outperform the self representation based methods (e.g. Sparse Subspace Clustering [7]) specifically in the scenarios in which the subspaces intersect. In this paper, we present a new discovery that the Directions of Innovation can also be utilized to devise a novel, strong, and provable robust PCA method. The table of Algorithm 1 shows the procedure to use the directions of innovation for both data clustering and outlier detection.

B. Connection and Contrast to Coherence Pursuit

In [35], the Coherence Pursuit (CoP) method was proposed as an outlier detection method. CoP computes the Coherence Values for all the data points to rank the data points. The Coherence value corresponding to data column d is a measure of resemblance between d and the rest of the data columns. CoP uses the inner product between d and the rest of the data points to measure the resemblance between d and the rest of data. In sharp contrast, iSearch leverages the directions of innovation to measure the innovation of each data point and ranks the data points accordingly. We show through theoretical studies and numerical experiments that finding the optimal directions makes iSearch notably stronger than CoP in detecting outliers which carry weak innovation.

III. PROPOSED APPROACH

Algorithm 1 presents the proposed outlier detection method along with the definition of the used symbols. In order to show the connection of proposed robust PCA method with the Innovation Search based clustering method [36], [37], Table 1 demonstrates both algorithms. iSearch consists of three steps. In the next subsections, Step 2 and Step 3 are discussed. In this

paper, we use an ADMM solver to solve (2). The computation complexity of the solver is . If PCA is used in the prepossessing step to reduce the dimensionality of the data to , the computation complexity of the solver is .

A. An Illustrative Example for Innovation Value

A synthetic numerical example is presented to explain the main idea behind iSearch. Assume , and suppose that D follows Assumption 1.

Assumption 1. The columns of A are drawn uniformly at random from . The columns of B are drawn uniformly at random from . To simplify the exposition and notation, it is assumed without loss of generality that T in Data Model 1 is the identity matrix, i.e, D = [B A].

Define d as a column of D and define as the optimal point of

and define the Innovation Value corresponding to d as . The main idea of iSearch is that shows two completely different behaviours with respect to U (when d is an outlier and when d is an inlier). First assume that d is an outlier. The optimization problem (3) searches for a direction whose projection on d is non-zero and it has the minimum projection on the rest of the data points. Since d is an outlier, d has a non-zero projection on . In addition, since is large, (3) searches for a direction in the ambient whose projection on U is as weak as possible. Therefore, lies in or it is close to . The left plot of Fig. 1 shows when d is an outlier. In this case, is orthogonal to all the inliers. Accordingly, when d is an outliers, is approximately equal to . On the other hand, when d is an inlier, the linear constraint strongly discourages to lie in or to be close to . Inliers lie in a low dimensional subspace and mostly they are close to each other. Since has a strong projection on d, it has strong projections on many of the inliers. Accordingly, the value of is much larger when d is an inlier. Therefore, the Innovation Value corresponding to an inlier is smaller than the Innovation Value corresponding to an outlier because is much larger when d is an inliers. Fig. 1 compares the vector when d is an outliers with the same vector when d is an inlier. In addition, it shows the vector of Innovation Values (right plot). One can observe that the Innovation Values make the outliers clearly distinguishable.

Fig. 1. In this example, the first 50 columns are outliers. The upper left panel shows vector is an outlier. The upper right panel depicts is an inlier. The bottom left plot shows the Innovation Values corresponding to all the data points (vector x was defined in Algorithm 1). The bottom right plot also shows the Innovation Values when the inliers are noisy (denotes the additive noise).

B. Building the Basis Matrix

The data points corresponding to the least Innovation Values are used to construct the basis matrix Y. If the data follows Assumption 1, the r data points corresponding to the r smallest Innovation Values span U with overwhelming probability [42]. In practise, the algorithm should continue adding new columns to Y until the columns of Y spans an r-dimensional subspace. This approach requires to check the singular values of Y several times. We propose two techniques to avoid this extra steps. The first approach is based on the side information that we mostly have about the data. In many applications, we can have an upper-bound on because outliers are mostly associated with rare events. If we know that the number of outliers is less than y percent of the data, matrix Y can be constructed using percent of the data columns which are corresponding to the least Innovation Values. The second approach is the adaptive column sampling method proposed in [35]. The adaptive column sampling method avoids sampling redundant columns.

C. Robustness to noise

The performance of iSearch is robust to the presence of noise because even if the inliers are noisy, the optimal direction of (3) corresponding to an outlier remains incoherent with U. The reason is that if the optimal direction is not incoherent with U and the inliers are not dominated by noise, is large because the inliers lie in a low dimensional subspace and they are close to each other. In Section IV-E, we provide a guarantee for the performance of iSearch with noisy data, an upper-bound for is established, and it is shown that

the value of the upper-bound is proportional to. The right plot in Fig. 1 shows the innovation values for the example described in Section III-A. In this plot, the inliers are noisy, D = [B (A + E)] where E denotes the additive noise, and . One can observe that Innovation Values clearly distinguishes the outliers from the inliers.

IV. THEORETICAL STUDIES

In this section, we analyze the performance of the proposed approach with three different models for the distribution of the outliers: unstructured outliers, clustered outliers, and linearly dependent outliers. Moreover, we analyze iSearch with two different models for the distribution of the inliers. These models include the union of subspaces and uniformly at random distribution on . In addition, the performance of iSearch in the presence of noise is studied and it is shown that the direction of innovation is a reliable tool when the data is noisy. The theoretical results are followed by short discussions which highlight the important aspects of the theorems and a short description of the proof. The proofs of all the presented theoretical results are included in Appendix.

A. Randomly Distributed Outliers

In this part, it is assumed that D follows Assumption 1. In order to guarantee the performance of the proposed approach, it is enough to show that the Innovation Values corresponding to the outliers are greater than the Innovation Values corresponding to the inliers. In other word, it suffices to show

Before we state the theorem, let us provide the following definitions and remarks.

Definition 1. Define

define , and where 0} and is the column of B. The value is the number of outliers which are orthogonal to .

Remark 1. In Assumption 1, the outliers are randomly distributed. Thus, if is significantly larger than is significantly smaller than with overwhelming probability.

Theorem 1. Suppose D follows Assumption 1. Define A =

A >

then (4) holds and U is recovered exactly with probability at least where

Proof Sketch: In order to prove Theorem 1, we leveraged this key feature of the direction of innovation: when d is an outlier, is highly incoherent with U. Specifically, Theorem 1 is proved in two steps. First, we derived the sufficient conditions to guarantee that all the optimal directions correspond to the outliers, , lie in . The exact orthogonality is not a necessarily condition but it simplifies the analysis. In fact, the first two sufficient conditions in (5) guarantee it with high probability. Second, the fact that are orthogonal to U was leveraged to derive the final conditions to guarantee that (4) holds with high probability.

Theorem 1 indicates that when is sufficiently larger than , the proposed approach is guaranteed to detect the randomly distributed outliers exactly. It is important to note that in (5), is scaled with 1/r but is scaled with . It means that if r is sufficiently smaller than , iSearch provably detects the unstructured outliers even if is much larger than . The numerical experiments presented in Section V confirms this feature of iSearch.

B. Structured Outliers

This section provides the analysis of iSearch with structured outliers. In contrast to the unstructured outliers, structured outliers can form a low dimensional structure different from the structure of the majority of the data points. Structured outliers are associated with important rare events such as malignant tissues [17] or web attacks [19]. In this section, we assume that the outliers form a cluster outside of U. The following assumption specifies the presumed model for the distribution of the structured outliers.

Assumption 2. A column of B is formed as . The unit -norm vector q does not lie in are drawn uniformly at random from , and is a positive number.

According to Assumption 2, the outliers cluster around vector q where . In Algorithm 1, if the dimensionality reduction step is performed, the direction search optimization problem is applied to . Thus, (3) is equivalent to

where and . The subspace Q is the column-space of D. In this section, we are interested in studying the performance of iSearch in identifying tightly clustered outliers because some of the existing outlier detection algorithms fail if the outliers form a tight cluster. For instance, the thresholding based method [14] and the sparse representation based algorithm [39] fail when the outliers are close to each other. Therefore, we assume that the span of Q is approximately equal to the column-space of [U q]. The following Theorem shows that even if the outliers are close to each other, iSearch successfully identifies the outliers provided that is sufficiently larger than .

Theorem 2. Suppose the distribution of the inliers/outliers follows Assumption-1/Assumption-2. Assume that Q is equal to the column-space of [U q]. Define , define , define as the optimal point of (6) with , and assume that . In addition, Define A =

then (4) holds and U is recovered exactly with probability at least .

Proof Sketch: Similar to the proof of Theorem 1, first we derive the sufficient conditions to guarantee that the optimal directions corresponding to the outliers are orthogonal to U (the exact orthogonality is not a necessarily condition but it simplifies the analysis). Subsequently, we leverage the structure of the optimal directions to establish an upper-bound/lower-bound for when d is an outliers/inlier.

In sharp contrast to (5), in (7) is not scaled with . Theorem 2 indicates that in contrast to the unstructured outliers, the number of the structured outliers should be sufficiently smaller than the number of the inliers for the small values of . This is consistent with our intuition regarding the detection of structured outliers. If the columns of B are highly structured and most of the data points are outliers, it violates the definition of outlier to label the columns of B as outliers. The presence of parameter emphasizes that the closer the outliers are to U, the harder it is to distinguish them. In Section V, it is shown that iSearch significantly outperforms the existing methods when the outliers are close to U. The main reason is that even if an outlier is close to U, its corresponding optimal direction obtained by (3) is highly incoherent with U.

C. Linearly Dependent Outliers

In some applications, the outliers are linearly dependent. For instance, in [10], it was shown that a robust PCA algorithm can be used to reduce the clustering error of a subspace segmentation method. In this application, a small subset of the outliers can be linearly dependent. The following assumption specifies the presumed model for matrix B.

Assumption 3. Define subspace with dimension such that and . The outliers are randomly distributed on . The orthonormal matrix is a basis for .

The following theorem establishes the sufficient conditions to guarantee the performance of iSearch with linearly dependent outliers. The procedure to proof the following theorem is similar to the previous theorems. The main difference was the techniques used to bound the value of .

Theorem 3. Suppose the distribution of the inliers/outliers follows Assumption-1/Assumption-3. Define

then (4) holds and U is recovered exactly with probability at least where

Theorem 3 indicates that should be sufficiently larger than . If is comparable to r, it is in fact a necessary condition because we can not label the columns of B as outliers if is also comparable with . If is large, the sufficient condition is similar to the sufficient conditions of Theorem 1 in which the outliers are distributed randomly on . It is also informative to compare the requirements of iSearch with the requirements of CoP. With iSearch, should be sufficiently larger than to guarantee that the algorithm distinguishes the outliers successfully. With CoP, should be sufficiently larger than [10], [35]. The reason that CoP requires a stronger condition is that iSearch finds a direction for each outlier which is highly incoherent with U.

D. Outlier Detection When the Inliers are Clustered

In the analysis of the robust PCA methods, mostly it is assumed that the inliers are randomly distributed in U. In practise the inliers form several clusters in U. In this section, it is assumed that the inliers form m clusters. The following assumption specifies the presumed model and Theorems 4 provides the sufficient conditions.

Assumption 4. The matrix of inliers can be written as A = where , , and is an arbitrary permutation matrix. The columns of are drawn uniformly at random from the intersection of subspace and where is a d-dimensional subspace. In other word, the columns of A lie in a union of subspaces and where denotes the direct sum operator.

Theorem 4. Suppose the distribution of the outliers/inliers follows Assumption-1/Assumption-4. Define

If the sufficient conditions in (5) are satisfied, then the inequality (4) holds and U is recovered exactly with probability at least .

Proof sketch: The difference between Theorem 4 and Theorem 1 is in their presumed model for the distribution of the inliers. The procedure used to prove Theorem 4 is similar to the procedure used in the proof of Theorem 1 but it required new techniques to bound .

In Assumption 4, the dimensions of the subspaces are equal and the distribution of the inliers inside these subspace are similar. Therefore, we can roughly say g = [24]. Thus, the sufficient conditions indicate that the population of the smallest cluster scaled by should be sufficiently larger than . The parameter

introduced in [24]. It shows how well the inliers are distributed in U. Evidently, if the inliers populate all the directions inside U, a subspace recovery algorithm is more likely to recover U correctly. However, having a large value of permeance statistic is not a necessary condition. The reason that permeance statistic appears in the sufficient conditions is that we establish the sufficient conditions to guarantee the performance of iSearch in the worst case scenarios. In fact, if the inliers are close to each other or the subspaces are close to each other, generally the performance of iSearch improves because the more inliers are close to each other, the smaller their Innovation Values are.

E. Noisy Data

The key feature of the proposed approach is that when d is an outlier, is highly incoherent with U. In the proof of the presented theorems, we derived sufficient conditions to guarantee that is orthogonal to U. It is not a necessarily condition but it simplified the analysis. When the inliers are noisy, we can not find a direction which is orthogonal to all of them because they do not lie in a low dimensional subspace. In this section, first we focus on showing that even if the data is noisy, the direction of innovation corresponding to an outlier remains incoherent with U. In the following theoretical results, it is assumed that the data follows Assumption 5. According to Assumption 5, each inlier is added with a random direction and each data column has an expected squared norm equal to 1.

Assumption 5. The matrix D can be expressed as D = The matrix represents the presence of noise and it can be written as where the columns of are drawn uniformly at random from and is a positive number which controls the power of the added noise.

Lemma 5. Suppose D follows Assumption 5 and assume that A and B follow Assumption 1. Define , and

where R is an orthonormal basis for . Then

with probability at least .

Lemma 5 establishes a upper-bound for . In practise, is much smaller than the value of the upper-bound because in the proof we consider the worst case scenarios. Lemma 5 indicates that the direction of innovation corresponding to an outlier stays incoherent with U provided that is sufficiently large. The following theorem provides the sufficient conditions to guarantee the performance of iSearch with noisy data.

Theorem 6. Suppose D follows Assumption 5 and assume that A and B follow Assumption 1. Define and . If

then (4) holds with probability at least .

In Theorem 6, the outliers are unstructured and its suffi-cient conditions are similar to the sufficient conditions of Theorem 1. Roughly, Theorem 6 states that if is sufficiently larger than , Innovation Values can reliably be used to distinguish the outliers. For clustered outliers and linearly dependent outliers, similar guarantees can be established.

V. NUMERICAL EXPERIMENTS

In this section, a set of experiments with synthetic data and real data are presented to study the performance and the properties of the iSearch algorithm. In the presented experiments, iSearch is compared with the existing methods including FMS [21], GMS [46], CoP [35], OP [43], and R1-PCA [6].

A. Phase Transition

In this experiment, the phase transition of the proposed approach is studied. First, it is shown that if is sufficiently large, iSearch can successfully recover the span of the inliers even in the dominant presence of the unstructured outliers. Define as an orthonormal basis for the recovered subspace. A trial is considered successful if

The data follows Assumption 1 with r = 4 and . Fig. 2 shows the phase transition of iSearch versus and . White indicates correct subspace recovery and black designates incorrect recovery. Theorem 1 indicated that if is sufficiently large, iSearch yields exact recovery even if is larger than . This experiment confirms the theoretical result. According to Fig. 2, even when , 40 inliers are enough to guarantee exact subspace recovery. In the second plot of Fig. 2, the outliers are structured and the distribution of outliers follows Assumption 2 with . One can observe that even when is small, a sufficiently large value of guarantees the performance of iSearch.

Fig. 2. Left plot: the phase transition of iSearch in presence of the un- structured outliers versus . White indicates correct subspace recovery and black designates incorrect recovery. Right plot: phase transition with structured outliers versus . In both plots,

Fig. 3. The probability of accurate subspace recovery versus the number of structured outliers (trial is considered successful iforthonormal basis for the recovered subspace.

B. Structured Outliers

In this experiment, we consider structured outliers. The distribution of the outliers follows Assumption 2 with and . In addition, the inliers are clustered and they lie in a union of 5 2-dimensional linear subspaces. There are 20 data points in each subspace (i.e., ) and r = 10. A successful trial is defined similar to Section V-A. We are interested in investigating the performance of iSearch in identifying structured outliers when they are close to U. Therefore, we generate vector q, the center of the cluster of the outliers, close to U. Vector q is constructed as

where the unit -norm vector is generated as a random direction on and the elements of are sampled independently from N(0, 1). The generated vector q is close to U with high probability because the column-space of [U p] is close to the column-space of U. Fig. 3 shows the probability of accurate subspace recovery versus the number of outliers. The number of evaluation runs was 50. One can observe that in contrast to the unstructured outliers, the robust PCA methods tolerate few number of structured outliers and iSearch exhibits higher robustness to the presence of the structured outliers.

C. Noisy Data

In this section, we consider the simultaneous presence of noise, the structured outliers and the unstructured outliers. In this experiment, , and . The data contains 300 unstructured and 10 structured outliers. The distribution of the structured outliers follow Assumption 2 with . The vector q, the center of the cluster of the structured outliers, is generated as a random direction on . The generated data in this experiment can be expressed as D = [B (A + E)] (the matrix E represents the additive Gaussian noise). Since the data is noisy, the algorithms can not achieve exact subspace recovery. Therefore, we examine the probability that an algorithm distinguishes all the outliers correctly. Define vector such that . A trial is considered successful if

Fig. 4 shows the probability of exact outlier detection versus SNR which is defined as SNR . It shows that iSearch robustly distinguishes the outliers in the strong presence of noise. The number of evaluation runs was 50.

Fig. 4. The probability of exact outlier detection versus SNR. The data contains 10 structured outliers and 300 unstructured outliers (

D. Innovation Value vs Coherence Value

We simulate a scenario in which the outliers are randomly distributed but they are close to the span of the inliers. Suppose , and . The outliers are generated as [U H] G where spans a random 2-dimensional subspace and the elements of are sampled independently from N(0, 1). Fig. 5 compares Innovation Values with Coherence Values. The last 20 columns are the outliers. One can observe that the Coherence Values can not make the outliers distinguishable from the inliers. The main reason is that in contrast to Coherence Pursuit, iSearch finds a direction corresponding to each outlier which is strongly incoherent with U.

Fig. 5. This figure compares the vector of the Coherence Values (right plot) with the vector of the Innovation Values (left plot). The last 20 columns are the outliers.

E. Clustered Inliers

In practice, the inliers are not necessarily distributed uniformly at random and they are mostly close to each other and they form one or multiple clusters. In this experiment, it is shown that in contrast to most of the existing methods, if the inliers form a cluster in U, the performance of iSearch improves. It is assumed that the distribution of inliers follows Assumption 6.

Assumption 6. Each column of matrix A is formed as where . The vector is a unit -norm vector and are drawn uniformly at random from .

‘

Fig. 6. The Log-Recovery Error of the robust PCA methods versus parameter is defined in Assumption 6. Log-Recovery Error is defined as is small, the inliers are close to each other and when it is larger, the distribution of the inliers is similar to the distribution of the inliers in Assumption 1 (i.e., uniformly at random on

The parameter controls how close the inliers are to each other. If goes to infinity, the distribution of the inliers converges to the distribution of the inliers in Assumption 1. In this experiment, , and . The distribution of the outliers follows Assumption 3 with and the dimension of intersection between and U is equal to 5. This robust PCA algorithms are used to obtain a basis for the dominant subspace (the span of the inliers). Fig. 6 shows Log-Recovery-Error defined equal to

versus parameter . The number of evaluation runs was 100. One can observe that the performance of most of the robust PCA methods degrade when is small. The reason is that if the value of increases, the distribution of the inliers becomes closer to the uniformly at random distribution

Fig. 7. Some of the frames of the Waving Tree video file. The highlighted frames are detected as outliers by R1-PCA.

on ). In contrast, Coherence Pursuit and iSearch yield better performance when the inliers form a cluster (the smaller values of ). The reason is that if the inliers become closer to each other, their Coherence-Values/Innovation-Values increase/decrease because their similarities increase and they are less innovative.

F. Detecting Outliers in Real Data

An application of the outlier detection methods is to identify the misclassified data points of a clustering method [10], [35]. In each identified cluster, the misclassified data points can be considered as outliers. In this experiment, we assume an imaginary clustering method whose clustering error is 25 %. The robust PCA method is applied to each cluster to find the misclassified data points. The clustering is re-evaluated after identifying the misclassified data points. Algorithm 2 adapted from [35] shows how a robust PCA algorithm is used to detect the misclassified data points and update the identified clusters. We use the Hopkins155 dataset [40], which contains data matrices with 2 or 3 clusters. In this experiment, 27 matrices with 3 clusters are used (i.e., the columns of each data matrix lie in 3 clusters). The outliers are linearly dependent and they are very close to the span of the inliers since the clusters in the Hopkins155 dataset are close to each other. In addition, the inliers form a tight cluster. Evidently, the robust PCA methods which assume that the outliers are randomly distributed fail in this task. This experiment with real data contains most of the challenges that a robust PCA method can encounter. For more details about this experiment, we refer the reader to [10], [35]. Table I shows the average clustering error after applying the robust PCA methods to the output of the clustering method. One can observe that iSearch significantly outperforms the other methods. The main reason is that iSearch is robust against outliers which are closed to U. In addition, the coherency between the inliers enhances the performance of iSearch.

TABLE I CLUSTERING ERROR AFTER USING THE ROBUST PCA METHODS TO DETECT THE MISCLASSIFIED DATA POINTS.

G. Activity Detection in Real Noisy Data

In this experiment, we use the robust PCA methods to identify a rare event in a video file. We use the Waving Tree video file [27]. In this video, a tree is smoothly waving and in the middle of the video a person crosses the frame. The

frames which only contain the background (the tree and the environment) are inliers and the few frames corresponding to the event, the presence of the person, are the outliers. Since the tree is waving, the inliers are noisy and we use r = 3 for all the methods. In addition, we identify column d as outlier if where is the recovered subspace. In this experiments, the outliers are very similar to each other since the consecutive frames are quite similar to each other. We use iSearch, CoP, FMS, and R1-PCA to detect the outlying frames. iSearch, CoP, and FMS identified all the outlying frames correctly. R1-PCA could not identify those frames in which the person does not move. The reason is that those frames are exactly similar to each other. Fig. 7 shows some of the outlying frames which were missed by R1-PCA.

VI. CONCLUSION

A new robust (to outlier) PCA method, termed iSearch, was proposed which uses a convex optimization problem to measure the innovation of the data points. The proposed approach recovers the span of the inliers using the least innovative data points. It was shown that iSearch can provably recover the span of the inliers with different models for the distribution of the outliers. In addition, analytical performance guarantees for iSearch with clustered inliers were presented. It was shown that finding the optimal directions makes iSearch significantly robust to the outliers which carry weak innovation. Moreover, the experiments with real and synthetic data demonstrated the robustness of the proposed method against the strong presence of noise.

REFERENCES

[1] Emmanuel J Cand`es, Xiaodong Li, Yi Ma, and John Wright. Robust principal component analysis? Journal of the ACM, 58(3):11, 2011.

[2] Emmanuel J Candes and Terence Tao. Decoding by linear programming. IEEE Transactions on Information Theory, 51(12):4203–4215, 2005.

[3] Venkat Chandrasekaran, Sujay Sanghavi, Pablo A Parrilo, and Alan S Willsky. Rank-sparsity incoherence for matrix decomposition. SIAM Journal on Optimization, 21(2):572–596, 2011.

[4] Vartan Choulakian. L1-norm projection pursuit principal component analysis. Computational Statistics & Data Analysis, 50(6):1441–1451, 2006.

[5] Philip J Davis. Leonhard euler’s integral: A historical profile of the gamma function: In memoriam: Milton abramowitz. The American Mathematical Monthly, 66(10):849–869, 1959.

[6] Chris Ding, Ding Zhou, Xiaofeng He, and Hongyuan Zha. R1-PCA: rotational invariant L1-norm principal component analysis for robust subspace factorization. In Proceedings of the 23rd International Conference on Machine Learning (ICML), pages 281–288, Pittsburgh, PA, 2006.

[7] Ehsan Elhamifar and Rene Vidal. Sparse subspace clustering: Algorithm, theory, and applications. IEEE Transactions on Pattern Analysis and Machine Intelligence, 35(11):2765–2781, 2013.

[8] Jiashi Feng, Huan Xu, and Shuicheng Yan. Robust PCA in highdimension: A deterministic approach. In Proceedings of the 29th International Conference on Machine Learning (ICML), Edinburgh, UK, 2012.

[9] Martin A Fischler and Robert C Bolles. Random sample consensus: a paradigm for model fitting with applications to image analysis and automated cartography. Communications of the ACM, 24(6):381–395, 1981.

[10] Andrew Gitlin, Biaoshuai Tao, Laura Balzano, and John Lipor. Improv- ing k-subspaces via coherence pursuit. IEEE Journal of Selected Topics in Signal Processing, 2018.

[11] Yoshiyuki Harada, Yoriyuki Yamagata, Osamu Mizuno, and Eun-Hye Choi. Log-based anomaly detection of cps using a statistical method. arXiv:1701.03249, 2017.

[12] Moritz Hardt and Ankur Moitra. Algorithms and hardness for robust subspace recovery. In The 26th Annual Conference on Learning Theory (COLT), pages 354–375, Princeton, NJ, 2013.

[13] Milos Hauskrecht, Iyad Batal, Michal Valko, Shyam Visweswaran, Gregory F Cooper, and Gilles Clermont. Outlier detection for patient monitoring and alerting. Journal of Biomedical Informatics, 46(1):47– 55, 2013.

[14] Reinhard Heckel and Helmut B¨olcskei. Robust subspace clustering via thresholding. IEEE Transactions on Information Theory, 61(11):6320– 6342, 2015.

[15] Wassily Hoeffding. Probability inequalities for sums of bounded random variables. Journal of the American statistical association, 58(301):13– 30, 1963.

[16] Peter J Huber. Robust statistics. Springer, 2011.

[17] Seppo Karrila, Julian Hock Ean Lee, and Greg Tucker-Kellogg. A comparison of methods for data-driven cancer outlier discovery, and an application scheme to semisupervised predictive biomarker discovery. Cancer Informatics, 10:109, 2011.

[18] Qifa Ke and Takeo Kanade. Robust L1 norm factorization in the pres- ence of outliers and missing data by alternative convex programming. In 2005 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR), pages 739–746, San Diego, CA, 2005.

[19] Christopher Kruegel and Giovanni Vigna. Anomaly detection of web- based attacks. In Proceedings of the 10th ACM Conference on Computer and Communications Security (CCS), pages 251–261, Washington, DC, 2003.

[20] Michel Ledoux. The concentration of measure phenomenon. Number 89. American Mathematical Soc., 2001.

[21] Gilad Lerman and Tyler Maunu. Fast, robust and non-convex subspace recovery. Information and Inference: A Journal of the IMA, 7(2):277– 336, 2017.

[22] Gilad Lerman and Tyler Maunu. An overview of robust subspace recovery. Proceedings of the IEEE, 106(8):1380–1410, 2018.

[23] Gilad Lerman, Michael McCoy, Joel A Tropp, and Teng Zhang. Robust computation of linear models, or how to find a needle in a haystack. Technical report, DTIC Document, 2012.

[24] Gilad Lerman, Michael B McCoy, Joel A Tropp, and Teng Zhang. Robust computation of linear models by convex relaxation. Foundations of Computational Mathematics, 15(2):363–410, 2015.

[25] Gilad Lerman, Teng Zhang, et al. Robust recovery of multiple subspaces by geometric lp minimization. The Annals of Statistics, 39(5):2686– 2715, 2011.

[26] Guoying Li and Zhonglian Chen. Projection-pursuit approach to robust dispersion matrices and principal components: primary theory and monte carlo. Journal of the American Statistical Association, 80(391):759–766, 1985.

[27] Liyuan Li, Weimin Huang, Irene Yu-Hua Gu, and Qi Tian. Statistical modeling of complex backgrounds for foreground object detection. IEEE Transactions on Image Processing, 13(11):1459–1472, 2004.

[28] Xingguo Li and Jarvis Haupt. Identifying outliers in large matrices via randomized adaptive compressive sampling. IEEE Transactions on Signal Processing, 63(7):1792–1807, 2015.

[29] Guangcan Liu and Ping Li. Recovery of coherent data via low-rank dictionary pursuit. In Advances in Neural Information Processing Systems (NIPS), pages 1206–1214, Montreal, Canada, 2014.

[30] Guangcan Liu and Ping Li. Low-rank matrix completion in the presence of high coherence. IEEE Transactions on Signal Processing, 64(21):5623–5633, 2016.

[31] Panos P Markopoulos, George N Karystinos, and Dimitris A Pados. Op- timal algorithms for l1-subspace signal processing. IEEE Transactions on Signal Processing, 62(19):5046–5058, 2014.

[32] Michael McCoy, Joel A Tropp, et al. Two proposals for robust PCA using semidefinite programming. Electronic Journal of Statistics, 5:1123–1160, 2011.

[33] Vitali D Milman and Gideon Schechtman. Asymptotic theory of finite dimensional normed spaces: Isoperimetric inequalities in Riemannian manifolds, volume 1200. Springer, 2009.

[34] Dohyung Park, Constantine Caramanis, and Sujay Sanghavi. Greedy subspace clustering. In Advances in Neural Information Processing Systems (NIPS), pages 2753–2761, 2014.

[35] Mostafa Rahmani and George K Atia. Coherence pursuit: Fast, simple, and robust principal component analysis. IEEE Transactions on Signal Processing, 65(23):6260–6275, 2017.

[36] Mostafa Rahmani and George K Atia. Innovation pursuit: A new ap- proach to subspace clustering. IEEE Transactions on Signal Processing, 65(23):6276–6291, 2017.

[37] Mostafa Rahmani and George K Atia. Subspace clustering via optimal direction search. IEEE Signal Processing Letters, 24(12):1793–1797, 2017.

[38] Benjamin Recht. A simpler approach to matrix completion. The Journal of Machine Learning Research, 12:3413–3430, 2011.

[39] Mahdi Soltanolkotabi and Emmanuel J Candes. A geometric analysis of subspace clustering with outliers. The Annals of Statistics, pages 2195–2238, 2012.

[40] Roberto Tron and Ren´e Vidal. A benchmark for the comparison of 3-d motion segmentation algorithms. In 2007 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR), pages 1–8, Minneapolis, MN, 2007.

[41] Manolis C Tsakiris and Ren´e Vidal. Dual principal component pursuit. In 2015 IEEE International Conference on Computer Vision Workshop, ICCV Workshops, pages 10–18, Santiago, Chile, 2015.

[42] Roman Vershynin. Introduction to the non-asymptotic analysis of random matrices. arXiv:1011.3027, 2010.

[43] Huan Xu, Constantine Caramanis, and Sujay Sanghavi. Robust PCA via outlier pursuit. In Advances in Neural Information Processing Systems (NIPS), pages 2496–2504, Vancouver, Canada, 2010.

[44] Chong You, Daniel P Robinson, and Ren´e Vidal. Provable self-representation based outlier detection in a union of subspaces. In 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 4323–4332,