b

DiscoverSearch
About
My stuff
Outlier Detection and Data Clustering via Innovation Search
2019·arXiv
Abstract
Abstract

The idea of Innovation Search was proposed as a data clustering method in which the directions of innovation were utilized to compute the adjacency matrix and it was shown that Innovation Pursuit can notably outperform the self representation based subspace clustering methods. In this paper, we present a new discovery that the directions of innovation can be used to design a provable and strong robust (to outlier) PCA method. The proposed approach, dubbed iSearch, uses the direction search optimization problem to compute an optimal direction corresponding to each data point. An outlier by definition is a data point which does not participate in forming a low dimensional structure with a large number of data points in the data. In other word, an outlier carries innovation with respect to most of the other data points. iSearch utilizes the directions of innovation to measure the innovation of the data points and it identifies the outliers as the most innovative data points. Analytical performance guarantees are derived for the proposed robust PCA method under different models for the distribution of the outliers including randomly distributed outliers, clustered outliers, and linearly dependent outliers. In addition, we study the problem of outlier detection in a union of subspaces and it is shown that iSearch provably recovers the span of the inliers when the inliers lie in a union of subspaces. Moreover, we present theoretical studies which show that the proposed measure of innovation remains stable in the presence of noise and the performance of iSearch is robust to noisy data. In the challenging scenarios in which the outliers are close to each other or they are close to the span of the inliers, iSearch is shown to remarkably outperform most of the existing methods. The presented method shows that the directions of innovation [36], [37] are useful representation of the data which can be used to perform both data clustering and outlier detection.

Index Terms—Robust PCA, Outlier Detection, Convex Optimization, Innovation Search, Data Clustering

Principal Component Analysis (PCA) has been extensively used to reduce dimensionality by finding linear projections of high-dimensional data into lower dimensional subspaces. PCA finds an r-dimensional linear subspace by solving

image

where  D ∈ RM1×M2is the given data, ˆUis an orthonormal basis for the r-dimensional subspace, I is the identity matrix, ∥ · ∥Fdenotes the Frobenius norm,  M2is the number of data points, and  M1is the dimensionality of the ambient space. While PCA is useful when the data has low intrinsic dimension, it is sensitive to outliers in the sense that the solution to (1) can arbitrarily deviate from the true underlying subspace if a small portion of the data is not contained in this low-dimensional subspace.

In addition to the sensitivity of PCA to outliers, detecting the outlying data points is also an important research problem in unsupervised machine learning. Outliers are associated with important rare events such as malignant tissues [17], the failures of a system [11], [13], [38], web attacks [19], and misclassified data points [10], [35]. In this paper, the proposed outlier detection method is introduced as a robust Principal Component Analysis (PCA) algorithm, i.e., the inliers lie in a low dimensional subspace.

In the literature of robust PCA, two main models for the data corruption are considered: the element-wise data corruption model and the column-wise data corruption model. These two models are corresponding to two completely different robust PCA problems. In the element-wise model, it is assumed that a small subset of the elements of the data matrix are corrupted and the support of the corrupted elements is random. Thus, the corrupted elements are not concentrated in any column/row of the data. This problem is known as the low rank plus sparse matrix decomposition problem [1], [3], [29], [30]. In the column-wise model, a subset of the columns of the data are affected by the data corruption [4], [8], [9], [12], [21], [26], [31], [32], [43][46]. We solely focus on the column-wise data corruption model. Specifically, it is assumed that the given data matrix D follows Data Model 1.

Data Model 1. The matrix D can be expressed as D = [B A] T , where  A ∈ RM1×ni, B ∈ RM1×no, Tis an arbitrary permutation matrix, and [B A] represents the concatenation of matrices A and B. The columns of A lie in an r-dimensional subspace U. The columns of B do not lie entirely in U, i.e., the  nicolumns of A are the inliers and the  nocolumns of B are the outliers. The orthonormal matrix U ∈ RM1×ris a basis for U.

The output of a robust PCA method is a basis for U. If U is estimated accurately, the outliers can be located using a simple subspace projection [28].

Summary of Contributions: The main contributions can be summarized as follows.

iSearch introduces a new idea to the robust PCA problem. iSearch utilizes the directions of innovation to measure the Innovation of the data points. It is shown that the pro-

posed approach mostly outperforms the exiting methods in handling close outliers and noisy data.

The proposed approach and the CoP method presented in [35], to the best of our knowledge, are the only robust PCA methods which are supported with analytical performance guarantees under different models for the distributions of the outliers including the randomly distributed outliers, the clustered outliers, and the linearly dependent outliers.

In addition to considering several models for the distribution of the outliers, we provide analytical performance guarantees under different models for the distribution of the inliers too. The presumed models include the union of subspaces and the uniformly at random distribution on U ∩SM1−1where  SM1−1denotes the unit  ℓ2-norm sphere in  RM1. Moreover, the stability of iSearch in the presence of additive noisy is studied and it is proved that iSearch is robust to noisy data.

A. Notation

Given a matrix  A, ∥A∥denotes its spectral norm. For a vector  a, ∥a∥pdenotes its  ℓp-norm and a(i) its  ithelement. Given two matrices  A1and  A2with an equal number of rows, the matrix  A3 = [A1 A2]is the matrix formed by concatenating their columns. For a matrix  A, aidenotes its ithcolumn. The subspace  U⊥is the complement of U. The cardinality of set I is defined as |I|. Also, for any positive integer n, the index set {1, ..., n} is denoted [n]. The coherence between vector a and subspace H with orthonormal basis H is defined as  ∥aT H∥2.

In this section, we review some of the important related works in robust (to outlier) PCA and the Innovation Pursuit method. We refer the reader to [22], [35] for a more comprehensive review on the topic. Some of the earliest approaches to robust PCA relied on robust estimation of the data covariance matrix, such as the minimum covariance determinant, the minimum volume ellipsoid, and the StahelDonoho estimator [16]. They mostly compute a full SVD or eigenvalue decomposition in each iteration and generally have no explicit performance guarantees. The performance of these approaches greatly degrades when nino < 0.5. To enhance the robustness to the presence of the outliers, another approach is to replace the Frobenius Norm in the cost function of PCA with  ℓ1-norm [18], [25] because  ℓ1-norm were shown to be robust to the presence of the outliers [2], [18]. However, ℓ1-norm does not leverage the column-wise structure of the corruption matrix. The authors of [6] modified the  ℓ1-norm minimization problem used in [18] and replaced it with an  ℓ1,2-norm minimization problem. In [24] and [46], the optimization problem used in [6] was relaxed to two different convex optimization problems. The authors of [24] / [46] provided sufficient conditions under which the optimal point of the convex optimization problem proposed in [24] / [46] is guaranteed to be equal to a projection matrix whose column-space is equal to  U / U⊥. The approach presented in [41] focused on the scenario in which the data is predominantly unstructured outliers and the number of outliers is larger than M1. In [41], it is essential to assume that the outliers are randomly distributed on  SS−1and the inliers are distributed randomly on the intersection of  SS−1and U. In [43], a convex optimization problem was proposed which decomposes the data into a low rank component and a column sparse component. The approach presented in [43] is provable but it requires  noto be significantly smaller than  ni. The outlier detection method proposed in [39] assumes that the outliers are randomly distributed on  SM1−1and a small number of them are not linearly dependent which means that [39] is not able to detect the linearly dependent outliers and the outliers which are close to each other.

A. Review of Innovation Pursuit for Data Clustering

In contrast to most of the spectral clustering based subspace clustering methods which utilize self-representation to compute the affinity matrix, the authors of [36], [37] proposed Innovation Pursuit which utilized the directions of innovation to compute the affinity matrix. A convex optimization problem was introduced which finds an optimal direction corresponding to each data point. The Direction of Innovation corresponding to data point d is closely aligned with d but it has the minimum projection on the other data points. In [36], [37], it was shown that Innovation Pursuit can significantly outperform the self representation based methods (e.g. Sparse Subspace Clustering [7]) specifically in the scenarios in which the subspaces intersect. In this paper, we present a new discovery that the Directions of Innovation can also be utilized to devise a novel, strong, and provable robust PCA method. The table of Algorithm 1 shows the procedure to use the directions of innovation for both data clustering and outlier detection.

B. Connection and Contrast to Coherence Pursuit

In [35], the Coherence Pursuit (CoP) method was proposed as an outlier detection method. CoP computes the Coherence Values for all the data points to rank the data points. The Coherence value corresponding to data column d is a measure of resemblance between d and the rest of the data columns. CoP uses the inner product between d and the rest of the data points to measure the resemblance between d and the rest of data. In sharp contrast, iSearch leverages the directions of innovation to measure the innovation of each data point and ranks the data points accordingly. We show through theoretical studies and numerical experiments that finding the optimal directions makes iSearch notably stronger than CoP in detecting outliers which carry weak innovation.

Algorithm 1 presents the proposed outlier detection method along with the definition of the used symbols. In order to show the connection of proposed robust PCA method with the Innovation Search based clustering method [36], [37], Table 1 demonstrates both algorithms. iSearch consists of three steps. In the next subsections, Step 2 and Step 3 are discussed. In this

image

paper, we use an ADMM solver to solve (2). The computation complexity of the solver is  O(max(M1M 22 , M 21 M2)). If PCA is used in the prepossessing step to reduce the dimensionality of the data to  rd, the computation complexity of the solver is O(max(rdM 22 , r2dM2)) 1.

A. An Illustrative Example for Innovation Value

A synthetic numerical example is presented to explain the main idea behind iSearch. Assume  D ∈ R40×250, ni = 200, no = 50, r = 5and suppose that D follows Assumption 1.

Assumption 1. The columns of A are drawn uniformly at random from  U ∩ SM1−1. The columns of B are drawn uniformly at random from  SM1−1. To simplify the exposition and notation, it is assumed without loss of generality that T in Data Model 1 is the identity matrix, i.e, D = [B A].

Define d as a column of D and define  c∗as the optimal point of

image

and define the Innovation Value corresponding to d as 1/∥DT c∗∥1. The main idea of iSearch is that  c∗shows two completely different behaviours with respect to U (when d is an outlier and when d is an inlier). First assume that d is an outlier. The optimization problem (3) searches for a direction whose projection on d is non-zero and it has the minimum projection on the rest of the data points. Since d is an outlier, d has a non-zero projection on  U⊥. In addition, since  niis large, (3) searches for a direction in the ambient whose projection on U is as weak as possible. Therefore,  c∗lies in  U⊥or it is close to  U⊥. The left plot of Fig. 1 shows DT c∗when d is an outlier. In this case,  c∗is orthogonal to all the inliers. Accordingly, when d is an outliers,  ∥DT c∗∥1is approximately equal to  ∥BT c∗∥1. On the other hand, when d is an inlier, the linear constraint strongly discourages  c∗to lie in  U⊥or to be close to  U⊥. Inliers lie in a low dimensional subspace and mostly they are close to each other. Since  c∗has a strong projection on d, it has strong projections on many of the inliers. Accordingly, the value of  ∥AT c∗∥1is much larger when d is an inlier. Therefore, the Innovation Value corresponding to an inlier is smaller than the Innovation Value corresponding to an outlier because  ∥AT c∗∥1is much larger when d is an inliers. Fig. 1 compares the vector  DT c∗when d is an outliers with the same vector when d is an inlier. In addition, it shows the vector of Innovation Values (right plot). One can observe that the Innovation Values make the outliers clearly distinguishable.

image

Fig. 1. In this example, the first 50 columns are outliers. The upper left panel shows vector  |DT c∗| when dis an outlier. The upper right panel depicts |DT c∗| when dis an inlier. The bottom left plot shows the Innovation Values corresponding to all the data points (vector x was defined in Algorithm 1). The bottom right plot also shows the Innovation Values when the inliers are noisy (∥A∥2F /∥E∥2F = 4 where E ∈ RM1×nidenotes the additive noise).

B. Building the Basis Matrix

The data points corresponding to the least Innovation Values are used to construct the basis matrix Y. If the data follows Assumption 1, the r data points corresponding to the r smallest Innovation Values span U with overwhelming probability [42]. In practise, the algorithm should continue adding new columns to Y until the columns of Y spans an r-dimensional subspace. This approach requires to check the singular values of Y several times. We propose two techniques to avoid this extra steps. The first approach is based on the side information that we mostly have about the data. In many applications, we can have an upper-bound on  nobecause outliers are mostly associated with rare events. If we know that the number of outliers is less than y percent of the data, matrix Y can be constructed using  (1 − y)percent of the data columns which are corresponding to the least Innovation Values. The second approach is the adaptive column sampling method proposed in [35]. The adaptive column sampling method avoids sampling redundant columns.

C. Robustness to noise

The performance of iSearch is robust to the presence of noise because even if the inliers are noisy, the optimal direction of (3) corresponding to an outlier remains incoherent with U. The reason is that if the optimal direction is not incoherent with U and the inliers are not dominated by noise,  ∥AT c∗∥1is large because the inliers lie in a low dimensional subspace and they are close to each other. In Section IV-E, we provide a guarantee for the performance of iSearch with noisy data, an upper-bound for  ∥UT c∗∥2is established, and it is shown that

the value of the upper-bound is proportional to�r/(M1n2i ). The right plot in Fig. 1 shows the innovation values for the example described in Section III-A. In this plot, the inliers are noisy, D = [B (A + E)] where E denotes the additive noise, and  ∥A∥2F /∥E∥2F = 4. One can observe that Innovation Values clearly distinguishes the outliers from the inliers.

In this section, we analyze the performance of the proposed approach with three different models for the distribution of the outliers: unstructured outliers, clustered outliers, and linearly dependent outliers. Moreover, we analyze iSearch with two different models for the distribution of the inliers. These models include the union of subspaces and uniformly at random distribution on  U ∩ SM1−1. In addition, the performance of iSearch in the presence of noise is studied and it is shown that the direction of innovation is a reliable tool when the data is noisy. The theoretical results are followed by short discussions which highlight the important aspects of the theorems and a short description of the proof. The proofs of all the presented theoretical results are included in Appendix.

A. Randomly Distributed Outliers

In this part, it is assumed that D follows Assumption 1. In order to guarantee the performance of the proposed approach, it is enough to show that the Innovation Values corresponding to the outliers are greater than the Innovation Values corresponding to the inliers. In other word, it suffices to show

image

Before we state the theorem, let us provide the following definitions and remarks.

Definition 1. Define  c∗j = arg min

define  χ = max�{∥c∗j∥2}noi=1�, χ′ = max�{∥c∗j∥2}M2i=no+1�, and  n′z = max�{|Ii0|}noi=1�where  Ii0 = {i ∈ [no] : c∗iT bi =0} and  biis the  ithcolumn of B. The value  |Ii0|is the number of outliers which are orthogonal to  c∗i.

Remark 1. In Assumption 1, the outliers are randomly distributed. Thus, if  nois significantly larger than  M1, n′zis significantly smaller than  nowith overwhelming probability.

Theorem 1. Suppose D follows Assumption 1. Define A =

image

A > 2n′z

image

A >

image

then (4) holds and U is recovered exactly with probability at least  1 − 7δwhere

image

Proof Sketch: In order to prove Theorem 1, we leveraged this key feature of the direction of innovation: when d is an outlier, c∗is highly incoherent with U. Specifically, Theorem 1 is proved in two steps. First, we derived the sufficient conditions to guarantee that all the optimal directions correspond to the outliers,  {c∗j}noj=1, lie in  U⊥. The exact orthogonality is not a necessarily condition but it simplifies the analysis. In fact, the first two sufficient conditions in (5) guarantee it with high probability. Second, the fact that  {c∗j}noj=1are orthogonal to U was leveraged to derive the final conditions to guarantee that (4) holds with high probability.

Theorem 1 indicates that when  ni/ris sufficiently larger than  no/M1, the proposed approach is guaranteed to detect the randomly distributed outliers exactly. It is important to note that in (5),  niis scaled with 1/r but  nois scaled with 1/M1. It means that if r is sufficiently smaller than  M1, iSearch provably detects the unstructured outliers even if  nois much larger than  ni. The numerical experiments presented in Section V confirms this feature of iSearch.

B. Structured Outliers

This section provides the analysis of iSearch with structured outliers. In contrast to the unstructured outliers, structured outliers can form a low dimensional structure different from the structure of the majority of the data points. Structured outliers are associated with important rare events such as malignant tissues [17] or web attacks [19]. In this section, we assume that the outliers form a cluster outside of U. The following assumption specifies the presumed model for the distribution of the structured outliers.

Assumption 2. A column of B is formed as  bi = 1√1+η2 (q+ηvi). The unit  ℓ2-norm vector q does not lie in  U, {vi}noi=1are drawn uniformly at random from  SM1−1, and  ηis a positive number.

According to Assumption 2, the outliers cluster around vector q where  q ̸∈ U. In Algorithm 1, if the dimensionality reduction step is performed, the direction search optimization problem is applied to  QT D. Thus, (3) is equivalent to

image

where  c ∈ RM1×1and  D ∈ RM1×M2. The subspace Q is the column-space of D. In this section, we are interested in studying the performance of iSearch in identifying tightly clustered outliers because some of the existing outlier detection algorithms fail if the outliers form a tight cluster. For instance, the thresholding based method [14] and the sparse representation based algorithm [39] fail when the outliers are close to each other. Therefore, we assume that the span of Q is approximately equal to the column-space of [U q]. The following Theorem shows that even if the outliers are close to each other, iSearch successfully identifies the outliers provided that  ni/√ris sufficiently larger than  no.

Theorem 2. Suppose the distribution of the inliers/outliers follows Assumption-1/Assumption-2. Assume that Q is equal to the column-space of [U q]. Define  q⊥ = (I−UUT )q∥(I−UUT )q∥2, define β = max�{1/|dTi q⊥| : di ∈ B}�, define  c∗ias the optimal point of (6) with  d = di, and assume that  η < |qT q⊥|. In addition, Define A =

image

then (4) holds and U is recovered exactly with probability at least  1 − 5δ.

Proof Sketch: Similar to the proof of Theorem 1, first we derive the sufficient conditions to guarantee that the optimal directions corresponding to the outliers are orthogonal to U (the exact orthogonality is not a necessarily condition but it simplifies the analysis). Subsequently, we leverage the structure of the optimal directions to establish an upper-bound/lower-bound for  ∥DT c∗∥1when d is an outliers/inlier.

In sharp contrast to (5), in (7)  nois not scaled with 1/√M1. Theorem 2 indicates that in contrast to the unstructured outliers, the number of the structured outliers should be sufficiently smaller than the number of the inliers for the small values of  η. This is consistent with our intuition regarding the detection of structured outliers. If the columns of B are highly structured and most of the data points are outliers, it violates the definition of outlier to label the columns of B as outliers. The presence of parameter  βemphasizes that the closer the outliers are to U, the harder it is to distinguish them. In Section V, it is shown that iSearch significantly outperforms the existing methods when the outliers are close to U. The main reason is that even if an outlier is close to U, its corresponding optimal direction obtained by (3) is highly incoherent with U.

C. Linearly Dependent Outliers

In some applications, the outliers are linearly dependent. For instance, in [10], it was shown that a robust PCA algorithm can be used to reduce the clustering error of a subspace segmentation method. In this application, a small subset of the outliers can be linearly dependent. The following assumption specifies the presumed model for matrix B.

Assumption 3. Define subspace  Uowith dimension  rosuch that  Uo /∈ Uand  U /∈ Uo. The outliers are randomly distributed on  SM1−1 ∩ Uo. The orthonormal matrix  Uo ∈ RM1×rois a basis for  Uo.

The following theorem establishes the sufficient conditions to guarantee the performance of iSearch with linearly dependent outliers. The procedure to proof the following theorem is similar to the previous theorems. The main difference was the techniques used to bound the value of  ∥DT c∗∥1.

Theorem 3. Suppose the distribution of the inliers/outliers follows Assumption-1/Assumption-3. Define  A = �2πni√r −

2√ni −�

image

then (4) holds and U is recovered exactly with probability at least  1−5δwhere  η′δ = max�43 log 2(ro)/δ ,�4 noro log 2rdδ �

image

Theorem 3 indicates that  ni/rshould be sufficiently larger than  no/ro. If  rois comparable to r, it is in fact a necessary condition because we can not label the columns of B as outliers if  nois also comparable with  ni. If  rois large, the sufficient condition is similar to the sufficient conditions of Theorem 1 in which the outliers are distributed randomly on  SM1−1. It is also informative to compare the requirements of iSearch with the requirements of CoP. With iSearch,  ni/rshould be sufficiently larger than noro ∥UoU⊥∥to guarantee that the algorithm distinguishes the outliers successfully. With CoP,  ni/rishould be sufficiently larger than no/ro+∥UTo U∥ni/ri[10], [35]. The reason that CoP requires a stronger condition is that iSearch finds a direction for each outlier which is highly incoherent with U.

D. Outlier Detection When the Inliers are Clustered

In the analysis of the robust PCA methods, mostly it is assumed that the inliers are randomly distributed in U. In practise the inliers form several clusters in U. In this section, it is assumed that the inliers form m clusters. The following assumption specifies the presumed model and Theorems 4 provides the sufficient conditions.

Assumption 4. The matrix of inliers can be written as A = [A1 ... Am]TAwhere  Ak ∈ RM1×nik, �mk=1 nik = ni, and TAis an arbitrary permutation matrix. The columns of  Akare drawn uniformly at random from the intersection of subspace Ukand  SM1−1where  Ukis a d-dimensional subspace. In other word, the columns of A lie in a union of subspaces  {Uk}mk=1and  (U1 ⊕ ... ⊕ Um) = Uwhere  ⊕denotes the direct sum operator.

Theorem 4. Suppose the distribution of the outliers/inliers follows Assumption-1/Assumption-4. Define

image

If the sufficient conditions in (5) are satisfied, then the inequality (4) holds and U is recovered exactly with probability at least  1 − 7δ.

Proof sketch: The difference between Theorem 4 and Theorem 1 is in their presumed model for the distribution of the inliers. The procedure used to prove Theorem 4 is similar to the procedure used in the proof of Theorem 1 but it required new techniques to bound  ∥AT c∗∥.

In Assumption 4, the dimensions of the subspaces  {Uk}mk=1are equal and the distribution of the inliers inside these subspace are similar. Therefore, we can roughly say g = arg mink nik[24]. Thus, the sufficient conditions indicate that the population of the smallest cluster scaled by  1/√dshould be sufficiently larger than  no/M1. The parameter

image

introduced in [24]. It shows how well the inliers are distributed in U. Evidently, if the inliers populate all the directions inside U, a subspace recovery algorithm is more likely to recover U correctly. However, having a large value of permeance statistic is not a necessary condition. The reason that permeance statistic appears in the sufficient conditions is that we establish the sufficient conditions to guarantee the performance of iSearch in the worst case scenarios. In fact, if the inliers are close to each other or the subspaces  {Ui}mi=1are close to each other, generally the performance of iSearch improves because the more inliers are close to each other, the smaller their Innovation Values are.

E. Noisy Data

The key feature of the proposed approach is that when d is an outlier,  c∗is highly incoherent with U. In the proof of the presented theorems, we derived sufficient conditions to guarantee that  c∗is orthogonal to U. It is not a necessarily condition but it simplified the analysis. When the inliers are noisy, we can not find a direction which is orthogonal to all of them because they do not lie in a low dimensional subspace. In this section, first we focus on showing that even if the data is noisy, the direction of innovation corresponding to an outlier remains incoherent with U. In the following theoretical results, it is assumed that the data follows Assumption 5. According to Assumption 5, each inlier is added with a random direction and each data column has an expected squared norm equal to 1.

Assumption 5. The matrix D can be expressed as D = [B 11+σ2n (A+E)]T.The matrix  E ∈ RM1×nirepresents the presence of noise and it can be written as  E = σnNwhere the columns of  N ∈ RM1×niare drawn uniformly at random from  SM1−1and  σnis a positive number which controls the power of the added noise.

Lemma 5. Suppose D follows Assumption 5 and assume that A and B follow Assumption 1. Define  ϱ =maxi{∥UT c∗i ∥2}noi=1, υ = max{ 1∥dT R∥2 }, and

image

where R is an orthonormal basis for  U⊥. Then

image

with probability at least  1 − 4δ.

Lemma 5 establishes a upper-bound for ϱ =maxi{∥UT c∗i ∥2}noi=1. In practise,  ϱis much smaller than the value of the upper-bound because in the proof we consider the worst case scenarios. Lemma 5 indicates that the direction of innovation corresponding to an outlier stays incoherent with U provided that  ni(M1/r)is sufficiently large. The following theorem provides the sufficient conditions to guarantee the performance of iSearch with noisy data.

Theorem 6. Suppose D follows Assumption 5 and assume that A and B follow Assumption 1. Define  A =�12πni√r −√ni −�ni log 1δ2r−2and  υ = max{ 1∥dT R∥2 }. If

image

then (4) holds with probability at least  1 − 5δ.

In Theorem 6, the outliers are unstructured and its suffi-cient conditions are similar to the sufficient conditions of Theorem 1. Roughly, Theorem 6 states that if ni(1+σ2n)√ris sufficiently larger than no√M1, Innovation Values can reliably be used to distinguish the outliers. For clustered outliers and linearly dependent outliers, similar guarantees can be established.

In this section, a set of experiments with synthetic data and real data are presented to study the performance and the properties of the iSearch algorithm. In the presented experiments, iSearch is compared with the existing methods including FMS [21], GMS [46], CoP [35], OP [43], and R1-PCA [6].

A. Phase Transition

In this experiment, the phase transition of the proposed approach is studied. First, it is shown that if  ni/ris sufficiently large, iSearch can successfully recover the span of the inliers even in the dominant presence of the unstructured outliers. Define ˆUas an orthonormal basis for the recovered subspace. A trial is considered successful if

image

The data follows Assumption 1 with r = 4 and  M1 = 100. Fig. 2 shows the phase transition of iSearch versus  ni/rand no/M1. White indicates correct subspace recovery and black designates incorrect recovery. Theorem 1 indicated that if  ni/ris sufficiently large, iSearch yields exact recovery even if  nois larger than  ni. This experiment confirms the theoretical result. According to Fig. 2, even when  no = 3000, 40 inliers are enough to guarantee exact subspace recovery. In the second plot of Fig. 2, the outliers are structured and the distribution of outliers follows Assumption 2 with  no = 25. One can observe that even when  ηis small, a sufficiently large value of  niguarantees the performance of iSearch.

image

Fig. 2. Left plot: the phase transition of iSearch in presence of the un- structured outliers versus  ni/r and no/M1. White indicates correct subspace recovery and black designates incorrect recovery. Right plot: phase transition with structured outliers versus  ni and η. In both plots,  M1 = 100 and r = 4.

image

Fig. 3. The probability of accurate subspace recovery versus the number of structured outliers (ni = 100, η = 0.1, M1 = 100, and r = 10). Atrial is considered successful if∥(I−UUT ) ˆU∥F∥U∥F < 10−2 where ˆU is anorthonormal basis for the recovered subspace.

B. Structured Outliers

In this experiment, we consider structured outliers. The distribution of the outliers follows Assumption 2 with  η = 0.1and  M1 = 100. In addition, the inliers are clustered and they lie in a union of 5 2-dimensional linear subspaces. There are 20 data points in each subspace (i.e.,  ni = 100) and r = 10. A successful trial is defined similar to Section V-A. We are interested in investigating the performance of iSearch in identifying structured outliers when they are close to U. Therefore, we generate vector q, the center of the cluster of the outliers, close to U. Vector q is constructed as

image

where the unit  ℓ2-norm vector  p ∈ RM1×1is generated as a random direction on  SM1−1and the elements of  h ∈ R(r+1)×1are sampled independently from N(0, 1). The generated vector q is close to U with high probability because the column-space of [U p] is close to the column-space of U. Fig. 3 shows the probability of accurate subspace recovery versus the number of outliers. The number of evaluation runs was 50. One can observe that in contrast to the unstructured outliers, the robust PCA methods tolerate few number of structured outliers and iSearch exhibits higher robustness to the presence of the structured outliers.

C. Noisy Data

In this section, we consider the simultaneous presence of noise, the structured outliers and the unstructured outliers. In this experiment,  M1 = 100, r = 5, and  ni = 100. The data contains 300 unstructured and 10 structured outliers. The distribution of the structured outliers follow Assumption 2 with  η = 0.1. The vector q, the center of the cluster of the structured outliers, is generated as a random direction on  SM1−1. The generated data in this experiment can be expressed as D = [B (A + E)] (the matrix E represents the additive Gaussian noise). Since the data is noisy, the algorithms can not achieve exact subspace recovery. Therefore, we examine the probability that an algorithm distinguishes all the outliers correctly. Define vector  f ∈ RM2×1such that f(k) = ∥(I − ˆU ˆUT )dk∥2. A trial is considered successful if

image

Fig. 4 shows the probability of exact outlier detection versus SNR which is defined as SNR  = ∥A∥2F∥E∥2F. It shows that iSearch robustly distinguishes the outliers in the strong presence of noise. The number of evaluation runs was 50.

image

Fig. 4. The probability of exact outlier detection versus SNR. The data contains 10 structured outliers and 300 unstructured outliers (ni = 100,no = 310, r = 5, and M1 = 100).

D. Innovation Value vs Coherence Value

We simulate a scenario in which the outliers are randomly distributed but they are close to the span of the inliers. Suppose  r = 8, ni = 180, and  no = 20. The outliers are generated as [U H] G where  H ∈ RM1×2spans a random 2-dimensional subspace and the elements of  G ∈ R10×20are sampled independently from N(0, 1). Fig. 5 compares Innovation Values with Coherence Values. The last 20 columns are the outliers. One can observe that the Coherence Values can not make the outliers distinguishable from the inliers. The main reason is that in contrast to Coherence Pursuit, iSearch finds a direction corresponding to each outlier which is strongly incoherent with U.

image

Fig. 5. This figure compares the vector of the Coherence Values (right plot) with the vector of the Innovation Values (left plot). The last 20 columns are the outliers.

E. Clustered Inliers

In practice, the inliers are not necessarily distributed uniformly at random and they are mostly close to each other and they form one or multiple clusters. In this experiment, it is shown that in contrast to most of the existing methods, if the inliers form a cluster in U, the performance of iSearch improves. It is assumed that the distribution of inliers follows Assumption 6.

Assumption 6. Each column of matrix A is formed as  ai =Usi∥Usi∥2where  si = w + γzi. The vector  w ∈ Rr×1is a unit ℓ2-norm vector and  {zi}noi=1are drawn uniformly at random from  Sr−1.

image

Fig. 6. The Log-Recovery Error of the robust PCA methods versus  γ. Theparameter  γis defined in Assumption 6. Log-Recovery Error is defined as log10� ∥(I−UUT ) ˆU∥F∥U∥F �. When γis small, the inliers are close to each other and when it is larger, the distribution of the inliers is similar to the distribution of the inliers in Assumption 1 (i.e., uniformly at random on  U ∩ SM1−1).

The parameter  γcontrols how close the inliers are to each other. If  γgoes to infinity, the distribution of the inliers converges to the distribution of the inliers in Assumption 1. In this experiment,  ni = 20, M1 = 100, and  no = 20. The distribution of the outliers follows Assumption 3 with  ro = 20and the dimension of intersection between  Uoand U is equal to 5. This robust PCA algorithms are used to obtain a basis for the dominant subspace (the span of the inliers). Fig. 6 shows Log-Recovery-Error defined equal to  log10�∥(I−UUT ) ˆU∥F∥U∥F

versus parameter  γ. The number of evaluation runs was 100. One can observe that the performance of most of the robust PCA methods degrade when  γis small. The reason is that if the value of  γincreases, the distribution of the inliers becomes closer to the uniformly at random distribution

image

Fig. 7. Some of the frames of the Waving Tree video file. The highlighted frames are detected as outliers by R1-PCA.

on  U ∩ SM1−1). In contrast, Coherence Pursuit and iSearch yield better performance when the inliers form a cluster (the smaller values of  γ). The reason is that if the inliers become closer to each other, their Coherence-Values/Innovation-Values increase/decrease because their similarities increase and they are less innovative.

F. Detecting Outliers in Real Data

An application of the outlier detection methods is to identify the misclassified data points of a clustering method [10], [35]. In each identified cluster, the misclassified data points can be considered as outliers. In this experiment, we assume an imaginary clustering method whose clustering error is 25 %. The robust PCA method is applied to each cluster to find the misclassified data points. The clustering is re-evaluated after identifying the misclassified data points. Algorithm 2 adapted from [35] shows how a robust PCA algorithm is used to detect the misclassified data points and update the identified clusters. We use the Hopkins155 dataset [40], which contains data matrices with 2 or 3 clusters. In this experiment, 27 matrices with 3 clusters are used (i.e., the columns of each data matrix lie in 3 clusters). The outliers are linearly dependent and they are very close to the span of the inliers since the clusters in the Hopkins155 dataset are close to each other. In addition, the inliers form a tight cluster. Evidently, the robust PCA methods which assume that the outliers are randomly distributed fail in this task. This experiment with real data contains most of the challenges that a robust PCA method can encounter. For more details about this experiment, we refer the reader to [10], [35]. Table I shows the average clustering error after applying the robust PCA methods to the output of the clustering method. One can observe that iSearch significantly outperforms the other methods. The main reason is that iSearch is robust against outliers which are closed to U. In addition, the coherency between the inliers enhances the performance of iSearch.

TABLE I CLUSTERING ERROR AFTER USING THE ROBUST PCA METHODS TO DETECT THE MISCLASSIFIED DATA POINTS.

image

G. Activity Detection in Real Noisy Data

In this experiment, we use the robust PCA methods to identify a rare event in a video file. We use the Waving Tree video file [27]. In this video, a tree is smoothly waving and in the middle of the video a person crosses the frame. The

image

frames which only contain the background (the tree and the environment) are inliers and the few frames corresponding to the event, the presence of the person, are the outliers. Since the tree is waving, the inliers are noisy and we use r = 3 for all the methods. In addition, we identify column d as outlier if  ∥d − ˆU ˆUd∥2/∥d∥2 ≥ 0.2where ˆUis the recovered subspace. In this experiments, the outliers are very similar to each other since the consecutive frames are quite similar to each other. We use iSearch, CoP, FMS, and R1-PCA to detect the outlying frames. iSearch, CoP, and FMS identified all the outlying frames correctly. R1-PCA could not identify those frames in which the person does not move. The reason is that those frames are exactly similar to each other. Fig. 7 shows some of the outlying frames which were missed by R1-PCA.

A new robust (to outlier) PCA method, termed iSearch, was proposed which uses a convex optimization problem to measure the innovation of the data points. The proposed approach recovers the span of the inliers using the least innovative data points. It was shown that iSearch can provably recover the span of the inliers with different models for the distribution of the outliers. In addition, analytical performance guarantees for iSearch with clustered inliers were presented. It was shown that finding the optimal directions makes iSearch significantly robust to the outliers which carry weak innovation. Moreover, the experiments with real and synthetic data demonstrated the robustness of the proposed method against the strong presence of noise.

[1] Emmanuel J Cand`es, Xiaodong Li, Yi Ma, and John Wright. Robust principal component analysis? Journal of the ACM, 58(3):11, 2011.

[2] Emmanuel J Candes and Terence Tao. Decoding by linear programming. IEEE Transactions on Information Theory, 51(12):4203–4215, 2005.

[3] Venkat Chandrasekaran, Sujay Sanghavi, Pablo A Parrilo, and Alan S Willsky. Rank-sparsity incoherence for matrix decomposition. SIAM Journal on Optimization, 21(2):572–596, 2011.

[4] Vartan Choulakian. L1-norm projection pursuit principal component analysis. Computational Statistics & Data Analysis, 50(6):1441–1451, 2006.

[5] Philip J Davis. Leonhard euler’s integral: A historical profile of the gamma function: In memoriam: Milton abramowitz. The American Mathematical Monthly, 66(10):849–869, 1959.

[6] Chris Ding, Ding Zhou, Xiaofeng He, and Hongyuan Zha. R1-PCA: rotational invariant L1-norm principal component analysis for robust subspace factorization. In Proceedings of the 23rd International Conference on Machine Learning (ICML), pages 281–288, Pittsburgh, PA, 2006.

[7] Ehsan Elhamifar and Rene Vidal. Sparse subspace clustering: Algorithm, theory, and applications. IEEE Transactions on Pattern Analysis and Machine Intelligence, 35(11):2765–2781, 2013.

[8] Jiashi Feng, Huan Xu, and Shuicheng Yan. Robust PCA in highdimension: A deterministic approach. In Proceedings of the 29th International Conference on Machine Learning (ICML), Edinburgh, UK, 2012.

[9] Martin A Fischler and Robert C Bolles. Random sample consensus: a paradigm for model fitting with applications to image analysis and automated cartography. Communications of the ACM, 24(6):381–395, 1981.

[10] Andrew Gitlin, Biaoshuai Tao, Laura Balzano, and John Lipor. Improv- ing k-subspaces via coherence pursuit. IEEE Journal of Selected Topics in Signal Processing, 2018.

[11] Yoshiyuki Harada, Yoriyuki Yamagata, Osamu Mizuno, and Eun-Hye Choi. Log-based anomaly detection of cps using a statistical method. arXiv:1701.03249, 2017.

[12] Moritz Hardt and Ankur Moitra. Algorithms and hardness for robust subspace recovery. In The 26th Annual Conference on Learning Theory (COLT), pages 354–375, Princeton, NJ, 2013.

[13] Milos Hauskrecht, Iyad Batal, Michal Valko, Shyam Visweswaran, Gregory F Cooper, and Gilles Clermont. Outlier detection for patient monitoring and alerting. Journal of Biomedical Informatics, 46(1):47– 55, 2013.

[14] Reinhard Heckel and Helmut B¨olcskei. Robust subspace clustering via thresholding. IEEE Transactions on Information Theory, 61(11):6320– 6342, 2015.

[15] Wassily Hoeffding. Probability inequalities for sums of bounded random variables. Journal of the American statistical association, 58(301):13– 30, 1963.

[16] Peter J Huber. Robust statistics. Springer, 2011.

[17] Seppo Karrila, Julian Hock Ean Lee, and Greg Tucker-Kellogg. A comparison of methods for data-driven cancer outlier discovery, and an application scheme to semisupervised predictive biomarker discovery. Cancer Informatics, 10:109, 2011.

[18] Qifa Ke and Takeo Kanade. Robust L1 norm factorization in the pres- ence of outliers and missing data by alternative convex programming. In 2005 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR), pages 739–746, San Diego, CA, 2005.

[19] Christopher Kruegel and Giovanni Vigna. Anomaly detection of web- based attacks. In Proceedings of the 10th ACM Conference on Computer and Communications Security (CCS), pages 251–261, Washington, DC, 2003.

[20] Michel Ledoux. The concentration of measure phenomenon. Number 89. American Mathematical Soc., 2001.

[21] Gilad Lerman and Tyler Maunu. Fast, robust and non-convex subspace recovery. Information and Inference: A Journal of the IMA, 7(2):277– 336, 2017.

[22] Gilad Lerman and Tyler Maunu. An overview of robust subspace recovery. Proceedings of the IEEE, 106(8):1380–1410, 2018.

[23] Gilad Lerman, Michael McCoy, Joel A Tropp, and Teng Zhang. Robust computation of linear models, or how to find a needle in a haystack. Technical report, DTIC Document, 2012.

[24] Gilad Lerman, Michael B McCoy, Joel A Tropp, and Teng Zhang. Robust computation of linear models by convex relaxation. Foundations of Computational Mathematics, 15(2):363–410, 2015.

[25] Gilad Lerman, Teng Zhang, et al. Robust recovery of multiple subspaces by geometric lp minimization. The Annals of Statistics, 39(5):2686– 2715, 2011.

[26] Guoying Li and Zhonglian Chen. Projection-pursuit approach to robust dispersion matrices and principal components: primary theory and monte carlo. Journal of the American Statistical Association, 80(391):759–766, 1985.

[27] Liyuan Li, Weimin Huang, Irene Yu-Hua Gu, and Qi Tian. Statistical modeling of complex backgrounds for foreground object detection. IEEE Transactions on Image Processing, 13(11):1459–1472, 2004.

[28] Xingguo Li and Jarvis Haupt. Identifying outliers in large matrices via randomized adaptive compressive sampling. IEEE Transactions on Signal Processing, 63(7):1792–1807, 2015.

[29] Guangcan Liu and Ping Li. Recovery of coherent data via low-rank dictionary pursuit. In Advances in Neural Information Processing Systems (NIPS), pages 1206–1214, Montreal, Canada, 2014.

[30] Guangcan Liu and Ping Li. Low-rank matrix completion in the presence of high coherence. IEEE Transactions on Signal Processing, 64(21):5623–5633, 2016.

[31] Panos P Markopoulos, George N Karystinos, and Dimitris A Pados. Op- timal algorithms for l1-subspace signal processing. IEEE Transactions on Signal Processing, 62(19):5046–5058, 2014.

[32] Michael McCoy, Joel A Tropp, et al. Two proposals for robust PCA using semidefinite programming. Electronic Journal of Statistics, 5:1123–1160, 2011.

[33] Vitali D Milman and Gideon Schechtman. Asymptotic theory of finite dimensional normed spaces: Isoperimetric inequalities in Riemannian manifolds, volume 1200. Springer, 2009.

[34] Dohyung Park, Constantine Caramanis, and Sujay Sanghavi. Greedy subspace clustering. In Advances in Neural Information Processing Systems (NIPS), pages 2753–2761, 2014.

[35] Mostafa Rahmani and George K Atia. Coherence pursuit: Fast, simple, and robust principal component analysis. IEEE Transactions on Signal Processing, 65(23):6260–6275, 2017.

[36] Mostafa Rahmani and George K Atia. Innovation pursuit: A new ap- proach to subspace clustering. IEEE Transactions on Signal Processing, 65(23):6276–6291, 2017.

[37] Mostafa Rahmani and George K Atia. Subspace clustering via optimal direction search. IEEE Signal Processing Letters, 24(12):1793–1797, 2017.

[38] Benjamin Recht. A simpler approach to matrix completion. The Journal of Machine Learning Research, 12:3413–3430, 2011.

[39] Mahdi Soltanolkotabi and Emmanuel J Candes. A geometric analysis of subspace clustering with outliers. The Annals of Statistics, pages 2195–2238, 2012.

[40] Roberto Tron and Ren´e Vidal. A benchmark for the comparison of 3-d motion segmentation algorithms. In 2007 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR), pages 1–8, Minneapolis, MN, 2007.

[41] Manolis C Tsakiris and Ren´e Vidal. Dual principal component pursuit. In 2015 IEEE International Conference on Computer Vision Workshop, ICCV Workshops, pages 10–18, Santiago, Chile, 2015.

[42] Roman Vershynin. Introduction to the non-asymptotic analysis of random matrices. arXiv:1011.3027, 2010.

[43] Huan Xu, Constantine Caramanis, and Sujay Sanghavi. Robust PCA via outlier pursuit. In Advances in Neural Information Processing Systems (NIPS), pages 2496–2504, Vancouver, Canada, 2010.

[44] Chong You, Daniel P Robinson, and Ren´e Vidal. Provable self-representation based outlier detection in a union of subspaces. In 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 4323–4332, 2017.

[45] Teng Zhang. Robust subspace recovery by Tyler’s m-estimator. Information and Inference: A Journal of the IMA, 5(1):1–21, 2016.

[46] Teng Zhang and Gilad Lerman. A novel m-estimator for robust PCA. The Journal of Machine Learning Research, 15(1):749–808, 2014.

Appendix

In this section, the proofs for Theorem 1 and Theorem 2 are provided. Before we present the proofs, we review few useful lemmas adapted from [15], [20], [23], [33][35].

Lemma 7. [20], [33], [34] Let the columns of  F ∈ RN×rbe an orthonormal basis for an r-dimensional random subspace drawn uniformly at random in an ambient N-dimensional space. For a unit  ℓ2-norm vector  c ∈ RN×1

image

Lemma 8. [15] For a given vector  a ∈ RM1×1

image

where  {ϵi}M1i=1are independent Rademacher random variables.

Lemma 9. [23] Suppose  g1, ..., gnare i.i.d. random vectors distributed uniformly on the unit sphere  SN−1in  RN. If N > 2, then

image

with probability at least  1 − δ.

Lemma 10. [35] Suppose  g1, ..., gnare i.i.d. random vectors distributed uniformly on the unit sphere  SN−1in  RN. If N > 2, then

image

with probability at least  1 − δ.

Lemma 11. [35] Suppose  g1, ..., gnare i.i.d. random vectors distributed uniformly on the unit sphere  SN−1in  RN. If N > 2, then

image

with probability at least 1 − δwhere η =max�43 log 2Nδ ,�4 nN log 2Nδ �.

Lemma 12. Suppose  F ∈ RN×rspans a random r-dimensional subspace. For a given vector  c ∈ RN×1

image

where  c1and  c2are constant real numbers and  ¯r =max(r, log N).

Lemma 13. Suppose  g1, ..., gnare i.i.d. random vectors distributed uniformly on the unit sphere  SN−1in  RNand define µN =√2 Γ((N+1)/2)N/2where  Γ(·)is the Gamma function [5]. If N > 2, then

image

A. Proof of Theorem 1

Theorem 1 is proved in two steps. First we provide the sufficient conditions to guarantee that when d is an outlier, the optimal point of (3) is highly incoherent with U with high probability. Second, it is shown that if  ni/ris sufficiently large and  c∗is close to  U⊥, then (4) holds with high probability. The algorithm does not require the optimal vectors  {c∗i }noi=1to be orthogonal to U. As long as they are sufficiently close to  U⊥, iSearch works well. However, in order to simplify the analysis, in the first step of the proof we derive the sufficient conditions to guarantee that the optimal vectors  {c∗i }noi=1lie in U⊥. The following lemma provides the sufficient conditions to guarantee that

image

Note that in this Lemma we do not make any assumption about the distribution of the inliers and the distribution of the outliers. Lemma 14 only uses Data Model 1 which is a deterministic model.

Lemma 14. Suppose D follows Data Model 1. Define  I0 ={i ∈ [M2] : ˆcT di = 0, di ∈ B}, where  ˆcis the optimal point of

image

Suppose d is one of the columns of B and define  d⊥ = (I −UUT )d/∥(I − UUT )d/∥2. If

image

then the equality (10) holds where  o = �di∈B sgn(ˆcT di) di.

In Theorem 1, the distribution of the inliers and the distribution of the outliers follow Assumption 1. The following lemma shows that if  ni/ris sufficiently large, the sufficient conditions of Lemma 14 are satisfied with high probability.

image

and

image

then the optimal point of (3) is equal to the optimal point of (11) for all  {d = di : di ∈ B}with probability at least 1 − 7δwhere √cδ = 3 max�1,� 8M1π(M1−1)r,�

image

The following Lemma assumes that the vectors  {c∗i }noi=1liein  U⊥and provides the sufficient condition to guarantee that the Innovation Values corresponding to the outliers are greater than the Innovation Values corresponding to the inliers.

Lemma 16. Suppose D follows Assumption 1. If the optimal vectors  {c∗i }noi=1lie in  U⊥and

image

then

image

with probability at least  1 − 3δ.

According to Lemma 15 and Lemma 16 and according to the analysis presented in the Proof of Lemma 15 and Lemma 16, if the sufficient conditions of Theorem 1 are satisfied then the inequality (4) holds and U is recovered exactly with probability at least  1 − 7δ.

B. Proof of Theorem 2

The iSearch method does not require the optimal vectors {c∗i }noi=1to be orthogonal to U. If they are sufficiently close to  U⊥, iSearch works well. However, in order to simplify the analysis, in the first step of the proof we derive the sufficient conditions to guarantee that the optimal vectors  {c∗i }noi=1lie in U⊥.

Lemma 17. Suppose the distribution of the inliers follows Assumption 1, the distribution of the outliers follows Assumption 2,  η < |qT q⊥|, and Q is equal to the column-space of [U q]. Define  q⊥ = (I−UUT )q∥(I−UUT )q∥2and define

image

and

image

then the optimal point of (6) is equal to q⊥dT q⊥for all {d = di : di ∈ B}with probability at least  1 − 5δwhere�c′′δ =

image

The following lemma shows that if  ni/ris sufficiently large and the optimal directions corresponding to the outliers are aligned with  q⊥, then (4) holds with high probability.

Lemma 18. Suppose the distribution of the data is similar to the presumed data distribution in Lemma 17 and define β = max�{ 1|dTi q⊥| : di ∈ B}�. If the optimal point of (6) is equal to 1dT q⊥ q⊥for all  {d = di : di ∈ B}and if

image

then (4) holds with probability at least  1 − 2δ.

According to Lemma 17, Lemma 18, and the analysis presented in the proof of these lemmas, if the sufficient conditions of Theorem 2 are satisfied, iSearch yields exact subspace recovery with probability at least  1 − 5δ.

C. Proof of Theorem 3

The procedure we use to prove Theorem 3 is similar to procedure we employed to prove Theorem 1 in Section VII-A. First we provide the sufficient conditions to guarantee that when d is an outlier, the optimal point of (3) is highly incoherent with U. Second, it is shown that if  ni/ris sufficiently large and  c∗is close to  U⊥, then (4) holds. The algorithm does not require the optimal vectors  {c∗i }noi=1to be orthogonal to U. As long as they are sufficiently close to  U⊥, iSearch works well. However, in order to simplify the analysis, in the first step of the proof we derive the sufficient conditions to guarantee that the optimal vectors  {c∗i }noi=1lie in  U⊥.

Lemma 19. Suppose the distribution of the inliers/outliers follows Assumption-1/Assumption-3. If

image

then the optimal vectors  {c∗i }noi=1lie in  U⊥with probability at least  1 − 5δwhere  η′δ = max�43 log 2ro/δ ,�4 noro log 2rdδ �

image

Next Lemma assumes that the optimal vectors  {c∗i }noi=1liein  U⊥and shows that if the number of inliers is sufficiently larger than the number of outliers, the inequality (4) holds with high probability.

Lemma 20. Suppose the distribution of the inliers/outliers follows Assumption-1/Assumption-3 and suppose the optimal vectors  {c∗i }noi=1lie in  U⊥. If

image

then the inequality (4) holds with probability at least  1 − 2δ.

According to Lemma 19, Lemma 20, and the analysis presented in the proof of these lemmas, if the sufficient conditions of Theorem 3 are satisfied, then (4) holds and U is recovered exactly with probability at least  1 − 5δ.

D. Proof of Theorem 4

The only difference between Theorem 1 and Theorem 4 is the difference between the presumed distribution of the inliers. In order to derive the sufficient conditions, we only need to establish a new lower-bound for inf

image

According to the clustering structure of the inliers,

image

Therefore, according to (22) and the definition of integer g,

image

image

parameter  ρshows how well the columns of A are distributed in U. According to the definition of  ρand using Lemma 9, (22) can be upper-bounded as follows

image

E. Proof of Lemma 5

The vector  c∗was defined as the optimal direction. If d is an outlier, then by definition d has non-zero projection on U⊥. Define R as an orthonormal basis for  U⊥and define d⊥ = RRT d∥dT R∥22. Accordingly to the definition of  d⊥, dT d⊥ =1 which implies that when d is an outlier we can conclude that

image

with probability at least  1 − 2δ. Next, we can derive the following lower-bound

image

which is valid with probability at least  1 − 3δ. The value of the lower-bound is smaller than the value of the upper-bound and √M1 − 0.5 < µM1[24]. Accordingly, we can conclude that (9) holds with probability at least  1 − 4δ.

F. Proof of Theorem 6

Suppose d in (3) is an inlier and  d = 11+σ2n (a + e)wherea is a clean inlier and e is the added noise. Since  dT c∗ = 1, then according to the presumed noisy data model,  ∥UT c∗∥2 ≥

(1+σ2n)/(1+σn). In addition, we can conclude that  ∥c∗∥2 ≥(1 + σ2n)/(1 + σn). Therefore,

image

with probability at least  1 − 3δ. If d is an outlier, then similar to the proof of Lemma 5,

image

with probability at least  1 − 2δ. Accordingly, if the sufficient conditions of Theorem 6 are satisfied, Innovation Values of the outliers are larger than the Innovation Values of the inliers with probability at least  1 − 5δ.

Proof of Lemma 14

The optimization problem (3) is a convex optimization problem. In order to prove that the equality (10) holds, it is enough to show that the cost function of (3) increases if we move away from  ˆcalong any feasible direction where  ˆcis the optimal point of (11). Define

image

Therefore, we only need to prove that  ω(δ) ≥ 0for any sufficiently small  δsuch that  δT d = 0. The deviation vector δshould be orthogonal to d to ensure that  ˆc − δstays in the feasible set of (3).

The vector  ˆcis the optimal point of (11). Therefore, for any vector  δosuch that

image

we have

image

The vectors  ˆcand  δoare orthogonal to U. Thus, the inequality (24) is equivalent to

image

When  δo → 0, we can rewrite (25) as

image

where the last identity follows from the Taylor expansion of the square root function. Thus,

image

has to be greater than zero for any small  δowhich satis-fies (23). The function  ω(δ)can be expanded as

image

We decompose  δin (28) as  δ = δi +δowhere  δilies in U and δolies in  U⊥. Thus,  ∥(ˆc − δ)T A∥1 − ∥ˆcT A∥1 = ∥δTi A∥1. Similar to (26), as  δ → 0,

image

Accordingly, in order to show that (10) holds, it is enough to guarantee that for any sufficiently small  δsuch that  δT d = 0,

image

Let us define vector o as

image

In order to show that the inequality (29) holds for any  δsuch that  δT d = 0, it is sufficient to guarantee that

image

We derive the sufficient conditions to guarantee that both of the inequalities in (30) hold. For the first inequality, it suffices to guarantee that

image

which can be simplified to

image

For the second inequality of (30), it suffices to show that for any pair of  δiand  δosuch that  δTi d = −δTo d, the inequality holds. We can scale both  δiand  δosuch that  δTi d = −δTo d =1. Therefore, it is enough to guarantee that

image

Define vector  d⊥ = (I − UUT )d/∥(I − UUT )d/∥2, i.e.,  d⊥is a unit  ℓ2-norm vector which is aligned with the projection of d onto  U⊥. Let us decompose  δointo two components which one component is orthogonal to d. Thus, we expand  δoas  δo = δoa + δodwhere

image

In order to show that (32) holds, it suffices to ensure that the LHS of (32) is greater than

image

Note that  δoa ∈ U⊥and  δToad = 0. Therefore, according to (27), we can conclude that  δToao − �di∈Bi∈I0��δToadi�� ≤ 0. Therefore, it suffices to show that the LHS of (32) is greater than

image

In addition, we can simplify the LHS of (32) as

image

Therefore if (31) is satisfied and the RHS of (34) is greater than the (33), then the equality (10) holds.

Proof of Lemma 15

First we provide the sufficient conditions to guarantee that the first inequality of (12) holds with high probability. The distribution of the inliers follow Assumption 1. According to Lemma 9,

image

with probability at least  1−δ. Note that (35) does not depend on any d in  {d = di : di ∈ B}. Moreover, according to Lemma 7,

image

Thus, we can rewrite o as

image

Since the columns of B are drawn uniformly at random from SM1−1, the distribution of the vectors  biin the subspace orthogonal to  ˆcis independent of  sgn(ˆcT bi). Accordingly, the distribution of o is equivalent to the distribution of

image

dependent Rademacher random variables. The optimal vector ˆcis orthogonal to U. Thus,

image

Note that in (40),  δT dˆc⊥i = δT di ≤ ∥UT di∥2. Suppose (36) is true. We use Lemma 8 to bound (40) as

image

According to (35), (37), and (41) if (13) is satisfied, the first inequality of (12) holds with probability at least  1 − 3δfor all  {d = bi}noi=1(because we guarantee that (10) holds for all {d = bi}noi=1).

In order to guarantee that the second inequality of (12) holds, we need to bound  o′T d⊥which can be expanded as

image

Note that

image

Thus, according to Lemma 10 the first component of the RHS of (42) is bounded as

image

The second component of the RHS of (42) can be expanded as

image

The vectors  d⊥and  ˆc/∥ˆc∥2are unit  ℓ2-norm vectors. Thus,

image

According to Lemma 7

image

with probability at least  1 − δfor all  {d = bi}noi=1where

image

true. Lemma 8 is used to bound the first part of the RHS of (45) as

image

In order to bound the second part of the RHS of (45), first we define vector a such that  a(i) = ˆcT di∥ˆc∥2. Note that

image

Thus, according to Lemma 11,

image

where  ηδ = max�43 log 2M1δ ,�4 noM1 log 2M1δ �. Note that (48) is true for all  {ˆc = c∗i }noi=1.

If (48) is true, Lemma 8 can be used to conclude that

image

If (46) is true, then

image

If (36) is true, then

image

According to (43),(47),(49), (50), and (51), if (14) is satisfied, the second inequality of (12) is satisfied with probability at least  1 − 4δfor all  {d = bi}noi=1.

Proof of Lemma 16

Suppose d in (3) is an inlier. According to the linear constraint of (3),  ∥UT c∗∥2 ≥ 1where  c∗is the optimal point of (3). Therefore,

image

Lemma 9 can be used to bound the RHS of (52). Therefore,

image

Note that (53) holds for all  {d = di : di ∈ A}.

Next we bound the inverse of the Innovation Value corresponding to an outlying column. Suppose d is an outlier. If the sufficient conditions of Lemma 15 are satisfied,  c∗lies in U⊥with high probability. If  c∗ ∈ U⊥, c∗T D = c∗T B. In addition,

image

Lemma 10 is used to establish the following bound on the inverse of the Innovation Value corresponding to an outlying column

image

Note that (54) holds for all  {d = di : di ∈ B}. Thus if the inequality (15) is satisfied, (4) holds with probability at least 1 − 3δ.

image

the first inequality of (30), we need to bound sup

image

can be expanded as follows

image

Rademacher random variables. In addition,  vTi δ ≤ ∥UT δ∥2. If (36) is true, we can use Lemma 8 to establish the following bound

image

In order to derive the sufficient conditions to guarantee that the second inequality of (30) holds, we need to bound

image

Lemma 7,

image

The LHS of the first inequality of (30) is bounded similar to (35). The LHS of the second inequality of (30) can be bounded similar to (34). Therefore, according to (55),(57),(35),and (34) if (16) and (17) are satisfied, then the inequalities of (30) hold for all  {d = di : di ∈ B}with probability at least  1−5δ, i.e., the optimal point of (6) is equal to 1dT q⊥ q⊥with probability at least  1 − 5δfor all  {d = di : di ∈ B}.

Proof of Lemma 18

First, we derive a lower-bound for  ∥c∗D∥1when d is an inlier. According to the first linear constraint of (6),  ∥c∗∥2 ≥ 1. In addition,  ∥c∗T D∥1 ≥ ∥c∗T A∥1and

image

We use Lemma 9 to conclude that if  d ∈ A

image

P�∥c∗TD∥1 <

image

Note that (58) holds for all the inliers with probability at least 1 − δ.

If d is an outlier and the sufficient conditions of Lemma 17 are satisfied,  c∗ ∈ U⊥ ∩ Qwith high probability. If  c∗ ∈U⊥∩Q, then  ∥c∗T D∥1 = ∥c∗T B∥1. We can expand  ∥c∗T B∥1as

image

Thus, we can use Lemma 7 to conclude that

image

(18) is satisfied the outliers are guaranteed to have greater Innovation Values with probability at least  1 − 2δ.

Proof of Lemma 19 Lemma 14 does not make any assumption about the distribution of the outliers. Accordingly, we only need to guarantee that the sufficient conditions of Lemma (14) are satisfied. First we bound the RHS of the first inequality in (12). Since the columns of B lie in  Uo, the  ℓ2-norm of the projection of the columns of B into U is bounded by  ∥UT Uo∥. Accordingly,

image

where  nz = |I0|and  I0 = {i ∈ [no] : c∗T bi = 0}. Similar to (39), o can be expanded as

image

The distribution of o is equivalent to the distribution of

image

independent Rademacher random variables. The vector  dˆc⊥iis defined similar to (38). Since  ˆcis orthogonal to U

image

image

The LHS of the first inequality of (12) is bounded similar to (35). Thus, if (19) is satisfied, the first inequality of (12) holds for all  {d = di : di ∈ B}with probability at least 1 − 2δ.

image

Define  ˆco = UoUTo ˆc∥UoUTo ˆc∥2. In addition, 1∥ˆc∥2 ∥UTo ˆc∥2 ≤∥UTo U⊥∥. Accordingly,

image

Moreover,

image

Accordingly, we can use Lemma 10 to bound the first component of the RHS of (60) as follows

image

Note that (62) is true for all  {d = di : di ∈ B}. Similar to (44) and (45), the second component in the RHS of (60) can be expanded as follows

image

Define vector a such that  a(i) = ˆcT di∥ˆc∥2. According to Lemma 11,

image

where  η′δ = max�43 log 2ro/δ ,�4 noro log 2rdδ �. If (63) is true, Lemma 8 can be used to conclude that

image

Similar to  ˆc/∥ˆc∥, d⊥lies in  U⊥and  ∥d⊥∥ = 1. Thus, we can bound the second component of the RHS of (60) similarly

image

The last component of the RHS of the second inequality of (12) can be bounded as

image

According to the definition of  ξ, the coefficient

image

can be bounded as

image

Using (62), (64), (65), (66), and (67) we can establish an upper-bound on the RHS of the second inequality of (12). Thus if (20) is satisfied, the second inequality of (12) holds with probability at least  1 − 4δfor all  {d = di : di ∈ B}.

Proof of Lemma 20

If the sufficient conditions of Lemma (19) are satisfied and if d ∈ B, then  c∗lies in  U⊥with high probability. If  c∗ ∈ U⊥, then  ∥DT c∗∥1 = ∥BT c∗∥1. Assuming that  c∗ ∈ U⊥, we use Lemma 10 and inequality (61) to conclude that

image

If  d ∈ A, the linear constraint of (3) ensures that ∥UT c∗∥2 ≥ 1. In addition,  ∥DT c∗∥1 ≥ ∥AT c∗∥1. Accordingly, when  d ∈ A, we can use Lemma 9 to derive the following bound on  ∥DT c∗∥1

image

Thus, if (21) is satisfied, the Innovation Values corresponding to the outliers are lager than the corresponding values for the inliers.


Designed for Accessibility and to further Open Science