Principal Component Analysis (PCA) has been extensively used to reduce dimensionality by finding linear projections of high-dimensional data into lower dimensional subspaces. PCA finds an r-dimensional linear subspace by solving
where is the given data,
is an orthonormal basis for the r-dimensional subspace, I is the identity matrix,
denotes the Frobenius norm,
is the number of data points, and
is the dimensionality of the ambient space. While PCA is useful when the data has low intrinsic dimension, it is sensitive to outliers in the sense that the solution to (1) can arbitrarily deviate from the true underlying subspace if a small portion of the data is not contained in this low-dimensional subspace.
In addition to the sensitivity of PCA to outliers, detecting the outlying data points is also an important research problem in unsupervised machine learning. Outliers are associated with important rare events such as malignant tissues [17], the failures of a system [11], [13], [38], web attacks [19], and misclassified data points [10], [35]. In this paper, the proposed outlier detection method is introduced as a robust Principal Component Analysis (PCA) algorithm, i.e., the inliers lie in a low dimensional subspace.
In the literature of robust PCA, two main models for the data corruption are considered: the element-wise data corruption model and the column-wise data corruption model. These two models are corresponding to two completely different robust PCA problems. In the element-wise model, it is assumed that a small subset of the elements of the data matrix are corrupted and the support of the corrupted elements is random. Thus, the corrupted elements are not concentrated in any column/row of the data. This problem is known as the low rank plus sparse matrix decomposition problem [1], [3], [29], [30]. In the column-wise model, a subset of the columns of the data are affected by the data corruption [4], [8], [9], [12], [21], [26], [31], [32], [43]–[46]. We solely focus on the column-wise data corruption model. Specifically, it is assumed that the given data matrix D follows Data Model 1.
Data Model 1. The matrix D can be expressed as D = [B A] T , where is an arbitrary permutation matrix, and [B A] represents the concatenation of matrices A and B. The columns of A lie in an r-dimensional subspace U. The columns of B do not lie entirely in U, i.e., the
columns of A are the inliers and the
columns of B are the outliers. The orthonormal matrix
is a basis for U.
The output of a robust PCA method is a basis for U. If U is estimated accurately, the outliers can be located using a simple subspace projection [28].
Summary of Contributions: The main contributions can be summarized as follows.
• iSearch introduces a new idea to the robust PCA problem. iSearch utilizes the directions of innovation to measure the Innovation of the data points. It is shown that the pro-
posed approach mostly outperforms the exiting methods in handling close outliers and noisy data.
• The proposed approach and the CoP method presented in [35], to the best of our knowledge, are the only robust PCA methods which are supported with analytical performance guarantees under different models for the distributions of the outliers including the randomly distributed outliers, the clustered outliers, and the linearly dependent outliers.
• In addition to considering several models for the distribution of the outliers, we provide analytical performance guarantees under different models for the distribution of the inliers too. The presumed models include the union of subspaces and the uniformly at random distribution on where
denotes the unit
-norm sphere in
. Moreover, the stability of iSearch in the presence of additive noisy is studied and it is proved that iSearch is robust to noisy data.
A. Notation
Given a matrix denotes its spectral norm. For a vector
denotes its
-norm and a(i) its
element. Given two matrices
and
with an equal number of rows, the matrix
is the matrix formed by concatenating their columns. For a matrix
denotes its
column. The subspace
is the complement of U. The cardinality of set I is defined as |I|. Also, for any positive integer n, the index set {1, ..., n} is denoted [n]. The coherence between vector a and subspace H with orthonormal basis H is defined as
.
In this section, we review some of the important related works in robust (to outlier) PCA and the Innovation Pursuit method. We refer the reader to [22], [35] for a more comprehensive review on the topic. Some of the earliest approaches to robust PCA relied on robust estimation of the data covariance matrix, such as the minimum covariance determinant, the minimum volume ellipsoid, and the StahelDonoho estimator [16]. They mostly compute a full SVD or eigenvalue decomposition in each iteration and generally have no explicit performance guarantees. The performance of these approaches greatly degrades when . To enhance the robustness to the presence of the outliers, another approach is to replace the Frobenius Norm in the cost function of PCA with
-norm [18], [25] because
-norm were shown to be robust to the presence of the outliers [2], [18]. However,
-norm does not leverage the column-wise structure of the corruption matrix. The authors of [6] modified the
-norm minimization problem used in [18] and replaced it with an
-norm minimization problem. In [24] and [46], the optimization problem used in [6] was relaxed to two different convex optimization problems. The authors of [24] / [46] provided sufficient conditions under which the optimal point of the convex optimization problem proposed in [24] / [46] is guaranteed to be equal to a projection matrix whose column-space is equal to
. The approach presented in [41] focused on the scenario in which the data is predominantly unstructured outliers and the number of outliers is larger than
. In [41], it is essential to assume that the outliers are randomly distributed on
and the inliers are distributed randomly on the intersection of
and U. In [43], a convex optimization problem was proposed which decomposes the data into a low rank component and a column sparse component. The approach presented in [43] is provable but it requires
to be significantly smaller than
. The outlier detection method proposed in [39] assumes that the outliers are randomly distributed on
and a small number of them are not linearly dependent which means that [39] is not able to detect the linearly dependent outliers and the outliers which are close to each other.
A. Review of Innovation Pursuit for Data Clustering
In contrast to most of the spectral clustering based subspace clustering methods which utilize self-representation to compute the affinity matrix, the authors of [36], [37] proposed Innovation Pursuit which utilized the directions of innovation to compute the affinity matrix. A convex optimization problem was introduced which finds an optimal direction corresponding to each data point. The Direction of Innovation corresponding to data point d is closely aligned with d but it has the minimum projection on the other data points. In [36], [37], it was shown that Innovation Pursuit can significantly outperform the self representation based methods (e.g. Sparse Subspace Clustering [7]) specifically in the scenarios in which the subspaces intersect. In this paper, we present a new discovery that the Directions of Innovation can also be utilized to devise a novel, strong, and provable robust PCA method. The table of Algorithm 1 shows the procedure to use the directions of innovation for both data clustering and outlier detection.
B. Connection and Contrast to Coherence Pursuit
In [35], the Coherence Pursuit (CoP) method was proposed as an outlier detection method. CoP computes the Coherence Values for all the data points to rank the data points. The Coherence value corresponding to data column d is a measure of resemblance between d and the rest of the data columns. CoP uses the inner product between d and the rest of the data points to measure the resemblance between d and the rest of data. In sharp contrast, iSearch leverages the directions of innovation to measure the innovation of each data point and ranks the data points accordingly. We show through theoretical studies and numerical experiments that finding the optimal directions makes iSearch notably stronger than CoP in detecting outliers which carry weak innovation.
Algorithm 1 presents the proposed outlier detection method along with the definition of the used symbols. In order to show the connection of proposed robust PCA method with the Innovation Search based clustering method [36], [37], Table 1 demonstrates both algorithms. iSearch consists of three steps. In the next subsections, Step 2 and Step 3 are discussed. In this
paper, we use an ADMM solver to solve (2). The computation complexity of the solver is . If PCA is used in the prepossessing step to reduce the dimensionality of the data to
, the computation complexity of the solver is
.
A. An Illustrative Example for Innovation Value
A synthetic numerical example is presented to explain the main idea behind iSearch. Assume ,
and suppose that D follows Assumption 1.
Assumption 1. The columns of A are drawn uniformly at random from . The columns of B are drawn uniformly at random from
. To simplify the exposition and notation, it is assumed without loss of generality that T in Data Model 1 is the identity matrix, i.e, D = [B A].
Define d as a column of D and define as the optimal point of
and define the Innovation Value corresponding to d as . The main idea of iSearch is that
shows two completely different behaviours with respect to U (when d is an outlier and when d is an inlier). First assume that d is an outlier. The optimization problem (3) searches for a direction whose projection on d is non-zero and it has the minimum projection on the rest of the data points. Since d is an outlier, d has a non-zero projection on
. In addition, since
is large, (3) searches for a direction in the ambient whose projection on U is as weak as possible. Therefore,
lies in
or it is close to
. The left plot of Fig. 1 shows
when d is an outlier. In this case,
is orthogonal to all the inliers. Accordingly, when d is an outliers,
is approximately equal to
. On the other hand, when d is an inlier, the linear constraint strongly discourages
to lie in
or to be close to
. Inliers lie in a low dimensional subspace and mostly they are close to each other. Since
has a strong projection on d, it has strong projections on many of the inliers. Accordingly, the value of
is much larger when d is an inlier. Therefore, the Innovation Value corresponding to an inlier is smaller than the Innovation Value corresponding to an outlier because
is much larger when d is an inliers. Fig. 1 compares the vector
when d is an outliers with the same vector when d is an inlier. In addition, it shows the vector of Innovation Values (right plot). One can observe that the Innovation Values make the outliers clearly distinguishable.
Fig. 1. In this example, the first 50 columns are outliers. The upper left panel shows vector is an outlier. The upper right panel depicts
is an inlier. The bottom left plot shows the Innovation Values corresponding to all the data points (vector x was defined in Algorithm 1). The bottom right plot also shows the Innovation Values when the inliers are noisy (
denotes the additive noise).
B. Building the Basis Matrix
The data points corresponding to the least Innovation Values are used to construct the basis matrix Y. If the data follows Assumption 1, the r data points corresponding to the r smallest Innovation Values span U with overwhelming probability [42]. In practise, the algorithm should continue adding new columns to Y until the columns of Y spans an r-dimensional subspace. This approach requires to check the singular values of Y several times. We propose two techniques to avoid this extra steps. The first approach is based on the side information that we mostly have about the data. In many applications, we can have an upper-bound on because outliers are mostly associated with rare events. If we know that the number of outliers is less than y percent of the data, matrix Y can be constructed using
percent of the data columns which are corresponding to the least Innovation Values. The second approach is the adaptive column sampling method proposed in [35]. The adaptive column sampling method avoids sampling redundant columns.
C. Robustness to noise
The performance of iSearch is robust to the presence of noise because even if the inliers are noisy, the optimal direction of (3) corresponding to an outlier remains incoherent with U. The reason is that if the optimal direction is not incoherent with U and the inliers are not dominated by noise, is large because the inliers lie in a low dimensional subspace and they are close to each other. In Section IV-E, we provide a guarantee for the performance of iSearch with noisy data, an upper-bound for
is established, and it is shown that
the value of the upper-bound is proportional to. The right plot in Fig. 1 shows the innovation values for the example described in Section III-A. In this plot, the inliers are noisy, D = [B (A + E)] where E denotes the additive noise, and
. One can observe that Innovation Values clearly distinguishes the outliers from the inliers.
In this section, we analyze the performance of the proposed approach with three different models for the distribution of the outliers: unstructured outliers, clustered outliers, and linearly dependent outliers. Moreover, we analyze iSearch with two different models for the distribution of the inliers. These models include the union of subspaces and uniformly at random distribution on . In addition, the performance of iSearch in the presence of noise is studied and it is shown that the direction of innovation is a reliable tool when the data is noisy. The theoretical results are followed by short discussions which highlight the important aspects of the theorems and a short description of the proof. The proofs of all the presented theoretical results are included in Appendix.
A. Randomly Distributed Outliers
In this part, it is assumed that D follows Assumption 1. In order to guarantee the performance of the proposed approach, it is enough to show that the Innovation Values corresponding to the outliers are greater than the Innovation Values corresponding to the inliers. In other word, it suffices to show
Before we state the theorem, let us provide the following definitions and remarks.
Definition 1. Define
define , and
where
0} and
is the
column of B. The value
is the number of outliers which are orthogonal to
.
Remark 1. In Assumption 1, the outliers are randomly distributed. Thus, if is significantly larger than
is significantly smaller than
with overwhelming probability.
Theorem 1. Suppose D follows Assumption 1. Define A =
A >
then (4) holds and U is recovered exactly with probability at least where
Proof Sketch: In order to prove Theorem 1, we leveraged this key feature of the direction of innovation: when d is an outlier, is highly incoherent with U. Specifically, Theorem 1 is proved in two steps. First, we derived the sufficient conditions to guarantee that all the optimal directions correspond to the outliers,
, lie in
. The exact orthogonality is not a necessarily condition but it simplifies the analysis. In fact, the first two sufficient conditions in (5) guarantee it with high probability. Second, the fact that
are orthogonal to U was leveraged to derive the final conditions to guarantee that (4) holds with high probability.
Theorem 1 indicates that when is sufficiently larger than
, the proposed approach is guaranteed to detect the randomly distributed outliers exactly. It is important to note that in (5),
is scaled with 1/r but
is scaled with
. It means that if r is sufficiently smaller than
, iSearch provably detects the unstructured outliers even if
is much larger than
. The numerical experiments presented in Section V confirms this feature of iSearch.
B. Structured Outliers
This section provides the analysis of iSearch with structured outliers. In contrast to the unstructured outliers, structured outliers can form a low dimensional structure different from the structure of the majority of the data points. Structured outliers are associated with important rare events such as malignant tissues [17] or web attacks [19]. In this section, we assume that the outliers form a cluster outside of U. The following assumption specifies the presumed model for the distribution of the structured outliers.
Assumption 2. A column of B is formed as . The unit
-norm vector q does not lie in
are drawn uniformly at random from
, and
is a positive number.
According to Assumption 2, the outliers cluster around vector q where . In Algorithm 1, if the dimensionality reduction step is performed, the direction search optimization problem is applied to
. Thus, (3) is equivalent to
where and
. The subspace Q is the column-space of D. In this section, we are interested in studying the performance of iSearch in identifying tightly clustered outliers because some of the existing outlier detection algorithms fail if the outliers form a tight cluster. For instance, the thresholding based method [14] and the sparse representation based algorithm [39] fail when the outliers are close to each other. Therefore, we assume that the span of Q is approximately equal to the column-space of [U q]. The following Theorem shows that even if the outliers are close to each other, iSearch successfully identifies the outliers provided that
is sufficiently larger than
.
Theorem 2. Suppose the distribution of the inliers/outliers follows Assumption-1/Assumption-2. Assume that Q is equal to the column-space of [U q]. Define , define
, define
as the optimal point of (6) with
, and assume that
. In addition, Define A =
then (4) holds and U is recovered exactly with probability at least .
Proof Sketch: Similar to the proof of Theorem 1, first we derive the sufficient conditions to guarantee that the optimal directions corresponding to the outliers are orthogonal to U (the exact orthogonality is not a necessarily condition but it simplifies the analysis). Subsequently, we leverage the structure of the optimal directions to establish an upper-bound/lower-bound for when d is an outliers/inlier.
In sharp contrast to (5), in (7) is not scaled with
. Theorem 2 indicates that in contrast to the unstructured outliers, the number of the structured outliers should be sufficiently smaller than the number of the inliers for the small values of
. This is consistent with our intuition regarding the detection of structured outliers. If the columns of B are highly structured and most of the data points are outliers, it violates the definition of outlier to label the columns of B as outliers. The presence of parameter
emphasizes that the closer the outliers are to U, the harder it is to distinguish them. In Section V, it is shown that iSearch significantly outperforms the existing methods when the outliers are close to U. The main reason is that even if an outlier is close to U, its corresponding optimal direction obtained by (3) is highly incoherent with U.
C. Linearly Dependent Outliers
In some applications, the outliers are linearly dependent. For instance, in [10], it was shown that a robust PCA algorithm can be used to reduce the clustering error of a subspace segmentation method. In this application, a small subset of the outliers can be linearly dependent. The following assumption specifies the presumed model for matrix B.
Assumption 3. Define subspace with dimension
such that
and
. The outliers are randomly distributed on
. The orthonormal matrix
is a basis for
.
The following theorem establishes the sufficient conditions to guarantee the performance of iSearch with linearly dependent outliers. The procedure to proof the following theorem is similar to the previous theorems. The main difference was the techniques used to bound the value of .
Theorem 3. Suppose the distribution of the inliers/outliers follows Assumption-1/Assumption-3. Define
then (4) holds and U is recovered exactly with probability at least where
Theorem 3 indicates that should be sufficiently larger than
. If
is comparable to r, it is in fact a necessary condition because we can not label the columns of B as outliers if
is also comparable with
. If
is large, the sufficient condition is similar to the sufficient conditions of Theorem 1 in which the outliers are distributed randomly on
. It is also informative to compare the requirements of iSearch with the requirements of CoP. With iSearch,
should be sufficiently larger than
to guarantee that the algorithm distinguishes the outliers successfully. With CoP,
should be sufficiently larger than
[10], [35]. The reason that CoP requires a stronger condition is that iSearch finds a direction for each outlier which is highly incoherent with U.
D. Outlier Detection When the Inliers are Clustered
In the analysis of the robust PCA methods, mostly it is assumed that the inliers are randomly distributed in U. In practise the inliers form several clusters in U. In this section, it is assumed that the inliers form m clusters. The following assumption specifies the presumed model and Theorems 4 provides the sufficient conditions.
Assumption 4. The matrix of inliers can be written as A = where
,
, and
is an arbitrary permutation matrix. The columns of
are drawn uniformly at random from the intersection of subspace
and
where
is a d-dimensional subspace. In other word, the columns of A lie in a union of subspaces
and
where
denotes the direct sum operator.
Theorem 4. Suppose the distribution of the outliers/inliers follows Assumption-1/Assumption-4. Define
If the sufficient conditions in (5) are satisfied, then the inequality (4) holds and U is recovered exactly with probability at least .
Proof sketch: The difference between Theorem 4 and Theorem 1 is in their presumed model for the distribution of the inliers. The procedure used to prove Theorem 4 is similar to the procedure used in the proof of Theorem 1 but it required new techniques to bound .
In Assumption 4, the dimensions of the subspaces are equal and the distribution of the inliers inside these subspace are similar. Therefore, we can roughly say g =
[24]. Thus, the sufficient conditions indicate that the population of the smallest cluster scaled by
should be sufficiently larger than
. The parameter
introduced in [24]. It shows how well the inliers are distributed in U. Evidently, if the inliers populate all the directions inside U, a subspace recovery algorithm is more likely to recover U correctly. However, having a large value of permeance statistic is not a necessary condition. The reason that permeance statistic appears in the sufficient conditions is that we establish the sufficient conditions to guarantee the performance of iSearch in the worst case scenarios. In fact, if the inliers are close to each other or the subspaces are close to each other, generally the performance of iSearch improves because the more inliers are close to each other, the smaller their Innovation Values are.
E. Noisy Data
The key feature of the proposed approach is that when d is an outlier, is highly incoherent with U. In the proof of the presented theorems, we derived sufficient conditions to guarantee that
is orthogonal to U. It is not a necessarily condition but it simplified the analysis. When the inliers are noisy, we can not find a direction which is orthogonal to all of them because they do not lie in a low dimensional subspace. In this section, first we focus on showing that even if the data is noisy, the direction of innovation corresponding to an outlier remains incoherent with U. In the following theoretical results, it is assumed that the data follows Assumption 5. According to Assumption 5, each inlier is added with a random direction and each data column has an expected squared norm equal to 1.
Assumption 5. The matrix D can be expressed as D = The matrix
represents the presence of noise and it can be written as
where the columns of
are drawn uniformly at random from
and
is a positive number which controls the power of the added noise.
Lemma 5. Suppose D follows Assumption 5 and assume that A and B follow Assumption 1. Define , and
where R is an orthonormal basis for . Then
with probability at least .
Lemma 5 establishes a upper-bound for . In practise,
is much smaller than the value of the upper-bound because in the proof we consider the worst case scenarios. Lemma 5 indicates that the direction of innovation corresponding to an outlier stays incoherent with U provided that
is sufficiently large. The following theorem provides the sufficient conditions to guarantee the performance of iSearch with noisy data.
Theorem 6. Suppose D follows Assumption 5 and assume that A and B follow Assumption 1. Define and
. If
then (4) holds with probability at least .
In Theorem 6, the outliers are unstructured and its suffi-cient conditions are similar to the sufficient conditions of Theorem 1. Roughly, Theorem 6 states that if is sufficiently larger than
, Innovation Values can reliably be used to distinguish the outliers. For clustered outliers and linearly dependent outliers, similar guarantees can be established.
In this section, a set of experiments with synthetic data and real data are presented to study the performance and the properties of the iSearch algorithm. In the presented experiments, iSearch is compared with the existing methods including FMS [21], GMS [46], CoP [35], OP [43], and R1-PCA [6].
A. Phase Transition
In this experiment, the phase transition of the proposed approach is studied. First, it is shown that if is sufficiently large, iSearch can successfully recover the span of the inliers even in the dominant presence of the unstructured outliers. Define
as an orthonormal basis for the recovered subspace. A trial is considered successful if
The data follows Assumption 1 with r = 4 and . Fig. 2 shows the phase transition of iSearch versus
and
. White indicates correct subspace recovery and black designates incorrect recovery. Theorem 1 indicated that if
is sufficiently large, iSearch yields exact recovery even if
is larger than
. This experiment confirms the theoretical result. According to Fig. 2, even when
, 40 inliers are enough to guarantee exact subspace recovery. In the second plot of Fig. 2, the outliers are structured and the distribution of outliers follows Assumption 2 with
. One can observe that even when
is small, a sufficiently large value of
guarantees the performance of iSearch.
Fig. 2. Left plot: the phase transition of iSearch in presence of the un- structured outliers versus . White indicates correct subspace recovery and black designates incorrect recovery. Right plot: phase transition with structured outliers versus
. In both plots,
Fig. 3. The probability of accurate subspace recovery versus the number of structured outliers (trial is considered successful if
orthonormal basis for the recovered subspace.
B. Structured Outliers
In this experiment, we consider structured outliers. The distribution of the outliers follows Assumption 2 with and
. In addition, the inliers are clustered and they lie in a union of 5 2-dimensional linear subspaces. There are 20 data points in each subspace (i.e.,
) and r = 10. A successful trial is defined similar to Section V-A. We are interested in investigating the performance of iSearch in identifying structured outliers when they are close to U. Therefore, we generate vector q, the center of the cluster of the outliers, close to U. Vector q is constructed as
where the unit -norm vector
is generated as a random direction on
and the elements of
are sampled independently from N(0, 1). The generated vector q is close to U with high probability because the column-space of [U p] is close to the column-space of U. Fig. 3 shows the probability of accurate subspace recovery versus the number of outliers. The number of evaluation runs was 50. One can observe that in contrast to the unstructured outliers, the robust PCA methods tolerate few number of structured outliers and iSearch exhibits higher robustness to the presence of the structured outliers.
C. Noisy Data
In this section, we consider the simultaneous presence of noise, the structured outliers and the unstructured outliers. In this experiment, , and
. The data contains 300 unstructured and 10 structured outliers. The distribution of the structured outliers follow Assumption 2 with
. The vector q, the center of the cluster of the structured outliers, is generated as a random direction on
. The generated data in this experiment can be expressed as D = [B (A + E)] (the matrix E represents the additive Gaussian noise). Since the data is noisy, the algorithms can not achieve exact subspace recovery. Therefore, we examine the probability that an algorithm distinguishes all the outliers correctly. Define vector
such that
. A trial is considered successful if
Fig. 4 shows the probability of exact outlier detection versus SNR which is defined as SNR . It shows that iSearch robustly distinguishes the outliers in the strong presence of noise. The number of evaluation runs was 50.
Fig. 4. The probability of exact outlier detection versus SNR. The data contains 10 structured outliers and 300 unstructured outliers (
D. Innovation Value vs Coherence Value
We simulate a scenario in which the outliers are randomly distributed but they are close to the span of the inliers. Suppose , and
. The outliers are generated as [U H] G where
spans a random 2-dimensional subspace and the elements of
are sampled independently from N(0, 1). Fig. 5 compares Innovation Values with Coherence Values. The last 20 columns are the outliers. One can observe that the Coherence Values can not make the outliers distinguishable from the inliers. The main reason is that in contrast to Coherence Pursuit, iSearch finds a direction corresponding to each outlier which is strongly incoherent with U.
Fig. 5. This figure compares the vector of the Coherence Values (right plot) with the vector of the Innovation Values (left plot). The last 20 columns are the outliers.
E. Clustered Inliers
In practice, the inliers are not necessarily distributed uniformly at random and they are mostly close to each other and they form one or multiple clusters. In this experiment, it is shown that in contrast to most of the existing methods, if the inliers form a cluster in U, the performance of iSearch improves. It is assumed that the distribution of inliers follows Assumption 6.
Assumption 6. Each column of matrix A is formed as where
. The vector
is a unit
-norm vector and
are drawn uniformly at random from
.
‘
Fig. 6. The Log-Recovery Error of the robust PCA methods versus parameter
is defined in Assumption 6. Log-Recovery Error is defined as
is small, the inliers are close to each other and when it is larger, the distribution of the inliers is similar to the distribution of the inliers in Assumption 1 (i.e., uniformly at random on
The parameter controls how close the inliers are to each other. If
goes to infinity, the distribution of the inliers converges to the distribution of the inliers in Assumption 1. In this experiment,
, and
. The distribution of the outliers follows Assumption 3 with
and the dimension of intersection between
and U is equal to 5. This robust PCA algorithms are used to obtain a basis for the dominant subspace (the span of the inliers). Fig. 6 shows Log-Recovery-Error defined equal to
versus parameter . The number of evaluation runs was 100. One can observe that the performance of most of the robust PCA methods degrade when
is small. The reason is that if the value of
increases, the distribution of the inliers becomes closer to the uniformly at random distribution
Fig. 7. Some of the frames of the Waving Tree video file. The highlighted frames are detected as outliers by R1-PCA.
on ). In contrast, Coherence Pursuit and iSearch yield better performance when the inliers form a cluster (the smaller values of
). The reason is that if the inliers become closer to each other, their Coherence-Values/Innovation-Values increase/decrease because their similarities increase and they are less innovative.
F. Detecting Outliers in Real Data
An application of the outlier detection methods is to identify the misclassified data points of a clustering method [10], [35]. In each identified cluster, the misclassified data points can be considered as outliers. In this experiment, we assume an imaginary clustering method whose clustering error is 25 %. The robust PCA method is applied to each cluster to find the misclassified data points. The clustering is re-evaluated after identifying the misclassified data points. Algorithm 2 adapted from [35] shows how a robust PCA algorithm is used to detect the misclassified data points and update the identified clusters. We use the Hopkins155 dataset [40], which contains data matrices with 2 or 3 clusters. In this experiment, 27 matrices with 3 clusters are used (i.e., the columns of each data matrix lie in 3 clusters). The outliers are linearly dependent and they are very close to the span of the inliers since the clusters in the Hopkins155 dataset are close to each other. In addition, the inliers form a tight cluster. Evidently, the robust PCA methods which assume that the outliers are randomly distributed fail in this task. This experiment with real data contains most of the challenges that a robust PCA method can encounter. For more details about this experiment, we refer the reader to [10], [35]. Table I shows the average clustering error after applying the robust PCA methods to the output of the clustering method. One can observe that iSearch significantly outperforms the other methods. The main reason is that iSearch is robust against outliers which are closed to U. In addition, the coherency between the inliers enhances the performance of iSearch.
TABLE I CLUSTERING ERROR AFTER USING THE ROBUST PCA METHODS TO DETECT THE MISCLASSIFIED DATA POINTS.
G. Activity Detection in Real Noisy Data
In this experiment, we use the robust PCA methods to identify a rare event in a video file. We use the Waving Tree video file [27]. In this video, a tree is smoothly waving and in the middle of the video a person crosses the frame. The
frames which only contain the background (the tree and the environment) are inliers and the few frames corresponding to the event, the presence of the person, are the outliers. Since the tree is waving, the inliers are noisy and we use r = 3 for all the methods. In addition, we identify column d as outlier if where
is the recovered subspace. In this experiments, the outliers are very similar to each other since the consecutive frames are quite similar to each other. We use iSearch, CoP, FMS, and R1-PCA to detect the outlying frames. iSearch, CoP, and FMS identified all the outlying frames correctly. R1-PCA could not identify those frames in which the person does not move. The reason is that those frames are exactly similar to each other. Fig. 7 shows some of the outlying frames which were missed by R1-PCA.
A new robust (to outlier) PCA method, termed iSearch, was proposed which uses a convex optimization problem to measure the innovation of the data points. The proposed approach recovers the span of the inliers using the least innovative data points. It was shown that iSearch can provably recover the span of the inliers with different models for the distribution of the outliers. In addition, analytical performance guarantees for iSearch with clustered inliers were presented. It was shown that finding the optimal directions makes iSearch significantly robust to the outliers which carry weak innovation. Moreover, the experiments with real and synthetic data demonstrated the robustness of the proposed method against the strong presence of noise.
[1] Emmanuel J Cand`es, Xiaodong Li, Yi Ma, and John Wright. Robust principal component analysis? Journal of the ACM, 58(3):11, 2011.
[2] Emmanuel J Candes and Terence Tao. Decoding by linear programming. IEEE Transactions on Information Theory, 51(12):4203–4215, 2005.
[3] Venkat Chandrasekaran, Sujay Sanghavi, Pablo A Parrilo, and Alan S Willsky. Rank-sparsity incoherence for matrix decomposition. SIAM Journal on Optimization, 21(2):572–596, 2011.
[4] Vartan Choulakian. L1-norm projection pursuit principal component analysis. Computational Statistics & Data Analysis, 50(6):1441–1451, 2006.
[5] Philip J Davis. Leonhard euler’s integral: A historical profile of the gamma function: In memoriam: Milton abramowitz. The American Mathematical Monthly, 66(10):849–869, 1959.
[6] Chris Ding, Ding Zhou, Xiaofeng He, and Hongyuan Zha. R1-PCA: rotational invariant L1-norm principal component analysis for robust subspace factorization. In Proceedings of the 23rd International Conference on Machine Learning (ICML), pages 281–288, Pittsburgh, PA, 2006.
[7] Ehsan Elhamifar and Rene Vidal. Sparse subspace clustering: Algorithm, theory, and applications. IEEE Transactions on Pattern Analysis and Machine Intelligence, 35(11):2765–2781, 2013.
[8] Jiashi Feng, Huan Xu, and Shuicheng Yan. Robust PCA in highdimension: A deterministic approach. In Proceedings of the 29th International Conference on Machine Learning (ICML), Edinburgh, UK, 2012.
[9] Martin A Fischler and Robert C Bolles. Random sample consensus: a paradigm for model fitting with applications to image analysis and automated cartography. Communications of the ACM, 24(6):381–395, 1981.
[10] Andrew Gitlin, Biaoshuai Tao, Laura Balzano, and John Lipor. Improv- ing k-subspaces via coherence pursuit. IEEE Journal of Selected Topics in Signal Processing, 2018.
[11] Yoshiyuki Harada, Yoriyuki Yamagata, Osamu Mizuno, and Eun-Hye Choi. Log-based anomaly detection of cps using a statistical method. arXiv:1701.03249, 2017.
[12] Moritz Hardt and Ankur Moitra. Algorithms and hardness for robust subspace recovery. In The 26th Annual Conference on Learning Theory (COLT), pages 354–375, Princeton, NJ, 2013.
[13] Milos Hauskrecht, Iyad Batal, Michal Valko, Shyam Visweswaran, Gregory F Cooper, and Gilles Clermont. Outlier detection for patient monitoring and alerting. Journal of Biomedical Informatics, 46(1):47– 55, 2013.
[14] Reinhard Heckel and Helmut B¨olcskei. Robust subspace clustering via thresholding. IEEE Transactions on Information Theory, 61(11):6320– 6342, 2015.
[15] Wassily Hoeffding. Probability inequalities for sums of bounded random variables. Journal of the American statistical association, 58(301):13– 30, 1963.
[16] Peter J Huber. Robust statistics. Springer, 2011.
[17] Seppo Karrila, Julian Hock Ean Lee, and Greg Tucker-Kellogg. A comparison of methods for data-driven cancer outlier discovery, and an application scheme to semisupervised predictive biomarker discovery. Cancer Informatics, 10:109, 2011.
[18] Qifa Ke and Takeo Kanade. Robust L1 norm factorization in the pres- ence of outliers and missing data by alternative convex programming. In 2005 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR), pages 739–746, San Diego, CA, 2005.
[19] Christopher Kruegel and Giovanni Vigna. Anomaly detection of web- based attacks. In Proceedings of the 10th ACM Conference on Computer and Communications Security (CCS), pages 251–261, Washington, DC, 2003.
[20] Michel Ledoux. The concentration of measure phenomenon. Number 89. American Mathematical Soc., 2001.
[21] Gilad Lerman and Tyler Maunu. Fast, robust and non-convex subspace recovery. Information and Inference: A Journal of the IMA, 7(2):277– 336, 2017.
[22] Gilad Lerman and Tyler Maunu. An overview of robust subspace recovery. Proceedings of the IEEE, 106(8):1380–1410, 2018.
[23] Gilad Lerman, Michael McCoy, Joel A Tropp, and Teng Zhang. Robust computation of linear models, or how to find a needle in a haystack. Technical report, DTIC Document, 2012.
[24] Gilad Lerman, Michael B McCoy, Joel A Tropp, and Teng Zhang. Robust computation of linear models by convex relaxation. Foundations of Computational Mathematics, 15(2):363–410, 2015.
[25] Gilad Lerman, Teng Zhang, et al. Robust recovery of multiple subspaces by geometric lp minimization. The Annals of Statistics, 39(5):2686– 2715, 2011.
[26] Guoying Li and Zhonglian Chen. Projection-pursuit approach to robust dispersion matrices and principal components: primary theory and monte carlo. Journal of the American Statistical Association, 80(391):759–766, 1985.
[27] Liyuan Li, Weimin Huang, Irene Yu-Hua Gu, and Qi Tian. Statistical modeling of complex backgrounds for foreground object detection. IEEE Transactions on Image Processing, 13(11):1459–1472, 2004.
[28] Xingguo Li and Jarvis Haupt. Identifying outliers in large matrices via randomized adaptive compressive sampling. IEEE Transactions on Signal Processing, 63(7):1792–1807, 2015.
[29] Guangcan Liu and Ping Li. Recovery of coherent data via low-rank dictionary pursuit. In Advances in Neural Information Processing Systems (NIPS), pages 1206–1214, Montreal, Canada, 2014.
[30] Guangcan Liu and Ping Li. Low-rank matrix completion in the presence of high coherence. IEEE Transactions on Signal Processing, 64(21):5623–5633, 2016.
[31] Panos P Markopoulos, George N Karystinos, and Dimitris A Pados. Op- timal algorithms for l1-subspace signal processing. IEEE Transactions on Signal Processing, 62(19):5046–5058, 2014.
[32] Michael McCoy, Joel A Tropp, et al. Two proposals for robust PCA using semidefinite programming. Electronic Journal of Statistics, 5:1123–1160, 2011.
[33] Vitali D Milman and Gideon Schechtman. Asymptotic theory of finite dimensional normed spaces: Isoperimetric inequalities in Riemannian manifolds, volume 1200. Springer, 2009.
[34] Dohyung Park, Constantine Caramanis, and Sujay Sanghavi. Greedy subspace clustering. In Advances in Neural Information Processing Systems (NIPS), pages 2753–2761, 2014.
[35] Mostafa Rahmani and George K Atia. Coherence pursuit: Fast, simple, and robust principal component analysis. IEEE Transactions on Signal Processing, 65(23):6260–6275, 2017.
[36] Mostafa Rahmani and George K Atia. Innovation pursuit: A new ap- proach to subspace clustering. IEEE Transactions on Signal Processing, 65(23):6276–6291, 2017.
[37] Mostafa Rahmani and George K Atia. Subspace clustering via optimal direction search. IEEE Signal Processing Letters, 24(12):1793–1797, 2017.
[38] Benjamin Recht. A simpler approach to matrix completion. The Journal of Machine Learning Research, 12:3413–3430, 2011.
[39] Mahdi Soltanolkotabi and Emmanuel J Candes. A geometric analysis of subspace clustering with outliers. The Annals of Statistics, pages 2195–2238, 2012.
[40] Roberto Tron and Ren´e Vidal. A benchmark for the comparison of 3-d motion segmentation algorithms. In 2007 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR), pages 1–8, Minneapolis, MN, 2007.
[41] Manolis C Tsakiris and Ren´e Vidal. Dual principal component pursuit. In 2015 IEEE International Conference on Computer Vision Workshop, ICCV Workshops, pages 10–18, Santiago, Chile, 2015.
[42] Roman Vershynin. Introduction to the non-asymptotic analysis of random matrices. arXiv:1011.3027, 2010.
[43] Huan Xu, Constantine Caramanis, and Sujay Sanghavi. Robust PCA via outlier pursuit. In Advances in Neural Information Processing Systems (NIPS), pages 2496–2504, Vancouver, Canada, 2010.
[44] Chong You, Daniel P Robinson, and Ren´e Vidal. Provable self-representation based outlier detection in a union of subspaces. In 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 4323–4332, 2017.
[45] Teng Zhang. Robust subspace recovery by Tyler’s m-estimator. Information and Inference: A Journal of the IMA, 5(1):1–21, 2016.
[46] Teng Zhang and Gilad Lerman. A novel m-estimator for robust PCA. The Journal of Machine Learning Research, 15(1):749–808, 2014.
Appendix
In this section, the proofs for Theorem 1 and Theorem 2 are provided. Before we present the proofs, we review few useful lemmas adapted from [15], [20], [23], [33]–[35].
Lemma 7. [20], [33], [34] Let the columns of be an orthonormal basis for an r-dimensional random subspace drawn uniformly at random in an ambient N-dimensional space. For a unit
-norm vector
Lemma 8. [15] For a given vector
where are independent Rademacher random variables.
Lemma 9. [23] Suppose are i.i.d. random vectors distributed uniformly on the unit sphere
in
. If N > 2, then
with probability at least .
Lemma 10. [35] Suppose are i.i.d. random vectors distributed uniformly on the unit sphere
in
. If N > 2, then
with probability at least .
Lemma 11. [35] Suppose are i.i.d. random vectors distributed uniformly on the unit sphere
in
. If N > 2, then
with probability at least where
.
Lemma 12. Suppose spans a random r-dimensional subspace. For a given vector
where and
are constant real numbers and
max(r, log N).
Lemma 13. Suppose are i.i.d. random vectors distributed uniformly on the unit sphere
in
and define
where
is the Gamma function [5]. If N > 2, then
A. Proof of Theorem 1
Theorem 1 is proved in two steps. First we provide the sufficient conditions to guarantee that when d is an outlier, the optimal point of (3) is highly incoherent with U with high probability. Second, it is shown that if is sufficiently large and
is close to
, then (4) holds with high probability. The algorithm does not require the optimal vectors
to be orthogonal to U. As long as they are sufficiently close to
, iSearch works well. However, in order to simplify the analysis, in the first step of the proof we derive the sufficient conditions to guarantee that the optimal vectors
lie in
. The following lemma provides the sufficient conditions to guarantee that
Note that in this Lemma we do not make any assumption about the distribution of the inliers and the distribution of the outliers. Lemma 14 only uses Data Model 1 which is a deterministic model.
Lemma 14. Suppose D follows Data Model 1. Define , where
is the optimal point of
Suppose d is one of the columns of B and define . If
then the equality (10) holds where .
In Theorem 1, the distribution of the inliers and the distribution of the outliers follow Assumption 1. The following lemma shows that if is sufficiently large, the sufficient conditions of Lemma 14 are satisfied with high probability.
and
then the optimal point of (3) is equal to the optimal point of (11) for all with probability at least
where
The following Lemma assumes that the vectors liein
and provides the sufficient condition to guarantee that the Innovation Values corresponding to the outliers are greater than the Innovation Values corresponding to the inliers.
Lemma 16. Suppose D follows Assumption 1. If the optimal vectors lie in
and
then
with probability at least .
According to Lemma 15 and Lemma 16 and according to the analysis presented in the Proof of Lemma 15 and Lemma 16, if the sufficient conditions of Theorem 1 are satisfied then the inequality (4) holds and U is recovered exactly with probability at least .
B. Proof of Theorem 2
The iSearch method does not require the optimal vectors to be orthogonal to U. If they are sufficiently close to
, iSearch works well. However, in order to simplify the analysis, in the first step of the proof we derive the sufficient conditions to guarantee that the optimal vectors
lie in
.
Lemma 17. Suppose the distribution of the inliers follows Assumption 1, the distribution of the outliers follows Assumption 2, , and Q is equal to the column-space of [U q]. Define
and define
and
then the optimal point of (6) is equal to for all {d =
with probability at least
where
The following lemma shows that if is sufficiently large and the optimal directions corresponding to the outliers are aligned with
, then (4) holds with high probability.
Lemma 18. Suppose the distribution of the data is similar to the presumed data distribution in Lemma 17 and define . If the optimal point of (6) is equal to
for all
and if
then (4) holds with probability at least .
According to Lemma 17, Lemma 18, and the analysis presented in the proof of these lemmas, if the sufficient conditions of Theorem 2 are satisfied, iSearch yields exact subspace recovery with probability at least .
C. Proof of Theorem 3
The procedure we use to prove Theorem 3 is similar to procedure we employed to prove Theorem 1 in Section VII-A. First we provide the sufficient conditions to guarantee that when d is an outlier, the optimal point of (3) is highly incoherent with U. Second, it is shown that if is sufficiently large and
is close to
, then (4) holds. The algorithm does not require the optimal vectors
to be orthogonal to U. As long as they are sufficiently close to
, iSearch works well. However, in order to simplify the analysis, in the first step of the proof we derive the sufficient conditions to guarantee that the optimal vectors
lie in
.
Lemma 19. Suppose the distribution of the inliers/outliers follows Assumption-1/Assumption-3. If
then the optimal vectors lie in
with probability at least
where
Next Lemma assumes that the optimal vectors liein
and shows that if the number of inliers is sufficiently larger than the number of outliers, the inequality (4) holds with high probability.
Lemma 20. Suppose the distribution of the inliers/outliers follows Assumption-1/Assumption-3 and suppose the optimal vectors lie in
. If
then the inequality (4) holds with probability at least .
According to Lemma 19, Lemma 20, and the analysis presented in the proof of these lemmas, if the sufficient conditions of Theorem 3 are satisfied, then (4) holds and U is recovered exactly with probability at least .
D. Proof of Theorem 4
The only difference between Theorem 1 and Theorem 4 is the difference between the presumed distribution of the inliers. In order to derive the sufficient conditions, we only need to establish a new lower-bound for inf
According to the clustering structure of the inliers,
Therefore, according to (22) and the definition of integer g,
parameter shows how well the columns of A are distributed in U. According to the definition of
and using Lemma 9, (22) can be upper-bounded as follows
E. Proof of Lemma 5
The vector was defined as the optimal direction. If d is an outlier, then by definition d has non-zero projection on
. Define R as an orthonormal basis for
and define
. Accordingly to the definition of
1 which implies that when d is an outlier we can conclude that
with probability at least . Next, we can derive the following lower-bound
which is valid with probability at least . The value of the lower-bound is smaller than the value of the upper-bound and
[24]. Accordingly, we can conclude that (9) holds with probability at least
.
F. Proof of Theorem 6
Suppose d in (3) is an inlier and wherea is a clean inlier and e is the added noise. Since
, then according to the presumed noisy data model,
. In addition, we can conclude that
. Therefore,
with probability at least . If d is an outlier, then similar to the proof of Lemma 5,
with probability at least . Accordingly, if the sufficient conditions of Theorem 6 are satisfied, Innovation Values of the outliers are larger than the Innovation Values of the inliers with probability at least
.
Proof of Lemma 14
The optimization problem (3) is a convex optimization problem. In order to prove that the equality (10) holds, it is enough to show that the cost function of (3) increases if we move away from along any feasible direction where
is the optimal point of (11). Define
Therefore, we only need to prove that for any sufficiently small
such that
. The deviation vector
should be orthogonal to d to ensure that
stays in the feasible set of (3).
The vector is the optimal point of (11). Therefore, for any vector
such that
we have
The vectors and
are orthogonal to U. Thus, the inequality (24) is equivalent to
When , we can rewrite (25) as
where the last identity follows from the Taylor expansion of the square root function. Thus,
has to be greater than zero for any small which satis-fies (23). The function
can be expanded as
We decompose in (28) as
where
lies in U and
lies in
. Thus,
. Similar to (26), as
,
Accordingly, in order to show that (10) holds, it is enough to guarantee that for any sufficiently small such that
,
Let us define vector o as
In order to show that the inequality (29) holds for any such that
, it is sufficient to guarantee that
We derive the sufficient conditions to guarantee that both of the inequalities in (30) hold. For the first inequality, it suffices to guarantee that
which can be simplified to
For the second inequality of (30), it suffices to show that for any pair of and
such that
, the inequality holds. We can scale both
and
such that
1. Therefore, it is enough to guarantee that
Define vector , i.e.,
is a unit
-norm vector which is aligned with the projection of d onto
. Let us decompose
into two components which one component is orthogonal to d. Thus, we expand
as
where
In order to show that (32) holds, it suffices to ensure that the LHS of (32) is greater than
Note that and
. Therefore, according to (27), we can conclude that
. Therefore, it suffices to show that the LHS of (32) is greater than
In addition, we can simplify the LHS of (32) as
Therefore if (31) is satisfied and the RHS of (34) is greater than the (33), then the equality (10) holds.
Proof of Lemma 15
First we provide the sufficient conditions to guarantee that the first inequality of (12) holds with high probability. The distribution of the inliers follow Assumption 1. According to Lemma 9,
with probability at least . Note that (35) does not depend on any d in
. Moreover, according to Lemma 7,
Thus, we can rewrite o as
Since the columns of B are drawn uniformly at random from , the distribution of the vectors
in the subspace orthogonal to
is independent of
. Accordingly, the distribution of o is equivalent to the distribution of
dependent Rademacher random variables. The optimal vector is orthogonal to U. Thus,
Note that in (40), . Suppose (36) is true. We use Lemma 8 to bound (40) as
According to (35), (37), and (41) if (13) is satisfied, the first inequality of (12) holds with probability at least for all
(because we guarantee that (10) holds for all
).
In order to guarantee that the second inequality of (12) holds, we need to bound which can be expanded as
Note that
Thus, according to Lemma 10 the first component of the RHS of (42) is bounded as
The second component of the RHS of (42) can be expanded as
The vectors and
are unit
-norm vectors. Thus,
According to Lemma 7
with probability at least for all
where
true. Lemma 8 is used to bound the first part of the RHS of (45) as
In order to bound the second part of the RHS of (45), first we define vector a such that . Note that
Thus, according to Lemma 11,
where . Note that (48) is true for all
.
If (48) is true, Lemma 8 can be used to conclude that
If (46) is true, then
If (36) is true, then
According to (43),(47),(49), (50), and (51), if (14) is satisfied, the second inequality of (12) is satisfied with probability at least for all
.
Proof of Lemma 16
Suppose d in (3) is an inlier. According to the linear constraint of (3), where
is the optimal point of (3). Therefore,
Lemma 9 can be used to bound the RHS of (52). Therefore,
Note that (53) holds for all .
Next we bound the inverse of the Innovation Value corresponding to an outlying column. Suppose d is an outlier. If the sufficient conditions of Lemma 15 are satisfied, lies in
with high probability. If
. In addition,
Lemma 10 is used to establish the following bound on the inverse of the Innovation Value corresponding to an outlying column
Note that (54) holds for all . Thus if the inequality (15) is satisfied, (4) holds with probability at least
.
the first inequality of (30), we need to bound sup
can be expanded as follows
Rademacher random variables. In addition, . If (36) is true, we can use Lemma 8 to establish the following bound
In order to derive the sufficient conditions to guarantee that the second inequality of (30) holds, we need to bound
Lemma 7,
The LHS of the first inequality of (30) is bounded similar to (35). The LHS of the second inequality of (30) can be bounded similar to (34). Therefore, according to (55),(57),(35),and (34) if (16) and (17) are satisfied, then the inequalities of (30) hold for all with probability at least
, i.e., the optimal point of (6) is equal to
with probability at least
for all
.
Proof of Lemma 18
First, we derive a lower-bound for when d is an inlier. According to the first linear constraint of (6),
. In addition,
and
We use Lemma 9 to conclude that if
D
Note that (58) holds for all the inliers with probability at least .
If d is an outlier and the sufficient conditions of Lemma 17 are satisfied, with high probability. If
, then
. We can expand
as
Thus, we can use Lemma 7 to conclude that
(18) is satisfied the outliers are guaranteed to have greater Innovation Values with probability at least .
Proof of Lemma 19 Lemma 14 does not make any assumption about the distribution of the outliers. Accordingly, we only need to guarantee that the sufficient conditions of Lemma (14) are satisfied. First we bound the RHS of the first inequality in (12). Since the columns of B lie in , the
-norm of the projection of the columns of B into U is bounded by
. Accordingly,
where and
. Similar to (39), o can be expanded as
The distribution of o is equivalent to the distribution of
independent Rademacher random variables. The vector is defined similar to (38). Since
is orthogonal to U
The LHS of the first inequality of (12) is bounded similar to (35). Thus, if (19) is satisfied, the first inequality of (12) holds for all with probability at least
.
Define . In addition,
. Accordingly,
Moreover,
Accordingly, we can use Lemma 10 to bound the first component of the RHS of (60) as follows
Note that (62) is true for all . Similar to (44) and (45), the second component in the RHS of (60) can be expanded as follows
Define vector a such that . According to Lemma 11,
where . If (63) is true, Lemma 8 can be used to conclude that
Similar to lies in
and
. Thus, we can bound the second component of the RHS of (60) similarly
The last component of the RHS of the second inequality of (12) can be bounded as
According to the definition of , the coefficient
can be bounded as
Using (62), (64), (65), (66), and (67) we can establish an upper-bound on the RHS of the second inequality of (12). Thus if (20) is satisfied, the second inequality of (12) holds with probability at least for all
.
Proof of Lemma 20
If the sufficient conditions of Lemma (19) are satisfied and if , then
lies in
with high probability. If
, then
. Assuming that
, we use Lemma 10 and inequality (61) to conclude that
If , the linear constraint of (3) ensures that
. In addition,
. Accordingly, when
, we can use Lemma 9 to derive the following bound on
Thus, if (21) is satisfied, the Innovation Values corresponding to the outliers are lager than the corresponding values for the inliers.