Suppose we have a data matrix with incomplete/missing entries, say, we have information about only a small number elements of M. The matrix completion problem ([2]) is to predict those missing entries as accurately as possible based on the observed entries. Such partiallyobserved data may appear in many application domains. For example, in a user-recommendation system (a.k.a collaborative filtering) we have incomplete user ratings for various products, and the goal is to make predictions about a user’s preferences for all the products (e.g., the Netflix problem). Also, the incomplete data could represent partial distance matrix in a sensor network, or missing pixels in digital images because of occlusion or tracking failures in a video surveillance system ([3]).
More mathematically, we have information about the entries Ω, where Ω
[
] is a sampled subset of all entries, and [n] denotes the list {1, ..., n}. The problem is to recover the unknown matrix M in a computationally tractable way from as few observed entries as possible. However, without further assumption on M it is impossible to predict the unobserved elements from a limited number of known entries. One popular assumption is that M has low-rank, say rank
. Such matrices have degrees of freedom (
, i.e., this many parameters control all other entries. This implies, if the number of observed entries
there can be infinitely many matrices of rank at most
with exactly the same entries in Ω; therefore, exact recovery of unobserved entries is impossible. So, in general, we need at least (
many observed entries for exact matrix completion. The matrix M, with the observed entries, can be interpreted as an element in mn-dimensional linear space, with available information about
) coordinates. The set of matrices compatible with the observed entries forms a large affine space. Then, exact matrix completion problem is to specify an efficient algorithm which uniquely picks M from this high-dimensional affine space ([9]). Since our target matrix M is low-rank, a natural optimization problem for finding M would be to find a matrix with minimum rank complying with the observed entries. However, minimizing rank over an affine space is known to be NP-hard ([12]). [2] proposed to solve the heuristic optimization in (1) (surrogate for rank minimization, [7]) to recover the low-rank matrix M.
where the nuclear norm of a matrix X is defined as the sum of its singular values,
(1) is a convex optimization problem that is efficiently solvable via semi- definite programming. Exact matrix completion thus becomes proving that the nuclear norm restricted to the affine space has a strict and global minima at M. That is, if
matrix in the affine space in (1), we need to show
[2], [9], [13], [1] developed the sufficient conditions and probabilistic tools to recover M as a unique solution to (1).
One natural question is: which elements of M should we observe in (1), i.e., how should we construct the sample set Ω? We want to define some probabilities on the entries of M. Most of the existing work focused on the case when Ω in (1) is constructed by observing the entries of M uniformly randomly ([2, 9, 13, 1]). However, this data-oblivious sampling scheme has a cost. If the matrix is very sparse, it cannot be recovered using uniform sampling of its entries, unless we observe almost all the entries. This is because by observing only zeros it is impossible to predict non-zeros of a matrix. This suggests that M cannot be in the null space of the sampling operator (to be defined later) extracting the values of a subset of the entries. Matrices similar to the above example can be characterized by the structure of their singular vectors. The singular vectors are (closely) ‘aligned’ with the standard basis. Therefore, the components of singular vectors should be sufficiently spread to reduce the number of observations needed to recover a low-rank matrix ([13]). Such restrictions on the row and column spaces of a low-rank matrix are called the incoherence assumptions (to be defined later). [9, 13] showed that such restricted class of matrices of rank
can be recovered exactly, with high probability, by observing as small as
) entries sampled uniformly. Very recently, [5] proposed non-uniform probabilities proportional to the sum of row and column leverage scores of M to observe its entries (leveraged sampling). They eliminated the need for those ‘incoherence’ assumptions, and showed that any arbitrary
matrix of rank
can be recovered exactly, with high probability, from as few as
) observed elements.
Similar to [5], we also incorporate the row and column leverage scores of the recovering matrix M into our proposed probability of observing an entry. However, we use a relaxed notion of leverage score sampling. Specifically, we propose to observe an entry with probability proportional to the sum of the corresponding row and column leverage scores, minus their product. Theorem 1 shows that observing entries according to this relaxed leverage score sampling in (3), we can recover matrix of rank-
exactly, with high probability, from Θ(((
)) observed entries, via (1). This bound on the sample size is optimal (up to log
) factor) in the number of degrees of freedom of a rank-
matrix. Also, this can give us
)) improvement on the sample size in [5] for
For an whose column space is incoherent and row space is arbitrarily coherent, [5] gives a provable sampling scheme (using leveraged sampling) which requires no prior knowledge of the leverage scores of M. They show that such M can be recovered exactly, with high probability, using sample size Θ(
). We can incorporate our relaxed leverage scores in such setting, with no prior knowledge of leverage scores, to achieve improvement on the sample
Table 1: Summary of bound on sample size s for exact recovery of matrix
size obtained by [5] while recovering exactly with high probability. Finally, our notion of relaxation in sampling probabilities can also achieve an improvement on the sample size of [5] even in case of uniform sampling for incoherent matrices. Table 1 summarizes some of the recent results in the literature.
1.1 Notations and preliminaries
[n] denotes natural number {1, ..., n}. Natural logarithm of x is denoted by log(x). Matrices are bold uppercase, vectors are bold lowercase, and scalars are not bold. We denote the (i, j)-th entry of a matrix denotes the i-th standard basis vector whose dimension should be clear from the context.
denote the transpose of matrix X and vector x, respectively. Tr(X) denotes the trace of a square matrix X.
Spectral norm of X is denoted by . The inner product between two matrices is
Tr(
). Frobenius norm X is denoted by
. The maximum entry of X is denoted by
. For vectors Euclidean
norm is denoted by
Linear operators acting on matrices are denoted by calligraphic letters. The spectral norm (largest singular value) of such operator A will be denoted by Also, we denote
), for some positive universal constants
Our focus is to define a probabilities on the entries of M (i.e., to construct the sample set Ω in (1)) to reduce the sample size, such that M becomes the unique optimal solution to (1). Here we use the Bernoulli sampling model ([1]), where each entry (i, j) is observed independently with some probability . Before we state our main result and the distribution, we first need to define the normalized leverage scores ([2, 13, 5]).
left and right singular matrices, respectively, and is the diagonal matrix of singular values. Normalized leverage scores for i-th row (denoted by
-th column (denoted by
defined as follows:
Normalized leverage scores are non-negative, and they depend on the structure of row and column spaces of the matrix. Also, we have
orthonormal columns. We state our main result.
Suppose, we have a subset of observed entries Ω
[
, where each entry (i, j) is observed independently with probability
, such that,
where is some universal constant. Then, M is the unique opti- mal solution to (1) with probability at least 1
, for sufficiently large c > 3. Moreover, if the number of observed entries, according to (3), is Θ
then, M is the unique optimal solution to (1) with probability at least 1
for sufficiently large c > 3.
Row and column leverage scores measure the contribution of a row or column to the low-rank subspace ([6, 11]). Probabilities in (3) are biased towards the leverage score structure of the recovering matrix. This suggests that the elements in important rows and columns, indicated by high leverage scores , of a matrix should be observed more frequently in order to reduce the number of observations needed for exact matrix completion. [5] also noticed this, and they proposed to sum up
in the sampling probabilities. However, our probabilities in (3) reduce this bias by subtracting the term
while maintaining the leverage score pattern in
. This relaxation in probabilities can help us to reduce the number of observations comparing to [5], in additive sense, to recover the low-rank matrix exactly, via (1).
One simple intuition behind this relaxation comes from basic set theory. Let the indicators of row and column leverage scores of the recovering matrix, respectively, where the probabilities are
. Then, we want to sample indices (i, j) according to row or column leverage scores, i.e., sampling probabilities
to be proportional to
are independent quantities (an element with high row leverage may or may not have high column leverage, and vice versa). Thus,
and
A practical implication of such relaxation could be as follows. Note that, 0
-th column is important, i.e.,
is high (say close to 1) we want to observe all the elements along that column as
. This eliminates the need for row leverage information of the elements of a column with high leverage score (and vice versa). This reduction of information for exact completion can lead to a smaller sample set in (1).
As discussed earlier, we need a minimum of Θ(() elements to recover a matrix exactly, regardless of the choice of probabilities. Theorem 1 proves that if we observe elements according to our relaxed leverage scores, we match this lower bound, up to a factor of
n)). Here is a comparison of our bound with that of [5]. 1) We provide a proof for general
matrix of rank-
(as opposed to
case of [5]), 2) our form of probabilities establishes an upper bound on the expected sample size E[s]. To see this, note that
comparing to
in [5], 3) our sample size to solve (1) can give an improvement on that of [5] in terms of degrees of freedom (note that (
Also, using the relaxed leverage scores we can observe improvement on sample size even in case of uniform sampling for matrices with incoherence restrictions. Let be the rank-
reconstructing matrix with SVD
. [2, 3, 13, 9] use two incoherence parameters,
, for exact matrix completion using uniform sampling, where, (a) max
(b)
. A meaningful range of
. [13] showed that if the sampling probability is uniform, such that,
where
is a constant, then M is the unique optimal solution of (1) with high probability. The lower bound achieved on the sample size in [13] (sample-with-replacement model) was
, and it could create a suboptimal dependence of sample size on
, in the worst case. Theorem 2 of [5] implies that observing entries with uniform probability satisfying,
, for some constant
, would recover the matrix exactly, with high probability. In this case, the lower bound on sample size is
[5] eliminated the need for the parameter
, and consequently the suboptimal dependence on
It follows from Theorem 1 that we can recover the matrix exactly, with high probability, if each entry is sampled uniformly with probability
where is a constant. This can improve
)) on the sample size of [5].
2.1 Column-Space-Incoherent Matrix Completion
Here we discuss exact completion of a low-rank matrix whose column space is incoherent, and we have control over the sampling of matrix entries. This setting is interesting in application domains like recommendation systems and gene expression data analysis ([10]).
Algorithm 1, adopted from [5], performs exact completion of a matrix M with incoherent column space, without a priori knowledge of leverage scores of M. Step 3 of Algorithm 1 computes the column leverage scores of M exactly, from only a small number of (uniformly) observed rows. We construct an additional sample set Ω of observed entries using our relaxed leverage scores in Step 4. Step 5 solves the nuclear norm minimization problem in (1) with Ω to recover M exactly. Theorem 2 proves the correctness of Algorithm 1.
Theorem 2 Algorithm 1 computes the column leverage scores of M exactly (step 3), i.e., ˜. Using the sample set Ω, Algorithm 1 recovers M as the unique optimal solution of (1). The total number of samples required by Algorithm 1 is Θ(
The results hold with probability at least, 1
, for sufficiently large c > 3.
We compare the bound on sample size in Theorem 2 with a couple of existing results. Let us assume m = n for simplicity. Theorem 2 can achieve an additive improvement on the sample size of [5] while recovering M exactly, via (1). [10] proposed an adaptive sampling algorithm that recovers M exactly, with probability at least 1
), and a sample size Θ(
Assuming comparable failure probabilities, sample size in Theorem 2 is better when
is not too small.
2.2 Coherent Matrix Completion using Two-Phase-Sampling
In reality, we do not have knowledge about the leverage scores of when we have control over how to choose entries. [5] proposed a heuristic two-phase sampling procedure (Algorithm 1 of [5]) for exact matrix completion with no a priori knowledge about the leverage scores. Here is an informal description of it.
Let, the total budget of samples be 1] be a parameter. First, construct an initial set Ω
by sampling entries uniformly (without replacement), such that,
matrix with ˜
. Let the rank-
be ˜
. Compute the leverage scores of ˜M and use them as estimates for the leverage scores of M, i.e., use ˜
second phase, use these estimates to sample (without replacement) remaining (1
entries of M with probabilities proportional to (˜
), to form the sample set Ω
Then perform matrix completion using sample set Ω = Ω
This heuristic is shown to work well on synthetic data that are less coherent ([5]). For highly coherent data, e.g., only few entries are non-zeros and others are zeros, it works poorly, as expected. We can incorporate our notion of relaxed leverage scores into the second phase of the above procedure by observing (without replacement) the remaining (1 entries of M with probabilities
) to form sample set ˆΩ
, and perform nuclear norm minimization in (1) using Ω = Ω
. We expect our relaxed leverage score sampling to follow similar trend as above, although we do not evaluate this heuristic numerically.
Section 3 shows experimental results on real datasets to support the theoretical gain on the sample size using relaxed leverage score sampling. We give proof sketch of Theorem 1 and Theorem 2 in Section 4 and Section 5, respectively, closely following the proof outline of [5]. We highlighted the main technical differences between our result and [5] in Section 4.
We show experimental performance of the exact recovery of real data matrices via nuclear norm minimization in (1) using our relaxed leverage score sampling. We use the software ‘TFOCS’ v1.2, written by Stephen Becker, Emmanuel Candes, and Michael Grant, to solve (1).
Let M be the rank-data matrix. We form the sample set Ω
by observing (i, j)-th entry of M according to the relaxed leverage score probabilities in (5):
where is a universal constant. Similarly, we form the sample set Ω
by observing
according to the leverage score probabilities in (6):
where is a universal constant. We use Ω
in the optimization problem (1), separately, to recover
be the optimal solution to (1) using a sample set Ω. We say
recovers M exactly if is a tiny fraction. We set
001. We perform 10 independent trials (sampling and recovery) and declare success if M is recovered exactly at least 9 times. Let
be the average sample size for successful recovery of
and Ω
, respectively. We expect
, and our gain in sample size (
) to be strictly positive, as suggested by the theory. Further, we investigate how (
) behaves with respect to the rank
. For this, we define
We expect ∆to be close to
as the theory suggests (
). For fairness of comparison, we use the same random seed for both the sampling methods in (5) and (6).
3.1 Datasets
We use the following two datasets.
MovieLens: This collaborative filtering dataset contains 100,000 ratings in the range 1 and 5 by 943 users on 1682 movies. Each user has rated at least 20 movies. This dataset is numerically not low-rank. We perform rank truncation to create an explicit low-rank matrix to apply the theory in (1). We observe the singular value spectrum of this data to heuristically choose two values for rank:
TechTC: We use a dataset from the Technion Repository of Text Categorization Database (TechTC) ([8]). Here each row is a document describing a topic, and words (columns) are the features for the topics. The (i, j)-th entry of this matrix is the frequency of j-th word appearing in i-th document. We choose a dataset containing the topics with IDs 11346 ans 22294. We preprocessed the data by removing all words of length four or less. Then, each row is normalized to have unit norm. Also, we observe the singular value spectrum of this preprocessed 125data to heuristically choose two values for rank:
= 20, to make the data explicitly low-rank.
3.2 Results
Figures 1 and 2 plot the singular values and the normalized leverage scores for rank-10 approximation for MovieLens and TechTC data, respectively. Normalized leverage scores are close to 1 when they are incoherent in nature. MovieLens is reasonably coherent, and TechTC has ex- tremely high coherence. Table 2 shows the constants , and the normalized gain ∆
Figure 1: [MovieLens] Singular values and the leverage scores for
exact recovery of MovieLens data. We see , as expected. We observe similar results for TechTC data in Table 3. Overall, these results support the accuracy of the theoretical analysis on the gain in sample size using the relaxed leverage score sampling for exact recovery of a low-rank matrix via (1).
Figure 2: [TechTC] Singular values, and the leverage scores for
Table 2: [MovieLens] Gain in sample size for exact recovery using relaxed leverage score sampling.
Table 3: [TechTC] Gain in sample size for exact recovery using relaxed leverage score sampling.
The main proof strategy was outlined by [2, 13, 9]: it is sufficient to construct a dual certificate Y obeying specific sub-gradient inequalities in order to show that M is the unique optimal solution to (1) (see Section 6 for more detail). We give a proof of Theorem 1 closely following the proof strategy of [13, 5]. Before stating the optimality conditions we need additional notations.
tively ) denote the k-th column of U (respectively V). Let T be a linear space spanned by elements of the form
, for arbitrary
be its orthogonal complement, i.e.,
is spanned by the family (
(respectively y) is any vector orthogonal to the space spanned by the left singular vectors (respectively right singular vectors). Then, orthogonal projection onto T is given by the linear operator
as
Similarly, orthogonal projection onto
Note that any can be expressed as a sum of rank-one matrices as follows:
We define the sampling operator
where, ) being the indicator function. That is,
extracts the terms, corresponding to the indices (
Ω, from (8) to form a partial sum in (9). Let
matrix with (
Ω, and zero otherwise.
4.1 Optimality Conditions
Following the proof road map of [13, 5], we restate the sufficient conditions for M to be the unique optimal solution to (1) (Section 6 contains a proof of sufficiency).
is the unique optimal solution to (1) if the following conditions hold:
Condition 1 of Proposition 1 suggests should be nearly the identity operator on the subspace T. Next we discuss the construction of a dual certificate Y.
4.1.1 Constructing the Dual Certificate
We follow the so-called golfing scheme [9, 1, 5] to construct a matrix Y (the dual certificate) that satisfies Condition 2 in Proposition 1. Recall, we assume that the set of observed elements Ω follows the Bernoulli model with parameter , i.e., each index (i, j) is observed independently with
in eqn (3)). We denote this by Ω
). Further, we assume that Ω is generated from Ω =
, where for each
set
. Clearly, this implies
which is the original Bernoulli model for Ω. Note that,
because of overlapping of Ω
Then,
where 11. Starting with
= 0 and for each
, we recursively define
where the sampling operator is defined as
Since Ω) implies Ω
), for each
, we prove (in Lemma 1) Condition 1 of Proposition 1 using sample set ˜Ω in (12).
Before we validate Condition 2 in Proposition 1 using the Y constructed above, we claim the following results to hold with high probability. First, we borrow the following definitions of weighted infinity norms for a matrix
where denote the i-th row and j-th column of Z, respectively.
Lemma 2 bounds the spectral norm of the matrix () using the sample set ˜Ω.
be a fixed matrix. Let ˜Ω be a sample set in (12). Then, for any universal constant c > 1, we have
Next two results control the ) norms of the projection of a matrix after random sampling.
be a fixed matrix. Let ˜Ω be a sample set in (12). Then, for any universal constant c > 2, we have
be a fixed matrix. Let ˜Ω be a sample set in (12). Then, for any universal constant c > 3, we have
We now validate Condition 2 in Proposition 1 using the Y constructed above.
We set ∆. Then, from definition of
We used ). Using the independence of ∆
We can bound this by recursively applying Lemma 1 with Ω
The above result fails with probability at most (; thus, total probability of failure is at most 11(
By definition, Y can be written as
It follows that,
We apply Lemma 2 to each summand in the above inequality, with corresponding Ω, to obtain
We can derive the following, applying Lemma 4 k times, with Ω
holding with failure probability at most Similarly, applying Lemma 3 and Lemma 4 recursively, with Ω
, we can derive,
holding with failure probability at most . Using (15) and (16), it follows from (14),
We note that, for all (i, j),
n Thus,
Therefore,
by setting
Let, the total number of sampled entries be s. Expected number of observed entries required to solve (1) is Summing up the individual failure probabilities of Proposition 1 the total failure probability never exceeds 33
for sufficiently large c > 3.
Finally, we can apply Hoeffding’s inequality to show that s is sharply concentrated around its expectation, i.e., )) with probability at least 1
, for sufficiently large c > 3.
This completes the proof of Theorem 1.
We closely follow the proof given by [5]. We pick each row of M with some probability p and observe all the entries of this sampled row. Let Γ ] be the set of indices of the row picked, and
) be a matrix obtained from X by zeroing out the rows outside Γ. Recall, SVD of M is
. We use the following lemma (Lemma 14 of [5]).
for some universal constant
. Then, for any universal constant
Now, 2 implies that
) is invertible and
. Using SVD of M, we can write
, and this has full rank-
. Therefore,
have the same row space, and we conclude that ˜
]. Thus, using the sample set Ω in Algorithm 1 we can recover M exactly via nuclear norm minimization in (1), with high probability. Expected number of entries observed in Algorithm 1 is
where, as in (4). We apply standard Hoeffding inequality to bound the actual sample size, and Theorem 2 follows as a corollary of Theorem 1.
Let M be the low-rank target matrix with rank-. We want to show that any perturbation Z to M, such that, M + Z is a solution to (1), strictly increases the nuclear norm, unless Z = 0. Now, M + Z is feasible only if
), which implies
e.g., Z is in the null space of
operator. We can choose
such that [
[
] are unitary matrices for which
. Then it follows from standard inequality of trace norm, for some Y in the range of
Above, (a) follows from Von-Neumann trace inequality, and (b) follows from Lemma 6. Using max, and the conditions in Proposition 1, we derive the final in- equality. Note that, Condition 1 in Proposition 1 implies
is the identity operator on the elements of subspace T, therefore
) = 0 implies Z = 0.
The following lemma is similar to Lemma 13 of [5].
Proof: Let us define the operator
where we use
Combining (17) and (18), and using
The result follows.
Here we prove Lemmas 1 through 5 using the matrix Bernstein inequality of Lemma 8 as the main tool. Also, we frequently use the fact in (20) and the result in Lemma 7. Note that self-adjoint linear operator. Thus we can write the following for any
We can derive, for all i and j,
Also, we know for all i, j,
Also, The last inequality follows because 0
Lemma 8 ([14], [Theorem 16] of [5]) Let be independent, zero-mean random matrices. Suppose
We consider sampling probabilities of the form (10) to prove Lemmas 1 through 4.
Notation Overloading: For simplicity, we reuse some of the notations in Section 7.1 through 7.4. Specifically, we replace ˜Ω by Ω to denote a sample set in (12), and,
7.1 Proof of Lemma 1
For any matrix , we can write
Using )] = 0 for any Z. Thus, we conclude that
] = 0. Also,
’s are independent of each other. Using probabilities in (10) (
’s vanish when
= 1, for all Z and (i, j)), and (20), we derive
From definition of operator norm, . Also, we derive
We apply Matrix Bernstein inequality in Lemma 8 using
to obtain, for any
7.2 Proof of Lemma 2
We can write the matrix (as sum of independent matrices:
We note that, ’s are zero matrix when
= 1, for all (i, j). We have
. Moreover,
Clearly, when = 1 the above quantities are zero. Using
in (10), and Lemma 7, we have
Using in (10), and noting that
The lemma follows from Matrix Bernstein inequality in Lemma 8, with
7.3 Proof of Lemma 3
Let,
Clearly, ] = 0. We need bounds on
to apply Matrix Bernstein inequality. First, we need to bound
Above we use triangle inequality and definition of . Note that,
is a zero vector when
= 1, for all (i, j). Otherwise, for
= 1, we consider two cases. Using bounds in (22), we have for j = b,
Using
. Combining these two inequalities, we have
For
The above quantity is zero for = 1. Otherwise, for
= 1, we consider two cases. For j = b, using (22) we have,
Using in (10), we have,
where we use the following bound in the second inequality. For all (
For
We set
Similarly, we can bound
by the same quantity. We take a union bound over all rows a and all columns b (i.e., total (m + n) events) to obtain, for any c > 2,
holding with probability at least 1
7.4 Proof of Lemma 4
Let, . We write rescaled (a, b)-th element of X as
This is a sum of independent, zero-mean random variables. we seek to bound First, we need to bound
For
Similarly, for
For
Above we use
The above quantity is zero for = 1. We bound the above considering four cases for
For
Above we use1, for all i and j.
For
We now apply Bernstein inequality in Lemma 8 to obtain, for any
We take union bound over all (a, b) (i.e., total events) to conclude that the above result holds with probability at least
7.5 Proof of Lemma 5
Let ) is the indicator function. We can write,
We apply the matrix Bernstein inequality in Lemma 8 to derive the result.
Acknowledgment: I thank Prof Petros Drineas and Prof Malik Magdon-Ismail for helpful discussions on this topic. Prof Drineas encouraged me to work on this problem.
[1] E. J. Candes, X. Li, Y. Ma, and J. Wright. Robust principal component analysis? In Journal of the ACM, page 58(3):11, 2011.
[2] E. J. Candes and B. Recht. Exact matrix completion via convex optimization. In Foundations of Computational Mathematics, pages 717–772, 9(6), 2009.
[3] E. J. Candes and T. Tao. The power of convex relaxation: Near-optimal matrix completion. In IEEE Transactions on Information Theory, pages 2053–2080, 56(5), 2010.
[4] S. Chatterjee and A. Hadi. Influential observations, high leverage points, and outliers in linear regression. Statistical Science, pages 1(3):379–393, 1986.
[5] Y Chen, S Bhojanapalli, S Sanghavi, and R Ward. Completing Any Low-rank Matrix, Provably. http: // arxiv. org/ abs/ 1306. 2979 , 2014.
[6] P. Drineas, M.W. Mahoney, and S. Muthukrishnan. Relative-error CUR matrix decomposi- tions. In SIAM Journal on Matrix Analysis and Applications, pages 844–881, 30(2), 2008.
[7] M. Fazel. Matrix rank minimization with applications. In PhD thesis, Stanford University, 2002.
[8] E. Gabrilovich and S. Markovitch. Text categorization with many redundant features: using aggressive feature selection to make SVMs competitive with C4.5. In Proceedings of International Conference on Machine Learning, 2004.
[9] D. Gross. Recovering low-rank matrices from few coefficients in any basis. In IEEE Transactions on Information Theory, pages 1548–1566, 57(3), 2011.
[10] A. Krishnamurthy and A. Singh. Low-rank matrix and tensor completion via adaptive sampling. In Advances in Neural Information Processing Systems 26, pages 836–844. 2013.
[11] M.W. Mahoney and P. Drineas. CUR matrix decompositions for improved data analysis. In Proceedings of the National Academy of Sciences, pages 697–702, 106 (3), 2009.
[12] B.K. Natarajan. Sparse approximate solutions to linear systems. pages 227–234, 24, 1995.
[13] B. Recht. A simpler approach to matrix completion. In The Journal of Machine Learning Research, pages 3413–3430, 12, 2011.
[14] J. Tropp. User-friendly tail bounds for sums of random matrices. In Foundations of Computational Mathematics, pages 12(4):389–434, 2012.