A range of machine learning problems such as link prediction in graphs containing community structure (Richard et al., 2014), phase retrieval (Candès et al., 2013), subspace clustering (Wang et al., 2013) or dictionary learning for sparse coding (Mairal et al., 2010) amount to solve sparse matrix factorization problems, i.e., to infer a low-rank matrix that can be factorized as the product of two sparse matrices with few columns (left factor) and few rows (right factor). Such a factorization allows for more efficient storage, faster computation, more interpretable solutions, and, last but not least, it leads to more accurate estimates in many situations. In the case of interaction networks for example, the assumption that the network is organized as a collection of highly connected communities which can overlap implies that the adjacency matrix admits such a factorization. More generally, considering sparse low-rank matrices combines two natural forms of sparsity, in the spectrum and in the support, which can be motivated by the need to explain systems behaviors by a superposition of latent processes which only involve a few parameters. Landmark applications of sparse matrix factorization are sparse principal components analysis (SPCA, d’Aspremont et al., 2007; Zou et al., 2006) or sparse canonical correlation analysis (SCCA, Witten et al., 2009), which are widely used to analyze high-dimensional data such as genomic data. From a computational point of view, however, sparse matrix factorization is challenging since it typically leads to non-convex, NP-hard problems (Moghaddam et al., 2006). For instance, Berthet and Rigollet (2013) noted that solving sparse PCA with a single component is equivalent to the planted clique
problem (Jerrum, 1992), a notoriously hard problem when the size of the support is smaller than the square root of size of the matrix. Many heuristics and relaxations have therefore been proposed, with and without theoretical guaranties, to approximatively solve the problems leading to sparse low-rank matrices. A popular procedure is to alternatively optimize over the left and right factors in the factorization, formulating each step as a convex optimization problem (Lee et al., 2007; Mairal et al., 2010). Despite these worst case computational hardness, simple generalizations of the power method have been proposed by Journée et al. (2010); Luss and Teboulle (2013); Yuan and Zhang (2013) for the sparse PCA problem with a single component. These algorithms perform well empirically and have been proved to be efficient theoretically under mild conditions by Yuan and Zhang (2013). Several semidefinite programming (SDP) convex relaxations of the same problem have also been proposed (Amini and Wainwright, 2009; d’Aspremont et al., 2007, 2008). Based on the rank one approximate solutions, computing multiple principal components of the data is commonly done though successive deflations (Mackey, 2009) of the input matrix.
Recently, several authors have investigated the possibility to formulate sparse matrix factorization as a convex optimization problem. Bach et al. (2008) showed that the convex relaxation of a number of natural sparse factorization are too coarse too succeed, while Bach (2013) investigated several convex formulations involving nuclear norms (Jameson, 1987), similar to the ones we investigate in this paper, and their SDP relaxations. Several authors also investigated the performance of regularizing a convex loss with linear combinations of the norm and the trace norm, naturally leading to a matrix which is both sparse and low-rank (Doan and Vavasis, 2013; Oymak et al., 2012; Richard et al., 2012, 2013, 2014). This penalty term can be related to the SDP relaxations of d’Aspremont et al. (2007, 2008) that penalize the trace and the element-wise
norm of the positive semi-definite unknown. The statistical performance of these basic combinations of the two convex criteria has however been questioned by Krauthgamer et al. (2013); Oymak et al. (2012). Oymak et al. (2012) showed that for compressed sensing applications, no convex combination of the two norms improves over each norm taken alone. Krauthgamer et al. (2013) prove that the SDP relaxations fail at finding the sparse principal component outside the favorable regime where a simple diagonal thresholding algorithm (Amini and Wainwright, 2009) works. Moreover, these existing convex formulations either aim at finding only a rank one matrix, or a low rank matrix whose factors themselves are not necessarily guaranteed to be sparse.
In this work, we propose two new matrix norms which, when used as regularizer for various optimization problems, do yield estimates for low-rank matrices with multiple sparse factors that are provably more efficient statistically than the and trace norms. The price to pay for this statistical efficiency is that, although convex, the resulting optimization problems are NP-hard, and we must resort to heuristic procedures to solve them. Our numerical experiments however confirm that we obtain the desired theoretical gain to estimate low-rank sparse matrices.
1.1 Contributions and organization of the paper
More precisely, our contributions are:
• Two new matrix norms (Section 2). In order to properly define matrix factorization, given sparsity levels of the factors denoted by integers k and q, we first introduce in Section 2.1 the (k, q)-rank of a matrix as the minimum number of left and right factors, having respectively k and q nonzeros, required to reconstruct a matrix. This index is a more involved complexity measure for matrices than the rank in that it conditions on the number of nonzero elements of the left and right factors of a matrix. Using this index, we propose in Section 2.2 two new atomic norms for matrices (Chandrasekaran et al., 2012). (i) Considering the convex hull unit
operator norm matrices with (k, q)-rank = 1, we build a convex surrogate to low (k, q)-rank matrix estimation problem. (ii) We introduce a polyhedral norm built upon (k, q)-rank = 1 matrices with all non-zero entries of absolute value equal to 1. We provide in Section 2.3 an equivalent characterization of the norms as nuclear norms, in the sense of Jameson (1987), highlighting in particular a link to the k-support norm of Argyriou et al. (2012).
• Using these norms to estimate sparse low-rank matrices (Section 3). We show how several problems such as bilinear regression or sparse PCA can be formulated as convex optimization problems with our new norms, and clarify that the resulting problems can however be NP-hard.
• Statistical Analysis (Section 4). We study the statistical performance of the new norms and compare them with existing penalties. Our analysis goes first in Section 4.1 using slow rate type of upper bounds on the denoising error, which despite sub-optimality gives a first insight on the gap between the statistical performance of our (k, q)-trace norm and that of the and trace norms. Next we show in Section 4.2, using cone inclusions and estimates of statistical dimension, that our norms are superior to any convex combination of the trace norm and the
norm in a number of different tasks. However, our analysis also shows that the factors gained over the rivals to estimate sparse low-rank matrices vanishes when we use our norm to estimate sparse vectors.
• A working set algorithm (Section 5). While in the vector case the computation remains feasible in polynomial time, the norms we introduce for matrices can not be evaluated in polynomial time. We propose algorithmic schemes to approximately learn with the new norms. The same norms and meta-algorithms can be used as a regularizer in supervised problems such as bilinear and quadratic regression. Our algorithmic contribution does not consist in providing more efficient solutions to the rank-1 SPCA problem, but to combine atoms found by the rank-1 solvers in a principled way.
• Numerical experiments (Section 6). We numerically evaluate the performance of our new norms on simulated data, and confirm the theoretical results. While our theoretical analysis only focuses on the estimation of sparse matrices with (k, q)-rank one, our simulations allow us to conjecture that the statistical dimension scales linearly with the (k, q)-rank and decays with the overlap between blocks. We also show that our model is competitive with the state-of-the-art on the problem of sparse PCA.
Due to their length and technicality, all proofs are postponed to the appendices.
1.2 Notations
For any integers is the set of integers from
denotes the set of subsets of k indices in [1, p]. For a vector
is the number of non-zero coefficients in
is its Euclidean norm,
is its
norm and supp
is its support, i.e., the set of indices of the nonzero entries of w. For any
is the vector that is equal to w on I, and has 0 entries elsewhere. Given matrices A and B of the same size,
is the standard inner product of matrices. For any matrix
the notations
stand respectively for the number of nonzeros, entry-wise
norms, the standard
Frobenius) norm, the trace-norm (or nuclear norm, the sum of the singular values), the operator norm (the largest singular value) and the rank of Z, while supp
is the support of Z, i.e., the set of indices of nonzero elements of Z. When dealing with a matrix Z whose nonzero elements form a block of size
takes the form
a matrix Z and two subsets of indices
is the matrix having the same entries as Z inside the index subset
entries outside. This notation should not be confused with the notation
which we will sometimes use to denote a general matrix with support contained in
In this section we propose two new matrix norms allowing to formulate various sparse matrix factorization problems as convex optimization problems. We start by defining the (k, q)-rank of a matrix in Section 2.1, a useful generalization of the rank which also quantifies the sparseness of a matrix factorization. We then introduce two atomic norms defined as tight convex relaxations of the (k, q)-rank in Section 2.2: the (k, q)-trace norm, obtained by relaxing the (k, q)-rank over the operator norm ball, and the (k, q)-CUT norm, obtained by a similar construction with extra-constraints on the element-wise of factors. In Section 2.3 we relate these matrix norms to vector norms using the concept of nuclear norms, establishing in particular a connection of the (k, q)-trace norm for matrices with the k-support norm of Argyriou et al. (2012), and the (k, q)-CUT norm to the vector k-norm, defined as the sum of the k largest components in absolute value of a vector (Bhatia, 1997, Exercise II.1.15).
2.1 The (k, q)-rank of a matrix
The rank of a matrix is the minimum number of rank-1 matrices (i.e., outer products of vectors of the form
) needed to express Z as a linear combination of the form
. It is a versatile concept in linear algebra, central in particular to solve matrix factorization problems and low-rank approximations. The following definition generalizes this notion to incorporate constraints on the sparseness of the rank-1 elements:
Definition 1 ((k, q)-rank) For a matrix , we define its (k, q)-rank as the optimal value of the optimization problem:
where for any is the set of n- dimensional unit vectors with at most j non-zero components.
When , we recover the usual notion of rank of a matrix, and a particular solution to (1) is provided by the SVD, for which the vectors
form each a collection of orthonormal vectors.
In general, however, the (k, q)-rank does not share several important properties of the usual rank, as the following proposition shows:
1. The (k, q)-rank of a matrix can be strictly larger than
2. There might be no solution of (1) such that form a collection of orthonormal vectors.
For k = q = 1, the (1, 1)-SVD decomposes Z as a sum of matrices with only one non-zero element, showing that , we deduce from the expression of the (k, q)-rank as the optimal value of (1) that the following tight inequalities hold:
The (k, q)-rank is useful to formulate problems in which a matrix should be modeled as or approximated by a matrix with sparse low rank factors, with the assumption that the sparsity level of the factors is fixed and known. For example, the standard rank-1 SPCA problem consists in finding the symmetric matrix with (k, k)-rank equal to 1 and providing the best approximation of the sample covariance matrix (Zou et al., 2006).
2.2 Two convex relaxations for the (k, q)-rank
The (k, q)-rank is obviously a discrete, nonconvex index, like the rank or the cardinality, leading to computational difficulties when one wants to estimate matrices with small (k, q)-rank. In this section, we propose two convex relaxations of the (k, q)-rank aimed at mitigating these difficulties. They are both instances of the atomic norms introduced by Chandrasekaran et al. (2012), which we first review.
Definition 3 (Atomic norm) Given a centrally symmetric compact subset of elements called atoms, the atomic norm induced by
is the gauge function
, defined by
Chandrasekaran et al. (2012) show that the atomic norm induced by A is indeed a norm, which can be rewritten as
and whose dual norm satisfies
We can now define our first convex relaxation of the (k, q)-rank:
Definition 4 ((k, q)-trace norm) For a matrix -trace norm
atomic norm induced by the set of atoms:
In words, is the set of matrices
Plugging (5) into (3), we obtain an equivalent definition of the (k, q)-trace norm as the optimal value of the following optimization problem:
Comparing (6) to (1) shows that the (k, q)-trace norm is derived from the (k, q)-rank by replacing the non-convex pseudo-norm of c by its convex
norm in the optimization problem. In particular, in the case
-trace norm is the usual trace norm (equal to the
of singular values), i.e. the usual relaxation of the rank (which is the
-norm of the singular values). Similarly, when k = q = 1, the (k, q)-trace norm is simply the
norm. Just like the (k, q)-rank interpolates between the
pseudo-norm and the rank, the (k, q)-trace norm interpolates between the
norm and the trace norm. Indeed, since
, we deduce from the expression of
as the optimal value of (6) that the following tight inequalities hold for any
In the case of the trace norm, the optimal decomposition solving (6) is unique and is in fact the singular value decomposition of the matrix being respectively the left and right singular vectors and
the singular values. This suggest that we can use the (k, q)-trace norm to generalize the definition of the SVD to sparse SVDs as follows
Definition 5 ((k, q)-SVD) For a matrix -sparse singular value decomposition (or (k, q)-SVD) any decomposition
that solves (6) with
. In such a decomposition, we refer to vectors
as a set of left and right (k, q)-sparse singular vectors of
as the corresponding collection of (k, q)-sparse singular values.
Without surprise, the (k, q)-SVD does not share a number of usual properties of the SVD, when
2. The (k, q)-SVDs do not necessarily solve (1): the number of non-zero (k, q)-sparse singular values of a matrix can be strictly larger than its (k, q)-rank.
3. The (k, q)-sparse left or right singular vectors are not necessarily orthogonal to each other.
In addition to (6), the next lemma provides another explicit formulation for the (k, q)-trace norm, its dual and its sub differential:
Our second norm is again an atomic norm, but is obtained by focusing on a more restricted set of atoms. It is motivated by applications where we want to estimate matrices which, in addition to being sparse and low-rank, are constant over blocks, such as adjacency matrices of graphs with non-overlapping communities. For that purpose, consider first the subset of made of vectors whose nonzero entries are all equal in absolute value:
We can then define our second convex relaxation of the (k, q)-rank:
Definition 8 ((k, q)-CUT norm) We define the as the atomic norm induced by the set of atoms
In other words, the atoms in are the atoms of
whose nonzero elements all have the same amplitude.
Our choice of terminology is motivated by the following relation of our norm to the CUT-polytope: in the case , the unit ball of
coincides (up to a scaling factor of
the polytope known as the CUT polytope of the complete graph on n vertices (Deza and Laurent, 1997), defined by
The norm obtained as the gauge of the CUT polytope is therefore to the trace norm as
2.3 Equivalent nuclear norms built upon vector norms
In this section we show that the (k, q)-trace norm (Definition 4) and the (k, q)-CUT norm (Defini-tion 8), which we defined as atomic norms induced by specific atom sets, can alternatively be seen as instances of nuclear norms considered by Jameson (1987). For that purpose it is useful to recall the general definition of nuclear norms and the characterization of the corresponding dual norms as formulated in Jameson (1987, Propositions 1.9 and 1.11):
denote any vector norms on
respectively, then
where the infimum is taken over all summations of finite length, is a norm over nuclear norm induced by
. Its dual is given by
The following lemma shows that the nuclear norm induced by two atomic norms is itself an atomic norm.
are two atomic norms on
induced respectively by two atom sets
, then the nuclear norm on
is an atomic norm induced by the atom set:
Figure 1: Unit balls of 3 norms of interest for vectors of materialized by their sets of extreme points at which the norm is non-differentiable. Each unit ball is the convex hull of the corresponding sets. In green, the usual
-norm scaled by the factor
, in blue the norm
2-support norm), in red the norm
(see theorem 11). Vertices of the
unit ball constitute the
set (see definition 8). The set
belongs to the unit spheres of all three norms (see proposition 18).
We can deduce from it that the (k, q)-trace norm and (k, q)-CUT are nuclear norms, associated to particular vector norms:
Theorem 11 1. The (k, q)-trace norm is the nuclear norm induced by , where for any
-support norm introduced by Argyriou et al. (2012).
For the sake of completeness, let us recall the closed-form expression of the k-support norm by Argyriou et al. (2012). For any vector
be the vector obtained by sorting the entries of w by decreasing order of absolute values. Then it holds that
where is the unique integer such that
where by convention
Of course, Theorem 11 implies that in the vector case (-trace norm is simply equal to
-CUT norm is equal to
. A representation of the “sharp edges" of unit balls of
and a appropriately scaled
norm can be found in Figure 1 for the case
and k = 2. In addition, the following results shows that the dual norms of
have simple explicit forms:
To conclude this section, let us observe that nuclear norms provide a natural framework to construct matrix norms from vector norms, and that other choices beyond may lead to interesting norms for sparse matrix factorization. It is however known since Jameson (1987) (see also Bach, 2013; Bach et al., 2012) that the nuclear norm induced by vector
-norm is simply the
matrix which fails to induce low rank (except in the very sparse case). However Bach et al. (2012) proposed nuclear norms associated with vectors norms that are similar to the elastic net penalty.
In this section, we briefly discuss how the (k, q)-trace norm and (k, q)-CUT norm can be used to attack various problems involving estimation of sparse low-rank matrices.
3.1 Denoising
Suppose is a noisy observation of a low-rank matrix with sparse factors, assumed to have low (k, q)-rank. A natural convex formulation to recover the noiseless matrix is to solve:
where is a parameter to be tuned. Note that in the limit when
, one simply obtains a (k, q)-SVD of X.
3.2 Bilinear regression
More generally, given some empirical risk L(Z), it is natural to consider formulations of the form
to learn matrices that are a priori assumed to have a low (k, q)-rank. A particular example is bilinear regression, where, given two inputs , one observes as output a noisy version of
. Assuming that Z has low (k, q)-rank means that the noiseless response is a sum of a small number of terms, each involving only a small number of features from either of the input vectors. To estimate such a model from observations
, one can consider the following convex formulation:
where is a loss function. A particular instance of (16) of interest is the quadratic regression problem, where
Quadratic regression combined with additional constraints on Z is closely related to phase retrieval (Candès et al., 2013). It should be noted that if
is the least-square loss, (16) can be rewritten in the form
where X(Z) is a linear transformation of Z, so that the problem is from the point of view of the parameter Z a linear regression with a well chosen feature map.
3.3 Subspace clustering
In subspace clustering, one assumes that the data can be clustered in such a way that the points in each cluster belong to a low dimensional space. If we have a design matrix row corresponding to an observation, then the previous assumption means that if
a matrix formed by the rows of cluster j, there exist a low rank matrix
. This means that there exists a block-diagonal matrix Z such that ZX = X with low-rank diagonal blocks. This idea, exploited recently by Wang et al. (2013) implies that Z is a sum of low rank sparse matrices; and this property still holds if the clustering is unknown. We therefore suggest that if all subspaces are of dimension k, Z may be estimated via
3.4 Sparse PCA
In sparse PCA (d’Aspremont et al., 2007; Witten et al., 2009; Zou et al., 2006), one tries to approximate an empirical covariance matrix by a low-rank matrix with sparse factors. Although this is similar to the denoising problem discussed in Section 3.1, one may wish in addition that the estimated sparse low-rank matrix be symmetric and positive semi-definite (PSD), in order to represent a plausible covariance matrix. This suggests to formulate sparse PCA as follows:
where k is the maximum number of non-zero coefficient allowed in each principal direction. In contrast to sequential approaches that estimate the principal components one by one (Mackey, 2009), this formulation requires to find simultaneously a set of factors which are complementary to one another in order to explain as much variance as possible. A natural convex relaxation of (17) is
where is a parameter that controls in particular the rank of the approximation.
However, although the solution of (18) is always PSD, its (k, k)-SVD leading to may not be composed of symmetric matrices (if
, and even if
the corresponding
may be negative, as the following proposition shows:
Proposition 13 1. There might be no decomposition of a PSD matrix attaining its (k, q)-rank (i.e. no solution of (1)) which decomposes it as a sum of symmetric terms.
2. The (k, k)-SVD of a PSD matrix is itself not necessarily a sum of symmetric terms.
3. Some PSD matrices cannot be written as a positive combination of rank one (k, k)-sparse matrices, even for k > 1.
This may be unappealing, as one would like to interpret the successive rank-1 matrices as covariance matrices over a subspace that explain some of the total variance. One may therefore prefer a decomposition with less sparse or more factors, potentially capturing less variance.
One solution is to replace ) by another penalty which directly imposes symmetric factors with non-negative weights. This is easily obtained by replacing the set of atoms
in Definition 4 by
, and considering the corresponding atomic norm which we denote by
To be precise,
is not a norm but only a gauge because the set
is not centrally symmetric. Instead of (18), it possible to use the following convex formulation of sparse PCA:
By construction, the solution of (19) is not only PSD, but can be expanded as a sum of matrices , where for all i = 1, . . . , r, the factor
-sparse and the coefficient
is positive. This formulation is therefore particularly relevant if
is believed to be a noisy matrix of this form. It should be noted however that, by Proposition 13,
is infinite for some PSD matrices
implies that some PSD matrices cannot be approximated well with this formulation.
3.5 NP-hard convex problems
Although the (k, q)-trace norm and related norms allow us to formulate several problems of sparse low-rank matrix estimation as convex optimization problems, it should be pointed out that this does not guarantee the existence of efficient computational procedures to solve them. Here we illustrate this with the special case of the best (k, q)-sparse and rank 1 approximation to a matrix, which turns out to be a NP-hard problem. Indeed, let us consider the three following optimization problems, which are equivalent since they return the same rank one subspace spanned by
In particular, if is an empirical covariance matrix, then the symmetric solutions of the problem considered are the solution to the following rank 1 SPCA problem
which it is known to be NP-hard (Moghaddam et al., 2008). This shows that, in spite of being a convex formulation involving the (k, q)-trace norm, the third formulation in (20) is actually NPhard. In practice, we will propose heuristics in Section 6 to approximate the solution of convex optimization problems involving the (k, q)-trace norm.
In this section we study theoretically the benefits of using the new penalties low-rank matrices with sparse factors, as suggested in Section 3, postponing the discussion of how to do it in practice to Section 5. Building upon techniques proposed recently to analyze the statistical properties of sparsity-inducing penalties, such as the
penalty or more general atomic norms, we investigate two approaches to derive statistical guarantees. In Section 4.1 we study the expected dual norm of some noise process, from which we can deduce upper bounds on the learning rate for least squares regression and a simple denoising task. In Section 4.2 we estimate the statistical dimension of objects of interest both in the matrix and vector cases and compare the asymptotic rates, which shed light on the power of the norms we study when used as convex penalties. The results in Section 4.1 are technically easier to derive and contain bounds for a matrix of arbitrary (k, q)-rank. The results provided in Section 4.2 rely on a more involved set of tools, they provide more powerful bounds but we do not derive results for matrices of arbitrary (k, q)-rank.
4.1 Performance of the (k, q)-trace norm in denoising
In this Section we consider the simple denoising setting (Section 3.1) where we wish to recover a low-rank matrix with sparse factors from a noisy observation
by additive Gaussian noise:
where is a random matrix with entries i.i.d. from N(0, 1). Given a convex penalty
, we consider, for any
, the estimator
The following result, valid for any norm , provides a general control of the estimation error in this setting, involving the dual norm of the noise:
Lemma 14
This suggests to study the dual norm of a random noise matrix in order to derive a upper bound on the estimation error. The following result provides such upper bounds, in expectation, for the (k, q)-trace norm as well as the standard
and trace norms:
be a random matrix with entries i.i.d. from N(0, 1). The expected dual norm of G for the (k, q)-trace norm, the
norm and the trace norm is respectively bounded by:
To derive an upper bound in estimation errors from these inequalities, we consider for simplicitythe oracle estimate
. From Lemma 14 we immediately get the following control of the mean estimation error of the oracle estimator, for any penalty
We can now derive upper bounds in estimation errors for the different penalties in the so-called single spike model, where the signal consists of an atom
, and we observed a noisy matrix
. Since for an atom
, we immediately get the following by plugging the upper bounds of Proposition 15 into (23):
Table 1: Various norms mean square error in denoising an atom corrupted with unit variance Gaussian noise. The column “
” corresponds to the order of magnitudes in the regime where
is an atom, the expected errors of the oracle estimators using respectively the (k, q)-trace norm, the
norm and the trace norm are respectively upper bounded by:
Remark 17 It is straightforward to see that if in the latter Corollary 16 the matrix is the convex combination of r > 1 atoms, a factor
appears in the upper bounds. This suggests a (sub-)linear dependence of the denoising error in the (k, q)-rank.
To make the comparison easy, orders of magnitudes of these upper bounds are gathered in Table 1 for the case where , and for the case where
. In the later case, we see in particular that the (k, q)-trace norm has a better rate than the
and trace norms, in
instead of
(up to logarithmic terms). Note that the largest value of
is reached when
and equals
. By contrast, when
gets far from
elements then
the expected error norm diminishes for the -penalized denoiser
on
while not changing for the two other norms.
Obviously the comparison of upper bounds is not enough to conclude to the superiority of (k, q)-trace norm and, admittedly, the problem of denoising considered here is a special instance of linear regression in which the design matrix is the identity, and, since this is a case in which the design is trivially incoherent, it is possible to obtain fast rates for decomposable norms such as the or trace norm (Negahban et al., 2012); however, slow rates are still valid in the presence of an incoherent design, or when the signal to recover is only weakly sparse, which is not the case for the fast rates. Moreover, the result proved here is valid for matrices of rank greater than 1. We present in the next section more involved results, based on lower and upper bounds on the so-called statistical dimension of the different norms (Amelunxen et al., 2013), a measure which is closely related to Gaussian widths.
4.2 Performance through the statistical dimension
Powerful results from asymptotic geometry have recently been used by Amelunxen et al. (2013); Chandrasekaran et al. (2012); Foygel and Mackey (2014); Oymak et al. (2013) to quantify the statistical power of a convex nonsmooth regularizer used as a constraint or penalty. These results rely essentially on the fact that if the tangent coneof the regularizer at a point of interest Z is thiner, then the regularizer is more efficient at solving problems of denoising, demixing and compressed sensing of Z. The gain in efficiency can be quantified by appropriate measures of width of the tangent cone such as the Gaussian width of its intersection with a unit Euclidean ball (Chandrasekaran et al., 2012), or the closely related concept of statistical dimension of the cone, proposed by Amelunxen et al. (2013). In this section, we study the statistical dimensions induced by different matrix norms in order to compare their theoretical properties for exact or approximate recovery of sparse low-rank matrices. In particular, we will consider the norms
and linear combinations of the
and trace norms, which have been used in the literature to infer sparse low-rank matrices (Oymak et al., 2012; Richard et al., 2012). For convenience we therefore introduce the notation
for the norm that linearly interpolates between the trace norm and the (scaled)
norm:
so that is the trace norm and
norm up to a constant
4.2.1 The statistical dimension and its properties
Let us first briefly recall what the statistical dimension of a convex regularizer refers to, and how it is related to efficiency of the regularizer to recover a matrix
that purpose, we first define the tangent cone
as the closure of the cone of descent directions, i.e.,
The statistical dimension can then be formally defined as
where G is a random matrix with i.i.d. standard normal entries and is the orthogonal projection of G onto the cone
. The statistical dimension is a powerful tool to quantify the statistical performance of a regularizer in various contexts, as the following non-exhaustive list of results shows.
• Robust recovery with random measurements. Suppose we observe where X is again a random linear map, and in addition the observation is corrupted by a
random noise . If the noise is bounded as
Chandrasekaran et al. (2012, Corollary 3.3) show that
• Denoising. Assume a collection of noisy observations available where
entries, and let
denote their average. Chandrasekaran and Jordan (2013, Proposition 4) prove that
• Demixing. Given two matrices , suppose we observe
where
is a random orthogonal operator. Given two convex functions
Amelunxen et al. (2013, Theorem III) show that
Conversely if , the demixing fails with proba- bility at least
4.2.2 Some cone inclusions and their consequences
In this and subsequent sections, we wish to compare the behavior of , as defined in (25). Before estimating and comparing the statistical dimensions of these norms, which requires rather technical proofs, let us first show through simple geometric arguments that for a number of matrices, the tangent cones of the different norms are actually nested. This will allow us to derive deterministic improvement in performance when a norm is used as regularizer instead of another, which should be contrasted with the kind of guarantees that will be derived from bounds on the statistical dimension and which are typically statements holding with very high probability. The results in this section are proved in Appendix C.
Put informally, the unit balls of and of all convex combinations of the trace norm and the scaled
-norm are nested and meet for matrices in
. This property is illustrated in the vector case (for
) on Figure 1. In fact
is a subset of the extreme points of the unit norms of all those norms except for the scaled
-norm (corresponding to the case
). Given that the unit balls meet on
and are nested, their tangent cones on
must also be nested:
As reviewed in Section 4.2.1, statistical dimensions provide estimates for the performance of the different norms in different contexts. Plugging (32) in these results shows that to estimate an atom in is at least as good as using
which itself is at least as good as using any convex combination of the
and trace norms.
Note that the various statements in Section 4.2.1 provide upper bounds on the performance of the different norms, with are guarantees that are either probabilistic or hold in expectation. In fact, the inclusion of the tangent cones (31) and a fortiori the tangential inclusion of the unit balls imply much stronger results since it can also lead some deterministic statements, such as the following:
Corollary 20 (Improvement in exact recovery) Consider the problem of exact recovery of a matrix from random measurements
by solving (28) with the different norms. For any realization of the random measurements, exact recovery with
exact recovery with
which itself implies exact recovery with
Note that in the vector case (), where the (k, q)-trace norm
boils down to the k-support norm
, the tangent cone inclusion (31) is not always strict:
Proposition 21
In words, the tangent cone of the norm and of the the k-support norm are equal on k-sparse vectors with constant non-zero entries, which can be observed in Figure 1. This suggests that, in the vector case, the k-support norm is not better than the
norm to recover such constant sparse k-vectors.
4.2.3 Bounds on the statistical dimensions
The results presented in Section 4.2.2 apply only to a very specific set of matrices ( not characterize quantitatively the relative performance of the different norms. In this Section, we turn to more explicit estimations of the statistical dimension of the different norms at atoms in
and
We consider first the statistical dimension of the on its atoms
unit ball of
is a vertex-transitive polytope with
vertices. As a consequence, it follows immediately from Corollary 3.14 in Chandrasekaran et al. (2012) and from the upper bound
Table 2: Order of magnitude of the statistical dimension of different matrix norms for elements of (left) and of their vector norms counterpart for elements of
(right). The
norm here is the element-wise
norm. The column “
” corresponds to the case of the planted clique problem where
. We use usual Landau notation with
. The absence of Landau notation means that the computation is exact.
Upper bounding the statistical dimension of the (k, q)-trace norm on its atoms requires more work. First, atoms with very small coefficients are likely to be more difficult to estimate than atoms with large coefficients only. In the vector case, for example, it is known that the recovery of a sparse vector
with support
depends on its smallest coefficient
The ratio between
and the noise level can be thought of as the worst signal-to-noise ratio for the signal
. We generalize this idea to atoms in
as follows.
Denote
. The atom strength
Note that the atoms with maximal strength value 1 are the elements of . With this notion in hand we can now formulate an upper bound on the statistical dimension of
Note that the upper bounds obtained on atoms of (Proposition 22) and
sition 24, with
) have the same rate up to k log k + q log q which is negligible compared to
. Note that once the support is specified, the number of degrees of freedom for elements of
, which is matched up to logarithmic terms. It is interesting to compare these estimates to the statistical dimension of the
norm, the trace norm, and their combinations
summarizes the main results. The statistical dimension the
norm on atoms in
is of order
, which is worse than the statistical dimensions of
by a factor
, though, the statistical dimension of
increases when the atom strength decreases, while the statistical dimension of the
independent of it and even decreases when the size of the support decreases. As for the trace norm alone, its statistical dimension is at least of order
, which is unsurprisingly much worse that the statistical dimensions of
since it does not exploit the sparsity of the atoms. Finally, regarding the combination
norm and of the trace norm, Oymak et al. (2012) has shown that it does not improve rates up to constants over the best of the two norms. More precisely, we can derive from Oymak et al. (2012, Theorem 3.2) the following result
Proposition 25 There exists M > 0 and C > 0 such that for any and
and for any
, the following holds:
Note that with equality if either
, so in particular
for
. In that case, we see that, as stated by Oymak et al. (2012),
does not bring any improvement over the
and trace norms taken imdividually, and in particular has a worse statistical dimension than
4.2.4 The vector case
We have seen in Section 4.2.3 that the statistical dimension of the (k, q)-trace norm and of the (k, q)-CUT norm were smaller than that of the and the trace norms, and of their combinations, meaning that theoretically they are more efficient regularizers to recover rank-one sparse matrices. In this section, we look more precisely at these properties in the vector case (
show that, surprisingly, the benefits are lost in this case.
Remember that, in the vector case, boils down to the k-support norm
down to the norm
). For the later, we can upper bound the statistical dimension at a k-sparse vector by specializing Proposition 22 to the vector case, and also derive a specific lower bound as follows:
From the explicit formulation of ) we can derive an upper bound of the statistical dimension of
on any sparse vector with at least k non-zero coefficients:
, the statistical dimension of the k-support norm
vector
is bounded by
where denotes the vector with the same entries as w sorted by decreasing absolute values, r is as defined in equation (14),
. In particular, when s = k, the following holds for any atom
with strength
We note that (35) has the same rate but tighter constants than the general upper bound (33) specialized to the vector case. In particular, this suggests that the ) may not be required. In the lasso case (k = 1), we recover the standard bound (Chandrasekaran et al., 2012):
which is also reached by on an atom
because in that case
). On the other hand, for general atoms in
the upper bound (35) is always worse than the upper bound for the standard Lasso (36), and more generally the upper bound for general sparse vectors (34) is also never better than the one for the Lasso. Although these are only upper bounds, this raises questions on the utility of the k-support norm compared to the lasso to recover sparse vectors.
The statistical complexities of the different regularizers in the vector case are summarized in Table 2. We note that, contrary to the low-rank sparse matrix case, the -support norm, and the norm
all have the same statistical dimension up to constants. Note that the tangent cone of the elastic net equals the tangent cone of the
-norm in any point (because the tangent cone of the
norm is a half space that always contains the tangent cone of the
-norm) so that the elastic net has always the exact same statistical dimension as the
As seen in Section 3, many problems involving sparse low-rank matrix estimation can be formulated as optimization problems of the form:
Unfortunately, although convex, this problem may be computationally challenging (Section 3.5). In this section, we present a working set algorithm to approximately solve such problems in practice when L is differentiable.
5.1 A working set algorithm
Given a set of pairs of row and column subsets, let us consider the optimization problem:
Let be a solution of this optimization problem. Then, by the characterization of
is the solution of (37) when
. Clearly, it is still the solution of (37) if S is reduced to the set of non-zero matrices
at optimality often called active components.
We propose to solve problem (37) using a so-called working set algorithm which solves a sequence of problems of the form () for a growing sequence of working sets S, so as to keep a small number of non-zero matrices
throughout. Working set algorithms (Bach et al., 2011, Chap. 6) are typically useful to speed up algorithm for sparsity inducing regularizer; they have been used notably in the case of the overlapping group Lasso of Jacob et al. (2009) which is also naturally formulated via latent components.
To derive the algorithm we write the optimality condition for (
From the characterization of the subdifferential of the trace norm (Watson, 1992), writing , this is equivalent to, for all (I, J) in S,
The principle of the working set algorithm is to solve problem () for the current set S so that (38) and (39) are (approximately) satisfied for (I, J) in S, and to check subsequently if there are any components not in S which violate (39). If not, this guarantees that we have found a solution to problem (37), otherwise the new pair (I, J) corresponding to the most violated constraint is added to S and problem (
) is initialized with the previous solution and solved again. The resulting algorithm is Algorithm 1 (where the routine SSVDTPI is described in the next section). Problem (
) is solved easily using the approximate block coordinate descent of Tseng and Yun (2009) (see also Bach et al., 2011, Chap. 4), which consists in iterating proximal operators. The modifications to the algorithm to solve problems regularized by the norm
are relatively minor (they amount to replace the trace norms by penalization of the trace of the matrices
and by positive definite cone constraints) and we therefore do not describe them here.
Determining efficiently which pair (I, J) possibly violates condition (39) is in contrast a more difficult problem that we discuss next.
5.2 Finding new active components
Once () is solved for a given set S, (38) and (39) are satisfied for all
. Note that (38) implies in particular that
at optimality. Therefore, (39) is also satisfied for all
if and only if
and if this is not the case then any (I, J) that violates this condition is a candidate to be included in S. This corresponds to solving the following sparse singular value problem
This problem is unfortunately NP-hard since rank 1 sparse PCA problem is a particular instance of it (when is replaced by a covariance matrix), and we therefore cannot hope to solve it exactly with efficient algorithms. Still, sparse PCA has been the object of a significant amount of research, and several relaxations and other heuristics have been proposed to solve it approximately. In our numerical experiments we use a truncated power iteration (TPI) method, also called TPower, GPower or CongradU in the PSD case (Journée et al., 2010; Luss and Teboulle, 2013; Yuan and Zhang, 2013), which has been proved recently by Yuan and Zhang (2013) to provide accurate solution in reasonable computational time under RIP type of conditions. Algorithm 2 provides a natural generalization of this algorithm to the non-PSD case. The algorithm follows the steps of a power method, the standard method for computing leading singular vectors of a matrix, with the difference that at each iteration a truncation step is use. We denote the truncation operator by
. It consists of keeping the k largest components (in absolute value) and setting the others to 0. Note that Algorithm 2 may fail to find a new active component for Algorithm 1 if it
finds a local maximum of ((k, q)-linRank-1) smaller than , and therefore result in the termination of Algorithm 1 on a suboptimal solution. On the positive side, note that Algorithm 1 is robust to some errors of Algorithm 2. For instance, if an incorrect component is added to S at some iteration, but the correct components are identified later, Algorithm 1 will eventually shrink the incorrect components to 0. One of the causes of failure of TPI type of methods is the presence of a large local maximum in the sparse PCA problem corresponding to a suboptimal component; incorporating this component in S will reduce the size of that local maximum, thereby increasing the chance of selecting a correct component the next time around.
5.3 Computational cost
Note that when are large, solving
involves the minimizations of trace norms of matrices of size
which, when k and q are small compared to
have low computational cost. The bottleneck for providing a computational complexity of the algorithm is the (k, q)-linRank-1 step. It has been proved by Yuan and Zhang (2013) that under some conditions the problem can be solved in linear time. If the conditions hold at every step of gradient, the overall cost of an iteration can be cast into the cost of evaluating the gradient and the evaluation of thin SVDs:
. Evaluating the gradient has a cost dependent on the risk function L. This cost for usual applications is
. So assuming the RIP conditions required by Yuan and Zhang (2013) hold, the cost of Algorithm 2 is dominated by matrix-vector multiplications so of the order
The total cost of the algorithm for reaching a
-accurate solution is therefore
However the worst case complexity of the algorithm is non-polynomial as (k, q)-linRank-1 is non-polynomial in general. We would like to point out that in our numerical experiments a warm start with singular vectors and multiple runs of the algorithm (k, q)-linRank-1 keeping track of the highest found variance has provided us a very fast and reliable solver. Further discussion on this step go beyond the scope of this work.
In this section we report experimental results to assess the performance of sparse low-rank matrix estimation using different techniques. We start in Section 6.1 with simulations aiming at validating the theoretical results on statistical dimension of and assessing how they generalize to matrices with (k, q)-rank larger than 1. In Section 6.2 we compare several techniques for sparse PCA on simulated data.
6.1 Empirical estimates of the statistical dimension.
In order to numerically estimate the statistical dimension of a regularizer
at a matrix Z, we add to Z a random Gaussian noise matrix and observe
has normal i.i.d. entries following N(0, 1). We then denoise Y using (30) to form an estimate
. For small
the normalized mean-squared error (NMSE) defined as
is a good estimate of the statistical dimension, since Oymak and Hassibi (2013) show that
Numerically, we therefore estimate and measuring the empirical NMSE averaged over 20 repeats. We consider square matrices with
, and estimate the statistical dimension of
and the trace norms at different matrices Z. The constrained denoiser (30) has a simple close-form for the
and the trace norm. For
, it can be obtained by a series of proximal projections (15) with different parameters
has the correct value
. Since the noise is small, we found that it was sufficient and faster to perform a (k, q)-SVD of Y by solving (15) with a small
, and then apply the
constrained denoiser to the set of (k, q)-sparse singular values.
We first estimate the statistical dimensions of the three norms at an atom , for different values of k = q. Figure 2 (top left) shows the results, which confirm the theoretical bounds summarized in Table 2. The statistical dimension of the trace norm does not depend on k, while that of the
norm increases almost quadratically with k and that of
increases linearly with k. As expected,
interpolates between the
) and the trace norm (for
outperforms both norms for intermediary values of k. This experiments therefore confirms that our upper bound (33) on
captures the correct order in k, although the constants can certainly be much improved, and that Algorithm 1 manages, in this simple setting, to correctly approximate the solution of the convex minimization problem.
Second, we estimate the statistical dimension of on matrices with (k, q)-rank larger than 1, a setting for which we proved no theoretical result. Figure 2 (top left) shows the numerical estimate of
for matrices Z which are sums of
with non-overlapping support, for k = 10 and varying r. We observe that the increase in statistical dimension is roughly linear in the (k, q)-rank. For a fixed (k, q)-rank of 3, the bottom plots of Figure 2 compare the estimated statistical dimensions of the three regularizers on matrices Z which are sums of
with non-overlapping (bottom left) or overlapping (bottom right) supports. The shapes of the different curves are overall similar to the rank 1 case, although the performance of
degrades as the supports of atoms overlap. In both cases,
consistently outperforms the two other norms. Overall these experiments suggest that the statistical dimension of
at a linear combination of r atoms increases as
where the coefficient C increases with the overlap among the supports of the atoms.
6.2 Comparison of algorithms for sparse PCA
In this section we compare the performance of different algorithms in estimating a sparsely factored covariance matrix that we denote . The observed sample consists of n random vector vectors generated i.i.d. according to
. The matrix
by adding 3 blocks of rank 1,
, having all the same sparsity
overlaps and nonzero entries equal to
. See Figure 3, bottom right plot for a representation of the ground truth
. The noise level
is set in order to make the signal to noise ratio below the level
where a spectral gap appears and makes the spectral baseline (penalizing the trace of the PSD matrix) work. In our experiments the number of variables is p = 200 and n = 80 points are observed. To estimate the true covariance matrix from the noisy observation, first the sample covariance matrix is formed as
and given as input to various algorithms which provide a new estimate . The methods we compared are the following:
• Raw sample covariance. The most basic is to output as the estimate of the covariance, which is not accurate due to presence of noise and underdeterminedness n < p.
• Trace penalty on the PSD cone. This spectral algorithm solves the following optimization problem in the cone of PSD matrices:
• In order to approximate the sample covariance
by a sparse matrix a basic idea is to soft-threshold it element-by-element. This is equivalent to solving the following convex optimization problem:
Figure 2: Estimates of the statistical dimensions of the , trace and
norms at a matrix
in different setting. Top left: Z is an atom in
for different values of k. Top right: Z is a sum of
with non-overlapping support, with k = 10 and varying r. Bottom left: Z is a sum of
with non-overlapping support, for varying k. Bottom right: Z is a sum of
with overlapping support, for varying k.
Figure 3: Sparse PCA example. The first row shows the supports found by our method (left) by sequential sparse PCA (middle) and element wise thresholding of the sample covariance matrix. Other plots contain heatmaps of the estimated covariance matrix using different methods, and the ground truth in the lower right hand side.
Table 3: Relative error of covariance estimation with different methods.
• The restriction of
to the PSD cone, which is equivalent to solving the following SDP min
1 2
• Sequential sparse PCA. This is the standard way of estimating multiple sparse principal components which consists of solving the problem for a single component at each step t = , and deflate to switch to the next (t + 1)st component. The deflation step used in this algorithm is the orthogonal projection
The tuning parameters for this approach are the sparsity level k and the number of principal components r.
• The following optimization problem, which is a proximal operator computation, is solved using the active set algorithm:
with the gauge associated with
already introduced in Section 3.4. The two parameters of this method are
We report the relative errorsover 10 runs of our experiments in Table 3, and a representation of the estimated matrices can be found in Figure 3. We observe that sparse PCA methods using
and also the sequential method using deflation steps outperform spectral and
baselines. In addition, penalizing
is superior to the sequential approach. This was expected since our algorithm minimizes a loss function that is close to the test errors reported, whereas the sequential scheme does not optimize a well-defined objective.
In this work, we proposed two new convex penalties, the (k, q)-trace norm and the (k, q)-CUT norm, specifically tailored to the estimation of low-rank matrices with sparse factors. Our motivation for proposing such convex formulations for sparse low-rank matrix inference was twofold. First, it allowed us to consider algorithmic schemes that are better understood when a problem is formulated as a convex optimization problem, even though the complexity of solving the problem exactly remains super-polynomial. Second, using convex geometry allowed us to provide sample complexity and statistical guarantees, and notably to show that the proposed estimators have much better statistical dimension than more standard convex combinations of the and trace norms. We observed that the improvement exists only for matrices: for sparse vectors, using our penalty (which boils down to the k-support norm in this case) does not improve over the standard
norm, in terms of statistical dimension increase rate.
One limitation of this work is that we assume that the sparsity of the factors is known and fixed. Lifting this constraint and investigating procedures that can adapt to the size of the blocks (like the norm adapts to the size of the support) is an interesting direction for future research. Another interesting direction is to use the nuclear norm formulation of the (k, q)-trace norm as in Lemma 10 to optimize the regularized problem.
We would like to thank Francis Bach for interesting discussions related to this work. This work was supported by the European Research Council (SMAC-ERC-280032) and by by Agence Nationale de la Recherche (ANR-13-MONU-005-10)
D. Amelunxen, M. Lotz, M. B. McCoy, and J. A. Tropp. Living on the edge: Phase transitions in convex programs with random data. Technical Report 1303.6672, arXiv, Mar 2013. URL http://arxiv.org/abs/1303.6672.
A. A. Amini and M. J. Wainwright. High-dimensional analysis of semidefinite relaxations for sparse principal components. Ann. Stat., 37(5B):2877–2921, 2009. URL http://dx.doi.org/10.1214/08-AOS664.
A. Argyriou, R. Foygel, and N. Srebro. Sparse prediction with the k-support norm. In F. Pereira, C. J. C. Burges, L. Bottou, and K. Q. Weinberger, editors, Adv. Neural. Inform. Process Syst., volume 25, pages 1457–1465. Curran Associates, Inc., 2012. URL http://books.nips.cc/papers/files/nips25/NIPS2012_0698.pdf.
F. Bach. Convex relaxations of structured matrix factorizations. Technical Report 1309.3117, arXiv, 2013. URL http://arxiv.org/pdf/1309.3117v1.pdf.
F. Bach, J. Mairal, and J. Ponce. Convex sparse matrix factorizations. Technical Report 0812.1869, arXiv, 2008. URL http://arxiv.org/abs/0812.1869.
F. Bach, R. Jenatton, J. Mairal, and G. Obozinski. Optimization with sparsity-inducing penalties. 4(1):1–106, 2011. URL http://dx.doi.org/10.1561/2200000015.
F. Bach, R. Jenatton, J. Mairal, and G. Obozinski. Structured sparsity through convex optimization. Stat. Sci., 27(4):450–468, 2012. URL http://dx.doi.org/10.1214/12-STS394.
Q. Berthet and P. Rigollet. Complexity theoretic lower bounds for sparse principal component detection. In S. Shalev-Shwartz and I. Steinwart, editors, COLT 2013 - The 26th Annual Conference on Learning Theory, June 12-14, 2013, Princeton University, NJ, USA, volume 30 of JMLR Proceedings, pages 1046–1066. JMLR.org, 2013. URL http://jmlr.org/proceedings/papers/v30/Berthet13.html.
R. Bhatia. Matrix analysis. Springer, 1997.
E. J. Candès, T. Strohmer, and V. Voroninski. PhaseLift: Exact and stable signal recovery from magnitude measurements via convex programming. Comm. Pure Appl. Math., 66(8):1241–1274, 2013. URL http://dx.doi.org/10.1002/cpa.21432.
V. Chandrasekaran and M. I. Jordan. Computational and statistical tradeoffs via convex relaxation. Proc. Natl. Acad. Sci. USA, 110(13):E1181–E1190, Mar 2013. URL http://dx.doi.org/10.1073/pnas.1302293110.
V. Chandrasekaran, B. Recht, P. A. Parrilo, and A. S. Willsky. The convex geometry of linear inverse problems. Found. Comput. Math., 12(6):805–849, 2012. URL http://dx.doi.org/10.1007/s10208-012-9135-7.
J. T. Chu. On bounds for the normal integral. Biometrika, 42(1/2):263–265, 1954. URL http://dx.doi.org/10.2307/2333443.
A. d’Aspremont, L. El Ghaoui, M. I. Jordan, and G. R. G. Lanckriet. A direct formulation for sparse PCA using semidefinite programming. SIAM Review, 49(3):434–448, 2007. URL http://dx.doi.org/10.1137/050645506.
A. d’Aspremont, F. Bach, and L. El Ghaoui. Optimal solutions for sparse principal component analysis. J. Mach. Learn. Res., 9:1269–1294, 2008. URL http://jmlr.org/papers/v9/aspremont08a.html.
K. R. Davidson and S. J. Szarek. Local operator theory, random matrices and Banach spaces. In W. B. Johnson and J. Lindenstrauss, editors, Handbook of the Geometry of Banach Spaces, volume 1, pages 317 – 366. Elsevier Science B.V., 2001. URL http://dx.doi.org/10.1016/S1874-5849(01)80010-3.
M. M. Deza and M. Laurent. Geometry of Cuts and Metrics, volume 15 of Algorithms and Combinatorics. Springer Berlin Heidelberg, 1997.
X. V. Doan and S. A. Vavasis. Finding approximately rank-one submatrices with the nuclear norm and 23(4):2502–2540, 2013. URL http://dx.doi.org/10.1137/100814251.
R. Foygel and L. Mackey. Corrupted sensing: Novel guarantees for separating structured signals. IEEE Trans. Inform. Theory, 60(2):1223–1247, 2014. URL http://dx.doi.org/10.1109/TIT.2013.2293654.
L. Jacob, G. Obozinski, and J.-P. Vert. Group lasso with overlap and graph lasso. In ing, pages 433–440, New York, NY, USA, 2009. ACM. ISBN 978-1-60558-516-1. URL http://dx.doi.org/10.1145/1553374.1553431.
G. J. O. Jameson. Summing and Nuclear Norms in Banach Space Theory. Number 8 in London Mathematical Society Student Texts. Cambridge University Press, 1987. URL http://dx.doi.org/10.1017/CBO9780511569166.
M. Jerrum. Large cliques elude the Metropolis process. Random Struct. Alg., 3(4):347–359, 1992. URL http://dx.doi.org/10.1002/rsa.3240030402.
M. Journée, Y. Nesterov, P. Richtárik, and R. Sepulchre. Generalized power method for sparse principal component analysis. J. Mach. Learn. Res., 11:517–553, 2010. URL http://jmlr.org/papers/volume11/journee10a/journee10a.pdf.
V. Koltchinskii, K. Lounici, and A. B. Tsybakov. Nuclear norm penalization and optimal rates for noisy matrix completion. Ann. Stat., 39(5):2302–2329, 2011. URL http://dx.doi.org/10.1214/11-AOS894.
R. Krauthgamer, B. Nadler, and D. Vilenchik. Do semidefinite relaxations really solve sparse PCA? Technical Report 1306:3690, arXiv, 2013. URL http://arxiv.org/abs/1306.3690.
H. Lee, A. Battle, R. Raina, and A. Y. Ng. Efficient sparse coding algorithms. In B. Schölkopf, J. C. Platt, and T. Hoffman, editors, Adv. Neural. Inform. Process Syst., volume 19, pages 801–808. MIT Press, 2007. URL http://papers.nips.cc/paper/2979-efficient-sparse-coding-algorithms.
R. Luss and M. Teboulle. Conditional gradient algorithms for rank-one matrix approximations with a sparsity constraint. SIAM Rev., 55(1):65–98, 2013. URL http://dx.doi.org/10.1137/110839072.
L. W. Mackey. Deflation methods for sparse PCA. In D. Koller, D. Schuurmans, Y. Bengio, and L. Bottou, editors, Adv. Neural. Inform. Process Syst., volume 21, pages 1017–1024. Curran Associates, Inc., 2009. URL http://papers.nips.cc/paper/3575-deflation-methods-for-sparse-pca.pdf.
J. Mairal, F. Bach, J. Ponce, and G. Sapiro. Online learning for matrix factorization and sparse coding. J. Mach. Learn. Res., 11:19–60, 2010. URL http://jmlr.csail.mit.edu/papers/v11/mairal10a.html.
B. Moghaddam, Y. Weiss, and Sh. Avidan. Spectral bounds for sparse PCA: Exact and greedy algorithms. In Neural Information Processing Systems (NIPS), volume 18, page 915. MIT Press, 2006.
B. Moghaddam, A. Gruber, Y. Weiss, and S. Avidan. Sparse regression as a sparse eigenvalue problem. In Information Theory and Applications Workshop, 2008, pages 121–127. IEEE, Jan 2008. URL http://dx.doi.org/10.1109/ITA.2008.4601036.
S. N Negahban, P. Ravikumar, M. J Wainwright, and B. Yu. A unified framework for high-dimensional analysis of M-estimators. Statistical Science, 27(4):538–557, 2012.
S. Oymak and B. Hassibi. Sharp mse bounds for proximal denoising. Technical Report 1305.2714, arXiv, 2013. URL http://arxiv.org/abs/1305.2714.
S. Oymak, A. Jalali, M. Fazel, Y. C. Eldar, and B. Hassibi. Simultaneously structured models with application to sparse and low-rank matrices. Technical Report 1212.3753, arXiv, 2012. URL http://arxiv.org/abs/1212.3753.
S. Oymak, C. Thrampoulidis, and B. Hassibi. The squared-error of generalized LASSO: A pre- cise analysis. In 51st Annual Allerton Conference on Communication, Control, and Computing, Allerton Park & Retreat Center, Monticello, IL, USA, October 2-4, 2013, pages 1002–1009. IEEE, 2013. URL http://dx.doi.org/10.1109/Allerton.2013.6736635.
E. Richard, P.-A. Savalle, and N. Vayatis. Estimation of simultaneously sparse and low-rank matrices. In Proceedings of the 29th International Conference on Machine Learning, ICML 2012, Edinburgh, Scotland, UK, June 26 - July 1, 2012. icml.cc / Omnipress, 2012. URL http://icml.cc/discuss/2012/674.html.
E. Richard, F. Bach, and J.-P. Vert. Intersecting singularities for multi-structured estimation. In S. Dasgupta and D. Mcallester, editors, Proceedings of the 30th International Conference on Machine Learning (ICML-13), volume 28, pages 1157–1165, Atlanta, Georgia, USA, may 2013. JMLR Workshop and Conference Proceedings. URL http://jmlr.org/proceedings/papers/v28/richard13.html.
E. Richard, S. Gaïffas, and N. Vayatis. Link prediction in graphs with autoregressive features. J. Mach. Learn. Res., 15(1):565–593, jan 2014. URL http://jmlr.org/papers/v15/richard14a.html.
R.T. Rockafellar. Convex Analysis. Princeton University Press, 1997.
P. Tseng and S. Yun. A coordinate gradient descent method for nonsmooth separable minimization. Math. Program., 117(1-2):387–423, 2009. URL http://dx.doi.org/10.1007/s10107-007-0170-0.
R. Vershynin. Introduction to the non-asymptotic analysis of random matrices. In Y. Eldar and G. Kutinyok, editors, Compressed Sensing, Theory and Applications, pages 210–268. Cambridge University Press, 2012. URL http://dx.doi.org/10.1017/CBO9780511794308.006.
M. J. Wainwright. Information-theoretic limits on sparsity recovery in the high-dimensional and noisy setting. IEEE Trans. Inform. Theory, 55(12):5728–5741, 2009. URL http://dx.doi.org/10.1109/TIT.2009.2032816.
Y.-X. Wang, H. Xu, and C. Leng. Provable subspace clustering: When LRR meets SSC. In C. J. C. Burges, L. Bottou, M. Welling, Z. Ghahramani, and K. Q. Weinberger, editors, Adv. Neural. Inform. Process Syst., volume 26, pages 64–72. Curran Associates, Inc., 2013. URL http://papers.nips.cc/paper/4865-provable-subspace-clustering-when-lrr-meets-ssc.
G. A. Watson. Characterization of the subdifferential of some matrix norms. Lin. Alg. Appl., 170: 1039–1053, 1992. URL http://dx.doi.org/10.1016/0024-3795(92)90407-2.
D. M. Witten, R. Tibshirani, and T. Hastie. A penalized matrix decomposition, with applications to sparse principal components and canonical correlation analysis. Biostatistics, 10(3):515–534, Jul 2009. URL http://dx.doi.org/10.1093/biostatistics/kxp008.
X.-T. Yuan and T. Zhang. Truncated power method for sparse eigenvalue problems. J. Mach. Learn. Res., 14:889–925, 2013. URL http://www.jmlr.org/papers/volume14/yuan13a/yuan13a.pdf.
H. Zou, T. Hastie, and R. Tibshirani. Sparse principal component analysis. J. Comput. Graph. Stat., 15(2):265–286, 2006. URL http://dx.doi.org/10.1198/106186006X113430.
Proof [Proposition 2] To prove the first claim, note that a matrix of the form most kq non-zero terms. Therefore, the decomposition of a matrix with no null entries as a linear combination of such sparse matrices must count at least
terms, which is larger than
when
To prove second claim, consider for the matrix
the problem of finding a decomposition of Z which attains the (2, 2)-rank. It is impossible to write Z as the sum of two (2, 2)-sparse matrices, because it would then have at most 8 non-zero coefficients. But we have the decomposition.
which shows that the (2, 2)-rank of Z is 3. Note that this decomposition is not unique: given that Z is invariant by any of the 6 permutations of the rows and any of the 6 permutations of the columns, Z admits at least 36 different decompositions attaining the (2, 2)-rank.
Now, observe that the decomposition proposed above for left- and right- (2, 2)-sparse factors that are obviously not orthogonal. It can actually be shown by systematic enumeration of all possible cases that it is impossible to find any (2, 2)-sparse decomposition of Z with left or right factors that are orthogonal.
Proof [Proposition 6]
To prove the first claim, let us consider the matrix . We showed in the proof of Proposition 2 above that its (2, 2)-rank is equal to 3. We now show that the number of its (k, q)-sparse singular value is 9, and thus much larger than 3. For that purpose, we express any (2, 2)-SVD of Z as a minimizer of (8), and write the corresponding Lagrangian:
where are the primal and dual variables. It is easy to check that the dual solution is the unique subgradient of
which is equal to
. But any primal solution must satisfy
. This implies that any primal solution
Then, one can check that
forms a basis of
so that any matrix Z admits a unique set of decomposition coefficients on that basis. This proves that the unique solution of (8) is the one such that
for all pairs
. This unique (k, q)-SVD is composed of 9 terms which is strictly larger than its (k, q)-rank, the latter being equal to 3.
To prove the second claim, let us consider the . By proposition 18,
which shows that
. Considering that there are 3 ways to partition {1, 2, 3, 4} into sets of cardinality 2, Z admits at least 9 different optimal decompositions in the sense of the (2, 2)-SVD since Z can be written in 9 different ways as the sum of four matrices of
with disjoint supports. Each of these decompositions attains the (2, 2)-rank which is equal to 4. Note also that by convexity any convex combination of these decompositions is also an optimal decomposition in the sense of the (2, 2)-SVD, but can contain up to 36 terms!
To prove the third claim, let us consider
As are all positive semidefinite we have
equality (7),
which proves that the decomposition
is optimal:
. So this decomposition is a decomposition of Z onto linear combination of atoms
which are not orthogonal.
where the second equality follows from the fact that the maximization of a linear form over a bounded convex set is attained at one of the extreme points of the set. Given this closed-form expression of the dual norm, we prove the variational formulation (8) for the primal norm Consider the function
Since is defined as the infimum of a jointly convex function of
obtained by minimizing w.r.t. to the latter variables, it is a an elementary fact from convex analysis that
is a convex function of Z. It is also symmetric and positively homogeneous, which together with convexity prove that
defines a norm. We can compute its dual norm as
This proves that since a norm is uniquely characterized by its dual norm. Finally, to show (10) we use the general characterization of the subdifferential of a norm (e.g., Watson, 1992):
Let us denote a subgradient by is an atom, we have
In addition,
, therefore the condition
boils down to
. Given the characterization of the dual norm (9), we therefore get:
Let now
Since , it is clear that
Conversely, let
, and therefore, by Pythagorean equality applied to the orthogonal vectors
but since we must have
. This shows that
. The same reasoning starting with the orthogonal vectors
shows that we also have
, implying that
. This concludes the proof that
, as claimed in (10).
Let be the nuclear norm induced by two atomic norms
, induced themselves respectively by the two atom sets
then the key argument is to note that we have
Indeed, if , then with
. The inclusion is then proved by density. By (12) the dual norm of
where the middle equality is due to the fact that the maximum of a linear function on a convex set is attained at a vertex. We therefore have ), this shows that
is the atomic norm induced by A.
Proof [Theorem 11]
Since the (k, q)-trace norm is the atomic norm induced by the atom set (5), Lemma 10 tells us that it is also the nuclear norm induced by the two atomic norms with atom sets correspond exactly to the so-called k- and q-support norms of Argyriou et al. (2012).
To prove the second statement, we proceed similarly to get that the (k, q)-CUT norm is the nuclear norm induced by the two atomic norms with atom sets
norms, we obtain an explicit formulation as follows:
where denotes the the ith largest element of s in absolute value. This norm is proportional to a norm known as the vector k-norm or 1-k symmetric norm gauge.
Proof [Proposition 13]
To prove the first claim, we show a counterexample for the J = {3, 4}. The matrix
can be written as
, and so its (2, 2)-rank is less than 4. But its (2, 2)-rank must be at least 4, because the matrix has 16 non-zeros coefficients and the sum of three (2, 2)-sparse matrices has at most 12 non-zero coefficients. Its (2, 2)-rank is thus equal to 4.
However, it is not possible to write it as a sum of less than 6 symmetric (2, 2)-sparse matrices, because each of these matrices can only make one coefficient above the non-diagonal non-zero.
For the second claim, we have shown in the proof of proposition 6 that the decomposition above is a (2, 2)-SVD.
To prove the third claim, note first that the case k = 1 is peculiar and not representative of the general case because the span of the PSD matrices of sparsity 1 are only the diagonal matrices, while the span of rank one PSD matrices of sparsity is all the symmetric matrices. Now, we claim that it is not possible to write
as a sum of PSD matrices that are (2, 2)-sparse and PSD. Indeed, if this was the case, this would imply the existence of a non zero vector v with a support of size at most
. Since the only eigenvector of Z associated with a non-zero eigenvalue is the constant vector this is impossible.
Proof [Lemma 14]
We prove a more general result than Lemma 14. Let be any matrix norm, and
be a linear map. We denote by
-th design matrix defined by
. For a given matrix
, assume we observe:
where is a centered random noise vector. We consider the following estimator of
for some value of the parameter . The following result generalizes standard results known for the
and trace norms (e.g., Koltchinskii et al., 2011, Theorem 1) to any norm
Theorem 28
Lemma 14 is then a simple consequence of Theorem 28 by taking for X the identity map, upper bounding the right-hand side of (43) by the value it takes for
, and replacing
which after developing the squared norm and replacing Y by (41) gives
and therefore
Now, using the fact (true for any norm) that for any vectors
taking
, we can upper bound the second term of the right-hand side of (44) by:
Plugging this bound back in (44) finally gives
the last inequality being due to the triangle inequality.
Before proving Propositon 15, let us first derive an intermediary results useful to obtain an upper bound on the dual (k, q)-trace norm of a random matrix with i.i.d. normal entries.
For a random matrix with i.i.d. standard normal entries, we have the following concentration inequality (e.g., Davidson and Szarek, 2001): for
Denoting , we have the sequence of inequalities
where the change of variable used in (47) is ) is true for any
(46) follows from the property of the inverse
and sandwiched via
where in the last inequality we simply used
The upper bounds for the and trace norms are standard. See Vershynin (2012, Theorem. 5.32) for the tight upper bound on the operator norm
, and for the upper bound on the element-wise
, use Jensen inequality followed by upper bounding the maximum of nonnegative scalars by their sum:
Taking in the logarithms of the last inequality gives
Let us start with a simple result useful to prove inclusions of tangent cones.
Lemma 30 Let f and g two convex functions from
From the definition (26) of the tangent cone we deduce, by taking the union over closure of this inclusion, that
We can now prove the results in Section 4.2.2
Proof [Proposition 18]
Consider a matrix Since A is an atom of both the norm
and the norm
that, for any
By Fenchel duality, we therefore have for any
Proof [Corollary 19] Combining Proposition 18 with Lemma 30 directly gives (31). (32) is then a direct consequence of the definition of the statistical dimension (27).
Proof [Corollary 20]
A necessary and sufficient condition for exact recovery is the so called null space property which is the event that , where Ker(X) is the kernel of the linear transformation X (Chandrasekaran et al., 2012, Proposition 2.1). The result therefore follows from the inclusion of the cones stated in Corollary 19.
Proof [Proposition 21]
Let , meaning that
differential of the scaled
From (10), we get that the subdifferential of
The first condition is equivalent to , which implies that the second is equivalent to
. We deduce that
if and only if
This shows that the subdifferentials of have the same conic hull, and Proposition 21 follows by noting that the tangent cone is the polar cone of the conic hull of the subdifferential (Rockafellar, 1997, Theorem 23.7).
The aim of this appendix is to prove the upper bound on the statistical dimension Proposition 24. Given its level of technicality, we split the proof in several parts. We start with preliminaries and notations in Section D.1, before proving Proposition 24 in Section D.2. The proofs of several technical results needed in Section D.2 are postponed to Section D.3, D.4 and D.5.
D.1 Preliminaries and notations
Let us start with some notations used throughout Appendix D. is an atom of
with
refers to the atom strength of A (Definition 23). For any
. Note that while
subvector of a, the notation
does not refer to a subvector of some vector u and that therefore
To analyze the statistical dimension (27) of
, it is useful to express it as follows (Chandrasekaran et al., 2012, Proposition 3.6):
where is the normal cone of
, the conic hull of the subdifferential of
at
denotes the Frobenius distance of the Gaussian matrix G with i.i.d. standard normal entries to
. In order to upper bound this quantity, it is therefore important to characterize precisely the normal cone
For that purpose, let us introduce further notations. We consider the following subspace of
and denote by the orthogonal projectors onto span
respectively. Since
, we have the closed-form expressions
For any , consider now the subspace
and its orthogonal
Note that spanis related to the subdifferential of
, since according to (10) we can write it as
It is possible to estimate the dimension of spanas follows:
Proof [Lemma 31]
For , the range of
equals the range of
which has dimension
. By the same token, the range of
has dimension q. By definition of span
we therefore have
and therefore by the inclusion-exclusion principle
Finally we denote by the projector onto span
the projector onto span
. They satisfy respectively
D.2 Proof of Proposition 24
Proof [Proposition 24]
In order to upper bound the statistical dimension of , we associate to any matrix G a matrix
belonging to the normal cone
is measurable. From the characterization of the statistical dimension (50), since dist
will then get the upper bound:
The main steps in the proof are then (i) to define the mapping , (ii) to show that
for all G, and (iiii) to upper bound
in order to derive an upper bound on
by (52).
Given a measurable function , let us therefore consider the mapping
The following lemma provides a mapping to ensure that
Then, for every , the matrix
defined in (53) belongs to the normal cone of
at A.
By choosing as in Lemma 32, the upper bound (52) because
. Using the decomposition
where (55) is due to and the fact that
follows a chi-square distribution with
degrees of freedom, since by Lemma 31 this is the dimension of span
to upper bound
we need the following two lemmata in addition to Lemma 29.
Lemma 34
Combining Lemmata 29, 56 and 34 with the definition of ) we deduce
Plugging this upper bound into (55) finally proves Proposition 24.
D.3 The scaling factor ensures that
(proof of Lemma 32)
so that (53) becomes . To prove that
belongs to the normal cone of
at A, it is sufficient to prove that
is a subgradient of
the characterization of the subgradient in (51), and since
, this is equivalent to
, which itself is equivalent to
There thus remains to prove the first inequality of (57). Note that the matrix has rank 2, so its Frobenius norm is larger than its operator norm by at most a factor of
with the Frobenius norm is more convenient, so knowing that
we will establish an upper bound on the latter quantity which we denote by . Noting that
we get
The following Lemma provides upper bounds on the different terms.
Lemma 35 We have
This yields
Since
These inequalities yield
, so that factorizing
in the previous expression, we obtain
which concludes the proof.
D.4 Proof of Lemma 35
Let us first start with a few useful technical lemmas.
Proof [Lemma 36]
which proves the first equality. The second one is proved similarly.
Lemma 38
The largest singular value is attained on the span of both on the left and on the right. Given that
, it is therefore also the largest eigenvalue of the matrix of the linear operator restricted to this span which is equal to
for . Tedious but simple calculations show that the squared operator norm of this matrix is equal to
which takes its maximum value 4/3 for x = 1/3.
because as a result of the fact that by lemma 37,
have disjoint supports.
Now (see Lemma 38 for a proof). This shows the second inequality and the third follows by symmetry.
D.5 Upper bounds for (Proofs of Lemmata 33 and 34)
Proof [Lemma 33]
Using (45) and the fact that
It follows that
As the sets are disjoint, and
of unit length, the random variable
follows a chi-square distribution with i + j degrees of freedom, where Using Chernoff’s inequality and the form of the chi-square moment generating function, we have that for any fixed real number
and fixed index sets
Taking the maximum over index sets I and J with the same intersection sizes with respectively, and using a union bound on the independent choices of I and J, yields
Taking , we have for any t < 1/2 (assuming w.l.o.g.
Let us introduce
where
As a consequence, we have
It follows that
Let us start with a technical lemma:
a linear map from the standard Gaussian ensemble and
and further
then, with probability , solving formulation (28) with the norm
fails to recover
simultaneously for any values of
are universal constants.
Proof [Lemma 39]
The proof consists in applying theorem 3.2 in Oymak et al. (2012) for the combination of the norm with the trace norm. We adapt slightly the notations of that paper to reflect the fact that we are working with matrices. Since we consider conic combinations of the
and trace norms, the number of norms is therefore
. To apply the theorem we need to specify
in the notations of that paper.
For each decomposable norm -norm and and
the trace norm, given a point
(which corresponds to the point
Oymak et al., 2012), the authors define
• the supporting subspaces and
in the paper), the orthogonal projection of any subgradient of the norm in
(Definition 2.1),
• the Lipschitz constant of
with respect to the Euclidean norm (Definition 2.2),
•
Let with support
the element of the canonical basis of
• so that dim
• Tso that dim
By definition
We then have is the projection of
The situation is less simple for
. Some calculations lead to
hence the definition of . Theorem 3.2 in Oymak et al. (2012) offers the possi- bility of constraining the estimator to lie in a cone C. In our case,
, given the definition of
we therefore have
. The result follows from applying the theorem with
using
Proof [Proposition 25]
Take M such that when is large enough to ensure
. Then, according to Lemma 39, solving (28) with the norm
to recover
with probability at least
. On the other hand, Amelunxen et al. (2013, Theorem 7.1) shows that, when
, then solving (28) with the norm
correctly recovers A with probability at least
where , then using the fact that
we get that the probability (59) is smaller than
. This implies that
F.1 Lower bound on the statistical dimension of (Proof of Proposition 26)
Let us start with two technical lemmata.
denote the kth order statistics of an i.i.d. sample
whose common distribution has a cdf F. Assume that
is a convex function
with . Assuming that
is a convex function, we have by Jensen’s inequality
Proof [Lemma 41] Denote by F the cdf of the absolute value of a standard normal variable. Then,
where is the cdf of a standard Gaussian and erf denotes the error function. We use the following inequality due to Chu (1954):
to deduce that
By definition, we have is a vector of independent standard normal variables. It can easily be checked that
is a convex function. This implies, using Lemma 40, that
Proof [Proposition 26] We will denote the squared Gaussian width of the tangent cone intersected with a Euclidean unit ball by
where denotes a standard Gaussian vector. We have
Chandrasekaran et al., 2012, Proposition 3.6). We thus seek a lower bound of
Since the tangent cone is polar to the normal cone, we have that
Given a random Gaussian vector the support of
the indices of the k largest coefficients of G in absolute value outside of
. Denote by
, the vector whose entries are zero outside of
and equal to the sign of the corresponding coefficient of G otherwise. Define
. By construction
. Let now consider
so that whence the result using Lemma 41 and
F.2 Upper bound on the statistical dimension of (Proof of Proposition 27)
Proof [Proposition 27]
Without loss of generality, let us assume that is a fixed vector having nonincreasing – in absolute value – coordinates, the first s of which are assumed to be nonzero. We compute the subdifferential of
directly by using (14). Remember that one characterization of the subdifferential is
Letting being the unique integer such that
let us partition the set of entries
(where each set may be empty). Then we can rewrite the expression of the k-support norm (14) as
As for , the coefficients
do not impact
so they should also not impact
this implies
this means
, and in that case
. With the convention
, we finally get the following expression for the subdifferential:
In the case . In that case
, showing that
is differentiable at
is useless to recover w.
where g is a p-dimensional random vector with i.i.d. normal entries and is the conic hull of
. We then get:
where following Chandrasekaran et al. (2012, Annex C), for (61) we used the fact that for a standard normal random variable
while (62) is obtained by taking
For the lasso case (. Plugging this into (62) we recover the standard bound (Chandrasekaran et al., 2012):
In the general case remember that, by definition of r,
and therefore
Plugging the left-hand inequality of (64) into (62) and remembering that shows that the bound (62) obtained for
, is never better than the bound (63) obtained for the lasso case k = 1. In the case s = k, the right-hand inequality of (64) applied to an atom
with atom strength
norm leads to
from which we deduce by (62) the upper bound on the statistical dimension